Learning Trading Negotiations Using Manually and Automatically Labelled Data

Transcription

1 Learning Trading Negotiations Using Manually and Automatically Labelled Data Heriberto Cuayáhuitl, Simon Keizer, Oliver Lemon School of Mathematical and Computer Sciences, Heriot-Watt University, United Kingdom {h.cuayahuitl s.keizer Abstract Strategic conversational agents often need to trade resources with their opponent conversants and trading strategically can lead to better results. While rule-based or supervised agents can be used for such a purpose, here we explore a learning approach based on automatically labelled examples from human players for automatic trading in the game of Settlers of Catan. Our experiments are based on data collected from human players trading in text-based natural language. We compare the performance of Bayes Nets, Conditional Random Fields, and Random Forests on the task of ranking trading offers, trained from both manually labelled and automatically labelled data. Our experimental results show that our best agent trained on automatic labels outperformed its counterpart trained on manual labels (with moderate annotator agreement) in terms of (a) predicting human trading negotiations better, and (b) winning more games. Keywords-strategic interaction; supervised learning; semisupervised learning; automatic labelling; board games; I. INTRODUCTION Strategic conversation does not assume full cooperation during the interaction between agents [1]. In this paper, we will use a strategic card-trading board game to illustrate our approach. Board games with trading aspects aim not only at entertaining people, but also at training them with trading skills. Popular board games of this kind include Last Will, Settlers of Catan, and Power Grid, among others [2]. While these games can be played between humans, they can also be played between computers and humans. The trading behaviours of computer games are usually based on heuristics or optimisation methods. The former include carefully tuned rules, and the latter include methods such as Monte-Carlo tree search [3] and reinforcement learning [4], [5], [6], or a combination of them [7]. However, their application is not trivial due to the complexity of the problem, e.g. large state-action spaces. On the one hand, unique situations in the game can be described by a number of variables (e.g. resources available) so that enumerating them would result in very large state spaces. On the other, the action space can also be large due to the wide range of unique negotiations (e.g. givable and receivable resources). While one can aim for optimising the whole game via compression of the search space, one can also aim for a specialised solution. The latter is the focus of this paper by focusing on learning to trade only, rather than learning to play the whole game. In addition, while previous work has focused on optimising negotiation strategies [4], [3], our proposed approach focuses on learning human-like trading from human examples despite the fact that in reality the best choice may not be the most human-like one, especially with non-expert player data. Our scenario for strategic interaction is the game of Settlers of Catan, where players take the role of settlers on the fictitious island of Catan see Figure 1. The board game consists of 19 hexes randomly connected: 3 hills, 3 mountains, 4 forests, 4 pastures, 4 fields and 1 desert. On this island, hills produce clay, mountains produce ore, pastures produce sheep, fields produce wheat, forests produce wood, and the desert produces nothing. In our setting, four players attempt to settle on the island by building settlements and cities connected by roads. To build, players need specific resource cards, for example: a road requires clay and wood; a settlement requires clay, sheep, wheat and wood; a city requires three clay cards and two wheat cards; and a development card requires clay, sheep and wheat. Each player gets points for example by building a settlement (1 point) or a city (2 points), or by obtaining victory point cards (1 point each). A game consists of a sequence of turns, and each game turn starts with the roll of a die that can make the players obtain or lose resources (depending on the number rolled and resources on the board). The player in turn can trade resources with the bank or other players, and can make use of available resources to build roads, settlements or cities. This game is highly strategic because players often face decisions about what resources to request and what resources to give away, which are influenced by what they need to build. A player can extend build-ups on locations connected to existing pieces, i.e. road, settlement or city, and all settlements and cities must be separated by at least 2 roads. The first player to win 10 victory points wins and all others lose. 1 This paper extends our previous approach based on statistical inference for ranking trading negotiations [9], i.e. the exchange of resources for some others, from training on labelled data to training on automatically labelled data. We compare three statistical agents Bayes Nets, Conditional Random Fields, and Random Forests against rule-based and random agents as baselines, and show that our best 1

2 Figure 1. Example board of the game Settlers of Catan [8]. The topmiddle dialogue box is a chat interface that displays the game history including trading offers and responses from all players agent, trained on automatically labelled data, performed better than its counterpart trained on manually labelled data. II. RELATED WORK Machine learning techniques for strategic trading games have received little attention to date. Notable exceptions have applied reinforcement learning to board games. [10] proposes reinforcement learning with multilayer neural networks for training an agent to play the game of Backgammon. He finds that agents trained with such an approach are able to match and even beat human performance. [4] proposes hierarchical reinforcement learning for automatic decision making on object-placing and trading actions in the game of Settlers of Catan. He incorporates built-in knowledge for learning the behaviours of the game quicker, and finds that the combination of learned and built-in knowledge is able to beat human players. [6] used reinforcement learning in non-cooperative dialogue, and focuses on a small 2-player trading problem with 3 resource types, but without using any real human dialogue data. This work showed that explicit manipulation moves (e.g. I really need sheep ) can be used to win when playing against adversaries who are gullible (i.e. they believe such statements) but also against adversaries who can detect manipulation and can punish the player for being manipulative [11]. More recently, [12] compare training policies against hand-crafted traders and supervised traders created from human players. They found that rather than training trading policies on hand-crafted rule-based heuristics, a more successful approach is to train trading policies from a supervised classifier trained from human examples. Related work on supervised learning using manually and automatically labelled data have reported divergent strategies. One strategy has been to train classifiers for natural language processing (NLP) tasks using automatically extracted examples. For example, [13] train a classifier for discourse relations and report a classification accuracy of up to 93%. On the other hand, [14] compare classifiers for discourse relations trained from automatically extracted examples against such trained from manually labelled examples. The authors focus on a dataset with only moderate inter-annotator agreement (κ=0.592), and observe that classification accuracy drops substantially in the presence of ambiguous labels. The success of automatic labelling therefore seems to vary with the nature of the target dataset. In this paper, we will present further evidence that automatic labelling can lead to good results. Some other supervised learning techniques have been applied to train automated agents that know how to play board games such as decision trees [15], preference learning [16], and deep neural networks [17]. Since statistical inference has received little attention in previous work, with some exceptions [17], [9], we argue that it can play an important role in training strategic agents with human-like behaviour. In addition, statistical traders have not been trained from automatically labelled data before, and our results report that this approach represents a state-of-the-art method for learning trading negotiations. Other related work has been carried out in the context of automated non-cooperative dialogue systems, where an agent may act to satisfy its own goals rather than those of other participants [5]. The game-theoretic underpinnings of non-cooperative behaviour have also been investigated [18]. Such automated agents are of interest when trying to persuade, argue, or debate, or in the area of believable characters in video games and educational simulations [5], [19]. Another arena in which non-cooperative dialogue behaviour has been investigated is in negotiation [20], where hiding information (and even outright lying) can be advantageous. Given the machine learning efforts applied to strategic interactive games, other forms of learning remain to be explored. They include not only direct but also inverse reinforcement learning to learn from trial and error, semisupervised learning to learn from labelled and unlabelled data, unsupervised learning to learn from unlabelled data, multi-agent systems to learn behaviours considering the strategies of opponents, transfer learning so that agents do not have to be trained from scratch, and active learning to learn to ask what to do in uncertain situations while playing the game, among others see [15], [21], [22] for an overview. Another direction to explore in strategic games includes a combination of planning and learning, which has shown more promising results than either in isolation

3 [17], [7]. A further direction to explore includes end-toend statistical training of language understanding [23], [24], game behaviour, and language generation [25], [26], [27] using a unified learning framework. III. THE DATA AND TASK We used a set of 32 logged games from 56 different players as described in [28]. Although they were carefully labelled by multiple annotators, they were difficult to annotate as is indicated by their moderate annotator agreement score of 0.62 according to the well-known kappa score [29]. The data correspond to 2512 trading negotiation events (also referred to as training instances ) denoted as D m = {(x 1, y 1 ),.., (x N, y N )}, where x i are vectors of features and y i are class labels (i.e. givable resources). Our data set reports an average of 44.8 turns per player. An example trading negotiation in the game of Settlers of Catan in Natural Language is I ll give anyone sheep for clay, which can be represented as follows, including the agent s available resources: Givable(Sheep, all) Receivable(Clay, all) Resources(clay = 0, ore = 0, sheep = 4, wheat = 1, wood = 0) Buildups(roads = 2, settlements = 0, cities = 0). From this illustrative example, y i =sheep and x i = {0, 0, 4, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0} based on features 1-14 in Table I. Although this representation may look simple at first sight, it has support for = 2.6 billion possible (and unique) negotiation events. Even though the class label is only the givable, we use receivables as features in ranking all the possible offers so all offers including one givable and multiple receivables are in fact ranked 2. Notice that not all of them are valid or legal at every point in time in the game. Choosing the most human-like (in our case) trading negotiation can be seen as a ranking task, where we focus on computing a score representing the importance of each trading negotiation (similar to the one above) available for making the best choice, i.e. the most human-like. In this way, the quality of our learning agents will depend on the quality of the examples provided. To rank such trading negotiation alternatives, we train a set of statistical classifiers based on the feature set described in Table I. Our set of features includes the resources available (features f 1 -f 5 ), the build-ups (features f 6 -f 8 ) with a default minimum of 0 and maximum value of 7, the receivable resources in binary form to reduce data sparsity (features f 9 -f 13 ), and the giveable resource considered as the class prediction (feature f 14 ). An example subdialogue between players is shown in Table II. The first column shows the player IDs, where the 2 The feature set listed in Table I was chosen because it yielded the best performance in previous experiments from a pool of feature sets from both manual feature selection and automatic feature selection. Other feature sets that we explored include smaller domains (only binary features), larger domains (non-binary features), smaller and larger sets of features, and multiple givables rather than a single one, among others. ID Domain Feature Description f 1 hasclay {0...7} Num. clay units available f 2 hasore {0...7} Num. ore units available f 3 hassheep {0...7} Num. sheep units available f 4 haswheat {0...7} Num. wheat units available f 5 haswood {0...7} Num. wood units available f 6 hasroads {0...7} Num. roads built so far f 7 hassettlements {0...7} Num. settlements built so far f 8 hascities {0...7} Num. cities built so far f 9 recclay Binary Clay offered by opponent? f 10 recore Binary Ore offered by opponent? f 11 recsheep Binary Sheep offered by opponent? f 12 recwheat Binary Wheat offered by opponent? f 13 recwood Binary Wood offered by opponent? f 14 givable Resource Clay/Ore/Sheep/Wheat/Wood Table I FEATURE SET FOR LEARNING TRADING NEGOTIATIONS FROM EXAMPLES. fourth player was silent. Each game had four players in total. The second column shows the messages typed and shown in the top-middle dialogue box of Figure 1. The third column shows the semantics of textual messages. The last column shows the context of the trading negotiations, represented by features f 1 -f 8 described in Table I. These sort of subdialogues occur in the game, which result in players accepting or rejecting trading offers from other players in turn. IV. TRAINING APPROACHES In this paper, we treat trading in strategic conversation as a classification task, where we train statistical classifiers either with manually labelled data (typical approach) or with automatically labelled data (our proposed approach). A. Training with Manually Labelled Data To train statistical agents in a supervised manner, we first use only one data set of manually labelled trading examples D m = {(x 1, y 1 ),.., (x N, y N )}, where x i are vectors of features and y i are class labels. Each pair or tuple represents an instance used for training or testing by the learning methods described in Section V. See Figure 2(a) for an illustration. B. Training with Automatically Labelled Data We extend the previous approach by automatically relabelling data set D m into D a = {(x 1, y p 1 ),.., (x N, y p N )}, where the y p j represent our predicted labels using the automatic labeller described below. This approach is motivated by the fact that it can generate potentially more useful data than its original source. We then use D a to train the statistical classifiers described in Section V. See Figure 2(b) for an illustration. Our classifier for automatic labelling used as features the most common words in text-based trading messages

4 Player Message Semantics Context (f 1...f 8 ) A Anyone wants to trade wood for clay Givable(wood) Receivable(clay) 0,0,0,2,3,4,2,0 A ,0,0,2,3,4,2,0 B No-one wants wheat for clay? Givable(wheat) Receivable(clay) 0,0,0,1,1,4,2,0 A Wheat for clay? Givable(wheat) Receivable(clay) 0,0,1,2,2,4,2,0 C Sheep for clay? Givable(sheep) Receivable(clay) 0,0,4,5,1,2,2,0 A I got 1 sheep Give(sheep) 0,0,1,2,2,4,2,0 Table II EXAMPLE TRADING NEGOTIATIONS FROM HUMAN PLAYERS IN THE GAME OF SETTLERS OF CATAN. Figure 2. Illustration of training approaches using manually labelled data D m and automatically labelled data D a. The latter is created from an automatic labeller trained from the original source D m that re-labels the data see Section IV-B for further details from human players 3, and the class labels were Givable and Receivable. This binary classifier used a Random Forest with 100 decision trees, see Section V-C for more details. Specifically, the word-level features included the most common words at the left of a resource in focus, and the most common words in the right-hand context of the same resource in focus. In this way, the sentence I give you sheep for clay would be labelled as Givable(sheep) and Receivable(clay). From this illustrative example, the words give and you at the left of the resource sheep would be potentially relevant features for the class label Givable. Similarly, the word for at the left of the resource clay would be a potentially relevant feature for the label Receivable. In other words, our automatic labeller generated the semantics from raw text as illustrated in columns 2 and 3 in Table II. We note that while manual labels referred to context beyond the sentence in a turn (e.g. one or more tradings before the one in focus), our automatic labeller only referred to the local context of the sentence in focus. We also note that our automatic labels were agnostic about the 3 The common words in text-based trading messages are defined as those that appear more than the average number of words and symbols (e.g. dots, question mark) in the training data Figure 3. High-level proportion of dialogue act types in manual and automatic labels players in focus, i.e. our automatic labeller did not take into account the sender and recipient players. Furthermore, we note that while the manual labels used 7 dialogue act types, our automatic labels focused on 3 dialogue act types see Figures 3, 4, and 5. The smaller set of dialogue ac types was used to reduce the complexity in the annotations.

5 is selected according to y = arg max y Y P (y e(y)), where the contextual information of givable y is defined by e(t) = {f 1 = val 1,..., f n = val n } with features f i. Figure 4. Figure 5. Detailed proportion of dialogue act types in manual labels Detailed proportion of dialogue act types in automatic labels V. STATISTICAL TRADING AGENTS We compare the performance of the following statistical classifiers with the aim of finding the best predictor of human-like trading negotiations: (i) Bayesian Networks, (ii) Conditional Random Fields, and (iii) Random Forests. A. Learning to Trade with Bayesian Nets Our Bayesian agent is defined by P (x) = n i=1 P (x i pa(x i )), where x= {x 1,..., x n } is a set of random variables describing the context of the game, pa(.) denotes the set of parent random variables, and every variable is associated with a conditional probability distribution P (x i pa(x i )). Two main tasks are involved in the creation of our Bayes net. First, parameter learning involves the estimation of conditional probability distributions (discrete in our case) from D based on maximum likelihood estimation with smoothing. Second, structure learning involves inducing the dependencies of random variables based on the K2 algorithm, see [30] for details. Once the Bayes net has been trained, we use the junction tree algorithm [31] for probabilistic inference of trades. The most probable human-like trade B. Learning to Trade with CRFs This agent treats trading as a sequence labelling task, in which a sequence of game environment inputs is labelled with appropriate givable resources to support trades. The task is therefore to find a mapping between (observed) features including available resources, build-ups, and receivables and a (hidden) sequence of givables. We use the linear-chain Conditional Random Field (CRF) model for predicting human-like trades in the game of Settlers of Catan. This model defines the posterior probability distribution of labels (givables in our case) y={y 1,..., y y } given features x={x 1,..., x x }, as P (y x) = 1 { T Z(x) t=1 exp K } k=1 θ kφ k (y t, y t 1, x t ), where Z(x) is a normalisation factor over all available vectors of contextual information x such that the sum of all labellings is one. The parameters θ k are weights associated with feature functions Φ k (.), which are real values describing the label state y at time t based on the previous label state y t 1 and features x t. The parameters θ k are set to maximise the conditional likelihood of sequences of givables in the training data set. They are estimated using the gradient descent algorithm. After training, labels can be predicted for new sequences of observations. The most likely trading offer is expressed as y = arg max y P r(y x), which is computed using the Viterbi and A search algorithms see [32] for further details. C. Learning to Trade with Random Forests This agent is trained using an ensemble of trees, which are used to vote for the class prediction at test time [33], [34]. A random forest is an ensemble learning method that constructs a set of random decision trees at training time, and uses them to generate the most popular class. We compute the probability distribution of a human-like trade b B P b(givable evidence), as P (givable evidence) = 1 Z where givable refers to the class prediction, evidence refers to observed features 1-13, P b (..) is the posterior distribution of the bth tree, and Z is a normalisation constant see [35] for further details. In our experiments below, we fixed the amount of decision trees to 100. Assuming that Y is a set of givables at a particular point in time in the game, extracting the most human-like trading offer (givable y ) and collected evidence (context of the game), is defined as y = arg max y Y P r(y evidence). VI. EXPERIMENTS AND RESULTS Our evaluation metrics for assessing the predictive power of human-like trading include classification accuracy and precision-recall. These metrics are part of our offline evaluation, which reports performance on held-out data.

6 In addition, to assess the performance of our statistical classifiers while playing the game we consider the following game-related metrics (in terms of averages): winning rate, victory points, offers made, successful offers, and pieces built. These metrics are part of our online evaluation, and are used to assess performance while playing the game of Settlers of Catan using a benchmark framework. Each of the classifiers (Bayes Net, Conditional Random Field, Random Forest) below was trained and evaluated equally. The only difference between models was the data source, i.e. manual labels or automatic labels. A. Offline Evaluation Table III shows the classification results of our statistical classifiers using the features listed in Table I trained as described in Section III, IV and V. Our evaluation used 10-fold cross-validation, i.e. average results over 10 rounds of 9 folds for training and 1 fold for validation. These folds mean that while our automatic labeller was trained on 90% of manually labelled data, the remaining 10% was used for validation. The classification accuracy of the automatic labeller was 80.23% according to the cross evaluation. For the evaluation in the next section, we choose the automatic labeller with the highest classification accuracy. Our observations from Table III can be described as follows: Firstly, it can be noted is that all our statistical classifiers substantially outperform a majority baseline. A second point to notice is that predicting human trading negotiations is a difficult task because our best classifier, the Random Forest, achieves a classification accuracy of 65.7% when training on manual labels, and 84.8% when training on automatic labels. A further point to observe is that all classifiers trained on automatic labels perform better than their counterparts trained on manual labels. In other words, automatic labels help to predict human trading behaviour better than manual labels. This result suggests that automatic labels are useful for data sets difficult to annotate like ours which reported a moderate annotator agreement in the manually labelled data [28]. Although this conclusion requires confirmation in other data sets, the next section reports an additional evaluation to confirm the good performance of the trained classifiers. B. Online Evaluation We also evaluated the statistical classifiers described in Sections III, IV and V by integrating them into the JSettlers benchmark framework [8] illustrated in Figure 1, where we use random and rule-based baseline negotiators 4 as the opponents. It has to be noted that our evaluations here played 4 The baseline trading agent referred to as rule-based included the following parameters in all agents, see [36] for further details: TRY N BEST BUILD PLANS:0, FAVOUR DEV CARDS:-5. strategic games at the semantic level, i.e. using dialogue acts as those shown in column 3 of Table II. In addition, our trained agents were active only during the ranking of trading offers, the functionality of the rest of the game was based on the JSettlers framework [8]. We refer to this evaluation as online because the agents were used in the actual game to rank realistic trading negotiations. This means that all games were run using four automated agents: one statistical vs. three rule-based. We evaluate each classifier with 10,000 games in order to obtain significant comparisons due to the randomness exhibited in the game. Such a number of games has shown to produce meaningful comparisons [36]. Table IV shows results of our online evaluation, which we describe as follows. First, note that random behaviour is substantially worse than rule-based, and that more (successful) offers do not contribute to more winning. Second, it can be noted that the rule-based agents obtain a winning rate of 25% because four players of the same kind play against each other. Third, it can also be noted that only some of our agents using the trained classifiers outperform the rule-based agents resulting in more winning, more victory points, and more pieces built but not necessarily more offers. Fourth, taking into account the classification results in the previous section, it can be inferred that higher classification accuracy from average human players does not imply better winning rates in the case of Conditional Random Fields and Bayes nets only in the Random Forest case. Similar effects but from expert human traders remain to be investigated. Fifth, we can observe that the best results are obtained by the Random Forest trained on automatic labels D a. It won 1% more games than the Random Forest trained on manual labels D m. This difference was significant at p < 0.05 according to a two-tailed Wilcoxon- Signed Rank Test. This result suggests that the use of automatic labels is useful for training better negotiation tradings than manual labels at least in the case of manual labels with a moderate annotator agreement. Manual labels with higher and even lower annotation agreements remain to be investigated. VII. CONCLUSIONS AND FUTURE DIRECTIONS The contribution of this paper is a learning approach for trading in strategic conversation including an evaluation of statistical trading agents trained for manually and automatically labelled data. We have trained three statistical agents from manually and automatically labelled data, and then applied statistical inference for computing probabilistic scores for each trading negotiation. The obtained scores were used to rank the available trading negotiations, where the top choice (i.e. the most human-like) was used in the game. In an offline evaluation, the statistical agents showed that the

7 Classifier Accuracy Precision Recall F-Measure Majority Baseline Conditional Random Field man Bayesian Network man Random Forest man Conditional Random Field auto Bayesian Network auto Random Forest auto Table III OFFLINE EVALUATION: CLASSIFICATION ACCURACY AND PRECISION-RECALL RESULTS OF HUMAN TRADING NEGOTIATIONS IN SETTLERS OF CATAN. NOTATION: man=training ON MANUALLY LABELLED DATA D m, AND auto=training ON AUTOMATICALLY LABELLED DATA D a Comparison Between Trained Winning Victory Offers Successful Pieces Statistical Trader vs Opponent Rate (%) Points Made Offers Built Random (from legal offers) vs Rule-based Rule-based vs Rule-based Conditional Random Field man vs Rule-based Bayesian Network man vs Rule-based Random Forest vs man Rule-based Conditional Random Field auto vs Rule-based Bayesian Network auto vs Rule-based Random Forest auto vs Rule-based Table IV ONLINE EVALUATION: GAME RESULTS COMPARING A STATISTICAL CLASSIFIER VS. THREE RULE-BASED TRADERS, I.E. FOUR PLAYERS IN TOTAL IN EACH GAME EACH LINE SHOWS AVERAGE RESULTS OVER 10,000 TEST GAMES. NOTATION: man=training ON MANUALLY LABELLED DATA D m, AND auto=training ON AUTOMATICALLY LABELLED DATA D a best classification result was obtained by a random forest classifier using automatic labels. In an online evaluation, the best agent (random forest) using automatic labels achieved a winning rate that was 1% better than its counterpart using manual labels with moderate annotator agreement. This result suggests that statistical classifiers should consider training from automatically labelled data especially if initially labelled data does not report high inter-annotator agreement. This result is encouraging for training statistical agents from human examples difficult to annotate in order to incorporate trainable behaviour in strategic conversational agents. Future research avenues include: training trading agents that take into account richer contextual information such as features from other players, and training them to play multiple games; training with other forms of machine learning, as commented in Section II; training agents not just from average players but from expert human traders in multiple domains; and evaluating trained agents against human players. ACKNOWLEDGMENTS Funding from the European Research Council (ERC) project STAC: Strategic Conversation no is gratefully acknowledged (see We would also like to thank the following members of the STAC project for helpful discussions: Markus Guhe, Eric Kow, Mihai Dobre, Ioannis Efstathiou, Wenshuo Tang, Verena Rieser, Alex Lascarides, and Nicholas Asher. REFERENCES [1] N. Asher and A. Lascarides, Strategic conversation, Semantics and Pragmatics, vol. 6, no. 2, pp. 1 62, August [2] M. McFarlin, 10 great board games for traders, Futures Magazine, Oct. 2013, great-board-games-for-traders. [Online]. Available: 10-great-board-games-for-traders [3] I. Szita, G. Chaslot, and P. Spronck, Monte-Carlo Tree Search in Settlers of Catan, in Proceedings of the 12th International Conference on Advances in Computer Games, ser. ACG 09. Berlin, Heidelberg: Springer-Verlag, 2010, pp [4] M. Pfeiffer, Reinforcement learning of strategies for Settlers of Catan, in International Conference on on Computer Games: Artificial Intelligence, Design and Education, [5] K. Georgila and D. Traum, Reinforcement learning of argumentation dialogue policies in negotiation, in Proc. of INTERSPEECH, [6] I. Efstathiou and O. Lemon, Learning non-cooperative dialogue behaviours, in SIGDIAL, [7] M. S. Dobre and A. Lascarides, Online learning and mining human play in complex games, in IEEE Conference on Computational Intelligence and Games, CIG, [8] R. Thomas and K. J. Hammond, Java settlers: a research environment for studying multi-agent negotiation, in Intelligent User Interfaces (IUI), 2002, pp

8 [9] H. Cuayáhuitl, S. Keizer, and O. Lemon, Learning to trade in strategic board games, in IJCAI Workshop on Computer Games (IJCAI-CGW), [10] G. Tesauro, Temporal difference learning and TD-gammon, Commun. ACM, vol. 38, no. 3, pp , [11] I. Efstathiou and O. Lemon, Learning to manage risk in noncooperative dialogues, in Proc. SEMDIAL, [12] S. Keizer, H. Cuayáhuitl, and O. Lemon, Learning Trade Negotiation Policies in Strategic Conversation, in Workshop on the Semantics and Pragmatics of Dialogue (godial), [13] D. Marcu and A. Echihabi, An unsupervised approach to recognizing discourse relations, in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), 2002, pp [14] C. Sporleder and A. Lascarides, Using automatically labelled examples to classify rhetorical relations: an assessment, Natural Language Engineering, vol. 14, no. 3, pp , [15] J. Fürnkranz, Machine learning in games: A survey, in Machines that Learn to Play Games, Chapter 2. Nova Science Publishers, 2000, pp [16] T. P. Runarsson and S. M. Lucas, Preference learning for move prediction and evaluation function approximation in othello, IEEE Trans. Comput. Intellig. and AI in Games, vol. 6, no. 3, pp , [17] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, Move Evaluation in Go Using Deep Convolutional Neural Networks, CoRR, vol. abs/ , [18] N. Asher and A. Lascarides, Commitments, beliefs and intentions in dialogue, in Proc. of SemDial, 2008, pp [19] J. Shim and R. Arkin, A Taxonomy of Robot Deception and its Benefits in HRI, in Proc. IEEE Systems, Man, and Cybernetics Conference, [20] D. Traum, Extended abstract: Computational models of non-cooperative dialogue, in Proc. of SIGdial Workshop on Discourse and Dialogue, [21] H. Cuayáhuitl, M. van Otterlo, N. Dethlefs, and L. Frommberger, Machine learning for interactive systems and robots: A brief introduction, in Proceedings of the 2 nd Workshop on Machine Learning for Interactive Systems: Bridging the Gap Between Perception, Action and Communication, ser. MLIS 13. New York, NY, USA: ACM, 2013, pp [22] O. Pietquin and M. Lopez, Machine learning for interactive systems: Challenges and future trends, in Proceedings of the Workshop Affect, Compagnon Artificiel (WACAI), [23] A. Cadilhac, N. Asher, F. Benamara, and A. Lascarides, Grounding strategic conversation: Using negotiation dialogues to predict trades in a win-lose game, in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing EMNLP, 2013, pp [24] H. Cuayáhuitl, N. Dethlefs, H. W. Hastie, and O. Lemon, Barge-in effects in bayesian dialogue act recognition and simulation, in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, December 8-12, 2013, 2013, pp [25] O. Lemon, Adaptive natural language generation in dialogue using Reinforcement Learning, in Proc. of the 12th SEMdial Workshop on on the Semantics and Pragmatics of Dialogues, London, UK, June [26] N. Dethlefs and H. Cuayáhuitl, Hierarchical reinforcement learning for situated natural language generation, Natural Language Engineering, vol. 21, [27] N. Dethlefs, H. W. Hastie, H. Cuayáhuitl, and O. Lemon, Conditional random fields for responsive surface realisation using global features, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, 2013, pp [28] S. Afantenos, N. Asher, F. Benamara, A. Cadilhac, C. Dégremont, P. Denis, M. Guhe, S. Keizer, A. Lascarides, O. Lemon, P. Muller, S. Paul, V. Rieser, and L. Vieu, Developing a corpus of strategic conversation in The Settlers of Catan, in Workshop on the Semantics and Pragmatics of Dialogue (SeineDial), Paris, France, 2012, hal [29] J. Carletta, Assessing Agreement on Classification Tasks: The Kappa Statistic, Computational Linguistics, vol. 22, no. 2, pp , [30] G. Cooper and E. Herskovits, A Bayesian method for the induction of probabilistic networks from data, Machine Learning, vol. 9, no. 4, pp , [31] F. G. Cozman, Generalizing variable elimination in bayesian networks, in In Workshop on Probabilistic Reasoning in Artificial Intelligence, 2000, pp [32] T. Kudo, CRF++: Yet another crf toolkit, Software available at crfpp.sourceforge.net, [33] L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 5 32, [34] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining, inference and prediction, 2nd ed. Springer, [35] A. Criminisi, J. Shotton, and E. Konukoglu, Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning, Foundations and Trends in Computer Graphics and Vision, vol. 7, no. 2-3, pp , [36] M. Guhe and A. Lascarides, Game strategies for The Settlers of Catan, in 2014 IEEE Conference on Computational Intelligence and Games, CIG 2014, Dortmund, Germany, August 26-29, 2014, 2014, pp. 1 8.