Solving the Game of Awari using Parallel Retrograde Analysis

Solving the Game of Awari using Parallel Retrograde Analysis John W. Romein and Henri E. Bal Vrije Universiteit, Faulty of Sienes, Department of Mathematis and Computer Siene, Amsterdam, The Netherlands john,bal @s.vu.nl Abstrat We have solved the game of awari, an anient Afrian board game that is played worldwide now. The game is a draw when both players play optimally. To solve awari, we omputed several databases that an be used jointly to selet the best move from any position that an our in a game. The largest database ontains 04 billion entries (78 gigabyte), and is muh larger than the largest (endgame) database for any game omputed so far. In total, we determined the results for 889 billion positions. We solved the game on a large omputer luster, using a new parallel searh algorithm that optimally uses the available resoures (proessors, memories, disks, and network). In this paper, we show how we solved awari and how we verified the results. We also present new insights into the game of awari. Keywords: awari, distributed game databases, distributed searh, retrograde analysis, superomputing Introdution With the progress of the state-of-the-art in proessor and network tehnology, parallel omputers an aept omputational hallenges that were inoneivable before. One suh hallenge is to solve the game of awari, a 3500-yearold board game that originates from Afria, whih is played all over the world now. Being the best-known variant of the lass of manala games (games played on a board with pits and stones that are sown around), awari has drawn muh attention from omputer siene and artifiial intelligene researhers [6], who have been studying tehniques to play the game for more than a deade [3]. We have solved awari on a modern, parallel omputer, using 44 proessors, a total of 7 GB main memory, and a fast interonnet (Myrinet). To fully solve the game, we had to determine the sore of 889.063.398.406 positions. The sore of a position exatly foretells the result at the end of the game when both players play optimally, and an be used to selet the best move. The sores are stored in a olletion of databases, one for eah number of stones. The largest database, whih ontains sores for eah reahable position with all 48 stones, has 04 billion entries and oupies 78 GB. This amply exeeds the largest database for any game omputed so far, both in number of positions and in storage size. For several reasons, it is not possible to ut this database into smaller, independent piees. Also, other attempts to solve the game were unsuessful (see Setion related work ). No single omputer has enough proessing power and memory to ompute the databases, but even for parallel omputers the problem is extremely hallenging. To reate the databases, we developed a new parallel algorithm that is based on retrograde analysis (RA). RA starts from all terminal positions in the game, of whih the outomes are trivially known, and reasons bakwards to the starting position of the game. The algorithm uses the main memories eonomially for frequently and randomly aessed data, and keeps terabytes of less frequently aessed intermediate results on disks. All proessors repeatedly inform eah other about positions for whih RA has (partially) determined the outome. This results in a huge amount of interproessor ommuniation, over a petabit (0 5 bits) in total, and terabytes of disk I/O. The rux of our algorithm is to make all this ommuniation and disk I/O asynhronous, so proessors do not have to wait and an ontinue doing useful work. This balaned and onurrent use of resoures (proessors, memory, disks, and network) leads to a highly effiient algorithm: no single resoure is an evident bottlenek. Solving the game took 5 hours, a surprisingly short time thanks to the many optimizations we applied. We put muh effort into verifiation of the databases. In an environment where we exeute quadrillions of instrutions, store terabytes of data, and ommuniate a petabit, we had to deal with possible hardware, operating system, and appliation faults. Writing a orret and effiient parallel appliation is diffiult, beause the desire to be as effiient as possible leads to many opportunities for rae onditions and other mistakes. Below, we explain the rules for awari and disuss previous work. Then, we explain the algorithm we use. Next, we

North f e d b a North f e d b a North f e d b a 5 5 5 3 5 9 6 9 3 5 4 A B C D E F South A B C D E F South A B C D E F South (a) Position to move from. move. South is to (b) After move from. Now North is to move. () Alternative move from. Stones from and are aptured. Figure. Example awari move. The numbers in the irles indiate the number of stones in the pits. elaborate on new insights into the game of awari, and we disuss verifiation measures. The awari rules Awari is played on a board where eah player owns 6 pits: ß for the player alled south, and ß for the player alled north (see Figure (a)). Eah player also owns an auxiliary pit on the right-hand side of the board, whih is used to keep aptured stones. In the initial position, all pits (exept the apture pits) are filled with four stones, thus the initial position has 48 stones. All stones have the same olor. The player to move hooses one of his own nonempty pits and removes all stones from the pit. The player then sows the stones ounterlokwise over the other pits, one stone into eah next pit, and skips the original pit if there are more than stones to sow. This is illustrated by Figure (b), where the three stones from are sown into,, and. After the sowing phase, stones an be aptured. If the last stone is sown into an enemy pit that ontains or 3 stones (after sowing), the stones are aptured. In this ase, if the seond last pit is also an enemy pit that ontains or 3 stones, they are aptured as well, and this proess is repeated lokwise. Captured stones are moved into the player s apture pit and remain there during the rest of the game. The player who aptures most stones wins the game. Figure () shows the result after making apture move from Figure (a). The stones are first sown into and, and subsequently aptured with the other stones from and. When a player annot move (i.e., all own pits are empty), the remaining stones are aptured by the opponent, ending the game. However, to avoid suh a situation early in the game, it is not allowed to do a move that leaves the opponent without ountermove, unless all moves eradiate the opponent. In Figure (a), this prohibits a move from, sine it eradiates the opponent and noneradiating moves are available. (The best move is from : south an win 7 ). Finally, a repeated position results in an even division of the remaining stones, oneptually splitting the last stone into two halves if there is an odd number of stones remaining. Unfortunately, the rules slightly differ from one ountry to another. For example, apturing all opponent s piees in a single move is sometimes allowed. We use the rules that are used by all omputer sientists sine the first publiation on omputer-aided play of awari [3]. Related work Despite the interest for games that omputer sientists have shown already sine the early days of omputing history, games are solved at rare oasions. Disregarding the solving of kalah (a muh simpler awari variant) in 000, it has been a long time sine the previous widely-known games were solved (nine-men s-morris in 993 [5], gomoku in 99 [], and onnet-4 in 988 []). Some games were solved analytially, others by brute fore, or by a ombination of both. A reent artile by Van den Herik et al. presents an overview of solved games [6]. So far, researh on awari has foused on endgame databases, sine they an play a deisive role in a game. In 995, Bal and Allis used a parallel retrograde analysis program to onstrut databases of up to stones on a distributed-memory mahine [4]. Five years later, van der Goot used a different parallel retrograde analysis variant to build the 40-stone database, on a large shared-memory mahine [3]. Linke and Marzetta published a sequential, memorylimited algorithm to ompute awari endgame databases [9]. As of June 00, a program implementing the algorithm has already run for many months, trying to ompute the 40- stone database. The algorithm requires two bits per position on disk (aessed sequentially), one bit per entry in main memory (aessed randomly), and some additional disk spae to store intermediate sores and apture information. With the 40 resp. 39-stone databases available, the latter two groups (independently) tried to solve awari by performing a large forward searh from the initial position down to positions that hit the onstruted endgame databases, a tehnique that has been used suessfully for games like kalah, nine-men s-morris (see below), and the sliding-tile puzzle. However, this failed for awari; the forward searh tree was too large and did not hit the databases fast enough, sine optimal play tends to postpone aptures quite long.

Databases have also been onstruted for other games. By far the most impressive ones are the hekers endgame databases built by the games group of the University of Alberta [7]. By June 00, the databases ontain all positions with up to 8 piees and some of the 9 and 0-piee positions, for a total of 3.3 trillion states. This is more than the total number of states for all awari databases, but for several reasons, the awari databases are harder to reate. First, the limiting fator is the size of the hardest subproblem (database) that must be solved at one. Awari annot be split into as many subproblems as hekers an be, and onsequently, the largest awari database ontains.5 times as many entries as the largest hekers database urrently solved. Seond, the awari databases ontain sores between -48 and 48, while the hekers databases only distinguish between win, draw, and loss, and thus oupy far less spae. Also, a hekers-speifi ompression tehnique redues the storage requirements of the reated databases to only 40 GB. The databases play an important role in Chinook, the undisputed world s strongest hekers-playing program. Gasser solved nine-men s-morris [5], also using a ombined forward and bakward searh. The bakward searh resulted in databases that ontained 7.67 billion states in total. Endgame databases for hess have also been built, for subgames with at most 6 piees on the board []. Although they have revealed remarkable new insights, they are more of theoretial interest than of pratial purpose, sine a hess game is usually deided well before the stage of the 6-piee endgame is entered. Retrograde analysis A position ontains aptured and remaining (nonaptured) stones. Although aptured stones do ontribute to the final outome of a position, the best move from a position does not depend on the aptured stones. We therefore only onsider the distribution of the remaining stones. This signifiantly redues the number of states to be searhed and stored, sine positions that only differ in aptured stones are now onsidered equal. The sore of a position depends on how the remaining stones will be divided eventually, after optimal play from both players. The sore is the differene in stones after suh a division: positive if the player to move will apture most of them, and negative if the other player will apture most. For example, the sore of the position in Figure (a) is +, beause optimal play on both sides divide the remaining stones 8 6 to south s advantage (sine south has already aptured four stones more than north, it will win +6). All reahable positions (where the apture pits are omitted) form a direted, yli graph alled the state spae. Figure shows a small part of the state spae for awari. The nodes represent positions, and the ars represent moves. The top node in the figure orresponds to the position in Figure (a). Eah ar is labeled with the pit name from whih the stones are sown, and possibly with the number South to move North to move South to move 0 a b 0 North to move 5 3 Figure. The searh of a simple state spae. of stones aptured during the move (in the figure, move B Figure (b) 6 3 F x 5 B 3 Figure (a) 3 D Figure () from the top node aptures 5 stones). The root of the state spae is the initial position and the leaves are terminal positions, where no moves are available (neither is shown in the figure). The numbers in the nodes are the sores. RA determines the sores bottom-up. The sores of the leaves follow from the rules of awari and equal the negated number of stones on the board, sine the remaining stones are taken by the opponent. For the interior nodes, the sore equals the maximum of the number of stones aptured minus the sore of the hild: Sore pµ max ¾Children pµ CapturedStones p µ Sore µµ The sores of the hildren are negated to reflet the opposed interests of the players. From the top node, move is better than, sine it results in a sore of (5 aptured stones minus hild sore 3) rather than 0. Eah best move is shown by a thik line. The figure shows a tree, but in fat, the state spae is a graph, sine nodes an (and usually do) have multiple parents. Moreover, the graph is yli sine there are move sequenes that yield repeated positions. Nodes in yles without outlet have sores of zero, sine they will be drawn by repetition. Our goal is to ompute the sores of all nodes, and in partiular that of the root node, to determine the gametheoretial sore of the entire game, using RA as bottomup searh tehnique. At eah bakward step, RA omputes the reverse moves to disover all parents of a node. When the sore of a node is established, the sore is reported to its parents. When a parent has reeived the sores from all its hildren, its sore is established. This proess is repeated reursively, until the sore of the root is determined. Various refinements for RA exist, often trading memory for omputation effort. 3

Awari databases Searhing the state spae at one go is impossible, due to its size. We therefore split the state spae into subgraphs that an be searhed one after another. The state spae is split with respet to the number of nonaptured stones in a position. For eah subgraph, we onstrut a database that ontains its sores. Eah n-stone awari position maps uniquely to a position in the n-stone database. Sine aptures annot be undone, the databases an be omputed one after another, starting with the 0-stone database and ending with the 48-stone database, skipping 47-stone positions sine they annot our in a game. n entries n entries n entries 0 5 4,368 43 73,7,65,3 6,375 44 90,57,649,586 78... 45,548,476,480 3 364 4 47,06,945,644 46 36,883,49,790 4,365 4 58,806,69,443 48 03,648,05,936 Table. Number of entries for some of the n-stone databases. We only store positions where south is to move; if north is to move, the position an be looked up in the database by rotating the board. We also omit the positions where all six pits of the nonmoving player are oupied. Those positions are unreahable (exept for the starting position), sine the nonmoving player always left at least one of his own pits empty during the previous move. Although hard to implement, omitting unreahable positions results in redutions in omputation time and database sizes of up to n n 5 7%. Eah n-stone database ontains positions (the number of possible stone permutations minus the number of unreahable permutations). Table lists the number of entries for some of the databases. The databases are indexed by Gödel numbers of the positions [3], whih enumerate all positions in a database. We modified the Gödel numbers to take unreahable positions into aount. Funtions exist to ompute the index number from a board and vie versa. Eah database entry stores a sore between -48 and +48 and oupies 7 bits. The larger databases do not fit in main memory: the 48-stone database requires 78 GB. Moreover, we need additional main memory during omputation of the database to store intermediate results. Unfortunately, it is not possible to split the databases into smaller piees (e.g., with respet to the number of stones in a partiular pit), beause interdependenies within a database (yles in the orresponding subgraph, that mutually influene the entries values) ompel that eah database is omputed at one go. Another ompliation is that the indexing sheme often maps parent and hild nodes in the state spae quite far from eah other in the database. Sine parents and hildren often interat, the databases are aessed randomly during onstrution. Storing the databases on disk and aessing the entries randomly is prohibitively slow. Sine we annot redue the number of entries in a database, we redue the number of bits per entry stored in main memory, as explained in the next setion. Memory-limited parallel retrograde analysis Several algorithms for retrograde analysis exist, eah with its own memory requirements and effiieny harateristis. Memory-limited algorithms trade memory requirements for inreased omputational effort, by storing as little as one bit per database entry in main memory, and the remainder on disk [8, 9]. These algorithms avoid random disk aesses and aess disk data only linearly, allowing transfers of large amounts of onseutive data. In ontrast, data in main memory may be aessed randomly. The hoie of the algorithm depends on the availability of main memory. For databases of up to 4 stones, we use the algorithm desribed by Bal and Allis [4], whih is omputationally the most effiient, but also the most memory-intensive algorithm, requiring 0 bits per entry in main memory. For the larger databases, we use a new parallel, ost-effiient algorithm that is almost three times slower, but requires five times less memory: eah entry requires only bits in main memory. The details of the algorithm are desribed in the sidebar (the ontents of the sidebar are appended at the end of this artile). The algorithm is parallelized by partitioning the databases over the mahines. To improve the load balane, the entries are interleaved over the mahines rather than partitioned in onseutive hunks (e.g., with 44 proessors, CPU stores entries, 46, 90,... ) [4]. The algorithm is diffiult to parallelize effiiently, beause the hildren and parents of a node are usually all loated at different proessors. Performing a synhronous remote lookup to hek the sore of a hild or parent is prohibitively expensive, beause the proessor has to wait for a reply, ausing it to idle most of the time. To avoid high ommuniation and synhronization overheads, we parallelized the searh in suh a way that ommuniation is highly asynhronous (nonbloking). Rather than shipping data, we migrate work to other proessors. Instead of performing remote lookups, a remote proessor must ontinue the work on a partiular node. As soon as a omputation requires data from a remote proessor, the work itself is transferred to and resumed on the remote proessor. Eah proessor uses a message reeive queue that ontains work to be proessed. As long as there is work in the reeive queue, the proessor aepts and performs the work. After the searh, a global termination detetion algorithm is used to hek that all work queues are empty and to ensure that no messages are in transit. This work-migrating sheme has important advantages. First, it keeps the proessors busy, sine they do not have to wait for (synhronous) reply messages. Seond, it allows ombining multiple piees of work into a single, large 4

network paket; this signifiantly redues the ommuniation overhead. Third, our algorithm is insensitive to network lateny, and fourth, we use the proessors and network devies onurrently by overlapping ommuniation and omputation. In top-down searh, we experiened the same benefits []; the sheme is appliable to any parallel graph searh algorithm. Sine we an only store bits per entry in main memory, the remaining state is kept on the loal disks. We only aess the files sequentially to avoid slow disk head seeks. To lower the overhead of disk I/O, we expliitly hint the operating system whih parts of the disk databases should be mapped into main memory. We prevent bloking on disk reads by aggressively prefething data from disk, so that the data are available when they are atually needed. After the data are used and possibly modified, we hint the operating system to lazily write bak dirty data and free up memory resoures, sine these data will not be used for a while. By properly integrating asynhronous disk I/O with the asynhronous parallel searh algorithm, all resoures are used onurrently, for optimal performane. The sheme is balaned in suh a way that none of the resoures is an apparent bottlenek. Performane results We omputed the databases on a luster with 7 dualproessor mahines, built by IBM. Eah mahine ontains two.00 GHz Pentium III proessors, GB main memory, and a 0 GB IDE disk. The system is onneted through Myrinet, a swithed network with Gb/s bidiretional links. The appliation is implemented in C and uses Myriom s GM ommuniation library. Constrution of the 48-stone database required 5. hours. It took 5 hours in total to ompute all databases. The appliation ommuniates heavily. During the proessing phase (where 50% of the time is spent), eah SMP node sent and reeived 0 to 30 MB/s, thus the Myrinet swith routed.4 to. GB/s, exluding protool overhead. In total, 30 TB (.0 petabit!) of data were ommuniated. The amount of disk I/O is estimated to be in the order of ten terabytes. New awari insights The newly onstruted databases revealed that awari is a draw, so being able to begin the game is neither an advantage, nor a disadvantage. Remarkably, the only good opening move is from the rightmost pit ( ); the other opening moves are losing. Figure 3 shows how the sores of the positions are distributed. The shape of the histogram reveals that most positions are entered around the bar sore = +. For random positions, the average sore is +.9, indiating an advantage for the player to move. 60.0% of the positions are winning, and.3 % are drawn. The different gray shades distinguish the number of stones on the boards. The alternating sizes of a partiular number of stones indiate an odd/even effet in whih positions with an even number of stones often have an even sore and vie versa. We also observed that apture moves are not always the best moves: for % of the positions where one has the hoie between apture and nonapture moves, it is better to do a nonapture move. The databases were used to replay and analyze the games at the Computer Olympiad 000, where the two strongest awari-playing programs were ompeting in a tournament [0]. Although it was believed that the programs played lose to perfetly, even in the final and deisive game, the eventual winner made 3 wrong moves before starting perfet endplay! In six out of eight games, both programs made mistakes while in their opening books; errors were made as early as at south s third move. Our databases thus signifiantly overtake the playing strength of the urrent world hampion. Sine we store sores rather than best moves in our databases, it is not always diretly lear whih move to take if there are multiple moves available that have the same best sore. There is a ompliation: if the sore is positive, one should avoid a repeated position; otherwise the remaining piees are divided and the indiated positive sore is not obtained. Oasionally this requires searhing the databases, but it is likely that this an always be done quikly [8]. Assuming this is true, we have strongly solved the game of awari (aording to the lassifiation of Allis []), meaning that for all positions that an our in a game, a strategy is known to obtain the game-theoretial sore of the position, under reasonable resoure onstraints. Our databases are aessible through a web server ( ØØÔ»» Û Ö º ºÚÙºÒÐ»). A Java applet with a graphial interfae an be used to look up positions, showing the sores for eah possible move. The applet an also be used to play a game, at an adjustable playing level. Database verifiation Hardware and software bugs ould silently orrupt the databases during onstrution. The following preautions minimize the hane that suh errors our. The hardware has some failities to detet and orret faults. The main memories, proessor ahes, and the memories on the Myrinet ards use error orreting odes (ECC). Moreover, the GM network software handles CRC errors for pakets that are orrupted on the fiber ables. The operating system detets and handles (nonfatal) CRC errors that our during data transport from and to the disks. We put muh effort in verifiation of the reated databases, sine writing a orret parallel program is a diffiult task, espeially when global termination detetion is involved. We therefore took the following measures: We made sure that the different retrograde searh implementations (those using resp. 0 bits per entry) reated exatly the same databases for up to 4 stones. 5

number of positions (billions) 00 80 60 40 0 48 stones 46 stones 45 stones 44 stones 43 stones up to 4 stones <-0-6 - -8-4 0 4 8 6 >0 sore Figure 3. Distribution of the positions sores. We wrote a separate program that heks whether the parent and hildren values in the databases are onsistent. We reomputed all databases on a different number of proessors (64 dual SMP nodes) and did a bitwise omparison with those generated by the 7 SMP nodes, whih showed that all entries were exatly the same. Smaller databases were reomputed over and over again, also using different numbers of proessors. This made us onfident that the programs are free from rae onditions, but it does not protet us against (sequential) bugs in the indexing funtions and the (un)move generators that generate the hildren and parents of a node. The indexing funtions are omplex due to the omission of unreahable positions, and generating parents is intriate due to awari s omplex apturing rules. We therefore also ompared our results to those by Linke and Marzetta (of up to 36 stones) : the results were the same. Conlusions and future milestones We solved the game of awari, and determined that the game is a draw when both players play optimally. We determined the sores for 889,063,398,406 positions, enumerating all states that an possibly our in a game. Due to the size of the 48-stone database and the random way the databases are aessed during onstrution, this ould not have been ahieved without a large omputer system with a well-balaned algorithm. Constrution of the databases took 5 hours on a 44-proessor GHz Pentium III luster, equipped with 7 GB main memory and a Gb/s network. To make the results omparable, we onstruted separate databases that inluded unreahable positions. We used a retrograde analysis algorithm that requires two bits per entry in main memory; the remaining data were kept on disk. The parallel algorithm is highly asynhronous, resulting in an appliation that profits from low ommuniation overhead, no idle proessors, network lateny hiding, and overlap in omputation, ommuniation, and disk I/O. We also implemented many optimizations to derease the exeution times. We put muh effort into verifiation of the databases, to make sure that they are orret. We ompared (partial) results from two algorithms, different numbers of proessors, and results obtained by other researhers, and verified everything using a separate program that heked the database onsisteny. After this milestone, the question arises whih game will be solved next, and when (see an artile by van den Herik et al. for a disussion [6]). We expet that hekers will be the first game solved, possibly within the urrent deade. Although its state-spae omplexity is six orders of magnitude larger than awari s (estimated at 0 8 []), a ombined top-down and bottom-up searh is likely to work here (beause the game ontains many fored aptures), and renders it unneessary to searh the entire state spae. Othello will be harder to solve, and hess will definitely not be solved in the foreseeable future, at least not using brute fore alone. Even though databases for these games require fewer bits per state and an be split into more disjoint subproblems than awari, their state-spae omplexities are too large (estimated at 0 8 and 0 46, respetively [6]). The awari databases, more statistis, and an (infallible) awari-playing program are aessible via ØØÔ»» Û Ö º ºÚÙºÒÐ». 6

Referenes [] J. D. Allen. A Note on the Computer Solution of Connet- Four. In D. N. L. Levy and D. F. Beal, editors, Heuristi Programming in Artifiial Intelligene : the First Computer Olympiad, pages 34 35. Ellis Horwood, Chihester, England, 989. [] L. V. Allis. Searhing for Solutions in Games and Artifiial Intelligene. PhD thesis, University of Limburg, Maastriht, the Netherlands, September 994. [3] L. V. Allis, M. van der Meulen, and H. J. van den Herik. Databases in Awari. In D. N. L. Levy and D. F. Beal, editors, Heuristi Programming in Artifiial Intelligene : the Seond Computer Olympiad, pages 73 86. Ellis Horwood, Chihester, England, 99. [4] H. E. Bal and L. V. Allis. Parallel Retrograde Analysis on a Distributed System. In Superomputing 95, San Diego, CA, Deember 995. [5] R. Gasser. Solving Nine-Man s-morris. Computational Intelligene, ():4 4, 996. [6] H. J. van den Herik, J. W. H. M. Uiterwijk, and J. van Rijswijk. Games Solved: Now and in the Future. Artifiial Intelligene, 34( ):77 3, January 00. [7] R. Lake, J. Shaeffer, and P. Lu. Solving Large Retrograde Analysis Problems Using a Network of Workstations. In H. J. van den Herik, I. S. Hershberg, and J. W. H. M. Uiterwijk, editors, Advanes in Computer Chess 7, pages 35 6. Universiteit Maastriht, the Netherlands, 994. [8] T. R. Linke. Exploring the Computational Limits of Large Exhaustive Searh Problems. PhD thesis, ETH Zurih, Swiss, June 00. [9] T. R. Linke and A. Marzetta. Large Endgame Databases with Limited Memory Spae. Journal of the International Computer Games Assoiation, 3(3):3 38, September 000. [0] J. W. Romein and H. E. Bal. Awari is Solved (note). Journal of the International Computer Games Assoiation, 5(3):6 65, September 00. [] J. W. Romein, H. E. Bal, J. Shaeffer, and A. Plaat. A Performane Analysis of Transposition-Table-Driven Sheduling in Distributed Searh. IEEE Transations on Parallel and Distributed Systems, 3(5):447 459, May 00. [] K. Thompson. Retrograde Analysis of Certain Endgames. Journal of the International Computer Chess Assoiation, 9(3):3 39, September 986. [3] R. van der Goot. Awari Endgame Databases. In Computers and Games, volume 063 of Leture Notes in Computer Siene, pages 87 95. Springer, 00. Appendix -bit parallel retrograde analysis Our parallel RA algorithm requires bits per entry in main memory, suffiiently few to ompute the 48-stone 7 ENUM State = (Win, DrawOrWin, Loss, Unknown); VAR States : ARRAY [.. NrEntries] OF State; VAR Sores : ARRAY [.. NrEntries] OF [ 48.. 48]; VAR Bound : [.. 48]; PROCEDURE DoCaptures(Index, Board) Best := 48; IF HasCapturesMoves(Board) THEN Best := BestCaptureSore(Board); IF Best >= Bound THEN States [Index] := Win; ELSE IF Best > Bound THEN States [Index] := DrawOrWin; ELSE IF NOT HasNonCaptureMoves(Board) THEN States [Index] := Loss; FOR Parent IN Parents(Board) DO States [BoardToIndex(Parent)] := Win; PROCEDURE PreProess() FOR Index IN.. NrEntries DO IF States [Index] = Unknown THEN DoCaptures(Index, IndexToBoard(Index)); PROCEDURE DoNonCaptures(Index, Board) IF States [Index] = Unknown AND AllNonCaptureMovesAreWinsForOpponent(Board) THEN States [Index] = Loss; FOR Parent IN NonCaptureParents(Board) DO ParentIndex := BoardToIndex(Parent); IF States [ParentIndex] <> Win THEN States [ParentIndex] := Win; FOR Gp IN NonCaptureParents(Parent) DO DoNonCaptures(BoardToIndex(Gp), Gp); PROCEDURE Proess() FOR Index IN.. NrEntries DO DoNonCaptures(Index, IndexToBoard(Index)); PROCEDURE PostProess() FOR Index IN.. NrEntries DO CASE States [Index] IS Win : INC(Sores [Index]); States [Index] := Unknown; DrawOrWin : ( nothing ) Loss : DEC(Sores [Index]); States [Index] := Unknown; Unknown : States [Index] := DrawOrWin; PROCEDURE RetrogradeAnalysis(NrStones) FOR Bound IN.. NrStones DO PreProess(); Proess(); PostProess(); Figure 4. Retrograde searh algorithm for bits per entry in main memory. database on our mahine. It is inspired by the sequential algorithm from Linke and Marzetta [9], but modified to take advantage of having two bits per entry in main memory instead of one, saving muh redundant work and many disk aesses. Sequentially, the algorithm roughly works as shown in Figure 4. The proedureê ØÖÓ Ö Ò ÐÝ omputes the sores in theæöëøóò -database (thus multiple invoations are needed to generate all databases). The funtion repeatedly determines for all n-stone positions whether their values are in a window (- ÓÙÒ, ÓÙÒ ). By trying multi-

iteration window positions. (-, ) ÏÄÏ ÏÄÄ.... (-, ) Ä Ï Ä... 3. (-3, 3) Ï Ä... total sore - 0 3 - -3... L W D U D 4 4 U W Figure 5. Determination of the sores for a 3-stone database. 3 3 U L ple windows, the sores an be derived indiretly. The initial window is (-, ); after eah iteration, the window is widened. A position is alled a win if its value is at least ÓÙÒ, a loss if it is at most - ÓÙÒ, and a draw otherwise. For example, if the algorithm determines that a position is a win in iteration 5, it has derived that the position s sore is at least 5. At the end of eah iteration, eah win inrements the position s sore by and eah loss derements the position s sore by. This is illustrated by Figure 5, where we horizontally list for eah position whether it is a win (Ï), draw ( ), or loss (Ä) after a partiular iteration. The sores are aumulated vertially. For example, if this value is 5, the position is ounted as a win during the first five iterations, and a draw during the remaining iterations. The algorithm in Figure 4 keeps an arrayëø Ø in inëóö main memory and an arrayëóö on disk. The former is aessed randomly and the latter sequentially. Entries ËØ Ø require two bits and are one ofï Ò, Ö ÛÇÖÏ Ò, ÄÓ, oríò ÒÓÛÒ.ËÓÖ aumulates the position s sores and will eventually ontain the final results. Eah entry requires 7 bits to store a value between -48 and 48. inëø Ø Initially, all entries inëø Ø areíò ÒÓÛÒand all values are 0. During eah iteration, three proessing phases are required to determine the win/draw/loss value of eah position. The preproessing phase deals with apture moves and lost positions. For brevity, we do not disuss the details, but the idea is that we initialize the entries from apture moves, for whih the sores are found in previously omputed databases, and from terminal positions (in awari, all terminal positions are lost positions). The proessing phase performs a san over all database entries. AnÍÒ ÒÓÛÒstate for whih all hildren areï Òs, is assigned aäó (no matter whih move is made, the opponent will win, thus the position is lost). Moreover, parents are assigned aï Ò, sine the opponent has a winning move available from those positions. Next, the grandparents undergo the same test, reursively. This is depited in Figure 6 (the numbers and fat arrows in the figure are referred to later). The position markedí Äis initiallyíò¹ ÒÓÛÒand its hildren areï Òs, thus the position itself is assigned aäó. Then, the parents that are not alreadyï Òs Atually, Ø ÔØÙÖ ËÓÖ uses a preomputed, auxiliary database that keeps values for apture moves (as suggested by Linke and Marzetta [9]) to avoid repeated aesses to the smaller databases. W W W Figure 6. Ations performed during the proessing phase. beomeï Òs now, and the grandparents reurse into the same proess. For a better understanding of the algorithm, it is important to see that allï Òs are determined through bakpropagation of lost positions. stillíò ÒÓÛÒor Ö ÛÇÖÏ Ò Also, Ö ÛÇÖÏ Òs an beomeï Òs, butäó es andï Òs retain their values during this phase. Unlike Linke and Marzetta s algorithm, the proessing phase does a single san over all database entries, sine it sans parents of entries that have just been set toï Òimmediately. The postproessing phase adds theï Òs and theäó es to theëóö ; entries that are are effetively a draw. Although all iterations are in fat independent, we perform them with an inreasing window size; this allows us to apply some optimizations when we prepareëø Ø for the next iteration: drawn positions remain drawn during the rest of the omputation, andï Òs andäó es are onverted toíò ÒÓÛÒ. A win an beome a win or a draw after the next iteration, and a loss an beome a loss or a draw. We parallelized the algorithm desribed above in the asynhronous way that is desribed in the main text. Finding out whether all hildren are wins or setting all parents to a win requires ommuniation between several proessors. This is illustrated by the fat arrows in Figure 6. The proessor owning the node markedí Ägenerates the first hild and sends a message to the proessor owning the first hild (this is message type in the figure). The reeiving proessor looks up the database entry for the first hild, and, if the entry denotes a win, inrementally generates the seond hild and sends the hild to the proessor ontaining the seond hild (again using message type ). If the entry denotes something other than a win, the message is dropped. When no more hildren an be generated, the proessor owning the original state is sent a message that the state is a loss (). Upon reeipt, the proessor updates the orresponding database entry, and sends messages to the 8

proessors owning its parents, that the parents are wins (3). When a proessor reeives suh a message, it looks up the parent to see if it is already a win. If it is not, it makes the entry a win, and sends messages to the grandparents to hek for unknown states (4), reursing the entire proess. If the message reeive queue of the proessor ontains work, the work is proessed, otherwise the proessor resumes the main loop in the proedureèöó. 9