Monte Carlo methods applied to tree-structured decision processes

(1)

U.U.D.M. Project Report 2017:23

Examensarbete i matematik, 15 hp Handledare: Jordi-Lluís Figueras Examinator: Jörgen Östensson Juni 2017

Monte Carlo methods applied to tree-structured decision processes

Mattias Bertolino, Arvi Jonnarth

Department of Mathematics

Uppsala University

(2)

(3)

Monte Carlo methods applied to tree-structured decision processes

Mattias Bertolino, Arvi Jonnarth June 4, 2017

Abstract

The goal of this project is to investigate and implement methods capable of making decisions within a decision process such as making moves in a game. The task is to find a statistical method or algorithm which can make qualified guesses for the next move without the use of any prior knowledge. Monte Carlo methods are a wide range of algorithms which make use of random simulations in order to arrive at a result. In this project a method called Monte Carlo Tree Search (MCTS), based on constructing a decision tree, is extensively studied. This method does not only perform naive random simulations but keeps track of previous simulations and considers the exploration-exploitation trade-off. Two similar games with vastly different properties are used in order to test the performance of the Monte Carlo Tree Search.

The first game that is used as a testing platform is Nim. It is chosen since it is a combinatorial game such as Chess and Go and has an optimal strategy which makes it a good choice to use as a benchmark. However, the goodness of a move is intrinsically badly scaled in Nim since the moves are either optimal or equally bad. The second game is developed from Nim in order to cope with these impracticalities, by introducing stochasticity to the game. Two new strategies are developed which are suited for the stochastic version. These new strategies are tested against Monte Carlo-based methods of which two strategies are implemented. The first Monte Carlo-based method only performs naive random simulations while the other one is based on the more advanced Monte Carlo Tree Search. We find that in normal Nim the naive method performs better than MCTS but when introducing the stochasticity MCTS outperforms the naive method. Monte Carlo Tree Search provides a parallelizable strategy to solve problems with no a priori knowledge of the application which makes it a versatile choice to use in fields such as finance, geolocation and healthcare.

(4)

Acknowledgments

Without the discussions held with Jordi-Lluis Figueras, our project mentor and supervisor, this work would not have been close to what it is now. We would therefore like to thank him for his guidance and support. We would also like to thank Ulla Ahonen-Jonnarth for the valuable feedback. All of your inputs have contributed considerably to this thesis.

Keywords

Monte Carlo Methods, Monte Carlo Tree Search (MCTS), Reinforcement Learning, Decision Tree, Nim, Impartial Games, Combinatorial Games, Stochastic Games

(5)

1 Introduction

The goal of this project is to investigate and implement methods capable of making decisions within a decision process such as making moves in a game. The task is to find a statistical method or algorithm which can make qualified guesses for the next move without the use of any heuristic knowledge. In many real world applications, the full decision tree turns out to be either unknown or too large to expand out. A statistical sampling algorithm based on Monte Carlo methods is therefore motivated by the same premises as Monte Carlo methods used to evaluate volumes in high dimensional spaces. The question we would like to address is how well a statistical algorithm, without heuristic knowledge, performs.

In complex decision making problems with large search spaces, a method which by statistical sampling breaks down the decision into a sequence of simpler decisions is a good choice. Applying pure Monte Carlo methods to evaluate the outcome of a particular decision may converge slowly if the search depth is too deep. It is therefore of interest to find a Monte Carlo-based algorithm which has a small error probability if it is stopped prematurely, and which converges to the best decision if let run long enough.

Pure Monte Carlo methods used to evaluate a decision are expected to converge to the optimal result. However, in a decision process with a vast amount of possible branches, there may be some decisions that never lead to any satisfactory result. It is therefore of interest to find a way to avoid wasting simulation time on decisions which clearly provide no useful information. Kocsis and Szepesvári [17] found satisfactory results by combining Monte Carlo simulations with a selection phase derived from the so called Multi-Armed Bandit Problem. This guides the simulations in the Monte Carlo method towards a quasi-uniform distribution. The idea of combining pure Monte Carlo simulations with a guided selection phase is called Monte Carlo Tree Search.

Monte Carlo Tree Search gained public attention with the Google-developed AlphaGo beating the legendary Go player Lee Sedol in March 2016 [12]. The main difference between the tree search methods used by Kocsis & Szepesvári and AlphaGo is in the implementation of the selection phase.

The selection phase used by Kocsis and Szepesvári is a simple model which computes a value for each child node which considers the exploration-exploitation trade-off. AlphaGo, however, takes advantage of game heuristics by using two neural networks, a policy network and a value network [18]. Although the use of neural networks is an essential tool, MCTS remains a central part in the AlphaGo algorithm.

As late as 27 May 2017 AlphaGo beat the current world champion Ke Jie with ease. After the victory Demis Hassabis, co-founder and CEO of Google DeepMind states [13]: "We believe these general purpose algorithms could one day help scientists as they tackle some of the world’s most complex problems, such as finding new cures for diseases, dramatically reducing energy consumption or inventing revolutionary new materials."

The performance evaluation of the proposed Monte Carlo-based strategies is done in two similar platforms (games) with immense differences in terms of mathematical properties; Nim, and a stochastic version of Nim. The choice of Nim is based partly because it is a combinatorial impartial game, which makes it related and influential to a range of other famous games such as Chess and Go. Nim is also highly scalable and has, followed by its impartial properties, been proved to have a winning strategy easily implemented as a benchmark for the algorithms proposed.

There are however problems with the simplicity of Nim. In Nim, all possible moves except the optimal ones are equally bad which make it unsuitable to use as the stand-alone test frame for statistical algorithms. Thus, the Monte Carlo-based methods are tested in an extended version of Nim where stochasticity is added. This slight modification alters the game characteristics making it both more complex and non-deterministic. Also, the goodness of the moves are presumed to form an ordered set, contrary to normal Nim, which makes it suitable to use as a compliment for evaluation. In more complex combinatorial games such as Go, where all moves except the optimal moves are equally bad, there may still be a motivation to use the model that the moves form an ordered set, since the complexity hinders the finding of the optimal strategy. Thus, some moves are perceived to be better than others, even though this is not the case.

The introduction of stochasticity to the game makes the optimal strategy in normal Nim suboptimal. Therefore, two new strategies are developed which are suited for the stochastic version.

The Monte Carlo-based methods are tested against these new strategies in order to evaluate the performance of the Monte Carlo-based methods.

Outline. In Section 2 descriptions of combinatorial and impartial games are given. There is

(7)

also a description of normal and stochastic Nim, together with definitions such as the nim sum and key words to understand further discussions. In this section a presentation of stochastic Nim is given alongside two proposed strategies used as benchmarks to measure performance of the Monte Carlo-based strategies. The two benchmark strategies, one of which is presented with an example, are derived from the optimal strategy known for normal Nim.In Section3 a theoretical description of Monte Carlo methods, and Monte Carlo Tree Search in particular, is given. Section4 contains detailed descriptions of each strategy that was implemented. In Section5the performance of the Monte Carlo-based methods are presented. There are also results from evaluating the benchmarks. In Section6the results are discussed and further improvements are suggested. Finally, our conclusions are given in Section7.

Contributions. Both authors have contributed significantly to the presented work. The work consisted among other things of developing theory, implementing the discussed ideas in C-code, running simulations to interpret and finally to summarize the work in this paper. Giving credits to one individual author for any specific part of the work would not describe the work flow in a fair manner. The discussions held along the project were of distinct importance for the development of this work. The C-code is open-source and available at GitHub¹.

1The repository can be found athttps://github.com/MBertolino/DegreeProjectMCTS.

(8)

2 Game theory

The study of game theory is a mathematical description of decision making problems concerned with the interaction between two or more rational decision makers. Originally, mathematicians like John von Neumann and John Nash studied zero-sum games where two players compete in a game where one player’s win is the other player’s loss. Nowadays, however, game theory covers a range of behavioural problems with applications in economics and negotiations for instance. One example of a behavioural problem is The Prisoner’s Dilemma [21], created by John Nash, which is a conflict between doing what’s best for the individual and considering what might be best for the group.

2.1 Combinatorial games

Combinatorial games are two-player zero-sum games with perfect information. A combinatorial game can be defined recursively as a divided set of all possible states the players can move to,

G = {G^L|G^R},

where G^L and G^Rboth are sets of games, representing the possible moves of the left and the right player, and g = {∅|∅} the base case called the Endgame. In normal mode, the player making the last move wins, and in misère mode, the player making the last move loses.

A subset of combinatorial games are impartial, where impartial refers to the fact that the allowed moves of each player are the same and determined only by the position. As described below, a position in a combinatorial game can be classified as either winning or losing. A player in a winning position, can secure a victory. Combinatorial games can in principle be solved completely but often involve an extensive amount of calculations. However, impartial games may be solved easier, and any impartial game can be regarded as an equivalent of Nim as showed independently by Sprague [4] and Grundy [2].

In conclusion, Sjöstrand [22] informally classifies combinatorial games which satisfy the following conditions, where the last condition determine whether the game is played in normal or misère mode.

1. Two players make alternating moves.

2. The game is deterministic.

3. Both players have perfect information of the game.

4. The game will always reach an end, and never in a draw.

5. The last player to make a move wins or loses.

However, since the field of game theory is constantly growing, scholars are extending the definition of combinatorial games. In some cases one-player games such as Sudoku and even zero-player games such as the automata Conway’s Game of Life are considered combinatorial games. In most cases, as well as in this paper, Chess and Go are considered combinatorial. Nim is also an impartial game whereas Chess and Go are not, since both players cannot make the same moves. Instead, Chess and Go are partisan games, the opposite of impartial. An example of a non-combinatorial game is Texas Hold’em since it neither has perfect information nor is deterministic.

2.2 The Nim game

Nim is an impartial combinatorial game with useful mathematical properties and is equivalent to several other combinatorial games, as described below where the results of Sprague and Grundy are presented. While there are many variations of Nim, the version presented by Bouton [1] is constructed by placing three heaps with an arbitrary number of sticks in each heap, except that no heap may contain the same amount of sticks as another heap.

In alternating turns, the players remove any positive number of sticks from any non-empty heap until there are no sticks left. Bouton presented the complete mathematical theory behind the game and provided a proof of a winning strategy described in Section2.2.2which uses the nim sum defined in Definition2.1as the total bitwise exclusive or (XOR) of all heaps. To get a clear picture of the progression of the game, see Example2.2.

(9)

2.2.1 The bitwise XOR operator

The bitwise XOR operator, denoted as ⊕, is a logical operator that computes the exclusive disjunction between two numbers expressed in base two [3]. For a binary input of two bits, A and B, the exclusive disjunction is given by the truth table in Table1. In Table2, the output of x ⊕ y is presented for x, y ∈ [0, 5] ⊂ N, and in Example2.1a computation of bitwise XOR between the integers 5 and 6 is exemplified.

Table 1: Truth table for the binary bitwise XOR.

A B A ⊕ B

1 1 0

1 0 1

0 1 1

0 0 0

Table 2: Output for the binary bitwise XOR.

x

y 0 1 2 3 4 5

0 0 1 2 3 4 5

1 1 0 3 2 5 4

2 2 3 0 1 6 7

3 3 2 1 0 7 6

4 4 5 6 7 0 1

5 5 4 7 6 1 0

Example 2.1 (Computation of bitwise XOR). To compute the bitwise XOR between the integers 5 and 6, they are first expressed in base two,

5 = (101)2, 6 = (110)2.

The bitwise XOR is now computed by the use of XOR operating in accordance with Table1 on each pair of bits in the corresponding position. Thus, the bitwise XOR of 5 and 6 is

5 ⊕ 6 = (101)₂⊕ (110)₂= (011)₂= 3.

Further, the bitwise XOR operator obeys the commutative and associative laws and has the fundamental property that

s ⊕ s = 0, ∀s ∈ N.

Definition 2.1 (Nim sum). Let {si}^N_i=1 be the sticks in the N number of heaps with binary expression si =P

jaij2^j for aij ∈ {0, 1} and bj ≡ P

iaij (mod 2). Let Ω denote the nim sum defined as

Ω := s1⊕ s2⊕ ... ⊕ sN =X

j

bj2^j.

2.2.2 The winning strategy

Definition 2.2 (Winning position). Consider a set of sticks in N number of heaps. The set is called a winning position, or safe combination, as defined by Grundy [2], if the nim sum of the sticks in the heaps is equal to zero.

Ω = s1⊕ s2⊕ ... ⊕ si⊕ ... ⊕ sN = 0.

If a player leaves the board in a winning position, its opponent cannot leave the board in a winning position. Contrarily, if the board is not in a winning position, a player can always leave the board in a winning position. Thus a player leaving the board in a winning position can always defend the board and win the game.

If the nim sum of all heaps is not equal to zero, the winning strategy is to find a heap, i, where the nim sum between the specific heap and Ω is less than the sticks in that specific heap

Ω ⊕ s_i< s_i,

and removing sticks such that the total nim sum is zero. It is always possible to remove sticks from this heap to reduce the nim sum of all heaps to zero and maintaining a winning position.

(10)

Example 2.2 (The winning strategy in normal Nim). In Figure1, player 2 wins a game of Nim using the winning strategy against player 1. The game is set up with three heaps and six sticks distributed with one stick in the first heap, two sticks in the second heap and three sticks in the third heap. Initially, the nim sum is zero since 1 ⊕ 2 ⊕ 3 = 0. When player 1 removes any number of sticks, the nim sum is altered, and therefore player 2 manages to keep the nim sum zero throughout the game.

Figure 1: An example of the game of Nim. Player 2 uses the winning strategy to maintain the nim sum at zero resulting in a win. The nim sum, Ω, is written under the board.

2.3 Sprague and Grundy’s conclusions

Grundy classifies the terminal states of a game as W or L, which determine a win or a loss for the player moving to that state. Since an impartial game forms a partially ordered set of states with the terminal states as end-conditions it is possible to proceed backwards to find a winning strategy. Sprague denotes the classes W and L as Verluststellung, meaning losing position in German, and Gewinnstellung, meaning winning position, defined oppositely to Grundy’s notation.

The difference is that Grundy connects a game state to the previous player, whereas Sprague connects a game state to the next player. If the nim sum is zero, Grundy would say that the board is in a winning position for the previous player. Sprague would instead say that the board is in Verluststellung for the next player. In this paper, Grundy’s classification and terminology is considered rather than Sprague’s alternative but equivalent formulation.

Definition 2.3 (The Grundy function). Let the Grundy function be Ω : N^N → N

with the properties

1. Any single move alters Ω(X),

2. If 0 ≤ ω < Ω(X), then Ω(X) can be decreased to ω in one move, 3. Ω(XT) = 0, for all terminal states XT.

It follows that a state Xi is a winning state if and only if Ω(Xi) = 0.

2.4 Stochastic Nim

Now consider a game with the same objective as normal mode Nim but at each turn, there is a probability r that a stick is added to the first heap. This makes it possible to make transitions between states which were not possible in normal Nim. Note that the perturbation occurs directly after a player has made a move and before checking if a player has won. This means that even if a player removes the last stick, the board could still be perturbed and the game continues. As in impartial games, both players can still reach the same game states, but the uncertainty cast by the parameter r removes the property of perfect information, transforming the game into a non-combinatorial game. For small r, stochastic Nim closely resembles normal Nim. For r close to one, however, the game is completely different.

Thus, Bouton’s results and the Sprague-Grundy conclusions are no longer applicable, and it is not possible to find a winning strategy using their results. We have therefore developed two alternative strategies, inspired by the winning strategy in normal Nim, to use when testing another strategy’s performance in a game of stochastic Nim. These two new strategies, referenced to as

(11)

the q1-player and the q2-player in Section 4, have both in common that they account for the stochasticity of the game. Thus, these strategies can be seen as extensions of the winning strategy in normal Nim to the stochastic game.

Stochastic Nim can be viewed in a probabilistic sense, with the nim sum being a random variable not completely described by the visible sticks on the board. Using Sprague’s and Grundy’s notion, the game state would be seen as all the visible sticks in the heaps,

X = s1, ..., sn+1,

with the strategy to remove sticks so that the nim sum of the visible sticks is zero.

The winning strategy in normal Nim does only perform well for r 1, as it does not take the stochasticity into account. Instead, we model the game state X to have one stochastic heap with limited information and n deterministic heaps with perfect information. The heap with limited information is viewed as having α + z sticks, where α = s₁ are the visible sticks and z are the hidden sticks that, one at a time, are made visible with a probability r right after each move. The heaps with perfect information are viewed as having β1, ..., βn sticks, where βj= sj+1. Thus, the state is expressed as

Xα,β₁,...,β_n= α + z, β1, ..., βn,

which is shortly written as Xα,β. For example, the state X1,3,2 corresponds to a board with three heaps consisting of one visible stick in the first heap, three sticks in the second heap and two sticks in the third heap. Note that the state represents the board directly after a player has made its move and before a hidden stick has potentially become visible. This means that the board can still have one extra visible stick in the first heap when it is time for the next player to make its move.

The strategy described in Section2.4.1removes sticks so that the probability of getting the nim sum zero is maximized for the accessible states. The strategy described in Section2.4.3avoids the formulation of a nim sum and aims to estimate the probability that a certain state is a winning position. These models have in common that they take into account the stochasticity. The addition of a new stick is what we denote as a perturbation. An example of a game of stochastic Nim is given in Example2.3. As in the Example2.2 of the winning strategy, player 2 tries to maintain the nim sum at zero throughout the game. However, keeping the nim sum of the visible sticks on the board at zero does not guarantee a win.

Example 2.3 (A game of stochastic Nim). In Figure2, a game of stochastic Nim is set up with one stick in the first heap, two sticks in the second heap and three sticks in the third heap. After the second move, another stick is added to the first heap (the red stick). In the rest of the game, no new stick is added, and hence player 1 wins. The nim sum of the visible sticks is written under the bar, but it is of little help for finding the optimal move.

Figure 2: An example of player 2 applying the winning strategy in normal Nim, but losing because another stick is added (the red) due to the stochasticity of the game. The nim sum, Ω, is written under the board.

(12)

2.4.1 Our first strategy

The first strategy extends the notion that the optimal move in normal Nim leaves the opponent with a board which has the nim sum equal to zero. When introducing the stochasticity, however, the nim sum cannot be computed explicitly from the visible sticks on the board. Instead we formulate the probability that a state has nim sum zero. In order to do so we introduce the concept of states of the game which correspond to a board prior to a possible perturbation.

Using the notation described in the previous section, let X_α,β be a general state with α number of sticks in the stochastic heap and β_jnumber of sticks in the j:th deterministic heap, for j = 1, ..., n, where n is the number of deterministic heaps. We make the assumption that the total number of sticks in the stochastic heap can be written as α + z, where z describes the number of sticks that will appear in the heap as a result of stochasticity throughout the game. We model the event of a stick being added as a hidden stick converting into a visible stick and that the total number of sticks in the heap is still α + z.

The goal is to estimate the probability that the nim sum of a given state Xα,β has the value zero. Let us now formulate the probability that the nim sum is equal to ϕ.

Theorem 1 (Probability of the nim sum). The probability that the nim sum Ω is equal to ϕ given the state Xα,β is equal to

P[Ω = ϕ|Xα,β] = P[z = β1⊕ ... ⊕ βn⊕ ϕ − α|Xα,β]. (1) Proof. By the definition of the nim sum, Definition2.1, we get

Ω = (α + z) ⊕ β1⊕ ... ⊕ βn. (2)

Since the inverse operation of ⊕ is ⊕ itself, the number of hidden sticks z in Equation (2) can be expressed as

z = β1⊕ ... ⊕ βn⊕ Ω − α.

By substituting Ω with ϕ we get

Ω = ϕ ⇔ z = β1⊕ ... ⊕ βn⊕ ϕ − α

The probability that these equalities are fulfilled given the state Xα,βare then equal. This concludes the proof.

Now, consider a possible state X_{α, ˜}_˜_β reachable from the state Xα,β including the perturbation as well as the possible legal moves that are available after the perturbation. Let us define a set of all reachable states.

Definition 2.4 (The set of all reachable states). Let X_α,β be the current state with K number of reachable states. Define the set S as

S = {Xα, ˜˜β}^K_k=1, to be the set of all reachable states from the state Xα,β.

We can formulate the probability in Equation (1) by summing up all the possible reachable states and looking at the probability that z has changed according to the change in the board.

The probability can be written as

P[Ω = ϕ|Xα,β] =

K

X

k=1

w_kP[Ω = ( ˜α + ˜z) ⊕ ˜β₁⊕ ... ⊕ ˜β_n|(X_{α, ˜}_˜_β)_k], (3)

where ˜α, ˜β1, ..., ˜βn are the number of sticks in each heap of the new state and (X_{α, ˜}_˜_β)k represents the k:th reachable state from Xα,β. wk is the probability that the k:th move is chosen. Note that in most cases ˜βk = βk since at most one of these values can be affected by which move is made and only the stochastic heap can be affected by the perturbations. Regarding ˜z, this will either be equal to z if no stick is revealed or z − 1 if a stick is revealed.

From Xα,β there are a total of α +Pn

j=1β_j possible moves to make if no perturbation occurs, and α +Pn

j=1β_j+ 1 possible moves if a perturbation occurs. These moves can be classified into four cases, namely

(13)

1. if perturbation does not occur and a player takes sticks from the stochastic heap, 2. if perturbation does occur and a player takes sticks from the stochastic heap,

3. if perturbation does not occur and a player takes sticks from the j:th deterministic heap 4. if perturbation does occur and a player takes sticks from the j:th deterministic heap.

Case 1 and 2 only affects the first (stochastic) heap and in turn leads to the new state Xi,βwhere i is the number of sticks left in the first heap. Case 3 only affects one of the deterministic heaps and leads to the new state Xα,β₁,...,i,...,β_n (with βj = i). Case 4 affects one of the deterministic heaps as well as the stochastic heap were the number of sticks is increased by one. Here, the new state is X_(α+1),β₁,...,i,...,βn (with βj= i).

In each move, the probability that a hidden stick is revealed, namely that α → α + 1 and z → z − 1, is r, and hence the probability that no perturbation occurs is (1 − r). z is then assumed to follow the binomial distribution with parameters m and r, with m being the number of moves left in the game,

z ∼ Bin(m, r).

Since the number of moves left in a game is unknown, m can be estimated as the average game length. Let us now formulate how the nim sum must change such that z remains fixed.

Theorem 2 (Reachable nim sums). If the nim sum is equal to ϕ for the state X_α,β then for a fixed z the nim sums in each of the four cases above are equal to







Ω1= (i + b ⊕ ϕ − α) ⊕ b, i ∈ [0, α − 1]

Ω2= (i + b ⊕ ϕ − (α + 1)) ⊕ b, i ∈ [0, α]

Ω3= Ω4= i ⊕ βj⊕ ϕ, i ∈ [0, βj], j ∈ [1, n].

where

b = β1⊕ ... ⊕ βn. For a proof, see AppendixA.

In general, the estimated probability that the nim sum is equal to ϕ for the state Xα,β, with α, β > 0, is thus recursively given by

P[Ω = ϕ|Xˆ α,β] = 1 − r α +Pn

j=1β_j

α−1

X

i=0

ˆ

w⁽¹⁾_i P[Ω = (i + b ⊕ ϕ − α) ⊕ b|Xi,β] + r

α + 1 +Pn j=1βj

α

X

i=0

ˆ

w⁽²⁾_i P[Ω = (i + b ⊕ ϕ − (α + 1)) ⊕ b|Xi,β] + (4)

1 − r α +Pn

j=1β_j

n

X

j=1 βj−1

X

i=0

ˆ

w_i⁽³⁾P[Ω = i ⊕ βj⊕ ϕ|Xα,β₁,...,i,...,β_n] +

r α + 1 +Pn

j=1β_j

n

X

j=1 βj−1

X

i=0

ˆ

w_i⁽⁴⁾P[Ω = i ⊕ βj⊕ ϕ|X_(α+1),β₁,...,i,...,βn],

where ˆw_i^(k), k = 1, 2, 3, 4 are estimated weights of how likely each move is. See Appendix Bfor a derivation. The most simple assumption is that ˆw_i^(k)= 1 and is motivated by the lack of prior knowledge of the outcome of the game. Now, since the board is perturbed with probability r in each move, the probability is recursively estimated by expanding the game tree to the state X_0,0. The probability that the state X_0,0 has nim sum ϕ is equal to

P[Ω = ϕ|X0,0] = r^ϕ(1 − r),

since this is the probability that the board is perturbed exactly ϕ times.

A strategy can now be formulated as making the move which leaves that game in the state with highest probability of having nim sum zero. This can be expressed as finding X_α_˜∗, ˜β^∗∈ S such that

X_α_˜∗, ˜β^∗= arg max

X_{α, ˜}_˜_β∈S

n ˆP[Ω = 0|X_{α, ˜}_˜_β]o

. (5)

(14)

2.4.2 Simplifications

Equation (4) is stated as an estimate of the conditional probability in a recursive manner that the nim sum is equal to ϕ. The ultimate goal is to find the move with the highest probability of having nim sum ϕ = 0. With the help of Bayes’ theorem [20] the conditional probability in each stage could be reformulated as

P[Ω = 0|Xα, ˜˜β] = P[X_{α, ˜}_˜_β|Ω = 0]P[Ω = 0]

P[Xα, ˜˜β] ,

where the task is to find the three probabilities in the right hand side. For a given starting board, the full sample space of the nim sum is known. Hence, in the optimization problem in Equation (5) the a priori probability of the nim sum being zero, P[Ω = 0], is ought to not influence due to lack of dependency of α and β. This simplifies to

X_α_˜∗, ˜β^∗= arg max

X_{α, ˜}_˜_β∈S

nP[X_{α, ˜}_˜_β|Ω = 0]

P[Xα, ¯˜β] o.

Further, the sample space of Xα,β is known. It thus remains to find the conditional probability of the board, given a zero nim sum, P[X^α,β|Ω = 0].

Example 2.4. Our first strategy Consider the state X2,1in a game with probability of perturbation r. If a player applies the winning strategy of Nim, the chosen move would be to remove one stick from the first heap which leads to a reduction of the nim sum from 3 to 0 as

Ω = 2 ⊕ 1 = 3 → 1 ⊕ 1 = 0.

However, we are interested in finding the state X_{α, ˜}_˜_β which maximizes the estimated probability of getting nim sum zero. The fact that P[Ω = 0|X^2,0] = 0 since (2 + z) ⊕ 0 6= 0 ∀ z ≥ 0, makes it sufficient to compare the estimated probabilities that X1,1 and X0,1 have nim sum zero. In this example we use the assumption that all moves are weighted uniformly and thus, the estimated probability that X1,1has nim sum zero is recursively computed using Equation (4) and is given by

P[Ω = 0|Xˆ 1,1] = 1 − r 2

P[Ω = 1|X1,0] + P[Ω = 1|X^0,1]

=1 − r 2

(1 − r)P[Ω = 0|X0,0] + (1 − r)P[Ω = 0|X0,0]

=1 − r 2

(1 − r)²+ (1 − r)²

= (1 − r)³,

and the estimated probability that X0,1 has nim sum zero is given by P[Ω = 0|Xˆ 0,1] = (1 − r)P[Ω = 1|X^0,0] +r

2

P[Ω = 1|X1,0] + P[Ω = 1|X^0,1]

= r(1 − r)²+r 2

(1 − r)²+ (1 − r)²

= 2r(1 − r)².

Our first strategy chooses to make the move which has the highest probability of having nim sum zero. This happens to be towards X0,1 for r >¹₃ since

P [Ω = 0|Xˆ 0,1] = 2r(1 − r)²> (1 − r)³= ˆP [Ω = 0|X1,1] ∀ r > 1 3, and towards X_1,1 otherwise.

2.4.3 Our second strategy

One might argue that finding the probability that the nim sum is equal to zero might not yield the best result since this might not translate to the optimal strategy in stochastic Nim. Even if

(15)

the hidden sticks are known, the players cannot always position themselves such that the zero nim sum, including the hidden sticks, guarantees a win. For example state X0,0 does not only generate a win if the nim sum is zero but also if it is even as well.² Instead of formulating the probability that the nim sum has a given value we can formulate the probability that a state Xα,β is a winning state. We will still have the same sums as in Equation (4) meaning that we still consider the same four cases as in Section2.4.1. The difference for this strategy is that we sum up the probabilities that the reachable states from Xα,β generates a loss. Since the probability that a state generates a loss is equal to one minus the probability that the state generates a win we can write

P[win|Xα,β] = 1 − r α +Pn

j=1β_j

α−1

X

i=0

w⁽¹⁾_i (1 − P[win|X^i,β]) + r

α + 1 +Pn j=1βj

α

X

i=0

w_i⁽²⁾(1 − P[win|X^i,β]) + (6)

1 − r α +Pn

j=1β_j

n

X

j=1 βj−1

X

i=0

w⁽³⁾_i (1 − P[win|X^α,β1,...,i,...,β_n]) +

r α + 1 +Pn

j=1β_j

n

X

j=1 βj−1

X

i=0

w_i⁽⁴⁾(1 − P[win|X^α+1,β1,...,i,...,β_n]).

The problem here is that the probability P[win|X^α,β] appears in both the right- and left-hand side.

In the right-hand side it is part of the last element of the second sum, that is, when i = α. By assuming uniform weights w^(k)_i = 1 for k = 1, 2, 3, 4 we can rewrite Equation (6) as an explicit formula for ˆP[win|Xα,β] which yields

P[win|Xˆ α,β] = τ + 1 τ + 1 + r

r

τ + 1+ 1 − r

τ + r

τ + 1

^α−1 X

i=0

(1 − P[win|Xi,β]) +

1 − r τ

n

X

j=1 β_j−1

X

i=0

(1 − P[win|Xα,β1,...,i,...,βn]) + (7)

r τ + 1

n

X

j=1 β_j−1

X

i=0

(1 − P[win|Xα+1,β1,...,i,...,βn])





where τ is the total number of sticks for state X_α,β

τ = α +

n

X

i=j

βj.

For a derivation see AppendixC. The base case for X0,0 is then the probability that the board is perturbed an even number of times (including zero) which is

P[win|X0,0] = 1 − r + r²(1 − r) + r⁴(1 − r) + ...

= (1 − r)

∞

X

m=0

r^2m

= (1 − r) 1 1 − r²

= 1

1 + r. (8)

2In the state X0,0 there are only z sticks in the first heap left on the board and therefore the nim sum is equal to z. Since these sticks are revealed one at a time in each move, a win is generated if z is even.

(16)

3 Monte Carlo methods

Monte Carlo methods are a broad class of statistical sampling methods used in various applications ranging from statistical physics and computational science to finance. The modern Monte Carlo methods were developed by physicists Enrico Fermi, Stanisław Ulam and John von Neumann.

Although Fermi did not name the method, nor publish anything, he first used Monte Carlo methods to calculate neutron diffusion [7]. Later, in 1949, Ulam and Metropolis published a paper [8] on Monte Carlo methods for mathematical physics. The introduction of modern computing machines made it possible to efficiently generate uniformly distributed and uncorrelated random numbers.

Ulam and Metropolis showed that this was useful when dealing with integro-differential equations in a statistical manner. They exemplify the use of Monte Carlo methods by evaluating the volume of a region in a high-dimensional space and by evaluating the solution to the Fokker-Planck equation.

In Example3.1 an example of Monte Carlo methods is given to evaluate the volume of a high- dimensional space.

Monte Carlo methods rely on the central limit theorem [11] and the law of large numbers [10].

The central limit theorem states that the distribution of the error in the sample mean, µ, is proportional to n^−1/2 where n is the sample size and hence, Monte Carlo methods converge slowly.

The law of large numbers states that if a stochastic process is sampled repeatedly, the mean value converges to the true expected value. The weak form of the law of large numbers, which is extensively used, is given below.

Theorem 3 (Law of large numbers (weak)). Let {X_i}ⁿ_i=1 and ¯X_n be the mean of n observations of X

X¯n= X1+ X2+ ... + Xn

n

then for any δ > 0 the sample average converges in probability to the expected value as P[X¯_n− δ < µ < ¯X_n+ δ]→ 1 as n → ∞^P

Example 3.1 (Evaluating the volume of a high-dimensional space). Consider a region with non vanishing volume in an n-dimensional euclidian space located in the unit cube, D ∈ [0, 1]ⁿ ⊂ Rⁿ, defined by all points (x1, x2, ..., xn) satisfying

f₁(x₁, x₂, ..., x_n) < 0; f₂(x₁, x₂, ..., x_n) < 0; ... f_n(x₁, x₂, ..., x_n) < 0. (9) The evaluation of the integrals will be tedious and hard to perform as the definition of the volume, by subdivision of the integrals, leads to an evaluation of 10ⁿ lattice points. Instead, the Monte Carlo approach would be to evaluate a large sample of random points inside the unit cube and by the law of large numbers, the proportion of the points which satisfy the inequalities in Equation (9), will tend to the actual volume. Since the law of large numbers does not care about the dimension of the space, Monte Carlo methods are adimensional. For this reason Monte Carlo methods are motivated in problems with high dimensionality as opposed to methods whose convergence are dependent on the number of dimensions.

3.1 The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem is a sequential decision making problem where we are faced with K slot machines (one-armed bandits), each with its own reward distribution [14]. The problem is stated as follows: Which of the K arms is to be pulled in order to maximize the reward? After each visit, more information is gathered about the selected bandit and as we visit more bandits we get a better picture of which of the bandits gives a higher reward.

The problem is interesting since we have to consider both exploring new machines as well as exploiting old machines that we are more certain about generating a high reward. This is called the exploration-exploitation trade-off. A selection strategy has to be formulated which chooses one out of K decisions (pulling one arm), getting a random reward following the distribution of the selected arm. In this paper, the selection strategy is based on the Upper Confidence bound applied to Trees (UCT) proposed by Kocsis and Szepesvári [17], that is, choose the decision i^∗ such that

i^∗= arg max

i

nwi

n_i + cr ln t n_i

o

, (10)

(17)

where wiis the number of positive outcomes for decision i, niis the number of simulations for decision i, t is the total number of simulations for the current node. The parameter c is the exploration constant which can be tuned to change the significance between exploration and exploitation. The first term is the rate of positive outcomes and corresponds to exploitation. The second term corresponds to exploration which has a higher value if the specific decision has been chosen fewer times relative to the other possible decisions.

Monte Carlo Tree Search, described in Section3.2, uses the ideas from the Multi-Armed Bandit Problem to guide its simulations towards a quasi-uniform distribution.

3.2 Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a method which aims to statistically find the decision which is most likely to be the optimal one in a decision process. The idea is to iteratively build up a decision tree which contains statistics of the different combinations of decisions. When applying this method to a game, each node of the tree corresponds to a legal move that a player can make.

In the most basic implementation only the decisions (or moves) that are currently available are considered. The tree is then simply a root with n child nodes where n is the total number of possible decisions. For each possible decision the process is then simulated k times with random decisions to gather statistics. When the simulations are done, the decision with the highest reward is chosen. This pure Monte Carlo strategy uses equally distributed simulations to evaluate the decision process and make a decision.

The next step is to not only consider the current set of possible decisions but also to consider the decisions which follow the first one. In this case the decision tree will be one layer deeper compared to the basic case. The number of nodes will increase with the branching factor which is equal to the number of decisions that are possible at a given time. The process of expanding every node in the tree can be repeated until the end of the process for every possible combination of decisions and the decision tree has then been exhausted. When exhausting the tree every decision path is known and a decision can be made to maximize the possible reward.

This approach, however, poses a practical problem. The number of computations needed in- creases exponentially with the depth of the tree. This means that the computation times become too large for this method to be feasible, even for fairly small decision processes. To improve the convergence of the pure Monte Carlo implementation and not to rely on exhaustion, Chaslot et al. [16] proposed a way to guide the simulations of the Monte Carlo evaluation. This is done using a selection step, in which the choice of the decision to simulate is guided by computing the UCT from Equation (10) from the Multi-Armed Bandit Problem. This way, the simulations used to evaluate the decision tree follow a quasi-uniform distribution. Each iteration of MCTS consists of four steps described below. In Figure3, the four steps are illustrated.

1. Selection, the tree is traversed down to a leaf node L using Equation (10) from the Multi- Armed Bandit Problem.

2. Expansion, the leaf node is expanded to create child nodes corresponding to all possible decisions from position L (unless this is an end position).

3. Simulation, the decision process is simulated from one of the newly created nodes with random decisions until the end is reached.

4. Backpropagation, the result of the simulation is stored and backpropagated up to the root.

When the iterations are done, the move is chosen which has the highest number of visits. A practical advantage of MCTS is that the simulations can be terminated at any time. This allows for a stopping criterion to be formulated which can be dependent on either time or the certainty of the predicted best decision.

(18)

Figure 3: The four steps of MCTS.

(19)

4 Implementation

In this section we describe how we implemented the different methods described in the previous sections. Since the Monte Carlo methods are based on performing many random simulations we have written the programs in the C language in order to increase performance. Several strategies, which we call players, were constructed to play both normal and stochastic Nim. Each player was written as a function where the input was the current board and the output was a legal move.

The first strategy that was implemented was a simple player which chooses its move randomly.

This player was called the r-player. Then we implemented the winning strategy which removes sticks to leave the opponent with nim sum zero, described in Section 2.2.2. We also included a parameter, p, for this player which is the probability that it plays by the winning strategy and performs an optimal move. If an optimal move is not to be performed the player makes a random move. This player is called the p-player (probability player). Two players were constructed which were based on Monte Carlo methods. The first one, called the s-player (statistical player), is a pure Monte Carlo method which for each possible legal move performs k number of random simulations of the game. The second one, called the x-player (extended statistical player), is based on MCTS described in Section3.2.

The p-player was mostly used as a benchmark to measure the performance of the s- and x- players in normal Nim. Since its strategy is not optimal in stochastic Nim we implemented our own strategies described in Section2.4.1and 2.4.3 to be used as benchmarks in stochastic Nim.

Our first strategy which has the goal to maximize the probability that the nim sum is zero is called the q1-player. Our second strategy is called the q2-player and aims to maximize the probability of a win.

Even though the goal was not to include any heuristic knowledge in the Monte Carlo-based strategies we included a special case for all the players. If a game ending move is possible then the players would execute that move. This happens when all sticks are in the same heap. The winning game ending move would then be to take all the remaining sticks. The motivation to why we include this special case is that for a human player this move is trivial.

This motivation holds for normal Nim but is no longer trivial in stochastic Nim. However, we can show that this is still the optimal move regardless of r by formulating the probability that we win if we take the last stick and put our opponent in the state X0,0. This is actually the base case in Equation (8) of our second strategy in Section2.4.3and is written as

P[win|X0,0] = 1 1 + r.

And since _1+r¹ > ¹₂ ∀r ∈ [0, 1) we can conclude that the optimal move is to take all the sticks if they are in the same heap. The case when r = 1 is not interesting since the board is always perturbed and no one can win.

All the implemented players are described in detail in the subsections below which contain algorithms for each player. The players are summarized in Table3. These algorithms return the heap, ρ, from which the sticks are to be removed as well as the number of sticks, σ, to be removed.

Since all players consider the special case that they perform the winning move if it is possible, it is excluded from the algorithms.

Table 3: Description of the players.

Abbreviation Description Strategy Benchmark

r-player Random player Random moves YES

p-player Probability player Optimal move with probability p YES q1-player Probability player Our first strategy YES q₂-player Probability player Our second strategy YES s-player Statistical player Pure Monte Carlo method NO x-player Extended statistical player Monte Carlo Tree Search NO

4.1 The r-player

The r-player is the simplest player. It randomly chooses between all possible moves. This is done such that each legal move has the same probability to be chosen regardless of the board. Since the number of possible moves is the same as the total number of sticks across the board, an integer

(20)

is randomized between 1 and the total number of sticks. This number then represents one of the legal moves. Detailed pseudo-code is found in Algorithm1where τ is the total number of sticks.

Algorithm 1 r-player

1: function ChooseMoveR

2: Randomize n ∈ [1, τ ] ⊂ N

3: Initialize ρ = 1

4: while n > {number of sticks in heap ρ} do

5: n ← n − {number of sticks in heap ρ}

6: ρ ← ρ + 1

7: σ ← n

8: return ρ, σ

4.2 The p-player

The p-player takes as input a parameter, p, which is the probability that it makes the move which would be optimal in normal Nim. If the optimal move is not to be performed the player makes a move at random following the pseudo-code in Algorithm1. The optimal move is made following the winning strategy in Section2.2.2and is described in detail in Algorithm2.

Algorithm 2 p-player

1: function ChooseMoveP

2: Randomize γ ∈ [0, 1] ⊂ R

3: X ← compute nim sum

4: if γ > p or X = 0 then

5: return random move

6: else

7: for ρ from 1 to {number of heaps} do

8: n ← {number of sticks in heap ρ}

9: if X ⊕ n < n then

10: σ ← n − (X ⊕ n)

11: return ρ, σ

4.3 The q

₁

-player

The q1-player is an implementation of our first strategy described in Section 2.4.1. The move having the highest estimated probability to generate a nim sum of zero is performed. The estimated probability is computed recursively according to Equation (4). The simplifications of Equation (4) using Bayes’ theorem have not been considered in the implementation. Algorithm 3 summarizes how the strategy is done. NimSumProb(ϕ) is an implementation of Equation (4) which recursively computes the estimated probability that then nim sum is equal to ϕ.

Algorithm 3 q₁-player

1: function ChooseMoveQ¹

2: for j from 1 to {number of heaps} do

3: for i from 1 to {number of sticks in heap j } do

4: prob(j, i) ← NimSumProb(0)

5: ρ, σ ← arg max_j,i{prob(j, i)}

6: return ρ, σ

4.4 The q

₂

-player

The q2-player is an implementation of our second strategy described in Section2.4.3. The move is decided by computing which of the possible moves has the highest estimated probability to generate a win. The estimated probability is computed recursively according to Equation (7). Algorithm4 summarizes how this is done. WinProb is an implementation of Equation (7) which recursively computes the estimated probability that a move generates a win.

(21)

Algorithm 4 q2-player

1: function ChooseMoveQ²

3: for i from 1 to {number of sticks in heap j } do

4: prob(j, i) ← WinProb

5: ρ, σ ← arg max_j,i{prob(j, i)}

6: return ρ, σ

4.5 The s-player

The s-player uses pure Monte Carlo simulations to evaluate the next move to make. The random simulations are equally distributed between the moves, and the decision is based on the statistics from the results of the simulations. Algorithm 5 describes in detail how the s-player was implemented, where k is the number of simulations performed and r is the perturbation rate. Sim- ulation is a function that simulates the game with random moves and returns a 1 if the simulation generated a win or 0 otherwise. In the random simulations the stochastic behaviour of the first heap is accounted for.

Algorithm 5 s-player

1: function ChooseMoveS

2: for κ from 1 to k do

4: for i from 1 to {number of sticks in heap ρ} do

5: stats(j, i) ← stats(j, i) + Simulation

6: ρ, σ ← arg max_j,i{stats}

7: return ρ, σ

8: function Simulation

9: Initialize win ← 1

10: while board not empty do

11: random move

12: Randomize γ ∈ [0, 1] ⊂ R

13: if γ < r then

14: ρ1← ρ1+ 1

15: win ← 1 − win

16: return win

4.6 The x-player

The x-player implements the Monte Carlo Tree Search as described in Section3.2. It performs the four steps selection, expansion, simulation and backpropagation. Algorithm 6 describes in detail how the x-player was implemented. k is the number of simulations performed. In this implementation, the trade-off between exploration and exploitation, namely the c-parameter, is set to√

2. MCTS is a function that updates the decision tree according to the four steps.

(22)

Algorithm 6 x-player

1: function ChooseMoveX

2: Initialize the root of the decision tree

3: while κ from 1 to k do

4: MCTS(current node)

5: ρ, σ ← child node with highest visit count

6: return ρ, σ

7: function MCTS(node)

8: if child node then

9: expand node

10: perform random simulation

11: update node statistics

12: return

13: for each child node do

14: compute UCT

15: node_max ← arg max_j,iUCT

16: MCTS(node_max)

17: backpropagate result

4.7 Measurements

In order to evaluate the performance of the s- and x-players we have performed simulations where the two have played against the r-, p-, q1- and q2-players. A player’s performance can be measured by letting it play against the p-player. By computing the win rate for different values of p, the performance of the s- and x-players was measured without the presence of perturbation.

In stochastic Nim, however, we did not vary the p-value since the p-player does no longer play optimally. Instead we measured the win rate against the r-, p- and q-players when varying the perturbation rate r. In this case the p-value was set to 1.

In order to be able to compare the s- and x-players in a fair way we fixed the number of simulations per move to 1000. The s-player distributes its simulations equally between the moves while the x-player distributes them according to the exploration-exploitation trade-off.

Since one game only gives a win or a loss, several games had to be played for each configuration in order to compute the win rate by the law of large numbers described in Theorem3. Also, to decrease the variance, quite a large number of games had to be played. For the more computation- ally heavy algorithms such as MCTS and the q-players, the simulation times have reached many hours on a regular laptop.

Since the computation time for the q-players grows exponentially with the size of the board due to their recursive structure they could only be played on small boards. This is not ideal since the point of the Monte Carlo methods is to perform a large number of simulations in order to decide on a move. On a small board this would exhaust the game. For this reason we performed simulations on both a small board and also a big board for the other players. The small board consisted of 4 heaps and 8 sticks in total and the big board consisted of 12 heaps and 80 sticks in total. For the q1-player, however, the simulation times were orders of magnitude too large to run, even on the small board. For this reason, when playing with the q₁-player we used a board with 4 heaps and 7 sticks in total which considerably reduced the computation times to a manageable size.

We did not want to limit the games to a specific board configuration and for that reason a random board was created for each game guaranteeing that each heap had at least one stick. To clarify, the small board for example had 8 sticks randomly distributed to 4 heaps.

(23)

5 Results

In this section we show our result when the s- and x-players were evaluated in both normal Nim in Section5.1and stochastic Nim in Section5.2against the benchmark players r, p, q1 and q2. The benchmark players were also compared against each other in Section5.3.

5.1 Nim

Figure4shows the win rate of the s- and x-players when playing against the p-player on both the small and the big board. The win rate is displayed as a function of the p-value. For each data point 10’000 games has been played and the 50% win rate line is displayed as a dashed line for clarity.

(a) (b)

Figure 4: The win rate of the s- and x-players when playing against the p-player when varying the p-value on a board of (a) 4 heaps and 8 sticks, and (b) 12 heaps and 80 sticks.

5.2 Stochastic Nim

The figures in this section show the win rates of the s- and the x-players when varying the perturbation rate r. For each data point 10’000 games have been played across all plots.

The win rates of the s- and x- players are shown in Figure 5 and 6 where they have played against the r- and p-player respectively. Both figures contain two graphs where the games have been played on both the small board and the big board.

Figure 7 and 8 shows the win rate of the s- and x-players when playing against the q₁- and q2-players respectively. As mentioned above, when playing against the q1-player a board with 4 heaps and 7 sticks was used. For the q2-player the regular small board of 4 heaps and 8 sticks was used.

Figure9shows the win rate of the x-player when playing against the s-player on both the small and the big board.

(24)

(a) (b)

Figure 5: The win rate of the s- and x-players versus the r-player when varying the rate of perturbation, r, on a board of (a) 4 heaps and 8 sticks, and (b) 12 heaps and 80 sticks.

(a) (b)

Figure 6: The win rate of the s- and x-players versus the p-player when varying the rate of perturbation, r, on a board of (a) 4 heaps and 8 sticks, and (b) 12 heaps and 80 sticks.

Figure 7: The win rate of the s- and x-players versus the q₁-player when varying the rate of perturbation, r, on a board of 4 heaps and 7 sticks.

(25)

Figure 8: The win rate of the s- and x-players versus the q₂-player when varying the rate of perturbation, r, on a board of 4 heaps and 8 sticks.

(a) (b)

Figure 9: The win rate of the x-player versus the s-player when varying the rate of perturbation, r, on a board of (a) 4 heaps and 8 sticks, and (b) 12 heaps and 80 sticks.

5.3 Benchmark testing

The results presented in this section are for comparing the four benchmark players r, p, q1and q2. We also investigated how the they performed against each other and if our strategies for the q1- and q2-players performed as expected.

In Figure10the win rate is shown as a heat map when two p-players have played against each other on both the small and the big board in normal Nim. This has been done by varying the p-value for both players on the different axes. The heat maps consist of 500×500 data points where 1000 games have been played for each data point. Contour lines are displayed to increase clarity.

The win rates of the q1- and q2-players versus the p-player are shown in Figure 11 and 12 respectively when varying the perturbation rate r. As mentioned before, when playing with the q₁-player a board of 4 heaps and 7 sticks has been used. With the q₂-player, the regular small board of 4 heaps and 8 sticks has been used. In these graphs 10’000 games have been played for each data point.

(26)

(a) (b)

Figure 10: The win rate in a game between two p-players on a board of (a) 4 heaps and 8 sticks, and (b) 12 heaps and 80 sticks in normal Nim. The win rate is of Player 1 which has its p-value on the y-axis.

Figure 11: The win rate of the q1-player versus the p-player when varying the rate of perturbation, r, on a board of 4 heaps and 7 sticks in stochastic Nim.

Figure 12: The win rate of the q₂-player versus the p-player when varying the rate of perturbation, r, on a board of 4 heaps and 8 sticks in stochastic Nim.

(27)

6 Discussion

6.1 Choice of platform

The combinatorial properties of Nim makes it a good platform to test the performance of the Monte Carlo-based strategies. The combinatorial properties make it possible to further compare the results to other combinatorial games such as Go, which recently has been extensively used as a testing platform for Monte Carlo-based methods. Go, as a combinatorial game, has an optimal strategy, but the immense complexity of the game aggravates the finding of it. The winning strategy in Nim, on the other hand, is known and easily implemented. The choice of Nim as a testing platform is also motivated by the clear scalability in both the number of dimensions and size, by changing the number of heaps and total sticks. This facilitates the measuring of performance for the two Monte Carlo-based strategies, the s- and x-players. Since the use of Monte Carlo methods is motivated for large decision processes, the scalability of Nim facilitates to assure that the Monte Carlo-based strategies do not exhaust the process while still being able to test the strategies against the optimal strategy.

Nim has a non-uniform branching factor where the branching factor for a given node is equal to the total number of sticks left on the board. This implies that the complexity for Nim shrinks as the game progresses. In the beginning of a game, Nim is highly complex, while at the end, its complexity is minimal. The Monte Carlo-based strategies are therefore expected to exhaust the game as it progresses even for large games. This, however, would have been expected in most sequential decision processes where the impact of the first decisions are harder to examine than the last.

While the proved optimal strategy, the vast scalability and the equivalence to other impartial games are useful, Nim has some intrinsic properties that makes it insufficient to use as the sole testing platform. Any suboptimal move in Nim is an equally bad move as all the other suboptimal moves. In other decision processes, however, a suboptimal move may be better than another suboptimal move. To test the Monte Carlo-based strategies in a platform which is presumed to have a better order of the suboptimal moves, a stochastic version of Nim was developed. However, the stochasticity of Nim removes the features of the game of having perfect information and being combinatorial. It is thus beneficial to use both Nim and stochastic Nim as testing platforms in order to test the versatility of the Monte Carlo-based strategies, that is, if the performance differs for the two different environments.

6.2 Performance of the s- and x-players

The differences in terms of performance between the s- and x-players are small but distinct. In- terestingly, they perform differently depending on if the stochasticity is present or not. In normal Nim the s-player, using pure Monte Carlo simulations, performs on average slightly better than the x-player using MCTS. This is probably a consequence of the fact that the moves in Nim either leads to a winning or a losing position. The pure Monte Carlo-simulations of the s-player therefore performs better than the guided Monte Carlo-simulations in MCTS of the x-player for normal Nim. The selection, stemmed from the Multi-Armed Bandit Problem, guides the simulations of the x-player, making them follow a quasi-uniform distribution. But if there is no clear order of which of the suboptimal moves are better, the guided statistical sampling will introduce a bias and result in a worse statistical basis than for equally distributed simulations. One could though argue that there may be some positions which have a higher portion of possible optimal moves, increasing the probability that a random move will be optimal. This, however, affects the game only a little.

Both the s- and x-players manage to perform better than a random player, that is when parameter p is zero. On the other end of the spectrum the win rates of the s- and x-players are zero for p = 1 on the big board, where the players have not exhausted the decision tree. This is explained by the fact that even if only one mistake is made at some point in the game, the resulting outcome is a loss. When playing on the small board both the s- and x-players manage to achieve a non-zero win rate even if p = 1. This can be explained by the fact that the game is small enough such that that the players can exhaust or close to exhaust the decision tree. This is not desired since the motivation for using a Monte Carlo-based method is in cases where the search space is too large to exhaust.

Monte Carlo methods applied to tree-structured decision processes

U.U.D.M. Project Report 2017:23

Examensarbete i matematik, 15 hp Handledare: Jordi-Lluís Figueras Examinator: Jörgen Östensson Juni 2017

Monte Carlo methods applied to tree-structured decision processes

Mattias Bertolino, Arvi Jonnarth

Department of Mathematics

Uppsala University

Monte Carlo methods applied to tree-structured decision processes

Mattias Bertolino, Arvi Jonnarth June 4, 2017

Contents

1 Introduction

2 Game theory

2.1 Combinatorial games

2.2 The Nim game

2.3 Sprague and Grundy’s conclusions

2.4 Stochastic Nim

3 Monte Carlo methods

3.1 The Multi-Armed Bandit Problem

3.2 Monte Carlo Tree Search

4 Implementation

4.1 The r-player

4.2 The p-player

4.3 The q

-player

4.4 The q

-player

4.5 The s-player

4.6 The x-player

4.7 Measurements

5 Results

5.1 Nim

5.2 Stochastic Nim

5.3 Benchmark testing

6 Discussion

6.1 Choice of platform

6.2 Performance of the s- and x-players