Modifying the learning process of the Expert Iteration algorithm

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Modifying the learning process of the Expert Iteration algorithm

JOHAN SJÖBLOM

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Modifying the learning process of the Expert Iteration algorithm

JOHAN SJÖBLOM

Master in Computer Science Date: October 10, 2019 Supervisor: Mika Cohen Examiner: Mads Dam

School of Electrical Engineering and Computer Science Host company: Totalförsvarets forskningsinstitut, FOI

Swedish title: Modifiering av lärandeprocessen bakom algoritmen Expert Iteration

(4)

(5)

Acknowledgements

First of all, I would like to thank the Swedish Defense Research Agency for funding this thesis work and providing the resources necessary to complete it.

I also want to give a special thanks to my supervisor, Mika Cohen, for his guid- ance throughout the process, and particularly for many interesting and thought provoking discussions, and to Farzad Kamrani, whose eye for detail significantly improved the quality of this report. Finally, a thank you to Thomas Anthony for inspiring this work and answering technical questions.

(6)

iv

Abstract

This thesis sets out to improve the performance of the Expert Iteration (EXIT) algorithm - a simulation-based learning process to train an AI. EXIT and four different modifications are implemented and evaluated through a tournament of the board game Hex.

The results suggest that the training pipeline of EXIT could be significantly simplified without a loss of performance.

(7)

v

Sammanfattning

Föreliggande examensarbete ämnar förbättra algorithmen Expert Iteration (EX- IT) - en simulationsbaserad lärandeprocess för att träna en AI. EXIT samt fyra olika modifikationer implementeras och evalueras genom en turnering i bräd- spelet Hex.

Resultaten antyder att EXIT:s träningsförfarande kan förenklas signifikant ut- an prestandaförsämring.

(8)

Chapter 1 Introduction

Learning - The acquisition of knowledge or skills through study, experience or being taught

Oxford English dictionary

When a human is born it is capable of doing astoundingly little and knows barely anything. Yet adult humans can routinely carry out complicated tasks and amass large amounts of knowledge. The ability to learn, to adapt and improve behavior based on external stimuli, is one of the most fundamental factors behind the success of our species. Another such factor is our profi- ciency in creating and using tools. From the first stone tools a few million years ago to the technology of the last century, they have played an increas- ingly important role in improving our lives.

An essential step towards the tools of the future seems to be to transfer the skill of learning to our tools. Unfortunately, this has proven very hard. Hu- mans are obviously capable of learning, but as we do not yet understand how the brain works the process governing it is to a large degree unknown.

When a toddler learns to walk it has no explicit teacher. It has to learn based on a combination of trial and error (experience) and imitation of others in its environment. These two fundamental approaches to learning, interaction and imitation, have spawned two fields of AI research: reinforcement learning (RL) and imitation learning (IL). However, the last decade most success has been found in a third field: supervised learning.

1

(12)

2 CHAPTER 1. INTRODUCTION

Supervised learning algorithms require an (often very large) data set of labeled examples to train a function approximator. The methods of supervised learning have for a long time been focused on mathematical convenience rather mimicking human learning, but interestingly, almost all of the last decade’s re- markable progress has been due to a family of algorithms known as artificial neural networks, which, albeit simplistically, simulate how the neurons in a brain work.

1.1 DeepMind’s 2016 breakthrough

When Deep Blue defeated the world chess champion Kasparov in 1997 the Chinese board game Go was by many designated to be the next grand challenge for AI. Go is harder than chess from an AI perspective primarily due to two reasons:

• its search space is enormous, in fact there are more possible Go states than there are atoms in the universe, and

• the significance of a move may not be revealed until hundreds of moves later.

The complexity of Go renders traditional brute-force based search methods useless; a more refined, general AI is required.

It would take almost 20 years before AI finally reigned supreme in Go. In 2016, DeepMind researchers Silver et al. [1] presented the program AlphaGo which went on to defeat the legendary professional Go player Lee Sedol. They had managed to combine ideas from reinforcement learning, imitation learning and supervised learning into a single powerful AI, and trained it with carefully crafted Go-features, 30 million human professional games and a massive amount of self-play.

However, AlphaGo had a significant drawback: its reliance on a database of labeled examples and problem-specific feature extraction made it impossible to directly apply in other domains. Even though the professional Go-playing AI is an impressive engineering achievement, its general utility for humanity remains low.

(13)

CHAPTER 1. INTRODUCTION 3

1.2 Superhuman performance tabula rasa

The next major breakthrough came shortly after the release of AlphaGo. In 2017, both Silver et al. [2] and Anthony, Tian, and Barber [3] published algorithms which expanded the ideas behind AlphaGo to work tabula rasa, learning only through self-play. No human knowledge was required beyond the rules of the game. Silver et al. named their algorithm Alpha Zero; it would go on to beat AlphaGo 100-0 and achieve superhuman performance in the games of chess and shogi [4]. Anthony, Tian, and Barber named their algorithm Expert Iteration (EXIT) and applied it to the game of Hex where it achieved state-of- the-art performance. Although the main underlying ideas are the same behind Alpha Zero and EXIT, there are a few differences (covered in [3]).

From a societal point of view, an AI capable of learning general tasks without human input would have extreme implications. Therefore, it is important to highlight that these algorithms are limited to domains which can be simulated fast and accurately. Board games, and to an extent computer games, are examples where they excel, but real-world tasks such as diagnosing diseases or driving cars are out of reach with current technology.

1.3 Thesis purpose

The primary purpose of this thesis is to investigate whether the performance of the EXIT algorithm can be improved through small changes in its learning process. In this report, as in Anthony, Tian, and Barber [3], the learning task is confined to the board game of Hex, a simple environment well suited for experimentation. But in the future, the host, the Swedish Defence Research Agency, hope to be able to extend this work to combat simulations and tacti- cal decision problems in military science.

EXIT was chosen over its DeepMind counterpart Alpha Zero primarily due to two reasons:

• EXIT currently lacks a public implementation and the performance reported by Anthony, Tian, and Barber [3] has yet to be reproduced, and

• EXIT used orders of magnitude less computational resources than Alpha Zero.

(14)

Chapter 2 Technical background

The purpose of this chapter is to build the technical foundation required to understand and analyze the EXIT algorithm. The first section gives a general background, covering important learning concepts, a problem formalization and a short description of the board game Hex. The second section introduces the basic Monte Carlo tree search (MCTS) algorithm and different ways to improve it. Finally, the third section gives a short description of perhaps the most influential previous work, the AlphaGo algorithm.

For a more detailed background a good place to start would be Reinforcement Learning: An Introduction by Sutton and Barto [5].

2.1 General background

The section gives a brief introduction to:

• the different learning paradigms required to understand where EXIT comes from,

• the theory behind the Markov decision process, which is a mathematical framework for modeling sequential decision-making problems, and

• the board game Hex.

2.1.1 Different approaches to learning

In order for a program to be able to learn it needs access to experience. This experience can either be supplied the program in precompiled data sets, or the

4

(15)

CHAPTER 2. TECHNICAL BACKGROUND 5

program has to gather the experience itself (e.g. through simulation, observation or real-world interaction). The former category is further divided into two subcategories: supervised and unsupervised learning. The latter category is studied in the field of reinforcement learning.

The success of the EXIT framework can in large be attributed to its ability to combine ideas from supervised learning and reinforcement learning. An important niche of supervised learning which studies such intermediate algorithms is imitation learning.

Supervised and unsupervised learning

Supervised learning differs from unsupervised learning in that it requires the supplied data to be labeled (i.e. annotated with correct output labels by an external “supervisor”). The basic goal is to infer a function from input data to output labels which can generalize to unseen examples.

A lot of research has been devoted to this area, and some important algorithms include linear regression, naïve Bayes and support vector machines. However, lately the most impressive results have predominantly come from a set of algorithms known as artificial neural networks (ANN), and in particular deep neural networks. A lot of theory is required for a thorough understanding of ANNs, a popular place to start is Deep Learning by Goodfellow, Bengio, and Courville [6]. This topic is not further pursued here since the focus of this thesis is on data generation and how network evaluations are utilized, rather than on why ANNs work so well.

In unsupervised learning, the goal is often to find hidden structures or patterns in the (unlabeled) data. It is not as popular as supervised learning, but it has a few important usages, including clustering (e.g. k-means), feature extraction (e.g. autoencoders, principal component analysis) and generative adversarial networks.

Reinforcement learning

In reinforcement learning (RL), the general goal is to create an entity, often called agent, able to maximize a numerical reward function by navigating through the surrounding environment. As opposed to supervised learning, the agent is not told which actions to take, and in the most general setting the result of actions and the reward function is unknown. A fundamental problem

(16)

6 CHAPTER 2. TECHNICAL BACKGROUND

in RL is the trade-off between exploiting current knowledge and exploring to acquiring new knowledge.

The canonical way to formalize RL problems, i.e. the interactions between an agent and its environment, is through a Markov decision process (MDP) model. The most basic parts of MDPs are introduced in the following section.

Imitation Learning

The goal of imitation learning (IL) is to train an apprentice to solve a RL prob- lem with the help of an existing expert. The role of the expert, e.g. a human or a prohibitively slow algorithm, is to demonstrate how to satisfactorily carry out a specific task. The access to such an expert removes the reliance on a reward function and opens the door to the fast converging methods of supervised learning.

However, supervised learning algorithms applied in the IL setting suffer an important drawback: samples drawn from the expert violate the i.i.d. (inde- pendent and identically distributed) assumption, leading to poor performance in theory and often in practice. Due to its imperfect actions, an apprentice might quickly find itself in a state the expert would never encounter. To remedy this problem Ross, Gordon, and Bagnell [7] created the dataset aggregation (DAGGER) algorithm. In DAGGER, the training data is aggregated through iteratively querying the expert based on states encountered by the current level apprentice.

2.1.2 Finite Markov decision processes

The finite Markov decision process (MDP) is a mathematical formulation of a family of problems known as sequential decision making problems. It is a mathematically idealized form of the RL problem where an agent continuously interacts with its environment. The following introduction is a short summary based on the book Reinforcement Learning by Sutton and Barto [5].

Formally, the interactions between the agent and its environment are discretized into a finite sequence of t time steps. Each time step, t, the agent receives an observation of the environments state, st, a reward, rt, and then executes an action, a^t. This creates a trajectory:

s₀, a₀, s₁, r₁, a₁, s₂, r₂, a₂, s₃, ...

(17)

which, if the task is episodic, eventually ends in a terminal state, or otherwise carries on indefinitely. The goal of the agent is to chose actions so that the expected accumulated reward is maximized.

Because the Markov assumption holds (i.e. the next state, st+1, and reward, rt+1, only depend on the current state, s^t, and action, a^t), the dynamics of an MDP can be expressed as:

p(s⁰, r | s, a) := P rS_t = s⁰, R_t= r | S_t−1 = s, A_t−1= a

for all s⁰, s ∈ S, r ∈ R, a ∈ A(s), where S, R, and A represent the set of states, rewards and actions.

Because p specifies a probability distribution a three-argument function can be attained by summing out the fourth argument, for example the state transi- tion probabilities:

p(s⁰ | s, a) := P rS_t = s⁰ | S_t−1= s, A_t−1 = a = X

r∈R

p(s⁰, r | s, a)

Summerizing over both rewards and states the expected reward of taking an action in a state is acquired:

r(s, a) := E[R_t | S_t−1 = s, A_t−1= a] = X

r∈R

rX

s⁰∈S

p(s⁰, r | s, a)

It is the goal of the agent to maximize this expected reward in the long term, i.e. to maximize the accumulated reward. In some settings a discount factor is introduced to bias the agent towards valuing short-term reward more than long-term, but for board games such as Hex the natural reward is zero in every state but the last, where a positive reward indicates a win and a negative reward indicates a loss.

Policy and value functions

The only influence an agent has over its reward is the sequence of actions it chooses. Therefore, a natural way to describe the behavior of an agent is through a function that defines a probability distribution over all actions for each state. This function is often called a policy and is denoted:

π(a|s) = P r(A_t= a | S_t= s)

(18)

If an agent is acting according to the policy π, the total expected return from a state is given by the value function:

v_π = E_π[

n

X

k=0

R_t+k+1| S_t= s]

where Eπ denotes the expectation over sequences generated by following the policy π, and R^t+n+1 is the reward after the final, terminal, state.

2.1.3 Hex

Hex is a two-player strategy board game. The players take turns placing a stone of their color, often black or white, on a hexagonal rhombus-shaped playing board. The winner is whoever forms a solid connection between their two op- posing sides.

From an AI point of view Hex shares many attractive qualities with the more frequently used board game Go. Primarily:

• its rules are simple, deterministic and games always terminate with a binary outcome,

• its state space is discrete and finite,

• its board size is easily increased so that the state space is too big for brute force search algorithms (solving an arbitrary Hex position is PSPACE- complete [8]), and

• the long-term effect of moves can be hard to foresee, giving way to com- plex strategy.

Furthermore, Hex has a few advantages over Go. Hex is fully observable from the current stones, thus no history features need to be recorded. Moreover, adding stones to an already decided game does not change its outcome, allow- ing for quick random “rollouts” (see chapter 2.2).

2.2 Monte Carlo tree search

The following summary is, unless otherwise specified, based on Sutton and Barto [5].

(19)

Monte Carlo tree search (MCTS) is a general search framework originally proposed in the 1940s, but it did not find much use until its big resurgence after a 2006 paper by Coulom [9]. MCTS builds an approximation of the optimal policy of a state π^∗(a|s) through a combination of tree search and simulations.

For over a decade it has been one of the most popular methods in many different challenging domains of search.

2.2.1 The Monte Carlo tree search framework

The basic algorithm is conceptually simple. It iteratively constructs a game tree (a directed graph where nodes represent game states and edges represent moves), starting with the current game state as root. When the algorithm is finished, the estimated value of each move will be the number of times its edge was traversed during construction.

Selection

The first step is selection. Each iteration starts from the root node, and the search tree is traversed based on an in-tree policy until an optimal expansion node is reached. If the optimal expansion node happens to be terminal, the second and third step is skipped.

The in-tree policy is perhaps the most important part of the algorithm. It has to manage the conflict between exploitation and exploration, between immediate certain reward and possible higher long-term reward, when selecting nodes.

The choice of optimal expansion node, on the other hand, is often just the first node encountered which has children not already incorporated in the tree.

Expansion

In the second step, the selected node from the previous step is expanded with a new child which is added to the search tree. To create the new child a previously unexplored action is taken from the node.

In the simplest case, the action is chosen at uniform random.

Simulation

In the third step, starting from the newly created node, a rollout (sometimes referred to as a playout) policy is followed until a terminal state is reached.

(20)

In the simplest case, a random uniform policy is used. But the literature gives many examples of how carefully crafted rollout policies, often by incorporat- ing domain knowledge, can improve the algorithm’s performance.

Backpropagation

In the final step, backpropagation, the terminal state reached in the previous step is evaluated and the result added to all traversed (in-tree) nodes.

In the simplest case, the nodes only record the number of times they have been chosen by the in-tree policy (i.e. how many times they have been “visited”

across iterations) and the average result of the respective rollouts.

2.2.2 In-tree policies

As stated previously, the effectiveness of MCTS crucially depends on the employed in-tree policy. A successful in-tree policy has to balance the immediate reward of choosing actions greedily (exploitation) with the possibility of a higher long-term reward when discovering new actions to exploit (exploration).

A lot of different procedures to handle this exploitation/exploration dilemma have been proposed [10]; following is a small selection of interesting policies from the literature.

-greedy

One of the simplest approaches is to add a random exploring component to the greedy policy. A hyperparameter, , controls the extent of the exploration, such that an action is selected at uniform random (as opposed to greedily) with probability .

Theoretically, as the number of iterations goes to infinity, the estimated values of actions converge to their real values. However, such asymptotic guarantees have been shown to have limited practical effectiveness.

Upper confidence bound for trees

Intuitively, a good policy should start out heavy on exploration, and gradually, as the variance of its estimates decreases, pivot towards exploitation. Fortu-

(21)

nately, such a policy was discovered by Auer, Cesa-Bianchi, and Fischer [11]

for the closely related family of multiarmed bandit problems. It was called UCB (Upper Confidence Bounds) which became UCT (UCB applied to trees) when Kocsis and Szepesvári [12] applied it to MCTS.

Adopting the notation used in [3], the UCT bound is:

U CT (s, a) = r(s, a) n(s, a) + c_b

s

log n(s) n(s, a) where

• r(s, a) is the sum of all rewards obtained through simulations passing through that edge,

• n(s, a) is the number of simulations passing through that edge,

• n(s) is the number of simulations passing through that node, and

• c_bis a constant controlling the amount of exploration.

The first term is simply the estimated value of the selected action. The second term attempts to control exploration by estimating the variance of the first term (i.e. starting out strong but decreasing as more rollout results are added to the first term). When action a is selected both n(s) and n(s, a) are incremented and so the second term decreases, but when any other action is selected only n(s) is incremented, and so the second term increases. Furthermore, because of the logarithm in the second term, its value decreases over time while retain- ing its asymptotic behavior.

The action selected by the UCT in-tree policy is the maximum bound:

a = argmax_a[U CT (s, a)]

Rapid action value estimation

One of the drawbacks of UCT is that the initial value estimates of actions have very high variance. To make reasonably accurate comparisons between actions several samples of each action is required. In settings with high branch- ing factor, this severely limits the possible depth of the MCTS. To amend this problem Gelly and Silver [13] came up with the rapid action value estimate (RAVE) policy.

(22)

The core of RAVE is the all-moves-as-first heuristic (AMAF, [14]). The idea behind AMAF is simple: if an action was strong at timestep t it was likely also strong at previous timesteps t⁰ < t. To incorporate RAVE in the search tree, two new statistics need to be saved in nodes: rRAV E(s, a) and nRAV E(s, a).

Similarly to UCT, r^{RAV E} denotes the accumulated rave-value estimate of a node and n^{RAV E}its simulation count. However, while UCT statistics are only updated when a simulation pass through them, the RAVE statistics are updated for every action that was taken subsequently as if it had been taken immedi- ately.

Formally, after a sequence s1, a₁, s₂, a₂, ..., s_T the rave values are updated for each state st1 in the sequence and every subsequent action at2 (where at2 is a valid action from s^t1), t¹ < t2and ∀t < t², a^t1 6= at2:

n_{RAV E}(s_t₁, a_t₂) := n_{RAV E}(s_t₁, a_t₂) + 1 rRAV E(st1, at2) := rRAV E(st1, at2) + R

where R is the reward of the sequence. Note that in multi-agent environments, such as for example two-player games, actions by player A do not contribute to the rave statistics for player B.

As stated initially, RAVE introduces much quicker learning (lowering of variance), but it comes at the cost of some bias since its core assumption, the AMAF heuristic, is often inaccurate. To get the best of both UCT and RAVE there is a mixing schedule to combine them. The idea is that the low-variance (but inaccurate) RAVE values are useful while the UCT values are still high- variance, but as the search continues the accuracy of the UCT values makes them preferable over RAVE values.

Since the effectiveness of RAVE and length of the search depends on the domain and setting, Gelly and Silver propose a general linear combination (again, with notation adopted from [3]):

π_in−tree(s) = β(s, a)rRAV E(s, a)

n_{RAV E}(s, a) + (1 − β(s, a))r(s, a) n(s, a) + c_b

s

log n(s) n(s, a) where β(s, a) is the mixing schedule. A popular choice is:

β(s, a) =

r c_{RAV E}

3n(s) + c_{RAV E}

(23)

This formulation of β introduces a hyperparameter cRAV E, sometimes referred to as an equivalence parameter since it stipulates the number of simulations before UCT and RAVE are given equal weight.

In the original paper, Gelly and Silver also included an exploration term in the RAVE term, but in a subsequent paper the authors decided to remove it, arguing that it is hard to justify explicit RAVE exploration: many actions will be evaluated by AMAF, regardless of which action is actually selected at turn t.

Similar to the case of pure UCT, the actual in-tree policy decision is simply the argmax of the π^in−tree action bounds.

2.3 AlphaGo - Using neural networks to guide Monte Carlo tree search

Probably, the most important previous work to understand where EXIT came from is the paper Mastering the game of Go with deep neural networks and tree search [1] where Silver et al. present the algorithm AlphaGo. AlphaGo was the first Go-playing AI ever to beat professional human players, and it gained worldwide recognition after defeating the legendary Go player Lee Sedol in a five-game match.

AlphaGo is essentially an intricate training pipeline to train neural networks capable of guiding a MCTS algorithm. In the first stage of training a large database of professional human games is used to train two move-prediction networks:

• a large convolutional neural network (CNN) used to guide the in-tree policy, and

• a linear softmax network of small pattern features used to guide the rollouts.

In the second stage, the CNN from the previous stage is used in a RL scheme of self-play to generate new data. The new data is used to further improve the policy network of the first stage. In the third and final stage, the network resulting from the previous stage is again used in a RL scheme of self-play to generate more data. The new data is used to train a board-value network with architecture similar to the policy network (but with a single tanh unit as

(24)

output).

The in-tree policy used is:

a_t= argmax

a

(Q(s_t, a) + u(s_t, a))

Where Q(st, a) is a linear combination of the policy network and rollout value, and u(st, a) is an exploration term proportional to the policy network value divided by number of visits (so that it decays over time, just like UCT exploration).

(25)

Chapter 3 Expert Iteration

This thesis is primarily based on the paper Thinking Fast and Slow with Deep Learning and Tree Search wherein Anthony, Tian, and Barber [3] present both the framework Expert Iteration (EXIT) and specific implementations of the framework.

The first section of this chapter summarizes the EXIT framework while the second focuses on the implementation details of the algorithm which Anthony et al. found to work best.

3.1 The Expert Iteration framework

The EXIT framework can be seen as a marriage of imitation learning (IL) and RL. The goal of the combination is to retain both the quick convergence properties typical to supervised learning approaches such as IL as well as the generality and applicability to unknown domains typical of RL.

The main limitation of IL systems is their reliance on an existing expert. In many domains, such experts are either expensive, unreliable, or simply un- available. Even in domains where experts are readily available they effectively impose a ceiling on the possible performance of the system. The main idea behind EXIT is to remove this limitation through an expert improvement step based on RL techniques.

15

(26)

16 CHAPTER 3. EXPERT ITERATION

The EXIT framework in pseudo-code:

ˆ

π₀ = initial_policy() π^∗₀ = build_expert(ˆπ0)

for i = 1; i ≤ max_iterations; i + + do S_i = sample_self _play(ˆπ_i−1)

D_i =(s, imitation_learning_target(π^∗i−1(s)))|s ∈ S_i ˆ

π_i = train_policy(Di) π_i^∗ = build_expert(ˆπ_i) end for

As can be inferred from the function names, ˆπi is the apprentice and πi^∗ the expert. Similar to IL systems, the most important aspect of EXIT is the choice of expert and apprentice. Because the expert has to be able to improve itself based on the apprentice, Anthony et al. argue that the canonical setup is a tree-search expert using a neural network apprentice to assist its search.

Because the expert uses the apprentice to guide its search it is automatically improved if the apprentice is improved. Therefore, as long as the expert can generate training data to improve the apprentice it will improve itself implic- itly. Generating a single improved data sample is a simple two-step procedure:

First a state is randomly selected from an apprentice guided self-play game, and then a MCTS is launched from that position, resulting in an improved sam- ple. Looking back at the pseudocode, this effectively voids the build_expert step, as the expert is automatically improved by improving the apprentice.

Another important choice is how to handle the generated data. In the basic formula given above the previous data set Dⁱ⁻¹ is discarded every new iteration, but previous work has shown that it can be beneficial to instead aggregate data over iterations [7]. Indeed, Anthony et al. found this to be the case in their experiments.

3.2 Implementation details

In their experiments, Anthony et. al. test several different versions of a MCTS expert and neural net apprentice. This thesis will focus on the combination they found to be most successful. Following is a detailed summary description of that algorithm.

(27)

CHAPTER 3. EXPERT ITERATION 17

3.2.1 The apprentice

The apprentice is an ANN built up by a bulk of convolutional layers which finally splits up in four different output heads:

• a black policy-head,

• a white policy-head,

• a black value-head, and

• a white value-head.

The policy-heads are referred to as “the policy network” and the value-heads

“the value network”.

Input features

The input format is borrowed from Young, Vasan, and Hayward [15]. The Hex board is first extended with a two stone thick border of dummy stones, with stones in the extra rows on the west and east sides being colored white and the north and south side black. Cornerstones belonging to both a white side and a black side are colored both white and black. This special edge padding essentially does not change the board state, but it helps make convolutions centered on (the original) board edges more meaningful. Finally, the board is split up into six separate channels. A 7x7 Hex board is thus transformed to the input layout 11x11x6. The channels represent:

• all black stones,

• black stones connected to the north side,

• black stones connected to the south side,

• all white stones,

• white stones connected to the west side, and

• white stones connected to the east side.

(28)

18 CHAPTER 3. EXPERT ITERATION

Convolutional bulk

The first part of the network consists of 13 convolutional layers using expo- nential linear units. The first eight convolutional layers, as well as layer 12, are identical: they zero-pad the input (preserving dimensions) and convolve it with 64 3x3 kernels with stride 1. Layer 9 and 10 differ in that they do not pad the input, and layer 11 and 13 both does not pad the input and use a 1x1 kernel.

Since the topology of the board is a hexagonal grid the normal square sized kernel is replaced by a hexagonal kernel (through zeroing the top-left and bottom- right corners).

Output heads

There are four parallel output heads after the convolutional bulk: a black policy output, a black value output, a white policy output and a white value output.

The policy outputs consist of a fully connected layer with softmax activation and the board size, 9 ∗ 9 = 81, as output dimension. Prior to applying the softmax illegal moves are removed. The value outputs consist of a fully connected layer with sigmoidal activation and a scalar output.

Training and prediction

The network is trained on mini batches of 250 samples with the Adam opti- mizer and early stopping (3 epoch limit). The policy outputs are trained with Kullback-Leibler divergence (KL) on tree-policy targets, the average tree policy of the MCTS at the root, giving the loss:

L = −X

a

n(s, a)

n(s) log[π(s)]

The value outputs are trained with normal KL loss:

L = −z log[V (s)] − (1 − z) log[1 − V (s)]

where V is the network output and z the (binary) target.

3.2.2 The expert

The expert is a MCTS algorithm running for 10 000 iterations (and expansion threshold 1). The rollout policy is simply uniform random but the in-tree policy utilizes both UCT and RAVE as well as policy and value estimates from

(29)

CHAPTER 3. EXPERT ITERATION 19

the apprentice. The formula for the in-tree policy is:

β(s, a)[r_{RAV E}(s, a) n_{RAV E}(s, a) + c_b

s

log n_{RAV E}(s) n_{RAV E}(s, a) ]

+(1 − β(s, a))[r(s, a) n(s, a) + c_b

s

log n(s) n(s, a) ] +w_a π(a | s, τ )

n(s, a) + 1 + w_vQ(s, a)ˆ where waand wvare weights, β(s, a) =

q k

3n(s)+k and k a constant. π(a | s, τ ) is acquired from the policy head of the apprentice (τ is the temperature parameter of the softmax output) andQ(s, a) is the backed up average of the ap-ˆ prentice value head estimates at the edge s, a. TheQ(s, a) values are backedˆ up and averaged in the same way the UCT values, r(s, a), are.

The general hyperparameter values Anthony et al. found working best with 10 000 iteration EXIT were:

• exploration constant cb = 0.05,

• RAVE equivalence parameter k = 3000,

• policy network weight wa = 100, and

• value network weight wv = 0.75.

3.3 Training pipeline

The training pipeline employed by Anthony, Tian, and Barber [3] can be divided into three steps:

1. random initialize the network and train it using a 1 000-iteration MCTS agent for self-play and 10 000-iteration MCTS agent for expert play, 2. continue training according to the EXIT framework, i.e. using the net-

work (apprentice) for self-play, but without using the value network, and finally,

3. switch on the value network and continue training.

In their experiment, roughly 10% of the training data was generated in step 1, 10% in step 2 and the remaining 80% in step 3.

(30)

Chapter 4 Modifications to the Expert Iter- ation learning process

The primary goal of this chapter is to propose variations that could improve the EXIT algorithm. An essential question is therefore what constitutes an improvement. In this thesis, the focus is on head-to-head performance and algorithm simplicity. It should be noted, however, that the comparisons are not completely fair considering they differ slightly in execution time and the pre-defined parameters might favor specific experiments.

The different EXIT variations explored in this thesis are covered in section 4.1, section 4.2 explains the relative performance evaluation and, finally, section 4.3 explains some implementation differences from the original EXIT algorithm presented in the previous chapter.

4.1 The experiments

Without swapping out the core components of EXIT, the MCTS expert and neural network apprentice, possible variations are primarily confined to MCTS policies, data generation and data aggregation methods. In this project, the original EXIT as well as four different variations were implemented.

The Hex board size was set to 7x7 and the different configurations were each trained 15 iterations with 4 800 samples aggregated each iteration (totaling 72 000 samples).

20

(31)

CHAPTER 4. MODIFICATIONS TO THE EXPERT ITERATION LEARNING PROCESS 21

4.1.1 Original Expert Iteration

Because of the small amount of data used in these experiments (compared to the experiments in [3]) giving a fair representation of the original EXIT is hard. A compromise was to (out of the total 15 iterations) allocate 1 iteration for step 1, 7 iterations for step 2, and 7 iterations for step 3.

4.1.2 Experiment 1: Simplifying the training pipeline

To greatly reduce the complexity of the original 3-step training procedure, the first two steps were discarded and all the training iterations spent on step 3.

Both the policy network and the value network were random initialized and switched on from the beginning. This is similar to how Alpha Zero works.

4.1.3 Experiment 2: Removing Monte Carlo tree search rollouts

In standard EXIT, the value of a newly expanded MCTS node is estimated through a combination of the value network and a random rollout. In this experiment, to reduce the complexity of the algorithm and possibly save computational resources, the random rollout was skipped. The hypothesis was that the time spent doing random rollouts could be better used searching deeper, solely guided by the policy and value networks. This is also in accordance with how Alpha Zero works. The formula for the in-tree policy simplifies to:

wa

π(a | s, t)

n(s, a) + 1 + wvQ(s, a) + cˆ b

s

log n(s) n(s, a) To guarantee exploration the UCT exploration term was retained.

4.1.4 Experiment 3: Averaging several simulations for value network data generation

To generate a value network data sample a single greedy policy network guided rollout is carried out. The sample is then labeled 1 on a win and 0 on a loss.

This rollout takes at most 49 moves, which is in stark contrast to the 10 000 iterations of MCTS used to generate a policy sample. Spending so little time on the presumably important value samples could be inefficient.

(32)

22 CHAPTER 4. MODIFICATIONS TO THE EXPERT ITERATION LEARNING PROCESS

To find better sample estimates, 100 non-greedy policy network guided rollouts (softmax temperature set to 0.5) were averaged to produce each new samples. Since the performance of this experiment is primarily relevant relative to the first experiment it used the same simplified training pipeline.

4.1.5 Experiment 4: Acquiring value network data from the tree statistics of Monte Carlo tree search

Another way to generate a value network data sample is to use the backed-up tree statistic for the value network value (i.e. Q(s, a) from section 3.2.2). Thisˆ way, no extra computational power was required and the algorithm became even simpler since the information was already available.

4.2 Head-to-head evaluation

Similar to both Anthony, Tian, and Barber [3] and Silver et al. [1][2] the head- to-head performance in this thesis was measured with ELO¹. The agents played a round-robin tournament where each match consisted of 24 pre-set opening moves as black and white respectively, thus each match consisted of a total of 48 games.

Two different tournaments were run, one with the different MCTS agents and one with the different neural networks (using greedy move-selection). Since the MCTS tournament was computationally heavy the agents were only evaluated at three different points: before any training, at iteration 7 and at iteration 15.

1A tool created by R. Coulom was used. https://www.remi-coulom.fr/Bayesian-Elo/

(33)

CHAPTER 4. MODIFICATIONS TO THE EXPERT ITERATION LEARNING PROCESS 23

4.3 Implementation details

The main problem overarching this thesis project was limited access to computational resources. To run the proposed experiments on a 9x9 board with the same amount of data as Anthony, Tian, and Barber [3] (roughly 8 000 000 samples per experiment) would take several hundred days on the available hardware. The main reasons behind this are:

1. 10 000 MCTS iterations are required to generate a single sample, 2. the source code for this project was written in Python, a relatively slow

programming language, and

3. the computer cluster used was small with 5-year-old CPU:s and with no GPU:s.

To remedy this problem the number of data samples of each experiment was reduced to 72 000 and the board size lowered to 7x7. This both reduced the size of the CNN (lowering the X and Y dimensions of all layers reduced the number of trainable parameters from roughly 1 200 000 to 700 000) and made games slightly shorter. However, the UCT count of the “best” move after 10 000 iterations of MCTS was almost exactly the same, so the number of iterations was not reduced.

Unless otherwise specified, all parameter values were copied from Anthony, Tian, and Barber [3]. Some of the parameter values used during the experiments were probably sub-optimal, but unfortunately not enough computational resources were available to run parameter optimization.

To ensure a high-quality solution the critical MCTS parts were inspired by the fuego/benzene project; the same project underlying the arguably most successful Hex AI of all times, MoHex [16] [17]. The CNN was implemented using the Tensorflow/Keras library. To simplify the implementation normal batch normalization was used in place of the proposed normalization propagation [18].

(34)

Chapter 5 Results

The results of the MCTS tournament are displayed in figure 5.1. Clearly, experiment 2 performed very poorly compared to the other agents. After com- pleting its 15 iterations of training experiment 2 could not even compete with the random-initialized versions of the other agents. For analysis of the other experiments figure 5.2 is provided, it excludes experiment 2 and therefore shows a smaller y-scale. The performance of standard EXIT as well as experiment 1 and 3 seem to be almost indistinguishable, and while experiment 4 performs slightly worse it is not far behind. The more fine-grained results of the CNN tournament, displayed in figure 5.3 and 5.4 respectively, reinforce the relative performance picture painted by the MCTS tournament.

It should also be noted that the improvement curve of experiment 2 is not as rigorous as the others. Even at its best, experiment 2 still lost to the worst of the other agents. Therefore its improvement is only measured through performance against other stages of itself, which could introduce self-play bias.

Finally, because of the small amount of data used and different board size, it is impossible to draw meaningful parallels between the performance of the standard EXIT evaluated here and the performance reported by Anthony, Tian, and Barber [3].

24

(35)

CHAPTER 5. RESULTS 25

0 5 10 15

−1,000

−500 0 500

Iterations

ELO

Standard EXIT Experiment 1 Experiment 2 Experiment 3 Experiment 4

Figure 5.1: Tournament results for the expert (MCTS) agents, including experiment 2.

0 5 10 15

−200 0 200 400 600

Iterations

ELO

Standard EXIT Experiment 1 Experiment 3 Experiment 4

Figure 5.2: Tournament results for the expert (MCTS) agents, excluding experiment 2.

(36)

26 CHAPTER 5. RESULTS

0 2 4 6 8 10 12 14 16

−2,000

−1,000 0

Iterations

ELO

Standard EXIT Experiment 1 Experiment 2 Experiment 3 Experiment 4

Figure 5.3: Tournament results for the apprentice (CNN) agents, including experiment 2.

0 2 4 6 8 10 12 14 16

−500 0 500

Iterations

ELO

Standard EXIT Experiment 1 Experiment 3 Experiment 4

Figure 5.4: Tournament results for the apprentice (CNN) agents, excluding experiment 2.

(37)

Chapter 6 Discussion and conclusions

This thesis project was hamstrung by long execution times and lack of computational resources. The results are therefore based on a, relatively speaking, minuscule amount of data, which severely limits their general applicability. It is of course possible, maybe even likely, that the relative performance of the experiments would be maintained during longer training sessions, but neural network based algorithms do not always follow intuition or simple reason.

With that said, the results still carry some significance.

6.1 Experiment 1

Experiment 1 was perhaps the most surprising and interesting result. Initially, it was hypothesized that it would perform noticeably worse than standard EXIT since it heavily simplified the training pipeline. Therefore, its on-par performance must be considered a success and puts in question the necessity of different training stages for the EXIT algorithm.

Anthony, Tian, and Barber [3] write that “With sufficiently large datasets, a value network can be learned to improve the expert further [...]” and while this is undoubtedly true, the results of experiment 1 suggest that the value network can be engaged from start without a loss of performance.

6.2 Experiment 2

Unsurprisingly experiment 2 did not perform well. While the idea, to solely rely on data generated by the CNN, is interesting, 72 000 samples are probably not enough to get beyond the starting point where the algorithm only learns

27

(38)

28 CHAPTER 6. DISCUSSION AND CONCLUSIONS

from examples where a winning move is within the MCTS search depth.

For this experiment configuration to be successful it would probably also take some changes to the data aggregation process. Specifically, discarding earlier samples after a number of iterations could help reduce the relatively large set of useless data generated by the slower initial training.

6.3 Experiment 3 and 4

Experiment 3 and 4 were targeting the value network data generation. As these experiments do not meaningfully reduce the complexity or run time of the standard algorithm their usage should be motivated by performance, but no clear advantage is evident from the data. On the contrary, experiment 4 seems to be detrimental and can probably be labeled a failure. Experiment 3 performs on par with the standard algorithm, and its relative performance should increase as the training process continues, but whether it would be worth the increase in execution time remains to be seen.

6.4 Concluding remarks

It is clear from this project that extra complexity and intuitively sound ideas do not necessarily translate into increased performance. This, together with the fact that conditions might change as the training progresses, makes formu- lating a one-fits-all solution very hard.

Looking forward, an interesting idea would be to incorporate an algorithm evaluation step into the training to automatically select the currently best configuration, including not only different MCTS parameters but also changes to the algorithm itself. Hopefully, making the EXIT implementation of this project public will help facilitate such efforts as well as to reproduce the result presented by Anthony, Tian, and Barber [3].

(39)

Bibliography

[1] David Silver et al. “Mastering the game of Go with deep neural networks and tree search”. In: nature 529.7587 (2016), p. 484.

[2] David Silver et al. “Mastering the game of Go without human knowl- edge”. In: Nature 550.7676 (2017), p. 354.

[3] Thomas Anthony, Zheng Tian, and David Barber. “Thinking fast and slow with deep learning and tree search”. In: Advances in Neural Infor- mation Processing Systems. 2017, pp. 5360–5370.

[4] David Silver et al. “Mastering chess and shogi by self-play with a gen- eral reinforcement learning algorithm”. In: arXiv preprint arXiv:1712.01815 (2017).

[5] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018. isbn: 9780262039246.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning.

MIT press, 2016.

[7] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. “A reduction of imitation learning and structured prediction to no-regret online learn- ing”. In: Proceedings of the fourteenth international conference on ar- tificial intelligence and statistics. 2011, pp. 627–635.

[8] Stefan Reisch. “Hex ist PSPACE-vollständig”. In: Acta Informatica 15.2 (1981), pp. 167–191.

[9] Rémi Coulom. “Efficient selectivity and backup operators in Monte- Carlo tree search”. In: International conference on computers and games.

Springer. 2006, pp. 72–83.

[10] Cameron B. Browne et al. “A survey of Monte Carlo tree search meth- ods”. In: IEEE Transactions on Computational Intelligence and AI in games 4.1 (2012), pp. 1–43.

29

(40)

30 BIBLIOGRAPHY

[11] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time anal- ysis of the multiarmed bandit problem”. In: Machine learning 47.2-3 (2002), pp. 235–256.

[12] Levente Kocsis and Csaba Szepesvári. “Bandit based Monte-Carlo plan- ning”. In: European conference on machine learning. Springer. 2006, pp. 282–293.

[13] Sylvain Gelly and David Silver. “Combining online and offline knowl- edge in UCT”. In: Proceedings of the 24th international conference on Machine learning. ACM. 2007, pp. 273–280.

[14] B. Brüegmann. Monte Carlo Go.

www.joy.ne.jp/welcome/igs/Go/computer/mcgo.tex. 1993.

[15] Kenny Young, Gautham Vasan, and Ryan Hayward. “NeuroHex: A deep Q-learning Hex agent”. In: Computer Games. Springer, 2016, pp. 3–18.

[16] Broderick Arneson, Ryan B. Hayward, and Philip Henderson. “Monte Carlo tree search in Hex”. In: IEEE Transactions on Computational In- telligence and AI in Games 2.4 (2010), pp. 251–258.

[17] Shih-Chieh Huang et al. “MoHex 2.0: A pattern-based MCTS Hex player”.

In: International Conference on Computers and Games. Springer. 2013, pp. 60–71.

[18] Devansh Arpit et al. "Normalization propagation: A parametric tech- nique for removing internal covariate shift in deep networks" (2016).

arXiv: 1603.01431.

(41)

TRITA -EECS-EX-2019:717

www.kth.se

Modifying the learning process of the Expert Iteration algorithm

Modifying the learning process of the Expert Iteration algorithm

JOHAN SJÖBLOM

Modifying the learning process of the Expert Iteration algorithm

JOHAN SJÖBLOM

Acknowledgements

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 DeepMind’s 2016 breakthrough

1.2 Superhuman performance tabula rasa

1.3 Thesis purpose

Chapter 2

Technical background

2.1 General background

2.1.1 Different approaches to learning

2.1.2 Finite Markov decision processes

2.1.3 Hex

2.2 Monte Carlo tree search

2.2.1 The Monte Carlo tree search framework

2.2.2 In-tree policies

2.3 AlphaGo - Using neural networks to guide Monte Carlo tree search

Chapter 3

Expert Iteration

3.1 The Expert Iteration framework

3.2 Implementation details

3.2.1 The apprentice

3.2.2 The expert

3.3 Training pipeline

Chapter 4

Modifications to the Expert Iter- ation learning process

4.1 The experiments

4.1.1 Original Expert Iteration

4.1.2 Experiment 1: Simplifying the training pipeline

4.1.3 Experiment 2: Removing Monte Carlo tree search rollouts

4.1.4 Experiment 3: Averaging several simulations for value network data generation

4.1.5 Experiment 4: Acquiring value network data from the tree statistics of Monte Carlo tree search

4.2 Head-to-head evaluation

4.3 Implementation details

Chapter 5 Results

Chapter 6

Discussion and conclusions

6.1 Experiment 1

6.2 Experiment 2

6.3 Experiment 3 and 4

6.4 Concluding remarks

Bibliography