AlphaZero with Input Convex Neural Networks

(1)

AlphaZero with Input Convex

Neural Networks

SHUYUAN ZHANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Neural Networks

SHUYUAN ZHANG

Master in Machine Learning Date: July 27, 2020

Supervisor: John Folkesson Examiner: Hossein Azizpour

School of Electrical Engineering and Computer Science Host company: RISE AB

(3)

(4)

Abstract

Modelling and solving real-life problems using reinforcement learning (RL) ap-proaches is a typical and important branch in the world of artificial intelligence (AI). For playing board games, AlphaZero has been proved to be successful in games such as Go, Chess, and Shogi against professional human players or other AI counterparts. The very basic components of AlphaZero algorithm are MCTS tree search and deep neural networks for state value and policy pre-diction. These deep neural networks are designed to fit the mapping function between a state and its value/policy to make the initialization of the state val-ue/policy more accurate. In this thesis project, we propose Convex-AlphaZero to exploit a new prediction structure for the state value and policy and test its availability by providing theoretical evidence and experimental results. Instead of using one feed-forward process to get these values, our adaptation treats the problem as an optimization process by using input convex neural networks which can model the state value as a convex function of the policy given the state (i.e. game board configuration). The results of our experiments show that our method outperforms traditional mini-max approaches and worth further research on applying it to games other than Connect Four used in this thesis project.

(5)

Sammanfattning

Modellering och lösning av verkliga problem med hjälp av förstärkningsinlär-ningssätt (RL) är en typisk och viktig gren i världen av konstgjord intelligens (AI). För att spela brädspel har AlphaZero visat sig vara framgångsrikt i spel som Go, Chess och Shogi mot professionella mänskliga spelare eller andra AI-motsvarigheter. De mycket grundläggande komponenterna i AlphaZero-algoritmen är MCTS-trädsökning och djupa nervnätverk för statligt värde och policyförutsägelse. Dessa djupa neurala nätverk är utformade för att passa kartläggningsfunktionen mellan ett tillstånd och dess värde / politik för att göra initieringen av tillståndsvärdet / politiken mer exakt. I det här avhandlingsprojek-tet föreslår vi Convex-AlphaZero att utnyttja en ny förutsägelsestruktur för det statliga värdet och policyn och testa dess tillgänglighet genom att tillhandahålla teoretiska bevis och experimentella resultat. Istället för att använda en framåt-riktad process för att få dessa värden, behandlar vår anpassning problemet som en optimeringsprocess genom att använda inmatade konvexa neurala nätverk som kan modellera tillståndsvärdet som en konvex funktion av politiken som ges tillståndet (dvs. spelskortkonfiguration) . Resultaten från våra experiment visar att vår metod överträffar traditionella mini-max-tillvägagångssätt och är värt ytterligare forskning om att använda den på andra spel än Connect Four som används i denna avhandling.

(6)

1 Introduction 1

1.1 Motivation . . . 2

1.2 Problem Definition . . . 2

1.3 Research Question . . . 2

1.4 Scope, challenges, and limitations . . . 3

1.5 Contributions . . . 3 1.6 Societal Impacts . . . 4 1.7 Ethical Considerations . . . 4 1.8 UN SDG Goals . . . 4 1.9 Acknowledgements . . . 4 2 Background 5 2.1 Reinforcement Learning . . . 5 2.1.1 Basics of RL . . . 5

2.1.2 Markov Decision Processes (MDP) . . . 6

2.1.3 Sampling Methods in RL . . . 7

2.1.4 Policy Optimization Using Policy Gradient Methods . 8 2.1.5 Model-based and Model-free RL. . . 9

2.1.6 Deep Reinforcement Learning . . . 10

2.1.7 Exploration and Exploitation . . . 11

2.2 AlphaZero Overall Description . . . 12

2.2.1 Deep Neural Network in AlphaZero . . . 12

2.2.2 Monte Carlo Tree Search in AlphaZero . . . 13

2.2.3 Playing . . . 16

2.2.4 Replay Buffer . . . 16

2.3 Input Convex Neural Networks . . . 16

2.3.1 Structure of ICNN . . . 17

2.3.2 Inference in ICNN . . . 19

2.3.3 Training in ICNN . . . 19

(7)

2.3.4 ICNN in AlphaZero . . . 19

2.3.5 Application of ICNNs in RL . . . 20

2.4 The Game of Connect Four . . . 20

2.4.1 Game Introduction . . . 20

2.4.2 Previous Solutions of the Game . . . 22

3 Methods 23 3.1 Input Convex Neural Network in AlphaZero . . . 23

3.1.1 Network Structure . . . 23 3.1.2 Inference . . . 25 3.1.3 Training . . . 26 3.1.4 Causality Reasoning . . . 26 3.2 Variance Reduction . . . 27 3.2.1 Uniform Policy . . . 27

3.2.2 Merging Matching States . . . 28

3.3 Other Details . . . 30

3.3.1 Game State Representation . . . 30

3.3.2 Extending Data Set . . . 31

3.3.3 Clipping And Normalizing Policies . . . 32

4 Results 33 4.1 Training Curves . . . 33

4.2 Player Strength Comparison . . . 34

4.2.1 Play with a Mini-max Agent . . . 35

4.2.2 AlphaZero vs Convex-AlphaZero . . . 37

4.2.3 Winning Rate under Different Decision Time . . . 37

4.3 Experiments about the ICNN . . . 39

4.3.1 Raw Network Performance . . . 39

4.3.2 Average Game Length . . . 41

5 Discussion 43 5.1 Comments on General Performance . . . 43

5.2 Coments on ICNN Performance . . . 43

5.3 Limitations . . . 44

5.4 Future Work . . . 44

6 Conclusions 46

(8)

(9)

(10)

Introduction

Human-computer competition on board games has always been a hot topic in the area of computer science and artificial intelligence for decades. As early as 1948, Alan Turing showed the possibility of letting intelligent machines play games such as chess, bridge and poker like a human [1]. Then in 1956, the first chess program Los Alamos [2] was developed by Paul Stein and Mark Wells for the MANIAC I computer, which opened the era of playing chess and other games with computers.

Mini-max [3] is a basic algorithm used in game-playing programs. The most challenging task for such programs is searching game states with efficiency. Chess, as one of the most famous board games, has 1047 [4] possible game states. A complete search in such a large game space is impossible even if we use modern supercomputers. As a result, scientists tried hard to reduce the scale of searching, either by limiting the depth of Mini-max search or by pruning [5]. The most successful implementation of Mini-max based approach was Deep Blue [6], a Chess program that defeated the human Chess world champion, Kasparov, at the time in 1997.

When it comes to Go, however, previous algorithms no longer works because Go has 10170 game states (10123 times more than Chess), which makes the Mini-max based approach infeasible. Although the game of Go is extremely complex, it did not discourage scientists as they then tried some more modern approaches such as reinforcement learning (RL) [7] and deep learning. The combination of RL and searching algorithms finally gave birth to AlphaGo [8], which again defeated a human world champion Lee Sedol in 2016.

(11)

1.1 Motivation

The direct motivation of this project is to propose and evaluate a new variant of AlphaZero, which is called Convex-AlphaZero. It treats state value as a func-tion of both state and policy using input convex neural networks. AlphaZero achieved state of the art results when compared with other AI programs, but the network in AlphaZero mapped states directly to their values, thus ignored the relationship between policies and values. Convex-AlphaZero may benefit from introducing an extra causal link between current policy and value using a 2-inputs, 1-output network. We would like to investigate the performance of the proposed method and see how the modification will affect the agent’s overall performance.

The motivation of this project goes beyond applying Convex-AlphaZero to board games. In real-life automatic control or planning, the amount of data sometimes can be extremely large. Data pre-processing is expensive and su-pervised learning seems infeasible. However, using reinforcement learning algorithms like AlphaZero, which do not need human force to process the data while reduce the computing resource cost by sampling, may introduce a new form of artificial intelligence to these practical areas.

1.2 Problem Definition

AlphaZero currently uses a feed-forward network to predict a given state’s value and best policy. We change this network to an input convex neural network to model the state value in another way. Input convex neural networks have been proven to be effective in areas such as multi-label classification, image completion, and continuous action reinforcement learning. In conclusion, the main problem we want to discuss can be boiled down to :

How does altering the convolutional neural network to input convex neural network in AlphaZero affect the player strength and winning rate against the original version when applied to the game of Connected Four ?

1.3 Research Question

We would like to investigate the following questions in addition to the main problem defined in section 1.2:

(12)

Does the inference process of ICNNs help to find a better policy after the network is sufficiently trained ?

Are ICNNs compatible with Monte-Carlo tree search algorithm so they do not interfere with each other when predicting state values and policies ?

1.4 Scope, challenges, and limitations

We propose the method of Convex-AlphaZero with its definitions and training techniques. Data and experiments are provided to validate the method’s ef-fectiveness. However, no strict mathematical proof nor sufficient experiments are conducted in this thesis to fully prove or reject the proposed modifications. Our experiments are limited to one single game form (i.e. Connect Four). The main challenge of this project is the lack of computing resources. Convex-AlphaZero needs to get policies by optimizing over a convex function, which costs much more than a single forward pass, like that in AlphaGo and AlphaZero. To solve this we have to limit the network size and number of optimization steps and this can be harmful to our model performance.

Due to the challenge mentioned, our project has some limitations. We do not systematically optimize over hyperparameters to find the best settings for model performance. Also, no sufficient independent experiments are conducted for statistics purposes. For most of the results we average over only 10 inde-pendent trials. Finally, we assume that our results on one single game could be applied to other similar games, thus do not applied the model to other game types.

1.5 Contributions

This thesis project put forward the idea that when applied in board games, reinforcement learning algorithms should consider the relationships between state value and policy. It provides a new model structure for game playing reinforcement learning agents.

(13)

1.6 Societal Impacts

As our method proposes a new causal model of analyzing board games, it can be generalized to any Markov Decision Process that shares similar structure in real life. This can help to further investigate current reinforcement learning algorithms and find potential ways to improve them, which has a positive effect on practical applications such as resource managing and planning.

1.7 Ethical Considerations

Current AI algorithms may need huge amount of data to achieve a satisfactory performance, which can pose a threat to privacy. This project focuses on learning intelligent agents through self exploration, without the need for a external dataset. Successful implementation can reduce the reliance on personal data. Also, our method can be used for social, economic or ecological purposes because it learns an intelligent agent which can be used in social/ecological resource management.

1.8 UN SDG Goals

This thesis project is aligned with UN Sustainable Development Goals.

1.9 Acknowledgements

This project is based on previous works[9] by Fredrik Carlsson and Joey Öhman from KTH School of Electrical Engineering and Computer Science. I also want to thank Dr. Ather Gattami from RISE and Dr. John Folkesson from KTH for supervising my master’s thesis.

(14)

Background

In this chapter we introduce basic concepts of reinforcement learning, together with a detailed description of AlphaZero algorithm and input convex neural networks.

2.1 Reinforcement Learning

Reinforcement learning (RL) is a field of study concerned with teaching intelli-gent aintelli-gents how to perform actions in an environment by maximizing expected future rewards. It is considered as one of three main branches in Machine Learning (ML) [10] (together with supervised learning and unsupervised learn-ing). RL is widely used in the academic/industrial world for various kinds of purposes, from game-playing to automatic control of dynamical systems.

2.1.1 Basics of RL

In reinforcement learning, there is an environment which can be defined as a combination of state transition probabilities and reward functions. The envi-ronment dynamics decide how states transform into each other and how much feedback an agent can get when performing certain actions.

Agent is another basic component of reinforcement learning, which serves as a synonym of "the algorithm used in the environment". The agent interacts with the environment by deciding actions taken and receiving feedback (reward) from the environment.

Unlike traditional supervised learning, the data used in RL is obtained by

(15)

the agent itself. In other words, there is no pre-obtained static data set, and the decisions made by the agent will affect the amount and quality of the data it can get.

2.1.2 Markov Decision Processes (MDP)

Markov Decision Processes is the most widely used mathematical model in RL. There are four fundamental elements in a MDP model: state, action, reward and transition probability.

State means distinctive configurations of the environment. The environment

can only be in one state at a given time step, and may jump into other states according to some intrinsic mechanisms in future steps.

Action means choices an agent can take to change the current state. When

performing an action, the current state often changes according to transition probabilities.

Reward means feedback an agent can get by performing certain action at

a state according to the environment settings. The goal of every RL problem is to get as much total reward as possible in the long run.

Transition probabilities is a set of function that define how the state will

change if an action is made by the agent. It defines the dynamics of the MDP and often takes the following form:

p(s0_{|s, a) = P{S}t = s|St−1 = s, At−1 = a}, ∀s, s0 ∈ S, ∀a ∈ A (2.1)

From the above four basic definitions we can further introduce the concept of policy and state value:

Policy is a mapping function from each state to a probability of selecting

the next possible action.

State value of a state s under policy π, is the expected future reward if start

in state s and keep choosing actions according to the policy π. It has different forms in finite/infinite MDPs.

(16)

state value function is defined in equation2.2: vt_π_{(s) ≡ E}π[Gt|St = s] = Eπ " _T X k=0 Rt+k+1|St = s # (2.2)

While in infinite MDP, the number of time steps is infinite so we need to add a discount factor γ to prevent infinite state values, as shown in equation2.3.

vπ(s) ≡ Eπ[Gt|St= s] = Eπ " _∞ X k=0 Rt+k+1· k Y m=1 γt+m|St= s # (2.3)

State-action-value function, also known as the q function, is defined as the expected return starting from state s, taking the action a and then following policy π (in an infinite scenario), as shown in equation2.4.

qπ(s, a) ≡ Eπ[Gt|St= s, At= a] = Eπ " _∞ X k=0 Rt+k+1· k Y m=1 γt+m|St= s, At = a # (2.4) The ultimate goal for all RL algorithms is to find the best policy that can maximize every state’s value function. To evaluate a policy in MDP, one can use Bellman equation. To find the optimal policy, one can use policy iteration [11] or value iteration [12] algorithm. The detailed descriptions of these algorithms will not be covered in this thesis because they are quite far away from the main topic.

2.1.3 Sampling Methods in RL

After fully modelling an RL problem using MDP, one can easily solve it by using value iteration or policy iteration mentioned. However, this ideal situation rarely happens in real life. Either the problem scale is too large to solve the MDP model completely, or some information about the environment is missing so a perfect model is impossible. In these cases, evaluating and optimizing policies become tougher, sampling methods should be applied to evaluate a proposed policy.

The most straightforward sampling method is Monte-Carlo method. The core of this method is to keep sampling state-action pairs in a episode according to the current policy, then evaluate the policy and state values using the average returns of episodes sampled, as shown in algorithm1.

(17)

Algorithm 1 Monte-Carlo prediction algorithm

1: for episode i ∈ {1, . . . , n} do

2: generate an episode τi under policy π

3: G=0

4: for t ∈ {T, T − 1, . . . , 0} do

5: G = λG + Rt,i

6: if St,i ∈ {S/ 0,i, . . . , St−1,i} then

7: V(i)(St,i) = V(i−1)(St,i) + 1_i(G − Vπ(St,i))

8: end if

9: end for

10: end for

Although simple and easy to deploy, Monte-Carlo method has limitations. It only works on finite length episodes and it does not exploit the Markovian nature of MDPs. To evaluate a policy in an infinite scenario, Temporal Differ-ence (TD) methods should be used instead, as shown in algorithm2.

Algorithm 2 TD(0) algorithm

1: Initialize V (s) = 0, ∀s ∈ S

2: for n ∈ {1, . . . , N } do

3: Initialize S

4: while S is not terminal do

5: A = action given by π

6: Take action A, observe R, S0

7: V(t+1)(S) = V(t)(S) + αRt+1+ λV(t)(St+1) − V(t)(St)

8: end while

9: end for

Both methods listed above are sampling methods for policy and state evaluation. They update the state values with regard to rewards in the episodes sampled. These methods do not need a complete understanding nor a thorough explo-ration of the environment, thus are good solutions to the problem mentioned before.

2.1.4 Policy Optimization Using Policy Gradient

Meth-ods

Policies can be optimized directly with the help of parameterization. Like any other optimization problems, once we construct a parameterized policy, the

(18)

final expected total reward can be expressed by the parameters:

J (θ) =X

τ

πθ(τ )R(τ ) (2.5)

τ denotes an episode sampled under the policy, πθ(τ ) is the possibility that

episode τ is drawn under the policy.

To find the best policy that will maximize the expected total reward func-tion J (θ), one need to optimize the funcfunc-tion with regard to parameters θ. Again gradient methods are good candidates for this task. Using an unbiased estimator of the gradient, which is:

∇J(θ) = E " ( T X t=1 ∇ log πθ(st, at))( T X t=1 r(st, at)) # (2.6)

optimization can be easily done. Policy gradient methods turn the problem of finding the best policy into an optimization problem, which can be solved by a variety of mathematical approaches.

2.1.5 Model-based and Model-free RL

Reinforcement learning can be further divided into two different sub-categories [13]: model-based RL and model-free RL. In model-based RL [14], the agent has an explicit model of the environment, like MDP mentioned before, with expected future reward obtained from the underlying model. The world model which contains information about transition functions is accessible to agents. On the contrary, in model-free RL [15], an agent completely learns through trial-and-error, without prior information from a world model.

AlphaZero, as our main topic in this project, is a model-based RL algorithm because we have an explicitly defined transition model of board games. The next game state can be obtained easily if the previous game state and action are given. This property makes it possible to design a corresponding RL algorithm with regard to the game model. More precisely, the model allows us to combine reinforcement learning with tree search algorithms, and this is exactly what AlphaZero did: using Monte Carlo Tree Search (MCTS) [16] to limit the scope of searching as well as get training samples for deep neural networks.

(19)

2.1.6 Deep Reinforcement Learning

In traditional (model-free) reinforcement learning, the agent usually needs to learn towards a Q-function which evaluates the value of every state-action pair. Many algorithms such as Q learning [17] and SARSA[18] were developed to solve this. However, when applied to more complex problems, these traditional methods rarely work because the Q-table then becomes too large to learn. In other scenarios, the whole problem lies in a continuous world, and the Q-table becomes infinite. One way to solve this is to use function approximate meth-ods to approximate the Q-function (Q-table). In deep RL, the approximation method used is deep neural networks, which tries to approximate the mapping function between states and Q values. One such deep (model-free) RL algo-rithm is deep Q-learning, as shown in algoalgo-rithm3.

Algorithm 3 Deep Q-Learning with ER and fixed targets

1: Initialize θ, φ, and replay buffer B

2: for Every episode e=1,...,E do

3: Initialize the first state s1

4: Test the model on test states and record results

5: for Every time step t=1,...,T do

6: Select action ataccording to an -greedy policy.

7: Do action atand get rt& st+1.

8: Store (st, at, rt, st+1) to the replay buffer.

9: Randomly choose K samples (si, ai, ri, s0i) in the buffer for training.

10: for i=1,...,K do yi = ( ri, if episode stops in s0i, ri+ λmaxbQφ(s0i, b), otherwise. 11: Update θ : θ = θ + α(yi− Qθ(si, ai))∇θQθ(si, ai) 12: end for 13: end for

14: At the end of every episode, update the target network φ ←− θ.

15: end for

Qθ(si, ai) and Qφ(si, ai) in algorithm 3 are two Q functions approximated

(20)

the empirical risk: J (θ) = 1 2 X i (Yi− Qθ(si, ai))2 (2.7)

Deep Q-learning has been proved to be successful in playing video games. In 2013, Volodymyr Mnih et al.[19] used the deep Q-learning algorithm presented above to learn policies directly from high-dimensional sensory inputs (game images). They tested the algorithm on various kinds of Atari games and re-ceived satisfactory results.

In AlphaZero, however, the deep RL algorithm is in another form. Because of its model-based properties, major modifications including changing neural network structure and resetting learning targets must be made. But the overall idea is still approximating the value function and using the learnt model to predict actions for game states that we have never seen before. More details of deep RL in AlphaZero will be shown later in part 2.2.

2.1.7 Exploration and Exploitation

Finding a balance point between exploration and exploitation[20] is a constant problem for reinforcement learning algorithms. In a partially observable en-vironment[21], the agent needs to not only take the optimal action according to known information (exploitation) but also jump out of the current optimal action and try different ones to get new knowledge (exploration).

In Q-learning, exploration was done through choosing the next action with regard to a -greedy policy:

a = (

argmaxb Q(st, b), w.p. 1 − ,

unif orm(Ast), w.p.

(2.8)

The optimal action given by current Q-function has a probability of 1- to be chosen, while we randomly choose an action from the action space A with probability to ensure exploration.

In AlphaZero, the exploration was performed during Monte Carlo Tree search when choosing the next action:

(21)

Q(s, a) indicates mean state value of the child node of state s if we perform action a in state s. U (s, a) is derived from a upper confidence bound, which is rather high in nodes that we rarely visit. More details of this will follow in part 2.2.

2.2 AlphaZero Overall Description

In 2016, AlphaGo [8] defeated the world champion Lee Sedol in Go, which opened a new era for deep reinforcement learning. The ’deep’ part in AlphaGo consists of two neural networks, the policy network and the value network. The policy network was first trained with human expert positions then optimized through self-play. The value network, as a counterpart of the policy network, was trained with the help of its colleague to maximize the winning rate. After the training procedure, both networks were used when playing, together with Monte Carlo Tree Search to limit the scope of searching.

AlphaGo Zero [22] was developed later in 2017 to save AlphaGo from learning with human expert moves. Instead of using two networks, AlphaGo Zero had only one network with two output ends. The network receives game states as input, with action probabilities pi and the final winner vi as its output. In

MCTS self-play phase, AlphaGo Zero uses its current network to estimate action probabilities, expand nodes with regard to that probability then update values of game states for further training. In this way, AlphaGo Zero was able to learn completely by itself.

The third-generation AlphaGo, which is called AlphaZero [23], is nearly identi-cal to AlphaGo Zero. The only difference was that it generalized the algorithm in AlphaGo Zero to other board games, showing that the structure of CNN + RL + MCTS applies to a variety of board games.

2.2.1 Deep Neural Network in AlphaZero

The deep neural network used in AlphaZero is a convolutional neural network [24] with residual blocks [25]. The input is game states and the output has two ends: one for move probabilities (p_i), and one for winning rate at that game state (vi). The whole network can be expressed by equation2.10:

(22)

The equation shows that the neural network which is parameterized by θ ap-proximates the mapping function between input state s and output (p,v). The structure of this NN is shown in figure2.1.

Figure 2.1: Structure of the deep NN in AlphaZero

To train this network, we need to generate training data by combining states s, search probabilities π and the game winner z as training samples. The loss function is defined by equation2.11:

l = (z − v)2− πT _{log p + c||θ||}2

(2.11) The aim is to maximize the similarity of the policy vector p to the search probabilities π using a cross-entropy loss while minimizing the mean-squared difference between the predicted winner v and the actual game-winner z. An additional term c||θ||2 is added as a L2 norm to prevent the network from over-fitting.

2.2.2 Monte Carlo Tree Search in AlphaZero

Searching and evaluating game states is a primal task for AI algorithms to compete with a human in board games. Traditional methods like Minimax approach[26] can always lead to the optimal move given enough time and space. In practice, however, such a complete search on the state tree is impossible because in most cases the number of possible game states is horribly large. Take Go for example, there are 319∗19possible game states, which is more than the number of atoms in the observable universe.

(23)

process using Monte Carlo Tree Search (MCTS)[27]. Once it reaches a state, the algorithm traverses its child states by sampling from a distribution. The algorithm is specified in algorithm4.

Algorithm 4 MCTS algorithm in AlphaZero

1: for Iterations i=1,...,I do

2: current_node = root

3: while current_node.visits > 0 and not current_node.is_terminal do

4: current_node = argmaxP U CT (current_node.children)

5: end while //Selection phase

6: 7: if current_node.visits = 0 then 8: (current_node.is_terminal, z) = T er_Eval(current_node) 9: if current_node.is_terminal then 10: current_node.value = z 11: else 12: (p, v) = fθ(s) 13: current_node.value = v 14: current_node.children = expand(current_node, p) 15: end if 16: end if

17: value = current_node.value //Evaluation & expansion phase

18:

19: current_node.visits + +

20: while not current_node.is_root do

21: current_node = current_node.parent

22: current_node.visits + +

23: current_node.value = current_node.value + value

24: end while //Back propagation phase

25: end for

26:

27: π = normalizedP olicy(root.children) 28: v0 = root.value/root.visits

The algorithm can be divided into three phases: selection phase, evaluation-expansion phase, and back propagation phase.

(24)

In the selection phase, the algorithm traverses iteratively from the root state until it encounters a state never visited before (leaf node) or a terminal state. Whether a state is terminal is decided by the game rule, which can be automati-cally checked by a function. argmaxP U CT is the core function in MCTS, as shown in equation2.12.

a = argmaxa(Q(s, a) + U (s, a)) (2.12)

argmaxP U CT decides which action should MCTS choose at the current node to search deeper in the tree. It is affected by two terms: Q(s, a) indicates the mean state value for the child node that will be reached if we perform a at state s, and U (s, a) controls exploration, as shown in equation2.13.

U (s, a) = cpuctP (s, a)

pP

bN (s, b)

N (s, a) + 1 (2.13)

N (s, a) stands for the number of times the edge (s, a) is traversed during tree search. Equation2.13is derived from the Polynomial Upper Confidence Trees (PUCT)[28], and cpuct is a coefficient which controls degree of exploration.

Edges (si, ai) that are seldom traversed during the searching phase have smaller

values of N (si, ai), which in turn increases their probability to be traversed in

future iterations.

In evaluation and expansion phase, the algorithm checks whether a node is a terminal node or a leaf node. For a terminal node, its state value is determined by game rules. For a non-terminal leaf node, MCTS uses its current neural network to inference p and v, then set v as the state value for that node and save

p for further expansion of the tree. To expand a non-terminal leaf node, the

algorithm inserts all child node of that node into the search tree, then initializes their statistics with p, as shown in equation2.14.

N (sLeaf, a) = 0, W (sLeaf, a) = 0, Q(sLeaf, a) = 0, P (sLeaf, a) = pa

(2.14) In back-propagation phase, after the search reached a pre-defined max depth, the state value is back propagated back to the root, while updating the average state values of all nodes along the searching path. The update is performed according to equation2.15and2.16.

N (st, at) = N (st, at) + 1, W (st, at) = W (st, at) + v (2.15)

Q(st, at) =

W (st, at)

N (st, at)

(2.16) For the root node, its state value and policy distribution is updated after all MCTS iterations have finished.

(25)

2.2.3 Playing

After sufficient steps of MCTS searching, AlphaZero performs an action ac-cording to statistics of the root node(current node), goes to the next game state, and starts a new MCTS search on the new game state. The actual action is chosen under the distribution given by equation2.17:

π(a|s) = N (s, a) 1 τ P bN (s, b) 1 τ (2.17) τ is a temperature parameter which controls degree of exploration. The algo-rithm will have a higher possibility to explore (i.e. not choosing the action with highest N (s, a)) if τ is high.

2.2.4 Replay Buffer

Since we generate training samples by traversing the game tree, successive samples are often strongly correlated, which can heavily affect the convergence rate of our neural network.

To avoid this, we should maintain a replay buffer [29] of previous experiences (training samples). We first fill the buffer with samples, then after the number of samples exceeds a pre-defined threshold, we randomly sample mini-batches from the buffer, and update the network accordingly. The threshold needs to be chosen wisely to find a balance between training stability and training speed.

2.3 Input Convex Neural Networks

Neural networks are widely used in academic/industrial world because of its extraordinary ability to approximate any function [30] together with its strong capability of generalization [31]. It has been proved to be very effective in many AI areas like computer vision, natural language processing and reinforcement learning.

A typical feed-forward neural network consists of an input layer, followed by several hidden layers and an output layer. Non-linear activation functions, such as ReLU[32] or Sigmoid, are added between layers to introduce non-linearity and improve the network’s expressiveness. To train a neural network, the most widely-used method is back propagating errors using gradient descend

(26)

[33]. Some techniques, like batch normalization [34], stochastic gradient de-scend [35], momentum [36], and reasonable initialization methods [37] can be applied to optimize a neural network’s performance.

In 2017, Amos et al. proposed a new kind of network structure which is called input convex neural networks(ICNN)[38] [39]. It is a neural network with scalar output, which can be expressed by equation2.18:

s = f (x, y; θ) (2.18)

In equation2.18, x and y are inputs and s is the scalar output. The network is built in such a way that the output s is convex in (a subset of) inputs y. We can optimize over the convex inputs (y) given some fixed inputs (x). Fundamentally, this property allows us perform inference in the network via optimization instead of a feed-forward process, as shown in equation2.19.

y = argminyf (x, y; θ) (2.19)

Now we can treat inference as a convex optimization problem, which could be very useful in some settings.

2.3.1 Structure of ICNN

In his paper, Amos mentioned two kinds of input convex neural networks: Fully Input Convex Neural Networks(FICNN) and Partially Input Convex Neural Networks(PICNN).

FICNN has the structure shown in figure2.2. Or we can express it using equation2.20

zi+1= gi(W (z)

i zi+ W (y)

i y + bi), f (y; θ) = zk (2.20)

For FICNN, if all W_1:k−1(z) are non-negative and all activation functions are convex and non-decreasing, the output is then a convex function of the input y. PICNN has the structure shown in figure2.3.

Or we can express it using equation2.21-2.23

(27)

Figure 2.2: Structure of FICNN

(28)

zi+1 = gi(W (z) i (zi◦ (W (zu) i ui+b (z) i ))+W (y) i (y ◦ (W (yu) i ui+b (y) i ))+W (u) i ui+bi) (2.22) f (x, y; θ) = zk, u0 = x (2.23)

The symbol ◦ denotes the Hadamard product [40], which is the element wise product between two vectors. For PICNN, if all W(z) terms are non-negative, the scalar output is then a convex function of vector y.

2.3.2 Inference in ICNN

As mentioned before, inference in ICNN is considered as a convex optimization problem. After we trained the network and modeled the output as a convex func-tion of (some of) the input, we can perform a convex minimizafunc-tion algorithm to find a y that minimizes the scalar output. Common convex minimization algorithms such as gradient descend with momentum / spectral step size modi-fications [41][42] and bundle entropy methods [43] can be used to inference in ICNNs.

In general, a gradient descend method can be expressed as in equation2.24

ˆ

y ← ˆy − α∇yf (x, ˆy; θ) (2.24)

In practice, we should choose a good optimization method like AdaGrad [44] or ADAM [45].

2.3.3 Training in ICNN

The goal of training ICNNs is that for a given pair (x, y∗), we want to train the network parameters such that:

y∗ ≈ argminyf (x, y; θ)ˆ (2.25)

This is a rather complicated optimization problem. But it can be simplified if we have a good estimation of the scalar output given input pair (x, y). In that case, we can set the scalar as the target and fit the network directly.

2.3.4 ICNN in AlphaZero

The main task for this thesis project is to incorporate ICNN with MCTS algo-rithm in AlphaZero. More specifically, we need to replace the original NN in AlphaZero with a PICNN.

(29)

In general, the idea is to model the actual winner z as a convex function of input actions p. States are non-convex inputs and actions are convex inputs. More details of the actual method will be presented in Chapter 3.

2.3.5 Application of ICNNs in RL

ICNNs have been proven to be useful when applied to continuous action rein-forcement learning tasks in the OpenAI Gym lab [46]. In the original paper of ICNNs, the authors provided a systematical way of training ICNNs on RL problems and compared ICNNs to Deep Deterministic Policy Gradient (DDPG) [47] and Normalized Advantage Functions (NAF) [48] as state-of-the-art off-policy learning baselines. Results showed that ICNNs can outperform these two baselines on 6 of 10 tasks tried in the experiment.

2.4 The Game of Connect Four

Due to limitations of computing resources and time, in this thesis project I chose Connect Four as the environment for experiments.

The game of Connect Four has a suitable state-space complexity of 1013. It is less complicated when compared with Go, Chess or other well known board games, and thus is a good choice for this specific project with limited resources. Nonetheless, it is still time-consuming if we use a naive approach to get playing strategies, making it necessary to use some more advanced methods on the game.

2.4.1 Game Introduction

Connect Four has a board size of 6*7. The game board is vertical, and both of players drop plates from the upper part of it. Plates then fall down and occupy the lowest available space within the column. Figure2.4shows the layout of the game board.

Each player can choose to drop the plate from one of the 7 columns. If a column is fully occupied, then players cannot drop plates at that column (in-vaild action). The first one to form a horizontal, vertical, or diagonal line of four of one’s own discs wins the game.

(30)

(31)

2.4.2 Previous Solutions of the Game

The game of Connect Four has been solved without deep architectures as early as 1995 by John Tromp. The method he used was improved Mini-max algorithm with alpha-beta pruning, move ordering, and transposition tables. Although his method was near-optimal and very powerful, we can hardly say that the method was perfect: it took 5 years for the program to generate playing strategies.

Deep learning based approaches can maintain a high winning rate while sig-nificantly reduce the scale of searching, and it is interesting to see how will the combinition of MCTS and deep learning help the agent to form its play strategies. This is one of the motivations why we choose Connect Four as our experiment platform.

(32)

Methods

The Convex-AlphaZero works similarly to the original version. They both rely on iterative Monte-Carlo tree search to decide the next move in a board game. One such Monte-Carlo tree search step has the following four phases: Selection, Evaluation, Expansion and Back-propagation.

In this thesis project, major modifications of the original AlphaZero algo-rithm are performed at the evaluation phase, which decides a game state’s value and policy when the search algorithm encounters that state for the first time. In chapter 3, we will show how Convex-AlphaZero works in detail, presenting not only core ideas of Convex-AlphaZero, but also training techniques and methods that can reduce the model’s variance.

3.1 Input Convex Neural Network in AlphaZero

3.1.1 Network Structure

Neural networks in AlphaZero are used to evaluate the value and policy of the state at a MCTS search node.

(p, v) = fθ(s) (3.1)

As shown in equation3.1, given a game state, the neural network outputs the value-policy tuple based on its current parameters. These two outputs are then used to update statistics in MCTS search nodes and generate training data for future model generations.

(33)

To get these two outputs in original AlphaZero is simple, one just need to feed the game state to the network and collect the tuple (p, v) after one forward pass, as shown in figure3.1.

Figure 3.1: Network structure in original AlphaZero

In Convex-AlphaZero, however, this is not possible as we now model the state value as a convex function of (input) policies. An partially input convex neural network is applied instead of the original convolutional network with two output ends. The structure of the network is shown in figure3.2.

Figure 3.2: Network structure in Convex-AlphaZero

With this network structure, we can no longer get policies through a forward pass because now it’s one of the network inputs. The whole network can be seen as a mapping function from a [state, policy] tuple to a negative state value, as shown in equation3.2

(−v) = fθ(s, p) (3.2)

One important thing here is that since we want to model the state value as a convex function of policy given the state, we should use the negative state value as the output. In this case, we are able to find the policy that will maximize the state value (minimize the negative state value) through a convex optimization process.

(34)

To make the network a partially input convex one, connections between residual blocks and latent variables should be designed carefully, as shown in figure3.3.

Figure 3.3: Connection types in PICNN

Furthermore, all weight matrices between latent variables zi should be

non-negative.

3.1.2 Inference

The inference process of state value and policy in a PICNN is a convex opti-mization problem, as stated before. Algorithm5below shows how to inference the state value and policy with given state s.

Algorithm 5 Inference algorithm in Convex-AlphaZero

1: randomly initialize a policy vector p₁

3: Calculate the gradient of p_i with regard to output vi:

4: ∇pi = ∇pifθ(s, pi)

5: Update policy according to the gradient:

6: p_i+1= p_i− α∇pi

7: end for

8: p = sof tmax(p_I+1)

9: v = fθ(s, p)

10: value = −v, policy = p

In algorithm 5, the convex optimization method used is naive gradient de-scend. In practice, however, naive GD may cause some numerical problems.

(35)

So in this project ADAM optimizer was applied to optimize the policy vector given model parameters and the game state.

3.1.3 Training

During the self-play phase, AlphaZero keep playing against itself while gen-erating [state, value, policy] tuples as training data for the neural network. In Convex-AlphaZero, we still use this tuple as the training data, but the target and the input is different from that in AlphaZero.

In original AlphaZero, the input is game state, while the value and the policy are two different targets for the network’s two output ends. MLE loss is used to fit the value and cross entropy loss is used to fit the policy.

In Convex AlphaZero, there are two inputs: state and policy. Also, before setting value as the target, a negative sign should be added, letting negative state value become the actual learning target. MLE loss was used to train the input convex neural network in Convex-AlphaZero.

3.1.4 Causality Reasoning

One of the reasons why we want to use a completely different network structure in Convex-AlphaZero is that we want to consider a different causality relation-ship between state, policy and value, and see how the new causality model will affect the whole system.

Traditionally, given a game state, the state’s value and corresponding pol-icy are then decided. However, there is an alternative way to think about this. At one certain game state, a great policy will lead the player to a victory, thus giving the state a rather high value. On the contrary, a poor policy will assign a low value to the state. In other words, the choice of policy can affect the state value given the state. Under this causality framework, it’s then reasonable to use a network with the state and the policy as its input and the value as its output. To express the new model, we introduced the input convex neural network to add a direct causality link between policy and value. More than that, the policy is the convex input of negative state value. The nature of ICNN allows us to perform a convex optimization step to find a policy that will maximize value (minimize negative value) given the game state.

(36)

The different causality relationships in AlphaZero and Convex-AlphaZero are shown in figure3.4.

(a) Causality in AlphaZero (b) Causality in Convex-AlphaZero Figure 3.4: Different causality models

3.2 Variance Reduction

In the self-play phase, AlphaZero starts from scratch with an empty MCTS search tree and a random network. That is to say, when optimizing policies in early generations using gradients methods, the optimized policy for a certain state may end up to be very far away from the ’ground truth’ policy of that state, which can significantly increase the model variance and negatively affect the training process of neural networks.

Another problem is that throughout the whole self-play process, the agent may encounter the same game state for several times, with different state value and policy each time. Learning with changing targets can severely slow down the network’s convergence rate, as well as increase the network’s variance over multiple simulation rounds.

Problems listed above are some instances that will affect the model’s per-formance and consistency over different runs (i.e. ’variance’). In this part we will mainly discuss techniques that can reduce such variance.

3.2.1 Uniform Policy

To prevent the network from giving deviated policies in early generations, we proposed an improved version of the inference algorithm. In early generations,

(37)

the algorithm outputs a uniform policy instead of gradient descend results, giving no guidance on the policy at all. This means that in early iterations the task of getting a appropriate policy label falls completely on the MCTS algorithm. Then after the network was trained with sufficient samples we start the convex optimization step and gives optimized policies.

The algorithm is described in algorithm6:

Algorithm 6 Improved inference algorithm in Convex-AlphaZero

1: if generation < G then

2: p_U = unif orm policy

3: v = fθ(s, pU)

4: value = −v, policy = p_U

5: else

6: randomly initialize a policy vector p₁

8: Calculate the gradient of p_iwith regard to output vi:

9: ∇_p

i = ∇pifθ(s, pi)

10: Update policy according to the gradient using ADAM:

11: p_i+1= ADAM (p_i, ∇pi, s, fθ)

12: end for

13: p = sof tmax(p_I+1)

14: v = fθ(s, p)

15: value = −v, policy = p

16: end if

G is a hyper-parameter that controls the start time of the convex optimiza-tion process.

In practice, if uniform policy is not applied at early generations, then there’s possibility to encounter NAN problems when training. The model’s loss be-comes extremely large and unlikely to drop in the future, making it meaningless to continue training.

3.2.2 Merging Matching States

Like many other deep reinforcement learning algorithms, AlphaZero has a replay buffer which stores training samples. To prevent the network from learn-ing strong correlations between successive updates of game states, samples are

(38)

drawn from the replay buffer randomly for training.

However, the same state may be added to the replay buffer many times with varying policy and value labels as the agent visits the same state frequently during thousands of Monte-Carlo iterations. As a result, the network is exposed to many data samples with the same state and different policy and value labels, which significantly increase its variance.

The method to solve this is straightforward: the data with matching states are grouped before each time we update network parameters. Policy and value labels are averaged over all samples in the group, as shown in figure3.5.

Figure 3.5: Example of merge duplicate states

Merging duplicate states serves an important role because it can compress the replay buffer into a more precise and compact form.

In addition, in the replay buffer, early training samples are removed regu-larly to ensure that the buffer contains the newest estimations given by the algorithm. Old training samples often have a biased estimation of the state value and policy, which are considered harmful to the network as they will force the network to remember some old and outdated information.

(39)

3.3 Other Details

3.3.1 Game State Representation

For game states, our network has a input layer with layer size of 7 ∗ 6 ∗ 5. The first two dimensions are the size of the game board, while the third dimension, 5, indicates that we used 5 feature maps to fully represent all information in a single game state.

The content of 5 feature maps of one input state are shown in figure3.6.

Figure 3.6: Example of feature maps

The first/third feature map shows where the yellow/red plates are located, while the second indicates all slots that are not occupied by plates of any colour. These three layers together can be interpreted as a one-hot encoding of each cell in the game board grid.

There are two more feature maps containing information of the next player. If it’s yellow’s turn, the fourth map will be all one and the fifth will be all zero. It’s opposite for red’s turn, with all one in the fourth map and all zero in the fifth. By representing game states like this, we are then able to pass information as

(40)

much as possible to the network, helping it to capture more info from the input together with finding some latent connections between different feature maps.

3.3.2 Extending Data Set

When learning how to play Go, scientists managed to reduce the number of total game states by merging symmetric states. The total number of possible states in Go is extremely large so every effort should be taken to control the scale of the game.

In our case, however, we are doing quite the opposite. We have to restrict the number of MCTS iterations because of the limited computing resource. As a result, the number of training samples we get in each generation may be not enough. A complicated network structure, like the ResNet we used, may suffer from overfitting in this situation.

To counter this, we added the symmetrical version of each sample in the buffer, with the same value label and inverted policy label, as shown in figure3.7.

Figure 3.7: Example of duplicating training samples

By doubling samples in the buffer the problem of overfitting can be allevi-ated.

(41)

3.3.3 Clipping And Normalizing Policies

The output policy should satisfy following restrictions:

p = [a1, a2, a3, a4, a5, a6, a7] (3.3)

0 ≤ ai ≤ 1, ∀ i ∈ [1, 2, 3, 4, 5, 6, 7] (3.4)

X

i

ai = 1 (3.5)

In AlphaZero, these restrictions are automatically satisfied by the softmax func-tion in the policy output layer. If the policies are given by a convex optimizafunc-tion process, like what happens in Convex-AlphaZero, then we should manually add the restrictions.

The simplest way is to directly apply a softmax function on the policy af-ter the convex optimization step. This is appealing because of its simplicity. In practice, however, we clip the output policy onto its feasible set (i.e. setting all negative elements to zero and all >1 elements to one) before performing softmax. The reason of doing this is that after the optimization step some elements in the policy vector may be extremely large/small, and the normalized policy could be a rather deterministic one (i.e. with 1 as one of the elements and 0 in all other). Clipping can reduce the absolute difference between elements and create a policy that enables exploration.

(42)

Results

In this chapter, we examine the empirical behavior of the proposed method, namely Convex-alphaZero, by comparing it to the baselines: original Alp-haZero [23] as well as a strong Mini-max agent. Also, to track the training process of the ICNN, we present training curves accompanied by explanation. Some detailed results about ICNNs will also be covered in this part.

4.1 Training Curves

Here we present training curves of CNN in AlphaZero / PICNN in Convex-AlphaZero. Both of the networks were trained during self-play in a MCTS searching algorithm with 1000 MCTS iterations to decide a single move. Each generation was trained after gathering training samples from 2000 self-played games. The parameters were randomly initialized using Xavier initialization. Training curves are shown in figure4.1.

With a two-output network, AlphaZero has two loss sources. The value loss is measured by MSE loss between output and the value label, while the policy loss is measured by cross-entropy loss between output and the policy label. For Convex-AlphaZero, there is only one output end for state value, using MSE loss as the measurement.

From figure 4.1 we may find out that both networks converged after about 50 generations. In AlphaZero, the policy loss is much higher than the value loss, which is an expected outcome because it’s harder to learn and fit the policy since it’s a vector of length seven. The policy loss is the main compo-nent of the total loss. In Convex-AlphaZero, the only loss source, which is the

(43)

(a) CNN in AlphaZero (b) PICNN in Convex-AlphaZero Figure 4.1: Training loss of (CNN in AlphaZero)/(PICNN in Convex-AlphaZero) on game states of Connect Four. For both (a) and (b), x-axis represents the number of model generations. For (a), y-axis means the losses of CNN in AlphaZero, and for (b) y-axis means the loss of ICNN in Convex-AlphaZero. Different colors stand for different loss type. (For (a), blue - value loss, orange - policy loss, green - total loss. For (b), blue - value loss(total loss))

value loss, dropped quickly, and became lower than 1.0000 after 20 generations. One should realize that the losses in AlphaZero and Convex-AlphaZero were in different scales in early generations. The total loss of CNN in AlphaZero was 2.0131 before training, and dropped to 1.0835 after 100 generations. While the loss of PICNN in Convex-AlphaZero was 13.2755 before training, nearly 7 times larger than the loss in AlphaZero. Such large discrepancy between losses could be explained by different network structures. PICNN had lots of restrictions in network parameters with complex connections between layers, which made it reasonable that the network had poor expressiveness before training. After trained by millions of game state samples, the input convex network managed to model the value as a convex function of state and policy. As a result the value loss was controlled.

4.2 Player Strength Comparison

In this part we compare player strength by letting them play against each other. The experiment platform was Connect Four, which is a biased game because it has been mathematically proven that the first player has an advantage over

(44)

the second in this game. The solved conclusion for Connect Four is first player win. With perfect play, the first player can force a win on or before the 41st move by starting in the middle column [49].

Because of its biased nature, we will fully compare player strength by plac-ing them as the first/second player, and show win / draw / lose ratios of each combination of rivals & play orders.

4.2.1 Play with a Mini-max Agent

To compare AlphaZero and Convex-AlphaZero with traditional methods, we let both of the models played against a Mini-max agent. The Mini-Max is a 10-step look forward strong one, which means it tries to win as early as possible or lose as late as possible. The Mini-Max algorithm is guided by a heuristic function which counts total number of plates in a horizontal, vertical, or diagonal line. For MCTS-based agents, we used 500 MCTS iterations to search for a single move. Each generation were trained by 1000 self-played games.

The winning rate and draw rate of AlphaZero against Mini-Max along model generation are shown in figure4.2.

Figure 4.2 indicates that AlphaZero outperforms Minimax quickly. In the 7th generation AlphaZero achieved a 50% winning rate against Mini-Max. When the model was fully trained (≥ 40 generations), AlphaZero can guaran-tee a victory if it plays first. As the second player, AlphaZero can reach a 90% winning rate after 50 generations.

For Convex-AlphaZero, we perform exact the same experiments with 500 MCTS iterations for a single move and 1000 games in each generation. Fur-thermore, we used uniform policies before the 5th generation to counter NAN problems.

The winning rate and draw rate of Convex-AlphaZero against Mini-Max along model generation are shown in figure4.3.

ConvAlphaZero performed poorly in early generations. This could be ex-plained by the high MSE loss when the network hasn’t capture the correct information in training samples. When played as the first player,

(45)

Convex-(a) AlphaZero as player 1 (b) AlphaZero as player 2 Figure 4.2: Winning rate of AlphaZero against Mini-Max. For both (a) and (b), x-axis represents the number of model generations of AlphaZero. For (a), y-axis means the winning/draw rate of AlphaZero as it played as player 1 against Mini-Max as player 2, and for (b) y-axis means the winning/draw rate of AlphaZero as it played as player 2 against Mini-Max as player 1. Different colors stand for different rates (blue - winning rate of AlphaZero, orange - draw rate of AlphaZero)

(a) Convex-AlphaZero as player 1 (b) Convex-AlphaZero as player 2 Figure 4.3: Winning rate of Convex-AlphaZero against Mini-Max. For both (a) and (b), x-axis represents the number of model generations of Convex-AlphaZero. For (a), y-axis means the winning/draw rate of convex-AlphaZero as it played as player 1 against Mini-Max as player 2, and for (b) y-axis means the winning/draw rate of Convex-AlphaZero as it played as player 2 against Mini-Max as player 1. Different colors stand for different rates (blue - winning rate of Convex-AlphaZero, orange - draw rate of Convex-AlphaZero)

(46)

AlphaZero achieved a 96% winning rate against the MiniMax agent after 50 generations. The winning rate of Convex-AlphaZero as the second player was 79% after 50 generations.

From above data we noticed that AlphaZero and Convex-AlphaZero achieved similar winning rate (Convex-Alphazero had a slightly lower winning rate) against a Mini-Max agent after 50 generations. But the winning rate curve of Convex-Alphazero was unstable when compared with AlphaZero’s. Also, the winning rate of Convex-AlphaZero in early generations is much lower.

4.2.2 AlphaZero vs Convex-AlphaZero

This section directly compares AlphaZero and Convex-AlphaZero with regard to their player strengths. We stored model parameters of both AlphaZero and Convex-AlphaZero in different model generations and made them play against each other.

When playing, both models decide a single move with 1000 MCTS itera-tions. Winning rate was calculated based on game results of 100 independently played games of Connect Four.

Results are shown in figure4.4.

Winning rate of Convex-AlphaZero as player 1 against original AlphaZero was 54% after 50 generations. However, when played as player 2, Convex-AlphaZero achieved a only 18% winning rate. In general, Convex-Convex-AlphaZero had a lower player strength according to experiments in this part.

4.2.3 Winning Rate under Different Decision Time

To further explore how the decision time affects the performance of AlphaZero and Convex-AlphaZero, we tested winning rates of Convex-AlphaZero against original AlphaZero with different numbers of MCTS iterations to decide a single move.

We chose to use 50th generation models of both methods. Winning rate was calculated based on game results of 100 independently played games of Con-nect Four. For each experiment, we let both method to search for the same number of MCTS iterations to decide a single move. Each data slot records the

(47)

(a) Convex-AlphaZero as player 1 (b) Convex-AlphaZero as player 2 Figure 4.4: Winning rate of Convex-AlphaZero against AlphaZero. For both (a) and (b), x-axis represents the number of model generations of both AlphaZero and Convex-AlphaZero. For (a), y-axis means the winning/draw rate of convex-AlphaZero as it played as player 1 against convex-AlphaZero as player 2, and for (b) y-axis means the winning/draw rate of Convex-AlphaZero as it played as player 2 against AlphaZero as player 1. Different colors stand for different rates (blue - winning rate of Convex-AlphaZero, orange - draw rate of Convex-AlphaZero)

averaged results of 10 independent trials. Results are shown in table4.1. # of MCTS iterations per move Convex-AlphaZero Convex-AlphaZero

as player 1 as player 2 W/T/L W/T/L 100 48.4/20.6/31.0 17.2/5.3/77.5 200 45.7/10.6/43.7 18.0/11.2/70.8 300 49.1/13.1/37.8 17.1/15.4/67.5 400 44.6/17.2/38.2 15.9/12.2/71.9 500 50.3/12.9/36.8 19.6/8.2/72.2 600 50.1/17.0/32.9 17.7/14.5/67.8 700 48.0/15.1/36.9 16.9/10.3/72.8 800 52.8/16.4/30.8 12.5/11.0/76.5 900 47.3/16.8/35.9 18.1/12.9/69.0 1000 49.1/15.3/35.6 19.3/13.6/67.1 Table 4.1: Winning rate w.r.t. different MCTS iterations (Averaged over 10 independent trials each data slot)

(48)

From table 4.1 we can see that due to the randomness and exploration set-tings, the winning rate of different trials kept fluctuating. But the overall outcome was that Convex-AlphaZero could barely defeat AlphaZero as player one, while it lost most of the games when played as player two. So in general Convex-AlphaZero is weaker than AlphaZero in Connect Four.

4.3 Experiments about the ICNN

In this section we present experimental results about different aspects of ICNN used in Convex-AlphaZero, from raw network performance to its compatibility with the MCTS search algorithm. From these experiments we expect some insights about how ICNNs worked in the proposed method.

4.3.1 Raw Network Performance

To investigate how the model generation and optimization process of ICNNs affect the performance of the network, we let the trained ICNNs of different generations played against a pre-trained feed forward CNN in AlphaZero as the baseline. The games are played using raw network predictions, which means no MCTS searches were used and the policies were directly given by the output of networks. Both networks are trained during self-played games, and then extracted for comparison.

To show how the optimization process (i.e. ICNN inference process) affected the final results, we plotted winning rate curves along number of optimization steps.

For statistical purposes, each winning rate presented in figure 4.5 was an averaged result of 10 independently conducted experiments. One experiment consisted of 100 games independently played.

Results are shown in figure4.5.

From figure4.5we may draw the conclusion that the optimization process in the ICNN indeed tried to find the best policy for each state. As the number of optimization step grows, the winning rate against baseline increases, which means the optimization had a positive effect on the network performance. Mod-els in later generations had a better winning rate, which is also an expected result.

(49)

(a) ICNN as player 1 (b) ICNN as player 2

Figure 4.5: Winning rate of raw input convex NN of different generations against raw CNN (generation 50). For both (a) and (b), x-axis represents the number of optimization steps used for the ICNN to find the best policy for each encountered game state. For (a), y-axis means the winning rate of raw ICNN as it played as player 1 against raw CNN as player 2, and for (b) y-axis means the winning rate of raw ICNN as it played as player 2 against raw CNN as player 1. Each point in the graph contains the mean winning rate collected from 10 independent runs. Different colors stand for different ICNN model generation used (blue-10, orange-20, green-30, red-40, purple-50)

(50)

These results supported that using ICNNs to model the state value as a convex function of policy given the state is a feasible idea. The network successfully captured the relationships between values and policys and convex optimization could be used for planning.

4.3.2 Average Game Length

Another metric for model evaluation is the length of games played. When comparing two agent with nearly the same winning ratio against the baseline, we could use the average game length as an extra criterion. A long game often indicates that the player strengths of the two players do not differ much. The baseline used in this experiment was a 50th generation AlphaZero (CNN + MCTS), with 500 MCTS iterations for a single move and 1000 games in each generation.

For statistical purposes, each data point presented was an averaged result of 10 independently conducted experiments, each consisted of 100 games indepen-dently played. Results are shown in figure4.6.

Figure 4.6 suggests that adding MCTS algorithm to raw ICNNs can boost its performance. The compatibility of ICNNs and MCTS algorithm is also supported.

(51)

(a) ICNN/ICNN+MCTS as player 1 (b) ICNN/ICNN+MCTS as player 2 Figure 4.6: Average game length of games played between raw ICN-N/ICNN+MCTS of different generations and AlphaZero (generation 50). For both (a) and (b), x-axis represents model generations of proposed methods. For (a), y-axis means average game length of raw ICNN/ICNN+MCTS as they played as player 1 against AlphaZero as player 2, and for (b) y-axis means the average game length of raw ICNN/ICNN+MCTS as they played as player 2 against AlphaZero as player 1. Each point in the graph contains the mean winning rate collected from 10 independent runs. Different colors stand for different agent used (blue - raw ICNN agent, orange - ICNN+ 500 steps MCTS agent (i.e. Convex-AlphaZero))

(52)

Discussion

5.1 Comments on General Performance

As the experimental results suggests, Convex-AlphaZero still have a lot to improve to outperform the original AlphaZero in the game of Connect Four in general player strength. In this part we will try to find some possible explana-tions.

The first reason could be that the real value function of a [state, action] tuple in Connect Four, or other board games, is not convex. Fitting a non-convex value function with an input convex neural network can have negative effects on the final performance of the whole system, causing a low winning rate against the original method or some of its variants.

The second reason could be the limitations of the computing power. We could not set the number of MCTS iterations, the number of self-played games, or the number of gradient steps per optimization process too high to control the total running time of our programs. As a result, statistics in the Monte Carlo tree nodes may not be able to accurately reflect the state values/policies, or samples are not drawn enough for the model to fully capture all the information in the game.

5.2 Coments on ICNN Performance

From raw network comparisons, we can see that the inference process of in-put convex neural networks can indeed optimize policies. The winning rate against raw CNN grows as the number of optimization steps increases, which