• No results found

Using Reinforcement Learning for Games with Nondeterministic State Transitions

N/A
N/A
Protected

Academic year: 2021

Share "Using Reinforcement Learning for Games with Nondeterministic State Transitions"

Copied!
78
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/055--SE

Using Reinforcement Learning

for Games with Nondeterministic

State Transitions

Reinforcement Learning för spel med icke-deterministiska

till-ståndsövergångar

Max Fischer

Supervisor : Anders Fröberg Examiner : Erik Berglund

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Given the recent advances within a subfield of machine learning called reinforcement learning, several papers have shown that it is possible to create self-learning digital agents, agents that take actions and pursue strategies in complex environments without any prior knowledge. This thesis investigates the performance of the state-of-the-art reinforcement learning algorithm proximal policy optimization, when trained on a task with nondeter-ministic state transitions. The agent’s policy was constructed using a convolutional neural network and the game Candy Crush Friends Saga, a single-player match-three tile game, was used as the environment.

The purpose of this research was to evaluate if the described agent could achieve a higher win rate than average human performance when playing the game of Candy Crush Friends Saga. The research also analyzed the algorithm’s generalization capabilities on this task. The results showed that all trained models perform better than a random policy baseline, thus showing it is possible to use the proximal policy optimization algorithm to learn tasks in an environment with nondeterministic state transitions. It also showed that, given the hyperparameters chosen, it was not able to perform better than average human performance.

(4)

Acknowledgments

I would like to thank King for giving me the opportunity to write my master thesis at your company, it has truly been a pleasure! A special thank you to my supervisor at King, Lele

Cao, for all the time and effort you have put in to support my work. Also a big thanks to the rest of the AI R&D Team, Christian Schmidli, Sahar Asadi, Alice Karnsund, Philipp

Eisen, Alex Nodet, Sami Purmonen, Michael Sjöberg and Erik Poromaa, for all the help and endless discussions about reward functions, implementations, hyperparameters etc. I also want to give an extra thanks to the tracking-team for making me a better Super Smash Bros player!

Finally a big thanks to my supervisor Anders Fröberg and examinor Erik Berglund at Linköpings University for providing feedback on my research and answering all my ques-tions.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables ix 1 Introduction 1 1.1 Motivation . . . 2 1.2 Aim . . . 3 1.3 Research Questions . . . 3 1.4 Delimitations . . . 3 2 Theory 4 2.1 Elements of Reinforcement Learning . . . 4

2.2 Policy Gradient Methods . . . 6

2.3 Proximal Policy Optimization . . . 7

2.4 Designing the Reward Function . . . 9

2.5 Feedforward Neural Networks . . . 9

2.6 Convolutional Neural Networks . . . 10

2.7 Training a Convolutional Neural Network . . . 14

2.8 Improve Performance of Reinforcement Learning Algorithms . . . 17

2.9 Game Board Representation . . . 17

2.10 Evaluation Metrics . . . 17

3 Method 19 3.1 Candy Crush Friends Saga . . . 19

3.2 Game Board Representation and Action Space . . . 22

3.3 Game Delimitations . . . 23

3.4 Agent Architecture . . . 23

3.5 Optimizer . . . 25

3.6 Improve Performance of Reinforcement Learning Algorithms . . . 27

3.7 Evaluation Metrics . . . 27

3.8 Experiments . . . 27

3.9 Hardware, Programming Language and Frameworks . . . 29

4 Results 31 4.1 Hyperparameter Search . . . 31

4.2 Training on Single Levels . . . 33

(6)

4.4 Generalization on Unseen Levels . . . 34

5 Discussion 35 5.1 Results . . . 35

5.2 Method . . . 37

5.3 Source Criticism . . . 39

5.4 The Work in a Wider Context . . . 39

6 Conclusion 41 6.1 Future Work . . . 42

Bibliography 43 A Appendix 48 A.1 Special Candy Combinations . . . 48

A.2 Scores . . . 49

A.3 Levels Used in Experiments . . . 50

A.4 Special Candy Creation . . . 51

A.5 Reward Function Search . . . 51

A.6 PPO Hyperparameter Search . . . 56

(7)

List of Figures

2.1 The interaction between the environment and the agent. . . 4

2.2 The policy and value function network in PPO. . . 7

2.3 Clipping of the surrogate objective [DBLP:journals/corr/SchulmanWDRK17]. . . 8

2.4 An artificial neuron. . . 10

2.5 A feedforward neural network. . . 11

2.6 A convolution operation. . . 12

2.7 Showing the effect of different learning rates. . . 15

2.8 Comparing grid search and random search [bergstra2012random]. . . . 17

2.9 Binary feature representation of black pawns in Chess. . . 18

3.1 Levels showing different characteristics of Candy Crush Friends Saga. . . 22

3.2 Game board tensor with 25 channels/feature planes. . . 23

3.3 Action space for a given state in CCFS. . . 24

3.4 High-level architecture of learning and playing. . . 24

3.5 The neural network architecture used in this research. . . 26

4.1 Scaled win rates for different reward functions . . . 32

4.2 Scaled win rates for different discounting factors γ . . . . 33

5.1 Development during training - Special candy reward function . . . 40

A.1 Levels used in experiments . . . 50

A.2 Illustrations of candy combinations to create special candies in the game of Candy Crush Friends Saga. . . 51

A.3 Episode reward - Reward function search. . . 51

A.4 Jam spread per action - Reward function search. . . 52

A.5 Score per action - Reward function search. . . 52

A.6 Episode length - Reward function search. . . 52

A.7 Episode length standard deviation - Reward function search. . . 53

A.8 Percentage special candies created - Reward function search. . . 53

A.9 Percentage color bombs created - Reward function search. . . 53

A.10 Percentage coloring candy created - Reward function search. . . 54

A.11 Percentage fish created created - Reward function search. . . 54

A.12 Percentage horizontally striped created created - Reward function search. . . 54

A.13 Percentage vertically striped created created - Reward function search. . . 55

A.14 Percentage wrapped created created - Reward function search. . . 55

A.15 Episode reward - PPO hyperparameter search. . . 56

A.16 Jam spread per action - PPO hyperparameter search. . . 56

A.17 Score per action - PPO hyperparameter search. . . 57

A.18 Episode length - PPO hyperparameter search. . . 57

A.19 Episode length standard deviation - PPO hyperparameter search. . . 58

(8)

A.21 Percentage special color bomb created - PPO hyperparameter search. . . 59

A.22 Percentage special coloring candy created - PPO hyperparameter search. . . 59

A.23 Percentage special fish created - PPO hyperparameter search. . . 60

A.24 Percentage special horizontally striped created - PPO hyperparameter search. . . . 60

A.25 Percentage special vertically striped created - PPO hyperparameter search. . . 61

A.26 Percentage special wrapped created - PPO hyperparameter search. . . 62

A.27 Scaled win rate - Training on single levels. . . 63

A.28 Episode reward - Training on single levels. . . 63

A.29 Jam spread per action - Training on single levels. . . 64

A.30 Score per action - Training on single levels. . . 64

A.31 Episode length - Training on single levels. . . 65

A.32 Episode length standard deviation - Training on single levels. . . 65

A.33 Percentage special candies created - Training on single levels. . . 66

A.34 Percentage special color bomb created - Training on single levels. . . 66

A.35 Percentage special coloring candy created - Training on single levels. . . 67

A.36 Percentage special fish created - Training on single levels. . . 67

A.37 Percentage special horizontally striped created - Training on single levels. . . 68

A.38 Percentage special vertically striped created - Training on single levels. . . 68

(9)

List of Tables

2.1 Activation functions used in convolutional neural networks. . . 13

3.1 Explanation of the different objectives in Candy Crush Friends Saga. . . 21

3.2 Explanation of all the friends in CCFS. . . 22

3.3 Channels/feature planes used in the input tensor. . . 25

3.4 Shared neural network architecture. . . 26

3.5 Policy neural network head architecture. . . 26

3.6 Value function neural network head architecture. . . 26

3.7 Information about the levels used in the experiments. . . 28

3.8 PPO hyperparameter default values. . . 28

3.9 Values used in discounting factor grid search. . . 29

3.10 Computational resources used in this research. . . 30

4.1 Scaled win rate comparison - Reward function search . . . 32

4.2 Scaled win rate comparison - PPO hyperparameter search . . . 32

4.3 Scaled win rate comparison - Training on single levels . . . 33

4.4 Scaled win rate comparison - Training on all levels . . . 34

4.5 Scaled win rate comparison - Generalization on unseen levels . . . 34

5.1 Discounted reward propagated back to actions during a winning game round, when γ = 0.8 is used. . . . 36

5.2 Scaled win rate comparison - Comparing to benchmarks . . . 37

5.3 Coefficients used in the special candy reward function . . . 38

A.1 The effect of combining different special candies in the game of Candy Crush Friends Saga. . . 48

(10)

1

Introduction

Games have been a part of human interaction since the early civilizations, with board-games such as Go1from ancient China and Tic-Tac-Toe2from ancient Egypt. Given the rise of the personal computer and later on the smart phone, digital games started to become an inte-grated part of peoples’ everyday life. As of 2018, the digital game market generated an aston-ishing $137.9 billion world-wide. Out of the total market revenue, mobile and tablet games were accountable for 51%, a share that is expected to grow in the upcoming years. [1]

Given the recent advances within a subfield of machine learning called reinforcement learning, several papers have shown that it is possible to create self-learning digital agents, agents that take actions and pursue strategies in complex environments without any prior knowledge. Some examples of games that have been mastered are Atari games [25, 30], Chess [35] and the Chinese board-game Go [32, 33].

Reinforcement learning is a class of algorithms that learns by “trial and error”. Given an environment, the agent interacts and takes actions resulting in positive and negative rewards received from the environment. The agent then learns what actions result in positive rewards and what actions result in negative rewards, given a certain environment state. By designing the rewards in a smart way, one can train the agent to perform tasks in the environment, such as winning a game of Chess, or finding its way out of a maze, purely by signaling good and bad behaviour. Sutton and Barto describe the concept in their book Reinforcement Learning - An Introduction as “...learning what to do – how to map situations to actions – so as to maximize a numerical reward signal” [37].

Board-games such as Chess and Go can be described as a process of strategic activity, where each player is trying to resist the opponent’s goals while forwarding their own. This requires the ability to evaluate the current state of the game board, plan ahead and decide on a good future game route. [27] A common factor of Chess, Go and many other zero-sum two-player board-games is the lack of randomness in the game dynamics. In these games, the difference between two consecutive states is small and deterministic given a certain action: even though the actual state transition is unknown to player A before player B makes his or her action, the state transition st Ñ st+1, given action at, is known to player A. The ability

to compute different state transitions given the current action spaceAt makes it possible to

1https://en.wikipedia.org/wiki/Go_(game) 2https://en.wikipedia.org/wiki/Tic-tac-toe

(11)

1.1. Motivation

utilize long-term strategic planning when developing self-playing reinforcement learning al-gorithms, something that is crucial in the current state-of-the-art algorithm AlphaZero, used for playing Chess, Go and the Japanese game Shogi [34].

One of the more popular genres of mobile and tablet games as of 2019 are single-player match-three tile games, with titles such as Bejeweled3(2001) and Candy Crush Saga4(2012).

These games are examples of grid-like games that have, in contrast to Chess and Go, nonde-terministic state transitions due to a built-in randomness in the environment. When an action is performed in a match-three tile game, the game items involved disappear from the board and the removed tiles are now replaced by new game items sampled from a pre-defined spawn rate. Even though the state transitions are nondeterministic, there’s still a need for strategic planning such as performing specific action combinations to master match-three tile games. This raises an interesting debate on how well the current state-of-the-art reinforce-ment learning algorithms perform when trained on a strategic game with nondeterministic state transitions.

One reinforcement learning algorithm that has gained a lot of attention since it was first published in 2017 is Proximal Policy Optimization (PPO) [30]. This method has shown good results on a variety of games, both simple single-player games (a variety of Atari games [30]), difficult single-player games (Super Mario Bros [6]) and highly strategic multiplayer games (Dota2 [26]).

This master thesis applies the Proximal Policy Optimization algorithm to play a nonde-terministic match-three game. It investigates the possibility of using the algorithm to train a self-playing agent for the specific type of game. Moreover, the generalization of the approach is examined.

1.1

Motivation

The mobile and tablet games were accountable for 51% of the global digital game revenue, $137.9 billion world-wide, where one of the more popular genres of mobile and tablet games are the match-three tile games. When designing game levels, an important and central fac-tor of the game-play is the objectives (also referred to as challenges) that the player faces; common examples of such objectives are time-related objectives and cleverness/logic-related objectives [8]. During the game design process, it is therefore important to test the objectives thoroughly before releasing the levels. An objective that is too hard to beat might cause play-ers to stop playing and an objective that is too easy to solve might make the playplay-ers bored. The process of testing could be automated with a human-like self-playing agent, something that is of great interest to companies developing games, as money could be saved and time to release could be shortened [14].

The research was conducted as a master thesis project at Midasplayer AB (known as King) in Stockholm, Sweden.

1.1.1

King

King is one of the largest mobile game developers in the world, with approximately 272 million monthly active users as of Q1 2019 and 2000 employees in 11 different locations. King’s most famous and popular game series is the Candy Crush Franchise, including four different match-three tile games: Candy Crush Saga, Candy Crush Soda Saga, Candy Crush Friends Saga and Candy Crush Jelly Saga.

Currently, when a new Candy Crush level is designed and developed, a supervised ma-chine learning bot is used to test the game level, its difficulty and dynamics [14]. During training, the bot is shown game states (a 9x9 board-grid with different game items on the

3https://www.ea.com/en-gb/games/bejeweled/bejeweled-classic 4https://king.com/game/candycrush

(12)

1.2. Aim

tiles) from a large data set of states collected from human game-play. When the bot makes an action (i.e. switch the positions of two candies on the grid), the action is compared to the action made by a human. The bot’s decision is then adjusted if its action differs from the action of the human. In this way, the bot learns to play the game in a human-like fashion.

The Candy Crush games are continuously evolving and new dynamics are often added to the games (e.g. new super candies, blockers and objectives). When a new level is designed using completely new dynamics, there is no human game-play for the bot to learn from, thus the bot is incapable of testing the difficulty of the game level. Human game testers are then needed to a bigger extent, leading to larger costs and longer lead times. It isn’t until after the level has been released to the public and human game-play data has been collected that it can learn the new dynamics. For this reason, it is of great interest for King and other gaming companies to develop bots that are capable of learning the game dynamics without human game-play data, which has been accomplished for other board-games by implementing different reinforcement learning algorithms.

1.1.2

General Application

As mentioned above, this master thesis investigates the possibility to train an agent using the Proximal Policy Optimization algorithm to play a nondeterministic match-three game. As a lot of other puzzle games share the dynamics of nondeterministic state transitions, the research serves a greater general interest.

1.2

Aim

The aim of this research is to investigate and contribute to the field of reinforcement learn-ing by analyzlearn-ing the performance and generalization of Proximal Policy Optimization when trained to play a nondeterministic match-three game.

1.3

Research Questions

1. Can a higher win rate be achieved when training an agent to play a nondeterminis-tic match-three game with Proximal Policy Optimization compared to average human game-play performance?

2. What win rate can be achieved when training the agent on levels with a certain objective with Proximal Policy Optimization and testing the agent on unseen levels with the same objective?

1.4

Delimitations

To be able to train a reinforcement learning algorithm, the agent has to interact with an envi-ronment. In the case of games, that usually consists of the game code. Due to competition, very few gaming companies release the code of their games to the public. This research will therefore be restricted to only use the environment of Candy Crush Friends Saga, a match-three game produced by King.

Candy Crush Friends Saga consists of five different level objectives spread among more than 1300 levels. To limit the scope, this thesis examines the proposed approach on 5 differ-ent levels which have the same type of objective. Extending the model to more levels and objectives will be left to further research.

The research will also be restricted to training a reinforcement learning algorithm to play as well as possible. How to limit a reinforcement learning model to play in a human fashion and how to predict game-play difficulty from game-play is out of scope.

(13)

2

Theory

This chapter presents the theoretical concepts used in this research. First, Section 2.1 de-scribes different elements involved in reinforcement learning. Then, in Section 2.2, the gen-eral concept of policy gradient methods, which is a set of reinforcement learning algorithms, is explained. Section 2.3 presents an in-depth explanation of the proximal policy optimiza-tion algorithm which is the focus of this thesis. Secoptimiza-tion 5, 6, and 7 describe neural network techniques relevant to this study: feedforward neural networks and convolutional neural networks. In addition, it will discuss neural network training techniques. Section 8, 9 and 10 talk about how to improve the performance of reinforcement learning algorithm, commonly used game board representation and evaluation metrics.

2.1

Elements of Reinforcement Learning

Even though there exists a variety of very different reinforcement learning algorithms, most of them are comprised of the same elements: an agent, an environment, a value function, an action-value function, a policy and a reward signal. Another important concept is exploration versus exploitation.

(14)

2.1. Elements of Reinforcement Learning

2.1.1

Environment

At the center of the reinforcement learning algorithm lies the environment, which is the “world” the agent interacts with. Like in the real world, there is a concept of cause and effect, i.e. given a state st of the environment and an interaction at, the environment will

return a new state st+1and a reward signal rt. The new state may or may not be different

from the previous state, depending on the action that was taken. [37] Due to complexity, the digital implementation of the environment is usually dense and only includes dynamics that affect the problem to be solved, e.g. the video game code when the objective is to train a self-playing bot or the stock market when trying to create a trading bot.

2.1.2

Agent

The agent (also called bot or robot) is a digital entity that interacts with the environment by taking actions by following a policy. The goal of the agent is to maximize the total cumulative reward that it gets from the environment. [37]

2.1.3

Policy

The policy, often denoted as π, can be viewed as the “brain” of the agent, as it determines what action atthe agent shall take in a given state stof the environment. The policy is

con-sidered to be the core of the algorithm and is by itself sufficient enough to determine the behaviours of the agent [37]. The set of possible actions the policy can choose from is defined by the environment and is called the action space, denoted byA.

2.1.4

Reward Signal

When the policy has chosen an action at, the agent sends the action to the environment, which

returns the next state st+1and a reward signal rt. The reward signal is a real number based

on a reward function designed to encourage the agent to take actions that either directly or indirectly fulfill an objective in the environment (e.g. winning a game or fulfilling a certain matching criteria). It is a short-term indicator of a good or bad action and the purpose of the reinforcement learning algorithm is to adjust its policy to take actions that maximize the numerical reward signal. [37]

2.1.5

Value Function

The value function Vπ(s)is the expected return, i.e the total cumulative future reward in a

given state s, when following the policy π. While the reward signal is a short-term indicator, the value function indicates how good a state of the environment is long-term. [37]

2.1.6

Action-Value Function

The action-value function Qπ(s, a)is the expected return of a given state s, when first taking

action a in state s and then following the policy π. [37]

2.1.7

Advantage

The advantage A(s, a)can be interpreted as how much better (A(s, a)> 0) or worse (A(s, a)< 0) the action a taken in state s was, compared to if the policy π was completely followed.

(15)

2.2. Policy Gradient Methods

2.1.8

Exploration Versus Exploitation

The exploration versus exploitation dilemma is an essential part of every reinforcement learn-ing algorithm. It is the trade-off between exploitlearn-ing the best currently known action in a given state, versus trying something new. If the agent constantly exploits its current knowledge, it might miss out on other, possibly better, actions. On the other hand, if the agent explores too much, it does not utilize the knowledge that it has gained in the past. [37]

2.2

Policy Gradient Methods

Policy gradient methods is a branch of reinforcement learning algorithms that learns a policy that is only dependent on a given state s and a set of parameters θ when taking an action a. In the case of the policy being modeled with a neural network (see section 2.6 Convolutional Neural Networks) , the parameters θ are the weights of the network. The probability of taking action a with such a method in time step t, given state s and parameters θ is:

π(a|s, θ) =Pr(At=a|St=s, θt=θ) (2.2) Compared to many other branches of reinforcement learning algorithms, policy gradient methods does not depend on the value function when selecting actions, even though it might use a value function during training to update the policy. A policy gradient method that learns to approximate the value function and then uses it during training is often referred to as an actor-critic method. [37]

When training the policy, a numerical performance objective J(θ)is used to measure how well the algorithm performs, given the current set of policy parameters θ. When learning these parameters, the policy gradient method aims to maximize the objective J(θ)by com-puting a stochastic approximation of gradient ascent (see subsection 2.7.3 Gradient-Based Optimization):

θt+1=θt+e {∇J(θt) (2.3)

The expectation of {∇J(θt)PRd, i.e.E[∇{J(θt)], is the stochastic approximation of the gradient of the numerical performance objective, with respect to the policy parameters θt[37]. e refers

to the learning rate (see subsubsection 2.7.3.1 Learning Rate).

A natural choice of J(θ)is the expected return in each time step given a set of parameters:

J(θt) =E[Gtt] (2.4) Gt=rt+1+γrt+2+γ2rt+3+...+ = N ÿ k=0 γkrt+k+1 (2.5)

where γ is the discounting factor, 0 ď γ ď 1, and rtis the reward signal in time step t [37].

In 1992, Ronald J. Williams proposed an unbiased gradient estimator ofE[Gtt]:

θlogeπ(at|st, θt)(Gt´bt) (2.6)

where logeπθ(at|st, θt)is the natural logarithm of the policy and btis a function called the

base-line [39]. Subtracting the basebase-line from the return will have a lowering effect on the variance, which will decrease the time of training while keeping the expected value of the gradient the same. The only restriction is that the baseline cannot be dependent on the action a. [37] If an estimation of the value function V(st)is used as the baseline bt, the subtraction Gt´btis an

approximation of the advantage, given the state stand action at:

(16)

2.3. Proximal Policy Optimization

2.3

Proximal Policy Optimization

Proximal Policy Optimization (PPO) is a robust [6], data efficient and reliable policy gradient reinforcement learning algorithm that was first introduced by John Schulman et al. in 2017. The purpose was to introduce an algorithm that used the benefits of another policy gradient method called Trust Region Policy Optimization [29], but that was more general and simpler to implement. [30]

PPO is an actor-critic policy gradient method, which means that it learns both a policy π (actor) and a value function V(s)(critic) at the same time during training. The policy and the value function are typically modeled with a neural network where they share the same initial parameters, but then split to have their own outputs (see Figure 2.2). This actor-critic network takes the current state stof the environment as input and outputs both a vector ytwith the

same length as the action spaceA(the policy) and a single value V(st)(the approximated

value in st). The policy network’s raw output is often referred to as logits. The policy then

chooses the next action at+1by sampling:

at+1=argmax[yt´loge(´loge(u))]

u∼Uni f orm(0, 1) (2.8)

where ytis the logits in time step t. This sampling approach is called the Gumbel-Max trick

[23]. It is equivalent to sampling from a softmax1distribution, while no softmax

transforma-tion of the logits is needed. This trick is used as the logits are later needed for calculating the entropy bonus (see subsubsection 2.3.1.3 Entropy Bonus).

Figure 2.2: The policy and value function network in PPO.

The algorithm iterates between a rollout phase and a learning phase. During the rollout phase, a predefined number of actions (time steps) are taken in the environment by apply-ing the current policy. For each time step, the algorithm stores the input state st with the

corresponding action at, probability of action at, gained reward signal rt, logit y(at)for the

chosen action and the approximated value V(st). With the gained reward signal and the

approximated value, a generalized advantage estimation ˆAtis calculated [30]:

ˆ

At=δt+ (γλ)δt+1+...+ (γλ)T´t+1δT´1 (2.9)

δt=rt+γV(st+1)´V(st) (2.10)

where 0 ď γ ď 1 is the discounting factor, 0 ď λ ď 1 is a factor for trade-off of bias and variance for the generalized advantage estimator and t = t1, ..., Tu. The information gath-ered during the rollout phase is then used as data set for updating the policy and the value function, i.e. the weights of the network, during the learning phase. After the learning phase, the data is discarded and the process starts over again.

(17)

2.3. Proximal Policy Optimization

(a) Clipping when positive ad-vantage.

(b) Clipping when negative ad-vantage.

Figure 2.3: Clipping of the surrogate objective [30].

2.3.1

Objective

PPO learns its policy by maximizing a numerical performance objective J(θ)that is combined out of three sub-objectives:

JtCLIP+VF+S(θ) =Eˆt[JtCLIP(θ)´c1JtVF(θ) +c2S[πθ](st)] (2.11)

where JtCLIP(θ) is a clipped surrogate objective, JtVF(θ) is a squared-error value function loss, S[πθ](st)]is an entropy bonus. 0 ď c1ď1 is the value function loss coefficient and 0 ď c2ď1

is the entropy bonus coefficient.

2.3.1.1 Clipped Surrogate Objective

The first term, JtCLIP(θ), is a clipped surrogate objective computed as follows:

JtCLIP(θ) =Eˆt[min(rt(θ)Aˆt, clip(rt(θ), 1 ´ e, 1+e)Aˆt)] (2.12)

rt(θ) = πθ(at|st) πθold(at|st)

(2.13)

where πθ(at|st) is the probability of action at given state st under the new policy and

πθold(at|st)is the probability of action at given state st under the old policy. r(θ)makes up a probability ratio that tells how much more or less probable an action is in the new policy compared to the old policy. The general idea of the clipped surrogate objective is to multiply the probability ratio r(θ) with the advantage estimation ˆAt. If ˆAtis positive, the action at

was better than what the value function estimated and the policy will be updated in favour of action at, and vice versa. By multiplying ˆAt with r(θ) instead of with logeπθ(at|st) as

Williams proposed [39], one limits the size of the policy update, thus limiting the risk of adjusting the policy too much.

To limit the size of the policy update even more, a clipping function clip() is used together with a min() function. If the advantage ˆAtis positive, the probability ratio is clipped at 1+e,

and if the advantage ˆAtis negative, the probability ratio is clipped at 1 ´ e (see Figure 2.3).

Thus, the clipping function will only clip the probability ratio when it would lead to an im-proved objective. [30]

(18)

2.4. Designing the Reward Function

2.3.1.2 Squared-Error Value Function Loss

The second term, JtVF(θ), is the squared-error loss of the value function: JtVF(θ) = (Vθ(st)´V

targ

t )2 (2.14)

where Vθ(st)is the approximated value in state st, given the parameters θ, and V

targ t is the

return in state st[30]:

Vttarg=rt+γV(st+1) (2.15)

2.3.1.3 Entropy Bonus

The last term, S[πθ](st), is the entropy of policy π’s logits in a given state st. By maximizing

this entropy bonus, the algorithm discourage the policy to prematurely converge towards highly-peaked logits, thus increasing the exploration [41, 24].

2.4

Designing the Reward Function

As previously mentioned, the reward signal is a real number based on a reward function designed to encourage the agent to take actions that either directly or indirectly fulfill an objective in the environment. It is therefore of great importance to design the reward function in an unbiased way, without adding prior knowledge of how the objective shall be achieved. [37] If done wrong, one might experience The Cobra Effect2, where the agent converges to learn something unexpected as a result of a poorly shaped reward function.

There are two main types of rewards: sparse rewards and non-sparse rewards. A sparse reward signal is only sent to the agent when it achieves the actual goal, e.g. winning a game or completing a task. When creating a self-playing game agent, a frequently used sparse reward is +1 for winning, -1 for loosing and 0 for all intermediate steps [34, 32]. This approach is recommended by Sutton and Barto, as it directly rewards a success and punishes a failure, without adding any bias. [37]

One challenge that often arises when training an agent with a sparse reward is the low fre-quency of positive rewards early on in training, as the terminal success state may occur very rarely. This might lead to the agent aimlessly trying out different actions for a long period of time before receiving any positive reward, a problem referred to as the Plateau problem. A solution to this problem is to introduce a non-sparse reward signal, an intermediate reward signal that guides the agent towards the goal in a more effective way. A popular non-sparse reward signal when creating a score-maximizing self-playing game agents is the delta score, i.e. the scored gained by an action. The “danger” of a non-sparse rewards is that it is tempt-ing to inject prior knowledge by rewardtempt-ing subgoal achievements, somethtempt-ing that might lead to unexpected results. [37]

Another alternative to solve the initially low frequency of positive rewards is reward ing. Similar to Curriculum learning (see subsection 2.8.1 Curriculum Learning), reward shap-ing focuses on gradually increasshap-ing the problem difficulty the agent is facshap-ing. In curriculum learning, it is achieved by gradually showing the agent more and more difficult tasks, while in reward shaping, the reward is transformed from an initial intermediate non-sparse reward, to a sparse reward focusing only on the final goal. [37]

2.5

Feedforward Neural Networks

The policy π and value function V(s)of the PPO algorithm are modeled with a machine learning technique called feedforward neural network. Feedforward neural network is a

(19)

2.6. Convolutional Neural Networks

ing algorithm inspired by biological learning, and is in some way imitating how learning happens in the brain. The purpose of the algorithm is to learn the best possible approxima-tion of a funcapproxima-tion f˚, where in the case of PPO, f˚is the optimal policy and value function.

[12]

The basic building blocks of a feedforward neural network is a unit called an artificial neuron. The neuron is made up out of two operations. The first operation inserts the input data x1, ..., xDinto a linear combination:

a=

D

ÿ

i=1

wixi+b0 (2.16)

where wiare the coefficients (weights) and b0is the fixed offset (bias). The output a is known

as the activation. [4]

The second operation takes the activation and inputs it into a nonlinear, differentiable function h(¨), called the activation function [4]:

z=h(a) (2.17)

Figure 2.4: An artificial neuron.

When multiple artificial neurons are stacked on top of each other, a neural network layer is formed. The inputs x1, ..., xDis then feed into M linear combinations (one for each neuron in

the layer) and each activation is inserted into the activation function:

aj= D ÿ i=1 w(p)ji xi+b(p)j0 (2.18) zj=h(aj) (2.19)

where(p)is an indication that the parameter belongs to the pth layer of the network and j is an indication of the jth artificial neuron in layer p. When multiple interconnected layers are stacked one after another and the output from one layer’s neurons are the input to neurons of the next layer, a fully connected feedforward neural network is constructed.

2.6

Convolutional Neural Networks

A convolutional neural network (CNN), first described by Yann LeCun in 1989 [22], is a special type of feedforward neural network designed to utilize a type of linear operation, called a con-volution. Due to the multidimensional nature of this operation, the CNN architecture is often used when the input data has a grid-like structure, like images, game boards, and time-series

(20)

2.6. Convolutional Neural Networks

Figure 2.5: A feedforward neural network.

data [12]. The technique gained an increased popularity in 2012, when Alex Krizhevsky et. al. won the image classification contest ILSVRC with AlexNet, a deep convolutional neural network. Their result lowered the state-of-the-art top-5 classification error rate with 10.9 per-centage points, from 26.2 percent to 15.3 percent. [21] Since then, all winners of the ILSVRC contest have utilized the CNN architecture, pushing the state-of-the-art within image classi-fication even further [42, 38, 15, 31, 17].

2.6.1

Convolutional Layer

At the core of the convolutional neural network lies the convolution, a linear operation that is, in its general form, performed on the two functions x and w:

s(t) =

ż

x(a)w(t ´ a)da= (x ˚ w)(t) (2.20) In the convolution used in the convolutional layer, the function x is a tensor (multidimen-sional array) called the input and the function w is a tensor called the kernel or filter. Each index in the kernel is a real-valued number that correspond to the weights of the artificial neuron described under section 2.5 Feedforward Neural Networks. The convolution is exe-cuted by the kernel w sliding over spatial regions of the input x. For each region, it computes and outputs a linear combination of the weights in the kernel and the corresponding indices in the input (see Figure 2.6). When the convolution operates on a 2-D kernel tensor and a 2-D input tensor, the operation is defined as:

S(i, j) = (I ˚ K)(i, j) =ÿ

m

ÿ

n

I(i ´ m, j ´ n)K(m, n) (2.21) where I is the input and K is the kernel. The output S(i, j), referred to as the feature map, is then passed through the nonlinear, differentiable activation function described under sec-tion 2.5 Feedforward Neural Networks. As the input data often contains a lot of informasec-tion, a common approach is to have multiple kernels in each layer, each with its own set of weights and biases. By creating multiple feature maps that hopefully extract different kind informa-tion from the input data x, the model will become more effective. [4, 12] Another approach used in multiple recent state-of-the-art convolutional neural network architectures is to have an increasing number of kernels, layer by layer, when going deeping into the network [15, 36, 25].

One important idea of convolutional neural networks is sparse interactions (also called sparse weights). As the kernel is of a smaller dimension than the input, less memory is re-quired and statistical efficiency is improved. Also, by letting the kernels slide over all spatial regions of the input, another important idea arises, parameter sharing. Only one set of parame-ters needs to be stored per kernel, which reduces the storage requirements for the parameparame-ters. [12]

(21)

2.6. Convolutional Neural Networks

Figure 2.6: A convolution operation.

2.6.1.1 Activation Functions

Similar to the feedforward neural network, the output of the convolution in the CNN is feed into a nonlinear activation function, a stage that is often referred to as the detector stage. Among the more popular activation functions are the Rectified linear unit (ReLU), Sigmoid function and Hyperbolic tangent (tanh).

As seen in Table 2.1, the logistic sigmoid and hyperbolic tangent, referred to as Sigmodial units, flatten out when x increases or decreases. Due to this reason, the derivate of sigmodial units, given a large positive or negative input x, is close to zero. Ian Goodfellow et al. there-fore discourage the use of sigmodial units in feedforward neural networks when combined with gradient-based learning, as it might result in vanishing gradients. They instead encour-age the use of rectified linear units or its generalizations. [12]

2.6.1.2 Batch Normalization

To efficiently train a neural network, it is important that the input data has the same dis-tribution throughout training, a property that also holds for the input to each individual convolutional layers within the network (as they can be looked upon as sub-networks). As the weights of the network are continuously changing during training, so is the input distri-bution to each individual layer. This phenomenon, referred to as the Internal Covariate Shift [18], slows down the training. A solution proposed by Sergey Ioffe and Christan Szegedy is the batch normalization, where the minibatchB = x1, ..., xm (see subsection 2.7.2 Minibatch)

inputted to each layer is normalized to have a mean of 0 and standard deviation of 1, thus fixing the input distribution to each layer.

2.6.2

Fully Connected Layer

When a convolutional neural network is used as a classifier (e.g. classify which action to take in a given state), multiple fully connected, feedforward layers (see section 2.5 Feedforward Neural Networks) are commonly inserted at the end of the network [21]. While the

(22)

convo-2.6. Convolutional Neural Networks

Rectified Linear Unit

(ReLU) f(x) = # 0, x ă 0 x, x ě 0 Logistic Sigmoid f(x) =σ(x) = 1 1+e´x Hyperbolic Tangent f(x) =(2x)´1

Table 2.1: Activation functions used in convolutional neural networks.

lutional layers learn to extract valuable information into feature maps, the fully connected layers learn to classify the feature maps into predefined classes.

2.6.3

Padding

When performing a convolution, the nature of the operation shrinks the dimensions, thus making the output smaller than the input. To avoid this spatial shrinkage, a common tech-nique is to make the input dimension larger by add zeros along the edges, a process referred to as zero padding. A popular zero padding approach is to add just as many zeros needed to give the output the same dimension as the input, an approach called same zero padding. [12]

2.6.4

Pooling

The purpose of the convolutional layers is to, given an input, extract feature maps that con-tain information at different abstraction levels. To make the layers approximately invariant to small translations of the input (e.g. small rotations), a pooling operation can be used. The pooling operation alters the outputs of a layer by outputting a summary statistics of the neighbouring outputs. Popular summary statistics are max pooling and average pooling, where max pooling outputs the max value of the neighbouring outputs, and average pooling outputs an average value of the neighbouring outputs. [12]

(23)

2.7. Training a Convolutional Neural Network

2.7

Training a Convolutional Neural Network

Training neural networks is a difficult task that differs from the traditional optimization, where the main goal is to minimize some cost function J directly. When optimizing a neural network, one is commonly interested in minimizing/maximizing some performance metrics P, such as Win rate, Accuracy or Precision. This is done indirectly by adjusting the network weights θ to minimize some loss function J(θ), that hopefully will improve the performance metrics P. [12]

2.7.1

Initialize Network Weights

Before the network training starts, the weights of the network have to be initialized to some values, a process that can have a tremendous effects on the training. The initialization can not only determine how fast the network converges, but, if done wrong, result in the training failing altogether due to numerical problems. [12]

One important characteristics of the initialization is that it has to be done unsymmetrical between kernels that have the same input and activation function. This is usually achieved by either sampling from a Gaussian or uniform distribution. [12] A widely used heuristic suggested by Glorot and Bengio in 2010 is the normalized initialization:

Wi,j„U ´ c 6 m+n, c 6 m+n ! (2.22) where m is the number of inputs and n is the number of outputs. [11]

2.7.2

Minibatch

When updating the network weights θ to optimize some cost function J(θ), a common ap-proach is to use a subset of the data set as input for each update, a so called minibatch. The size of the minibatch is usually a power of 2, ranging from 16 to 256 data points. Even though larger minibatches tend to result in a more accurate approximation of the gradient [12], a recent study by Nitish Shirish Keskar et al. [19] showed that larger minibatches also lead to a degradation in generalization. As larger minibatches approximate the gradients more accurately, they tend to converge to sharper minimas (worse generalization), while smaller minibatches approximate the gradients less accurately due to inherent noise, thus converges to flatter minimas (better generalization).

2.7.3

Gradient-Based Optimization

Optimizing a neural network refers to the minimization of a loss function J(θ). In the case of Proximal Policy Optimization, it maximizes the objective JtCLIP+VF+S(θ), something that is achieved by minimizing its negative counterpart:

J(θ) =´JtCLIP+VF+S(θ) =´ ˆEt[JtCLIP(θ)´c1JtVF(θ) +c2S[πθ](st)] (2.23)

The most common approach used when optimizing the weights of a neural network is gradient-based optimization, methods that use the simple concept of gradients, a multi-variable generalization of the derivative. The gradient of J(θ)with respect to θ, i.e.θJ(θ), represents the multi-variable slope of J in θ.θJ(θ)contain the information of how a small change of the input θ approximately will affect the output J:

J(θ+e)«J(θ) +eθJ(θ) (2.24)

The technique of moving θ, i.e. the weights of the network, in a direction that minimizes J is called gradient descent. The goal is to find a global minima, the values of θ where J(θ)has its lowest point. [12]

(24)

2.7. Training a Convolutional Neural Network

(a) Too small learning rate. (b) Good learning rate. (c) Too large learning rate. Figure 2.7: Showing the effect of different learning rates.

One of the most frequently used gradient descent algorithm within the machine learn-ing field is the Minibatch gradient descent algorithm. Instead of calculatlearn-ing the gradient and a weight update for each data sample separately, an average gradient is calculated over a minibatch (see subsection 2.7.2 Minibatch). By doing so, an unbiased approximation of the gradient can be achieved [12]:

θt+1=θt´ eθm1

řm

i=1f(θ) (2.25)

where e is the learning rate (step size) and m is the minibatch size.

2.7.3.1 Learning Rate

The learning rate (e) is a hyperparameter that, as mentioned in subsection 2.7.3 Gradient-Based Optimization, is the size of the steps taken in the direction of the gradient when per-forming gradient-based optimization [12]. Given a too large step size, one might miss the optimum, and given a too small step size, the learning process becomes slow (see Figure 2.7). Yoshua Bengio recommends a learning rate between 1 and 1e-6, but also points out that it is dependent on the model’s parametrization. [2] In the experiments conducted by Schulman et al. in the original Proximal Policy Optimization paper, initial learning rates of magnitude 10´4were used together with the gradient-based optimizer Adam (see subsubsection 2.7.3.2 Adam). In one of their experiments, Schulman et al. also used a linearly annealing learning rate of 2.5 x 10´4x α, where α was linearly annealing between 1 and 0 during learning. [30]

2.7.3.2 Adam

Adam, derived from Adaptive Moment Estimation, is an efficient first-order gradient-based optimization algorithm, first proposed by Diederik P. Kingma and Jimmy Lei Ba in 2014 [20]. It is the native optimization algorithm used in the original PPO paper, has been shown to perform better on multiple problems than minibatch gradient descent and is recommended by [28] when working with sparse input data (i.e. data containing mostly zeros).

(25)

2.7. Training a Convolutional Neural Network

Instead of using the same fixed learning rate e for all weights during update, as mini-batch gradient descent does, Adam utilizes the first and second moments of the gradients to compute an adaptive learning rate for each weight at each update:

gt= ∇θft(θt´1) mt=β1mt´1+ (1 ´ β1)gt vt=β2vt´1+ (1 ´ β2)g2t ˆ mt= mt (1 ´ βt1) ˆvt= vt (1 ´ βt2) θt=θt´1´ α?mˆt ˆvt+e (2.26)

where m0 = v0 = 0 and the recommended default values for the coefficients are β1 = 0.9,

β2=0.999 and e=1e-8.

2.7.4

Hyperparameter Tuning

Hyperparameters are scalar values that affect the behaviour of machine learning algorithms such as neural networks and Proximal Policy Optimization. Opposed to other parameters, hyperparmeters are not learned during training. In a convolutional neural network, the most common hyperparameters are the learning rate, kernel size and number of convolutional layers. [12] In the Proximal Policy Optimization algorithm, the hyperparameters are the clipping rate e, the value function coefficient c1, the entropy bonus coefficient c2, the discounting factor γ,

the generalized advantage estimation parameter λ, the number of parallel workers and how many time steps the workers should run before updating the policy [30].

Hyperparameters can dramatically affect the behaviour of the algorithm and how it con-verges, thus requiring a deep knowledge if manually set. A common alternative approach is to do an automated hyperparameter search, where an algorithm is used to determine a well-suited combination of hyperparameters.

2.7.4.1 Grid Search

Grid search is a structured way of testing hyperparameter combinations, often used when there are three or less hyperparameters. A small set of values is selected for each hyperpa-rameter. A model is then trained and evaluated for every combination of hyperparameters and the combination that yields the best result is chosen. [12]

2.7.4.2 Random Search

A more efficient way of doing hyperparameter search is the random search suggested by James Bergstra and Yoshua Bengio, where the hyperparameters are randomly sampled within pre-defined ranges. Their research show that in a given problem, some hyperparameters are often more sensitive to change than others, thus more important to evaluate. If the designer of the algorithm know the sensitivity of each hyperparameter and how they affect the machine learning algorithm, he or she can easily set up well-suited grids and do grid search. But there is often a lack of understanding regarding the effect of the hyperparameters and it’s therefore hard to design a good grid search. As seen in Figure 2.8, out of 9 hyperparameter combinations, random search results in 9 trials of the important hyperparameter, while grid search only results in 3 trials. [3]

(26)

2.8. Improve Performance of Reinforcement Learning Algorithms

(a) Grid search. (b) Random search. Figure 2.8: Comparing grid search and random search [3].

2.8

Improve Performance of Reinforcement Learning Algorithms

This section discusses approaches that have resulted in improvements in the performance of reinforcement learning and other machine learning algorithms.

2.8.1

Curriculum Learning

Curriculum learning is a machine learning approach inspired by a common concept in human learning: the idea of increasing complexity, where examples are presented in meaningful sequential order. When training the neural network, instead of uniformly sample training examples of varying difficulty, examples are sampled with increasing difficulty, starting with the most simple examples. This approach has shown good results on a variety natural lan-guage processing and computer vision problems. [12]

2.8.2

Masking out Illegal Actions

In grid-like board-games such as Chess, Go and Candy Crush, not all actions are legal in any given state. For example, the rules of Go forbid a stone to be placed on top of another stone and the rules of Chess forbid the pawn to be moved backwards. Even though it is possible for a reinforcement learning algorithm to learn to distinguish between legal and illegal actions by giving a negative reward signal to illegal actions, it adds another dimension to the problem, thus making it harder to solve. One solution used by David Silver et. al is to mask out illegal actions by setting their probabilities to zero [35].

2.9

Game Board Representation

When representing a rectangular, grid-based game board (such as Chess or Go), a frequently used representation is a tensor containing sparse, binary feature planes (see Figure 2.9). The height and width represent the game board, and each feature plane (depth) in the tensor represents different characteristics of the game. Examples of common features planes are the location and characteristics of the game pieces. [35, 14, 32].

2.10

Evaluation Metrics

Match-three games are single-player games. Therefore, the ELO-rating [10] that is commonly used to evaluate self-playing agents in zero-sum two-player board-games such as Chess and Go is not applicable in match-three games [35, 34].

(27)

2.10. Evaluation Metrics

Figure 2.9: Binary feature representation of black pawns in Chess.

When playing a match-three game, a large score does not necessarily implicate winning the game, as the level objective has to be fulfilled. Therefore, the win rate is a more represen-tative metric to measure the performance of the agent:

win rate= number of wins

(28)

3

Method

This chapter will go through the approaches used in this research. The first three sections explain the data set used: the rules and dynamics of Candy Crush Friends Saga, the game board representation, the action space and game delimitations. Section 3.4, 3.5 and 3.6 go through the model architecture, optimizer and evaluation metric used. Section 3.7 explains the research experiments performed and Section 3.8 lists the hardware, programming lan-guage and frameworks used.

3.1

Candy Crush Friends Saga

The game used in this report was Candy Crush Friends Saga (CCFS), a match-three game produced by the mobile game developer King. It consists of a grid-like game board with the maximum visible dimension of 9x9, though it can be smaller, rectangular and have “holes” in the board, depending on the level (see Figure 3.1). Each tile in the game board holds one or more characteristics, where the main characteristics are the seven main candies: cyan, yellow, orange, green, red, purple or blue. See Table 3.3 for all tile characteristics used in this research.

3.1.1

Actions

In CCFS, an action consists of the player swapping two adjacent candies on the game board, either horizontally or vertically. The action is legal if it results in three or more adjacent candies of the same main type (called a match). When a legal action is performed, the adjacent candies disappear from the game board, and the candies above drop down to take their place. This behaviour might result in a drop down cascade where the dropped candies once more align to create new legal actions. The cascade continues until the board is stable, i.e there is no three or more adjacent candies on the game board. As the tiles on the very top does not have any adjacent candies vertically above, the game engine drops new candies by sampling from a pre-defined spawn-rate to fill the tiles. The spawn rate is not known by the player, which makes the game and the state transition nondeterministic. An illegal action is performed when the player swaps two candies that does not result in three or more adjacent candies. The candies are then swapped back, resulting in the same game state as before, i.e. st=st+1.

(29)

3.1. Candy Crush Friends Saga

a game is won when the objective of the level is fulfilled within the action limit range. One rollout of a level (first action to terminate state) is referred to as an episode.

3.1.2

Blockers

Blockers are a special type of tile characteristic that either completely blocks a tile from being used or blocks the candy on the tile from moving on the board. Usually, a blocker requires a certain amount of adjacent matches to disappear. When the blocker has disappeared, the tile can be treated as a normal one.

The blockers used in the levels evaluated in this research were the Liquorice swirl, the Liquorice lock and the Caramel cup. The Liquorice swirl is a one-layer blocker that is removed by an adjacent candy match. The Liquorice lock is a blocker that locks the candy underneath it and the Liquorice lock is not removed until the locked candy is used in a matched. The Caramel cup blockers exists in six different versions, from a one-layer to a six-layer blocker. Each layer is removed by an adjacent match.

3.1.3

Special Candies

There are 6 different special candies available in CCFS: color bomb, vertically striped, horizontally striped, wrapped, fish and coloring candy.

3.1.3.1 Color Bomb

Color bomb is the only special candy that does not have any main color connected to it. It is created by matching 5 candies of the same color in a straight line. When a color bomb is combined with one of the main candies, it removes all candies on the board with the same color.

3.1.3.2 Vertically and Horizontally Striped

The vertically striped special candy is created by matching 4 candies of the same color in a straight column and the horizontally striped special candy is created by matching 4 candies of the same color in a straight row. When used in an action, the vertically striped special candy removes all candies on the board above and below the special candy and the horizontally striped special candy removes all candies on the board to the right and left of the special candy.

3.1.3.3 Wrapped

The wrapped special candy is created when 5 adjacent candies of the same color are matched in a L- or T-shape. When used in an action, the wrapped special candy explodes, thus remov-ing candies in a 3x3 area around it. The wrapped candy then starts to flash and waits until the board is stable. Then it explodes a second time, again removing candies in a 3x3 area.

3.1.3.4 Fish

The fish special candy is created when 4 candies of the same color are matched in a 2x2 square. When used in an action, the fish swims away on the game board and removes a well-suited candy on the board.

3.1.3.5 Coloring Candy

The coloring candy is the most powerful special candy in the game. It is created by combining 5 candies of the same color in a straight line, with an additional candy of the same color aligned in a T-shape at the 3rd candy. When the coloring candy is combined with one of the

(30)

3.1. Candy Crush Friends Saga

main candies, each candy on the board with the same color as the main candy will change color to the color of the coloring candy.

3.1.3.6 Special Candy Combinations

Special candies can be combined to create even more powerful actions. A full list of special candy combinations can be found in Appendix Table A.1.

3.1.4

Objectives

There are five different objectives in CCFS: free the animals, free the octopuses, spread the jam, dunk the cookies and fill the empty hearts. See Table 3.1 for full description of the objectives.

Objective Description

Free the animals

Animals are locked behind blockers and the objec-tive is to match candies adjacent to the blocked tiles to remove the blockage and free the animals. Free the octopuses

Octopuses are locked behind blockers and the ob-jective is to match candies adjacent to the blocked tiles to remove the blockage and free the octopuses.

Spread the jam

Jam is spread over tiles on the game board and each time a candy is matched on top of the jam it spreads to more tiles. The objective is to cover the whole game board with jam.

Dunk the cookies

Cookies are dropped from the top of the game board and the player has to match candies until all the cookies make their way to the bottom of the game board.

Fill the empty hearts

One or many hearts are placed on the board with a given “heart-path”. For each candy removed adja-cent to a heart, it moves one tile along its route. The objective is fulfilled when all hearts have arrived at the end of their paths.

Table 3.1: Explanation of the different objectives in Candy Crush Friends Saga.

3.1.5

Boosters

By completing in-game challenges or by in-game purchases, the player collects boosters. Boosters are special disposable items that can be used once to modify the game board in a positive way.

3.1.6

Friends

Before starting a level, the player get to choose a friend. The friend is an in-game character that helps the player by modifying the game board in different positive ways, e.g. adding special candies and removing blockers. There currently exist 5 different friends: Tiffi, Yeti, Nutcracker, Odus and Misty. See Table 3.2 for full description of the friends.

3.1.7

Score

Each level has, apart from the objective, a score counter. When a legal action is performed and candies are matched, a score is gained. The score is dependent on the amount of adjacent

(31)

3.2. Game Board Representation and Action Space

Friend Description

Tiffi

When the player has removed 10 red candies, Tiffi spawns three red fish candies randomly on the game board.

Yeti

When the player has removed 12 cyan candies, Yeti throws a wrapped candy at a random tile on the game board.

Nutcracker

When the player has removed 10 blue candies, Nutcracker attacks up to 5 blockers in a horizontal line.

Odus

When the player has removed 8 purple candies, Odus spawns both a vertical and a horizontal striped candy randomly on the game board. Misty

When the player has removed 15 orange candies, Misty spawns one to four random special candies randomly on the game board.

Table 3.2: Explanation of all the friends in CCFS.

(a) Level 4. (b) Level 521. (c) Level 864. Figure 3.1: Levels showing different characteristics of Candy Crush Friends Saga.

candies matched and the number of objectives that are fulfilled (e.g. jam spread). A full list of the scores can be found in Appendix Table A.2.

3.2

Game Board Representation and Action Space

The CCFS game board was represented as a 3-dimensional tensors with 25 feature planes, as described in section 2.9 Game Board Representation. The 1st feature plane was completely covered in ones. Feature plane 2 to 24 represented different characteristics of the game board tiles, e.g. what type of main candy the tile holds and if the tile holds a special candy. The last feature plane was covered in the level progression, written as decimal percentage. A full list of the feature planes used can be found in Table 3.3.

(32)

3.3. Game Delimitations

Figure 3.2: Game board tensor with 25 channels/feature planes.

In any given state, the action space contained 288 possible actions. Action 0-71 were all horizontal actions where a left candy was swapped with an adjacent right candy. Action 72-143 were all horizontal actions where a right candy was swapped with an adjacent left candy. Action 144-215 were all vertical actions where a top candy was swapped with an adjacent bottom candy. Action 216-287 were all vertical actions where a bottom candy was swapped with an adjacent top candy.

3.3

Game Delimitations

To limit the scope of the problem, the experiments were only performed on 5 levels with spread the jam objective (see Table 3.7 for list of levels) and no boosters were taken into account. Among the friends option available in CCFS, “Tiffi” was used in the scope of this research. “Tiffi” modifies the board by changing three red candies on the game board to red special fish candies when ten or more red candies were removed.

The action space was also restricted by considering left to right actions and right to left actions equal. The same restriction was considered for top to bottom and bottom to top actions. This abstraction did not affect the game rules, but lowered the action space from 288 to 144 possible actions (see Figure 3.3).

3.4

Agent Architecture

On a high-level, the agent architecture consisted of two different parts: the PPO algorithm and the environment (see Figure 3.4). The environment was implemented in Gym, OpenAI’s toolkit for developing reinforcement learning algorithms [5]. When a Gym environment ob-ject was initialized, a docker container1was spawned running the CCFS game code. The PPO algorithm then interacted with the Gym environment taking actions.

On game start, the Gym environment reset the game by calling a reset API, resetting the CCFS game in the docker container. After reset, the game returned the initial state s1. The

initial state was then passed to the PPO algorithm, which evaluated the state and decided on an action, a1. The chosen action was passed back to the Gym environment, who called a

(33)

3.4. Agent Architecture

Figure 3.3: Action space for a given state in CCFS.

step API, passing the action to the CCFS game. The game executed the action and returned the next state, s2. If an illegal action was performed, the game returned the same state as in

the previous time step, st=st+1. The Gym environment then calculated a reward signal and

passed the next state and the reward signal to the PPO algorithm, who again evaluated the state and decided on an action. This process was repeated until a terminal state was achieved, either a win or a lose. When a terminal state was achieved, the environment reset the game and the process started over again.

Figure 3.4: High-level architecture of learning and playing.

3.4.1

Policy and Value Function Network Architecture

As described in section 2.3 Proximal Policy Optimization, the policy and value function net-work shared the initial weights, and was then split into their own output netnet-works, referred to as heads (see Figure 3.5). All the weights were initialized using the normalized initialization technique, as proposed by Glorot and Bengio [11] and described in subsection 2.7.1 Initialize Network Weights. The shared network was implemented as a convolutional neural network consisting of 5 convolutional layers, as described in subsection 2.6.1 Convolutional Layer,

References

Related documents

Stabil produktion av en specifik ost kräver att råvarans sammansättning och egenskaper inte avviker eller varierar för mycket, men självklart påverkas produktkvalitet även

Department of Chemistry, Materials, and Surfaces, SP Technical Research Institute of Sweden, Borås, Sweden d Vascular Department, University of Oslo, Oslo, Norway.. This article

En undersökning med ett sådant fokus hade naturligtvis också kunnat intressera sig för, och svarat på frågor om, studenters förväntningar på ett utbyte, men

In the non-cooperative game theoretic approach, both the local and global con- trol laws were computed by HJBE and each agent minimized its own cost func- tional by these control

The deep determin- istic policy gradient (DDPG) algorithm [3] is another example of em- ploying deep neural networks in a reinforcement learning context and continuous action

Our approach that combines hand-crafted and deep appearance based features with deep motion features achieves state-of-the-art results on this datset with a mean OP of 84.1%.. Figure

The European Year of Languages 2001 highlighted this idea and on the 13 December 2001 a European Parliament resolution was issued on the same topic and was followed by the

Methodology/research%design:%