Scaling Reinforcement Learning Solutions For Game Playtesting

(1)

IN THE FIELD OF TECHNOLOGY DEGREE PROJECT

ENGINEERING PHYSICS

AND THE MAIN FIELD OF STUDY

COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Scaling Reinforcement

Learning Solutions For Game

Playtesting

MATHIAS TÖRNQVIST

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Scaling Reinforcement

Learning Solutions For Game

Playtesting

MATHIAS TÖRNQVIST

Master in Computer Science Date: June 16, 2020

Supervisor: Jörg Conradt

Industrial Supervisor: Sahar Asadi Examiner: Erik Fransén

School of Electrical Engineering and Computer Science Host company: King Digital Entertainment

(3)

(4)

iii

Abstract

Games are commonly used as playground for AI research, specifically in the field of Reinforcement Learning (RL). RL has shown promising results in de-veloping intelligent agents to play a multitude of games. Previous work have explored how RL agents can be used in the process of playtesting in game development. This thesis investigates different aspects of the RL algorithm

Deep Q-Network (DQN) that learns to play the match-three game Candy Crush Friends Saga (CCFS). This thesis also investigates two of the challenges in

applying RL in the context of CCFS. First, different sampling strategies are explored to speed up the training of a DQN-based agent. With inspiration from Imitation Learning (IL), demonstrations of game play are incorporated in the DQN algorithm to speed up the training. Another challenge when do-ing research in RL is volatility in target metrics durdo-ing the traindo-ing phase. In this thesis, we investigate how different factors contribute to this volatility. We propose three metrics to assess the agent during the training phase.

(5)

iv

Sammanfattning

Spel är en vanligt förekommande lekplats för AI forskning, speciellt inom fäl-tet Förstärkande Lärning (RL). RL har visat lovande resultat i att utveckla intelligenta agenter som spelar en mångfald av olika spel. Tidigare arbeten har utforskat hur RL agenter kan användas i speltestningsprocessen i spelutveck-ling. Den här uppsatsen undersöker olika aspekter av RL algoritmen Djupt

Q-Nätverk (DQN) som lär sig att spela match-tre spelet Candy Crush Friends Saga (CCFS). Två utmaningar undersöks i användandet av RL i kontext till

CCFS. Först utforskar vi olika urvalsmetoder för att snabba up träningsproces-sen av en DQN-baserad agent. Med inspiration från Imitations Lärande (IL) använder vi demonstrationer av spelande som inkorporeras i DQN algoritmen för att snabba upp träningsprocessen. Den andra utmaningen vid forskning in-om RL är volatilitet i målvariabler under träningsfasen. I den här uppsatsen undersöker vi hur olika faktorer bidrar till denna volatilitet. We föreslår tre mätvariabler för att utvärdera agenten under träningsfasen.

(6)

Acknowledgments

I would like to start by thanking my academic supervisor, Jörg Conradt, at the division of Computational Science and Technology at the Royal Institute of Technology (KTH). I would also like to express my gratitude to Sahar Asadi, Lele Cao, Alice Karnsund, and Alex Nodet, from King Digital Entertainment, for your constant support and guidance. With your knowledge, expertise and academic support this research reached a higher level than I expected.

Francesco Lorenzo, I want to specifically thank you for sharing this master thesis experience at King with me. It has been a true pleasure to discuss, debate, learn, and play ping pong with you. Grazie Mille.

I want to thank my partner Frida Grunewald for the daily support and love. You carried me during my lowest points and I will be forever grateful for it.

Special thanks to my family for always caring for me as only a family can do.

Lastly I want to thank everyone I met and worked with during my 6 years at KTH. Especially Sebastian Angermund, Joar Nykvist, Mustafa Abedali, and Jonathan Gisslén, thank you.

Stockholm, June 16, 2020

Mathias Törnqvist

(7)

Acronyms

A3C Asynchronous Actor-Critic Agents. AI Artificial Intelligence.

ALE Arcade Learning Environment. API Application Program Interface. BC Behavioral Cloning.

CCFS Candy Crush Friends Saga. CCS Candy Crush Saga.

CNN Convolutional Neural Network. DNN Deep Neural Network.

DP Dynamic Programming.

DQfD Deep Q-learning from Demonstration. DQN Double Deep Q-Network.

DQN Deep Q-Network. ELU Exponential Linear Unit. GCP Google Cloud Platform. IL Imitation Learning.

IRL Inverse Reinforcement Learning.

(11)

x Acronyms

KL Kullback–Leibler.

MCTS Monte Carlo Tree Search. MDP Markov Decision Process. ML Machine Learning.

PER Prioritized Experience Replay. PPO Proximal Policy Optimization.

PSER Prioritized Sequence Experience Replay. RELU Rectified Linear Unit.

RL Reinforcement Learning. TD Temporal Difference.

(12)

Chapter 1 Introduction

1.1 Motivation

Richard S. Sutton and Andrew G. Barto [1] state that there was a trivial en-lightenment that a Reinforcement Learning (RL) agent could learn through the environment that the agent interacts with. This algorithmic modus operandi set off the branch of RL research. Today RL has branched out into multiple sub-fields such as control theory and Artificial Intelligence (AI).

Games and RL have a mutually beneficial relationship in the sense that games provide a simulator to learn behaviours from. And on the other hand, RL provides games with bots, content generation, and playtesting. In 2012, the Arcade Learning Environment (ALE)[2] was developed by Bellemere et al. ALE is an object-oriented framework that enables researchers to develop AI agents. ALE does so with an interface to many of the classic games from the Atari 2600 console.

ALE presents significant research challenges for RL, model learning, model-based planning, Imitation Learning (IL), transfer learning, and intrinsic moti-vation. Games have come to be a commonly used playground for AI research, mostly due to games simulating different aspects of the real world.

Playtesting is a process in game development that is used to understand player experience and to improve the content quality. There are several types of playtests that pursue different goals, e.q. playtest that concerns about the difficulty of a level or a playtest that seeks to test game crashes[3]. Playtesting, in its simplest form, can be carried out by real humans as testers that are given access to the new content before release. However, human playtesting has high latency and costs in the development process. RL algorithms have shown promising results in learning to play games, achieving even better performance

(13)

2 CHAPTER 1. INTRODUCTION

than professional players in multiple games[4][5].

Match-three games are a popular family of puzzle games that have seen an enormous increase in players the last decade. This type of game is interesting for a number of reasons; match-three games usually consists of a wide vari-ety of features and levels. However, applying RL to play match-three games faces a few challenges. Two of the challenges being: (1) volatility in target metrics during the training phase, (2) failing to generalize to previously un-seen settings[6]. When conducting large scale research, tools and frameworks can be beneficial to the researchers. Tools and frameworks that are promis-ing regardpromis-ing reproducibility, benchmarkpromis-ing and consistency in the research process.

1.2 Problem Definition

The problem of analyzing the training of Deep RL algorithms is not straightfor-ward for a number reasons: (1) computational complexity: many algorithms are computationally expensive to train and the hyperparameter space is usu-ally large, (2) reliability: Results can vary heavily due to stochasticity which in turn stresses a need for experimental comparisons to do an accurate anal-ysis. This thesis emerges from the common RL algorithm Deep Q-Network

(DQN)[7]. DQN has earlier been shown to successfully learn to play the game Candy Crush Friends Saga (CCFS))[8]1

. CCFS is the fourth and latest game in King’s Candy Crush franchise. CCFS is a "match-three" game, a type of puzzle game where matching three or more tiles in a row removes them from the board and takes one step towards achieving the game objective. DQN has shown to work on several other types of games[9], in the original paper by Mnih et. al [7] the algorithm learns to play many of the ALE games. To mitigate volatility in the training phase of DQN several approaches has been proposed, Anschel et. al proposed a variance reduction extension to stabilize the training[10]. (3) usecase specific limitation: when applying RL on a real usecase, such as a match-three mobile game, other factors are added to the complexity. Then the algorithms and the research needs to be adapted to work with the environment that usually is not built for doing RL research. This thesis approaches the problem by investigating the possibility to speed up the training of RL algorithms playing the match-three mobile game CCFS.

(14)

CHAPTER 1. INTRODUCTION 3

1.3 Research Questions

This thesis aims to answer the following research questions:

1. what are the strongest contributing factors within the parameter space that have impact on training of a DQN based agent playing a match-three game?

2. Can we speed up the training of a DQN agent playing the game CCFS? (a) How does prioritized sampling perform against uniform sampling

from the experience replay?

(b) Can demonstrations of playing the game be used in pre-training without performance loss?

3. To what extent is it possible to industrialize RL in game playtesting? (a) How do the speed up of RL enable scalability?

(b) What are good practices to consider when building a training frame-work for RL? (in particular considering the industrial requirements of game playtesting in match-three games.)

(c) Is it possible to benchmark consistently to achieve a comparable analysis of models?

1.4 Contribution

(15)

4 CHAPTER 1. INTRODUCTION

regards to a specific metric. Together with an analysis on benchmarking, met-rics and framework. This contributes to evaluation of research done within RL in a industrialized setting.

1.5 Outline

(16)

Chapter 2 Background and Related Work

This chapter starts with an introduction to AI and its subfields in Section 2.1, subfields that relates to general Machine Learning (ML). The aim is to present a holistic view on the topic and thereafter narrow down. Thereafter games together with AI are presented in Section 2.2, the section is divided into bots (see Section 2.2.1) and playtesting (see Section 2.2.2). Section 2.3 is the last part of this chapter and describes how frameworks are used to conduct research in ML. With a focus on reproducibility, generalization, and scaling which are common challenges in RL.

2.1 Artificial Intelligence

The research brought up in this section is all related to the field of AI. Russell and Norvig [11] divides the field of AI into four approaches, namely thinking

humanly, thinking rationally, acting humanly and acting rationally. Acting ra-tionally is the most relevant approach to the elements in this thesis. An object

that acts in this setting is called an agent according to Russell and Norvig. The acting an agent performs is of a more complex nature than a normal computer program. The complex behaviour can be that the agent controls the steering of a car autonomously or that the agent plays a game of chess. The agent does so by trying to perform the best-expected outcome when pursuing a goal. The methods used to create agents such as these are an ever-changing story. The current state-of-the-art of AI is hard to grasp due to the broad meanings, def-initions, and problems that are included in the field of AI. However, in many problem settings, the state-of-the-art has emerged from the AI field. The car manufacturer Tesla uses an autopilot in their car introduced with: “We believe that an approach based on advanced AI for vision and planning, supported by

(17)

6 CHAPTER 2. BACKGROUND AND RELATED WORK

efficient use of inference hardware is the only way to achieve a general solution to full self-driving.”[12] Teslas’ autopilot can arguably be on the front line of autonomous driving on a large scale today. Tesla claim to use state-of-the-art from computer vision, decision theory, and planning theory among others.

ML is a subfield of AI that learns automatically from data. By building mathematical models that make predictions or take decisions. The following sections will shortly mention different subfields of ML.

2.1.1 Supervised and Unsupervised Learning

Supervised ML can be explained by an adaptive mathematical model that uses data to learn a mapping from an input vector to a target vector. When intro-duced to new data the model should predict its targets. The model is general-izing from the training data and thus can predict unlabeled data. An important concept in Supervised ML is the bias-variance trade-off. The bias-variance trade-off is a trade-off between the variance from the fact that the data is finite and the bias error from incorrect assumptions in algorithm/model chosen. Un-supervised ML uses training data without a set of corresponding target values. Without targets, the goal may be to discover clusters in the data by grouping similar data. Another problem is dimensionality reduction for the purpose of data pre-processing or visualization.

Over the spectrum of supervised and unsupervised learning, there are meth-ods lying in between such as semi-supervised learning[13], transfer learn-ing[14], one/few-shot learning[15].

2.1.2 Reinforcement Learning

(18)

CHAPTER 2. BACKGROUND AND RELATED WORK 7

2.1.3 Imitation Learning

Under the circumstances where reward is not directly available, IL can be ap-plied to enable training an agent. IL (a.k.a. learning by demonstration) can be viewed as an extension of the Behavioral Cloning (BC) concept. IL is a method for learning sequential decision-making policies when interacting with an en-vironment. The method learns the decision policy by imitating demonstra-tions. The agent can also query an “expert” to imitate the “expert’s” decision. The reward can be hard to define depending on the task, an example is the task to learn a robot to wave its arm like a human. Then IL works well as the task can be teached from the movements of a human. This technique can also be used if mimicking human behaviour in a game is desirable. In RL, IL is often used to initialize an agent and learn beyond the imitated behaviours.

The main approaches to achieve IL include: (1) BC, (2) Inverse

Reinforce-ment Learning (IRL) [16] (3) adversarial learning [17].

The process in which an intelligent agent learns a specific behavior from another agent is usually called Behaviour Cloning or Learning from Obser-vation. The behaviour is inspired by a type of human social learning that is named observational learning. BC has been applied in AI for a long time, an early example is the ALVINN(Autonomous Land Vehicle In a Neural Net-work) from 1989[18] which used images from a vehicle driven by a human to imitate the drivers turning behaviour. An issue with BC is that an agent only learns the behaviour that is shown, so if the agent happens to end up in a position which the agent has not seen before the agent might fail to execute a correct decision.

The success of AlphaGo was partly due to IL, as the neural network policy was initialized from demonstrations[5]. The AlphaGo algorithm was the first algorithm that was able to beat a human master in the game Go.

2.2 Games

(19)

generates content and AI that model player behaviour.

2.2.1 Bots

In gaming, a bot is referred to a computer program that controls a character or acts as the player. This thesis focuses on puzzle games so the latter is relevant. The bot controls the different actions that are allowed to be taken in a game depending on the game state. Several approaches have been proposed and successfully deployed that can play games. A few of the most used methods will be further introduced here.

Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a tree search algorithm that is popular

to use in board games. MCTS was the first algorithm to beat a professional player in a full-sized game of Go [5]. The MCTS method can be explained with the four phases selection, expansion, simulation, and backpropagation.

Selection: The algorithm selects the node in the tree with the highest win

possibility. The selection is done by recursively descend the tree until the algorithm reaches a node with unvisited children.

Expansion: When the node is reached the algorithm expands the tree with

children to the node. The children represent new moves that might be played in the future.

Simulation: now the algorithm plays the possible moves until the game

ends and a payout is computed. Which can be described as the score of the game round.

Backpropagation: Lastly the parent nodes get updated together with the

new score and change the tree.

An MCTS approach was explored by Poromaa [21] to play the game Candy

Crush Saga (CCS) at King. A weakness with the approach is that MCTS needs

a large amount of simulations to find good moves.

Supervised Approaches

There exist several supervised approaches to act as bots in games. Clark et al. hypothesize that a better way to play a game is to rely on pattern recognition instead of brute force computation. Their approach is to train a Deep

Con-volutional Neural Network (CNN) that predict moves that expert Go players

(20)

that have not been used by the experts. An approach with a supervised bot has earlier been explored at King[3]. Where a CNN was trained on player data to predict the most "human" move.

RL

RL already has several success stories, one of them being the TD-gammon program developed by Gerald Tesauro [22] in 1995. The program is a

Tempo-ral Difference (TD)-method and used an trained by the specific method called

TD-lambda[23]. The program explored novel strategies and achieved almost as good result as the top human players. A more recent example is OpenAI Five[4] that beat the world champions at Dota 2, a multiplayer real-time strat-egy game. OpenAI Five used a Proximal Policy Optimization (PPO) algorithm that trained for 10 months on their own built distributed platform that is named Rapid. RL algorithms are today the foremost choice of current research with bots at King, most recent research have been on DQN[8] and PPO[24].

2.2.2 Playtesting

Playtesting can be described as the process where a game designer evaluates a game. A game can be prone to bugs and design flaws, avoiding these before releasing a game is desired. Depending on the type of game the playtesting process will be different. In open-world games, the environment can be huge, and finding all bugs before a release is not plausible. In comparison to releas-ing a small feature on a specific level in a platform game where the feature can be thoroughly tested before release.

King

(21)

2.3 Frameworks

This section introduces how frameworks enable scientific reproducibility and algorithmic benchmarking in the field of ML. ML is often trained to optimize a specific performance metric. A metric can be a proxy for the desired goal in a way where the specific capabilities of the algorithm need to be under-stood. That is a reason why theorizing the behaviours are of great importance. Measure how the learned behaviour depends on different types of input or ro-bustness against noise. Due to the complexity of different ML algorithms, post-hoc analysis and visualization of metrics are common. Multiple frame-works utilize different methods to perform analysis of RL agents. One of them being Deep minds bsuite [26], bsuite aims to assess capabilities of RL agents with automation of evaluation and analysis on shared benchmarks.

2.3.1 Benchmarking and Reproducibility

Henderson et al. [27] investigated the issue with reproducibility in popular RL algorithms. In their study, the same algorithm with the same hyperparameters and implementation showed widely different results only when shifting ran-dom seeds. Difficulties with variance in the algorithms also contribute to the problem with reproducibility. Both Henderson et al. and Islam et al. [28] sug-gest guidelines to mitigate problems, this thesis aims to show how benchmark-ing can shed light on the issues with reproducibility. A software framework with the ambition to perform reproducible RL is SLM Lab [29]. Motivating that a single-codebase can produce an RL algorithm benchmark that enables comparing differences between algorithms, not between implementations and other noisy factors. [28][30]

2.3.2 Scaling Reinforcement Learning

(22)

is desirable in intelligent agents. Unfortunately, agents failing to generalize beyond the specific environment that the agent is trained on is common. Com-monly, agents are trained in one environment and then evaluated on the same, as in ALE [2] where agents usually are trained in this manner. [7][31][32]. A drawback with training and evaluating on the same environment is that the agent is trained to optimize only the policy and not how well the agent gen-eralizes. Several methods have been proposed to approach the generalization problem. Cobbe et al. [33] investigated the generalization problem by creat-ing a new game environment that is called CoinRun. Measurcreat-ing if an agent was capable of playing new levels of the game depending on how many dif-ferent levels the agent was trained on. Packer et al. [6] proposes an empirical evaluation of generalization to enable systematic comparisons. Defining goals on multiple environments and then measure the success rate percentage. The success rate is evaluated with agents trained in environments that are different then the ones tested on. The environments used are a subset of different tasks in the Mujoco physics engine[34].

(23)

Chapter 3 Reinforcement Learning

This chapter introduces the reader to theoretical frameworks that are used in this thesis. The chapter starts with the fundamental elements of RL in Section 3.1, including the main concepts relevant to this thesis. Among them are an explanation of the general RL problem in Section 3.1.1, what a model-free algorithm is in Section 3.1.2, the concept of On- and Off-Policy in Section 3.1.3, and what TD-learning is in Section 3.1.4. The idea is to introduce the foundational concepts that are needed for this thesis. The section is followed by a description of commonly used algorithms and their different elements in Section 3.2. Lastly the two different concepts of sampling (Section 3.4.3) and demonstrations (Section 3.4.3) are explained in detail, two major contributing factors to speeding up the learning process of an RL agent.

3.1 Fundamental Elements

We consequently introduce the problem setting and basic concepts of RL. RL can be described as a computational approach that tries to automate decision making through a learning process. The learning process is governed by spe-cific goals set up to achieve wanted behaviours. According to Sutton [1] the modern field of RL is a mix between learning by trial and error, TD methods, and optimal control by using DP with value functions.

3.1.1 Problem Setting

Explaining the RL setting in a problem framework is a common approach, a mathematical idealized form of the RL problem is the one of Markov Decision

Process (MDP).

(24)

CHAPTER 3. REINFORCEMENT LEARNING 13

Figure 3.1: Figure

An agent learns from its interaction with an environment, interactions that are governed by an action selection method. The environment can be viewed as anything that is not the agent, practically meaning all signals, sensors, data et cetera, that the agent gets as input. The information from the environment can be divided into a state st ∈ S and a reward rt ∈ R, divided into discrete

time steps {t; t ∈ N, tn+1 > (tn+ 1), t0 = 0}. The state is the representation

of the environment while the reward is a numerical value. The action is taken on the basis of the current state at ∈ A(st). Figure 3.1 depicts the process in a

diagram. RL can on top of the aforementioned be described to use four main components.

Reward

Firstly, the reward function yields the reward mentioned above and governs what the agent will learn. Meaning that the agent will try to maximize the cumulative reward. A formal definition of the cumulative reward is the sum of rewards according to:

Gt = Rt+1+ Rt+2+ Rt+3+ · · · + RT (3.1) The cumulative reward does not differentiate on when the reward is given, the direct reward Rt+1 is not discounted against RT. A problem can lack a final

(25)

14 CHAPTER 3. REINFORCEMENT LEARNING

the discounted sum of returns defined as:

Gt= Rt+1+ γRt+2+ γ2Rt+3+ · · · = ∞

X

k=0

γkRt+k+1 (3.2) where γ is a hyperparameter governing the strength of propagation and is bounded by 0 ≤ γ ≤ 1.

Policy

Secondly the policy, the action that the agent selects is based on a policy πt(a|s), the policy is the conditional probability of an action given the state.

The policy gets updated according to the discounted cumulative rewards that the agent seeks to optimize.

Value Function

Thirdly a value function is the value of being in a specific state. Meaning that the function will estimate the cumulative rewards for a state, which in turn depends on the policy. So the value function is defined with the policy as:

vπ(s) = Eπ[Gt| St= s] = Eπ h_X∞ k=0 γkRt+k+1 St= s i . _(3.3) The function is then given a state s, the expected value of a state of the dis-counted cumulative rewards when following a specific policy π.

3.1.2 Model-Based or Model-Free

In Lastly, there are two different approaches regarding the environment. Firstly, if a model of the environment is available, then model-based methods are used. A model of the environment can be attained by knowing the rules, as in chess, or by learning the dynamics of the environment. In the latter case, the learned model is an approximated model that works as a proxy to the real environ-ment. A model transition probability distributions are used to compute the value function.

(26)

3.1.3 On-Policy versus Off-Policy

There are two different ways of using the policy during learning. An On-Policy method uses the current policy to make the decisions to get new experiences and improving the policy at the same time. While an Off-Policy method makes decisions from another policy than the policy that the method is trying to learn, the idea is to explore all possible actions and learn all optimal decisions instead of getting stuck in local minima. The local minima problem is mitigated in On-Policy methods by having a soft policy, which means that π(a|s) > 0 for all a ∈ A and all s ∈ S and that the method incrementally gets more deterministic in the late stage of training.

3.1.4 Temporal-Difference Learning

TD-Learning is a central idea to RL. In the same manner as Monte Carlo meth-ods, TD-Learning samples from the environment. TD-Learning is also using current estimations to perform updates comparatively to DP. The latter con-cept is called bootstrapping. TD-learning is a model-free method and can learn from raw experiences. The most basic TD-method uses the observed reward and the value function estimation to incrementally update the value function according to:

V (St) ← V (St) + αRt+1+ γV (St+1) − V (St). (3.4) The bootstrapping comes from the fact that the estimation is estimated from another estimation. Note that the estimated value function is denoted with capital V while the true value function is denoted with minuscule v. The term in the square brackets in Equation (3.4) is the TD-error

δt= Rt+1+ V (St+1) − V (St) (3.5)

3.1.5 Challenges

(27)

3.2 Algorithms

There exist many algorithms that comply with the above mentioned techniques and types. A brief introduction to a subset of successful algorithms are intro-duced below.

3.2.1 Q-learning

Q-learning is a TD control, Off-Policy algorithm that made a big impact on the RL community[1]. The algorithm approximates the optimal action-value function

q∗(s, a) = max

π qπ(s, a). (3.6)

The difference to the value function in Equation (3.3) is that the discounted expected cumulative reward that is given at a state s also depends on the action a. The one step Q-learning algorithm is defined as

Q(st, at) ← Q(st, at) + α Rt+1+ γmax a Q(st+1, a) − Q(st, at) . _(3.7) The algorithm is independent of the policy being followed which makes the algorithm Off-Policy, a well sought feature in the RL setting. Off-Policy meth-ods enables the policy to train with batches, with the benefit of being more data-efficient.

3.2.2 Asynchronous Advantage Actor-Critic

Asynchronous Actor-Critic Agents (A3C) was developed by Deep mind in 2016[35].

A3C incorporates the advantage function, the advantage function is defined: A(at, st) = Q(st, at) − V (st) (3.8)

The advantage function is the difference between the state value-function and the state-action value function. A3C keeps a policy π(at|st; θ) and an

esti-mated value function V (st; θv) to estimate the advantage function according

to: A(st, at; θ, θv) = k−1 X i=0 γirt+i+ γkV (st+k; θv) − V (st; θv). (3.9)

(28)

parallel. Using the parallell exploration to update the parameters in more ef-fectively as the updates will be less correlated then training one agent in an online fashion.

The algorithm has shown to be highly data efficient as well as highly time efficient.[35]

3.2.3 Proximal Policy Optimization

PPO is a family of actor-critic algorithms developed by OpenAI[36]. PPO are On-Policy and learn both a policy π and a value function V in the same manner as A3C. PPO builds on the ideas of Trust Region Policy Optimization

(TRPO)[37], when maximizing a objective function PPO does so without a

constraint on the size of the policy update. Instead PPO solves the uncon-strained policy optimization problem with a penalty. Then the optimization problem to be solved is,

maximize θ ˆ Et πθ(at|st) πθold(at|st) ˆ At− βKL[πθold(·|st), πθ(·|st)] . _(3.10) Here the Kullback–Leibler (KL) divergence[38] is between the policy parame-ters θoldbefore the update and the current policy parameters θ. β is a coefficient

that governs the size of the penalty. The first part of Equation (3.10) will lead to a excessive policy update, hence the algorithm is usually modified with a clipped surrogate objective to penalize changes that are too large. PPO has shown to be sample-efficient and robust[39], the algorithm was successful in learning to play the game CCFS and its capabilities has been explored in-[24].

3.2.4 Deep Q-Networks

DQN was introduced by Minh et al.[7] and obtained human-level performance in many games in the Atari 2600 game suite. As described earlier, the action-value function in Equation (3.7) is commonly estimated by a function approxi-mator. In the DQN algorithm, a neural network is used as a nonlinear function approximator with weights θ and is often referred to as a Q-network.

Q(s, a; θ) ≈ Q∗(s, a) _(3.11) where s is the state and a is the action. The Q-network reduces the mean-squared error in the Bellman equation, as in Equation (3.7), to train its param-eters θ. The approximated target values are computed according to,

yDQN _{= r + γ max}

a0Q(s0, a0; θ−

(29)

where r is the reward, γ is the discount factor, s0 is the next state, a0 is the actions available in that state. θi−and θi denote the target network parameters and online network parameters, respectively. The target network parameters θ−_i _{are from an earlier iteration and are used instead of the optimal target values} y = r + γ maxa0Q(s0, a0) to train the network. Yielding a sequential loss for

each iteration i

Li(θi) = Es,a,r(Es0[y|s, a] − Q(s, a; θ_i))2 _(3.13)

Practically, DQN is set up by two neural networks where the one that approx-imates the target values are copying the parameters of the Q-network every τ steps, If τ = 1 then that would lead to θi−= θi−1. The other important feature used in DQN is the use of a experience replay to store all transitions.

Experience Replay

A common approach in RL methods is to use experienced transitions in place of knowledge of the expected transition[1]. Experienced transitions are incor-porated in the DQN algorithm with an experience replay buffer. According to Lin [40] the experience replay is a way for an agent to remember experiences that the agent has previously experienced. The experience replay enables the possibility of being reexperienced multiple times. Experience replay works by storing every experience. In Lins paper, an experience is a quadruple con-sisting of the state st, action at, reward r, and the next state st+1. The

motiva-tion to use a memory is to learn from valuable experiences more than once as throwing the experiences away is wasteful.

Double DQN

Double Deep Q-Network (DQN) is an extension of the vanilla DQN algorithm

that makes the algorithm less prone to substantial overestimations [31]. In DDQN the approximated targets are modified to separate the values for se-lection and evaluation of an action. The modification is done by rewriting Equation (3.12) as

yDoubleDQN _{= r + γ Q(s}0_{, argmax}

a0Q(s0, a0; θ_i); θ−_i ). _(3.14)

(30)

The Dueling Architecture

By factoring the neural network into two dueling ones, factoring has been shown to improve generalized learning across actions. The factoring does not change the algorithm but is still proven to achieve better policies in games from the Atari 2600 suite[32]. Following the CNN, the dueling architecture consists of two parallel fully connected layers instead of one that is used in DQN. Estimating the value function and advantage function separately, there-after combined to produce the Q-estimates.

Q(s, a; θ) = V (s; θ, α) + A(s, a; θ, β) _(3.15) where θ are the parameters of the CNN, α is the parameters of the stream that estimates the value function and β is the parameters for the stream that estimates the advantage function. The idea is that the dueling architecture will learn which states are relevant, independent of knowing how an action will affect the value function in every state. The two streams are combined to estimate the Q-function, a main idea is that the value function stream is enabling the network to learn good states disregarding the actions available.

3.3 Sampling

With the aforementioned experience replay, the vanilla DQN sampled transi-tions uniformly from the replay buffer. However, several other approaches have been proposed to mitigate the sample inefficiency which comes from uniform sampling. The uniform sampling approach is sample inefficient since there is no consideration in how valuable a sample is.

Prioritized Experience Replay

(31)

where piis the priority of a sample and α is a constant that governs how much

prioritization that is used.

Introducing two different methods to compute the prioritization, rank-based prioritization and proportional prioritization. The proportional prioritization is defined as

pi = |δi| + (3.17) where δiis the TD-Error defined in Equation (3.5). is a constant that is bigger

then zero and the idea is that the priority shouldn’t be zero for samples where the TD-Error is zero, so that the samples can be sampled again. The rank-based prioritization is an indirect variant which computes the priority according to

pi =

1 rank(i)

. _(3.18) where the rank is sorted according to the TD-Error δi. The primary idea is that the rank-based method is insensitive to samples with small TD-error and therefore more robust. However, it is computationally heavier as the rank-based method needs to sort all samples in the buffer after each update.

Prioritized Sequential Experience Replay

An extension of the above approach was developed by Brittain et al. called

Prioritized Sequence Experience Replay (PSER)[41]. The method is to decay

the priority of a transition exponentially with regards to earlier consecutive transitions. Decaying is done with a coefficient ρ for n number of steps back in time.

pn−1=max{pn· ρ1, pn−1}

pn−2=max{pn· ρ2, pn−2}

. . .

(3.19) The idea with PSER is that transition preceding the win state should be sam-pled more frequently and capitalize on the fact that the transitions leading to a win state are good to learn from. In contrast to PER that does not take into account consecutive transitions as the prioritization only depends on the TD-error.

3.4 Learning Speedup

(32)

faster and cheaper experimentation. Common speedup methods will be intro-duced below. However, despite of the specific method, a common technique to speed up the training of deep neural networks is to parallelize the gradient computation, which was first presented by Zinkevich et al [42].

3.4.1 Transfer Learning

In a broad sense, transfer learning is a method in ML to store and reuse learned behaviour. Learning an initial behaviour or starting from a pre-trained agent can be beneficial in several ways. First, the initial performance of an agent can be jumpstarted and result in a baseline behaviour to learn from and hopefully be surpassed with exploration methods. The time the agent takes to reach a pre-defined performance level of a specific task can be decreased if the trans-fer learning shows asymptotic performance increase. Taylor and Stone did a survey of transfer learning in RL domains[43] investigating how to evaluate transfer learning in RL and expected capabilities. Transfer learning methods was found to speed up the learning of agents, which is relevant for this thesis.

3.4.2 Distributional Learning

Distributional learning is a collection of methods used in RL where the agents utilize distributional methods. A bottleneck when training an RL agent is the number of states shown to the agent. Possibly from the generation of sam-ples from the environment or limitations with the buffer size. Gorila is an RL framework developed by Google DeepMind [44] that uses parallel meth-ods to increase the performance of the DQN algorithm. The framework uses multiple environments generating experiences and multiple learners that learn from those experiences. Sharing a distributed neural network that computes the value function and a distributed experience buffer. An extension to the gorila method was the APE-X algorithm [45] relying on the PER method to use significant data generated by parallel actors and discard non-significant experiences.

3.4.3 Learning from Demonstration Data

(33)

learning is a type of transfer learning where an agent is able to quickly learn an initial strategy to interact with the environment which in turn yields a much faster training. Using demonstration is also applicable to settings where an agent does not have access to a simulator and needs to be able to safely in-teract with the environment from the beginning. However, in this thesis, the focus is primarily on speeding up the training.

One of the first papers that successfully applied demonstration to DDQN was Deep Q-learning from Demonstration (DQfD) by Google Deepmind [46]. One of the key aspects of DQfD is that of pre-training a DDQN agent with demonstrations. In their experimentation setting, using demonstration from human gameplay on the Atari 2600 game suite. The method is to use two buffers, one which is populated by demonstrations and another one that is pop-ulated by interacting with the environment as in the vanilla DDQN algorithm. The experiences are replaced by a sliding window principle where the oldest experience is overwritten by the new experience. The two buffers have the same size and the only difference is that the demonstration experiences are immutable.

What differs from the vanilla DDQN is a pre-training phase and a demon-stration buffer. In the original paper, large margin classification loss is added[47]

JE(Q) = maxa∈AQ(s, a) + l(aE, a) − Q(s, aE) (3.20)

where aE is the action from the demonstration in that state. And a margin function: l(aE, a) = ( c, a 6= aE, c > 0 0, a = aE (3.21) The supervised loss in Equation (3.20) makes the policy to imitate the demon-strations. However, the loss also makes the network prone to over-fitting on the demonstration. Therefore, a L2 regularization loss is added to prevent overfit-ting. Also adding an n-step Q-learning loss that is supposed to propagate the values of the expert’s trajectorys. Finally the full loss that is used to update the network is a combination of the four.

(34)

Chapter 4 Evaluation framework

This chapter aims to explain how an evaluation framework is beneficial when conducting research in RL. A framework can support consistent and coherent research, and emphasize the important characteristics that are desired. Section 4.1 is followed by an overview of benchmarking, different benchmarking ap-proaches are needed depending on the task at hand (see Section 4.2). Lastly, Section 4.3 describes the process of standardizing experimentation. Together, these processes form a basis for needs both in the scientific RL research pro-cess but also in the industrial RL research propro-cess.

4.1 King Reinforcement Learning Platform

The term platform is used here to refer to the software framework where RL research is performed at King. Whereas a software framework refers to the ab-straction of classes and methods which a platform consists of. Throughout this thesis, the word platform will be used to refer to both the software framework and the auxiliary artifacts.

RL consists of the algorithms that perform actions when interacting with an environment. RL algorithms possess different challenges depending on how the algorithms are designed. The Gorila[44] algorithm has parallel ac-tors that each generates experiences, the algorithm also has parallel learners that learn from the centrally stored experiences. Storage, massive paralleliza-tion and coordinaparalleliza-tion of multiple learner are three tasks that can be hard by themselves. These three factors can arguably make the implementation of RL-algorithms Gorila non-trivial. In order to make the research more accessi-ble a platform that support a researcher in implementing RL-algorithms has the potential to be highly beneficial.The many platforms that exist today are

(35)

24 CHAPTER 4. EVALUATION FRAMEWORK

a proof for the benefits of platforms. Microsoft’s project Malmö[48], Face-book’s ReAgent[49], and Coach that is developed by Intel’s Nervana Sys-tems[50] are all well known RL platforms. These platforms have very different focus; Project Malmö is connected to the game Minecraft, Minecraft is a com-plex environment that simulates a “real” world. Facebooks ReAgent focuses on real-world decision-making in, for example, a recommendation system.

King has developed its own platform called AI Level Production Platform. The AI Level Production Platform is mostly a software framework built for automation, related to generating new levels in King games. This includes bots for playtesting, particularly RL algorithms that interact with the games. This platform works as a foundation to perform RL research and deploy agents that play games at King. In a research intensive setup, a subset of requirements that would be supportive are:

• Result comparison • Tracking of metrics

• Plug and play with different environments

4.2 Benchmark

RL usually involves a few stochastic pieces, both from an algorithmic perspec-tive but also from the environment an agent interacts with. This illustrates the need of benchmarks to evaluate performance. Dai and Berleant [51] summa-rize seven benchmarking principles:

1. Relevance: Benchmarks should measure important features.

2. Representativeness: Benchmark performance metrics should be broadly accepted by industry and academia.

3. Equity: All systems should be fairly compared.

4. Repeatability: Benchmark results should be verifiable. 5. Cost-effectiveness: Benchmark tests should be economical.

6. Scalability: Benchmark tests should measure from a single server to multiple servers.

(36)

CHAPTER 4. EVALUATION FRAMEWORK 25

Benchmarks are necessary but hard to define, a general global benchmark to fit the broad sets of sought after properties. Continuous control tasks such as balancing a pole[34] differs a lot from learning an RL agent to play Go[5]. Systematically compare and evaluate agents from those two domains is not feasible. There are projects aimed at specific tasks such as RLLAB developed by OpenAi[52] that only evaluates continuous control tasks and argues the need for specific benchmarking. The requirement regarding result compari-son listed in Section 4.1 covers compatible benchmarking over RL-algorithms. Thus motivating the need for developing benchmarking methods at King.

4.3 Standardization of Metrics

Standardizing performance metrics is one factor for investigating how RL agents perform. The metrics used will be introduced in Section 6.3. OpenAI has im-plemented multiple agents into a framework with the ambition to function as a general baseline to compare with[53]. This framework is mostly focused on the development of RL algorithms. The problems tackled by RL generally differ a lot and metrics are not always interesting to compare against the RL community baselines. The need for internal baselines is also important, specif-ically when testing on an environment that is not open to the public. However, popular performance metrics can still be used and adapted. DeepMinds be-haviour suite(bsuite) for RL defines several standardized metrics. Figure 4.1

(37)

26 CHAPTER 4. EVALUATION FRAMEWORK

display how three different agents are compared to a random baseline on the bsuite metrics. The bsuit metrics are defined through a collection of experi-ments1. This thesis focuses on metrics that aims to quantitify different charac-teristics during the training phase. Implementing different standardized ways to compare agents in the same manner as bsuite does is a common approach. The RL framework Coach[50] implements its own visualizing tool with a more flexible approach to metrics. While SLM Lab[29] defines multiple metrics related to a random baseline that are used to evaluate agents. The metrics proposed in this thesis are described in Section 6.3.

1

(38)

Chapter 5 Method

This chapter explains the methods that are used to answer the research ques-tions defined in Section 1.3. What factors within the parameter space have an impact on training of a DQN based agent playing a match-three game. Also, is it possible to speed up the training of a DQN based agent playing the game CCFS. The chapter starts with describing the environment used in Section 5.1, the specific details regarding the game, and how an RL agent interacts with the game. This is followed by explaining DDQN which is the main algorithm used throughout the thesis (see Section 5.2).Lastly, Section 5.3 explains the method of incorporating demonstrations into the DDQN algorithm.

5.1 Environment

The environment used in this thesis is one of King’s match-three puzzle games called CCFS. This section will briefly describe the game together with the features that are used in this thesis. In particular, this section will focus on explaining how the game’s input and output are transformed to be used with an agent. The specific details of the state space, action space, and reward used will be described. CCFS consists of several levels (almost 3000 level to this date). Each level has an objective which the player needs to achieve in the given number of moves available in order to win that level. In this Chapter, only the objectives used within this thesis are explained.

Board

The game board consists of a grid of 9 × 9 filled with candies of different types and colors. Three or more neighboring candies of the same colors should

(39)

28 CHAPTER 5. METHOD

be matched vertically or horizontally to progress toward the objective of a level. The matched tiles will be refilled with candies cascading from the top. Matching more than three candies creates special candies with extra powers. A few candies have special powers affecting larger area of the board, thus, are called special candies.

Figure 5.1: A starting board of level 61 in CCFS.

Figure 5.1 shows a starting board of level 61 in CCFS. This is the specific level that is used in all experiments. Tiles with red background indicate jam on the tile (the center tile in Figure 5.1).

State space

(40)

CHAPTER 5. METHOD 29

Figure 5.2: A subset of the feature planes, depicting the binary channels used to encode the board of a level in CCFS. (illustration inspired by the feature planes in [3]).

The state space on a level in CCFS is large. The initial state space is de-termined by a random seed; the seed decides the initial placement of candies on the board and how candies will fall after being removed. This seed is de-terministic, so if the same set of actions are played on two runs with the same game seed, the game will end in the same state. However, the seed pool is large, if a random seed is chosen to initialize a board then the game can be seen as non-deterministic due to the differences in behaviour a game round will exhibit.

Action space

The action space includes the available moves on a board game. This is defined as switching place between two adjacent candies. An image depicting the action space is shown in Figure 5.3. As for the action space, there are 288 possible actions to swap a candy with its neighbor in total on a 9 × 9 board. We reduced the action space by half since we do not differentiate between left or right and up or down swapping [3]. Not all actions are permitted in every state. The legal moves in a given state are the ones that result in a match-three or by using a special candy with the features the special candy possess1. Note

(41)

30 CHAPTER 5. METHOD Board features Green Red Blue Cyan Orange Purple Yellow

Horizontally striped Vertically striped Fish Wrapped candy Coloring candy Color bomb

Jam

Other layers

Ones Candy fall-down Void

Table 5.1: Features used in the game

that the action space makes no difference between swapping two candies from the left or the right side of an adjacent border. This is however relevant when playing the game as special candies have different behaviour depending on how the candies are matched. This will not affect what is set out to test in this thesis as all agents have the same action space.

Objective

(42)

Figure 5.3: The encoding of action space a ∈ {0, 1}144[3].

5.2 Double Deep Q-Network

The algorithm used in this thesis is based on DQN introduced by Minh et. all[7] with the DDQN extension introduced in Section 3.2.4. The DDQN ex-tension proved to be stable and achieved high performance[8] on CCFS, the Dueling DQN was also explored. However, DDQN being less complex then the Dueling DQN together with better performance guided our choice of al-gorithm.

Reward function

In the implementation of DQN in this thesis, we use the same reward function as the one previously suggested in Karnsund[8]. The reward function is called

progressive jam and was developed to work with the Jam objective explained

in Section 5.1. The progressive jam function was the best performing reward function out of three.

R(st, at) =

(

Number of tiles covered with jam

board size , ∆jam > 0

0, _{∆jam = 0} (5.1) Here stis the state and atis the action at step t. The aim of progressive jam is

(43)

(a) State st. (b) State st+1.

Figure 5.4: Two adjacent game board states depicting a match-three made with red candies resulting in spreading jam on the board. Left before the match, right the state after the match with the jam spread.

leads to a win will result in a reward of 1, since that correspond to the same number of tiles covered with jam as the size of the board. If there is no change in the number of tiles covered in jam then the reward will be zero.

Action Selection

This thesis utilizes a common technique called -greedy action selection which is defined as

ar =

(

arg maxbQt(s), with probability 1 −

uniform(A(s)), with probability .

. _(5.2) The -greedy selection method takes a random action from the available moves with probability . Otherwise, the policy that gives the maximum reward in a given state is chosen with probability 1 − .

5.2.1 Architecture

(44)

recorded from game rounds played by a MCTS-agent and thereafter evaluated what architectures performed well[21]. As is illustrated in Figure 5.5, the in-put state is the size of the board times number of inin-put features, 9 × 9 × 17. There are five convolutional layers with a kernel of 3 × 3. The stride is 1 and 35 filters are used per layer. The convolutional layers are followed by two fully connected layers, the first with 999 hidden nodes and the second with 144 nodes. All convolutional and fully connected layers use the Exponential

Linear Unit (ELU)-activation function [54].

f (z, α) = (

α(ez− 1) _for z ≤ 0

z _for z > 0 (5.3) α is a hyperparameter that governs the value for negative inputs to which an ELU saturates. ELU achieved a higher validation accuracy in a previous work on one of the Candy Crush games[25] than the more commonly used Rectified

Linear Unit (RELU)-activation. Lastly, the output layer is of size 144 which is

the same size as the number of actions that the agent can take and represents the Q-values. The detailed specifications of network visualized in Figure 5.5 are presented in Table 5.2.

State input, s

Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 FCL

Q(s,·)

(45)

Layer Type Size Filters Activation Function Strides Padding Input State 9 x 9 x 17 - - - -Conv1 Conv. 3 x 3 35 ELU 1 Same Conv2 Conv. 3 x 3 35 ELU 1 Same Conv3 Conv. 3 x 3 35 ELU 1 Same Conv4 Conv. 3 x 3 35 ELU 1 Same Conv5 Conv. 3 x 3 35 ELU 1 Same

FCL1 FCL 999 - ELU - -FCL2 FCL 144 - ELU - -Output - 144 - - -

-Table 5.2: Network specification.

Initialization

The weights of the DNN is assigned using Xavier initializer that sets every layer’s weights to ± √ 6 √ nin+ nout . _(5.4)

nin is the number of inputs and noutis the output connections. Xavier

initial-ization is widely used to initialize neural networks; this approach has shown substantially faster convergence[55].

5.2.2 Experience Replay

Three different types of experience replays are used in this thesis: 1. PER[56]

2. PSER[41]

3. Uniform Experience Replay[7]

All three sampling approaches share a common data structure as buffer, this data structure consists of tuples with the current state St, action At, reward Rtand the next state St+1.

The buffer stores a limited number of experiences which is one hyperpa-rameter of the algorithm. When the buffer is filled, the oldest tuple will be replaced with the newly experienced one, this is usually called a sliding

(46)

State Action Reward Next state S1 A1 R1 S2

S2 A2 R2 S3

..

. ... ... ... SN AN RN SN +1

Table 5.3: The illustration of ordering of state, action, reward, and next state in the buffer. Each row representing a tuple in the buffe

5.3 Demonstrations

This section explains how pre-training with demonstrations from user data is implemented. As the first step, the buffer is extended with a fifth column to distinguish demonstrations to self-generated data.

State Action Reward Next state If demo S1 A1 R1 S2 Bool

S2 A2 R2 S3 Bool

..

. ... ... ... ... SN AN RN SN +1 Bool

Table 5.4: A visualization of the buffer when using demonstrations, containing both the DQN tuple and a Boolean to distinguish the persistent demonstrations in the buffer to the self-generated demonstrations.

5.3.1 Data Collection

The demonstrations used are collected from a supervised bot that predicts the most human like move given a state of the game. This bot was developed at King [3] and can be described to work in the following way:

1. A game is started which is accessible through a game Application

Pro-gram Interface (API).

2. The API is called to retrieve the state which is then feed through the supervised bot.

3. The bot predicts the most human like move

(47)

5. Repeat until the game round ends

All state transitions are stored with the format as in Table 5.3 with an addi-tional binary value that is describing whether the sample is a demonstration. All demonstrations are then saved in a persistent storage for reproducibility reasons.

5.3.2 Pre-Training

The demonstrations are used in a pre-training phase to train the agent in a supervised learning manner. A mini-batch of samples from the replay buffer is sampled according to the prioritized experience replay strategy. These are then trained off-line in the same manner as the vanilla DDQN but with the extended loss. The loss used during the pre-training phase is a modified version of Equation (3.22). The loss is modified by omitting the n-step loss for easier comparison with the previous work done with DQN. The loss used is

J (Q) = JDDQN+ λ1JE(Q) + λ2JL2(Q), (5.5)

where JE is defined in Equation (3.20) and JDDQN is defined in (3.13). The pre-training is done in n number of steps and each step is using one sample to pre-train the agent.

5.3.3 Deep Q-learning from Demonstrations

To get a clearer perspective, consider Figure 5.6 that depicts the training ac-cording to DQfD[46].

The process can be described as follows

1. Pre-train the agent with the full loss on the demonstration data, 2. play n steps,

3. add transition to the replay buffer,

4. sample a batch of transitions from the two buffers as in PER [56], 5. If demonstration transition is sampled then use the extended loss to train

the network if an agent generated transition is sampled then only train on the DDQN-loss as in Equation (3.13),

(48)

(49)

Chapter 6 Experiments

This chapter describes how the experiments are setup. Section 6.1 explains how the hyperparameters are selected. Then, an overview of experiments is presented including sampling strategy and the setup for demonstration exper-iments (see Section 6.2). This is followed by presenting the metrics that are used to evaluate the experiments in Section 6.3. And lastly, the specification of the setup used, such as hardware, software frameworks, and programming language as described in Section 6.4.

6.1 Hyperparameters

This thesis aims to answer if the sampling strategy from a replay buffer can im-prove the speed and stability during the training phase. Therefore, the primary focus is not to achieve a state-of-the-art performance; thus, no exhaustive hy-perparameter search is carried out. The hyhy-perparameters chosen are primarily from the original DQN paper[7] with adjustments. A coarse hyperparameter search was performed on the game CCFS in [8] which are used here. We keep those hyperparameters constant over multiple runs to enable fair comparison. Other than the default hyperparameters used for the DDQN, the -greedy action selection provides an additional parameter in addition to DDQN.

The sampling methods PER[56] and PSER[41] introduces hyperparame-ters that were chosen according to the original papers.

The demonstration implementation also introduces hyperparameters in-cluding the size of how many demonstrations to use, pre-training steps, and how much the demonstrations should be weighted.

The hyperparameters are provided in the appendix [A.1,A.2,A.3,A.4].

(50)

CHAPTER 6. EXPERIMENTS 39

6.2 Experiments

The experiments are divided into two groups: the first set of experiment aims to measure what factors contribute to the volatility, and the second set of ex-periments explore how sampling strategies and demonstrations can be used to increase the speed and stability in training. This section presents details of these two groups of experiments.

6.2.1 Contributing factors

When it comes to reliability of results, it is important to identify contributing factors to the variance in the target metrics. We have identified five factors that will be explored in this thesis: Game seed, Python seed, Tensorflow1 seed,

High exploration and Low exploration. Additionally, we train a baseline to

act as a reference.

Game seed influences the game initialization and how the game is played.

Meaning that the board is initialized with candies in certain places and the or-der that new candies fall from the top during the game.Two different game seeds can lead to almost entirely different state spaces; consequently, con-tributing greatly to volatility. The game seed can affect the difficulty of a level and some seeds are impossible to win. Python seed only dictates the random selection of actions from the available actions and how the mini-batch is sam-pled from the experience replay buffer. The random selection of actions are selected according to Equation (5.2). Tensorflow seed changes the initializa-tion of the neural network, as defined in Equainitializa-tion (5.4). This is expected to be a minor contributor to volatility as the Xavier initialization forces the weights to be initialized with similar values. High- and Low- exploration differs from the baseline by the value of in Equation (5.2). A small implies low explo-ration and a high implies high exploexplo-ration. The idea here is to investigate how much exploration contributes to volatility. We expect high exploration to result in a higher volatility as many of the samples in the buffer will be random actions Table 6.1 shows how the seeds and exploration are varied and how the experiments differ from the baseline.

1

(51)

40 CHAPTER 6. EXPERIMENTS

Factor Game seed Python seed Tensorflow seed Trials Baseline G1 P1 T1 0.01 5 Game seed (G2, ..., G6) P1 T1 0.01 5 Python seed G1 (P2, ..., P6) T1 0.01 5 Tensorflow seed G1 P1 (T2, ..., T6) 0.01 5 Low exploration G1 P1 T1 0.0001 5 High exploration G1 P1 T1 0.5 5

Table 6.1: shows the setup used in the experiments related to contributing factors to volatility. All experiments are carried out on level 61 in CCFS. Every experiment is carried out in five trials.

6.2.2 Sampling

This group of experiments are planned to investigate how prioritized sampling performs against sampling uniformly from a replay buffer. This is carried out by applying the two methods PER[56] and PSER[41], both methods are de-scribed in Section 3.3. The experiments are divided into two different settings, the first one includes multiple trials over the same seeds while the other con-sists of 1 trial over the same 5 seeds. The latter investigates how the methods perform when keeping all contributing factors explored in the experiments from Section 7.1. The first setting examines if the method generalizes to oth-ers seeds, in order to avoid results that are only true for one specific setup.

Type Game seed Python seed Tensorflow seed Trials PER G1 P1 T1 5 PSER G1 P1 T1 5 Uniform G1 P1 T1 5 PER (G2, ..., G6) (P2, ..., P6) (T2, ..., T6) 5 PSER (G2, ..., G6) (P2, ..., P6) (T2, ..., T6) 5 Uniform (G2, ..., G6) (P2, ..., P6) (T2, ..., T6) 5

(52)

CHAPTER 6. EXPERIMENTS 41

6.2.3 Demonstrations

Lastly, the demonstration implementation will use 50000 demonstrations and pre-train for 640000 steps. The seed setup will be the same as the multi-seed setup in the sampling experiment, shown in Table 6.3.

Type Game seed Python seed Tensorflow seed Trials Demonstrations (G2, ..., G6) (P2, ..., P6) (T2, ..., T6) 5

Uniform (G2, ..., G6) (P2, ..., P6) (T2, ..., T6) 5

Table 6.3: shows the setup used in the experiments related to the implemen-tation of demonstrations into DDQN. All experiments are carried out on level 61 in CCFS. Every experiment is carried out in five trials.

6.3 Evaluation Metrics

Evaluation of the quality and agent performance is carried out using the three metrics stability, consistency and efficiency. The three proposed metrics are all normalized between zero and one, with one indicating the best performing score. Firstly, the average win rate and total episodes defined as

Wi, for i = 1, 2, ..., m, (6.1) an episode is a full game round until the level objective or the maximum num-ber of moves are reached. m is the last episode in a training run. The average win rate is the number of successfully completed game runs against the total game rounds. The average win rate is computed for each episode during the training. The average win rate is scaled with a constant for legal reason and is therefore called average scaled win rate.

To measure the agents skill rate the agent is compared to a random policy playing the game. The random policy chooses a random action when playing the game and the scaled random win rate is defined as

Wrand = E[Wπrand], _(6.2)

πrand

Scaling Reinforcement Learning Solutions For Game Playtesting

Scaling Reinforcement

Learning Solutions For Game

Playtesting

MATHIAS TÖRNQVIST

Scaling Reinforcement

Learning Solutions For Game

Playtesting

MATHIAS TÖRNQVIST

Abstract

Sammanfattning

Acknowledgments

Contents

Acronyms

Chapter 1

Introduction

1.1

Motivation

1.2

Problem Definition

1.3

Research Questions

1.4

Contribution

1.5

Outline

Chapter 2

Background and Related Work

2.1

Artificial Intelligence

2.1.1

Supervised and Unsupervised Learning

2.1.2

Reinforcement Learning

2.1.3

Imitation Learning

2.2

Games

2.2.1

Bots

2.2.2

Playtesting

2.3

Frameworks

2.3.1

Benchmarking and Reproducibility

2.3.2

Scaling Reinforcement Learning

Chapter 3

Reinforcement Learning

3.1

Fundamental Elements

3.1.1

Problem Setting

3.1.2

Model-Based or Model-Free

3.1.3

On-Policy versus Off-Policy

3.1.4

Temporal-Difference Learning

3.1.5

Challenges

3.2

Algorithms

3.2.1

Q-learning

3.2.2

Asynchronous Advantage Actor-Critic

3.2.3

Proximal Policy Optimization

3.2.4

Deep Q-Networks

Experience Replay

Double DQN

The Dueling Architecture

3.3

Sampling

Prioritized Experience Replay

Prioritized Sequential Experience Replay

3.4