Monto Carlo Tree Search in Real Time Strategy Games with Applications to Starcraft 2

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Monto Carlo Tree Search in Real Time Strategy Games with

Applications to Starcraft 2

ARON GRANBERG

(2)

(3)

Monto Carlo Tree Search in Real Time Strategy Games with Applications to Starcraft 2

July 2, 2019 Author

Aron Granberg arong@kth.se Supervisor Petter Ögren KTH

Examiner

Joakim Gustafsson KTH

(4)

Abstract

This thesis presents an architecture for an agent that can play the real-time strategy game Starcraft 2 (SC2) by applying Monte Carlo Tree Search (MCTS) together with genetic algorithms and machine learning methods. Together with the MCTS search, a light-weight and accurate combat simulator for SC2 as well as a build order optimizer are presented as independent modules. While MCTS has been well studied for turn-based games such as Go and Chess, its performance has so far been less explored in the context of real-time games. Using machine learning and planning methods in real-time strategy games without requiring long training times has proven to be a challenge. This thesis explores how a model based approach, based on the rules of the game, can be used to achieve a well performing agent.

Sammanfattning

Denna uppsats presenterar en arkitektur för ett program som kan spela real- tidsspelet Starcraft 2 (SC2) genom att använda Monte Carlo Tree Search (MCTS) tillsammans med genetiska algoritmer och maskininlärningsmetoder. Tillsammans med MCTS-sökningen så presenteras också en snabb och exakt stridssimulator för SC2 samt en optimeringsalgoritm för bygg-ordningar som separata moduler.

MCTS has studerats mycket inom turordningsbaserade spel som till exempel Go och Schack, däremot så har det utforskats mindre när det kommer till realtidsspel. Att använda maskininlärning och planeringsalgoritmer i realtidsstrategispel utan att kräva långa träningstider har visat sig vara en utmaning. Denna uppsats utforskar hur ett modellbaserat tillvägagångssätt, baserat på reglerna för spelet, kan användas för att skapa ett bra presterande program.

(5)

Glossary

DPS Damage Per Second. 29–31

MCTS Monte Carlo Tree Search. 4, 7, 9–13, 25–29, 43, 44 RL Reinforcement Learning. 10, 11

RTS Real-Time Strategy. 5, 9–11 SC1 Starcraft 1. 5, 10–12

SC2 Starcraft 2. 4, 5, 9, 10, 12–17, 21, 25–27, 30–36, 38, 42–44

(7)

1 Introduction

In this section we will outline the topic of this thesis and the scope of it, as well as provide an overview of the game of Starcraft 2 (SC2).

Games have long been a platform for AI and machine learning research. They provide environments which are easily simulated with well defined actions and goals. Histori- cally the main approaches for writing agents that can play computer games make heavy use of handcrafted strategies. Many simpler board games and computer games have since been mastered by more AI centric approaches (Section 3) however complex games such as SC2 remain a significant challenge. This thesis tries to tackle the problem of writing a program that can play the game of SC2 without having to resort to strategies handcrafted by a human.

Improved AI algorithms in this area can be beneficial in the direct sense of allowing game designers to produce more challenging agents for their games. However many of the solutions to problems found in games can be useful in other applications as well.

1.1 Research questions

In order to investigate the task of writing an agent to play SC2 we lay out three main research questions:

How can Monte Carlo Tree Search (MCTS) be used for high level army control in a game like SC2 while taking into account that actions in SC2 are durative?

Actions are durative in the sense that they take time to complete, unlike for example a move in Chess which is instant.

How can genetic algorithms be used to create reasonable counter strategies to the enemy for the purposes of economy planning?

How can build orders in SC2 be optimized to most efficiently execute those counter strategies?

Limitations on scope

In the game of SC2 individual control of units, commonly called microing, can lead to better outcomes in battles. This thesis does however not tackle this problem at all, and the bot will only use high level actions that concern larger groups of units at the cost of slightly reduced strength in battles.

1.2 Ethics

This work deals with games and other simulations that have minor impact on the world as far as ethics are concerned. We have not found any ethical concerns regarding the work done in this thesis.

(8)

(a) A typical SC2 map, players start in opposite corners.

(b) A screenshot from the SC2 game during a battle.

Figure 1

1.3 Starcraft 2

Starcraft 2 (SC2) is a Real-Time Strategy (RTS) game in which opposing players or teams play on a symmetrical map (see Figure 1a) with the goal of building an army to eliminate their opponent. Both players control up to several hundred units simultaneously. The players build up their bases in order to improve their economy, build army units and battle the opponent frequently in smaller or larger battles. See Figure 1b. There are 3 races that a player can play in SC2 (terran, protoss and zerg) which have completely distinct unit types that have many unique game mechanics. Each race has around 40-50 distinct unit types when including buildings. For more information about SC2 we refer to [1].

Starcraft 1 (SC1) is the predecessor to SC2. The games are similar and some papers cited in this thesis have studied SC1. Most of these approaches apply to SC2 as well, as the games are very similar.

1.4 Simulation environment

This thesis uses the full SC2 game as a simulation environment. The agent interfaces with the environment through an API [2] provided by Blizzard and Google DeepMind.

The API allows the agent to issue commands to arbitrary units in a very flexible way and it also allows the agent to observe the game state.

2 Theory

This thesis makes use of both genetic algorithms as well as monte carlo tree search. This section gives a brief overview of those algorithms to make it easier for the reader to follow the later sections.

(9)

Mutation Crossover

Generation 1

Generation 2

Generation 3

Generation 4

Figure 2: Illustration of how a genetic algorithm can mutate and combine genes in order to produce greenish colors. Note how only the most green genes are carried over to the next generation.

2.1 Genetic algorithms

A genetic algorithm [15] is an optimization method which belongs to the larger group of evolutionary algorithms which, as the name implies, are inspired by evolution. It is based on a concept of a group of individuals represented by their genes (the current generation) which are modified slightly (mutation) or combined (crossover) into new individuals. The best individuals in the generation, according to some objective function (the fitness function), are then moved to the next generation and the remaining ones are discarded. This is repeated many times until a given iteration limit is reached. The best individual of the final generation is then returned as a good solution to the problem.

Mutation is typically done via a number of different mutation operators that are specific to the structure of the genes.

In Figure 2 an example can be seen in which colors are used. Starting from an initial generation of random colors the 3 colors closest to green are selected and randomly mutated or combined to form the next generation. In this case the mutation operator is simply tweaking the color a bit randomly and the crossover operator is blending the colors together in RGB space. In the next generation the 3 colors closest to green are yet again selected and used to form a new generation and so on. After just a few generations one can see that most of the genes look pretty green.

Genetic algorithms can be applied on pretty much any type of data as long as it is possible to mutate it in small steps to provide a better solution. Better is defined here as producing a higher value when evaluated using the fitness function. Typically several different mutation operators and some form of crossover are used, however one can use only mutation or crossover if that is reasonable for the particular problem. A major benefit of genetic algorithms compared to many other optimization methods is that they are gradient free, i.e. there is no need to calculate the gradient of the fitness function with respect to the data. This makes the implementation simpler and allows much more freedom in the choice of fitness function as well as allows optimization over non- differentiable data. However for problems where a gradient exists and is not too hard to calculate, gradient based methods typically perform better.

(10)

2.2 Monte carlo tree search

Monte Carlo Tree Search (MCTS) is a game tree search algorithm [6]. Some better known algorithms used for the same purpose are the minimax and alpha-beta algorithms [20].

For a given game, the MCTS algorithm solves the problem of exploring the search space of possible actions and approximately calculating what the optimal move is for the current player.

MCTS first became popular after having been used with success for playing the game of Go [6] and has since then been used in several board games such as Game of the Amazons [14] and Hex [8] as well as some real time games such as Ms. Pac-Man [18].

MCTS has recently been combined with reinforcement learning in the game of Go to beat one of the best human players in the world [22].

The algorithm is summarized below. For illustration purposes we will assume a two player game where the players take alternating turns, but this is not strictly necessary.

Assume we have a search tree where each node represents a game state and each edge represents an action. The starting game state is at the root of this tree, and initially the tree consists of just this root. In each game state we keep two additional values: W: the number of wins we have seen that includes this state and N: the number of times this state has been visited in total. The estimated win ratio R for a given state is then equal to W/N. Since players take alternating turns in this game, the root node will represent player 1’s turn and the root node’s children to player 2’s turn, and so on. The number of wins W is set from the perspective of the player whose turn it is in that state. The algorithm then proceeds as follows. The steps in bold are described in detail further below.

1. Start with the current as the root node.

2. Select child nodes recursively until a leaf node is found. This replaces the current node.

3. If the current node has been visited more times than some threshold then expand the node and go back to step 2, otherwise ignore this step.

4. Simulate a game from the current node.

5. Propagate the simulation result to the top of the tree.

6. Continue from step 1 until some desired number of iterations have been run.

2.2.1 Select

From the current node, select a child node is selected to move to. This can be done in multiple different ways. The most common one is called UCT1 (Upper Confidence bound 1 applied to Trees) [10]. Given a child n and current node p the UCT formula

(11)

says to select the child with the highest value of

(∞ if N_n =0

c qln Np

Nn +^W_Nⁿ

n otherwise

In the formula the second term corresponds to the win rate R, it is reasonable to try to search nodes that have a high win rate. The first term corresponds to exploration and c is a constant that controls the amount of exploration. A higher value will make the algorithm try many different actions while a low one will make it focus more on the most promising parts of the tree. c is commonly set empirically, however√

2 is a good starting point as it is theoretically the optimal value [10]. The simulations are noisy so the win rate is not necessarily the ground truth, therefore we need to occasionally explore alternative actions that may have a lower estimated win rate at that point in time. If some child nodes have never been visited, UCT says we should always prioritize those nodes over all other nodes.

2.2.2 Expand

If the current node (a leaf node) has been visited more times than a given threshold (i.e.

N>K for some user defined value of K) then we add children to the node to make it no longer be a leaf node. The node is expanded by adding one child node for each possible action, which is then initialized with the state that we get when taking that action from the current state and with N=_W =_0.

2.2.3 Simulate

A simulation of the game is run from the current state until the end of the game is reached or for a few steps before using a heuristic function. The players actions taken during the simulation can be random or they can be controlled by some predefined policy. If the end of the game is reached, either a win (V =1), loss (V=0) or tie (V=_0.5) for the current player is recorded. A heuristic function can be used instead to estimate the probability of winning from the current state. There are no theoretical constraints on the heuristic function, but a good one should return values close to 1 if it is likely that the current player will win and close to 0 if the opponent is likely to win. The values do not have to be between 0 and 1, however the exploration coefficient c should be around the same order of magnitude as the heuristic values.

2.2.4 Propagate

The result V from the simulation step is used to update the tree statistics. The current node’s W is incremented by V and N is incremented by 1. The result is then propagated upwards. We move to the parent node and increment its N and W variables as well.

Note that since the player perspective in the parent node is different the W update is not exactly the same: if the simulation indicated that the player in that node won, then in the parent node the current player lost. This propagation continues until we reach the root node.

(12)

2.2.5 Using the MCTS result

When the MCTS search is done, the child of the root node with the highest value of N is chosen as the best action (not R as one might intuitively expect).

As one can note from this, the MCTS algorithm requires a model of the game dynamics, often called a forward simulator. This simulator can, given the current state and an action, determine what the next state of the game will be. This is usually easily done for board games and simple computer games, however it does pose some problems for an RTS game as the simulator is often large and slow (essentially the full game) which makes the whole search hopelessly slow. In addition the number of actions one can take in an RTS game like SC2 is very large, far more than could even be iterated over (in SC2 one can have hundreds of units, even the choice of which subset of those units to select to move brings the branching factor up to 2²⁰⁰ ≈ 10⁶⁰). It can therefore be necessary to use an abstracted state and action space to be able to do planning in any reasonable time.

There are many variants of MCTS. One variant that is important for this thesis is a change to the UCT formula. In [22] a variant was used that incorporate priors into the formula. It is given by







cP(n|p)

√∑b∈CpN_b

1+Nn if N_n =0

cP(n|p)

√∑b∈CpN_b 1+Nn +^W_Nⁿ

n otherwise

Where C_pis the set of all child nodes of p. P(n|p)is the prior probability of the action leading to state n being the best one when in state p. How one arrives at this prior varies.

One could incorporate handcrafted estimates based on a human expert, or one could try to learn it using machine learning, or in any number of other ways. We can note one important thing: the score is no longer infinite for unexplored child nodes. This is important for large action spaces because there may be a large number of actions that we can technically take, but we also have a priori knowledge that most them are very likely bad. With the UCT formula the algorithm would have to visit every single unexplored child node before it could visit any child node more than once.

In a real time game like SC2 the players do not take turns as such, however from a planning perspective it may be a good enough approximation to yield useful plans.

MCTS has also been investigated with simultaneous moves in [12] among others.

3 Related work

There are many ways to tackle planning in games and they have many different subtasks that can stand by themselves as the subject of a paper. In this section we will give a brief overview of some ways of handling scenarios that occur in RTS games in particular. Both for planning in the full game as well as how to handle some specific sub-tasks.

(13)

3.1 Reinforcement learning approaches

Reinforcement learning (RL) has risen in popularity during the last few years and has been successfully applied to a wide variety of tasks, with games being a common subject of study. RL has been applied to many board games, Go among others [22], and many computer games, including the games Super Mario, Space invaders, Pong, Breakout [16]

and Montezuma’s Revenge [21].

A recent improvement on AI for SC is [24] in which a bot is successfully trained using reinforcement learning to defeat all the built-in bots in SC2. The strategies discovered by the agent were however not very varied, but it is still impressive to be able to train an RL bot on the full SC2 game. Even more recently, Google’s DeepMind has managed to train a much better agent that can even beat professional players [7]. There is no published paper with this result yet so very few details are available. It is known however that it required a very large amount of computation power to train this agent, with each agent that they trained (their approached required them to train multiple agents with different strategies) playing SC2 for about 200 years (game time) corresponding to millions of games, as an average SC2 game is between 10 and 15 minutes long.

In [13] a modular approach is taken where different components of the bot are individu- ally trained and combined with other components that are scripted by hand. The bot manages to beat all built-in bots in SC2 with a win rate of 50% or higher. They note that surprisingly their bot performed better with partial information (i.e. that only the part of the map near the player can be observed, often called fog-of-war) than with perfect information.

3.2 Monte carlo tree search

MCTS is a common algorithm to use for planning in games. See Section 2.2 for a brief overview of the algorithm.

In RTS games there have also been attempts of using MCTS. In [17] MCTS is used to play the µRTS game. µRTS is a small RTS-like game built specifically for use as a research platform. They train a probabilistic model from replays of a strong player’s games to guide the MCTS towards actions which seem reasonable for the current game state.

This probabilistic model is very helpful in overcoming the very large branching factor in RTS games. The use of MCTS in this game combined with the probabilistic model was enough to significantly improve upon the state of the art for that game. Later in [27] they extend this to SC1 with some success. MCTS is used for controlling units on a squad level (i.e. small groups of a few units at around the same position). Partial observability is disabled in their tests and the agents have perfect information about the game state.

MCTS normally assumes perfect information, however in [28] a variant is explored which handles hidden information.

(14)

3.3 Build order optimization

Many RTS games, but SC in particular, has an important concept called build orders.

The build order is the order in which units and buildings are built in the game, but it disregards placement, combat and other side effects. It is quite a useful abstraction for many problems in RTS games. Even most professional human players learn many build orders by heart and tries to execute them to perfection in their games. This is possible because the beginning of the game has very little interaction between the players.

Therefore the players can have a set plan for what they want to build and can follow it closely in most cases. One can take the separation further because even later in the game the army composition and unit movement is quite decoupled from the base building.

Therefore to reduce the complexity of the game one could separate the base building and army management as two different tasks that one tries to plan for independently. This is of course not a perfect approximation, there are many cases where the army does matter for the base building and vice versa. For example if the enemy has an army outside the player’s base, then it is perhaps not the best time to send out a worker to try to expand the base as it will just get killed immediately. This decoupling might however be very useful.

One paper which explores this subtask is [4] which uses a genetic algorithm to find the best build order. That is, given the current game state and a set of units that the player wishes to have, it tries to find the best sequence of buildings to build and units to produce would will lead to the player having those units in as short a time as possible.

The build orders that are produced are near optimal and the algorithm is fast enough to use by a bot playing the game in real time. Another one is [25] which takes a different approach and uses RL to find the best build order.

3.4 Combat simulation

There are other subtasks of RTS games that are useful for planning. A very important one is combat simulation. This allows the agent to predict the outcome of battles before engaging in them, or during a battle it can help to decide whether fleeing or staying is the best option. It is not always used as an explicit subtask when planning for RTS games.

For example approaches using RL usually ignore it completely. Some sort of forward state simulator is needed In MCTS, and a combat simulator can be a very important part of that. This is especially true in games like SC because it is designed to have quite prominent non-transitive unit strengths (e.g. unit A is better than B which is better than C which is better than A) and the army composition really matters for the outcome of the battle. These dynamics are not easily modelled without a combat simulator.

There is a trade-off between the accuracy of the simulator and the performance and abstractions one can use. Unit positions for example do matter quite a lot in SC, however when planning for a battle that is in the future it is hard or impossible to know the unit positions at that time. Therefore it may be beneficial to use other approximations that assume common scenarios.

In [23] the outcomes of individual battles in SC1 are predicted with a high accuracy.

They also made an attempt to find an army composition that would win against a given

(15)

Figure 3: Overview of the different modules of the agent and how they depend on each other.

opponent army subject to some constraints like the number of units it is allowed to use.

In this task they were less successful but the results were still good. No unit positions were considered. In [5] a combat simulator for SC1 is presented called SparCraft which has subsequently been open sourced and used in several SC1 bots. This simulator has a high accuracy and includes unit positions but ignores collisions. Alpha-beta and MCTS search for planning during a single combat was also explored, however a third greedy search was found to perform better than both of them. In [26] an approach is taken where they try to learn combat models from replays of games. This model is compared to SparCraft and it is found that the learned model performs similarly but is significantly faster to execute. The model does however not take into account special game mechanics such as healing, area damage (splash), or other special unit abilities.

3.5 Map analysis

The state space to use in tasks such as these can be very important. As for positions it can often be good enough to simply use cartesian coordinates, however some tasks can benefit from a richer positioning system. One way to approach this is to decompose the map into larger regions. This will reduce positions to a single integer instead which indicates which region a point is inside. One paper that does this in depth is [19].

4 Proposed architecture

In this section we will present how the agent is structured, how the problem of playing SC2 is broken down into modules and how they fit together. We have chosen to break down the problem using a very high level strategy. It can be summarized as the following:

1. Deduce a reasonable state for the enemy (Section 4.1).

(16)

2. Find an army composition that counters the enemy army as effectively as possible (Section 4.3). It is based on the army the agent has at that time and the buildings, since that influences how quickly a new army can be built.

3. Find the best build order to build that army composition (Section 4.2).

4. Use MCTS to plan unit movement assuming the above build order is followed (Section 4.4).

The first 3 steps are performed at regular but sparse intervals, only once every few game minutes. This is because the plan that the build order optimizer generates tends to focus on the economy early in the build order and on producing units later. Recalculating the build order too often will make the overall strategy more biased towards economy which leads to worse overall play compared to when using a single build order for a longer time. The MCTS search in step 4, on the other hand, is performed once every few seconds.

In Figure 3 an overview of how the different modules fit together is shown.

The following sections will describe the individual modules in detail.

4.1 Deducing the enemy state

Since SC2 is a game with partial information it is important to be able to make good guesses about the state of the enemy. For example about the army composition, the number of bases and where they are located, etc. This section will describe a simple way to produce a prior that the other modules can use. It should be noted that this prior is not representative of the state of the art when it comes to predicting the enemy state.

A prior of some kind is necessary for the rest of the modules to work, however not a significant amount of effort has been spent on improving this module.

The countering module (Section 4.3) needs information about what army units the enemy is likely to have in the future. This is calculated by assuming that the enemy will primarily produce more of the same units that it already has. If the agent knows about a set of enemy units A of a given type that are alive, as well as a set of enemy units B of that type which are dead, then we will assume that the enemy has a total of|A| +^|^B₂^| units of that type.

Before the agent has ever seen any enemy units at all a prior is needed. This allows the agent to plan using an expectation of what a typical enemy has at a given stage of the game. For each race we define a hard coded probability distribution over which units those races are likely to have in the early stages of the game. They are as follows:

Protoss Zealot: 1/6, Stalker: 3/6, Adept: 1/6, Archon: 1/6 Zerg Queen: 1/10, Roach: 4/10, Zergling: 4/10, Hydralisk: 1/10 Terran Marine: 5/10, Marauder: 2/10, Hellion: 3/10

We also define a unit count prior that depends on the game time in seconds.

C_prior(T) =25· ²⁰⁰ 200+T

(17)

Figure 4: A build order for producing a single Zealot. The time when an action is started is listed below each action on the format mm:ss.

To get the final unit counts estimate for the enemy we sum up all units based on the alive and dead count and then as long as the total count is lower than C_priorwe sample from the unit probability distribution single units until we have enough.

Note that the prior decreases over time, it does not increase. This is because we want the prior to matter less further into the game. The prior starts at a high value and not at zero because this prior is used to determine how we should counter the enemy (Section 4.3). This counter will only matter once we actually fight the enemy, so we really want to know what units we should expect the enemy to have in the next battle.

4.2 Build order planning

In SC2 the problem of planning the order in which to build units and buildings is a mostly separate problem from the rest of the game so it can be handled independently.

In the game every unit/building has a cost in terms of minerals and vespene gas, a time to build the unit/building and a list of pre-requisites (e.g. a Gateway building requires a Pylon building to have been built first).

An example of a build order can be seen in Figure 4. In that build order the agent starts by building a Pylon 4 seconds into the game, then at 9 seconds it builds a Probe, etc.

Even in a build order this short there are many dependencies. For example the Zealot cannot be started before the Gateway is finished, and the Gateway cannot start before the Pylon is finished.

We will formulate the problem as what is the best build order given that we want to produce a given number of units/buildings of some specified types from a given starting state. Best is left vague here because it is not as simple as for example optimizing for the build order that takes the least time or requires the fewest resources (see Section 4.2.4). This problem formulation assumes that we know which units the agent wants to build and how many of them. The process of determining this goal is covered in Section 4.3.

The approach taken in this project is similar to the one taken in [11], however several important changes are made to improve performance and increase the usefulness of the build orders. An evolutionary algorithm where each individual corresponds to a build order is used to iteratively optimize the build orders to maximize a given objective function. In particular the parts that differ from [11] are the implicit dependencies, event based simulator (instead of a fixed time step) and the way the best build order is determined, all of which will be described below. An overview of genetic algorithms is given in Section 2.1. In the following sections the details specific to this scenario will be discussed.

(18)

4.2.1 Gene representation

In order to optimize build orders in SC2 one can use the literal sequence of actions in the build order as a gene. A gene could then for example be the sequence [Probe, Probe, Pylon] or [Probe, Pylon, Gateway, Zealot]. These genes have a variable length and their representation is very simple. Note that timing information is not included in the gene itself, but this is simply a side result when the build order is executed in the game or in a simulator. The idea is that when a build order is executed the agent executes actions in sequence. If it reaches a point where it cannot execute an action it simply waits until it has the necessary resources or tech requirements. Using a simulator, one can compare different build orders and see which are better and which are worse. For example one fitness function could be to minimize the time when the last item in the build order is finished. Note that unless otherwise specified it will be assumed that a build order starts from the starting state of the game, i.e. 12 workers and 1 Nexus/Hatchery/Command Center depending on the race.

There is a significant problem with this representation however. Many genes represent build orders that cannot be executed due to dependencies. Even if one starts with a generation of build orders that are all valid one will quickly get invalid build orders when genes are mutated. For example the build order [Gateway] would make the agent wait indefinitely because in order to build a Gateway a Pylon is required, but that is not specified in the build order. There are a few ways one can approach this problem. The first one is to mutate build orders without taking dependencies into account and then discard build orders that turn out to be invalid. This is the approach taken in [11]. One can also try hard to ensure that any mutations done guarantee that the requirements still hold, however this is both error prone and makes the code much more complex and slower. A third option which we will use here is implicit dependencies. The idea is that a given build order stored in a gene will be pre-processed before it is executed to ensure that it is valid regardless of what it contains. This pre-processing step consists of checking each item in the build order to see if all pre-conditions are fulfilled at that point. If this is not the case the necessary build order items to fulfill the requirements are inserted right before that item. This procedure is then repeated recursively so that requirements of the requirements can be fulfilled.

There are two types of requirements that affect the build orders. The first one is direct unit and building dependencies. These can come from the SC2 tech tree or due to the fact that a unit is produced in a given structure. A requirement of this type can also appear when a unit/building requires vespene gas to be built. If this is the case we add a requirement on the gas collector building (e.g. Refinery) corresponding to the agent’s race. This ensures that the agent will be able to get the resources it needs (even though it may take a long time). We handle the unit and building dependencies by keeping track of the unit and building counts at each point in the build order and inserting the necessary items based on the requirements. The unit and building counts can be calculated directly from the build order because each item in the build order corresponds to the count of that unit or building type being incremented by one.

The second one is the supply limit. In SC2 only a limited number of units can exist at the same time. The limit depends on how many buildings of a specific type that the player

(19)

has. For example for the Protoss race the Pylon building increases the supply limit by 8.

It is possible to track how close to this limit we get by looking at the building counts at each step in the build order. If at any point we would go over the limit we add a new supply building right before that item in the build order.

To provide an example of all these effects consider a build order with a single item [Carrier]. This build order is converted to [Stargate, Carrier] because the carrier is built in a Stargate. The Stargate itself requires the Cybernetics Core which requires the Gateway which in turn requires a Pylon, so those buildings are recursively added as well. The build order then becomes [Pylon, Gateway, Cybernetics Core, Stargate, Carrier]. The Stargate also requires vespene gas to produce, so a Refinery is added right before it. Finally the Fleet Beacon is added right before the Carrier because that is a tech requirement of the Carrier. The Fleet Beacon also requires the Cybernetics Core, but since it already exists no further action is needed. The final build order is thus [Pylon, Gateway, Cybernetics Core, Refinery, Stargate, Fleet Beacon, Carrier].

As one can see, dependencies can add many steps to a build order. Using an evolutionary algorithm to evolve the above build order would take a lot more time than using implicit dependencies. Furthermore using implicit dependencies makes the genes easier to manipulate using mutation. Consider the build order [Probe, Probe, Probe, Probe, Probe, Pylon, Gateway]. Assume that it turns out that producing the Gateway earlier would be beneficial. Without using implicit dependencies the Pylon would first have to be moved, which may lead to an intermediate worse build order, and then the Gateway. When using implicit dependencies the build order could instead look like [Probe, Probe, Probe, Probe, Probe, Gateway] and the Pylon would be implicit. The Gateway could then be repositioned and a Pylon added if necessary.

4.2.2 Build order simulator

In order to compare build orders one need to simulate them to determine how fast they were and track other parameters that may be relevant for the fitness function. We do this using an event based approach very similar to the fast forwarding approach described in [4]. We will briefly outline the algorithm here but refer to [4] for further details.

In the description below units will refer to both movable units and buildings. The SC2 engine treats them identically and thus it is easiest to treat them identically in the simulator as well. In order to simulate the build order we keep a simulated state for the current time. The state contains: the current time, the number of units of each type and how many of them are busy right now, future events and the player’s resources. There are two kinds of events: unit completion events and events that marks a unit as not busy anymore. Associated with each event is the time when it will be triggered.

As a primitive we will consider what happens when just simulating until a given time without adding any additional events or modifying the state in any other way. It turns out that the only things that can happen during this time is either that an event is triggered or that the player’s resources change due to mining. We can estimate the mining speed of the agent using two values, minerals per second and vespene gas per second, with an approximation that only depend on the number of workers and the

(20)

number of bases and gas harvesting structures. For simplicity we assume that when a gas harvesting structure has been built the maximum number of workers (3) will always be assigned to it. This is pretty much always the case in real SC2 games as well. This mining speed approximation has the property that it only changes when an event happens since that is the only time when the number of workers or bases change.

Thus we can simulate the game by fast forwarding to the time of the next event and the resources gained during that time can be calculated. This is formalized as the function S⁰ =_sim(S, T)which returns the new state S⁰after the state S has been simulated until the time T.

Given that we want to execute an action there are three limiting factors. Either a unit or building dependency is not satisfied, e.g. we want to build a Gateway but no Pylon is yet fully constructed, we do not have enough resources or all units that can do the action are busy, e.g. we want to produce a Zealot but all Gateways are busy producing other units. If a unit or building dependency is not satisfied for the action, we simulate until the next event chronologically and try again. Since we know that the build order can be performed (see Section 4.2.1) we will eventually reach a point where only resources or unit busyness can be the limiting factors. If all units are busy we can similarly continue until the next event and we will eventually reach a point when only resources can be the limiting factor. If only resources prevent the action from being executed it is easy to calculate when we will have enough resources for the action since resources increase as linear functions between event times. Thus we simulate until that point and execute the action at that time.

When executing an action, the procedure always follows the same basic form. One unit is marked as busy, e.g. a worker that is going to go build a building, and an event is added that will create the target unit after the build time. In almost all cases the unit stays busy until the unit is finished, the exception are Probes which only have to start the building process and can then go and do something else while the building constructs itself in the background. For Probes we assume that they stay busy for 6 seconds. This value was chosen empirically after looking at replays of the game.

The algorithm can be summarized in the following pseudo-code.

functionSIMULATEBUILDORDER(S, buildOrder) foritem∈buildOrder do

whilenot all requirements met for item do S←sim(S, nextEventTime(S))

end while

ifnot enough resources for item then S←sim(S, when(S, item))

end if

addEvents(S, item) end for

returnS end function

The when function calculates the first time when there will be enough resources to do the given action. The nextEventTime function returns the time of the next event chronologically in the state. The addEvents function adds new events for when the unit

(21)

is completed and to mark the appropriate units as free again.

4.2.3 Mutating build orders

In most genetic algorithms there is mutation. I.e. random changes to the genes which hopefully produce some genes that are better than the originals. Crossover is also popular, but for this task we found it to work poorly as there is no straightforward way to combine build orders that doesn’t produce very suboptimal ones most of the time.

Mutation was found to work much better.

There are two primary mutation operators used.

add-remove For every item in the build order there is a probability paof the item being removed and a probability p_a of a new random action being inserted before this item. Care is taken such that an item cannot be removed if it is directly part of the goal for the optimizer (e.g. if the goal is to produce 5 Zealots, there must always be at least 5 Zealot actions in the build order).

move For every item in the build order there is a probability pmof the item being moved earlier or later. The number of steps that the item moves is sampled from a normal distributionbN 0,|buildOrder| ·¹₄cso that it is more likely with a small number of steps than a large number.

For every iteration in the genetic algorithm both the move and add-remove mutation operators are applied. Using a grid search the parameters that yielded the best build orders were found to be around pa =0.05 and pm =0.025.

Furthermore there is an additional operator that does local optimization at every step in the build order. It does this by trying to swap every pair of non-identical adjacent items in the build order to check if the new build order is better. It also tries to remove every single item that is not strictly required to be there in order to achieve the goal.

This operator is however much more computationally expensive so it is only done once every 50 generations as well as once for the best build order once the normal genetic algorithm has finished. Doing it once at the end helps to clean up any stray items that do not make the build order much worse, but are not strictly necessary.

In Section 4.2.1 implicit dependencies are discussed. One issue of using implicit dependencies is that it is not possible for the optimizer to reorder the implied dependencies.

There is a chance of the add-remove operator adding a required dependency as a concrete item by chance, however it does not happen that often. All implicit dependencies will therefore be added as concrete items to the build order after half the total number of iterations. This expansion step is only done once and afterwards the build order may be mutated to again contain implicit dependencies.

To illustrate the benefit of this consider the build order [Probe, Gateway, Zealot]. Note that there is an implicit Pylon between the Probe and the Gateway. Assume further that the optimal build order is [Pylon, Probe, Gateway, Zealot]. This build order cannot be represented if the Pylon is implicit because then it will always be right before the Gateway. If however the Pylon is added as a concrete item the build order can be reordered to the optimal one.

(22)

4.2.4 Comparing build orders

Figure 5 Figure 6

Ultimately one gets to the question of which build order is the best. One of the simplest definitions is that the build order which builds the requested number of units in as short a time as possible is the best one. For practical use in a game this can however be a bad fitness function. The reason is that when focusing on fulfilling the goal the player’s overall economy suffers and the player will be at a disadvantage if the game continues on for a longer time than the duration of the build order.

Mining Speed [1/s]

Build Order Duration [s]

Figure 7: 3 build orders and their durations and mining speeds. The dotted lines show the expected future mining speed if the agent only focuses on economy. The mining speed is the sum of mineral and vespene gas mining speeds.

One can compare the build orders listed in Figures 5 and 6. They both produce 15 Adepts and the one in Figure 5 does it slightly faster. If one looks at the mining rate of resources when the build order has completed however, one sees that in Figure 6 the mining rate is 784 minerals/min + 160 gas/min while in Figure 5 the mining rate is 958 minerals/min + 319 gas/min. In exchange for taking 15 seconds longer (about 6%) it has a much stronger economy with 7 additional Probes. For reference it takes about 12 seconds to train a single Probe. This trade-off can in many cases be worth it.

One candidate for a fitness function would be some linear combination of the completion time of the build order and the mining speed when it is finished. This is however prob- lematic because a mining speed increase when the player has a very low overall mining speed is much more useful compared to when the player already has a strong economy.

Furthermore what ultimately makes it unsuitable is that for any linear combination there exists a point where the player has a strong enough economy that it can increase the mining speed so much per unit of time that the best build order is infinitely long (i.e.

building more workers increases the fitness more than the reduced fitness from having a longer build order).

We propose to phrase it the following way: in case of two build orders, the shorter build order is better than the longer one if it is possible to extend the shorter to the same duration as the longer one and the shorter one then has a higher mining speed than the longer one. This is visualized in Figure 7. It is quite reasonable that the green build order is better than the blue one as they have similar mining speeds but the green one is shorter. However it may not be obvious which one of the green and red build orders should be treated as better. Using the above definition we can see that if the red build order is extended to the same duration as the green one, the mining speed would still be lower than the green one, and therefore we conclude that the green one is better.

(23)

This requires us to estimate how the mining speed will increase as a build order is extended. To calculate this estimate we make the assumption that the best build order for increasing the mining rate follows the following rule or a similar one for the Zerg and Terran races:

action=

(Pylon if the current available supply is <= 2 Probe otherwise

This is reasonable as it allows the player to continue building workers without ever being blocked by the supply limit in most cases. It is not a perfect rule but for practical use cases over short time frames it is good enough.

Using this we estimate how quickly the agent can increase its mining speed in the short term by simply simulating the build order generated by the above rule for a minute of game time beginning from the end of the existing build order. A minute is chosen because it on the same scale that different build orders vary in duration, so the approximation of the mining speed increase will be representative. Denote the current mining speed M, the estimated increase in mining speed per unit of time M⁰ and the duration of the build order T. We can then compare two build orders a and b as follows

a >b⇔

(M_a+M⁰_a(T_b−T_a) ·K> M_b if T_a <T_b M_b+M_b⁰(T_a−T_b) ·K< M_a otherwise

Intuitively one can visualize this in the graph in Figure 7. The shorter build order is better than the longer one if the longer one lies below the dotted line extending from the shorter one.

The constant K is a user defined parameter that can be used to tweak the behavior of the optimizer. Using the definitions in the text above we would always have K=1 or slightly higher to compensate for the fact that the build order rule mentioned above is not perfectly optimal. A value of K =∞ corresponds to only focusing on the build order duration and K=0 corresponds to only focusing on the mining speed which leads to infinitely long build orders as there is no duration penalty at all. Empirically we have found that a value of K≈10 works pretty well. This results in shorter build orders than with K = 1 but it still achieves significantly higher mining rates than with K = _{∞. A} value of K =1 still works well, however it is slightly more susceptible to early attacks from the enemy as it will not have built as many military units then.

The keen reader may notice that this is not a transitive relation, that is, there exists build orders A, B and C such that A < B, B < C but C < A. This is an issue as with a non- transitive relation there is no well defined sorting order for the genes. It is necessary to sort the genes in order to pick out the best ones that will survive until the next generation.

In practice this rarely happens however as it primarily happens when the build orders differ by a very large duration, in particular since the approximations used above do not hold for large time scales. For a more stable sort we do however first sort the build orders by their duration and then using the above relation with bubble sort. Bubble sort

(24)

only compares adjacent elements in the list which makes it much less likely to compare items with very large duration differences.

4.2.5 Chrono boost

The Protoss race in SC2 has the so-called chrono boost ability which is worth mentioning.

This ability can be used once every 50 seconds for every Nexus building that the player has and the effect of it is that it speeds up the production speed of a given building by 50% for 20 seconds. To handle this ability we associate a boolean with every action in the build order. If the boolean is true the simulator will try to use chrono boost on the given item. This may not work if the chrono boost was recently used for something else however. During the mutation phase these booleans are randomly flipped so that eventually it will figure out which items are good to use chrono boost on.

A limitation of this approach is that it is not possible to apply chrono boost multiple times on a single action which may be beneficial in some cases. However we do not deem this a significant loss as it is quite rarely done in practice.

4.3 Countering army compositions

In SC2 there is a very strong rock-paper-scissors like dynamic among the unit types.

I.e. a given unit is often strong against certain opponents but very weak against others.

This makes it very important to not just build a large army, but build a large army that is efficient against the opponent’s army. To tackle this problem we will use a combat simulator (see Section 4.5) together with a genetic algorithm to optimize for the best army composition that can beat the opponent’s army, or at least our guess of what the opponent’s army composition is (Section 4.1).

Since this is a genetic algorithm it is very similar to what is described in Section 2.1 however the gene representation and the mutation operators need to be specified for this application.

4.3.1 Gene structure

A suitable gene representation for this problem is a count of each unit type that should be included in the army. Since the player may already have some units we define the gene as what additional units should be created, this makes all non-negative unit counts valid which is simpler to keep track of. An example of a gene could then be [Zealot: 5, Stalker: 3], where it is implied that all unit types other than Zealot and Stalker have a count of zero.

We let a gene contain values G(u)which is the number of additional units that should be produced of the unit type u. Both mutation and crossover makes sense for this representation. For mutation we will simply replace a unit count randomly with samples from a geometric distribution that has the same mean as the current value + 1. The added 1 to the mean is to ensure that if G(u) =0 then it will not get stuck there, but

(25)

positive values will also have a non-zero probability of being generated. Note that the geometric distribution Geo

1 1+µ

generates non-negative integer values with a mean of µ.

G(u) ←

(∼Geo

1 2+G(u)

with a probability p

G(u) otherwise

In this simulator a mutation rate of p=0.1 is used.

We also use crossover in which two genes (a, b) are combined to form a new gene. We do this by for each unit count picking one of the parents randomly and using the value from that gene.

G_child(u) ←

(Ga(u) with probability ¹₂ G_b(u) with probability ¹₂

One additional significant modification is done to the standard genetic algorithm. In order to make sure that the army compositions in the gene pool are always winning against the enemy we do a pre-processing step for each generation. A battle with the opposing army is simulated and if the gene’s army composition lost, all unit counts in it are scaled by 1.5. This is repeated until it does not loose anymore or after a maximum number of iterations was reached. This can in some cases still fail to produce an army that wins against the opponent, in particular if the enemy has some unit which no unit in the army can attack. Then no matter how much it is scaled up it will never be enough.

4.3.2 Fitness function

There are a few important features one can consider in a fitness function. An important factor is how much it would cost to produce this army as well as how long time it would take to produce. Furthermore it is better if most of our army survives rather than it being a very narrow victory. We also add a large penalty for not defeating the opposing army, in case the previously described scaling failed.

We will formulate this fitness function in terms of resource costs. For each unit in our army we will reduce the fitness by the cost of that unit, additionally if that unit takes damage the fitness will be reduced by its cost again proportionally to how much damage it took. For enemy units we will create a similar score.

These are not the only costs associated with producing the army however. It may also be necessary to build additional tech buildings and unit production buildings. This is not straightforward to calculate however and it depends on which build order is used. We can however approximate it fairly well using a neural network by assuming we follow the optimal build order as calculated in Section 4.2. Using neural networks we estimate the required additional resources (R_minerals, Rgas) needed to produce the army as well as the time it would take to produce it (T_army). This is outlined in Section 4.3.3.

(26)

Thus far we can calculate a few cost functions:

cost_our = R_minerals+R_gas+

∑

u∈our units

cost(u) · (1+damageTakenFraction(u))

costenemy=

∑

u∈enemy units

cost(u) · (1+damageTakenFraction(u))

Where cost(u) is the sum of the mineral and vespene gas cost for the unit and damageTak- enFraction(u) is the fraction of the total health and shields that the unit took as damage during a simulated battle (i.e. 0 if the unit took no damage and 1 if it died).

To take the estimated time it would take to produce the army into account we modify cost_enemyusing a multiplier. Reasoning that if it takes too long to produce our army then it will be less effective against the enemy, especially if the enemy attack long before our army is ready.

Finally we add a large negative score if the combat simulation was lost. This is to ensure that the optimization will produce an army composition that will beat the opponent.

cost_loss =

(C if combat simulation was lost 0 otherwise

Where C is a suitably large constant for this purpose, much larger than any other terms in the fitness score.

Using this we can formulate the final fitness score

F= −cost_our+cost_enemy·min

1, 2· ³⁰ 30+T_army

−cost_loss

Thus we are optimizing for an army that costs little to produce and takes little damage, does not lose against the enemy and takes a short time to produce.

The multiplier for the enemy cost can be seen as a bit arbitrary. However having some sort of method for optimizing for an army that takes a short time to produce turns out to be very important. For example early in the game it makes no sense to try to produce a very late game unit (e.g. a Carrier) even though a single one of those units might be able to beat the opponent’s army at a relatively low cost. Furthermore it is important to consider that the enemy does not necessarily wait until the player has built up its army, it can attack at any point and then it is beneficial if the player already has something to defend with.

4.3.3 Build time approximation using neural networks

As mentioned in Section 4.3.2 we use a neural network to predict the time it would take to produce a given set of units and the additional resources it would require. We

(27)

define additional resources as any non-economic buildings that are built, primarily tech buildings and buildings for producing units, but not things like worker units or a Nexus.

Since we already have a build order optimizer (see Section 4.2) one may question why that one is not used as it can provide us with exactly these answers. The reason not to use that is performance. Running the build order optimizer takes from around half a second up to one or two seconds, however the genetic algorithm has to evaluate a large number of army genes every iteration. This easily turns into thousands of queries for this information which would take a long time to calculate if the build order optimizer was used. Furthermore exact information is not necessary, a good approximation is more than enough for the genetic algorithm to work well, therefore a neural network is a reasonable approach.

The neural network is mapping from a two vectors of unit counts (one count for each possible unit type), with one vector for the units the player already has and one for the additional units that we want to produce. The output of the network is 3 values: the time it would take to produce those units (in seconds), the additional minerals it would require and the vespene gas it would require. We train the network separately for each race as this requires a much smaller network since no unit types are shared between the races.

The architecture is a stack of fully connected layers with leaky ReLU activations.

4N 2N

Input8N

N 10 10 10 10 10 5 5 3 Output

Figure 8: Neural network architecture. Each layer is labeled with its size. N is the number of unit/building types (around 35) for the race. Each layer is fully connected and has leaky ReLU activations.

The unit count inputs to the network are pre-processed slightly. Instead of inputting the direct unit count we input 3 values for each one. The first one is the raw unit count, the second one is 1 if the unit count is greater than 0 and 0 otherwise. The last one is 1 if and only if the unit count is exactly 1. This makes it easier for the network to reason about for example if it has at least one building of a particular kind (important for tech buildings which one only need one of) without being confused if there suddenly turns out that the player has many of them for some reason.

The input is therefore a 2xNx3 matrix (the 2 comes from providing the network with both the units that the player has and the ones we want to produce) which is flattened to a 6N vector.

Neural networks usually work better with default initialization methods if the output data has a magnitude of around 1, therefore we scale the target values (time, minerals

(28)

and vespene gas) so that common values are around 1. The scale values used were 1/1000, 1/5000 and 1/2000 respectively for time, minerals and vespene gas.

Training is done by sampling random starting states, i.e. which units, resources, etc. the player has, and random target counts for which units we want to produce. Then we use the build order optimizer to find a good build order and extract the build order duration and the necessary resources. The network was trained on this data using an L2 loss and a 80%-20% train-test split using the Adam optimizer with a learning rate of 0.001 and a batch size of 512. The training and test set together contained around 43000 build order samples. The results of this training can be seen in Section 5.2.

4.4 MCTS planning

The agent uses MCTS to plan the high level actions for the army movement. See Section 2.2 for an overview of how MCTS works. While MCTS is a very general algorithm, a low branching factor (i.e. number of possible actions) is preferred to keep search times low enough to be usable. We use a small set of high level actions that move units around the map. A game simulator is used to run the MCTS which abstracts SC2 so that it can be used for planning much more efficiently, while still keeping the parts relevant for combat reasonably accurate. As the exact state of the enemy is not known due to hidden information, a reasonable guess is made and the game simulator state is initialized with that.

For the MCTS search we have to consider how often the players can perform actions.

We let the players take alternating turns and simulate the game for a few seconds in between. To allow the search to span a sufficiently long time for a strategy to play out we use time step durations that follow the formula∆T=3+0.2T, where T is the game time since the simulation started. Thus the time steps are initially spaced 3 seconds apart and then increase slowly over time. This allows the strategies to have more detail at the start of the simulation while still being able to represent what happens after a long time.

There are 3 key components to MCTS: the action space, the game simulator and the heuristic function. In the subsequent sections we will detail how they are defined for this agent.

4.4.1 Action space

In SC2 there is a very large number of actions that can theoretically be made at any one time. To keep the branching factor low it is necessary to choose a small subset of actions that are expressive enough for the agent to play the game. In the list below we detail which actions we have chosen for the agent. Each action is specified using the tuple selector and action which indicate which units the action applies to and the action itself.

In the descriptions below S stands for the set of units selected by the selector.

(29)

Selector Action Description

None None Do nothing

Army Attack closest enemy Move S to the enemy closest to the average position of S Idle army Attack closest enemy Same as above

Army Consolidate Move S to the average position of S Idle Army Consolidate Same as above

Army Move to P_i Move S to the pre-defined point P_i

Army Move to own base Move S to the closest building on the same team Army Attack enemy base Move S to the closest enemy building

There are two selectors that are used: army and idle army. The army selector selects all army units and the idle army selector selects army units that currently have no order and are just standing still.

There are 3 predefined points P₁, P₂and P₃. These are hard-coded points of interest that help increase the expressiveness of the agent’s actions. P₁is the center of the map and P2 and P3 are the locations of the second and third expansion for the opponent respectively. The first expansion, also commonly called the natural expansion, is so close to the enemy base that having a separate point for that would not add a significant amount of expressiveness.

When one of the above actions are executed a search is done for all units that match the selector. The matching units are then given movement orders to a target point depending on the action. All actions are durative since unit movement takes time, so even if the None action is executed many units may still be moving across the map.

Note that while the above actions may not seem to allow the agent to move to many different points on the map, it is possible to execute actions in sequence to increase the expressiveness. For example if the agent wants to position the army outside the main base there is no action for this, however if the army was currently in the base it could execute the action (Army, Attack enemy base) first which would make it move out of the base. Then when the army is at a suitable distance from the base it can execute the (Army, Consolidate) action to make it stop at that position. This is in fact often done by the agent in games.

4.4.2 Game simulator

Using the real SC2 game in the MCTS search is impractical as the search has to simulate on the order of thousands of games every time it needs to come to a decision. The real SC2 game is simply too slow to use for this. Instead we use a game simulator that implements an abstraction of the game. This allows us to simulate thousands of games per second while keeping the parts important for decision making reasonably accurate compared to the real game.

The simulator represents all units as having a position, health, shields, energy and a movement target. Energy is however only used for a very small number of special units (Section 4.5.6). For improved performance the simulator will combine units into groups if they are at around the same position and have the same movement target. The group stores a single position and movement target for all its units. Using groups also has a