• No results found

MAXTRUEDSSON,ERIKWESTERGREN Adversariallearninginamulti-agentenvironment

N/A
N/A
Protected

Academic year: 2021

Share "MAXTRUEDSSON,ERIKWESTERGREN Adversariallearninginamulti-agentenvironment"

Copied!
48
0
0

Loading.... (view fulltext now)

Full text

(1)

INOM EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP , STOCKHOLM SVERIGE 2020

Adversarial learning in a

multi-agent environment

MAX TRUEDSSON

ERIK WESTERGREN

KTH

(2)
(3)

Adversarial learning in a

multi-agent environment

MAX TRUEDSSON, ERIK WESTERGREN

Bachelor in Computer Science Date: June 8, 2020

Supervisor: Jörg Conradt Examiner: Pawel Herman

(4)
(5)

iii

Abstract

Reinforcement learning is one of the main paradigms of machine learning. In this paradigm learners (agents) are not explicitly told what to do, instead they must on their own explore their environment and determine which actions to take, in order to maximize some numerical signal. Reinforcement learning has several challenges, one of which is the training time. The agents can re-quire millions of training sessions to become proficient at their task, which is troublesome since for most systems it is costly to produce the data.

This thesis investigates whether it is beneficial for agents learning to play tag, regarding their proficiency as well as their training speed, to train against other learning agents over non-learning agents. This was done by training the agents in tag with different opponents and performing benchmarks throughout the training. The benchmarks measured how many times the predators managed to tag the prey (runner) in several games.

(6)

iv

Sammanfattning

Förstärkningsinlärning (reinforcement learning) är ett paradigm inom maski-ninlärning. I detta paradigm får inte agenter (de som ska lära sig) instruktioner på vad de ska göra, istället måste de självständigt utforska och pröva sig fram för att lära sig ett beteende som maximerar någon numerisk signal. Förstärks-ningsinlärning har dock flera utmaningar och problem, en av dessa är tränings-tid. Det kan krävas miljontals träningspass för att agenterna ska bli duktiga på deras uppgift, något som är problematiskt då det är dyrt för många system att producera datan.

Denna avhandling undersöker om det är fördelaktigt för agenter som lär sig leka kull, vad det gäller deras spelfärdighet och träningshastighet, om de tränas mot andra lärande agenter istället för icke-lärande agenter. För att undersöka detta så har agenter tränats att leka kull mot olika typer av motståndare och under träningen så har flera benchmarks utförts. Benchmarksen räknade hur många gånger jagarna fångat bytet under flera spelomgångar.

(7)

Contents

1 Introduction 1 1.1 Research question . . . 2 1.2 Purpose . . . 2 1.3 Hypothesis . . . 2 1.4 Outline . . . 3 2 Background 4 2.1 Autonomous agents . . . 4 2.2 Reinforcement learning . . . 4

2.3 The simple tag environment . . . 5

2.4 The learning algorithm Maddpg . . . 7

2.5 Related work . . . 7

3 Methods 8 3.1 Setting up the tag environment . . . 8

3.2 Training the agents . . . 8

3.3 The non-learning agents behavior . . . 10

3.4 Benchmarking . . . 10

4 Results 12 4.1 Scenario 1. Agent as both predator and prey . . . 13

4.1.1 AvA benchmarks . . . 13

4.1.2 AvPred benchmarks . . . 14

4.1.3 AvPrey benchmarks . . . 15

4.2 Scenario 2. Agent as prey . . . 16

4.2.1 AvPred benchmarks . . . 16

4.3 Scenario 3. Agent as predator . . . 17

4.3.1 AvPrey benchmark . . . 17

4.4 Scenario 4. Hybrid training against predator agents . . . 18

(8)

vi CONTENTS

4.4.1 AvA benchmarks . . . 18

4.4.2 AvPred benchmarks . . . 20

4.4.3 AvPrey benchmarks . . . 22

4.5 Scenario 5. Hybrid training against prey agent . . . 23

(9)

Chapter 1

Introduction

Machine learning is today an active research field and is used in a variety of dif-ferent areas. For example, Spotify uses machine learning to better personalize the user experience for their customers [1] and some recent machine learning methods are being used to detect and filter out spam emails[2].

Reinforcement learning is one of the three main paradigms in machine learn-ing, the other two being supervised and unsupervised learning. In this paradigm the learners (usually called agents) are not told what to do, instead they must on their own interact with their environment and discover what actions will yield the best result [3].

Reinforcement learning is used in a variety of different areas and problems. In robotics for example reinforcement learning is a well-established approach to teach robots new skills and behaviors. Partly because it enables robots to learn tasks humans cannot physically demonstrate or directly program[4]. Re-inforcement learning has also been used to create programs that excel at games. One example of such a program is AlphaGo, a program that has managed to beat several professional players in the game Go[3].

However reinforcement learning has several problems and challenges, one of which is the training time. It can require millions of training-sessions in order for an agent to become proficient at its task. The more complex a situation and task is, the longer it will take for the agent to figure out the best possible behavior. This is problematic since for many of the real-world systems it is costly to produce the data [5] and as such it is of interest to reduce the required training time. The aim of this thesis is therefore to explore how the training time in a multi-agent system is affected depending on what kind of an opponent

(10)

2 CHAPTER 1. INTRODUCTION

the agents train against.

This thesis investigates the effect on training time when agents compete against other learning agents as well as agents competing against non-learning agents with an already defined behavior. More specifically this thesis explores the training time of agents when playing a game of tag.

1.1

Research question

Is it beneficial for autonomous agents training speed and proficiency in a game of tag to be trained against other learning agents over non-learning agents with a predetermined behavior?

1.2

Purpose

While it might seem irrelevant to investigate such a simple game as tag the findings and conclusions that might be discovered can possibly also be applied to other more complex reinforcement learning problems. As stated previously in the introduction, producing data for the reinforcement learning in many real-world systems is costly [5]. As such it is of great importance to improve and minimize the required training time.

1.3

Hypothesis

(11)

CHAPTER 1. INTRODUCTION 3

1.4

Outline

(12)

Chapter 2

Background

This chapter begins with providing the reader an overview regarding autonomous agents, reinforcement learning as well as deep reinforcement learning. Af-ter which details regarding the environment and the learning algorithm is de-scribed. Lastly some related work is presented.

2.1

Autonomous agents

Autonomous agents are systems that independently can interact with their en-vironment with their own sensors in order to accomplish some given or self-generated task(s). In that sense both humans and animals can be seen as au-tonomous agents[6], however this thesis focuses solely on artificial agents. In general an autonomous agent acts in the following manner: The agent re-ceives some input via its sensors from the environment, the agent then decides which action(s) to perform. The agent carries out the action(s) and then re-ceives new input from the environment, performs some action(s) and so on [6].

2.2

Reinforcement learning

Reinforcement learning is a machine learning method that essentially is about learning what to do and how to map situations to actions. The end goal is to maximize a numerical reward. Unlike other machine learning methods rein-forcement learning does not tell the learner what actions to take, instead the learner must on its own discover which actions will yield the most rewards. In

(13)

CHAPTER 2. BACKGROUND 5

more complex and challenging scenarios the learners actions might not only affect the immediate reward, but also all subsequent rewards [3].

A reinforcement learning system usually consists of five elements: an agent, an environment, a policy, a reward function and a value function. In some cases there is also a model of the environment.

The agent is the learner and the environment is the world the agent is in and can interact with.

The policy defines the agents current behavior and decision making. It can be seen as a mapping from the perceived state of the environment to the actions to take in that state.

The reward function defines the goal of the task. On each time step when the agent has done some action the environment sends a single number to the agent called the reward. The reward indicates the desirability of that state for the agent is the primary basis to altering the agents policy. If for example an action yields a low reward, then the policy may be changed to select some other action in a similar situation in the future.

The value function defines what is good for the agent in the long run. The value of a state is the total amount of reward an agent can expect to get in the future from that state. While rewards are the primary basis to altering the policy, the values can be seen as a secondary basis.

Lastly the model of the environment mimics the behavior of the environment and its purpose is to aid the agent in planning which actions to take in the future. Given a state and and an action the model might predict the resultant state and reward. If the reinforcement learning system does not use a model then the system is called a trial-and-error learner[3].

2.3

The simple tag environment

The environment used in this thesis to address the research question is called

simple tag and is one of the multi-agent environments created by Ryan Lowe

(14)

6 CHAPTER 2. BACKGROUND

Figure 2.1: The tag environment.

(15)

CHAPTER 2. BACKGROUND 7

2.4

The learning algorithm Maddpg

The learning algorithm Maddpg was used in this thesis to train the agents and just like the simple tag environment the algorithm is also created by Ryan Lowe et al. (2017). The name Maddpg stands for Multi-Agent Deep Deterministic

Policy Gradient. Furthermore this learning algorithm is configured to be used

in conjunction with the simple tag environment[7].

The Maddpg learning algorithm utilizes deep reinforcement learning, which is a more advanced version of reinforcement learning that uses methods’ which are considered deep-learning. Deep-learning methods signifying feature is that they have multiple levels of representation, in other words multiple layers. At each layer data is transformed and sent to the next layer, which is more abstract than it’s predecessor. When there are enough of these layers it is then possible to learn more complex behaviors[8].

2.5

Related work

It has been shown that reinforcement learning is a working method for teach-ing agents in a competitive multi-agent environment as the agents have the capability to execute advanced strategies without human input. As an exam-ple Bowen Baker et al. showed that their agents while playing hide-and-seek started using tools from the environment to aid them in completing their tasks. Furthermore they also showed that a multitude of strategies can emerge from a competitive multi-agent environment. Lastly they also managed to show that the agents quickly can learn to exploit more advanced environments in non intended ways, which is important to have in mind when training agents and analyzing the result[9].

(16)

Chapter 3

Methods

This chapter begins with describing how the tag environment was set up. The chapter then proceeds with presenting how learning agents have been trained in different scenarios. Then the behaviour of the non-learning agents used in some of the scenarios is described. Finally the different benchmarks done throughout the training in the different training scenarios are presented.

3.1

Setting up the tag environment

Both the simple-tag environment and the learning algorithm Maddpg was writ-ten in Python, and required libraries such as TensorFlow, Gym and NumPy. When setting up our experiments we used Python 3.7.2, TensorFlow 1.13.1, Gym 0.10.5 and Numpy 1.18.3. Furthermore in order to automatize the ex-periments batch scripts were written and used. Finally all of the exex-periments were run on our own personal windows computers.

The simple-tag environment was chosen because it supported both learning and non-learning agents, which was a requirement for us to be able to explore the research question. The learning algorithm Maddpg was chosen since it was created and configured for the simple-tag environment.

3.2

Training the agents

To determine whether it is better for the agents training speed and proficiency of the game to be trained against other agents or non-learning agents the fol-lowing different scenarios have been investigated.

(17)

CHAPTER 3. METHODS 9

1. Standard training: Learning agent as both predator and prey.

2. Standard training: Learning agent as prey and non-learning agent as predator.

3. Standard training: Learning agent as predator and non-learning agent as prey.

4. Hybrid training: Learning agent as prey first trains against non-learning predator agents and then against learning predator agents.

5. Hybrid training: Agent as predator first trains against non-learning prey agents and then against learning prey agents.

In all of the above scenarios the environment was configured to have three predator agents, one prey agent and two obstacles, all learning agents were also trained using the Maddpg algorithm. Furthermore in every scenario the learning agents were trained for a total of 250 000 episodes, meaning in each training scenario the agents played the game 250 000 times. This was chosen to make it possible for the learning agents to learn more complex behaviors, as well as make it plausible given the time frame to run the scenarios multiple times for more reliable results.

The standard training scenarios were all performed ten times and in all of these the environment was configured to have an episode length of 50. The episode length determines how many actions the agents will perform during the game and as such the episode length also determines the length of a game. Ideally it would have been better to run the scenarios more times in order to get more reliable results, but due to the study’s time frame this was not possible. The episode length of 50 was chosen to give the agents plenty of time to play tag, in order to make sure that the learning agents learned more complex behaviors and not only learned what to do in the beginning of the game.

(18)

10 CHAPTER 3. METHODS

In order to be able to evaluate, measure and compare the proficiency as well as the agents training speed several benchmarks were done throughout the train-ing in all traintrain-ing scenarios.

3.3

The non-learning agents behavior

The behavior of the non-learning agent as predator was naive, it simply tracked and walked right towards the prey. As such it did not take into consideration where other predators, prey or obstacles are.

Similar to the predator the behavior for the non-learning prey was also naive, however it was a bit more complex. The prey tried to move away from all predators, however it was more affected by predators nearby. This meant that the prey prioritised to move away from predators nearby over predators that were far away. The non-learning prey agent was also unable to move past the "starting area".

This very naive behavior was chosen because it did not seem relevant to have non-learning agents with a more complex behaviour. The non-learning agents’ purpose in this thesis is solely to aid the training and improve the learning agents learning speed, not actually be the most proficient agents.

3.4

Benchmarking

In order to measure and compare how proficient the learning agents are at playing the game, several benchmarks were done throughout the training in each training scenario. The benchmark tests were performed in intervals of 5000 training episodes. This high frequency was chosen to more accurately track the performance of the agents in order to get more insight into the speed of the learning.

All performed benchmarks were done by letting the agents play the game 2000 times, during which the learning agents would not improve or learn. During all of these games the number of collisions (tags) the predators had with the prey was counted. The number of collisions with the prey is the metric used to compare and evaluate the different training scenarios.

(19)

CHAPTER 3. METHODS 11

the games during the benchmark were also configured to have the same episode length.

In total three types of benchmarks have been performed, however all three types have not been applicable for all training scenarios. In the first bench-mark, Ava which stands for agent vs agent, both predators and prey are learning agents. As such this benchmark could not be used for training scenarios 2 and 3, since at no point during these scenarios are both sides played by learning agents.

In the second benchmark, AvPred which stands for agent vs predator, the prey is a learning agent and the predators are non-learning agents. This benchmark has been used in all training scenarios with the exception for scenario 3. Since in training scenario 3 the prey is a non-learning agent and it would not be par-ticularly interesting to evaluate the performance of non-learning agents when they play against each other.

Lastly in the third benchmark, AvPrey which stands for agent vs prey, the predators are learning agents and the prey is a non-learning agent. Just like

AvPred this benchmark has been used for all scenarios except for one, in this

(20)

Chapter 4

Results

This chapter first presents the results from the benchmarks done during the training of scenario one, where both predator and prey are played by agents. Following that are the results from the benchmarks done during the training of scenario two, where the prey is an agent and the predators are non-learning agents. Then the results of the benchmarks from scenario three are presented, where the predators are agents and the prey is a non-learning agent. Lastly the results from the benchmarks done during the hybrid training scenarios are presented.

As a remainder, in all of these training scenarios the agents have in total trained for 250 000 episodes, which means they have played the game 250 000 times. Furthermore the graphs in this chapter uses a few abbreviations. AvA stands for agent versus agent and means that both predators and prey are learning agents. AvPrey means that the predators are agents and the prey is a non-learning agent. Lastly AvPred means that the prey is an agent and the predators are non-learning agents.

(21)

CHAPTER 4. RESULTS 13

4.1

Scenario 1. Agent as both predator and

prey

In this scenario both the predator agents and the prey agent have been trained by playing against each other. Furthermore in this training scenario all three types of benchmarks have been performed and the following graphs showcases the average results from the all of the performed benchmarks.

4.1.1

AvA benchmarks

Figure 4.1: Average results from benchmarks with Agent vs Agent.

(22)

14 CHAPTER 4. RESULTS

4.1.2

AvPred benchmarks

Figure 4.2: Average results from benchmarks with Agent vs Predator.

(23)

CHAPTER 4. RESULTS 15

4.1.3

AvPrey benchmarks

Figure 4.3: Average results from benchmarks with Agent vs Prey.

(24)

16 CHAPTER 4. RESULTS

4.2

Scenario 2. Agent as prey

In this scenario the prey agent have been trained by playing against three non-learning predators. Moreover in this scenario only one type of benchmarks was performed, AvPred. The following graph presents the results from the benchmarks done throughout the training.

4.2.1

AvPred benchmarks

Figure 4.4: Results from benchmarks with Agent vs Predator.

(25)

CHAPTER 4. RESULTS 17

4.3

Scenario 3. Agent as predator

In this scenario the predator agents have been trained by playing against a non-learning prey. Furthermore in this scenario only one type of the benchmarks was performed, namely AvPrey. The following graph presents the results from the benchmarks done.

4.3.1

AvPrey benchmark

Figure 4.5: Average results from benchmarks with Agent vs Prey.

(26)

18 CHAPTER 4. RESULTS

4.4

Scenario 4. Hybrid training against

preda-tor agents

In this scenario a prey agent first gets trained by playing against non-learning predators for 125 000 episodes, after which the prey agent train against learning predator agents for another 125 000 episodes. Moreover in this scenario all three types of benchmarks has been performed through out the training, and the following graphs showcases the results from the benchmarks.

It is important to acknowledge that the data from 5 to 125 will be nonsensical for two out of the three benchmarks, as there will be untrained learning agents against learning agents or non-learning agents. So the data before 130 should only be considered for the AvPred benchmark. After 125 all data is significant in all benchmarks, as all agents are learning after that point.

4.4.1

AvA benchmarks

(27)

CHAPTER 4. RESULTS 19

Figure 4.7: Average results from benchmarks with Agent vs Agent but only the AvA part.

(28)

20 CHAPTER 4. RESULTS

4.4.2

AvPred benchmarks

(29)

CHAPTER 4. RESULTS 21

(30)

22 CHAPTER 4. RESULTS

4.4.3

AvPrey benchmarks

(31)

CHAPTER 4. RESULTS 23

Figure 4.11: Average results from benchmarks with Agent vs Prey after 125. The number of collisions starts off low but quickly increases up to about 7000 collisions. After which the increase in collisions is slower. Eventually the number of collisions seems to plateau around 8000 collisions. This shows that the predator agents quickly and steadily becomes better at catching the non-learning prey.

4.5

Scenario 5. Hybrid training against prey

agent

(32)

24 CHAPTER 4. RESULTS

4.5.1

AvA benchmarks

(33)

CHAPTER 4. RESULTS 25

(34)

26 CHAPTER 4. RESULTS

4.5.2

AvPred benchmarks

(35)

CHAPTER 4. RESULTS 27

(36)

28 CHAPTER 4. RESULTS

4.5.3

AvPrey benchmarks

Figure 4.16: Average results from benchmarks with Agent vs Prey. All the data is significant.

(37)

CHAPTER 4. RESULTS 29

(38)

Chapter 5

Discussion

This chapter begins with discussing the results from the benchmarks done throughout all the different training scenarios. Following that comes a dis-cussion regarding the method used to address the research question as well as its limitations. Finally some potential future research is presented.

5.1

Results

The results from scenario one show that agents training against other learning agents becomes more proficient at the game, compared to the non-learning agents. Since both the predators and prey learning agents perform better when benchmarked against the non-learning agents than the agents trained against the non-learning agents. Furthermore the results also shows that the behaviors the agents have learned from facing each other is also useful against similar but different opponents. Looking at AvPred, figure 4.2 it is especially interesting to see that the prey agent only needs about 20 000 training episodes to outperform the non-learning predators. In contrast to AvPrey (figure 4.3) which has a much slower learning pace. A possible explanation for this large difference in learning speed is that the predator agents must learn to cooperate to more effectively catch the prey, which is a more difficult task compared to learning to run away from the agents. Another explanation is that this is because of the reward system used. Whenever a predator manages to catch the prey all predators are rewarded, not only the predator that caught the prey. This could affect the learning since the other predators might perform a "bad" action at the same time another predator catches the prey, which would result in the predator performing a "bad" action being encouraged to do this action. However since

(39)

CHAPTER 5. DISCUSSION 31

all predator agents are always rewarded based on the distance to the prey it is unlikely that this would have such a large impact on the learning.

Comparing the AvPred benchmarks in scenario 1 and 2 (figures 4.2 and 4.4) its clear that the prey agent in scenario 1 has become much more proficient at the game than the prey agent in scenario 2. Since the number of collisions in 4.2 is way lower and much more consistent in the benchmarks. Moreover the prey agent in scenario 1 also learned much faster compared to the agent in scenario 2. From this it is clear that scenario 1 is better suited for training the prey agent compared to scenario 2. This outcome strides partly against our hypothesis, which predicted that the prey training against non-learning agents would perform better in the early stages of training and eventually plateau. However the actual result is far from this, furthermore the AvPred benchmark graph is very volatile compared to the other graphs. From the data collected it is hard to deduce why the prey has this unexpected and volatile behaviour, but we have two possible explanations. The first one being that there is an unknown software error in the implementation of the non-learning predator agents when training them. The other potential cause could be that the non-learning predators does not give the non-learning prey agent any room to explore new behaviours because of their relentless pursuit. If the second cause is the actual cause this would clearly show that scenario 2 is a bad method for training the prey agent, but more tests would have to be done in order to draw that conclusion.

When comparing AvPrey in scenario 1 and 3 (figures 4.3 and 4.5) it is not as clear as the comparison of AvPred regarding which scenario has lead to the more proficient agents. It is surprising that the agents in scenario 3 starts off with fewer catches compared to the agents in scenario 1. Furthermore it is also surprising that the agents in scenario 3 manages to catch the prey more often compared to the agents in scenario 1 after around 65 000 episodes. These outcomes strides against our hypothesis, which predicted that the agents trained against other learning agents would become more proficient but per-form worse in the early stages of training. However it is reasonable that the predators trained against the non-learning prey performs worse in the begin-ning since the predator agents does not have any idea on what to do and as such they will not even chase the non-learning prey.

(40)

32 CHAPTER 5. DISCUSSION

same prey used in the AvPrey benchmark it is reasonable that after extensive training the agents will have developed a behavior that effectively counters the specific behavior of the non-learning prey. Meanwhile the predator agents in scenario 1 trains against an agent whose behavior changes and as such the predators in scenario 1 cannot learn a specific behavior that counters the non-learning prey. It is plausible that the predators trained in scenario 3 would perform worse against another type of prey agent with a different behavior, however since these predators have not been tested against any other opponents this is only speculation.

In general all benchmark graphs from the hybrid scenarios are similar after switching opponents to the corresponding benchmark graphs in scenario 1. This is interesting because it shows that it is not beneficial regarding pro-ficiency for agents to first train against non-learning agents and then other learning agents. Furthermore it is also intriguing that in most of the hybrid graphs the learning speed is reduced after the switch of opponent compared to the corresponding graphs in scenario 1. We believe this is because the agents learn solely to play against non-complex behaviors and when they face more complex opponents they need to unlearn some of their already learned behav-ior.

Another observation from scenario 4: AvPred hybrid is that the AvA bench-marks (figure 4.6) is close to 6000 catches at the end of the training compared to both scenario 1: AvA (figure 4.1) and scenario 5: AvPrey hybrid (figure 4.12) whom have different results. Scenario 1 fluctuates around 7000 catches when it plateaus and scenario 5 is between 5000 and 6000. This means that the prey agents that are in scenarios where agents trained against non-learning agents before training against other learning agents outperform the agents that train against other learning agents for the same amount of time.

(41)

CHAPTER 5. DISCUSSION 33

The AvPred benchmark for scenario 4 (figure 4.8) after the switch to the AvA training shows a similar but slightly worse proficiency than the scenario 1 benchmark for AvPred (figure 4.2) and the descent is slower. Which means that the training speed is lower as the opponent for the agent does not change, it is also worth noting that it was significantly worse in the benchmarks before the switch, when training against non-learning agents. This indicates that the prey is slightly worse at fleeing from the non-learning agents than AvA. The AvPrey benchmark in scenario 4 (figure 4.10) has a value that is around 1000 catches less than the AvPrey benchmark in scenario 1 (figure 4.3). Which is significantly worse. It also is significantly worse in the beginning, but quickly reaches scenario 1s starting point. It also didn’t show any sign that it was con-tinuing to improve from this performance which is worse than the AvA after only 55 000 episodes. This shows that the predators that trained against an already trained prey are significantly worse at catching the prey than scenario 1.

The AvPrey benchmark for scenario 5 (figure 4.16) has slightly less catches than scenario 1 (figure 4.3) which implies that the predators that have trained against an already intelligent agent perform worse against non-learning agents than the ones in scenario 1. This shows that the predators that trained against a non-learning prey first performed worse than the agents in scenario 1 against non-learning prey.

The observations above indicate that the agents that first trained against non-learning agents are not performing better, instead it is the agents that started their training against an already intelligent agent that performed worse. Which improves the relative performance of the agents compared to scenario 1. When the agents where tested directly they also performed at worse or similar levels. This makes us believe that the cause of the hybrid training’s better perfor-mance does not imply that they are better, but instead that their opponents are worse.

(42)

34 CHAPTER 5. DISCUSSION

5.2

Method

The benchmarks that has been performed is something that could be improved. For example the performed benchmarks have only measured the number of collisions during several games of tag. While this metric is reasonable for a game based around collisions it would probably have been good to run bench-marks that measured different metrics as well. As of now the data that has been collected in our benchmarks gives a very limited insight into the agents performance and their learning. As such it is hard to determine based on the benchmark data if the agents are still learning but are only finding behaviors with similar results or if the agent is stuck in one behavior. A potential metric that could have been interesting to keep track of is the time until the first col-lision. Performing more different types of benchmarks and keeping track of more stats regarding the game would give more insight into the agents profi-ciency at the game.

The choice of rewarding system for the agents was decided by testing reward based on impact and reward based on distance. We came to the conclusion that the predator agents needed feedback when they performed an action that improved their odds of catching a prey agent, as otherwise the predators did not learn any behaviours that helped them catch the prey in the time frame of the training sessions. As such for the predators we decided to use the distance to prey as the reward. The prey did not need this incentive and are instead punished by getting caught, this system is more relevant for the prey as it does not matter if they are close to the predators as long as they are not caught. The most significant limitation regarding the method is probably that the agents trained in the different scenarios have not been tested against each other. By letting differently trained agents play against each other it would have given more insight into which agents had become more proficient. Unfortunately the environment used did not support this and given our time frame it was not plausible for us to try and modify it or change environment, as it was already difficult to find an environment that provided us with the amount of control necessary to add our non-learning agents.

(43)

CHAPTER 5. DISCUSSION 35

differ. Doing this could provide additional insight into what type of opponent is beneficial for agents to train against. However due to the time frame it was not possible for us to investigate other environments or learning algorithms, however it would have been possible to investigate some other configuration of the environment. But we instead chose to repeat all of our training scenarios several times in order to get more reliable and accurate results.

5.3

Future research

(44)

Chapter 6

Conclusions

This thesis has investigated if it is beneficial for autonomous agent training speed and proficiency in a game of tag to be trained against other learning agents over learning agents. Five different training scenarios with non-learning and non-learning agents have been investigated. From the benchmarks done throughout the training in all of the performed training scenarios we have come to the conclusion that it is beneficial to train against other learning agents rather than training against our naive implementation of non-learning agents. The non-learning agent training at best didn’t change the end result and at worst hampered both the learning speed and the performance of the agents. Scenario 1 which was agent vs agent performed the best in training speed and performed well in the three different benchmarks, whereas, scenario 2 and 3 had bad results and the time it took them to reach consistency was either never or slower than AvA. Scenario 4 and 5 performed had better performance in some benchmarks but this is shown to be the result of their opponents being worse rather than actually being better at the game. So if we take all the scenar-ios into consideration we can conclude that the non-learning agents hampered the learning speed as well as the proficiency.

(45)

Bibliography

[1] Spotify Engineering. For Your Ears Only: Personalizing Spotify Home

with Machine Learning. url: https : / / labs . spotify . com /

2020 / 01 / 16 / for your ears only personalizing -spotify-home-with-machine-learning/. (accessed: 19.04.2020). [2] Emmanuel Gbenga Dada et al. “Machine learning for email spam

filter-ing: review, approaches and open research problems”. In: Heliyon 5.6 (2019), e01802. issn: 2405-8440. doi: https : / / doi . org / 10 .

1016/j.heliyon.2019.e01802. url: http://www.sciencedirect. com/science/article/pii/S2405844018353404.

[3] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An

In-troduction. Second. The MIT Press, 2018. url: http://incompleteideas.

net/book/the-book-2nd.html.

[4] Petar Kormushev, Sylvain Calinon, and Darwin Caldwell. “Reinforce-ment Learning in Robotics: Applications and Real-World Challenges”. In: Robotics 2.3 (July 2013), pp. 122–148. issn: 2218-6581. doi: 10 . 3390 / robotics2030122. url: http : / / dx . doi . org / 10 . 3390/robotics2030122.

[5] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges

of Real-World Reinforcement Learning. 2019. arXiv: 1904.12901 [cs.LG].

[6] Paul Davidsson. “Autonomous Agents and the Concept of Concepts”. eng. PhD thesis. 1996. isbn: 91-628-2035-4.

[7] Ryan Lowe et al. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments”. In: Neural Information Processing Systems (NIPS) (2017). [8] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”.

In: Nature 521.7553 (2015), pp. 436–444. issn: 1476-4687. doi: 10 . 1038 / nature14539. url: https : / / doi . org / 10 . 1038 / nature14539.

(46)

38 BIBLIOGRAPHY

(47)
(48)

www.kth.se

References

Related documents

The agent in a Markov decision process has as its objective to maximise its expected future reward by nding a policy that produces as large expected future rewards as possible..

The agents that played marathon Kamisado matches most competently were the agents that put most weight on the striking position heuristic, second most weight on the tower

Trader Agent: This agent is responsible for making compelling trade offers to other players in exchange of resources needed by the player to build settlements that could give

Furthermore, we confirm that such anti-predator responses have negative effects on growth and development of prey but positive effects on prey survival in the presence of a

To compensate for this, every agent will have their memory of agents at that blue exit decreased by two per turn instead of one, and each time any blue agent takes the max of

Tacksam för svar enligt nedan i bifogat svarskuvert till undertecknad: Distriktssköterska Görel Bergström,. Personal/utbildningsavd, PPH, Box 1316, Eklundavägen1,

 Har aldrig laddat ned. - Denna fråga ställdes för att kunna jämföra beteendet innan lagen förändrades med hur det är efter lagändringen, gäller för fråga 5-10.  Har

To measure the playing strength of the MCTS when varying the number of simulations, i.e., computational e↵ort, tournaments consisting of 240 games each were played with an MCTS