Application to Cyber-Physical systems.

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

AI in Simulated 3D Environments.

Application to Cyber-Physical systems.

CLEMENT SEVY

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

AI in Simulated 3D Environments

Application to Cyber-Physical systems

CLEMENT SEVY

Master in Computer Science Date: April 4, 2019

Supervisor: Stefano Markidis (KTH) & Pedro Faria (AKKA) Examiner: Erwin Laure

Swedish title: AI i simulerade 3D-miljöer. Tillämpningar på cyber-fysiska system.

School of Electrical Engineering and Computer Science

(4)

(5)

i

Abstract

Over the past several years, UAVs (unmanned aerial vehicles) and autonomous systems in general, have become hot topics, both in aca- demia and in industry. Indeed, the opportunities for the application of such technologies are vast, with the military and the infrastructure industry being the two most eminent cases.

Up until recently, autonomous systems showed quite little flexibility, their actions originating in well defined programs that executed and replicated a given task, without much ability to adapt to new conditions in the surrounding environment. However, recent advances in AI and Machine Learning have made it possible to train computer algorithms with unprecedented effectiveness, which opened the door to having cyber-physical systems that can show intelligent behaviour and decision-making capabilities.

Using simulated environments, one is now able to train such systems to exhibit decent performance in tasks whose complexity stum- ped state-of-the-art algorithms less than a decade ago. An approach that has proved extremely successful is Reinforcement Learning (RL).

In this thesis, we used it (along with other AI techniques) to train a virtual flying drone to perform two different tasks. The first one consists in having the drone fly towards a predefined object, no matter where it is placed. The second one is to have it fly in a manner that would allow for the exploration of an unknown environment. We were able to combine both tasks: to find and head towards a specific target within an unknown environment, by using only the relative position of the drone to its taking off point and its camera, therefore without any environment specific information. After a process of trial and error, we developed a framework for exploration on a plane, excluding the movement on the yaw axis. In order to perform such tasks with a deep Q network model we had to retrieve a depth image, the relative position of the drone and a segmented image.

The results presented herein demonstrate that a drone can be trained to be reasonably performant in the aforementioned tasks. It was achieved up to 81% accuracy on an unknown test environment for the first task while achieving 98% accuracy for the training environment

(6)

ii

on the same task. It holds the promise for doing the same with other cyber-physical systems and for more complex tasks.

(7)

iii

Sammanfattning

Under de senaste åren har obemannade flygplan och autonoma system i allmänhet blivit heta ämnen, både inom akademin och indu- strin. Faktum är att möjligheterna till tillämpning av sådan teknik är stora, med militär- och infrastrukturindustrin som de två mest fram- trädande fallen.

Fram till nyligen medförde autonoma system föga flexibilitet då deras handlingar härrör från väldefinierade program som utför giv- na, väl specificerade uppgifter, utan större förmåga att anpassa sig till nya förhållanden i omgivningen.Nya framsteg inom AI och Machine Learning har däremot gjort det möjligt att träna datoralgoritmer med oöverträffad effektivitet, vilket öppnade dörren för att ha cyber-fysiska system som kan uppvisa intelligent beteenden och beslutsfattande för- mågor.

Med hjälp av simulerade miljöer kan man nu träna sådana system för att uppvisa acceptabel prestanda i uppgifter vars komplexitet för- bryllade toppmoderna algoritmer för mindre än ett decennium sedan.

Ett tillvägagångssätt som har visat sig vara mycket framgångsrikt är Reinforcement Learning (RL). I detta examensarbete använde vi denna metod (tillsammans med andra AI-tekniker) för att ”lära” en virtuell flygande drönare att utföra två olika uppgifter. Den första uppgiften bestod av att få drönaren att flyga mot ett fördefinierat objekt, oavsett var objektet är placerat. Den andra uppgiften innebar att få den att flyga på ett sätt som skulle möjliggöra utforskning av en okänd miljö. Vi kunde kombinera båda uppgifterna: att hitta och leda mot ett specifikt mål inom en okänd miljö, genom att bara använda drönarens relativa position i förhållande till dess startpunkt och dess kamera, därför utan någon miljöspecifik information. Efter en försöksprocess med ett fler- tal svårigheter utvecklade vi ett ramverk för prospektering på ett plan, med undantag av rörelsen på yaw-axeln. För att kunna utföra sådana uppgifter med en djup Q-nätverksmodell behövde vi hämta en djup- bild, drönarens relativa position och en segmenterad bild.

Resultaten som presenteras i denna rapport visar att en drönare kan utvecklas för att bli mer ”intelligent” avseende rimlig prestanda för att utföra de ovan nämnda uppgifterna. Vi uppnådde upp till 81%

(8)

iv

noggrannhet i en okänd testmiljö för en viss uppgift, samtidigt som vi uppnådde 98% noggrannhet för träningsmiljön på samma uppgift.

Det ger hopp om att det i framtiden kommer vara möjligt att uppnå liknande fenomen med andra cyber-fysiska system och för mer kom- plexa uppgifter.

(9)

v

Acknowledgement

First of all, I would like to thank both of my supervisors: Stefano Markidis who gave me guidelines all along the way and Pedro Faria who gave me the opportunity to perform my thesis inside AKKA Re- search Benelux while working on cutting-edge and immersive technologies. My greetings also come to my examiner Erwin Laure and my opponent Jakob Tideström who gave me great insights and feed- backs on my work.

I would also like to thank all the members of AKKA Research and particularly Dimitris who gave me the opportunity to show the best of my work and Kevin who gave me deep insight about reinforcement learning.

Finally, I express my gratitude to my friends and my family that always supported me during those intense months and of course to Claire.

(10)

Chapter 1 Introduction

Nowadays, carefully programmed systems execute repetitive tasks in production lines and even research environments. In recent years, robots that can be trained by repeated demonstration have made their way into certain markets. Being an area that still needs a lot of research and can benefit from recent advances in AI algorithms, namely from the field of Deep Learning, we propose to delve into this topic.

AI technology has already a huge impact on our society, economics, and work. It provides services and affects a large panel of sectors.

In healthcare, AI helps diagnostics and assists decisions. Microsoft’s Hanover project consists of predicting effective cancer drug treatment for patients, while the startup Kheiron Medical makes a breast can- cers diagnosis in mammograms [31]. In marketing and advertisement, AI uses digital footprint to propose personalized experience and helps companies understanding their customer segmentation. In transporta- tion, AI is widely used in autonomous cars driving with Waymo (pow- ered by Google). GM and Tesla are key players in this growing market segment [9]. In finance institutions, AI participates by improving risk model and prevent financial crimes and fraud [39]. In customer services, AI has allowed the development of chatbots and personal as- sistants such as Apple’s Siri, Google’s home or Amazon’s Alexa [3].

AI also has application in military, agriculture, energy management, retail, logistics and robotics. This thesis will focus on the last topic.

Training AI algorithms for cyber-physical systems is a complex problem that is both expensive and time-consuming. Luckily, nowadays

1

(14)

2 CHAPTER 1. INTRODUCTION

we can benefit from sophisticated software platforms that simulate 3D environments with diverse levels of realism. Those simulated environments are, and will become, a playground in the industry in the next years because they allow for testing and prototyping with reduced ef- fort. This holds the promise for developing, testing and potentially deploying such algorithms in real scenarios.

Our work is focusd on specific cyber-physical systems: drone systems. Unmanned aerial vehicles (UAVs) can be used in different scenarios such as security monitoring, search and rescue missions or map- ping and data collection and all those can be performed in several environments that are, most of the time, unknown. Thus, they make an interesting choice in a growing market.

While it is easy for humans to face unknown situations by relying on previous experiences, it is more difficult for classic AI algorithms to do the same. Therefore, our goal is to simulate the human learning by teaching a drone in a simulated environment to perform three tasks. The first one, Go To consists of going to a predefined target. The second one, Explore consists of exploring an unknown environment and the last one Combination is a combination of both where the drone will explore an unknown environment and go toward the target once found.

This thesis is organized as follows: Chapter 2 explains the necessary technical background that the reader has to understand in order to value the work. Chapter 3 introduces the papers and works previously achieved in relation to this thesis. Chapter 4 describes our exper- iments and the methods used in order to achieve the results presented in Chapter 5. Chapter 6 adds a critical point of view and proposes several possible extensions of the thesis.

1.1 Research Question

The research question and general problem that this thesis tries to tackle is the following: "How to train AI algorithms for drone systems in simulated 3D environments?". We chose to work with UAV (un-

(15)

CHAPTER 1. INTRODUCTION 3

manned aerial vehicles) systems because of the countless real-world scenarios in which they can be useful.

Taking advantage of having a simulated environment, we will ex- amine an approach that, despite requiring some trial and error to suc- ceed, can achieve great results. This approach is called Reinforcement Learning, a technique that usually works in tandem with neural networks.

Even though the tasks proposed in this thesis can be approached programmatically using classic algorithms such as path planning methods for the Exploration task, they would lack flexibility and would not have much ability to adapt in new conditions. To enable UAV systems to show intelligent behaviour and decision-making capabilities, we chose Reinforcement learning as the AI method. This decision is backed up by the fact that reinforcement learning tends to simulate human learning through experiences and that humans are able to face unknown situations because they can rely on their previous experiences.

The challenges addressed are numerous, including the choice of the simulated 3D environment, the choice of the AI algorithm and the choice of the task that we would like to perform. The accuracy in the results is a real challenge here since we cannot afford to have a real cyber-physical system crashing into a wall as we could in a simulated environment. The excessive training duration generated by the use of reinforcement learning is to take into account but its results want to be worth it.

1.2 Contributions

This thesis contributes to the academic field of AI, reinforcement learning and autonomous navigation by providing yet another attempt at navigating in an unknown environment, with basic sensors, in order to find a predefined target.

• In reinforcement learning, it supports existing methods, namely deep reinforcement learning, and dual deep Q-networks by pro-

(16)

4 CHAPTER 1. INTRODUCTION

viding evidence that such algorithms are performing in other complex environments.

• In autonomous navigation, it is a continuation of previous works with the exploration of a specific, and different, methodology in simple scenarios. Reinforcement learning has not been a popu- lar choice in a majority of papers related to autonomous driving even though it seems a natural choice to mimic human driving by imitating their learning. By demonstrating that autonomous navigation can be performed using reinforcement learning, we add a modest contribution to both fields.

It is by the exploration of the intersection of both fields that this thesis add value to the academic field of AI.

Finally, this work supports the numerous paper in those domains and provides a specific solution with a method that can be extended.

One could implement a different task with the proposed methodology or apply existing tasks on another cyber-physical system.

1.3 Ethics and Sustainability

This work is directly related to artificial intelligence and autonomous driving, therefore the ethic point of view needs to be considered. The AI field raised several philosophical questions and the one of interest here relates to the danger of intelligent systems and their ethical be- haviors. Some perceive AI as a threat to mankind where we would not be able to control it [40] [18]. This lack of control is caused by a lack of transparency of current algorithms. This thesis, used as research material, is an example of this lack of transparency: we created the rules to build the system but we are unable to fully understand the decision-making process. In autonomous driving, the concern is on li- ability and the well-known trolley problem: choosing who to sacrifice to avoid killing other people. In March 2018, self-driving Uber vehicle hit and killed a pedestrian, this once again raised the question of com- plete commitment to autonomous cyber-physical systems.

(17)

CHAPTER 1. INTRODUCTION 5

The ethical question in AI is an open discussion where summits are constantly held regarding autonomous cyber-physical systems and their regulations.

(18)

Chapter 2 Background

The human learning process needs to be modeled in order to be simulated by a cyber-physical system. As explained above, we will use reinforcement learning as the AI algorithm. To understand the complexity and the outputs of this thesis, the reader must first understand the fun- damentals of reinforcement learning and its extension. That is what we propose to do in this chapter by explaining the necessary technical background. The chapter first presents the fundamental of reinforcement learning in finite dimensions, then it introduces the reader to deep learning and neural networks, and finally presents some deep reinforcement learning methods that can be used in non-finite dimensions.

2.1 Reinforcement Learning

Reinforcement learning (RL) is a branch of Machine Learning that, alongside supervised and unsupervised learning, has had a tremen- dous impact in the field of Artificial Intelligence. The idea of RL is, like many other techniques, inspired in the way that humans and other animals are believed to learn (at least in part) from a very young age.

One can see some form of RL when a child learns to walk or run without a teacher. It does so through its mere interaction with the environment. One usually says that, despite there being a teaching signal (feedback from the aforementioned interaction), there is no direct supervisor, certainly not in the way it is usually thought. It is a process of trial and error, using a reward function to guide the agent with, as feedback, a reward value. Usually, this feedback is delayed and the

6

(19)

CHAPTER 2. BACKGROUND 7

sequence in time matters. The agent has a goal which is defined by the maximization of expected cumulative reward. Some famous ap- plications are presented by Google’s DeepMind where they teach an humanoid agent to walk (Heess et al. [15]) or when they play different Atari games by outperforming human expert in Mnih et al. [30].

Reinforcement learning is based on the Agent-Environment model defined in Figure 2.1.

Figure 2.1: Interaction Agent-Environment. Source: Ashraf [2]

In this model we can define:

1. A set of state S defining the agent and the environment.

2. A set of action A that the agent can perform.

3. A set of scalar value R defining the reward.

At each time step t, the agent reads the state St ∈ S and the reward r_t ∈ R and choose an action a ∈ A and received from the environment a new state St+1and a reward Rt+1.

From this interactions, we want the agent to learn a policy Π : S → Athat would maximize the cumulated reward. The cumulated reward or return Gtis expressed as a:

G_t= r_t+ γ r_t+1+ γ²r_t+2+ ... + γⁿr_t+n =

∞

X

k=0

γ^tr_t+k+1 (2.1)

(20)

8 CHAPTER 2. BACKGROUND

Where γ is the discount factor that determines how important future rewards are for the agent. γ = 0 represents an agent that only sees the closest reward.

2.1.1 Markov Decision Process

The Markov Decision Process formally describe an environment for reinforcement learning where it is fully observable. The following section derived from the course of David Silver [4] and Sutton and Barto [46].

The Markov Decision Process is defined by a tuple hS, AP, R, γi and is a Markov reward process with decisions if

• S is a finite set of states

• A is a finite set of actions

• P is a state transition probability matrix where P_s,s^a ⁰ = P[St+1 = s⁰|S_t= s, A_t= a]

• R is a reward function with R^as = E[rt+1|S_t= s|A_t= a]

• γ is a discount factor

The main idea behind Markov Decision Process is that "the future is independent of the past given the present". Formally a state Stis Markov if and only if :

P[St+1|S_t] = P[St+1|S₁, ..., S_t]

We define a policy π as π(a|s) = P[At = a|St = s]. This policy de- fines the behavior of an agent.

From this Markov Decision process, we can extract the two important values, the state-value function vπ and the action-value function q_π.

State-value function

We define the value function vπ(s)of an Markov Decision process by

v(s) = Eπ[G_t|S_t = s] = Eπ

" _∞ X

k=0

γ^tr_t+k+1

S_t = s

#

(2.2)

(21)

Action-value function

We also define the action-value function qπ(s, a) that does represent the expected return from a state taking the action a.

q_π(s, a) = E[Gt|S_t = s, A_t= a] (2.3)

2.1.2 Bellman Equation

The previously defined values have a particular recursive relationship that is called the Bellman equation. It can be express using vπ :

v_π(s) =X

a∈A

π(a|s) R_s^a+ γX

s⁰∈S

P_ss^a0v_π(s⁰)

!

(2.4) or using qπ:

qπ(s, a) = R^a_s + γX

s⁰∈S

P_ss^a⁰ X

a⁰∈A

π(a⁰|s⁰)qπ(s⁰, a⁰) (2.5) The best possible performance in the Markov Decision Process is achieved by maximizing one of the two values. We define the optimal state-value v∗(s) = max_πv_π(s)and the optimal action-value q∗(s, a) = max_πq_π(s, a).

We can demonstrate the following relationship between q∗(s, a)and v∗(s):

∀s ∈ S, v∗(s) = max

a q∗(s, a) (2.6)

The optimal policy π∗corresponds to the optimal value function

∀s ∈ S, π∗ = arg max

π

v_π(s) (2.7)

and because of Equation 2.6,

∀s ∈ S, π∗(s) = arg max

a

q∗(s, a) (2.8)

We can then write the following recursive equations for the optimal values ( Alzantot [1] ):

v∗(s) = max

a R^a_s + γX

s⁰∈S

P_ss^a0v∗(s⁰)

!

(2.9)

q∗(s, a) = R^a_s+ γX

s⁰∈S

P_ss^a0max

a⁰ q∗(s⁰, a⁰) (2.10)

(22)

2.1.3 Model-based Approach

When we have full knowledge of a finite Markov Decision Process, we can use dynamic programming to find the optimal policy. Based on equations 2.9 and 2.10, we will present the two main techniques used to find the Bellman optimum.

Value Iteration

Value Iteration is a method that iteratively improves the state-value v(s). The main idea is that if we know the exact value of each state, our decision is simple and consists of choosing the action that maximizes the expected return. The algorithm is presented in algorithm 1.

Initialize each state to random values v0(s); repeat

For every state, update vi+1(s) = maxa∈A R^a_s + γP

s⁰∈SP_ss^a0vi(s⁰) ; until convergence of v(s);

Algorithm 1:Value Iteration

We will not demonstrate the proof of convergence here.

Policy Iteration

Value Iteration has a weakness which is that we are trying to optimize the value, instead of the policy itself that can converge before the value (Rai [36]). Therefore we will look at another method called policy iteration presented in algorithm 2.

Initialize a policy π⁰ ; repeat

Let π := π⁰;

Evaluatethe values using π by solving vπ(s) = R^a_s+ γP

s⁰∈SP_ss^a0vπ(s⁰);

Improvethe policy at each state:π⁰ := arg max_a(R_s^a+ γP

s⁰∈SP_ss^a0v_π(s⁰)); until π = π⁰;

Algorithm 2:Policy Iteration

(23)

Even though each iteration is more computationally expensive, policy- iteration usually takes a considerably fewer number of iterations to converge and is, therefore, more efficient.

2.1.4 Model-Free Approach

In the previous section, we explained how to solve a finite known Markov Decision Process. However, in practice, we do not have such knowledge of the environment. So the only option to learn from an environment is not to learn explicit models of state transition and reward function but to find an optimal policy from the interactions with the environment. Literature presents two main methods to estimate the value function of an unknown Markov Decision Process: Monte Carlo Methods and Temporal-Difference Learning.

We will quickly present the basis of the two methods. The reader is encouraged to read about the maths behind them.

Monte Carlo Methods

Monte Carlo Methods are a set of methods that learn directly from completed episodes of experience. You only get a reward at the end of an episode and learn the value function from a sample return of the Markov Decision Process. The sample return is the mean of episodes in Monte Carlo methods, therefore all the episodes must terminate. We update V (s) incrementally after each episode. And for each state St

with return Gtwe update V (St)using:

V (S_t) := V (S_t) + α(G_t− V (S_t))

αbeing a number inferior to 1 that is correlated to the learning rate and can be used to forget old episodes.

Temporal-Difference Learning

Temporal Difference Methods combines ideas of Monte Carlo methods and Dynamic Programming. It also learns directly from episodes but they can be incomplete. They are using an estimate return Rt+1+ γV (S_t+1)at each time-step. This estimated return is called the temporal difference target. At each time step we update V (St)using:

V (S_t) := V (S_t) + α(R_t+1+ γV (S_t+1) − V (S_t))

(24)

We can highlight that temporal difference can learn before the end of an episode from an incomplete sequence while Monte Carlo cannot. Temporal Method has shown better efficiency because they ex- ploit Markov Property that Monte Carlo does not.

Among Temporal-Difference methods, a famous one is Q-learning, which consists of learning the action-value function q(s, a) presented before. We update the Q function Qt(s, a)an estimate of qπ(s, a)at each time step using:

Q(S_t, A_t) := Q(S_t, A_t) + α

R_t+1+ γ max

a Q(S_t+1, a) − Q(S_t, A_t) (2.11) Using this update function, Q-learning is simple and presented in algorithm 3.

For each state s and action a, initialize randomly Q(s, a) ; observe initial state s repeat

Pick and execute an action a;

Observe reward r and a new state s⁰; Update Q(s, a) using :

Q(s, a) := Q(s, a) + α (r + γ max_aQ(s⁰, a⁰) − Q(s, a)); s = s⁰;

until terminated;

Algorithm 3:Q-learning On-Policy and Off-Policy

In those model-free approaches, we can again define two different types of methods, on-policy and off-policy. Q-learning is an example of an off-policy method, where the agent learns about a given policy π while the environment is explored by another policy µ. On-policy learning consists of learning about a given policy π while exploring the environment with the same policy π. SARSA [46] is an example of an on-policy algorithm.

2.2 Deep Learning

Deep learning is a part of the machine learning field that has evolved rapidly over the past years thanks to recent breakthroughs in compu-

(25)

tational power and parallelized computing on GPUs. We will only explain some of the basic concepts here. The reader is encouraged to read Goodfellow, Bengio, and Courville [12] which is a reference in the domain.

2.2.1 Neural Networks

A neural network is a cluster of nodes, called neurons, that are able to learn a complex function. The model is inferred from our understanding of the brain and its neuron cells. Each neuron computes a linear function which is the sum of the input with weight added to a bias, and the output is produced by an activation function. Figure 2.2 show the representation of a perceptron. A classic representation is only to represent only the input and the output. There are plenty of activation functions, to give some name we can cite ReLu, Softmax, Sigmoid and ISRU.

Figure 2.2: Representation of a Perceptron. Source: Kang [17]

To improve efficiency, we use hidden layers between the input and the output. These hidden layers learn and compute functions that are not controlled by the user but they allow to learn complex functions. On each of their inputs, they apply the same linear function with different weights. Figure 2.3 compares a single layer perceptron with a multi layers’ one. Each connection has its own weights, bias and activation function f . To compute the output for the hidden layer h¹’s first neuron h¹₁, having the input X = x1, x2, ..., xm with weights

(26)

W¹ = w₁¹, w₂¹, ...w¹_mand bias b we do the following steps:

z =

m

X

i=1

wixi+ b (2.12)

h¹₁ = f (z) (2.13)

Figure 2.3: Left: Single-Layer Perceptron. Right: Perceptron with Hid- den Layer. Source: Kang [17]

So, a neural network is characterized by its architecture (number of layers, number of nodes on each layer), its weights and bias. To update the weights and bias of each connection, we need data to evaluate

"how far are we from the solution?" that will represent our error. Usu- ally we use the mean squared error and update our weights using back propagation. This is how, today, we are able to approximate complex functions that cannot be defined in a traditional way.

2.2.2 Convolutional Neural Networks

Convolutional Neural Networks (CNN) are similar to neural networks where layers are composed of convolutional layers. The inputs of a CNN are images and we encode them using weights and bias in three dimensions instead of one. These three dimensions are the width, the height, and the depth. Each convolutional layer is using filters that are applied to the input as presented in Figure 2.4. These filters allow us to extract features on images such as edges, corners, objects ... Usu- ally, we can also find some pooling layers in CNN to downsample the

(27)

dimensions of resulting images.

Figure 2.4: Convolution of an image using a filter 5x5). Source: [6]

As literature shows, those Convolutional Neural Network are get- ting deeper and people are stacking layer after layers. Some improve- ments are also made constantly. One breakthrough was deep residual learning where identity is parsed to the activation function. ResNet is a good example of a network using this method. It was able to predict with 96.43% accuracy using 152 layers ([6]).

2.3 Deep Reinforcement Learning

In complex environments, with complex inputs, such as the camera frames, the set of state S has a high dimension. Therefore we are unable to compute the value functions seen previously and unable to find the optimal policy. We need to approximate those functions using deep neural networks. This combination of deep neural network and reinforcement learning is called deep reinforcement learning.

2.3.1 Different Categories

Actual deep reinforcement learning methods are separated into three categories, Policy Gradient Methods, Value Iteration Methods, Actor-critic

(28)

Methods. We will cite a few before focusing on the one used for this thesis.

• As seen previously, policy-based methods directly learn to optimize the policy without having to calculate the reward for each action. This works well in environments with infinite actions and we use the total reward of episodes. Among those, we can cite the Monte-Carlo policy gradient with the REINFORCE algorithm (Sutton et al. [47]) or the Vanilla policy gradient. A new family of policy gradient methods has emerged with Trust Re- gion Policy Optimization (TRPO) in Schulman et al. [43] that has been improved and simplified with the method called Proximal Policy Optimization (Schulman et al. [42]) and has been able to beat Dota2 players.

• Value based methods estimate the reward for each action in a certain state and return the action with the highest reward. They have a better convergence than policy gradient but work only with a finite set of actions as we map each state-action pair to a value. Mnih et al. [30] was one of the first papers to tackle this is- sue for video games and implemented a deep Q network (DQN) that was able to play seven Atari games using the same network without any domain specified knowledge provided. Then lots of methods appeared such as Double Deep Q Network in Hasselt, Guez, and Silver [14], Dueling Deep Q Network in Wang et al.

[49] and others researchers worked on a single aspect of DQN such as experience replay with prioritized experience replay in Schaul et al. [41]. We will detail and use DQN and its variants for this thesis.

• Actor-critic methods combine both and propose a distributed method to learn. The basic idea of it is that the actor being policy- based decides which action to take and the critic being value- based tells the actor how good its action was and how it should adjust it. In this category, we can name A2C (Advantage Actor- Critic), its asynchronous and multiple agents version A3C (Asyn- chronous Advantage Actor-Critic) algorithm (Mnih et al. [28]) and IMPALA (Espeholt et al. [10]) that has been able to make breakthroughs in AI for games. They have been compared using the Atari and the DMLab-30 sets of environments.

(29)

2.3.2 Deep Q Network

In this section, we will present Deep Q Network (Mnih et al. [29]), an extension of the classic Q-learning, and its variants.

The Deep Q network consists of approximating the Q function that represents the maximum discounted future reward given a state and an action. The optimal policy can be learned from this value using Equation 2.6. Because we cannot map the Markov Decision in a Q table, the Q network will take the state as input and return a value for each action in the set A.

Deep Mind in Mnih et al. [29] proposed the following architecture presented in Table 2.1, that combines convolutional layers and dense layers. The network will learn the Q function and output the Q value for each action in a given state.

Layer Input Filter Size Stride Num Filters Activation Output

conv1 84x84x4 8x8 4 32 ReLU 20x20x32

conv2 20x20x32 4x4 2 64 ReLU 9x9x64

conv3 9x9x64 3x3 1 64 ReLU 7x7x64

dense1 7x7x64 512 ReLU 512

dense2 512 18 Linear 18

Table 2.1: DeepMind architecture for Atari games

Deep Q Learning consists of using algorithm 3 where the update of the Q function will be replaced as following (Tambet Matiisen [48]).

1. Do a feedforward pass with state s to get Q(s, a), ∀a ∈ A

2. Do a feedforward pass for the next state s⁰ and get maxa⁰Q(s⁰, a⁰) 3. Set the Q-value target for action a to r + γ maxa⁰Q(s⁰, a⁰)

4. Compute the loss using the Q-value target and update the weights using back propagation.

Exploration versus Exploitation

Equation 2.6 claims that a perfect agent always chooses the action to take corresponding to the highest Q-value. However, this is only true when we have a good approximation of the Q-function, which is not

(30)

the case before training. Always picking this action is called maximum exploitation, but as long as we do not know the optimal strategy we want our agent to pick random actions and explore the effects they have. This is the exploration-exploitation dilemma.

An effective solution to this problem is a simple algorithm called

-greedy exploration. This consists of choosing a random action with probability , to promote exploration, and otherwise choose the action with the highest Q-value and therefore be greedy. Mnih et al. [29] starts with = 1 and decrease it linearly to 0.1 over time. This allows having full exploration while the Q function is still random and exploits gradually each strategy.

Experience Replay

To improve the learning process, we use experience replay that consists of storing a part of the experiences < s, a, r, s⁰ >in memory where s is the state of the agent, a is the action taken by the agent, r is the reward of taking the action a in state s and s⁰ is the resulting state. During training, we use random batches from the replay memory instead of using the most recent experience. Doing this avoids convergence into local minimum as we would always be facing similar data since states are changing only gradually over time.

Separate Target DQN

To improve stability in the training, we use a copy of the network, called separate target DQN. It is used to generate the target Q-value that is used to compute the loss. It is updated to the primary Q- network only periodically. Recent work has shown that the stability of the learning process is again improved if it updates slowly over time ( Lillicrap et al. [25]). If the Target Network learned policy µt, the Q target will be update with r = γ maxa⁰Q_µ_t(s⁰, a⁰)

Double DQN

It happens that in a DQN with a single network, Q1unequally overesti- mates a given action. A method to counter this is to introduce another network Q2 that will work with the original one to keep each other in check (Hasselt, Guez, and Silver [14]). Both networks are constantly

(31)

updated and they train each other, having these Q-target values:

Q1_target(s, a) = r + γ max_a⁰Q2(s⁰, a⁰)

Q2_target(s, a) = r + γ max_a⁰Q1(s⁰, a⁰) (2.14) Dueling DQN

The idea behind dueling DQN is that the Q value can be decomposed as a sum of the value of being at a given state s, V (s) the value function, and the advantage of taking a given action a on a given state s, A(s, a) the advantage function.

Q(s, a) = A(s, a) + V (s)

Therefore Dueling DQN consists of separating the two estimators in two different part of the network and combines only them at the final layer Wang et al. [49]. This the network that we will use in the thesis.

(32)

Figure 2.5: Top: single q network introduced above. Bottom: Dueling network with two streams that calculate advantage and value function. Source: Wang et al. [49]

(33)

Chapter 3 Related Work

This chapter does a state-of-the-art of autonomous navigation and reinforcement learning. It introduces several papers and works that are closely related to the thesis. It first presents the projects realized in autonomous vehicles before focusing on the one using reinforcement learning. Eventually, it extends to practical usage of depth estimation.

3.1 Autonomous Vehicle

Autonomous driving for vehicles using neural networks is a trending subject and has been for a long time. The first results on this subject have been done using cars. In 2005, DAVE was trying to avoid obstacles and learned to drive using human supervised training (LeCun et al. [23]). Another example, LAGR, was classifying roads to predict traversability (Hadsell et al. [13]). The challenge with drones is that it is very difficult to relate depth of pixels according to the horizon. Some interesting results were obtained using imitation training in Ross et al.

[38] or by generating a large dataset of how to not fly: Gandhi, Pinto, and Gupta [11] have succeeded in avoiding obstacles by crashing their drone 11500 times. The trained drone is sometimes able to perform better than human in tricky environments. One of the main tasks that are solved without reinforcement learning is collision avoidance and path planning. On robots, lots of algorithms have emerged to propose a trajectory while avoiding obstacle (Peng et al. [32]) and some are even computed in real time (Davis et al. [5], Lin and Goodrich [26]).

21

(34)

22 CHAPTER 3. RELATED WORK

3.2 Reinforcement Learning for Autonomous Vehicle

To perform simple tasks, reinforcement learning has been used successfully in the past. It has firstly been successful in teaching ground robots to find a battery charger using elementary tasks, such as following a wall or docking on a battery charger (Lin [27]).

On drones, state-of-the-art shows that it has been successful at sta- bilizing quadrotors by using simulated environment before moving to reality (Hwangbo et al. [16]). Polvara et al. [34] also uses reinforcement learning to identify a marker and land a drone on it. Using the down-looking camera as input, they are able to perform the task in a simulated environment and then scale it to reality. Kjell [19] was also able to use reinforcement learning in a simulated environment to move a UAV toward a target with external indicators. Scaling to reality is ac- tually a challenging problem, the MIT Sertac, Thomas, Sayre-McCord, and Winter, Guerra [44] made a drone learning while moving in reality and “hallucinating” in a virtual environment. They successfully achieved simple tasks, like going through doors.

3.3 Computer Vision for Depth images

Even though 3D simulators are able to produce a depth image, the majority of accessible drones only have a monocular camera. This means that we cannot create a depth image using basic techniques.

Researches are advanced on this subject, Pinard et al. [33] presents a depth map inference system from monocular videos, based on a dataset that mimics aerial footage from UAVs that can be ported to reality to cover UAVs’ complexes tasks. Xie et al. [50] uses two networks to achieve obstacle avoidance for ground robot where the first one is a depth prediction network from RGB images. Kovács [21] also infers a depth image and succeeds in avoiding obstacles. One solution to infer this depth math is to use SLAM techniques (Leonard and Durrant-Whyte [24]), some are even working on monocular cameras on a CPU (Engel, Schöps, and Cremers [8]).

(35)

Chapter 4 Methods

This chapter presents the technologies and methodologies applied to answer our research question. It first presents the chosen simulated environments, then introduces two tasks that will be taught to the UAV with specific methods and inputs and finally the overall architecture of the project.

4.1 Simulated Environment

As discussed previously, we will work on a simulated environment to avoid real crashes and to ease the learning process. Therefore, our criterion to pick the right environment are:

• Close to reality

• Scalable

• Easy-to-use API

Literature gives us several simulated environments to work with.

Eikanger [7] presents the 3 main available simulators which are JMavSim, AirSim, and Gazebo.

The first one is JMavSim, which is a lightweight multirotor simulator. JMavSim is said to be great for quick set up, and hardware/firmware testing. However, it is not easy to incorporate extra sensors or obstacles.

The second one presented is AirSim (Shah et al. [45]), developed by Microsoft and running on the Unreal 4 Engine. This is a quite new

23

(36)

24 CHAPTER 4. METHODS

project (revealed in February 2017) that has a very realistic looking simulation. It contains API for both C++ and Python. The simulator already implements quadrotor with monocular and depth cameras. The drawback is that the software demands powerful GPUs to run.

The third one is Gazebo (Koenig and Howard [20]), a simulator developed by the University of Southern California. It integrates ROS (Robot Operating System) and can simulate any robot. Eikanger [7]

states that it is relatively easy to add an obstacle or add a generic sensor. Even though the realism is not as stunning as Microsoft’s AirSim, it is lighter.

We chose to try the most promising simulators: Gazebo and Air- Sim.

As advertised, Gazebo, coupled with PX4 SITL, is lightweight and able to correctly simulate drones and environments. It is highly scalable but doesn’t provide a python API. We wrote a quick one in order to move our drone around via ROS. However, the open-source environments are not realistic enough. Regarding the reinforcement learning part, Zamora et al. [51] proposes an OpenAI wrapping around Gazebo.

But only the turtlebot, a ground robot with a lidar sensor, is main- tained and no previous academic work has been done on Gazebo using UAVs.

AirSim, based on Unreal Engine, needs more computational re- sources to run. However, it has a pre-built environment with an easy- to-use python API able to interact with the drone and the position of meshes in the whole environment. The physics is very close to reality with drifting and balance effects that occur on a real drone. The camera output is also close to reality, with the possibility to get a depth vision of the scene. Figure 4.1 displays a screenshot from the environment. Moreover, Kjell [19] proposes an OpenAI Gym wrapping for AirSim.

Because of our criterion, we chose to continue working with Air- Sim, using [19] as a base structure that allowed us to have an environment to work with relatively quickly.

(37)

CHAPTER 4. METHODS 25

Figure 4.1: Left: snapshot of Gazebo environment ([35]). Right: snapshot of AirSim environment ([45])

4.2 Learning Process

To teach our drone how to perform a simple task in the simulated environment, we used a Dueling Double Deep Q-Network presented above. The architecture is inspired from Mnih et al. [29] and Kjell [19].

Each task has its specific parameters and network architecture alongside its own set of actions.

Regarding the hyper-parameters, we chose the ones found in the literature ([29] and [19]) with some minor testing but we did not pro- ceed in fine tuning. Table 4.1 presents the final common hyper-parameters chosen for the two presented tasks. The only difference being the dis- counting factor: we chose the value 0.99 for the Go To task where the expected future reward has a very high importance. The value 0.70 was chosen for the Exploration task where the expected future reward has less importance.

Hyper-parameter Value Description minibatch size 32 Training batch size

replay memory size 50000 Size of the memory from where training data are taken learning rate 0.00025 Learning rate used for the training

Initial 1 Initial value of the -greedy exploration Final 0.1 Final value of the -greedy exploration

Table 4.1: Common hyper-parameters

(38)

4.2.1 Go To Test

The Go To task consists of going to a pre-defined object in the accessible field of view of the drone. Such a task can be done programmatically, without using reinforcement learning, but our goal here is to prove that our approach is able to teach the drone to achieve specific tasks.

We have two different inputs here, the segmentation image and the depth image displayed in Figure 4.2. The first input is the depth camera, this represents the depth on a grayscale image in the drone vision. The one provided by AirSim where we enhanced the contrast, and cropped the part of interest. The second one consists of a camera view where a selected object is highlighted and the rest has a uni- form color. This is the kind of input one has by doing image detection on RGB frames. We trained our network in the default environment

"block" and for robustness, we randomly change the target position every episode.

(39)

Figure 4.2: Top: Depth image, on the left the output of AirSim and on the right the preprocessed input of the network. Bottom: Segmenta- tion image, on the left the output of AirSim after marking a specific ball in the environment and on the right the preprocessed input of the network.

We have 3 possibles actions for our drone that will move only on a horizontal plan with a fixed height:

• Go 2 meters forward

• Rotate by 24^◦ on the right

• Rotate by 30^◦ on the left

Those restricted actions correspond to a one-floor environment use- case and allow us to present this proof-of-concept. Further work could implement more complex actions. Our trained environment is "block"

provided by AirSim and presented in Figure 4.3.

(40)

Figure 4.3: Training environment for the Go To task

Our network, presented in Table 4.2, is composed of two branches of 3 convolutional layers, one branch for each input image followed by a flattening operation and two dense layers to predict the action to execute. Since both of our inputs have a size of 20x54 px, our network has 1 992 643 parameters.

Layer Input Input Size Filter Size Stride Num Filters Activation Output

conv_depth_1 depth_input 20x54 4x4 4 32 ReLU 35x5x13

conv_depth_2 conv_depth_1 32x5x13 3x3 2 64 ReLU 15x2x64

conv_depth_3 conv_depth_2 15x2x64 1x1 1 64 ReLU 15x2x64

conv_segm_1 segm_input 20x54 4x4 4 32 ReLU 35x5x13

conv_segm_2 conv_segm_1 32x5x13 3x3 2 64 ReLU 15x2x64

conv_segm_3 conv_segm_2 15x2x64 1x1 1 64 ReLU 15x2x64

dense1 conv_depth_3 + conv_segm_3 3840 512 ReLU 512

dense2 dense1 512 3 Linear 3

Table 4.2: Architecture used for the Go To task

The reward function used is dependent on the distance to the target and is calculated at each step, allowing to avoid sparse reward. We defined the reward function as following:

reward_t= (dist_t− dist_t−1) − 1 with disttbeing the distance to the target in meters.

Therefore, our reward acts as the following:

• negative if the drone doesn’t go away from the target or stays at the same position

• close to 0 if the drone moves to the target but not in a straight line

(41)

• positive if the drone moves toward the target at an efficient rate.

4.2.2 Exploration Test

The Exploration taks is more complex since we want to find the best strategy to explore an unknown environment and avoid obstacles. This task is usually implemented using a high-end sensor and SLAM methods. We will limit our research in a plane exploration, meaning that we will not explore our environment along the vertical axis.

To achieve this Exploration task, we propose the creation of a real- time heatmap that simulates the drone position and its comprehension of the environment. The heat map is created with, as only input, the position of the drone over time and its orientation. Therefore it does not have a prior-estimation of the environment itself and it does not use any high-end sensors, only the accelerometer of the drone. We chose to give the map size of 330x330px and at each step draw a circle of radius 16 pixels at the drone position. Then an arrow is created on the image to simulate the orientation of the drone. We merged and compressed this heatmap to the depth image to provide one single image of 126x100px displayed Figure 4.4.

(42)

Figure 4.4: Left: Actual depth image above the created heatmap, Right:

input of the network: depth image and compressed heatmap.

We have 4 possibles actions for this tasks:

• Rotate by 24^◦on the right

• Rotate by 30^◦on the left

The network structure only has one branch here, and each input is 4 consecutively stacked frames. The network is presented in Table 4.3 and has 7 396 068 trainable parameters.

(43)

Layer Input Filter Size Stride Num Filters Activation Output

conv1 126x100x4 4x4 4 32 ReLU 32x31x25

conv2 32x31x25 3x3 2 64 ReLU 15x15x64

conv3 15x15x64 1x1 1 64 ReLU 15x15x64

dense1 15x15x64 512 ReLU 512

dense2 512 4 Linear 4

Table 4.3: Architecture used for the Exploration taks

For this task, we chose to keep a fixed sized delimited environment, derived from "block" and we chose to randomly move all the meshes inside the delimitation at each step. This allows our agent to train itself on variable environments. Figure 4.5 presents a sample of training environments.

Figure 4.5: Sample of training environment for the exploration Our reward function here is based on the heatmap. At each step, we count the number of new white pixel, X, created that goes up to around 100px (with physic approximations). Therefore our reward function is the following:

• - 0.1 if X < 30 and the drone goes forward

• - 4.0 if we have rotated during the 3 previous steps

• between -1 and 0 if we discover only a little of the environment (X/30 − 1 if X < 30)

• between 0 and 1 if we discover the environment (X/60 − 1/2 if 30 <= X < 90)

• 1 if we discover the environment at an efficient rate ( X > 90 ) Therefore, our reward function acts as the following:

(44)

• Negative when the drone does not move or does not discover more of the environment.

• Positive when the drone discovers the environment and extend the heat map.

We also added some bonus point each time we reach a milestone of the discovery (5%, 10%, 15%, 20%, 30%, 40%, 50% ...)

4.2.3 Combination Test

The last task is a combination of the previous tasks. With both of our policies trained on the two different tasks on different networks, we propose a way to switch the policy depending on the input of the environment. From this, we are able to explore the environment until finding the target and then going to it. The switch is straightforward and consists of changing the network used in the simulation once the target is in the field of view of the drone using the segmentation image and the number of colored pixel is above a predefined threshold.

Figure 4.6 displays a case where the switching will be done.

Figure 4.6: Switching case, based on the segmentation image once the number of target pixels is large enough.

(45)

4.3 Project Overview

In this section, we will present the architecture of the overall project.

This architecture is linked to the agent-environment model where the AI module acts as the agent while the combination AirSim and Unreal acts as the environment, as displayed in Figure 4.7.

Figure 4.7: Project Overview with main interactions

Here, Unreal, the game engine, constantly communicates with the drone simulator AirSim to simulate the environment, its physics, and the drone’s movement. Indeed, AirSim is built as an Unreal plugin.

We then call the AirSim API with Python to get the input data. As specified above, those are the depth camera, the segmentation camera and the drones relative position to its taking off point. These raw inputs are then pre-processed and are fed to the network that outputs the Q value for each possible action. Then the action with the highest Q-value, meaning the action that maximizes the expected cumulative reward, is chosen and sent to the drone simulator that will change the current state to the desired one using a PID controller. We can again call the AirSim API to get the new state and iterate the process.

(46)

Chapter 5 Results

This chapter describes the results obtained by the three tasks, both in training and test environments. We will first present the results achieved by the Go To task, the Exploration task and the combination of both to finally display some directly observable obstacles.

Each training lasted around 48 hours on a GTX 1070 GPU. Our networks have been assessed on the training environment and on a com- pletely different environment, based on "Epic Zen Garden" from Epic Games presented Figure 5.1.

Figure 5.1: modified Epic Zen Garden environment used for testing.

Left: overview of the environment, Right: simulation view

5.1 Go To Test

The Go To task is very effective, the model has been trained on 1750 episodes and we tested our trained model on the block environment

34

(47)

CHAPTER 5. RESULTS 35

where we randomly moved the target, which is the ball. Figure 5.2 displays the metrics of the training over the episodes. The metrics displayed, duration in number of steps and reward over the number of episodes are negatively correlated: the faster we reach the objective, the shorter the path is, and the higher the reward is. Therefore, looking at the figure we can say that the learning process worked : the reward is increasing over time while the duration is decreasing.

Figure 5.2: Training metrics. Left: duration of each episode that de- creases over time: meaning the objective is reached faster and faster.

Right: Reward for each episode, that increases over time.

We tested the network on the training environment, over 100 episodes, the drone finds and goes to the ball 98 times when the ball is in its potential field of view. By potential field of view, we mean the following hypothesis: if the drone does a 360 ^◦ it will at least see the ball once.

However, it is not able to move toward the ball when it is not in his potential field of view. This is the expected result since with such a scenario and the input parameters presented above, we are in a non- Markov model. A human would not do better.

In the test environment, the drone finds and goes to the ball 81 times out of a 100. Some obstacles are difficult for it to detect which will be discussed in section 5.4, but since the inputs are not environment dependents, it is able to move toward the target.

(48)

36 CHAPTER 5. RESULTS

5.2 Exploration Test

The Exploration task has been trained in 2000 episodes, also on the block environment. For each episode, the environment is changed at the beginning. Figure 5.3 displays the metrics of the training. We can analyse that the network is learning the task. Indeed, the Map discovery metrics increase over time with the reward function. However, it learns more slowly than the Go To task, probably because of its complexity. Figure 5.4 displays an example of the output of the Exploration task after 500 steps.

Figure 5.3: Training metrics. Left: Relative Map discovery in a percent- age that increases, showing that the task is performed better and better.

Right: Reward of each episode, that slightly increases over time. An episode ends after reaching 500 steps, or crashing.

(49)

Figure 5.4: Test of the Exploration network after 500 steps. Left: the environment. Right: the last input of the network with the actual size of the environment shown with dashes.

We tested the Exploration task on two different environments. The goal was to explore as much as possible of the environment before reaching 200 steps, or crashing. In order to perform some metrics, we defined a non-expert performance by taking the maximum score over 3 runs of a random human performing the task in an unknown environment. In the training environment, he discovered 29,91% of the map in 200 steps.

Over 50 episodes, we observe that the drone achieves 14 times a better result than our non-expert human. Figure 5.5 displays a distribution of the discovery over the episodes.

(50)

Figure 5.5: Histogram of the map discovery metric over 50 episodes.

The dashed line is the non-expert top performance after 3 episodes.

In the test environment, a non-expert human discovers, as the maximum score after 3 episodes, 64.38% of the map in 200 steps. Over 50 episodes, we observe that the drone achieves 8 times a better result than our non-expert human. Figure 5.6 displays the distribution of the discovery over the episodes.

Figure 5.6: Histogram of the map discovery metric over 50 episodes.

The dashed line is the non-expert top performance after 3 episodes.

These results show that the drone learned to move in an unknown environment, but it is not able to out-perform human on each episode with a limited number of step. This shows that it does not perform exploration in an optimal way. However it has been observed that

(51)

the drone avoids obstacles and very often tries to reach unexplored environments.

5.3 Combination Test

We tested the combination task on several scenarios, where the ball is hidden from the drone at its starting point, and with an unlimited number of step for the exploration phase. Some results are presented below.

Figure 5.7: Behavior of the drone in a given scenario. Star: starting point. Ball: Target. Plain arrows: drone’s movements using the Explo- ration policy. Dash arrows: drone’s movement using the Go To policy

(52)

Figure 5.8: The behavior of the drone in a given scenario. Star: the starting point. Ball: Target. Plain arrows: drone’s movements using the Exploration policy. Dash arrows: drone’s movement using the Go To policy

After performing the test on 50 episodes in each environment, we observe that the drone reach the hidden target 44 times in the training environment and 29 times in the test environment. This is giving us an accuracy of 88% in the training environment and 58% in the test environment. Section 5.4 will explain why the drone performs poorly in the test environment.

As observed previously the taken path is more often not the short- est one and it can be explained by that fact that the drone does not have any information regarding the direction of the target while exploring.

A video of an example in the Epic Zen Garden environment is presented there: https://www.dropbox.com/s/mekg0hibgki0yir/

combination.mp4?dl=0. Some snapshots of our tests are displayed in Figure 5.10 and in Figure 5.9.

(53)

Figure 5.9: Snapshots of the Exploration and Go To agent in the block environment

Figure 5.10: Snapshots of the Exploration and Go To agent in the Zen environment

5.4 Some Obstacles

During our testing phase, we analyzed that the drone successfully finds a given object in an unknown environment. However, it failed to recognize some physical obstacles as Figure 5.11 shows. We can analyse that some shapes are learned and well identified even though the agent has only seen corners and walls. The bottom part of the figure displays some cases when this identification fails.

(54)

Figure 5.11: Different obstacles in the new environment that are different from the trained one. Top: recognized obstacles. Bottom: Non recognized obstacles.

(55)

Chapter 6 Discussion and Conclusion

This chapter is adding a critical point of view to our work by first pre- senting the challenges faced and limitations discovered, secondly, it concludes has giving an answer to our research question and also proposes several possible extensions of the thesis.

6.1 Challenges

This thesis represents a huge challenge that comes with several obstacles. We will detail the main ones in this section.

The first obstacle faced is the hardware. Reinforcement learning requires a huge amount of data and therefore the simulation needs to run a long time. Using classic Atari games as presented in the literature require less computational power than simulating a drone in a virtual environment. Running Unreal Engine is computationally expensive and the simulation can only be sped up by 2 to 4 times without avoiding collisions or physics mistakes.

A huge prototyping time is also a pain point, each training lasted around 48 hours which is enormous regarding the scope of a 20 weeks of the thesis. Several weeks have been spent on just training the model and correcting it.

Then, as all reinforcement learning problems, the reward functions are the main difficulties. They have been changed several times to produce more or less satisfying results and could be fine-tuned even more.

43

Application to Cyber-Physical systems.

AI in Simulated 3D Environments.

Application to Cyber-Physical systems.

CLEMENT SEVY

AI in Simulated 3D Environments

Application to Cyber-Physical systems

CLEMENT SEVY

Abstract

Sammanfattning

Acknowledgement

Contents

Chapter 1 Introduction

1.1 Research Question

1.2 Contributions

1.3 Ethics and Sustainability

Chapter 2 Background

2.1 Reinforcement Learning

2.1.1 Markov Decision Process

2.1.2 Bellman Equation

2.1.3 Model-based Approach

2.1.4 Model-Free Approach

2.2 Deep Learning

2.2.1 Neural Networks

2.2.2 Convolutional Neural Networks

2.3 Deep Reinforcement Learning

2.3.1 Different Categories

2.3.2 Deep Q Network

Chapter 3

Related Work

3.1 Autonomous Vehicle

3.2 Reinforcement Learning for Autonomous Vehicle

3.3 Computer Vision for Depth images

Chapter 4 Methods

4.1 Simulated Environment

4.2 Learning Process

4.2.1 Go To Test

4.2.2 Exploration Test

4.2.3 Combination Test

4.3 Project Overview

Chapter 5 Results

5.1 Go To Test

5.2 Exploration Test

5.3 Combination Test

5.4 Some Obstacles

Chapter 6

Discussion and Conclusion

6.1 Challenges