• No results found

Navigation of Mobile Robots in Human Environments with Deep Reinforcement Learning

N/A
N/A
Protected

Academic year: 2022

Share "Navigation of Mobile Robots in Human Environments with Deep Reinforcement Learning"

Copied!
73
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2016,

Navigation of Mobile Robots in Human Environments with Deep Reinforcement Learning

BENJAMIN COORS

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)
(3)

Navigation of Mobile Robots in Human Environments

with Deep Reinforcement Learning

Degree Project in Computer Science and Communication Second Cycle, 30 Credits

Master’s Programme in Computer Science

Benjamin Coors coors@kth.se

Supervisor: Pawel Herman

Examiner: Anders Lansner

Principal: Robert Bosch GmbH Supervisors (Principal): Michael Herman

Tobias Gindele

June 15, 2016

(4)
(5)

iii

Abstract

For mobile robots which operate in human environments it is not sufficient to simply travel to their target destination as quickly as possible. Instead, mo- bile robots in human environments need to travel to their destination safely, keeping a comfortable distance to humans and not colliding with any obsta- cles along the way. As the number of possible human-robot interactions is very large, defining a rule-based navigation approach is difficult in such highly dynamic environments. Current approaches solve this task by predicting the trajectories of humans in the scene and then planning a collision-free path.

However, this requires separate components for detecting and predicting human motion and does not scale well to densely populated environments. Therefore, this work investigates the use of deep reinforcement learning for the navigation of mobile robots in human environments. This approach is based on recent research on utilizing deep neural networks in reinforcement learning to suc- cessfully play Atari 2600 video games on human level. A deep convolutional neural network is trained end-to-end from one-dimensional laser scan data to command velocities. Discrete and continuous action space implementations are evaluated in a simulation and are shown to outperform a Social Force Model baseline approach on the navigation problem for mobile robots in human envi- ronments.

(6)

iv

Navigering av mobila robotar i mänskliga miljöer med deep reinforcement learning - Sammanfattning

För mobila robotar som manövrerar mänskliga miljöer är det inte tillräck- ligt (räcker det inte) att nå målet på kortast möjliga tid. Istället måste de nå målet på ett säkert vis med ett bekvämt avstånd till människor utan att kolli- dera med objekt längs vägen. Då antalet sätt som en robot kan interagera med människor är stort är det svårt att definiera en regelbaserad navigationsstrate- gi. Nuvarande metoder löser problemet genom att förutspå mänskliga rörelse- mönster för att planera en kollisionsfri väg. Detta kräver separata komponenter för att upptäcka och förutse mänskliga rörelser och metoden presterar dåligt i människotäta miljöer. Detta arbete undersöker användandet av deep reinfor- cement learning för att navigera mobila robotar i mänskliga miljöer. Metoden är baserad på forskning som använde deep neural networks för att spela Ata- ri 2600 spel på mänsklig nivå. Ett convolutional neural network är tränat att använda en-dimensionell avståndsdata från en laser till att styra färdriktning- en av en robot. Implementeringar med diskret och kontinuerlig handlingsrymd utvärderas i en simulering och presterar bättre än en social force model som jämförelse för navigering av mobila robotar i mänskliga miljöer.

(7)

Contents

Contents v

List of Figures vii

List of Tables viii

List of Symbols and Acronyms x

1 Introduction 1

1.1 Task . . . 3

1.2 Objectives and Goals . . . 3

1.3 Contribution . . . 3

1.4 Outline . . . 4

2 Deep Reinforcement Learning 5 2.1 Artificial Neural Networks . . . 5

2.1.1 Training Artificial Neural Networks . . . 7

2.1.2 Convolutional Neural Networks . . . 15

2.2 Reinforcement Learning . . . 17

2.2.1 Markov Decision Processes . . . 18

2.2.2 Value Functions . . . 18

2.2.3 Q-learning . . . 19

2.3 Reinforcement Learning with Neural Networks . . . 21

2.3.1 Value Function Approximation With Neural Networks . . . . 21

2.3.2 Deep Q-Networks . . . 22

2.3.3 Deep Deterministic Policy Gradient . . . 24

2.3.4 Extensions . . . 26

3 Methods 30 3.1 Task Scenario . . . 30

3.2 Reward Function . . . 32

3.3 Discrete Control Agent . . . 33

3.4 Continuous Control Agent . . . 35

3.5 Evaluation . . . 37 v

(8)

vi CONTENTS

4 Results 38

4.1 Baseline . . . 38

4.2 Discrete and Continuous Agent . . . 39

4.3 Learned Policy . . . 42

4.4 Discussion . . . 44

5 Conclusions 47

A Simulation 49

B Training Parameters 50

Bibliography 53

(9)

List of Figures

2.1 Structure of an artificial neuron . . . 5

2.2 A multilayer neural network . . . 6

2.3 Non-linear activation functions . . . 7

2.4 Influence of learning rate η on gradient descent convergence . . . . 8

2.5 Non-saturating activation functions . . . 12

2.6 Early stopping . . . 14

2.7 Local receptive field . . . 15

2.8 Max pooling operation . . . 16

2.9 Architecture of a reinforcement learning agent . . . 17

2.10 Neural network for action-value function approximation . . . 21

2.11 Network architecture for K bootstrap samples . . . 28

3.1 Screenshot of the simulated hallway environment . . . 31

3.2 Simulated laser scan sample . . . 31

3.3 Deep Q-network architecture . . . 33

3.4 Continuous agent network architecture . . . 35

4.1 Mean cumulative reward per epoch . . . 39

4.2 Average number of target arrivals per epoch . . . 40

4.3 Mean Q-value for a fixed data set . . . 40

4.4 Absolute difference between Q-estimation and received return . . . 41

4.5 Learned policy: Examples for human collision avoidance . . . 42

4.6 Learned policy: Example for human collision . . . 43

4.7 Learned policy: Example for human following . . . 44

A.1 MORSE simulation screenshot . . . 49

vii

(10)

List of Tables

3.1 Reward function factors . . . 33

3.2 Reward function constants . . . 33

3.3 Deep Q-network architecture . . . 34

3.4 Actor network architecture . . . 36

3.5 Critic network architecture . . . 36

4.1 Performance of the Social Force Model agent . . . 38

B.1 Discrete agent parameters . . . 50

B.2 Continuous agent parameters . . . 50

viii

(11)

List of Algorithms

1 Q-learning . . . 20 2 Deep Q-learning with experience replay . . . 24 3 Deep deterministic policy gradient . . . 26

ix

(12)

List of Symbols and Acronyms

Acronyms

A3C Asynchronous Advantage Actor-Critic DDPG Deep Deterministic Policy Gradient DPG Deterministic Policy Gradient DQN Deep Q-Network

LSTM Long Short-Term Memory MLP Multilayer Perceptron

MORSE Modular Open Robots Simulation Engine MDP Markov Decision Process

NAF Normalized Advantage Function

NFQCA Neural Fitted Q-Iteration with Continuous Actions NFQ Neural Fitted Q-Iteration

ReLU Rectified Linear Unit ROS Robot Operating System SFM Social Force Model Symbols

α Step-size parameter

A(s) Set of actions possible in state s at Action at t

 Probability of random action in -greedy policy η Learning rate

x

(13)

LIST OF SYMBOLS AND ACRONYMS xi

γ Discount-rate parameter L Loss function

λ Regularization parameter

N Noise from a Gaussian distribution O Ornstein-Uhlenbeck process noise ω Angular velocity

Pssa0 Probability of transition from state s to state s0 under action a ϕ Activation function

π Policy, decision-making rule

π(a|s) Probability of taking action a in state s under stochastic policy π π(s) Action taken in state s under deterministic policy π

Q(s, a) Value of taking action a in state s under the optimal policy Qπ(s, a) Value of taking action a in state s under policy π

Q, Qt Estimates of Qπ or Q

Rass0 Expected immediate reward on transition from s to s0 under action a Rt Return (cumulative discounted reward) following t

rt Reward at t

S Set of all states of the environment st State at t

T Final time step of an episode t Discrete time step

v Linear velocity

Vπ(s) Value of state s under policy π

V(s) Value of state s under the optimal policy V, Vt Estimates of Vπ or V

w, θ Neural network weights

(14)
(15)

Chapter 1

Introduction

As more and more mobile robots are being deployed in human environments such as hospitals or offices, there is a growing need for safe navigation strategies for mo- bile robots in complex dynamic environments. In human environments autonomous mobile robots must move around in a predictable and defensive manner to avoid collisions with humans. This is essential in order to facilitate building trust into mobile robots working alongside humans in the future. Ideally, mobile robots op- erating in human environments should not stand out but rather blend in with the human crowd by behaving somewhat ’human-like’. However, writing a rule-based approach for the robot to behave in such a way is a very difficult task due to the infinitely large number of possible human-robot interactions.

In the past, the majority of navigation approaches for mobile robots in dynamic environments relied on reactive collision avoidance methods, where a collision-free path is planned based on the current distance to static or dynamic obstacles in the environment (Burgard et al., 1999; Thrun et al., 2000).

More recently, navigation strategies for mobile robots in dynamic environments have been proposed, which plan a path according to predicted trajectories of humans in the scene (Thompson et al., 2009; Bennewitz et al., 2005; Kuderer et al., 2012).

After detecting humans in the proximity of the robot, these approaches predict human motion based on a hand-craft motion model or based on a model, which was learned from a collected set of observations. A path is then planned which avoids collisions with any of the predicted human trajectories.

These approaches have the disadvantage that their computational complexity at runtime does not scale well with the number of humans in the scene, which makes them unsuited to densely populated environments. Additionally, prediction-based navigation approaches require to develop and run separate components for human detection, human motion estimation and path planning.

In order to overcome these limitations, this work investigates the applicability of deep reinforcement learning algorithms to the navigation task in human environ- ments. These approaches use deep convolutional networks to implement a social aware navigation policy, directly mapping from raw sensory data to velocities.

1

(16)

2 CHAPTER 1. INTRODUCTION Thus, they eliminate the need for separate human detection, motion prediction and collision avoidance components. Additionally, they scale well at runtime as their computational cost is independent of the number of humans in the scene.

Reinforcement learning enables an agent to learn its behavior through trial-and- error interactions with its environment. If the state space of the environment is small, it is sufficient for the agent to keep the information about the value of each action in each state in a lookup table, such as a Q-table (Watkins, 1989; Watkins and Dayan, 1992). However, this approach is no longer practical when the state or action space grows large or is continuous. In this case, it is necessary to use function approximators to estimate the values of the different state-action pairs.

Non-linear function approximators with good generalization properties are ar- tificial neural networks. An early example for the successful use of artificial neural network in reinforcement learning is TD-Gammon (Tesauro, 1994), which used a multilayer perceptron to play backgammon on an advanced human level of play.

However, after this early success the use of artificial neural networks in rein- forcement learning stalled after Tsitsiklis and Van Roy (1997) demonstrated the possibility of divergence when non-linear function approximators are used in con- junction with temporal-difference reinforcement learning algorithms.

More recently, developments in machine learning have focused on learning good feature representations directly from high-dimensional input data with the help of deep neural networks, giving rise to the field of deep learning. Deep learning models have since demonstrated to outperform algorithms which use hand-crafted feature representations on complex problems such as image classification or speech recognition (Krizhevsky et al., 2012; Graves et al., 2013) and currently constitute the state-of-the-art in these domains.

The emerging field of deep reinforcement learning has set out to combine these representation learning capabilities of deep learning with reinforcement learning al- gorithms to perform end-to-end learning directly from high-dimensional state data.

A prominent example of deep reinforcement learning is the deep Q-Network (DQN) algorithm which was introduced by Mnih et al. (2013, 2015).

The DQN algorithm combines a deep convolutional neural network with a vari- ant of Q-learning. The deep Q-network takes as input high-dimensional sensory data and outputs the corresponding Q-values of a discrete set of actions. The au- thors apply the DQN algorithm to the domain of Atari 2600 video games, where they train an agent directly on raw video game data and demonstrate a performance on human level.

However, one limitation of DQN is its restriction to discrete action spaces due to its greedy action selection policy of always selecting the action, which maximizes the Q-function in the current state. To resolve this shortcoming of DQN, the deep deterministic policy gradient (DDPG) algorithm was developed by Lillicrap et al.

(2015). It uses a similar network architecture to DQN but, instead of one deep Q- network, it uses an actor-critic architecture with two deep convolutional networks to estimate the agent’s policy and Q-function separately.

(17)

1.1. TASK 3

1.1 Task

The task is to navigate a mobile robot autonomously to a target destination in a human environment such as a hospital or an office. The mobile robot is assumed to be omnidirectional and to be equipped with a laser scanner as its main sensing device, which gives the robot a scan line within a 2D plane. Furthermore, the robot is assumed to have access to a map of its environment. Based on this map, the robot is expected to be able to localize globally by estimating its current position on the map. Additionally, a global planner is presumed to be present, which can calculate a path from the current position of the robot to a target destination. This path consists of a set of waypoints where two adjacent waypoints are always in sight of each other and not separated by static obstacles. Based on this information, a deep reinforcement learning agent is then tasked with learning to safely navigate each path segment by avoiding collisions as well as keeping a safe and comfortable distance to humans.

1.2 Objectives and Goals

This work has the goal of evaluating the applicability of deep reinforcement learning approaches to the problem of navigating a mobile robot in human environments.

In order to answer the research question on the applicability of deep reinforce- ment learning algorithmis to this task, this work proposes a discrete and a con- tinuous action agent based on the DQN and DDPG algorithms. Both algorithms use deep convolutional networks and perform end-to-end learning from sensory data input to a command velocity output, thereby eliminating the need for a separate human detector or human path predictor.

To determine whether a discrete or continuous action space approach is better suited to the problem of navigating a mobile robot in human environments, the two approaches are evaluated in a simulation and compared on several performance metrics, such as the mean cumulative reward the agents achieve during testing.

Furthermore, the agents’ performance are compared to a Social Force Model baseline approach, which has previously been used to implement a social aware robotics navigation framework (Ferrer et al., 2013), to put the evaluation results into context.

1.3 Contribution

Reinforcement learning has been successfully used in previous works to implement a mobile robot navigation policy in static (Smart and Kaelbling, 2002; Huang et al., 2005) and dynamic (Jaradat et al., 2011) environments. However, as to the author’s knowledge, this work is the first to present and evaluate a deep reinforcement learn- ing approach for the navigation of mobile robots in human environments, thereby making an important contribution to the field.

(18)

4 CHAPTER 1. INTRODUCTION

1.4 Outline

Chapter 2 gives the required background on the topic of deep reinforcement learning.

It starts by introducing fundamental artificial neural networks and reinforcement learning basics as well as the notation of this work. The chapter concludes with an in-depth introduction to the deep Q-network (DQN) and deep deterministic policy gradient (DDPG) algorithms as well as some extensions which have been proposed to DQN and DDPG.

In the following Chapter 3, the task scenario and reward function are presented in more detail. Furthermore, the chapter explains how the DQN and DDPG algo- rithms were modified to the mobile robot navigation task in human environments.

Lastly, the chapter presents the performance metrics of the evaluation as well as the baseline approach, which is based on a Social Force Model.

Chapter 4 then evaluates the discrete and continuous agents on a number of per- formance metrics and compares their performance against the baseline Social Force Model approach. In addition, a learned policy of the discrete agent is visualized and the overall evaluation results are discussed.

Finally, Chapter 5 concludes this work by summarizing the evaluation results and presenting possible future work.

(19)

Chapter 2

Deep Reinforcement Learning

Deep reinforcement learning combines traditional reinforcement learning methods with deep neural network function approximators. This chapter gives a background on artificial neural networks (Section 2.1) and reinforcement learning (Section 2.2), before going into the details of (deep) neural network function approximation for reinforcement learning in Section 2.3.

2.1 Artificial Neural Networks

An artificial neural network consists of a number of neurons, simple processing units which receive inputs from other neurons and produce an output based on their inputs (Fausett, 1994). Figure 2.1a shows the architecture of a simple neuron with input x and output y.

x1

x2

x3

w1 w2 w3

y

(a) Without Bias

x1

x2

x3

w1 w2 w3

y 1

w0

(b) With bias

Figure 2.1: Structure of an artificial neuron

To compute its output y, a neuron takes the weighted sum of its inputs y_in = PN

i=1wixi, where xi is an input to the neuron and wi the corresponding weight.

5

(20)

6 CHAPTER 2. DEEP REINFORCEMENT LEARNING Using vector notation, the weighted sum can be written as y_in = x · w. This weighted sum is then used as the input to an activation function ϕ, which computes the output of the particular neuron y = ϕ(y_in). A simple activation function is the step function with threshold t:

ϕ(x) =

(1 if x ≥ t

0 if x < t (2.1)

Instead of using a threshold t, a bias term is often used to let the network learn the optimal threshold by itself. To do so, an additional component x0 = 1 is added to every neuron in the network, as visualized in Figure 2.1b. The corresponding weight w0 is then referred to as the bias. This bias is included when computing the weighted input to a neuron y_in = PNi=0wixi. From now on it will be assumed that every neuron uses a bias. In this case, the step activation function is simplified to:

ϕ(x) =

(1 if x ≥ 0

0 if x < 0 (2.2)

Neurons can be arranged in layers. A network that only consists of an input and output layer of neurons is called a single-layer network. The output of one neuron can be used as the input to another neuron to create a multilayer network of neurons.

x0

xi

Input layer Hidden layer

yk

y1

Output layer h1

hj

vij wjk

Figure 2.2: A multilayer network with input x, hidden layer h and output y Figure 2.2 shows a multilayer network, where the output of neurons h is used as the input to the output neurons y. In a multilayer network, the layers between the input and output layer are referred to as the hidden layers of the network. A multilayer network can have one or more hidden layers.

(21)

2.1. ARTIFICIAL NEURAL NETWORKS 7 While single-layer networks are only able to model linearly separable problems, multilayer networks, which are often referred to as multilayer perceptrons (MLPs), are capable of solving nonlinear problems if a nonlinear activation function is used.

Common nonlinear activation functions are sigmoid functions such as the logistic function:

f(x) = 1

1 + exp(−x) (2.3)

An alternative is the hyperbolic tangent:

f(x) = tanh(x) (2.4)

As can be seen in Figure 2.3, the value of the logistic function is always positive between 0 and 1, while the hyperbolic tangent is symmetric around the origin with a function value between −1 and 1.

−6 −4 −2 0.00 2 4 6

0.2 0.4 0.6 0.8 1.0

(a) Logistic function

−6 −4 −2 0 2 4 6

−1.0

−0.5 0.0 0.5 1.0

(b) Hyperbolic tangent

Figure 2.3: Non-linear activation functions

All the networks shown so far have been feedforward networks, where the signal flows from the input to the output neurons in a forward direction. However, the network could also contain closed loops from a neuron back to itself. In this case the network is recurrent. The networks presented up to this point have also all been fully-connected. In fully-connected networks every neuron in each layer is connected to every neuron in the next layer.

2.1.1 Training Artificial Neural Networks

Most neural networks are trained using supervised learning, where the training data consists of a set of input data with corresponding targets. For a single-layer network a common supervised learning algorithm is the Delta rule (Widrow and Hoff, 1960).

The Delta rule is a gradient descent algorithm, which attempts to minimize the error in the output of the network.

(22)

8 CHAPTER 2. DEEP REINFORCEMENT LEARNING The error can be defined as the sum of squared differences between the output yj and the target values tj of the jth output neuron of the network:

E= 1 2

m

X

j=1

(tj− yj)2 (2.5)

The weight update for the weight wij between the ith input and the jth output is then calculated by taking the partial derivative of the error with respect to the weight, thereby moving the weights toward a minimum of the error function:

∆wij = −η ∂E

∂wij (2.6)

∂E

∂wij = −(tj − yj0(y_inj)xi (2.7) where η is the learning rate, which controls the amount of weight adjustment at each training step.

The learning rate is an important hyperparameter of a neural network. An optimal learning rate ηoptwill ensure direct convergence to the minimum. Choosing the learning rate too small will result in very slow learning, while a learning rate which is too large can cause the system to oscillate or diverge. Figure 2.4 illustrates this behavior.

w

∂E

∂w

η < ηopt

w

∂E

∂w

η = ηopt

w

∂E

∂w

η >2ηopt

w

∂E

∂w

η > ηopt

Figure 2.4: Influence of learning rate η on gradient descent convergence Multilayer networks can be trained using backpropagation (Werbos, 1974; Rumel- hart et al., 1986), a generalized version of the delta rule. Similar to the Delta rule, backpropagation is also a gradient descent optimization algorithm which minimizes the error function, which is sometimes also referred to as loss function, with respect to the weights.

(23)

2.1. ARTIFICIAL NEURAL NETWORKS 9 Training a network with backpropagation involves three steps. At first, the pre- diction y of the neural network for a given input x is computed in the forward pass.

In a second step, the updates of the weights are computed based on the difference between the predicted and actual output of the network during the backward pass.

In a final step, the weights of the network are updated using the weight updates computed in the step before.

For a two layer, fully-connected feed-forward network as shown in Figure 2.2, which uses the total squared error loss function (equation 2.5), the backpropagation equations are:

Forward pass:

hj = ϕ(X

i

vjixi) (2.8)

yk = ϕ(X

j

wkjhj) (2.9)

Backward pass:

δk = (tk− yk0(y_ink) (2.10) δj = X

k

δkwkjϕ0(h_inj) (2.11) Weight update:

∆wkj = ηδkhj (2.12)

∆vji = ηδjxi (2.13)

Standard backpropagation updates the weights w of a neural network by per- forming gradient descent on the error surface of the complete dataset. This is known as batch learning. However, when the dataset is large, computing the output and er- ror for the complete dataset can be slow. The alternative is to use backpropagation with stochastic gradient descent as suggested by Le Cun et al. (1998).

In stochastic gradient descent a subset of the training data of size m (mini-batch) is sampled at every training iteration to compute the error Em:

wt+1= wt− η∂Em

∂w (2.14)

Stochastic gradient descent has the advantage of updating the weights much more often and therefore being faster when compared to regular gradient descent.

However, unlike regular gradient descent, it is not guaranteed that a weight update in stochastic gradient descent follows the true gradient of the error surface.

A noisy update, which does not correspond to the true gradient, can have the negative effect of preventing full convergence to the minimum. On the other hand, it can also have the positive effect of jumping out of the basin of local minimum into the basin of the global minimum. In many recent deep learning implementations, stochastic gradient descent has proved itself as an effective and efficient optimization method (Krizhevsky et al., 2012; Graves et al., 2013; Hinton et al., 2012; Hinton and Salakhutdinov, 2006).

(24)

10 CHAPTER 2. DEEP REINFORCEMENT LEARNING

Momentum

The convergence of stochastic gradient descent can be sped up by using momentum (Polyak, 1964). The idea behind momentum is to model the velocity of a ball which is rolling down the error surface. Once it has reached a certain speed, its momentum will keep it moving in the same direction even if there is a change in the direction of the gradient. Momentum is implemented by adding the previous weight update, scaled by a momentum parameter µ, to the current update:

∆wt+1= −η∂E

∂wt+ µ∆wt (2.15)

An improved type of momentum was introduced by Sutskever et al. (2013) and is based on Nesterov’s Accelerated Gradient Nesterov (1983). Instead of using the gradient at wt, Nesterov’s momentum uses the gradient at wt+ µ∆wt:

∆wt+1= −η ∂E

∂(wt+ µ∆wt) + µ∆wt (2.16) This means that if the momentum term µ∆wtis a poor update, the gradient at position wt+ µ∆wtwill point back towards wtand make a correction to the weight update.

Adaptive Learning Rate

In a multilayer network, the magnitude of gradients and fan-in for different layers can vary widely and as a consequence the appropriate learning rates. Therefore, the convergence of stochastic gradient descent can be improved by using individual adaptive learning rates instead of a constant global learning rate.

AdaGrad (Duchi et al., 2011) is one approach for an adaptive learning rate for stochastic gradient descent. It scales the learning rate of the weights by the accumulated squared gradients over all time steps t:

∆wt= − η0 q

Pt T =1g2T

gt (2.17)

where η0 is a constant base learning rate and gt= ∂w∂Et is the per-parameter gra- dient at time t. AdaGrad has the effect of reducing the learning rate of weights with high gradients and increasing the learning rates of weights with small gradients. A disadvantage of AdaGrad is that the effective learning rate monotonically decreases as the accumulated sum of gradients grows larger over time.

RMSProp, which was introduced by Tieleman and Hinton (2012), offers a solu- tion to the problem of continually decaying learning rates. Instead of keeping a sum of gradients over all time, RMSProp only keeps a fixed number of past gradients.

This means that the denominator can no longer accumulate to infinity and learning is able to continue even after many time steps.

(25)

2.1. ARTIFICIAL NEURAL NETWORKS 11 RMSProp implements the accumulation of gradients efficiently by keeping an exponential moving average of squared gradients vt, where 0 < p < 1 is a decay constant:

vt= pvt−1+ (1 − p)g2t (2.18) Similar to AdaGrad, RMSProp takes the square root of this accumulated gra- dients, which corresponds to taking the root mean square (RMS) of the moving average of the gradients up to time t. In order to prevent division by zero, a smoothing parameter  is included in the update equation:

∆wt= − η0

vt+ gt (2.19)

Adam (Kingma and Ba, 2014) extends RMSProp by not only including the second moment estimate vt in the computation but also a first moment estimator mt, which is calculated as an exponential moving average of the gradient gt:

mt= qvt−1+ (1 − q)gt (2.20)

where 0 < q < 1 is a decay constant similar to p. As both vt and mt are initialized to zeros, the first and second moment estimates are biased towards zero, especially during early time steps. This bias is corrected by adapting the base learning rate η0 in the following way:

ηt= η0

p(1 − qt)

(1 − pt) (2.21)

The full ∆wt update equation for Adam is:

∆wt= − ηt

vt+ mt (2.22)

Non-saturating Activation Function

One problem when training networks with sigmoid activation functions is the van- ishing gradients problem (Hochreiter, 1991). It describes the problem of the deriva- tives of the sigmoid function getting very small when the input to the function is a very large positive or negative value and its activations saturate. An alternative activation function, for which the derivatives do not saturate, is the Rectified Linear Unit (ReLU). It is shown in Figure 2.5a and is defined as:

f(x) = max(0, x) =

(x for x > 0

0 otherwise (2.23)

The problem of vanishing gradients does not exist with a ReLU as its partial derivative above zero is 1. The use of ReLUs greatly accelerates (stochastic) gradient descent. Krizhevsky et al. (2012) demonstrated that training a network with ReLUs can be up to six times faster than when a hyperbolic tangent is used.

(26)

12 CHAPTER 2. DEEP REINFORCEMENT LEARNING

−6 −4 −2 00 2 4 6

1 2 3 4 5

(a) ReLU function

−6 −4 −2 0 2 4 6

−1 0 1 2 3 4 5

(b) PReLU function

Figure 2.5: Non-saturating activation functions

As an additional advantage, the ReLU function is less expensive to compute as it can be implemented with simple thresholding. However, for values below zero the gradient of ReLU is zero, which can lead to "dead" units which never activate and whose weights are never updated.

To alleviate this problem, Maas et al. (2013) propose the leaky rectified linear unit (LReLU), which has a constant gradient of 0.01 below zero. The idea of leaky ReLUs is generalized by the parametric rectified linear unit (PReLU) (He et al., 2015b), which is plotted in Figure 2.5b. It replaces the constant gradient of 0.01 below zero with a learnable parameter a:

f(x) = max(ax, x) =

(x for x > 0

ax otherwise (2.24)

Weight Initialization

Another parameter which influences the speed of convergence and whether gradient descent will converge to a local or global minimum is the initialization of the weights of a neural network. Glorot and Bengio (2010) propose to initialize the weights with values randomly drawn from a uniform distribution with mean zero and variance Var[W ] of:

Var[W ] = 2

nin+ nout (2.25)

This prevents the variance of the output of a neuron growing with the number of its inputs nin. Similarly, it maintains the variance of the back-propagated gradients by also including the number of units in the next layer nout.

However, initializing with variance of equation 2.25 assumes a linear activation, which is not the case when a ReLU, LReLU or PReLU activation function is used.

For a network which uses an activation function from the ReLU family He et al.

(2015b) propose to initialize the weights randomly from a Gaussian distribution with a variance Var[W ] of:

Var[W ] = 2

nin (2.26)

(27)

2.1. ARTIFICIAL NEURAL NETWORKS 13

Batch Normalization

In a neural network, the input to each layer depends on the weights of all previous layers. If these weights change, the distribution of the input to that layer does, too.

This phenomenon is referred to by Ioffe and Szegedy (2015) as internal covariate shift. This is especially problematic when using saturating nonlinearities. Small changes to the weights in early layers of the network can then cause the gradients to vanish in later layers of the network when the weight changes cause the input to the sigmoid function to become very large (positive or negative). The deeper the network is, the more such weight changes can get amplified.

Because of the internal covariate shift phenomenon, great care must be taken when choosing the learning rate and weight initialization of a neural network so that learning is fast but does not diverge or slow down due to vanishing gradients.

Batch normalization(Ioffe and Szegedy, 2015) reduces the internal covariate shift by normalizing the input to a hidden layer’s nonlinearity for each training mini-batch and thereby fixes the distribution of the layer’s inputs.

Normalizing the inputs to a neural network by subtracting the mean and dividing by the standard deviation of the training data has long been a standard practice for training neural networks (Le Cun et al., 1998). Batch normalization applies the same principle to the hidden layers of the network by computing for each mini-batch of m samples the mean µB and variance σ2B of the input xi to a layer. This input is then normalized to ˆxi and afterwards scaled and shifted with the parameters a and b, which are learned during training:

µB = 1 m

m

X

i=1

xi (2.27)

σB2 = 1 m

m

X

i=1

(xi− µB)2 (2.28)

ˆxi = xi− µB q

σ2B+  (2.29)

yi = aˆxi+ b (2.30)

The use of batch normalization enables to train the network with higher learning rates and makes the training less reliant on the initial parameter values. When added to a state-of-the-art image classification model, Ioffe and Szegedy (2015) show that the use of batch normalization can help to speed up the training by more than an order of magnitude while achieving a higher final accuracy.

Regularization

An important aspect of neural networks is to ensure that the trained network does not only perform well on data used during the training process but also generalizes to previously unseen samples.

(28)

14 CHAPTER 2. DEEP REINFORCEMENT LEARNING To do so, the neural network must not be overfitting the training data. Over- fitting describes the problem of not only learning the regularities of the training data but also modelings its noise. The simplest approach to prevent overfitting is to increase the amount and diversity of training data. However, this is often not possible as data sets have limited size and acquiring new training data can be costly.

Alternatively, the capacity of the neural network can be limited. One way of limiting the capacity of a neural network is to apply early stopping, where learning is stopped before the training error reaches a minimum. To enable early stopping cross-validation can be used, which splits the dataset into three separate subsets:

a training set, a validation set and a test set. In early stopping, training on the training set continues as long as the error on the validation set decreases. If the validation error starts to increase, the training is stopped as otherwise only specific regularities of the training set are learned. Figure 2.6 visualizes the concept.

Training cycles Early stopping

Error Validation error

Training error

Figure 2.6: Early stopping: Stop training when validation error increases Another approach to control the complexity of the model is to limit the mag- nitude of the weights by applying weight decay. In weight decay, a penalty term is added to the error function to penalize large weights. Equations 2.31 and 2.32 show two forms of weight decay, which are known as L1 (eq. 2.31) and L2 (eq. 2.32) regularization. Both L1and L2 regularization lead to smaller weights, which results in a smoother model and prevents the model from fitting the sampling noise.

EL1 = E0+ λ||w||1 = E0+ λX

i

|wi| (2.31)

EL2 = E0+ λ||w||22 = E0+ λX

i

w2i (2.32)

In both equations a weight dependent regularization term is added to the regu- lar error function E0. A regularization parameter λ controls the importance of the regularization term. To determine the ideal regularization parameter λ, cross vali- dation can again be used to find the λ-value with the smallest error on the validation set. While both L1 and L2 regularization lead to small weights, L1 regularization has the effect of pushing many weights to zero.

(29)

2.1. ARTIFICIAL NEURAL NETWORKS 15

2.1.2 Convolutional Neural Networks

When applying neural networks to work on raw pixel data for computer vision problems, one challenge is that the number of weights grows large even for small images when a regular fully-connected network architecture is used. Such a large number of weights increases the risk of overfitting and requires a large training set.

The work by Lecun et al. (1998) on convolutional neural networks is an im- portant contribution to overcoming these problems. Convolutional neural networks are currently used in the state-of-the-art approaches to many problems in the field of computer vision such as semantic segmentation (Long et al., 2015), optical flow (Fischer et al., 2015) or image classification (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; He et al., 2015a; Szegedy et al., 2014).

Convolutional neural networks offer an alternative neural network architecture to fully-connected neural networks. Compared to fully-connected architectures, they reduce the number of weights and are relatively shift, scale, and distortion invariant.

This is achieved by three architectural ideas: local receptive fields, shared weights and spatial or temporal subsampling.

Local Receptive Fields

The idea of local receptive fields in convolutional neural networks is based on re- ceptive fields in the visual cortex, which are locally-sensitive neurons in the visual system (Hubel and Wiesel, 1962). Instead of each neuron being connected to all neurons in the previous layer, a neuron in a convolutional layer is only connected to a local region in the spatial dimension of its input.

32

32 3

5 5

96

32 32

Figure 2.7: Local receptive field of dimension 5 × 5 × 3

The region of pixels a neuron connects to is called the local receptive field of the hidden neuron. If the input is an RGB color image, the input data as well as the local receptive field will have an additional depth dimension. While the spatial dimension F of a local receptive field may vary, it always covers the full depth dimension (1 for black and white, 3 for RGB). Neurons which look at the same local receptive field are stacked in depth in the hidden layer. Figure 2.7 visualizes this concept for a local receptive field of dimension 5 × 5 × 3.

(30)

16 CHAPTER 2. DEEP REINFORCEMENT LEARNING The local receptive fields of adjacent hidden neurons at the same depth level are shifted by a certain number of pixels. This number is constant and is referred to as the stride S. This can be interpreted as sliding a receptive field of fixed size across the input image from left to right, starting in the top-left corner of the input image.

Thereby, the stride controls the spatial dimension of the output.

Shared Weights

Each neuron in the convolutional layer has weights to each pixel of its local receptive field as well as a bias. To keep the number of weights small, neurons at the same depth level in the convolutional layer share their weights and bias. Thereby, these neurons detect exactly the same feature. The intuition behind this is that a patch feature is not only useful at a single spatial location but at any location of the input.

This makes the feature detection invariant to translations in the image.

Neurons at the same depth in the convolutional layer form a feature map. A convolutional layer consists of several different feature maps. The shared weights and bias of one feature map is referred to as a filter. A convolutional layer can then be interpreted as a sliding dot product of a set of filters over the input data. The result of the convolution operation for one filter is called an activation map. The set of activation maps for one filter is stacked together to produce the output of one convolutional layer.

Subsampling

8 7

6 3

-1 2

7 5

-2 2

4 6

-5 2

1 3

8 7

6 3

Single depth slice

y x

Max pooling

Figure 2.8: Max pooling operation with 2 × 2 filter and stride 2

Besides consisting of convolutional layers, convolutional neural networks also commonly use pooling layers in order to reduce the amount of weights and thus control overfitting. Pooling layers are inserted periodically behind convolutional layers and resize the output of a convolutional layer through subsampling. While the depth dimension of the output remains unchanged, the spatial dimension is decreased by sliding a max operation with a certain stride and dimension over each feature map output (max-pooling). Figure 2.8 visualizes the effect of applying max- pooling to the output of a single feature map.

(31)

2.2. REINFORCEMENT LEARNING 17

2.2 Reinforcement Learning

Reinforcement learning (Sutton and Barto, 1998) is a branch of machine learning which describes the way an agent is able to learn its behavior through trial-and-error interactions with an environment. Figure 2.9 visualizes the general architecture of a reinforcement learning agent. At each discrete time step t the agent is able to perceive the current state of the environment st∈ S, where S is the set of possible states of the environment. Based on the current state st the agent selects and executes an action at ∈ A(st), where A(st) is the set of actions available in state st. After completing an action, the agent receives the reward rt+1 for its action in the next time step and can observe the updated state of the environment st+1. The reward can be delayed so that it is not directly clear how beneficial each action is.

state st reward rt action at

Agent

Environment rt+1

st+1

Figure 2.9: Architecture of a reinforcement learning agent (Sutton and Barto, 1998) The reward is defined by a reward function, which maps from perceived states of the environment to a numeric value, which indicates the desirability of that particular state. Overall, the goal of the agent is to maximize the expected return Rt, the total discounted reward an agent can expect from time-step t:

Rt= rt+1+ γrt+2+ γ2rt+3+ ... =

X

k=0

γkrt+k+1 (2.33)

The discount factor γ (0 ≤ γ ≤ 1) is a parameter which determines how highly future rewards are valued. It determines how short- or farsighted the agent acts. It can be intuitively interpreted as a measure of uncertainty about the future.

In order to maximize its return, the agent follows a policy πt, which maps from its perceived state to probabilities of selecting each possible action and defines the agent’s behavior. The agent’s policy can be either deterministic (π(s)) or stochastic (π(a|s)).

(32)

18 CHAPTER 2. DEEP REINFORCEMENT LEARNING

2.2.1 Markov Decision Processes

One important aspect of the states in a reinforcement learning problem is that they are often assumed to have the Markov property (or to be Markov). The Markov property says that the environment response at t + 1 depends only on the current state st and action at, but not on the history of states, actions and rewards. In other words, the future is independent of the past, given the present:

P[st+1= s0, rt+1 = r|st, at] = P[st+1= s0, rt+1= r|st, at, rt, ..., r1, s0, a0] (2.34) A reinforcement learning task which satisfies the Markov property is called a Markov decision process(MDP). It is an environment in which all states are Markov.

MDPs formally describe an environment for reinforcement learning where the envi- ronment is fully observable. An MDP is defined by a tuple < S, A, P, R, γ >:

• S: Set of states s

• A: Set of actions a

• P: State transition probability matrix

• R: Reward function

• γ: Discount factor

Given a state s and an action a, the transition probability Pssa0 of an MDP defines the probabilities to the possible follow-up states s0:

Pssa0 = P[st+1 = s0|st= s, at= a] (2.35) Similarly, the reward function Rass0 defines the expected value of the next reward for a state s, an action a and a follow-up state s0:

Rass0 = E[rt+1|st= s, at= a, st+1= s0] (2.36)

2.2.2 Value Functions

The state-value function Vπ(s) of an MDP is the expected return starting from state s and then following policy π:

Vπ(s) = Eπ[Rt|st= s] (2.37) It therefore corresponds to the long-term value of a state s. In the same way, the action-value function Qπ(s, a) defines the value of taking action a in state s under policy π:

Qπ(s, a) = Eπ[Rt|st= s, at= a] (2.38)

(33)

2.2. REINFORCEMENT LEARNING 19 The Bellman expectation equations describe how the state-value and action-value function can be decomposed into an immediate reward plus a discounted value of the successor state:

Vπ(s) = Eπ[rt+1+ γVπ(st+1)|st= s] (2.39) Qπ(s, a) = Eπ[rt+1+ γQπ(st+1, at+1)|st= s, at= a] (2.40) One problem in reinforcement learning is to find the policy which achieves the greatest return. A policy which yields an expected return greater than or equal to that of all other policies π0 for all states is called the optimal policy π. There always exists at least one optimal policy which is better than or equal to all others.

All optimal policies share the same optimal state-value function V(s), the max- imum value function over all policies for all states s ∈ S:

V(s) = maxπ Vπ(s) (2.41)

Similarly all optimal policies also achieve the optimal action-value function Q(s, a), which is the maximum action-value function over all policies for all states s ∈ S and all actions a ∈ A(s):

Q(s, a) = maxπ Qπ(s, a) (2.42) The optimal state-value function and the optimal action-value function are re- lated by the Bellman optimality equations:

V(s) = maxa Q(s, a) (2.43)

Q(s, a) = E[rt+1+ γV(st+1)|st= s, at= a] (2.44) Q(s, a) = E[rt+1+ γ max

a0 Q(st+1, a0)|st= s, at= a] (2.45) If Q(s, a) is known, the optimal policy π(a|s) is defined as the action which maximizes Q(s, a):

π(a|s) =

(1 if a = argmaxa∈AQ(s, a)

0 otherwise (2.46)

This optimal policy is said to be greedy with respect to the optimal action-value function. The optimal policy defines the agent’s optimal actions without need of knowledge of the environment’s dynamics (Rass0, Pssa0).

2.2.3 Q-learning

A popular approach to find the optimal action-value function Q(s, a) is Q-learning (Watkins, 1989; Watkins and Dayan, 1992), a temporal-difference (TD) method. Q- learning is a model-free approach, which means that it does not require knowledge of Rass0 or Pssa0.

(34)

20 CHAPTER 2. DEEP REINFORCEMENT LEARNING Q-learning does not wait for the final return Rt to update its estimate of Q(st, at). Instead, it updates the current estimate of Q(st, at) at each time step with the difference between the current estimate of Q(st, at) and the TD target.

Q-learning’s TD target is defined as the reward rt+1, observed after performing at

in st, plus the discounted action-value Q(st+1, a) of the follow-up state st+1, where ais the action which maximizes Q in st+1:

Q(st, at) ←− Q(st, at) + α[rt+1+ γ maxa Q(st+1, a) − Q(st, at)] (2.47) The complete Q-learning algorithm is shown below in Algorithm 1:

Algorithm 1 Q-learning (Sutton and Barto, 1998) Initialize Q(s, a) arbitrarily

for each episode do Initialize s

repeat

Choose a from s using policy derived from Q Take action a, observe r, s0

Q(s, a) ← Q(s, a) + α[r + γ maxa0Q(s0, a0) − Q(s, a)]

s ← s0

until s is terminal end for

Q-learning is an off-policy method, which means that it learns an optimal policy regardless of the policy the agent is actually following. Several papers have shown that the learned action-value function Q(s, a) will converge to the optimal action value function Q(s, a) with probability of one when the policy visits all state-action pairs infinitely often and an appropriate learning rate α is used (Watkins and Dayan, 1992; Jaakkola et al., 1994; Tsitsiklis and Sutton, 1994).

An important aspect to consider when choosing the policy for Q-learning is the exploration-exploitation trade-off in reinforcement learning. On the one hand, an agent should choose the action which is known to maximize its reward (exploitation).

However, if an agent always makes the greedy choice of selecting the current best action, it can miss out on discovering a potentially better action. Therefore, an agent should also at times choose a suboptimal action or an action it has not tried before (exploration).

A simple approach to balance exploration and exploitation is the -greedy policy.

Instead of always greedily selecting the action with the highest estimated action- value and thereby maximizing the immediate reward, -greedy randomly selects one of the m actions with a small probability. While the greedy action is chosen with probability 1 − , a random action is chosen with probability :

π(a|s) =

(/m+ 1 −  if a = argmaxa∈AQ(s, a)

/m otherwise (2.48)

(35)

2.3. REINFORCEMENT LEARNING WITH NEURAL NETWORKS 21

2.3 Reinforcement Learning with Neural Networks

For small problems, it is sufficient to maintain the estimated values of Q(s, a) in a lookup table with one entry for each state-action pair. However, for problems with a large or continuous state and action spaces it is not feasible to use a table due to time and memory constraints. Instead, it is desirable to produce a good estimate of the action-value function from a limited subset of the state-action space. In other words, generalization is needed from experienced state-action pairs to unseen ones.

In order to do so, the value function can be estimated with a function approximator.

One popular approach to function approximation is to use artificial neural networks due to their ability to approximate non-linear functions.

2.3.1 Value Function Approximation With Neural Networks

When using neural networks for the approximation of the action-value function, it is possible to represent the value function for a large state space. The use of multilayer perceptrons as value functions has the advantage of good generalization as neural networks perform global approximation.

A simple architecture for a neural network based value function approximation is shown in Figure 2.10. In the figure, the current state st and an action at are used as inputs to the neural network. The output of the network corresponds to the approximated action-value Q. To choose the action at with the highest action- value in a state st, Q(st, at) needs to be computed with a forward-pass through the network for each possible action.

Q(st, at) st

at

θij

Figure 2.10: Neural network for action-value function approximation To train a neural network to approximate the Q function, standard gradient de- scent techniques can be used to learn the weights. However, Tsitsiklis and Van Roy (1997) demonstrate that when using non-linear models, such as neural networks, as function approximators there is a risk of the learning process becoming unstable or diverging.

(36)

22 CHAPTER 2. DEEP REINFORCEMENT LEARNING One reason for this instability is that unlike local schemes, such as a table, neural networks have a global approximation property. This property enables a weight update in a certain part of the state space to influence the values in other parts of the network. Thereby, an update can have the effect of making the network

’forget’ knowledge it has learned from an earlier sample.

A solution to this problem is a technique known as experience replay (Lin, 1992).

It is based on the idea that it is very inefficient to use experiences only once and then throw them away, in particular if an experience only rarely occurs. Therefore, experience replay stores past experiences in a replay memory, where an experience is defined as a quadruple < s, a, s0, r >consisting of a state s, an action a, a new state s0 and a reward r. The memorized experiences are then presented to the learning algorithm more than once during training, preventing the network from forgetting previous knowledge. This approach is very easy to implement and will speed up the learning process which can otherwise be slow.

Additionally, high correlations between training samples when using sequential training samples can cause the training to become unstable. In reinforcement learn- ing, experiences are often generated sequentially, causing consecutive experiences to be highly correlated. However, optimization algorithms such as stochastic gradient descent generally assume that the training data is independently and identically distributed. Therefore, training a neural network online on sequential experiences can cause training to oscillate. Experience replay once again offers a solution as the replay memory decorrelates experiences and thereby stabilizes the training.

Neural Fitted Q-Iteration (NFQ) by Riedmiller (2005) uses a form of experience replay to learn the action-value function with a multilayer perceptron. Based on the idea of Q-learning, NFQ is a model-free approach to learn the optimal Q-function by sampling state transitions. However, unlike Q-learning, NFQ does not update the Q-function on-line after every new state transition in a sample-by-sample manner.

Instead, NFQ stores previous tuples of experiences in memory and updates the neu- ral value function with all transition tuples. This turns the reinforcement learning problem into a series of offline supervised learning tasks on sampled transition data and enables the use of efficient batch training methods.

2.3.2 Deep Q-Networks

Inspired by the success of deep learning in the field of computer vision, Mnih et al.

(2013, 2015) proposed to use deep neural networks as function approximators for the value function in reinforcement learning problems. Their work introduces a new architecture called Deep Q-Networks (DQN), a deep convolutional network which is trained with a variant of Q-learning.

Instead of using a hand-crafted state space, the deep convolutional architecture enables to train the network directly on high-dimensional raw video data. When ap- plying their approach to Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al., 2012), the trained agent outperforms previous reinforce- ment learning approaches and even human players on some of the games.

References

Related documents

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Exakt hur dessa verksamheter har uppstått studeras inte i detalj, men nyetableringar kan exempelvis vara ett resultat av avknoppningar från större företag inklusive

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft