Powered by TCPDF (www.tcpdf.org)

(1)

Learning Goal-Directed Behaviour

MARCEL BINZ

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

(2)

MARCEL BINZ marcelb@kth.se

Master in Machine Learning Date: June 14, 2017

Supervisor: Florian Pokorny Examiner: Danica Kragic

School of Computer Science and Communication

(3)

Abstract

Learning behaviour of artificial agents is commonly studied in the framework of Rein- forcement Learning. Reinforcement Learning gained increasing popularity in the past years. This is partially due to developments that enabled the possibility to employ complex function approximators, such as deep networks, in combination with the framework. Two of the core challenges in Reinforcement Learning are the correct assignment of credits over long periods of time and dealing with sparse rewards. In this thesis we propose a framework based on the notions of goals to tackle these problems. This work implements several components required to obtain a form of goal-directed behaviour, similar to how it is observed in human reasoning. This includes the representation of a goal space, learning how to set goals and finally how to reach them. The framework itself is build upon the options model, which is a common approach for representing temporally extended actions in Reinforcement Learning. All components of the proposed method can be implemented as deep networks and the complete system can be learned in an end-to-end fashion using standard optimization techniques. We evaluate the approach on a set of continuous control problems of increasing difficulty. We show, that we are able to solve a difficult gathering task, which poses a challenge to state-of-the-art Re- inforcement Learning algorithms. The presented approach is furthermore able to scale to complex kinematic agents of the MuJoCo benchmark.

(4)

Sammanfattning

Inlärning av beteende för artificiella agenter studeras vanligen inom Reinforcement Le- arning. Reinforcement Learning har på senare tid fått ökad uppmärksamhet, detta beror delvis på utvecklingen som gjort det möjligt att använda komplexa funktionsapproxi- merare, såsom djupa nätverk, i kombination med Reinforcement Learning. Två av kär- nutmaningarna inom reinforcement learning är credit assignment-problemet över långa perioder samt hantering av glesa belöningar. I denna uppsats föreslår vi ett ramverk ba- serat på delvisa mål för att hantera dessa problem. Detta arbete undersöker de komponenter som krävs för att få en form av målinriktat beteende, som liknar det som obser- veras i mänskligt resonemang. Detta inkluderar representation av en målrymd, inlärning av målsättning, och till sist inlärning av beteende för att nå målen. Ramverket bygger på options-modellen, som är ett gemensamt tillvägagångssätt för att representera tempo- ralt utsträckta åtgärder inom Reinforcement Learning. Alla komponenter i den föreslagna metoden kan implementeras med djupa nätverk och det kompletta systemet kan tränas end-to-end med hjälp av vanliga optimeringstekniker. Vi utvärderar tillvägagångssättet på en rad kontinuerliga kontrollproblem med varierande svårighetsgrad. Vi visar att vi kan lösa en utmanande samlingsuppgift, som tidigare state-of-the-art algoritmer har upp- visat svårigheter för att hitta lösningar. Den presenterade metoden kan vidare skalas upp till komplexa kinematiska agenter i MuJoCo-simuleringar.

(5)

Contents iii

1 Introduction 2

2 Related Work 4

2.1 Feudal Reinforcement Learning . . . 4

2.2 Option Discovery . . . 5

2.3 Universal Value Function Approximators . . . 5

2.4 Alternative Hierarchical Approaches . . . 6

2.5 Intrinsic Motivation . . . 6

3 Preliminaries 7 3.1 Reinforcement Learning . . . 8

3.1.1 Markov Decision Processes . . . 8

3.1.2 Monte-Carlo Methods . . . 10

3.1.3 Temporal-Difference Learning . . . 10

3.1.4 Policy Gradients Theorem . . . 11

3.1.5 Asynchronous Advantage Actor-Critic (A3C) . . . 12

3.2 Function Approximation with Neural Networks . . . 14

3.2.1 Feed-forward Neural Networks . . . 14

3.2.2 Recurrent Neural Networks . . . 15

3.2.3 Gradient-Based Learning . . . 16

3.2.4 Attention Mechanisms . . . 17

3.3 Hierarchical Reinforcement Learning . . . 19

3.3.1 Options Framework . . . 19

3.3.2 Universal Value Function Approximators . . . 19

4 Methods 21 4.1 Hierarchical Reinforcement Learning with Goals . . . 21

4.1.1 Motivation . . . 22

4.1.2 Goal-Setting Policy . . . 23

4.1.3 Goal-Reaching Policy . . . 24

4.1.4 Interpretation as Options . . . 25

4.2 Connections to Related Approaches . . . 25

iii

(6)

5 Experiments 26

5.1 Randomized Goals with Fixed Embeddings . . . 28

5.1.1 Task . . . 28

5.1.2 Experimental Details . . . 28

5.1.3 Results . . . 29

5.1.4 Discussion . . . 30

5.2 Learned Goals with Fixed Embeddings . . . 32

5.2.1 Task . . . 32

5.2.3 Results . . . 35

5.3 Learned Goals with Learned Embeddings . . . 38

5.3.1 Task . . . 38

5.3.3 Results . . . 39

6 Conclusion and Future Work 42

Bibliography 44

(7)

We provide a short summary of the notation used in this thesis. In general vectors are denoted with small, bold and matrices with large, bold letters.

st Observation at time step t a_t Action at time step t

z_t Activation in goal space at time step t gt Goal at time step t

h_t Hidden state of a neural network at time step t

S Observation space

A Action space

G Goal space

p(st+1|st, a_t) Transition probability rt Reward at time step t R_t Return following time step t

V^π(s) State value function of state s when following policy π

Q^π(s, a) Action value function of state s and action a when following policy π A^π(s, a) Advantage function of state s and action a when following policy π

π Policy

φ Goal embedding

Ω Set of all available options

ω Single option from Ω

α Learning rate

β Entropy scaling factor

γ Discount factor

θ Set of parameters being learned τ Trajectory of states and actions

W Weight matrix of neural network layer b Bias vector of neural network layer y Targets from supervised learning ˆ

y Predictions corresponding to y E[•] Expectation of a random variable

(•)_S Subscript referring to goal-setting policy (•)R Subscript referring to goal-reaching policy

1

(8)

Introduction

Being able to build computational agents, that can perform complex tasks in unknown environments, is of enormous value for our society. In general we would like to enable such systems to learn their behaviour from experience, as defining them explicitly is of- ten impossible. Reinforcement Learning (RL) provides a general mathematical framework to model such agents. Recent advances in RL, in particular its combination with methods from Deep Learning, allowed these methods to scale to much more challenging problems.

These problem include, amongst others, playing ATARI 2600 games [1], robotic manipulation [2] and mastering the game of Go [3].

Two of the key challenges in Reinforcement Learning are proper credit assignment over long periods of time and dealing with sparse rewards. To highlight these problems, let us inspect the results reported for different games of the well-established ATARI 2600 benchmark [4]. Here we notice that state-of-the-art methods [1, 5] are able to solve most reactive games, such as Ping-Pong or Breakout, easily. However games, that require long term planning and abstract reasoning, such as Montezuma’s Revenge, pose a much harder challenge for these algorithms.

Hierarchical Reinforcement Learning attempts to deal with these challenges by de- composing the problem into smaller sub-tasks, which are then solved independently.

There are different ways how this decomposition is realized and we will review some of these in the later parts of this thesis. In this work we will focus on a form of hierarchical RL, that is based on the notion of goals.

For this we will describe a framework, that exhibits a form of goal-directed behaviour.

In our approach we strive for as much generality as possible, meaning that all components should be learned from interaction with the world. These components include:

1. Learning the structure of a goal space 2. Setting goals on given space

3. Reaching those goals through atomic actions

During the stages of this thesis we will relax some of the generality assumptions in order to build up an incremental solution, that tackles all three components, required to attain goal-directed behaviour.

2

(9)

It may not be directly obvious, how to define goals in mathematical terms. It is however important to provide such a definition in order to implement the described mechan- ics. For our purposes goals are elements of an abstract goal space, such as vector spaces.

These representations may be specified by the human designer or correspond to hidden states of deep network. In general it is required, that the agent can interfere its current position in goal space, decide where it would like to be and is able perform actions to minimize the discrepancy between current position and goal.

Let us consider a few examples to be more concrete about this concept and highlight its potential advantages. In the first scenario our task is to maneuver an ant-like robot through a maze, and we are only rewarded for reaching the exit. While we have to learn about complex behaviour sequences to control the robot, the reward structure of the task depends only on its location in the maze. Therefore it is natural to consider a decomposition based on goals. Here the goal space could correspond to locations in the maze, which the robot would like to reach. For other tasks it may be desirable to exhibit more complex structures. In an ego-shooter game goals could, in addition to moving to certain locations, correspond to high-level events, such as a kill on an opponent or collect- ing new weapons.

Reasoning about goals could provide means for more structured exploration in complex problems. This is especially important in tasks with sparse or even non-existing rewards. Such settings are challenging for the current RL approaches, as these only perform exploration through sampling from stochastic policies, such as a Boltzmann distribution, for discrete actions, or Gaussian distributions in the continuous case. However in many tasks adding this kind of noise will never lead to meaningful action sequences.

Both, Reinforcement and Deep Learning, are to some extent inspired by how humans learn to represent and solve problems. In similar fashion we also observe forms of goal- directed reasoning in human behaviour. Indeed psychologists and cognitive scientist often distinguish between two types of behaviour – habitual and goal-directed [6, 7]. There are also links to neuroscience, where for example the neurotransmitter Dopamine is be- lieved to regulate goal-directed behavior [8] in the human brain. For a recent overview of connections between hierarchical RL, psychology and neuroscience we refer the reader to [9].

The main contributions in the presented work are the proposal and implementation of a framework for end-to-end learning of goal-directed behaviour, its application to continuous control problems and investigation of different properties through experiments.

The remainder of this thesis is structured as follows. First we provide a broad overview of related work in Chapter 2. This is followed in Chapter 3 by presenting important pre- supposed techniques. Chapter 4 outlines the proposed framework and motivates its specific design choices. It also places the suggested technique in the context of similar approaches. We validate the approach in Chapter 5 through a set of experiments with increasing difficulty. This chapter also includes an analysis of the obtained results. We conclude this thesis in Chapter 6.

(10)

Related Work

A major part of the work presented in this thesis is based on the ability to apply Rein- forcement Learning techniques in high-dimensional, continuous observation and action spaces. In these cases it is natural to employ complex function approximators such as deep networks. Combining these two methods proves however challenging due to sta- bility problems during the learning process. Recent work in Deep Q-Networks (DQN) [1]

showed how to overcome these issues when dealing with discrete actions. Since then these methods have been extended to the case of continuous control. Here we make ex- tensive use of A3C [5], which is an on-policy actor-critic algorithm. There is however a wide range of other algorithms that could be applied instead, such as continuous DQN variants [10], Trust Region Policy Optimization [11], off-policy actor-critics [12, 13] or meth- ods with deterministic policies [14, 15].

A3C itself has been extended in several ways. Generalized Advantage Estimation [16]

can be employed as a technique to control bias and variance in the estimation of the advantage function. [12] proposes a fast and flexible trust region update method, which was tested empirically in combination with A3C. UNREAL [17] adds a set of unsuper- vised auxiliary tasks on top of A3C and show significant improvement in stability, speed and performance.

The tasks we consider in this report are based on the MuJoCo physics simulator [18]

and have been proposed in [19]. Since they have been used in several research papers.

2.1 Feudal Reinforcement Learning

Hierarchical Reinforcement Learning is an area of diverse approaches. Out of all these Feudal Reinforcement Learning (FRL) [20] resembles the model proposed in this thesis the most. FRL is based on the metaphor of the hierarchical aspects of feudal fiefdom. It defines a framework consisting of a hierarchy of managers, in which higher ones are able to set sub-tasks for lower ones. Recently two models have been proposed, which combine FRL with deep networks [21, 22]. The former [21] has been applied to the challenging ATARI 2600 game Montezuma’s Revenge. It uses a two-level DQN hierarchy, in which the higher one operates over a hand-designed space of entities and relations.

[22] design an end-to-end trainable model, that fits the FRL framework and learns its high-level goal space jointly with the task. Their model is evaluated on a set of challenging ATARI 2600 games, where it performs favorable against rivaling approaches. Also

4

(11)

in this line of work is the HASSLE [23] algorithm, in which a high level policy selects a sub-goal from a set of abstract high-level observations, which is obtained through an un- supervised vector quantization technique. Low-level polices are governed with reaching these goals. In their case a limited set of low-level polices learns to associate itself auto- matically with the most-suitable high-level observations. Recently a related model has been applied to continuous action spaces, where a simulated biped robot learns to drib- ble a soccer ball and to navigate through static as well as dynamic obstacles [24].

2.2 Option Discovery

The options framework [25] is a popular choice for modelling temporal abstractions in RL. An option is defined as a temporally extended sequence of actions with a termination criterion. As the framework itself does not describe how options arise, there is large amount of additional methods for learning them from experience. Identifying suitable sub-goals is often the first step when constructing options. Given this connection to goal- directed behaviour, we will briefly review some methods for option discovery in the following.

In the given setting Machado et al. propose to identify purposes, that are just out of reach [26]. This is archived through the use of singular value decomposition on changes in consecutive observations of the environment. The result is then transformed into sub- goals, which are learned through an intrinsic reward function. A related approach is taken in [27], where PCCA+ is used to perform a spectral clustering of the MDP.

A similar concept was proposed by the authors of [28]. Their approach relies on identifying bottlenecks in the observation space. In their case a bottleneck is defined as a state, that is visited on many successful paths (such as a doorway). Discovered states are than turned into options through a learned policy based on an intrinsic reward function.

The optic-critic architecture [29] extends the policy gradient theorem to options. This allows to jointly learn internal policies along with policies over options. The approach is appealing as it only requires to specify the number of options, but no intrinsic rewards or sub-goals.

2.3 Universal Value Function Approximators

How do we provide the agent with information about which sub-goal it currently tries to reach? Most of the outlined approaches in this section learn an independent policy for each possible sub-goal. Universal Value Function Approximators (UVFA) [30] on the other hand provide the ability to treat sub-goals as a part of the observation. This enables to generalize between different sub-goals and allows, in theory, to employ infinite sets of them.

UVFAs have been applied to several settings, that include hand-designed, high-level goals. This includes visual navigation of robots [31, 32], where goals are locations on a map and first-person shooter games [33], where goals correspond to maximization of different game measurements (such as health, ammunition or kill score).

(12)

2.4 Alternative Hierarchical Approaches

There is also a large set of alternative approaches to hierarchical Reinforcement Learning besides those, that are based on sub-goals and options. The MAXQ value function de- composition [34] decomposes the original MDP into a set of smaller MDPs in a way, that the original value function is transformed into an additive combination of the new ones.

For this the programmer has to identify and define useful sub-tasks in advance.

The Hierarchy of Abstract Machines [35] model is build upon a set of non-deterministic finite state machines, whose transition can trigger machines at lower levels. This setup constrains the actions, that the agent can take in each state, allowing to incorporate prior knowledge into the task at hand.

Macro-actions are simpler instance of options, where whole action sequences are de- cided at the point of their initiation. They have been proposed in [36], and since then been investigated in other works such as [37]. STRAW [38] consists of a deep network, that maintains a plan in form of macro-actions. This plan can be adjusted on certain re-planning points. The whole model is trainable end-to-end and does not require any pseudo-rewards or additional hand-crafted information.

The Horde architecture [39] shows many similarities to the approach presented in this work. It consists of a large number of sub-agents, called demons. Each of the demons has its own policy and is responsible for answering a single goal-oriented question. These questions are posed in form of different value functions. The whole system runs many demons in parallel, similar to how A3C is used in this thesis.

2.5 Intrinsic Motivation

Employing goal-based methods provides a mechanism for structured exploration of the environment. This objective however can also be realized through other means. It is often investigated through the lens of intrinsic motivation. Here one strategy is to place a bonus reward, based on visitation statistics of states [40]. This bonus then encourages to visit states, which have not been seen before. Recently techniques have been proposed, that are based on hashing [41] or density models [42, 43] to keep track of the visitation counter in continuous state spaces.

Yet another approach to obtain intrinsic motivation is via uncertainties in predictions. For example we can model uncertainties with a Bayesian model [44] and than use Thompson sampling [45] to exploit them for exploration. Another alternative is to maximize the reduction in uncertainty of a dynamics models via the mutual information between the next state distribution and the model parameters [46] or to use surprise-based mechanisms [47, 48].

(13)

Preliminaries

To allow computational agents to adapt their behaviour based on experience, we first need to establish the mathematical foundations of interacting within an environment.

Different aspects of this process will be presented within this chapter.

The first part provides an overview over the theory of Reinforcement Learning (RL) as it is employed in later parts of this report. First the framework will be outlined and several of its properties will be highlighted. We then continue to describe how decisions can be modelled within the framework as Markov Decision Processes (MDP). This section will also define the quantities, which should be optimized in order to learn desired behaviour. Most Reinforcement Learning methods alternate in some way between estimating the quality of the current behaviour in terms of value functions, followed by improving given behaviour. We present the two major techniques for learning to estimate value functions, namely Monte-Carlo methods and Temporal-Difference learning. Finally we will show how the obtained estimates can be used to update the agents actions via the policy gradient theorem.

Traditional RL methods rely on table-based methods to keep track of their prior experience. These methods are however not applicable in high-dimensional, continuous spaces. Here one can instead resort to (parametrized) function approximators. In this setting neural networks¹ are a popular choice due their ability to learn arbitrary functions under mild assumptions [49, 50]. As part of this chapter we will define both feed- forward and recurrent neural network models and show how these can learn from data using gradient-based updates.

We will conclude this chapter with a section on hierarchical Reinforcement Learning.

For this we will outline the options framework [25], which allows to incorporate high- level primitives into the traditional RL framework. We will also highlight a connection between options and semi-Markov Decision Processes, which allows to transfer some theoretical guarantees for traditional RL to its hierarchical counterpart. Finally we will present Universal Value Function Approximators (UVFA), which enable generalization be- tween a – possibly infinite – set of options.

1In this thesis we use the terms neural network and deep network interchangeably.

7

(14)

Agent Environment

action observation, reward

Figure 3.1: Interaction between agent and environment in the RL setting. Figure adopted from [51].

3.1 Reinforcement Learning

The Reinforcement Learning (RL) [51] framework provides the foundations for learning se- quential decision making through interaction with unknown environments. It assumes an agent, that is able to interact with a given environment. At each discrete time-step the agent perceives an observation from the environment and may interact with it through performing an action. This in turn influences the state of the environment and triggers a new observation as well as a reward signal, see Figure 3.1 for an illustration of this process. The agent is exposed to the environment and has to adapt its behaviour through a trial-and-error learning process such that received rewards are maximized.

This formulation is extremely general, as it makes little assumptions and thus it can be applied to a wide range of problems. Indeed it includes for example supervised learning as a special case. However in contrast to classical supervised tasks Reinforcement Learning defines closed loop systems, in which the learner’s actions influence possible future perceptions. Furthermore it introduces an additional layer of complexity as desired behaviour is not given in advance, but has to be figured out first. This means, that the agent usually has to find a trade-off between exploration of the environment and ex- ploitation of current knowledge.

Reinforcement Learning is partially inspired from biological learning systems and shows further connections to psychology and neuroscience. For example it was shown, that a form of Temporal-Difference errors (see Section 3.1.3) is correlated with activations in the human brain [52]. This thesis will focus on RL methods, which are applicable to high-dimensional, continuous state and action spaces. This section is largely based on [51] and we refer the reader to this book whenever references are omitted.

3.1.1 Markov Decision Processes

In Reinforcement Learning the interaction between an agent and the environment is mod- elled as a Markov Decision Process (MDP), defined as a tuple (S, A, p, r, γ). Here only the discrete time step setting is considered. At time step t the agent receives a state of the environment st ∈ S, based on which it selects an action at ∈ A. In our case S and A

(15)

will correspond to vector spaces of dimensions ns and na respectively. The choice of the action triggers a reward signal rt ∈ R. The agent’s observations at the beginning of an episode are defined according to an initial state distribution p1(s₁) and the environment’s transitions are given by a transition distibution p(st+1|st, a_t), such that the Markov prop- erty is fulfilled (hence the name Markov Decision Process). In most cases we do not have access to these distributions, such that we have to either use them implicitly (the model- free case) or to learn them from experience (the model-based case).

The agents maintains a policy π, which determines the choice of its actions. In most cases a stochastic policy π(at|s_t) is used. Stochastic polices have the advantage, that they incorporate the possibility for exploration in the environment directly into the agent’s behaviour. However there are also notable exceptions, which consider deterministic policies [14].

Let the return R^γ_t be the discounted sum of rewards from time step t onwards, Rt = P^∞

k=tγ^k−tr_k. Note, that by introducing a discount factor γ ∈ (0, 1], this formulation extends easily to continuing tasks, because the sum is bounded for γ < 1. As designing good policies in high-dimensional spaces by hand is challenging, we set out to learn them from experience. To enable learning of desired behaviour it is crucial to define a quantity, that should be optimized. In Reinforcement Learning this quantity is commonly based on the expected return. Our goal is therefore to adapt the policy, such that the expected return is maximized. Let V^π(s) be the state-value function, which is the expected return starting from state s and then following policy π. In MDPs this is specified by the following equation:

V^π(s) = Eπ[R^γ_t|St= s; π] (3.1) V^π is a measure of the expected future reward for any state given a fixed policy π. A action-value function Q^π can be formulated accordingly as:

Q^π(s, a) = Eπ[R^γ_t|St= s, At= a; π] (3.2) Similar to V^π, Q^π describes how valuable it is to take an action in a certain state. In discrete action spaces it is straight forward to derive the optimal policy given the optimal action-value function Q^∗ (through greedy selection of the action with the highest value).

For continuous spaces one has to resort to different techniques as the greedy choice is not applicable. To optimize the expected rewards from any start state, it is useful define the following performance objective:

J(π) = Eπ[R^γ₁|π] (3.3)

Learning to select good actions can be expressed as a form of general policy iteration [51]. Policy iteration describes the process of alternating between estimating the current value function and improving the policy. To ensure faster convergence one usually learns value functions and the policy simultaneously. This however introduces non-stationary behaviour, as the estimate of the value function changes with the policy. This is one of the central issues when combining deep networks with Reinforcement Learning (usually referred to as Deep Reinforcement Learning).

(16)

Figure 3.2: Backup diagrams are a method to highlight, which information an agent in the RL setting uses for learning. The horizontal axis describes the spectrum of n-step TD learning, ranging from 1-step TD to Monte-Carlo. The vertical axis shows one learning iteration starting from a certain state at the top. White circles indicate states and corresponding estimates for value functions, while black circles are choices of actions. Termination of an episode is denoted with grey squares. Figure taken from [51].

Most Reinforcement Learning algorithms rely on estimating value functions. There- fore the next chapters will introduce the two major methods to achieve this. This will be followed by an explanation of how those estimates can be used in tasks with continuous action spaces.

3.1.2 Monte-Carlo Methods

In many interesting tasks one does not have direct access to the environment’s dynamics. In this setting Monte Carlo methods offer a simple way to estimate value functions given a policy. Monte-Carlo methods, as described here, are restricted to episodic tasks.

The general idea in Monte Carlo methods for value function estimation is to take sample experience from one episode and use the resulting rewards to compute sample returns.

Those are then used to update the current estimate of the value function. In this thesis neural networks will be employed as function approximation technique to keep track of this estimate. In this case we use the Monte-Carlo samples as targets for training a value function network. The Monte-Carlo learning process can be illustrated with the rightmost backup diagram in Figure 3.2

3.1.3 Temporal-Difference Learning

Temporal-Difference methods take a different path towards learning value functions.

They are model-free, as Monte Carlo methods, but differ in the targets they use for up- dating their estimates. Instead of waiting for the end of a episode, TD methods implement their updates based on other updates – they bootstrap. In the simplest form, which

(17)

is known TD(0), the estimates for value functions are adjusted based on the TD error defined as:

δ_t= r_t+1+ γV^π(s_t+1) − V^π(st) (3.4) Using bootstrapping can be advantageous, because updates do not have to wait until the end of an episode. This enables faster learning, which can turn out to be crucial in many cases, as episodes often span over long periods. However it comes at the cost of using potentially noisier estimates. When employing neural networks for function approximation we use rt+1+ γV^π(s_t+1) as targets.

It is also possible to interpolate between TD and Monte-Carlo methods, leading to n-step TD learning. Here the idea is to bootstrap the estimate after observing n sample rewards. Figure 3.2 illustrates the spectrum of n-step TD learning, ranging from TD(0) to Monte-Carlo methods.

3.1.4 Policy Gradients Theorem

The policy gradient theorem [53] provides a gradient-based updated rule for stochastic policies. It is a special case of Score Function Estimators [54] and is sometimes also referred to as log-derivative trick. The theorem allows to derive the gradients for J(πθ), as defined in Equation 3.3, as follows:

∇_θJ(π_θ) = ∇_θEπ[R^γ₁|π_θ]

= ∇θ

X

τ

p(τ ; θ)r(τ )

=X

τ

∇_θ p(τ ; θ)r(τ )

=X

τ

p(τ ; θ)

p(τ ; θ)∇_θ p(τ ; θ)r(τ )

=X

τ

p(τ ; θ)∇θ p(τ ; θ) p(τ ; θ) r(τ )

=X

τ

p(τ ; θ)∇θlog p(τ ; θ)r(τ )

≈ 1 m

m

X

i=1

∇_θlog p(τ⁽ⁱ⁾; θ)r(τ⁽ⁱ⁾) (3.5)

where in the last line the expectation is approximated with a m sample trajectories, and τ⁽ⁱ⁾ represents the ith sample. Here we denote πθ as a policy parametrized through θ∈ Rⁿ^θ. For convenience, we define τ to be a state-action sequence ((s1, a₁), . . . , (sT, a_T)) and r(τ ) =P_T

t=1γ^t−1r_t. What remains to be specified is how to obtain ∇θlog p(τ⁽ⁱ⁾; θ):

(18)

∇θlog p(τ⁽ⁱ⁾; θ) = ∇θlog

" _T X

t=1

p(s⁽ⁱ⁾_t+1|s⁽ⁱ⁾_t , a⁽ⁱ⁾_t )πθ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t )

#

= ∇θ

" _T X

t=1

log p(s⁽ⁱ⁾_t+1|s⁽ⁱ⁾_t , a⁽ⁱ⁾_t ) +

T

X

t=1

log πθ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t )

#

= ∇_θ

T

X

t=1

log π_θ(a⁽ⁱ⁾_t |s⁽ⁱ⁾_t ) (3.6)

After combining Equations 3.5 and 3.6 we obtain the full version of the policy gradient theorem. The fact, that the gradient of the objective, does not depend on the gradient of the state distribution allows for a wide range of applications. A straight forward application of the theorem is the REINFORCE [55] algorithm, which uses sample returns R^γ_t as an estimate for expected future rewards. While using the sample return is an un- biased estimator, it exhibits high variance [16]. Several methods have been proposed to reduce its variance. One of these is the actor-critic architecture [53], which maintains a second model – the critic – to estimate Q^π. The critic learns to predict value functions using for example Temporal Difference learning or Monte Carlo methods (as described in the previous two sections). Yet another approach to reduce variance is to subtract a baseline. We will see an example of this in the following section.

The theorem comes with an intuitive explanation. The higher the estimate of the future rewards, the more we increase the probability for the actions, that led to the out- come. For negative values, the probability is decreased.

3.1.5 Asynchronous Advantage Actor-Critic (A3C)

Combining neural networks with Reinforcement Learning is known to be notoriously difficult [1]. This is due the fact, that learning algorithms for neural networks in general assume data samples, that are independent and identically distributed – an assumption that usually does not hold in RL. Another reason are non-stationary targets, to which the network has to adapt. Deep Q-Networks (DQN) [1] constitute one of the first systems to overcome these issues through use of experience replay [56] and target networks [1].

Another approach, that was shown to be effective is to run multiple agents in an asynchronous manner. The asynchronous advantage actor-critic (A3C) [5] fits in this frame- work. It is a policy gradient method, which launches multiple threads on CPU, in order to enable using updates in the fashion of Hogwild! [57]. This means, that experience from local models is used to update a shared model. Local models are then synchronized with the shared one from time to time. It was shown that this method can stabilize training and has furthermore the advantage to be applicable in the on-policy setting.

A3C uses an estimate of the advantage function:

A^π(s, a) = Q^π(s, a) − V^π(s) (3.7) for the return in Equation 3.5, where subtracting V^π(s) can be interpreted as estab- lishing baseline. The algorithm maintains two networks. One to approximate the policy π and one to learn about the state-value function V^π. Both networks are updated from

(19)

experience either every Tmax steps or when a terminal state is reached, using a mixture of n-step returns (see Section 3.1.3). It is also possible, and often desirable, to share parts of both networks. To encourage exploration an entropy regularization term [58] is added, leading to the gradient of the final objective:

∇_θlog π(a_t|s_t; θ)(R_t− V^π(s_t; θ)) + βH(π(s_t; θ)) (3.8) where Q^π is approximated with the k-step sample return Rt=Pk−1

i=0 γⁱr_t+i+γ^kV^π(s_t+k; θ).

H denotes the entropy of the output distribution and β its weight. The exact procedure is given as pseudo-code in Algorithm 1.

Data:Globally shared parameters θ and θv; thread-specific parameters θ^′ and θ_v^′ t= 1;

repeat

Reset gradients: dθ ← 0 and dθv ← 0;

Synchronize parameters: θ^′ = θ and θ_v^′ = θv; t_start = t;

Get current state st; repeat

Perform action atbased on policy π;

Obtain rtand st+1from the environment;

t← t + 1;

until t − tstart== Tmaxor terminated;

ifterminated then R = 0;

else

R = V (s_t; θ);

end

for i ∈ {t − 1, . . . , tstart} do R ← r_i+ γR;

Accumulate actor gradients: dθ ← dθ + ∇_θ^′ log π(a_i|s_i; θ^′)(R − V (s_i; θ^′_v));

Accumulate critic gradients: dθv← dθv+∂(R − V (s_i; θ_v^′))²

∂θ_v^′ ; end

Perform asynchronous update of θ using dθ and of θvusing dθv; untilconverged;

Algorithm 1:Pseudo-code for one thread in the A3C algorithm. θ denotes parameters of the policy and θvparameters of the value function estimator. Algorithm adopted from [5].

In the case of continuous actions the outputs of π are usually the statistics of a nor- mal distribution with diagonal covariance matrix. The authors of [5] found it furthermore beneficial to share the statistics of the optimizer between the threads.

(20)

3.2 Function Approximation with Neural Networks

The classical method to keep track of value functions is to store experienced outcomes of visited states in a table. This approach has at least two major shortcomings. It does not allow to generalize between similar states, and it is not directly applicable in continuous state spaces. In scenarios where the table-based setting can not be employed, we can resort to function approximation techniques instead. The core concept behind function approximation in RL is to use a statistical model, which is successively adjusted to fit the value function of the current policy.

There are several options when selecting the model class. Linear function approximators [59], as the traditional choice, are appealing, because they enjoy theoretical guarantees. However linear models are limited in the set of functions they can represent. Exam- ples for more flexible models, that have received attention within RL include Gaussian Processes [60, 61] and neural networks [1, 56].

This thesis focuses on the use of deep networks, which are outlined in the following subsections. We will start by defining feed-forward and recurrent networks. This is followed by explaining the principles behind backpropagation, which is the most common method for training deep networks. This section is largely based on [62], to which we refer the reader whenever references are omitted.

3.2.1 Feed-forward Neural Networks

We often find ourselves in the scenario, where we want to determine a function y = f^∗(x), but we do not have any access to f^∗ directly. We however can observe corresponding inputs x and outputs y. One approach to solve this problem is to set up a model f(x; θ), that is parametrized by θ. Once the model is specified, we want to adjust its parameters θ, such that it becomes the best fit to a given set of samples. Ideally we aim for models, that generalize well to unseen examples. Feed-forward neural networks are one class of models, that can be used in such a setting.

Feed-forward neural networks are composed of a set of simple transformations, that are chained together to obtain the final output. Each of these transformation is referred to as a layer. Intermediate transformations are called hidden layers. We denote the outputs of the lth layer as h^(l) and the depth, i.e. the total number of layers, of the network with L. The standard definition of one layer l is given by:

h^(l)= g^(l)(W^(l)h^(l−1)+ b^(l)) (3.9)

where g^(l) is a non-linear activation function. Here common choice include tanh, ReLU [63] and ELU [64], see Figure 3.3 for a visualisation of their properties. W^(l) and b^(l) represent the parameters of a linear transformation. We denote the set of all of these as θ.

h⁽⁰⁾ corresponds to the input x and h^(L) corresponds to the predicted output ˆy. The output size of the hidden layers can be adjusted and it is commonly selected based on the difficulty of the task at hand.

The computations of a feed-forward network form a directed acyclic graph, meaning that no feedback connections are allowed. Extensions to include feedback connections are presented in the following section.

(21)

Figure 3.3: Visualisation of different activation functions. Left: tanh, Middle: ReLU, Right:

ELU.

It can be shown, that a feed-forward network with a single hidden layer of finite size is already universal [49, 50] given mild conditions, i.e. it can approximate any continuous functions on a compact subset of Rⁿ. There is however a lot of empirical evidence, that creating deeper networks leads to better performance [65, 66], as well as some theoretical justification [67].

3.2.2 Recurrent Neural Networks

Recurrent neural networks (RNN) [68] generalize the formulation of the last section to include recurrent (i.e. feedback) connections. They allow us to model a sequence of inputs x = (x1, . . . , x_T) with the same set of transformations at each step. This invariance property greatly reduces the number of parameters in the model, but assumes that char- acteristics are shared over the sequence, which often turns out to be the case. In general one layer in a recurrent network is described with Equation 3.10.

ht= f (xt, h_t−1) (3.10)

where f can be any function. For convenience in notation, all definitions in this sections are presented for one layer. They however could be extended to multiple layers in a straight forward manner as done in Equation 3.9. In the traditional, one-layer (Elman) RNN [69] f is given by:

ht= tanh(W_xhxt+ W_hhh_t−1+ b_h) ˆ

yt= Whyht+ by (3.11)

In the recent years RNNs have achieved state-of-the-art performance in many sequence modelling task, such as machine translation [70] or speech recognition [71]. It is also natural to consider them in the RL setting, as the agent here deals with a constant stream of inputs.

It is well studied, that RNNs suffer from vanishing gradients [72], when they are trained with first-order optimization techniques. One solution proposed to deal with this problem is the Long Short-Term Memory (LSTM) [73]. It can be interpreted as specific func- tion f , that is computed in the following way:

(22)





 f_t it

o_t g_t







= W_xhx_t+ W_hhh_t−1 (3.12)

ct= σ(ft) ⊙ c_t−1+ σ(it) ⊙ tanh(gt) (3.13)

ht= σ(ot) ⊙ tanh(ct) (3.14)

Here σ is the logistic sigmoid function and ⊙ the element-wise multiplication. Con- sidering input dimension n and hidden dimension m this leads to Wxh ∈ R^4m×n and W_hh ∈ R^4m×m, which are model parameters, that are initialized randomly and then learned from data. h0 and c0 correspond to the initial states, which are usually filled with zeros.

LSTMs introduce several changes compared to standard RNNs. The main component of the LSTM is the additive update of the cell state ct, which allows for an easy flow of information (in both forward and backward passes). Furthermore they allow for fine grain manipulation of the cell state via different gates. The input gate it controls how much new information is added, the forget gate ft determines how much of the current information is dismissed and the output gate ot regulates which parts can leave the LSTM. For a more detailed explanation we refer the reader to [68].

3.2.3 Gradient-Based Learning

So far we have discussed how our models are defined, but not yet how they are adjusted based on a data set (in the supervised setting) or on experience (in the RL setting). In neural networks this learning is most commonly performed through iterative, gradient- based optimization.

For this procedure the user specifies a task-dependent loss function L(θ), which is used to evaluate the current prediction ˆy⁽ⁱ⁾ of the ith sample against its ground truth y⁽ⁱ⁾. For example if we are exposed to a regression task the mean-squared-error (MSE) is a sensible choice as a loss function:

L(θ) = 1 m

m

X

i=1

||y⁽ⁱ⁾− ˆy⁽ⁱ⁾||²₂

= 1 m

m

X

i=1

||y⁽ⁱ⁾− f (x⁽ⁱ⁾; θ)||²₂ (3.15)

where f denotes the function implemented by the chosen model with parameters θ.

This process of evaluating predictions of a neural network is often referred to as forward pass.

In order to employ gradient-based optimization techniques, we need to obtain the gradients of the defined loss w.r.t. the parameters of the model ∇θL(θ). If all functions in our model are differentiable these can be obtained with the backpropagation algorithm

(23)

f f f . . . f

xt x₁ x₂ xT

ˆ

y_t yˆ₁ ˆy₂ yˆ_T

Figure 3.4: Left: RNN with one hidden layer and activation function f . Right: The equiv- alent network unfolded over time. All components in the same row share the same set of weights.

[74]. Backpropagation itself essentially boils down to an application of the chain rule, which can be applied recursively to obtain the derivatives w.r.t. the parameters for each layer in a neural network. This procedure is also referred to as the backward-pass.

In the plain version of stochastic gradient descent a update of the parameters is performed by adapting the parameters in the direction, that minimizes the loss function:

θ= θ − α∇θL(θ) (3.16)

where α is the step size or learning rate.

In practice this updated is performed over a mini-batch of samples and the procedure is repeated until convergence. It is also common to employ more sophisticated optimiz- ers, that include momentum terms and adaptive learning rates, such as RMSprop [75] or Adam [76].

It may not directly clear how to apply backpropagation to RNNs, due to their cyclic connections. The most prominent approach to calculate the gradients w.r.t. the param- eters of a RNN is backpropagation through time (BPTT) [77]. It assumes, that the weights remain constant while training on a sequence of inputs. Due to this the RNN can be unfolded as shown in Figure 3.4. This leads to a feed-forward network with lateral connections, to which we can apply standard backpropagation. When dealing with long sequences it is common to truncate the BPTT procedure. In these cases BPTT is performed every k steps for m steps (usually k = m).

The learning methods from this section can also be applied in a straight-forward fashion to the RL setting. For this we replace supervised targets ˆywith estimates of value functions, which are be obtained through either Monte-Carlo (Section 3.1.2) or Temporal- Difference (Section 3.1.3) methods.

3.2.4 Attention Mechanisms

Attention mechanisms have recently gained popularity in sequence-to-sequence modelling tasks [78]. Their core component is a trained attention module, that decides on

(24)

which part of the representation it wants to focus. In the context of thesis we will employ an attention mechanism to define suitable goal spaces. Therefore we will briefly review the fundamental concepts in this section.

The attention module outputs a vector at∈ Rⁿ, to which we refer as attention weights.

The attention weights describe how much attention we would like to pay to certain parts of a given input. Usually we define the attentions weight, such that they sum to one, which is achieved through the softmax function. Given an input, that consist of k vectors of size n, Xt∈ R^k×n, we define the attention mechanism as:

zt= Xtat

= X_t e^(f^att^(X^t^,h^t−1⁾⁾ P

ie^f^att⁽ⁱ⁾^(X^t^,h^t−1⁾ (3.17) where fatt : R^k×n → Rⁿ is a (potentially recurrent) neural network and f_att⁽ⁱ⁾ indicates the ith element of its output.

The variant in Equation 3.17 is commonly referred to as soft attention. It has the advantage, that it is fully differentiable, meaning that it can be plugged into any existing architecture and be trained end-to-end with the rest of the network.

Soft attention attends all parts of the input to varying degree. In some case we would like a behaviour, that makes sharper decisions on which parts to attend. This technique is referred to as hard attention [79, 80]. To obtain such behaviour we add an additional sampling step from the categorical distribution defined by the softmax in Equation 3.17.

This means atwill become a one-hot vector representing the sampled class and thus the module will only attend a specific part of the input, while ignoring the rest.

Using hard attention has the downside, that we can not derive gradients through the attention module analytically anymore. However, we can still learn its parameters through RL methods, such as REINFORCE or its variance-reduced modifications.

(25)

3.3 Hierarchical Reinforcement Learning

Hierarchical Reinforcement Learning describes a set of methods, that attempt to extend standard RL agents with some form of hierarchical structure. We have already mentioned some of these techniques in Chapter 2.

Learning good temporal abstractions is one of the key challenges in RL. It does not only allow to decompose the task in sub-components, that are easier to solve, but also enables to reuse already learned solutions, potentially speeding up the training process by orders of magnitude.

In this section we will review the options [25] framework, which is a popular choices for incorporating temporal abstractions in existing RL methods. Furthermore we outline Universal Value Function Approximators (UVFA) [30], which allow to transfer knowledge between different options. Together these techniques form the core of the approach proposed in the following chapter.

3.3.1 Options Framework

Options [25] extend the agent’s available behaviour through a set of temporally extended actions. Instead of choosing a primitive action, the agents is allowed to select an option from a set Ω, which is then followed until a termination criterion is met. Examples for options include grasping an object, opening a door and also the agent’s primitive actions.

Mathematically an option ω is defined as a triple (Iω, π_ω, β_ω), where Iω ⊆ S is the initiation set, βω : S → [0, 1] the termination condition and πω : S → A the intra-option policy. For simplicity it is often assumed, that all options are available everywhere, i.e.

Iω = S for all ω.

A set of options defined over a MDP constitutes a semi-MDP [25]. As the theory of semi-MDPs is closely related to that of MDPs, we can carry other many theoretical guarantees when dealing with options. The framework itself does however not define, how suitable options arise. We have already reviewed several extensions, that strive to dis- cover options from experience, in Chapter 2.

It is intuitive to think of options as a procedure for archiving a sub-goal, but it may also be not immediately clear how such a behaviour emerges. Given, that we know the set of suitable sub-goals, we can assign an option to each these, along with an intrinsic reward function. The intrinsic reward should assign high values for archiving the sub- goal and low values otherwise. If we now learn the intra-option policies based on the intrinsic reward functions we already obtain a type of goal-directed behaviour.

3.3.2 Universal Value Function Approximators

The interpretation of options as sub-goals provided in the last paragraph, would require to learn separate values functions Vg(s) for each possible sub-goal g. Learning many, independent values functions can be slow and does not scale well to settings with a large number of possible sub-goals.

To circumvent these issues Universal Value Function Approximators (UVFA) [30]

(26)

approximate the union of value functions for all goals with V (s, g; θ). In practice this means, that the goal is provide as part of the input to the value function approximator.

In the simplest case this is realized through concatenation of state and goal. UVFAs are able to exploit additional structure in the goal space and thus enable faster learning and better generalization. Furthermore they scale easily to large or even uncountable sets of goals.

(27)

Methods

4.1 Hierarchical Reinforcement Learning with Goals

In this section we extend the RL framework to include an explicit goal-directed mechanism. For this the agent’s internals are separated in two parts – a goal-setting policy πS, which is responsible to set goals and a goal-reaching policy πR, which has to reach the goals given by the goal-setting policy¹. An illustration describing the concept is given in Figure 4.1. The model is constructed in a way, that both parts can be learned simultaneously through arbitrary RL methods. We will however argue later in this chapter, that certain choices are more sensible than others. Note, that similar architecture have been applied in prior work [21, 22, 81]. The relation to these methods will be further discussed in Section 4.2.

The goal-setting agents operates on some continuous goal space G ⊆ R^k. In the most general case it learns a function φ : S → R^k, that maps from the current state stto the current activation in goal space zt. We will also refer to φ(st) as goal embedding in the following. In the initial state and whenever the current goal was reached a new goal is selected. We interpret setting goals as actions of the goal-setting policy represented through πS : G → R^k. This allows to apply standard RL techniques for the goal-setting policy. The goal-setting agent’s actions last over multiple time steps and its rewards are defined as the sum of external rewards from the environment along the path towards the goal. The goal-reaching policy πR on the other hand acts through the agent’s atomic actions at each time-step and gets rewarded for reaching the goal set from the goal-setting policy. The complete method is outlined in Algorithm 2.

In the following we will first highlight possible advantages compared to standard RL approaches, explained on an idealized example. We describe specific design choices – for both goal-setting and goal-reaching policies – used during the experiments in this report.

We then provide a connection to the options framework, which provides some theoretical grounding for the approach. Finally we highlight connections and differences to related approaches.

1Other works also refer to these two policies as high- and low-level policy or as manager and worker.

21

(28)

Figure 4.1: Illustration of the goal-based framework proposed in this thesis. Detailed information can be found in the corresponding subsections of this chapter.

4.1.1 Motivation

To motivate the presented approach, we highlight its advantages with a short example.

Consider an ant-like insect interacting in an environment with sparse rewards. The agent is controlled through manipulation of forces. Traditional approaches initialize the agent with random actions and improve those actions upon receiving rewards. However op- timizing behaviour with such methods is impossible if random actions never lead to rewarding cases. Even if a sequence leads to a rare reward scenario, it might be hard to assign credit to the actions, that actually caused the reward. Introducing a hierarchy as described above provides one possible approach to these problems and allows the agent to learn meaningful behaviour even in the absence of rewards.

To be more specific take mentioned ant agent, which has to solve a labyrinth task.

The agent is only rewarded upon reaching a predefined final position. Applying standard RL algorithms will lead to failure, because random exploration in the atomic action- space will never cause rewarding states. However if a goal-setting policy would be able to set goals based on the location of the agent, it has the potential to set way points, which can simplify the search for a good policy. The goal-reaching policy would be then governed with reaching these way points. This makes it easier to obtain initial rewards and continue learning from there.

(29)

Ideally we learn the desired goal space from experience, however in some domains we can also consider the simpler case of using pre-designed embeddings and only learn transitions in the given space. Learning goal embedding greatly increases the generality of the procedure, at the cost of introducing additional complexity. We will investigate both cases during later sections in this report.

repeat t = 1;

Reset goal;

Get initial state s1and goal embedding z1; repeat

ifno goal then

Set goal g based on πSand zt; end

Perform action atbased on g, stand πR;

Get next state st+1, embedding zt+1and extrinsic reward rt; Calculate r^Rt based on ztand zt+1;

Update r^S_t based on rt; if πSupdated required then

Update πSbased on experience;

end

if πRupdated required then

Update πRbased on experience;

end

ifgoal reached or gmaxsteps passed then Reset goal;

end

untilterminated;

untilconverged;

Algorithm 2:Pseudo-code for the goal-based framework proposed in this section.

4.1.2 Goal-Setting Policy

The goal-setting agent consists of a function φ, mapping to a continuous goal space, and a policy πS. Both of these components are implemented through deep networks. The policy πS turns active at the beginning of an episode or whenever the current goal was just reached. It takes the output of the goal embedding φ as inputs and outputs a goal g, which is then passed to the goal-reaching policy.

Using continuous goal spaces allows to backpropagate through all parts of the model.

This means, that all internal representations – including the goal space – can be learned in an end-to-end manner. Using a discrete set of goals on the other hand would require a sampling process inside the model, through which we can not readily pass errors back- wards (though there are techniques for doing so, see for example [82]).

We may consider constant horizons for each goal or terminate a goal immediately when it is reached. Here we choose to focus on the second option, even though the first

(30)

one might have its own advantages. We denote the maximum time-span a goal is active in the following as gmax. The reward of the goal-setting policy is based in the extrinsic reward of the environment along the taken path towards the goal, defined as:

r^S_t =

k

X

n=1

r_t+n (4.1)

where k is bounded from above through gmax.

We now have defined all components of the goal-setting MDP, that are required to learn about its policy. This means we can directly apply any RL algorithm to this problem. Here we decide to employ A3C (Section 3.1.5) as the method of choice.

It is worth mentioning, that the state-transition distribution of the goal-setting MDP p^S is dependent on the behaviour of the goal-reaching policy. This means, that there might be cases in which goals are not reached. In order to encourage learning, based on transition of the goal-reaching policy, we propose a modified version of the update rule from Equation 3.8:

∇θSlog πS(z_t+k|st; θS)(Rt− V^π^S(st; θS)) (4.2) where gt is replaced with the reached location in goal space zt+k.

The goal-setting policy is, by definition, exposed to less data then the goal-reaching policy, as it operates at a slower time-scale. This could make learning challenging. We will later discuss methods, that could enable more data-efficient learning.

4.1.3 Goal-Reaching Policy

The goal-reaching policy πR is, in contrast to the goal-setting policy, active at every time- step and responsible for selecting the agents primitive actions. Let n be the nth step, after a new goal was set at time-step t. At each time-step t + n the agent receives information about the currently active goal (for example as relative or absolute values). In practice we just pass this information as an additional input to the policy and value networks. This process is a form of UVFA, as described in Section 3.3.2, where the set of possible goals is a continuous vector space. Therefore the agent is able to generalize between all possible goals.

The goal-reaching polices in this report will learned with the A3C algorithm as described in Section 3.1.5, where policies are represented through deep networks. Network architectures will be described in more detail during the later parts of this report. The goal-reaching policy adapts its behaviour to maximize an intrinsic reward r_t+n^R , which is based on some distance measure d between the current position in goal space zt+n and the active goal gt:

r^R_t+n= −d(z_t+n, gt) (4.3)

where the choice of d can be adapted to the given scenario. In practice we can also combine this reward with the extrinsic reward rt+n from the environment. A similar technique, based on a linear combination of both terms, was shown to be beneficial in [22]. Here we will simply add all control-based rewards (if available) from rt+n to r^R_t+n.

(31)

Name Domain Actions Goals Learning UVFA

HASSLE [23] Navigation discrete ARAVQ Advantage ✗

Learning

Horde [39] Mobile Robots discrete Demons GQ(λ) ✗

h-DQN [21] ATARI discrete Entities & Relations DQN ✓

FuN [22] ATARI discrete Learned A3C* ✓

DeepLab

DeepLoco [24] Bullet continuous Footsteps Actor-Critic ✓

ours MuJoCo continuous Learned A3C ✓

Table 4.1: An overview of methods, that use similar techniques as the ones proposed in this thesis. * Note, that FuN also uses modified policy gradient variant in the goal-setting policy.

The control-based rewards can include penalties for large actions, friction-based terms and survival bonuses.

4.1.4 Interpretation as Options

We can also view this goal-based approach in the light of the options framework of Sec- tion 3.3.1. Let us first assume, that we have a discrete set of goals, each of which is learned with an independent intra-option policy πω. If we now set the initiation set Iω to S and β_ω to 1 if the goal is reached or gmax steps passed and 0 otherwise, we obtain a complete, well-defined set of options Ω. However we use the defined intrinsic reward r^R (Equation 4.3), instead of the extrinsic rewards r, to learn the intra-option policies.

If we now represent all policies in the set of options Ω with an UVFA (Section 3.3.2) we are able to share information between possible options, while simultaneously decreas- ing the number of adjustable parameters. This also enables to use an infinite set of goals, such as coordinates on a map or activations of a deep network.

4.2 Connections to Related Approaches

There is a small set of prior work, that directly considered similar models as the ones proposed in this thesis. We would like to take this section to state connections to these approaches more explicitly. We are aware of five prior approaches, that employ some form of hierarchy to learn goal-directed behaviour in the RL setting. We highlight similarities and differences to the method proposed here in Table 4.1.

Inspecting Table 4.1 we note, that only one other approach (FuN) is able to learn suitable goal spaces from interaction with the environment. However FuN has only been applied to settings with discrete actions (although extensions to the continuous case are possible). DeepLoco represents the only other model, that employed a goal-based technique for continuous control, but it is less general, as it includes a range of prior assumptions.