Latent Representation of Tasks for Faster Learning in Reinforcement Learning

(1)

Latent Representation of Tasks for Faster Learning in Reinforcement Learning

FELIX ENGSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Tasks for Faster Learning in Reinforcement Learning

FELIX ENGSTRÖM

Master in Machine Learning Date: September 9, 2019 Supervisor: Johannes Stork Examiner: Patric Jensfelt

Swedish title: Latent representation av uppgifter för snabbare inlärning i reinforcement learning

School of Electrical Engineering and Computer Science

(4)

Abstract

Reinforcement learning (RL) is an field of machine learning (ML) which

attempts to approach learning in an manner inspired by the human

way of learning through rewards and penalties. As with other forms

of ML, it is strongly dependent on large amounts of data, the acquisi-

tion of which can be costly and time consuming. One way to reduce

the need for data is transfer learning (TL) in which knowledge stored

in one model can be used to help in the training of another model. In

an attempt at performing TL in the context of RL we have suggested

a multitask Q-learning model. This model is trained on multiple tasks

that are assumed to come from some family of tasks sharing traits. This

model combines contemporary Q-learning methods from the field of

RL with ideas from the concept of variational auto encoders (VAEs),

and thus suggests a probabilistically motivated model in which the

Q-network is parameterized on a latent variable z 2 Z representing

the task. This is done in a way which is hoped to allow the use of

the Z space to search for solutions when encountering new tasks from

the same family of tasks. To evaluate the model we have designed a

simple grid world environment called HillWalk, and two models are

trained, each on a separate set of tasks from this environment. The

results of these are then evaluated by comparing against a baseline Q-

learning model from the OpenAI project, as well as through an inves-

tigation of the final models behaviour in relation to the latent variable

z.

(5)

Sammanfattning

Reinforcement learning (RL) är ett område av maskininlärning inom vilket man ämnar att efterlikna det mänskliga lärandet, där inlärning sker genom interaktion med en omgivning och styrs av interaktioner med denna i form av positiv och negativ belöning. Som med andra for- mer av maskininlärning så är RL starkt beroende av tillgången av stora mängder data, vilket kan vara mödosamt och tidsödande att samla in.

Ett sätt att minska den data som behövs till träningen av en ny modell

är transfer learning (TL), där kunskap kan överföras mellan modeller

och på så sätt minska den totala mängden data som krävs. I ett för-

sök att utföra TL i en RL-kontext har vi föreslagit en multiuppgifts

Q-inlärningsmodell. Denna model tränas på ett set utav uppgifter vil-

ka antas tillhöra någon familj av uppgifter i det avseendet att det finns

kända likheter mellan uppgifterna. Modellen kombinerar samtida me-

toder från RL-fältet med koncept från variational auto encoders, och

därigenom föreslås ett probabilistiskt Q-nätverk som är parametrise-

rat på en latent representation z av uppgifterna. Detta görs med av-

sikten att det skall tillåta Z att användas för att hitta lösningar till nya

uppgifter från samma familj. För att utvärdera modellen definieras en

familj av uppgifter och en miljö, HillWalk. Med hjälp av denna skapas

två träningsset för vilka en model respektive tränas. Resultatet från

detta jämförs sedan med en baslinjealgoritm från OpenAi-projektet,

samt genom att den slutgiltiga modellens beteende i förhållande till

z-parametern undersöks.

(6)

1 Introduction 1

1.1 Introduction . . . . 1

1.2 Research Question . . . . 3

1.3 Limitations . . . . 3

2 Background 4 2.1 Artificial Neural Networks . . . . 4

2.1.1 Architectures . . . . 4

2.1.2 Learning . . . . 7

2.2 Reinforcement Learning . . . . 8

2.2.1 The Reinforcement Learning Problem . . . . 8

2.2.2 The Solution . . . 10

2.3 Variational Auto Encoder . . . 14

2.4 Transfer Learning . . . 16

3 Related Work 18 3.1 Related Research . . . 18

4 Method 21 4.1 The Model . . . 21

4.2 Training . . . 24

5 Experimental Setup 26 5.1 Environment . . . 26

5.2 Training . . . 28

5.2.1 Source Sets . . . 28

5.2.2 Parameterized Multi-Task Q-Learning . . . 29

5.2.3 Q-learning Baseline . . . 29

5.2.4 Training Evaluation . . . 30

5.2.5 Z-space Evaluation . . . 30

v

(7)

6 Results 31

6.1 Training Results . . . 31

6.2 Latent Space Evaluation . . . 34

6.3 Discussion . . . 38

6.3.1 Training . . . 38

6.3.2 Latent Space . . . 39

7 Conclusion 41 7.1 Summary and Conclusion . . . 41

7.2 Future Research . . . 42

7.3 Sustainability, Ethics and Societal Aspects . . . 43

Bibliography 44 A Results 48 A.1 Training Results . . . 48

A.1.1 Source Set 1 . . . 48

A.1.2 Source Set 2 . . . 51

A.2 Interpolation Results . . . 54

A.2.1 Results from Interpolation 1 . . . 54

A.2.2 Results from Interpolation 2 . . . 57

A.2.3 Results from Random z . . . 61

(8)

Introduction

1.1 Introduction

In recent years there has been a massive surge of popularity and in- terest in machine learning. Due to better computational resources and the rapidly growing access to data the field of machine learning has seen great successes in domains such as natural language processing and image recognition. Most of the attention and progress has been in the area of supervised learning, where the approach is to learn from data with a provided solution, trying to use this data to learn in a way that allows for generalization onto new and unseen data.

Although useful in many situations, this method has its limitations.

Firstly, it demands the creation of labeled data, which can be very la- borious or even undoable; secondly, it is easy to see how it is limited in comparison to how humans learn, as they are capable of learning by trying, interacting with the task and changing their behavior in rela- tion to positive and negative rewards. The genre of algorithms trying to model this kind of learning is called reinforcement learning (RL).

RL is not a new field, but has lately been boosted by the general surge of popularity and scientific progress in machine learning, and there has been some notable examples of success. Among the more pronounced successes one can, for example, see [26] that using re- inforcement learning algorithms combined with other methods suc- ceeded in beating the top Go players in the world, and [21] that has produced an algorithm that can outperform humans on a large num- ber of Atari games.

However, in spite of these recent successes, there is much work to

1

(9)

be done. In many of these approaches to RL an agent that has learned a task is very specialized at this task, see for example the previously mentioned [26, 21]. This means that when searching for a solution to a task similar to one previously solved, this previous solution often can- not be used, and instead the agent needs to be trained from scratch.

Also, RL is limited among other things by the vast amounts of data needed to train the algorithms. That combined with the very special- ized nature of the agents is troublesome.

One solution to the previously described problem is learning to solve tasks in a way that allows us to harness our knowledge about similarities between tasks, and use the solution to previously learned tasks as an initialization to the new agent. This concept is called trans- fer learning and has, for example, yielded great success in the field of image recognition [4].

An approach has previously been suggested in which one assumes that the tasks T

^m

from a family of tasks M can be parameterized by some latent variable z

m

in such a way that one would be able to define a single policy ⇡(a|s, z

m

) applicable to all tasks in the family M [22].

Assuming that there exists such a parametrization for the tasks and that this can be found would, given a new task T new, allow us to use our knowledge about the similarity between the tasks, for example, by searching only for the z

new

corresponding to the new task.

That this would be helpful is not apparent, but a thought-experiment can illustrate the potential gains. Imagine that we have a set of robotic arms. The arms are similar, but with small differences such as that the distance between the joints differs between the robots. Assume that we individually task these arms with trying to solve some single problem and let this be our family of tasks. Finding an optimal policy, (a|s, z), where z is assumed to parametrize the difference between the arms for this family of tasks, would when faced with a new arm allow us to search only in Z for the z

new

representing the new arms proportions.

Another scenario could be where there is a set of identical arms try-

ing to push an object to a goal, where the goal is different between the

tasks. Learning in the previously described manner would then allow

us to search only for the proper task representation z. This thesis hopes

to utilize the concept of parametrization in an attempt to provide the

outlines for new method, that learns in a way which allows the use of

previous knowledge when facing new tasks.

(10)

1.2 Research Question

The goal of the thesis is to suggest an approach that allows for fast training when faced with a new task T

t

by using knowledge acquired from previously encountered tasks {T

^s1

, ..., T

^sn

} with known similari- ties to T

t

.

This thesis will aim at suggesting a method similar to that of [22], exept that it will attempt to train the policy in a manner that gives us some knowledge about the distribution of the latent variable z. Such knowledge can hopefully be used to further speed up the adaptation to the new task. The aims of this thesis is thus:

• To suggest a more principled way of finding the distribution over Z and to evaluate this new method.

The hypothesis being that:

• An latent representation z

m

can be found for every task T 2 M together with a function q(a, s, z) such that z encodes the features separating T

m

from other tasks in M, allowing q(a, s, z) to accu- rately decide the value of the a, s pair given for each given task from the family M.

• Given that such a representation can be found, it can be used in the process of learning the policy for a new task T

new

by searching only for a suitable z

new

, harnessing the knowledge of the distri- bution over z.

1.3 Limitations

The main limitation is in the choice of the environment. The thesis

will only be evaluated on a single task designed to fulfill some desired

prerequisites. That also includes limiting the experiments to a discrete

action space.

(11)

Background

2.1 Artificial Neural Networks

Artificial Neural Networks (ANNs) is a family of algorithms loosely inspired by the human brain. There exists a plethora of ANNs with a variety of different uses. The following section will contain itself to some essential types and concepts relevant to this thesis. The descrip- tions will in large be following the explanations given in [10].

2.1.1 Architectures

Feedforward Neural Network

In this quintessential ANN one tries to approximate some function f

^⇤

(x) by combining a number of functions f

¹

(x), f

²

(x), ..., f

ⁿ

(x) in a chain such that f

^⇤

(x) = f

ⁿ

(f

^{n 1}

(...f

¹

(x)))). One of the most common and basic types of feedforward networks is the multi-layered percep- tron (MLP) with fully connected layers. For the fully connected MLP with n layers, the layers be described as

f

ⁱ

(h

i

) = g

ⁱ

(W

_i^>

h

i

+ b

i

) = h

i+1

, (2.1) for 0 < i < N, an with with h

0

= x and h

n

= o where x is the input, and o the output of the network. All but the last layers are usually referred to as hidden layers, which is because their output is hidden in regards to the final results. Thus, they are distinguished from the final layer, which called the output layer.

Some observations can be made regarding the parameters W

i

and b

i

. Firstly, W

i

and b

i

, referred to as the weights and bias respectively,

4

(12)

are the parameters by which the network will be trained. Secondly, each W

i

corresponds to a N

i

⇥M

i

matrix, where the number of columns M

i

has to match the number of rows N

i 1

of the input h

i 1

and number of rows N

i

will decide the number of rows of the output h

i

. This second term M

i

is usually referred to as the width of the layer. They are often visualized with graphs similar to the one seen in Figure 2.1. The width of the hidden layers is together with the number of layers n in the network two of the major design choices that needs to be made when designing an MLP. Another such choice is the function g

ⁱ

; this is called an activation function and is described later in 2.1.1.

Although the type of layers described above are prevalent, it should be said that other types of configurations are possible. One might for example have connections skipping layers in an attempt to reduce the number of parameters for the network to train, or the kind of connec- tions seen in residual networks where the output of a layer is com- bined with the output of the next layer [14].

x

1

x

2

x

3

x

4

x

5

x

6

g

1

()

y

₁

y

2

y

₃

y

4

g

2

() Hidden

layer Hidden

layer

Output layer

Figure 2.1: Graph of fully connected ANN with input X = [x

1

, ..., x

5

]

and output Y = [y

1

, ..., y

₄

]. It has two hidden layers and an output

layer. In the graph each column of nodes represents a layer, corre-

sponding to a weight matrix W

ⁱ

and each node represents a unit, cor-

responding to a column w

_jⁱ

of that layer.

(13)

Convolutional Neural Networks

Convolutional neural networks (CNNs) are a subfamily of feedfor- ward neural networks, prevalently used in applications such as im- age classification, speech recognition and reinforcement learning [18], [21], [1] . It produces features by performing convolutions and pool- ing. Convolutional layers are typically used to utilize prior knowledge about the nature of the data, such as stationary of statistics and locality of dependencies of the input [18].

Recurrent Neural Networks

Recurrent neural networks (RNNs) is in many ways similar to feedfor- ward neural networks, but are more complex in that they allow out- puts from layers in the network to go back as an input to previous layers for future time steps.

Activation Function

As previously described, each hidden layer is separated by an non- linear function called an activation function. The main purpose of these activation functions are to de-linearise the network. For a net- work with layers w

_i^>

h

i 1

+ b

i

= h

i

and no activation functions, any two consecutive layers w

^>_i

h

i 1

+ b

i

= h

i

and w

_i+1^>

h

i

b

i+1

= h

i+1

could always be collapsed to w

_i+1^>

w

_i^>

h

i 1

+ w

^>_i+1

b

i

+ b

i+1

= w

^0>

h

i 1

+ b

⁰

where w

⁰

= w

_i+1^>

w

i

and b

⁰

= w

_i+1^>

b

i

+ b

i+1

, which in turn would allow us to re- cursively collapse any network of arbitrary length into a single linear transform W

^0>

x + b

⁰

= o.

There are numerous plausible activation functions. Previously the

most common choice was either the logistic sigmoid function g(z) =

(z) or the hyperbolic tangent function g(z) = tanh(z), but these days

the rectified linear unit (ReLU) first described in [9] is considered to

be the best default choice for a feedforward network as it prevents

saturation [10]. There are also variations on the ReLU, such as the

leaky ReLU , g(z) = max(0, z) + min(↵z, 0) for ↵ equal to some small

number like 0.01 which allows for the reactivation of units by giving

deactivated units a small gradient [20].

(14)

Output Function

The last layer can be considered unique compared to the hidden lay- ers in that it does not need to be followed by an activation function.

That being said, a similar function is often applied, but this time it is done to adapt the output to fit the target data. There is a strong con- nection between the choice of output function and loss function which together is of great importance to the network design. Using no acti- vation function is, for example, common when trying to maximize the output to the target as a conditional Gaussian distribution when do- ing regression, while when learning a binary classification problem a sigmoid function is often used, thus maximizing the loglikelihood of a Bernoulli distribution. Many other choices are available, with the gen- eral goal of adapting the output of the last layer to fit the input space of the appropriate loss function.

Loss Function

As hinted in the previous section, the choice of the loss function is important when constructing an ANN. For many parametric models a distribution p(y| x, ✓) is defined, where y is the target data, x the input data and ✓ the model parameters, allowing for the use of a maximum- likelihood approach. Given this maximum-likelihood approach one can use the cross-entropy loss function, which is simply

J(✓) = E

(x,y)x,y⇠pdata

[log p

model

(y |x)]. (2.2) Further assuming that p(y|x, ✓) = N (y; f

^⇤

(x), I ), wich is common when performing regression, gives

J(✓) / E

^x,y⇠pdata

||y f (x; ✓) ||

²

, (2.3) which amounts to calculating the mean square error (MSE) using the data.

The choice of the loss function is crucial and captures the assump- tions made about the data, and there are numerous other choices for loss functions, all with their implications and interpretations. For more examples see [10].

2.1.2 Learning

A topic not yet mentioned is how the ANNs learn the correct weights

for their parameters. One of the downsides of many of these networks

(15)

is that they are non-convex functions making them hard to optimize.

Gradient Descent

As previously described, mostly the networks will be of a non-convex type, and many methods for optimization will, therefore, be intractable.

Luckily, gradient descent has proven a viable method for training the networks. Even in spite of the fact that it only performs local optimi- sation.

2.2 Reinforcement Learning

2.2.1 The Reinforcement Learning Problem

The concept of reinforcement learning (RL) is central to this thesis; the following section contains a brief description of RL and some essential terms related to this. The descriptions will in large be following the explanations given in [28].

In RL, one is tasked with teaching an agent to act optimally in some environment. The agent’s interaction with the environment is usu- ally described in terms of observations, actions, and rewards, typically modeled in discrete time steps in the form of a Markov decision pro- cess (MDP). In this model the agent receives a observation o

t

and an reward r

t

at each timestep t, and given these the agent chooses an ac- tion a

t

, with which it interacts with the environment, yielding a new observation o

t+1

and reward r

t+1

. A diagram of this can be seen in 2.2.

It should be noted that in much of the literature regarding RL, what

here is called observation will be called state. The choice has been

made to make a distinction between these to account for the cases

where the state of the environment only is partially observable. Each

observation o is thus assumed to be generated by a underlying state s.

(16)

Environment Agent

a

t 1

——

a

t

o

t

r

t

Figure 2.2: Graph of RL agent-environment relationship. The crossed connection going in to the environment represents the fact that a

t

+ 1 will send the environment into the state yielding observation o

t+1

.

There are four essential parts to this model:

• The observation o 2 O, where O called the observation space is all possible observations.

• The action a 2 A, where A called the action space is all possible actions.

• The reward r 2 R, where R called the reward space is all possible rewards. It should be noted that r often is defined as a determin- istic function r(o, a, o

⁰

) or even r(o, a).

• The policy ⇡ by which the action a is chosen given a state s. That is in its most general definition viewed as a distribution ⇡(a|o) over actions given an observation.

• The model p(r, o

⁰

|a, o) which defines the probability of transfer- ring to o

⁰

from o given action a and then receiving r as reward.

MDPs can either be episodic or non-episodic. An episodic MDP has a defined start-state s

s

and end-state s

t

such that p(s

t

|s

^s

) 6= 0 and after the end-state is achieved the episode ends. In non-episodic MDPs there is no such end, and the episodes go on forever. During traning and evaluation data is sampled from MDPs in sequences, these se- quences are often called rollouts.

Another central concept is the discounted return at time t denoted G

t

which is defined as

G

t

:=

X

1 i=0

i

r

t+i

, (2.4)

(17)

where 0   1 is called discount rate. A major purpose of the dis- count rate is assuring that the return for non-episodic MDPs remains finite, but it can also be seen as modeling a certain amount of uncer- tainty in the model one is training, making it prefer rewards closer in time.

With the concept of model, policy and return, the value of a state can be defined as the expected return given the policy. The value of a state is thus dependent on policy. That is, for a policy ⇡ the value is

V

⇡

(o) :=E

⇡

⇥ G

t

|O

^t

= o ⇤

(2.5)

= X

a2A

⇡(a |o

t

) X

o⁰2O,r2R

p(o

⁰

, r |o, a)(r + V

⇡

(o

⁰

)). (2.6) One can also describe a Q-function. The Q-function is the value of a pair (s, a) of state and action and is defined as

Q

⇡

(o, a) := E

⇡

⇥ G

t

|O

^t

= o, A

t

= a ⇤

, (2.7)

and describes the expected return given action a and observation o.

With this framework setup, a goal for the learning can be defined, namely to find the parameter ! for an optimal policy ⇡

_⇤

(a |s; !) as to maximize the expected return

⇡

_⇤

(a |s; !) = arg max

_⇡

V

⇡

(o). (2.8)

2.2.2 The Solution

Given the concepts described in the previous section, there are many approaches to RL. The following section describes a few of the major methods and concepts related to this topic.

Tabular Methods

Tabular methods are a family of methods that attempt to solve the RL- problem with the assumption that the action space A and state space O are both small enough to allow for keeping track of the value function in the form of tables.

Dynamic Programming Dynamic Programming algorithms assume

that the model p(a, o

⁰

|a, o) is known, and looks for the best policy by

iteratively updating either a policy or value function in what is called

(18)

policy and value iteration respectively. The basic idea is that, if there is a policy ⇡

k

(a |o), one can find a value function V

⇡k

(o), which in turn can be used to find an improved policy ⇡

k+1

(a |o), and so on. Given 0 < < 1 this is guaranteed to converge to the best solution given an infinite number of iterations [28]. The changes each iteration is made based upon the current approximation, and the model thus bootstraps itself. Remember that the value function is a tabular lookup table, and one is thus guaranteed that changing the value of one observation will not affect the value of other observations.

Monte Carlo Methods In many cases, the model will not be known and finding it unfeasible. Monte Carlo (MC) methods handles this by making approximations of the value of states through sampling from the current policy and then updating the value of the states accord- ingly. The current value function is not used in the update and is thus in contrast to many other algorithms, not a bootstrapping algorithm.

Alas, a large amount of sampling is needed between every update of the value function.

There are various MC methods, and they differ among other things in how they handle multiple visits to the same state.

In a simple every-visit MC-method

¹

the update of the value func- tion might, for example, look as follows

V (o

t

) V (o

t

) + ↵[G

t

V (o

t

)], (2.9) where G

t

is a sampled return following time t, and ↵ is a constant step- size parameter. That is, the value of the state is updated to be closer to the sampled return.

Temporal Difference Learning Extending previous ideas temporal difference (TD) learning uses samples from the current policy in com- bination with the current value function and bootstraps a new value for the value function. The central idea in TD-learning is that one can update the value function using only: the current value function, the last return r

t

, the previous state o

t

, and the new state o

t+1

. In a basic implementation, this update can be made as

V (o

_t

) V (o

t

) + ↵[r

_t

+ V (o

_t+1

) V (o

_t

)], (2.10)

1

Every visit to a state is used to update the value function. In contrast to for

example first-visit, where only the return from the first visit to a state is used.

(19)

which, in contrast to Equation 2.9, bootstraps the current value func- tion in its update.

An important variant of TD-learning is the Q-learning method, in which the action-value function Q(o, a) is used and updated according to

Q(o

t

, a

t

) Q(o

t

, a

t

) + ↵[r

t

+ max

a⁰

Q(o

t+1

, a

⁰

) Q(o

t

, a

t

)], (2.11) this method is suited especially well for tasks with discrete action spaces with finite actions, such that finding arg max

a

Q(o, a) is feasible.

Approximate Methods

Although useful in many cases the tabular approach is limited, as many problems have observation spaces that are far too large for a ta- ble. Consider for example the input being a images from a robot, then the set of possible observations O would consist of all possible constel- lations of pixel values. A common approach to this is to replace the ta- ble of values with an approximate function, v

!

(o

t

) parameterized by !.

With this approach, many of the previously described approaches can be extended to handle large observation spaces. However, it should be noted that many of the guarantees for convergence will no longer hold in the approximate case, primarily due to that when one changes the weights in the parameterized function to better reflect the value for one observation, it will have effects on the approximation of the val- ues of other observations. This kind of generalization is due to the size of the parameters of the function being much smaller than that of the observation space O, and thus it is unable of correctly representing the value of all observations. Generalization has its downsides, but it also makes these models quite powerful as they allow us to learn things about observations that have not yet been seen.

Many of the most widely used learning methods for the approx- imate approaches are using stochastic gradient descent (SGD). This will, in general, mean searching for a local optimum rather than a global. A general description of the SGD update method for the observation- value function is

w

t+1

= w

t

+ ↵ ⇥

U

t

ˆ v

t

(o

t

, w

t

)] ⇤

rˆv

^t

(o

t

, w

t

), (2.12)

where U

t

is our target for the value function at time t, and ˆv(o

t

, w

t

) is

our approximate function. The approximate function can be shown

(20)

to converge to a local optima as long as E ⇥ U

t

⇤ = v(o

t

) for the true value function v(o

t

), as it for example would if U

t

was found through MC-sampling. However, even where this does not hold, like in boot- strapping algorithms such as Q-learning it has proven to be an effi- cient approach. The methods using unbiased targets are thus called SGD methods and the ones with biased targets are in turn called semi- gradient methods.

Deep Q-learning A common choice for the parameterized function is an ANN. These methods, often deep Q-networks (DQNs), and are used in some of the more successful recent works. As they replace the tabular representation of the Q-function Q(o, a) with an ANN, often training it using SGD, they get around the problem of having to store the value for each observation. However, this is not without its draw- backs as the use of parameterized approximate functions introduces a dependency between the value function of different observations.

The DQN update-step using SGD yields the following expression

! ! + ↵[r + max

a⁰

Q(o

⁰

, a

⁰

; !) Q(o, a; !)] r

!

Q(o, a; !). (2.13) Several approaches and tricks have shown to yield great improve- ments when applied to this method. [15] shows the results of using a set of them together in a DQN environment in an ablation study. In this thesis, a few of these have been used, while others have been left for future study. The methods used for this study are:

• Experience Replay [21] suggests using experience replay as orig- inally suggested by [19] for DQN-learning. This method uses a so-called replay buffer, which consists of a buffer in which the N last experiences are stored, and at each training step, a batch of size k is sampled uniformly from this buffer. That helps to dimin- ish the problems introduced due to the strong temporal correla- tions between the samples, as well as allowing for more efficient data usage. In an extension to this a Prioritized Experience Re- play has been suggested [24], speeding up training by sampling experiences based on some priority schedule, often using the last TD-error of the samples.

• Double Deep Q-learning Double Q-learning suggested by [11]

is an extension of the Q-learning method, reducing the value

(21)

overestimation introduced by the max function in Equation 2.11.

[12] show with their Double Deep Q Network (DDQN) algo- rithm that the Double Q-learning method can be extended to the Deep Q-learning setting, yielding great results. In the DDQN algorithm two separate Q-functions Q(o, a; !) and Q(o, a; !

⁰

) are used, the later often called the target network. The update step then uses the target network to approximate the value of the next observation. Leading to the following update step

! !+↵[r+ Q(o

⁰

, arg max

a⁰

Q(o

⁰

, a

⁰

; !); !

⁰

) Q(o, a; !)] r

!

Q(o, a; !).

(2.14) The original Double Q-learning scheme involves alternating the roles, and updates of the Q-functions, for example, by randomly choosing which one to act as target each iteration [11]. [12] use an easier approach, simply having a dedicated target network, that is a copy of the original Q-function, updated at an interval chosen as a hyperparameter.

• Multi-step Learning Where the one-step learning looks only at the current reward, and MC-methods looks at the return from an entire episode, the n-step method combines these by holding back for a certain number of steps and then using the return from those steps in combination with bootstrapping to approximate the value of an observation. This approach gives a new update function for an o seen at time t and o

⁰

seen at time t + n

! ! + ↵[G

^o:o⁰

+ max

a⁰

Q(o

⁰

, a

⁰

; !) Q(o, a; !)] r

^!

Q(o, a; !), (2.15) where G

o:o⁰

=

n 1

P

i=0

i

r

t+i+1

and n is the number of steps.

2.3 Variational Auto Encoder

The following section will give a brief introduction to variational auto encoders (VAEs) and will mainly follow the model presented in [17].

VAEs is an approach to finding encoded representations of data us-

ing a variational Bayesian approach. The goal of this is to create an en-

coder that can map the data into a lower dimensional representation-

space Z in a way that gives us knowledge about the prior p(Z).

(22)

Given a dataset X = {x

ⁱ

}

^Ni=1

of N i.i.d. data points, one starts with the assumption that they are generated by some random process. Each data point is thus assumed to be generated from a conditional distri- bution P (x|z) depending on random variable z, which in turn is dis- tributed according to some prior distribution p(Z). With this one can formulate the goal to be finding the distribution p(x|z) that maximizes the marginal probability

p(x) = Z

z2Z

p(x |z)p(z)dz. (2.16)

However, performing this integral is often intractable. To solve this a recognition model q (Z|X) is introduced, with which Equation 2.16 can be rewritten as

p(x) = Z

z2Z

q (z |x)p(X|z)p(z)

q (z |x) dz (2.17)

=E

_z⇠q(Z|X)

[ p(x, z)p(z)

q (z |x) ]. (2.18)

Taking the log of this, to aviod having to optimize the product, and using the Jensens inequality, this can then be rewritten as

log E

z⇠q(Z|X)

[ p(x, z)p(z)

q (z |x) ] E

z⇠q(Z|X)

[log p(x, z)p(z)

q (z |x) ]. (2.19) After some rewriting, leading to the final expression

log p(x

ⁱ

) L( , ✓, x

ⁱ

), (2.20) with

L( , ✓, x

ⁱ

) = D

KL

(q (z |x

ⁱ

) ||p

✓

(z)) + E

q (z|xⁱ)

[log p

✓

(x

ⁱ

|z)], (2.21) where the recognition model q (z|x) can be shown to approach the true conditional distribution p(x|z) as L( , ✓, x

ⁱ

) increases. Due to the i.i.d. assumption about the data X one can describe the log p(X) = P

N

i=1

log p(x

ⁱ

). For which each term can be implicitly optimized using the

variational lower bound as L( , ✓, x

ⁱ

). To find values for the parame-

ters ✓, stochastic gradient descent can be used. To allow for this z is

(23)

re-parameterised using a differentiable function g (x, ✏) of an auxiliary noise parameter ✏ ⇠ p(✏), which will allow us to approximate

E

_{q (z}_|xⁱ₎

[f (z)] ' X

K k=1

1 K f (g

_✓

(x, ✏

^k

)), (2.22) for ✏

^k

⇠ p(✏).

Under the given restrictionsa general stochastic gradient variational Bayes have now been defined. What the actual implementation will look like depends largely on the choice of distribution for the decoder q

✓

(z |x), prior p(z) and the kind of parametrised function to use. A common choice is to use an ANN as the parametrised function, let p(z) = N (0, I) and q

^✓

(z |x) = N (z; µ(x),

²

(x)I). Fig. 2.3 shows how the general design an VAE-ANN looks.

X !

µ

Z

✏ ⇠ N (0, 1)

✓ X

⁰

Encoder Decoder

Figure 2.3: Graph depicting ANN architecture for a VAE. Here X is the input and X

⁰

is the output.

2.4 Transfer Learning

Most ML algorithms are concerned with problems where the training

and test data are assumed to be drawn from the same feature space

and distribution. As a consequence of this, the model often needs to be

retrained from scratch when the feature space or distribution changes

[23], even though one often knows about similarities between previ-

ously learned models and the one to be trained. The field of transfer

(24)

learning (TL) thus tries to decrease the data needed when encounter- ing a new task by using data and knowledge from previously seen tasks. TL has yielded success in many areas of supervised learning [23, 31], such as computer vision and language models [27, 5], and has recieved a fair amount of attention in the field of RL as well [29].

In its most general form TL can be described as trying to transfer knowledge from one or more source tasks T

^s

in a source domain D

^s

onto another task or set of target tasks T

^t

solved in a target domain D

^t

. The domains can be said to consist of two parts, a feature space and marginal distribution P (X) for X = {x

1

, .., x

n

} 2 . Given a domain, a task T is defined to consist of two components: a label space Y and an objective predictive function f(·). Given a source domain D

^s

, a source task T

s

, a target domain D

t

and a target task T

t

then the goal of TL is to help improve the learning of the target predicative function f(·)

t

by using the knowledge in D

s

and T

s

where D

s

6= D

t

and T

s

6= T

t

[23].

In the case of using TL in a RL setting, the same concepts can be used to describe the TL task, however, one might also be more explicit about what the different parts relates to, having the domain D describe an MDP in the form of a tuple D ⌘ (S, A, p(r|s, s

⁰

, a), p(s

⁰

|s, a) where S observation space, A the action space, R the reward space, p(r|s, s

⁰

, a) the reward function and p(s

⁰

|s, a) the transition function [16]. A task could then, for example, be described as T = (A, ⇡

^⇤

( ·|s)), that is trying to find the optimal policy for the MDP.

TL comes in a variety of shapes, as there are numerous choices to which parts of the model one allows to change in between the source and target domain, in addition to this one has to decide how the knowl- edge will be transferred. A common approach is the use of pre-trained networks, either as a way of guiding the early training of deep net- works or as a way of reducing the need for data to train on [27, 5]. [29]

shows, in the RL-setting, how different approaches allow for changes

in different parts of the MDP between source and target, and various

ways in which the knowledge can be passed between models.

(25)

Related Work

3.1 Related Research

This thesis is concerned with learning an RL model parameterized on a task embedding; the following section will contain a short survey of previous research with a similar focus.

There are a multitude of research on the topic of TL in RL, [29]

gives an overview of older research. Recent research includes concepts such as meta-learning, domain randomization, and different types of parametrization.

In meta-learning, one tries to use the source domain to learn. An example of this is [8] in which the learning is structured to learn a disentangled parametrization that allows for faster learning when en- countering an unseen task by training the model with a meta-objective.

For tasks T

ⁱ

sampled from p(T ), with loss function L

Ti

, a parame- terized function f

✓

and with a one-step gradient update ✓

i

= ✓

↵ r

✓

L

Ti

(f

_✓

) and , the meta-objective is for example defined as min

✓

X

Ti⇠p(T )

L

Ti

(f

✓i

) = X

Ti⇠p(T )

L

Ti

(f

✓

↵ r

✓

L

Ti

(f

✓

)). (3.1)

The model is then optimized in relation to this meta-objective using SGD. This meta-objective can be described as trying to find the set of parameters ✓ that optimizes learning on the tasks from the distribution p( T ).

Domain randomization is generally concerned with trying to trans- fer policies learned in simulation to the real world. An example of such a method is [30], in which an agent is trained on simulated data with

18

(26)

domain randomization in a way which allows it to generalize to real- world tasks. More specifically, this is done by varying several features, for example, lighting, the orientation of objects, and camera position- ing between each simulated sample used for training.

Parametrization has been used in various ways in RL. [25] trains a universal policy parameterized on goals. Using multiple tasks with different goals to train a single policy that can generalize to new un- seen goals; this is done both by performing end-to-end learning by training an MLP on data samples created by concatenating goal-state to state and action. Notably, they find that it is more efficient to do this in a two-stage process, wherein a first stage goal and state-action are respectively embedded using matrix factorisation, and in the second stage these are combined through a simple linear function to generate the final output. Building upon the work of [25], Hindsight Experi- ence Replay [2] is an approach to more efficiently use data gathered in tasks with well-defined target states. This is done by parametriz- ing the policy on the target and reusing the data with other states as targets. In addition to the data points generated in the rollouts, artifi- cial data points are also added to the experience replay buffer. These artificial data points are generated by changing the target state to one reached during a rollout, and adding previously seen actions tuples with the reward recalculated, using this target. [16] suggests a method in which a disentangled state representation is learned, allowing the agent to disregard uninformative changes between different tasks. In a three-stage process, they first train a disentangled representation of states using a VAE on the source tasks, in the second stage they use this encoder to train a policy on the source task, and finally, they transfer this policy onto a target task.

More similar to the approach suggested in this thesis is, for exam-

ple, [22], [13] and [3], which all suggest a general approach for param-

eterizing the policy based on tasks. [22] learns the policy and latent

representation of the task simultaneously through SGD together with

an annealing scheme in the hope of achieving a policy ⇡ that is smooth

with respect to the latent representation z. This approach does not use

any knowledge about the distribution p(z), and instead finds suitable

values for z through further iterations of SGD. [13] uses an approach

based on a variational bound, but in contrast to what is done in this

thesis, they use entropy regularized RL. [3] uses a loss function very

similar to the one described in this thesis, but instead of Q-learning

(27)

they perform policy gradient learning to learn a policy for a contin-

uous action space. Also, in contrast to the method suggested in this

thesis their approach differs in that it requires policies for the source

tasks, and thus only learns the actual parametrization of the policy.

(28)

Method

4.1 The Model

The following section describes a model for training a single policy for a set of tasks, parameterized on a latent variable z

t

representing each task.

As has previously been pointed out, it might be beneficial to har- ness the knowledge about similarities between tasks to allow for faster learning when encountering new tasks. The assumption is that there exist families of tasks T where one can assume that parts of the policy can be shared among its members. By disentangling the parts that can be shared between its members from the ones that cannot, the hope is to find a lower dimensional space Z, encoding the parts that differ between the tasks, to search when encountering unseen task ⌧

t

2 T . Thus, given a family of tasks T the goal is to find one single policy

⇡

!

(o, z

i

), where z

i

is a latent representation of a task ⌧

i

and o is an ob- servation, such that ⇡

(

o, z

_i

) ⇡ ⇡

^⇤⌧i

(o), where ⇡

_⌧^⇤_i

(o) is the optimal policy for the task ⌧

i

. Also, it would be beneficial to know about the distribu- tion p(Z) over Z to further help in the search over the latent space Z like the one suggested in [22]. In contrast to [22] and inspired by the method described in [17], we suggest finding the latent representation z in a manner that gives us knowledge about its distribution p(Z).

In a first step towards this goal a probabilistic interpretation of Q- learning is formulated. Under the assumption that r ⇠ N (Q

!

(o

⁰

, a

⁰

)

arg max

_a0

Q

_!

(o, a), I↵) and given an i.i.d. dataset D = {(o

i

, a

_i

, r

_i

, o

⁰_i

) }

^Ni=0

21

(29)

the Q-learning target X

(o,a,r,o⁰)2D

(r + arg max

a⁰

Q

!

(o, a) Q

!

(o

⁰

, a

⁰

))

²

, (4.1) can be reinterpreted as trying to maximize the likelihood of R

D

, as

log p(R

_D

|O

D

, O

_D⁰

, A

_D

, !) = X

(o,a,r,o⁰)2D

log p(r |o, o

⁰

, a, !)

= X

(o,a,r,o⁰)2D

log N (r; Q

!

(o

⁰

, a

⁰

) arg max

a⁰

Q

!

(o, a), I↵)

/ X

(o,a,r,o⁰)2D

(r + arg max

a⁰

Q

!

(o, a) Q

!

(o

⁰

, a

⁰

))

²

. (4.2) In a manner analogous to the one suggested in [17] this expression can be assumed to depend on a latent variable z, which here is as- sumed to encode the tasks, and thus be written as a marginal over the latent variable

log p(r |o, o

⁰

, a, !) = Z

z2Z

log p(r, z |o, o

⁰

, a, !)dz, (4.3)

where Z is the set of all possible z. Again, we follow the method pre- sented in [17], and rewrite the expression as

log p(r |o, o

⁰

, a, !) E

z⇠q(z)

[log p(r |o, o

⁰

, a, z, !)] D

KL

[q(z) ||p(z)] (4.4) using an recognition model q(z). As the task each datapoint is sampled from is known, this information can be used in the hope that it helps during learning; the network is thus provided with a task ID giving q(z |c), where c 2 {1, ..., C} and C is the number of training tasks. This gives

logE

z⇠q(z|c)

[p(r |o, o

⁰

, a, z, !)] D

KL

[q(z |c)||p(z)]. (4.5) Combining Equation 4.2, 4.3 and 4.5 we get

log p(R |O, O

⁰

, A, !) X

N

i=1

log E

z⇠q(z|ci)

[p(r

i

|o

ⁱ

, o

⁰_i

, a

i

, z, !)] D

KL

[q(z |c

ⁱ

) ||p(z)]. (4.6)

(30)

As in [17] the prior can be chosen as p(z) ⇠ N (0, 1), this is moti- vated by the fact that any distribution of d dimensions can be repre- sented by taking d variables that are normally distributed and map- ping them through sufficiently complicated function [7], which in this case is a neural network. Further, if the recognition model is chosen as q(z |c

i

) ⇠ N (µ

i

,

i

) this allows for calculation of D

KL

[q(z |c

i

) ||p(z)] in a closed form expression [17],

D

KL

[q(z |c

ⁱ

) ||p(z)] = 1 2

X

J j=0

(1 + log(

ij

) µ

²_ij _ij²

), (4.7)

where

ij

, µ

ij

is the jth element of

i

, µ

i

respectively and J is the di- mensionality of the latent variable z.

For a single data point, we thus want to optimize

E

z⇠q(z|ci)

[log p(r

i

|o

ⁱ

, o

⁰_i

, a

i

, z, !)] + 1 2

X

J j=0

(1 + log(

ij

) µ

²_ij _ij²

), (4.8)

where the first part can be approximated as

E

_z⇠q(z|c_i)

[log p(r

i

|o

i

, o

⁰_i

, a

i

, z, !)] ⇡ 1 L

X

L l=0

log p(r

i

|o

i

, o

⁰_i

, a

i

, z

l

, !), (4.9) yielding

L(!) = ( 1 L

X

L l=0

log p(r

i

|o

ⁱ

, o

⁰_i

, a

i

, z

l

, !) + 1 2

X

J j=0

(1 + log(

ij

) µ

²_ij _ij²

)) (4.10) as the loss function.

Moreover, the assumption that r ⇠ N (Q

!

(o

⁰

, a

⁰

) arg max

a⁰

Q

!

(o, a), I↵) gives

log p(r

i

|o

i

, o

⁰_i

, a

i

, z

l

, !) = 1

2 ( log(2⇡↵) 1

↵ (r

_i

+ arg max

a⁰

Q

_!

(o

⁰_i

, a

⁰

, z

_l

) Q

_!

(o

_i

, a

_i

, z

_l

))

²

). (4.11)

Note that the expression arg max

a⁰

Q

!

(o

⁰_i

, a

⁰

, z

l

) in Equation 4.11 is

supposed to correspond approximation U(o

⁰

) of the value of o

⁰

and can

in accordance to previous work be replaced by different target function

U

^⇤

(o

⁰

), which will be used when implementing the Double-DQN [12].

(31)

4.2 Training

The following section describes how the previously described model in a first step can be trained on a set of source tasks T

source

= {⌧

1

, ..., ⌧

_k

}.

For training, data will be generated by taking actions in the envi- ronments using the current Q-function as policy and storing resulting data points {o, a, o

⁰

, r, c } in a replay buffer as described in [24]. The ac- tions will be chosen ✏-greedily using the current policy and a linearly decreasing ✏ schedule. At each timestep, one task will be chosen and an action performed. The tasks will be assured to be equally repre- sented by choosing them in a rotating schedule.

For each time step t a batch of data points D

batch

= d

₁

, ..., d

_n

of size n is sampled from the replay buffer according to weights based on the most resent loss of each data point, where d

i

= (o

i

, a

i

, o

⁰_i

, r

i

, c

i

).

For all of the parameterized functions ANNs will be used as ap- proximators and be trained using SGD with the help of the loss function described in Equation 4.10, where the likelihood equation log p(r

i

|s

i

, s

⁰_i

, a

i

, z

l

, !) is extended using the Double-Q- and multistep- learning algorithms described in Section 2.2.2. Changing o and o

⁰

to refer to obeservations made at time t and t + n respectively, and re- ward r for the cumaltive return g

o:o⁰

between these observations, the same steps taken in Equation 4.2 gives

log p(g

o:o⁰

|o, o

⁰

, a, z

l

, !) = 1

2 ( log(2⇡↵) 1

↵ (g

o:o⁰

+

ⁿ

Q(o

⁰

, arg max

a⁰

Q(o, a

⁰

; !, z

l

), z

l

) Q(o, a, z

l

))

²

).

(4.12) Figure 4.1 depicts the general architecture used for this thesis.

The network consists of two encoders, one variational encoder for the mapping from task ID C to latent representation Z and an en- coder, in our case an MLP, which transforms observations o to a one- dimensional feature vector x and finally a Q-function approximator.

Usually, the encoding of the observation would be done implicitly by

the Q-function, but here it is done explicitly to allow us to concatenate

it with the latent representation z before inputting it to the Q-network.

(32)

C !

Task Encoder

✏⇠ N (0, 1)

!

State Encoder

O X !

Q-function Z

Q(O, A, Z)

L(0, A, R, 0⁰, Z; !)

R

Figure 4.1: Diagram showing general layout of network used in train-

ing. Here, for a batch of size n, C = {c

¹

, ..., c

n

} where c

ⁱ

2 Z is an

identifier for the tasks, O = {o

1

, ..., o

n

} and O

⁰

= {o

⁰1

, ..., o

⁰_n

} where

o, o

⁰

2 O are observations, R = {r

1

, ..., r

n

} where r 2 R are rewards,

A = {a

1

, ..., a

_n

} where a 2 A are actions, X = {x

1

, ..., x

_n

} where x

i

is an

intermediate state representation, Z = {z

¹

, ..., z

n

} where z

ⁱ

is an tasks

embedding generated by c

i

and ! are the model parameters.

(33)

Experimental Setup

5.1 Environment

The primary goal of this experiment is to evaluate if and how the model can be trained. This first step will require a family of tasks, or rather an environment with modifiable traits, allowing us to create a family by changing these traits, on which to train the model. Traits that would be beneficial for such an environment to have are:

1. It is beneficial if a ground truth solution exists, as this would al- low for easier interpretation of the results achieved by the mod- els.

2. It should allow for changing both start and target states.

3. It should allow for changing the distribution of rewards p(r|s, a).

4. It is beneficial if it allows for interpretable visualization of policy and value functions.

To accommodate these traits, we suggest a new environment called the HillWalk environment. The HillWalk environment is a grid-world toy problem in which the goal is to walk from one position, the start position, in the grid to another position, the target position. Each envi- ronment can contain several hills that increase the cost of taking a step if the step goes up a hill. The reasoning behind this layout is related to the previously presented traits in the following way:

1. It is easy to define this grid-world environment in a way that allows the calculation of a ground truth policy by using Dijkstra’s algorithm.

26

(34)

2. It allows for changing both start and target states.

3. Changing the positions, number or height of the hills is equal to changing p(r|s, a).

4. The grid-world allows for easy visualization of policy and value function. Moreover, the fact that the problem has a relationship to a problem in the physical world helps when interpreting the policy and reward function.

Below follows a more detailed description of the HillWalk environ- ment.

The HillWalk Environment

Each HillWalk environment consists of a V ⇥ V grid with a number of H hills. The goal for the agent is to move from a starting position S

₀

to a target position S

target

. The hills can be described as stacks of increasingly large squares stacked upon each other. Thus, a hill of height four will have four levels of size 1 ⇥ 1, 3 ⇥ 3, 5 ⇥ 5 and 7 ⇥ 7 stacked upon each other. If two hills overlap, then the hight of a position is that of the highest of all hills affecting the position. The observation space O : Z

^V

⇥ Z

^V

, and o 2 O corresponds to a position on the grid. From each position in the grid, all adjacent positions can be reached through an action. When stepping between two positions o and o

⁰

the cost of doing so is described as c(o, o

⁰

) = max(1, (height(o

⁰

) height(o)) ⇤ strain) where strain 1 determines the cost of stepping uphill. The reward function r(o, o

⁰

) can be described as

r(o, o

⁰

) =

( 1 if o

⁰

= o

else |o o

⁰

|c(o, o

⁰

) . (5.1) The action space is A : Z

8

where each action corresponds to taking a step in one of the following eight directions: north, north-east, east, south-east, south, south-west, west, and north-west. If the position reached by acting is outside of the grid, the state remains unchanged, but the reward is still calculated as described above. The mapping action-direction can be changed between tasks, but only if specified.

Figure 5.1 shows an example task with three hills, overlaid with the

ground-truth policy calculated with Dijkstra’s algorithm.

(35)

Figure 5.1: Example of HillClimbing environment with three hills. The arrows shows a ground truth policy calculated with Dijkstras algo- rithm.

5.2 Training

The following section describes the setup for evaluating, in which the model is trained on the source set. This evaluation includes setting up source sets, training the model and the baseline model on these, and then comparing their results.

5.2.1 Source Sets

The model will be trained on two sets of source tasks, sampled from two families of tasks. These two source sets will be referred to as Source Set 1 and 2, having the following specifications:

• Source Set 1 - Varying hills Source Set 1 consists of ten tasks.

Each task is a 10 ⇥ 10 HillWalk environment with two hills of height 5, for which the positions have been sampled randomly for each environment. All tasks’ starting state is (0, 0) and the target is (9, 9) , putting start and target in diagonally opposite corners.

• Source Set 2 - Varying target state Source Set 2 consists of ten

tasks. Each task is a 10 ⇥ 10 HillWalk environment with two hills

which are the same for all tasks. All tasks’ starting state is (0, 0)

and the target is sampled randomly for all tasks individually.

(36)

5.2.2 Parameterized Multi-Task Q-Learning

The multi-task model will be trained for 500,000 episodes for each of the two source sets. The specifics of the neural networks in the model can be found in Table 5.1 and Table 5.2 contains hyperparametervalues used in training.

Hidden layers: Activation

function: Batch normalization:

Task encoder [60] ⇥ 5 ReLU No

Q-network [60] ⇥ 10 ReLU Yes

State encoder [10, 4, 2] ReLU Yes

Table 5.1: Network specifics for mulit-task models

Parameter: Value:

z-dimensions 2 Exploration ratio 0.5 Min. exploration 0.01 Replay buffer size 25,000

Gamma 1

VAE warmup ra- tio

0.8 n-step Q-learning 2 Target network update interval

500 Table 5.2: Hyperparameters for mulit-task models

5.2.3 Q-learning Baseline

For each of the tasks in both of the source sets an individual Q-network is trained, using the baseline implementation provided by [6]. The for the training can be seen in Table 5.3 and 5.4

Hidden layers: Activation function:

Batch normalization:

Q-network [64] ⇥ 10 ReLU Yes

Table 5.3: Network specifics for baseline models

(37)

Parameter: Value:

Exploration ratio 0.5 Min. exploration 0.01 Replay buffer size 5000

Gamma 1

Target network update interval 500

Table 5.4: Hyperparameters for specifics for baseline models

5.2.4 Training Evaluation

To compare the return from the single multi-task model to the return from mulitple baseline models trained for each task a first step of re- formatting is necessary. Because the time period of the rollouts differs, with the mulit-task model ranging from 0 to 500,000 and the baselines from 0 to 50,000. To combine these each time step t during the train- ing of the baseline model is considered to correspond to n iterations, where n is the number of tasks in the source set, thus modeling the fact that there are n models being trained in parallel. After this a mean over tasks can be calculated by at each timestep t for which all tasks have produced a return, taking the mean over the last return of each task, for the mulit-task model and baseline models respectively.

5.2.5 Z-space Evaluation

After having compared the parameterized model to the baseline, and

hopefully assured its veracity, the next step is to explore how the final

policy behaves in the Z-space. The question is how the parameterized

policy looks for z-values other than those used to represent the tasks

from our source sets. This will be explored by interpolating linearly

between pairs of (z

i

, z

j

), where z

i

, z

j

are the latent encoding of task i

and j from one of the source sets, and plotting the value-function and

policy for these z-values over all states. It could be possible that in-

terpolating between these z-values has some relation to the interpola-

tion between the two targets corresponding to the environments these

z-values are supposed to encode; thus, a target corresponding to the

interpolation between the two tasks will also be added to the plot.

(38)

Results

6.1 Training Results

The following section contains the results from training the parame- terized model on the two source sets described in Section 5.2.1. The results are presented together with the results from training the base- line DQN on the same source sets as well as the ground truth best value for reference.

The primary goal is to evaluate how well the parameterized model can be trained. To evaluate this, the return from the rollouts produced during the training of the parameterized model is compared to the return of the rollouts from training the baseline model.

A first comparison between the rollout returns for Source Set 1 is shown in Figure 6.1. Subfigure 6.1a shows all iterations of the training and the full range of returns. For better visibility, Subfigure 6.1b shows the same data but limits the range of iterations to [150,000, 500,000] and the returns to [ 100, 20]. Figure 6.2 does the same, but for the mod- els trained on Source Set 2. 6.1 shows how the parameterized model converges faster and has a higher average result over the entire sec- ond half of the iterations. Figure 6.2 also shows faster convergence and better results over the final half, although not with quite as big of a difference.

Although interesting, this way of comparing the algorithms is somewhat problematic, as one of the advantages of having separate models for each task is that it allows one to freeze the model for any given task at any given time during the training. Therefore, the results from training the parameterized model are compared to what here is

31

(39)

called the optimistic results from the baseline. The optimistic results are calculated by, for each timestep t, taking the mean over the highest return produced for each task up until that time. The measure is over- optimistic in the sense that it assumes that the return of one episode truly measures the actual performance of the model, which is not nec- essarily the case. The comparison to the optimistic baseline for Source Set 1 can be seen in Figure 6.3 and for Source Set 2 it can be seen in Figure 6.4.

(a) (b)

Figure 6.1: Results from Source Set 1. (a) shows the mean return over tasks from training the parameterized model compared with the re- turns from training the baseline. (b) shows the same results with the x-axis limited to [150,000, 500,000] and y-axis limited to [ 100, 20].

(a) (b)

Figure 6.2: Results from Source Set 2. See description of Figure 6.1

(40)

(a) (b)

Figure 6.3: Results from Source Set 1 compared to optimistic baseline.

See description of Figure 6.1.

(a) (b)

Figure 6.4: Results from Source Set 2 compared to optimistic baseline.

See description of Figure 6.1

The final model can be visualized for every task T

i