Latent Representation of Tasks for Faster Learning in Reinforcement Learning
FELIX ENGSTRÖM
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Tasks for Faster Learning in Reinforcement Learning
FELIX ENGSTRÖM
Master in Machine Learning Date: September 9, 2019 Supervisor: Johannes Stork Examiner: Patric Jensfelt
Swedish title: Latent representation av uppgifter för snabbare inlärning i reinforcement learning
School of Electrical Engineering and Computer Science
Abstract
Reinforcement learning (RL) is an field of machine learning (ML) which
attempts to approach learning in an manner inspired by the human
way of learning through rewards and penalties. As with other forms
of ML, it is strongly dependent on large amounts of data, the acquisi-
tion of which can be costly and time consuming. One way to reduce
the need for data is transfer learning (TL) in which knowledge stored
in one model can be used to help in the training of another model. In
an attempt at performing TL in the context of RL we have suggested
a multitask Q-learning model. This model is trained on multiple tasks
that are assumed to come from some family of tasks sharing traits. This
model combines contemporary Q-learning methods from the field of
RL with ideas from the concept of variational auto encoders (VAEs),
and thus suggests a probabilistically motivated model in which the
Q-network is parameterized on a latent variable z 2 Z representing
the task. This is done in a way which is hoped to allow the use of
the Z space to search for solutions when encountering new tasks from
the same family of tasks. To evaluate the model we have designed a
simple grid world environment called HillWalk, and two models are
trained, each on a separate set of tasks from this environment. The
results of these are then evaluated by comparing against a baseline Q-
learning model from the OpenAI project, as well as through an inves-
tigation of the final models behaviour in relation to the latent variable
z.
Sammanfattning
Reinforcement learning (RL) är ett område av maskininlärning inom vilket man ämnar att efterlikna det mänskliga lärandet, där inlärning sker genom interaktion med en omgivning och styrs av interaktioner med denna i form av positiv och negativ belöning. Som med andra for- mer av maskininlärning så är RL starkt beroende av tillgången av stora mängder data, vilket kan vara mödosamt och tidsödande att samla in.
Ett sätt att minska den data som behövs till träningen av en ny modell
är transfer learning (TL), där kunskap kan överföras mellan modeller
och på så sätt minska den totala mängden data som krävs. I ett för-
sök att utföra TL i en RL-kontext har vi föreslagit en multiuppgifts
Q-inlärningsmodell. Denna model tränas på ett set utav uppgifter vil-
ka antas tillhöra någon familj av uppgifter i det avseendet att det finns
kända likheter mellan uppgifterna. Modellen kombinerar samtida me-
toder från RL-fältet med koncept från variational auto encoders, och
därigenom föreslås ett probabilistiskt Q-nätverk som är parametrise-
rat på en latent representation z av uppgifterna. Detta görs med av-
sikten att det skall tillåta Z att användas för att hitta lösningar till nya
uppgifter från samma familj. För att utvärdera modellen definieras en
familj av uppgifter och en miljö, HillWalk. Med hjälp av denna skapas
två träningsset för vilka en model respektive tränas. Resultatet från
detta jämförs sedan med en baslinjealgoritm från OpenAi-projektet,
samt genom att den slutgiltiga modellens beteende i förhållande till
z-parametern undersöks.
1 Introduction 1
1.1 Introduction . . . . 1
1.2 Research Question . . . . 3
1.3 Limitations . . . . 3
2 Background 4 2.1 Artificial Neural Networks . . . . 4
2.1.1 Architectures . . . . 4
2.1.2 Learning . . . . 7
2.2 Reinforcement Learning . . . . 8
2.2.1 The Reinforcement Learning Problem . . . . 8
2.2.2 The Solution . . . 10
2.3 Variational Auto Encoder . . . 14
2.4 Transfer Learning . . . 16
3 Related Work 18 3.1 Related Research . . . 18
4 Method 21 4.1 The Model . . . 21
4.2 Training . . . 24
5 Experimental Setup 26 5.1 Environment . . . 26
5.2 Training . . . 28
5.2.1 Source Sets . . . 28
5.2.2 Parameterized Multi-Task Q-Learning . . . 29
5.2.3 Q-learning Baseline . . . 29
5.2.4 Training Evaluation . . . 30
5.2.5 Z-space Evaluation . . . 30
v
6 Results 31
6.1 Training Results . . . 31
6.2 Latent Space Evaluation . . . 34
6.3 Discussion . . . 38
6.3.1 Training . . . 38
6.3.2 Latent Space . . . 39
7 Conclusion 41 7.1 Summary and Conclusion . . . 41
7.2 Future Research . . . 42
7.3 Sustainability, Ethics and Societal Aspects . . . 43
Bibliography 44 A Results 48 A.1 Training Results . . . 48
A.1.1 Source Set 1 . . . 48
A.1.2 Source Set 2 . . . 51
A.2 Interpolation Results . . . 54
A.2.1 Results from Interpolation 1 . . . 54
A.2.2 Results from Interpolation 2 . . . 57
A.2.3 Results from Random z . . . 61
Introduction
1.1 Introduction
In recent years there has been a massive surge of popularity and in- terest in machine learning. Due to better computational resources and the rapidly growing access to data the field of machine learning has seen great successes in domains such as natural language processing and image recognition. Most of the attention and progress has been in the area of supervised learning, where the approach is to learn from data with a provided solution, trying to use this data to learn in a way that allows for generalization onto new and unseen data.
Although useful in many situations, this method has its limitations.
Firstly, it demands the creation of labeled data, which can be very la- borious or even undoable; secondly, it is easy to see how it is limited in comparison to how humans learn, as they are capable of learning by trying, interacting with the task and changing their behavior in rela- tion to positive and negative rewards. The genre of algorithms trying to model this kind of learning is called reinforcement learning (RL).
RL is not a new field, but has lately been boosted by the general surge of popularity and scientific progress in machine learning, and there has been some notable examples of success. Among the more pronounced successes one can, for example, see [26] that using re- inforcement learning algorithms combined with other methods suc- ceeded in beating the top Go players in the world, and [21] that has produced an algorithm that can outperform humans on a large num- ber of Atari games.
However, in spite of these recent successes, there is much work to
1
be done. In many of these approaches to RL an agent that has learned a task is very specialized at this task, see for example the previously mentioned [26, 21]. This means that when searching for a solution to a task similar to one previously solved, this previous solution often can- not be used, and instead the agent needs to be trained from scratch.
Also, RL is limited among other things by the vast amounts of data needed to train the algorithms. That combined with the very special- ized nature of the agents is troublesome.
One solution to the previously described problem is learning to solve tasks in a way that allows us to harness our knowledge about similarities between tasks, and use the solution to previously learned tasks as an initialization to the new agent. This concept is called trans- fer learning and has, for example, yielded great success in the field of image recognition [4].
An approach has previously been suggested in which one assumes that the tasks T
mfrom a family of tasks M can be parameterized by some latent variable z
min such a way that one would be able to define a single policy ⇡(a|s, z
m) applicable to all tasks in the family M [22].
Assuming that there exists such a parametrization for the tasks and that this can be found would, given a new task T new, allow us to use our knowledge about the similarity between the tasks, for example, by searching only for the z
newcorresponding to the new task.
That this would be helpful is not apparent, but a thought-experiment can illustrate the potential gains. Imagine that we have a set of robotic arms. The arms are similar, but with small differences such as that the distance between the joints differs between the robots. Assume that we individually task these arms with trying to solve some single problem and let this be our family of tasks. Finding an optimal policy, (a|s, z), where z is assumed to parametrize the difference between the arms for this family of tasks, would when faced with a new arm allow us to search only in Z for the z
newrepresenting the new arms proportions.
Another scenario could be where there is a set of identical arms try-
ing to push an object to a goal, where the goal is different between the
tasks. Learning in the previously described manner would then allow
us to search only for the proper task representation z. This thesis hopes
to utilize the concept of parametrization in an attempt to provide the
outlines for new method, that learns in a way which allows the use of
previous knowledge when facing new tasks.
1.2 Research Question
The goal of the thesis is to suggest an approach that allows for fast training when faced with a new task T
tby using knowledge acquired from previously encountered tasks {T
s1, ..., T
sn} with known similari- ties to T
t.
This thesis will aim at suggesting a method similar to that of [22], exept that it will attempt to train the policy in a manner that gives us some knowledge about the distribution of the latent variable z. Such knowledge can hopefully be used to further speed up the adaptation to the new task. The aims of this thesis is thus:
• To suggest a more principled way of finding the distribution over Z and to evaluate this new method.
The hypothesis being that:
• An latent representation z
mcan be found for every task T 2 M together with a function q(a, s, z) such that z encodes the features separating T
mfrom other tasks in M, allowing q(a, s, z) to accu- rately decide the value of the a, s pair given for each given task from the family M.
• Given that such a representation can be found, it can be used in the process of learning the policy for a new task T
newby searching only for a suitable z
new, harnessing the knowledge of the distri- bution over z.
1.3 Limitations
The main limitation is in the choice of the environment. The thesis
will only be evaluated on a single task designed to fulfill some desired
prerequisites. That also includes limiting the experiments to a discrete
action space.
Background
2.1 Artificial Neural Networks
Artificial Neural Networks (ANNs) is a family of algorithms loosely inspired by the human brain. There exists a plethora of ANNs with a variety of different uses. The following section will contain itself to some essential types and concepts relevant to this thesis. The descrip- tions will in large be following the explanations given in [10].
2.1.1 Architectures
Feedforward Neural Network
In this quintessential ANN one tries to approximate some function f
⇤(x) by combining a number of functions f
1(x), f
2(x), ..., f
n(x) in a chain such that f
⇤(x) = f
n(f
n 1(...f
1(x)))). One of the most common and basic types of feedforward networks is the multi-layered percep- tron (MLP) with fully connected layers. For the fully connected MLP with n layers, the layers be described as
f
i(h
i) = g
i(W
i>h
i+ b
i) = h
i+1, (2.1) for 0 < i < N, an with with h
0= x and h
n= o where x is the input, and o the output of the network. All but the last layers are usually referred to as hidden layers, which is because their output is hidden in regards to the final results. Thus, they are distinguished from the final layer, which called the output layer.
Some observations can be made regarding the parameters W
iand b
i. Firstly, W
iand b
i, referred to as the weights and bias respectively,
4
are the parameters by which the network will be trained. Secondly, each W
icorresponds to a N
i⇥M
imatrix, where the number of columns M
ihas to match the number of rows N
i 1of the input h
i 1and number of rows N
iwill decide the number of rows of the output h
i. This second term M
iis usually referred to as the width of the layer. They are often visualized with graphs similar to the one seen in Figure 2.1. The width of the hidden layers is together with the number of layers n in the network two of the major design choices that needs to be made when designing an MLP. Another such choice is the function g
i; this is called an activation function and is described later in 2.1.1.
Although the type of layers described above are prevalent, it should be said that other types of configurations are possible. One might for example have connections skipping layers in an attempt to reduce the number of parameters for the network to train, or the kind of connec- tions seen in residual networks where the output of a layer is com- bined with the output of the next layer [14].
x
1x
2x
3x
4x
5x
6g
1()
y
1y
2y
3y
4g
2() Hidden
layer Hidden
layer
Output layer
Figure 2.1: Graph of fully connected ANN with input X = [x
1, ..., x
5]
and output Y = [y
1, ..., y
4]. It has two hidden layers and an output
layer. In the graph each column of nodes represents a layer, corre-
sponding to a weight matrix W
iand each node represents a unit, cor-
responding to a column w
jiof that layer.
Convolutional Neural Networks
Convolutional neural networks (CNNs) are a subfamily of feedfor- ward neural networks, prevalently used in applications such as im- age classification, speech recognition and reinforcement learning [18], [21], [1] . It produces features by performing convolutions and pool- ing. Convolutional layers are typically used to utilize prior knowledge about the nature of the data, such as stationary of statistics and locality of dependencies of the input [18].
Recurrent Neural Networks
Recurrent neural networks (RNNs) is in many ways similar to feedfor- ward neural networks, but are more complex in that they allow out- puts from layers in the network to go back as an input to previous layers for future time steps.
Activation Function
As previously described, each hidden layer is separated by an non- linear function called an activation function. The main purpose of these activation functions are to de-linearise the network. For a net- work with layers w
i>h
i 1+ b
i= h
iand no activation functions, any two consecutive layers w
>ih
i 1+ b
i= h
iand w
i+1>h
ib
i+1= h
i+1could always be collapsed to w
i+1>w
i>h
i 1+ w
>i+1b
i+ b
i+1= w
0>h
i 1+ b
0where w
0= w
i+1>w
iand b
0= w
i+1>b
i+ b
i+1, which in turn would allow us to re- cursively collapse any network of arbitrary length into a single linear transform W
0>x + b
0= o.
There are numerous plausible activation functions. Previously the
most common choice was either the logistic sigmoid function g(z) =
(z) or the hyperbolic tangent function g(z) = tanh(z), but these days
the rectified linear unit (ReLU) first described in [9] is considered to
be the best default choice for a feedforward network as it prevents
saturation [10]. There are also variations on the ReLU, such as the
leaky ReLU , g(z) = max(0, z) + min(↵z, 0) for ↵ equal to some small
number like 0.01 which allows for the reactivation of units by giving
deactivated units a small gradient [20].
Output Function
The last layer can be considered unique compared to the hidden lay- ers in that it does not need to be followed by an activation function.
That being said, a similar function is often applied, but this time it is done to adapt the output to fit the target data. There is a strong con- nection between the choice of output function and loss function which together is of great importance to the network design. Using no acti- vation function is, for example, common when trying to maximize the output to the target as a conditional Gaussian distribution when do- ing regression, while when learning a binary classification problem a sigmoid function is often used, thus maximizing the loglikelihood of a Bernoulli distribution. Many other choices are available, with the gen- eral goal of adapting the output of the last layer to fit the input space of the appropriate loss function.
Loss Function
As hinted in the previous section, the choice of the loss function is important when constructing an ANN. For many parametric models a distribution p(y| x, ✓) is defined, where y is the target data, x the input data and ✓ the model parameters, allowing for the use of a maximum- likelihood approach. Given this maximum-likelihood approach one can use the cross-entropy loss function, which is simply
J(✓) = E
(x,y)x,y⇠pdata[log p
model(y |x)]. (2.2) Further assuming that p(y|x, ✓) = N (y; f
⇤(x), I ), wich is common when performing regression, gives
J(✓) / E
x,y⇠pdata||y f (x; ✓) ||
2, (2.3) which amounts to calculating the mean square error (MSE) using the data.
The choice of the loss function is crucial and captures the assump- tions made about the data, and there are numerous other choices for loss functions, all with their implications and interpretations. For more examples see [10].
2.1.2 Learning
A topic not yet mentioned is how the ANNs learn the correct weights
for their parameters. One of the downsides of many of these networks
is that they are non-convex functions making them hard to optimize.
Gradient Descent
As previously described, mostly the networks will be of a non-convex type, and many methods for optimization will, therefore, be intractable.
Luckily, gradient descent has proven a viable method for training the networks. Even in spite of the fact that it only performs local optimi- sation.
2.2 Reinforcement Learning
2.2.1 The Reinforcement Learning Problem
The concept of reinforcement learning (RL) is central to this thesis; the following section contains a brief description of RL and some essential terms related to this. The descriptions will in large be following the explanations given in [28].
In RL, one is tasked with teaching an agent to act optimally in some environment. The agent’s interaction with the environment is usu- ally described in terms of observations, actions, and rewards, typically modeled in discrete time steps in the form of a Markov decision pro- cess (MDP). In this model the agent receives a observation o
tand an reward r
tat each timestep t, and given these the agent chooses an ac- tion a
t, with which it interacts with the environment, yielding a new observation o
t+1and reward r
t+1. A diagram of this can be seen in 2.2.
It should be noted that in much of the literature regarding RL, what
here is called observation will be called state. The choice has been
made to make a distinction between these to account for the cases
where the state of the environment only is partially observable. Each
observation o is thus assumed to be generated by a underlying state s.
Environment Agent
a
t 1——
a
to
tr
tFigure 2.2: Graph of RL agent-environment relationship. The crossed connection going in to the environment represents the fact that a
t+ 1 will send the environment into the state yielding observation o
t+1.
There are four essential parts to this model:
• The observation o 2 O, where O called the observation space is all possible observations.
• The action a 2 A, where A called the action space is all possible actions.
• The reward r 2 R, where R called the reward space is all possible rewards. It should be noted that r often is defined as a determin- istic function r(o, a, o
0) or even r(o, a).
• The policy ⇡ by which the action a is chosen given a state s. That is in its most general definition viewed as a distribution ⇡(a|o) over actions given an observation.
• The model p(r, o
0|a, o) which defines the probability of transfer- ring to o
0from o given action a and then receiving r as reward.
MDPs can either be episodic or non-episodic. An episodic MDP has a defined start-state s
sand end-state s
tsuch that p(s
t|s
s) 6= 0 and after the end-state is achieved the episode ends. In non-episodic MDPs there is no such end, and the episodes go on forever. During traning and evaluation data is sampled from MDPs in sequences, these se- quences are often called rollouts.
Another central concept is the discounted return at time t denoted G
twhich is defined as
G
t:=
X
1 i=0i
r
t+i, (2.4)
where 0 1 is called discount rate. A major purpose of the dis- count rate is assuring that the return for non-episodic MDPs remains finite, but it can also be seen as modeling a certain amount of uncer- tainty in the model one is training, making it prefer rewards closer in time.
With the concept of model, policy and return, the value of a state can be defined as the expected return given the policy. The value of a state is thus dependent on policy. That is, for a policy ⇡ the value is
V
⇡(o) :=E
⇡⇥ G
t|O
t= o ⇤
(2.5)
= X
a2A
⇡(a |o
t) X
o02O,r2R
p(o
0, r |o, a)(r + V
⇡(o
0)). (2.6) One can also describe a Q-function. The Q-function is the value of a pair (s, a) of state and action and is defined as
Q
⇡(o, a) := E
⇡⇥ G
t|O
t= o, A
t= a ⇤
, (2.7)
and describes the expected return given action a and observation o.
With this framework setup, a goal for the learning can be defined, namely to find the parameter ! for an optimal policy ⇡
⇤(a |s; !) as to maximize the expected return
⇡
⇤(a |s; !) = arg max
⇡V
⇡(o). (2.8)
2.2.2 The Solution
Given the concepts described in the previous section, there are many approaches to RL. The following section describes a few of the major methods and concepts related to this topic.
Tabular Methods
Tabular methods are a family of methods that attempt to solve the RL- problem with the assumption that the action space A and state space O are both small enough to allow for keeping track of the value function in the form of tables.
Dynamic Programming Dynamic Programming algorithms assume
that the model p(a, o
0|a, o) is known, and looks for the best policy by
iteratively updating either a policy or value function in what is called
policy and value iteration respectively. The basic idea is that, if there is a policy ⇡
k(a |o), one can find a value function V
⇡k(o), which in turn can be used to find an improved policy ⇡
k+1(a |o), and so on. Given 0 < < 1 this is guaranteed to converge to the best solution given an infinite number of iterations [28]. The changes each iteration is made based upon the current approximation, and the model thus bootstraps itself. Remember that the value function is a tabular lookup table, and one is thus guaranteed that changing the value of one observation will not affect the value of other observations.
Monte Carlo Methods In many cases, the model will not be known and finding it unfeasible. Monte Carlo (MC) methods handles this by making approximations of the value of states through sampling from the current policy and then updating the value of the states accord- ingly. The current value function is not used in the update and is thus in contrast to many other algorithms, not a bootstrapping algorithm.
Alas, a large amount of sampling is needed between every update of the value function.
There are various MC methods, and they differ among other things in how they handle multiple visits to the same state.
In a simple every-visit MC-method
1the update of the value func- tion might, for example, look as follows
V (o
t) V (o
t) + ↵[G
tV (o
t)], (2.9) where G
tis a sampled return following time t, and ↵ is a constant step- size parameter. That is, the value of the state is updated to be closer to the sampled return.
Temporal Difference Learning Extending previous ideas temporal difference (TD) learning uses samples from the current policy in com- bination with the current value function and bootstraps a new value for the value function. The central idea in TD-learning is that one can update the value function using only: the current value function, the last return r
t, the previous state o
t, and the new state o
t+1. In a basic implementation, this update can be made as
V (o
t) V (o
t) + ↵[r
t+ V (o
t+1) V (o
t)], (2.10)
1
Every visit to a state is used to update the value function. In contrast to for
example first-visit, where only the return from the first visit to a state is used.
which, in contrast to Equation 2.9, bootstraps the current value func- tion in its update.
An important variant of TD-learning is the Q-learning method, in which the action-value function Q(o, a) is used and updated according to
Q(o
t, a
t) Q(o
t, a
t) + ↵[r
t+ max
a0
Q(o
t+1, a
0) Q(o
t, a
t)], (2.11) this method is suited especially well for tasks with discrete action spaces with finite actions, such that finding arg max
aQ(o, a) is feasible.
Approximate Methods
Although useful in many cases the tabular approach is limited, as many problems have observation spaces that are far too large for a ta- ble. Consider for example the input being a images from a robot, then the set of possible observations O would consist of all possible constel- lations of pixel values. A common approach to this is to replace the ta- ble of values with an approximate function, v
!(o
t) parameterized by !.
With this approach, many of the previously described approaches can be extended to handle large observation spaces. However, it should be noted that many of the guarantees for convergence will no longer hold in the approximate case, primarily due to that when one changes the weights in the parameterized function to better reflect the value for one observation, it will have effects on the approximation of the val- ues of other observations. This kind of generalization is due to the size of the parameters of the function being much smaller than that of the observation space O, and thus it is unable of correctly representing the value of all observations. Generalization has its downsides, but it also makes these models quite powerful as they allow us to learn things about observations that have not yet been seen.
Many of the most widely used learning methods for the approx- imate approaches are using stochastic gradient descent (SGD). This will, in general, mean searching for a local optimum rather than a global. A general description of the SGD update method for the observation- value function is
w
t+1= w
t+ ↵ ⇥
U
tˆ v
t(o
t, w
t)] ⇤
rˆv
t(o
t, w
t), (2.12)
where U
tis our target for the value function at time t, and ˆv(o
t, w
t) is
our approximate function. The approximate function can be shown
to converge to a local optima as long as E ⇥ U
t⇤ = v(o
t) for the true value function v(o
t), as it for example would if U
twas found through MC-sampling. However, even where this does not hold, like in boot- strapping algorithms such as Q-learning it has proven to be an effi- cient approach. The methods using unbiased targets are thus called SGD methods and the ones with biased targets are in turn called semi- gradient methods.
Deep Q-learning A common choice for the parameterized function is an ANN. These methods, often deep Q-networks (DQNs), and are used in some of the more successful recent works. As they replace the tabular representation of the Q-function Q(o, a) with an ANN, often training it using SGD, they get around the problem of having to store the value for each observation. However, this is not without its draw- backs as the use of parameterized approximate functions introduces a dependency between the value function of different observations.
The DQN update-step using SGD yields the following expression
! ! + ↵[r + max
a0
Q(o
0, a
0; !) Q(o, a; !)] r
!Q(o, a; !). (2.13) Several approaches and tricks have shown to yield great improve- ments when applied to this method. [15] shows the results of using a set of them together in a DQN environment in an ablation study. In this thesis, a few of these have been used, while others have been left for future study. The methods used for this study are:
• Experience Replay [21] suggests using experience replay as orig- inally suggested by [19] for DQN-learning. This method uses a so-called replay buffer, which consists of a buffer in which the N last experiences are stored, and at each training step, a batch of size k is sampled uniformly from this buffer. That helps to dimin- ish the problems introduced due to the strong temporal correla- tions between the samples, as well as allowing for more efficient data usage. In an extension to this a Prioritized Experience Re- play has been suggested [24], speeding up training by sampling experiences based on some priority schedule, often using the last TD-error of the samples.
• Double Deep Q-learning Double Q-learning suggested by [11]
is an extension of the Q-learning method, reducing the value
overestimation introduced by the max function in Equation 2.11.
[12] show with their Double Deep Q Network (DDQN) algo- rithm that the Double Q-learning method can be extended to the Deep Q-learning setting, yielding great results. In the DDQN algorithm two separate Q-functions Q(o, a; !) and Q(o, a; !
0) are used, the later often called the target network. The update step then uses the target network to approximate the value of the next observation. Leading to the following update step
! !+↵[r+ Q(o
0, arg max
a0
Q(o
0, a
0; !); !
0) Q(o, a; !)] r
!Q(o, a; !).
(2.14) The original Double Q-learning scheme involves alternating the roles, and updates of the Q-functions, for example, by randomly choosing which one to act as target each iteration [11]. [12] use an easier approach, simply having a dedicated target network, that is a copy of the original Q-function, updated at an interval chosen as a hyperparameter.
• Multi-step Learning Where the one-step learning looks only at the current reward, and MC-methods looks at the return from an entire episode, the n-step method combines these by holding back for a certain number of steps and then using the return from those steps in combination with bootstrapping to approximate the value of an observation. This approach gives a new update function for an o seen at time t and o
0seen at time t + n
! ! + ↵[G
o:o0+ max
a0
Q(o
0, a
0; !) Q(o, a; !)] r
!Q(o, a; !), (2.15) where G
o:o0=
n 1
P
i=0
i
r
t+i+1and n is the number of steps.
2.3 Variational Auto Encoder
The following section will give a brief introduction to variational auto encoders (VAEs) and will mainly follow the model presented in [17].
VAEs is an approach to finding encoded representations of data us-
ing a variational Bayesian approach. The goal of this is to create an en-
coder that can map the data into a lower dimensional representation-
space Z in a way that gives us knowledge about the prior p(Z).
Given a dataset X = {x
i}
Ni=1of N i.i.d. data points, one starts with the assumption that they are generated by some random process. Each data point is thus assumed to be generated from a conditional distri- bution P (x|z) depending on random variable z, which in turn is dis- tributed according to some prior distribution p(Z). With this one can formulate the goal to be finding the distribution p(x|z) that maximizes the marginal probability
p(x) = Z
z2Z
p(x |z)p(z)dz. (2.16)
However, performing this integral is often intractable. To solve this a recognition model q (Z|X) is introduced, with which Equation 2.16 can be rewritten as
p(x) = Z
z2Z
q (z |x)p(X|z)p(z)
q (z |x) dz (2.17)
=E
z⇠q(Z|X)[ p(x, z)p(z)
q (z |x) ]. (2.18)
Taking the log of this, to aviod having to optimize the product, and using the Jensens inequality, this can then be rewritten as
log E
z⇠q(Z|X)[ p(x, z)p(z)
q (z |x) ] E
z⇠q(Z|X)[log p(x, z)p(z)
q (z |x) ]. (2.19) After some rewriting, leading to the final expression
log p(x
i) L( , ✓, x
i), (2.20) with
L( , ✓, x
i) = D
KL(q (z |x
i) ||p
✓(z)) + E
q (z|xi)[log p
✓(x
i|z)], (2.21) where the recognition model q (z|x) can be shown to approach the true conditional distribution p(x|z) as L( , ✓, x
i) increases. Due to the i.i.d. assumption about the data X one can describe the log p(X) = P
Ni=1
log p(x
i). For which each term can be implicitly optimized using the
variational lower bound as L( , ✓, x
i). To find values for the parame-
ters ✓, stochastic gradient descent can be used. To allow for this z is
re-parameterised using a differentiable function g (x, ✏) of an auxiliary noise parameter ✏ ⇠ p(✏), which will allow us to approximate
E
q (z|xi)[f (z)] ' X
K k=11
K f (g
✓(x, ✏
k)), (2.22) for ✏
k⇠ p(✏).
Under the given restrictionsa general stochastic gradient variational Bayes have now been defined. What the actual implementation will look like depends largely on the choice of distribution for the decoder q
✓(z |x), prior p(z) and the kind of parametrised function to use. A common choice is to use an ANN as the parametrised function, let p(z) = N (0, I) and q
✓(z |x) = N (z; µ(x),
2(x)I). Fig. 2.3 shows how the general design an VAE-ANN looks.
X !
µ
Z
✏ ⇠ N (0, 1)
✓ X
0Encoder Decoder
Figure 2.3: Graph depicting ANN architecture for a VAE. Here X is the input and X
0is the output.
2.4 Transfer Learning
Most ML algorithms are concerned with problems where the training
and test data are assumed to be drawn from the same feature space
and distribution. As a consequence of this, the model often needs to be
retrained from scratch when the feature space or distribution changes
[23], even though one often knows about similarities between previ-
ously learned models and the one to be trained. The field of transfer
learning (TL) thus tries to decrease the data needed when encounter- ing a new task by using data and knowledge from previously seen tasks. TL has yielded success in many areas of supervised learning [23, 31], such as computer vision and language models [27, 5], and has recieved a fair amount of attention in the field of RL as well [29].
In its most general form TL can be described as trying to transfer knowledge from one or more source tasks T
sin a source domain D
sonto another task or set of target tasks T
tsolved in a target domain D
t. The domains can be said to consist of two parts, a feature space and marginal distribution P (X) for X = {x
1, .., x
n} 2 . Given a domain, a task T is defined to consist of two components: a label space Y and an objective predictive function f(·). Given a source domain D
s, a source task T
s, a target domain D
tand a target task T
tthen the goal of TL is to help improve the learning of the target predicative function f(·)
tby using the knowledge in D
sand T
swhere D
s6= D
tand T
s6= T
t[23].
In the case of using TL in a RL setting, the same concepts can be used to describe the TL task, however, one might also be more explicit about what the different parts relates to, having the domain D describe an MDP in the form of a tuple D ⌘ (S, A, p(r|s, s
0, a), p(s
0|s, a) where S observation space, A the action space, R the reward space, p(r|s, s
0, a) the reward function and p(s
0|s, a) the transition function [16]. A task could then, for example, be described as T = (A, ⇡
⇤( ·|s)), that is trying to find the optimal policy for the MDP.
TL comes in a variety of shapes, as there are numerous choices to which parts of the model one allows to change in between the source and target domain, in addition to this one has to decide how the knowl- edge will be transferred. A common approach is the use of pre-trained networks, either as a way of guiding the early training of deep net- works or as a way of reducing the need for data to train on [27, 5]. [29]
shows, in the RL-setting, how different approaches allow for changes
in different parts of the MDP between source and target, and various
ways in which the knowledge can be passed between models.
Related Work
3.1 Related Research
This thesis is concerned with learning an RL model parameterized on a task embedding; the following section will contain a short survey of previous research with a similar focus.
There are a multitude of research on the topic of TL in RL, [29]
gives an overview of older research. Recent research includes concepts such as meta-learning, domain randomization, and different types of parametrization.
In meta-learning, one tries to use the source domain to learn. An example of this is [8] in which the learning is structured to learn a disentangled parametrization that allows for faster learning when en- countering an unseen task by training the model with a meta-objective.
For tasks T
isampled from p(T ), with loss function L
Ti, a parame- terized function f
✓and with a one-step gradient update ✓
i= ✓
↵ r
✓L
Ti(f
✓) and , the meta-objective is for example defined as min
✓X
Ti⇠p(T )
L
Ti(f
✓i) = X
Ti⇠p(T )
L
Ti(f
✓↵ r
✓L
Ti(f
✓)). (3.1)
The model is then optimized in relation to this meta-objective using SGD. This meta-objective can be described as trying to find the set of parameters ✓ that optimizes learning on the tasks from the distribution p( T ).
Domain randomization is generally concerned with trying to trans- fer policies learned in simulation to the real world. An example of such a method is [30], in which an agent is trained on simulated data with
18
domain randomization in a way which allows it to generalize to real- world tasks. More specifically, this is done by varying several features, for example, lighting, the orientation of objects, and camera position- ing between each simulated sample used for training.
Parametrization has been used in various ways in RL. [25] trains a universal policy parameterized on goals. Using multiple tasks with different goals to train a single policy that can generalize to new un- seen goals; this is done both by performing end-to-end learning by training an MLP on data samples created by concatenating goal-state to state and action. Notably, they find that it is more efficient to do this in a two-stage process, wherein a first stage goal and state-action are respectively embedded using matrix factorisation, and in the second stage these are combined through a simple linear function to generate the final output. Building upon the work of [25], Hindsight Experi- ence Replay [2] is an approach to more efficiently use data gathered in tasks with well-defined target states. This is done by parametriz- ing the policy on the target and reusing the data with other states as targets. In addition to the data points generated in the rollouts, artifi- cial data points are also added to the experience replay buffer. These artificial data points are generated by changing the target state to one reached during a rollout, and adding previously seen actions tuples with the reward recalculated, using this target. [16] suggests a method in which a disentangled state representation is learned, allowing the agent to disregard uninformative changes between different tasks. In a three-stage process, they first train a disentangled representation of states using a VAE on the source tasks, in the second stage they use this encoder to train a policy on the source task, and finally, they transfer this policy onto a target task.
More similar to the approach suggested in this thesis is, for exam-
ple, [22], [13] and [3], which all suggest a general approach for param-
eterizing the policy based on tasks. [22] learns the policy and latent
representation of the task simultaneously through SGD together with
an annealing scheme in the hope of achieving a policy ⇡ that is smooth
with respect to the latent representation z. This approach does not use
any knowledge about the distribution p(z), and instead finds suitable
values for z through further iterations of SGD. [13] uses an approach
based on a variational bound, but in contrast to what is done in this
thesis, they use entropy regularized RL. [3] uses a loss function very
similar to the one described in this thesis, but instead of Q-learning
they perform policy gradient learning to learn a policy for a contin-
uous action space. Also, in contrast to the method suggested in this
thesis their approach differs in that it requires policies for the source
tasks, and thus only learns the actual parametrization of the policy.
Method
4.1 The Model
The following section describes a model for training a single policy for a set of tasks, parameterized on a latent variable z
trepresenting each task.
As has previously been pointed out, it might be beneficial to har- ness the knowledge about similarities between tasks to allow for faster learning when encountering new tasks. The assumption is that there exist families of tasks T where one can assume that parts of the policy can be shared among its members. By disentangling the parts that can be shared between its members from the ones that cannot, the hope is to find a lower dimensional space Z, encoding the parts that differ between the tasks, to search when encountering unseen task ⌧
t2 T . Thus, given a family of tasks T the goal is to find one single policy
⇡
!(o, z
i), where z
iis a latent representation of a task ⌧
iand o is an ob- servation, such that ⇡
(o, z
i) ⇡ ⇡
⇤⌧i(o), where ⇡
⌧⇤i(o) is the optimal policy for the task ⌧
i. Also, it would be beneficial to know about the distribu- tion p(Z) over Z to further help in the search over the latent space Z like the one suggested in [22]. In contrast to [22] and inspired by the method described in [17], we suggest finding the latent representation z in a manner that gives us knowledge about its distribution p(Z).
In a first step towards this goal a probabilistic interpretation of Q- learning is formulated. Under the assumption that r ⇠ N (Q
!(o
0, a
0)
arg max
a0Q
!(o, a), I↵) and given an i.i.d. dataset D = {(o
i, a
i, r
i, o
0i) }
Ni=021
the Q-learning target X
(o,a,r,o0)2D
(r + arg max
a0
Q
!(o, a) Q
!(o
0, a
0))
2, (4.1) can be reinterpreted as trying to maximize the likelihood of R
D, as
log p(R
D|O
D, O
D0, A
D, !) = X
(o,a,r,o0)2D
log p(r |o, o
0, a, !)
= X
(o,a,r,o0)2D
log N (r; Q
!(o
0, a
0) arg max
a0
Q
!(o, a), I↵)
/ X
(o,a,r,o0)2D
(r + arg max
a0
Q
!(o, a) Q
!(o
0, a
0))
2. (4.2) In a manner analogous to the one suggested in [17] this expression can be assumed to depend on a latent variable z, which here is as- sumed to encode the tasks, and thus be written as a marginal over the latent variable
log p(r |o, o
0, a, !) = Z
z2Z
log p(r, z |o, o
0, a, !)dz, (4.3)
where Z is the set of all possible z. Again, we follow the method pre- sented in [17], and rewrite the expression as
log p(r |o, o
0, a, !) E
z⇠q(z)[log p(r |o, o
0, a, z, !)] D
KL[q(z) ||p(z)] (4.4) using an recognition model q(z). As the task each datapoint is sampled from is known, this information can be used in the hope that it helps during learning; the network is thus provided with a task ID giving q(z |c), where c 2 {1, ..., C} and C is the number of training tasks. This gives
logE
z⇠q(z|c)[p(r |o, o
0, a, z, !)] D
KL[q(z |c)||p(z)]. (4.5) Combining Equation 4.2, 4.3 and 4.5 we get
log p(R |O, O
0, A, !) X
Ni=1
log E
z⇠q(z|ci)[p(r
i|o
i, o
0i, a
i, z, !)] D
KL[q(z |c
i) ||p(z)]. (4.6)
As in [17] the prior can be chosen as p(z) ⇠ N (0, 1), this is moti- vated by the fact that any distribution of d dimensions can be repre- sented by taking d variables that are normally distributed and map- ping them through sufficiently complicated function [7], which in this case is a neural network. Further, if the recognition model is chosen as q(z |c
i) ⇠ N (µ
i,
i) this allows for calculation of D
KL[q(z |c
i) ||p(z)] in a closed form expression [17],
D
KL[q(z |c
i) ||p(z)] = 1 2
X
J j=0(1 + log(
ij) µ
2ij ij2), (4.7)
where
ij, µ
ijis the jth element of
i, µ
irespectively and J is the di- mensionality of the latent variable z.
For a single data point, we thus want to optimize
E
z⇠q(z|ci)[log p(r
i|o
i, o
0i, a
i, z, !)] + 1 2
X
J j=0(1 + log(
ij) µ
2ij ij2), (4.8)
where the first part can be approximated as
E
z⇠q(z|ci)[log p(r
i|o
i, o
0i, a
i, z, !)] ⇡ 1 L
X
L l=0log p(r
i|o
i, o
0i, a
i, z
l, !), (4.9) yielding
L(!) = ( 1 L
X
L l=0log p(r
i|o
i, o
0i, a
i, z
l, !) + 1 2
X
J j=0(1 + log(
ij) µ
2ij ij2)) (4.10) as the loss function.
Moreover, the assumption that r ⇠ N (Q
!(o
0, a
0) arg max
a0Q
!(o, a), I↵) gives
log p(r
i|o
i, o
0i, a
i, z
l, !) = 1
2 ( log(2⇡↵) 1
↵ (r
i+ arg max
a0
Q
!(o
0i, a
0, z
l) Q
!(o
i, a
i, z
l))
2). (4.11)
Note that the expression arg max
a0Q
!(o
0i, a
0, z
l) in Equation 4.11 is
supposed to correspond approximation U(o
0) of the value of o
0and can
in accordance to previous work be replaced by different target function
U
⇤(o
0), which will be used when implementing the Double-DQN [12].
4.2 Training
The following section describes how the previously described model in a first step can be trained on a set of source tasks T
source= {⌧
1, ..., ⌧
k}.
For training, data will be generated by taking actions in the envi- ronments using the current Q-function as policy and storing resulting data points {o, a, o
0, r, c } in a replay buffer as described in [24]. The ac- tions will be chosen ✏-greedily using the current policy and a linearly decreasing ✏ schedule. At each timestep, one task will be chosen and an action performed. The tasks will be assured to be equally repre- sented by choosing them in a rotating schedule.
For each time step t a batch of data points D
batch= d
1, ..., d
nof size n is sampled from the replay buffer according to weights based on the most resent loss of each data point, where d
i= (o
i, a
i, o
0i, r
i, c
i).
For all of the parameterized functions ANNs will be used as ap- proximators and be trained using SGD with the help of the loss function described in Equation 4.10, where the likelihood equation log p(r
i|s
i, s
0i, a
i, z
l, !) is extended using the Double-Q- and multistep- learning algorithms described in Section 2.2.2. Changing o and o
0to refer to obeservations made at time t and t + n respectively, and re- ward r for the cumaltive return g
o:o0between these observations, the same steps taken in Equation 4.2 gives
log p(g
o:o0|o, o
0, a, z
l, !) = 1
2 ( log(2⇡↵) 1
↵ (g
o:o0+
nQ(o
0, arg max
a0
Q(o, a
0; !, z
l), z
l) Q(o, a, z
l))
2).
(4.12) Figure 4.1 depicts the general architecture used for this thesis.
The network consists of two encoders, one variational encoder for the mapping from task ID C to latent representation Z and an en- coder, in our case an MLP, which transforms observations o to a one- dimensional feature vector x and finally a Q-function approximator.
Usually, the encoding of the observation would be done implicitly by
the Q-function, but here it is done explicitly to allow us to concatenate
it with the latent representation z before inputting it to the Q-network.
C !
Task Encoder
✏⇠ N (0, 1)
!
State Encoder
O X !
Q-function Z
Q(O, A, Z)
L(0, A, R, 00, Z; !)
R