• No results found

A Deep Reinforcement Learning Framework where Agents Learn a Basic form of Social Movement

N/A
N/A
Protected

Academic year: 2022

Share "A Deep Reinforcement Learning Framework where Agents Learn a Basic form of Social Movement"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

UPTEC F 18008

Examensarbete 30 hp April 2018

A Deep Reinforcement Learning Framework where Agents Learn a Basic form of Social Movement

Erik Ekstedt

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

A Deep Reinforcement Learning Framework where Agents Learn a Basic form of Social Movement

Erik Ekstedt

For social robots to move and behave appropriately in dynamic and complex social contexts they need to be flexible in their movement behaviors. The natural complexity of social interaction makes this a difficult property to encode programmatically. Instead of programming these algorithms by hand it could be preferable to have the system learn these behaviors. In this project a framework is created in which an agent, through deep reinforcement learning, can learn how to mimic poses, here defined as the most basic case of social movements. The framework aimed to be as agent agnostic as possible and suitable for both real life robots and virtual agents through an approach called

"dancer in the mirror". The framework utilized a learning algorithm called PPO and trained agents, as a proof of concept, on both a virtual environment for the humanoid robot Pepper and for virtual agents in a physics simulation environment. The framework was meant to be a simple starting point that could be extended to incorporate more and more complex tasks. This project shows that this framework was functional for agents to learn to mimic poses on a simplified environment.

Ämnesgranskare: Ginevra Castellano Handledare: Alex Yuan Gao

(3)

Popul¨ arvetenskaplig sammanfattning

M¨anskligheten blir alltmer beroende av teknologi och utvecklingen g˚ar snabbare

¨

an n˚agonsin f¨orr. F¨or ett decennium sedan s˚a blev den f¨orsta smartphonen in- troducerad och plattformar som Facebook och Youtube d¨ok upp f¨or att f¨or¨andra samh¨allet f¨or alltid. P˚a grund av hur fort teknologin utvecklas ¨ar det fullt m¨ojligt att vi snart lever i ett samh¨alle d¨ar sociala robotar k¨anns lika sj¨alvklara som v˚ara smartphones g¨or idag. Robotar som kan hj¨alpa oss med allt fr˚an s¨allskap till sjukv˚ard, r¨addningstj¨anst och utbildning.

N¨ar m¨anniskor interagerar och kommunicerar i v˚ara vardagliga liv, det vill s¨aga n¨ar vi tr¨affas i det verkliga livet, anv¨ander vi oss mycket av gester och r¨orelser. Vi r¨or oss p˚a olika s¨att beroende p˚a vilken social umg¨angeskrets vi ¨ar med eller vad f¨or slags social situation som ¨ar relevant. Det ¨ar annorlunda att se n˚agon m¨ota sina b¨asta v¨anner inne p˚a en fest eller g˚a ut fr˚an en begravning. Vi anv¨ander v˚art kroppsspr˚ak till att f¨ortydliga vad vi menar och vi kan bed¨oma andras sin- nesst¨amning fr˚an analys av deras h˚allning och s¨attet de f¨or sig. Om sociala rob- otar ska vara en naturlig del av samh¨allet och interagera och kommunicera med oss m¨anniskor vore det f¨ordelaktigt om dessa hade liknande egenskaper. Sociala robotar borde kunna r¨ora sig p˚a ett naturligt s¨att som tillf¨or n˚agot till den so- ciala interaktionen och g¨or att m¨anniskor k¨anner sig lugna och s¨akra. Deras tillv¨agag˚angss¨att borde ¨andras beroende p˚a hur andra i det sociala sammanhanget beter sig. Sociala situationer ¨ar av en dynamisk natur som g¨or det sv˚art att p˚a f¨orhand programmera in den exakta kunskap som kr¨avs f¨or att r¨ora sig p˚a ett, f¨or m¨anniskor, ¨overtygande s¨att. Ist¨allet f¨or att best¨amma hur en robot ska bete sig och programmera in olika typer av r¨orelser vore det b¨attre om roboten sj¨alv l¨arde sig detta.

P˚a senare ˚ar har deep learning, ett omr˚ade av maskininl¨arning som anv¨ander sig av neurala n¨atverk, visat stora framsteg inom m˚anga olika omr˚aden. AI ¨ar ett popul¨arkulturellt begrepp och f˚ar mycket utrymme i media. Det kan vara r¨on om allt fr˚an sj¨alvk¨orande bilar, personliga assisstenter till cancerdiagnosterande system och i de flesta fallen ¨ar det deep learning och neurala n¨atverk som ¨ar den underliggande teknologin. Neurala n¨atverk har funnits sedan 40-talet men det ¨ar under de senaste ˚aren de har blivit mainstream. Det ¨ar f¨orst idag vi har tillr¨ackligt med ber¨akningskraft tillg¨angligt f¨or tillr¨ackligt m˚anga m¨anniskor som dessa neu- rala n¨atverk har kunnat ge de resultat vi nu ser ¨ar m¨ojliga. Dessa typer av program

¨

ar nu standard i allt fr˚an ljud- och bildigenk¨anning till att att ¨overs¨atta text mel- lan olika spr˚ak. Det ¨ar ¨aven denna teknologin som ligger bakom de program som nu ¨ar b¨attre ¨an m¨anniskor p˚a spel som Go, Atari och schack. Dessa program har l¨art sig spela dessa spel genom en teknik, p˚a engelska kallad reinforcement learn-

(4)

ing. Denna teknik handlar om att l¨ara sig beteende p˚a liknande s¨att som djur och m¨anniskor l¨ar sig.

Inom reinforcement learning s˚a anv¨ands uttryck som agent, milj¨o och bel¨oning. En agent interagerar med sin milj¨o d¨ar olika handlingar ger olika bel¨oningar beroende p˚a hur bra handlingen var. Agenten testar sedan att g¨ora massvis med olika han- dlingar och efter en viss m¨angd tr¨aning l¨ar den sig vad som ¨ar b¨ast att g¨ora och vad som b¨or undvikas. Detta ¨ar generellt och de beteenden som agenten l¨ar sig beror p˚a milj¨on, bel¨oningen och inl¨arningsalgorithmen. Olika milj¨oer med olika bel¨oningssystem ger upphov till agenter som ¨ar bra p˚a olika saker.

I detta projekt skapas en milj¨o med tillh¨orande bel¨oningssystem som ¨ar till f¨or att en agent ska l¨ara sig att h¨arma en annan agents kroppsh˚allning. Att h¨arma en annan agents kroppsh˚allning antas i detta projekt vara den mest element¨ara formen av sociala r¨orelser. Planen ¨ar sedan att utg˚a fr˚an detta och introducera mer och mer komplexa uppgifter. F¨orutom milj¨on s˚a anv¨andes en ny optimeringsalgoritm, f¨orkortad som PPO, f¨or att optimera de neurala n¨atverk som var skapade f¨or att l¨osa uppgiften. I denna implementation ¨ar det viktigt att milj¨on ¨ar generell f¨or att kunna tr¨ana dels helt fiktiva virtuella figurer men ocks˚a riktiga robotar s˚a som den humanoida roboten Pepper fr˚an Softbank Robotics. Projektet implementerade en milj¨o baserat p˚a Programmet Choregraphe d¨ar man kan styra Pepper samt en milj¨o som ¨ar baserat p˚a non profit-f¨oretaget OpenAI’s Roboschool-milj¨o byggt p˚a fysik-simuleringsprogram Bullet. Det de olika milj¨oerna har gemensamt ¨ar s¨attet agenter i milj¨on ska l¨ara sig akten att h¨arma en annan agents kroppsh˚allning.

Efter det att milj¨oerna blev funktionella s˚a utf¨ordes n˚agra mindre omfattande experiment f¨or att se om algoritmen, milj¨on, bel¨oningssystemet och de neurala n¨atverken kunde visas klara uppgiften att h¨arma en annan agents kroppsh˚allning.

Resultaten fr˚an dessa mindre experiment visar p˚a att det ¨ar m¨ojligt att h¨arma kroppsh˚allning p˚a detta s¨attet, i en f¨orenklad milj¨o, men att mer arbete beh¨ovs f¨or att g¨ora milj¨oerna mer komplexa och relevanta f¨or realistiska situationer.

(5)

TABLE OF CONTENTS

1 Introduction 1

1.1 Setup . . . 2

1.2 Dancer in the Mirror Approach . . . 3

1.3 Research Questions . . . 5

2 Background 5 2.1 Machine Learning . . . 5

2.2 Artificial Neural Networks . . . 6

2.3 Activation Functions . . . 7

2.3.1 Sigmoidal Activation Function . . . 7

2.3.2 ReLu . . . 8

2.4 Backpropogation . . . 9

2.4.1 Stochastic Gradient Descent . . . 9

2.4.2 Adam . . . 10

2.5 Architectures . . . 10

2.5.1 Convolutional Neural Network . . . 11

2.5.2 Recurrent Neural Networks . . . 11

2.5.3 Hyperparameters . . . 12

2.6 Reinforcement Learning . . . 12

2.6.1 Value iteration . . . 15

2.6.2 Policy Optimization . . . 15

2.6.3 Actor-Critic Methods . . . 17

2.6.4 Proximal Policy Optimization . . . 17

2.6.5 Exploration vs Exploitation . . . 18

2.7 Pepper . . . 19

2.7.1 Choregraphe . . . 19

2.8 OpenAI’s Gym . . . 20

2.8.1 Roboschool . . . 21

3 Method 21 3.1 Learning Algorithm . . . 22

3.2 Pepper Environment . . . 24

3.3 Custom Roboschool Environment . . . 25

3.4 Reward Function . . . 26

3.5 Networks . . . 28

3.5.1 Modular Approach . . . 28

3.5.2 Semi Modular Approach . . . 29

3.5.3 Combined Approach . . . 30

3.6 Experiment . . . 30

(6)

3.7 Custom Reacher Experiments . . . 31

3.8 Custom Humanoid Experiments . . . 32

3.9 Pepper Experiments . . . 33

4 Results 34 4.1 Reward Function . . . 34

4.2 Experiment . . . 34

4.2.1 Reacher Environment . . . 35

4.2.2 Humanoid Environment . . . 39

4.3 Pose Evaluation . . . 39

4.4 Pepper . . . 41

4.5 Code . . . 44

5 Discussion and Future Work 45 5.1 Custom Roboschool Environment . . . 45

5.2 Pepper . . . 46

5.3 Project . . . 48

5.4 Future Work . . . 49

6 Conclusion 49

(7)

1 Introduction

This project aims to construct a framework for training agents to learn a basic form of social movement through end to end deep reinforcement learning. In human so- cial interactions individuals convey a lot of information through the movements of different body parts. We implement many detailed movements in the facial area and in the use of our arms, hands and overall posing. There are a wide variety of different movements humans use when we engage in social interactions and they range from fully conscious and explicit in their meanings, all the way to movements that we are unconsciously doing and are not aware of. We use the movement information of others as a way to infer the type of social interaction we are in as well as the emotional state and intentions of the people we socialize with. Social movements are highly context dependent and the context change over time. In other words the contexts are dynamical and require that an agent is able to adapt to different behaviors based on queues in the social environment. Be- cause of these nuances of movements in human social interaction it is very hard to write programs that could simulate similar behavior for any virtual or actual robot.

In order to program these sort of movements in machines the types of movements humans implements have to be defined but also when those movements are imple- mented and in what context. This makes the problem highly complex and could seem quite intractable to solve. Instead of trying to define everything in advance it would be preferable to have the robot learn these movements by itself. In recent years deep learning, a subset of machine learning, has shown great proficiency in a wide range of tasks. In 2012 Alexnet [1] won the imagenet competition [2, 3] and deep learning algorithms has dominated the field of computer vision ever since.

These algorithms has been shown to be able to play Atari games [4] and achieve superhuman performance in Go and Chess [5, 6] and has improved the state of the art in fields like machine translation [7] and speech synthesis [8] to name a few. Deep learning algorithms are general in nature and learns by training on large amount of data. The behaviors learned are both dependent on the type of data but also how the learning problem is stated. This project states the learning as a reinforcement learning problem where an algorithm learns by implementing actions in a trial and error fashion. An agent implements different actions inside of an environment and this environment simulates the consequences of the action and returns a reward. Based on this reward the algorithm updates and the process is repeated.

There are many available environments that focuses on games such as ALE [9], pygame [10], Starcraft [11], Deeplab [12] but also environments such as Psych- lab [13], House3d [14] and dialog management[15], which are research specific to

(8)

certain fields of research. A combination of the kinds of environments above is found in OpenAI’s Gym [16] which is an attempt to make these kinds of environ- ments more accessible to practitioners. In the future, when learning algorithms become better and more efficient, it will be the learning frameworks that will de- fine what behaviors systems may acquire. This project aims to implement a basic starting point for an environment with a focus on social movements.

This project creates a basic framework consisting of custom made environments with associated rewards, an implementation of a reinforcement learning algorithm and some basic neural network architectures for the goal of learning social move- ments. The framework will have a general design to make it as agent agnostic as possible by an approach referred to as dancer in the mirror. The term agent agnostic is used throughout the thesis and for a framework to have this property any agent should be able to learn the specific task in the same way. The specific traits of any agent should not effect the training at large. The main consideration is that the learning should depend only on the agent in question without humans or labeled data in the loop. The framework consists of environments for both the humanoid robot Pepper [17], using Choregraphe [18], and a physics simulation en- vironment based on OpenAI’s Roboschool [19]. This project shows that the dancer in the mirror approach is functional as a basic framework for social movements by having a simplified agent learn to mimic poses. The project also indicates that this setup works for more complex agents such as Pepper and a virtual humanoid torso given more time and effort for training.

1.1 Setup

Social movements are complex in general and therefore the problem is simplified by defining some minimum requirements as a basis. Coordination is needed in order to move in a controlled way to any target state, meaning that if an agent is to raise its arm, by virtue of coordination, it is able to do so. The other ability that is required for social movements is a notion of understanding or perception. In all social context it is very important to be able to perceive what the other agents are doing, so as to extract meaning in order to implement appropriate responses.

Language is very important in order to act accordingly in any interaction where communication is essential, but for social movements there are many contexts in which explicit communication may not be present. In this project the language aspect of social interactions is omitted and understanding is purely based on vi- sual perception. Thus the two basic abilities an agent utilizes in this project are coordination and vision.

(9)

Agents can be of many shapes and forms and it is preferable if the learning frame- work is as agent agnostic as possible. To achieve this little or no prior knowledge about the specific agents, concerning movement or vision, should be assumed.

The learning loop should not depend on other actors such as humans or on la- beled data. Because of the complexity of the problem and the many variables that are combined to make up social movements, the design choices of the framework needs to be carefully considered. By starting very simple and then incrementing the complexity of the learning framework some problems might be avoided. In human interactions it is common to mimic poses and gestures [20] and the ability to mimic a pose could be seen as a first step in learning social movements. In this project the ability to mimic others, mainly the ability to reach the same pose as a target agent, is the first form of social movement to learn. In order to learn this ability for the general case a first step is arguably to learn how to achieve poses based on self training without any other agents in the loop and this is the goal of the dancer in the mirror approach. This may then be made more complex by instead targeting a sequence of poses (movements) and combine several sequences and add other actors etc.

1.2 Dancer in the Mirror Approach

In this project the setup will be analogous to that of a dancer practicing in front of the mirror, see Figure 1. The dancer initiate movements in order to reach a certain pose, observes how this is externally perceived and, by having a target pose in mind, modifies the movement until the desired pose has been learned. By simultaneously observing the external perception of the poses and the movements performed the dancer learns how certain internal actions, the muscle control, cor- relate with certain external observations. After sufficient amount of training in front of the mirror the dancer generalizes this ability to then be able to mimic other dancers. The strength of this approach is that the setup is the same for any type of agent, virtual as well as real, and is not dependent on any specific abilities of the agent. The idea is that this setup may be implemented virtually in some simulation environment as well as in real life for actual physical robots. This setup is not dependent on the agent specific details of movement or appearance but could work across a wide range of agents.

To create a dancer in the mirror setup an agent needs the ability to move, to perceive its external state and have access to a target pose. A pose is the con- figuration an agent is in and is explicitly defined in the internal joint state but may also be inferred from an external observation. All agents that can move has access to its internal joint states and in this context this refers to the to the virtual

(10)

muscles of the agent, the actuators or motors. This means that movements are performed in the space spanned by these values. To reach a pose is defined as, given an internal state representation of a target, perform actions in the internal state space, and by some degree of precision, reach the configuration specified by the targe pose. This is the ability of coordination. The ability to translate an ex- ternal observation of another agent’s pose into the correlating internal joint state representation is referred to as understanding. Given an observation of another agent the understanding ability translates the external observation into an internal state representation and the coordination ability implements suitable movements in order to reach that pose. To successfully implement the two abilities to reach the target pose is what this project defines as pose mimicking.

The training is implemented by defining a set of target poses, initialize the agent, implement actions and receive rewards. After a certain amount of time the target may be switched and the process repeated. During training all of the data is known, the agent trains with itself, but during testing the agent mimics a target pose without having access to the targets internal state.

Figure 1: A dancer practicing in front of a mirror. A dancer implements move- ments, perceives an external observation of itself and with a target pose in mind executes movements to reach the desired pose. Photo credit: Mait J¨uriado Balle- rina via photopin. License.

(11)

1.3 Research Questions

This project aims to create suitable environments for agents in which to train and choose a learning algorithm along with neural network architectures. Then inves- tigate if the dancer in the mirror approach is valid as a basic learning framework for social movements.

• Can the dancer in the mirror approach train agents to learn coordination and mimic poses?

• Can the humanoid robot Pepper learn to mimic poses in this way?

• And can this be generalized to mimic dynamic motions and more complex behaviors?

First the background material for the relevant concepts and algorithms chosen are summarized. Then the method section goes through what was done and what experiments were run, the results of which are shown in the result section. There- after a discussion about the results, lessons learned, problems that arose and future work.

2 Background

In this section different concepts, algorithms and programs relevant to this project are explained. This sections starts with machine learning in general and then specifically machine learning based on neural networks, how neural networks are built and how they are trained. This is followed by a summary of reinforcement learning, what types of learning algorithms there are and then a more detailed explanation of the specific algorithm used in this project. The background section ends with summaries of the major softwares utilized namely Choregraphe and OpenAI’s Gym.

2.1 Machine Learning

Machine learning is an area in computer science that refers to algorithms which learns to approximate a function from data. This is in contrast to ”regular” al- gorithms which are defined in a static way to process data as instructed by the programmer. Machine learning is generally clustered in to three different but over- lapping fields of study called supervised learning, unsupervised learning and rein- forcement learning. Supervised learning is for labeled data where all data points has a correlated label associated with it, the goal is then to find an algorithm which correctly maps the data points to the correct labels. Common supervised

(12)

learning problems are classification and regression. Unsupervised learning is the field of finding patterns or features in data with no associated labels. Common techniques are clustering algorithms and anomaly detection. The amount of data available is rapidly increasing and unlabeled data is the most common type, there- fore unsupervised learning is a very interesting field for the future. The final part of machine learning is referred to as reinforcement learning and gets its inspiration from learning behaviors observed in animals and humans, where a trial and error approach with associated rewards are used in order to construct algorithms. A common type of parametric function used in all fields of machine learning are neu- ral networks and this project focuses exclusively on this type of machine learning.

Deep learning has gotten a lot of attention recently and the name refers to neu- ral network algorithms that utilizes many layers. These types of algorithms has proved very efficient in supervised learning problems with large sets of data and many credit todays increasing attention for these algorithms after the ImageNet [2, 3] competition in 2012 where Krizhevsky et al. won with their deep neural network AlexNet [1].

2.2 Artificial Neural Networks

Artificial neural networks (ANNs) are functions built up by interconnected neu- rons. Todays popular ANNs are constructed by a certain type of artificial neuron that receives a vector input x, the sum of which is fed through a non-linearlity called an activation function ϕ, to produce a scalar output y. For an N -dimensional input vector we have,

y = ϕ

N

X

i=1

xi

!

. (1)

All neurons except the ones in the input layer, receive data from other neurons in a network and are weighted by a particular weight associated by the connection between the specific neurons. A network which only takes input and feeds it through the network in a straightforward manner is called a feed forward net.

This is in contrast to recurrent network structures, which are explained later. The simplest kind of feed forward network is the multi layered perceptron (MLP). The term perceptron originates from one of the first artificial neural network algorithms with the same name. The perceptron is defined the same way as in equation 1 with a heavyside step function as the non-linearity, φ. In a MLP network all neurons in one layer is connected to all the neurons in the subsequent layer, this connection scheme is often referred to as ”fully connected”. There are weights

(13)

between specific connections in the network and these are commonly ordered in a matrix and multiplied by the vector output of one layer to produce the inputs to the next. Let W denote the matrix containing the weights in a layer and X be the input vector, b is a vector of bias values, and ϕ again is the activation function used by the neurons, then the vector output Y of a layer is defined as,

Y = ϕ(W X + b). (2)

This output is then used as the input to another layer until the final output is reached. In Figure 2 a schematic diagram of a network with two layers is shown. Here the arrows between the neurons are the weights and the circles are the neurons. The goal is to update the weight parameters so the network converges towards the desried function.

Figure 2: A schematic representation of an artificial neural network. The circles are neurons and the arrows are weighted connections. [image source [21]]

2.3 Activation Functions

The non-linearities in a neural network are required in order to model complex functions. Several linear layers connected together in the way described above are always mathematically equivalent to one linear layer. The first activation used in the perceptron algorithm from the late 1950’s was the heavyside or the threshold step function. Given any input the neuron would either be off or if the sum of the input was over a certain threshold the neuron would turn on. When neural networks got popular again in the 80’s common activation functions to use were the logistic function and the hyperbolic tangent function.

2.3.1 Sigmoidal Activation Function

Sigmoid functions refer to functions of an s-shape form and includes functions such as the hyperbolic tangent and the logistic function. However, in machine

(14)

learning the logistic function showed below is commonly referred to as the sigmoid activation. This function is defined as

Asigmoid(x) = 1

1 + e−x = ex

ex+ 1, (3)

which has a range between (0, 1), ∀x ∈ R and the graph of the function is shown in Figure 3a. Another popular activation function is the hyperbolic tangent which, like the logistic function, is also of a sigmoidal shape. The hyperbolic tangent is defined by

Atanh(x) = ex− e−x

ex+ e−x, (4)

with a range between (−1, 1), ∀x ∈ R and is shown in Figure 3b. Both the logistic and tanh functions are well suited for classification because of their steep gradient around zero. This property makes it easy for the behavior of the neuron to move away from zero to either side, making the behavior of it more distinct. They are both bounded, their ranges are both finite and ”small” meaning that the signals in the neural network wont blow up and become very large. However, the two functions have a really plane slope towards the larger negative and positive values, this makes it so that if the signal is far away from zero the gradient information becomes small. This problem is called the vanishing gradient problem and could make the training and convergence slow. They differ just slightly in their behavior mostly of the fact that tanh can output negative values and that its gradient is a little larger but apart from this they behave very similar.

2.3.2 ReLu

In order to avoid the vanishing gradient problem and by inspiration from biol- ogy [22] the rectified linear unit ReLu, works differently. The ReLu activation function is a max operation and is linear for inputs larger than 0 and 0 elsewhere,

AReLu(x) = max(0, x). (5)

The graph of the function is shown in Figure 3c. This activation function has an infinite range and therefore makes it possible for the signal in the neural network to become large. However, because of the zero output for negative inputs the signal in the network is sparse, meaning that not all neurons will feed the signal forward.

In practice the ReLu activation has become very popular for convolutional neural networks.

(15)

(a) The sigmoid (logistic) function.

(b) The hyperbolic tan- gent function (tanh).

(c) The rectified linear unit (ReLu) activation.

Figure 3: Activation functions.

2.4 Backpropogation

Artificial neural networks are parametric functions with an artificial neural compu- tational unit at its core. These neurons are connected to one and other in specific ways defined by the type of architecture that is implemented. But regardless of what type of architecture is being used all neural networks need to be trained in order to approximate the desired function. Training refers to updating the param- eters of a network in order to optimize for a specific function commonly referred to as a loss, cost or objective function. The most common approach is to optimize the weights by slightly permute the weight in the direction of the gradient of the loss function with respect to the specific weight being updated. The algorithm for updating a neural network through gradient information is called backpropa- gation, referring to the way the error gradient ”flows” backwards from the final output layers of the network to the initial input layers. It is hard to explicitly say who invented the backpropagation algorithm because the general optimization of using gradients in order to optimize some parametric function has been known in mathematics for a long time. For details regarding backpropagation or neural networks in general the interested reader is referred to this overview [23]. However, the basic approach of backpropagation algorithm is the stochastic gradient descent algorithm.

2.4.1 Stochastic Gradient Descent

Neural networks are commonly optimized by gradient optimization algorithms.

There are many different varieties but they are all based on the standard gradient descent algorithm referred to as stochastic gradient descent (SGD). In SGD the network trains on mini-batches or a subset of the total dataset meaning that only a sample of the true gradient is computed, hence the stochastic gradient. Let a neural network be parameterized by the vector θ and f (θ) be the objective function that is being optimized. The gradient of the objective function with respect to θ

(16)

then is given by ∇θf (θ). Then the update rule of for SGD is

θt+1i = θti− α∇θif (θ), ∀θi ∈ θ, (6) where α ∈ R is the learning rate and θti refers to the i:th parameter being updated at time t.

2.4.2 Adam

The basic SGD in section 2.4.1 is the foundation for many different optimization algorithms and a popular one is Adam [24]. Adam stands for adaptive moment estimation and is more complex than the regular SGD but often performs better.

The algorithm estimates moving values for the first moment and second moment of the gradient. These are the parameters mtand vt where t denotes the time step of the stochastic function ft(θ). The moments are defined by

mt= β1· mt−1+ (1 − β1) · ∇θft(θ), (7) vt= β1· mt−1+ (1 − β1) · ∇θft(θ)2, (8) ˆ

mt= mt

(1 − β1t), (9)

ˆ

vt= vt

(1 − β2t), (10)

where β1, β2 ∈ [0 1) are the exponential decay rates of the moments and ˆmt and ˆvt are the bias-corrected estimates. The explanation why the moments needs to be corrected is omitted in this summary and the interested reader is referred to the paper [24]. The final update rule for time step t is then defined as

θt= θt−1− α · mˆt

√ˆvt+ , (11)

where  is a small constant for computational stability. Informally Adam is efficient because it scales the step taken each update based on how noisy the gradient signal is. Where noisier signals yields a smaller step size.

2.5 Architectures

The way neural network architectures work is by combining the basic computa- tional neuron, see equation 1, in different ways in order to optimize for specific uses. There are three prominent base architectures that are commonly used in ma- chine learning and the first of which is the fully connected architecture explained in section 2.2. The other two are convolutional and recurrent neural networks.

(17)

2.5.1 Convolutional Neural Network

In computer vision convolutional neural network has shown great efficiency and performance and is the most common architecture to use when processing visual inputs. They are inspired by the mammalian visual cortex and differ from the fully connected architecture by only connecting certain neurons in one layer to specific neurons in the next. In two dimensional spatially correlated data, like the data in images and videos, the information in one part of the data arguably has little or nothing to do with data in other parts and this property is what convolutional networks utilize. Instead of connecting the output from one neuron to all neurons in the next layer only a smaller subset of the neurons in the next layer will be con- nected, a semi-connected neural network. However, convolutional neural network takes this one step further and argues that the processing being done in one part of the data is also useful in other areas and use a kernel of weights for all connec- tions. Figuratively one may imagine that a kernel with weights traverses the two dimensional data and taking the dot product between its weights and the subset of neurons it is hovering over. In more rigorous terms a convolutional operation is being done on the data. A schematic overview of this for a kernel size of 3x3 is shown in Figure 4a. The parameters in a convolutional layer are the size of the kernel, the stride and the number of kernels. The stride refers to the amount of values the kernel might skip. A stride of 1 calculates outputs for all values and a stride of 2 skips every other value and so on.

2.5.2 Recurrent Neural Networks

Both the fully connected and the convolutional architectures are often implemented in a feed forward manner meaning that the output only depend on the current input. In other words, these structures have no explicit internal memory apart from the connections between the neurons. For sequential data this could be a problem because for such data there is a dependence between the data points in a sequence and by definition this can not be modeled without any memory. This is what recurrent architectures aim to fix. Instead of only having data propagate forward some data is stored internally for each neuron and is used in subsequent data passes, a qualitative schematic is shown in Figure 4b. This means that recurrent neural networks (RNNs) contain an internal memory property and is thus better at processing sequential data. The output of the network is not only dependent on the current input data but on all data previously processed by the network. This makes recurrent models more difficult to train because the error gradient must not only traverse from the output error back to the input layers but also back through time because of the internal memory structure.

(18)

(a) A Convolution layer with a 3 by 3 kernel. A schematic representation showing the output of the convolution for the dark blue value.

(b) A schematic representation of a re- current neural network. The circles are neurons and the arrows are weighted connections. Notice the curved arrows pointing back on the neurons which rep- resent the internal memory of the RNN.

[image source [25]]

Figure 4: Two common neural architectures. The convolutional neural architecture in (a) and the recurrent neural architecture in (b).

2.5.3 Hyperparameters

A neural network is defined by the types of basic architectures, explained above, used in the implementation but also of the size of the network, the number of layers, neurons and other relevant features. These are all parameter values that are set before training and are commonly referred to as hyperparameters.

2.6 Reinforcement Learning

Reinforcement learning states the machine learning problem in terms of an agent interacting with an environment through actions. After implementing an action the agent receives information about how the environment is through an obser- vation and a reward. By repeatedly executing actions and receiving rewards the agent’s goal is to learn how to behave and what policy to implement in order to get the most rewards. A schematic overview is shown in Figure 5a. In practice it is most common to state these problems as a Markov decision process, MDP.

Markov decision processes are a mathematical framework that defines how actions can transition an agent from one state to another. They are discrete time stochas- tic control processes meaning that the time is modeled as discrete and that the transitions between states in the MDP’s are stochastic in nature. Let S be a set of states s, A be a set of available actions a, and P (s0|s, a) is the transition probabili- ties of going from one state to another given an action. At time t the environment will be in a particular state, st, and an action, at, is implemented and then by some probability P (st+1|st, at) the environment transitions into st+1 and outputs a

(19)

reward, R(st+1, st, at, ...). A small schematic example is shown in Figure 5b. The reward is given by some reward function that could depend on the current state, the implemented action, the subsequent state and more. This reward is often denoted as rt for convenience. In short a MDP consists of four elements, a set of states S, a set of actions A, a state transition function P (s0|s, a) and a reward function R(·).

(a) A reinforcement learning diagram.

The agent (robot) implements an action in the environment and receives a re- ward and an observation.

(b) A Markov Decision Process. The green circles represents states and the orange represents actions. The arrows represents probabilities and connects actions to the resulting states.

Figure 5: Schematic representations of reinforcement learning and Markov Deci- sion Processes.

Reinforcement learning approaches can in general be divided into two major types, the model-based and model-free. In a model-based approach there is a module that models the environments in order to simulate what could happen in the fu- ture given a specific action. This model could then be used in order for the policy to decide what actions to implement in order to achieve the best result. This is in contrast to model-free approaches which uses no explicit modeling of the envi- ronment. Both model-based and model-free approaches consists of three common methods which are value-based, policy-based and actor-critic, shown in Figure 6.

In reinforcement learning there are some common functions which are defined below. An agent interacts with an environment over discrete time steps. Each time step t, the agent receives a state st ∈ S and implements an action at ∈ A chosen according to policy π : S → A in a stochastic manner at ∼ π(at|st). The agent then receives a scalar reward rt and the environment transitions to the next state st+1. This behavior is iterated until the agent reaches a terminal state. The total accumulated return from time t is given by

(20)

Figure 6: Two common reinforcement learning types are model based and model free. Both of these may utilize policy based, value based and actor-critic ap- proaches.

Rt=

X

k=0

γkrt+k, (12)

where γ ∈ (0, 1] is referred to as the discount factor. The value function

Vπ(s) = E[Rt|st = s], (13)

is the expected total return by following the policy π from state s. Similarly the action value function or Q-function

Qπ(s, a) = E[Rt|st = s, at= a], (14) is the expected return from selecting action a in state s and then following the policy π. The optimal value functions are then given by

Q(s, a) = maxπQπ(s, a),

V(s) = maxπVπ(s), (15)

which are the maximum state-action value and value for any possible policy. In practice these functions are not known and is commonly approximated by a func- tion with parameters θ. The goal is to maximize the expected return from each state st with respect to the parameters θ.

(21)

2.6.1 Value iteration

In value-based learning the goal is to approximate the optimal value function and then follow a greedy policy, choosing the action that yields the largest reward.

Value iteration methods compute the optimal value function for a given MDP through an iterative process. These methods first initializes an arbitrary random starting function V0(s) which then is updated until the values for the states con- verges. It is common to use a temporal difference approach such as Q-learning in order to approximate the value function. Deep Q-learning got attention after Deepmind utilized it to solve several Atari environments [26]. However, other types of algorithms which utilizes policy methods or actor-critic methods has shown to be even more effective [27].

2.6.2 Policy Optimization

Policy optimization approaches work in a different way than value-based methods and directly tries to find the best policy without a value function approximation.

To do this it is common in practice to perform gradient ascent on the objective function by estimating the gradient. These approaches are called policy gradient methods. Here the policy πθ(a|s)is optimize directly. Let

U (θ) = E

" H X

t=0

γtr(st); πθ

#

=X

τ

P r(τ ; θ)R(τ ), (16) be the total expected return from the environment conditioned on a policy πθ, for any trajectory τ of length H. The factor P r(τ ; θ) is the probability of a certain trajectory given the parameters θ and R(τ ) is the cumulative reward collected following τ . Then the objective becomes to maximize the function in Equation 16 with respect to the parameters of the network,

maxθ U (θ) = max

θ

X

τ

P r(τ ; θ)R(τ ). (17)

(22)

In order to optimize 17 the gradient with respect to θ is needed. The gradient is computed and algebraically restructured in a way which will prove useful.

θU (θ) = ∇θX

τ

P r(τ ; θ)R(τ )

=X

τ

θP r(τ ; θ)R(τ )

=X

τ

P r(τ ; θ)

P r(τ ; θ)∇θP r(τ ; θ)R(τ )

=X

τ

P r(τ ; θ)∇θP r(τ ; θ) P r(τ ; θ) R(τ )

=X

τ

P r(τ ; θ)∇θlogP r(τ ; θ)R(τ ).

(18)

However, equation 18 is an analytical solution which depends on all possible trajec- tories and for practical implementations these are not known. Instead the gradient is estimated by empirically averaging over m sampled trajectories and equation 18 then becomes

θU (θ) ≈ ˆg = 1 m

m

X

i=1

θlogP r(τ ; θ)R(τ ). (19) In order to explicitly state how the gradient estimate relates to the policy the trajectory probabilities are decomposed. For any trajectory i

θlogP r(τ(i); θ) = ∇θlog

" H Y

t=0

P (s(i)t+1|s(i)t , a(i)t ) · πθ(a(i)t |s(i)t )

#

= ∇θ

" H X

t=0

log P (s(i)t+1|s(i)t , a(i)t ) +

H

X

t=0

log πθ(a(i)t |s(i)t )

#

= ∇θ

H

X

t=0

log πθ(a(i)t |s(i)t ).

(20)

Because the transition probabilities P (s(i)t+1|st, at) does not depend on θ that term vanishes and left are terms only dependent on the policy πθ. This is a real useful result and means that information about the environment is not needed in the policy gradient estimate. Equation 20 is inserted in to equation 19 to produce

ˆ

g = ˆEt[∇θlogπθ(at|st)Rt] , (21)

(23)

which is the estimated policy gradient also called the ”Vanilla” policy gradient.

Algorithms which updates policies in the direction of 21 are referred to as REIN- FORCE after Williams [28] where he also showed that equation 21 is an unbiased estimate of the true gradient. Instead of using the actual returns Williams [29]

showed that the variance of the REINFORCE estimation can be reduced by sub- tracting a learned function, called a baseline, from the reward to produce

ˆ

g = ˆEt[∇θlogπθ(at|st)(Rt− bt(st))] . (22) The baseline function can be modeled in many ways to decrease the variance and one common and efficient approach is to use an estimate of the value function.

This results in algorithms which both updates the policy and estimates a value function. These approaches are called actor-critic methods.

2.6.3 Actor-Critic Methods

Actor-critic methods relies on the same principles as both value-based and policy- based approaches combined. The actor is the policy πθa(a|s) and the critic is the value function approximation Vθc(s). The parameters θa,c are different if two separate approximators are used, however, both functions may be approximated by a single parametric function. The most common way to define actor-critic algorithms is by the gradient estimate

ˆ g = ˆEt

h∇θlogπθ(at|st) ˆAti

, (23)

where ˆAt is the advantage function estimate at time t. This advantage function can be defined in many different ways [30] but a common one is

t= Qt(at, st) − V (st), (24) which compares the value for taking a particular action Q(at, st) with the average expected value V (st). This is useful because if a certain action is associated with a better than average value the advantage is positive but if it is less the advantage is negative. From this follows that policies associated with positive advantages will be encouraged and the ones associated with the negative advantages will be diminished. The algorithm used in this project is an actor-critic type algorithm and is called Proximal Policy Optimization.

2.6.4 Proximal Policy Optimization

Proximal policy optimization PPO [31], is an actor-critic type algorithm that up- dates the policy by estimating the policy gradient and an advantage function.

(24)

However, the gradient estimate differs slightly from that of ordinary actor-critic approaches. Instead of optimizing equation 23 it maximizes a surrogate loss. Let

rt(θ) = πθ(at|st)

πθold(at|st), (25)

be the ratio between the distribution of the policy πθ after updates and an older version πold before said updates. By switching the policy logarithmic function in equation 23 with the ratio 25 the conservative policy loss is defined as

LCP I(θ) = ˆEt

 πθ(at|st) πθold(at|st)

t



= ˆEt

h

rt(θ) ˆAti

, (26)

which is the basis in the trust region policy algorithm TRPO [32]. However, instead of also adding a condition to the optimization problem, as in TRPO, PPO uses a clipped version of the conservative policy loss in equation 26 defined as

LCLIP(θ) = ˆEt

h

min(rt(θ) ˆAt, clip(rt(θ), 1 − , 1 + ) ˆAt)i

, (27)

which is the PPO gradient estimate. Here  is referred to as the clip parameter, a hyperparameter set on initialization. In this project the PPO algorithm was im- plemented with a truncated version of the general advantage estimate [30] defined in the PPO paper [31].

The training is implemented in two steps, the exploration and the actual training.

During the exploration phase the policy collects data for a certain number of steps, after which the general advantage estimate is calculated, and the policy is trained for a certain amount of epochs. Both the epochs and the steps taken during exploration are hyperparameters defined at initialization.

2.6.5 Exploration vs Exploitation

The exploration vs exploitation dilemma is a concept in reinforcement learning which concerns how an agent should implement actions during training. Explo- ration is often done by inserting some randomness in to the actions implemented, which makes the policy sometimes choose worse actions over better ones. This could yield less returns in the short term but the additional information gathered could help to learn better policies in the future. Exploitation, on the other hand, means to choose the actions that yields the highest returns in the short term often also referred to as a greedy strategy. This yields the highest reward based on the current information the policy has gathered about the environment but could make many states left unexplored which could yield even larger returns. This effectively means that the policy could likely get stuck in local optima. There is no objective

(25)

best choice for choosing the ratio between exploration and exploitation and it is an open problem in reinforcement learning.

2.7 Pepper

Pepper [17] is a humanoid robot from Softbank Robotics that can interact with people and, according to the manufacturer, perceive emotions and adapt its behav- ior accordingly. Pepper has a 3D-camera and two RGB-cameras in his facial area and can recognize faces and objects. He also has four directional microphones which can detect direction of sound sources and also analyze the tone of voice and what is being said in order to model users emotional context. Pepper moves around on three multi-directional wheels and uses 20 engines in order to perform fluent and human like motions. He has the ability to move his head, back and arms and has a built in battery which gives Pepper about 12 hours of autonomous interaction.

Figure 7: The humanoid robot Pepper [33] from Softbank Robotics.

In this project access to Pepper’s arms is the main focus. Each arm has 6 degrees of freedom but are constrained for certain ranges, listed in Table1, and shown in Figure 8.

2.7.1 Choregraphe

A recommended first step when working with Pepper is with the program Chore- graphe [18]. In this program users can setup an interaction pipeline, visualize the robot’s movements and define new movements and more, a running instance of

(26)

Table 1: Arm actuators and ranges [34]

Joint Name, Motion (rotation axis) Range (degrees) RShoulderPitch Right shoulder joint, front and back (Y) -119.5 to 119.5

RShoulderRoll Right shoulder joint, right and left (Z) -89.5 to -0.5 RElbowYaw Right shoulder joint, twist (X’) -119.5 to 119.5 RElbowRoll Right elbow joint, (Z’) 0.5 to 89.5

RWristYaw Right wrist joint, (X’) -104.5 to 104.5 LShoulderPitch Left shoulder joint, front and back (Y) -119.5 to 119.5

LShoulderRoll Left shoulder joint, right and left (Z) 0.5 to 89.5 LElbowYaw Left shoulder joint, twist (X’) -119.5 to 119.5 LElbowRoll Left elbow joint, (Z’) -89.5 to -0.5

LWristYaw Left wrist joint, (X’) -104.5 to 104.5

Choregraphe is shown in Figure 9. In order to communicate with the robot from outside Choregraphe, the Qi-python-API [35] was used. This framework makes it possible to programmatically control Pepper which was relevant for this project.

2.8 OpenAI’s Gym

In the field of reinforcement learning there have been a lack of common environ- ments for researchers to have as a baseline for reinforcement learning algorithms.

Small differences in environments could have a large impact on the performance of reinforce learning algorithms and this makes them hard to compare and to re- produce research results. In an attempt to fill this void OpenAI [36] constructed the Gym framework [16, 37]. Gym contains a collection of Partially observable Markov decision processes (POMDP) in environments categorized as classic con- trol, algorithmic, Atari, board games and both 2D and 3D robot simulators. Gym is user friendly and has become a popular starting point for experimenting with re- inforcement learning as well as for serious research. The Gym framework provides a simple and convenient way to create Gym wrappers for custom environments.

A Gym environment consists of a few specific functions mainly the step, reset and render functions. The step function handles the transition between time steps in the environment, it takes an action as an input and returns the state of the envi- ronment, a reward and information whether the episode is done or not. The reset function is called at the start of the training and when the environment resets be- tween episodes. This function commonly only returns a state which is the initial state of an episode. The last of the three most common Gym functions is the render function which does exactly what its name implies and renders the episode.

(27)

(a) Left arm (b) Right arm Figure 8: The movement specification of Peppers Arms[34].

2.8.1 Roboschool

In the OpenAI Gym framework there are many available simulation environ- ments for reinforcement learning. Some of these are simple like the ”CartPole”- environment, seen in Figure 10, or more complex like the Atari and the robot simulation environments. When Gym was first released their robot simulation en- vironments all depended on the physics simulation engine Mujoco [38]. However, in order to use the Mujoco engine a license is required. For students one license for one computer can be retrieved for free but for other cases there are associated fees. This made the Gym environments dependent on Mujoco more difficult for practitioners to use and led to a demand for a more easily accessible system. A solution for this was the Roboschool environments shown in Figure 11. These environments did not depend on Mujoco but on Bullet [39] which is free to use.

3 Method

This section describes the work done in the project and how the experiments were setup and implemented. The first section is a description of the implementation of the reinforcement learning algorithm followed by sections on how the custom environments were created, the design of the reward function and neural networks.

The method section ends with the experiments used to investigate whether the approach was functional.

(28)

Figure 9: A view of an open instance of Choregraphe showing the main program to the left and the robot view to the right.

Figure 10: The Gym Cartpole environment where the goal is to balance a beam.

3.1 Learning Algorithm

The first focus of the project was to implement the learning algorithm, the Prox- imal Policy Optimization, described in section 2.6.4. This algorithm was chosen because of its promising performance and efficiency on robotic simulation tasks [31]. The implementation of the PPO algorithm was written in PyTorch [40, 41], which is a framework for machine learning with an emphasis on neural network optimization. The entire algorithm was implemented from scratch in order for full control and to learn about working with neural networks in a reinforcement learning context in practice.

The PPO algorithm requires specific data storage, exploration and training func- tions. The algorithm is a form of on-policy learning which means that data, sam- pled by following the current policy, is used for training but then discarded for new data sampled by the updated policy. The most relevant parts of the implementa-

(29)

Figure 11: An example from the roboschool environment showing a humanoid, a

”hopper” and a ”halfcheetah” doing a locomotion task.

tion are explained below but for exact details the interested reader is referred to the PPO paper [31] and the implemented code [42].

First the ability to store data from several environments run in parallel was imple- mented. In this project the agents explore multiple instances of an environment at once and the data from these instances are collected simultaneously after each environment has returned data from a single step. The PPO algorithm works by exploring the environment and collecting data for a fixed amount of steps and then uses these data points to compute relevant values for the training. It is important to keep track of data points from the different processes because the relevant values needed for the training are temporally dependent, meaning that values from one process at one time needs an explicit connection to values in that process at the next time step.

After the data storage was implemented the exploration and training functions were created and are specific to PPO. The exploration function calls the step function in the environments, handles and transforms the data returned in appro- priate ways and iterates until exploration is finished. PyTorch is an automatic differentiation framework and for the actual training, one defines the specific PPO loss function and then the mechanisms of PyTorch runs backpropagation and up- dates the network using an optimizer. The chosen optimizer for this project was Adam, explained in section 2.4.2.

(30)

3.2 Pepper Environment

The purpose of Pepper in this project was for training on a real robot. The specific traits of Pepper was not the focus and the design of the Pepper implementation was aimed to represent a setup for any generic robot. The many abilities intrinsic to Pepper and really his main functionalities, such as emotional recognition, face recognition and object detection and so on, are not exploited and only access to the actuators that controls his movements are utilized. The Choregraphe Suite program rendered Pepper and simulated all the actions he implemented. The Pepper environment was written as Gym wrapper, described briefly in section 2.8, where the step function would utilize the qi-python-API [35] to send actions to Choregraphe and then receive relevant information back. The goal was to define the actions of this environment as the actuator torques applied in the joints. The idea of using the direct torques as actions is that it requires no prior abilities of movement and fulfills the criteria of being as agent agnostic as possible. The state returned would be the current configuration of the actuator values, an observation would be the pixel rendering of Pepper and the reward would be a function of the distance between the current state and the target state.

Because of the way Pepper is constructed and programmed it proved difficult to get control over the actual torques in the actuators. However, an absolute or a rel- ative angle of each joint was possible to use as an action. In other words one could define a set of angles and, regardless of the current configuration of Pepper, the intrinsic movement abilities would implement the movements necessary to reach the defined configuration. This would assume prior knowledge of movement and therefore would make the training less agent agnostic. To circumvent this a small incremental angle was used to define an action. This incremental angle needed to be constrained in such a way as to make the increments small enough as to not rely heavily on prior knowledge, but not too small such that exploration in the reinforcement learning setup would be too constrained. The solution for this was to make each actuator action dependent on a fraction of its total angle range referred to as the max angle. Another parameter in the qi-API defined how fast Peppers movements were implemented, namely how much torque could be applied during a movement, and was a fraction between 0 and 1. The parameter values are listed in in Table 2. A complete action was defined by the output of the policy network, a value between -1 and 1 for each actuator, multiplied by the specific max angle for the individual actuators.

Movements were simulated in a window in Choregraphe called ”robot view” but the pixel values could not be directly transmitted over the API. The solution for this was to write a script that found the coordinates for the ”robot view” window

(31)

and retrieved the pixels from the screen, i.e the observations. Because of this the

”robot view” window was always visible thus making the rendering function of the gym wrapper obsolete, the rendering would take place in Choregraphe. The final thing to define was an initial starting state and for convenience the already implemented ”StandInit”-pose, the built in starting pose of Pepper was used.

Table 2: Action parameter values

Parameter Value Description

Max angle 5% fraction of maximum for each joint Max speed fraction 10% Actions are careful not strong

3.3 Custom Roboschool Environment

The custom roboschool environments are built based on OpenAI’s roboschool [19].

The custom environments created for this project consists of an agent static in space with the ability to move certain limbs. The environments are implemented from two directions, the definition of the physical environment simulated by Bul- let [39] and the wrapper for Gym. The Bullet part is the actual physics simulation and the entire environment for the 3D simulation is defined in a xml-file. In this file the information of where the joints and actuators will be located and how they connect to each other and other body parts are defined.

Two custom environments were implemented, namely the custom reacher environ- ment and the custom humanoid environment. The custom reacher environment is a customization of the already implemented reacher environment in roboschool and the other is an implementation of a more robot-like humanoid. The custom reacher environment consists of one arm moving in a plane with 2 degrees of free- dom (DoF). The 2 DoF comes from the fact that this arm only has 2 different joints both of which can only rotate around one axis. There were two versions of this environment, one where two spherical targets were visible and the goal was to get the color coded joints to the correlating targets. The other version had no explicit objects as targets but a target image and joint state values set in the gym part of the implementation outside of the simulation. This custom environment without the explicit targets is shown in Figure 12a. The other environment is a virtual upper torso of a humanoid which has 6 DoF moving in 3D space and is shown in Figure 12b.

When designing an environment for reinforcement learning it is important to note

(32)

(a) An observation from the custom Reacher environment.

(b) An observation from the custom hu- manoid environment.

Figure 12: Custom environments.

that every detail might play a part in the final outcome and therefore any decisions should be as thought through as possible. In Bullet there are several physical ob- jects that are attached to each other and can exert friction or dependencies. The reacher environment was created to be the most basic environment possible, there was no friction and the movements were defined in a plane perpendicular to grav- ity (no forces). The joint movements, in the plane, were unconstrained and the body parts could not interact with each other and no collisions were possible. The humanoid environment was meant to be more human like and therefore more com- plex and so in this environment the movements were constrained in order to model how humans can move based on the humanoids defined in roboschool, collisions were made possible and the limbs could interact.

The actions for the two custom environments were torques and represented by continuous values between -1 and 1, one value for each actuator. The observation was pixel values of the agent and the state was the coordinates of the joints, their velocities, the distance vector between the robot parts and the targets. The reward was calculated based on the distance vector between the robot and the targets.

3.4 Reward Function

In reinforcement learning the reward function is what promotes which action and/or states that are ”good” or ”bad” and changing the reward function is to change the behavior that any agent in the environment learns. In this project the goal is for an agent to move its limbs such that specific key points on the agent aligns with specified target points. Therefore the reward is defined as a function of these key values and specifically the distance vector between them. However,

(33)

there are many different ways to define the exact properties of this function and a smaller qualitative experiment was conducted in order to decide on which specific reward function to use. There were three reward functions tested in this project, the absolute distance reward, the velocity reward and the latter with associated costs. The absolute reward function, Ra is defined by

D(s, starget) = v u u t

n

X

i=0

|si− starget,i|2 (28)

Ra,t(s, starget) = −Dt(s, starget), (29) at time t, where the reward is the negative of the distance D(s, starget,t) which is the p2-norm or the euclidean distance between the key parts on the robot and the targets. This means that this reward function states that being far away from the target is worse than being close and the maximum value, the best reward, is zero. The second reward function is the velocity reward which includes a time dependency between subsequent states and is defined as

Rv,t = Dt−1− Dt. (30)

This function states that it is good to move towards the target position and bad to move away from it. Moving towards the target will yield a positive reward, moving away a negative reward and not moving at all yields zero reward. The final reward function is an extension of the velocity reward function and adds penalties or cost to the reward. The velocity cost reward is then defined as

Rc,t= Rvel,t− Costs, (31)

where Costs is a function that might dependent on the specific action taken or a penalty for reaching a state defined as bad.

An experiment was conducted on the custom reacher environment with spherical targets to inspect which reward function was the better candidate. The cost func- tion for the velocity cost reward was defined as Costs = 0.1∗|action|. The different reward functions were trained using the same network architecture, for the same amount of frames and the highest test score policy was chosen to represent each reward function. These experiments were qualitative and not extensively tested because the framework should function with either one albeit that the learning efficiency might vary. During the development process it is vital to keep as many things as possible unchanged while adding different features, and instead of picking a reward function at random or to spend time on extensive testing, three policies were trained and based on the recorded videos a reward function was chosen.

References

Related documents

electronics and Electrical circuits). Immediately after the enquiry four groups of students were randomely selected and interviewed about the enquiry and the courses. The enquiry

This paper reports on the study of the strategy use of Chinese English majors in vocabulary learning; the individual differences between effective and less effective learners

Methodology/research%design:%

The agent in a Markov decision process has as its objective to maximise its expected future reward by nding a policy that produces as large expected future rewards as possible..

De fåtal ungdomar som uttrycker att de känner sig missnöjda med sig själva vid bildexponeringen kan tolkas ha en låg utvecklad självkänsla vid tidiga år

Tacksam för svar enligt nedan i bifogat svarskuvert till undertecknad: Distriktssköterska Görel Bergström,. Personal/utbildningsavd, PPH, Box 1316, Eklundavägen1,

In order to evaluate the capabilities of the discrete and continuous control agents on the task of navigating a mobile robot in human environments, the performance of both approaches

In this paper, we present a framework to help on addressing the task at hand (density-, traffic flow- and origin-destination flow predictions) considering data type, features,