Impact of observation noise and reward sparseness on Deep Deterministic Policy Gradient when applied to inverted pendulum stabilization

(1)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2019,

Impact of observation noise and reward sparseness on Deep

Deterministic Policy Gradient when applied to inverted

pendulum stabilization

HARIS POLJO

ADAM BJÖRNBERG

KTH

SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

(2)

Impact of observation noise and reward sparseness on Deep Deterministic Policy Gradient when applied to inverted pendulum

stabilization

HARIS POLJO, ADAM BJÖRNBERG

Degree Project in Computer Science, DD142X Date: June 4, 2019

Supervisor: Alexander Kozlov Examiner: Örjan Ekeberg

School of Electrical Engineering and Computer Science

Swedish title: Effekten av observationsbrus och belöningsgleshet på Deep Deterministic Policy Gradient tillämpad på inverterad pendelstabilisering.

(3)

(4)

iii

Abstract

Deep Reinforcement Learning (RL) algorithms have been shown to solve complex problems. Deep Deterministic Policy Gradient (DDPG) is a state-of-the- art deep RL algorithm able to handle environments with continuous action spaces. This thesis evaluates how the DDPG algorithm performs in terms of success rate and results depending on observation noise and reward sparseness using a simple environment. A threshold for how much gaussian noise can be added to observations before algorithm performance starts to decrease was found between a standard deviation of 0.025 and 0.05. It was also concluded that reward sparseness leads to result inconsistency and irreproducibil- ity, showing the importance of a well-designed reward function. Further testing is required to thoroughly evaluate the performance impact when noisy observations and sparse rewards are combined.

(5)

iv

Sammanfattning

Djupa Reinforcement Learning (RL) algoritmer har visat sig kunna lösa kom- plexa problem. Deep Deterministic Policy Gradient (DDPG) är en modern djup RL algoritm som kan hantera miljöer med kontinuerliga åtgärdsutrym- men. Denna studie utvärderar hur DDPG-algoritmen presterar med avseende på lösningsgrad och resultat beroende på observationsbrus och belöningsgles- het i en enkel miljö. Ett tröskelvärde för hur mycket gaussiskt brus som kan läggas på observationer innan algoritmens prestanda börjar minska hittades mellan en standardavvikelse på 0, 025 och 0, 05. Det drogs även slutsatsen att belöningsgleshet leder till inkonsekventa resultat och oreproducerbarhet, vil- ket visar vikten av en väl utformad belöningsfunktion. Ytterligare tester krävs för att grundligt utvärdera effekten av att kombinera brusiga observationer och glesa belöningssignaler.

(6)

Chapter 1 Introduction

For the novice computer science researcher, and perhaps for any mildly inter- ested researcher, the momentum of deep Reinforcement Learning (RL) algorithms in recent years can hardly be escaped. In particular, model-free (black box) RL algorithms have been applied to solve a wide variety of complex tasks with astounding results [1]. These successes are, like many other recent machine learning achievements, in large part thanks to the utilization of deep neural networks which enable accurate approximations of highly complex functions. In deep RL, these are applied in order to learn from trial-and-error interactions, and with model-free algorithms, only a reward function which defines desirable outcomes is required.

There are many great resources available for exploring and evaluating deep RL algorithms. OpenAI provides an extensive collection of open-source li- braries, including the Gym framework for RL ready simulations, and the Base- lines library containing high-quality Tensorflow¹implementations of state-of- the-art deep RL algorithms. Deep Deterministic Policy Gradient [2] is one such algorithm able to effectively solve complex tasks in many different do- mains, for example controlling simulated robots and rockets [3].

These algorithms are usually trained with access to perfect knowledge about the environment. This facilitates reward function design and environment exploration accuracy, both essential for being able to solve tasks in an environment [4]. These algorithms may not work at all when applied to more complex environments which possess unfavorable properties. For example, the case where environment observations stem from imperfect sensory input, or the case where a clear goal exists but no obvious measure of how close one is to achieving the goal is available.

1An end-to-end open source platform for machine learning.

1

(9)

2 CHAPTER 1. INTRODUCTION

1.1 Problem Statement

How does the OpenAI Baselines implementation of the Deep Deterministic Policy Gradient algorithm perform in terms of success rate and results depending on observation noise and reward sparseness?

1.2 Scope and constraints

To thoroughly evaluate a deep RL algorithm it has to be tested several times on the same environment. Depending on the complexity of the environment, training an algorithm can be very time-consuming, which becomes a problem when the same test has to be done several times. All experiments in this study are done with one algorithm implementation in a simple inverted pendulum stabilization simulation environment.

(10)

Chapter 2 Background

In this section, the core ideas behind RL are described followed by some theory needed to understand the Deep Deterministic Policy Gradient (DDPG) algorithm. The section ends with mentioning related work regarding baseline implementations of deep RL algorithms and benchmarking.

2.1 Reinforcement learning

The idea of reinforcement learning algorithms is to have an agent interact with an environment and use trial-and-error to learn a decision-making rule (policy), with which it achieves a goal. In a basic setup, the interaction occurs in discrete time-steps and at each time-step t, the agent observes some repre- sentation of the environment’s state st, selects an action at using its current policy, and receives a numerical reward signal rt. This setup is formalized as a Markov Decision Process (MDP), commonly represented as a 4-tuple (S, A, P, R)where:

• S is the state space, the set of possible states

• A is the action space, the set of possible actions

• P (st+1|st, at)is the transition probability function, returns the probability of transitioning to state st+1given current state stand action at.

• R(s^t, at, st+1)is the reward function, returns a numerical reward signal rt 2 R measuring whether action at and immediate state transition is good or bad.

3

(11)

4 CHAPTER 2. BACKGROUND

When a task is modeled as a MDP it is assumed to satisfy the Markov prop- erty. This states that the probability of any future state only depends on the current state of the environment [5]. In some applications, such as real-world problems involving sensory input, the agent is only able to partially observe the environment. For these, the trajectory of the process (a set of previously observed state-action pairs {. . . , (st 1, at 1), (st, at)}) may be needed to better describe the environment’s true hidden state. These fall under the more general framework of Partially Observed MDPs [6].

The state and action space can either be continuous or discrete. The game Snake has a discrete action space, the snake turns left or right, and the state space may also be discrete, a game board with 64⇥64 tiles for example. A car racing game might have a continuous action space, as in real-valued angles of the steering wheel, and a continuous state space, as in real-valued car positions and speeds. In Snake, the reward signal may be defined in terms of eating food and not crashing, and for the racing game, similar performance measures would be needed. What action is to be selected by the agent in any given state is determined by its policy ⇡(a|s), a mapping from states to actions. Achieving the goal, as defined by the reward function, is seen as finding an optimal or semi-optimal policy in order to maximize the expected sum of future reward signals from any given state.

2.1.1 Value and return

The value of a state or state-action pair under a policy is the expected return when subsequent actions are selected by the policy forever or until termination.

Optimally, the return Gtis defined as the undiscounted sum of all future reward signals:

Gt= rt+1+ rt+2+ rt+3+ . . .

Episodic tasks have clear episodes, shooting arrows at a target for example, and continuing tasks may go on indefinitely, balancing a plate on a stick for example. Longer episodes, and possibly infinitely long processes, can be dealt with by limiting the return to a horizon T :

Gt= rt+1+ rt+2+ rt+3+ . . . + rT

or by discounting rewards obtained further in the future using a discount rate parameter :

G_t= r_t+1+ r_t+2+ ²r_t+3+ . . .

(12)

CHAPTER 2. BACKGROUND 5

The latter is referred to as the discounted return and is the default approach in most cases, even in theory [5]. The value of is in [0, 1) which guarantees that the return Gtwill converge [7].

2.1.2 Value functions

A value function maps states and actions to their value, i.e. how much they are worth in terms of expected return. In general, both the transition probability function and the policy may be stochastic, so the expected return is an average of known or potential outcomes. The state-value function v⇡(s)is defined as the expected discounted return under policy ⇡ starting from state s:

v⇡(s) =E⇡[ X1

k=0

krt+k+1|St= s]

The action-value function Q⇡(s, a), also called Q-function, is defined as the expected discounted return under policy ⇡ after action a has been selected in state s:

Q⇡(s, a) =E^⇡[ X1 k=0

krt+k+1|S^t= s, at= a]

Intuitively, optimal policies maximize the value of any state and action, and thus share the same optimal value functions. The state-value function is suitable for policy evaluation since actions are exclusively selected by the policy, while the action-value function is suitable for policy improvement since it evaluates a potentially valuable state-action pair [5].

2.1.3 Bellman equations

Formulating the value functions recursively is quite straightforward and results in what is referred to as the Bellman equations, named after mathematician Richard E. Bellman. The value of a state or state-action pair is the reward signal received in the current time-step added with the discounted value of the state or state-action pair in the next time-step:

v⇡(st) = E^⇡[rt+ v⇡(st+1)]

Q⇡(st, at) = E^⇡[rt+ E^⇡[Q⇡(st+1, at+1)]]

Dynamic programming becomes a natural approach for solving RL problems.

In theory, the optimal value functions can be learned with an exhaustive search,

(13)

and once learned, a greedy policy which always chooses the action with the most value according to the Bellman equations would be optimal. In practice, an exhaustive search is seldom feasible, so RL algorithms can instead aim to estimate the optimal value functions through experience [5].

2.1.4 Learning Rate

The weight of new experiences in comparison to old ones can be defined with a learning rate parameter ↵. For example, an estimated Q-function can be updated by averaging the old value of a state-action pair with the newly found value, weighted by ↵:

Q(st, at) Q(st, at) + ↵[rt+ max

a Q(st+1, a) Q(st, at)]

Applying this approach with a stochastic ("-greedy) exploration policy results in the Q-learning algorithm, which serves as a base for more advanced algorithms, for example the Deep Deterministic Policy Gradient algorithm.

2.1.5 Exploration and exploitation

In any RL algorithm there needs to be a tradeoff between exploration and exploitation. Exploration is about gathering more information, perhaps randomly, and exploitation is about taking advantage of what is already known.

RL algorithms are either on-policy or off-policy, depending on the exploration method. On-policy means that the algorithm explores the environment by sam- pling actions from the policy that is being learned. Off-policy means that the algorithm has one policy which is being learned to become the optimal policy, and another policy which is used for exploration [5].

2.2 Policy gradient methods

Instead of indirectly learning a policy by estimating optimal value functions, there are modern RL algorithms which search for an optimal or semi-optimal policy directly using Policy Gradient Methods [2][8]. The intuition behind this approach is that a policy may be simpler to approximate than a value function since the complexity of value functions per definition depend on the size of the state and action space. Policy gradient methods also offer stronger convergence guarantees at the cost of not necessarily converging to the global optimum [9].

(14)

With policy gradient methods the policy is defined as a function which is parameterized with respect to the variable ✓, ✓ 2 Rⁿ[5]. This parameterized policy can be described as a conditional probability ⇡(a|s, ✓) which says that the probability of the policy choosing action a is dependent on the given state s and the values of parameter ✓ [5]. When the parameter ✓ is changed, the probability of choosing an action a in a given state s can also change. The question we can ask now is how we can change the value of ✓ so that the policy performs better.

The policy performs better when the expected total return increases. The total return can be calculated with the objective function J, which is dependent on the parameter ✓ of the policy ⇡ [10].

J(✓) = X

s2S

d⇡(s)v⇡(s)

d⇡(s) = limt!1P r(st = s|s⁰, ⇡)

The objective function is dependent on the parameter ✓ because if the parameter ✓ is changed that means that the behavior of the policy can change which in turn means that the expected total return can change.

Policy gradient methods works by calculating the gradient of the objective function dependent on the parameters ✓ of the policy ⇡ [11]. The gradient rJ(✓) says in what direction the total expected reward is increasing the most.

Gradient ascent can then be used to update the parameters ✓ with the formula

✓_t+1= ✓_t+ ↵rJ(✓t)where the learning rate ↵ is how big the step should be taken in the same direction as the gradient. In theory, if the policy is repeatedly updated using gradient ascent the policy improves and eventually maximizes expected total return.

2.2.1 Actor Critic

There is a class of algorithms that learn a policy and a value function con- currently [5]. The policy is called the actor and the value function is called the critic. The critic learns to approximate the value function and the actor uses the critic to calculate the policy gradient which is then used to update the policy.

2.3 Deep Neural Networks

An artificial neural network (ANN) is a function approximator inspired by how brains work. ANN:s consist of several layers of stacked artificial neurons

(15)

where neurons in one layer are all connected to neurons in the layer in front [12]. Every artificial neuron has an activation function, commonly the ReLu function [13]:

f (x) = max(0, x)

Every connection xj to a neuron has an associated weight wij and every neuron has a bias bi. The output of the neuron can be calculated with the following formula:

yi = f (bi+ Xn

j=1

wijxj)

By using the first layer as the input we can then propagate the input through the different layers and get a output value from the last layer of the network [12].

The backpropagation algorithm is used to train the neural network to approximate a function [12]. Backpropagation works by calculating the error when inputting a value and then comparing the networks output with the wanted output. With the error the backpropagation algorithm can update the value of the weights and biases in the network so that the network better approximates the wanted function. An ANN is considered a deep neural network (DNN) if it has many layers and neurons, which is suitable for approximating more complex functions.

2.4 Deep Deterministic Policy Gradient (DDPG)

Deep Deterministic Policy Gradient (DDPG) is model-free deep RL algorithm for environments with continuous action spaces [2]. Model-free means that the algorithm only needs a reward function to learn to solve tasks in an environment. DDPG is also an off-policy and an actor critic method which is implemented by using two deep neural networks.

The first network is the actor which is the learned policy µ✓(s). The actor network has the state as input and one or several actions as output. The output of the actor network is deterministic therefore an action noise is added to the action output to help with exploration while training. The action noise can for example be defined as gaussian noise.

The second network is the critic and it approximates the action-value function. It uses data sampled from a replay buffer D which contains prior experiences. The data is a tuple which contains the state s, action a, reward r, the

(16)

next state s⁰ and a truth value d which says if s⁰ is a terminal state. The sam- ple data and the Bellman equation is used to learn the action-value function.

The critic network is trained by minimizing the mean-squared Bellman error (MSBE) function given by the formula:

L( , D) =E^(s,a,r,s⁰^,d)⇠D[(Q (s, a) (r + (1 d)Q targ(s⁰, µ✓targ(s⁰))))²] Minimizing the error means that the value of the critic network and the target value given by (r + (1 d)Q targ(s⁰, µ✓targ(s⁰))) should be as close to each other as possible. The target is calculated by using the target networks Q targ(s, a)and µ✓targ(s). The first target network has parameters which are close to the critic network, this helps with stabilizing the training. The second target network has parameters that are close to the actor network. The target actor network is used to find an action which approximately maximizes Q targ(s, a)because the action space is continuous therefore it is not feasible to find the real action which maximizes Q targ(s, a) . The target networks parameters are updated after some defined time step by using the formula

targ p targ + (1 p) where p is a value between 0 and 1.

2.5 Hyperparameter tuning

Deep RL algorithms have hyperparameters which are parameters that change how the algorithm functions. Some examples of hyperparameters are the learning rate, the discount rate, and the amount of nodes and layers in the DNN:s.

To make sure that the given algorithm performs well in a certain environment, a hyperparameter tuning has to be conducted were different values of the hyperparameters are used and evaluated to find which combinations work the best [14].

2.6 Related work

Many benchmarks of various baseline Tensorflow implementations of DDPG and other deep RL algorithms have been conducted on classical control problems, specifically on the OpenAI Gym simulation environments, one of which is inverted pendulum stabilization (Pendulum-v0). The creators of rllab benchmarked their implementations on many of the simulation environments available at OpenAI Gym [6]. One part of their benchmarking was on environments with noisy state representations and delayed actions. They added gaussian

(17)

noise with zero mean and standard deviation of 0.1 to the state, and benchmarked several deep RL algorithms on the noisy environments. The result showed that the overall performance of the algorithms decreased compared to the result from training in environments with no noise and no action delay.

They did not test it on DDPG however.

DDPG has also been evaluated in studies which measure the reproducibility of deep RL experiments using baseline algorithm implementations [15][14].

In these, several baseline implementations were tested, including OpenAI Base- lines and rllab, using physics engine based environments available at OpenAI Gym. To ensure fairness they run five experimental trials for each evaluation. Islam et al. stated that ”both intrinsic (e.g. random seeds, environment properties) and extrinsic sources (e.g. hyperparameters, codebases) of non- determinism” can affect reproducibility of results [14].

(18)

Chapter 3 Method

3.1 Algorithm implementation

The OpenAI Baselines implementation of the DDPG algorithm will be used.

OpenAI:s own Baselines repository contains implementations of several deep reinforcement learning algorithms, but lacks some features like creating plots of the data collected while training [16]. Therefore, another repository called Stable Baselines will be used. It is a fork of OpenAI Baselines with many useful features, including support for TensorBoard which will be used to plot experimental data [17].

3.2 Environment

The experiments will be run on the OpenAI Gym environment called Pendulum- v0 [18]. The environment consists of a pendulum and it has a continuous state space where states are represented as cos(✓), sin(✓) and _dt^✓, where ✓ 2 R is the angle of the pendulum:

11

(19)

12 CHAPTER 3. METHOD

The action space is also continuous where an action a 2 R is a joint effort which will apply force to the pendulum. The default reward function for the environment is defined as:

Rdef ault(✓, a) = (✓²+ 0.1⇤ ✓ dt

2

+ 0.001⇤ a²)

The goal of the agent is to point the pendulum upwards and remain at zero vertical angle, with the least rotational velocity, and the least effort. The agent only receives negative rewards with a maximum reward signal of zero, which would occur at upright equilibrium, and a minimum reward signal of 16.2736044. The task is continuing as there is no guaranteed terminal state, so simulations are divided into fixed episodes of 200 time-steps. State tran- sitions are deterministic, and the state and action spaces are relatively small, which greatly reduces the complexity of solving tasks in the environment.

This environment was chosen mainly due to its flexibility and the potential for drawing parallels to robotic control tasks. Simplicity was another factor due to time constraints and insufficient resources for carrying out evaluations on more complex problems which would take us hours to solve. Limiting tests to one environment also has the benefit of letting us narrow down test result differences to specific environment properties.

3.2.1 Hyper Parameters

The hyperparameters for DDPG that will be used are tuned for solving the Pendulum-v0 environment with no noise or changes to how sparse the rewards are. The hyperparameters can be found in appendix A.

3.3 Setup of experiments

The experiments are divided into three sets, one for testing the impact of noisy observations, one for testing the impact of sparse rewards, and one for testing the impact when noisy observations and sparse rewards are combined. All experiments are run five times in order to get results that reflect average performance as there is result inconsistency between sessions. One run is 1000 episodes, which is 200k time-steps in the environment in total. Here we will guide you through how the experiments were set up.

(20)

CHAPTER 3. METHOD 13

3.3.1 Noisy observations

This set of experiments aims to measure the impact of adding noise to the state representations (observations) seen by the agent. The noise is defined as a gaussian noise with mean 0 and a standard deviation which will vary from experiment to experiment. For every time-step of training in the environment, noise is generated with the same dimension as the state dimension. The noise is then added to the state:

2 4

sin(✓) cos(✓)

✓ dt

3 5 +

2 4

✏₁

✏2

✏3

3

5 ✏1, ✏2, ✏3 ⇠ N(0, )

Experiment 1(no noise) 2 3 4 5 6 7

N/A 0.0001 0.001 0.01 0.025 0.05 0.1

Table 3.1: Shows noise standard deviations for experiments 1-7

3.3.2 Sparse rewards

This set of experiments aims to measure the impact of reducing how often the agent receives rewards. With the Pendulum-v0 environment, reward abun- dance can be gradually decreased by limiting rewards to a range around the goal state. This was accomplished using a reward limit parameter rlimit, re- sulting in the following reward function:

R(✓, a) =

( 16.2736044, if |✓| > ⇡ ⇤ rlimit

Rdef ault(✓, a)/rlimit otherwise

)

If ✓ is outside the range [ ⇡, ⇡] ⇤ rlimit, then minimum reward is given, otherwise the reward is calculated using the default reward function.

Experiment 8 9 10 11

r_limit 0.8 0.6 0.4 0.2

Table 3.2: Shows the reward limits for experiments 8-11

3.3.3 Sparse rewards and noisy observations

This set of experiments aims to measure algorithm performance when noisy observations and sparse rewards are combined. These experiments are based

(21)

14 CHAPTER 3. METHOD

on the results from the two previous sets. Additional noise is not tested above the upper bound (0.025) of noise that DDPG could handle with the default reward function. The same goes for sparse rewards, reward limits below the lower bound (0.6) for which performance began to degrade without noise is not tested.

0.0001 0.001 0.01 0.025

rlimit 0.8 0.8 0.8 0.8

Table 3.3: Shows the noise standard deviations and reward limits for experiments 12-15

0.0001 0.001 0.01 0.025

r_limit 0.6 0.6 0.6 0.6

Table 3.4: Shows the noise standard deviations and reward limits for experiments 16-19

3.4 Evaluation

Algorithm performance will be measured in terms of episode reward, which is the total reward received during one episode (200 time-steps) in the environment. Inverted pendulum stabilization is an unsolved environment, meaning there is no set point at which it can be considered solved. For the sake of this study, success (solved) will be defined as having the episode reward converge around 500 or higher before training has terminated. This threshold is derived from Experiment 1, which will be used as reference. If the rate of convergence was consistent between all five sessions of an experiment, then the mean episode reward across all the sessions is plotted together with the reference experiment. Otherwise, all five sessions are plotted to visualize success rate and convergence time variability. Convergence time is the amount of time-steps before success in a successful training session and convergence time variability is defined in terms of standard deviation of the convergence times in an experiment.

(22)

Chapter 4 Results

Here the results from the three sets of experiments are presented. The y-axis of the plots in section 4.1 shows the mean episode reward and for section 4.2 it shows the episode reward. The x-axis of both section 4.1 and 4.2 show the total training steps completed. The graphs were smoothed using TensorBoard’s smoothing function with factor 0.9. This reduces clutter seen in graphs for less successful training sessions. In section 4.3, the results are summarized using plots which show convergence times and success rates.

4.1 Noisy observations

This section shows six plots corresponding to experiments 2-7 proposed in the method section. Each will show the mean episode reward across all five sessions since convergence times are consistent. Every plot will also contain the result of experiment 1 (graph with dark blue color) as reference to understand how well agents performed in a noisy environment compared to a environment with no noise.

From Figure 4.1, 4.2, 4.3, and 4.4 it can be seen that DDPG in a noisy environment with gaussian noise with standard deviation from 0.0001 to 0.025 has a mean episode reward which is close to the mean episode reward of DDPG in a non-noisy environment. In Figure 4.6 and 4.5, the mean episode reward increased when compared to the mean episode reward at 0 time-steps. The final result after 200k time-steps does not have a mean episode reward that is in the same region as when DDPG is used in a non-noisy environment.

15

(23)

16 CHAPTER 4. RESULTS

Time steps

Meanepisodereward

Figure 4.1: Orange graph corresponds to experiment 2 ( = 0.0001)

Time steps

Meanepisodereward

Figure 4.2: Cyan graph corresponds to experiment 3 ( = 0.001)

Time steps

Meanepisodereward

Figure 4.3: Red graph corresponds to experiment 4 ( = 0.01)

(24)

CHAPTER 4. RESULTS 17

Time steps

Meanepisodereward

Figure 4.4: Green graph corresponds to experiment 5 ( = 0.025)

Time steps

Meanepisodereward

Figure 4.5: Gray graph corresponds to experiment 6 ( = 0.05)

Time steps

Meanepisodereward

Figure 4.6: Pink graph corresponds to experiment 7 ( = 0.1)

(25)

4.2 Sparse rewards

This section shows six plots corresponding to experiments 8-11 proposed in the method section. There is one figure per experiment and each contains all five sessions to visualize success rate and convergence time variability.

In experiment 8 and 9, the success rate is almost 100% except for an outlier (dark blue) in experiment 8. The average convergence time is higher in experiment 8 except for the outlier.

In experiment 10 and 11, the success rate has gone down drastically. The algorithm did reach success in two of the training sessions for experiment 10, and all sessions were showing signs of progress. The agent was close to success in one of the sessions (dark blue) for experiment 11, and was showing progress in all sessions except one (cyan).

Time steps

Episodereward

Figure 4.7: All five sessions of experiment 8, 80% success rate

(rlimit= 0.8)

Time steps

Episodereward

(rlimit= 0.6)

(26)

CHAPTER 4. RESULTS 19

Time steps

Episodereward

(rlimit= 0.4)

Time steps

Episodereward

(rlimit= 0.2)

4.3 Sparse rewards and noisy observations

In this section, the outcomes of experiments 12-15 and 16-19 are summarized in the form of two graphs which show convergence times and success rates.

The error bar shows convergence time variability in terms of standard deviation. The eight plots corresponding to each experiment can be found in appendix B.

(27)

The success rates for experiments 12-15 with reward limit 0.8 were fairly high, even reaching 100% in experiment 15 in spite of having the largest noise standard deviation. The average success rate of these experiments is 75%.

Convergence time variability seems almost constant apart from experiment 14 ( = 0.01) for which it is higher than average.

Figure 4.11: Convergence times and success rates for experiments 12-15

The success rates of experiments 16-19 with reward limit 0.6 are lower with a 50% average. The highest success rate of 80% is seen in experiment 16 with the smallest noise standard deviation, and the second highest is seen in experiment 19 with the largest noise standard deviation. Convergence time variability is close between experiment 18 ( =0.01) and experiment 19 ( =0.025).

It is lower for experiment 16 ( =0.0001), and it could not be computed for experiment 17 ( =0.001) since only one session was successful.

Figure 4.12: Convergence times and success rates for experiments 16-19

(28)

Chapter 5 Discussion

5.1 Noisy observations

The standard deviation of the noise does not have any significant effect on the mean episode reward as it increases from 0.0001 to 0.025, however, due to the substantial decrease in mean episode reward when the standard deviation was increased to 0.05, there seems to exist a threshold between 0.025 and 0.05 where the mean episode reward will start to decrease. This threshold might differ if hyperparameter tunings are done separately for each experiment, as this is a major factor dictating whether the algorithm will perform well in an environment [14].

5.2 Sparse rewards

This set of experiments showed a correlation between reward sparseness, success rate, and convergence time variability. In figure 4.7 and 4.8 it can be seen that majority of the sessions for these experiments managed to succeed, however, as the reward limit reached 0.4 and 0.2 there was a drastic shift in rate of success. Convergence time variability steadily increases as rewards become more sparse. The main impact of reward sparseness is thus result inconsistency, and whether a run is successful or not may come down to randomly get- ting a series of desirable state-action trajectories in a row so that the networks are updated enough to push these actions. Due to this being a model-free algorithm [2] and that reward sparseness affects performance, it can be concluded that having a reward function for a environment which constantly gives useful reward is needed in order for DDPG solve a task effectively and consistently.

This poses a significant challenge for more complex environments and tasks in

21

(29)

22 CHAPTER 5. DISCUSSION

which perfect knowledge is unavailable, as designing a reward function may be more complicated and algorithm training more time-consuming [4].

5.3 Noisy observations and sparse rewards

The average success rate for each reward limit was lower in experiments 12-19 with noise than it was in experiments 8-9 without noise. The average success rate went from 80% to 75% for the experiments with reward limit 0.8, and it went from 100% to 50% for the experiments with reward limit 0.6. However, upon examining the result of experiment 15 with = 0.025 in figure 4.11, we can see that the success rate is higher compared to experiments with the same reward limit but lower noise, which is contradicting as you would expect that higher noise would lead to worse result.

As we examine the success rates in figure 4.11 and 4.12, we can see that it is at its highest for the lowest (0.0001) and the highest (0.025) noise standard deviations. Therefore, it is a possibility that certain noise levels yield higher success rates. If we look at the convergence times in figure 4.11, they seem to increase as noise increases and the variability is approximately constant. The same cannot be said for the convergence times in figure 4.12 which are more random. In summary, the results from these experiments are inconclusive as the results are too inconsistent to make any clear judgments with this few runs of each experiment. More runs per experiment would be needed to get a better estimate of the success rate, convergence time and variability for each case.

5.4 Future work

As mentioned, repeating the set of experiments which combine noisy observations and sparse rewards could possibly yield more conclusive results regarding the combined impact of these environment properties.

Due to lack of computing power and having to rerun the same experiment five times to get a better estimate of the performance of the algorithm in a certain environment, we could only do the experiment in a simple environment.

If it had failed for this simple environment it would not be worth trying it for more complex environments and therefore avoid wasting computing resources.

However, due to DDPG being able to manage Gaussian noise with mean 0 and standard deviation up to at least 0.025 it might also be worth trying to do the same experiment in a more complex environment. A more complex environment would be one which has more than the three-dimensional state space

(30)

CHAPTER 5. DISCUSSION 23

which the Pendulum-v0 environment had. In the case of higher-dimensional state space, the total error of the observation would increase, which could mean that DDPG would not be able to handle the same noise as it could in the Pendulum-v0 environment. Doing the same experiment on a complex environment would clarify these questions.

(31)

Chapter 6 Conclusions

This thesis can conclude that the deep RL algorithm DDPG can at least handle gaussian noise with mean 0 and standard deviation of 0.025 in the simple environment called Pendulum-v0, and there exists a threshold somewhere between noise standard deviations 0.025 and 0.05 were the mean episode reward will start to decrease. Results become less consistent as rewards become more sparse, highlighting the importance of having a well-designed reward function.

Further testing is required to fully evaluate the impact when noisy observations and sparse rewards are combined.

24

(32)

Bibliography

[1] Nick Statt. OpenAI’s Dota 2 AI steamrolls world champion e-sports team with back-to-back victories. 2019. : https://www.theverge.

com / 2019 / 4 / 13 / 18309459 / openai - five - dota - 2 - finals - ai - bot - competition - og - e - sports - the - international-champion(visited on 05/02/2019).

[2] Timothy P. Lillicrap et al. “Continuous control with deep reinforcement learning”. In: (2015).

[3] OpenAI Gym. Leaderboard. 2019. : https://github.com/

openai/gym/wiki/Leaderboard(visited on 05/02/2019).

[4] Alex Irpan. Deep Reinforcement Learning Doesn’t Work Yet. https:

//www.alexirpan.com/2018/02/14/rl- hard.html.

2018.

[5] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. Second. The MIT Press, 2018.

[6] Yan Duan et al. “Benchmarking Deep Reinforcement Learning for Con- tinuous Control”. In: (2016).

[7] Eric Weisstein. Ratio test. : http://mathworld.wolfram.

com/RatioTest.html(visited on 05/01/2019).

[8] John Schulman et al. “Proximal Policy Optimization Algorithms”. In:

(2017).

[9] Jens Kober, J. Andrew Bagnell, and Jan Peters. “Reinforcement learning in robotics: A survey”. eng. In: 32.11 (2013), pp. 1238–1274. : 0278-3649.

25

(33)

26 BIBLIOGRAPHY

[10] Richard S. Sutton et al. “Policy Gradient Methods for Reinforcement Learning with Function Approximation”. In: Proceedings of the 12th International Conference on Neural Information Processing Systems.

NIPS’99. Denver, CO: MIT Press, 1999, pp. 1057–1063. : http:

//dl.acm.org/citation.cfm?id=3009657.3009806.

[11] Lilian Weng. Policy Gradient Algorithms. 2018. : https://lilianweng.

github.io/lil-log/2018/04/08/policy-gradient- algorithms.html#policy- gradient- algorithms (visited on 04/20/2019).

[12] Raúl Rojas. Neural Networks: A Systematic Introduction. eng. Berlin, Heidelberg: Springer Berlin Heidelberg, 1996. : 9783642610684.

[13] Ian Goodfellow. Deep learning. eng. Adaptive computation and ma- chine learning. 2016. : 9780262035613.

[14] Riashat Islam et al. “Reproducibility of Benchmarked Deep Reinforce- ment Learning Tasks for Continuous Control”. In: (2017).

[15] Peter Henderson et al. “Deep Reinforcement Learning that Matters”. In:

(2017).

[16] Prafulla Dhariwal et al. OpenAI Baselines. https://github.com/

openai/baselines. 2017.

[17] Ashley Hill et al. Stable Baselines. https://github.com/hill- a/stable-baselines. 2018.

[18] OpenAI. Pendulum-v0. 2018. : https://gym.openai.com/

envs/Pendulum-v0/(visited on 04/23/2019).

(34)

Appendix A

Hyper parameters

27

(35)

Appendix B Plots

Time steps

Episodereward

Figure B.1: All five sessions of experiment 12, 80% success rate ( = 0.0001, rlimit= 0.8)

Time steps

Episodereward

28

(36)

APPENDIX B. PLOTS 29

Time steps

Episodereward

Time steps

Episodereward

Time steps

Episodereward

(37)

30 APPENDIX B. PLOTS

Time steps

Episodereward

Time steps

Episodereward

Time steps

Episodereward

(38)

www.kth.se

TRITA-EECS-EX-2019:356

Impact of observation noise and reward sparseness on Deep Deterministic Policy Gradient when applied to inverted pendulum stabilization

Impact of observation noise and reward sparseness on Deep

Deterministic Policy Gradient when applied to inverted

pendulum stabilization

HARIS POLJO

ADAM BJÖRNBERG

Impact of observation noise and reward sparseness on Deep Deterministic Policy Gradient when applied to inverted pendulum

stabilization

HARIS POLJO, ADAM BJÖRNBERG

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 Problem Statement

1.2 Scope and constraints

Chapter 2 Background

2.1 Reinforcement learning

2.1.1 Value and return

2.1.2 Value functions

2.1.3 Bellman equations

2.1.4 Learning Rate

2.1.5 Exploration and exploitation

2.2 Policy gradient methods

2.2.1 Actor Critic

2.3 Deep Neural Networks

2.4 Deep Deterministic Policy Gradient (DDPG)

2.5 Hyperparameter tuning

2.6 Related work

Chapter 3 Method

3.1 Algorithm implementation

3.2 Environment

3.2.1 Hyper Parameters

3.3 Setup of experiments

3.3.1 Noisy observations

3.3.2 Sparse rewards

3.3.3 Sparse rewards and noisy observations

3.4 Evaluation

Chapter 4 Results

4.1 Noisy observations

4.2 Sparse rewards

4.3 Sparse rewards and noisy observations

Chapter 5 Discussion

5.1 Noisy observations

5.2 Sparse rewards

5.3 Noisy observations and sparse rewards

5.4 Future work

Chapter 6

Conclusions

Bibliography

Appendix A

Hyper parameters

Appendix B Plots