• No results found

WILLIAMSKAGERSTRÖM Adaptivenetworkselectionformovingagentsusingdeepreinforcementlearning

N/A
N/A
Protected

Academic year: 2021

Share "WILLIAMSKAGERSTRÖM Adaptivenetworkselectionformovingagentsusingdeepreinforcementlearning"

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

Adaptive network selection for moving agents using deep

reinforcement learning

WILLIAM SKAGERSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)
(3)

for moving agents using deep reinforcement learning

WILLIAM SKAGERSTRÖM

Master in Computer Science Date: February 2, 2021

Supervisor: Somayeh Aghanavesi Examiner: Carlo Fischione

School of Electrical Engineering and Computer Science Host company: Telefonaktiebolaget LM Ericsson

Swedish title: Adaptivt nätverksval för agenter i rörelse med djup förstärkande inlärning

(4)
(5)

Abstract

With the rapid development and deployment of “Internet of Things”-devices comes a new era of benefits to increase the efficiency of our everyday lives.

Many of these devices rely on having an established network connection in order to operate at peak performance, but this requirement could be hard to guarantee in the face of less supported infrastructure in certain parts of the world. Thus there is value in the concept of granting more information to said devices, which could allow them to take proactive actions in order to ensure that they meet certain expectations.

One method is the ability to perform adaptive network selection, depend- ing on both the availability of telecom operators within the region as well as their perceived performance. This paper outlines a methodology for the con- struction of an interactive environment from raw historical data which comes in the form of measurements already available in user equipment.

An algorithm is then trained by exploring said environment using rein- forcement learning, under the premise of having only limited information about its current whereabouts and target destination. The objective of agents within the environment is to select network operators over the course of a specified geographical route in order to maximize the perceived network performance.

The results showcased that, given the existence of a policy that can grant an increase in the perceived performance, it will find it such a policy. Un- der circumstances where it cannot, it will approximate the performance of the best available operator. Said results showed promise of further development for methods that rely on this type of algorithmic behaviour, and could find interesting applications for the future, especially around instance areas where network infrastructure is still in development.

(6)

Sammanfattning

Med den snabba utvecklingen och användningen av "Internet of Things-enhe- ter kommer en ny era av fördelar som kan förbättra våra vardagliga liv. Många av dem här enheterna beror på en etablerad nätverksuppkoppling för att utföra sin funktion på dess bästa förmåga, men detta krav kan vara svårare att uppfyl- la om man finner sig i regioner med mindre utvecklad infrastruktur. På så sätt finns det värde i att kunna förse enheterna med mera information, som skulle kunna låta dem ta beslut angående hur dem vill hantera dessa situationer om dem vill uppnå sin optimala förmåga.

En metod för att göra detta är förmågan att göra adaptiva nätverksval, be- roende på både tillgängligheten och den uppfattade nätverkskvaliteten. Detta papper introducerar en metod för att skapa en interaktiv miljö från historisk data som kommer från mått tillgängliga i vanliga enheter. En algorithm trä- nas sedan genom att utforska denna miljö med hjälp av förstärkande inlärning, under förutsättningarna att man inte har något tidigare information förutom den lite information om den nuvarande platsen och destinationen. Målet för en agent är då att optimera den uppfattade nätverkskvaliteten genom att välja ut operatörer över en given geografisk väg.

Resultatet visade att om det existerar en policy som kan ge en förbättring av prestandan, så kommer algorithmen att hitta den. Annars så kommer den approximera kvaliteten av den bästa operatörer inom området. Resultaten vi- sade mycket god potential för framtida arbeten, och skulle kunna applicerat till områden då denna typ av algoritmiska beteenden skulle vara önskvärda, speciellt när man arbetar med områden där infrastruktur fortfarande är under konstruktion.

(7)

1 Introduction 1

1.1 Motivation . . . 1

1.2 Research question & limitations . . . 2

1.3 Report structure . . . 2

2 Background 3 2.1 Reinforcement Learning . . . 3

2.1.1 Markov Decision Process . . . 5

2.1.2 Policies and value functions . . . 6

2.1.3 Temporal-difference learning . . . 9

2.1.4 Function approximation & neural networks . . . 11

2.2 The H3 library . . . 15

2.3 Cellular Networks . . . 15

2.3.1 Operator roaming . . . 17

2.3.2 Network performance metrics . . . 17

2.4 Related works & state of the art . . . 18

3 Methods 22 3.1 Data acquisition . . . 22

3.2 Construction of the environment . . . 23

3.2.1 Environment padding . . . 29

3.2.2 Environment variations . . . 30

3.2.3 Environment metrics . . . 31

3.3 MDP formulation . . . 33

3.4 DQN architecture & hyperparameters . . . 35

3.5 Performance evaluation . . . 37

4 Results 39 4.1 Adaptive operator selection . . . 41

4.2 Adaptive satellite roaming . . . 47

v

(8)

4.3 Evaluation comparison table . . . 51

5 Discussion 53 5.1 Examination of the results . . . 53

5.2 Limitations . . . 54

5.2.1 Limitations of the implementation of the environment 54 5.2.2 Limitations of the experiments . . . 54

5.2.3 Limitations on a real-world application . . . 55

5.3 Future work . . . 55

5.3.1 Alterations and additions to the environment . . . 55

5.3.2 Modifications to the policy training . . . 58

6 Conclusion 59

Bibliography 60

(9)

Introduction

1.1 Motivation

After a long period of research and development, the next generation cellular communication system, 5G, entered the standardization phase in 2018. Since then, there have been several small-scale deployments of the network tech- nology, but there is still some way to go before it becomes available globally.

With 5G comes a large array of new applications and use cases, since the network technology has been developed with new requirements in mind under the guidance of the International Telecommunication Union [1]. Previous net- work technologies have been primarily focused on delivering human-centric communication, such as telephony, media services and internet connectivity.

5G instead has a new expanded set of requirements that aim to cover so-called machine-type communication as well, which opens up enormous areas of ap- plication for Internet of Things communication devices [2]. One of these areas is the use of autonomous driving vehicles and unmanned aerial vehicles, which would benefit greatly from the increased capabilities of the 5G networks [3].

These types of applications do however bring forward additional challenges due to their tendency to travel over large geographical distances where the available network quality can vary greatly depending on available infrastruc- ture. Thus it can be desirable to provide opportunities on an application-level on how to handle these circumstances of poor network quality(or the com- plete absence of it). This line of research has developed quite naturally with the massive increase of research and development with regards to autonomous vehicles and similar use-cases that has an interest in monitoring the quality of the local networks available to internal devices in order to ensure that the qual- ity of service provided is at its peak [4] [5] [6].

1

(10)

1.2 Research question & limitations

This paper investigates if it is possible to utilize machine learning techniques such as reinforcement learning to train agents to make adaptive network se- lections in problem instances that centers around said agent moving over a geographical location. The goal is for an agent to be able to efficiently roam in order to maximize the established connection quality, given a path from a starting point to a given destination, in addition to a selection of telecom operators available over said area.

Note that this thesis investigates a problem definition that consists of un- known or incomplete information, in that the algorithm will not have any pre- vious information prior to actually interacting with the problem environment.

In the instance that the algorithm would have complete information, there are other techniques that would likely be more suited for this approach, such as for- mulating the problem as a constraint-based optimization problem and solving it using dynamic programming [7]. The project will involve some simplifi- cations which may strain on its ability to be applied to real world scenarios, but the main purpose of the thesis was to investigate the research question that was outlined in an inquisitive manner. The main contribution from this thesis will be in the conversion procedure of sparse geographical data into a rein- forcement learning environment, as well as the results from the experiments of training a policy within said environment.

1.3 Report structure

The report will begin with a brief overview reinforcement learning and its ap- plications within the modern industry in order to provide the material needed for understanding the thesis and the research approach. For the sake of keeping the structure cohesive, some details regarding the training procedures of the learning algorithm such as the backpropagation technique are left out and are instead referred to related references and works. This choice was made since an understanding of the learning algorithm is not necessarily needed to un- derstand the methodology or the related results. A thorough walk-through of the conversion from sparse geographical data into the reinforcement learning environment is then conducted. Network architecture and related hyperparam- eters are then outlined, and followed by the results. Afterwards a discussion is brought up which culminate in some speculation about real-world applications and possible future work.

(11)

Background

2.1 Reinforcement Learning

Reinforcement learning is one of the primary areas of modern machine learn- ing and which stems from dynamical systems theory. The methodology has found many uses, particularly in scenarios involving grand strategy, where some contributions led to an increased interest in the applications of reinforce- ment learning. One example that caused a large stir in the scientific community as well as garnered a significant amount of attention to the field was when Al- phago, a machine developed by Deepmind for the specific purpose of playing the game Go, beat grandmaster Lee Sedol in 2016 [8]. Reinforcement learn- ing thrives in situations where there is a naturally occurring "butterfly effect", where each action taken can have consequences far down the line, not just for the immediate situation. This causes it to be extremely adaptable and can be used in a wide variety of fields and problem types.

Reinforcement learning is typically utilized in applications for the automa- tion of tasks and processes that require the training of a goal-oriented agent which inhibits a refined thought process when it interacts with an established environment [9]. The other two main areas of machine learning that are Super- vised learning and Unsupervised learning All these techniques have a defined set of characteristics and distinguish them, particularly related to techniques that allow the different algorithms to learn. Supervised learning consists of utilizing labeled data, and can be likened to showing the algorithm examples that it is supposed to learn from. In a classification scenario, it would mean for each input vector x, there is a corresponding label vector y, and the algorithms training process consists of learning the mapping of f (x) = y, where f (x) is the machine learning algorithm. On the other end of the spectrum, where you

3

(12)

have a set of input vectors x but no target vectors with their corresponding labels, one may resort to unsupervised learning. These learning processes are instead aimed to discover patterns in unorganized data, such as clustering them into groups based on similarities [9].

Reinforcement learning, which is the primary concern for this thesis, is instead a method that aims to discover how an algorithm should act in an es- tablished environment given that it will actively interact with the problem in- stance it is trying to learn from. It will then utilize the experiences obtained from said interactions in order to optimize itself to enhance its capabilities for future interactions. Normally the reinforcement learning algorithm is not given any specific information about the environment itself, and as such it must explore it by trial and error[9][10]. The nomenclature to describe an entity that is aiming to learn how to interact with a problem environment and execute the actions is called an agent. The goal of any agent that utilizes reinforcement learning is to find a way to maximize a cumulative reward over the course of a problem instance.

When the agent interacts with an environment, it participates in a form of feedback loop where it will observe a given state that the environment finds itself in, and select an action. This action is then taken in by the environment, which outputs a new state and the resulting reward from this particular inter- action. A flowchart of the process can be viewed below in Figure 2.1.

Figure 2.1: The agent-environment interaction flowchart, based on the corre- sponding figure in [10].

Different reinforcement learning algorithms can then utilize the informa- tion embedded in this interaction loop in order to learn the mappings between state-action pairs and its associated reward.

(13)

2.1.1 Markov Decision Process

One major assumption that is done on reinforcement-learning problem in- stances is that they follow the Markov property. For simplicity, if we as- sume that we are working with a finite and discrete number of states and re- wards(rather than continuous values and functions), we can say that for each timestep t, the environment transition that is performed as a result of the agent due to its actions can be conditioned solely on the current action at and the current state, st[11].

r[st+1= s0, rt+1 = r|st, at] =

P r[st+1= s0, rt+1 = r|st, at, rt, ...., r0, s0, a0] (2.1) If the Markov property holds, we can define it as a Markov decision pro- cess. Markov decision processes(MDPs) is a type of formulation for a classic decision-making process, where a set of available actions enables one to influ- ence the state of the environment that results not only in immediate rewards, but can also affect future rewards through sequences of states that arise from previously selected actions.

This definition is essentially the same as the above explanation of rein- forcement learning, but an MDP is now a way to structure the problem into a formal definition. An MDP is typically consists of a tuple, (S, A, P , R, γ), where:

• S is the established state space, containing all permutations of different states that can occur within the established environment, ∀s, s ∈ S.

• A is the action space, which hosts all actions available to the agent which it can perform in the environment, ∀a, a ∈ A.

• P is the set of transition probabilities, each taking the form p(s0|s, a), denoting the probability of transitioning to the state s0 from s with a reward r, by performing action a in a state s. Note that the transition is conditioned only on the current state and the selected action, and in- dependent of previous ones, and thus the equation fulfills the Markov property.

• R denotes the reward function, which dictates the given reward for tran- sitioning to a state s0. The reward function is typically defined as a func- tion of the current state and the performed action, R(s, a).

(14)

• γ is the discount factor, which is a parameter utilized in certain algo- rithms to dictate how much the agent should prioritize the immediate reward, in contrast to the cumulative reward obtained over time, and is bounded to [0, 1]. A simple situation for explaining the discount factor at work would be an agent that has to choose over obtaining a big lump re- ward for performing a task, versus many small recurring rewards(which eventually sums to be larger than the initial lump reward) from choosing to perform an investment instead. An agent with a low discount factor would favor the initial big reward, while an agent with a higher discount factor would favor the latter.

If both the state space S and the action space A are finite (discrete and bounded in size), the MDP process can be classified as finite. Normally, the transition probabilities and the associated rewards are not known by the acting agent, because if they were, the optimal solution could easily be calculated using dynamic programming[12]. Instead, the agent needs to explore the en- vironment in order to obtain an understanding of how to act. The interaction flowchart from earlier in Figure 2.1 outlines an agent in an environment with a state st, which has awarded the agent with the reward rt. The agent then takes the action at, which results in the environment transitioning to a new state st+1

with its associated reward rt+1. This entire interaction step can be combined into a new tuple, (st, at, st+1, rt+1), which is usually denoted as an experience tuple. All of these components are parts of the MDP process. This informa- tion is what is utilized by reinforcement learning algorithms in order to derive a policy, or how to act in the environment given the current state.

2.1.2 Policies and value functions

Using the definitions of the MDP formulation, if at a time step t, the agent ob- serves the state of the environment s ∈ S, and takes an action a ∈ A, one can define this procedure as the agent in question following a policy π(a|s). The policy is what determines which action the agent should select upon observing the state s, and learning this policy is the core of the reinforcement learning procedure. More formally, we say that the policy defines the probability of the agent taking an action a conditioned on the current state s. Since we are working with probabilities, we also have that:

X

a∈A

π(a|s) = 1, ∀s ∈ S (2.2)

(15)

Taking an action atin the state stwill then transition the environment into a state st+1, according to the transition probabilities that are defined in the MDP formulation of the problem. From this interaction, the environment will also output a reward, rt. At all stages, the state transition will satisfy the previously mentioned Markov property.

The goal of the reinforcement learning procedure is then to generate an op- timal policy which maximizes the given reward function. The most common metric utilized is the discounted total reward over the course of a defined time horizon T , which is also called the total return, and can be defined as:

Gt=

T

X

i=t+1

γi−t−1ri (2.3)

Where γ ∈ [0, 1] is the previously established discount factor. Note that the time horizon does not need to be finite, thus leaving the possibility open for T = ∞. In order to evaluate a policy, a concept called the value function is used to give a type of measurement of how rewarding a state is following the policy π. This function can be expressed as

vπ(s) = Eπ[Gt|st = s] = Eπ

" T X

i=0

γirt+i+1|st= s

#

, ∀s ∈ S (2.4)

and expresses the expected return for an agent at state s which follows the policy π when selecting future actions. The state-action equivalent to the value function, called the q-function, can be expressed as

qπ(s, a) = Eπ[Gt|st= s, at= a]

= Eπ

" T X

i=0

γirt+i+1|st= s, at= a

#

, ∀s ∈ S, ∀a ∈ A (2.5)

and defines the expected return for taking an action a in a state s following the policy π. Both of these equations have a rather famous recursive definition, which we may obtain if we use the notations of the state s, the action a, and the successor state of s being defined as s0and the corresponding action performed in the successor state as a0. The value function can then be expressed as:

vπ(s) =X

a

π(a|s)X

s0,r

p(s0, r|s, a)[r + γvπ(s0)] (2.6)

(16)

with the corresponding q-function of qπ(s, a) =X

s0,r

p(s0, r|s, a)[r + γX

a0

π(a0, s0)qπ(s0, a0)] (2.7) which is named the Bellman equation after Richard Bellman [12]. The desired outcome of the reinforcement learning problem then becomes the task of obtaining the optimal state-value function:

v(s) = max

π vπ(s)

= max

a qπ(s, a)

∀s ∈ S, ∀a ∈ A

(2.8)

where ∗ denotes the condition of optimality and π denotes an optimal policy. This is typically what is utilized during different learning schemes in order to arrive at an optimal policy for the specified problem formulation. Us- ing this, we can derive what is called the Bellman optimality equation, which states that the value of a state following an optimal policy must be the expected return for the best action from that state, or as:

v(s) = max

a

X

s0,r

p(s0, r|s, a)[r + γv(s0)] (2.9)

The equivalent Bellman optimality equation for the q can also be ex- pressed as

q(s, a) =X

s0,r

p(s0, r|s, a)[r + γ max

a0 q(s0, a0)] (2.10) A full derivation of the equations and the associated steps may be found in An introduction to reinforcement learning, 1998, by Sutton and Barto [10].

In order to find the optimal state-action policy q, the reinforcement algorithm will need to obtain experiences within the environment. This presents a bit of a dilemma, namely that in order to iterate towards alternative policies, it might need to select actions which are contradicting its supposed goal of maximizing the discounted total reward, or the return. For many complex scenarios, there are pathways to maximizing the reward which will require one to first take a step that is inherently a net-loss of a reward in order to advance to a successor state where more reward might be obtained. A simple example could be the act of sacrificing a pawn in chess in order to advance to a state where you can then capture a queen. If the policy is structured to be a greedy policy, this initial action would never be selected since it would view the loss of the pawn

(17)

as a worse reward than the alternatives. As such, a greedy algorithm may be defined as:

at = argmax

a

π(st, a) (2.11)

And the greedy algorithm will always select the best possible perceived action which nets the highest reward from the perspective of the state st. A common modification that is done in order to force exploration of what may be a sub-optimal move at first(but net a higher overall reward down the line, like the previous chess example), is to introduce the concept of exploration.

This is typically done by introducing a defined exploration rate, , which is the probability that the algorithm will take an arbitrary random action instead of following the greedy algorithm [13]. There is also the option of utilizing a vari- able exploration rate, which then uses an exploration decay that is multiplied with the exploration factor, and as T −→ ∞, then  −→ 0 (or alternatively

 −→ k for a defined lower bound k).

2.1.3 Temporal-difference learning

Temporal difference (TD) learning is one of the primary learning methodolo- gies within reinforcement learning and is a model-free approach. It combines some of the themes and ideas of Monte Carlo methods as well as dynamic pro- gramming. They typically utilize the experience tuples by exploration, while making no assumptions about the mechanics of the environment. These al- gorithms also utilize a type of bootstrapping technique to estimate the value function without waiting for any final terminations of the learning process.

Following a policy π, it will generate experience in the form of the previously mentioned experience tuples with the goal of using them to estimate the value function, v, and state-value function, $q.

There are two different types of TD-learning, on-policy, and off-policy.

The distinction between them is that an on-policy algorithm will explore the environment using and update its estimates of the value function using the same policy. On the other hand, an off-policy learning rule makes a distinction between the behavioral policy used to explore the environment and generate experiences and the policy that is currently being evaluated, and as such they do not need to be the same.

(18)

TD(0)-learning

One of the most basic forms of TD-learning is a method that is named TD(0), where its name comes from its functionality of using "one-step" iterative up- dates using a specified rule. As such it will only conduct updates based on the value estimate of the next timestep. This rule can be expressed as:

V (st) ← V (st) + α[rt+1+ γV (st+1) − V (st)] (2.12) where α is a selectively chosen learning rate, similar to the learning rate of other traditional machine learning algorithms. TD(0) is typically not used however because the computation of optimal actions given only the state can be quite expensive in complex environments. Instead, the more common algo- rithm that is utilized within reinforcement learning is the Q-learning method.

Q-learning

Q-learning is a rather famous off-policy TD control algorithm that estimates the state-action value function, rather than the value function for the state only.

This information is normally encoded in a table, which is updated according to sampled experiences from exploration. With a state s, action a, reward r, the action space A, a learning rate of η and a discount factor of γ, the update rule for Q-learning is defined as:

Q(st, at) ← Q(st, at) + η[rt+1+ γ max

a∈A Q(st+1, a) − Q(st, at)] (2.13) In this method, the table Q directly approximates q, the optimal action- value function [10]. As mentioned, Q-learning is an off-policy learning al- gorithm, but the selected policy used for exploration still has a notable effect on the state-action pairs that it will explore and visit. Thus one should still exercise caution when selecting the behavioral policy so that it will periodi- cally explore all state-actions pairs in order to guarantee the convergence of the algorithm.

Using the same notation as for 2.13 with the addition of the state space S, the Q-learning algorithm in its procedural form can be expressed as the following:

(19)

Q-learning (off-policy TD control) for estimating π ≈ π

Algorithm parameters: learning rate α ∈ (0, 1], small epsilon > 0 Initialize Q(s, a), for all s ∈ S, a ∈ A(s), arbitrarily except with Q(terminal, ·) = 0

foreach episode do Initialize S

foreach step of episode do

Choose A from S using policy derived from Q (e.g., -greedy) Take action A, observe R, S0

Q(S, A) ← Q(S, A) + α[R + γ maxaQ(S0, a) − Q(S, A)]

S ← S0 end foreach end foreach

2.1.4 Function approximation & neural networks

Deep Q-learning

Q-learning suffers one major drawback which comes from its reliance on a table-like structure to host the state-action value pairs, as shown below in Fig- ure 2.2.

Figure 2.2: Q-table lookup functionality.

Consider for a moment a situation where a state includes values that are not finite and discrete. Instead, you have a range of possible values that can be continuous, such as GPS-coordinates (longitude, latitude). As a result of this, the state space, S, would grow to be obscenely large due to its definition of containing every permutation of states available within the environment. This would obviously cause problems for the established learning algorithms men- tioned since they rely on indexing the state-action pairs to create a policy. This

(20)

table would be required to be unfeasibly large to host all these permutations.

The solution to this problem comes in the form of utilizing function approxi- mation in order to reduce the effect of the large state or action spaces. Instead of having a lookup-table that contains the state-action values, you now utilize a neural network to approximate the optimal action and its associated value given only the state. This means that it is no longer necessary to store every single state-action combination, but now you are left to train the network to approximate this relation instead in a combination of function approximation and target optimization as in Figure 2.3.

Figure 2.3: Neural network approximation for the Q-table.

Artificial neural networks are a methodology within machine learning that aims to emulate a simplified version of a biological brain [14]. Neural net- works are very useful due to their ability to theoretically approximate any form of continuous function, and as such are well known as a type of uni- versal function approximators [9]. One of the most common structures of a neural network is the multilayer perceptron(MLP). An MLP consists of neu- rons where each one is an individual processor, capable of taking a form of input and producing a corresponding output. Neurons are organized in layers, where the outer layer, commonly referred to as the input layer, reacts to stimuli from the external environment and then propagates this through the network by weighted connections[14] [15]. Attached to the input layer is usually one or more layers called hidden layers, followed finally by an output layer. The neurons that these layers are composed of can be expressed as a mathematical

(21)

function defined as:

f (x, wt) = φ

d

X

i=1

witxi+ bt

!

(2.14) which takes two parameters, the input x and the assigned weights of the neuron wt. The term d is the dimension of the input and btdenotes the bias term. The symbol φ represents the so-called activation function, which is selected with the purpose of inducing non-linearity into the system.

The exact selection of which activation function to use in the different parts of a neural network is something that is dependent both on the architecture of the network, but also on the problem itself. Some commonly selected acti- vation functions include Rectified linear unit(ReLu), Sigmoid and Hyperbolic tangent (Tanh). There is also the option of using a linear mapping.

Using the previous definition of a perceptron, activation function, and the concept of layers, a neural network(multilayer perceptron variant) is show- cased below in Figure 2.4.

Figure 2.4: Overview of a multilayer perceptron.

Artificial neural networks belong to the class of supervised learning, where they are trained on samples that include an input pattern and their correspond- ing output values(target values). The weights are then adjusted using an op- timization algorithm, where the most common one is stochastic gradient de- scent(SGD), or some variation of it such as Adam or RMSprop[9][16]. These algorithms utilize a loss function, which outputs a metric that gives a rough

(22)

estimate on how far away the active predictions coming out from the neural network are from the true target values. The choice of a loss function varies depending on the problem type(classification, regression, etc). The goal of the optimization algorithms is then to minimize the loss function by perform- ing weight adjustments, which are calculated using the chain rule in a method called backpropagation [14].

Experience replay

The goal of deep-q-learning is ultimately to approximate the Q-table, Q(s, a), and freeing us from the requirement of having to keep a table with all the pos- sible state-action permutations and instead rely on function approximation. In order to do this, the method of stochastic gradient descent (SGD) is utilized in order to update the weights of the neural network in order to create this mapping between the input(states) and the output(q-values). SGD relies on its data being independent and distributed across the batches it uses for cal- culating the weight updates, but in our RL-environment, we will be given a sequence of interactions between our agent and the environment, and each of these steps within the sequence has a high chance of having a correlation. A popular method within deep-q-learning to combat this is the idea of keeping a memory buffer, where you continually add your obtained experiences in. The buffer has a capacity and is filled up as the agent interacts with the environ- ment. Once the buffer is full, the oldest memories will be continually shifted out in favor of fresh ones. SGD will then sample from this buffer uniformly at random when selecting samples for the batch when a training step is to be commenced.

Shifting target values: Using a target network

Training a neural network that is employed in reinforcement learning requires a different approach compared to that of traditional neural networks in super- vised learning. The target vectors that the neural network will learn to fit its weights to reduce the loss function that is selected will be an estimation of the value-function, rather than a static label that will always be fixed. This effec- tively means that during the training of the neural network, the target vector will shift as the estimation is changed. This can make the act of training a deep-q-network unstable. This can lead to oscillation and poor training per- formance. One method to counteract this is the methodology of utilizing a target network. This method involves making a copy of the neural network(or more specifically, its weights), and using the copy to procure the target values

(23)

to fit our model. While the target network will help the model to train more steadily, it will also be continually relying on outdated information as we com- pute and apply our weight updates. Every now and then, one wants to copy over the current weights of the network in training to the target network. This is typically done every few episodes, dictated by a set parameter, which will be called target network update ratio. This defines how many episodes will elapse before the model network will have its weight copied over the target network.

2.2 The H3 library

H3 is an open-source library that was created by the transportation company Uber. It offers functionalities for indexing coordinates into a discrete space of hexagonal shapes [17]. The library acts as an alternative to geocode systems for representing geographical entities such as geohash [18]. H3 contains nu- merous options for fine tuning the indices which are created, the most notable one being the so-called resolution. The resolution defines the size of each cell that is created, and each of these cells can then be approximately subdivided into smaller cells(and thus a higher resolution grid). The size of each hexagon is determined by this resolution parameter according to table 2.5.

The library then allows for easy conversion between geographical data points in longitude/latitude into their associated cell values, creating a discrete set of cells. This can have multiple different purposes, such as the example dis- played on the official developer blog, detailing the conversion between sparse data into a cell-based heatmap, as in Figure 2.6 [17].

The library contains a plethora of methods to work with these cells, includ- ing pathfinding-algorithms, cell-to-coordinate conversions(and vise versa), ob- taining cell boundaries(in coordinates), up/downscaling resolutions, etc. More information can be found in the official documentation [19].

2.3 Cellular Networks

This section briefly touches upon the concept of cellular connections, the act of roaming and some metrics involved when measuring the perceived network quality of an established connection. The details are kept at a very high level, since not much in-depth knowledge is required for understanding the theoret- ical applications that this project seeks to explore, and as such many of the details may be abstracted for the purpose of the research question. In the event

(24)

Figure 2.5: H3 Resolution Table, taken from Table of Cell Areas for H3 Res- olutions in the H3 documentation [19]

Figure 2.6: H3 conversion, taken from Ubers engineering developer blog, H3:

Uber’s Hexagonal Hierarchical Spatial Index [17]

that the act of adaptive roaming would become a tool to use in real-world appli- cations, there are additional considerations that should be taken into account,

(25)

some of which are brought up in the discussion section. The nomenclature used in this thesis will be that telecom providers are referred to as operators, and the use of the word cell does not relate to cellular networks, but to the H3 cells that are used in the construction of the environment(Section 3.2).

2.3.1 Operator roaming

Roaming is a term within wireless network communication that relies on de- vices switching the currently selected network for another within its vicinity.

The act of roaming from one network connection to another is a feature that all modern devices have available which are meant to be able to access dedicated networks. This is typically confined to roaming between local networks, such as those provided by WiFi-access points.

The principle of roaming between different cellular operators that provide network access through the use of 3G, 4G and soon to be 5G, is quite differ- ent from the localized roaming between available local networks. It used to be that this principle was limited to specific types of deceives with dedicated hard- ware, but this technology has started to extend to normal mobile devices with the widespread introduction of dual-sim slots or eSIM compatibility[20][21].

When supported, this allows the use of multiple different operators (with vari- able data plans, country of origin, or simply for the purpose of having multiple numbers tied to a single phone).

In this thesis, we will assume that we can access devices capable of using this type of technology for the purpose of roaming freely between different telecom providers operating within a specific region. In reality, there would likely be delays during the transition between one operator and another, as well as charges that are placed for performing this act aggressively rather than for specific purposes. The attributes are somewhat simulated in the proposed environment with the use of a penalty for the action of roaming itself but are otherwise overlooked for simplicity.

2.3.2 Network performance metrics

Network connections have a multitude of performance metrics that relate to different qualities. These different metrics relate to different areas of interest, and what constitutes a "good network connection" may vary depending on the needs of the application. There are some performance metrics that are more prevalent in Long Term Evolution(LTE) system networks, which will be used as prediction indicators for the reinforcement learning algorithm during test-

(26)

ing. Reference Signal Received Power (RSRP) is a signal strength related met- ric for a specific cell that is typically used for ordinary handover decisions(the act of transferring a data session from one channel to another) by using it to rank a connection among available candidates in order to grasp their relative signal strength. This makes it a good candidate to utilize for network qual- ity predictions. The formal definition of RSRP is the average power of the Resource Element that carry cell-specific Reference Signals within the con- sidered bandwidth, and typically ranges from -44dbm to -140dbm, with a less negative measurement representing a better signal quality[22].

RSRP can be viewed in a simplified approach by classifying it according to certain levels of threshold, which is typically used to give a rough estimation of the network quality in a connection. An example is the listed values in table 2.1.

Table 2.1: RSRP Metric quality classes

Quality RSRP

Excellent >= -80 dBm

Good -80 dBm to -90 dBm

Poor -90 dBm to -100 dBm

Bad / No signal <= -100 dBm

Some other metrics of interest are CQI(Channel Quality Index), SINR(Signal to interference plus noise ratio). These measurements are defined according to standards within the telecommunication technology, particularly from ETSI.

More information can be found in the documents related to these standards (ETSI TS 136 214 V9.1.0) [23].

2.4 Related works & state of the art

The principle of applied machine learning to enhance the functionality, ca- pability and efficiency of network related applications is something that has been picking up in interest over the recent years. Some papers have been pub- lished related to this matter, but the methodology involved highly depends on the problem specification.

The underlying field of machine learning has reached a point where many methods have reached the stage where their dependency and robustness have been proven sufficient enough of times to be considered reliable and the current state of the art. Reinforcement learning in particular, which is heavily featured

(27)

in this thesis, has seen a large progression in its possible areas of applications as a result from its evolution to utilize other machine learning algorithms such as neural networks in order to further increase the areas of applications. Re- inforcement learning has traditionally been well suited to problems that are hard to use standardized statistical models for but have a plethora of data and use-case scenarios available to utilize in the formulation and training of a data- driven model. There do arise some complications quite often due to the use of continuous variables, causing some hiccups in traditional reinforcement learn- ing algorithms due to their poor scalability with large state or action spaces.

This has been alleviated with the use of deep learning however, as presented earlier in the paper. One of the original sources of inspiration for this very thesis was the publication of a white paper by the 5GAA Automotive Associa- tion titled "Making 5G Proactive and Predictive for the Automotive Industry", which covers potential use cases for the ability to predict imminent changes in network quality for many different types of on-board systems (and the value of being able to take action in order to ensure a high quality of service). While the paper does not go into explicit details about any implementation procedures, it does make some suggestive hints that an implementation using machine learn- ing would likely be able to meet the demands of these types of systems. As such, the creation of predictive and proactive agents that can be utilized in autonomous vehicles in order to improve the quality of service that they pro- vide for the relevant users is something that is currently being looked at in the modern industry. The paper also goes into detail about the key-concepts of predictive QoS, all from data collection to delivering the final predictions through the eyes of the automotive industry.

Another paper that gives insights into the needs and wants of the industry is a scientific survey which contains motivation and explanation for the use of deep reinforcement learning in the context of managing non-convex and complex optimization problems [24]. While a large segment of the paper is dedicated to explaining the workings of neural networks and reinforcement learning algorithms, in addition to how they can be integrated with another to create powerful tools, there is also an in-depth look into the possible applica- tions of such algorithmic uses within the field of communication. One of the keynotes is how many modern networks become more decentralized and ad- hoc in their functionality, and thus need to make independent decisions based on the information available to them. One example mentioned is the case of an active agent selecting between available base stations depending on cer- tain criterias, which is one area that closely resembles the topic of this thesis.

This problem is referred to as the "Dynamic spectrum access", and has been

(28)

well-studied due to its close relation with industries striving for the "Internet of Things"(IoT) which also has an inherent need for continuous and uninter- rupted connection up-time. The main takeaway from the survey is about the implementation possibilities of the technology rather than focusing on the im- plementations of the mentioned methodologies for the problems it presents.

Other examples of machine learning implementations related to predicting quality-related metrics in network technology exists, such as Casas, Pedro, et al, 2017, who tackle a problem related to making predictive analysis on Quality of Experience(QoE) that is experienced by users of smartphones using machine learning techniques [25]. The paper conceives a number of different models based on supervised machine learning by using passive measurements obtained from the devices in use in combination with surveys from the users themselves, which asks them questions about the quality of service provided during the use of several applications. The measurements are then mapped to the feedback of the users and this is used as a dataset for a number of supervised learning algorithms to varying degrees of success.

There are also examples of research conducted with the goal of implement- ing reinforcement learning algorithms in mobile communication networks where some researchers constructed a reinforcement learning algorithm capable of solving the task of bandwidth optimization depending on the requirements of the network [26]. They formulated the problem as a constrained Markov deci- sion problem, and then implemented an agent capable of learning an optimal policy given a sufficient amount of experience by utilizing Q-learning. These are some examples of machine learning being implemented on network tech- nologies of different characteristics, but where this thesis aims to extend them is that it will focus on the implementation of a reinforcement learning agent capable of adaptive network selection in order to create a basis for an algo- rithm which can be used in order to guarantee adequate quality of service for devices where the network infrastructure could be lacking or unknown.

Additionally, a big part of the thesis is going to be the implementation of the related algorithms, and in particular, the performance obtained by the neural network. Much work has been conducted within the field of data sci- ence related to the fine-tuning of network performance for various tasks, and will very likely be utilized in this paper to get the best performance out of the algorithm.

There have been some recent (2019,2020) works related to utilizing ma- chine learning techniques in order to tackle optimization problems within net- work connectivity related to access points which can be used to define an es- timate of the current state of the art. One publication conducted an experi-

(29)

ment where a reinforcement learning approach was selected to tackle the op- timization problem of selecting access points in order to meet certain quality- of-service requirements [27]. They found that they could reduce the aver- age access delay by 44.5% by utilizing Q-learning on what was structured as a constraint-based optimization problem. A recent IEEE paper utilized the widely used random forest machine learning algorithm in order to select fea- tures for better network coverage when selecting access points [28]. The gen- eral principle of recently published papers outlines that it is usually through the use of the already established modern machine learning algorithms that involve either supervised learning or reinforcement learning that are utilized.

The exact implementation and the complexity of said implementation varies depending on the problem that is tackled, but generally these methods can be considered the state of the art when tackling optimization problems within the area of networking technology.

Ericsson, the company this thesis was conducted with, also performed some initial experiments with an environment structure that took on a graph- like form, which allowed agents that traversed through it swap between differ- ent operators in an attempt to optimize the network connectivity. The problem was laid out as a simple reinforcement learning problem. This experiment served as the initial idea that eventually concluded in this thesis, which was to expand upon the original idea and add a multitude of functionalities while re- lying on real data obtained from measurements instead of an artificial dataset.

This would then culminate in some trials to see if it was possible to train an agent to optimize the connectivity using reinforcement learning. Ericsson also recently filed a patent that relates to methodologies concerning the act of roam- ing in more intelligent manners, of which this thesis was one of the exploratory branches that were pursued [29].

(30)

Methods

3.1 Data acquisition

The data used in this thesis project is obtained from an agreement between the company Umlaut and Ericsson [30][31]. The nature of the data is crowd- sourced, using an opt-in program where measurements are frequently col- lected from mobile devices with networking capabilities. The measurements collected contained information about network quality, throughput, latency, and other relevant metrics that could have an effect on the quality of ser- vice from an established connection. Additionally, information about the used hardware, the chosen telecom provider, and the location of said measurement in a pair of longitudelatitude coordinates. The size of typical data queries de- pends both on the geographical region of selection as well as the timestamp that is used. In more populated areas and/or countries, the number of data points can easily reach towards millions of samples. In this thesis, a small sub- set of the data is utilized, namely the geographical coordinates the data points were obtained from in addition to their related RSRP-value. These values will be used in this thesis as a rough estimate to the perceived network quality of that particular region within the reinforcement learning environment.

One thing of note is that, while in this particular project, RSRP is solely used as the indicator for network quality, one could utilize any relevant metric with an attached geographical coordinate. This is further brought up in the discussion.

The crowdsourced nature of the dataset means that certain restrictions are imposed in how it can be displayed, in accordance to internal regulations orig- inating from the agreement between Ericsson and Umlaut, in addition to laws within the European Union which relate to the protection of privacy the use of

22

(31)

personal data [32]. As such, data may not be directly displayed or showcased under any circumstances (neither in the form of screenshots, samples or the complete dataset). However, any statistical analysis such as plots in addition to the reinforcement-learning environment that is created using said data will be showcased.

3.2 Construction of the environment

The description of the data obtained Umlaut which was given in Section 3.1 gave a brief description of the data and its structure, but one major task in the process of this paper is the conversion of the sparse data available into a reinforcement-learning environment. As previously mentioned, the data col- lected from the queries form a dataset with geographical points and relevant metrics for measuring connectivity.

Using the library H3, previously established in Section 2.2, one can obtain a discrete set of cells to act as the underlying ground truth for the environment in which the agents will be operating. The act of selecting H3 for this method, and thus a discrete environment with a finite set of cells, was an intentional choice due to the fact that relying on methods such as nearest neighbor would have created problems. A large portion of the geographical locations are very sparse on data, and would likely lead to wildly inaccurate assumptions about the metric values for a specific location unless you were to bound the nearest neighbor selection to a specific length. But that would end up as a structure very similar to that of the H3 cell structure except that the cells themselves would overlap between each-other. There are also methods one can utilize to allow the direct use of sparse continuous data to form the operator-layers that are used within the environment, such as using function interpolation, but this has a tendency to scale poorly with extreme sizes of the datasets. The exact size of the pieces of data obtainable from Umlaut varies significantly with the region selected and the time-span used, but can easily go upwards to tens of millions in populated areas. For this reason, the idea of using a method to create a discrete cell-space was chosen for the purpose of a simplification but also the ease of implementation and working with it, regardless of the size of the dataset that is utilized.

In order to specify the related metric of a position for a given operator, one can instead utilize the created H3 cells in order to obtain an averaged metric from the cell in which the coordinate is placed.

The following figures will step through the creation process of an environ- ment for a single selected operator with some data points in a small square-

(32)

area. Such an area could look like Figure 3.1.

Figure 3.1: Geographical data with a selected metric for one operator. Note that the data used for these diagrams is generated dummy data, simulating geographical coordinates with an attached metric value (normalized).

The area is then split up into a selection of hexagonal shapes, dictated by the resolution that is used(see table 2.5 for hexagon sizes for the different resolutions). A rough drawing of an example cell-division over the area with an arbitrary resolution could look like in Figure 3.2. Normally the grid is populated with cells for each cell which contains at least one data point, but cells at the edges have been excluded in this particular example. That is to say, in a normal example there might occur "gaps" in the cell-map for areas which contain no data points. The environment supports a couple of methods in order to handle this which is explained in Section 3.2.1.

(33)

Figure 3.2: Cells are created for selected data points.

The cells are then iterated through (as in Figure 3.3) and are processed with a type of voting system to determine the class of the cell, which is used in the reward function. This value will be used as a rough approximation of the perceived network quality within the geographical region which is repre- sented by the cell itself. This presents a certain level of simplification since the sparse data is now converted into a discrete set of areas, rather than working on the continuous coordinates of the original data. However, given sufficient quantities of data, it is possible to utilize the H3-resolution factor to get a very high resolution to try to more closely approximate the real network connection metrics within the regions. The resolution factor is a parameter of the envi- ronment creation process and can be specified at application startup. Some alternatives to the discrete cell-structure of the environment were considered, such as the nearest neighbour approach, but it was deemed that the discrete set of cells that are obtained from working with the H3 library was a good trade-off between realism and simplicity in order to contain the scope of the project.

(34)

Figure 3.3: Iterate over the cells to be included in the environment.

For each cell, the operator values will be thresholded into categories de- pending on their value, in accordance with table 2.1 about RSRP. Each clas- sified data point then casts a vote to determine the classification of the cell (Bad, Ok, Good, Great). The class with the largest proportion of values dic- tates the classification of the cell. If a cell does not include any data points for a given operator, it will have a zero-type class, which is taken into account in the reward function. Figure 3.4 visualizes a simple voting process for the clas- sification of a cell. The voting-system came about as a result of the presence of a significant amount of noise in the dataset, meaning that there were many outliers. By using a voting system with a winner-takes-all approach, these outliers would naturally filter out in favor of the majority holder in terms of the quality classes. It was found to be universally true that the majority class would always be a very large proportion of the total data points within the cell (>50%) when using sufficient data quantities (>100 000 entries).

(35)

Figure 3.4: Votes for each class are counted depending on their value.

In this instance, we receive a majority vote of class 3, and that would be the used ground-truth for that particular cell-location within our current working environment. So for any pair of coordinates, you receive which would be found within the boundaries of this cell, the reward function would utilize class 3 in order to determine the reward for using this particular operator in this cell location. How exactly the reward function is structured will be showcased in the MDP formulation in the upcoming Section 3.3.

The noise mentioned as a motivation behind the voting-system could have many reasons for its existence. Since the data-queries is of a crowdsourced na- ture, there was an extremely large variance in the quality of said measurements since any device could be affected by a number of different external factors, such as:

• Location of the device.

• If obstacles exist between the device and the currently selected cell an- tenna.

• How many users are currently connected to the available cell antenna.

(36)

• The weather of the current day.

Using a majority vote system allowed for partially filter out some of the more noisy data points and outliers that were collected.

After this process has been completed, one is left with a graph-structure of hexagonal cells which represents geographical areas with associated values for the metric you are currently interested in investigating. For the purpose of this thesis, the environments will all utilize RSRP, but it can take any other signal metric for which one has data for.

Since this problem was focused on problem instances where the start and goal location is known ahead of time, the methodology for generating a new problem instance for the agent to explore worked as follows:

1. Two pairs of cell IDs are sampled uniformly at random from the total available. The pair must also fulfill the condition that start 6= goal.

2. A pathfinding algorithm then determines a path that is available between the two selected cells and the path is stored inside of the environment.

3. The agent’s current location is set to the start-cell.

Figure 3.5: 1: Select start and goal Figure 3.6: 2: Generate a path using h3 The pathfinding algorithm used for this is H3’s h3Line method, which draws a line between the two cells and selects all the cells in-between the two in order to create a shortest-path route. In reality, any common pathfinding al- gorithm could be used as an alternative to this, such as a breadth-first-search,

(37)

Dijkstra’s algorithm, or A*. For the problem instances generated by this envi- ronment, the internal method provided by H3 was deemed sufficient.

Interactions with the environment can be described as "stepping" through it, one cell-tile at a time, until one arrives at the goal-destination. The agent has no control over the actual movement but is instead tasked with selecting between available operators along the way in order to maximize the accumu- lated reward. A simplified sketch of an environment with 3 operators available is showcased below in Figure 3.7.

Figure 3.7: A sketch of an environment with 3 operators in action.

The environment follows with several methods that handle the interaction with the environment, obtaining rewards and resetting it, and generating prob- lem instances to solve. The naming convention follows closely that of the famous Open AI gym, which is frequently consulted by people learning rein- forcement learning in order to obtain example environments to test their algo- rithms on [33].

3.2.1 Environment padding

In the data queries from Umlaut, one encounters many areas where there can be no data available for particular operators. This varies between different regions, as well as how large of a timespan one is investigating, but the occur- rences are many. When faced with such a situation, one can either accept the cell as having no connection for that particular operator, which can happen in two different cases: The first one is that the cell actually has no data, and the operators have no presence there. Or alternatively, the dataset being currently used has no recorded instance of that operator in that particular region.

To allow some flexibility, a method named padding was introduced to al- low some freedom for the environment during the creation of operator-classifications

(38)

for the previously mentioned cells. Padding works by allowing a cell with no values to use the average of its neighbors’ values as its own classification. An illustration padding is showcased below in Figure 3.2.1. Padding has an addi- tional parameter, which dictates how many neighboring cells are required to be valid and contain the presence of the operator in question. This parame- ter is called minimumNeighbourhood, and can be used to restrict how freely padding is used.

Padding value & description Illustration

Padding=False. The cell will get its value from the classification only. If no data points exist, it will be an empty cell.

Padding=True. If the cell has no data, it ob- tains one by averaging the value of the neigh- borhood (marked in yellow) if AT LEAST min- imumNeighbourhood neighbors are valid cells with the related operator present in said cells.

Table 3.1: Examples of the effect of different padding values.

3.2.2 Environment variations

Outside of the different parameters which dictate the complexity and layout of the problem environment, it will also come in two primary variations which share the general structure of the environment defined in the earlier parts of Section 3.2, but with some minor alterations. These environments, or prob- lem variations, will be referred to as adaptive operator selection and adaptive satellite roaming. The adaptive operation selection problem is the environ- ment above with a set number of operations that the agent has available to

(39)

roam to, with the goal of maximizing the cumulative reward over the duration of the generated problem instance. The adaptive satellite roaming works sim- ilarly, except that the number of operations is now limited to only 2. One of them is a single selected from the area, which is utilized to populate the values in the method defined earlier, but the second operator available will now be a simulated satellite connection. This satellite operator will be ever-present in all the cells, with no gaps (which can result in normal operators due to either lack of data or lack of coverage), but with the catch that the reward will be notably lower compared to a normal operator. The interest here is seeing if the agent is capable of learning how to utilize the available satellite connection in order to compensate for any "gaps" that are present in the primary opera- tor. As such, this problem definition is just a slight alteration of the "arbitrary number of operators-roaming", but is interesting due to the future development of technology such as spacemobile, which could enable devices to roam to a satellite connection without requiring dedicated hardware [6]. In the results, evaluations from both these variations of the environment will be present in their respective sections.

Figure 3.8: The alternative environment with an operator layer and a satellite layer.

3.2.3 Environment metrics

Inside of the environment, there are some metrics that will be utilized in order to give context to the environment that is currently in use during the results showcase. These metrics are primarily related to the operators that are present within the cells of the environment and are called average operator value and coverage.

(40)

Average Operator Value

Average operator value is simply the average classification that for all the cells that the operator has a presence in. Note that this does not include empty cells(whose class for the operator will be 0). Since the classes are defined as integers within the set of {1, 2, 3, 4}, then the average operator value ∈ [1, 4].

Coverage

Coverage is defined as the percentage of the total number of cells within the environment that the operator has a presence. So if the operator has values in n cells, and the total number of cells within the environment is m, coverage will be expressed as the percentage:

coverage(%) = 100 ∗ n

m (3.1)

When displaying results with padding, coverage will be displayed with "cov- erage(P)", with (P) meaning padded. Coverage will also be used to give some information on the relationship between operators within the environment. In particular, there are two relationships that will turn up for operator A and op- erator B, with coverage cov(A) and cov(B):

• The union of the coverage, cov(A) ∪ cov(B), is the number of cells they cover combined, divided by the total number of cells.

• The intersection of the coverage, cov(A) ∩ cov(B), is the number of cells they both have a presence in, divided by the total number of cells.

Both of these relations, like coverage, are expressed as a percentage (%).

Rating

A metric named operator rating was introduced to enable one to use a sin- gle number to give a rough estimate on how good a single operator performs within the currently established environment. This operator rating can be ex- pressed as:

Rating(operator) = cov(operator) ∗ AvgV al(operator) (3.2) The rating gives a rough number to use when comparing different operators between each other.

(41)

3.3 MDP formulation

The problem can be viewed as a Markov decision process, with the relevant attributes from Section 2.1.1. The tuple, (S, A, P , r, γ), has the following properties:

The state space, S

The state space is a tuple containing relevant variables for the current position of the agent in the context of the environment, as well as some additional in- formation. This tuple acts as the input into the neural network model that acts as the approximator for the Q-table. The information that is delivered from the environment is a size 6 tuple with the following information:

• The currently selected operator

• The start cell ID.

• The goal cell ID.

• The current cell ID.

• The next cell ID (the cell to select an operator for).

• The previous cell ID, which was the location before the current cell ID.

The action space, A

The action space of the problem definition is the selection of available op- erators, and thus this size varies depending on how many different opera- tors are available within the geographical location of the established environ- ment. But the action space can always be defined a set of operators, with

|A| = |{1..n}| = n, n > 1, where n is the number of operators.

Transition Probabilities, P

The transition probabilities within this problem environment are completely deterministic, meaning that for a state s ∈ S, and an action a ∈ A, transition- ing to the state s0 can be expressed as P (s0|s, a) = 1.

References

Related documents

MANAGING THE COMPETITIVE ENVIRONMENT Focus within industry Differentiation or Cost-cutting RED OCEAN STRATEGY Create new untapped market

Figure 6.1 - Result matrices on test dataset with Neural Network on every odds lower than the bookies and with a prediction from the model. Left matrix shows result on home win

*OUSPEVDUJPO "DUJWF MFBSOJOH BOE NBDIJOF UFBDIJOH BSF UXP EJGGFSFOU DBU FHPSJFT PG JOUFSBDUJWF NBDIJOF MFBSOJOH TUSBUFHJFT UIBU DBO CF VTFE UP EF DSFBTF UIF BNPVOU PG MBCFMMFE

​ 2 ​ In this text I present my current ideas on how applying Intersectional Feminist methods to work in Socially Engaged Art is a radical opening towards new, cooperative ​ 3 ​

The teacher asking relevant questions for architects makes the student able to get a deeper understanding of what it is to make an architectural ground plan

Utifrån empiriska studier från yrkesutbildning (Öhman, 2017), där bedömning och återkoppling relateras till lärares- och elevers interaktion i pågående

The goal of this master thesis was to create a machine learning algorithm that could identify if a player was using a cheat aimbot in the first-person shooter game

The reinforcement agent regularly managed to combine the best char- acteristics of the different scheduling methods it was given and con- sequently performed better on average in