Reinforcement Learning for 5G Handover

(1)

Master Thesis in Statistics and Data Mining

Reinforcement Learning for 5G handover

by

Maxime Bonneau

2017-06

Department of Computer and Information Science

Division of Statistics

(2)

Supervisors:

Jose M. Peña (LiU) Joel Berglund (Ericsson) Henrik Rydén (Ericsson)

Examiner:

(3)

Linköping University Electronic Press

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – från

publiceringsdatum under förutsättning att inga extraordinära omständigheter

uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid

en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form

eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller

konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

för-lagets hemsida

http://www.ep.liu.se/

.

Copyright

The publishers will keep this document online on the Internet – or its possible

replacement – from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for

anyone to read, to download, or to print out single copies for his/her own use

and to use it unchanged for non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional upon the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its www home page:

http://www.ep.liu.se/.

(4)

List of Figures

1.1 Simplified diagram of the LTE-A network . . . 6

1.2 Illustration of a handover . . . 7

1.3 Illustration of interaction between agent and environment in RL . . . 8

2.1 Map of the simulated network . . . 11

3.1 Most commonly used activation functions for hidden layers . . . 19

3.2 Simple ANN with one hidden layer of two nodes . . . 19

3.3 First ANN approximating action-value function . . . 21

3.4 Second ANN approximating action-value function . . . 22

3.5 Expected shape of the cumulative sum of the reward . . . 23

3.6 Expected shape of the evolution of the mean reward . . . 24

3.7 Plot to visualise the efficiency of the learning . . . 25

3.8 Barplots in complement of the graphs in 3.7 . . . 26

3.9 The column on the right shows the action selected . . . 28

3.10 Two different UEs with identical start and finish states . . . 29

3.11 Directions simplification for a UE . . . 30

3.12 Two different UEs with different start and finish points but similar directions . . . 30

4.1 Histogram of the time needed to run ten Q-Learning algorithms . . . 32

4.2 Cumulative sum of the reward . . . 32

4.3 Mean reward per epoch while learning with the first five UEs . . . 33

4.4 Evolution of the mean squared error through the number of iterations . . . 34

4.5 Test with a penalty factor of 25 . . . 34

4.6 Test with a penalty factor of 50 . . . 34

4.7 Q-learning algorithm using two actions . . . 35

4.8 Heatmap representation of a Q-table . . . 36

4.9 Heatmap representation of the number of visits . . . 36

4.10 Mean reward for each state-action pair . . . 37

4.11 Standard deviation of the reward for each action-state pair . . . 37

4.12 Testing indoor UEs . . . 38

4.13 Testing car UEs . . . 38

4.14 Testing walking UEs . . . 38

4.15 Learning with 1 UE . . . 39

4.16 Learning with 5 UEs . . . 39

4.18 Precision of 10 . . . 40

4.19 Precision of 1 . . . 40

4.20 No feature added for learning . . . 41

4.21 Direction used for learning . . . 41

4.22 Angle between UE and BS used for learning . . . 41

(7)

4.24 log-likelihood of the ANN . . . 43

4.25 Heatmap of the Q-table after learning with the first three car UEs . . . 43

4.26 Some output from the ANN trained with the first three car UEs . . . 43

7.1 Distribution of the mean RSRP difference (walking UEs) . . . 48

7.2 Distribution of the number of handovers (walking UEs) . . . 48

7.3 Distribution of the mean RSRP difference (car UEs) . . . 49

7.4 Distribution of the number of handovers(car UEs) . . . 49

7.5 Testing on walking UEs . . . 49

7.6 Testing on car UEs . . . 49

7.7 Testing indoor UEs . . . 53

7.8 Testing car UEs . . . 54

7.9 Testing walking UEs . . . 54

7.10 Learning with 1 UE . . . 55

7.13 Precision of 10 . . . 57

7.14 Precision of 1 . . . 57

7.15 No feature added for learning . . . 58

7.16 Direction used for learning . . . 58

(8)

List of Tables

2.1 Longitude of the first UE . . . 10

2.2 RSRPs for the first UE . . . 11

7.1 Table based Q-learning algorithm parameters . . . 50

7.2 Parameters of Q-learning algorithm using ANNs . . . 50

7.3 Results for each category with contextual information . . . 51

7.4 Results while learning with different numbers of UEs . . . 51

7.5 Results for the seventh to the eleventh UE when learning with 10 UEs . . . 52

7.6 Results with different precisions for rounding the RSRP . . . 52

(9)

Abbreviations and definitions

Abbreviation Meaning

3GPP Third Generation Partnership Project

ANN Artifical Neural Network

BS Base Station

CN Core Network

eNB eNodeB

E-UTRAN Evolved UMTS Terrestrial Radio-Access Network

logL log-likelihood

LTE-A Long Term Evolution Advanced

MDP Markov Decision Process

MME Mobility Management Entity

MSE Mean Squared Error

P-GW Packet data network GateWay

RL Reinforcement Learning

RSRP Reference Signal Received Power

S-GW Serving GateWay

(10)

List of Tables

Term Definition

Epoch An epoch is all the time-steps from the start at a random state to the goal state. An epoch is then composed of a succession of ensembles state-action-reward-state.

Epoch index The process of learning being a succession of epochs,

the index of an epoch is the index given to an epoch according to its chronological order.

Iteration

An iteration is one step in the process of learning.

An iteration consists of a succession of epochs, one for each UE used to learn. To be able to learn properly, the agent should repeat a certain amount of iterations.

(11)

Abstract

The development of the 5G network is in progress, and one part of the process that needs to be optimised is the handover. This operation, consisting of changing the base station (BS) pro-viding data to a user equipment (UE), needs to be efficient enough to be a seamless operation. From the BS point of view, this operation should be as economical as possible, while satis-fying the UE needs. In this thesis, the problem of 5G handover has been addressed, and the chosen tool to solve this problem is reinforcement learning. A review of the different meth-ods proposed by reinforcement learning led to the restricted field of model-free, off-policy methods, more specifically the Q-Learning algorithm. On its basic form, and used with sim-ulated data, this method allows to get information on which kind of reward and which kinds of action-space and state-space produce good results. However, despite working on some restricted datasets, this algorithm does not scale well due to lengthy computation times. It means that the agent trained can not use a lot of data for its learning process, and both state-space and action-state-space can not be extended a lot, restricting the use of the basic Q-Learning algorithm to discrete variables. Since the strength of the signal (RSRP), which is of high in-terest to match the UE needs, is a continuous variable, a continuous form of the Q-learning needs to be used. A function approximation method is then investigated, namely artificial neural networks. In addition to the lengthy computational time, the results obtained are not convincing yet. Thus, despite some interesting results obtained from the basic form of the Q-Learning algorithm, the extension to the continuous case has not been successful. Moreover, the computation times make the use of reinforcement learning applicable in our domain only for really powerful computers.

(12)

Acknowledgments

First, I would like to thank Ericsson AB, especially LinLab, for giving me the chance to work with them, and for providing me an interesting thesis topic, along with relevant data and a perfect working frame and atmosphere.

Specifically, I would like to express my gratitude to Henrik Rydén and Joel Berglund, my supervisors at Ericsson, who have always been ready and eager to give me advices and answer to my questions. Thanks also for teaching me how to lose at table hockey.

In the same vein, I would like to thank Jose M. Peña, my supervisor at Linköping Univer-sity, for his opinions and thoughts on my work, and also for his time.

Thanks also to Andrea Bruzzone, my opponent, for his careful and thorough revision work.

I would also like to thank my family, who, despite the distance, has been able to keep my motivation and my will at high level by providing continuous encouragements.

A final thanks goes to the whole class of the master in Statistics and Data Mining for the friendly and good atmosphere during these two years, especially to Carro who has followed me at Ericsson in order to push me to do my best.

(13)

1 Introduction

This chapter aims to present a first approach of telecommunication systems and reinforce-ment learning. Then the objective of this thesis is stated.

1.1 Background

Telecommunication systems

For the purpose of this master thesis, some insight into telecommunications is preferable. A complete description of the network is not necessary, but an overview of telecommunication systems and a description of the part of the network on the side of the customer might help the understanding.

The biggest companies in the telecommunications domain fight step by step for the de-velopment of the wireless networks from the second generation, also known as 2G, until the latest generation of 4G, called LTE-A (Long Term Evolution Advanced). As explained by Dahlman et al. in [5], this network is composed mainly by the Core Network (CN) and by the Evolved-UMTS Terrestrial Radio-Access Network (E-UTRAN) (see Figure 1.1). The CN makes the link between the Internet and the E-UTRAN. It is composed of several nodes, e.g. the Mobility Management Entity (MME), the Serving Gateway (S-GW), or the Packet Data Network Gateway (P-GW). The MME constitutes the control-panel node: it runs security keys and checks if a User Equipment (UE) (a UE is a device able to communicate with the LTE-A network) can access the network and establishes connections. The S-GW node is the user-plane node. There are several Base Stations (BSs) broadcasting information to the UEs, and the role of the S-GW node is to act as a mobility anchor when UEs move between these BSs. This node also collects information and statistics necessary for charging. The P-GW node is the one directly linked to the Internet and relaying it by providing IP-addresses.

(14)

1.1. Background

Figure 1.1: Simplified diagram of the LTE-A network

There is only one type of node used by the LTE-A radio-access network, and it is called eNodeB (eNB). So an eNB is a logical node linked directly and wirelessly, through a beam, to a UE. A beam is a signal transmitted along a specific course, used to carry the data from the eNB to the UE. The connection between BS and UE is established after agreement from both parties. The transmission between a UE and a BS is called uplink when the UE communicates information to an eNB, and the opposite communication, from an eNB to a UE, is called downlink. On the other side, an eNB is linked to the CN via the MME node by the mean of an S1 user-plane part and the S-GW node by the mean of an S1 control-plane part. Among other possible implementations, the eNB is commonly implemented as a three-sector site, each site spreading several beams. An eNB can be implemented as, but is not the same as a BS. Despite this, eNB and BS will in this thesis be assimilated and called BS, or node. This approximation in the language and in the technical precision does not change the results obtained later.

An important network procedure that needs to be described here is the handover. It should be noted that the following explanation is a simplification of the real process. While a UE can stand in the field of several beams at the same time, only one of these beams provides data to the UE, and it is called the serving beam. The strength of the signal received by a UE is called Reference Signal Received Power (RSRP) and is measured in decibel milliwatts (dBm). While the UE is moving, the RSRP can fluctuate. When the RSRP becomes so low that the UE does not get a satisfactory connection, another beam should replace the serving beam in order to get a better RSRP. The handover is this operation consisting of switching the UE’s serving beam from one BS to another (see Figure 1.2). It should be noted that a handover is a costly operation, that is why it should be made in an efficient way. In fact, performing a handover is not only about switching the serving beam, but also to find the best possible beam to switch to.

Merely one study has broached the problem of optimisation of handovers for the 5G net-work. Ekman [7] used supervised learning (more specifically, random forest) to find how many candidate beams should be activated in order to find the best one to switch to. The results are that around 20 candidate beams are required in order to get 90% of the samples where the best beam was selected. This thesis is about the same topic, but will not try to answer the same question.

(15)

1.1. Background

Figure 1.2: Illustration of a handover

5G requirements

The third generation partnership project (3GPP) is a collaboration between groups of telecom-munications associations. Its main scope is to make globally applicable specifications for all the generations of mobile networks starting from the 3G system.

The main requirements for the 5G system are the following [14]: • Higher data rate

• Latency in the order of 1ms • High reliability

• Low-cost devices with very long battery life • Network energy efficiency

Telecommunication companies need to respect these constraints while developing their network. Nevertheless, it may be complicated to satisfy all these requirements at the same time. A trade-off may be necessary within these requirements, since for example it is difficult so far to propose an extremely high data rate along with a perfect reliability for a low-cost [18].

Using a lean design policy, these requirements can be reached since it gives companies more freedom in the way to develop and manage their network, so they can produce more efficient products. For instance, it is interesting when one knows that, despite being most of the time in idle mode, a BS is always fully supplied with power [8].

Reinforcement learning

Reinforcement learning (RL) is a subfield of machine learning, alongside supervised and un-supervised learning. Gaskett et al. described reinforcement learning in [9] as: "Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find struc-ture in data."

As explained by Sutton in [23], the basic idea of RL is to let an agent learn its environ-ment through a trial-and-error process. At each state of its learning, the agent must choose an action, which is recognised by a reward (see Figure 1.3). This reward may be positive, negative, or null. Depending on the received reward, the agent will learn whether taking this action at the current state is advantageous or not. Therefore it is essential that the rewards are properly defined, or the agent could learn badly how to behave. Taking an action leads the agent to a new state, that may be the same as the previous one. The agent repeats this

(16)

1.2. Objective

process of state-action until it reaches a goal state. It can happen that there is no goal state; in this case the agent stops learning after a certain amount of steps, defined by the user.

An epoch is the process between a randomly chosen first state and the goal state. In order to learn, the agent may need to repeat a quite massive amount of epochs, depending on the size of the environment. The aim of this process is that the agent visits all the possible states in the environment, takes all possible actions at each state several times, so it eventually knows how to achieve a goal in the best possible way.

Figure 1.3: Illustration of interaction between agent and environment in RL [20] Since its beginnings, RL has evolved quite a lot and has found many fields to be exploited. Here come some examples of applications of RL, among so many others:

• Control a helicopter flight hovering [13]

• Play Atari games and beat human champions [12] • Swing up and balance a pole [16]

• Training robot soccer teams [21]

1.2 Objective

As mentioned earlier, a handover is a costly operation. This is why the handover procedure needs to be improved, both from the UE and the BS point of view, but furthermore to reduce the number of handovers as much as possible. Concretely, the objective is to optimise the trade-off between the signal quality, the number of measurements to find a better beam, and the number of handovers. This master thesis will propose a new approach for this problem, using machine learning, and more specifically RL. The questions that will be broached in this thesis are the following:

• What kind of RL algorithm could address the 5G handover problem? • Which features are needed to make this algorithm as efficient as possible?

(17)

1.2. Objective

• Is RL a good method in practice to find the optimal trade-off between signal quality, number of measurements and number of handovers?

The structure of this thesis is the following: Chapter 2 provides details on the data used for this work. Chapter 3 presents the methods used, along with the necessary theory. Chapter 4 will then present the main results. Chapter 5 discusses the outputs from previous chapters. Chapter 6 will draw conclusions from this work. Finally, the Appendix 7 will provide addi-tional results and clarifications.

(18)

2 Data

In this chapter, the origin of the data is described, followed by the description of the available data.

2.1 Data sources

While developing a new network, it is not possible to collect real data since the hardware is not deployed. To compensate, Ericsson has developed a network simulator. BSs are deployed in a city model consisting of streets and buildings (see Figure 2.1) to simulate a network. For this simulation, each BS is divided into three sectors, each of them transmitting eight beams. UEs are simulated during a period of time of 60 seconds. At the beginning of the simulation, they are placed at a start point and then move on the map. Some of them are moving along the streets, possibly in a vehicle, while others are moving inside buildings, this in order to model realistic UE movements. At each time step, the simulator provides the RSRP values spread from all the beams to all the UEs.

2.2 Data description

The simulation results contain the position of the UEs (see Table 2.1) in the three dimensional space (latitude, longitude, altitude), the position of the BSs in the same reference space, and the RSRP received by each UE from each beam. It should be raised that the latitude, longitude and altitude are expressed in the referential of the map, not with usual units like degrees. The values in the tables 2.1 and 2.2 have been rounded to three decimals for better readability.

Time (s) 0.5 0.6 0.7 0.8 0.9 ... 60

90.356 90.424 90.495 90.566 90.636 ... 53.28 Table 2.1: Longitude of the first UE

It must be noticed that the UE’s position provided by the simulator is the perfect location. However, in the real world, it is impossible to have a perfect information concerning the position of the UE, because of the precision of the measurements tools (e.g. Global Navigation Satellite System or radio positioning).

(19)

2.2. Data description

Figure 2.1: Map of the simulated network.

The map represents streets (white), buildings (light grey), and open areas between buildings (dark grey). The green and red points represent sectors; each group of three sectors is a BS

All the RSRP values (expressed in dBm) are measured every 0.1 second in a time frame of 59.5 seconds, from 0.5 seconds after the beginning of the simulation to one minute after (see table 2.2). It makes a total of 596 measurements per beam per UE during the simulation. There are 14 BSs, all of them steering 24 beams, which makes 336 values of RSRP per UE per time step. While a lot of different simulated UEs are available, only 600 have been loaded to start with. Later, these 600 UEs will be referred as UE1, UE2, ..., UE600, or first UE, second UE, etc. To refer to several UEs, it can be said the first three UEs to refer to UE1, UE2 and UE3 for example. This order is neither a preference order nor a choice, it is simply the order in which they were simulated.

Time (s) 0.5 0.6 0.7 0.8 0.9 ... 60 Beam (index) 1 -84,738 -85,235 -85,869 -85,938 -83,753 ... -80,119 2 -85,844 -86,797 -89,941 -89,8159 -91,170 ... -65,808 3 -98,353 -97,127 -97,082 -96,866 -93,078 ... -63.796 ... ... ... ... ... ... ... ... 336 -149,320 -146,096 -151,465 -149,321 -149,291 ... -123,636 Table 2.2: RSRPs for the first UE

From all the simulated data presented in this chapter, other information such as the direction and the speed of a UE, the distance between a UE and a BS, the time since last measurement has been performed, etc., can be extracted.

It can be noted here that for a certain immobile UE receiving a signal from a certain BS, the RSRP can change slightly over time. The reason is that some interferences can modify how the beam spreads, hence a change of RSRP over time.

(20)

3 Method

First, this chapter introduces some related works that have been inspiring for this work. Then, it aims to present the methods used in this thesis, starting with the Q-learning algorithm. When needed, the theory necessary to understand the method is explained.

3.1 Related works

The three following paragraphs present results published by some researchers or groups of researchers which present conditions close to the ones faced in this thesis. These examples have been selected because they present really good results and inspiring methods, or be-cause they are among the rare ones to broach their topic, especially the work of Csàji.

Gaskett et al. ([9]) did a leading work on the extension of the use of the Q-learning algorithm in the case of continuous state and action spaces. They could make a submersible vehicle find its way to a random point, while the controller of the vehicle did not know how to control it and what was the goal. To do this, they used feedforward artificial neural networks combined with wire fitting in order to approximate the action-value function.

Mnih et al. ([12]) were able to train a single agent to play seven Atari 2600 games, us-ing deep RL. They have used batch updates of the Q-learnus-ing algorithm combined with approximation through deep neural networks to train their agent to play these Atari games without knowing the rules, and by simply looking at the pixels of the screen. The results are impressive: the method outperforms the comparative methods on six out of seven games, and have better scores than a human expert on three out of seven games.

Csáji has shown in his Ph.D. thesis ([4]) that using (e, δ)-MDPs, the Q-learning algorithm could perform well in varying environments. In a scheduling process, an unexpected event appears and two restarts of the scheduling are compared: from scratch, or from the current action-value function. Using the result of the Q-learning algorithm in varying environments reduces significantly the time needed to reach the same result as before the unexpected event.

(21)

3.2. Q-learning algorithm

3.2 Q-learning algorithm

Markov Decision Process

A Markov Decision Process (MDP) is the framework of any reinforcement learning algorithm, as it describes the environment of the agent. According to Silver [20], a MDP is a tuple ă

S,A,P,R, γ ą where: • S is a finite set of states • Ais a finite set of actions

• Pis a state transition probability matrix: Pa

SS1 =P(St+1=s1|St=s, At=a)

• Ris a reward function:Ra

S=E(Rt+1|St=s, At=a), Rtthe reward received at time t. • γ is a discount factor, γ P[0, 1].

S,A,Rand γ are designed by the user. P can be either defined, estimated, or ignored. In the case no state transition probability matrix has been defined, less RL methods will be available. More details will be provided in Section 3.2.

The state transition probability matrixP defines transition probabilities from all states s to all successor states s1_:

for each action a,Pa₌   P11 . . . P1n . . . . P_n1 . . . Pnn  , where each row of the matrix sums to 1 and n is the number of states.

The discount factor is the present value of future rewards. It is used in the definition of the return Gt, which is the total discounted return from time step t:

Gt= 8 ÿ k=0

γkR_t+k+1, (3.1)

where Rtis the reward at time t. A γ close to 1 tends towards favouring the reward on the long term, while being close to 0, only the very next values of the reward have an effect on the return.

As suggested by its name and the definition ofP, the state sequence S1, S2, ... follows the Markov property. It means that given the present state, the future does not depend on the past. Concretely, the observed state contains all the information needed for the future, even if the sequence leading to this state is lost.

Greedy policy

In order to learn the best action to take at any state, the theory lies on the concept of return (see Equation 3.1). As one can see, the return takes into account all the future rewards. By looking at this equation, one can realise that an agent is expected to maximise its return. Since it would be extremely costly to compute the expected return for each state-action pair at a time t, it is needed to use some approximation. That is where greedy policies appear. A greedy policy is a policy which considers that the action leading to the maximum return is the one giving the highest reward at the current state.

It is expected from an agent to choose the most rewarding action for a given state. How-ever, to learn what actually is the best action, the agent needs to make a trade-off between exploration and exploitation. One does not want the agent to exploit only, because as soon as it has discovered a state-action pair, it would stick to this same pair, even if it is not the best

(22)

one. On the opposite, the agent only exploring will take a random action in any state to see how is the reward, which is obviously not a learning behavior. In order to make the trade-off between exploration and exploitation, the agent will use a greedy policy which will lead the agent to explore and exploit in an appropriate manner.

Some greedy policies, namely the e-greedy policy and the optimistic initialisation policy, will be introduced in the following subsections.

e-greedy policy

The concept of the e-greedy policy is very simple. It states that the agent should follow a greedy policy 100(1 ´ e)% of the time, otherwise a random action is selected, in order to favour exploration. There are some variations of this e-greedy policy, for example, e can change over time. For instance, one can assume that after some time, the agent has a good perception of the environment. Then there is no real need to continue exploring, and e can be set to zero after a certain amount of epochs. It is also possible to make it decrease as a function of the time, so its limit to infinity is zero.

Optimistic initialisation policy

The idea of this policy is to set highly optimistic initial Q-values. Doing this, for any state and action, the reward is less than expected, therefore the learner will try another action the next time it reaches this state. This encourages the learner to explore all the state-action pairs several times, until the action-values converge.

About the algorithm

To start with, a MDP to apply the Q-learning algorithm on should be defined. The first MDP that has been set up is the following :

• The state-spaceS is all the combinations of a serving node and a RSRP. • The action-spaceAis performing a handover to every single node. • The reward is the RSRP of the target node.

• The discount factor is γ = 0.9

These choices will be explained later. Moreover, the state transition probability matrix has not been defined, because this thesis will focus on the Q-learning algorithm. It should be noticed that by definition, the RSRP will always be negative. The higher the RSRP, the stronger the signal; consequently, a high RSRP will result in a high reward. Moreover, as explained in Section 2.2, the RSRP can change over time for an immobile UE. The reward is then stochastic, even if it does not change drastically.

It is important to notice here that unless some changes are explicitly stated, this MDP will be the base of any Q-learning algorithm used in this thesis.

Model-free and model-based methods

In Section 3.2, the postulate saying that no state transition probablity matrix would be defined was made. This since the Q-learning algorithm is a model-free method.

The difference between model-based and model-free methods can be compared to the difference between habitual and goal-directed control of learned behavioural patterns [23]. Dickinson [6] showed that habits are the result of antecedent stimuli (model-based), while goal-directed behaviour is driven by its consequences (model-free).

(23)

Model-based methods use the knowledge provided by the state transition probability ma-trix in order to help the agent to learn the most rewarding behaviour in the environment (e.g. forward model [10]). On the contrary, in a model-free method the agent discovers its environ-ment by trail-and-error. It builds habits by facing states and choosing actions several times. Model-free methods are of interest for this thesis, because the high number of states and ac-tions used in this project makes it difficult to know the state transition probability matrix. The most famous model-free methods are probably SARSA [17] and Q-learning [25].

On or off-policy

Among the different model-free methods, two kinds can be distinguished: the on-policy methods and the off-policy methods.

To learn the best possible behavior in its environment, a learner may need to optimise its policy. A policy π is a distribution over actions given states: π(a|s) = P(At = a|St = s). How an agent behaves is completely defined by its policy, because it is the rule that the agent follows to select actions.

In order to estimate this policy, two different kinds of models can be used: on-policy or off-policy. The clearest explanation of on-policy and off-policy is given by Sutton in [23]: "On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data."

For a better clarification of these terms, the notion of action-value function needs to be introduced. It is the expected return from a starting state s, taking an action a and following a policy π:

qπ(s, a) =Eπ(Gt|St=s, At=a) (3.2) The goal of many RL algorithms is to learn this action-value function. An action-value is also called a Q-value.

SARSA is an on-policy method because it uses the next state and the action taken by following the policy π to update its Q-value. On the opposite, Q-learning is an off-policy method because it uses the next state and the action taken by following a greedy policy (see Section 3.2) to update its Q-value.

Description

The Q-learning algorithm is a rather popular reinforcement learning algorithm, developed in 1989 by Watkins [25]. It won its spurs by performing in a wide range of problems (e.g. [1], [22]). Thus, it is a highly interesting tool to start with in this project. It is a model-free, off-policy algorithm, which idea is to update an action-value table, also called Q-table, until it converges to the optimum Q-values [24]. The action-value table is a matrix with the states standing as rows and the actions standing as columns. After the learning, this table can be used by the agent to take the best decision depending on the state it faces.

The Q-learning algorithm is implemented as presented in Algorithm 1, and will be detailed further.

(24)

Algorithm 1Q-Learning Algorithm

1: Initialize Q(s,a) arbitrarily, @ sPS, aPA(s), and Q(terminal_state, .) =0 2: foreach epoch do

3: Initialize S

4: foreach step of epoch do

5: Choose A from S using policy derived from Q (e.g., e-greedy policy) 6: Take action A, observe R, S’

7: _Q₍_{S, A}₎_Ð_Q₍_{S, A}_{) +}_α_[_R₊_γmax

aQ(S1, a)´Q(S, A)] (3.3)

8: S Ð S1

9: until S is terminal

αis called the learning rate, α P [0, 1]. It decides the power to give to the newly acquired

information compared to the old. An α close to 0 would make that the agent does not learn anything (see Equation 3.3) while an α close to 1 would erase everything previously learnt. It can be useful in the case of a deterministic environment, however, this assumption is not checked in this thesis.

In the case when there is no goal-state, a time limit for each epoch should be defined so that the agent does not keep learning on one epoch.

In words, the Q-learning algorithm works as follows: a Q-table is first randomly ini-tialised, with the terminal state row being set to 0. Then for each epoch, a random state is chosen. And for each time-step of each epoch, an action is taken according to the pol-icy derived from Q. After this action is taken, the agent will reach a new state and get the corresponding reward. Then the Q-table is updated according to Equation 3.3.

This equation may need some clarification. It explains how to update the Q-table for the state-action pair S, A. After taking this action A at the state S, the agent gets a reward R and reaches the state S1_{. The Q-table is updated using the reward and, because it is an off-policy} algorithm, the best possible action-value that can be obtained at the state S1_{. γ tells how} important the future steps are for the current action-value.

If the agent was a robot looking for the exit in a maze, it would repeat the Q-learning for a certain amount of epochs. That is, it would start from a random point in the maze and would look for the exit and repeat this operation until it knows all the maze. In the case of this thesis, the agent is not a UE, but the network, which is trying to figure out how to behave regarding all the UEs. That is why the agent should learn from a lot of UEs to know how to choose the best possible BS for each UE. To do so, iterations will be run. One iteration contains an epoch for each UE. For example, if the interest is on three UEs, one iteration will consist of three epochs run successively, one for each UE. And an epoch is made of 596 steps, due to the way of simulating the UEs.

Defining the features

First, a greedy policy should be chosen. In this thesis, the e-greedy policy will mainly be used. The value chosen for e is 0.1, which means that one out of ten actions will be selected randomly among the possible actions. This choice seems to be a good trade-off between exploration and exploitation, as confirmed by its common use and the encountered successes [23]. The advantage of this method is that it is suitable for any kind of reward.

In the frame of this work, all the rewards are negative. Because of this characteristic, the optimistic initialisation policy is a suitable alternative to the e-greedy policy. By setting all the initial Q-values to zero, any reward will be lower than the Q-value. Then, for a certain state-action pair, the updated Q-value will decrease according to Equation 3.3. When this state is encountered again, the Q-value of this state-action pair is lower than all the other Q-values

(25)

for this state. The equation 3.3 makes that the maximum Q-value is chosen, so the specific state-action pair is not visited until it has the lowest Q-value again. Thus, there is no need for exploration, all the state-action pairs are visited because they have at a moment the highest Q-value.

Since these two methods are suitable for this work, the choice is made to favour the most spread one, namely the e-greedy policy.

Then, to implement a working Q-learning algorithm, several steps have been needed. The main issue while updating a table is the time it takes, which mainly depends on the size of the matrix. In fact, the bigger the matrix, the more steps are needed to visit each action-state enough times to reach convergence. Thus, the first challenge is to get as much information as possible while keeping the Q-table as little as possible. Since the rows represent the states and the columns represent the actions, the goal is to reduce the state-space and the action-space. After weighing up the pros and cons of many different spaces, the actions were decided to be switching to a beam, and the states should be the combination of the serving beam and the measured RSRP from this beam.

To reduce significantly the state-space and the action-space, instead of choosing a beam, a node is chosen. Since, in the simulated map, each node spreads 24 beams, it reduces greatly both spaces. When a node is picked, the beam providing the best RSRP is chosen. This is an optimistic way of proceeding, but the idea with this choice is to be able to reproduce it at the beams scale. If the state-space was made with beams, there would be no choice to make. There are 14 nodes on the simulated map, so the action-space is reduced to 14 actions, each of them being performing a handover to a node. The nodes received an index during the simulation, this index will be used for the actions as well. For example, the first column of the Q-table will be: switching to node 1. If the action selected is to switch to the serving node, then there is no operation actually made.

Since the Q-learning algorithm uses a table, it is not possible to use a continuous variable like RSRP. In order to use this essential variable anyway, the closest ten to the measured RSRP is used. for example, if the RSRP is -73.02, then it is rounded to -70. After looking at all the simulated values, it appears that all the RSRP values are included between -200 and -40. There are thus 17 possible values of rounded RSRP for each serving node, and there are 14 nodes, so there is a total of 14x17 = 238 states.

The reward initially chosen is the difference in RSRP between the node resulting in the action chosen, also called target node, and the RSRP received from the serving node. As mentioned earlier, even if the target node and the current serving node are the same, the reward can be different from zero, due to the evolution in the time, and probably in the space, of the UE.

The parameters of the model are the learning rate α=0.1 and the discount factor γ=0.9. These values have been chosen according to the literature, in which they are well spread and work efficiently [23], but not only. A γ close to 1 has been chosen because a handover should be performed only if this operation is interesting on the long-term. A γ close to 0 would make the agent think that only the next step is important, and then it would perform a handover each time a better node is found.

Convergence

If theoretically this method is supposed to converge to the optimum Q-values, how to be sure that the number of iterations was sufficient to let the Q-table converge? Apart some visualisation tools that will be presented later, a way to certify the convergence has been thought. To be able to verify that there is no more evolution of the Q-table, some intermediate values of the Q-table are saved during the learning. To measure the convergence, the mean squared element-wise difference between two successive Q-tables (MSE) is computed. At the

(26)

3.3. Combining reinforcement learning and artificial neural networks

time-step t,

MSE(t) =

ř

i,j(Qt´1i,j ´Qti,j)2

I ˚ J , (3.4)

where Qtis the Q-table at time t, Qi,jis the element at the crossing of the row i and the column j, and I is the size of the state-space and J is the size of the action-space.

Moreover, in case the slope is still fluctuating due to changing rewards, a linear regression will be applied to the last values of the MSE. The explanatory variable will be the number of iterations and the response variable will be the MSE. Then if the slope is close to 0, it means that the MSE has converged.

3.3 Combining reinforcement learning and artificial neural networks

It has become a quite popular technique to integrate the use of Artificial Neural Networks (ANNs) in RL. According to the literature, these kinds of methods have met a lot of success (e.g. [23], [11]). Q-learning is one of the algorithms that have been successfully used with ANNs [23].

Theory of neural networks

Description

ANNs were defined by Dr. Robert Hecht-Nielsen in [3] as "a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs". The ANN, which are deeply inspired by human neural network, usually aim at inferring functions or at doing classification by providing observations. An ANN is organized as a succession of layers, each layer being composed of one or several so-called nodes.

In order to introduce neural networks in a mathematic way, a specific kind, used in this thesis, will be described: the feed-forward network, also called multilayer perceptron. Con-sider x = x1, ..., xD as the input variable and y = y1, ..., yK as the output variable. First, M linear combinations of x are constructed:

aj = D ÿ i=1

w(1)_ji xi+w(1)j0 , (3.5)

where j = 1,...,M and the superscript(1)points out that the parameters wji(the weights) and wj0(the biases) are in the first layer of the ANN. The ajare called activations. Then, an acti-vation function h, which should be differentiable and nonlinear, is applied to the actiacti-vations in order to give the hidden layer:

zj=h(aj) (3.6)

The hidden layer, composed by the M hidden units, is the layer which is between the input and the output layers. The activation function is often chosen to be logistic sigmoid function or the ’tanh’ function (see Figure 3.1).

(27)

Figure 3.1: Most commonly used activation functions for hidden layers Then, K linear combinations of the hidden units are constructed:

ak = M ÿ j=1

w(2)_kj zi+w(2)_k0, (3.7)

where k = 1,...,K and the superscript(2)points out that the parameters wkjand wk0are in the second layer of the ANN. The ak are called output unit activations. Last, the outputs y are obtained applying another activation function to the output unit activations. In this thesis, because ANNs will be used for regression, the activation function of the output layer is the identity: yk= ak.

A common representation of the ANN described above is given in Figure 3.2. In this particular case, D = 2, M = 2 and K = 1. An ANN with one hidden layer is called a two-layer network, because it is the number of layers having weights and biases to adapt. It is a two-layer feed-forward ANN which has been described previously, but a feed-forward ANN can have several hidden layers. It has to be noticed that a feed-forward network does not contain any closed directed cycles, the links are originated from one layer and reach the following layer, but there is no possible comeback to the previous layer.

Figure 3.2: Simple ANN with one hidden layer of two nodes [19]. The weights are on the links between two consecutive layers

(28)

ANN training

The training will be exposed in a more general case, that is not only in the case of a network with one hidden layer. First, the generalised equations for any layer of Equations 3.5 and 3.6 are:

aj= ÿ

i

wjizi, (3.8)

where ziis the activation of a unit sending a connection to unit j, and

zj=h(aj) (3.9)

To train an ANN, the idea is to provide inputs x and targets t=t1, ..., tK. The targets are the values that one would like to get as an output of the ANN when providing x as inputs. The inputs are placed in the input layer, then the networks proceeds these inputs with the current parameters w following the Equations 3.8, and 3.9. The output is denoted y. The goal is to minimise the error function E(w) = 1₂řK

n=1(y(xn, w)´tn)2.

To minimise E(w), the idea is to solve the equation ∇E(w) = 0. Since it is extremely complicated to find an analytical solution to this problem [2], it will be necessary to proceed by iterative numerical procedures. The error backpropagation (see Algorithm 2) is the most commonly used solution.

Here is a short clarification of this method, more details can be found in [2]. The error function can be written as a sum of terms, one for each data point in the training set:

E(w) = řK

n=1En(w). Then, the problem can be restricted to evaluating∇En(w)for each n. According to the chain rule for partial derivatives,

BEn Bwji = BEn Baj Baj Bwji = BEn Baj ziaccording to Equation 3.8 =δjzjby notation (3.10)

By applying again the chain rule for partial derivative,

δj= ÿ k BEn Bak Bak Baj =h1₍_a j) ÿ k wkjδkusing 3.8, 3.9 and 3.10 (3.11)

Algorithm 2Error backpropagation

1: For an input vector x, apply a forward propagation through the network using 3.8 and 3.9 to find all the hidden and output units activations.

2: For the output units, evaluate δk=yk´tk

3: For each hidden unit, backpropagate the δ’s using δj =h1(aj)řkwkjδk 4: Evaluate the required derivatives using 3.10

Using artificial neural networks in the Q-learning algorithm

The principle of this Q-learning algorithm is still to update an action-value function until convergence to the optimum Q-values, following the Algorithm 1. The action-value function is still updated according to Equation 3.3. However, instead of updating a Q-table, an ANN is trained to approximate the action-value function. The state and the action are given as

(29)

inputs, and the output is the Q-value corresponding to this state-action pair. Following are presented the ANNs as they will be used to approximate the action-value function.

Using an ANN should allow to come to the continuous dimension, so there is no need to approximate the RSRP anymore. This will result in a smaller amount of input nodes, because there is one continuous variable instead of a categorical variable with for example 17 possible values, for as many nodes. It should also result in a more accurate action-value function.

The first ANN created is a feedforward neural network, described previously in this sec-tion, with only one hidden layer of ten units. This ANN aims to approximate the action-value function, so a state-action pair is given as an input, and the action-value corresponding to this state-action pair is the output. The state-spaceS and the action-spaceAare the same as de-scribed in Section 3.2. Thus, the state-space is the combination of the categorical variable serving node which can take 14 values and the continuous variable RSRP. There will be so 15 input nodes for the state. The 14 nodes for the serving node will all be set to 0 but the one corresponding to the serving node which will be set to 1. In the same way, the action-space is composed of 14 nodes, all of them being set to 0 but the one corresponding to the target node (see Figure 3.3). There are 15 nodes to represent the state and 14 nodes for the action, which means that there are 29 input nodes. The output being the Q-value of the input state-action pair, there is only one output node.

Figure 3.3: First ANN approximating action-value function, in the case of a state S: -70dBm received from the serving node 1, and an action A: perform a handover to node 14

The reward being the RSRP received from the target node, a penalty of 25 is applied to an action leading to a handover. This choice will be explained later in 3.5 and 4.4.

The weights and biases of the layers of the ANN are initialised as follows: wji„N (0, 0.01)

The chosen activation function is a tanh-function (see Figure 3.1). The values of γ, α and e are chosen as previously, according to the discussion in Sections 3.2 and 3.2. In order to train the ANN, an input dataset and a target dataset should be provided. However, there is no such datasets in this case. In fact, the target needs to be computed at each time-step of the learning following the Equation 3.3. So it is not possible to have a complete target dataset, because the target is depending on the current action-value given by the ANN. Thus, the ANN must be

(30)

adapted after every time-step, by providing the input, the output and the target computed depending on the input.

After a first try, it appears that this method is extremely time consuming. In order to fasten the process, batch training is considered. It means that the weights are not updated after every iteration, but the inputs and targets are saved, and the ANN is updated with this set of values from time to time. The size of the batch is set to the quarter of the time of an epoch, so there are four updates made for each epoch.

Despite this batch updating of the ANN, the training process remains extremely slow and thus very difficult to exploit in the time of a master thesis. So a new kind of ANN is built. The general architecture, a feed-forward ANN with one hidden layer of ten hidden units and a tanh activation function, is kept. The input layer is the state, and the output layer is the Q-value for each possible action. Thus, there are 15 input nodes, corresponding to the RSRP and the serving node, and 14 output nodes, corresponding to each action (see Figure 3.4).

Figure 3.4: Second ANN approximating actionvalue function, in the case of a state S: -70dBm received from the serving node 1. All the Q-values corresponding to all the actions compose the output

Convergence

Computing the log-likelihood (logL) is a solution to measure if an ANN is converging. As-suming a normal distribution of the output of the ANN, it can be written:

logL(t) =log    K ź k=1 exp

´(_nett(_inputt)[k]´_targett[k])2

2σ2 ? 2πσ2    (3.12)

where K is the size of the output layer, nettis the output of the ANN at time-step t, inputtis the input at time-step t and targettis the target at time-step t.[k]represents the k-th element, for example the output of the k-th node of the ANN. The interest is not the value of the log-likelihood, but the relative value. So all the constant variables can be grouped in variables (X and Y):

logL(t) =

K ÿ k=1

(31)

3.4. Visualisation

Finally, X and Y will be removed of the expression of the log-likelihood. So now logLis going to be a value proportional to the log-likelihood:

logL(t) =

K ÿ k=1

(nett(inputt)[k]´targett[k])2 (3.14) This value is supposed to decrease while the output values become closer to the target values.

3.4 Visualisation

In order to see how good the learning performs, it is common to look at the curve of the cumulative sum of the reward, since the goal of the agent is to maximise its reward [23]. The cumulative sum of the reward is the sum of all the rewards obtained by the agent during the learning: cumsum(T) = T ÿ t=1 Rt, (3.15)

where Rtis the reward obtained by the agent at the time-step t. In fact, the cumulative sum of the mean reward per epoch will be plotted. That is, after an epoch, the rewards obtained during this epoch are summed and divided by the size of an epoch, which is always 596. In practice, the only difference is that the curve will look more smooth and the values obtained will be lower.

The expected behaviour is that after taking some actions leading to excessively bad re-wards, the agent should avoid to retake these actions, and after some time all the actions to avoid are known. It follows on that the cumulative sum of the reward should first include low rewards, so decrease quickly, and after some time these state-action pairs are not chosen anymore, so the slope becomes closer to zero (see Figure 3.5). In one iteration, there is one epoch of learning for each UE. The epoch index is the index given to an epoch according to its chronological order.

(32)

3.4. Visualisation

Another solution to actually visualise that the agent learns is to plot the mean reward per epoch. In the end of each epoch, the mean reward obtained by the agent is computed, and the graph obtained shows the evolution of this mean epoch after epoch. The expected behaviour is that the mean reward starts by being quite low because of badly rewarding states. Then, these states are avoided because the agent knows they are not interesting for him, so the mean reward increases after some time of learning (see Figure 3.6).

Figure 3.6: Expected shape of the evolution of the mean reward

Barplots

If the agent has learnt, it is interesting to visualise how it used its knowledge. The idea is to use the learnt Q-table and a simulated UE to see how BSs would provide a signal to this UE. To select an action, the agent picks the one with the highest Q-value for the current state. To see how it performs, two kinds of graphs are plotted. First (see figure 3.7), the RSRP (on the left) and the serving node (on the right), both for each time-step during a complete epoch. The serving node is the node spreading the serving beam. On each graph, there are two different lines. The first one, in orange, shows the case where the RSRP is maximum. The second, in blue, shows the situation induced by the use of the Q-table learnt by the agent. That is, for each time-step, the maximum RSRP and the corresponding serving node are displayed in order to see if the learnt behaviour is close or not. The simulation is made on one or several UEs that have been used to make the agent learn, plus at least one test UE, to see if the Q-table learnt can be used for unknown positions. All the UEs are plotted row by row.

The test UE is always the UE following the last UE used to learn in the simulation order. If there are several test UEs, they are the following ones. In case of different categories (see Section 3.5), the test UE is the one following the last UE in the same category.

(33)

3.4. Visualisation

Figure 3.7: Plot to visualise the efficiency of the learning

For the previous plot, the first five UEs have been used for the learning of the agent. Some UEs seem to perform well, like the first or the fourth UE, because the blue line (predicted values) is close to the orange line (optimal values) on the RSRP graph, and also because the blue line is quite stable on the graph of the serving node, meaning that a few handovers are performed. In the opposite, the predicted values of RSRP of the test UE are much lower than the optimal ones, and the serving node is constantly changing, proof that a lot of handovers are performed.

In order to complete the information given by these graphs, the second kind of graph used is barcharts, which are used to show the mean difference in RSRP between optimal and learnt choice of node, and also to see the number of handovers (see figure 3.8).

(34)

3.4. Visualisation

Figure 3.8: Barplots in complement of the graphs in 3.7

The first five UEs have been used for the learning of the agent on the previous graph. The analysis is the same, but more summarised. It is easier to see how many handovers have been performed for example.

In order to be more precise, the 90% confidence interval of the mean difference of RSRP is added to the graph, for each UE. However, the distribution of the mean difference is un-known. To go through this obstacle, the confidence intervals are computed using bootstrap method. The idea of bootstrapping is to use resampling to estimate statistics, such as the confidence interval in this case.

Quickly, here is how bootstrap works ([15]). Suppose that the goal is to estimate a statistic m from a sample x1, ..., xn. x1˚, ..., xn˚is a sample of the same size as the original sample, with m˚ being the searched statistic computed from the resample. This operation is repeated many times. Then, according to the bootstrap principle, the variations of m and m˚ are ap-proximately equal.

In the field of this thesis, the bootstrap confidence interval is computed after resampling 10,000 times from the differences in RSRP. Then the first five and the last five percentiles are removed from the 10,000 means computed from the resamples. The remaining lowest value is the lower limit of the confidence interval, while the remaining highest value is the upper limit.

Heatmaps

A complementary technique can be used in order to visualise and interpret the obtained results, namely the heatmaps. Firstly, they will be used to visualise the Q-table itself. For more readability, instead of plotting a single heatmap with all the states and all the actions, all the different nodes are plotted individually. Thereby, one can see distinctly where is the border between two nodes and then understand better the map. The state-action pairs that have never been visited during the learning process are colored in grey. For the other colors, the darker they are, the smaller the corresponding action-value. After the learning, one can guess thanks to this heatmap which action will be taken at a given state: the one with the palest colour. The heatmap of the action-value table can be used to visualise which node will make the agent perform a handover to which other node. It can also become visually evident

(35)

3.5. Settings

why two nodes are continuously alternating, like for the third UE in Figure 3.7, where there is a continuous alternation between nodes 5 and 12 between the time-steps 1 and 100.

In order to get even more information, the use of heatmaps will be extended. For example, one can have a look at the number of times each state-action pair has been visited during the learning process. It could allow to see if some actions are favoured when the agent is at a certain state. This would mean that this action has probably not been chosen randomly, but because the target node will give a strong signal. For more readability, a log-10 scale is used, otherwise only some state-action pairs that have been visited extremely often appear. It means that in order to get the real number of visits, the following operation should be made:

If log(nb of visits) = x, Then nb of visits = 10x

Moreover, since for a same state-action pair the reward can be different, it can be inter-esting to have an idea of the distribution of the rewards for each state-action pair over the complete process of learning. Heatmaps can be used here as well, to plot the mean reward and the standard deviation for each state-action pair.

3.5 Settings

Penalising handover

As previously said, performing a handover is a costly action. It is why it can be judicious to penalise a handover in the model. Concretely, it means adding a penalty to the reward, the value of which has to be defined, when the action leads to select a node different of the current serving node. A simple way for choosing the value of this penalty factor is to train the model with different values, all things being equal, included the addition of a seed to avoid differences due to randomness in the initialisation. The test values are taken every fifth value between 5 and 50.

Precisely, the penalty factor is added to the reward, and so the updating function of the Q-learning algorithm 3.3 becomes:

Q(S, A)ÐQ(S, A) +α[R ´ p.1TargetNode‰CurrentNode+γmaxaQ(S1, a)´Q(S, A)] (3.16) where p is the penalty factor.

The selection of the best penalty factor implies a compromise, because it is expected that the higher the penalty, the lower the number of handovers, and the higher the mean difference. So a trade-off between mean difference and number of handovers has to be made.

Two actions

Since choosing 14 actions may be time consuming, it can be pertinent to consider a smaller action-space. It seems like it is not possible to restrain it more than to two actions. Thus, the action-space described in 3.2 is updated as following:

A= {Measure the RSRP from all the beams and switch to the best one; Do not perform a handover}

The second action could also be called "Do nothing", because there is not even a measure-ment which is done. In order to see which action is taken at each time step, a new column of graphs has been added to the previous visualisation tool 3.7. It is the column on the right (see Figure 3.9). The action 0 corresponds to do nothing, while the action 1 is measuring all

(36)

3.5. Settings

the beams. By saying this, one can realise that it may be a costly operation to measure all the beams, even if no handover is performed after taking action 1. To counter this, a penalty factor has been added when the measuring action is taken. As this action is about to measure the power of a lot of beams and perform a handover, the penalty factor has been taken quite high: 50.

Figure 3.9: The column on the right shows the action selected

It can be noticed on Figure 3.9, for the second, third and test UEs, that the action taken is often the action 1. However, when looking at the column in the middle, it looks like no handover has been performed. It is because the action 1 leads to measuring all the beams and switching to the best one, so if the best one is the serving node, there is no handover to perform. Consequently, this action should not have been performed to be optimal.

Contextual information

A new idea is now exploited: the notion of contextual information, introduced by Sutton [23]. It means that some exterior knowledge is used to make the agent learn more efficiently. It would probably be really interesting to know what kind of UE is requiring a signal, and then knowing if it is a smartphone or a connected car for example, which brand it is, etc. Unfortunately, on the simulator, it is not possible to recreate different kinds of terminals. But there is one type of knowledge which can be of great interest to use the contextual informa-tion: the mobility of the UE. There are three categories of UEs: the ones used indoor, the ones used in a motorised vehicle outdoor, and the ones used otherwise outdoor. The contextual information is used to learn one Q-table for each category. Thereby, it is expected that the results are more accurate. Actually, without contextual information, if two UEs are at the same state but one is indoor and the other outdoor, it is quite likely that for the same action the following state will not be the same.

For more convenience, without reducing the understanding, the three categories listed above will be called indoor, car and walking UEs. The simulator did create first a walking UE, then a car UE and finally a walking UE. Then it started again keeping this order. So the walking UEs are UEs 1, 4, 7, etc., the car UEs are UEs 2, 5, 8, etc. and the walking UEs are UEs 3, 6, 9, etc.

(37)

3.5. Settings

Adding new features

Until now, a quite simple space has been used. As explained earlier, a bigger state-space would imply longer computation times. However, this additional time is taken to try to discover features giving better precision to the agent. In order to see if some other features can be used to help the agent learn, two experiments have been done. The first one was to use the direction of the UE, the second was to use the angle between the UE and the serving node.

The idea of adding these features grew because using simply the state as a combination of the RSRP and the serving node was a little bit restrictive. In fact, if two UEs start at the same place with the same RSRP, if they move in different directions, their RSRPs can evolve in the same way but the node to switch to will be different (see Figure 3.10).

Figure 3.10: Two different UEs with identical start and finish states

On the previous diagram, a very theoretical case is introduced. Two UEs are considered to start from the same point, so with the same RSRP. Then they move in different directions, but their RSRP stays exactly the same. When the RSRP becomes low, a handover should be performed, and both UEs will not need the same beam to get a good signal strength, because they are quite far from each other.

Consider the direction in which is moving the UE. It would have been interesting to use the angle directly, but due to the use of the Q-learning algorithm, an approximation has to be made. To do so, the vertical is taken as the 0˝_{and the space is divided in a certain amount of} parts (see Figure 3.11). The UE being in the centre of the figure, it can go in one of the eight directions. Eight has been chosen for being a middle ground between not too wide segments and not too many more states.

(38)

3.5. Settings

Figure 3.11: Directions simplification for a UE

But after testing this feature, it appeared that adding the direction did not solve all the issues caused by similar states, because similar states and similar direction can also lead to different needs of node to switch to (see Figure 3.12).

Figure 3.12: Two different UEs with different start and finish points but similar directions On the previous diagram, another theoretical case is introduced. Two UEs start with an identical RSRP provided by the same serving node. The two UEs evolve in the same direction, and their RSRP stays the same while they are moving. When the RSRP becomes too low, a handover needs to be performed, but the ideal beam will not be the same for both UEs.

To solve this, the angle between the UE and the BS looks to be the best feature to avoid all these matters of similar cases. In fact, the previous case is broken by using this feature. The simulated BSs are each composed of three sectors, whose coordinates are known. The same system of segments as for the direction is used (see Figure 3.11), so if the three sectors are not in the same segment, then the segment containing two sectors is used.

Adding a feature means actually changing the MDP described in 3.2. The state-spaceS

becomes the combination of the serving node, the RSRP received from this node, and the new feature. As described previously, the new feature will be represented by an integer between 1 and 8. For example, a state can be a RSRP of -70dBm received by the node 8 which angle with the UE is in segment 3. Adding one of these new features multiplies the size of the state-space by 8, to become 8x238 = 1904 states.

Reinforcement Learning for 5G Handover

Master Thesis in Statistics and Data Mining

Reinforcement Learning for 5G handover

by

Maxime Bonneau

2017-06

Department of Computer and Information Science

Division of Statistics

Linköping University Electronic Press

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – från

publiceringsdatum under förutsättning att inga extraordinära omständigheter

uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid

en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form

eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller

konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

för-lagets hemsida

http://www.ep.liu.se/

.

Copyright

The publishers will keep this document online on the Internet – or its possible

replacement – from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for

anyone to read, to download, or to print out single copies for his/her own use

and to use it unchanged for non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional upon the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its www home page:

http://www.ep.liu.se/.

Contents

List of Figures

List of Tables

Abbreviations and definitions

Abstract

Acknowledgments

1

Introduction

1.1

Background

Telecommunication systems

Reinforcement learning

1.2

Objective

2

Data

2.1

Data sources

2.2

Data description

3

Method

3.1

Related works

3.2

Q-learning algorithm

Markov Decision Process

Greedy policy

About the algorithm

Convergence

3.3

Combining reinforcement learning and artificial neural networks

Theory of neural networks

Using artificial neural networks in the Q-learning algorithm

3.4