A Deep Learning Approach to Downlink User Throughput Prediction in Cellular Networks

(1)

INOM

EXAMENSARBETE

DATATEKNIK,

AVANCERAD NIVÅ, 30 HP

,

STOCKHOLM SVERIGE 2020

A Deep Learning Approach to

Downlink User Throughput

Prediction in Cellular Networks

DUDU XUECHEN ZUO

(2)

Abstract

A majority of the global population subscribe to mobile networks, also known as cellular networks. Thus, optimizing mobile traffic would bring benefits to many people. The available downlink user throughput in cellular networks is subject to heavy fluctuations, which leads to inefficient use of network capacity. The underlying network protocols address this issue by making use of adap-tive content delivery strategies. An example of such a strategy is to maximize the video stream resolution with respect to the available bandwidth. However, the currently dominating solutions are reactive and hence take time to adapt to bandwidth changes. In this work, a deep learning framework for downlink user throughput prediction is proposed. Accurate throughput predictors could provide information about the future downlink bandwidth to the underlying protocols that would let them become proactive in their decision making and adapt faster to resource changes. The models are trained with novel loss func-tions that capture the di↵erent costs of overestimation and underestimation. They are based on feedfordward and long short term memory networks and achieve up to 79.4% accuracy.

(3)

Sammanfattning

En majoritet av världens befolkning innehar ett abonemang för mobilnätverk. Det innebär att optimering av mobildataflöden skulle gynna ett stort antal per-soner världen över. Användarens mobila nedlänksbandbredd är förem˚al för kraftiga svängningar, vilket försv˚arar utnyttjandet av moilnätverkens fulla ka-pacitet. De underliggande protokollen hanterar detta genom att använda adap-tiva leveransstrategier av det inneh˚all som användaren efterfr˚agar. Ett exem-pel p˚a en s˚adan strategi kan vara att anpassa upplösningen av en streamad video efter den tillgängliga bandbredden. De vanligaste protokollen tar dock beslut baserat p˚a användarbandbredden i tidigare tidssteg, vilket gör beslut-sprocessen reaktiv och därmed l˚angsam i att anpassa sig till förändringar. I detta arbete presenteras djupinlärningsmodeller som har tränats till att förutsp˚a användarens nedlänksbandbredd. Med hjälp av precisa bandbreddsprediktioner kan protokollen istället bli proaktiva i sin beslutsprocess och därmed anpassa sin leveransstrategi snabbare. Modellerna är tränade med skräddarsydda förlust-funktioner som speglar de olika e↵ekterna av överskattning respektive under-skattning av bandbredden. De är av typerna feedforward samt long short term memory och uppn˚ar en trä↵säkerhet i sina prediktioner p˚a upp till 79.4%.

(4)

Acknowledgement

I would like to thank Intel for being the host company of this master thesis. In particular, I would like to thank Jonas Svennebring, my supervisor at Intel, for his support and guidance during my work. He has been trusting in letting me explore the research field independently at my own pace and yet supportive in always making time for my questions and concerns.

Furthermore, I would like to thank Qing He, my supervisor at KTH. It has been so valuable receiving advice and questions from an outside perspective. She has pointed out the flaws that I did overlook.

Finally, I would like to thank my team at Intel for the warm welcome and time spent together. You made me look forward to coming to the office everyday.

(5)

Abbreviations

ARIMA Autoregressive Integrated Moving Average CNN Convolutional Neural Network

CQI Channel Quality Indicator DNN Deep Neural Network FNN Feedforward Neural Network HMM Hidden Markov Model LSTM Long Short Term Memory LTE Long Term Evolution MAE Mean Absolute Error MLP Multilayer Perceptron MSE Mean Squared Error PRR Packet Reception Ratio NR New Radio

RBF Radial Basis Function ReLU Rectified Linear Unit RF Random Forest

RNN Recurrent Neural Network RSRP Reference Signal Received Power RSRQ Reference Signal Received Quality RSSI Received Signal Strength Indicator SNR Signal To Noise Ratio

SVM Support Vector Machine SVR Support Vector Regression TTI Transmission Time Interval UE User Equipment

(6)

Introduction

In November 2019 there were around 5.9 billion mobile subscribers around the globe and mobile traffic is expected to grow by 27% annually between 2019-2025 [1]. Thus, optimizing the utilization of mobile network capacity would benefit many people. In this work, deep learning applied to mobile network utiliza-tion optimizautiliza-tion is explored. More specifically, a deep learning framework for downlink user throughput prediction in mobile wireless networks is proposed. With accurate throughput predictions, the underlying network protocols can maximize the amount of the data they send to the users with respect to the available bandwidth. Thereby, the user experience is maximized.

The mobile user throughput is subject to significant fluctuations [2] [3]. The channel conditions of mobile networks, also known as cellular networks, vary due to their dependence on dynamic spatial and temporal factors. The physical environment causes the signals sent from and to cells to reflect, di↵ract and scatter. Signals interfere with each other. User mobility can cause big jumps in the achievable throughput when a User Equipment (UE) moves from one cell to another with di↵erent characteristics. These e↵ects are present for both uplink and downlink data transmission. However, in this work the focus will be on downlink traffic and therefore the term user throughput will refer to the downlink case. Another word for the user throughput is the user bandwidth. Since both are widely used in the literature the terms user throughput and user bandwidth will be used interchangeably in this report.

The throughput fluctuations lead to inefficient use of network capacity and a decrease in quality of experience [4]. Content providers address this issue by making use of protocols with adaptive delivery strategies. These protocols are present in both established areas such as video streaming and newly introduced ones such as remote cloud gaming. Given the same content with multiple resolu-tions the underlying protocol can switch between them depending on historical information from the established connection to always match the current user throughput capacity. However, these protocols are reactive and hence they are slow at adapting to big and sudden capacity changes. This delay would be avoided if protocols could make accurate predictions of the future user

(9)

through-CHAPTER 1. INTRODUCTION

put to base their decisions on. Such an approach would make the protocols proactive which in turn would allow them to make decisions that better utilize the available throughput.

An interesting application area for proactive protocols is found in video streaming. Video streaming makes up a major proportion of the current mo-bile traffic. In 2019 video constituted 63% of the total momo-bile traffic and it is expected to grow 30% annually until 2025 to make up 76% of the total traffic [1]. The majority of video streaming services use the media streaming standard MPEG-DASH [5]. MPEG-DASH divides a video into a grid with time along one axis and resolution along the other axis. Di↵erent resolutions require di↵erent amount of bandwidth for transmission. Depending on the time it took the user to retrieve previous video chunks the resolution of the next chunk is adjusted. If the previous chunks were retrieved faster than expected, the resolution for the next chunk can be increased while if they were retrieved slower than expected the resolution of the next chunk decreases. Thus, current methods for choosing among video resolutions in video streaming are based on historical data, that is the resolution shown in time t depends on information gathered in time t 1 and backwards [2]. This means that the MPEG-DASH protocol makes its de-cisions in a reactive way. With user throughput prediction content providers could move away from their current reactive solutions to become proactive and make the right delivery choices throughout the whole connection period.

Proactive protocols in video streaming have many advantages. User experi-ence will improve as content providers can deliver the highest possible resolution with no stalls. If user throughput is expected to be low the video can be pre-bu↵ered and stalls are thereby avoided. On the other hand, if the user through-put is expected to be higher than the amount required for retrieving a video chunk with the highest available resolution, the number of bu↵ered chunks can be held on a minimal level. Since users tend to jump between or within videos this will reduce the amount of data being discarded without being watched. This in turn decreases network congestion and network load [5].

Zou et. al. quantify how much video streaming would benefit from accurate user throughput prediction [4]. The authors conclude that current methods achieve only 68% - 89% of the optimal quality and hence fail to fully utilize the bandwidth. The main gap occurs at the beginning where only 15% - 20% and 22% - 38% of the optimal quality are achieved during the first 32 and 64 seconds respectively. They found that video chunk resolution choices based only on bandwidth prediction lead to poor performance due to numerous and er-ratic resolution switches. However, combining short term bandwidth predictions with bu↵er occupancy and/or rate stability functions boosts the performance to achieve 96% of the optimal one.

Another example of how user throughput prediction can make protocols proactive is found in the newly established area of remote cloud gaming. In remote cloud gaming a server with powerful hardware hosts the games played on less powerful client devices. In contrast to conventional gaming, where the GPU sends the rendered frames through a frame bu↵er directly to the display, cloud gaming converts the rendered frames into a network stream that is sent

(10)

1.1. SCOPE AND CONTRIBUTION CHAPTER 1. INTRODUCTION

to the client. Due to the large amount of information per frame a transmission of raw frame data requires substantial bandwidth. To address this the servers make use of compression methods before sending the frames. At the client side, the network stream is decoded and sent through a frame bu↵er to the client display.

The compression methods rely on one or more compression parameters that determine to what extent the frame is compressed. The degree of compression in turn a↵ects the quality of the decoded frame. A more compressed frame requires less bandwidth for transmission but leads to a lower frame quality and thereby a lower user experience. However, choosing a compression method that requires more bandwidth than available for transmission leads to lags in the game which severely a↵ects the player’s ability to play well and reach good scores. By accurately predicting the available user throughput the compression parameters can be chosen in a way that optimizes user experience.

To be of practical use the bandwidth predictions must balance between being informative and stable. Big rapid fluctuations or spikes in the bandwidth often have no e↵ect on the decision of the underlying protocol since it needs to stick to a lower resolution that fits the stable level of the bandwidth. Furthermore, one could argue that overestimating the bandwidth does more harm to the user experience than underestimating it. Again taking the video streaming scenario as an example it is clear that underestimation leads to lower video quality while overestimation will cause the video to stall. For remote cloud gaming this e↵ect is even clearer where underestimation can cause players to lose the game. This preference should be encoded into the training of the prediction models.

Each cell of a cellular network has its own characteristics. Therefore, a prediction model trained on one cell with satisfactory test results will not nec-essarily generalize well to perform satisfactory on another cell without being retrained. To make user throughput prediction feasible in practice the predic-tion framework should be easy to deploy and train on di↵erent cells. This will require each cell to have its own dataset and hence it is of great importance that the training data consists of information that can be gathered by opera-tors through existing pipelines. This would both enable easy deployment and eliminate the risk of data shortage.

In this work, several frameworks for downlink user throughput prediction are proposed. The models are deep learning based and make use of novel loss func-tions that try to reflect the preference of over- and underestimation. They are compared to baselines in form of more traditional machine learning approaches and evaluated in terms of prediction accuracy and cell generality.

1.1 Scope and contribution

From previous work it can be concluded that accurate user throughput predic-tion can increase bandwidth utilizapredic-tion significantly [4]. This benefits content providers since it provides a better user experience to their customers. However, most protocols in use rely on historical throughput measurements and hence are

(11)

1.2. OUTLINE CHAPTER 1. INTRODUCTION

making their decisions reactively. Moreover, they put equal emphasis on both over and underestimation failing to reflect the di↵erent consequences they cause. The aim of this project is to explore if deep learning approaches can be ap-plied on downlink user throughput prediction successfully. More specifically, the aim is to create and compare di↵erent deep learning frameworks for probability-guaranteed user throughput prediction. Apart from prediction accuracy, model performance is also measured in terms of how well it generalizes to di↵erent cells in the network. This evaluation will require datasets from di↵erent cells. To ensure the size and quality of these datasets a statistical model that can generate more cell data is proposed. The contributions of this study are

• A deep learning framework for accurate downlink user throughput predic-tion that can easily be deployed and trained on di↵erent cells.

• A novel loss function that captures the needs of content providers and the di↵erent e↵ects of overestimation and underestimation.

• A statistical model of the cell loads that can be used to generate data and thereby mitigate the risk of data shortage.

This project is done in collaboration with Intel, which is the host company of this master thesis. The data used is provided by one of Intel’s customers, the South Korean company SK Telecom and it is collected during the first half of 2020. More specifically, the data is collected partly from SKT’s base stations and partly from by Intel owned mobile phones. Although the created deep learning frameworks will be trained and tested on di↵erent cells on the mobile network the dataset used is limited to the geographical region of Seoul. Thus, the results are bounded to this region. Furthermore, this study is limited to UEs that stay within one given cell. This means that each cell has its own model, which is trained on cell specific data. To obtain a full model of the downlink user throughput, models of how UEs are moving between di↵erent cells must be added.

When working with user data it is important to take privacy concerns into account. Data collected from UEs falls within the category of user data and can contain sensitive information about user location, user IP etc. The privacy concerns for this thesis are minor since the UEs used for data collection belong to Intel itself. However, it is important to consider the privacy issues when selecting the feature set for the prediction models. The feature set must be kept on a level acceptable by users in order for them to share their data in a future application scenario.

1.2 Outline

The rest of the thesis is structured as follows. In the following chapter the theoretical background, related work and the details of the prediction problem are presented. Chapter 3 gives a detailed description of the methodology used. In chapter 4 the experimental results are provided and in chapter 5 they are

(12)

1.2. OUTLINE CHAPTER 1. INTRODUCTION

carefully and extensively discussed. Chapter 5 also points out directions for future studies. Lastly, chapter 6 highlights the conclusions of this work.

(13)

Chapter 2

Background

2.1 Machine learning

Machine Learning is a branch of artificial intelligence. It is built on the idea that machines can learn from data by themselves with minimal human inter-vention. There exist various branches and modelling techniques within the field of machine learning. The model categories encountered in this work are Ran-dom Forests (RFs), Support Vector Regression (SVR) models and deep learning models.

Machine learning is commonly divided into supervised learning, unsuper-vised learning and reinforcement learning. Superunsuper-vised learning makes use of labelled data to find function approximations from the input data to the tar-get. It is called supervised since the labels of the dataset provide the correct output to each of its inputs. In other words, the computer needs a teacher or a supervisor that tells it the correct answer to learn. On the contrary, unsuper-vised learning methods use unlabelled data and let the models find patterns in it on their own. These models are commonly used for exploratory data analysis. In reinforcement learning agents learn from interactions with its environment. By successively exploring the state space the agent learns the behaviour that optimizes some predefined reward function. RFs and SVRs are examples of su-pervised learning techniques. Deep learning is present in all three of the machine learning categories.

An RF is an ensemble machine learning technique consisting of multiple independently trained decision trees. Ensembles make predictions by combining the individual predictions of each independent model. For regression tasks the final prediction is commonly given by an average and for classification tasks the final prediction is decided by voting. Figure 2.1 shows an example of an RF. Previous studies show that ensembles typically achieve better results than its individual components [6]. Like ensemble methods in general, RFs are produced by overproduce and choose. More specifically, this means that a large number of decision trees are produced in the overproduction phase and a subset of

(14)

2.1. MACHINE LEARNING CHAPTER 2. BACKGROUND

Figure 2.1: A visualization of an RF.

them are then selected to constitute the ensemble in the choice phase. In the overproduction phase RFs make use of bagging and boosting. Bagging helps reduce the variance since it makes use of bootstrap sampling techniques to create variations of the training dataset. Boosting reduces the ensemble bias since it randomizes the selection of independent features of individual models [6]. In choice phase, a selection is made to create the ensemble. The selection is based on some criteria function such as the Mean Squared Error (MSE) loss. SVR can be seen as the regression version of the classification model Support Vector Machine (SVM). SVMs use kernel functions to map the data from input space to a higher dimensional feature space and then try to find hyperplanes that separate the feature space into the data classes with as wide margin as possible. The margin is given by the distance from the hyperplane to the boundary lines, which are two parallel lines each on either side of the hyperplane. The SVR also makes use of these kernel mappings and hyperplane separation. However, it has a di↵erent objective when fitting the hyperplane to the feature space compared to both the SVM classifier and other regression techniques. Unlike other common regression models, which try to minimize the error between the true and predicted values, SVRs define an acceptable error threshold in form of a hyperparameter. This error threshold is then set to the width of the margin in the hyperplane. Instead of finding a hyperplane with as wide margins as possible like SVMs do, SVRs look for hyperplanes that can fit as many data points as possible within the margins. In other words, SVRs try to find a hyperplane that lets as many data points as possible be covered by the area created by the error threshold. The hyperplane is then used to make predictions on unseen data. A visualization of an SVR is shown in Figure 2.2. SVRs typically use Radial Basis Function (RBF) kernels for the mapping to a higher dimensional space [7].

(15)

2.2. DEEP LEARNING CHAPTER 2. BACKGROUND

Figure 2.2: Visualization of an SVR.

2.2 Deep learning

Deep learning is a class of machine learning and uses multiple-layer architectures to extract features on di↵erent abstraction levels from the raw data. These layers are usually organized in form of a neural network. A basic Deep Neural Network (DNN) is a composition of several layers built up by weights. Each layer takes the output of its previous layer as input and typically applies some non-linear activation function to compute its own output. At the final layer this results in the prediction of the network and this process is known as the feed-forward propagation. During the training phase the prediction loss of the model is calculated via some pre-defined loss function. The loss is then back-propagated through the network via gradient descent to update the weights of each layer. This cycle of forward and backward propagation is repeated which reduces the loss until convergence. Models with this basic architecture are known as Feedforward Neural Networks (FNNs) or Multilayer Perceptrons (MLPs) and have previously been used to model data both within networking and other application areas [8] [9] [10]. A visualization of an FNN is shown in Figure 2.3. The red nodes are the input layer, the grey nodes represent two hidden layers and the blue nodes are the output layer.

FNNs can be used to solve both classification and regression tasks. For classification the number of nodes in the output layer of the network should equal the number of classes in the dataset with each output node representing one class. Typically, a softmax activation function is applied on the final layer. The class corresponding to the output node with the highest value is chosen as the prediction. For regression problems the output layer should consist of a single node. The node typically uses no activation function and its final value is equivalent with the network prediction.

However, FNNs treat every input data point independently and take no account of the order in which the points are fed into the model. Hence, they are unable to capture temporal dependencies. To address this issue, the Recurrent Neural Network (RNN) was introduced. RNNs are FNNs with additional edges

(16)

2.2. DEEP LEARNING CHAPTER 2. BACKGROUND Hidden layer 1 Hidden layer 2 Input layer Output layer

Figure 2.3: An example of an FNN with two hidden layers.

spanning over adjacent time steps. At a given time t the RNN receives both the current data point and the hidden states of previous data points. This means that the output at the next time step depends not only on the current input data but also on previous output data. Thus, the RNN is provided with the notion of time. RNNs are hard to train since they experience vanishing or exploding gradients that occur when errors are backpropagated over multiple time steps. This issue gets bigger when modelling longer time dependencies. Thus, standard RNNs fail to process long term memory. To address this problem the Long Short Term Memory (LSTM) network was introduced [11].

The LSTM model is a special instance of RNNs with the nodes of its hidden layers replaced by memory cells. Each memory cell is built up by nodes and gates connected in a certain way, see Figure 2.4. Gates are distinctively used in LSTM networks and are sigmoid activations applied to the bypassing flow. They are denoted with in Figure 2.4. If a value of the gate is zero, the flow is cut o↵ while if a value is one all flow is passed on. The LSTM cell takes a cell state vector ct, a hidden state htand a sample xt of the dataset as inputs

and outputs the cell state vector ct+1 and the hidden vector ht+1 of the next

time step. The vectors ct+1 and ht+1 are then fed into the next LSTM unit.

Typically, the vectors are initialized as c0= 0 and h0= 0. The building blocks

of the LSTM memory cell are presented in the list below.

• The input node takes the current input data and previous hidden states into the memory cell and runs the weighted sum of the inputs through an activation function, typically tan-activation.

• The input gate decides how much flow from the input node to let through. • The internal state has linear activation and contains a self-connected re-current edge of fixed weight 1 to ensure that the gradient can pass through many time steps without exploding or vanishing.

(17)

2.3. CELLULAR NETWORKS CHAPTER 2. BACKGROUND

Figure 2.4: The LSTM cell.

• The forget gate provides the network with the ability to flush the content of the internal state.

• The output gate decides how much of the flow from the internal state to let through and applies some activation function to the flow to produce the final output of the cell.

The LSTM network has become one of the most successful and widely used RNN models and it has achieved state of the art results in many sequential modelling tasks [11].

2.3 Cellular networks

Cellular networks are the foundation of mobile communication. [12]. They consist of multiple transmitters. The coverage areas of the transmitters divide the surrounding area into cells. Each cell is assigned a range of frequencies and served by its own antenna. The antennas are organized in base stations and hence each cell belongs to a base station. Neighbouring cells are allocated to di↵erent frequencies to avoid interference. However, cells located sufficiently far away can be assigned the same frequency domain. Ideally, all adjacent base station antennas should be equidistant to a given base station antenna since it simplifies the task of when and where to move the connection of a UE. In theory this can be achieved by shaping the cells as hexagons. However, in practice perfect hexagonal cells are not used due to environmental and hardware limitations.

Each cell in a cellular network has its own characteristics. These are mainly determined by the following factors.

• The carrier frequency of the cell, which determines how well signals can pass through obstacles.

• The width of the frequency band where a wider band allows for more bandwidth.

(18)

2.4. DOWNLINK USER THROUGHPUT CHAPTER 2. BACKGROUND

• The generation (3G, 4G, 5G etc.) implemented on the cell and the stan-dard (Long Term Evolution (LTE), LTE advanced, New Radio (NR) etc) implemented on the cell.

• The transmit power of the base station, which determines the coverage of the cell. Typically rural areas have large cells with wider coverage while cities have smaller cells with less coverage.

• The cell location, which has impact on the radio environment and the degree of interference.

• The backhaul, that is the capacity of the backbone network connecting the cell to the core network and onwards to the Internet.

The large number of influential factors makes cells more or less unique. At a given time a cell can serve zero, one or more mobile users. The control unit of a base station handles the communication between UEs and that base station. In turn the base stations communicate with each other. With this infrastructure, UEs on cellular networks can communicate with each other.

2.4 Downlink user throughput

In this project the user throughput is defined as the bits per second a UE on the cellular network can receive. This concept is also commonly called the available user throughput or the bandwidth. The base station allocates proportions of the available cell throughput to the UEs within that cell. The more throughput a UE is allocated, the more data it can send or receive per time step.

The available downlink user throughput is mainly a↵ected by [12] [13] • the characteristics of the cell the UE is connected to.

• the UE position relative to the base station and he environment in which both of them are located.

• the number of other UEs connected to the same cell and their activities on the network.

• the encoding scheme of the transmitted data such as quadrature amplitude modulation.

Unfortunately, many of the above factors are not commonly accessible. This makes the task of predicting the downlink user throughput challenging and an it is an open research question.

2.5 User throughput prediction

User throughput prediction can be divided into three groups, namely formula-based methods, history-formula-based methods and machine learning methods [3]. The

(19)

2.5. USER THROUGHPUT PREDICTION CHAPTER 2. BACKGROUND

formula-based methods rely on knowledge and assumptions of the behaviour of the underlying protocol. However, matters are complicated by protocol di↵er-ences and protocol updates. History-based approaches make throughput predic-tions based on past measurements and cover methods such as autoregressive in-tegrated moving average (ARIMA). Machine learning approaches includes mod-els such as RF, SVR or DNN. These machine learning modmod-els constitute current state of the art. These categories are not mutually exclusive. For example, there are also hybrid models that combine the history-based and machine learning ap-proaches. In the remainder of this section some formula-based methods for user throughput prediction are presented and discussed. Next, various history and machine learning based approaches are presented in the related work section. A more exhaustive overview of how machine learning has been applied to net-working in general can be found in [14] and a detailed summary of how deep learning in particular is applied to the networking field can be found in [15].

An estimate of the user throughput can be inferred from radio performance statistics available in commercial LTE base station products [13]. Under the assumption that fair sharing is applied the instantaneous throughput equals the maximum achievable throughput with a single user connected to the cell divided by the number of active users. Thus, the average user throughput TU E is given

by

TU E = E[

T X]

where the random variable X denotes the number of active users and the random variable T denotes the maximum achievable throughput when a single user is connected to the cell. In turn, the maximum throughput T is dependent on many factors such as the UE location, interference from other cells, and the number of transmit antennas. However, it is reasonable to assume that T is independent of the number of active users in the cell [13] which yields

TU E = E[T ]E[1/X].

The factor E[1/X] is usually not an available radio counter for operators which in turn means that TU E cannot be calculated directly. Instead, one can use the

average number of active UEs E[X] to get an estimate TSch of TU E as

TSch=

E[T ] E[X] which is known as the scheduled throughput [13].

Another popular formula for calculating user throughput is based on the distance d from the user to the transmitter and the number N of other users within the coverage area of the transmitter [16]. This formula is given by

TU E = gT( , N ) = T0⌘/N

where T0 is a parameter specific to the cellular system, = 10log10( ) is the

(20)

2.6. RELATED WORK CHAPTER 2. BACKGROUND

for that signal to interference and noise ratio. In turn is given by = 0r/d↵

where ↵ is the path-loss exponent, 0 is a technology specific parameter and

r is the fast fading gain. Thus, calculating the average user throughput with this approach is also associated with the problem of the expected value E[1/N ] being unavailable.

The achievable user throughput over a cellular network depends on all the components involved in the transmission process. These components are the UE, the radio link, the cell, the core network and the server the UE is talking to. It is commonly claimed that the bottleneck is located at the wireless link [2]. Under this assumption it is enough to predict the achievable throughput over the wireless link to get the prediction of the end to end throughput With 5G approaching it is likely that the bottleneck will move away from the air interface. However, predicting the achievable downlink throughput over the wireless link will still be interesting since it can be used as a component in the end to end throughput prediction problem [2].

2.6 Related work

There exist various attempts to predict downlink user throughput with ma-chine learning approaches. To meet the requirements of light-weight models that makes fast predictions previous studies have made use of RFs. Samba et. al. examine user throughput prediction features that they group into four families [2]. The feature families are UE categories and cell features such as frequency bands on the UE, physical layer features such as received power, con-text information such as UE distance to the cell, and RAN measurements such as average cell throughput. The authors conduct correlation analysis between the features and the throughput for feature selection. Then they use the se-lected features in an RF model to solve the prediction task. They show that the two feature families that increased prediction accuracy the most are physical layer features and RAN features and that the two feature families are comple-mentary in the sense that using both of them achieves the highest prediction accuracy. Yue et. al. use a similar approach for cellular link throughput pre-diction [17]. They identify five types of low layer features that are correlated with cellular throughput and use them together with upper layer features in an RF model. Their results show that even though historical data is the most im-portant feature, combining it with lower layer information increases prediction performance.

Kousias et. al. conduct a study that compare di↵erent user throughput prediction models [18]. They compare the performance of RFs with the per-formance of SVR and multiple linear regression. They start o↵ from over 76 features and use interpolation to adjust for inconsistencies in time granularities in the data. Next, they do feature extraction to find the minimal set of relevant features. They find that uplink features never improve downlink throughput

(21)

2.6. RELATED WORK CHAPTER 2. BACKGROUND

prediction and should therefore never be included. They also show that there exists a trade-o↵ between accuracy and over the air data consumption. The authors claim that current throughput prediction methods make use of a sig-nificant amount of TCP traffic for given time durations. Although this yields reliable results, the method is inefficient for mobile subscriptions with limited data plans. Their results show that a 39.7% reduction in data consumption cor-responds to a median absolute percentage error of 5.55% and that a reduction of 95.15% in data consumption corresponds to a median absolute percentage error of 20%. The authors point out deep learning as a direction of future research since they think that deep learning models might increase prediction accuracy. They share this expectation with other researchers. Wei et. al. use an Hidden Markov Model (HMM) in combination with a Gaussian mixture model to predict throughput [19]. They use the mixture model for clustering to find the states of the HMM and achieve better results than traditional regression methods. Also they point to deep learning models as the direction of future studies. Zhang et. al. conduct an extensive survey of how deep learning has been applied to mobile and wireless networks [15]. The authors point out that deep neural networks have achieved outstanding results in fields such as com-puter vision and natural language processing. They summarize previous work regarding throughput prediction and highlights the potential of deep learning in relation to network related prediction problems.

However, previous attempts to model downlink user throughput with deep neural networks are sparse. Rehman et al. use correlation analysis to find a set of relevant features for conducting user throughput prediction and cell throughput prediction respectively [9]. They settle for 13 features and implement a deep neural network with 40 layers and 13 nodes in each layer. Wei et. al. use historical data in combination with sensor data such as cell ID, time, location and Received Signal Strength Indicator (RSSI) to do throughput predictions for moving users [20]. They divide their model into a user movement pattern identification part and a throughput prediction part. The prediction is made with LSTM. In another of their studies, the authors compare several models for user throughput prediction in the context of adaptive video streaming [21]. They establish a trace-based emulation environment to be able to evaluate model performance quantitatively under the same artificial conditions. Their results indicate that LSTM is the best performing model.

Schmid et al. introduce location independent models, one based on a reg-ular DNN and one based on an LSTM network [10]. To provide evidence of the location independence the training set and test set are collected from dif-ferent geographical regions. Also they conduct careful correlation analysis both between network variables and along the time axis to find relevant features. The authors perform careful hyperparameter analysis to increase network per-formance. They find that for the root mean square loss the LSTM performed better than DNN.

User throughput can be seen as an indicator of the user link quality. Sun et. al. propose an alternative link quality metric, namely the Packet Reception Ratio (PRR) [22]. PRR represents the number of received packets over the

(22)

2.7. NEIGHBOURING FIELDS CHAPTER 2. BACKGROUND

number of transmitted packets. However, the authors argue that Signal to Noise Ratio (SNR) is a more stable metric than PRR and claim that PRR can be directly inferred from SNR. Thus, they propose a link quality estimation algorithm consisting of a wavelet neural network that estimates SNR. They decompose the SNR into a time-varying non-linear part and a non-stationary random part. The two parts are separately processed before they are fed into the wavelet neural network for probability-guaranteed SNR prediction. The estimation of PRR is then obtained from the SNR prediction.

2.7 Neighbouring fields

Due to the sparsity of previous studies on user throughput prediction it is in-teresting to widen the literature review to also include neighbouring fields with similar challenges. One such field is network traffic prediction. Like the user throughput, the network traffic experiences both complex spatial and tempo-ral dependencies. Within this field traditional ML methods such as RF and SVR are no longer the dominating approach. Instead, previous work has uti-lized various kinds of deep learning models such as DNNs, Convolutional Neural Networks (CNNs), LSTMs and autoencoders.

Nie et al. combine a deep belief network with a Gaussian model to predict network traffic [23]. The authors use discrete wavelet transform to decompose the traffic into one low-pass component that represents the long-term time de-pendencies and one high-pass component that represents the short-term irregu-lar fluctuations. They then model the low-pass component with the deep belief network and the high-pass component with the Gaussian model. The Gaussian parameters are learned by maximum likelihood estimation. By combining pre-dictions from both separate models, the authors obtain the prediction of the network traffic. Similar to Nie et al. many authors propose hybrid models that utilizes more than one modelling technique. However, many authors choose a di↵erent decomposition of the network traffic, namely the decomposition into one temporal component and one spatial component.

One common and e↵ective way of modelling temporal data is through the RNN. However, due to their problems with exploding or vanishing gradients the LSTM has become a more popular choice among researchers for modelling the temporal component of network traffic [15]. Alawe et al. compare the performance of an LSTM model implemented for control plane cellular network load prediction with a basic DNN [24]. They make the assumption that 10% of the load constitutes control plane traffic and hence only use 10% of each load data point. The authors turn the problem into a classification task by dividing the load into ten intervals. Their models achieve 80% and 90% accuracy for the DNN and LSTM respectively.

Huang et al. compare three deep learning models, a 3D-CNN, an LSTM and a combined CNN-LSTM model applied on mobile traffic forecasting [25]. More specifically, the authors try to predict maximum, minimum and average traffic load. The authors use the neural networks as feature extractors and pass the

(23)

2.7. NEIGHBOURING FIELDS CHAPTER 2. BACKGROUND

found features to a multitask regression model for the final prediction. They argue that joint training with a multitask approach benefits overall prediction performance. CNNs with their grid structures capture spatial dependencies well. However, 2D-CNNs alone lack the ability to take temporal dependencies into account. On the contrary, a 3D-CNN can capture the time dependency along its additional dimension and it has been successful in doing video analysis [25]. Similarly, a CNN-LSTM model captures both temporal and spatial features. The authors further compare their deep learning approaches with traditional methods such as ARIMA. Their results show that the combined CNN-RNN model performs best for all prediction tasks with 70-80% test accuracy.

Wang et al. have another approach for using a deep learning hybrid model to capture temporal and spatial dependencies in traffic load prediction [26]. The authors conduct a correlation analysis in both the temporal and spatial domain to show that the correlation is non-zero. Next, they divide the geographical region at which their data is collected into a grid structure and map each base station to one unique square of the grid. The authors use autoencoders to model spatial dependencies and do feature extraction. The found features are then fed into an LSTM model which makes the final traffic load prediction. They compare their approach with results produced by the traditional methods ARIMA and SVR and shown that their model performs significantly better. However, their approach requires preprocessing of the data fed into their neural networks by mapping each cell into a square of the geographical grid. This limits its generality and introduces additional errors. To address these limitations and to also take discrete factors such as the day of the week into account Feng et al. proposes an end-to-end framework that they call DTP for traffic prediction [27]. The DTP model consists of two parts, a general feature extraction part and a sequential modelling part. The feature extractor in turn consists of two modules, one spatial correlation extractor and one discrete embedding module. For the sequential modelling an LSTM network is used. Their results show that DTP outperforms ARIMA with more than 40% and basic RNN models with more than 12%.

The benefits of deep learning models are coupled with an increase in com-putational cost. He et al. study machine learning and statistical methods for encrypted user traffic prediction. The authors evaluate two classes of models, namely ARIMA models and LSTM models, for online payload prediction of flows and aggregates. The traffic parameters were limited to those that can be extracted from encrypted traffic. The authors look at both video and non-video traffic. They find that LSTM networks achieve very good results to a cost of significantly more heavy training computations [28].

Yet another way of decomposing the traffic prediction problem is found in [29]. There, He et al. propose a meta-learning scheme for user level traffic prediction over short time horizons. The authors combine a set of specialized predictors with a master policy for choosing among the predictors. Each spe-cialized predictor is optimized towards a certain kind of traffic prediction. They use deep reinforcement learning to evaluate their model on both video and non-video traffic traces. They find that their meta-learning scheme outperform other

(24)

2.8. THE PREDICTION PROBLEM CHAPTER 2. BACKGROUND

state of the art methods [29].

To summarize, deep learning models are the current state of the art for both downlink user throughput prediction and the neighbouring field of network traffic prediction. While the throughput prediction field mainly utilizes simpler FNNs and LSTMs for the prediction task the network traffic prediction field has achieved state of the art results with more complex and modern model architectures. Previous studies mainly rely on correlation analysis for feature extraction. Compared to more traditional deep learning fields such as image classification, the number of features used are small. Since user throughput is a continuous variable, most previous studies have used the deep learning frameworks for regression rather than classification.

2.8 The prediction problem

Researchers within the field of user throughput prediction point out deep learn-ing as the future research direction and previous studies in adjacent fields have already achieved state of the art results with deep learning frameworks. These factors motivate the focus of this project, which is to apply deep learning on user throughput prediction. After reviewing related work the conclusion was drawn that FNNs and LSTMs seem most promising and hence those models with di↵erent architectures and loss functions will be used. With historical and current radio link information data collected from cells and UEs the aim is to predict the downlink user throughput of the UEs. To train and evaluate the models they need to be provided with the true throughput values along with the input data. Thus, a labelled dataset is needed and the models belong to the category of supervised learning. To reflect that underestimation of the through-put is preferred over overestimation a novel loss function for model training is proposed. This loss is a modified version of the widely adopted MSE loss and the modification can easily be generalized to work for other conventional loss function such as the Mean Absolute Error (MAE) loss. Furthermore, a model evaluation metric that captures the needs of content providers is proposed.

(25)

Chapter 3

Method

A summary of the method can be described as follows. The dataset is formed by mapping cell collected data to UE collected data through the date and time of the collection. Correlation analysis is used for feature selection. The dataset is preprocessed, normalized and split into training and validation sets. It is then used to train the baselines (an SVR and an RF) and the deep learning models (an FNN and an LSTM). The code implementation is made in Python 3.6 and the models are implemented with Keras and Scikit-learn.

3.1 Dataset

The dataset is collected in collaboration with Intel and their customer SK Tele-com, also simply known as SKT, which is the main wireless telecommunications operator in South Korea. The data is gathered both from cells in SKT’s cellu-lar networks and from UEs connected to given cells in the same network. All cells are located in the geographical region of Seoul. The UEs stay fixed at a given location and is connected to one of the cells while making measurements. In each cell the average uplink load and downlink load are recorded every ten seconds. These values are stored together with a cell ID and the date and time of the measurement. However, due to technical issues the dataset experiences irregular holes of various sizes where no measurements are recorded. Every one minute the UEs run their measurements by downloading 10 Mb of data. The throughput of this download is logged and serves as an estimate for the down-link bandwidth. In addition, the UEs also return the cell ID, UE ID, the signal power at the beginning and the end of the measurement, the SNR at the begin-ning and end of the measurement, the date and time of the measurement start and the latency. All the collected variables, their notations and descriptions are summarized in Table 3.1. One UE measurement takes approximately one second. 25% of the data is held out to use as validation set and the rest of the data is used for training.

(26)

3.2. CORRELATION ANALYSIS CHAPTER 3. METHOD

Variable Description

bw The downlink bandwidth

power1 The signal strength before the bandwidth measurement power2 The signal strength after the bandwidth measurement

snr1 The signal to noise ratio before the bandwidth measurement snr2 The signal to noise ratio after the bandwidth measurement latency The latency of the download

datetime he date and time at the start of the bandwidth measurement dl The downlink cell load

Table 3.1: The notation and description of each variable in the collected dataset.

terms of cell loads the same cell load represents di↵erent cell bandwidths in di↵erent cells. Thus, each cell needs its own corresponding dataset. To obtain these the first step is to group the UE data by cell IDs. Next, the downlink cell load of a given cell is mapped through date and time to its corresponding entry in the dataset labelled with its cell ID. The uplink load is dropped in accordance with previous studies such as [18]. This mapping is made for each cell at which data is collected. To adjust for the extra load the UE measurements are causing, the assumption is made that the cell load during the measurement is the same as the cell load right before the measurement starts. Due to the holes in the cell datasets such a mapping does not always exist. If a matching cell load is lacking, the UE entry is simply dropped. This decreased the size of all the datasets to approximately half their original sizes.

The measurements were made on four di↵erent cells located in the same office area and on one UE in form of a Samsung S9 cellphone. After combining the cell and UE data two out of the four datasets were deemed to be of sufficient size for further analysis. They were approximately of size 1,400 and 1,200 data points respectively

3.2 Correlation analysis

To determine the set of independent features correlation analysis is used. More specifically, correlation matrices for each of the two datasets are computed to find a suitable set of independent variables for the bandwidth prediction. Fur-thermore, the autocorrelation of the bandwidth itself is computed to evaluate its temporal dependency and the optimal lag. The computed correlation ma-trix for the bigger dataset is shown in Table 3.2. The other correlation mama-trix follows a similar pattern but with slightly lower overall correlation between the target and the independent variables. Furthermore, the correlation between bandwidth and cell load for the bigger dataset is visualized in a scatter plot in Figure 3.1. This correlation is particularly interesting since the cell load is the only variable retrieved from another dataset and requires additional computing power to be included as an independent feature.

(27)

3.2. CORRELATION ANALYSIS CHAPTER 3. METHOD

bw power1 power2 SNR1 SNR2 latency dl bw 1.00 0.07 0.07 0.46 0.47 -0.10 -0.50 power1 0.07 1.00 0.94 0.19 0.19 -0.06 -0.09 power2 0.07 0.94 1.00 0.19 0.19 -0.06 -0.09 SNR1 0.46 0.19 0.19 1.00 0.99 -0.16 -0.80 SNR2 0.47 0.19 0.19 0.99 1.00 -0.17 -0.80 latency -0.10 -0.06 -0.06 -0.16 -0.17 1.00 0.14 dl -0.50 -0.09 -0.09 -0.80 -0.80 0.14 1.00 Table 3.2: The correlation table computed for the largest of the cell-specific datasets. The variables from left to right are bandwidth, power at the beginning of the UE measurement, power at the end of the UE measurement, SNR at the beginning of the UE measurement, SNR at the end of the UE measurement, downlink cell load, and latency.

From Table 3.2 it can be determined that none of the radio link features are strongly correlated to the bandwidth. The correlation matrix shows that the bandwidth is moderately correlated to downlink cell load and to SNR. However, it also shows that downlink cell load and SNR experience a moderate correlation between each other which reduces the explanatory e↵ects each single variable has on the bandwidth. This is likely due to the fixed position of the UE. When the UE stays fixed, changes in SNR are likely caused by changes in the number of active UEs on the cell which in turn has a direct e↵ect on the downlink load. If the UE was to move within the cell, the SNR and downlink load would most likely be less correlated. This would increase the explanatory power of the independent variables. Furthermore, the matrix shows a small correlation between bandwidth and latency and between bandwidth and power. Based on the correlation analysis it can be concluded that downlink cell load and SNR should serve as independent variables in the bandwidth prediction models while latency and power should not be included. However, the formula-based throughput prediction models suggest that power has an impact on the user throughput for moving UEs. Since the aim is to produce a model that can be generalized to moving UEs within a given cell and since deep learning models are relatively insensitive to redundant features the decision is made to include power as a feature even though the correlation is low. Thus, the independent features used were SNR, power and downlink cell load.

In Table 3.2 it can further be noticed that SNR1 and SNR2 experience almost perfect correlation between each other. This is expected since the measurements are taken in succession with only a few seconds in between on a UE at a fixed lo-cation. Thus, the SNR is likely to stay constant or change only slightly. Adding both SNR1 and SNR2 will therefore provide little or no additional information which lead to the decision to only include one of the measurements or an average of them in the set of independent variables. The same reasoning is applicable to power1 and power2.

(28)

3.3. DATA PREPROCESSING CHAPTER 3. METHOD

Figure 3.1: Scatter plot of bandwidth towards cell load.

computed. The autocorrelation values with their corresponding lags are shown in Table 3.3. From the table it can be seen that a lag of two yields the highest autocorrelation. Thus, the decision is made to create an LSTM model with lag equal to two.

lag 1 2 3 4

autocorrelation 0.36 0.39 0.36 0.37

Table 3.3: The autocorrelation values of the bandwidth computed with di↵erent lags.

3.3 Data preprocessing

The data is preprocessed in order to make it compatible with the neural net-works and in order to improve network performance. In the mapping function between UE and cell specific data an option is added to not only map the most recent cell load to the UE entry but to include the n most recent loads in n columns respectively. To evaluate how time granularity of the cell load mea-surements a↵ects the prediction performance another option is added to allow for aggregation of cell load data points. Instead of mapping one cell load entry to one UE measurement, the option is provided to map an average of the n clos-est cell loads to the UE measurement. This is interclos-esting since if less fine time granularities do not decrease prediction performance the efficiency in both the

(29)

3.3. DATA PREPROCESSING CHAPTER 3. METHOD

data gathering and data analysis can be increased. However, repeated experi-ments show that mapping only one cell load with the original time granularity of ten seconds yields the best results and hence these settings were used.

As the next step, the outliers in the bandwidth are handled. Examining Figure 3.1 (a) it can be seen that the spread in the bandwidth is larger towards smaller cell loads. Moreover, it can be noted that bandwidths of 40 Mbps or more are rare. As discussed before, single peaks in the user throughput have no impact on the decision of the underlying protocol. In other words, there is no interest in being able to predict those single spikes. On the contrary it is preferable to let those predictions stay on the same level as the bandwidths surrounding the peak. Thus, these outliers can potentially be removed from the dataset without a↵ecting the performance of the prediction models. However, one should be careful when removing outliers since they can contain important information that gets lost. Repeated experiments show that this is indeed the case. When removing the outliers the results both in terms of validation loss and validation accuracy decreased significantly and hence the decision is made to not remove any data points.

Finally, the features of the merged datasets are preprocessed one by one. The dates and times are simply dropped since the order of the entries is considered more interesting than the absolute date and time. Recall that the UE dataset contains two measurements of power and SNR respectively, one at the start of the UE measurement and one at the end of it. The two entries of each feature respectively are merged to one feature by taking the average. Furthermore, each feature column is normalized to range from zero to one in order to better fit neural network models. The cell loads that are measured in percent can simply be divided by 100 to become normalized. The power, SNR and bandwidth are normalized by

xi min(x)

max(x) min(x)

where xi2 x is an entry in the feature column to be normalized. To make the

feature values centred around zero instead of one half they can simply be shifted down by 0.5. A summary of the chosen features and how they and the target variable were preprocessed can be found in Table 3.4.

Variable Preprocessing procedure

bw normalized

power normalized mean of power1 and power2 snr normalized mean of snr1 and snr2

dl divided by 100 to go from percent to decimal

Table 3.4: A summary of the target variable and the set of chosen features together with descriptions of how each of them was preprocessed.

(30)

3.4. MODELLING THE CELL LOAD CHAPTER 3. METHOD

3.4 Modelling the cell load

Next, a statistical model of the cell load is created. The reason for this is two-fold. Firstly, such a model could be used for cell load interpolation to fill in the holes in the cell load dataset. This would mitigate the risk of data shortage. Secondly, such a model could enable the retrieval of more fine-grained cell load data. Recall that the cell loads provided by SKT are averaged over ten seconds. However, in practice the load changes every Transmission Time Interval (TTI). By creating a statistical model with samples drawn from finer time-granularities such as per second and then do the data aggregation up to ten seconds averages one can obtain the probability distribution of cell loads at the finer time granularity.

The decision was made to build the statistical model for a time interval of approximately 15 minutes. This roughly corresponds to 100 data points from the cell load dataset. Initially, the assumption is made that during a given 15 minutes time period, the cell load approximately follows a normal distribution. The idea is that this assumption could be relaxed further on. Next, the cell load dataset is visualized in histograms to get a better understanding of it. This visualization is made for both the whole dataset and for randomly chosen subsets of size 100.

(a) Histogram of the whole dataset (b) Histogram of normalized sample datasets Figure 3.2: Histograms of the cell load dataset.

Figure 3.2 (a) shows the histogram of the whole cell load dataset. It shows that very high and very low cell loads are more common than moderate loads. This is expected since the cell is located in an office area. During office hours loads are high since people are working and during night it is low since workers have left the area. Only during morning and late afternoon or evening when many people are either arriving to the area or leaving it the cell load should be moderate. Figure 3.2 (b) shows the histogram of 10,000 subsets of the dataset, each consisting of 100 sequential data points. The subsets are randomly drawn from the cell load dataset and each subset is normalized with respect to its sample mean and sample standard deviation. The histogram indicates that

(31)

3.5. LOSS FUNCTION AND METRICS CHAPTER 3. METHOD

within the given time interval the cell load follows a normal distribution. This suggests that our assumption holds and that the cell load over such time periods should be modelled with some normally distributed noise.

Next, the statistical model of the cell load is created. As an initial naive approach a pure Gaussian noise model is created. In this model, each generated data point is independently drawn from a Gaussian distribution. The param-eters of the Gaussian are determined by the data points at the beginning and end of the hole in the dataset. To relax the assumption of purely Gaussian cell loads an independence rate ↵ is introduced. The idea is that the cell load at the current time step is partly dependent on the load at the previous time step and partly a↵ected by the noise. More specifically, the formula for generating data to fill the ith hole is given by

loadt= (1 ↵)loadt 1+ ↵xt

where 0_{ ↵  1 and}

xt⇠ N (µi, i).

This is a type of autoregressive model. Choosing an independence rate of ↵ = 1 reduces the model to the naive approach.

To determine the mean µi and standard deviation i of the Gaussian noise,

10,000 random subsets of size 1+00 data points are drawn from each cell load dataset. Each subset consists of sequential data. For each subset, the sample mean and sample standard deviation are calculated. The sample mean versus sample standard deviation is then visualized in a scatter plot and a line is fitted to the data points in order to find a relationship between the two Gaussian parameters. The line represents the sample standard deviation as a function of sample mean. When a hole needs to be filled, the parameter µi is set to the

average of the two end data points of the hole. The fitted line is then used to find i. Wit this procedure, the Gaussian parameters are determined and the

cell load data can be generated.

3.5 Loss function and metrics

As discussed previously, in many application scenarios underestimation is pre-ferred over overestimation. A common loss function used for regression problems is the MSE loss which is given by

lossM SE = 1 N N X n=1 (xtruen xpredn )2 where xtrue

n and xpredn are the true and predicted values respectively and N is

the size of the test dataset. Thus, the conventional MSE loss function assigns equal value to both over and underestimation. To favour underestimation a penalty factor is added to the MSE loss each time underestimation occurs. This

(32)

3.5. LOSS FUNCTION AND METRICS CHAPTER 3. METHOD

yields the modified loss function loss⇤M SE = 1 N N X n=1 c2(xtruen xpredn )2, ( c = 1 if (xtrue n xpredn ) 0 c = p otherwise

where p > 1 is the penalty factor. When comparing the two losses it can be seen that the MSE loss with penalty reduces to the conventional MSE loss if p is set to p = 1. The larger p gets, the more overestimation is penalized and the more underestimation is favoured.

Another popular loss function used in regression is the MAE loss. It is given by lossM AE = 1 N N X n=1 |xtrue n xpredn |

where the variables are defined as for the MSE loss above. With the same reasoning, a modified version of the MAE is obtained as

loss⇤M AE= 1 N N X n=1 c_|xtruen xpredn |, ( c = 1 if (xtrue n xpredn ) 0 c = p otherwise again with variables defined as for the MSE loss.

This way of defining the loss has the disadvantage that the loss score can no longer be converted to the same unit as the target variable. When using the conventional MSE the errors in the same unit as the target can be obtained by simply taking the root of the MSE loss and when using the MAE no conversion at all is needed. However, the penalty factor in the modified loss functions complicates such conversion procedures. To be able to compare di↵erent loss functions and di↵erent penalty factors a model performance evaluation metric is defined. This metric is defined in terms of an accuracy score. A prediction is said to be accurate if and only if

0_{ (x}truen xpredn ) K

where K is a ceiling determined by the use-case requirements. In words this means that a prediction is said to be accurate if it is equal to or below the true value with a maximum underestimation distance of K. The accuracy metric is then given by

accuracy = accurate predictions total number of predictions.

This way of defining accuracy introduces an additional hyperparameter, namely K. The smaller K is the more rigid the requirements get.

With these definitions of loss and accuracy it is informative to decompose the error rate into an overestimation part and an underestimation part as

(33)

3.6. THE BASELINES AND THE MODELS CHAPTER 3. METHOD

A shift in the penalty will lead to both a change in the total accuracy and a balance shift in the decomposed error rates. When the penalty factor is increased the overestimation error should increase while the underestimation error should decrease.

3.6 The baselines and the models

Two baseline models are implemented, namely an RF and an SVR. The RF consists of 100 decision trees and the selection criteria used is the MSE loss. The SVR is implemented with the RBF kernel and the error threshold is set to 0.1 on the normalized scale. It make use of L2 regularization and a stopping

criteria of 0.001 in the hyperplane optimization.

The deep learning models implemented are an FNN and an LSTM, both for the task of regression. The FNN consists of an input layer, 2 hidden layers with 8 and 4 nodes respectively and an output layer with a single node. At each hidden layer Rectified Linear Unit (ReLU) activation is used. No activation is used at the output layer. It is a fully connected feed forward network. The LSTM has a similar model architecture to make the deep learning frameworks more comparable. It consists of an input layer, two stacked LSTM layers with 8 and 4 nodes respectively and an output layer with a single node. It has a lag of 2 determined from the autocorrelation analysis and uses the same activation function on the hidden layers as the FNN does.

The deep learning models are trained until convergence for 50 epochs with a batch size of 64. Each of them is trained in four versions with both the conventional and modified MSE and MAE losses and both networks use the Keras RMSprop optimizer. Model performance is measured in the above de-fined accuracy metric. To find the optimal penalty factor in the modified MSE and MAE losses each neural network is trained independently with di↵erent penalty factors. The total accuracy, underestimation error and overestimation error are then plotted as a function of the penalty factor. The optimal penalty is determined by the highest total accuracy achieved. However, one might pre-fer another penalty to get a better balance between over and underestimation depending on the application domain. To make the results more robust, the accuracy and error rates for each given penalty factor is determined by an en-semble of 100 independently trained networks. The data producing the plots are the average accuracy and error rates produced by the networks of the ensemble.

3.7 The covid-19 impact

The covid-19 pandemic had a significant impact on the data collection. Due to the relatively early virus outbreak in South Korea the data distribution was altered from mid February until the end of this project. The working from home policies undertaken by many firms made the collected cell load data skewed towards very low loads. This e↵ect is visualized in Figure 3.3. Note that the

(34)

3.7. THE COVID-19 IMPACT CHAPTER 3. METHOD

scale of the axes in the cell load histograms are di↵erent in order to show the shapes of the plots more distinctly. Also note that Figure 3.3 (a) is the same as Figure 3.2 (a) and is included a second time to enable easier comparison. The lower cell loads increased the fluctuations in the bandwidth since with few active users connected to a given cell one UE can get allocated as high proportion of the cell capacity as needed. This reduced the correlation between the bandwidth and most independent variables. Furthermore, Figure 3.3 show a decrease in the dataset sizes. The collaboration team at SKT could not access the OSS to fix technical issues as easily as before the outbreak. This lead to more frequent holes in the cell load datasets and problems to train the LSTM network which is in need of sequential data.

Due to these issues the decision was made to only use data collected before the covid-19 outbreak altered the distribution of the cell load data. However, this significantly reduced the size of the dataset to approximately 3,500 UE data points. After mapping the UE data to the cell load data the datasets was reduced to 1,400 points and 1,200 points respectively. The size of the datasets also limits the complexity of the deep learning models.

(a) Before the outbreak (b) After the outbreak

Figure 3.3: The cell load distribution before and after the covid-19 outbreak. Recall from the correlation matrix in Table 3.2 that none of the independent variables was very strongly correlated with the bandwidth. Thus, the decision was made to extend the number of collected UE features. More specifically, Reference Signal Received Power (RSRP), Reference Signal Received Quality (RSRQ) and Channel Quality Indicator (CQI) was included. As for the old features, measurements of these three newly added features were taken right before and right after each UE data download. Thus, they were collected at the same time as power and SNR. The additional features were included at approximately the same time as the covid-19 outbreak started to impact the distribution of the collected data. Therefore, the additional features could not be included when training the models. However, it can be noted that a correlation analysis conducted on the altered dataset showed that RSRP and CQI were

(35)

3.7. THE COVID-19 IMPACT CHAPTER 3. METHOD

weakly correlated with the bandwidth while RSRQ was moderately correlated with the bandwidth.

(36)

Chapter 4

Results

4.1 Cell load model

The results of the Gaussian parameter search from the theoretical model are shown in Figure 4.1. The figure shows the scatter plot of the cell load sample mean versus the cell load sample standard deviation and the fitted second degree polynomial curve to the scattered data points of two di↵erent cells.

(a) Cell 1 (b) Cell 2

Figure 4.1: The plots show the sample standard deviation as a function of sample mean for a set of 10,000 randomly chosen subsets of the cell load datasets. Each subset contains 100 sequential data points.

The standard deviation of both cells follows a clear pattern peaking at medium cell loads and getting smaller towards both ends of the cell load spec-trum. This is reasonable since the cells on which the measurements are made are located in an office area. This means that the cell load at night should be constantly low and the cell load at office hours should be constantly high. Thus, the cell load within these time intervals should experience less

A Deep Learning Approach to Downlink User Throughput Prediction in Cellular Networks

INOM

EXAMENSARBETE

DATATEKNIK,

AVANCERAD NIVÅ, 30 HP

,

STOCKHOLM SVERIGE 2020

A Deep Learning Approach to

Downlink User Throughput

Prediction in Cellular Networks

DUDU XUECHEN ZUO

Abstract

Sammanfattning

Acknowledgement

Abbreviations

Contents

Chapter 1

Introduction

1.1

Scope and contribution

1.2

Outline

Chapter 2

Background

2.1

Machine learning

2.2

Deep learning

2.3

Cellular networks

2.4

Downlink user throughput

2.5

User throughput prediction

2.6

Related work

2.7

Neighbouring fields

2.8

The prediction problem

Chapter 3

Method

3.1

Dataset

3.2

Correlation analysis

3.3

Data preprocessing

3.4

Modelling the cell load

3.5

Loss function and metrics

3.6

The baselines and the models

3.7

The covid-19 impact

Chapter 4

Results

4.1

Cell load model