Multiple time-series forecasting on mobile network data using an RNN-RBM model

(1)

UPTEC F 17005

Examensarbete 30 hp Februari 2017

Multiple time-series forecasting on mobile network data using an RNN-RBM model

Arvid Bäärnhielm

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Multiple time-series forecasting on mobile network data using an RNN-RBM model

Arvid Bäärnhielm

The purpose of this project is to evaluate the performance of a forecasting model based on a multivariate dataset consisting of time series of traffic characteristic performance data from a mobile network. The forecasting is made using machine learning with a deep neural network. The first part of the project involves the adaption of the model design to fit the dataset and is followed by a number of simulations where the aim is to tune the parameters of the model to give the best performance. The simulations show that with well tuned parameters, the neural network performes better than the baseline model, even when using only a univariate dataset. If a multivariate dataset is used, the neural network outperforms the baseline model even when the dataset is small.

ISSN: 1401-5757, UPTEC F 17005 Examinator: Tomas Nyberg Ämnesgranskare: Justin Pearson Handledare: Tor Kvernvik

(3)

Popul¨ arvetenskaplig sammanfattning

I takt med att den tekniska utvecklingen har gjort det möjligt att producera allt snabbare datorer, samtidigt som allt mer data samlas in och lagras, s˚a har det spännande forskningsomr˚adet Machine Learning kunnat växa fram. Machine Learning är en del av det större forskningsomr˚adet artificiell intelligens och m˚alet är att, med hjälp av stora mängder data och noggrant inställda algoritmer, skapa avancerade modeller som kan hitta mönster i den insamlade datan. När det kommer till s˚a stora mängder data är den mänskliga hjärnan inte längre tillräckligt avancerad för att klara av att se dessa mönster.

I detta projekt undersöks möjligheten att sätta samman en modell av algoritmer inom Machine Learning för att analysera statistisk data insamlad fr˚an enskilda celler i ett mobilnätverk. Datan är insamlad i form av tidsserier, där värden är ackumulerade och lagrade i jämna tidsintervall över en längre tid och där datan är insamlad fr˚an flera celler och för flera olika statistiska mätvärden. M˚alet är att undersöka om det sedan g˚ar att träna modellen att förutsp˚a mätvärden för framtida tidsintervall i tidsserien, till att börja med ett tidsintervall in i framtiden, genom att l˚ata modellen analysera den historiska datan.

Träningen, eller optimeringen, sker genom att en stor mängd data matas in i modellen, där b˚ade den historiska datan och det framtida värdet är känt. Därefter finjusteras ett antal parametrar i modellen med hjälp av en optimeringsfunktion s˚a att modellen ˚aterskapar det kända framtida värdet s˚a korrekt som möjligt, för ett stort antal olika värden samtidigt. Justeringen sker stegvis och automatiskt genom att testa träffsäkerheten efter varje justering. Modellens förm˚aga att p˚a ett träffsäkert sätt uppskatta framtida värden testas sedan mot en ungefär lika stor mängd separat data där b˚ade den historiska datan och det framtida värdet är känt, men där bara den historiska datan matas in i modellen. Träffsäkerheten jämförs med en annan modell av enklare typ, för att f˚a ett m˚att p˚a kvaliteten p˚a modellen.

Modellen har tränats och testats p˚a ett par olika uppsättningar av data. Syftet är dels att undersöka hur träffsäkerheten p˚averkas av mängden historisk data, men främst att un- dersöka om det finns samband mellan geografiskt spridda celler samt mellan olika typer

ii

(4)

av statistiska mätvärden. Datan har därför utökats successivt fr˚an ett enskilt statis- tiskt mätvärde i en enskild cell, till att slutligen inneh˚alla flera olika typer av statistiska mätvärden fr˚an ett flertal olika celler.

Slutsatsen av de simuleringar som har gjorts är att modellen visar stor potential till att skapa träffsäkra prognoser för framtida mätvärden. Det finns samtidigt flera förslag p˚a förbättringsmöjligheter i modellens uppbyggnad som kan öka träffsäkerheten ytterligare.

Det finns därmed goda skäl att göra fördjupade tester för att ytterligare undersöka modellens potential.

iii

(5)

iv

(6)

v

Acknowledgements

I would like to express my sincere appreciation and deepest thanks to my supervisor Tor Kvernvik and my second supervisors Tony Larsson and Johan Haraldsson at Ericsson for all the support and engagement through this masters thesis. I am forever grateful for the opportunity to come to Ericsson and to be able to work in such an interresting field as Machine Learning. I would also like to extend my greatest gratitude to my subject reader Justin Pearson at the Department of Information Technology at Uppsala University. Your help and encouragement during times of struggle has been essential for my work. Finally, I would like to thank my wife and family. Your love and support is my foundation in life.

(7)

List of Tables

B.1 Table showing all results from the simulations using a single cell and a single counter as input data. . . 66

B.2 Table showing all results from the simulations using multiple cells and a single counter as input data. . . 67

B.3 Table showing all results from the simulations using a single cell and multiple counters as input data. . . 67

B.4 Table showing all results from the simulations using multiple cells and multiple counters as input data. . . 68

ix

(11)

x

(12)

List of Figures

2.1 A historical mobile network to the left, with a single basestation covering a large number of users and a heterogenous mobile network to the right, consisting of a large number of small cells, connected to each basestation.

Image source: [Hal] . . . 8

2.2 An illustration of the load at cells in housing and business areas during morning commuting. The top row shows activity in green, mostly at the housing areas, while the middle row shows activity mostly along the roads, and the bottom row shows activity mainly at the business areas. . . 9

2.3 An illustration of the load at cells along a highway when users pass by the cells. The load rises at Cell A first, followed by Cell B, and Cell C. . . 10

2.4 An illustration of the load during one week. The top graph shows a smooth curve, whith cycles that are easy to spot, while the bottom graph shows a more rough curve, where the cycles are less obvious. . . 10

2.5 An RNN model unfolded in time. The bottom layer is the input, the top layer is the output and the middle layer is the hidden state, dependent on the input and the previous hidden state. Image source: [LBH15] . . . 13

2.6 A graphical description of an RBM with the visible layer in the bottom and the hidden layer in the top. Image source: [Deea] . . . 14

xi

(13)

xii LIST OF FIGURES

2.7 A graphical illustration of t-step Gibbs sampling. Note that the last hidden step in the figure should be labeled as h^(t−1) and not h^(t), to follow the pattern. Image source: [Deea] . . . 16

2.8 A graphical illustration of the RNN-RBM model. The bottom layer is the RNN implementation and the top two layers are the RBM implementation.

Image source: [Deeb] [BLBV12] . . . 17

3.1 The confusion matrix presents a visualization of how well a model manage to classify a set of values. The correctly classified values are shown on the diagonal from the top left corner to the bottom right corner, while the incorrectly classified values are presented in the other fields. Image source:

[dS] . . . 27

3.2 The ROC space presents, graphically, the performance of a classification model. The dots represents the rate of true positives, T P R, versus the rate of false positives, F P R. A value close to the upper corner, with high T P R and low F P R is a good classifier. . . 28

3.3 The results from different choices of threshold for the clasification model creates a curve. The area under the curve, AU C, as well as the shape of the curve, gives an indication of the performance of the model. . . 29

4.1 The left and the middle columns show the forecasts made using the RNN- RBM model with 100 and 1000 hidden units in the RNN layer, respectively.

The right column shows the forecasts made using the Holt-Winter model.

The rows corresponds to dividing the input data into different number of percentiles with 5 on top, 20 in the middle and 100 at the bottom. The blue line is the real data and the read line is the forecasted data. . . 33

(14)

LIST OF FIGURES xiii

4.2 One-step forecast made using the Holt-Winter model (red) on unchanged data and the corresponding RSS value. The real data is shown in blue.

The top right corner shows the ROC plot with the corresponding AU C value. 35

4.3 One-step forecast made using the Holt-Winter model (red) on modified data and the corresponding RSS value. The real data is shown in blue.

4.4 One-step forecast made using the RNN-RBM model (red) with a single cell, a single counter and 7 days of history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot. The top right corner shows the ROC plot with the corresponding AU C value. . . . 37

4.5 One-step forecast made using the RNN-RBM model (red) with a single cell, a single counter and all history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot. The top right corner shows the ROC plot with the corresponding AU C value. . . . 38

4.6 One-step forecast made using the RNN-RBM model (red) with multiple cells, a single counter and 7 days of history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot.

4.7 One-step forecast made using the RNN-RBM model (red) with multiple cells, a single counter and all history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot. The top right corner shows the ROC plot with the corresponding AU C value. . . . 39

4.8 One-step forecast made using the RNN-RBM model (red) with a single cell, multiple counters and 7 days of history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot. The top right corner shows the ROC plot with the corresponding AU C value. . . . 41

(15)

LIST OF FIGURES 1

4.9 One-step forecast made using the RNN-RBM model (red) with a single cell, multiple counters and all history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot. The top right corner shows the ROC plot with the corresponding AU C value. . . . 42

4.10 One-step forecast made using the RNN-RBM model (red) with multiple cells, multiple counters and 7 days of history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot.

4.11 One-step forecast made using the RNN-RBM model (red) with multiple cells, multiple counters and all history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot. The top right corner shows the ROC plot with the corresponding AU C value. . . . 43

5.1 Figure of the forecast giving the overall best AU C value. The performance counter is the number of packets transferred and is shown in blue with the forecast in red. . . 46

5.2 Two figures showing how the RSS value and the AU C value are dependent on the input data. . . 47

5.3 Two figures showing how the RSS value and the AU C value depends on the number of hidden units in the RBM layer. . . 49

5.4 Two figures showing how the RSS value and the AU C value depends on the number of hidden units in the RNN layer. . . 50

5.5 Two figures showing how the RSS value and the AU C value depends on the batch size. . . 52

(16)

Chapter 1 Introduction

This project aims to evaluate the performance of a multivariate forecasting model on a multivariate dataset consisting of time series data. The performance of the evaluated model is compared to the performance of a baseline model. In this chapter, a brief explanation of time series is given in Section 1.1, followed by a deeper explanation of time series analysis and the state of art models that are used, in Section 1.2. The chapter ends with Section 1.3, that gives a motivation of the usefulness of the project.

1.1 Time series

A time series is a sequence of numerical data points listed in time order. Most commonly the data points are distributed evenly in time, with equally spacing between sucessive data points over the entire time series. Many kinds of data can be gathered into time series, as long as there is a time dependence in the data. Many times there is a desire to be able to predict and forecast the future behaviour of the time series. Weather forecasts are probably the most familiar type of time series forecasting, where multiple time series of historical data from spatially distributed data sources, and a number of different characteristics, such as temperature, pressure, humidity, etc, are combined into

2

(17)

1.2. Time series analysis and forecasting models 3

a large dataset. These added dimensions create a multivariate time series, in contrast to a univariate time series, which only consist of one time series. The different data sources used in this project are described further in Section 3.2.

1.2 Time series analysis and forecasting models

Time series analysis and forecasting has been used in many different areas and fields for a long time, and the interest and the possible applications have grown over time. Apart from the already mentioned weather forecasting, time series analysis is used to predict traffic and congestion [MYWW15], [GPP⁺09], [TN07], to predict data traffic and load in mobile networks [SPSM12], [WGLP10], [Bru00], [YMJ11], to predict movements of people [JLP15], [DZ07], and to classify the time series (e.g., labeling sentences as grammatically correct or incorrect), which is a task that is related to prediction and where there is a mutual benefit of combining the tasks [HS03], to name a few applications. A number of different algorithms and models have been designed and used to predict and forecast time series, a few of these are described below.

The authors of [Bru00] and [TN07] suggests the use of Holt-Winter’s algorithm, an exponential smoothing algorithm where the impact of historical data on the forecasted data decays exponentially. The algorithm builds on the premise that time series can be decomposed into three components: baseline, linear trend, and seasonal trend, where all components are presumed to evolve over time. The algorithm is able to perform multi- step prediction and is fairly simple in it’s design. In [GPP⁺09] the authors propose the use of a multiplicative seasonal Autoregressive Integrated Moving Average (ARIMA) model, also called Box-Jenkins. ARIMA is a generalized version of the Autoregressive Moving Average (ARMA) model and both these models are used for prediction, but also to better understand time series by simplifying the behavior of the series. The authors of [WGLP10] use an improved variation of the Support Vector Machine (SVM) algorithm:

(18)

4 Chapter 1. Introduction

Least Squares Support Vector Machine (LS-SVM), that is shown to be more efficient and accurate than the ARIMA algorithm. In [DZ07] the authors propose yet another algorithm, called Scale-Free Echo State Network (SHESN), that is a variation of the Echo State Network (ESN) algorithm. The SHESN algorithm makes use of clustering to further increase the performance of the forecasting. The clustering is implemented using a naturally evolving dynamic state reservoir, unlike the ESN algorithm, that use a completely random dynamic state reservoir. The authors of [YMJ11] suggests the Prior knowledge based Clustered Complex ESN (PCCESN) as an even better algorithm than SHESN, also using a naturally evolving dynamic state reservoir to implement clustering, but with a more adaptive implementation. The authors of [JLP15] suggest the use of the Cluster- Aided Mobility Projector (CAMP) algorithm, that also uses clustering to increase the performance of the forecasting algorithm. However, this algorithm is used to predict trajectories, but the clustering implementation makes the algorithm perform very well even with very short previous trajectories. In both [RCSR07] and [GSM⁺15] the authors use the K-means clustering algorithm to find spatial patterns in time series data. This can be useful in a forecasting algorithm since it could replace the large amount of cell data with a much smaller number of cluster data and greatly increase the efficiency of the algorithm.

The authors of [MYWW15] suggest the use of a Recurrent Neural Network combined with a Restricted Boltzmann Machine (RNN-RBM) to predict and forecast time series.

When predicting traffic congestion inside a city the accuracy could reach as high as 88%.

This was accomplished by using data from spatially distributed time series of traffic speed on a number of roads in the city where the speeds were collected from GPS data from a great number of taxis. The ability to find dependencies both spatially and temporally distinguish the RNN-RBM model from other algorithms and models mentioned.

The different models all have their advantages and disadvantages and are better suited for different types of tasks and problems, using different datasets. In this project the possible benefits of using the multivariate properties of the RNN-RBM model, designed by [BLBV12], will be evaluated. The model has been chosen for evaluation based on the

(19)

1.3. Motivation 5

promising results described in [MYWW15], which is a similar problem to the problem in this project. For comparison, the univariate Holt-Winter model will be used as a baseline, mainly due to its simple design and ability to capture seasonality in the data. Both these models are further described in Chapter 2.

1.3 Motivation

For Ericsson, time series analysis and forecasting is a useful tool in many different fields and for many different datasets. One important area where time series analysis and forecasting can have great impact is when trying to forecast the mobile traffic in the mobile network. A good forecast of the mobile traffic could help in a number of ways to improve the performance of the network. A few possible use-cases where forecasts of mobile traffic could be helpful are listed below.

• Load balancing

By knowing beforehand when there is about to be high load in the network, measures can be taken to prevent a situation of overload. Data can be preallocated, users can be rerouted to nearby cells as well as other measures. This will help improving the Quality of Service (QoS) for all users.

• Energy savings

For most cells in the network, the data traffic will at times be very low, e.g. during night time. In some cases, especially where the cells are close enough so that their respective covered areas will overlap, there are not always a need for all components, or even all cells, to be active. If cells could be partially or fully deactivated when they are not needed, energy could be saved while also extending the life time of the components. However, to avoid switching the components on and off repeatedly, it is important to have knowledge about the amount of data traffic for a sufficient amount of time into the future.

(20)

6 Chapter 1. Introduction

• Anomaly detection

The ability to detect anomalies in the data traffic can, among other things, help detect components that are close to breaking down, and help reduce downtime by making it possible to swith the component before it breaks down. Since an anomaly is simply a divergence from the expected value, a good forecast will help improving the anomaly detection.

The granularity of the forecasts, as well as how far into the future the forecasts are reliable, will affect the impact of the forecasts as well as what applications that can benefit from the forecasts. For some applications a granularity of several hours could be enough, while for others the granularity has to be as short as milliseconds. The different demands in granularity will limit the model in different ways. A forecast on millisecond level will of course demand that the model is fast enough to make the forecasts in time. A less granular forecast will also in itself reach further into the future, but for very short timespans a forecast longer into the future will be difficult to produce.

(21)

Chapter 2 Theory

In this chapter the infrastructure of the mobile network and the dependencies between different parts of the network is explained in Section 2.1, followed by an explanation of the two models used in this project. The baseline model, the Holt-Winter model, is described in Section 2.2, and the RNN-RBM model is described in Section 2.3.

2.1 Mobile Network infrastructure

The infrastructure of the mobile network consists of a large number of connected base stations that transfer the data between the users and the network. Historically, the basestations single-handed served a large numbers of users. Today, however, each basestation is connected to a large number of smaller cells that each connects to a number of users [Hal].

Figure 2.1 shows a graphical representation of the connections between the basestation and the cells, and how the network has changed from at first only consist of basestations to adding more and more cells.

The cells can have different designs and properties depending on where they are located and their purpose. The area covered can be large or small and overlaps in covered areas

7

(22)

8 Chapter 2. Theory

Figure 2.1: A historical mobile network to the left, with a single basestation covering a large number of users and a heterogenous mobile network to the right, consisting of a large number of small cells, connected to each basestation. Image source: [Hal]

are common. This helps with creating redundancy in the network so that if a cell or basestation breaks down, a nearby cell or basestation can still give coverage to the area.

Each cell collects statistical data of how many users are connected and the amount of data traffic they generate. When users move around, they will eventually move outside the range of the cell they are connected to and connect to another cell that is closer.

This will cause the number of users connected to each cell to change over time. This movement is not random, but follows certain patterns, i.e. from housing areas to business areas in the morning and back in the evening or along a highway or a railway during commuting hours. This can be graphically seen in Figure 2.2, where the movement from housing areas to business areas is demonstrated by the colouring of the basestations. In the top row, the active basestations are mainly at the housing areas and nearby roads.

In the middle row it can be seen that the active basestations are along the roads and approaching the business areas. In the bottom row the active basestations are mainly at the business areas. In Figure 2.3 an illustration of the load at the cells along a highway can be seen. The car moves from top to down, passing the three cells and causes the activity to increase and decrease at different times. Both these figures are, however, very

(23)

2.1. Mobile Network infrastructure 9

Figure 2.2: An illustration of the load at cells in housing and business areas during morning commuting. The top row shows activity in green, mostly at the housing areas, while the middle row shows activity mostly along the roads, and the bottom row shows activity mainly at the business areas.

simplified and only give a simple explanation of the spatial dependencies in the mobile network.

In addition to the spatial dependence, there is also a strong temporal dependence in the traffic in the network as can be seen in Figure 2.4. The daily cycle starts with low traffic at nighttime, followed by a quick rise in the morning to a plateu of high traffic during the day and finally a slow decline in the evening and back to the low traffic at night. The weekly cycle will show higher loads during workdays and lower load during weekends.

The graph on top shows a smooth curve, where cycles are easy to see (it is even possible to spot a decrease in activity during lunch hours), while the graph at the bottom is an example where the cycles are less obvious.

(24)

10 Chapter 2. Theory

Figure 2.3: An illustration of the load at cells along a highway when users pass by the cells. The load rises at Cell A first, followed by Cell B, and Cell C.

Figure 2.4: An illustration of the load during one week. The top graph shows a smooth curve, whith cycles that are easy to spot, while the bottom graph shows a more rough curve, where the cycles are less obvious.

(25)

2.2. Holt-Winter 11

2.2 Holt-Winter

The additive Holt-Winter seasonal model with exponential smoothing is a model used for time series forecasting [HA] [BD06] [Bru00]. The model builds on the premise that the time series data can be decomposed into three components;

1. Baseline:

a_t= α(y_t− c_t−m) + (1 − α)(a_t−1+ b_t−1) (2.1)

2. Linear Trend (“slope”):

b_t= β(a_t− a_t−1) + (1 − β)b_t−1 (2.2)

3. Seasonal Trend:

c_t= γ(y_t− a_t−1− b_t−1) + (1 − γ)c_t−m (2.3)

where y_tis the true value at time t, m is the seasonality of the time series, and α, β, and γ are adaption parameters of the model with values ranging from 0 to 1. The outputs a_t, b_t, and c_t correspond to the baseline, the linear trend, and the seasonal trend, respectively.

Both the seasonal parameter, m, and the daption parameters α, β and γ can be calculated in a number of ways¹. The sum of the three components a_t, b_t and c_t+1−m creates the forecasted value for time t + 1 as

ˆ

y_t+1= a_t+ b_t+ c_t+1−m (2.4)

where ˆy_t is the forecasted value for time t. The initial values for the three components are calculated using

1For this project the seasonality is hard-coded to 24 hours, which translates to 96 time steps, and the adaption parameters are calculated automatically by the code referenced in 3.5.

(26)

1.

a_m−1 = 1 m

m

X

i=0

y_i (2.5)

2.

b_m−1 = 1 m²

m

X

i=0

(y_i+m− y_i) (2.6)

3.

c_i = y_i− a₀ for 0 ≤ i < m (2.7)

giving the model an update period from t = m to the end of the time series.

2.3 Recurrent Neural Network-Restricted Boltzmann Machine (RRN-RBM)

The Recurrent Neural Network-Restricted Boltzmann Machine (RNN-RBM) model is different from many other models in that it use multivariate dependencies. The model is a combination of the RNN model and the RBM model and to better understand the combined RNN-RBM model, an explanation of each of these models is given in Section 2.3.1 and 2.3.2, respectively, followed by an explanation of how the models combine into to the RNN-RBM model in Section 2.3.3.

2.3.1 Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNNs) are models where the idea is to make use of sequential information[Bri]. Since time series data in many cases are temporally dependent on previous data, the use of traditional neural networks will fail since they assume that all inputs (and outputs) are independent of each other. A RNN makes use of the temporal dependence and use previous computations as input to each new computation.

(27)

2.3. Recurrent Neural Network-Restricted Boltzmann Machine (RRN-RBM) 13

Figure 2.5: An RNN model unfolded in time. The bottom layer is the input, the top layer is the output and the middle layer is the hidden state, dependent on the input and the previous hidden state. Image source: [LBH15]

Figure 2.5 shows a typical RNN model unfolded in time. The bottom layer is the input layer where x_t is the input at time t. The middle layer is the hidden layer where s_t is the hidden state at time t. s_t is calculated based on the the input x_tand the previous hidden state s_t−1 as

s_t = f (U x_t+ W s_t−1) (2.8)

where the function f is usually a nonlinearity such as tanh or ReLU². The initial value of s, used to calculate the first hidden state, is usually set to all zeroes. The top layer is the output layer where ot is the output at time t. ot is usually calculated using the softmax function as

o_t= softmax(V s_t). (2.9)

The parameters U , V and W are updated during training to achieve a desirable output.

When used for forecasting of time series, the output should represent the input at time t + 1 and the output from input x_t is therefore often labeled as o_t+1.

2Rectified linear unit

(28)

Figure 2.6: A graphical description of an RBM with the visible layer in the bottom and the hidden layer in the top. Image source: [Deea]

2.3.2 Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machines (RBMs) are energy based models that has been used as generative models of many different types of data, including high-dimensional temporal sequences such as video or motion capture data or speech[Hin10]. The model includes one hidden layer h = (h₁, h₂, ...h_n_k)^T and one visible layer v = (v₁, v₂, ...v_n_v)^T where n_v and n_k are the number of visible units and hidden units respectively. A graphical description is shown in Figure 2.6. All units in each layer are connected to all units in the other, but no connections exist between units in the same layer. Each unit can take the value 1 or 0 where 1 corresponds to the unit being activated.

The energy of the model is a scalar value associated to each configuration of the variables of interest and can be calculated by

E(v, h) = −a^Tv − b^Th − h^TW v (2.10)

where a and b are bias vectors connected to v and b respectively and W is a weight matrix between the layers. Each pair of a visible and a hidden vector can be assigned a probability by

p(v, h) = 1

Ze^−E(v,h) (2.11)

where Z is the sum of the energy of all possible pairs of visible and hidden units as

(29)

Z =X

v,h

e^−E(v,h) (2.12)

and is used to normalize equation 2.11. The conditional probability that a hidden unit h_j is activated given the visible vector v, and the conditional probability that a visible unit v_i is activated given the hidden vector h can be calculated by

P (h_j = 1|v) = σ(b_j +X

i

v_iw_i,j)) (2.13)

P (v_i = 1|h) = σ(a_i+X

j

h_jw_i,j)) (2.14)

respectvely, where σ(x) is the logistic sigmoid function 1/(1 + e^−x). Since no connections exist between units in the same layer, the conditional probability can be written as

P (h|v) =Y

j

P (hj = 1|v) (2.15)

P (v|h) =Y

i

P (v_i = 1|h) (2.16)

and to find the parameters θ = (W, a, b) in equation 2.10, the RBM is required to maximize the probability of training set V by

arg max

θ

Y

v∈V

P (v) (2.17)

which is equal to maximize the log-likelihood of P (v). This is commonly done by using gradient descent as

θ = θ + η∂ ln P (v)

∂θ (2.18)

(30)

Figure 2.7: A graphical illustration of t-step Gibbs sampling. Note that the last hidden step in the figure should be labeled as h^(t−1) and not h^(t), to follow the pattern. Image source: [Deea]

where η is the learning rate and the partial derivative is calculated as

∂ ln P (v)

∂θ = −

*∂E(v, h)

∂θ +

P (h|v)

+

*∂E(v, h)

∂θ +

P (v,h)

(2.19)

where h·i denotes the expectation value with respect to probability distribution P . Solving equation 2.17 is computationally expensive and a better way is to use the Contrastive Divergence (CD) approach [Hin02] as

∂ ln P (v)

∂θ =











∂ ln P (v)

∂wi,j ≈ P (h_i = 1|v⁽⁰⁾)v_j⁽⁰⁾− P (h_i = 1|v^(t))v_j^(t)

∂ ln P (v)

∂ai ≈ v⁽⁰⁾_i − v^(t)_i

∂ ln P (v)

∂bi ≈ P (h_i = 1|v⁽⁰⁾) − P (h_i = 1|v^(t))

(2.20)

where v^(t)is the result from t-step Gibbs sampling. One step Gibbs sampling is performed using equations 2.13 and 2.14 respectively and setting h_j and v_i equal to 1 randomly based on the calculated probabilites. By repeting this process t times v^(t) can be obtained, which can be seen graphically in Figure 2.7.

(31)

Figure 2.8: A graphical illustration of the RNN-RBM model. The bottom layer is the RNN implementation and the top two layers are the RBM implementation. Image source: [Deeb] [BLBV12]

2.3.3 RNN and RBM combined as RNN-RBM

Nicolas Boulanger-Lewandowski, together with Yoshua Bengio and Pascal Vincent at Universit´e de Montr´eal, combined the RNN and RBM models into an RNN-RBM model, as a generalization of the Recurrent Temporal RBM (RTRBM)[BLBV12]. The purpose was to further utilize the forecasting capability of the two models and to create a model that allowed more freedom to describe the temporal dependencies involved. The model extends the RNN model by adding an RBM at each time step. The output layer of the RNN, as described in Figure 2.5, is no longer a direct representation of the visible units intended to forecast, but instead lays ground to the parameteres for the RBM model. This can be seen graphically in Figure 2.8. The bottom layer constitutes the RNN model and the top two layers constitutes the RBM model. The model consists of nine parameters;

W , b_v and b_h as part of the RBM model, W_uu, W_vu, u⁽⁰⁾ and b_u as part of the RNN model and W_uh and W_uv to connect them.

The initial values for the matrices W , W_uu, W_vu, W_uhand W_uvcan be set to small random normalized values and the initial values for the bias vectors b_v, b_h, b_u and u⁽⁰⁾ can be set

(32)

to zero. The dimensions of the parameters are given by the number of units in the visible layer, n_v, the number of hidden units in the RBM layer, n_h, and the number of hidden units in the RNN layer, n_hr. The number of hidden units in the RBM layer and the RNN layer will be set by evaluating a number of combinations of parameters.

The bias vectors for the RBM model b^(t)v and b^(t)_h are updated through the hidden units for the RNN layer u^(t−1) as

b^(t)_v = b_v + W_uvu^(t−1) (2.21)

b^(t)_h = b_h+ W_uhu^(t−1) (2.22)

where b_v and b_h are the initial bias vectors for the visible units and the hidden units in the RBM layer. The vector u^(t) represents the hidden units for the RNN layer at time t and is calculated as

u^(t) = f (b_u+ W_uuu^(t−1)+ W_vuv^(t)) (2.23)

where f is an activation function and b_u is the initial bias vector for the hidden units in the RNN layer. The activation function is suggested by [BLBV12] to be the σ function, while [Deeb] suggests using the tanh function. The training iteration of the model is based on the following scheme:

1. Generate the hidden units u^(t) for the RNN model using equation 2.23 on the set of visible units.

2. Update the bias vectors b^(t)v and b^(t)_h using equation 2.22 and 2.21 respectively for u^(t−1) and perform n-step Gibbs sampling to obtain a representation of the visible units v^(t)∗.

(33)

3. Calculate the log-likelihood gradient using the CD approach described in equation 2.20 with respect to W , b^(t)v and b^(t)_h .

4. Propagate the gradient with respect to b_v(t) and b^(t)_h backwards in time to obtain the gradients with respect to Wuu, Wvu, Wuv, Wuh, bv, bh and bu.

The forecasted value v^(t+1) is then obtained by first constructing the bias vectors b^(t+1)v

and b^(t+1)_h , using equations 2.21 and 2.22, and then performing t−step Gibbs sampling, with v^(t+1) initiated as zero, until convergence.

2.3.4 Binary versus real-valued data

The model developed in [BLBV12], was designed to predict and generate MIDI sequences by learning both temporal dependenies but also the chord conditional distribution. This 2-dimensional dependence can be adapted and applied to spatially distributed data and in theory be extended to handle dependencies in n dimensions. The MIDI sequences are represented by a binary vector of length 88, where each index correspon to a note in the MIDI spectra. A one represents an active note, whereas a zero represent an inactive note.

The problem to be solved is therefore a binary problem, where a binary RBM design is used. The problem in this project is, in contrast, a real-valued problem. To be able to apply this problem to the proposed model, either the data has to be transformed to fit the modoel, or the model has to be adapted to fit the data.

(34)

Chapter 3 Methodology

This chapter aims to give an explanation on how the project has been performed and the tools that have been used. The chapter starts with Section 3.1, where the software that has been used is described, and is followed by Section 3.2, that describes the datasets that has been used for all simulations and evaluations. Section 3.3 gives a brief explanation of the training process of the model, followed by a description of the evaluation method in Section 3.4. The chapter ends with Section 3.5, explaining the source for the code used in the project.

3.1 Software

A significant part of the project has been dedicated to programming the model and running simulations of the forecasts. Programming a Machine Learning model can be made in a lot of different programming languages, where MATLAB, R and Python are some of the most popular ones. At Ericsson research, Python is a widely used language and since Python is also the language that the base model is programmed in, the choice of language obvious. To aid in the advanced Machine Learning a numerical computation

20

(35)

3.2. Dataset 21

library called Theano[TARA⁺16] was used. Python, Theano and other software that has been used in the project, are described in Appendix A.

3.2 Dataset

Two different datasets from two different capital cities has been used for testing of the model during the project. During the first weeks a dataset consisting of counters of a number of different traffic characteristic performance data from a number of cells in a major capital city was used, Dataset 1. It was primarily to evaluate the basic functionality of the RNN-RBM model and the baseline model. In the later part of the project a dataset consisting of counters of a number of different traffic characteristic performance data from a number of cells in a small capital city was used, Dataset 2. This dataset was used for the final testing and evaluation of the performance of the model depending on the different modifications that has been made to the model. Since the datasets consists of a number of different cells and a number of different data counters, there is a 3-dimensional dependence for the forecast model; temporal, spatial and between data counters. However, the dimensionality has been reduced to two dimensions by folding the data counters and the spatiality into one dimension, due to time limitations on extending the model to handle more than two dimensions. This will cause a loss of information abut dependencies between similar data counters or data counters in the same cell and consider all dependencies between different time series as equal.

3.2.1 Dataset 1

Dataset 1 consists of data from more than 7000 cells from the central parts of a major capital city, including all types of city districts. The data is aggregated over 15-minute intervals, giving 96 data points per day, for a period of 30 days. There are four different data counters included in the set covering

(36)

22 Chapter 3. Methodology

• the amount of data traffic

• the number of data requests

• the number of calls

• the number of SMS sent

during each time interval. The data is aggregated over all users and normalized, to ensure the integrity of all users, while still keeping the relative values consistent.

3.2.2 Dataset 2

Dataset 2 consists of data from 10 cells in the semi-central parts of a small capital city, including mainly business areas. The 10 cells have been selected as the cells with the largest amount of data transferred through the cells so that irregularities in the data could be kept as low as possible. The data is aggregated over 15-minute intervals, giving 96 data points per day, for a period of 122 days. There are five different data counters included in the set covering

• the amount of downloaded data

• the amount of uploaded data

• the number of packets sent

• the number of calls

• the number of SMS sent

during each time interval. The data is aggregated over all users but not normalized.

The data covers a period of 122 days. However, a number of days in the middle of the interval had a varying amount of missing data. Some days even had to be disregarded

(37)

3.3. Parameter tuning 23

due to too much missing data, while others could be repaired by filling out the missing data with interpolated data. Only days with limited missing data were repaired, to not risk interfering with the performance of the model. This lead to the data being split into two separate groups where the first group consisted of 47 days and was used for training, while the second group consisted of 54 days and was used for testing. In between these were 21 days that was disregarded due to too much missing data.

The data in Dataset 2 is transformed by taking the logarithm of each data point in an attempt of making the data set more linear. This can only be done if no values are zero, which is guaranteed by changing all zeroes in the input data to a very small number, larger than zero. The dataset is then normalized with mean = 0 and standard deviation = 1. The normalization parameters are obtained by normalizing the training data and is individual for each time series. The parameters are then used as the basis when normalizing the test data, meaning that the mean and standard deviation will differ slightly from 0 and 1 respectiviely for the test data. The parameters are also used when de-normalizing the forecasted data, all to assure that all data is following the same range.

To ensure the data is anonymized, all data has been modified and no true values are presented in the report. All relative differences are, however, kept intact.

3.3 Parameter tuning

The performance of the model is dependent on the tuning of a few parameters; the number of hidden units in the RBM layer, the number of hidden unts in the RNN layer and the batch size during training. The performance is also affected by the amount of historical data used in the forecasting process.For each additional paramater value that should be evaluated, a complete set of simulations together with all sets of values for the other parameters is needed. This means that the number of simulations needed for evaluation

(38)

will scale very fast depending on the number of parameter values The total number of simulations will scale as

S =

n

Y

i=1

p_i (3.1)

where S is the number of simulations, p_i is the number of values for parameter i, and n is the total number of different parameters. Because of this, a few initial simulations has been made to try to pinpoint the range including the optimal parameter values. In the next step the simulations has been made in a more systematic way, where a few different values in the assumed optimal range has been evaluated. Three different number of hidden units in the RBM layer and three different number of hidden units in the RNN layer has been evaluated in this way.

When training the RNN-RBM model, the input data needs to be separated into smaller batches, both to minimize the runtime and the performance of the training. The batches cover all cells and all counters, but is limited in the time dimension. One batch at a time is inserted into the model for training, until all batches have been trained. Between each batch the error is calculated and the parameters are updated. When all batches has been trained, the training is repeated for all batches for a number of cycles, or epochs, until the error has converged. The size of the batches will impact the performance of the model; a too small size has the risk of not capturing all dependencies, while too large batches have the risk of making the model too hard, or even impossible, to train. The best choice of batch size has been briefly evaluated during the project by testing both 1 day and 2 days of data per batch. Some testing has also been made to artificially connect all batches, so that the risk of not learning all temporal dependencies will be minimized.

During testing, the model needs some initial hstorical data to be able to make a forecast of the next value. The amount of historical data used has been chosen in two different ways. First, the initial data has been set to a fixed amount of data for each single forecast,

(39)

3.4. Evaluation method 25

in a so called “Sliding window”. This means that the accuracy of the forecast should not depend on what time step that is being forecasted. The other approach is to use all available historical data in the test set as initial data to the forecast. This means that the initial data will be larger for forecasts at time steps in the end of the test set than in the beginning of the test set. The accuracy of the forecast should therefore have a chance of being better in the end of the test set. This approach will hereafter be referred to as

“Full history”.

3.4 Evaluation method

The forecasts made by the RNN-RBM model has been made as one-step rolling forecasts.

A one-step rolling forecast is a combination of multiple forecasts, where each forecast is for a single time-step beyond the input data, and where the input data is changed between the forecasts to include the known data for an additional time-step and, which also makes the model create a forecast at one time-step after the last forecast, hence the “rolling”.

The input data for each individual value forecasted have been set, for the first set of tests, to a fixed amount of data, then, for the second set of tests, to all available test data prior to the value being predicted. This is also the way the Holt-Winter model operates on the input data, with the restriction that a minimum of one season of data is needed for the Holt-Winter model. Each complete forecast has then been evaluated and compared with the corresponding baseline forecast using two different performance values; RSS, that should be as low as possible, and AU C, that should be as high as possible (with 1 as an upper limit), as explained below.

(40)

3.4.1 Residual sum of squares - RSS

The RSS value is used to compare continous values and is calculated by taking the square of the difference between the forecasted value and the real value for each data point and sum over all data points as given by

RSS =

n

X

i=1

(yi− ˆyi)² (3.2)

where y_i is the real value to be forecasted and ˆy_i is the forecasted value. The RSS value gives a comparable value of how close the forecasted data is to the real data; the lower the RSS value, the better. The value is dependent on the range of the data, since the equation only takes the square of each difference. This makes it impossible to use as a comparable value for forecasts of different sources of data. However, using the RSS value is a simple way of comparing forecasts of the same data source.

3.4.2 Area under the curve - AUC

The AU C value is used to compare the performance of classfied data. When evaluating the performane of a classification model it is common to use the confusion matrix. The confusion matrix for a binary classifications, i.e. if a cell in the network is overloaded or not, can be seen in Figure 3.1. The confusion matrix can be extended to multiple classifications, but in this project only the binary case is relevant.

On the diagonal from the top left corner to the bottom right corner are all values that were correctly classified. In all other boxes are the values that were misclassified. In the binary case the correct classifications are labeled as True Positive, T P , where a positive value was correctly classified as a positive value, and True Negative, T N , where a negative value was correctly classified as a negative value. The incorrect classifications are similarly labeled as False Positive, F P , where a negative value was incorrectly classified as a positive value,

(41)

Figure 3.1: The confusion matrix presents a visualization of how well a model manage to classify a set of values. The correctly classified values are shown on the diagonal from the top left corner to the bottom right corner, while the incorrectly classified values are presented in the other fields. Image source: [dS]

and False Negative, F N , where a positive value was incorrectly classified as a negative value. From these values a number of additional evaluation values can be calculated.

Among them are the True Positive Rate, T P R, as

T P R = T P

P (3.3)

where P is the number of positive values in the dataset, and False Positive Rate, F P R, as

F P R = F P

N (3.4)

where N is the number of negative values in the dataset. When plotting these values with T P R on the y-axis and F P R on the x-axis, a curve, called the ROC curve (Receiver Op-

(42)

Figure 3.2: The ROC space presents, graphically, the performance of a classification model. The dots represents the rate of true positives, T P R, versus the rate of false positives, F P R. A value close to the upper corner, with high T P R and low F P R is a good classifier.

erating Characteristic), can be constructed by connecting the points (0, 0), (F P R, T P R) and (1, 1), as can be seen in Figure 3.2. The area under this curve is then called the AU C value, (Area Under the Curve) and is an indication of how well the model is able to classify the values. An AU C value below 0.5 means that the model performs worse than random and can be improved simply by reversing the predictions (as the bottom curve in Figure 3.2). An AU C value at 0.5 means the model performs at par with a random guess and indicates that it is a bad classifier. The closer the AU C value is to 1, the closer the model is to a perfect classifier. Graphially this is when the (F P R, T P R) point is located in the upper left corner.

When making a classification, a threshold has to be set to separate the predicted negative values from the positive values. This threshold can be varied to maximize the performance of the model. If the model predicts too many false positives, the threshold could be increased, and conversely, if the model predicts too few true positives, the threshold could be decreased. The performance of different thresholds can be captured in an extended ROC plot, where all (F P R, T P R)-pairs are plotted, as can be seen in Figure 3.3.

(43)

Figure 3.3: The results from different choices of threshold for the clasification model creates a curve. The area under the curve, AU C, as well as the shape of the curve, gives an indication of the performance of the model.

By connecting all points, a smoother ROC curve can be obtained that shows how the performance is dependent on the thresholds. The closer the curve is to the edges and the top left corner, the better the model is in making correct classifications. The AU C value connected to this plot better represents the performance of the model than when only considering one (F P R, T P R)-pair. The choice of threshold is very important for the performance of the model, however, the smoother the curve is, the less sensitive it is to the choice of threshold. The bottom curve in Figure 3.3 shows the results of an model that is a bad classifier. The curve crosses the random curve several times and is very sensitive to the choice of threshold. The shape of the curve, as well as the AU C value at 0.52 indicates that the predictions might be more or less random. The upper curve, however, shows the results of an model that is able to classify the data fairly well, with an AU C value at 0.7, which is well above the random value at 0.5. The shape is also smooth, which indicates that it is not very sensitive to the choice of threshold.

(44)

3.5 Code

The execution of this projet and all simulations has been made possible by using some code developed by other groups. All baseline simulations have been conducted using available code, free for download at [Que]. The code has not been modified in any way.

For the simulations using the RNN-RBM model, another set of code have been used. The code is free and available for download at [BL] and explained further in [BLBV12]. This code has been modified to fit the dataset. The mofidied version, together with all other code, is available for Ericsson in the internal Git system.

(45)

Chapter 4 Results

The forecasts have been made using two different approaches and the results of each approach has been compared to the forecast made by the baseline model, Holt-Winter.

The results of the first approach, where the input data is transformed to fit the original design of the model is presented in chapter 4.1. The results of the second approach, where instead the model is adapted to fit the input data is presented in chapter 4.2. All comparisons are made for the same cell, randomly chosen from the set, called Cell A, and for the same datatype, download, also randomly chosen.

4.1 Binary-valued input data

The RNN-RBM model that is used in this project is designed to work with binary data representing a MIDI file. The model will take a binary vector with 88 values as input, where each point in the vector corresponds to a certain note in a MIDI sequence, called

“piano-roll”¹. There are dependencies between the notes, where certain combinations of notes, e.g. combinations that form chords, have high probability to occur, while other combinations have very low probability to occur. There are also temporal dependencies,

1https://en.wikipedia.org/wiki/Piano_roll#In_digital_audio_workstations

31

(46)

32 Chapter 4. Results

where the probability that a combination of notes will follow another combination of notes will vary depending on the combinations.

The early testing of the model on the mobile network data are made by converting the real-valued data in the dataset for the download data of Cell A into binary vectors, where the data is converted into n_p different percentile values using a function in the python library numpy where the output value from the function for each data point will be the index of the percentile that the data is connected to. The index is then used as the base for creating a vector with length n_p+1 where the first point corresponds to a missing value in the input data and the following data points correspond to the different percentiles.

All data points in the vector will be zero except for the one with index corresponding to the percentile value. The forecasts made in this way will not be able to understand if a forecasted value is “close” to the correct value, but only be able to see if the value is correct or not.

The number of percentiles can be chosen freely, where a low number of percentiles will remove more of the information in the input data while in the same time make the output data easier to forecast, in the sense that there are fewer options to choose from, and conversely for a higher number of percentiles, more information will be preserved in the input data, while in the same time it will be harder to forecast the correct value due to a higher number of options to choose from. No matter how many percentiles used, some information is bound to be lost².

Three different number of percentiles have been chosen for evaluation: 5, 20, and 100, to show how the performance of the model varies with the number of percentiles. For each of these a forecast made by the Holt-Winter model has also been produced. For the RNN-RBM model the number of hidden units in the RBM layer has been set to 200, and the number of hidden units in the RNN layer has been set to 100 and 1000 in two different simulations. The one-step rolling forecast for two consecutive days using the

2Of course, even continuous data is discrete when digitalized, which in practice gives an upper limit to the number of percentiles.

(47)

4.1. Binary-valued input data 33

Figure 4.1: The left and the middle columns show the forecasts made using the RNN-RBM model with 100 and 1000 hidden units in the RNN layer, respectively. The right column shows the forecasts made using the Holt-Winter model. The rows corresponds to dividing the input data into different number of percentiles with 5 on top, 20 in the middle and 100 at the bottom. The blue line is the real data and the read line is the forecasted data.

RNN-RBM model with the two different set-ups can be seen next to the Holt-Winter forecast in Figure 4.1.

As can be seen in Figure 4.1, the forecasts using the RNN-RBM model are mostly random, independent on how many percentiles the data is divided into or how many hidden units in the RNN layer the model is using. The Holt Winter model is notably outperforming the RNN-RBM model with an RSS value at about a magnitude lower.

(48)

4.2 Real-valued input data

In the following and main part of the project the input data is not transformed and instead the RNN-RBM model is adapted to fit the input data. In this way no information in the input data is lost. The initial testing is made on one single cell with one single datatype as described in Section 4.2.2, giving only one time series as input data. The tests then proceeds to first handle multiple cells and a single counter, as described in Section 4.2.3, then a single cell and multiple counters, as described in Section 4.2.4. Lastly the combination of both multiple cells an multiple counters is tested, as described in Section 4.2.5. Before these sections a short description of the baseline forecasts is given in Section 4.2.1.

In all tests, the number of hidden units in both the RBM layer and the RNN layer has been varied between simulations to evaluate what numbers produced the best results. For the RBM layer the different numbers of hidden units has been 100, 300 and 1000 units.

For the RNN layer the values has been 1000, 2000 and 5000 units. The batch size during training has also been varied, to evaluate if the results were affected, between 1 and 2 days, corresponding to 96 and 192 time steps. The initial data during testing has been set to both a “Sliding window” of 14 days and to “Full history”, using all available historical data.

For each of the tests the results has been evaluated using the RSS value and the AU C value, as described in section 3.4, and presented in a set of plots. The limits of the y-axis has been fixed to ease the comparisons. As a result, some of the forecast plots reach above the limit. Each plot consists of a complete one-step rolling forecast at the bottom with a zoom in over two days in the upper left corner and an ROC curve in the upper right corner. The initial 14 days and the last 1 day has been cut away from the forecast, the first 14 days to let the forecasts converge and the last day due to noisy data, giving a total of 39 forecasted days.

(49)

4.2. Real-valued input data 35

Figure 4.2: One-step forecast made using the Holt-Winter model (red) on unchanged data and the corresponding RSS value. The real data is shown in blue. The top right corner shows the ROC plot with the corresponding AU C value.

To be able to calculate the ROC curve and the corresponding AU C value for this problem, it first has to be converted into a classification problem. In these tests, the classification has been defined as the ability to predict when the amplitude of the input data is in the top 10th percentile of all input data.

4.2.1 Baseline forecasts

Two baseline forecasts, using the Holt-Winter model, have been created for comparison.

Figure 4.2 shows a forecast made with the input data unchanged. However, this procedure produces negative values, that has manually been set equal to zero since the real data is never negative. Figure 4.3 shows a forecast made with input data that has been transformed by taking the logarithm of each datapoint and then normalized over the entire dataset. This produces a forecast that has no negative values.

When comparing the two forecasts, the RSS value is slightly better in the first (1.75 versus 1.76), which would indicate that the first forecast is closer to the real data than the second. However, when looking at the AU C value in the ROC plot, the second is

(50)

Figure 4.3: One-step forecast made using the Holt-Winter model (red) on modified data and the corresponding RSS value. The real data is shown in blue. The top right corner shows the ROC plot with the corresponding AU C value.

slightly better (0.774 versus 0.770), which would indicate that the second forecast is able to classify the top 10 % amplitudes slightly better than the first. The differences, however, are very small in both cases and the forecasts can more or less be regarded as identical.

4.2.2 Single cell and single counter

When making forecasts using only one single cell and one single counter the RNN-RBM model operates on data with the same dimensionality as the Holt-Winter model. Since one of the strengths of the RNN-RBM model lays in it being able to handle mutivariate input data, the results from these forecasts will not in itself be enough to determine which of the models that are most accurate.

All results from the simulations, with both limited amount of input data (7 days prior to the forecasted datapoint) as well as unlimited amount of input data (all days prior to the forecasted datapoint), can be seen in Table B.1 in Appendix B. The results from two of the simulations producing the best results are shown in Figure 4.4 and Figure 4.5. Both these simulations used 100 hidden units in the RBM layer, 2000 hidden units in the RNN

(51)

4.2. Real-valued input data 37

Figure 4.4: One-step forecast made using the RNN-RBM model (red) with a single cell, a single counter and 7 days of history as input. The real data is shown in blue and the corresponding RSS value is shown above the plot. The top right corner shows the ROC plot with the corresponding AU C value.

layer and a batch size of 2 days of data per batch. The real data is shown in blue and the forecasted data is shown in red. Two of the days are zoomed in to give a better view of the individual datapoints.

As can be seen from the figures, and further in Table B.1, the RNN-RBM model performs better than the baseline model regarding the AU C value both for a limited amount of input data as well as for an unlimited amount of input data but only when using an unlimited amount of input data is the RNN-RBM model able to perform better than the baseline model regarding the RSS value.

4.2.3 Multiple cells and single counter

When making forecasts using multiple cells but only one single counter, the spatial dependence should occur and help giving better forecasts. The input data is now two- dimensional, in contrast to the one-dimensional data that is used in the baseline forecast.

The spatial dependence is not equally strong between all pair of cells, but depend on

Multiple time-series forecasting on mobile network data using an RNN-RBM model

Examensarbete 30 hp Februari 2017

Multiple time-series forecasting on mobile network data using an RNN-RBM model

Arvid Bäärnhielm

Abstract

Multiple time-series forecasting on mobile network data using an RNN-RBM model

Popul¨ arvetenskaplig sammanfattning

Acknowledgements

Contents

List of Tables

List of Figures

Chapter 1

Introduction

1.1 Time series

1.2 Time series analysis and forecasting models

1.3 Motivation

Chapter 2

Theory

2.1 Mobile Network infrastructure

2.2 Holt-Winter

2.3 Recurrent Neural Network-Restricted Boltzmann Machine (RRN-RBM)

2.3.1 Recurrent Neural Network (RNN)

2.3.2 Restricted Boltzmann Machine (RBM)

2.3.3 RNN and RBM combined as RNN-RBM

2.3.4 Binary versus real-valued data

Chapter 3

Methodology

3.1 Software

3.2 Dataset

3.2.1 Dataset 1

3.2.2 Dataset 2

3.3 Parameter tuning

3.4 Evaluation method

3.4.1 Residual sum of squares - RSS

3.4.2 Area under the curve - AUC

3.5 Code

Chapter 4

Results

4.1 Binary-valued input data

4.2 Real-valued input data

4.2.1 Baseline forecasts

4.2.2 Single cell and single counter

4.2.3 Multiple cells and single counter