Machine Learning for Forecasting Signal Strength in Mobile Networks

(1)

Master Thesis in Statistics and Data Mining

Machine Learning for Forecasting

Signal Strength in Mobile Networks

Nikolajs Prihodko

Division of Statistics and Machine Learning

Department of Computer and Information Science

(2)

Supervisor

Oleg Sysoyev

Examiner

(3)

(4)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(5)

Abstract

In this thesis we forecast the future signal strength of base stations in mobile networks. Better forecasts might improve handover of mobile phones between base stations, thus improving overall user experience. Future values are forecast using a series of past sig-nal strength measurements. We use vector autoregression (VAR), a multilayer perceptron (MLP), and a gated recurrent unit (GRU) network. Hyperparameters, including the set of lags, of these models are optimised using Bayesian optimisation (BO) with Gaussian pro-cess (GP) priors. In addition to BO of the VAR model, we optimise the set of lags in it using a standard bottom-up and top-down heuristic. Both approaches result in similar predictive mean squared error (MSE) for the VAR model, but BO requires fewer model estimations. The GRU model provides the best predictive performance out of the three models. How-ever, none of the models (VAR, MLP, or GRU) achieves the accuracy required for practical applicability of the results. Therefore, we suggest adding more information to the model or reformulating the problem.

(6)

Acknowledgments

I would like to express gratitude to my Linköping University thesis supervisor Oleg Sysoyev for advice and productive discussions. I would also like to thank my Ericsson supervisor Jan Lindgren for keeping his hand on the pulse of the thesis. Furthermore, I would like to extend my gratitude to Linnea Faxén for patiently helping me to solve any problems that I have encountered and László Hévizi for the great support with the simulator.

Moreover, I would like to thank Caroline Svahn for encouragement and support with the thesis and Torrin Raoufi-Danner for thoroughly reviewing it and giving valuable suggestions. Lastly, I would like to mention that this thesis would be impossible without the com-munity behind numerous open-source and open-access projects.

(7)

Abstract iii Acknowledgments iv Contents v 1 Introduction 1 1.1 Background . . . 1 1.2 Literature review . . . 3 1.3 Objective . . . 6 1.4 Limitations . . . 6 2 Data 7 2.1 Data source . . . 7 2.2 Raw data . . . 8 2.3 Preprocessing . . . 9 2.4 Sample description . . . 9 3 Methods 12 3.1 Justification . . . 12 3.2 Linear models . . . 13 3.3 Non-linear models . . . 17 3.4 Hyperparameter optimisation . . . 25 3.5 Forecasting . . . 27 3.6 Performance evaluation . . . 28 3.7 Summary . . . 30 4 Results 32 4.1 Stationarity of data . . . 32 4.2 Hyperparameter optimisation . . . 32 4.3 Performance comparison . . . 35 5 Discussion 43 5.1 Results . . . 43 5.2 Methods . . . 45

5.3 Suggestion for further research . . . 47

5.4 Ethical issues . . . 47

6 Conclusions 48 Bibliography 49 A Appendix 52 A.1 List of abbreviations . . . 52

(8)

A.2 Bayesian optimisation . . . 54

A.3 Performance metrics . . . 56

A.4 Prediction intervals . . . 58

A.5 Relation between dB and dBm . . . 59

(9)

1 Introduction

1.1 Background

Over the last few decades mobile communication has advanced from being a luxury service for a small minority of wealthy individuals into a commodity good available to most of the world’s population (Dahlman, Parkvall & Skold, 2013, p. 1). Although there were quite substantial changes in the technology, such as moving from analogue to digital signal, the main principles behind the technology remain the same.

The possibility for mobility is achieved by a mobile phone, called user equipment (U E) in the context of 4G networks, being always connected to a network (Dahlman et al., 2013, p. 301). The cellular land mobile network consists of a collection of individual base station (B S). A B S is a fixed location transceiver that communicates directly to a U E by sending and receiving data from and to it. A B S is capable of transmitting the data to other base stations, to which other U Es are connected, thus enabling communication between U Es in different geographical locations. An area served by a B S is called a cell (Stüber, 2017). This idea is illustrated in Figure 1.1.

User equipment (UE)

Base station (BS)

Cell

Figure 1.1: Illustration of a mobile network. A user equipment (U E) ]is connected to a base station (B S) and communicates with it. B Ses transmit data among each other, thus allowing U Es in different cells to communicate with each other. A cell is a region served by a base station.

(10)

1.1. Background When initially accessing the network, a U E performs a cell search and connects to the one with the best available connection. However, due to the nature of U Es, they tend to move, making the initially connected cell suboptimal or even not available. In order to support the persistent connection, a U E should estimate the signal reception quality of the neighbouring cells and perform either a handover to another cell or cell reselection when its signal becomes stronger. The difference between the two procedures is that the former happens when a U E is active (e.g. during an ongoing call), while the latter happens when the phone is idle (turned on but not actively communicating with the network). Moreover, a handover requires active decision making by the network, while cell reselection is handled by a U E.

In this thesis we focus on possibilities of using machine learning for improvement of handover (not reselection) procedures.

1.1.1 Handover

Every handover has a certain probability of failure, which can result in a dropped call or lost internet connection. A major task in a cellular network is to minimise this probability and determine when to perform a handover by observing the connection quality (Stüber, 2017, p. 593). If the signal strength or quality deteriorates too rapidly and is not caught by the monitoring system, a drop occurs, which leads to the overall decrease in user experience. On the other hand, if there are too many handovers, the capacity of the network is reduced due to increased mobile traffic for control purposes. This also worsens user experience. Moreover, each handover causes a short interruption of data stream, which adds up to a noticeable decrease in throughput when handovers occur too often.

The handover is mainly governed by two manually predefined parameters: margin (also called signal strength difference or hysteresis) and time-to-trigger (Stüber, 2017, p. 596). The handover is performed when the signal strength from a candidate cell is stronger by a certain amount of decibels (margin) than the serving cell during a certain period of time (time-to-trigger). This is illustrated in Figure 1.2. The decision to perform a handover is done by the network. When the measurement fulfils the configured reporting criteria (e.g., signal strength below a threshold), a U E starts sending measurement reports (e.g., signal strength of the serving and the neighbouring cells) to the serving cell, which decides whether and to which cell a handover should be performed. Strength of the received signal is usually measured in terms of reference signal received power (R S R P) in 4G networks. R S R P is expressed in decibel-milliwatts (dBm), which is the signal power ratio measured in decibels (dB), with the reference power of 1 milliwatt (mW).

margin time-to-trigger handover R S R P time candidate cell serving cell

Figure 1.2: Illustration of handover design parameters. The handover is performed when the signal strength from a candidate cell is stronger by a certain amount of decibels (margin) than the serving cell during a certain period of time (time-to-trigger).

(11)

1.2. Literature review There are two types of handovers: vertical and horizontal (Miao, Zander, Sung & Slimane, 2016).

• A vertical handover occurs when a U E switches to a different frequency served by the same base station. This can happen, for example, due to load balancing: a U E is handed over to another less busy frequency, thus, improving the quality of service.

• A horizontal handover occurs when a U E is handed over to another base station within the same frequency.

In this thesis we focus on horizontal handovers because they are caused by U E mobility.

1.1.2 Problems with handovers

Handovers, as mentioned above, are both risky and costly. They are risky because an un-successful handover leads to a drop in connection. Recovering from a drop takes more time than a handover procedure itself. And handovers are costly because they involve a lot of data traffic for control of handover purposes, which can instead be used for increased cell capacity. Therefore, it is important to perform handovers at optimal times and avoid unnecessary frequent handovers.

In an urban environment there are obstructions, such as buildings between a base station and a U E and around them. These obstacles affect the signal and cause signal fading (Chia & Warburton, 1990). There are multiple types of fading, for example, shadow fading when the signal is shadowed by an obstacle. This effect is especially noticeable when a U E turns around the corner, and the signal deteriorates rapidly. This is called the corner effect. Murase, Symington and Green (1991) has shown that in such conditions a signal can deteriorate as quickly as 25 dBm over 10 meters, which is quite substantial and enough to cause a drop in connection.

It is important to adjust the handover parameters such that a handover is performed before the signal has deteriorated. However, the current approach of setting the same handover parameters for every U E within the cell might be suboptimal. In the same geographical position and with the same R S R P values it might be best to perform or not to perform a handover for different U Es. Optimality of the choice depends on, for example, the direction of movement, because for one U E the signal will be shadowed (e.g., due to turning around the corner) while for another U E it is best to keep the connection (e.g., due to straight movement). Therefore, predicting futureR S R Pvalues for each particular U E could result in better handovers.

1.2 Literature review

Most of the relevant work that focuses on optimising handover parameters using statistical and machines learning modelling can be grouped into four main directions of research:

1. prediction of the number of handovers, 2. prediction of the next serving cell, 3. signal quality prediction,

4. grouping similar U Es and optimising parameters for each group.

1.2.1 Prediction of the number of handovers

This group of papers focuses on optimising handover parameters on a cell rather than U E level.

(12)

1.2. Literature review Vy, Tung and Lin (2017) cluster cells with similar handover behaviour together and then optimise handover parameters for each of the clusters. Handover patterns change in both time and location. For example, cells along the highways have more connection attempts than cells in buildings even if the traffic is the same. Moreover, some cells experience larger load on weekdays (cells in office areas), while other cells—on weekends (cells in residential areas).

Therefore, the authors cluster cells using the average hourly, daily, and weekly handover attempts. They use k-means as a clustering algorithm. Then, within each cluster they use linear and polynomial autoregressive, neural network (N N), and Gaussian process (G P) models to forecast the number of handover attempts.

The authors show a reasonable predictive power of these models, but do not explain how predicting hourly handover attempts can help optimising handover parameters.

1.2.2 Movement prediction

This group of papers attempts to predict the next cell to which a U E will connect.

Michaelis and Wietfeld (2006) seek predicting the next cell to which a U E will be connected based on the history of the previous cells to which it was connected (U E trace). They simulate U E movement in a simple topology: it consists of six areas connected by streets. Most U Es select one of the areas and proceed to it, while other U Es move randomly to introduce noise in the model.

The authors study the impact of the algorithm choice and the length of the history of the previously connected cells on prediction accuracy. They consider k-nearest neighbour (k-N N), decision trees (D T) and support vector machine (S V M) algorithms.

Their findings suggest that there is no clear difference in the performance of these al-gorithms. The history length, however, has a critical impact on the performance. There is a noticeable jump in accuracy when the history length is increased from one period (currently connected cell) to two (current and the previous one): from 50% to more than 80% accuracy. This happens because the combination of the current and the previously connected cells cap-tures the direction of movement. Increasing the history to more than two cells has a marginal impact on accuracy. For example, using eight previous cells, the accuracy is boosted by less than 10 percentage points compared to using two cells.

Moreover, the authors consider a few voting schemes to combine the three algorithms. These schemes allows acting on prediction only when multiple algorithms are in agreement with each other. This leads to further reduction in the error rate, however, this comes at the expense of not making any prediction when the prediction is uncertain. Thus, the resulting accuracy number is inflated.

Chen, Mériaux and Valentin (2013) attempts to predict the next connected cell using the information about previously connected cells and the series of signal strength measurements reported by a U E within the currently connected cell. They simulate U E movement in both a simple Manhattan-grid and a realistic scenario of a Frankfurt city.

One of the important modelling assumptions is that when a U E enters a cell it randomly chooses speed and one of the predefined trajectories to move inside the cell.

The authors train a separate classification S V M to each pair of the previous and the current serving cell. Inputs to the model are normalised (to ensure the same length) vectors of signal strength measurements. The output is the next connected cell.

Compared to the method of using only U E trace, using signal strength information improves accuracy. In the Manhattan-grid scenario, by using 60% of the signal strength series they achieve accuracy of 100%. In a more realistic Frankfurt scenario, they achieve the accuracy of 95% by using 75% of the data gathered within a cell.

(13)

1.2. Literature review

1.2.3 Signal quality prediction

This group of papers attempts to predict what will be the quality of the signal depending on which cell a U E connects to.

Ali, Baldo, Mangues-Bafalluy and Giupponi (2016) attempt to predict which base station will provide better quality of service in the presence of obstacles. They simulate a U E movement in a map with three base stations. A U E starts its movement from one of them. As it moves, the signal deteriorates and a handover to one the remaining two base stations should be performed. One of these base stations have stronger signal, but depending on the U E trajectory it might become shadowed by an obstacle. Therefore, in some cases it is optimal to connect to the stronger base station, while in other cases it is best to connect to a weaker, but not shadowed base station. While moving, a U E downloads a file, and if it chooses the base stations that is shadowed later, the download interrupts.

The authors then train two N Ns for each of the base stations. One N N predicts whether a file will be downloaded successfully given theR S R Preadings triggered by an A3 event1, and the other one predicts what will be the speed of the download, given its success. Adjusting handovers using the information provided by the N Ns, the authors manage to increase the number of completed downloads by 75% and the speed of downloads by 84%.

1.2.4 Categorising U E trajectories

This group of papers attempts to identify groups of U Es with similar behaviour and optimise handover parameters for each group separately.

Hegazy and Nasr (2015) split users into four groups based on their speed (slow and fast) and network usage (real-time and not real-time). Then they optimise handover parameters for each of the groups separately, however, they don’t introduce any algorithm to identify which group a user belongs to, rather they assume that it is known.

Sas, Spaey and Blondia (2015) attempt to predict the mobility of users by assigning them to one of the groups and assuming that users in a group are homogeneous. They simulate the movement of U Es on a grid with multiple macro and microcells. They collect series of

R S R Pfrom A4 events2for each of the users. Then they assign each of the U Es to one of the reference U Es. This makes the algorithm similar to a k-N N algorithm with k=1. Matching happens using an author-developed Modified Dynamic Time Warping as a similarity measure (Sas, Spaey & Blondia, 2014). This is a modified version of a well-known technique to find an optimal alignment between two time series, but is adjusted to overcome some of the drawbacks of the original algorithm.

It is assumed that a U E will have similar readings ofR S R Pas the reference U E to which its was assigned. This allows optimising handover for each group separately. Reference U Es are selected at random and not more than 20 are allowed. Using this method, the number of unnecessary handovers is reduced by varying degree depending on the U E speed.

Castro-Hernandez and Paranjape (2017) also attempt to divide users into similar groups. However, they use clustering instead of reference U Es and simulate movement in a realistic indoor environment. They focus onR S R Preadings from A2 events3. These events are triggered on the edge of the cell, where a potential handover should be performed. The series ofR S R P

readings are recorded and then clustered using shapelets4as the similarity between time series and k-means algorithm. Optimal value of k is is trained as a part of the method.

The authors manage to successfully clusterR S R Ptrajectories such that each cluster cor-responds to one of the exits in the indoor environment. Since each exit has a differentR S R P

trajectory (e.g., some exits have quick deterioration of R S R P values, while others have a

1_{An A3 event is triggered when a target cell becomes by a margin better than a threshold.} 2_{An A4 event is triggered when a target cell becomes by a margin better than a serving cell.} 3_{An A2 event is triggered when a serving cell becomes worse than a threshold.}

4_{Shapelets are subsequences of time series that attempt to capture the most discriminative shape features of the}

(14)

1.3. Objective gradual decrease), it should be possible to improve handover by optimising policies for each of the clusters.

1.3 Objective

The most relevant approach for identifying potential sudden drops in the signal strength is in papers by Sas et al. (2015) and Castro-Hernandez and Paranjape (2017). They cluster U Es into groups and then optimise handover parameters within each cluster. However, this can be viewed as implicitly using past R S R P values to predict future R S R P values. Past

observations are used for cluster assignment, and futureR S R Pvalues are inferred from the mean (or reference) value ofR S R Pat each point in time in an assigned cluster.

Therefore, in this thesis we attempt to predict futureR S R P values from the past readings for each U E individually. Moreover, instead of building local models for each of the groups, we try building a global but possibly more complex model. Thus, we attempt to answer the following questions:

1. How accurately is it possible to predict futureR S R Pvalues using pastR S R Pvalues in an urban environment?

2. Which model achieves better predictive performance?

3. Is the achieved predictive performance good enough for improving handovers?

1.4 Limitations

This thesis has a number of constraints. First, we use simulated data because real-life data is either non-existent or hard to obtain.

Second, we do a number of simplifications within the simulator, such as:

• we do not model interference caused by the communication between U Es and cells; • we only consider macrocells with three cells per base station, which is a simpler topology

than the one typically deployed in Long-Term Evolution (LT E) networks in densely populated areas;

• we do not take into account thermal noise floor: even very weak signals are recorded; • we measureR S R Pvalues of all cells within the simulation, while LT E standard supports

only up to eight neighbours;

(15)

2 Data

2.1 Data source

All of the data used in the analysis comes from a simulator that simulates U E movement and the corresponding signal strength measurements.

2.1.1 Simulator

Data comes from a real-time simulator used for preliminary assessment of radio access network parameters and machine learning experiments. The simulator does not model interference caused by communication between U Es, but has a sophisticated signal propagation model. A map of signal strengths is generated prior to the simulation and is rasterised with a pixel size equal to a 5-by-5 meter area. The simulation is run for 3.5 hours with 100 U Es on 1 km2area. The map consists of streets of different widths and buildings of different sizes. The city map used in the simulation is shown in Figure 2.1. The modelled area is 1 by 1 km. There are 14 B Ses: seven centre and seven surrounding B Ses. They are approximately equally spread over the map. Each B S has three antennae, therefore, there are three series ofR S R Pvalues associated with it. Each antennae serves a certain sector, but because each sector has its own

R S R Pvalue, we do not distinguish between cells and sectors of a single B S. Thus, we assume that each B S serves three different cells. Location of antennae is shown in Figure 2.1.

Base stations are divided into centre macrocells and surrounding macrocells. Centre macro-cells are the ones of interest, and we model only their signal strengths. Surrounding macromacro-cells are present for accounting for signal interference caused by surrounding environment. Thus, we model 21 cells (3 cells per 7 centre B Ses).

2.1.2 U E movement

A U E randomly chooses a point on the map and proceeds to it using the fastest path identified by Dijkstra’s algorithm (Dijkstra, 1959). When a U E arrives at the destination, it spends a random amount of time indoors moving slower and randomly. It then randomly picks another destination in a different part of the map1and proceeds to it.

(16)

2.2. Raw data

Center Surrounding

Figure 2.1: City map and B S locations used in the simulation. Each B S has three cells (sectors).

When outdoors, U Es move faster on wider roads than on narrower ones to simulate different traffic conditions. Moreover, a U E slows down before making sharp turns, which makes its movement more realistic.

2.2 Raw data

Data comes from a 3.5-hour long simulation with 100 U Es.

Approximately every 0.1 seconds a measurement of R S R Pbetween each U E and each

cell is recorded. The measurements are sampled in real time, and therefore, the exact time difference between them is affected by other processes running on the system2. The recorded measurements are written into a dictionary format. Each element in the file contains allR S R P

measurements per U E at a single point in time. Example of an entry: {"user_id": 1045,

"pos": [623.5, 658.8],

"rsrp_list": [-124.851, -155.912, ..., -152.913],

"ts": 1520947901404000}

In this entry, user_id is used to identify a U E, pos contains coordinates expressed in meters, rsrp_listcontains R S R Pvalues of each of the cells, and ts is microseconds from the the start of year 1970 (Unix time).

Raw data is then converted into a table format with a single observation (allR S R P meas-urements) per U E. An example of this table can be found in Table 2.1.

2_{Actual distribution is described in Section 2.4.}

Table 2.1: Example of data converted into the table format.

R S R P, dBm

Timestamp, s U E I D x, m y, m Cell 0 Cell 1 . . . Cell 21

1 520 947 901.46 1099 462.4 638.6 ´103.60 ´156.49 . . . ´154.27

1 520 947 901.46 1100 623.5 658.8 ´124.85 ´103.72 . . . ´152.91

1 520 947 901.56 1001 446.8 491.6 ´122.08 ´149.33 . . . ´171.57

(17)

2.3. Preprocessing

2.3 Preprocessing

We preprocess raw data in table format before the start of the modelling. We downsample it to decrease frequency, then we add noise and standardise it.

2.3.1 Downsampling

The measurements are obtained at (almost) regular intervals of 0.1 second. This is unreasonably often for a U E to perform measurements that often since it would quickly drain their battery. Since the simulator does not allow for less frequent reporting we keep only every 10th observation, thus, obtaining measurements at the frequency of 1 second. According to industry experts at Ericsson, this frequency is more realistic.

2.3.2 Noise

TheR S R P readings in the simulator are deterministic and precomputed for every location

before the start of the simulation. This is unrealistic, therefore, we add Gaussian noise with the standard deviation of 1 dBm, because it is suggested by the industry experts at Ericsson as reasonable.

2.3.3 Standardising

Before estimating models, we scale each individual variable to have mean zero and standard deviation of one. Since all our variables are in comparable range, this is done mostly to reduce numerical errors because the distribution of floating-point numbers is denser in regions closer to zero (Gentle, 2009, p. 91).

2.4 Sample description

In order to understand the data better, we describe the sample after implementing the prepro-cessing steps.

As we have mentioned earlier, the actual time difference between observations might not be strictly 1 second, therefore, we investigate how far off this difference is. Number of observations per U E during the whole simulation is shown in Figure 2.2. We see that U Es have different but very similar number of observations. Moreover, according to the histogram of time differences between two consecutive measurements (Figure 2.3a), the absolute majority of differences lies within the interval between 1.00 and 1.01 seconds, which is a variation of less than 1%. Thus, varying time differences between consecutive observations are unlikely to significantly affect the results.

The histogram of the average U E speed between consecutive observations is shown in Figure 2.3b. The distribution contains a peak, around 5 km/h, which corresponds to a walking speed; and a decaying tail with the speeds up to 70 km/h, which is similar to the velocities of moving vehicles. Therefore, the simulation is similar to a scenario in which U Es walk most of the time when travelling between locations, but occasionally take vehicles.

Figure 2.4 shows the distribution of R S R Pvalues of each cell. A row contains cells that

belong to a single B S. We see that all distributions are different, but they form roughly bell-shaped curves and lie within comparable ranges. Therefore, the values seem to be reasonable. The heatmap of the most popular locations on the map is shown in Figure 2.5. We see that U Es move along the same paths which correspond to streets.

(18)

2.4. Sample description 0 20 40 60 11807 11808 11809 11810

N obs.

Number

of

UEs

Figure 2.2: Histogram of the number of observation per U E. Total number of U Es is 100.

0 200 400 1.00 1.01 1.02 1.03 1.04 1.05

Difference, s

Density

(a) Time difference.

0.0 0.1 0.2 0 20 40 60

Speed, km/h

Density

(b) Speed.

Figure 2.3: Histograms of time differences and U E speeds between consecutive observations of the same U E.

(19)

2.4. Sample description 0 1 2 0 1 2 3 4 5 6 -200 -100 -200 -100 -200 -100 0.001 0.010 0.020 0.001 0.010 0.020 0.001 0.010 0.020 0.001 0.010 0.020 0.001 0.010 0.020 0.001 0.010 0.020 0.001 0.010 0.020

RSRP, dBm

Density

Figure 2.4: Histogram of R S R Pvalues per cell. Each rows contains R S R Preadings from a single B S. Three columns within each row represent three cells (sectors) per B S.

0.000 0.001 0.002 0.003 0.004

Frequency

Figure 2.5: Heatmap of the most common U E locations overlayed over the simulation map. Frequency of 0 means that no U E has ever been in the location, and frequency of 1 implies that all U Es have always been in the location.

(20)

3 Methods

In this section we justify the choice of the models and present the notation. Then we compute and compare three types of models. First, we use linear vector autoregression (Lütkepohl, 2005) described in Section 3.2. We also discuss issues of stability and stationarity. Then we explore two different types of non-linear models (Section 3.3): feed-forward neural networks and recurrent neural networks. For feed-forward neural networks we use multilayer perceptron (Bishop, 2006, p. 228) with rectified linear units (Hahnloser, Sarpeshkar, Mahowald, Douglas & Seung, 2000) as activation functions, and for recurrent neural networks we use networks with gated recurrent units (Cho et al., 2014). All the models are also compared to forecasts

R S R Pvalues are assumed to be persistent, i.e. all future values are assumed to be the same as the current one.

We train neural networks using Nesterov-accelerated adaptive moment estimation (Nadam) (Dozat, 2016). In order to prevent overfitting, we apply dropout (Srivastava, Hinton, Kr-izhevsky, Sutskever & Salakhutdinov, 2014). Additionally, we use early training stopping (Bishop, 2006, p. 259) to further decrease overfitting and reduce training time.

All three types of models depend on hyperparameters, which are optimised using Bayesian optimisation (B O) with G P priors (Snoek, Larochelle & Adams, 2012) described in Section 3.4.1. For the vector autoregression (VA R) model we also optimise the set of lags using a more traditional bottom-up and top-down heuristic.

Then models are compared using traditional performance metrics such as mean squared error (M S E), root mean squared error (R M S E), and mean abosolute error (M A E). The models’ certainty in their prediction is estimated using average prediction interval (P I) width of forecasts. Then we assess whether estimated coverage probability of P I is close to the desired one.

3.1 Justification

We have time series ofR S R Preadings from different cells. Therefore, we explore which time series models are better for predicting futureR S R Pvalues using pastR S R Pmeasurements. We deliberately do not use spatial coordinates because although it is present in the simulator, this data is sensitive and hard to obtain in real life.

As there is strong reason to believe that one cell carries information useful for predicting

(21)

3.2. Linear models

A

B

Figure 3.1: Contour lines represent areas with the same signal strength. Points A and B have the same signal strength of the middle cell, however, future signal strengths are likely do develop differently. Distinguishing between points A and B is possible by incorporating information about signal strength from surrounding cells.

signal strength from multiple cells, we are more likely to distinguish between different loc-ations that would alternatively have had the sameR S R P. This is useful in the presence of obstacles. For example, in Figure 3.1, U Es in points A and B have the same signal strength from the middle cell, however, given the movement direction of the U Es, the futureR S R P

values from the middle cell are likely to be different: signal will deteriorate slower for U E A than for U E B. Using surrounding cells on the left and on the right of the middle cell allows us to distinguish between points A and B, potentially improving predictions for each of the cases. We denote a vector of signal strengths at time t as ytand model it as a function of its lagged

values:

yt= f(yt´1, yt´2, . . .) +εt, (3.1)

where ytcorresponds toR S R Pmeasurements in a single row of the data in the tabular form (Table 2.1), and εtis noise.

We know that in the simulated data there are no differences between U Es. Moreover, in reality it is unlikely that we can know differences between them a priori, therefore, we do not distinguish among them and use the same model for all U Es. We do ensure, however, that for a certain t the set of measurements tyt, yt´1, . . . , yt´pubelongs to the same U E, where p is the longest time dependency for which a model can account.

3.2 Linear models

We start by modelling R S R P values using linear VA R model because this is a standard

approach for modelling multivariate time series, common due to its relative simplicity.

3.2.1 Multivariate linear regression

According to Lütkepohl (2005, p. 4), multivariate time series models are used when the current value of a variable, y1,tdepends not only on its own past observations, y1,t´1, y1,t´2, . . . , but

also on the past values of other variables, y2,t´1, y3,t´1, . . . , y2,t´2, . . . . More formally

y1,t= f1(y1,t´1, . . . , yK,t´1, y1,t´2, . . . , yK,t´2, . . .) +ut, (3.2)

where K is the total number of variables and utis stochastic noise.

Assuming that f(¨)is linear we can express (3.2) as

y1,t=ν1+a1,1y1,t´1+a1,2y1,t´2+. . .

(22)

3.2. Linear models where ν1is an intercept term.

For easier notation, it is useful to express (3.3) in the vector form,

yt=ν+A1yt´1+A2yt´2+¨ ¨ ¨+ut, (3.4)

where

yt= (y1,t, y2,t. . . , yK,t)T (3.5)

is a K ˆ 1 vector of realizations of a random process at time t, and

Ai =     

a1,1,i a1,2,i . . . a1,K,i

a2,1,i a2,2,i . . . a2,K,i

..

. ... . .. ...

aK,1,i aK,2,i . . . aK,K,i

     (3.6)

is a K ˆ K matrix of autoregressive coefficients of lag i, with ak,l,idenoting the linear

depend-ence of ykon i-th lag of yl, and

ν= (ν1, ν2, . . . , νK)T (3.7)

is a K ˆ 1 bias vector, and utis a K ˆ 1 vector of multivariate noise.

3.2.1.1 Estimation

Assuming that noise follows the diagonal Gaussian distribution, Yule–Walker and maximum likelihood estimators result in the same multivariate least squares estimator (Lütkepohl, 2005, pp. 85, 90).

In order to estimate the coefficients, we want to express the model in the following form

Y =BZ+U. (3.8)

This can be achieved by essentially reshaping the coefficients and variables from (3.4). For a model with K variables and T observations they can be reshaped in the following way:

Y =y1, . . . , yT loooomoooon KˆT , (3.9) B= (ν, A1, A2, . . . , Ap) looooooooooomooooooooooon Kˆ(Kp+1) , (3.10) Z= (Z0, Z1, . . . , ZT´1) loooooooooomoooooooooon (Kp+1)ˆT , (3.11) Zt= (1, yt, . . . , yt´p+1)T looooooooooomooooooooooon (Kp+1)ˆ1 , (3.12) U= (u1, . . . , uT) loooooomoooooon KˆT . (3.13)

Then the multivariate least squares estimate is ˆ

B=Y ZT(ZZT)´1. (3.14)

3.2.1.2 Stationarity and stability

The basic assumption about a VA R process is that it is stable and stationary (Lütkepohl, 2005, p. 13). In practice, this assumption is useful, for example, for deriving asymptotic properties of the estimators. In our case it is desirable because a stationary process also exhibits the following properties for forecasting:

(23)

3.2. Linear models • it does not have unit root1, therefore, it is not a random walk, which means that better

predictions than yt=yt´1are possible;

• it converges to a long-run average, which implies that it is not explosive, meaning that the process does not grow or decrease exponentially.

Stability Formally, a VA R(1) process,

yt=ν+A1yt´1+ut, (3.15)

is stable if all eigenvalues of A1have modulus less than 1 (Lütkepohl, 2005, p. 13).

This can be easily extended to a VA R(p) by expressing a VA R(p) process as a VA R(1) process (Lütkepohl, 2005, p. 15):

yt=ν+Ayt´1+ut, (3.16)

where we redefine variables yt, ν, A, utas

yt=      yt yt´1 .. . yt´p+1      loooomoooon Kpˆ1 ν=      ν 0 .. . 0      loomoon Kpˆ1 A=        A1 A2 . . . Ap´1 Ap IK 0 . . . 0 0 0 IK . . . 0 0 .. . ... . .. ... ... 0 0 . . . IK 0        loooooooooooooooooooomoooooooooooooooooooon KpˆKp ut=      ut 0 .. . 0      loomoon Kpˆ1 . (3.17)

Stability of a VA R process also implies stationarity (Lütkepohl, 2005, p. 25). Therefore, we only test the resulting model for stability.

Stationarity Although Sims, Stock and Watson (1990) argue for estimating VA R in levels even if the individual series contain unit root, hence are not stationary, the general approach is to estimate VA R in levels only if there are cointegrating relationships or if individual series processes are stationary (Lütkepohl, 2005, p. 396). Therefore, we test whether individual realisation of R S R P time series processes are stationary. We do that using the augmented Dickey–Fuller (A D F) test (Dickey & Fuller, 1979) with automatic lag selection.

The idea of the test is to estimate a (univariate) regression of the form ∆yt=α0+γyt´1+

p

ÿ

i=2

∆yt´i+1+εt, (3.18)

and test whether γ= 0 using the t-statistics calculated with ordinary least squares (O L S) standard errors (Enders, 2014, p. 207–208). If it is 0, the series contains a unit root and is not stationary. Note, however, that critical values are different from the ones used in O L S inference and are obtained using Monte Carlo simulations (Dickey & Fuller, 1979).

The order of augmented lags, p, is selected using an information criterion. We use the method suggested by Seabold and Perktold (2010) Akaike information criterion (A I C) and the upper bound of p equal to

pmax=12 N 100 1₄ . (3.19)

Since the A D F test is suitable only for consecutive univariate time series, we perform it for every combination of a U E and cell. Then we create a histogram of p-values on which we base our judgement.

(24)

3.2. Linear models 3.2.1.3 Ridge penalisation

We propose an approach for making not stable models stable. A model is stable if all eigen-values of A in (3.16) are less than one in modulus. If the model is not stable we can force the coefficients to be smaller in absolute values until all eigenvalues are less than one in modulus.

O L S in VA R minimises the squared error min

B ||BZ ´ Y||

2

2, (3.20)

where notation follows (3.8). L2regularization adds an additional`2norm of the coefficient

matrix to the minimised quantity, min

B ||BZ ´ Y||

2

2+λ||B||22. (3.21)

Coefficients with larger absolute values in A result in larger ||A||2₂term. Thus, minimising the quantity from (3.21) gives preference to solutions with smaller coefficients compared to (3.20), making it more probable that all eigenvalues of of A in (3.16) are less than 1 in modulus.

There exists a closed form solution for (3.21) (Hoerl & Kennard, 1970): ˆ

B=Y ZT(ZZT+λI)´1. (3.22)

By gradually increasing the penalty parameter, we can find the smallest λ, such that the model is stable. In the most extreme case, when λ = 8, all coefficients become 0, which implies a stable model. Thus, this approach is guaranteed to result in a stable solution given large enough λ.

3.2.1.4 Lag selection

Since the lag order of a VA R(p) is unknown, we have to estimate it. It is important to estimate it as well as possible because a too big value of p can decrease the forecasting accuracy (Lütkepohl, 2005, p. 136), while a too low value of p might result in not capturing the important information.

If the upper bound is known, then we can test all models within the possible values of p and choose the model which describes our data best using the Likelihood-ratio test (Lütkepohl, 2005, p. 138).

If we are interested in having a good predictive model, then the standard approach is to choose the model based on one of the information criteria: A I C, Bayesian information criterion (B I C) or Hannan–Quinn information criterion (H Q C) (Lütkepohl, 2005, p. 146–153). These information criteria measure the goodness of fit while penalising for the degrees of freedom, thus, reducing overfitting.

Bottom-up and top-down It is possible that lags of larger order still contain useful informa-tion. However, given a large number of variables, we would also like to keep the total number of lags used in the model as low as possible to avoid estimating too many coefficients. Thus, we want to estimate a model with constraints, more specifically, we want to set some of the coefficients to zero a priori. It is not possible to test all possible combinations of restrictions due to computational limitations. For example, for a bivariate VA R(4) process, the number of possible restriction combinations is 65 540. Therefore, a few heuristics were developed. The commonly used ones are top-down and bottom-up strategies (Lütkepohl, 2005, p. 207– 211). These approaches are not exclusive and Lütkepohl (2005) suggests using bottom-up approach first, because it is likely to not require estimation with all the parameters, followed by top-down lag selection. The algorithm is summarised in Algorithm 1. Moreover, instead of testing one coefficient at a time, we restrict all coefficients of lag i to be zero. Additionally we substitute an information criterion with the mean squared error (M S E) (described in Section

(25)

3.3. Non-linear models 3.6.2) on the validation data set (split into training, validation, and test data sets is described in Section 3.6.1) because this measure is also indicative of the goodness of fit, but is more consistent with other approaches used in the thesis.

Algorithm 1Pseudocode of a bottom-up and top-down approach for non-consecutive lag selection, using M S E.

1: l Ð H

2: Estimate model with lags l=H. ŹNo covariates, bias only.

3: M S EminÐM S E of the estimated model.

4: for i P t1, 2, . . . , pu do ŹBottom-up.

5: liÐl X i

6: Estimate VA R model using lilags.

7: M S EiÐM S E of the estimated model.

8: ifM S EiăM S Eminthen 9: l Ð li 10: M S EminÐM S Ei 11: end if 12: end for 13: for i P l do ŹTop-down. 14: liÐlzi

15: Estimate model using lilags.

16: M S EiÐM S E of the estimated model. 17: ifM S EiăM S Eminthen

18: l Ð li

19: M S EminÐM S Ei

20: end if

21: end for

Bayesian optimisation Since the lags used in the model can be seen as hyperparameters, it is also possible to estimate the best combination using Bayesian optimisation (B O) described in Section 3.4.1.

Although, B O is a common approach for hyperparamter optimisation of machine learning models, up to our knowledge, it has not been used for lag selection in frequentist VA R before. Most probably, this is because B O is most useful when function evaluations take substantial amount of time (Snoek et al., 2012), while VA R training is typically relatively fast compared to other more complex models, such as N Ns. However, we have relatively large data set, thus, VA R training takes significant amount of time. Moreover, this is useful for consistent comparison with other models considered in Section 3.3.

3.3 Non-linear models

For non-linear modelling we choose to use N N models, because they have shown good forecasting performance and unlike traditional time-series models, such as autoregressive integrated moving average (A R I M A), they do not assume linear autocorrelation (Oancea & Cristian Ciucu, 2014).

Time-series N Ns describe a non-linear relationship between the current value of the variables and its lagged values (Samarasinghe, 2006, p. 446). This can be described as

yt= f(yt´1, yt´2, . . . , yt´p) +εt. (3.23)

Note the similarity with (3.2).

Several neural network models have been proposed for modelling time series, but most of them can be divided into two groups (Samarasinghe, 2006, p. 446):

(26)

3.3. Non-linear models ... ... yt´p yt´2 yt´1 z1 zk´1 zk yt

(a) Feed-forward network.

yt´1

z yt

(b) Recurrent (feedback) network.

Figure 3.2: Illustration of difference between a static feed-forward network and a recurrent (feedback) network. h represents a hidden unit.

• time-lagged feed forward neural networks, • recurrent (feedback) neural networks (R N N).

The main difference between the two approaches is in the way the autocorrelations are modelled. Feed-forward networks account for short-term memory explicitly by having lagged values as inputs. Recurrent neural networks have only the last lag as input, and try to build an autocorrelation structure internally, using feedback connections. For example, output of the hidden layer at t ´ 1 can be used as additional input to the hidden layer at t. This is illustrated in Figure 3.2.

We use multilayer perceptron (M L P) (Section 3.3.1.1) as a feed-forward neural network, and we use gated recurrent unit (G R U) network (Section 3.3.2.2) as a recurrent neural network.

3.3.1 Feed forward neural networks

Time-lagged feed-forward neural networks are similar to other feed-forward networks with the only difference being that instead of other variables we use lagged values of the variable itself as inputs (Samarasinghe, 2006, p. 448).

3.3.1.1 Multilayer perceptron

Feed forward neural networks are essentially static because we account for time externally, and therefore we can use traditional feed-forward networks such a multilayer perceptron (M L P) (Samarasinghe, 2006, p. 449).

In general an M L P model is a non-linear function from a set of inputs txiuto a set of

outputs tyku, which is governed by trainable parameters w (Bishop, 2006, p. 228). It can be

described by a series of functional transformations.

For a single hidden layer model with D inputs, M hidden units, and N outputs, we construct M linear combinations of input variables txu_i=1D . These combinations are called activations, a, a(1)_j = D ÿ i=1 w(1)_ji xi+w(1)j0 , (3.24)

(27)

3.3. Non-linear models where j=1, . . . , M, and superscript(1)indicates the first layer. For simplification, a bias term, w(1)_j0 , and can be absorbed into the set of twu parameters by setting x0=1:

a(1)_j =

D

ÿ

i=0

w(1)_ji xi. (3.25)

Each of the activations is non-linearly transformed using a differentiable non-linear activation function, h, to obtain hidden units, z,

z(1)_j =ha(1)_j . (3.26)

Hidden units and are linearly combined to produce outputs:

yk = f   M ÿ j=0 w(o)_kj z(1)_j  , (3.27)

where k=1, . . . , N, and a bias term is contained in w, superscript(o)indicates the output layer, and f(¨)is either an identity in case of a regression, or a softmax (normalised exponential),

fsoftmax(xi) = exp

(xi)

ř

jexp(xj)

, (3.28)

in case of classification to ensure that the output can be interpreted as probabilities. Combining equations (3.25–3.27) we get the following expression:

yk(x, w) = f   M ÿ j=0 w(o)_kj h D ÿ i=0 w(1)_ji xi ! . (3.29)

However, instead of computing outputs in (3.27), we can use hidden units as inputs to the second layer, thus giving rise to a two-layer perceptron:

yk(x, w) = f   M2 ÿ j2=0 w(o)_kj 2h   M1 ÿ j1=0 w_kj(2) 1h D ÿ i=0 w(1)_j 1ixi !   , (3.30)

where we introduce subscripts for j and M to emphasize that the number of units can differ between different layers. These layers can be stacked any number of times, hence, multilayer perceptron.

3.3.2 Recurrent neural networks

The main difference between feed-forward and recurrent neural networks is the presence of feedback connections that are designed to remember past states of the network and feed them back into it (Samarasinghe, 2006, p. 485). We illustrate this concept using a simple Elman network as an example, and then describe a more advanced gated recurrent unit (G R U) model that we use forR S R Pprediction.

3.3.2.1 Elman network

The Elman network is a simple and illustrative R N N. In its original form it is similar to a single layer perceptron with the addition of context units (Elman, 1990). These units are set to the hidden unit values and are fed into hidden units at the next time step together with external inputs. Thus, the context units remember previous internal state, and hidden units map internal state and external inputs to outputs. For a network with D-dimensional input

(28)

3.3. Non-linear models yt´2 zt´1 yt´1 y0 z0 y0 yt´1 zt yt ... ... ...

Figure 3.3: Elman network unrolled to create a directed acyclic graph.

space and M hidden units, we first compute activations. The activation j=0, . . . , M at time t, at,j, is a linear combination of inputs yt´1and previous hidden units,

at,j = D ÿ i=0 wjiyt´1,i+ M ÿ k=0 wjkzt´1,k. (3.31)

Then we non-linearly transform them to obtain current hidden units,

zt,j =h at,j (3.32)

and the output is a linear combination of M hidden units,

yt,m= M

ÿ

l=0

wmlzt,j, (3.33)

where subscript m=0, . . . , D indicates an output dimension.

Figure 3.2b contains a graphical representation of the Elman network, and Figure 3.3 contains the Elman network unrolled to create a directed acyclic graph.

3.3.2.2 Gated recurrent unit

It is hard to estimate long-term dependencies in R N Ns, such as Elman networks, because error gradients between distant time steps vanish (Bengio, Simard & Frasconi, 1994). A number of attempts to solve this issue exists. One of the most successful is introduction of a long short-term memory (L S T M) unit (Hochreiter & Schmidhuber, 1997).

A gated recurrent unit (G R U) is similar to an L S T M, but is simpler and requires less computational power (Cho et al., 2014). It is a form of activation function that is able to adaptively remember and forget the state.

A G R U is different from a simple context unit that remembers internal state (described in Section 3.3.2.1) in containing a so-called reset gate and an update gate. Reset gate, rj, is

rj=σ

[Wrx]j+ [Ujz(t´1)]j

, (3.34)

where(t ´ 1)denotes the time period, σ is a logistic sigmoid function,[¨]jdenotes the j-th

element of a vector, x is inputs, zt´1is the previous hidden state, and Wrand Ujare matrices

(29)

3.3. Non-linear models The update gate, gj, is

gj=σ

[Wgx]j+ [Ugz(t´1)]j

. (3.35)

The actual activation is

z(t)_j =gjz(t´1)j + (1 ´ gj)˜z(t)j , (3.36)

where superscript(t)denotes a time period, and a hidden state, ˜z(t)_j , is

˜z(t)_j =tanh[W x]_j+ [U(r d z_(t´1))]_j, (3.37)

where d is the Hadamard (element-wise) product.

The idea behind the gates is that when the reset gate is close to zero, the hidden state ‘forgets’ the previous state. An update gate selects whether at all and how much the hidden

state should be updated with the current value of the hidden unit.

3.3.3 Training

For an M L P, values of weight vector w can be found by maximising the likelihood of observed data over the parameter values (Bishop, 2006, p. 232). Assuming that the data has the Gaussian distribution, maximising its likelihood is the same as minimising the sum of squared error, thus, ε(w) = N ÿ n=1 (yˆ(xn, w)´y)2, (3.38) w˚ = arg min w ε(w), (3.39)

where w˚_{is a vector of optimal weights, and ε}₍_w₎_{is a minimised error function.}

Since activation functions are differentiable, (3.30) is also differentiable and constructed out of multiple nested functions. Therefore, we can find its derivatives using the chain rule and use gradient descent to minimise the squared error (Bishop, 2006, p. 240):

w(τ+1)=w(τ)´ η∇ε(wτ), (3.40) where η is the learning rate, and τ is the iteration.

For gradient descent we need to know the value of the gradient. In the context of neural networks, its computation is often referred to as back-propagation (Bishop, 2006, p. 241). We forward propagate and compute the estimated ˆy. Then we back-propagate and compute gradient of the error function, (3.38) with respect to the weights.

R N Ns are also trained using gradient descent methods. Gradients are computed using back-propagation through time. It is an extension of the standard back-propagation algorithm and involves unrolling the network into a directed acyclic graph, thus creating a feed-forward neural network (Haykin, 2009, p. 808). After this step it is not different from the usual back-propagation.

3.3.3.1 Rectified linear unit

In general, any non-linear differentiable function can be used as an activation function. Com-mon choices of an activation function include sigmoid (Bishop, 2006, p. 227),

f(x) = e

x

(30)

3.3. Non-linear models

tanh sigmoid ReL U

-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -1 0 1 2

x

f

(

x

)

Figure 3.4: Illustration of common activation functions.

or hyperbolic tangent:

f(x) =tanh(x). (3.42)

However, recently the research focus has been on non-saturating activation functions (Arora, Basu, Mianjy & Mukherjee, 2016).

Loosely speaking, a saturating function is a function that maps to a certain interval. For example, sigmoid function maps to(0, 1)and hyperbolic tangent maps to(´1, 1).

A saturating function, is a function that tends to infinity. More formally, a non-saturating function, f(¨), is a function such that

ˇ ˇ ˇ ˇzÑ+8lim f(z) ˇ ˇ ˇ ˇ= +8 (3.43) or ˇ ˇ ˇ ˇzÑ´8lim f(z) ˇ ˇ ˇ ˇ = +8, (3.44)

and a saturating function is a function that is not non-saturating.

When a saturating function is saturated, its derivative is very close to 0. For example, for x =10, which is not an unrealistic value, the derivative of a sigmoid is 4.5 ˆ 10´5_{. This is a}

problem for gradient descent optimisation because, according to (3.40), the rate of descent is directly related to the derivative, and with saturated activation functions, learning happens slowly.

According to Arora et al. (2016), one of the simplest and most widely used non-saturating activation functions is a rectified linear unit (ReL U) first proposed by Hahnloser et al. (2000) and coined by Nair and Hinton (2010). A network with ReL U can be trained up to six time faster than a network which uses hyperbolic tangent as activation function (Krizhevsky, Sutskever & Hinton, 2017). Therefore, we use ReL U. It is defined as

f(x) =max(0, x), (3.45)

which is illustrated in Figure 3.4.

He, Zhang, Ren and Sun (2015) introduce a weights initialisation scheme for ReL U units that outperform other initialisations schemes. We use it for training the M L P model. 3.3.3.2 Nadam

Vanilla gradient descent from (3.40) has a number of drawbacks, such as slow learning and practical infeasibility for large data sets (Ruder, 2016). Stochastic gradient descent (S G D) and minibatch gradient descent attempt to solve these issues. The difference compared to vanilla gradient descent is that instead of computing gradients using the entire data set and then

(31)

3.3. Non-linear models updating parameters, gradients and parameter updates are done for each small subset of the data. The subset size is one in the case of S G D and any number that is smaller than the size of the training data set in the case of minibatch2. When the entire data set has been used once for estimating gradients and updating coefficients, a training epoch is complete.

S G D does not guarantee convergence and has a number of problems which are to be addressed. The most important ones are the tendency to get stuck in local optima or saddle points and the choice of appropriate learning rate (Ruder, 2016). The former is addressed by introducing momentum, and the latter—by dynamically adjusting the learning rate for each parameter. A recent popular method, Nadam (Dozat, 2016), attempts to solve both this issues simultaneously, therefore we opt for using it.

Nesterov momentum In order to avoid local optima and accelerate gradients in the relevant direction, a momentum is used. The vanilla momentum takes into account the previous gradi-ent, and the weight update is the average between the current gradient and the exponentially decaying average of the previous gradients (Qian, 1999):

v(τ)=γv(τ´1)+η∇ε(wt), (3.46)

w(τ)=w(τ´1)´v(τ). (3.47)

Nesterov momentum or Nesterov accelerated gradient (Nesterov, 1983) is an extension to vanilla momentum that attempts to be forward looking. Instead of computing gradients at the current value of the parameters, it approximates parameter values at the next step, w(τ)_{´ γv}(τ´1)_{, and then computes gradients with respect to approximate future parameter}

values:

v(τ)=γv(τ´1)+η∇ε(wt´ γv(τ´1)), (3.48)

w(τ)=w(τ´1)´v(τ). (3.49)

Adam Adaptive moment estimation (Adam), builds on top of simple momentum and adjusts

the learning rate for each parameter individually (Kingma & Ba, 2014). The idea is that it keeps track of exponentially decaying past squared gradients. It allows decreasing the learning rate for parameters that are updated frequently more than for parameters that are updated less frequently.

Let us denote the gradient at iteration τ as g(τ),

g(τ) = ∇ε(w(τ)), (3.50)

then exponentially decaying averages can be expressed as

m(τ)=β1m(τ´1)+ (1 ´ β1)g(τ), (3.51)

v(τ)=β2v(τ´1)+ (1 ´ β2)

g(τ)2, (3.52)

(3.53) where β1and β2are some decay rates that are usually close to 1. Then parameter update at

each step is

w(τ+1) =w(τ)´a η v(τ)₊_em

(τ)_, _(3.54)

where e is some small value to avoid division by zero.

2_{In the literature, minibatch algorithm is also often referred to as S G D (Ruder, 2016), and we will follow this}

(32)

3.3. Non-linear models Moreover, authors note that because both vectors m and v are initialised with 0, they are biased towards 0 at the initial step. In order to counteract this, they propose bias correction:

m(τ) := m (τ) 1 ´ βτ 1 , (3.55) v(τ) := v (τ) 1 ´ βτ 2 . (3.56)

This increases their values at early steps when βτ

i is close to 1.

Combining Adam with Nesterov momentum Nesterov-accelerated adaptive moment es-timation (Nadam) is essentially Adam with Nesterov momentum (Dozat, 2016).

w(τ+1)=w(τ)´a η v(τ)₊_e β1m (τ)₊ (1 ´ β1)g(τ) 1 ´ βτ 1 ! , (3.57)

where m(τ)is from (3.55), v(τ)from (3.56), and g(τ)is the gradient as defined in (3.50). Since this optimisation method combines benefits of both Nesterov momentum and ad-aptive learning rate, we use Nadam. Moreover, we use parameter values suggested by Dozat (2016).

3.3.3.3 Early-stopping

Training is the iterative reduction of the error function with respect to the training data set. This error function is a non-increasing function of the iteration (Bishop, 2006, p. 259). However, the error measured on the independent validation data set tends to decrease at first, but increase afterwards. Therefore, in order to achieve optimal generalisation ability of the model, training should be stopped at the iteration which results in the smallest validation error. This can be a seen as a form of regularisation because the model is moving from realisations of simpler models to more complex ones (Haykin, 2009, p. 173).

Moreover, this procedure makes optimisation of hyperparameters more time-efficient, because the model is not trained after it starts overfitting the data.

In reality, the validation error does not evolve smoothly between epochs (Haykin, 2009, p. 174). In order to solve this problem, a a so called patience parameter is introduced (Chollet et al., 2015). It governs how many epochs we allow for the validation error to grow before we stop training. This procedure is described in Algorithm 2.

Algorithm 2Illustration of the patience parameter in early stopping.

1: i Ð 0 2: εminÐ 8

3: while i ă patience do

4: Train a model for one epoch.

5: Compute validation error, ε.

6: if ε ă εminthen 7: εminÐ ε 8: i Ð 0 ŹReset counter. 9: else 10: i Ð i+1 11: end if 12: end while

(33)

3.4. Hyperparameter optimisation 3.3.3.4 Dropout

N Ns are very flexible models, therefore, they are prone to overfit the training data. There are traditional methods to address this issue, such as`₁and`₂regularisation or early stopping. However, dropout proved to be especially advantageous with N Ns (Srivastava et al., 2014).

The main idea behind dropout is to randomly remove units together with their incoming and outgoing connection from the network for each minibatch in the training sample. After the connections are dropped, we compute gradients and update weights only for the units that remain in this thinner model. During the validation and test phase, all connections are present.

Dropout can be seen as a regularisation method, because it prevents weights from co-adapting too much. It also can be seen as an approximation of averaging predictions over exponentially many different less complex models (Srivastava et al., 2014).

In R N N, Pham, Bluche, Kermorvant and Louradour (2014) suggest applying dropout only to input and output layers and provide evidence that dropout of recurrent units disrupts the model performance. Therefore we apply dropout to all layers in M L P and only to input and output layers in G R U. The dropout rate is one of the hyperparameters that we optimise.

We consider dropout values between 0 and 0.5. Srivastava et al. (2014) suggests that a dropout of 0.5 if often optimal, while we are also interested in smaller values because training without dropout happens faster.

3.4 Hyperparameter optimisation

The models described above depend on the hyperparameters: • VA R: lags.

• M L P: lags, architecture (number of hidden layers and number of hidden units in each of them), dropout.

• G R U: sequence length, architecture, dropout.

There are a few common approaches for hyperparameter optimisation, such as grid search and random search. In this work we consider hyperparameter optimisation in the framework of B O with Gaussian process (G P) priors.

We minimise the loss function, mean squared error, during the hyperparameter optimisa-tion process.

3.4.1 B O with G P priors

Bayesian optimisation (B O) is similar to other kinds of optimisation in the sense that it optimises a function f(x)on a set of possible x values (Snoek et al., 2012). The difference is that it attempts to approximate f(x)with a probabilistic model and exploit information obtained from this approximation to better choose the next point for function evaluation. The price for this is that it takes more computational power to determine this point compared to other approaches such as gradient descent. Therefore, this type of optimisation makes sense when function evaluations are expensive, for example when it requires training a machine learning algorithm.

We treat hyperparameters as x, and a performance metric as f(x), which allows us to use B O to choose evaluation points wisely. The algorithm is summarised in Algorithm 3.

(34)

3.4. Hyperparameter optimisation Algorithm 3Summary of B O.

1: Sample a few initial points, x(τ+1), from possible values of hyperparameters.

2: repeat

3: x(τ)Ðx(τ+1)

4: Train the model using these hyperparameters, x(τ).

5: Obtain performance metric, fx(τ).

6: Approximate the true relationship between x and f(x)using a Gaussian process (G P) regression.

7: Choose the next set of hyperparameters, x(τ+1), by choosing ‘promising’ points (e.g. points that are expected to deliver the best improvement in performance) according to the G P regression model. 8: until › › ›x (τ+1)_´_x(τ)›› › ă e ŹConvergence criterion. 3.4.1.1 Gaussian process

‘A Gaussian process (G P) is a collection of random variables, any finite number of which have a joint Gaussian distribution’ (Rasmussen, 2004, p. 13). A G P is fully specified by mean (m(x)) and covariance (k(x, x1₎_{) functions for any two inputs x and x}1_:

m(x) =E(f(x)), (3.58)

k(x, x1_{) =}_E ₍_f₍_x₎

´m(x))(f(x1)´m(x1)), (3.59)

and is denoted as

f(x)„GP (m(x), k(x, x1)). (3.60)

Random variables represent the value of the function f(x)at location x, where x are possible values of covariates.

Prediction The prediction of a G P has a closed form solution that depends on training data, kernel function, and hyperparameters (Appendix A.2.1).

Kernel Different choices for covariance functions, k(x, x1₎_{, exist. Snoek et al. (2012) suggest}

using Matérn5₂kernel (Rasmussen, 2004, p. 84) for optimisation problems: k_Matern,5 2(r) =σ 2 f 1+ ? 5r ` + 5r2 3`2 exp ´ ? 5r ` , (3.61)

where ν,` ą0, Kνis a modified Bessel function, and r is Euclidean distance between two

points. G P hyperparameter`determines smoothness of the functions. Instead of setting the same`for all dimensions, we can optimise it for each dimension individually, thus, allowing functions in some dimensions to be smoother than in other dimensions. Such an approach is called automatic relevance determination (A R D) (Appendix A.2.3).

G P hyperparameter optimisation Other hyperparameters of the G P, such as σ2_f and`, are optimised maximising marginal likelihood (Rasmussen, 2004, p. 109). For G P regression, a closed form marginal likelihood exists (Appendix A.2.4) and can be optimised using numerical methods (Rasmussen, 2004, p. 113).