Influence of different frequencies order in a multi-step LSTM forecast for crowd movement in the domains of transportation and retail

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Influence of different frequencies order in a multi-step LSTM

forecast for crowd movement in the domains of transportation and retail

STOCKHOLM, SWEDEN 2018

MANUEL CADARSO SALAMANCA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

School of Electrical Engineering and Computer Science Kungliga Tekniska Högskolan

Influence of different frequencies order in a multi-step LSTM forecast for crowd

movement in the domains of transportation and retail

M

ASTER

T

HESIS

ICT Innovation

Author: Manuel Cadarso Salamanca

Supervisor: Sarunas Girdzijauskas Examiner: Henrik Boström

Course 2017-2018

(3)

(4)

Acknowledgements

I would like to thank my family & friends for their support during these two years of the master and to my girlfriend, Ainara because without her I wouldn’t have been able to have done anything.

(5)

(6)

v

Sammanfattning

Denna avhandling presenterar ett tillvägagångssätt för att förutspå förflytt- ning inom folkmassor med hjälp av LSTM-neurala nätverk. Specifikt analyseras inflytandet som olika frekvenser av tidsserier har på både prognosen för folk- massorna och designen i arkitekturen inom transport och handel. Arkitekturen påverkas även då frekvensändringar provocerar fram en ökning eller minskning i datamängd och arkitekturen därför bör anpassas. Tidigare forskning inom pro- gnoser relaterade till folkmassor har huvudsakligen fokuserat på att förutspå folkmassans nästa förflyttning snarare än att definiera mängden människor på en specifik plats under ett specifikt tidsspann. Dessa studier har använt olika tekniker som till exempel Random Forest eller Feed Forward neurala nätverk för att ta reda på inflytandet som de olika frekvenserna har över prognosens resul- tat. Denna avhandling tillämpar istället LSTM-neurala nätverk för analysering av detta inflytande och använder specifika fältrelaterade tekniker för att hitta de bäs- ta parametrarna för att förutspå framtida välstånd i folkmassor. Resultatet visar att frekvensordningen i en tidsserie tydligt påverkar resultatet av prognoserna inom transport och handel, och att detta inflytande är positivt när frekvensordningen av tidsserierna kan fånga upp frekvensens form i prognosen. Därför, med frekvensordningen i åtanke, visar resultaten i prognoserna för de analyserade platserna en förbättring på 40% för SMAPE och 50% för RMSE jämfört med in- hemska tillvägagångssätt och andra tekniker. Utöver detta visar de även att det finns ett samband mellan frekvensordningen och komponenterna i arkitekturer- na.

Nyckelord:LSTM, frekvens, prognos,neurala nätverk

(7)

vi

Abstract

This thesis presents an approach to predict crowd movement in defined places using LSTM neural networks. Specifically, it analyses the influence that different frequencies of time series have in both the crowd forecast and the design of the architecture in the domains of transportation and retail. The architecture is also affected because changes in the frequency provokes an increment or decrement in the quantity of data and, therefore, the architecture should be adapted. Previous research in the field of crowd prediction has been mainly focused on anticipat- ing the next movement of the crowd rather than defining the amount of people during a specific range of time in a particular place. These studies have used different techniques such as Random Forest or Feed-Forward neural networks in order to find out the influence that the different frequencies have in the results of the forecast. However, this thesis applies LSTM neural networks for analysing this influence and uses specific field-related techniques in order to find the best parameters for forecasting future crowd movement. The results show that the order of the frequency of a time series clearly affects the outcomes of the predictions in the field of transportation and retail, being this influence positive when the order of the frequency of time series is able to catch the shape of the frequency of the forecast. Therefore, taking into account the order of the frequency, the results of the forecast for the analyzed places show an improvement of 40% for SMAPE and 50% for RMSE compared to the Naive approach and other techniques. Further- more, they point out that there is a relation between the order of the frequency and the components of the architectures.

Key words:LSTM, frequency, forecast,neural network

(8)

List of Figures

2.1 Sliding window adapted from [1]. . . 12

2.2 Comparison of the autocorrealation of a time series before applying the differencing method (left illustration) and after the differencing (right illustration). Adapted from [2]. . . 14

2.3 Example of a decision tree algorithm adapted from [3]. . . 15

2.4 Graphical representation of the architecture of a Multilayer neural network adapted from [4]. . . 17

2.5 Example of three memory block connected following the first architecture [5]. Figure adapted from [6]. . . 19

2.6 Explanation of the internal mechanism of a Memory block with the popular modification of the peepholes [7]. Figure adapted from [6] 20 3.1 Representation of how the antennas. Own elaboration . . . 23

3.2 Station from January until May. Own elaboration. . . 24

3.3 Promenade from January until May. Own elaboration. . . 24

3.4 Mall from January until May. Own elaboration. . . 24

3.5 Autocorrelation of the Mall before applying the differencing. Own elaboration.. . . 25

3.6 Histogram of the Mall before applying the differencing. Own elaboration. . . 25

3.7 Autocorrelation of the Mall after applying the differencing. Own elaboration.. . . 26

3.8 Histogram of the Mall after applying the differencing. Own elaboration. . . 26

3.9 Representation of the Tanh activation function. Adapted from [8]. . 27

3.10 Representation of the time series of the Mall with a frequency order of five minutes. Own elaboration. . . 29

3.11 Representation of the time series of the Mall with a frequency order of one hour. Own elaboration. . . 29

3.12 How to divided a dataset for regression problem. Own elaboration. 33 4.1 Standard architecture of the LSTM models defined for a frequency order of one hour. Own elaboration. . . 36

ix

(11)

List of Tables

3.1 Comparison of the statistics for the Station in a different order. . . . 30

4.1 Comparison of the performance of the network with Dropout and L2 regularization. . . 36

4.2 Hyperparameters search. The number of Time steps are based on an hourly frequency order. . . 37

4.3 Best configuration of Hyperparameters for each frequency order. . 37

4.4 Results of each frequency order model with a similar architecture. . 37

4.5 Comparison of each different method for the Station. . . 38

4.8 Comparison of each different method for the Promenade. . . 40

x

(12)

CHAPTER 1

Introduction

1.1 Background

The increment of the inhabitants in the metropolitan areas of the countries during the last century is a problem[9] [10] that has affected in one way or another every kind of business, public demonstrations and national festivities throughout the years. One of the most common problems that people in charge of these busi- nesses have to deal with is how to measure their resources; in other words, they have to figure out how to approximate the number of customers/attendants to their events/shops at any time.

The world is competitive as it has never been before hence a small advantage can make a difference between failure and success in any business, service or project. Therefore, being able to estimate the proper number of clients or resources that a company may have during a specific working day would increase the capacity of the companies for being competitive. However, this situation has become even more complicated because of the following two reasons. First of all, as mentioned Duranton & Puga [9] that since 1920 the average growth of citizens per decade for the 366 most significant metropolitan areas in the USA was 17,9%.

This increment of the population in the cities has complicated even more the fore- sight of attendants to massive events because of the behaviour of the crowds [11].

On the other hand, the arrival of the social networks has modified the paradigm of how we do meet and relate with other people. Nowadays, events may con- centrate thousands of people with a simple tweet or a Facebook event in a few minutes [12] [13].

Although the overall situation seems complicated, several techniques have been applied during the last century for confronting these kinds of problems.

The most common approach is the use of predictive models based on the historical data for trying to discover the hidden behaviour and the trend of these events. However, there are several options in terms of predictions, hence a good approach to narrow the general above prediction problem down is to define and study the workloads used for this thesis. Depending on the nature of the data some models would be able to achieve better performance than others. This project is going to specifically investigate how is the crowd movement in the domains of transportation and retail being explained in the next sections all their

1

(13)

2 Introduction

specific characteristics of these areas. Therefore all the models will be defined taking into account these two specific workloads.

On the other hand, it is also crucial to determine the horizon that the forecast wants to measure. Forecasting the next ten minutes of the number of users in a specific area will not be the same as forecasting the next week. As far as the prediction forecast has to predict a further horizon it becomes more complicated to hit the proper future behaviour and therefore to achieve good results. It was decided that the forecast horizon is to predict the next twenty-four hour of users and getting this result displayed in an hourly approach. Having defined the aim of this project, in the following lines is going to show which are the techniques that better fit the upper requirements. One of the first techniques applied in this field were the ARIMA models. ARIMA was created in the late fifties and since that date, these models have been one of the most useful methods for dealing with forecast problems in a time series context [14][15]. However, these models have complicated processes of finding the best parameters [16] and some specific conditions [17] are required for being able to apply the model.

An alternative to this traditional approach would be to apply the new approach called Deep Learning.[4] These techniques have been applied for several different fields that may include from image recognition to weather prediction.

The feature that shows us that this approach is interesting is inside their capacity of learning from the historical data and the hidden structures of the time series. The most known technique of this field is the neural networks[4]. Indeed, there is a specific neural network that has the property of remembering from the short-long term sequence; being this characteristic the exact need that this project demands. This neural network is known as LSTM networks [18]. Another remarkable feature of LSTM networks is the fact that they can be easily applied to any time series without being necessary to employ huge transformations to the time series[19].

Thus, taking all the benefits of LSTM networks into account, this project will use this technique for modelling the problem of a multistep forecast for different time series frequencies in the defined forecast horizon and for the specific domains presented in the section.

1.2 Problem

Most of the algorithms and researches that have been proposed before for the field of crowd behaviour have been mainly focused on trying to predict the next movement of the crowds [20][21][22].

However, in this thesis, the phenomenon that is going to be treated is the evolution of crowd movement in defined places. Therefore, the scope of this project is to create a model for forecasting the number of people in that place at a defined time in the future. It was commented in the background that the LSTM networks were the chosen approach for facing up the forecasting problem.

The technique is already described in several publications and papers [18][7], emphasizing in several of them how to model most of the variables and features of the models. At present time there are works that explain everything: from

(14)

1.3 Purpose 3

the hyperparameters of the neural networks, the number of layers, until the best activations functions for a specific problem. Before starting with the knowledge gap it is important to explain the concept of frequency order. Frequency order is defined as the number of events equally repeated per unit of time, being a change of the order and increment or decrement of the events per unit of time.

After carrying out a few experiments it was discovered that time series with the different order could experiment a huge difference in these results using similar architectures and the same data but in a different frequency order. Therefore, it was found out that in the field of the LSTM models, there is a knowledge gap concerning how the frequency order of a time series influence the configuration of the neural network and the results of the forecast.

The next step after this finding was to verify whether other techniques have been confronted or on the contrary, this matter would not have been treated before. It was found that ARIMA models have studied these kinds of problems [23]. Also, the most common approach of neural networks (Feed-Forwards) has studied the relationship between the frequency order of the data and the forecast results [24]. Although this paper is based on feed-forward neural networks instead of LSTM networks, useful and applicable conclusions from the latter ones can be used as the baseline for this thesis project.

1.3 Purpose

This thesis will discuss how the influence of different frequency orders in the domains of transportation and retail produce an impact on the results of the forecast for an LSTM model. Also, it will be deeply studied the relation between the frequency order of the time series and the architecture’s features of the LSTM models.

Therefore, based on the previous statements and having identified a knowledge gap, which this project will study, the research question following:

"Can the frequency order of a time series influence the forecast results and the architecture design of the LSTM model for the fields of transportation and retail in a phenomenon of crowd movement?"

Besides that, the obtained model is going to be compared with the Naive approach and other techniques for measuring its performance.

1.4 Goal

The goals that this master thesis expect to carry out are the followings:

1. Create a multi-step LSTM model that can be exportable for different frequencies orders.

2. Investigate which is the best approach for changing the frequency order.

(15)

4 Introduction

3. Compare the results that different frequency order models achieve for the same time series data with a similar architecture.

4. Define which is the best metric for comparing completely different time series frequency orders.

5. Compare the results of the LSTM models with other forecast techniques such as the Naive approach, ARIMA models and Random Forest.

6. Find a method for determining the time steps of the LSTM model.

1.4.1. Benefits, Ethics and Sustainability

1.4.1.1. Benefits

In academic terms, this project will be interesting for people that would desire to learn more about how the frequency order affects in a multi-step LSTM model for the specific explained specific domains. Moreover, from a company perspective, this model might be used for providing a service that could measure the demands of customers and help other companies to use the information in their benefit.

1.4.1.2. Ethics

Two thousand eighteen will be remembered as the year that one of the most significant laws in terms of data protection arose in Europe: GDPR [25].

This law provides more rights to the users, giving them the possibility of re- quiring which data the companies are keeping of you or just demanding them to erase all the existing data from your profile. Another import fact is that now companies must ensure that they store and keep the data safe and protected. In case of not meeting these requirements companies will get a huge fine.

Specifically, this project is utilising sensitive data from the locations of Telia customers. Therefore, the first operation that must be done before any other one is to anonymise the data of the users. Basically, it has been applied a double hash function for every user for protecting their information. Also, the real information about the places and the frequency of people in this place will not be clearly provided in order to preserve the security of that information. Hence, the information about the areas will be modified for being unrecognizable.

1.4.1.3. Sustainability

One of the multiple applications that this model provides in terms of sustainability has to deal with environmental proposes. One clear example would be, the estimation of the amount of rubbish that a massive event of people could be produced during some journeys. The data regarding the number of estimated attendants from the LSTM models, could be processed for calculating the number of required bins during the event and the necessary number of dustmen after the event. In this way, LSTM models can help to reduce the environmental impact that these kinds of events produce in the places they are held.

(16)

1.5 Methodology 5

Another possible use might be related to the field of retail. For instance, a restaurant might know the estimated demand of customers for a specific day, and therefore it could measure the amount of food needed. This action would save a lot of food that otherwise would be likely wasted or expired.

Considering all the previous cases and based on the sustainable development goals that we had studied in the subject of research methodology this project will mainly support the goal number eleven: Sustainable cities and communities [26].

1.5 Methodology

Due to the nature of the research question, several different approaches may be chosen for solving it. However a practical criteria for choosing the appropriate approach is based on evaluating the available resources of the project and decide which approach fits better with the data of the thesis.

On the other hand, and once the data sources are defined the next step was to proceed with the methods. According to Håkansson [27], choosing the right methods is one of the most critical parts of conducting research projects. This phase will make the difference between an outstanding work or mediocre research. For a master thesis, it would be convenient to apply a combined approach of using qualitative and quantitative methods for covering the entire field and research questions created by this project.

Based on the research question, the philosophical assumption that fits better for this project will be the Positivism [27]. This philosophical trend assumes that reality is objectively given and independent of the observer and instruments. The researchers test theories, usually in a deductive manner, to increase predictive understanding of a phenomenon. In this project our starting point will be the document [24] and from that point all the existing doubts will be faced up by other complementary papers. Moreover, several tests will be done and they will require to be surpassed in order to be stated as valid the hypotheses.

Basically the question will be solved by applying several experiments keeping the variables that are not tested fixed, and changing the target feature. For example, for answering the first part of the research question about the influence of different frequency order in the results of the forecast, the experiment will set up all variables, and just change the frequency. After that, the output of the forecast, which is based on the metrics, will be deeply analyzed. These results would be quantitative since the metrics will provide a numerical result.

In addition, for answering the second part of the research question, which presents a knowledge gap about how the different frequencies affect the design of the architecture, some more layers will be added and it should be defined which is the best proportion by running several tests. The best proportion will be selected based on the metrics hence these experiments will be quantitative too.

Finally, the performance of the LSTM model will be studied comparing it to other models such as the Naive approach. For this experiment all the variables will be set up in the same way and the only characteristic that will change, will

(17)

6 Introduction

be the technique. The results will be classified based on the metrics therefore this test will be quantitative.

1.6 Delimitations

This project uses data that comes from a specific domain(tracking of the antennas), describes a particular human phenomenon(movement of people in a specific place), forecast a particular horizon (the next twenty-four hours and is focused on two areas (transportation and retail). Therefore, the project is limited by the requirements of data.

Therefore, the project outcomes are going only to be totally useful and exportable to other projects that have the exact kind of conditions. For example, if the forecast horizon would be changed to other, the conclusion may not be applicable. On the other hand, it will be interesting to try all the findings of this thesis in a project composed of roughly all the similar features and observing if the same conclusion can be reached.

Another point that must be stood out is that all the data used in this master thesis came from the city of Oslo. This fact delimits as well as the previous paragraph, the capacity to export and using the conclusion of the project to other cities. Due to the fact that the project just uses data from one city being that condition not enough for extending the results to a broader solution.

Thirdly, deep learning models demand a lot of data for being accurate and understanding the hidden structures of the time series. Therefore, it is essential to possess enough computational power for running the models. In this case, it has been used Amazon Web Services for developing my models. Specifically, the models were developed by using thirty-two virtual CPUs and one GPU NVIDIA Tesla K80. The execution of each model usually needs around eight hours for obtaining the desirables results.

Although several models architectures and hyperparameters have been tested, it is impossible to try all the possible combinations that the components and hyperparameters. Thus, each of the variables that the model could choose was selected based on a theoretically grounded knowledge. From the practical point of view, it is likely that another different model not tested in this project could reach better results than the presented ones.

1.7 Outline

As it has been already developed, the first chapter of this thesis is the introduction which includes the following subsections: background, problem, purpose, goal, delimitations and the outline of the project. The next chapter will be the machine learning background. In this section, it will be related to the evolution of the forecast techniques. From the most simple multi-step forecast approach until reaching the LSTM models. Moreover, all the theoretical concepts related to the human phenomenon of movement of crowds in a particular place and how this behavior could be modelled as a time series, will be studied in this section.

(18)

1.7 Outline 7

Secondly, in chapter 3, it will be explained which scientific methods have been chosen for adequately shaping the LSTM models and how it could be used for investigating the influence of a different time series frequency. Chapter 4 will present the obtained results of the models in different frequency and places making a comparison among them. Furthermore, a comparison with other forecast techniques will be done. In chapter 5 the obtained results from the model covering the different perspectives that these results have generated during the entire process will be discussed. Finally, the last chapter will present the conclusion and future work.

(19)

(20)

CHAPTER 2

Extended background

2.1 Nature of the data

2.1.1. Human phenomenon

Human behaviour is a field that has been widely investigated from thousand of different point of perspectives. From how humans learn in the earlier steps of the life [28] until the of human sleep behaviour [29]. Narrowing the domain down, this project will be interesting in understanding whether the human beings are habit beings or if, on the contrary, the human behaviour does not follow any kind of pattern. The majority of studies about this topic agree that the human being is a creature of habit [30] [31]. Having confirmed the last point, the next fact that must be demonstrated is the relationship between the human habit and the prediction of the human behaviour in the future [32]. The following paper Predicting fruit consumption: the role of habits, previous behaviour and mediation effects [33], studied the influence of a previous habit and behaviour in the future consumption of fruits. The conclusion was clear, the habit and the previous behaviour are important as predictors variables for forecasting the future behaviour. Therefore, it is confirmed that there is a connection between the previous human behaviour and the future prediction in several fields. The data that this thesis is going to use, reflect the movement of people in a specific place during a range of time.

Consequently, and correlating to the previous affirmations the question that may arise to the reader is the following:

"Is the human behaviour an important predictive fac- tor in the forecast of the movement of people in a par- ticular place?"

There are dozens of studies that prove the relationship between previous human behaviour and the prediction of the movement of people in the future [20][21][22].

Although, the knowledge exposed in those papers is more related to the prediction of the next movement. The knowledge basement will be the same for the problem of this master thesis.

9

(21)

10 Extended background

Finally, it can be affirmed that humans beings do like routines and habits and based on that statement, it can be concluded that the historical record of the movement of people in a place will be enough information for predicting the number of people in that specific area in the future.

Another important fact that must be commented on is that the chronological order of the data is a key property of this data. This property represents the main feature of the model that is going to be commented in the next point and use in the entire thesis. This model is known as Time series and all the attributes and features will be explained in the following points.

2.1.2. Time series

The data collected for this master thesis is the historical record of the movement of people in three specific areas. According to Brockwell and Davis, [34] the time series is a set of observations recorded at a specific time with a particular frequency. Based on this definition and the historical data of this project, which is an ordered sequence of users in specific frequency order of five minutes, the data could accomplish the statistical method called time series.

Time series have been used for different purposes during the last decade. The most important employments of these models were mainly for identifying, ana- lyzing and understanding the observed data and for predicting future values of the series. Based on how the nature of the frequency of the time series, they can be classified into different groups: discrete-time series or continuous-time series.

Discrete-time series are formed by a set of variables that occur in a different point in the time and are repeated with a specific frequency. On the other hand, the continuous-time series are obtained when the observations are recorded continu- ously over some time interval.

Time series are complex structures that do not acquire good performance in- terpreting raw data. It is recommended to decompose the time series into four specific features for a better understanding of the picture of this model. The components are defined in the following way by the authors [34]:

1. Trend: is the long-term pattern of a time series. It represents the underlying level and depending on the behaviour could be considered positive in case of an increasingly long-term pattern or negative on the contrary case.

2. Cyclical: cyclic pattern exists when data exhibit rises and falls that are not of the fixed period.

3. Seasonal: A seasonal pattern exists when a series is influenced by seasonal factors (e.g., the quarter of the year, the month, or day of the week). Season- ality is always of a fixed and known period.

4. Irregular component or noise: describes the random behaviour in other words the irregular influences.

Using an additive model the formula will be:

Timeseries =Trend+Seasonal+Cyclical+Noise (2.1)

(22)

2.2 Forecast models for time series 11

Besides this previous manner of classification the time series, there is another remarkable way of ordering the time series. This other way is based on the previous properties of the time series. These properties demand that all the statistical features of time series such as the trend, variance and autocorrelation structure do not change over time. On the contrary, the other time series that don’t fulfil these conditions are considered as a non-stationary time series [35]. The main advantage of satisfying this attribute is because of this two reasons:

1. It is easier to obtain meaningful sample statistics such as means, variances, and correlations with other variables [35].

2. Most statistical predicting methods for time series are based on premises that the time series is stationary or an approximated to stationary [35] . This point about approximating a non-stationary time series to stationary will be expanded in the methodology section.

Based on the previous definitions this project is forming a discrete-time series.

Because the frequency of the movement of the number of people is recorded every five minutes. Time series of daily or shorter time intervals are accepted as high-frequency data [24]. This frequency present other different features compared to the low-frequencies time series [36]. In this case, depending on which range of time want to be forecasted, it would be needed to adapt the order of the time. The maximum change that the frequencies of models will down-sample is from the five minutes to an hourly approach. Therefore, all of the frequencies that are going to be shown in this project are considered high-frequencies. The other point which it is essential to stand out from the applicabilities of the time series is their capacity to be used for predictive proposes. This point will be deeply studied in the next section.

2.2 Forecast models for time series

Time series models mainly use the historical data of a time series for forecasting of the future result [15]. Usually, these forecast models are just focused on the forecast of the next observation. These kinds of predictions are called one-step forecast. However, this project is interesting in predicting the next twenty-four hours of movement of users that means it will be necessary more than one-step.

Therefore a multi-step forecasting approach must be needed [37]. In the next points, it will be described from the most simple multi-step forecast approach until the LSTM models presenting the advantages and downsides of all the models.

2.2.1. General concepts of the models

2.2.1.1. Multi-step Forecast

As it was shortly commented in the introduction of this section, the analysis and forecast that this thesis aim is a multi-step forecast. The company is interested

(23)

in knowing the movement of people at least in the next day. Thus, the expected results will be the next twenty-four from the last day of data. Being necessary to employ a multi-step forecast the most common methods of the domain will be commented:

• Recursive: In this strategy, a single model f is trained to perform a one-step ahead forecast [38].

• Direct: consists of forecasting each horizon independently of the others [38].

The first approach gets usually good performance for dataset without noise.

However, our data is a representation of the human movement in a specific place so any kind of collateral variable such as day-off will create some noise into the frequency. Taking this affirmation into account the method that is going to be applied in all methods is the direct approach. In addition, the direct method has the property of non-accumulating errors so for any circumstances one of the forecasts obtains a bias results that situation wouldn’t affect the next ones [38].

2.2.1.2. Sliding window

This method base their foundations on the correlation between the correlation of past step of the time series with next forecast [39]. In the time series field is also called "Lag method". Determining the size of the window or the size of the lag will be an important task. If the window size is too short, noise dominates and it is hard to catch the underlying dynamics properly. Nevertheless, if the window size is too long, all of the points become sparsely located in the state area. In the next illustration an example of how the sliding window method works, is shown:

Figure 2.1:Sliding window adapted from [1].

For determining the correct one size of the window a study of the partial lag correlation will be done and the found result will be applied to all the non-deep learning techniques. In the methodology section, the methods applied to the neural networks will be explained.

(24)

2.2.2. Naive approach

The baseline and the most simple model that might be created in a multi-step forecast approach would be the Naive approach. This method basically forecasts their results based on the last observed values without applying any other operation [40]. This idea will support the fact that human behaviour follows habits and therefore these habits are mathematically translated to a strong seasonal component. The main advantages that this model owns are the followings:

• It is the most simple approach and then does not need much time for building the model.

• It could obtain good results in short-term.

• It is a good starting point for comparing with the improvement of other forecast technique.

Nonetheless, the results for long-term doesn’t achieve enough quality being more complex to catch the trend over a long-term. Therefore, this model is useful for simplifying and summarizing a multi-step forecast problem.

2.2.3. ARIMA models

ARIMA models are one of the most established forecast techniques for time series since their creation at the end of the 70s [15]. The definition of ARIMA is Autore- gressive Integrated Moving Average models. Also, sometimes this method is called Box–Jenkins. This method tries to catch the autocorrelation of the time series basically because this model extrapolates the "past results" to forecast the future. Also, it is important to emphasize that ARIMA models are able to carry out a multi-step forecast. For being able to apply on ARIMA models it is demanded that the studied time series were stationary or at least be able to be treated like a stationary [17]. The requirements for being stationary are: No trend/seasonality, constant level and variance and autocorrelations.

ARIMA models are formed by three important elements that compose this model: the differencing or D variable, the autoregressive process or AR(q) and the moving average process or known as MA(q) too. The first parameter that must be adjusted is the differencing. Differencing is a transformation that is used for removing the trend and can become non-stationaries time series to stationaries [41]. Basically, it is the subtraction of consecutive values several times until erase the root component. This transformation is also used in other techniques to try to stationarize the time series. In the following picture there is an example of this transformation made Down-Jones by it is shown:

(25)

Figure 2.2:Comparison of the autocorrealation of a time series before applying the differencing method (left illustration) and after the differencing (right illustration). Adapted

from [2].

The next step of the process will be two find the appropriate values of p and q based on the study of the autocorrelation function (ACF) and partial autocorrelation (PACF).

The principal advantage that this model present is the capacity to forecast future behaviour values. Nevertheless, it owns other downsides such as the com- pulsory stationary condition for the investigated time series or needed amount of time until reaching the best parameters of the ARIMA model for shaping the time series and the tremendous demand of past data for being able to detect the past patterns and extrapolate them to the forecast [42] [36].

Regarding the topic about how the frequency influence a forecast, ARIMA models have already studied that a particular frequency of data may positively or negatively influence the results of the prediction of the ARIMA models [42].

Therefore depending on the domain, some frequencies order could be better than the default one.

2.2.4. Machine learning

Machine Learning is defined by Dr.Nikolic: the science of getting computers to act without being explicitly programmed but instead letting them learn a few tricks on their own [43]. This field gives to the data science researchers thousands of techniques for dealing with several kinds of problems. This project will just focus on one of the branches of ML called Supervised learning. The main characteristic and differential factor from SL than other techniques is the fact that the input data and the output data are labelled and provide from the beginning. The goal is to create a function that learns from the labelled data and can perform predictions for new inputs. There are two subfields into the field of supervised learning:

• Classification: The output of the variables will be part of a category. For example, imagine that we are interested in classify patients from a hospital

(26)

in a binary classification(the most common) the possibles categories would be "Sick" or "Healthy" [44].

• Regression: The output of the variables will be a real value. For instead, the degree temperature of a place [44].

Based on the above definitions this project will perfectly suit with the second option. There are several techniques for facing up the problem of multi-step forecast a time series but it will be commented on the most relevant techniques:

random forest and the neural networks.

2.2.4.1. Random forest

Random forest is a machine learning ensemble technique that could be used for solving time series problem [45]. This method has gained popularity during the last year due to their capacity to be employed either for classification or prediction problems [45]. Random forests are based on decision trees, therefore, it will be important first to understand the mechanisms of decision tree before starting with Random forest.

Decision trees are structures based on a tree-graph that defines branches with conditions based on the features of the data. These conditions have the duty of filtering and delimiting the result of the model trying to catch the structure of the data. Each branch will be considered a leaf of the tree. In the regression context, the last leaf of the tree will end up in a numerical result. This last leaf will be reached once the maximum depth of the tree will be touched or there are a minimum set of points. The principal problem that the decision tree suffer is due to their nature of too specialized the conditions provoking over-fitting. For clarifying how this method develop their operations an example of a decision tree for calculating the temperature will be represented:

Figure 2.3:Example of a decision tree algorithm adapted from [3].

(27)

Although the decision trees are a powerful method, there are not enough for achieving the best outcomes. However, if thousand of the decision trees are combined for working in the same forecast problem, the probability of reaching better results will exponentially increase. This combination of decision trees is known as the Random Forest. The method uses Breiman’s "bagging" idea and the random selection of features, for building a collection of decision trees with con- trolled variation. A formal definition of random forest is provided by Breiman [46]: Random forests for regression are formed by growing trees depending on a random vector O such that the tree predictor h(x,O) takes on numerical values as opposed to class labels. The output values are numerical and we assume that the training set is independently drawn from the distribution of the random vector Y, X. The mean-squared generalization error for any numerical predictor h(x). The main advantage of Random forest are the followings:

• Good predictions can be achieved without spending much time in the configuration of the parameters(Few parameters).

• It has a mechanism for selecting the important features of the models.

On the other hand, this model suffers the same issue as its predecessor, caus- ing overfitting to the results [47]. Regarding the main topic of this thesis, time series forecast, the random forest method has been used in different study cases and comparisons for time series [48] [49]. However, there is not a clear study about the influence of different order of time series frequency in the results of the forecast. Hence, for comparing the performance that Random forest can achieve, depending on the different order of the time series, the frequencies used in the comparison of performance will be the suggested to LSTM networks or the results that ARIMA models suggest based on the partial correlation.

2.2.4.2. Deep Learning - Neural networks

Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data rep- resentations. These methods aim to learn feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features.

Automatically learning features at multiple levels of abstraction allow a system to learn complex functions mapping the input to the output directly from data, without depending completely on human-crafted features [50]. This technique is considered as opposed to the traditional task-specific algorithms. It has become one of the most used techniques during the last year because of its capacity of being employed for different fields such as computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics and others fields. Being able to solve either supervised problems or unsupervised problems is crucial features that have boosted the applicabilities of this method. In addition, the other fact that has Deep learning gained this popularity is the ability to process a tremendous volume of raw data [4].

Deep learning models are mainly based on Artifical neural networks. Neural networks try to simulate the way of learning by the human brains. According to

(28)

Schmidhuber [51] a formal definition of Neural network would be the following:

A standard neural network (NN) consists of many simple, connected processors called neurons, each producing a sequence of real-valued activations. Input neurons get activated through sensors perceiving the environment, other neurons get activated through weighted connections from previously active neurons. Some neurons may influence the environment by triggering actions. Learning or credit assignment is about finding weights that make the NN exhibit desired behaviour, such as driving a car. Depending on the problem and how the neurons are connected, such behaviour may require long causal chains of computational stages, where each stage transforms (often in a non-linear way) the aggregate activation of the network. A graphical representation of how the neural networks look like will be the following:

Figure 2.4: Graphical representation of the architecture of a Multilayer neural network adapted from [4].

Based on their properties of the neural network, it may be split into three different classes of structures:

• Feed-Forward neural network: these neural networks are the simplest one and were created in 1960. [52]. Based on their properties of the neural network, it may be split into three different classes of structures: The principal differential point of these networks their incapacity for creating cycles.

In other words, the information follows just one direction from the input through the hidden layer until the output. This kind of composition is also known as Perceptron. The most common architecture of Feed-Forward neural network is the Multilayer Perceptron. Each layer of the network, without counting the input nodes, has a nonlinear activation function and uses backpropagation as a supervised learning technique. An important characteristic of the architecture of this network is that each node in a specific layer connects to every node in the following layer with a specific weight.

This circumstance is known as Fully connected.

• Convolutional neural network: are obtaining incredible results in the area of computer vision. Basically, because it can take advantage of the convo-

(29)

lution operation(mainly filters) produced by the convolution layers. The convolution simulates the behaviour of a neuron to visual stimuli [53]. An- other distinguishing feature of this nets is the fact that in the majority of the cases the whole architecture of the network is not fully connected, reducing the number of parameters and the complexity of the network.

• Recurrent neural networks: the main attribute of the structure of this network is the possibility for the nodes of being connected in both directions and therefore creating loops among the network [54]. This attribute will completely be the opposites to a feed-forward neural network. The capacity of creating loops allows applying the output of hidden state produce by a previous input and the current one to generate current output. In other words, it uses memory for obtaining more accurate results. This property is extremely useful for catching the behaviour of something that follows a sequence. The time could be interpreted as a finite sequence of events. Hence, this kind of architectures makes a good performance in problems related to time.

Considering the class of problem which this project would like to face-up and the previous statements, the neural network that will work better is the recurrent neural network. However, recurrent neural networks have a significant known issue when tries to remember from a long sequence. This problem is called Van- ishing gradient and frequently impair the accuracy of the long-term forecast models [5]. But first, for understanding, this problem is necessary to comprehend the backpropagation methods of the neural networks. This method essentially per- forms the calculation of the gradient of the loss function. This output is used for the calculations of weights of the neural network. These computations will mod- erate the scope of learning of the neural network. Having this concept clear, it could be explained the principal problem of the Vanishing gradient presents in the recurrent neural network. Every time step during training the neural network is using the same weights to calculate events. This multiplication is performed during back-propagation too. The further that the sequence moves backwards, the errors will become bigger or smaller. Thus, the network suffers troubles in remembering events from far away in the sequence and makes predictions based on only the most recent ones. This issue is solved in the evolved next model of the recurrent neural network called Long short-term memory [18]. Apart from the last statement, the LSTM neural networks were chosen for their proved ability to beat time series problems.

2.2.4.3. LSTM networks

Long Short-Term Memory networks are a particular class of recurrent neural networks, able to learn long short-term structures. These networks were created in 1997 by Hochreiter and Schmidhuber [18]. The main peculiarity feature of LSTM is their ability to remember information from long periods of time. The respon- sible element that provides memory is called "memory block". In the following lines, it will be explained the internal mechanism of this component.

(30)

Figure 2.5:Example of three memory block connected following the first architecture [5].

Figure adapted from [6].

First of all, let’s explain the inputs that this memory block get. This structure receives three sources of information, two with the information of the previous cell and the other with the current input. The top horizontal line controls the cell state. There are three gates that connect a Sigmoid neural networks layer and a pointwise multiplication operation in charge of letting information to the cell state. The first sigmoid layer also called the forget gate layer. It looks into the output of the previous block and the new input vector. After that and generate a number between zero and one for each number in the cell state that contains information from the previous memory block. One expresses will be the point for keeping the information meanwhile a zero represents discard this the information. The mathematical function is:

f t =σ(W f ∗ [ht−1, xt] +b f) (2.2) The next step will select which new information is going to be included in the cell state. This operation is performed through a sigmoid layer which decides the new values and a hyperbolic Tahn layer that generates a vector with the candidates CT.

it=_σ(Wi∗ [ht−_{1, xt}] +bi) _(2.3) Dt=tanh(Wc∗ [ht−_{1, xt}] +bc) _(2.4) The following step will multiply the previous state by ft, and then forget the use- less information and combine with C t. These values will be the new candidates, scaled by how much was decided to update each state value.

Ct= f t∗Ct−₁+it∗Dt (2.5)

Finally, it will be determined the output. This output will be based on our cell state but will be a filtered version. First, a sigmoid layer which chooses what parts of the cell state are going to be the output. Then, the cell state was put through the tanh layer and multiply it by the output of the sigmoid gate, so that we only output the interesting parts.

ot=σ(Wo∗ [ht−xt] +bo (2.6)

(31)

ht =ot∗tanh(Ct) (2.7)

The described LSTM is the most common LSTM model although there are others LSTM models with some modifications. A popular modification was created by Gers and Schmidhuber in 1999 [7]. The main of this alternative architecture is the management of the memory from previous blocks. In this case, the memory will be an input into the gates too. This modification is known as peephole connection and it will be used in the implementation of this project.

Figure 2.6: Explanation of the internal mechanism of a Memory block with the popular modification of the peepholes [7]. Figure adapted from [6]

Having summarized all the advantages and explained the methods that LSTM models use for being able to remember. The next scenario will be to investigate how the frequency affects the output in a multi-step forecast and how the frequency modifies the architecture of a neural network.

(32)

CHAPTER 3

Methods

In this chapter, a study of the data will be done detailing all the necessary transformations for preparing the data for the experiments. Moreover, each variable that LSTM models utilize will be set up based on previous studies leaving just the frequency as the variable for playing around and investigating its influence in the forecast for the domains of transportation and retail.

3.1 Chosen method

The research question may be answered by different approaches, therefore, one of the most important parts of the project is to select the appropriate one for solving the problem. Actually, this problem was summarized in the research question so the key to choosing one method will be mainly based on the nature of this question. The most common methods in this kind of projects are the following:

• Theoretical method: is formulated to explain, predict, and understand phenomena and, in many cases, to challenge and extend existing knowledge within the limits of critical bounding assumptions [55] [27]. One of the main strength that the theoretical method has rather than the empirical is their availability for defining boundaries to the problem and therefore control- ling better the results. However, this method usually needs a deep investigation on the field for elaborating the premises on which the problem will be solved. This fact might stall the investigation in case of not finding solid assumptions and consuming more time than expected.

• Empirical method: is based on observed and measured phenomena and derives knowledge from actual experience rather than from theory or be- lief. It is usually divided into two parts the data collection and the analysis [56]. The main advantage of this method is its capacity to adapt to changing situations and also the possibility of overcome theoretical limits ending up in new findings. On the other hand, this method can assume incorrect premises as correct reaching to inaccurate conclusions [57].

Both approaches could be applied for answering the research question so now the possible options will be shown. Based on the research question the theoretical

21

(33)

22 Methods

method would need to study how other techniques have addressed this problem and extract conclusion for building the premises. On the other hand, the empirical method would focus on creating different tests for covering the different scenarios that the research question proposes, analyse the results and generating conclusions. Due to this project have a lot of real data regarding the problem and the multiples scenarios can be designed for solving all the gaps in the research question. Therefore, it will make more sense to use an empirical approach instead of the theoretical. In addition, a theoretical approach might not be applied, because there are no previous studies in the matter of how frequency affect LSTM and even there are just a few investigations on how the frequency affects the forecast in other techniques and it would be difficult to find assumptions.

Having chosen the empirical method as the research approach it is necessary to deeply analyse their components and how it will answer the research question. As it was introduced during the first paragraph of this section, the empirical approach is composed of two parts. The first one explains how the data was collected. The next section of the document fully explains how this process was carried out but it could be shortly summarized as follow: the data was obtained from the antennas that provide connectivity to the users in the three places of the city previously defined. It should be highlighted that these three places belong to specific domains that Telia wants to study: Transportation and retail. One of the main reasons for choosing those domains was the fact that the phenomenon of movement variation was bigger than for other domains.

For that reason, all the places that did not belong to the required domains were filtered and not tracked. In addition, the data used in this project can be classified as quantitative because the variable applied to the project counts the number of people in a specific place during a finite period of time. Therefore the variable is numerically quantified adjusting to the requirements of quantitative research.

The second part, as it was explained before, consists the analysis and extrac- tion of the conclusion from the created experiments, these results can be eval- uated either in a quantitative or qualitative method. In this project, the performance of the different experiments is done by comparing their results through the metrics of the model. For example, choosing the best learning rate of the neural network is based on different values an selecting the one that obtains better performance. Thus, it can be concluded that the evaluation and analysis parts of this empirical method are also quantitative. Finally, it is important to mention which metrics might have been selected and which ones have really been chosen for the project. Since the main purpose of this thesis is to find if the frequency order of a time series may affect the result of the forecast in an LSTM model, the desired metric should be able to compare the results of different models without loosing information. The first metric that appears is MAPE. Also, known as Mean Absolute Percentage of Error, presents the requirements that were demanded at the beginning of the paragraph. However, MAPE has drawbacks such a bias solution in comparison or the impossibility of using zero values [58]. Therefore, it was decided to use sMAPE because fixes all the problems that MAPE has and allows a real comparison among different results from different models. On the other hand, the project wanted to measure the accuracy of model. For that task RMSE was chosen as the metric instead of other metrics such as MSE or MAE.

RMSE works better when the time series contain a great number of outliers [59]

(34)

3.2 Data Preparation 23

fitting this affirmation with the fact that there are unexpected and sudden rush hour in the datasets.

3.2 Data Preparation

The data used for the predictive models were collected from the beginning of 2018. More precisely, the process started on the first day of the year to the first day of May. The antennas of Telia monitor every five minutes which users are connected to them. This scheduled behaviour allows creating a historical record of the number of users connected per antenna at any specific time. Moreover, knowing the coverage of the antennas, enable to calculate the number of users in a defined place. Basically, knowing the exact number of users in a specific place can be done by calculating the sum of users of all antennas that have coverage to the defined area. In the next illustration a graphical example of how this technology works is shown:

Figure 3.1:Representation of how the antennas. Own elaboration

The three examined places were chosen because of the fact that these places model different crowd behaviour in those areas. The three places are: a metro station, a mall and a promenade.

These examples represent core business cases of the domains of retail and transportation. Therefore, the external variables of the time series will be completely different for metro station(transportation) and the promenade and mall retail). For keeping the security of places, they will not be clearly specified the real locations of the places.

Although the most important feature is common to all of them; they are grasp- ing the same kind of phenomenon: the movement of people in a place, being the hidden structures the same. For example, each of them has a rush hour although the most probable thing is that the rush hour would be different for all of them.

Each of the three places is formed by around 35.000 samples. Performing the maths about the theoretical number of samples that the dataset should contain, leads to the conclusion that the datasets of the place do not have almost miss- ing data, because from the beginning of the year until the first of May are 121 days. Multiplying the number of the days by the number of hours per day and the frequency per hour making results of 34848 theoretical samples being, practi- cally the same number. However, some columns were presenting null values and should be discarded for the final model.

(35)

24 Methods

Before applying any transformation to the data it is important to carry out one of the simplest studies to the time series: the visual analysis [60]. This analysis can be used as a starting point of the study. For example, it can be determined if there are any patterns among the week, the possible outliers, the general trend, etc.

Every week of the time series is represented by a different colour and three figures of the places are respecting the same pattern of colour. These features will make the visual analyses much easier:

Figure 3.2:Station from January until May.

Own elaboration.

Figure 3.3: Promenade from January until May. Own elaboration.

Figure 3.4:Mall from January until May. Own elaboration.

The first anomaly that could be observed from the plots presented in the next page is that all the places are experiencing a significant decrease in the movement of people during the same week in the middle of April. Studying the variable explains this weird behaviour. Basically, the reason is that Easter was that week.

People usually take advantage of this week of vacation and leave the cities for a few days, being the most irregular week of the dataset.

Another remarkable fact of the time series is represented in figure 3.4. This figure presents the most regular behaviour of the places and it is the small mar- ket. This could be supported based on the idea that humans are animal of habits.

Therefore, people usually go to either the supermarket or the gym on the same days of the week or similar times. On the other hand, the most irregular place of the chosen ones is the station. From a theoretical perspective, the station is the place that can be more affected for other more external and irregular variables such as the weather, traffic jam, the festivities, etc. Nonetheless, even this place shows some repeated behaviours among the weeks of the time series. For exam-

(36)

ple, from the fourth week until the eleventh week, the behaviour pattern during the days of the week is pretty similar in a different order. There is a positive trend from Monday to Thursday, then there is a small gap at the weekend and finally another increase on Sunday. It is important to mention that the promenade area is closed to some residential areas. This circumstance explains why in this area the value almost never decreases to zero

The main conclusions that can be extracted from this first sight into the data are that there are some crowd repetitive behaviour structures among the time series, one week could be considered as an outlier and there is not a remarkable trend in any of the time series.

3.2.1. Stationary

In the background section, the idea that time series usually needs to be stationaries for getting good results was introduced. [17]. Although it is as mandatory as in the ARIMA model, it is advisable for the LSTM neural networks that the time series behave as a stationary or closed to a stationary one.

The first value that should be defined is the lag of the time series. In this case, it was found that the value should be one hundred sixty-eight. This conclusion was determined based on the nature of the time series. The order frequency is every five minutes, therefore, each hour has a frequency of twelve repetitions per hour. Multiplying these twelve cycles by the number of hours per day will give us the frequency of a day. This lag will be enough for grabbing the daily behaviour of the time series. All the analytical process will be explained using the data from the time series of the Station. These operations also have been applied to the others places. The first step will be to investigate the autocorrelation and the histogram:

Figure 3.5:Autocorrelation of the Mall before applying the differencing. Own elab-

oration.

Figure 3.6: Histogram of the Mall before applying the differencing. Own elabora-

tion.

(37)

26 Methods

The current autocorrelation shows a clear non-stationary series where almost all lags are exceeding the confidence interval of the autocorrelation [61]. Actually, the ideal autocorrelation happens when the majority of the lags drop to zero at the beginning. Moreover, the shape of the histogram is far from following a Gaussian bell as it is required [61]. Therefore, it can be affirmed that the current status of the time will correspond to a non-stationary time series. However, there are some mechanisms to try to approximate and change the non-stationary situation to a stationary one.

Differencing is a known transformation that could provide stationarity to time series [41]. Differencing helps to stabilize the mean of a time series by removing changes in the level of a time series, and thus eliminating trend and seasonality.

This operation basically consists of calculating the difference between consecutive values. The order of the differencing will be determined by the number of time that this operation has to be carried out. The formula of the first order- differencing is:

Yt⁰ =Yt−Yt−1 (3.1)

The figure 3.7 shows that the autocorrelation of the transformed time series has just a few peaks out of the confidential interval. The lag of the autocorrelation dropped out very quickly after applying the differencing method. This behaviour is much closer to stationary. On the other hand, the histogram has been modified and now is similar to a pure Gaussian bell. Therefore, it can be concluded that the first order of differencing was successful and has become the non-stationary time series to stationary.

Figure 3.7: Autocorrelation of the Mall after applying the differencing. Own elabo-

ration.

Figure 3.8:Histogram of the Mall after applying the differencing. Own elaboration.

For verifying the obtained results a test called Dickey–Fuller test will be performed for checking the stationarity of the time series. This test was defined by Said and Dickey [62] as a test to determine whether a time series is stationary or, more specifically, whether the null hypothesis of a unit root can be rejected.

In other words, the null hypothesis of the test determines a unit root in the time series and therefore, it will be not stationary. In the opposite case, the alternative hypothesis is that the time series is stationary. After performing the test, the hypothesis must be rejected because it was obtained a p-value smaller than 0.05 that confirms the alternative hypothesis.

Finally, the time series is prepared for being used into the models being now stationary.

(38)

3.2.2. Featuring scale

A common technique applied in most of the neural networks is the Scaling of the dataset. This method provides normalization into the data. When the data is normalized, it becomes more regular and therefore makes easier the forecast for predicting models. Apart from that, some activation functions demand that the data were scaled for working properly [18]. This project, for example, is using the proper layer of LSTM in the neural network, before mentioning as a Memory block, that by default uses the hyperbolic Tanh function as its active function.

This function output by default values between -1 and 1.

Figure 3.9:Representation of the Tanh activation function. Adapted from [8].

In addition, scaling the data provokes that the gradient descent converge way faster and the speed of the algorithm is faster [63]. Hence, taking all these advantages into account the scale is going to be used. The formula of the scaling method is the following:

x⁰ = (x−average(x))

(max(x) −min(x)) ^(3.2)