Time Series Forecasting of House Prices: An evaluation of a Support Vector Machine and a Recurrent Neural Network with LSTM cells

(1)

Time Series Forecasting of House Prices: An evaluation of a Support Vector Machine and a Recurrent Neural Network with LSTM

cells

BACHELOR’S THESIS IN STATISTICS

Uppsala University Department of Statistics

Authors: Fredrik Hansson and Jako Rostami

Supervisor: Sebastian Ankargren

Examiner: Professor Johan Lyhagen

24 May 2019

(2)

I

Abstract

In this thesis, we examine the performance of different forecasting methods. We use data of monthly house prices from the larger Stockholm area and the municipality of Uppsala between 2005 and early 2019 as the time series to be forecast. Firstly, we compare the performance of two machine learning methods, the Long Short-Term Memory, and the Support Vector Machine methods. The two methods forecasts are compared, and the model with the lowest forecasting error measured by three metrics is chosen to be compared with a classic seasonal ARIMA model. We find that the Long Short-Term Memory method is the better performing machine learning method for a twelve-month forecast, but that it still does not forecast as well as the ARIMA model for the same forecast period.

Keywords:machine learning, cross-validation, seasonality, sliding window, sequential model, supervised learning

(3)

Table of Contents II

List of Tables

Table 5.1: Hyperparameters and respective kernel for Uppsala. . . 23 Table 5.2: Hyperparameters and respective kernel for the larger Stockholm area. 23 Table 5.3: LSTM hyperparameters and RMSE for both areas. . . 24 Table 5.4: Optimal hyperparameters for seasonal ARIMA and their respective

cross-validation RMSE and overall AIC. . . 26 Table 5.5: Model comparison of LSTM and SVM. . . 27 Table 5.6: Model comparison of LSTM and SARIMA. . . 27

(6)

List of Figures V

List of Figures

Figure 4.1: A graph of a simple SVM with slack variables. . . 8 Figure 4.2: A flowchart of a basic artificial neural network with a single hidden

layer and binary outputs. . . 10 Figure 4.3: A simple RNN with an input layer, a single hidden layer, and an output

layer and its recurrent architecture. . . 11 Figure 4.4: An LSTM cell with visual representations of its internal structure and

the recurrent connections. . . 12 Figure 5.1: House prices of the larger Stockholm area and Uppsala from January

2014 till April 2019. Source: Svensk Mäklarstatistik. . . 21 Figure 5.2: Seasonal plot of houses in the larger Stockholm area. . . 22 Figure 5.3: Seasonal plot of houses in Uppsala. . . 22 Figure 5.4: Cross-validation with out-of-sample predictions for the larger Stock-

holm area. . . 25 Figure 5.5: Cross-validation with out-of-sample predictions for Uppsala. . . 25 Figure 5.6: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01

on average house prices per month of the larger Stockholm area . . . 28 Figure 5.7: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01

on average house prices per month of Uppsala. . . 28

(7)

List of Figures VI

Terminology

Abbreviations

NN Neural Network

ANN Artificial Neural Network CNN Convolutional Neural Network DNN Deep Neural Network

RNN Recurrent Neural Network LSTM Long Short-Term Memory

ARIMA Autoregressive Integrated Moving Average

SARIMA Seasonal Autoregressive Integrated Moving Average RBF Radial Basis Function

MAPE Mean Absolute Percentage Error

SMAPE Symmetric Mean Absolute Percentage Error RMSE Root Mean Square Error

SVM Support Vector Machine SVR Support Vector Regression

Vocabulary in machine learning

The words that occur frequently in this paper and in the machine learning community, are given a terminology translation in Table 1 below. For statistical terms, it is expected that the reader is familiar with them prior to reading this paper.

Table 1: Terminology glossary Machine learning Statistics

Back-propagation Chain rule of partial derivatives

Epochs Number of times the regression coefficients get updated using one or more observations

Hyperparameter Model parameter independent of the data

Input/Feature Independent variable

Output Dependent variable

Training Fitting

Training set In-sample set

Test set Out-of-sample set

(8)

Introduction 1

1 Introduction

When looking at a time series the question of what happens next often comes up. It is of interest to be able to predict the future of the time series. There are many ways to make predictions and many models to choose from when making forecasts. A time series is made up of quantitative observations of one or more measurable characteristics of an individual entity and taken at multiple points in time. It is often characterized by trend, seasonality, stationarity, and auto-correlation (Avishek and Prakash, 2017). Interest rates, stock prices, weather indices and population over time are all examples of possible time series to analyse. In this paper forecasting future house prices is of interest. Since there is a lot of money involved in the house market being able to predict future prices is of great interest to many and how to do this well is being researched (Yu et al, 2018).

A common method used to make forecasts of time series is the ARIMA model and it will be used in this paper. Other than the ARIMA model we will also employ two machine learning methods to make forecasts. Machine learning is an interdisciplinary field that shares common threads with the mathematical fields of statistics, information theory, game theory, and optimization. Given the emerging research in the machine learning field the last two decades, machine learning models have established themselves as serious contenders to classical statistical models in the forecasting community (Bon- tempi et al, 2013). Time series data is explained by a process that is unknown to the analyst most of the time. According to Alpaydin (2010) the niche of machine learning is detecting patterns and regularities under the assumption that identifying the complete process may not be possible. Machine learning is considered to be born out of the idea that instead of programming machines for specific tasks they should be able to learn themselves (Khan et al, 2019).

Machine learning methods can be categorized into methods using supervised and unsupervised learning techniques (Ghatak, 2017). Supervised learning techniques makes predictions of the data where the supervisor compares the correct answers to the predicted answers. Unsupervised learning techniques gathers information about the structure of the data. It explores the structure of the data and sheds light on it. This paper explores two machine learning methods both of which use supervised learning. The two methods are called Support Vector machine (SVM) and Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM). The RNN with LSTM will be referred to as LSTM in this paper.

(9)

Introduction 2

In this paper, we will present our research question and the data we are going to use followed by the previous literature that shaped the paper. We will explain how the SVM, LSTM, and ARIMA models work and how we build the models as well as how we find the best hyperparameters to use. Our results will be presented, first comparing the SVM and LSTM models and then the better of those two models with the ARIMA model. We will then discuss the results and the strengths and weaknesses of the models.

(10)

Research question 3

2 Research question

2.1 Research question

For this paper, we have real-world data on house prices in Sweden coming from Svensk Mäklarstatistik. The structure of the thesis is such that the data provided is analyzed visually, thereafter the two supervised machine learning models are applied to the time- series data. After that, the performance of the two machine learning models forecasts is evaluated and compared to each other to see which one performs better. Once the best performing machine learning model has been chosen it is compared to a seasonal ARIMA and thereafter evaluating and comparing both of the final models forecasting performance to determine if the classic model or the machine learning model is better.

• Which of the chosen machine learning techniques is better at forecasting house prices and is it better than the classical ARIMA time series model?

2.2 Data

Svensk Mäklarstatistik has provided the dataset for this paper. Svensk Mäklarstatistik is a company that collects data on estate sales through realtors. Their services provide realtors, media, firms, government agencies, and universities with data and statistics (Svensk Mäklarstatistik, 2019). The dataset we have received contains the average of final prices that detached-houses were sold for. We have data for the larger Stockholm area, and Uppsala municipality. The data is monthly and ranges from January 2005 to March 2019 with no missing data.

2.3 Restrictions from a macroeconomic point of view and statistical motivation

There could be many different variables that could be useful for trying to predict future house prices in addition to time and previous values. However, this paper is not con- cerned with trying to find the best predictor for future house prices. In 2018 Yu et al compared an LSTM with 13 characteristic variables and time variables to an LSTM only based on time. The percentage error for the LSTM with characteristic and time variables

(11)

Literature Review 4

was 10.35% and for the LSTM based on time only was 2%. This motivates us to drop the macroeconomic part of house prices and solely focus on the temporal part of the univariate time series data. Because of the above reasons we chose not to add any other variables than time and past prices to the models.

3 Literature Review

When the data is complex neural networks are useful because they are able to learn any function with a single hidden layer, where the hidden layer can be considered as a black-box that performs calculations. Because of this, they are known as universal function approximators (Maini and Sabri, 2017). With RNNs having a built-in memory such that they can keep track of what happened historically even when it is not visible at once. Neural Networks involving recurrence for any function can be considered a RNN (Goodfellow et al, 2016).

Yu et al (2018) ) designed several models to evaluate their forecasting performance using monthly housing price trends to predict the housing prices in Beijing in the future. What they found was that a Convolutional Neural Network (CNN) had lower percentage error than an LSTM when accounting for house characteristic variables and that an LSTM was better when employed on time series with the LSTM depended on time only.

Noticeable was their seasonal ARMA with a percentage error of 3.1% and their LSTM-1 model with a percentage error of 2%. Continuing with China, the identification of real estate cycles can be seen as a recurrent cycle but with irregular fluctuations in the rate of the total return for all properties of real estate according to Zhang et al (2015). When the authors introduced neural networks into identifying real estate cycles in China they used an Artificial Neural Network (ANN) to determine cycles on the Chinese real estate market. They used data from 1993-2008 to train on and to forecast the years 2009-2011.

In 2009 China’s real estate market reached its peak and went into recession in 2010 whilst reaching its trough in 2011. They found that the neural network’s performance is generally consistent with market reality. China has cyclical characteristics and fluctuates more frequently after 2008 with economic events in 2009 and 2011 to explain the cyclical fluctuations. In their conclusion, they suggest that applications of neural networks can be enriched by international comparisons for real estate cycles based on whether a neural network can be a potential research area to examine global real estate cycles.

Farman, Khan, and Jan (2018) provides an insight into how deep learning in neural networks can be used for finance and the occurrence of irregular and complex behaviour of

(12)

Literature Review 5

the economic domain with a large number of factors. The authors mention that one type of neural network that is suitable for detecting patterns in time series is the RNN which can retain information over a longer time period and enables it to recognize patterns in sequential data. Specifically, it was developed to recognize patterns in time series data. Other neural networks are not suitable for sequential data where the sequence is time-dependent. This makes the RNN a preferable choice for time series compared to CNNs and DNNs, as the latter ones are not suitable for sequential data. One example of neural networks applied to time series models is in the paper by Kohzadi et al (1996).

The authors provides a neural network which can account for non-linear relationships and compare it to ARIMA models that cannot do this. They used monthly live cattle and wheat prices from 1950 through 1990 and repeated the experiment seven times for successive three year periods. The performance of the neural network model showed 27 percent and 56 percent lower mean squared error than the ARIMA model. The absolute mean error and the mean absolute percent error were also lower for the neural networks compared to the ARIMA models. In their conclusion the authors mention that one of the reasons that the neural networks performed better may be because of non-linear or chaotic behaviour in the data which cannot be fully captured by the ARIMA model.

Further on they claim that because they used past prices the methods can be applied in other forecasting problems such as stocks or other financial prices.

In a paper by Amrani and Zaytar (2016) the LSTM model was used for time series weather prediction with the final goal to produce two types of models per city in Mo- rocco. In total nine cities in Morocco were used for forecasting with 24 and 72 hours worth of weather data with about 15 years of hourly meteorological data to train the neural network. While the authors mention the chaotic nature of the atmosphere and the massive computational power required, their results showed that LSTM neural networks can be considered a better alternative to forecast general weather conditions. In their conclusion, they claim that the success of their model suggests it could be used on other weather-related problems.

In addition to the consideration of the nature of the data, Walid and Alamsyah (2017) in their work with RNNs for forecasting time series mention that RNNs are often used to predict and estimate issues related to electricity and can describe the cause of the swelling of electrical load experienced by the Indonesian state-owned electricity company. The authors highlight the necessity of electricity to humans and the issues the Indonesian state-owned electricity company has with not being able to provide contin- uous power to its customers. While they used RNNs for forecasting time series, the nature of their data was different from that of previously mentioned papers. The study was a comparison between different models of RNNs and not a comparison of neural networks to other methods. Previously mentioned studies have a few things in common;

irregular patterns in their data, time-dependent data, and their NNs can compute the

(13)

Literature Review 6

underlying structure of their data. This tells us that NNs can learn any function.

So far we have mentioned a few papers regarding neural networks and their applications in fields with random to cyclical behaviour such as electricity consumption, economics and finance, and meteorological data. These articles aim to lift the occurrence of neural networks in these fields to highlight the complexity that neural networks can handle and how they perform in forecasting. In a paper by Ahmed et al (2010), the authors compared eight machine learning models used for time series forecasting on monthly time series competition data. The competition data is used as a benchmark for testing and comparing forecasting methods. One model they used is the Support Vector Machine Regression also called Support Vector Regression which is also used in this thesis. They used lagged values, differencing, and moving averages to compare overall performance among the eight different models where they are considered in their basic form. Their findings according to the authors suggest some models are better than others but the performance results vary. Some time series data favor some models more than other models. Preprocessing the data can also have a significant impact on the performance.

The Support Vector Machine is one of the top models in classification but didn’t perform as well on regression in time series. It is either in the top for some time series and for others it has a high percentage error.

Levis and Papageorgiou (2005) presented a paper on customer demand forecasting using Support Vector Regression on time series data. They use a three-step algorithm with the last step being recursive constructed by semi-deterministic past data. The three-step algorithm predicts one future value to be used as the most recent value for the following predicted value. Meaning that the predicted value is only used for one following time period but it is also replacing actual past demand in a recursive manner for the next predicted time period because there is no actual past demand data-point for the first predicted time period. Their model performs well on non-linear data and their prediction accuracy, measured as 100 – MAPE, for all their examples is over 93%.

According to the authors, their model not only avoids overfitting but also interprets the underlying pattern of their customer demand data.

Vapnik et al (1999) used Support Vector Regression for time series prediction comparing it to a Radial Basis Function network by using generated data from the Santa Fe competition. The SVR performed better on stationary data but not on nonstationary data compared to an RBF network. The authors mention that nonstationarities should be considered before the actual prediction. In the paper by Hao and Yu (2006) where they use non-stationary financial time series to predict the next 40 time periods based on 100 training examples of a dataset of the Shanghai Stock Exchange. They used a modified SVR that penalized recent errors more than distant errors and by doing so they received lower RMSE and MAE. The authors claim that their modified SVR performs better

(14)

Literature Review 7

than a standard SVR and a traditional time series forecasting models when it comes to forecasting stock composite. They have not in their paper presented a comparison between their model and the models they claim superiority over and they ignore the non-stationarity issue mentioned by Vapnik et al (1999).

As we have presented several research papers conducting neural networks and Sup- port Vector Machines, our expectations are that they will perform well on time series.

Depending on the preprocessing for the LSTM and the SVR the outcomes may vary if we do not admit their intrinsic differences. Both of them are great with non-linear data depending on how they are preprocessed. Ultimately, the nature of both the models are complex and claiming a superior one is not possible given how they can perform if one takes their unique structure into consideration. This means that they are to be modified with respect to their complexity and to their differences when approximating non-linearity.

(15)

Methodology 8

4 Methodology

4.1 Support Vector Machine

A support vector machine is in its simplest form a linear classifier that separates data into two classes. It does this by drawing a line through the data where data points on one side of the line are part of one class and the datapoints on the other side are part of another class. However, this line can, of course, be drawn in many ways and can separate the data arbitrarily which would make for a poor method. The support vector machine separates the data by maximizing the distance between the line and the nearest data points on either side of the line. The result is that the two resulting classes are as different from each other as possible. This produces a tube that separates the data where the data points at the edge of the tube are called support vectors. (Flach, 7.3, 2012).

We are not, however, looking to separate the data into two groups. We want to be able to make accurate time series forecasts. To do this we need to modify the model a little bit. If the data has more dimensions than one or if it is not linear we have to map it into a high-dimensional feature space and then use the feature space in our future calculations. We represent this mapping with the function ϕ (x) where x has been mapped into a high-dimensional feature space. The feature space is unknown right now but will eventually be replaced with a so-called kernel function. When x is in the correct form we can use the SVM to do a linear regression to make predictions. (Okasha, 2014).

Figure 4.1: A graph of a simple SVM with slack variables.

(16)

Methodology 9

Below, we see the equation we want to solve to be able to make predictions;

y(x) =W^Tϕ(x) +b

where the coefficient W is the vector perpendicular to our regression line and the coeffi- cient b is the intercept. The ϕ indicates a high-dimensional feature space. Notice that this equation is similar to a simple linear regression (Okasha, 2014). To solve the above equation we introduce a set of so-called slack variables. The slack variables measure the function’s distance from a data point to the boundary of the tube (see Figure 4.1).

For every data point, the slack variables are positive but are zero for data points on the rim of the tube and if the data point is correctly classified. In SVM regression we allow data points to lie within the tube. The consequence of this is that maximizing the tube size, as we do in the normal SVM case, results in a tube of infinite size since the data points are allowed to be inside the tube. Because of this, we introduce slack variables into the model. The slack variables are minimized at the same time as the tube size is maximized. When the tube grows larger, more data points are misclassified and more slack variables are needed. A balance, therefore, needs to be struck between tube size and slack variable minimization. To do this we introduce a hyperparameter called cost that determines how important it is to minimize the slack variables (Flach, 7.3, 2012).

The final SVM equation becomes:

y(x) =

∑

n i=1

(α_i−α^∗_i)(ϕ(x_i), ϕ(x)) +b (4.1)

Where the α and α* are lagrange multipliers and where we can express(ϕ(_x_i)_{, ϕ}(_x)) = K(x_i, x). This is called a kernel function. The kernel function can take many forms depending on the dimensionality of the input data (Okasha, 2014). The function ϕ is unknown to us but we can use the kernel function to find a suitable feature space to map our data onto. A kernel function calculates the inner product of the kernel variables in the feature space (Ruping, 2001). Now, the kernel function is not a set function and looks different for different data. Choosing the right kernel function is important to make sure the accuracy of the forecast (Okasha, 2014). There are many kernels to choose from and our kernel will be chosen on the basis of which one produces the best forecast through trial and error. The kernel function is, therefore, another hyperparameter that needs to be determined.The kernel functions that will be tested in this report will be the linear, radial-basis, sigmoid and polynomial kernels. Their equations can be seen in the appendix.

(17)

Methodology 10

4.2 Recurrent Neural Network

4.2.1 Background

Artificial neural networks are inspired by the design of the brain. It is a model of com- putation that can be viewed as formal models with equations and statements of which parts that are to be used etc. These models consist of a large number of basic computing neurons connected to each other in a complex communication network (Shalev-Shwartz and Ben-David, 2014, Chapter.20). Hence, it is mimicking the learning behaviour of the brain without being programmed for a specific task as mentioned earlier. This is where a supervisor enters a teaching position and interacts with the learner, i.e the network, and the environment. This is known as supervised learning.

Figure 4.2: A flowchart of a basic artificial neural network with a single hidden layer and binary outputs.

As seen in Figure 4.2 above, the NN is feeding forward and is composed of an input layer, a hidden layer with 3 cells and activation functions in each cell, and an output layer with an activation function that determines the output of the cell. An activation function works as a transformer in the network, where it takes the input values to transform into output. In the figure above is a sigmoid function, which is a special case of the logistic function used in logistic regression, to transform the outputs into binary outputs between 0 and 1. The hidden layer works as a learning box and as a filter depending on the nature of the data and the structure of the hidden layer.

4.2.2 A basic Recurrent Neural Network

RNNs are a part of the family of artificial neural networks with a recurrent architecture.

The recurrence involved is the usage of previous outputs as new inputs such that they

(18)

Methodology 11

are recurring, meaning that they occur one or more times in the calculation of new outputs. They are similar to feed-forward networks as the one explained in Figure 4.2.

A simple RNN takes not only the current input, but also previous information as inputs through recurring connections. This gives importance to previous events, meaning that the events at t-2 and t-1 will influence decisions taken at time t. See Figure 4.3 for the intuition behind an RNN. The basic RNN is known as a vanilla RNN or as a simple RNN. The basic RNN has been applied in several disciplines such as natural language processing and tourism forecasting (Bianchi et al, 2017). RNNs are sequential models that process a sequential input or process a sequence of variables (Goodfellow et al, 2016). Their main signifier is how they can use all of the previous inputs for each output such that they are recurring giving the notion of memory in the neural network (Graves, 2012, Chapter.3).

Figure 4.3: A simple RNN with an input layer, a single hidden layer, and an output layer and its recurrent

architecture.

4.2.3 LSTM

For an RNN with LSTM architecture, the subnets, i.e the memory blocks, are recurrently connected. These memory blocks consist of cells with four gates where the gates are representations of activation functions (Graves, 2012). The cells in the memory block, also known as neurons, can be simply viewed as a small house with an entrance and an exit. The doors in the house represent the four gates in the memory block, one cell represents one passage through the house. Two cells represent passing through the same house and entering again just to exit again, it is a recurring passage.The gates are called;

forget gate, input gate, output gate, and the fourth one, which we will call, the gate

(19)

Methodology 12

gate and they are multiplicative gates done element-wise and are known as Hadamard products. The original work on LSTMs and the details of the algorithm by Schmidhuber and Hochreiter (1997) is simplified here for the forward-propagation. Figure 4.4 shows that the memory block has three inputs; x_t, h_t−1, and c_t−1.

Starting by simplifying the input vector x_t and the hidden state, the hidden state at time t-1 is defined as ht−1. Let xtand ht−1be column vectors of size m×1 and n×1.

Concatenate them to receive a column vector of size u called v, where v=_[x_t_,h_t₋₁_]=_[v₁ v₂...v_m], and u= (m+n) ×1. Continue to the gates in Figure 4.4, where the long-term memory at the previous time step defined as ct−1is used as the long-term memory at time t before it is updated by the multiplication of the forget gate (i). The output value from the previous time step is sent into the current time step through the forget gate (i) where it tells the long-term memory at time t what to remember and what to forget through a sigmoid function. The sigmoid function outputs numbers between 0 and 1, where 1 means “fully store this” and 0 means “fully forget this”. To avoid exploding numbers when the network undergoes many computations, the tanh function keeps the values between [-1,1] to make sure the network is steady and stable.

Other multiplications done inside the gates are through weight matrices with the concatenated vector. Every gate consists of a set of parameters called weights. Similar to regression coefficients, they decide how important values at the current time-step and past time-steps are. For this, weight matrices are introduced at each gate and multiplied with the concatenated column vector v. The weight matrices are also updated through back-propagation to lower the error they produce from training until they can no longer reduce their training error.

Figure 4.4: An LSTM cell with visual representations of its internal structure and the recurrent connections.

(20)

Methodology 13

Now compute the weight matrix multiplied by the concatenated vector and by element- wise summation, add the bias and finally run it through the sigmoid function.

F_t=σ(W_F×v+b_F) (i) The forget gate output.

Similarly for the input gate which learns which parts of the hidden state at t-1 and the input xtthat are worth using and saving.

I_t=σ(W_I×v+b_I) (ii) The input gate output.

And the gate gate is deciding which other candidates that are suitable by transforming it between values [-1,1]. Once the input gate and the gate gate are multiplied with each other, the sigmoid output of the input gate will determine which information to keep from the tanh output. This also regulates the network from what is known as exploding gradients during back-propagation for LSTMs to ensure network stability over multiple time-steps.

Gt=tanh(W_G×v+b_G) (iii) The gate gate output.

The old long-term memory is updated by multiplying the old long-term memory with the forget gate and adding the input gate multiplied by the gate gate.

c_t=F_t×c_t−1+I_t×G_t

(iv) The new updated long-term memory, also called the cell state.

And the output gate where it is decided what to output and is based on the previous memory block and the given input at the current time-step. This is without the updated long-term memory, which means that the new output gate contains past information.

Ot=σ(W_O×v+b_O) (v) The output gate output.

Passing the updated cell state through a tanh function and multiplying it with the sigmoid output tells the hidden state what information to carry on. Our memory block at time t has now been updated and is used for predictions at time t+1 and for the final output at t. To get the new hidden state at time t which also works as the output at time t.

(21)

Methodology 14

ht=Ot×tanh(ct) (vi) The new hidden state output.

Finally our predicted output value, for regression output, will run through a linear transformation so that the output values become unbounded. The memory block processes as explained in this section gives the reader a notion of what the Long Short- Term Memory stands for. As demonstrated, ctis the Long Memory where it remembers all previous operations whilst O_tis the Short-Term Memory which remembers the previous outputs. Hence, Long Short-Term Memory is the process of the memory block with recurring connections through time.The recurrence involved is through ctand htwhen carried on to the next time-step where past hidden states and past information are recurring in the next memory block.

4.3 SARIMA

A commonly used way to forecast time series is to use a SARIMA model or one of its many variants. SARIMA stands for Seasonal Autoregressive Integrated Moving-Average process. The SARIMA model consists of lags from previous time periods. These lags are past values of dependent variables as well as past values for independent, identically distributed random variables with mean 0 and variance σ_e²(Cryer, Chan, 4.2-4.3, 2008).

The lags are subject to a set of weights that determines their importance to the value being predicted. Some time series are not stationary which causes problems for us when we want to build the SARIMA model. To remedy this we take the difference of the model until it becomes stationary (Cryer, Chan, 10.1, 2009). Some dependent variables depend not only on variables from the previous time period but also on variables at past recurring time intervals. This is called a seasonal element. A common seasonal element is what happened the current month a year ago (Cryer, Chan, 5.2, 2008). In equation 4.2 we see a SARIMA model.

(1−φ₁B)(1−_Φ₁B¹²)(1−B)(1−B¹²)Y_t= (1+θ₁B)(1+_Θ₁B¹²)e_t (4.2)

Equation 4.2 is a SARIMA model with a seasonal element with lag 12. B is the backshift operator. φ is the weights for past lagged values of dependant variables. Φ is the weights for past seasonal lagged values of dependent variables.(1−B)is the first difference and (1−B¹²)is the first seasonal difference. θ is the weight for past random variables andΘ is the weight for past seasonal random variables. Y_tis the dependent variable value we want to predict and e_tare independent, identically distributed random variables with mean 0 and variance σ_e².

(22)

Methodology 15

4.4 Cross-validation and evaluation metrics

Hastie et al (2017) describe that the method of k-fold cross-validation is randomly dividing the dataset into k groups of approximately equal size. The first group is treated as a test set while the rest is training on the remaining k – 1 groups. However, because we are dealing with time series data, we have to approach the problem with a slightly different method. Since we do not have the possibility of dividing the dataset into a training set, a test set, and a validation set because of insufficient data. And given our models it would be more appropriate to use the walk-forward/sliding windows testing routine as described by Kaastra and Boyd (1996) with a training and test set (Hastie et al, Ch.7, 2013). Hyndman and Athanasopoulos (2018) suggest that a one-step or multi-step rolling forecast origin may also be used. One must consider the size and nature of the data set.

Using monthly time series data suggests that a sliding window or rolling forecast origin are the methods to use when cross-validating. Given the machine learning models, sliding windows try to simulate real-life data and test model robustness by retraining on a large out-of-sample dataset. It also allows quicker adaptations to changing conditions on the market when using neural networks (Kaastra and Boyd, 1996).

The sliding window cross-validation is done by splitting the dataset into a training period and a test period. The length and number of windows are defined by the periods we train and test on. This makes the sliding window method a sort of a k-fold method for time series. In order to evaluate the sliding window for the global cross-validation, the overall RMSE of the forecast errors is given in Equation 4.3.

RMSEover al l=^{r MSE}¹+...+MSE_j

k j=1, 2, ..., k (4.3)

where MSE_j is the mean of the squared forecast errors in sliding window j.

However, Hyndman and Koehler (2005) bring up the sensitivity of RMSE to outliers thus including two other metrics is added to ensure a good comparison between different methods where the added metrics are scale-independent. The MAPE, Mean Absolute Percentage Error,

1 T−T0+1

∑

T t=T0

Y_t−Y^ˆ_t Yt

∗100 1<T₀<T (4.4)

(23)

Methodology 16

and the Symmetric Mean Absolute Percentage Error (SMAPE), 1

T−T₀+1

∑

T t=T0

200∗ ˆYt−Yt

(|Yt| + ˆYt

) ¹<T0<T (4.5)

will be used in this paper. In Equation 4.4 and 4.5, T₀ is the first forecast month and T is the final forecast month. In our case T is always the twelfth month we forecast.

These metrics are not defined as optimal measurements of forecast accuracy, MAPE puts heavier penalty on positive errors than negative errors. And the SMAPE is not symmetric as the name suggest, where the same value of y_t, the value of 200∗ |Y^ˆ_t−Y_t|/(|Y_t| + |Y^ˆ_t|) has a heavier penalty when forecasts are low compared to when they are high (Hyndman and Koehler, 2005). Therefore, using three different metrics to measure the forecast errors between models should guide us better than using only one. The RMSE should in this case act primarily as an in-model forecast error metric and secondary as a between- model forecast error metric. While the MAPE and the SMAPE are primarily between- model forecast error metrics for model comparison and secondary to clarify the overall comparison together with the RMSE.

4.5 Hyperparameter tuning

Searching for the most optimal models is a time-consuming process and leads to the search of methods to find the best configuration where it performs the best on the evaluation metrics chosen for the training and testing sets. This is logical, since it reduces the time spent on manual trial and error and instead create an efficient computing mechanism to receive optimal configurations.

Hyperparameters are the parameters, that are independent of the data and can be considered to be the buttons on a black box. They are used to control the model and are defined as those parameters that cannot be learned directly from the data (A.Ghatak, 2017).

Tuning them is to tune the model with respect to its performance on the metrics chosen for training and testing. Three methods of hyperparameter tuning will be considered in this paper; manual configuration, random hyperparameter search, and gridsearch.

The manual configuration might seem to be unnecessary or contradictory since we have mentioned how time-consuming it is to search for the optimal model. Regardless, this is used to build any model and is the foundation for how to build models in general.

Build it first and consider finding the best configuration through other methods.

Gridsearch is a method that finds the hyperparameters of interest. It searches for all possible configurations and gives you the best configuration. However, gridsearch can be computationally demanding and sometimes requires a lot of time and computational resources. The more parameters that are tested by the gridsearch, the more demand-

(24)

Methodology 17

ing it becomes. Because of this, a random hyperparameter search can be preferable instead if the computing is heavy. Even though random search is unlikely to find the best configuration, in the end, it will give better configurations using fewer iterations.

Suggesting that perhaps after 10 iterations, random search gives the better configuration but gridsearch will give the best configuration after a full search. This is where one has to consider the computational side and the time available. In this paper we will use the following hyperparameters for the LSTM,

Batch size = number of training examples to pass through one epoch, this will be set to 1 Hidden layers = number of memory blocks in the network

Time-steps = the amount of steps in time run through the RNN where it memorizes the amount of steps, for this paper this is set to 1

Features = number of variables in every time step, in univariate time series this is 1 Recurrent dropout = a fraction, defined as 0-1, of how much of the hidden states to drop Cells = number of cells where the procedure of the LSTM memory block is repeatedly executed

Optimizer = an algorithm to optimize the weights to reduce the mean square error, AdaDelta will be used with default settings in Keras

For the SVR,

C = a constant that determines the importance of slack variable minimization versus tube diameter maximization

Gamma = a constant used in the radial-basis, sigmoid and polynomial kernels Coef = an intercept constant used in the sigmoid and polynomial kernels Degree = a constant used in the polynomial kernel

In the case of the SVR we will use gridsearch as an initial way of finding good values for the hyperparameters and then try to improve on the gridsearch by searching for better hyperparameters. This is done by trying to lower the RMSE by changing the hyperparameters one by one. Each hyperparameter is increased and/or decreased until the RMSE is no longer lowered after which we move on to the next hyperparameter and do the same thing. When all hyperparameters have gone through this process each is increased and decreased again to see if a lower RMSE can be found. This process continues until a better value cannot be found.

4.6 Model building approach

Designing a machine learning model for time series forecasting can be a time-consuming procedure and might demand a structural building approach. An eight-step design methodology for a neural network forecasting model on financial time series data has

(25)

Methodology 18

been presented by Kaastra and Boyd (1996) which may require revisiting previous steps instead of a single-pass one. Utilizing this methodology for the LSTM and the SVR in this thesis will be beneficial in terms of time but also in tuning the models where we can look back at previous steps.

Step one: Variable selection

The first step of building the model will be selecting which variables to use in the model.

The simplest neural network model uses lagged values of the dependent variable(s) or its first differences (Kaastra and Boyd, 1996). The SVR will use time as the only independent variable while the LSTM will use lagged values.

Step two: Data collection

Data on monthly house prices for detached-houses have been provided by the company Svensk Mäklarstatistik.

Step three: Data preprocessing

The data provided will go through a preprocessing considering the most suitable procedure for LSTM and SVR. This preprocessing might be procedures like taking the logarithm of the process, normalized values, and/or square root transformation. We will use the square root transformation and the normalization procedure.

Step four: Training, testing, and validation sets

Ideally for a large data set it is better to divide the dataset into three parts; a training set, a test set, and a validation set. But for situations with insufficient data, as the dataset for this thesis, the dataset will only be split into a training set and a testing set. We will use a walk-forward sliding windows testing routine as mentioned by Kaastra and Boyd (1996) or named as a rolling forecast by Hyndman and Athanasopoulos (2018, Chapter 3).

Step five: Neural network paradigm

In machine learning, hyperparameters are the parameters that are independent of the learning process and are configured before learning begins. LSTM can be viewed as an artificial neural network that produces an output at each time step. Some RNN also have hidden layers that do not produce output but are instead used to refine the learning process by sending data to the processes that give the output. RNN models with more hidden layers are called deeper and they are advantageous according to Goodfellow et al (2016). However, Kaastra and Boyd (1996) mention the dangers of overfitting when increasing the depth, i.e increasing the hidden layers, and brings up that one or two hidden layers are widely used and have performed well. The SVM has a number of hyperparameters that needs to be determined and they are chosen based on the model performance. The hyperparameters of both models will be determined in the results

(26)

Methodology 19

section of this report.

Step six: Evaluation criteria

To decide which model is the better one we need to be able to evaluate their performance.

We will look at the overall RMSE of the sliding windows forecasts. The model with the lowest average forecast error will be determined as the better model.

Step seven: Model training

The supervised learning method will take place by passing training data to the LSTM and the SVR. Optimization of the models by trial and error and with different tuning methods will be carried out to find the most suitable models.

Step eight: Implementation

Once steps 1 through 7 are finished the implementation of the final LSTM and SVR model will be done. These are now the definite models used for forecasting.

4.7 Software and computational setup

R, a statistical programming tool, has been used for the work in this paper. Python, another programming language, has been integrated into the R environment through the Anaconda platform for the purpose of utilizing two libraries that are written in Python code.

The main libraries used are:

• _Keras- a machine learning library written in Python code, runs on TensorFlow.

• TensorFlow GPU- a machine learning library written in Python code, utilizes GPU instead of CPU.

• _caret- a statistical learning package written in R for classification and regression.

• _e1071- a statistical learning package written in R for Support Vector Machines.

• _tfruns- a hyperparameter tuning package written in R for Keras and TensorFlow.

• _forecast- a time series library written in R.

(27)

Methodology 20

Computer setup for the software and libraries used.

Computer specifications

Specifications PC-1 PC-2

CPU i5-9600K i5-7500

GPU GeForce RTX 2060 6GB GeForce GTX 1060 6GB

RAM 16 gigabytes 8 gigabytes

(28)

Empirical findings 21

5 Empirical findings

5.1 Data exploration

An interesting part of the house prices is how they change over a long period of time.

By visualizing the house prices for the larger Stockholm area and the municipality of Uppsala, one can get a glimpse of any occurring pattern. Figure A.1 (see appendix) shows a stable increase in house prices over the time period from 2005 till 2019 where Stockholm saw a jump in prices around 2015. Uppsala saw a jump in increasing prices just before 2015 and has, like Stockholm, had steadily increasing prices over the entire time period. When viewing the time period of 2014 till 2019 of Stockholm and Uppsala a pattern is revealed (see Figure 5.1). There is an increasing seasonal trend for houses in Stockholm with troughs as well. However, the seasonal element persists through this time period. Looking at Uppsala, a seasonal trend can be seen in the first 2 years while eventually becoming cyclical between 2016 and 2019. The deepest trough for Stockholm occurs in the middle of 2016 and recovered by increasing all the way into 2017 before going into a trough again, this time not as deep as the previous year. For Uppsala, similar behavior is detected but it was short-lived. It goes into a seasonal pattern after 2016 and continues until 2018 before it returns to similar behaviour as before 2016.

Figure 5.1: House prices of the larger Stockholm area and Uppsala from January 2014 till April 2019. Source:

Svensk Mäklarstatistik.

(29)

Visually plotting the preprocessed data and looking at the autocorrelation for Stockholm and Uppsala, we see that both of them show some non-negligible correlation of past periods. As seen in A.2 and A.3, they are both decreasing until they hit lag 12 and the autocorrelation increases and thereafter decreasing again. A 12-month lag seems to be an appropriate past value to use for the LSTM. We will look into it further to try to visualize and detect underlying elements of the data. By observing it closer to see where the troughs appear, see 5.2 and 5.3, reveals the month of July that is responsible for this.

Figure 5.2: Seasonal plot of houses in the larger Stockholm area.

Figure 5.3: Seasonal plot of houses in Uppsala.

The seasonal plots also tell us how the values behave monthly over a longer time period.

The monthly values for Stockholm behave similarly from year to year whereas for Upp- sala, monthly values behave differently over a longer time period but July still remains as the significant trough over the measured time. Decomposing the preprocessed data of

(30)

Stockholm and Uppsala it is seen how both of them have a seasonal and a trend element (see A.4 and A.5). Looking at the trend component there is a pattern of increasing prices while the seasonal component shows 12-month seasonal fluctuations with troughs in the middle of the year for Stockholm and for Uppsala.

5.2 Results

5.2.1 Support Vector Machine

The table below shows the tests of the linear, polynomial, radial-basis and sigmoid kernels for the SVM. The tests were run on houses in Uppsala and Stockholm using four sliding windows. Each sliding window had a training period of four years and a test period of one year. The next window was offset by two years. The grid-search method did not yield the lowest RMSE model. It suggested a radial-basis kernel with higher RMSE than all the kernels presented below. To find these kernels we used the alternative method presented earlier where we find the hyperparameters manually. As we can see, the radial-basis kernel seems to perform the best for both models.

Table 5.1: Hyperparameters and respective kernel for Uppsala.

Hyperparameter

Kernel Linear Polynomial Radial-

basis

Sigmoid

RMSE 292072 289668 274222 385975

Cost 0.075 0.77 6 100

Gamma - 0.016 1.02 10000

Degree - 1 - -

Coef - 1 - 1

Table 5.2: Hyperparameters and respective kernel for the larger Stockholm area.

Hyperparameter

Kernel Linear Polynomial Radial-

basis

Sigmoid

RMSE 300854 300853.7 280994 378205

Cost 0.51 0.12 44 1

Gamma - 5 0.08 3000

Degree - 1 - -

Coef - 100 - 1

(31)

5.2.2 Long Short-Term Memory

The LSTM model used the trial and error method to test for the optimal hyperparameters.

One of the manually configured settings outperformed the random hyperparameter search in terms of RMSE. Table 5.3 below shows the optimal hyperparameter configurations with a random sample of 20 percent of all the combinations and a manual tuning configuration. In this case 45 out of 225 combinations were searched for both Stockholm and Uppsala, however, a manual configuration was found to be the optimal one for Uppsala.

Table 5.3: LSTM hyperparameters and RMSE for both areas.

Area

Hyperparameter Metro-Stockholm Uppsala

Optimizer Adadelta Adadelta

RMSE 286234 331505

Hidden layers 2 2

Cells (1) 24 24

Cells (2) 12 24

Recurrent dropout (1) 0.1 0.1

Recurrent dropout (2) 0.5 0.4

Tuning Random search Manual

For Stockholm, the random hyperparameter search yields a better configuration than the manual search did. The proposed configuration is two hidden layers with 24 units in the first layer, 12 units in the second layer, and a recurrent dropout of 10 percent in the first layer and 50 percent in the second layer. For Uppsala, the optimal configuration was not yielded by the random hyperparameter search. Instead, manual testing gave the best configuration which had two hidden layers, 24 units in the first layer, 24 units in the second layer, and a recurrent dropout of 10 percent in the first layer and 40 percent in the second layer. The training of the LSTM for Stockholm went through 100 epochs as this was tested to be optimal for the current sampling plan. By visually inspecting the cross-validation for Stockholm in Figure 5.4, it can be seen that the LSTM captures a few of the seasonal occurrences in terms of troughs and peaks. It is visually appealing since it seems to understand the structure of the data with low RMSE on the window sets.For Uppsala, the LSTM has a hard time learning from the data and the underlying structure is difficult to remember. On all the sliding windows, the predictions on the unseen 1-year windows look tedious but the RMSE is not so awful such that the configuration is disregarded. This LSTM had optimal training on 50 epochs and it is seen especially on the fourth sliding window where the model performed badly compared to the previous three samples as seen in Figure 5.5 below.

(32)

Figure 5.4: Cross-validation with out-of-sample predictions for the larger Stockholm area.

Figure 5.5: Cross-validation with out-of-sample predictions for Uppsala.

(33)

5.2.3 SARIMA

For the seasonal ARIMA model, examining the autocorrelation and partial autocorrelation functions to choose the right values of hyperparameters has been done manually. By testing several hyperparameters, the optimal model for Stockholm was the one with 2 past lags, 1st order differencing, 0 past noise lags, 0 past seasonal lags, 1st order seasonal difference, and 2 past seasonal noise lags. Uppsala follows the same configuration except that it does not have a first-order difference.

Table 5.4: Optimal hyperparameters for seasonal ARIMA and their respective cross-validation RMSE and overall AIC.

Area

Hyperparameter Metro-Stockholm Uppsala

Order(p,d,q) (2,1,0) (2,0,0)

Seasonal(P,D,Q) (0,1,2)[12] (0,1,2)[12]

RMSE 154933 163600

Average AIC 938.3 987.9

Both models have been differenced such that their slowly decaying autocorrelation function for the residuals is decaying faster, their partial autocorrelation function for the residuals are uncorrelated at lag 12, the plots for these functions can be found in appendix, see Figure A.10-A.13.

Seen in Figure A.6 and A.7, the SARIMA models have a low overall RMSE for the sliding windows cross-validation sampling plan. The model for Uppsala performs visually poor but the RMSE is still low, whilst the model for Stockholm performs well visually and by inspecting the RMSE.

5.2.4 Machine learning model comparison and model choice

Since the radial-basis kernel performed best in the sliding windows test for the SVM it is also applied for the final forecasts seen below. The forecast is made with the first 159 months of our data as training data and then forecasts 12 months ahead. The RMSE shown is the RMSE of the 12 forecast months. The forecast for the LSTM is done in the same way and uses the optimal. The first 159 months are used as training data and a forecast is made 12 months ahead. To illustrate clearer, the RMSE, MAPE, and SMAPE shown in Table 5.5 are the forecast error metrics of the 12-month forecast.

As can be seen in Table 5.5 below, the LSTM has a lower RMSE than the SVM for both Uppsala and Stockholm. The LSTM model will, therefore, be compared to the SARIMA model. The SVM performs poorly on both areas when inspected visually in Figure A.8 and A.9, compared to the LSTM which has both lower metric values and looks visually satisfying.

(34)

Table 5.5: Model comparison of LSTM and SVM.

Support Vector Machine vs Long Short-Term Memory

LSTM MAPE SMAPE RMSE

Metro-Stockholm 4.29% 4.18% 286124

Uppsala 7.52% 7.11% 340303

SVM

Uppsala 8.32% 8.05% 373270

5.2.5 Best machine learning model vs. SARIMA

We see by direct comparison of the RMSE in the table below that the SARIMA models have a lower RMSE than the LSTM models. Table 5.6 below shows the evaluation metrics on both models. In Figures 5.6 and 5.7 below we also see how the SARIMA model forecasts on the unseen 12-month data better than the LSTM.

Table 5.6: Model comparison of LSTM and SARIMA.

SARIMA vs Long Short-Term Memory

LSTM MAPE SMAPE RMSE

Uppsala 7.52% 7.11% 340303

SARIMA

Uppsala 6.04% 5.79% 280297

(35)

Figure 5.6: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01 on average house prices per

month of the larger Stockholm area

Figure 5.7: Comparison of forecasts on the period from 2018-04-01 to 2019-03-01 on average house prices per

month of Uppsala.

(36)

Discussion & Conclusion 29

6 Discussion & Conclusion

6.1 Discussion

As seen in the results section the SARIMA models perform better than both the LSTM and SVM models. The LSTM models, in turn, perform better than the SVM models. It especially outperforms the SVM model in the Stockholm time series where the difference between the RMSE, MAPE, and SMAPE is very large. The SVM RMSE, MAPE, and SMAPE are almost twice as large as the LSTM error measurements. Let’s look at the differences between the LSTM and SVM models more closely.

In the graphs A.8 and A.9 where the forecasts of the LSTM and SVM models are shown we gain more insight into why the measurements are worse for the SVM. It is clearly seen that the LSTM models try to fit the data. It follows the data reasonably well but stumbles a little bit in the later part of the Uppsala forecast. The SVM models, on the other hand, make an almost linear forecast. When predicting the Uppsala data it seems to be able to find a downward trend during the forecast time-period and comes close to the LSTM prediction for that reason. It is however limited by its linear nature and cannot make predictions beyond the trend and will for this reason suffers a high error.

In the Stockholm forecast, the LSTM performs far better than the SVM as previously mentioned. The SVM tries to find a linear trend again but this time it misjudges the trend and goes off in a different direction than the actual time series does. This is the reason that the error measurements for the forecast are so poor. The LSTM forecast does not do this and instead somewhat captures the behavior of the time series.

What are the problems with the SVM model and how can the SVM model be improved?

A likely problem with the SVM model is that the optimal hyperparameters have not been found. The grid search method used to find the hyperparameters failed at finding the optimal parameters and was very easy to beat by manually trying to find better parameters. A problem with manually trying to find the best hyperparameters is that it is very easy to find a local minimum that is mistaken for a global minimum which is the optimal set of hyperparameters. This is especially true for the kernels that use more than one hyperparameter since the number of possible combinations of parameter values increases manyfold. A computer-based method should be able to beat out the manual method used in this paper since it can try out many more combinations faster than any human could. Unfortunately, the grid search method found in the packages we have used was not very good. A solution to this could be to build our own computer-based

(37)

method to find the best set of hyperparameters. However, time constraints prevented us from doing this.

Another problem with the SVM model is that the hyperparameters suggested after finishing the sliding windows process are not the best for forecasting the final model.

The sliding windows process suggested using the radial-basis kernel to forecast both time series. When trying to forecast the final forecast using a different kernel than the radial-basis kernel with the hyperparameters suggested by the sliding windows process we sometimes got more accurate forecasts. In some cases, these forecasts vastly outperformed the radial-basis kernel. This means that the sliding windows process used is not finding the best model to use for the final forecast. It is possible that we are using too few windows and need to use more windows to find a set of hyperparameters that fits the time series better than our current parameters. The downside of doing this is that it could become more computationally demanding especially if a computer-based method is used to find the hyperparameters. However, the SVM was not very demanding when used with four windows so they could probably be increased without much issue. A third way to increase the forecast accuracy could be to include not only the time series data as input but also lagged values of the time series. We know from viewing the data seen in the data visualization part of this report that there is a seasonal dependence in the time series. If that dependence is suggested to the model as input it could improve the model. This also was not done due to time constraints. Lastly, the LSTM has an advantage over the SVM since it uses lagged values as input and not only the normal time series. This makes it easier for it to be able to take seasonality into account and since we know that there is a seasonal element, this gives the LSTM model an edge over the SVM.

The LSTM clearly outperforms the SVM and it comes with hyperparameters that are different. The model needs to know how long past information is to be retained and dropped in future predictions and at the same time, it needs to know how much of the past information is to be retained and dropped in future predictions. Further, the model needs to have an optimal amount of hidden layers to avoid overfitting. As two hidden layers were used for both areas it confirms Kaastra and Boyd’s (1996) paper where the models perform well on the cross-validation. Also by looking at the cross-validation predictions for the LSTM and comparing it to the forecasting performance on the 12- month period from April 2018 to March 2019. It performs unexpectedly well since it underperformed on the fourth sliding window forecast on Uppsala, which is also the most recent data in the sampling plan. This creates confusion in evaluating the optimal hyperparameters when revisiting the model building steps. On one hand it seems reliable when multi-step forecasting, on the other hand it can create a confirmation bias where it seems like you have found the optimal hyperparameters but can be a case of

“beginner’s luck”. This is a problem when you have insufficient computational resources

(38)

and cannot do a gridsearch to get the best configuration, it is easy to fall in the pit of having created the optimal forecasting model. One way to come overcome this problem would be to preprocess the data like the SARIMA model is preprocessed. How would the model behave under stationarity? This, for example, would create a time series with properties that do not change over time and would make the model easier to interpret as the properties would be predictable over time.

6.2 Conclusion

While the SVM model performs poorly, the LSTM model shows much greater promise as a forecasting tool for time series. The LSTM model building approach is a sensitive balance of bias-variance tradeoffs and the time it takes to configure the model that may easily fail to give accurate forecasts. This was shown in the cross-validation plan for Uppsala where the out-of-sample predictions exhibited high bias and low variance on the fourth sliding window. Having no assumptions about the data is a huge advantage with the LSTM model over the SARIMA model but it makes it more vulnerable to difficult data than the SARIMA model. Our data preprocessing might not have been enough to curtail difficulties with the data with reduced forecast accuracy as a consequence. Another reason for lower accuracy may be the choice of hyperparameters.

It is likely that the hyperparameters chosen by the random search are not the most optimal parameters possible. This is even truer when manually chosen parameters were the best parameters found. Like in the SVM case it is more likely that a better method for finding hyperparameters would yield a higher forecast accuracy. The model might be unable to detect strong seasonal patterns, heavy white noise, or that the model perhaps needs statistical assumptions about the data before configuring it. Comparing the LSTM to a classical time series model on a small dataset shows how robust the SARIMA is.

The LSTM is defeated in performance on three metrics; RMSE, MAPE, and SMAPE. The SARIMA model was also far less labor intensive to construct, but this is probably in part due to our previous lack of experience with machine learning methods. It seems like the best machine learning method used in this paper was not better than the SARIMA model.

Contradictory to the results found in the study by Yu et al (2018) their best LSTM model had a MAPE of 2% and their seasonal ARMA a MAPE of 3.1%. This paper shows that a seasonal ARIMA with a percentage error of 2.7% (MAPE) and 2.68% (SMAPE) beats an LSTM with a percentage error of 4.29% (MAPE) and 4.18% (SMAPE). The difference between this paper and the study mentioned seems to depend on which ARIMA model to employ for comparison, the structure of the data, and the model building approach.

This does however not imply that the LSTM can be disregarded. It has shown to be a valid model under no assumptions but might need further research to make it more accurate for forecasting time series.

(39)

References 32

7 References

Ahmed, N. K., Atiya, A. F., Gayar, N. e., El-Shishiny, H. (2010). An empirical comparison of machine learning models for time series forecasting. Econometric Reviews, 29(5), 594-621.

Alpaydin, E. (2010). Introduction to machine learning (2nd ed.). Cambridge, Mass: MIT Press.

Amiri, S., von Rosen, D., Zwanzig, S., Sveriges lantbruksuniversitet. (2009). The SVM approach for B.J models. Revstat-Statistical Journal, 7(1), 23-36.

Bianchi, F. M., Maiorino, E., Kampffmeyer, M. C., Rizzi, A., Jenssen, R. (2017). Recurrent neural networks for short-term load forecasting: An overview and comparative analysis. New York: Springer.

Bontempi, G., Taieb, S. B., Borgne, Y. (2013). Machine learning strategies for time series forecasting. Business Intelligence, 62-77.

Cryer, J. D., Chan, K. (2008). Time series analysis: With applications in R (2nd ed.). New York: Springer.

Flach, P. (2012). Machine Learning. The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press.

Ghatak, A., (2017). Machine learning with R. Singapore: Springer Singapore.

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep learning. Cambridge, MA: MIT Press Graves, A., Dr. (2012). Supervised sequence labelling with recurrent neural networks. Heidel- berg: New York: Springer.

Hao W., Yu S. (2006) Support Vector Regression for Financial Time Series Forecasting.

In: Wang K., Kovacs G.L., Wozny M., Fang M. (eds) Knowledge Enterprise: Intelligent Strategies in Product Design, Manufacturing, and Management. PROLAMAT 2006. IFIP International Federation for Information Processing, vol 207. Springer, Boston, MA.