Forecasting the OMXS30 - a comparison between ARIMA and LSTM

(1)

Forecasting the OMXS30 – a comparison

between ARIMA and LSTM

By David Andréasson & Jesper Mortensen Blomquist

Department of Statistics

Uppsala University

Supervisor: Yukai Yang

(2)

Abstract

Machine learning is a rapidly growing field with more and more applications being proposed every year, including but not limited to the financial sector. In this thesis, historical adjusted closing prices from the OMXS30 index are used to forecast the corresponding future values using two different approaches; one using an ARIMA model and the other using an LSTM neural network. The forecasts are made on three different time intervals: 90, 30 and 7 days ahead. The results showed that the LSTM model performs slightly better when forecasting 90 and 30 days ahead, whereas the ARIMA model has comparable accuracy on the seven day forecast.

Keywords: machine learning, deep learning, neural networks, RNN, time series, stock market, index.

(3)

1 Introduction

The stock market is a volatile place filled with uncertainty. It is also a place where incredible amounts of money change hands every day, in the hopes that the transactions made will generate profits for investors. If it was possible to navigate this volatility and accurately forecast the movements of the market it would create an opportunity to acquire great amounts of wealth for the people who are able to make these projections.

There are several types of models that can be used for attempting to forecast financial time series data such as the stock market. In this thesis the objective is to compare forecasts made using a classic time series model with forecasts from a machine learning model, and see which one performs best. There is the definitive possibility that neither of the models produce any significant results if there are not any relevant patterns to draw conclusions on. This is unequivocally a possibility considering the nature of the data in question, the stock market, where there to this day exists no conclusive answer as to whether it is predictable at all.

Although stock options – which in some sense form the basis of quantitative finance – have been present since the 17th century, it was not until the 20th century that the field really took a giant leap forward. Although there had been some work done by mathematicians in the late 1800s on the properties of financial markets, they had not gained any relevance until the middle of the next century when more and more research on the topic was performed. The big revolution however, occurred in 1973 when Black and Scholes published their paper on the pricing of options, which in turn caused a ripple effect on the interest for derivatives trading and created the market in its current form _{(Cesa, 2017). Although this thesis is not on} the topic of options trading, it provides some context on the growth of financial engineering and quantitative analysis.

Coinciding with this, is the rise of artificial intelligence, machine learning and specifically neural networks. The first theory on the topic was formulated by McCulloch and Pitts (1943). Obviously this theory was very different from the networks of today but it formed an important basis point. Up until the 1980s the research on the topic was relatively stagnant,

(6)

when several papers created a renewed interest in the field. In the past ten years, very significant progress has been made on the topic with the development of more advanced recurrent neural networks and deep feedforward neural networks as examples of this (Macukow, 2016). Many of the earlier applications of neural networks were related to fields such as genetics, psychology and engineering but they were proven to be just as useful in the field of finance, gaining widespread use (McNelis, 2005).

In his book _{A Random Walk Down Wall Street, initially published in 1973, Burton G. Malkiel} (2015) argues that the stock markets are completely random. He proposes the hypothesis that the past data has literally no bearing on the future values, suggesting that in essence a coin flip can forecast the market just as successfully. If Malkiel’s argument is as correct as it is compelling in the manner he presents it in, it would be impossible to with any accuracy forecast the future closing prices of the OMXS30 based on historical closing prices. In spite of this, financial trading today is in large part done using different computer algorithms, and it is not uncommon to use statistical models to try and forecast future stock values.

Applying and testing the power of neural networks in finance have become increasingly common in the academic world in the recent years. Chen et al. (2015) tested this on the Chinese stock market and found that the model’s results outperformed that of random chance. Nelson et al. (2017) performed a similar study on the Brazilian market and found that the model could predict whether a stock would rise or fall in the near future with an accuracy of approximately 55%.

The purpose of this thesis is to look at whether the Long Short-Term Memory (LSTM) neural network is able to more accurately forecast the movements of the stock market than a more classical method, the Autoregressive integrated moving average (ARIMA), which have historically been used in order to try and forecast movements in time series data. Upon initial review it seems that an LSTM likely would outperform a classic model given the amount of parameters that are taken into consideration. This however, is not definitive and needs to be tested and applied on real data, specifically on the Swedish stock market which is significantly less studied and with different characteristics than the US markets.

(7)

Therefore, the research question is the following: Does an ARIMA or an LSTM model perform better when forecasting future values of the OMXS30 index; 90 days, 30 days and 7 days ahead?

The thesis is structured as follows: in section 2, some theory behind the ARIMA and the LSTM is given, to provide the reader with a basic understanding of time series analysis and neural networks. In section 3, the data used in the report is described. In section 4, the methods behind the model specification are outlined. The results are then presented in section 5, followed by a discussion and a conclusion in sections 6 and 7 respectively.

2 Background

In this section, the core concepts on which the thesis is based on are explained. Some relevant theory behind time series analysis and neural networks is presented to give an understanding of how the models described in the following sections are specified.

2.1 Time Series

A time series is a sequence of numerical data points in a successive time order. Examples of time series include a country’s GDP, the air temperature at a specific place and the value of a stock. Time series analysis can be used to forecast future values by using patterns in historical time series data and extrapolate into the future (Box et al., 2016).

2.1.1 Stochastic Processes

A stochastic process is a sequence of random variables. The time series and the stochastic process relate to each other in the sense that the time series is a realization of a stochastic process. In this thesis, the stochastic process generates the data points which we observe in the stock index. In other words, the time series is the result of sampling once from every random variable in the stochastic process (Cryer and Chan, 2008; Box et al., 2016).

(8)

2.1.2 Stationarity

One of the central assumptions that usually is made about stochastic processes is stationarity. This assumption is important in time series analysis because it makes it easier to make statistical inferences about the data; in essence the majority of time series tests have stationarity as a prerequisite _{(Manuca and Savit, 1996). A stochastic process is stationary if} its statistical properties are constant over time. The strict interpretation of this is that all random variables making up the stochastic process follow the same distribution. However, it is often practically impossible to know whether or not this is the case. Instead, there is a more forgiving interpretation which is commonly used, called second-order stationarity (Cryer and Chan, 2008).

When checking for second-order stationarity, the first two moments are considered. For a process to be second-order stationary, the following three conditions must be fulfilled:

, (1)

, (2)

, (3)

where _μ_{denotes the mean, γ the covariance, t is an arbitrary point in time and k is some lag} length. In other words, a stochastic process is second-order stationary if the mean is constant over time, the variance is finite and constant over time, and the covariance between two observations does not depend on the specific time points but rather on the distance between them (Cryer and Chan, 2008).

Many stochastic processes are not stationary. For example, it is reasonable to assume that all random variables in a stochastic process describing monthly ice cream sales in Sweden would not have the same mean. The mean is likely much higher in July than in January because, on

(9)

average, more ice cream is sold in the summer. Such a process would therefore not be stationary because it is affected by seasonality.

2.1.3 The Moving Average Process

One of the most common stochastic processes used in time series analysis is the moving average (MA) process. The value of the MA process depends on current and previous values of a random shock term. An important characteristic of the MA process is that it is always stationary (Box et al., 2016).

The general MA process is denoted MA( _{q), where q stands for the order of the process. The} process can be written

, (4)

where _Y_tis the value of the process at some arbitrary point in time _{t, e}_tto_e_t-qare the random shock terms and _θ₁ to _θ_q are the parameters. This means that the value of an MA( _{q) at time t} depends on the value of the shock term at time _{t, as well as all the shock terms q timesteps} back. So, an MA process of order one is called an MA(1), and the value of the process is dependent on the shock term at the current time and the shock term from the time directly before. For an MA(2), the value of the process depends on the shock term today and at the two previous points in time (Cryer and Chan, 2008; Box et al., 2016).

2.1.4 The Autoregressive Process

Another widely used time series process is the autoregressive (AR) process. The value of this process at some time is dependent on the value of the process at previous timesteps, along with a random shock term at the current time. Unlike the MA process, the AR is not always stationary (Box et al., 2016).

The general AR process is denoted AR( _{p), where p is the order of the process. The process} can be written

(10)

, (5)

where _Y_tis the value of the process at some arbitrary time _{t, e}_tis the shock term at time _{t and} ϕ0 to ϕq are the parameters. So, the value of an AR( p) is dependent on the value of the process at the _{p previous timesteps, plus the value of the shock term at the current time. For} example, an AR process of order one is abbreviated AR(1), and the value of the process depends on the value at the time step right before, and the shock term today. The value of an AR(2) depends on the value at the two previous timesteps as well as today’s shock term (Cryer and Chan, 2008; Box et al., 2016).

2.1.5 Unit Root Processes and the Random Walk

A unit root process refers to a stochastic trend in the given time series data. In contrast to a process with no unit root, which due to its properties will eventually converge with the initial value, the unit root will have an uncontrollable process that is random and hard to forecast (Levendis, 2018).

A random walk is an example of a unit root process. The random walk assumes that every period the time series takes a new step which is completely random. I.e. if a time series increases or decreases it is not down to to fundamental factors but rather complete chance. Random walks can either be said to have a drift or no drift, in essence it depends on whether the mean of the process is zero or non-zero, with the non-zero mean having a drift and vice versa (Cryer and Chan, 2008).

2.1.6 Differencing and Integration Order

Non-stationary time series processes can be made stationary. One way to do this is by using differencing. Differencing means transforming the process so that the data points in the process instead of the actual values becomes the differences between consecutive values. This way, if a stochastic trend exists in a non-stationary process, it can be removed, which stabilizes the mean and therefore can make the process stationary (Cryer and Chan, 2008).

(11)

If a process is differenced once to form another process, the original process is said to be integrated of order one. Since differencing is used to achieve stationarity, the order of integration states the number of times differencing is required in order to obtain the desired stationarity (Hamilton, 1994).

2.1.7 Autoregressive Integrated Moving Average

The autoregressive integrated moving average (ARIMA) process, is another commonly used time series process. It is a generalization of the Autoregressive Moving Average (ARMA) process which combines the previously mentioned AR and MA processes into one. As opposed to the simple ARMA model, the ARIMA can be used in cases where the data are not stationary. This is due to the fact that differencing of the data can be applied, represented by the _{I in the model, in order to make it stationary (Box et al., 2016).}

The general ARIMA model is denoted ARIMA( _{p,d,q), where p is the order of the AR part, q} is the order of the MA part, and _d_{is the integration order. For example, for an ARIMA(1,1,1)} the data has been differenced once, and the model has an AR(1) and an MA(1) part. An ARIMA(0,0,1) is the same as an MA(1), and an ARIMA(2,0,0) is an AR process of order two (Pankratz, 1983).

2.2 Neural Networks

A neural network (NN) is a network of multiple artificial neurons which are connected to each other and arranged in a structure containing different layers. NNs are often said to be loosely modelled after the human brain (Dreyfus, 2005). A neuron is a function of some input values. It can be seen as a unit which takes some inputs and from that calculates an output. It does this by multiplying each input with some weights and makes a linear combination of all these, before adding a constant, or a bias. This linear combination can be written as

(12)

where _w₁ to _w_n are the weights, _x₁ to _x_n are the inputs and _{b the bias. To this, the neuron} applies a so called activation function, which converts the number _{z to a desired format. This} results in the neuron’s output _{ŷ (Michelucci, 2018), which is}

, (7)

where _{f is some activation function.}

There are three different types of layers in an NN. The first layer is called the input layer, which is where the data are fed into the model and passed on to the neurons in the next layer. The input layer does not actually consist of neurons, since it does not perform any calculations on the inputs. The last layer is the output layer, which is where the network’s final output is generated. Between the input and the output layers are hidden layers, consisting of neurons which perform calculations on the data (Dreyfus, 2005; El-Amir and Hamdy, 2019). Using these layers of neurons, an NN learns associations between input and output values, generating outputs which can be used for dealing with classification and forecasting problems (Purkait, 2019). The structure of a simple neural network is visualised in Figure 1.

(13)

2.2.1 Feedforward Neural Networks

An NN where the information moves only in a forward direction from the input, through the hidden layers and to the output layer, is called a feedforward network (Michelucci, 2018). This was the first kind of NN that was created. Today, a specific type of these, called convolutional NNs, are widely used for classification problems such as image recognition (Witten et al., 2016).

2.2.2 Recurrent Neural Networks

More advanced than the feedforward NN presented above is the recurrent neural network (RNN). The neurons in this type of network do not only use the present values as input, but they also use their own output from the calculation at the previous timestep. RNNs can be said to have memory because they keep information through multiple timesteps in the network. In short, this means that an RNN takes sequences and timesteps into account. Therefore, this type of network is suitable to use when dealing with sequential data, such as in speech recognition, machine translation and time series forecasting (Witten et al., 2016; Chollet and Allaire, 2018).

2.2.3 Vanishing/Exploding Gradient Problem

Understanding the simplest form of RNN is useful for learning about networks which are well-suited for modelling and are widely used today. However, the simple RNN has a specific flaw which makes it not very useful; it suffers from problems with vanishing and exploding gradients. This means that between short distances in the data, the RNN can predict based on what it already has seen, but as the distance becomes more vast though it can either suffer from vanishing or exploding gradients. The former means that, due to the properties of an RNN, when a value that is below one goes through the different layers, it exponentially decreases until it becomes close to zero and disables any opportunity for the model to learn. The exploding gradient on the other hand suffers from the opposite problem, if the value is above one it rises very fast until it reaches the point where it becomes undefinable and hence also rendering the model useless, although in a different manner (El-Amir and Hamdy, 2019).

(14)

2.3 Long Short-Term Memory Network

To remedy the problem with vanishing and exploding gradients, the first Long Short-Term Memory (LSTM) network was put forward by Sepp Hochreiter and Jürgen Schmidhuber (1997). The LSTM is a kind of recurrent neural network, with a more advanced structure than the simple RNN. This allows the network to keep information that is considered relevant in the model, and to forget information that is not (Greff et al., 2017).

2.3.1 Activation Functions

To understand the calculations performed in an LSTM cell, some knowledge of activation functions are needed. These are mathematical functions with the purpose of transforming values into some other values which are better to work with. In the case of the LSTM, two activation functions are used: the sigmoid (σ) function and the hyperbolic tangent (tanh) function. The sigmoid function converts an input value into a value between zero and one, where a value of one means that the network keeps the input in its entirety while a zero is the equivalent of the network completely forgetting it (El-Amir and Hamdy, 2019). The sigmoid function is used in the input, output and forget gates of the LSTM cell _{(Witten et al., 2016).} The tanh activation function is similar to the sigmoid except that the possible outputs range from minus one to one instead of zero to one. The tanh activation function is beneficial in that negative inputs will be negatively mapped as well. Input values that are zero are also mapped as zero in the tanh function (El-Amir and Hamdy, 2019). Both activation functions are plotted in Figure 2.

(15)

The sigmoid function converts values using the calculation

, (8)

whereas the tanh function uses the formula

, (9)

where _{x is the input value.} 2.3.2 LSTM Cell Structure

The perhaps most central segment of an LSTM network is its cell, which is comprised of several different components. Looking at Figure 3, the lines represent the transmission of a vector in the direction of the arrow. When the lines diverge, the values do not, rather they are copied so that there is no split. There are three inputs into an LSTM cell: the memory from the previous timestep represented by _c_t-1, the activation or input from the previous timestep represented by _h_t-1 and the new input values at the time, _x_t(Purkait, 2019). The red boxes show the activation functions that are used in the so called gates.

(16)

The gates are where the different operations take place, they are, from left to right in Figure 3: the forget gate, input gate, update gate and output gate (El-Amir and Hamdy, 2019). The forget gate is important in making the LSTM unique, it decides whether to keep the value from the previous timesteps or to forget and incorporate it in the next timestep. This is especially effective when dealing with long-term dependencies. The circles represent pointwise operations, in this case addition and multiplication (Purkait, 2019).

The calculations performed in the LSTM cell are shown in the following equations:

(10) (11) (12) (13) (14) (15)

where _{W is the weight matrix, h}_t-1 is the input from the previous timestep, _x_tis the input at the current time and _{b is the bias. Equation (10) shows the operation at the forget gate. E} quation (11) shows the calculation performed at the input gate, which determines what values will be updated. This is followed by (12), the update gate, where a vector of potential new memory is created. In (13) the two previous equations are combined and merged with the previous memory in order to get the new memory. Equation (14) decides the output, which is finally multiplied with the current memory in (15) (El-Amir and Hamdy, 2019).

(17)

3 Data

The data used in this thesis is from the OMX Stockholm 30 (OMXS30) index, which is one of the most commonly used stock indices in Sweden. It is an index which measures the value of Sweden’s 30 largest stocks. The historical data are obtained from_{Yahoo! Finance in the} form of a _{Comma Separated Value} _{file (.csv) and includes the opening price of the market,} the high and low for the time interval and the closing price as well as the adjusted closing price. The measure of interest in this thesis for determining the future value of the index is the adjusted closing price. This is the price adjusted for stock splits and dividend distributions (Yahoo Finance, 2020). This measure is chosen to ensure that no such corporate actions affect the value of the index in a misleading way.

To be able to forecast a long period ahead in time, a large sample of data are used. The 90 day forecast uses data from the past 10 years while the 30 and seven day forecasts uses data from the past three years and one year respectively. For all three of the time horizons, daily data are used in the model.

4 Methods

In this chapter, the methods used in order to be able to answer the research question are presented. This is done by explaining how the models are specified and how the different parameters for each of these are chosen.

4.1 ARIMA Specification

At the onset of the model specification, the data are split up into a training set and a test set. Three different kinds of ARIMA models are fitted for the three different time horizons. There are many different models used in financial time series, however, since the adjusted closing prices are used, ARIMA is chosen as it can model non-stationary data due to its differencing component. The models are fitted using the _{tseries package in the statistical software R, and} the _{auto.arima function is used to select which models are used. The function returns the best}

(18)

models according to some information criterion, which in this case is chosen to be the Akaike information criterion (AIC).

4.1.1 Akaike Information Criterion

An information criterion is a method used to find the time series model that best fits the data. This is done by taking the negative log likelihood for several different models, and adding a penalty function which penalizes models with more parameters. The best fitting model is the one with the lowest score. AIC is one of the most commonly used information criteria, its calculation being

, (16) where _{k denotes the number of parameters in the model (Cryer and Chan, 2014). Using an} information criterion to determine the correct model is important because if the order of a chosen model is too low it will not be consistent. If a too high order is chosen it will instead have an increasing variance (Shibata, 1976).

4.1.2 Fitted Models

Based on the process detailed in the two previous sections, the three ARIMA models are fitted on the data. This results in models with slightly different constructions. Notable is that the one used for the 90 day forecast is specified as a ARIMA(0,1,0), which simply means that the process generating the data in this case seem to be a random walk.

The model for the 30 day forecast is an ARIMA(1,0,3) with non-zero mean, which means that in contrast to the two other models, it is not differenced. In other words, it is a simple ARMA(1,3), containing both an autoregressive and a moving average component.

The final model for the seven day forecast is an ARIMA(2,1,0) which in practice means that it is an AR(2) that is integrated of order one.

(19)

4.2 LSTM Specification

In the sections below, the different aspects of the LSTM NN specification are described. The models are created using the Keras application programming interface in TensorFlow, which is an open source library for machine learning written in Python (Purkait, 2019). It is run using the cloud based software Google Colaboratory.

4.2.1 Training a Neural Network

For an NN to be able to make accurate forecasts, it has to optimize the weights and biases in the neurons so that the output is close to the actual values that the network is tested on. When the training starts, some random weights are used in the network. The output value generated is used to make an initial forecast. The weights are then updated as the network trains itself, to make the errors it makes on the training data smaller, learning its pattern so it can use the updated weights to forecast unseen data (El-Amir and Hamdy, 2019).

4.2.2 Hyperparameters

Unlike in many other statistical methods, there are multiple parameters in an NN which are not estimated by the model. Instead, the network is constructed by manually specifying a number of so called hyperparameters. For example, these include the amount of neurons the network is comprised of, the number of layers it contains and which methods are used to train the network (Michelucci, 2018). Specifying the hyperparameters is not an exact science. Generally, there are no rules for what the different hyperparameters should be. Dreyfus (2005) even suggests that their values are not of great importance. For this thesis, the hyperparameters are in some cases set using rules of thumb and sometimes through trial and error. The following sections give explanations for the different hyperparameters and clarifies how they have been selected.

4.2.2.1 Layers

An NN consists of a number of layers. The number of these are chosen when designing the model and there is not any clear answers to how many layers there ideally should be. Earlier theoretical works suggested that only one hidden layer, a so called shallow neural network,

(20)

was sufficient to solve any problem (Cybenko, 1989). However, as time went on this became a debated issue. The premise was still true, one layer was enough, but the performance could be enhanced by adding more layers, making it a deep neural network (Pascanu et al., 2014). Since the so called Stacked LSTM have become more widespread in their usage, this is implemented in the model for this thesis, using 3 hidden layers. Of these hidden layers, two are LSTM layers and one is a Dense layer, where the LSTM layers perform the calculations previously shown in equations 8 to 13, and the Dense layer performs the more simple linear combinations in equations 6 and 7.

4.2.2.2 Neurons

In each layer of a NN there are a number of neurons which for all layers are manually specified. The optimal number of neurons in the hidden layers is somewhat up for debate; some papers have been using up to 250 neurons in each layer (Macukow, 2016). Other empirical data suggests that the number of neurons do not affect the performance by much (Huang et al., 2015).

For the model in this thesis, a middle ground is chosen with 50, 50 and 25 neurons for each respective layer. A test using trial and error is employed for different values for every layer, while keeping the other hyperparameters constant. The result from this, which can be seen in Figure 4 on the next page, shows that the differences in the loss function are small when comparing various numbers of neurons (note that the y-axis in Figure 4 is truncated). The chosen amount of neurons produces slightly better results on the training loss than the other options. The output layer consists of one single neuron, because only one output value at each time is used to forecast in time series problems such as this one.

(21)

Figure 4. MSE loss for different numbers of neurons. 4.2.2.3 Loss Function

When training an LSTM model, a measure for the loss has to be defined. For example, when trying to specify hyperparameters in the model, such as the number of epochs (covered more in depth below), minimizing the loss is an intuitive way to do that. In time series based problems, the Mean Square Error (MSE) is considered a good choice, hence why it is chosen as the loss function in this thesis (Capelo, 2018; Chollet and Allaire, 2018).

4.2.2.4 Validation Data

When using an LSTM NN, the data are typically not only split into training and testing data but also into a validation set. Among the purposes of the validation set is to help tune the hyperparameters (Witten et al., 2016). It is also important to study when plotting the learning curves of the model. The ideal result is then to obtain two curves that converge after having the loss function drop rapidly as the model is trained (Wei, 2015).

4.2.2.5 Optimizer

Training NNs is done using a so called optimizer, an algorithm with the purpose of minimizing the loss function. This is a method where the goal is to find the local minimum of some function. This is computed by a CPU which knows the actual values in the training data and iteratively tries to end up where the loss function is minimized. The process is performed in a three step loop, the first step is via forward and backward propagation to obtain the function value and gradient. The next step is to propose a new step and increment depending

(22)

on the current step. The third and final step is to incorporate said increment into the original function before repeating the process a certain number of times (Lv et al., 2017). The concept of an optimizer is not exclusive to NNs or LSTMs but rather is a widespread concept in many fields. In this thesis, the optimizer _Adam _{is used as it has been proven to be very efficient, and} have been widely recommended as the best choice (Kingma and Ba, 2015; Ruder, 2017; Michelucci, 2018). It handles large datasets well in addition to being appropriate to use with non-stationary data. Seeing as this thesis focuses on the closing price of the market, a measure that might not result in a stationary time series, Adam fits perfectly into this criteria. Adam is used with the default configuration parameters which are recommended by its creators (Kingma and Ba, 2015).

4.2.2.6 Batch Sizes

In training an LSTM model, a batch is an iteration of one or a number of samples that are used in order to forecast the data (Michelucci, 2018). In a hypothetical scenario with a training set of 500 samples, a batch size of 250 would divide the the set into two different batches when training the model. There is not a definitive rule of thumb on how large a batch size should be, but if one is to use a mini batch – i.e. a batch that is larger than one single sample but smaller than the entire training set – 32 has been proved to be a good choice, and therefore used for the network specified in this thesis (Reimers and Gurevych, 2017).

4.2.2.7 Number of Epochs

An LSTM NN is trained a certain number of times in order to minimize the error of the model. When the entire training set is passed through the model once, that is defined as an epoch (Michelucci, 2018). For the number of epochs, as opposed to for batch size, there are no rules of thumb for how many epochs should be processed but rather being dependent on the characteristics of the data. Looking at Figure 5 on the next page, one can clearly see that during the initial epochs, the MSE decreases quite dramatically before it levels off, between epoch 10 and epoch 500 there is no noticeable further decrease of the MSE loss. Based off this, 100 epochs is chosen as a fitting number to ensure loss is minimized since further training does not seem to improve the network, while it would be more computationally expensive.

(23)

Figure 5. MSE loss function for the training and validation data used for the 90 day forecast, over a different number of epochs.

4.2.2.8 Regularization

Problems related to overfitting are common in many statistical methods, which is also the case for LSTM NNs. Overfitting effectively means that there is a risk of training the model in a way that makes it able to learn the characteristics of the data set it is trained on, but cannot generalize outside the training data and therefore cannot accurately forecast on data it has not seen before (Chollet and Allaire, 2018).

There are numerous ways to try to mitigate overfitting, and thus improve generalization in an LSTM. The methods for doing this are known as regularization techniques (Chollet and Allaire, 2018). In the networks created for this thesis, dropout is used as the regularization technique since it is a method that has been proven to produce very good results (Srivastava et al., 2014; Baldi and Sadowski, 2013). It works by randomly selecting nodes with some probability and dropping them, along with their connections, from the network during training. Many of these thinned networks are trained, and at testing time they are combined into one network, which can generalize better than networks without dropout (Srivastava et al., 2014).

In this thesis, dropout is used between all layers, except between the input layer and the first hidden layer, since no calculations are made in the input layer (Dreyfus, 2005). The dropout

(24)

rate, the specified probability that each unit will be dropped, is set to 0.5. According to Srivastava et al. (2014), this rate seems to work well for the intermediate layers in many different kinds of networks.

Once again looking at Figure 5, the plots for both the training and validation data decreases almost simultaneously until they stabilize at a very low value. After they have stabilized, there is no new increase in the error of the validation data further along the training of the epochs either. Had this been the case, e.g. that the validation data error had started to increase after epoch 100 or any other point, it would have signalled that the model is overfitting, which it is not in this case.

4.3 Error Measures for Evaluation

To be able to draw conclusions about which model makes better forecasts, the forecasts have to be evaluated. In this thesis, this is done using two different error measures which are described below. The Root Mean Square Error (RMSE) is a common measure used when determining the accuracy and rate of error for different models. However, critics have suggested that the Mean Absolute Error (MAE) is a superior measure when evaluating a model (Willmott and Matsuura, 2005). Even though the MAE might be superior, given that the RMSE is widely used, both are used to evaluate the ARIMA and LSTM models.

4.3.1 Root Mean Square Error (RMSE)

The RMSE computes the mean of the squared differences between the observed and forecasted values, and takes the square root of that value. This can also be written as

, (17) where _ŷ_iare the forecasted values, _y_i the observed values and _{n is the number of forecasts.}

(25)

4.3.2 Mean Absolute Error (MAE)

The MAE has similar properties to the RMSE but instead of comparing the squared difference, it uses the absolute values. The MAE is the mean of the absolute values in the differences between forecasted and observed values, or

, (18)

where _ŷ_iare the forecasted values, _y_i the observed values and _{n is the number of forecasts.}

(26)

5 Results

Table 1. Error measures for the ARIMA model forecasts.

RMSE MAE Number of days of forecast ARIMA LSTM ARIMA LSTM 90 days 163.66 104.66 146.69 84.20 30 days 53.99 38.50 38.37 30.32 7 days 11.44 7.66 6.52 6.21

As can be seen in Table 1, both error measures used in this thesis, the RMSE and the MAE show relatively high values, indicating that there is a large degree of error in the forecasts, primarily for the two longer time scopes. A 90 day forecast of the value of the OMXS30 index, using an ARIMA has a very low chance of projecting the actual value. Looking in the appendix, based on the AIC, the model is deemed to be a random walk as it is an ARIMA(0,1,0), hence not predictable. Compared to the one for 90 days, the 30 day ARIMA forecast clearly shows a lower error rate. The RMSE is about a third of what it is for the 90 day one while the MAE is almost a fourth of it. While still being quite high values, this shows that the model is at least somewhat accurate in forecasting for this time-interval.

The seven day ARIMA forecast on the other hand shows a much lower error rate than the two longer horizons, especially the MAE measure is low, indicating that on a shorter forecast the accuracy of the model is relatively high. A definitive trend can be seen in regards to this, the error for the ARIMA forecast is almost fifteen times less frequent for a seven day forecast than for a 90 day one, showing that the model gets more and more accurate, the shorter the time frame for forecast is.

In Table 1, one can see that the RMSE for the 90 day forecast is high, although quite a bit lower than that of the of the ARIMA forecast. The MAE follows the relationship of the RMSE in terms of how much it differs for the two models. With the same hyperparameters as for the 90 day forecast, the RMSE and MAE for the 30 day forecast are dramatically lower.

(27)

For the 30 day forecast, the LSTM still outperforms the ARIMA. For the seven day forecast on the other hand, both values are much closer to that of the ARIMA on the same interval, but both the RMSE and the MAE are still slightly lower for the LSTM.

6 Discussion

Looking at the results, the initial hypothesis seems to match the results, the LSTM outperforms the ARIMA model on the 90 and 30 day forecasts. However, on the seven day forecast, especially the MAE values are very close, which is contrary to the expectations before constructing the two models. One possible, if not likely, explanation for this could be the very random nature of the stock market. Even though it might not be dependent on the equivalent of a coin flip to determine its value, the past data conceivably have a very small effect on the present and future value of a given index. This theory is reinforced by the fact that based on the AIC score, the time series process seems to be a random walk. However, it is noteworthy that despite this, the LSTM outperforms the ARIMA by a margin that is not inconsiderable.

As is detailed in the Methods section, overfitting is a problem in many NNs and a primary concern at the onset of this thesis. Dropout is used as the chosen method in order to mitigate this, naturally the hope is that this would work. Looking at Figure 5, it shows a good fit according to the present standards for an NN, likely confirming that the risk of overfitting is successfully combated.

As touched upon above, the stock market is extremely complex in its nature, although it is not necessarily entirely insignificant, relying completely on the closing price of the market in previous days is a very narrow approach. This does not take into consideration any fundamental aspects in the model such as the profitability of the companies of which the index consists of. Shocks in the market due to unforeseen events are also impossible to forecast. If one was able to accurately incorporate indicators on this, then the model would likely be much more accurate. The fact that the OMXS30 is the only index modelled limits the scope of this thesis. The conclusions that one would draw if they were to apply the same

(28)

model and hyperparameters to e.g. the NASDAQ-100 would perhaps differ at least to some degree.

Although the concept of a network that trains itself on past data in order to make forecasts at first glance seems to be a perfect fit for applying to stock markets, this is however, a quite naive point of view. This model, or any similar one that bases their forecasts solely on past data will likely not be able to accurately and consistently forecast the future value of the market. If one was to employ an investment or trading strategy on this, it might not end catastrophically but it would not be the revolutionizing success that it would have been if it had been able to accurately make forecasts. That is not to say that neural networks and machine learning does not have a use, quite the contrary, in fields where forecasts can be made on past data, the ability to spot indicators ahead of time is extremely powerful. As for their use in finance, considering the fact that finance professionals have been incorporating machine learning more and more by each passing year, they obviously have some relevance, although they are not an almost magic entity that can forecast future values based solely on past data.

A possible improvement that could be made to better forecast the market would be to create some model that not only took into consideration the historical prices but also other factors such as macroeconomic indicators, the intrinsic value of the companies that composes an index and possibly some sort sentiment indicator.

(29)

7 Conclusion

This thesis was written with the objective of learning more about different time series forecast techniques and comparing the results as well as gaining more insight to the workings of the Swedish stock market. This was done by fitting three ARIMA models and three LSTM neural networks to the same data from the OMXS30 index and evaluating their performance on a 90, 30 and 7 day interval.

Possible improvements on the model that would make it slightly more sophisticated would be to try and accurately incorporate the fundamentals of the companies in the index as well as leading macroeconomic indicators and some sort of sentiment analysis.

In the thesis, is is shown that in this case, the LSTM models performs better in forecasting future index values than the ARIMA on the 90 and 30 day forecasts. On the seven day forecast though, the ARIMA and the LSTM performed very comparably, which was contrary to the initial hypothesis. Furthermore, it is possible that the results would be different if different datasets were used in the forecasts. Forecasts for one dataset is presented in this thesis, but if these same methods were applied on other stock indices or other time series data, other conclusions might be drawn. Using data from another stock market that is different in its nature might yield different results, only analyzing the Swedish market therefore limits the scope of the conclusions that can be drawn from this thesis. A conclusion can be drawn on the effectiveness of the two presented models though, where the LSTM neural network consistently seems to outperform the ARIMA. However, considering that none of the models make highly accurate forecasts, using either one to try and turn a profit trading seems to be a futile endeavour.

(30)

References

Baldi, P. and Sadowski, P. (2013). Understanding Dropout. _{Proceedings of the 26th}

International Conference on Neural Information Processing Systems_{, 2:2814-2822.} Box, G. E. P., Jenkins, G. M., Reinsel, G. C. and Ljung, G. M. (2015). _{Time Series Analysis:}

Forecasting and Control_{. John Wiley & Sons, Hoboken, 5th edition.}

Capelo, L. (2018). _{Beginning Application Development with TensorFlow and Keras: Learn to} Design, Develop, Train, and Deploy TensorFlow and Keras Models As Real-World Applications. _{Packt Publishing Limited, Birmingham, 1st edition.}

Cesa, M. (2017). A brief history of quantitative finance._{Probability, Uncertainty and} Quantitative Risk,_2(1):1-16.

Chen, K., Zhou, Y. and Dai, F. (2015). A LSTM-based method for stock returns prediction: A case study of China stock market. _{2015 IEEE International Conference on Big} Data,_2823-2824.

Chollet, F. and Allaire, J. J. (2018). _{Deep Learning with R. Manning Publications Co., New} York, 1st edition.

Cryer, J. D. and Chan, K. (2008). _{Time Series Analysis: With Applications in R. Springer,} New York, 2nd edition.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. _Mathematics of Control, Signals, and Systems_{, 2(4):303-314.}

Dreyfus, G. (2005). _{Neural Networks: Methodology and Applications. Springer, Berlin} Heidelberg, 1st edition.

El-Amir, H. and Hamdy, M. (2019). _{Deep Learning Pipeline: Building a Deep Learning} Model with TensorFlow_{. Apress, Berkeley, 1st edition.}

(31)

Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R. and Schmidhuber, J. (2017). LSTM: A Search Space Odyssey. _{IEEE Transactions on Neural Networks and} Learning Systems, _{28(10):2222-2232.}

Hamilton, J. D. (1994). _{Time Series Analysis. Princeton University Press, Princeton, 1st} edition.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. _{Neural computation,} 9(8):1735-1780.

Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging.

arXiv preprint arXiv:1508.01991_.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

Levendis, J. D. (2018). _{Time Series Econometrics: Learning Through Replication. Springer,} Cham, 1st edition.

Lv, K., Jiang, S. and Li, J. (2017). Learning gradient descent: Better generalization and longer horizons. _{Proceedings of the 34th International Conference on Machine} Learning_{, 70:2247-2255.}

Macukow, B. (2016). Neural Networks – State of Art, Brief History, Basic Models and Architecture. Computer Information Systems and Industrial Management_{, 9842:3-14.}

Malkiel, B. G. (2015). _{A Random Walk Down Wall Street: The Time-Tested Strategy for} Successful Investing_{. W.W. Norton & Company, New York, 11th Edition.} Manuca, R. and Savit, R. (1996). Stationarity and nonstationarity in time series analysis.

Physica D: Nonlinear Phenomena_{, 99(3):134-161.}

McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. _{Bulletin of mathematical biophysics, 5:115-133.}

(32)

McNelis, P. D. (2005). _{Neural Networks in Finance: Gaining Predictive Edge in the Market.} Academic Press, San Diego, 1st edition.

Michelucci, U. (2018). _{Applied Deep Learning: A Case-Based Approach to Understanding} Deep Neural Networks_{. Apress, Berkeley, 1st edition.}

Nelson, D. M., Pereira, A. C. and de Oliveira, R. A. (2017). Stock market's price movement prediction with LSTM neural networks. _{2017 International Joint Conference on} Neural Networks_{, 1419-1426.}

OMX Stockholm 30, index quote. _{Yahoo! Finance. Retrieved 2020-04-07, from:} https://finance.yahoo.com/quote/%5EOMX/

Pankratz, A. (1983). _{Forecasting with Univariate Box-Jenkins Models: Concepts and Cases.} Wiley, New York, 1st edition.

Purkait, N. (2019). _{Hands-On Neural Networks with Keras: Design and create neural} networks using deep learning and artificial intelligence principles_{. Packt Publishing,} Birmingham, 1st edition.

Reimers, N. and Gurevych, I. (2017). Optimal hyperparameters for deep lstm-networks for sequence labeling tasks. _{arXiv preprint arXiv:1707.06799.}

Ruder, S. (2016). An overview of gradient descent optimization algorithms. _{arXiv preprint} arXiv:1609.04747_.

Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike's information criterion. _{Biometrika, 63(1):117-126.}

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. _{The journal of} machine learning research_{, 15(1):1929-1958.}

(33)

the root mean square error (RMSE) in assessing average model performance. _Climate research_{, 30(1):79-82.}

Witten, I. H., Frank, E., Hall, M.A. and Pal, C.J. (2016). _{Data Mining: Practical Machine} Learning Tools and Techniques_{. Elsevier, Amsterdam, 4th edition.}

What is the adjusted close? _{Yahoo! Help, Retrieved 2020-04-07, from:} https://help.yahoo.com/kb/.html

(34)

Appendix

Figure 6. Forecast from an ARIMA(0,1,0) 90 days ahead. Table 2. Error measures from the 90 day ARIMA(0,1,0) forecast.

RMSE MAE

(35)

Figure 7. Forecast from an ARIMA(1,0,3) with non-zero mean, 30 days ahead. Table 3. Error measures from the 30 day ARIMA(1,0,3) with non-zero mean forecast.

RMSE MAE

(36)

Figure 8. Forecast from an ARIMA(2,1,0) with drift, 7 days ahead. Table 4. Error measures from the 7 day ARIMA(2,1,0) with drift forecast.

RMSE MAE

(37)

Figure 9. Forecast from an LSTM NN 90 days ahead. Table 5. Error measures from the 90 day LSTM NN forecast.

RMSE MAE

(38)

Figure 10. Forecast from an LSTM NN 30 days ahead. Table 6. Error measures from the 30 day LSTM NN forecast.

RMSE MAE

(39)

Figure 11. Forecast from an LSTM NN 7 days ahead. Table 7. Error measures from the 7 day LSTM NN forecast.

RMSE MAE

Forecasting the OMXS30 - a comparison between ARIMA and LSTM