CAN DEEP LEARNING BEAT TRADITIONAL ECONOMETRICS IN FORECASTING OF REALIZED VOLATILITY?

(1)

CAN DEEP LEARNING BEAT TRADITIONAL ECONOMETRICS IN FORECASTING OF REALIZED VOLATILITY?

Submitted by Filip Bj¨ ornsj¨ o

A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for a two-year Masters degree

in Statistics in the Faculty of Social Sciences

Supervisor Yukai Yang

Spring, 2020

(2)

ABSTRACT

Volatility modelling is a field dominated by classic Econometric methods such as the Nobel Prize winning Autoregressive conditional heteroskedasticity (ARCH) model. This paper therefore investigates if the field of Deep Learning can live up to the hype and outperform classic Econometrics in forecasting of realized volatility. By letting the Heterogeneous AutoRegressive model of Realized Volatility with multiple jump components (HAR-RV-CJ) represent the Econometric field as benchmark model, we compare its efficiency in forecasting realized volatility to four Deep Learning models. The results of the experiment show that the HAR-RV-CJ performs in line with the four Deep Learning models: Feed Forward Neural Network (FNN), Recurrent Neural Network (RNN), Long Short Term Memory network (LSTM) and Gated Recurrent Unit Network (GRU). Hence, the paper cannot conclude that the field of Deep Learning is superior to classic Econometrics in forecasting of realized volatility.

(3)

1 Introduction

Volatility is a measure of great importance for banks, financial institutions and investors. The measure plays an important role within risk management modelling, financial derivatives pricing and hedging theory. Stay- ing ahead with precise forecasting models of volatility can therefore make huge difference for practitioners.

The importance of the discipline is reflected in historical research where it is one of the most covered fields within Econometrics. A lot of famous papers such as Engle (1982), Bollerslev (1986) and Corsi (2009) has been published on the topic throughout the years. A central factor for research in the discipline is volatility clustering, which is is a well established property of financial return series, dating all the way back to Mandel- brot (1963) and Fama (1965). The serial dependency in volatility makes it suitable for time series modelling and a large variety of models are in literature applied to the problem. Nobel Prize winning Autoregressive conditional heteroskedasticity (ARCH) framework by Engle (1982) and generalized by Bollerslev (1986), is a classic approach which dominates early studies on the subject. A lot of discussion regarding the (G)ARCH framework surround its ability to forecast and critics mean that the model has little practical use in this area, see; Cumby et al. (1993) and Figlewski (1997). Andersen & Bollerslev (1998) defends the framework in their study and found that that (G)ARCH generates good volatility forecasts in response to critics. So there is some degree of fragmentation in historical research. The (G)ARCH family treats volatility as latent, which for long was a general view, but with the digital transformations of the markets and increasing availability of intraday financial data, Andersen & Bollerslev (1998) proposed that the daily volatility can be measured by the square root of the sum of intraday squared log returns; Realized Volatilty (RV).

Volatility did no longer have to be treated as a latent variable, which opened up for a new set of models in the area and Andersen et al. (2003) showed that simple times series models applied to RV strongly outperformed the (G)ARCH family in terms of forecasting. The Autoregressive Fractionally Integrated Moving Average (ARFIMA) proposed by Granger & Joyeux (1980) has been frequently used in studies of the realized volatility. Thanks to its long memory properties, it is suitable for modelling time series which are serial dependent and was found superior to GARCH by Ma, Li, Zhao & Luo (2012). ARFIMA is in Corsi (2004) compared to the Heterogeneous AutoRegressive model of Realized Volatility (HAR-RV) first proposed by Fulvio Corsi, which is based on the heterogeneous market hypothesis proposed by M¨uller et al. (1993). HAR-RV, which is an additive cascade model with different volatility components, showed on remarkable forecasting abilities of realized volatility for its simplicity and performed as good as the complicated ARFIMA. Izzeldin, Hassan, Pappas & Tsionas (2019) confirms the close forecasting results between the methods on realized volatility for equities. The simplicity of the HAR-RV model and its ability to capture the long memory behavior of volatility, even if not belonging to the family of long memory models, is what makes it unique. Inspired by Barndorff-Nielsen and Shephard (2003) results showing that realized volatility can be decomposed into a continuous and discontinuous sample path, Andersen et al. (2007) further developed the HAR-RV framework allowing for discontinuities, i.e a sudden jump in volatility due macroeconomic news. Andersen refers to the model with one ”jump” component as HAR-RV-J and multiple as HAR-RV-CJ.

The study showed that the components improved the forecasting performance in comparison with standard model propes by Corsi (2004). This is is also the case in Wen, Zhao, Zhang & Hu (2019), where the HAR model with multiple jump components has best ability when forecasting realized volatility of oil prices.

In order to capture asymmetrical effects in volatility shocks (negative shocks have more impact then positive according to Black (1976)) and non-linear patterns, classic econometric models have often been combined with

(5)

Deep Learning by researchers. For example, Donaldson & Kamstra (1997) applied a semi non-parametric GARCH model, making use of Artificial Neural Networks (ANN) to capture leverage effects and Hajizadeh et al. (2012) improved an EGARCH by an augmented model with ANN. The HAR-RV-J has also been combined with Deep Learning by Liu, Pantelous & Mettenheim (2018), where a Recurrent Neural Network (RNN) is compared to the HAR-RV-J and a hybrid model. The RNN forecasts for realized volatility is most accurate in the study, which also is the case in Bucci (2019), where the Recurrent Neural Networks are superior to ARFIMA.

In past research of realized volatility modelling, Artificial Neural Networks are often used in hybrid models combining some Econometric framework with the non-linear properties of Deep Learning. This report therefore aims to contribute to literature by comparing Deep Learning with Econometrics, in order to examine if the if the hyped field of Deep Learning can beat classical statistics in forecasting of realized volatility.

For this objective, an experimental setup has been carefully designed in order to answer if Deep Learning really is superior to Econometrics for the problem. In the setup, a model which has proven strong predictive accuracy of realized volatility in historical research from the Econometric field is chosen. The chosen candi- date; HAR-RV-CJ is used as benchmark and compared to four Deep Learning models; Feed Forwards Neural Network (FNN), Fully Connected Recurrent Neural Network (RNN), Long Short Term Memory Network (LSTM) and Gated Recurrent Unit Network (GRU). All models are estimated based on the variables set by the HAR-RV-CJ framework, in order to create an as fair comparison as possible. Forecasts of S&P500, DAX30 and N225 realized volatility are computed and loss measures in combination with Diebold-Mariano test statistics are interpreted in order to answer the thesis objective. The idea is to let the Econometric model set the ”playing rules” by choosing variables of use and train the Artificial Neural Networks with standard settings and no optimization of hyperparameters. If the Deep Learning models still would outperform the Econometric, results would indicate the families superiority.

In next section the data used in order to examine the thesis objective is introduced in combination with some asymptotic proofs necessary to define the variables in the HAR-RV-CJ framework. Thereafter theoretical background for the Econometric and Deep Learning models is presented, with a method section describing the experimental setup in detail following. The results are based on methodology described in the method section and a conclusion part is included together with a discussion of results. Note that Appendix A includes descriptive statistics and Appendix B some results.

(6)

2 Data

The data used in this paper has been collected from Oxford-Man Institute’s realized library, published by Heber, Gerd, Lunde, Shephard & Sheppard (2009). The library includes realized measures for 31 major indexes, based on intraday data obtained from Thomson Reuters. For the purpose of this report, three major indexes has been chosen:

• S&P 500 - Stock market index of the 500 largest companies listed on stock exchanges in the U.S.

Collected data ranging between: 2001-01-03 - 2020-03-11.

• DAX 30 - German Stock index of the 30 largest companies traded on Frankfurt Stock Exchange.

Collected data ranging between: 2001-01-03 - 2020-03-11.

• NIKKIE 225 - Weighted index of 255 large companies listed on the Tokyo Stock Exchange. Collected data ranging between: 2001-01-03 - 2020-03-11.

The measures that has been collected for each index are:

• The Realized daily Volatility (RV)

• The Bipower Variation (BV)

Realized Volatility is actually the square root of Realized Variation, but we will treat these terms inter- changeably in line with Anderson et al. (2007). The measure is calculated by taking the sum of intraday log returns

RVt=

M

X

j=1

r²_t,j, (1)

where rt,j = Xt,j − Xt,j−1 and Xt,j is the intraday log price for day t and intraday period j. Keep in mind that when M increase the length of the intraday periods decrease. Daily Realized Volatility based on 5-minute intraday log returns was collected, which corresponds to M = 102 for a normal market day with 8.5 hours of trading. Bipower Variation (BV) is calculated as following

BV_t= µ⁻²₁ M M − 1

X^M

j=2

|r_t,j||r_t,j−1|, (2)

where µ1 = E(|Z|) =p2/π, hence the expected absolute value of a standard normal variable Z and rt,j

are once again 5-minute log returns. BV will be used in combination with RV in order to define one of the explanatory variables in next section.

(7)

2.1 Defining Variables

The variables of use in this report are based on the HAR-RV-CJ framework by Anderson et al. (2007) and rely heavily on asymptotic properties of Realized Volatility (RV) and Bipower Variation (BV). The first variable is the jump component J_t, which measures discontinues in the underlying sample path, i.e sudden jumps in volatility. The variable is defined as

J_t= M axh

RV_t− BVt, 0i

. (3)

Proof of this can be found in Appendix A. This definition does though include a lot of noise, which we want to eliminate in order to get a better approximation of the true jump effects in volatility. Andersson et al. (2007) handle the noise by applying results from Barndorff-Nielsen & Shephard (2003) and Huang &

Tauchen (2005) to form a Z statistic based on RV and BV in order to find significant jump effects:

Zt= RV_t⁻¹(RVt− BVt) q

(µ⁻⁴₁ 2µ⁻²₁ + M⁻¹(_BV^{T Q}^t2 t )

∼ N (0, 1). (4)

With help from (4), Andersson et al. (2007) define the ”significant” jump effects as values with Z score larger then the cutoff Z1−a for some level of significance a. Hence, the new jump component is defined with indicatior function I(·) as

Jt= I(Zt> Z1−a)(RVt− BVt). (5)

As this study does not have access to the Tri-power Quarticity (TQ) variable in (4), we will apply an alternative approach of removing noise from the jump component as in (5). It is though still important to consider the procedure used by Andersson et al. (2007), in order to follow the set up of the HAR-RV-CJ framework from the original paper as close as possible. The method proposed for finding ”significant” jump effects in this paper, is simply based on observed deviation from the sample mean and is defined as

X_t= RV_t− BVt (6)

J_t= I(X_t> ¯X + 2σ_x)(X_t), (7)

where σxand ¯X are the standard deviation and mean of the full sample X and I(·) is an indicator function.

This way, the jump component is cleaned from noise just as in the original papers procedure. This definition of the jump component might though lead to bias in forecasting results, as we make use of information that has not yet been observed when defining the jump component. But as this is a comparison study and all models get the same amount of information, this possible problem is disregarded.

(8)

In line with Anderson et. al (2007) framework for the HAR-RV-CJ model, we can now define the variables that all models will make use of. let Ct−p represent the rolling average of realized volatility based on the previous p days

C_t−p=1 p

RV_t+ RV_t−1+, , , +RV_t+1−p

= 1 p

p

X

i=1

RV_t+1−i, (8)

and Jt−p be the rolling average of the jump component defined in equation (7) for the previous p days

Jt−p= 1 p

Jt+ Jt−1+, , , +Jt+1−p

= 1 p

p

X

i=1

Jt+1−i. (9)

The final variables are then C_t, C_t−5, C_t−22and J_t, J_t−5, J_t−22, hence the daily, weekly and monthly average of realized volatility and jump component for S&P500, DAX30 and N225. Note that Ct is equivalent to RV_t and and the target variable of interest is RV_t+k. Descriptive statistics of all variables can be found in Appendix B. We can note that realized volatility for DAX30 has largest standard deviation and that N225 has lowest.

(9)

3 Theory

This section will at first give a brief explanation of the two families which are up for comparison; Econometrics and Deep Learning. The frameworks will then be explained in more detail, with the mathematical background and discussion regarding parameters. The Diebold-Mariano test used to determine differences in accuracy between competing forecasts is also described in detail.

3.1 Econometric model

Econometric models are simply a set of statistical methods used within economics. In broad terms, this can be anything from modelling GDP growth with OLS regression, to forecasting sales figures for a company the next coming month with some time series approach. This report will however focus on the subset of models within Econometrics that deal with serial dependencies. As mentioned in the related work part, there are numerous methods that could be applied to the problem of forecasting realized volatility but the one chosen is the Heterogeneous Autoregressive model of Realized Volatility with multiple jump components (HAR-RV-CJ) designed by Corsi (2004) and Andersen et al. (2007). This choice is motivated by historical research that suggest its superiority to other econometric models and that the HAR-RV-CJ framework is fixed and independent of underlying process, which facilitates comparison with the Deep Learning models.

ARFIMA could have been used due to its in line forecasting results with the HAR-RV-CJ (in research), but as the best fitting ARFIMA is dependent of the underlying time series, the fixed framework of HAR-RV-CJ is to prefer for comparison reasons.

3.2 Deep Learning models

Deep learning is a subset of methods within the family of Machine Learning that exploit multiple levels of non-linear information, in order to model complex relationships among data (Deng, Yu (2014)). The methods within the field are characterized by their multiple-layer architecture and learning can be supervised, semi- supervise or unsupervised. This report will though focus on supervised learning with Artificial Neural Networks (ANN) and its different applications. As realized volatility is serial dependent, it makes sense to use the class of Artificial Neural Networks that has built in memory functionalities; Recurrent Neural Networks (RNN). These are special designed to deal with modelling that require knowledge about previous information, such as text, speech or general time series problems. Three Recurrent Neural Networks, with different memory designs, will be used; Fully Connected Recurrent Network, Long Short Term Memory Network, by Hochreiter & Schmidhuber (1997), Gated Recurrent Network, by Cho et. al (2014). A regular Feed Forward Neural Network (FNN) will also be applied.

(10)

3.3 HAR model of Realized Volatility (HAR-RV-CJ)

The HAR-RV-CJ will be applied in this paper in line with Andersen et al. (2007). As mentioned, this is an additive cascade model, based on three volatility and three jump components aiming to capture continuities and discontinuities characteristics of realized volatility. The model has the following structure

RVt+k= a + βCDCt+ βCWCt−5+ βCMCt−22

+ βJ DJt+ βJ WJt−5+ βJ MJt−22+ t+h,

(10)

where Ct, Ct−5, Ct−22 and Jt, Jt−5, Jt−22 are the rolling daily, weekly and monthly averages of the realized volatility and the jump component defined in section 2.1. Hence, the model is built on moving averages, aiming to capture both short and long term effects.

3.3.1 Estimation

HAR-RV-CJ is a linear regression model and estimation is done by Ordinary Least Square (OLS) which aims to minimize the sum of squared residuals. Let

X = (1, Ct, Ct−5, Ct−22, Jt, Jt−5, Jt−22), be a set of (n − k) × 1 arrays and

Y = (RVt+k, RVt+k−1, ..., RVt+k−n)⁰, be the target array of realized volatility. Then OLS estimate

β β

β = (a, βCD, βCW, βCM, βJ D, βJ W, βJ M)⁰ by

β β

β = (X⁰X)⁻¹X⁰Y. (11)

Hence, the estimates from (10) are used in (11) for forecasting RV_t+k. The assumptions for OLS estimation are:

• Linearity in the parameters

• The observations are identical independently distributed

• E(|X) = 0

• No multi-collinearity

• Homoscedastic error term

As we are only interested in the models forecasting accuracy and not in the models actual estimates, we will not examine if these hold.

(11)

3.4 Artificial Neural Networks

Some researchers argue that Artificial Neural Networks (ANN) are inspired by the human brain, where a set of neurons co-operate in handling and processing information. In an ANN, neurons or units as they are called are stimulated by taking some input, processing it and passing on a new bit of information to the next layer of units, which repeats the procedure. By increasing the number of layers, we get a multilevel system, that in the end will generate some output. In the simplest of neural network architectures, it can be viewed as a general regression problem, where the single unit takes on some input x = [1, x1, x2, ..., xn], process it and generate output ˆy as in Figure 1.

Figure 1: An Artificial Neural Network in the simplest of structures, where inputs x is multiplied by a weight w and processed by the single unit in order to produce an output.

This can be expressed as σ(w0+ w1x1+ ... + wnxn) = ˆy, where σ(·) is the activation function, hence the way units process inputs and wi are the weights we want to estimate. To account for a multilevel structure, with multiple units and hidden layers a general formula for unit i in layer l can be expressed as

h^(l)_i = σ_l w_0,i^(l)+ w_1,i^(l)x₁+ ... + w_n,i^(l)x_n, (12)

where σlis the activation function of the l:th layer. Note that x is only input for layer one (h = 1), for h > 1 previous layer of units is instead used. As every unit has its own n × 1 array of weights, every full layer has its own n × m matrix of weights and n × 1 array of bias terms:

W^(l)=

"w

(l)

11 · · · w_1m^(l) ... . .. ... w^(l)_n1 · · · wnm^(l)

#

, b^(l)=

" w

(l) 01

... w_0m^(l)

# .

Where W^(l) is a weight matrix and b^(l) is a bias term for layer l. These matrices can help describe a neural networks forward pass, hence they way the network process inputs x and generate outputs ˆy.

(12)

This can mathematically be described as:

h⁽¹⁾= σ1 W⁽¹⁾x+b⁽¹⁾) h⁽²⁾= σ2 W⁽²⁾h⁽¹⁾+b⁽²⁾)

...

h^(L−1)= σL−1 W^(L−1)h^(L−2)+b^(L−1)) ˆ

y = σL W^(L)h^(L−1)+b^(L))

(13)

Where σi, i = 1, ..., L is the layer specific activation function. Note that equation (13) allows for p × 1 array output which can be useful depending on problem, but for the purpose of this report it will be single float value. In supervised learning output ˆy will be mapped against the true target value y. Based on this, the Artificial Neural Network updates the way it process information by making adjustments to the weight values in order to better map outputs to the real values. How this procedure works, will be covered in section 3.4.2.

3.4.1 Activation Function

Activation functions are simply used to convert raw inputs into some desired format. Artificial Neural Networks use these, to propagate information from one layer’s units to the next. This not only provides a foundation for the multi-level system to capture nonlinearities, but also fulfills the purpose of controlling the distribution of internal values and outputs. There are numerous activation function designed for different problems, but only those relevant to this report will be covered here. The simplest of activation functions, is the linear which basically takes an input and outputs the same

g(x) = x. (14)

A more complex activation function is Sigmoid which squeeze the inputs into outputs in the range [0, 1].

The function is defined as

σ(x) = 1

1 + e^−x. (15)

When outputs in the range [−1, 1] are required, the Tanh function is applicable as it squeeze inputs into the desired range. The activation function is defined as

tanh(x) = 2

1 + e^−2x − 1. (16)

These function are not only important in how information processed between the network’s layers, but also how the Artificial Neural Network update its learning process to map the targets better.

3.4.2 Training a Neural Network

When training an Artificial Neural Network we search for the optimal set of internal weight values in order to map predictions from the feed forward pass (described in section 3.4) to the real values as close as possible.

The search for the optimal weights is done by a backpropagation algorithm, which base its search on a loss

(13)

function. So a neural network consists of a feed forward pass, which takes some input, passes it through the multilevel system and generates an output and a backwards pass which updates the systems weights in order to minimize the loss function, hence find the optimal weight set. A loss function thereby consists of inputs, system weights and real values. In case of using Mean Squared Error (MSE) as loss function, this can be described as

f_W(x) = ˆy, W^(l), b^(l)∈ W (17)

L(ˆy, y) = 1 p

p

X

i=1

y_i− ˆy_i²

. (18)

In order to find the optimal weight set, the backwards pass make use of some optimization technique based on the gradient of the loss function with respect to every weight in the network

∂L(ˆy, y)

∂W = ∇. (19)

The simplest technique is called gradient descent and the idea is to update the weights of the system with the opposite sign of the gradient multiplied with the learning rate γ. This can be describe mathematically as

W_new= W_old− γ∂L(ˆy, y)

∂W_old . (20)

Repeating the procedure in (20) for every input data point will lead us (in best case) towards the global minimum of the loss function and thereby the optimal set of weights. Though, this technique is very computational heavy as it requires the gradient to be calculated for every iteration. A solution to this is Stochastic Gradient Descent (SGD) which uniformly draws mini-batches from the input data, and base the calculation in (20) on the average loss from each batch. This reduce the computational burden in trade for slightly lower convergence rate per epoch.

A lot of research has been done in recent years regarding stochastic gradient based optimization. Dif- ferent Optimizers has been developed in order to tackle problems such as local minimums, learning rate and computational efficiency. AdaGrad (Duchi J, Hazan E, Singer Y. (2011)), RMSProp (Tieleman T, Hinton GE (2012)) and the combination of the two Adam (Kingma, Lei Ba (2015)) are examples of some frequently used optimizers which aims to tackle the problems for stochastic optimization in different ways and has shown on robust results.

3.4.3 Regularization

As sufficiently large Neural Networks are able to fit training data almost perfectly, networks are prone to overfitting. Regularization fills the purpose of combating overfitting by applying different techniques.

Dropout is a regularization technique which decrease the risk of the networks units co-adapting, hence creating complicated relationships, exclusively depending on each other. By adding a dropout rate 0 < p < 1 to each layer, only 1 − p percent of the layers units will be randomly sampled and included in that specific

(14)

forward and backwards pass. This decrease the convergent rate, but increase the networks ability to capture robust patterns in the data. Note that dropout is only applied when training the network. Early stopping is another useful technique to prevent overfitting. The idea is to stop training before the network start overfitting its weights to the sample. This can be done by stopping if training loss keeps on decreasing, but validation loss starts increasing. This indicates that the network is just adapting its weights to the training sample and it might be a good idea to stop training the model.

3.5 Feed-forward Neural Network (FNN)

A Feed-Forward Network (FNN) is the simplest of network architectures. The system operates with a forward and backwards pass exactly as described in section 3.4. The structure is presented in Figure 2 (no bias unit included).

Figure 2: Feed Forward Neural Network structure without bias unit. Describes the flow from inputs to output. Note that structure only serves the purpose to visualize a Neural Network and does not correspond to FNN later used.

As earlier explained, the inputs are passed through the system which generates some output given the internal weights and activation functions. The flow can be described as:

h⁽¹⁾= σ(W⁽¹⁾xt+ b⁽¹⁾) (21)

ˆ

y = (W⁽²⁾h⁽¹⁾+ b⁽²⁾) (22)

Where σ is the Sigmoid function and a linear activation is applied to the output layer. The Feed-Forward network processes inputs without respect to earlier information. This is why recurrent neural networks with memory functions in theory are better suited for problems including serial dependencies like realized volatility.

(15)

3.6 Recurrent Neural Network (RNN)

In contrast to standard Feed Forward networks, the class of Recurrent Neural Networks (RNN) considers previous information when predicting the current. A standard RNN cell simply combine the previous hidden state h_t−1 with current information x_t and pass a new set of information on to the next state. At state 0, there does not exist a hidden state yet, so a random vector is passed. A recurrent cell is presented in Figure 3.

Figure 3: Flow through a recurrent cell, where previous stored information from hidden state ht−1is combined with current xt. The yellow box, symbolise a hidden layer with tanh() as activation function. Image by Mingxian Lin is licensed under CC BY-SA 4.0.

The yellow box in Figure 3 symbolise a hidden layer with tanh as activation function. The flow trough the cell can mathematically be described as:

h_t= tanh(W_xx_t+ W_hh_t−1+ b_h) o_t= tanh(W_xx_t+ W_hh_t−1+ b_h)

(23)

To complete the Recurrent Neural Network, an output layer is combined with the recurrent cell. This can be described as following applying a linear activation function:

ˆ

y = (Wzot+ bz) (24)

By excluding the previous hidden state, this would be a standard Feed Forward Network. The Tanh activation function is standard for RNN’s as it it has solid properties counteracting the vanishing/exploding gradient problem, see: Hochreiter, J. (1991).

(16)

3.7 Long Short Term Memory (LSTM)

Long Short Term Memory (LSTM) Neural Networks where introduced by Hochreiter & Schmidhuber (1997) and are designed to efficiently deal with long term dependencies and the exploding/vanishing gradient problem of standard Recurrent Neural Networks. A LSTM cell does not only consider x_tand the previous hidden state ht−1, but also cell state ct−1. The information stored in cell state is updated at every forward pass by the forget gate and input gate. These gates decide what information to forget and add to the new cell state ct. Further, hidden state is updated by the output gate, in combination with updated the cell state. The flow is presented in Figure 4.

Figure 4: Flow through a LSTM cell, where previous stored information from hidden state h_t−1 and cell state c_t−1 are combined with current x_t in order to output current hidden state and cell state. The yellow boxes symbolise hidden layers with Sigomoid and Tanh activation functions. The plus sign symbolise element wise addition and the multiplication sign the Hadamard product. Image by Mingxian Lin is licensed under CC BY-SA 4.0

The plus sign in Figure 4 symbolise element wise addition and the multiplication sign the Hadamard product.

The yellow boxes are layers, with corresponding activation function Sigmoid and Tanh. The flow through the LSTM cell can be described mathematically as following where f_t is the forget gate, i_t the input gate and otthe output gate:

ft= σ(Wf xxt+ Wf hht−1+ bf) it= σ(Wixxt+ Wihht−1+ bi) o_t= σ(W_oxx_t+ W_ohh_t−1+ b_o)

˜

c_t= tanh(W_cxx_t+ W_chh_t−1+ b_c) c_t= f_t c_t−1+ i_t ˜c_t

ht= ot tanh(ct)

(25)

In order to complete the LSTM Neural Network, the cell is combined with an output layer:

ˆy = (Wzht+ bz). (26)

Note that a linear activation function is for simplicity used for the output layer.

(17)

3.8 Gated Recurrent Unit (GRU)

Gate Recurrent Unit cells by Cho et. al (2014) can be seen as a lighter version of LSTM that still address the problem of vanishing/exploding gradients of classical recurrent neural network. In contrast to the LSTM cell, a Gated Recurrent Unit does not include a cell state. It does though rely on an update gate, which works similar to a LSTM’s forget gate in the way that it determines what information to pass on. Further a GRU operates with a reset gate, which regulate how much past information to forget. The full architecture is presented in Figure 5.

Figure 5: Flow through a GRU cell, where previous stored information hidden state h_t−1is combined with current information x_tin order to output a current hidden state. Yellow boxes symbolise hidden layers with Sigmoid and Tanh activation functions. Plus sign symbolise element wise addition and the multiplication sign the Hadamard product. Image by Mingxian Lin is licensed under CC BY-SA 4.0.

In Figure 5 the plus sign symbolise element wise addition and the multiplication sign the Hadamard product.

The yellow boxes symbolise layers with corresponding sigmoid and tanh activation functions. The full flow through the GRU cell can be described mathematically as following, where ut() is the update gate and rt() the reset gate:

r_t= σ(W_rxx_t+ W_rhh_t−1+ b_r) u_t= σ(W_uxx_t+ W_uhh_t−1+ b_u)

h˜_t= tanh(W_hxx_t+ W_hh(r_t h_t−1) + b_h) ht= (ht−1 ut) + (1 − ut)˜ht

(27)

In order to complete the GRU Neural Network, the cell is combined with an output layer:

ˆy = (Wzht+ bz). (28)

Note that a linear activation function is for simplicity used for the output layer.

(18)

3.9 Diebold-Mariano Test

In order to determine differences in accuracy between competing forecasts, a test proposed by Diebold &

Mariano (1995) which relies on asymptotic properties of normality will be used. The one sided Diebold- Mariano (DM) test, state the null hypothesis forecast 1 has equal or better accuracy then forecast 2. To find the test statistic, we let {ˆyi,t}^T_t=1, i = 1, 2 be competing forecasting series and {ei,t}^T_t=1, i = 1, 2 be corresponding squared forecasting errors (squared difference between predicted values and actual). Then the loss differential series is defined as

dt= e1,t− e2,t (29)

Assuming covariance stationary and short memory for {dt}^T_t=1, we can rely on following asymptotic results to compute the Diebold-Mariano Test statistic

√

T ( ¯d − µ) → N

0, 2πfd(0)

, (30)

where

f_d(0) = 1 2π

∞

X

k=−∞

γ_d(k), (31)

is the spectral difference of the loss differential series

γ_d(k) = E(d_t− µ)(d_t−k− µ), (32)

where k is the forecasting horizon. relying on these results, the Diebold-Marianio test statistic is

SDM = d¯ q2π ˆf_d(0)

T

∼ N (0, 1), (33)

where ˆf_d(0) is a consistent estimator of f_d(0). Hence, the level a test rejects the null of forecast 1 having equal or better accuracy then forecast 2, for

S_DM > Z_1−a. (34)

(19)

4 Method

This section will describe the data processing, model setup and testing methodology applied to answer the thesis objective. The experiment setup is focused on making the comparison of models as fair as possible.

4.1 Data Processing

As earlier mentioned, the input data for all models are the variables from the HAR-RV-CJ framework described in section 2.1. In line with Anderson et. al (2007), no normalization or standardization will be done. Though, a min-max scalar will be applied in order to transform the data into unit format. This in line with recommendations for neural networks data preparation and will make loss measures for the three indexes comparable.

4.2 Model setup

The Deep Learning models are implemented using the Python package Keras, utilizing Tensorflow as backend.

How many hidden layers to make use of when setting up the network structure is often a point of discussion.

There are no clear guidelines and applying some type of ”trial and error” validation is often used to determine the number of layers and units. Making use of this approach would though not be optimal for comparison.

Therefore, all networks presented in this report will make use of one hidden layer, including ten units. This is based on Donaldson and Kamstra (1996) proofs that a network with one hidden layer works as an universal approximator. Hence, given that the layer includes a sufficient number of units the network can approximate a wide range of linear and non-linear relationships. All other settings are in line with description in Section 3.5, 3.6, 3.7 & 3.8 with stated functions for layers and a linear activation function for the output layer. A dropout rate of 0.2 is applied and the recurrent neural networks are given a look-back period of ten time steps. Further the Adam optimizer is used in combination with mean squared error (MSE) as loss function.

4.3 Forecasting

A rolling window approach is applied for forecasting. The models are fit on windows consisting of w = 66 and w = 253 tensors of input data at each time step t. Window sizes correspond to three months and one year of market data. One and five days ahead predictions are computed based on the fit at each time step.

Accordingly, the forecast series {ˆyt+k}^{T −k}_t=w+1 is returned based on predictions from models fit to the set of tensors Xt = {xt, xt−1, ..., xt−w}, with target y_t+k = {yt+k, yt+k−1, ..., yt−w+k} at each time step t. The following forecasting series will be computed and used in the results part:

• {ˆyt+1}^{T −1}_t=66+1, w = 66 and k = 1

• {ˆy_t+5}^{T −5}_t=66+1, w = 66 and k = 5

• {ˆyt+1}^{T −1}_t=253+1, w = 253 and k = 1

• {ˆyt+5}^{T −5}_t=253+1, w = 253 and k = 5

Hence, four forecast series are returned for each model (HAR-RV-CJ, FNN, RNN, LSTM, GRU) and underlying realized volatility time series (S&P500, DAX30, N225). The deep learning models are trained until convergence of the loss function on each window.

(20)

4.4 Evaluation

For comparison of the forecasts the measures Root Mean Squared Error

RM SE = v u u t 1 N

N

X

i=1

ˆ y_i− y_i2

(35)

and Mean Absolute Error

M AE = 1 N

N

X

i=1

yˆi− yi

, (36)

are used. Generally, the RMSE penalize large forecasting errors to a greater degree then the MAE, since the RMSE is based on a weighted average where large errors get a relativiely larger weight. Hence, forecasting error outliers will affect the RMSE more then the MAE. As comparing loss functions for competing forecasts does not give strong evidence of a model outperforming the other, the one sided Diebold-Mariano Test described in section 3.9 is used as complementary results to provide statistical evidence. In order to fit the purpose of this report; to determine if Deep Learning outperforms Econometrics in forecasting of realized volatility, we state the null hypothesis: HAR-RV-CJ has equal or better forecasting accuracy then Deep Learning model x. Where model x corresponds to FNN, RNN, LSTM or GRU. We then reject the one sided level a test when

S_DM > Z_1−a. (37)

We assume that assumptions for the DM-test of covariance stationarity and short memory are fulfilled and that our samples are sufficiently large to rely on asymptotic properties. Note that the null hypothesis is window size, forecasting horizon and model specific. The loss measures in combination with a Diebold- Mariano Test gives a complete test procedure to examine the forecasting results. In order to evaluate the strength of results and match the purpose of this report, we define the following three possible outcomes of results:

• No evidence - Econometric model performs in line with Deep Learning models overall.

• Some evidence - Some Deep Learning model(s) consistently outperforms the Econometric and reject the null for all indexes, forecasting horizons and window sizes.

• Strong Evidence - All Deep Learning models consistently outperforms the Econometric and reject the null for all indexes, forecasting horizons and window sizes.

We reject the null on at least the 10 % level and outperform in terms of lower RMSE and MAE.

(21)

5 Results

This section will present the results that the conclusions of this report rely on. Index specific results in combination with a summary are presented according to measures and tests from section 4.0.

5.1 S&P500 results

Loss measures based on forecasts for S&P500 realized volatility are presented in Table 1. Models are trained/fitted on rolling window size of three months (w = 66) and one year (w = 253). One day ahead (k = 1) and five days ahead (k = 5) forecasts are presented. For extensive description of methodology see Section 4.3.

S&P500

k = 1 k = 5

w = 66 w = 253 w = 66 w = 253 RMSE

HAR-RV-CJ 0.0307 0.0274 0.0283 0.0266 FNN 0.0258 0.0248 0.0269 0.0265 RNN 0.0320 0.0282 0.0336 0.0300 LSTM 0.0234 0.0241 0.0256 0.0241 GRU 0.0234 0.0221 0.0281 0.0263 MAE

HAR-RV-CJ 0.0084 0.0075 0.0093 0.0089 FNN 0.0099 0.0093 0.0093 0.0097 RNN 0.0132 0.0104 0.0137 0.0112 LSTM 0.0078 0.0079 0.0090 0.0087 GRU 0.0075 0.0074 0.0091 0.0093

Table 1: The table presents the forecasting losses for given model, window size w and forecasting horizon k. The measures Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) are presented. The lowest loss for specific window and horizon is highlighted.

From Table 1 we note that GRU and LSTM models generate best forecasting results with lowest losses for the different window sizes and forecasting horizons. The GRU perform best on one day ahead forecasting, while LSTM on five days ahead. Overall, it is clear that the models gain predictive power when fit on the larger windows of w = 253, compared to w = 66. Further, we note that one day ahead forecasts are generally more accurate then five days ahead, which is as expected. When comparing the Deep Learning models to the Econometric, the RNN is the only one outperformed by HAR-RV-CJ RMSE wise, but when comparing the MAE the Econometric performs better the FNN & RNN and in line with LSTM & GRU.

Hence, the HAR-RV-CJ seems less competitive RMSE wise, which suggest that the model might generate large forecasting errors occasionally (outliers), as RMSE is penalized to a larger to degree then MAE in such case. No Deep Learning model consistently outperforms the Econometric for for the different forecasting

(22)

horizons or window sizes.

In order to back up the results in Table 1 with statistical evidence, we make use of the one sided Diebold- Mariano Test with null hypothesis: HAR-RV-CJ has equal or better forecasting accuracy then model x.

Where x is FNN, RNN, LSTM & GRU. The test statistics are presented in Table 2.

Diebold-Mariano Test S&P500

k = 1 k = 5

w = 66 w = 253 w = 66 w = 253 FNN 1.3307^∗ 0.8619 1.0014 0.0368 RNN -0.4396 -0.2888 -2.7629 -2.0721 LSTM 2.0121^∗∗ 1.1451 1.8051^∗∗ 1.4432^∗ GRU 2.0146^∗∗ 1.7007^∗∗ 0.1568 0.3167

Table 2: The table presents the test statistics for the one sided Diebold-Mariano Test for a given model, window size w and forecast horizon k, with the null hypothesis: HAR-RV-CJ has equal or better forecasting accuracy then model x. ***,

** and * indicate that we reject the null on the significance level 1 %, 5 % and 10

%.

The null hypothesis is rejected for some window sizes and forecasting horizons. Though, no model consistently rejected the null. The best performers from Table 1; GRU and LSTM, also reject that HAR-RV-CJ has equal or better forecasting accuracy to some extent in Table 2. Combining results from Table 1 & 2, we do not find evidence that suggest any Deep Learning models superiority to the HAR-RV-CJ in forecasting of S&P500 realized volatility, as no model consistently rejects the null and outperforms over different window size and forecasting horizon.

(23)

5.2 DAX30 Results

Loss measures based on forecasts for DAX30 realized volatility are presented in Table 3. Models are trained/fitted on rolling window size of three months (w = 66) and one year (w = 253). One day ahead (k = 1) and five days ahead (k = 5) forecasts are presented. For extensive description of methodology see Section 4.3.

DAX30

k = 1 k = 5

w = 66 w = 253 w = 66 w = 253 RMSE

From Table 3, we note that the LSTM and GRU models performs best with lowest overall losses. In line with previous results, the predictive power increase with window size. When examining the RMSE for HAR-RV- CJ, FNN and RNN, we find that five day ahead forecasts are more accurate then one day ahead. Though, this is not the case for HAR-RV-CJ and FNN when checking the MAE, where the results are as expected.

These results may indicate a poor fit and forecasts being no better then random or large forecasting errors for k = 1 affecting the RMSE to a larger degree then MAE. When comparing the Deep Learning models to the Econometric, RNN is again the only model outperformed by HAR-RV-CJ RMSE wise, while comparing MAE the Econometric model outperforms FNN & RNN and in line with LSTM & GRU. Hence, results suggest that no Deep Learning model consistently outperforms the Econometric for for the different forecasting horizons or window sizes. For k = 1 and w = 253 the Econometric model generates the lowest MAE score.

The lack of competitiveness RMSE- but not MAE wise again suggests that the HAR-RV-CJ model might suffer from large forecasting errors occasionally.

(24)

In order to back up the results in Table 3 with statistical evidence, we make use of the one sided Diebold- Mariano Test with null hypothesis: HAR-RV-CJ has equal or better forecasting accuracy then model x.Where x is FNN, RNN, LSTM GRU. The test statistics are presented in Table 4.

Diebold-Mariano Test DAX30

k = 1 k = 5

w = 66 w = 253 w = 66 w = 253 FNN 1.6013^∗ 0.2069 1.6907^∗∗ 0.3174 RNN 0.1564 -2.3713 -0.7707 -2.7116 LSTM 1.7913^∗∗ 1.1929 2.2325^∗∗ 1.1358 GRU 1.7871^∗∗ 1.264 2.0433^∗∗ 1.9625^∗∗

%.

From Table 4 we note that the null hypothesis is mainly rejected for w = 66 and not consistently over forecasting horizon and window size for any model. GRU is closest to achieve consistent results, where only the one day ahead forecasts with w = 253 cannot reject the HAR-RV-CJ having equal or better forecasting accuracy. Combining results from Table 3 & 4, we do not find evidence that suggest any Deep Learning models superiority to the HAR-RV-CJ in forecasting of DAX30 realized volatility, as no model consistently rejects the null and outperforms over different window size and forecasting horizon.

(25)

5.3 N225 Results

Loss measures based on forecasts for N225 realized volatility are presented in Table 5. Models are trained/fitted on rolling window size of three months (w = 66) and one year (w = 253). One day ahead (k = 1) and five days ahead (k = 5) forecasts are presented. For extensive description of methodology see Section 4.3.

N225

k = 1 k = 5

w = 66 w = 253 w = 66 w = 253 RMSE

Examining the results in Table 5, we note that the LSTM model forecasts with best precision with overall lowest losses. The same pattern as in previous results is present with increasing predictive power for larger window size, but also one day ahead forecasts overall being more accurate then five days ahead for both RMSE and MAE. Comparing the Deep Learning models to the Econometric, we note that the RNN is only model outperformed by HAR-RV-CJ MAE wise, but when comparing RMSE the Econometric models forecasting accuracy better then RNN & FNN and line with LSTM and GRU. Hence, results suggest that no Deep Learning model consistently outperforms the Econometric for for the different forecasting horizon sor window sizes. For k = 5 and w = 253, the Econometric model generates the lowest MAE.

(26)

In order to back up the results in Table 5 with statistical evidence, we make use of the one sided Diebold- Mariano Test with null hypothesis: HAR-RV-CJ has equal or better forecasting accuracy then model x.Where x is FNN, RNN, LSTM GRU. The test statistics are presented in Table 6.

Diebold-Mariano Test N225

k = 1 k = 5

w = 66 w = 253 w = 66 w = 253 FNN 1.8859^∗∗ 0.1806 1.2856^∗ -0.604 RNN -0.7133 -4.7381 -1.7763 -5.3883 LSTM 1.7862^∗∗ 0.5881 1.723^∗∗ 1.7986^∗∗

GRU 1.44578^∗ 0.0184 1.2787 1.643^∗

%.

The null hypothesis is not rejected consistently over forecasting horizon and window size for any model.

LSTM is closest to achieve consistent results, where only the one day ahead forecasts with w = 253 cannot reject the HAR-RV-CJ having equal or better forecasting accuracy. Combining results from Table 5 & 6, we do not find evidence that suggest any Deep Learning models superiority to the HAR-RV-CJ in forecasting of N225 realized volatility, as no model consistently rejects the null and outperforms over different window size and forecasting horizon.

5.4 Summarising results

The average losses of the forecasting results for S&P500, DAX30 and N225 realized volatility are presented in Appendix B. When examining the results RMSE wise, LSTM performs best among the models when fit on the smaller window size, while GRU on larger. The MAE results however indicate that the HAR-RV- CJ model is superior for the larger window size, while GRU and LSTM on smaller. Overall, the results vary depending on loss measure, window size and forecasting horizon. The Long Short Term Memory and Gated Recurrent Unit networks consistently are among the best performers, while the Econometric model performance diverge depending on measure and performs relatively poor on the smaller window size. The Diebold-Mariano Tests confirms this, where the null hypothesis of the HAR-RV-CJ model having equal or better forecasting accuracy then model x, was rejected for a large majority of the tests on smaller sample sizes, but not for larger. When summarising the forecasting results of the three realized volatility series, we find some statistical evidence in all cases of Deep Learning models outperforming the Econometric when examined independently, for specific window size and forecasting horizon. However, we do not find any Deep Learning model which consistently outperforms the Econometric or rejects the null for neither S&P500, DAX30 or N225 realized volatility forecasts. Also taking the strong average MAE results for the HAR-RV- CJ into account it is hard to argue the superiority of Deep Learning to Econometrics in forecasting realized volatility given the experimental setup.

(27)

6 Conclusion

This report aims to investigate if the hyped field of Deep Learning can outperform traditional Economet- rics in forecasting of realized volatility. The Hetergenous Autoregressive model of Realized Volatility with multiple jump components (HAR-RV-CJ) was chosen as Econometric model due to historical research indicating its superiority within the Econometric family in forecasting of realized volatility. Feed Forward, Recurrent, LSTM and GRU neural networks where chosen to represent Deep Learning in the experimental setup. Realized measures of volatility for S&P500, DAX30 & N225 was collected and variables in line with the HAR-RV-CJ framework was defined and used as estimation data for all models. Applying a rolling window approach, one and five days ahead forecasts was computed and mapped versus true values. By comparing loss measures and backing up with statistical evidence from the one sided Diebold-Mariano Test we conclude that the results does not give strong enough evidence to state that Deep Learning outperforms Econometrics in forecasting of realized volatility given the experimental setup.

Even if some evidence of Deep Learning models outperforming the HAR-RV-CJ for specific forecasting horizon, rolling window size and index is present, the evidence is not consistent and varies depending on index. Further, interpretation of the summarising results showed on strong MAE performance on average by the Econometric model, indicating that the model posses predictive power in line with or better then the Deep Learning models. Worth noting is that the neural network where trained to minimize the Mean Squared Error (MSE), which may be a reason why they perform better competitive wise comparing RMSE to MAE. The inconsistency in results and strong performance on average by the Econometric model makes it infeasible to argue superiority of any Deep Learning model to the HAR-RV-CJ in the experimental setup.

Though, the Deep Learning models seems to be less sensitive to size of window, with similar forecasting results independently of size compared to the Econometric model where differences in results are notable.

The HAR-RV-CJ also seems to suffer from generating large forecasting errors occasionally, as RMSE penalize errors to a larger to degree then MAE and the model lack competitiveness RMSE, but not MAE wise in some cases. Large forecasting errors might occur for various reasons, like sudden volatility spikes fooling the model to believe more volatility is expected etc. The results thereby may indicate that the Deep Learning models are more robust forecasting when unexpected movements in realized volatility occur.

To summarise, the Deep Learning models seems less sensitive to sample size and more robust in forecasting generally then the HAR-RV-CJ. However, this cannot be generalized to the full Econometric Family as different models have different characteristics. Though, we can conclude that the experiment does not show evidence of the Deep Learning family outperforming Econometrics in forecasting of realized volatility, as neither of FNN, RNN, LSTM or GRU consistently reject the null in the Diebold Mariano test or outperforms the HAR-RV-CJ for different window size or forecasting horizon in forecasting of S&P500, DAX30 and N225 realized volatility.

(28)

7 Discussion

The conclusions of this report rely on an isolated experiment where we let the HAR-RV-CJ model set the

”playing rules” as the Deep Learning models was estimated based on variables from the HAR-RV-CJ framework. Further, no modification of original structures was done, one hidden layer with ten units was chosen in line with Donaldson and Kamstra (1996) and no optimization of hyperparameters was performed. Hence, the Deep Learning models was estimated with settings far from optimal, which was not the case for the HAR-RV-CJ where only modifications of the estimation data could increase the complexity and improve the accuracy of the model. This was also the point as it would make evidence of superiority stronger if the Deep Learning models would have been more accurate overall. In this case, they were not but in another experimental setup where Deep Learning models are better adjusted to suit the problem it is likely that they might be, as in Liu, Pantelous Mettenheim (2018) and Bucci (2019).

Deep Learning models are dependent of using the right activation functions, number of layers, units, loss and optimization for the specific underlying problem. So finding the right set of parameters is itself an optimization problem. In this report, it is clear that the Fully Connected Recurrent Neural Network (RNN) suffers from miss-specification. The RNN performs worse then the Feed Forward Neural Network (FNN) overall, which theoretically should not be be the case due to serial dependencies in realized volatility. This could possibly be due to the vanishing/exploding gradient problem but more likely due to the Tanh activation function applied to the hidden layer. as Tanh squeeze values into the range [−1, 1] and the RNN was the only model occasionally predicting negative values (Realized volatility > 0), the RNN’s predictive power would probably improve using Sigmoid with range [0, 1]. This adjustment was not done due to comparison reasons, as it would be unfair customizing one model to the underlying problem. But it shows that there is a lot of room for improvement for the Deep Learning models which might not be a case for the HAR-RV-CJ.

Why the Econometric model struggles competing RMSE wise but not when comparing MAE is startling. As Ordinary Least Square (OLS) is used for estimation, which aims to minimize the sum of squared residuals, the model is estimated minimizing the Mean Squared Error such as the neural networks. Theoretically, the Econometric model should therefore be competitive RMSE wise. The results are hard to explain, but might be due to the neural networks minimizing the MSE more efficiently in a non-linear manner to a larger cost of MAE then the linear model. Of course, minimising the MSE decrease the MAE but the relationship is not perfectly correlated or even linear. So for future studies to avoid similar results one might consider evaluation measures that are not influenced by the models estimation process, in order to find the most robust model comparative wise and avert possible biases when interpreting results. In order to find stronger results then this study, future studies should also include additional Econometric models, test larger window sizes and perhaps also include Convolutional Neural Networks (CNN) to the comparison.

(29)

References

Andersson, T.G. Bolerslev, T. (1998). ”Answering the skeptics: yes, standard volatility models do provide accurate forecasts*”. International Economic Review, Vol. 39, No. 4.

Andersson, T.G., Bolerslev, T. Diebold, F.X, (2007). ”Roughing it up: including jump component in the measurement, modeling and forecasting of return volatility”. Review of Economics and Statistics, Vol. 89 Issue 4, p.701-720.

Barndorff-Nielsen,O.E., & Shephard, N., (2003). ”Power and Bipower Variation with Stochastic Volatility and Jumps”. Journal of Financial Econometrics, Vol. 2, Issue 1,1–37.

Bucci, A. (2019). ”Realized Volatility Forecasting with Neural Networks”. Department of Eco- nomics and Social Sciences, Universit`a Politecnica delle Marche.

Corsi, F. (2004). ”A simple long memory model of realized volatility”. manuscript, University of Lugano.

Corsi, F. (2005). ”Measuring and Modelling Realized Volatility: from Tick-by-tick to Long Mem- ory”. PhD dissertation, University of lugano.

Corsi, F. (2009). ”A Simple Approximate Long-Memory Model of Realized Volatility”. Journal of Financial Econometric,Vol. 7, No. 2, 174–196

Cumby, R., S. Figlewski, J. Hasbrouck, (1993). ”Forecasting Volatility and Correlations with EGARCH Models”. Journal of Derivatives, 51-63.

Deng, L. Yu, D. (2014). Deep Learning: Methods and Applications. Now Publishers.

Diebold,F.X & Mariano,R.S. (1995). ”Comparing Predictive Accuracy”. Journal of Business and Economic Statistics, vol.13, 253-265.

Donaldson, G.R., Kamstra, M., 1996b. Forecast Combining with Neural Networks. Journal of Forecasting 15, 49–61

Duchi, J., Hazan, E. & Singer, Y. (2011). ”Adaptive subgradient methods for online learning and stochastic optimization”. The Journal of Machine Learning Research, 12, 2121–2159.

FIGLEWSKI, S., (1997). ”Forecasting Volatility”. Financial Markets, Institutions and Instruments vol 6, 1-88.

Granger, C. W. J. Joyeux, R. (1980). ”An Introducation to Long-memory time series models and fractional differencing”. Journal Of Time Series Analysis, Vol. 1, No.1, 15-29

(30)

Hajizadeh, E., Seifi, A., Turksen, M. H. F. B. (2012). ”A hybrid modeling approach for forecasting the volatility of SP 500 index return”. Expert Systems with Applications,vo. 39, 431–436.

Heber, Gerd, Asger Lunde, Shephard, N., Sheppard,K., (2009). ”Oxford-Man Institute’s realized library”, version 0.3, Oxford-Man Institute, University of Oxford.

Hochreiter, J. (1991). ”Untersuchungen zu dynamischen neuronalen Netzen”. Diploma thesis, In- stitut fur Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen.

Hochreiter, S. Schmidhuber, J. (1997). ”Long Short-Term Memory”. Neural Computation no.

8.

Lei Ba. J., & Kingma,D.P. (2015). ”ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION”.

Conference paper at ICLR 2015.

Liu, F., Pantelous,A.A., Von Mettenheim, H-J,. (2018). ”Forecasting and Trading High Frequency Volatility on Large Indices”. Quantitative Finance, Volume 18, Issue 5, pp. 737-748, 2018.

Liu,.L.Y., Patton, A.J., Sheppard, K., (2015). ”Does Anything Beat 5-Minute RV? A Compari- son of Realized Measures Across Multiple Asset Classes”. SSRN Electronic Journal, 187(1)

Ma Y., Li X., Zhao J. Luo D. (2012). Technology for Education and Learning. Springer, 427- 434

Ma Y., Li X., Zhao J. Luo D. (2012). Technology for Education and Learning. Springer, 427-434 (Using ARFIMA Model to Calculate and Forecast Realized Volatility of High Frequency Stock Market Index Data)

Schittenkopf, C., Dorffner, G. Dockner, E.J., (2000). ”Forecasting time-dependent conditional densities: a semi non-parametric neural network approach”. Journal of Forecasting, vol. 19, 355–374.

Tieleman, T. & Hinton G.E., (2012). Lecture 6.5 - RMSProp, COURSERA: Neural Networks. for Machine Learning. Technical report.

(31)

8 Appendix A

The full proof of finding and defining the jump component in section 2.1 equation (3) is in this section presented. We will closely follow Anderson et al. (2007). We let p(t) denote a logarithmic price at time t and express it in terms of a stochastic different equation

dp(t) = µ(t)dt + σ(t)dW (t) + k(t)dq(t), 0 ≤ t ≤ T (38)

Where µ(t) is a continuous variation process, σ(t) a strictly positive continuous volatility process with well defined limits, w(t) is a Brownian motion, q(t) a counting process and k(t) the size of discrete jumps in the logarithmic price process. The quadratic variation for the cumulative return process r(t) = p(t) − p(0) is then:

[r, r]t= Z t

0

σ²(s) ds + X

0<s≤t

k²(s). (39)

In absence of jumps, the quadratic variation is a consistent estimator of the Integrated Variance σ²(t) Andersson & Bolerslev (1998), Barndorff-Nielsen & Shephard (2003), Andersson et al. (2007). Applying this to Realized Volatility we first define the sampling of intraday returns as rt,δ = p(t) − p(t − δ). Then we have that

RVt+1(δ) =

1/δ

X

j=1

r²_t+j∗δ,δ (40)

Then from Anderson et al. The RV converge to the quadratic variation when the intraday sampling frequency goes to infinity and the BV to the integrated variance. Hence, when δ → ∞

RV_t+1(δ) → Z t+1

t

σ²(s) ds + X

t<s≤t+1

k²(s) (41)

BVt+1(δ) → Z t+1

t

σ²(s) (42)

Hence, we can find the discontinuous jump component by subtracting the RV with the BV:

RVt+1− BVt+1= Z t+1

t

σ²(s) ds + X

t<s≤t+1

k²(s) − Z t+1

t

σ²(s) = X

t<s≤t+1

k²(s) (43)

We then define the jump component in line with Anderson et al. (2007) by

Jt+1= max[RVt+1− BVt+1, 0] (44)

CAN DEEP LEARNING BEAT TRADITIONAL ECONOMETRICS IN FORECASTING OF REALIZED VOLATILITY?