Deep Neural Networks to Ensure the Quality of Calculated Yield Curves in Banking

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

Deep Neural Networks to Ensure the Quality of Calculated Yield Curves in Banking

ANNA EKLIND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Deep Neural Networks to Ensure the Quality of

Calculated Yield Curves in Banking

ANNA EKLIND

Master in Computer Science Date: July 2, 2020

Supervisor: Hamid Reza Faragardi Examiner: Pawel Herman

School of Electrical Engineering and Computer Science Host company: Svenska Handelsbanken

Swedish title: Djupa neurala nätverk för att säkerställa kvaliteten av

beräknade avkastningskurvor inom bankväsendet

(4)

(5)

iii

Abstract

Yield curves are of great importance within the financial sector and are, among other things, used as indicators of future economic growth. A curve that is up- ward sloping implies that investors expect positive economic growth, whereas a downward sloping curve is considered as a warning of a forthcoming reces- sion. It is critical that these curves are actual reflections of the market. Un- expected changes in some parts of the curves should only occur if there have been actual changes in the market, however, this is not always the case and the curves are therefore continuously monitored and maintained. A potential solution to further ensure the quality of the curves is by the application of deep neural networks.

The purpose of this study is to examine whether deep architectures are capable of predicting yield curves accurately. If this can be shown, the predic- tions can be further used to detect anomalies in yield curves estimated by the banks. Three models are compared in short-term and long-term predictions of yield curves; the Random Walk approach (RW) serving as the baseline and a point of reference, a Long Short-Term Memory Network (LSTM) and a Tem- poral Convolutional Network (TCN). The latter two have shown state-of-the- art results within time series forecasting and sequences modelling tasks and were therefore chosen to further investigate in this study.

According to the experiments of this study, the RW approach was most

accurate in one-day-ahead predictions, however, the method was statistically

outperformed by the deep architectures in longer forecast horizons. For in-

stance, in the case of 120-days-ahead forecasts, the TCN showed an increase

of 82% in performance (Root Mean Squared Error) in comparison with the

RW approach and the LSTM network an increase of 56%. It was concluded

that the RW approach should be the default option in case of one-day-ahead

forecasts, but that deep architectures have great potential in providing further

assurance of the quality of yield curves in case of longer forecast horizons.

(6)

Sammanfattning

Avkastningskurvor är viktiga inom finanssektorn och används bland annat som indikatorer av framtida ekonomisk tillväxt. En kurva som är uppåtsluttande antyder att investerare förväntar positiv ekonomisk tillväxt, medan en nedåt- sluttande kurva betraktas som en varning för en kommande lågkonjuktur. Det är viktigt att dessa kurvor är faktiska reflektioner av marknaden. Plötsliga för- ändringar i vissa delar av kurvorna bör endast uppstå om det har skett faktiska förändringar på marknaden, men detta är inte alltid fallet. Därav övervakas och underhålls kurvorna kontinuerligt. En potentiell lösning för att vidare säker- ställa kvaliteten av kurvorna är genom användning av djupa neurala nätverk.

Syftet med denna studie är att undersöka huruvida djupa arkitekturer är kapabla till att prediktera avkastningskurvor med bra precision. Om detta kan påvisas, skulle predikteringarna vidare kunna användas för att upptäcka av- vikelser hos avkastningskurvor vilka estimerats av bankerna. Tre modeller jämförs, Random Walk-metoden (RW), vilken får utgöra referenspunkt, en Long Short-Term Memory Network (LSTM) och en Temporal Convolutio- nal Network (TCN). De två sistnämnda har uppvisat state-of-the-art-resultat inom tidsserieprognoser och sekvensmodelleringsuppgifter och valdes därför för vidare tillämpning i denna studie.

RW-metoden var mest precis i att generera prediktioner en dag framåt en- ligt experimenten av denna studie, men blev statistiskt överträffad av de djupa arkitekturerna i längre prognoshorisonter. Det temporära nätverket uppvisa- de en ökning av predikteringsprestandan (Root Mean Squared Error) på 82%

i jämförelse med RW-metoden gällande prediktioner 120 dagar framåt och

LSTM nätverket en ökning på 56% i jämförelse. Det drogs slutsatsen att RW-

metoden bör vara standardalternativet i fallet av prediktioner en dag framåt,

men att djupa arkitekturer har stor potential i att utgöra ytterligare försäkran

av avkastningskurvors kvalitet i fall av längre prognoshorisonter.

(7)

Chapter 1 Introduction

A yield curve is a graph which visualizes the relationship between the yields of a bond and its maturities (borrowing periods). Yield curves are used to evaluate present or future cash flows over time and constitute indicators of future economic growth. They are of great importance within finance and it is critical that the calculations of these curves are actual reflections of the market [1].

A yield curve is constructed by the use of a mathematical model which consists of a number of parameters. Those parameters are estimated in such a way that the differences between the theoretical yields and the ones observed in the market are minimized. The curves are continuously updated as the market changes by calibrating the model’s parameters accordingly. Scripts are run frequently to ensure that the calculated curves are accurate reflections of the market and indicate if the theoretical values differ significantly from the ones observed.

A potential solution to further ensure the quality of the curves is by the application of Deep Neural Networks (DNN). If DNNs can be shown to gener- ate accurate forecasts of yield curves by utilizing historic curve patterns, those forecasts could be further used to detect anomalies in the calculated curves.

Recurrent Neural Networks (RNNs) with Long Short-Term Memory units (LSTM) is a category of DNNs which is designed to capture temporal dynam- ics with long term dependencies in sequences of data, making them appropri- ate for forecasting time series and hence predicting yield curves [2].

Deep LSTMs have been used successfully for speech recognition tasks [3, 2], anomaly detection [4] and financial time series forecasting [5, 6]. How- ever, Convolutional Neural Networks (CNN) have started to outperform RNNs such as LSTMs in a range of tasks and data sets [7]. One architecture that has

1

(12)

shown state-of-the-art results and in which practitioners within deep learning have started to use for sequence modeling is Temporal Convolutional Networks (TCNs) [7, 8, 9]. The architecture of TCNs entails longer memory in compar- ison with recurrent architectures and enables parallel computations which are not possible when using recurrent architectures [7].

TCNs have not been used within financial time series forecasting before and it is a far more unexplored field in comparison with LSTMs and its applica- tion areas. Previous results indicate that TCNs constitute a great approach for sequence modeling tasks, both in terms of prediction performance and com- putational cost [7, 8, 9].

Furthermore, there is a lack of research to a greater extent where deep ar- chitectures have been used to forecast yield curves, despite their great potential within the field. Only one previous work was found where a deep LSTM had been used to forecast yield curves [10]. More research where TCNs are shown to outperform LSTMs is needed if TCNs should be considered as the natural starting point for sequence modeling tasks. This study has therefore chosen to investigate and compare the potential gain in applying a deep LSTM and TCN, respectively, for forecasting yield curves.

1.1 Objective and Research Questions

The objective of this study is to determine whether deep neural networks can be used ensure the quality of calculated yield curves. Furthermore, whether a TCN is a better approach than an LSTM network for the specific problem. The degree project aims to examine and answer the following research questions in order to fulfill the objectives.

RQ1. Are deep architectures, designed for times series forecasting, capable of outperforming the Random Walk approach in predicting yield curves?

RQ2. Is a Temporal Convolutional Neural Network more accurate, in pre- dicting yield curves, in comparison with a deep LSTM network?

1.2 Scope

Yield curves are potentially very difficult to forecast in such a way that a prof-

itable trading strategy can be based upon it, that is however not the scope of

this sutdy. The host company is interested in if a curve used at a certain point

(13)

CHAPTER 1. INTRODUCTION 3

in time is an accurate representation of the prices reflected in the market. In addition, this study solely examines the difference between the models in terms of prediction performance, but not in computational cost.

1.3 Outline

The thesis is structured as follows. Chapter 2 provides a theoretical back- ground with relevant information needed to follow the remainder of this study.

Chapter 3 accounts for related works, such as the usage of deep architectures within financial time series forecasting. The methods of this study, which con- cerns the establishment and evaluation of the chosen models, are justified and described in Chapter 4. Chapter 5 presents the obtained results. Finally, Chap- ter 6 discusses the main findings of the study, answers the stated research ques- tions, reflects upon the method choices, and discusses the validity of the study.

Some ethical issues are considered and the chapter is concluded with sugges-

tions for future work.

(14)

Background

This chapter starts with a brief introduction of the financial terms used through- out the thesis, in which yield and interest rate are used interchangeably. An introduction to time series and time series forecasting is also provided, fol- lowed by a theoretical background of the models adopted in this thesis.

2.1 Bond

Governments and corporations who need to raise money without taking a loan from the bank can issue and sell bonds. A bond is in short a loan and is issued by any of the mentioned parties. The ones that choose to invest in bonds get paid, commonly semiannually, in terms of interest rate until the maturity of the bond is reached. On the date of the final repayment, the maturity, the whole loan is usually repaid [1].

2.2 Yield Curve

The relationship between yields and different maturities of bonds with equal credit quality is visualized by a graph known as yield curve. The curve can take on different shapes, in which three of them are illustrated in Figure 2.1.

When the long-term interest rates are higher than the short-term rates, i.e., the yield of an investment is higher in the long-term since it entails greater risk than investing short-term, the curve is upward sloping and is referred to as normal yield curve [1]. Normal is the most common shape and is associated with positive economic growth [11].

When the long-term interest rates are lower than the short-term rates, the curve adopts an inverted shape. Interest rates have a tendency to fall as an

4

(15)

CHAPTER 2. BACKGROUND 5

answer to a slowdown in the economy, thus an inverted yield curve is therefore often perceived as a leading indicator of negative economic growth.

A flat yield curve commonly occurs when there is a transition between the other two shapes and means that the long-term rates and the short-term rates are essentially the same [1].

Figure 2.1: The graph is not based on real data, but rather provides an illus- tration on some of the different shapes a yield curve can adopt. The maturity is shown in terms of years to maturity.

2.2.1 Modeling the Yield Curve

The Nelson-Siegel model and its extension the Nelson-Siegel-Svensson are two models which are widely used by the central banks to model yield curves [12]. The models are parametric, meaning that the information required for predicting future values from the current value is the parameters included in the models.

The Nelson-Siegel model was first introduced in 1987 by Nelson and Siegel [13] and was further developed by Svensson [14] in 1994. The former is shown in Equation 2.1 and is defined by four different parameters; β 0 , β ₁ , β ₂ and τ . The yield y(m) of maturity m is hence obtained by:

y(m) =β ₀ + β ₁ [1−exp(−m/τ )]

m/τ + β ₂

[1−exp(−m/τ )]

m/τ − exp(−m/τ )

(2.1)

β ₀ controls the level of the curve, β 1 its slope, and β 2 the curvature. The

last term of the equation adds a hump to the curve, in which its maximum is

decided by τ .

(16)

The model is simple and capable of capturing many of the shapes a yield curve adopts over time. The extension includes two more parameters that add an additional hump to the curve, hence making the model more flexible [15, 16].

The parameters can be estimated by applying Least Squares and the goal is to find those parameters which minimize the difference between the theoretical rates and the observed ones [15]:

parameters min

X (ˆ y − y) ² (2.2)

y denotes the rate of the model and y the rate observed in the market. ˆ

2.3 Time Series Forecasting

A time series is a set of observations in which each observation is recorded at a specific point in time [17], the number of passengers on a specific airline route each month or the number of vehicles sold every quarter are examples of such.

A significant feature of most time series is that adjacent observations are de- pendent, a dependence which is of great importance in time series forecasting.

A time series can be described by the means of three main components; trend, seasonality, and cyclic behavior. Trend refers to when a time series exhibit a long-term increase or decrease, whereas seasonality expresses variations in a time series due to seasonal factors such as the time of the year or the week of the month. Seasonal patterns occur with specific regular intervals, while rises and falls of a time series with non-regular intervals indicate cyclic behavior [18].

Time series analysis is used to understand the data at hand further and concerns the quantification of essential features in data. The aim with time series analysis is, to a great extent, to analyze and explain the aforementioned dependencies between observations in the data. The information gained from conducting time series analysis can be further used to choose an appropriate model for time series forecasting [19, 20], which is the application of a model to forecast future values based on observations made in the past.

Autoregressive (AR) models such as Autoregressive Integrated Moving

Average (ARIMA) and Vector Autoregression (VAR) are traditional approaches

for time series forecasting. AR models are linear, which means that they as-

sume that the current value in a time series is a linear combination of the pre-

vious observations [18]. However, when a time series is non-linear, which is

often the case in real-world applications, the effectiveness of AR models gets

(17)

CHAPTER 2. BACKGROUND 7

limited since they cannot capture non-linear relationships in data. Artificial Neural Networks (ANNs) on the other hand make no such assumptions of lin- earity and are capable of modeling and forecasting non-linear data. Hence, constituting a great alternative and competitor to the traditional forecasting models [21, 22].

2.3.1 Random Walk Forecast

Simple forecasting methods are often used as benchmarks in the development of new forecasting methods. If a newly developed method performs worse than the simple alternative it can be concluded that the model is not worth considering further and hence should be abandoned.

The naive forecasting approach is an example of such a method. It predicts all future values to be the value of the last observation, see Equation 2.3. The approach is optimal when a time series follows a random walk and is therefore also referred to as random walk forecast (RW). Data following a random walk is said to be unpredictable and the best prediction which could be made for the future is, therefore, assuming the latest observed value.

The RW approach is widely used for financial and economic data since it is not uncommon that such data follows a random walk and provides a good point of reference when evaluating other methods for economic time series forecasting [18].

ˆ

y _{T +h|T} = y _T (2.3)

h denotes the forecast horizon and y T the last observation.

Equation 2.3 accounts for the RW approach in its simplest form, however, there exist other versions as well such as RW with drift. The drift represents the average change from one period to another and adds another term to the equation of the naive method accordingly [23]:

ˆ

y _{T +h|T} = y _T + h y T − y 1

T − 1 (2.4)

2.4 Artificial Neural Networks

An ANN is a computational model which is biologically inspired and an at-

tempt to model the capability of the nervous system to process information

[24, 25]. It can be described as a directed graph composed of computational

units, nodes, which are connected through directed edges. These connections

(18)

make it possible to transmit information to other nodes in the network and each edge has a weight connected to it which describes its relative importance. In biological terms, the computational units represent neurons and the weights of the edges correspond to the strength of the synapses separating the neurons in the brain [26].

Figure 2.2 shows a model of a computational unit. It has 3 input connec- tions, weights associated with each connection, and an output. The compu- tational unit produces a single output by taking the dot product of the input and the corresponding weights. The dot product is then passed through an activation function f, which determines whether the computational unit gets activated or not, and the output is emitted through the unit.

Figure 2.2: The structure of an artificial neuron/computational unit.

The activation function introduces non-linearity to the network and enables the model to solve complex problems which require non-linear mappings be- tween input features and output values. Sigmoid, ReLU and Tanh are instances of activation functions commonly encountered in practice [27], in which ReLU is the default recommendation [28]. An activation function takes a real-valued number as input, performs a mathematical operation on it, and squash it into a predefined range. The range depends on the activation function, when using Sigmoid the resulting output is between 0 and 1 and with the application of Tanh between -1 and 1. ReLU returns 0 if the input is negative otherwise it returns the same number retrieved as input. For further details about the dif- ferent activation functions and their respective advantages and disadvantages see CS231 Convolutional Neural Networks for Visual Recognition [27].

The way the edges connect to the nodes in a network divides ANNs into two

categories, feed-forward networks and recurrent networks. Figure 2.3 shows

(19)

CHAPTER 2. BACKGROUND 9

a simple feed-forward network with two layers of hidden nodes, referred to as Multi Layer Perceptron (MLP). The graph does not contain any cycles, i.e., the output of a node is not fed back into the same node but is fed forward through the network. The choice of network and category depends on the characteris- tics of the problem to be solved. Feed-forward networks are commonly used for classification and regression tasks.

Figure 2.3: A MLP with two hidden layers.

In order to solve a problem efficiently, the network has to learn the under- lying patterns of it. The learning procedure involves updating the weights of the network repeatedly to present an accurate solution to the problem [29], a procedure which is commonly referred to as the training procedure. One of the most extensively used methods for this purpose is the backpropagation algo- rithm (BP). BP adjusts the weights of a network such that the error, quantified in a loss function, is minimized. The loss function is basically a measure of the difference between the predicted output of the network and the actual val- ues. Mean Squared Error (MSE) is an example of such a measure widely used within the field:

M SE = 1 n

n

X

i=1

(y _i − ˆ y _i ) ² (2.5)

n denotes the number of predictions, ˆ y _i the predicted value and y i the actual

value for observation i.

(20)

BP adopts gradient descent, an iterative optimization algorithm which ex- ploits the derivatives of a given function to decide in which direction the pa- rameters of the function should be updated to reach the function minima [30, 28].

Thus, data is fed into the network and propagated forward through the hid- den layer(s) to the output layer. Computations are made at each layer and the predictions of the network, the class scores, are retrieved in the last layer. The predicted output is compared to the ground truth in terms of the loss func- tion. The loss function is back-propagated through the network and the partial derivatives of the loss function with respect to each of the network’s weights are calculated. The weights are then updated according to gradient descent, in the negative direction of the slope/gradient [26, 31].

2.4.1 Learning Rate

The learning rate is an important parameter included in the weight updates of a network. It determines to what extent the weights should be updated during the training procedure in addition to in which direction. An appropriate learning rate is crucial for the performance of a network, setting it too low entails slow convergence of the loss function while setting it too high might lead to that minimas are missed. Thus, finding the optimal learning rate is of great importance in training ANNs [25, 27].

RMSprop, Adagrad, and Adam are all extensions of gradient descent and are examples of optimization algorithms which are commonly used within deep learning. They are adaptive optimizers, meaning that the learning rate is dynamic and updated during the training and hence does not need to be manually tuned. In addition, the learning rate is adapted to each parameter of the network which means that the methods are suitable for sparse data [32].

Infrequent features are encountered with higher learning rates and frequent ones with lower learning rates, thus allowing information of rare features to be captured [33].

2.4.2 Regularization

It is not uncommon within machine learning that models perform well on train-

ing data, but yield bad results on unseen data. This is an undesirable feature

that can be countered by using regularization. Regularization strategies are

specifically designed to increase the generalization ability of a network and

hence reduce the test error [28].

(21)

CHAPTER 2. BACKGROUND 11

Dropout is an example of regularization and means that a number of non- output units are randomly dropped at each iteration of the training procedure, i.e., they are temporarily removed from the network including their incoming and outgoing connections. The chance by which a unit is dropped is deter- mined by the dropout rate, a predefined value between 0 and 1. Thus, a dropout rate set to 0.1 means that there is a 10% chance for each unit to be dropped during training [28, 34].

Another example of regularization is Early Stopping, in which the training procedure is halted if no further improvements on the validation set have been made in a predefined number of consecutive iterations. Whenever the error on the validation set is improved during training, the parameters of the model are saved thus that when the training procedure is stopped, the best model can be returned and not the most recent one [28].

2.5 Recurrent Neural Networks

Recurrent neural networks have, in contrast to feed-forward networks, feed- back connections, i.e., cycles in the computational graph. This architectural feature makes RNNs capable of, in theory, to map the entire history of previ- ous states to each output [26]. The internal memory of a recurrent architecture enables the network to capture dependencies between observations which is crucial when modeling sequences or forecasting time series. Figure 2.4 shows a simple RNN and its appearance when it is unfolded. It has an input layer, a hidden layer, and an output layer. The result retrieved in the previous time step is fed back into the network and is used to produce the next output.

Figure 2.4: An RNN and its unfolded version.

Backpropagation Through Time (BPTT) is applied to update the weights

of an RNN. However, the learning algorithm causes vanishing or more rarely

exploding gradients when propagated over many stages/time steps [28]. The

(22)

depth of an RNN is defined when it is unfolded and is equal to the length of the input sequence. Thus, when the gradient is back-propagated through the layers/time steps, it can become vanishing resulting in extremely small weight updates. In the worst case, it can lead to that the network stops learning and that no further improvements in performance are made. Due to this problem, RNNs have difficulties to learn long-term dependencies.

2.5.1 Long Short-Term Memory Networks

The Long Short-Term Memory Network (LSTM) was introduced by Hochreiter and Schmidhuber [35] in 1997 and was developed to overcome the vanishing gradient problem. The recurrent nodes of the hidden layer in an RNN are replaced by memory blocks in an LSTM, see Figure 2.5. Each memory block has a memory cell c, a node which is self-connected, and three gates which control the information flow in and out from the cell; an input gate, an output gate, and a forget gate [28, 31].

The input gate protects the information stored in the cell from interference by irrelevant input and the output gate protects other nodes from interference by currently irrelevant information stored in the cell [26, 35]. Initially, the recurrent connection of the cell had its weight fixed at 1.0 to achieve constant error flow and thus avoiding the vanishing gradient problem.

However, in 2000 the forget gate was introduced by Gers et al. [36]. The forget gate resets the memory block of an LSTM by controlling the weight associated with the cell. The block is reset once its information is out of date.

This functionality was developed to avoid a scenario where the values of cells

grow indefinitely and eventually causing the network to break down. The for-

get gate was proven effective and is considered standard today.

(23)

CHAPTER 2. BACKGROUND 13

Figure 2.5: An overview of a memory block. The previous state h t−1 is com- bined with the current input x _t and is then passed through the different gates of the block.

2.6 Convolutional Neural Networks

A Convolutional Neural Network (CNN) is an ANN which is typically used for image processing, e.g., object detection or image classification. The archi- tecture of a CNN includes three types of layers; convolutional layers, pooling layers, and fully connected layers. The network is usually built by alternating convolutional and pooling layers and contains one fully connected layer.

The convolutional layer makes use of filters to create feature maps that contain extracted features which describe the input data. Filters are weights shared within a layer that can be learned and work as feature detectors. One filter is a detector of a specific feature, thus several filters are needed within a layer to detect more than one feature of the input data. The shape of a filter is defined in terms of width, height, and depth, whereby the depth of a filter has to match with the depth of the input volume. The width and the height of the filter, the filter size, covers an area of the input data which is referred to as the receptive field.

Feature maps are produced by sliding/convolving filters over the input.

Figure 2.6 shows a filter of size 2x2 which is convolved over the input with

a stride length of 1, i.e., the filter is moved by one pixel at a time. The corre-

sponding feature map which is created by the operation is shown to the right

in the figure.

(24)

Figure 2.6: The creation of a feature map.

Pooling layers are applied after the convolutional layers with the purpose to reduce the number of parameters in the network. Max pooling is a widely used method for this purpose. A 2x2 filter with stride length 2 is convolved over the input and the maximum value in each sub-region is kept. This procedure results in a reduced feature map [27], see Figure 2.7. Besides reducing the parameters of the network, the pooling operation also makes the network less prone to overfitting. The relative positionings of the features are presented in the pooled feature maps and not the exact ones, thus if a feature is shifted in the input or if some other kind of distortion occurs, the feature map’s output will remain [37, 38].

Figure 2.7: The result of max pooling.

The fully connected layer takes an input with the same size as the out-

put of the previous layer and outputs the class scores. The network is trained

with backpropagation and the weights (filters) are updated accordingly. The

more convolutional layers that are included in the network, the more high

level/abstract features can be extracted from the input [37].

(25)

CHAPTER 2. BACKGROUND 15

2.6.1 Temporal Convolutional Networks

Temporal Convolutional Networks (TCNs) are characterized by two main fea- tures; the convolutions of the network are casual and the network can take an input sequence of any length and map it to an output sequence with the corresponding length.

Causal convolutions ensure that the prediction made by the network at time step t is not dependent on any future time steps t+1, ..., T. However, when the network is supposed to capture long-term dependencies, i.e., exhibit a large receptive field, casual convolutions become unfeasible. A very deep architec- ture with many layers or large filters is needed for this purpose, which entails a computational cost that is too heavy [7].

TCNs employ dilated convolutions to encounter this problem. The filters are convolved over greater areas than their sizes by skipping input elements with a dilation rate. The rate defines how many steps between the input ele- ments which should be skipped. Stacking dilated causal convolutional layers gives the network a large receptive field without growing too deep [39]. Figure 2.8 illustrates a network with casual dilated convolutions and a filter size of 2.

The dilation rate is exponentially increased with the level of the network, i.e the current depth i.

Figure 2.8: A dilated causual convolutional network with filter size 2. The dilation of each layer is given according to d = 2 ⁱ , where i denotes the current level of the network.

Since the filters are shared within the layers in TCNs, the convolutions can

be done in parallel and enable the network to process an input sequence as

a whole in comparison with RNNs where the input sequence is processed se-

(26)

quentially by one step at a time. The possibility to parallelize the computations

is a great advantage of TCNs [7].

(27)

Chapter 3 Related Work

The application of deep learning within finance has in recent years been paid a great amount of attention, nevertheless when it comes to stock price predic- tion, but almost no research has been published where deep architectures have been applied to model the yield curve, even though the area itself has been extensively researched. This chapter introduces research related to this study, grouped in three different categories. Furthermore, the application of previous research in the work of this degree project is accounted for.

3.1 Modeling Yield Curves

The Nelson-Siegel approach for forecasting yield curves was further developed by Diebold and Li [40] in 2006 into a dynamic version. The model was named Dynamic Nelson-Siegel (DNS) and showed encouraging results in both short- term and long-term predictions in comparison with other approaches. One- year-ahead forecasts with the DNS approach was superior in comparison with assuming a random walk. However, in 2008, research conducted by Guidolin and Thornton [41] was published in the journal of the European Central Bank.

The research showed that the use of a DNS model when modeling yield curves was not significantly better than assuming an RW, not even in terms of long forecast horizons.

In this thesis the RW is adopted as the baseline model using a naive fore- casting approach. The model has been proven difficult to outperform in sev- eral studies [42, 43, 44] and is widely used for financial and economic data as benchmark when evaluating the performance of other methods [18].

17

(28)

3.2 Deep Architectures for Financial Time Se- ries Forecasting

A comprehensive literature review of financial time series forecasting with deep learning was published in 2019 [45]. The aim of the paper was to pro- vide a snapshot of the current progress within the field of deep architectures to forecast financial time series and to describe the characteristics of each archi- tecture to facilitate the choice of models for researchers and practitioners. The results of the literature review showed that the most studied financial applica- tion was stock price prediction and the category of ANNs which dominated among the models was RNNs with LSTMs as the preferred choice. No work concerning modeling yield curves was identified by the study.

However, an attempt of modeling yield curves with an LSTM network was made in 2019 by Christoph Gerhart et al. In their study [10], a deep LSTM network was developed to predict future yields with multiple yield curves as input. The implemented network consisted of two LSTM layers and was com- pared with the RW approach. The LSTM network outperformed the RW in most cases except in one-day ahead forecasts. However, both methods pre- sented very low RMSEs in both short-term and long-term predictions, i.e., they managed to produce extremely accurate forecasts and it was concluded by the authors that their method constituted a robust approach when it came to forecasting yield curves.

Unlike the work of [10] where a multiple-curve approach was chosen, in which three different curves were analysed simultaneously, this work adopts a single-curve approach where only one yield curve is considered at a time.

However, while [10] studied one maturity at a time, this project analyses all maturities simultaneously with aim to capture possible dependencies amidst them. The kind of problems where the value of a feature does not only depend on its past values but also on the values of other features are referred to as multivariate problems. The dependency between the features may be of great importance and is therefore used to predict future values [17].

3.3 Temporal Convolutional Neural Networks for Time Series Forecasting

In 2018 the authors of [7] suggested that using recurrent architectures as the

starting point when working with sequence modelling tasks should be recon-

(29)

CHAPTER 3. RELATED WORK 19

sidered. They conducted a systematic evaluation and an extensive compari- son of networks such as LSTMs and TCNs over a set of different tasks and datasets, tasks which commonly had been used to evaluate the performance of recurrent architectures. The TCN outperformed the recurrent architectures in the greater majority of the tasks and the TCN was concluded as a promising approach when it came to sequence modeling tasks.

Another study of TCNs was made 2019 in the context of forecasting short- term traffic flow, where a TCN was compared to other models, including LSTMs, in the same task [9]. The architecture of the network provided features which made it possible for the network, through long memory and the capability of modeling long time scales up to entire sequences, to capture long-term pat- terns and achieve great prediction performance. The TCN demonstrated 95%

accuracy in predicting the short-term traffic flow, an increase of 15% in com- parison with the LSTM network, which achieved a forecasting accuracy rate at 80%.

The work of this degree project is the first to apply a TCN for financial

time series forecasting, specifically in modeling yield curves. In contrast to

[7] and [9], the problem of this degree project is of a multivariate nature.

(30)

Methods

This chapter accounts for the methods applied in this thesis. The first sec- tion describes the data sets that were used, how they were pre-processed, and split. Some details concerning the models are then provided, followed by a description of the two different hyperparameter optimization strategies that were considered. The chapter ends with describing the time series specific evaluation method that was chosen to proceed with during the project and the statistical tests that were used to ensure that the obtained results were not just due to chance.

4.1 Data

The models were evaluated on two historical data sets of daily yield curve es- timates, in which one of them was supplied by Svenska Handelsbanken and ranged from 2012-11-13 to 2020-02-11, resulting in 1765 yield curves/observations.

Each curve was constituted by 40 points, in which the interest rate was given for maturities up to 10958 days, i.e., 30 years from the date of the given curve.

The other data set was public and obtained through Bank of England’s website ¹ . The data ranged from 2009-01-02 to 2015-12-31, resulting in 1830 observations. Each curve consisted of interest rates given across 60 maturities, specified in the number of months to maturity.

The data provided from Svenska Handelsbanken is denoted by SHB through- out the rest of the thesis, and the data obtained from Bank of England, with BOE.

1

https://www.bankofengland.co.uk/statistics/yield-curves

20

(31)

CHAPTER 4. METHODS 21

4.1.1 Pre-processing

Initially, the data was normalized by re-scaling the values of the data sets into a range between 0 and 1. Normalization transforms all features of a data set to the same scale and results in a more stable training procedure and a better performing network [46]. MinMax normalization was used in this project, in which utilizes the minimum and maximum values of the training set to scale the data. The scaler was fit according to the training data and was then applied to the test data [47].

When applying machine learning in time series forecasting the data has to be turned into a supervised learning problem before the training procedure can be conducted [48]. The reconstruction of the data was done by iterating all observations in the time series and setting the next observation in time at time step t + 1 as the target for the previous time steps t − n, ..., t. The number of previous time steps, n, used to predict the value of the next time step is commonly referred to as window size or look back period.

Thus, with a window size set to 3 time steps, the three first observations in the data sets got the fourth observations as targets. The window was then moved one step further in the time series and the corresponding subsets cov- ered by the window got the fifth observations as targets. This procedure was repeated until the end of the data sets were reached.

4.1.2 Data Split

The data was split into training and test sets with an 80/20 ratio. The training sets were further divided into training and validation sets, also with an 80/20 ratio. The division of data was done with respect to the temporal order in which the observations had been observed. The validation sets were used for the evaluation of the hyperparameter search. During the final evaluations, both the training and the validation sets were used, to avoid a time gap between the training and test data.

4.2 Models

The LSTM network and the TCN were constructed with Keras with the usage

of Tensorflow as backend. Adam [49] was used as an optimizer since it is

one of the most widely used optimizations algorithms and was identified as

the optimizer in several related works [7, 10, 50, 51, 52]. Dropout and Early

Stopping were used as regularization techniques during the training procedure.

(32)

The patience for Early Stopping was set to 20 epochs and different dropout rates were examined during the hyperparameter search, see section 4.3.1 for further details.

A linear activation function was applied in the last layer of each network and Mean Squared Error (MSE) was used as loss function since it is the default loss function for time series forecasting problems [18].

4.2.1 LSTM

The inner state of the LSTM units in the network were reset after one batch of data had been processed, meaning that the resulting internal states of pro- cessing one batch of data were not provided as initial states for the next batch [53].

The LSTM network was built by stacking LSTM layers and each layer was followed by a dropout layer. Architectures with one input layer, two and three LSTM layers and a fully connected output layer, respectively, were examined to find the optimal network. Additional parameters constituting the network are accounted for in section 4.3.1.

4.2.2 TCN

The TCN was implemented by adopting residual blocks instead of convolu- tional layers as in the work of [7] where TCNs were shown to outperform RNNs. Residual blocks were first introduced in [54] to counter the degradation problem, in which error rates had shown to grow with the number of stacked layers in a network, and stabilizes the training procedure of deep architectures.

Normally, an ANN learns the mapping H between an input x and an output y according to H(x) = y. However, in residual learning, the difference be- tween the mapping applied to the input and the input itself, F (x) = H(x) − x, is learned. The mapping is referred to as residual mapping and is expressed accordingly when rearranged; H(x) = F (x) + x. The residual mapping has shown to be easier to optimize than the original one and allows deeper archi- tectures to achieve the performance they are capable of in theory.

Figure 4.1 illustrates a residual block and its components according to how it is adopted in [7] with an exception of the choice of normalization technique.

The block has two dilated casual convolutional layers followed by normaliza-

tion and non-linearity in terms of ReLU. Dropout is used as regularization. A

1x1 filter is convolved over the input x to get the input and the resulting output

F (x) in the same dimensions, thus that an element-wise addition of the two

(33)

CHAPTER 4. METHODS 23

can be performed. The result H(x) is then passed through ReLU.

Figure 4.1: The components of a residual block.

Batch Normalization (BN) [55] was used as normalization in the residual blocks instead of Weight Normalization (WN) [56] as in [7], since it was al- ready available in Keras. However, BN serves the same main purpose as WN and that is to ease the optimization of deep architectures. When using BN, the training procedure is accelerated by normalizing the output of each neu- ron across the mini-batch before the application of non-linearity and allows for higher learning rates. WN, on the other hand, normalizes the weights of a layer. The usage of BN did not, however, favor the performance of the TCN and was later excluded.

The TCN was built by stacking residual blocks. The optimal number of

residual blocks was examined in the hyperparameter search.

(34)

4.3 Hyperparameter Search

Hyperparameter optimization was conducted for both the LSTM network and the TCN. The purpose was to find the parameter settings which minimized the loss function, i.e., yielded the lowest Root Mean Squared Error (RMSE), on the validation set. RMSE is further described in section 4.4.

Two strategies were considered for the hyperparameter optimization; Grid Search and Random Search. Grid Search is, according to [28], common prac- tice when the number of hyperparameters to examine is three or less to its ex- tent. A set of values is determined for each hyperparameter when conducting a grid search and all possible combinations of these sets of values are then ex- plored during the search to find the optimal parameter configuration. However, the computational cost of Grid Search grows exponentially with the number of hyperparameters to be examined and was therefore deemed unfeasible in this project.

The random approach was chosen instead, wherein random combinations of hyperparameters were evaluated thus making the search less exhausting.

Random Search results in, proven by [57], models which are equally good or even better in comparison with models retrieved from conducting grid search, within a fraction of time.

4.3.1 LSTM

The hyperparameters presented in Table 4.1 were listed as options for the LSTM network. Random Search was conducted for each window size. The number of hidden nodes and the dropout rate were both randomly chosen for each layer.

Table 4.1: Search space LSTM network.

Parameter Settings

Number of hidden layers [2, 3]

Number of hidden nodes [32, 64, 128, 256]

Batch size [24, 32, 64, 128]

Dropout rate [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]

Initial learning rate [0.0005, 0.001]

4.3.2 TCN

The receptive field of a TCN is determined by the kernel size, the number of

stacked residual blocks, and the last dilation in the dilation list according to

(35)

CHAPTER 4. METHODS 25

receptive field = kernel size · residual blocks · last dilation.

The receptive field should be at least as great as the maximum sequence length fed into the network, which corresponds to the number of past time steps that are used to predict the next one. Thus, different parameter settings were therefore carefully considered for the kernel size, the number of stacked resid- ual blocks, and the dilations. The different combinations are listed in Table A.1-A.3 in Appendix A and were during the hyperparameter search sampled from at random. Besides those parameters, the parameters and settings pre- sented in Table 4.2 were included in the search as well.

Table 4.2: Search space TCN.

Parameter Settings

Number of filters [32, 64, 128, 256]

Batch size [24, 32, 64, 128]

Dropout rate [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]

Initial learning rate [0.0005, 0.001]

4.4 Evaluation

The models, with the optimal hyperparameter settings, were fit on the given training set and then evaluated on the test set to obtain out-of-sample forecasts indicating the ability of the models to forecast on new data.

Root Mean Squared Error (RMSE) was used as evaluation metric for the performance of the networks since it is one of the most commonly used per- formance measures in time series forecasting [18]:

RM SE = v u u t 1 n

n

X

i=1

(y _i − ˆ y _i ) ² (4.1) n denotes the number of observations, y i the actual observation for sample i, and ˆ y _i the corresponding prediction made by the network. To be noted is that RMSE is a scale-dependent performance measure and hence cannot be used to compare the performance between different data sets but only between models evaluated on the same data [18].

The models were evaluated according to the fixed-origin evaluation method-

ology, in which the origin denotes the time step t from which all forecasts are

generated (the last time step of the training set). By applying the fixed-origin

evaluation, one forecast per observation (yield curve) in the test set is gener-

ated, each with origin t and the forecasts thus reaches from t+1 to t+h, where

(36)

h denotes the last observation in the test set and corresponds to the longest forecast horizon.

The method entails some shortcomings, it produces only one forecast per forecast horizon and hence yields only one forecast error per horizon. In addi- tion, the errors of the test set, i.e., of all forecast horizons are often combined to obtain a measurement of the model’s performance. However, the errors tend to grow as the forecast horizon increases and hence means that errors with dif- ferent variances are averaged, providing a potentially misleading performance measure [18, 58, 59].

To counter the problems of the fixed-origin approach, the models were fit and evaluated across five trials for each configuration to obtain averaged RMSEs. More trials could have been conducted to get even more robust per- formance measures, however, five trials was deemed feasible in this study with regards to the computational resources at hand and the time constraints. Fur- thermore, the evaluations and comparisons of the models were done for each forecast horizon separately, to provide fair performance measures.

To evaluate whether the difference in performance between the models was statistically significant, the Friedman test was chosen [60, 61]. The test can be described as a non-parametric counterpart of the one-way Analysis of Variance (ANOVA) and was chosen since the assumption homogeneity of variance (equal treatment variances) of ANOVA could not be assumed in this study [62].

The Friedman test utilizes ranks to determine whether there is any statisti- cally significant difference between the means of three groups or more. Each group is ranked according to its performance over a number of test attempts.

The ranks of the groups are summarized group-wise and those sums are fur- ther used to obtain an F-value, i.e., the test statistic [60]. The null hypothesis states that all means of the examined population are equal and is rejected if the obtained F-value is larger than the critical value of F. The F-value was calculated according to the following equation:

F = 12

(N k(k + 1))

X R ² − (3N (k + 1)) (4.2) N denotes the number of subjects, in this case, the number of trials. k denotes the number of treatments, i.e., groups, and R ² the squared rank sum of a given group.

If the test indicates statistical significance, a post-hoc test can be conducted

to locate the origin of the significance. Thus, in case of a rejected null hypoth-

esis, pairwise comparisons were executed by applying the Nemenyi post-hoc

(37)

CHAPTER 4. METHODS 27

test to determine the best performing model, in which a number of p-values

were obtained [62].

(38)

Results

This chapter first presents the results obtained by the hyperparameter search followed by the results of applying the optimal hyperparameter settings in terms of one-day-ahead and h-days-ahead predictions for each model and data set.

5.1 Hyperparameter Search

Random Search was conducted for each window size and data set and was run for 100 iterations. Five trials were carried out for each configuration and the results presented in this section are therefore averaged RMSEs. The training procedure lasted for 100 epochs except in the case of early stopping and the validation set was used for evaluation of the hyperparameter search.

Furthermore, significance testing was conducted for both the LSTM net- work and the TCN, respectively, for both data sets, to examine whether differ- ent window sizes had any impact on the model performance and to determine which window size to proceed with for each model and data set. The signifi- cance level, alpha, was set to 0.05 for all tests and the critical value of F was obtained from a table of critical F-values for the Friedman test ¹ and set to 6.4 accordingly.

5.1.1 LSTM

The optimal hyperparameter settings for the LSTM network, trained and eval- uated on the BOE data set, are shown in Table 5.1 and were the same for all

1

https://psych.unl.edu/psycrs/handcomp/hcfried.PDF

28

(39)

CHAPTER 5. RESULTS 29

window sizes and the validation losses are presented in Table 5.2. The Fried- man test resulted in an F-value of 2.6, thus the corresponding null hypothesis could not be rejected. No statistically significant difference in performance due to the window size could be identified in case of the BOE data set. However, the smaller the window size the more observations can be utilized for training and evaluation, thus the smallest look back period of 20 days was therefore chosen to proceed with.

Table 5.1: Optimal hyperparameter settings for the BOE data set, includes all window sizes.

Layers Hidden nodes Dropout rates Learning rate Batch size

2 [128, 256] [0.05, 0.4] 0.0005 32

Table 5.2: Validation losses, BOE data set.

Window size RMSE

20 0.029

50 0.036

100 0.032

In the case of the SHB data set, the optimal hyperparameter settings varied depending on the window size. Those settings are accounted for in Table 5.3 and the resulting validation losses are presented in Table 5.4. The Friedman test did not indicate any statistical significance between the different window sizes, an F-value of 0.4 was obtained. Thus, the smallest window size was chosen to proceed with for the SHB data set as well.

Table 5.3: Optimal hyperparameter settings, SHB data set.

Window size Layers Hidden nodes Dropout rates Learning rate Batch size

20 2 [256, 32] [0.3, 0.05] 0.001 128

50 2 [32, 128] [0.5, 0.3] 0.0005 32

100 2 [128, 256] [0.5, 0.05] 0.001 128

Table 5.4: Validation losses, SHB data set.

Window size RMSE

20 0.044

50 0.044

100 0.045

(40)

An LSTM network with three hidden layers was examined as well for both data sets, those results can be found in A.4. However, a network of two hidden layers yielded lower loss in both cases and was therefore chosen to continue with.

5.1.2 TCN

Table 5.5 presents the optimal hyperparameter settings that were found during Random Search for the TCN when using the BOE data set. The optimal hyper- parameter settings varied for the different window sizes, but the same number of filters was shown to be optimal for all. The validation loss for each window size is shown in Table 5.6.

Table 5.5: Optimal hyperparameter settings, BOE data set.

Window size Filters Kernel size Residual blocks Dilations Dropout rate Learning rate Batch size

20 128 4 6 [1, 2, 4] 0.2 0.001 64

50 128 4 5 [1, 2, 4, 8, 16] 0.2 0.001 64

100 128 3 20 [1, 2, 4, 8] 0.5 0.0005 24

Table 5.6: Validation losses, BOE data set.

Window size RMSE

20 0.063

50 0.072

100 0.079

However, the settings presented above resulted in errors which grew big for the h-step-ahead forecasts. Thus, other hyperparameter settings were ex- amined, with the settings of a similar time series forecasting problem as a starting point. The results of those experiments are presented in Appendix A, section A.2. The optimal hyperparameter settings which were found through the re-framed search for the BOE data set are presented in Table 5.7 and cut the validation losses with almost 50% for all window sizes, see Table 5.8.

Table 5.7: Optimal hyperparameter settings, BOE data set.

Filters Kernel size Residual blocks Dilations Dropout rate Learning rate Batch size

48 8 8 [2

ⁱ

, i ∈ {0, ..., 7}] 0.05 0.001 256

(41)

CHAPTER 5. RESULTS 31

Table 5.8: Validation losses, BOE data set.

Window size RMSE

20 0.032

50 0.044

100 0.036

The greatest impact on the validation loss was the exclusion of Batch Nor- malization (BN). The difference in performance for each window size with and without BN is presented in Table 5.9 below.

Table 5.9: RMSE with and without the use of BN, BOE data set.

Window size BN W/O BN

20 0.096 0.044

50 0.085 0.049

100 0.088 0.049

Thus, when the hyperparameter search was to be conducted with the SHB data, the optimal hyperparameter settings found with the BOE data was used as a basis when defining the parameters to choose from, see section A.3 of Appendix A for further details. The same trend could be observed for the SHB data, the exclusion of BN cut the validation losses with almost 50%. Table 5.10 presents the optimal hyperparameter settings found for each window size and Table 5.11 the corresponding validation losses.

Table 5.10: Optimal hyperparameter settings, SHB data set.

Window size Filters Kernel size Residual blocks Dilations Dropout rate Learning rate Batch size

20 24 16 9 [2ⁱ, i ∈ {0, ..., 8}] 0.3 0.002 128

50 48 16 8 [2ⁱ, i ∈ {0, ..., 7}] 0.2 0.002 256

100 48 14 8 [2ⁱ, i ∈ {0, ..., 7}] 0.2 0.002 64

Table 5.11: Validation losses, SHB data set.

Window size RMSE

20 0.047

50 0.046

100 0.047

The Friedman test resulted in an F-value of 2.8 for the BOE data set and

1.6 for the SHB data. Thus, a window size of 20 days was further chosen to

proceed with for both data sets, similarly as for the LSTM network.

(42)

5.2 Forecasts

The LSTM network and the TCN were re-trained on the complete training sets and evaluated on the test sets by conducting 1-, 20-, 60-, 120-, 240-days-ahead predictions, respectively. The results of each data set are presented in the fol- lowing subsections. The Friedman test was again applied to examine whether the difference in performance between the models was of any statistical signif- icance. The significance level was set to 0.05 for all tests and a critical F-value of 6.4 was used.

The LSTM network and the TCN were fitted for 600 epochs for both data sets. Initially, the optimal number of epochs were examined for each model and data set, those results are presented in section A.5 of Appendix A. The lowest losses were achieved with 600 epochs of training and was therefore chosen. No training was needed for the baseline due to the naive forecasting approach, the model was however evaluated on the same test sets. Five trials were carried out for each model and data set, thus the RMSEs presented in this section are therefore averaged over five runs.

5.2.1 Bank of England

Table 5.12 shows the resulting F-values for all forecast horizons in case of the BOE data set, in which all F-values were greater than the critical value of F. Thus, for each forecast horizon the null hypothesis was rejected. A sta- tistically significant difference between the means of the models in terms of performance was identified in the case of all forecast horizons.

Table 5.12: Obtained F-values in comparison of models.

1 day 20 days 60 days 120 days 240 days

8.4 8.4 7.6 8.4 7.6

Pairwise comparisons were then executed by applying the post-hoc test

Nemenyi to examine the origins of the differences. Statistically significant

differences could be identified between the baseline and the LSTM network

in the case of 1, 20, and 60-days-ahead forecasts. In terms of the longer fore-

cast horizons, statistically significant differences were indicated between the

baseline and the TCN, see Table 5.13. However, no statistically significant

difference could be noted between the LSTM network and the TCN in any

forecast horizon

(43)

CHAPTER 5. RESULTS 33

Table 5.13: p-values for each pair of models.

h-days-ahead [RW, LSTM] [RW, TCN] [LSTM, TCN]

1 0.012 0.601 0.139

20 0.012 0.139 0.601

60 0.031 0.069 0.900

120 0.139 0.012 0.601

240 0.069 0.031 0.900

Table 5.14 presents the prediction performances of the models. Consid- ering the p-values just accounted for, the baseline significantly outperformed the LSTM network in one-day-ahead predictions, while the LSTM network outperformed the baseline in 20 and 60-days-ahead predictions. Finally, the TCN beat the baseline in terms of the longest forecast horizons.

Table 5.14: RMSE for each model and forecast horizon.

h-days-ahead RW LSTM TCN 1 0.018 0.036 0.024

20 0.059 0.020 0.030

60 0.179 0.027 0.027

120 0.492 0.057 0.043 240 0.203 0.031 0.027

Figure 5.1 to 5.3 indicates how the models managed to predict the maturity

of one month across all observations in the BOE test set. There are 59 more

such figures, since one observation/yield curve is constituted by the yield of

a bond given across 60 maturities. The curves illustrated in Figure 5.4 to 5.6

are the predictions on the first observation in the test set, whereas Figure 5.7

to 5.9 shows the predictions on the 60th observation in the test set.

(44)

Figure 5.1: RW forecast on the test set, time to maturity = 1 month.

Figure 5.2: LSTM forecast on the test set, time to maturity = 1 month.

(45)

CHAPTER 5. RESULTS 35

Figure 5.3: TCN forecast on the test set, time to maturity = 1 month.

Figure 5.4 to 5.6 illustrates the one-day-ahead forecasts of the models on the BOE data set. The time to maturity is given in months and extends to 60 months, i.e., 5 years from the specified date. It is difficult to distinguish the quality of the forecasts between the baseline and the LSTM network by the figures. The TCN follows the actual curve very closely until the maturity of 30 months, then it starts oscillating to a greater extent.

Figure 5.4: RW forecast one-day-ahead, date = 2014-08-14.

(46)

Figure 5.5: LSTM forecast one-day-ahead, date = 2014-08-14.

Figure 5.6: TCN forecast one-day-ahead, date = 2014-08-14.

Figure 5.7 to 5.9 shows the 60-days-ahead forecast of each model. The

baseline overestimates the interest rate from maturity 5 and forward. The

LSTM network and the TCN, respectively, managed to model the curve quite

closely. However, the prediction of the TCN oscillates slightly as for the one-

day-ahead forecast. The curves of the other forecast horizons are found in

section B.1.1 of Appendix B.

(47)

CHAPTER 5. RESULTS 37

Figure 5.7: RW forecast 60-days-ahead, date = 2014-11-06.

Figure 5.8: LSTM forecast 60-days-ahead, date = 2014-11-06.

(48)

Figure 5.9: TCN forecast 60-days-ahead, date = 2014-11-06.

To summarize. By considering Figure 5.5, 5.6, 5.8 and 5.9, it can be seen that the LSTM network generated curves that were more smooth in comparison with the TCN whose curves were slightly more oscillating. This despite the fact that the TCN seemed to capture the increasing trend better than the LSTM network for a maturity of 1 month, see Figure 5.2 and 5.3.

5.2.2 Svenska Handelsbanken

The same experiments were conducted for the SHB data set. A statistically significant difference between the models could be noted in the case of all forecast horizons, see Table 5.15.

Table 5.15: Obtained F-values in comparison of models.

1 day 20 days 60 days 120 days 240 days

10 10 10 10 7.6

The results of the post-hoc tests are presented in Table 5.16 and show

that there was a statistically significant difference between the baseline and

the LSTM network in case of the one-day-ahead predictions, i.e., the baseline

outperformed the LSTM network in this task, see Table 5.17. In terms of 20

and 60-days-ahead predictions, the TCN significantly outperformed the LSTM

network. In addition, the TCN statistically beat the baseline in conducting 120

and 240-days-ahead predictions, respectively.

Deep Neural Networks to Ensure the Quality of Calculated Yield Curves in Banking

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020 ,

Deep Neural Networks to Ensure the Quality of Calculated Yield Curves in Banking

ANNA EKLIND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Deep Neural Networks to Ensure the Quality of

Calculated Yield Curves in Banking

ANNA EKLIND

Master in Computer Science Date: July 2, 2020

Supervisor: Hamid Reza Faragardi Examiner: Pawel Herman

School of Electrical Engineering and Computer Science Host company: Svenska Handelsbanken

Swedish title: Djupa neurala nätverk för att säkerställa kvaliteten av

beräknade avkastningskurvor inom bankväsendet

iii

Abstract

According to the experiments of this study, the RW approach was most

accurate in one-day-ahead predictions, however, the method was statistically

outperformed by the deep architectures in longer forecast horizons. For in-

stance, in the case of 120-days-ahead forecasts, the TCN showed an increase

of 82% in performance (Root Mean Squared Error) in comparison with the

RW approach and the LSTM network an increase of 56%. It was concluded

that the RW approach should be the default option in case of one-day-ahead

forecasts, but that deep architectures have great potential in providing further

assurance of the quality of yield curves in case of longer forecast horizons.

Sammanfattning

i jämförelse med RW-metoden gällande prediktioner 120 dagar framåt och

LSTM nätverket en ökning på 56% i jämförelse. Det drogs slutsatsen att RW-

metoden bör vara standardalternativet i fallet av prediktioner en dag framåt,

men att djupa arkitekturer har stor potential i att utgöra ytterligare försäkran

av avkastningskurvors kvalitet i fall av längre prognoshorisonter.

Contents

1 Introduction 1

1.1 Objective and Research Questions . . . . 2

1.2 Scope . . . . 2

1.3 Outline . . . . 3

2 Background 4 2.1 Bond . . . . 4

2.2 Yield Curve . . . . 4

2.2.1 Modeling the Yield Curve . . . . 5

2.3 Time Series Forecasting . . . . 6

2.3.1 Random Walk Forecast . . . . 7

2.4 Artificial Neural Networks . . . . 7

2.4.1 Learning Rate . . . . 10

2.4.2 Regularization . . . . 10

2.5 Recurrent Neural Networks . . . . 11

2.5.1 Long Short-Term Memory Networks . . . . 12

2.6 Convolutional Neural Networks . . . . 13

2.6.1 Temporal Convolutional Networks . . . . 15

3 Related Work 17 3.1 Modeling Yield Curves . . . . 17

3.2 Deep Architectures for Financial Time Series Forecasting . . . 18

3.3 Temporal Convolutional Neural Networks for Time Series Fore- casting . . . . 18

4 Methods 20 4.1 Data . . . . 20

4.1.1 Pre-processing . . . . 21

4.1.2 Data Split . . . . 21

4.2 Models . . . . 21

v

4.2.1 LSTM . . . . 22

4.2.2 TCN . . . . 22

4.3 Hyperparameter Search . . . . 24

4.3.1 LSTM . . . . 24

4.3.2 TCN . . . . 24

4.4 Evaluation . . . . 25

5 Results 28 5.1 Hyperparameter Search . . . . 28

5.1.1 LSTM . . . . 28

5.1.2 TCN . . . . 30

5.2 Forecasts . . . . 32

5.2.1 Bank of England . . . . 32

5.2.2 Svenska Handelsbanken . . . . 38

6 Discussion 45 6.1 Hyperparameter Search . . . . 45

6.2 Model Comparison . . . . 46

6.3 Critical Evaluation . . . . 47

6.4 Validity Discussion . . . . 49

6.4.1 Construct Validity . . . . 49

6.4.2 Internal Validity . . . . 50

6.4.3 Conclusion Validity . . . . 50

6.5 Ethics and Sustainability . . . . 51

6.6 Societal Aspects . . . . 51

6.7 Future Work . . . . 52

The Nelson-Siegel model was first introduced in 1987 by Nelson and Siegel [13] and was further developed by Svensson [14] in 1994. The former is shown in Equation 2.1 and is defined by four different parameters; β 0 , β ₁ , β ₂ and τ . The yield y(m) of maturity m is hence obtained by:

y(m) =β ₀ + β ₁ [1−exp(−m/τ )]

m/τ + β ₂

m/τ − exp(−m/τ )

β ₀ controls the level of the curve, β 1 its slope, and β 2 the curvature. The

X (ˆ y − y) ² (2.2)