MARKET DATA

(1)

Master thesis, 30 credits

Department of mathematics and Mathematical Statistics

IMPUTATION AND GENERATION OF MULTIDIMENSIONAL

MARKET DATA

Master Thesis

Tobias Wall & Jacob Titus

(2)

Imputation and Generation of Multidimensional Market Data Tobias Wall^†, towa0029@student.umu.se

Jacob Titus^†, jati0007@student.umu.se

Supervisors: Jonas Nylén Nasdaq Inc.

Armin Eftekhari Umeå University Examiner: Jianfeng Wang Umeå University

Master of Science Thesis in Industrial Engineering and Management, 30 ECTS Department of Mathematics and Mathematical Statistics

Umeå University

SE-901 87 Umeå, Sweden

†Equal contribution. The order of the contributors names were chosen based on a bootstrapping procedure where the names was drawn 100 times.

(3)

Abstract

Market risk is one of the most prevailing risks to which financial institutions are exposed. The most popular approach in quantifying market risk is through Value at Risk. Organisations and regulators often require a long historical horizon of the affecting financial variables to estimate the risk exposures. A long horizon stresses the completeness of the available data; something risk applications need to handle.

The goal of this thesis is to evaluate and propose methods to impute financial time series. The performance of the methods will be measured with respect to both price-, and risk metric replication. Two different use cases are evaluated;

missing values randomly place in the time series and consecutively missing values at the end-point of a time series. In total, there are five models applied to each use case, respectively.

For the first use case, the results show that all models perform better than the naive approach. The Lasso model lowered the price replication error by 35%

compared to the naive model. The result from use case two is ambiguous. Still, we can conclude that all models performed better than the naive model with respect to risk metric replication. In general, all models systemically underestimated the downstream risk metrics, implying that they failed to replicate the fat-tailed property of the price movement.

Keywords: Time Series Imputation, Financial Time Series, Machine Learn- ing, Deep Learning, Value at Risk, Expected Shortfall

(4)

Sammanfattning

Marknadsrisk är en av de mest betydande riskerna som finansiella institut exponeras mot. Det populäraste sättet att kvantifiera marknadsrisk är genom Value at Risk.

Organisationer och tillsynsmyndigheter kräver ofta en lång historisk horisont för de berörda marknadsvariablerna vid dessa beräkningar. En lång horisont ökar risken av ofullständighet i det tillgängliga datat, något riskapplikationer behöver hantera.

Målet med denna uppsats är att utvärdera och föreslå metoder för att im- putera finansiella tidsserier. Metodernas prestanda kommer att mätas med avseende på både pris- och riskreproducerbarhet. Två olika scenarion utvärderas;

värden som slumpmässigt saknas i tidsserien, och på varandra följande saknande värden i änden av en tidsserie. Totalt har fem modeller tillämpats på varje scenario.

Under det första scenariot visar resultaten att alla modeller presterar bättre än det naiva tillvägagångssättet. Lasso-modellen sänkte prisreplikationsfelet med 35 % jämfört med den naiva modellen. Resultatet från det andra scenariot är tvetydigt. Vi kan ändå dra slutsatsen att alla modeller presterade bättre än den naiva modellen med avseende på riskvärdes-replikering. I allmänhet underskattade alla modeller systematiskt riskvärdena, vilket antyder på att de misslyckades med att replikera egenskapen av tjocka svansar i prisrörelsens distribution.

Nyckelord: Imputering av Tidsserier, Finansiella Tidsserier, Maskininlärn- ing, Djupinlärning, Value at Risk, Expected Shortfall

(5)

Acknowledgement

We would like to extend our gratitude to Jonas Nylén, Anders Stäring, Markus Nyberg, and Oskar Janson at Nasdaq Inc., who has given us the opportunity to do this thesis work as well as providing supervision and support throughout the entire project.

We would also like to thank our supervisor at the Department of Mathemat- ics and Mathematical Statistics, Assistant Professor Armin Eftekhari, for guidance and valuable advice during the project.

Finally, we would like to thank our families and friends for their support and words of encouragement throughout our time at Umeå University which comes to an end with the completion of this thesis.

Tobias Wall Jacob Titus Umeå, May 26, 2021

(6)

Chapter 1 Introduction

Market risk is one of the most prevailing risks that financial institutions are subjected to. It is the potential losses that investments inherit by uncertainties of market variables [24]. Risk management is all about identifying, quantifying, and analysing these risks to decide how market risk exposures should be avoided, accepted, or hedged. The most common approach in quantifying the market risk is by looking at how the affecting market variables, e.g. prices, have moved historically and use that knowledge to conclude how big losses could get in the future.

Value at Risk, henceforth VaR, is one of the most widely used market risk metrics. There are several different ways to calculate VaR, but we will focus on a non-parametric approach using historical simulations from observed market data.

VaR aims to make the following statement of an investment; “We are X percent certain that we will not lose more than V dollars in time T”. Suppose we would like to calculate 1-day 99% VaR of a USD 1 000 000 investment in the American stock index S&P500¹, using seven years historical prices from 2014 to the end of 2020.

Then, start by computing the daily price returns over the given period, find the return at the 1st percentile, and multiply that return by the current value of the investment. This yields 1-day 99% VaR to be USD 32 677. But, what if the price series was incomplete over the specific period, where several days had missing price data?

Assume our dataset lacked the desired long-term data, and only five years history were available hence, from the beginning of 2016 as illustrated by the dashed line in Figure 1.1. Calculating 1-day 99% VaR from 2016 onwards results in a value of USD 35 675, which is 9.17% higher than the complete dataset. This discrepancy is intuitive when analysing the price and logarithmic return process presented in Figure 1.1. The period from February 19th to March 23rd of 2020 was turbulent in many ways, but mainly, it was the start of the COVID-19 pandemic.

The financial markets fell, with S&P500 dropping 34% and the Swedish stock index OMX30 dropping 31%, leaving no markets unaffected. The period has two

”Black Mondays”, the 9th and 16th of March, where markets fell 8% and 13%

respectively, and one ”Black Thursday” on the 12th of March where markets fell 10%. This stressed market period has a great impact on the VaR metric. Leaving a period of normal market condition out of the calculation will increase VaR due

1S&P500 is a stock index containing 500 large companies listed on the Nasdaq Stock Exchange and New York Stock Exchange that represent the American industry.

(10)

(a) (b)

Figure 1.1: S&P 500’s price process (a) and corresponding one-day logarithmic return (b) from January 2nd, 2014 to December 31st, 2020. The black dashed line marks January 1st, 2016.

to the enlarged weighted contribution of the stressed period. As of 2021, 18 of the companies included in the S&P500 index were not founded before 2016 [44] and thereby not publicly listed. They would all lack the desired long-term market data as we specified for our VaR calculation.

The absence of long-term market data is one frequent issue that needs to be dealt with when assessing market risk metrics from historical simulations. Another common situation when dealing with multiple instruments is a sparse dataset where single or a few consecutive data points are missing. This could, e.g., happen due to operational failure at the market, caused by broken sensors, bugs, or data collection failure. But the main cause of this is varying business days between different markets. Suppose we have a portfolio with exposures on both the S&P500 index and the Hong Kong stock index Hang Seng, HSI². The Hong Kong market is subjected to the Chinese public holidays, which are not aligned with the American holidays that S&P500 are affected by. E.g., during the Chinese New Year, occurring every year in February, the Hong Kong market is closed for three days³. Also, the Chinese national day 1st of October, and Buddha’s birthday 30 of April [22]. All of whom will imply a missing HSI price in our portfolio. Figure 1.2 display implicit missing prices for the HSI during the Chinese New Year in 2019. There are several approaches to tackle this problem. The simplest one would be to exclude all dates where the market data is incomplete and then feed it to the downstream risk application. The drawbacks of such a naive approach are that it ignores all known price movements by the other observed market variables.

To avoid missing information, we need to develop a method to fill the missing prices.

There are several ways to fill the missing window of the HSI price series but, what effect will they infer on the VaR metric? The nearest reference points to the price process are the 4th and 8th of February; due to the high autocorrelation

2The HSI index contains the largest companies of the Hong Kong stock market and is an indicator of the overall performance of the Hong Kong stock market.

3Depending on the day of the week the New Year’s day occurs.

(11)

Figure 1.2: S&P 500’s and HSI’s price process during the Chinese New Year, 2019.

The red area marks the missing price period of HSI.

between adjacent prices, would not a straight line between the references make a good prediction? It would be good reasoning if one were only interested in getting a fair estimate of the missing price. Still, such an approach will minimise the largest, relative, price movement, which most likely will lead to a systemic underestimate of the VaR metric. Another naive method is to fill the values by the nearest known price observation; hence, the 4th becomes an estimate of 5th and 6th, and the price at 8th estimates the 7th price⁴. Contrarily to the straight-line approach, this method will maximise the price movement between two days while constricted to estimate prices bound between the references and flatten all other movements. This approach will perhaps not imply a biased VaR metric, but it will surely overestimate the number of horizontal movements.

A common drawback of the two above-described methods is that they only use information from the two reference prices and are further restricted to take values that fall between these two references. In reality, the price process is not restricted between the two reference points, as observed by the S&P 500’s process in the red marked area in Figure 1.2 where values fall higher and lower than any of the two bordering references. Thus far, we only mentioned problems that affect on single, missing price process. Most of the time, investors are interested in their risk exposure given a set of positions to account for netting effects that their movement may cause. For example, assume we filled the missing prices of the HSI series by the complete opposite movement of the S&P 500 for the respective time step.

Given the same weighted contribution of our portfolio, such an approach would yield zero return scenarios for our portfolio. Such a strong correlation between price movements is seldom seen over longer horizons. Still, market variables often possess both long-term and temporal correlations that investors exploit to hedge their exposures.

4It is custom to use the value corresponding to the prior time point when the missing value is equally distant between two referencing points.

(12)

1.1 Problem Definition

This thesis will examine methods to fill missing values in multidimensional financial time series in the context of risk metric applications. The aim is to provide reasoning about and suggest what models to be applied when imputing an incomplete, multidimensional time series. In a broader context, the project aims to improve client risk metrics to support risk managing decisions. The performance of the methods will be evaluated concerning both a price replication point of view and their effect on downstream risk application, which puts more attention to the estimated price movements.

Specifically, we will investigate two distinct use cases that contextualise different situations of an incomplete dataset when calculating VaR, that is:

1. Single or a few missing data points causing a sparse time series. Making it an interpolation problem with reference points before and after the missing data points and with information from reference channels, e.g., other market variables for the time of interest. This use case may arise when a portfolio contains assets traded on different exchanges with variety in operational days or simply due to data loss.

2. Consecutive missing data points for a longer horizon at endpoints. Making it an extrapolation problem for that particular time series. Still, there may be reference channel data, e.g., other market variable data, for the period of interest. This use case may arise when a portfolio contains assets that have not been on the market during the full period, e.g., initial public offerings (IPO) for corporate stocks or the creation of new derivative instruments.

1.2 Dataset

The analysis of this thesis will be based on a dataset provided by Nasdaq, but sourced by Refinitiv⁵. The dataset contains historical market data depicting 35 different financial variables, given on a daily frequency basis. The dataset stretches from 2014 to the beginning of 2021. Four different financial variables are included in the dataset; futures, discount rates, Foreign Exchange rates, and implied volatilities.

Due to confidentiality, the dataset will not be presented in full.

5Refinitiv is a provider of market- data and structure for financial institutions.

(13)

Chapter 2 Background

This section will explain financial markets, risk measures, and related work to financial time series imputation. The section will briefly introduce financial markets and market risk, especially the risk measures Value at Risk and Expected Shortfall.

Moving on by introducing the financial variables covered in this thesis, we describe how instruments are traded in the market and explain the risk measure of volatility and its connection to options and the implied volatility surface. We continue with the stylised facts where fundamental properties of financial price processes are presented and exemplified — finishing off with an introduction to imputation for time series data in general and explaining why these traditional approaches may fail in our financial setting.

2.1 Market Risk

Financial markets make up a key role in our modern society. They enable the interaction between a buyer and seller of financial instruments and have implications that can be derived to fundamental pillars of the global economy such as liquidity providing-, risk managing- and asset pricing entity. All entities operating in the financial markets are also associated with risks. Financial portfolios are dependent on several financial variables that affect the value of their assets. This risk is called market risk and is one of the main risks financial institutions are subjected to [24].

Traders often quantify and manage market risk using the Greek metrics for a smaller set of investments. However, generally, a financial institution’s portfolio is dependent on hundreds or thousands, financial variables. Presenting many greek metrics will not give a holistic view of current risk exposure to senior management or regulating authorities. As a response, Value at Risk and Expected Shortfall was developed to give a single value indicative of the total risk of a portfolio [24].

(14)

Figure 2.1: Value at Risk equals the profit and loss at the α-percentile, whereas Expected Shortfall equals the mean of all profit and losses greater or equal to the α-percentile.

2.1.1 Value at Risk

Value at Risk is a risk measure that aims to make the following statement for a portfolio [24];

"We are α − % certain that we will not lose more than $ V in time T."

As previously mentioned, the VaR calculations in this thesis are based on historical simulations, a non-parametric method using historical market data to calculate certain financial holdings incurred T period profit and loss scenarios over a fixed historical period.

To calculate the α-% T -day VaR, let t ∈ {t0− h, . . . , t0} denote a specific day where day t0 is today and h the specified historical horizon. Assume the portfolio consist of d assets with corresponding position quantity wⁱ, i ∈ {1, . . . , d}.

Further assume that all asset prices depends on the market variables x, and have their individual price function Fⁱ(x)¹. For any given day, the T -period, market variable scenario can be calculated as,

s_t= x_t−T − x_t

x_t−T , t ∈ {t₀ + T, . . . , h} . (2.1) For every market variable scenario, one can now calculate their incurred Profit-and- Loss, PnL, on today’s portfolio value as²[24],

PnL_t=

d

X

i=1

wⁱFⁱ(x_t₀) −

d

X

i=1

wⁱFⁱ(s_tx_t₀) , (2.2) where the first term denotes today’s portfolio value and the second term the portfolio value under the t-th day’s market variable scenario. Compute the PnL’s for all t ∈ {T + 1, . . . , h}, sort them in ascending order and pick the PnL on the α percentile to be the α-% T -day VaR [24]. In Figure 2.1 the red coloured bar illustrates the α-% VaR among the PnL distribution.

1E.g. Black-Scholes formula being dependent on interest rate, price of underlying, etc. Inclusion of a pricing function allows flexibility when calculating different scenarios.

2Note that losses are represented as a positive PnL’s, conversely profits are negative.

(15)

There are three parameters to the VaR model; the confidence level α, scenario period T , and the historical price period h. Organisational or regulatory standards set the value of these parameters. The confidence level α is usually between 95 − 99.9%, and the historical period h usually is between 1 − 7 years. The scenario period T is typically set to one day but depends on the liquidity of the investment [24].

2.1.2 Expected Shortfall

The expected shortfall, henceforth ES, is similar to VaR but aims to quantify the expected loss given a scenario that violates the threshold α percentile. Thus, trying to make the following statement for a portfolio;

"If things get bad, how bad does it get?"

This thesis will focus on the non-parametric, historical simulation approach when calculating the ES metric. The parameters are the same as for VaR and are usually set within the same intervals. The ES is assessed by calculating the PnL’s for a fixed historical horizon and sort them in ascending order. Then, ES equals the mean of all PnLs larger or equal to the α percentile PnL. Figure 2.1 illustrates how VaR and ES differ.

2.2 Financial Variables

Measures depicting the financial markets are often referred to as financial variables.

They are, in general, sourced through the trade information of marketed financial instruments. Below is a description of the financial variables that are relevant to this thesis.

2.2.1 Futures

A futures contract is an exchange-traded derivative that is an agreement to buy or sell an asset, called the underlying, at a future time point T , for a specific price K.

A futures contract is the standardised version of a forward contract that, instead of being publicly traded on an exchange, is agreed upon between two parties outside of an exchange called Over-The-Counter. Since the futures contract needs to be standardised, a contract includes [23]:

i) An underlying asset.

ii) A contract size.

iii) How the asset will be delivered.

iv) When the asset will be delivered.

2.2.2 Discount Rates

The common denominator for valuation in financial markets is the interest rate. An interest rate defines how much one borrows funds to pay back to the lender of those funds. There are several different interest rates quoted on the market, regardless of currency. The most important interest rate in the pricing of derivatives is the

(16)

interest rate used for discounting the expected cash flows, called the discount rate.

The so-called risk-free rate is the most used discount rate when pricing derivatives and are usually assumed to be an Interbank offered rate, which is the interest rate that banks are charged when taking short-term loans. It is important to note that these are used as the risk-free rate even though they are not risk-free [23].

2.2.3 Foreign Exchange Rates

Foreign Exchange rates, commonly referred to as FX rates, denote the value of a currency pair relative to each other. E.g., the 0.8 exchange rate for EUR to USD means that EUR 0.8 can be exchanged to USD 1.0, or equivalent that USD 1.0 can be exchanged to EUR 0.8. In this case, the price relation of USD to EUR is 1.25.

An FX rate can be traded as it is, called spot with delivery in the coming days, or as underlying in instruments like futures and options. FX rates are common in the use of hedging of cash flows denominated in another currency and can be done both through futures and forwards [5][43].

2.2.4 Options

An options contract is a derivative that gives the holder the option to buy or sell an asset called the underlying, for a specific price, K. This is different from a futures contract where the holder of the contract is obliged to some action. The two most common form of options is the American and the European options. European options can only be exercised on a future time point T , called the maturity. In contrast, the American counterpart can be exercised whenever until T . The two most basic options are call- and put options. The call option gives the holder the right to buy the underlying asset for a price K at or before maturity, T whereas, the put option gives the holder the right to sell the underlying asset for a price K at or before maturity T . This leads to the following pay-offs for the holder of a call option C and a put option P ,

C = max (s − K, 0) , P = max (K − s, 0) , (2.3) where s is the price of the underlying. The price of a European option is usually estimated by the Black-Scholes formula, which is formally defined as:

Definition 2.2.1. The price, C of an European call option with strike price K and time to maturity T is given by the Black-Scholes formula:

C = sN (d₁) − Ke^−rTN (d₂) (2.4) where s is the price of underlying asset, N (·) the cumulative distribution function of the standard Normal distribution, r the risk-free rate and

d₁ = ln(s/K) + (r − σ²/2) T σ√

T , d₂ = d₁− σ√

T , (2.5)

where σ is the volatility of the underlying asset.

(17)

A European put option is then priced through the put-call parity. American options have no known closed-form solution but can be priced through binomial trees or simulation methods. Options traded in the market are often American options, but European options are easier to analyse due to the Black-Scholes formula used for European options pricing [23].

There is a certain type of lingo concerning options. Out-of-the-money (OTM), at-the-money (ATM) and in-the-money (ITM) are words that refer to the intrinsic value of the option. Hence, how much is the option worth if it was exercised today.

For a call option, the terminology follows [23];

i) A call option is OTM if s < K which makes its intrinsic value 0.

ii) A call option is ATM if s ≈ K which makes its intrinsic value ≈ 0.

iii) A call option is ITM if s > K which makes its intrinsic value s − K.

2.2.5 Volatility

The volatility of an asset price, usually denoted σ, is the variability of the return of that asset. Since volatility is the only parameter that is not observed in the market used in the Black-Scholes formula, volatility is often the centre of attention when dealing with options and other derivatives. There is no single concept of volatility, but the ones that will be the subject of this thesis is historical volatility and implied volatility. The historical volatility is simply the standard deviation of the log returns of a time series. The implied volatility is the volatility implied by the market price of an option priced by the Black-Scholes formula. E.g., if C^∗ is the price of a European call option observed in the market, the implied volatility, σimp, is the volatility that solves the implicit equation,

C^∗ = C^BS(s, K, T, r, σ_imp) .

By plotting the implied volatilities of an option with the same maturity, T , against different strike price, K, one would get the volatility smile. By plotting the implied volatilities for options with the same strike price, K, against their different maturities, T , one would get the volatility term structure. One way of creating an implied volatility surface is by combining the volatility smiles with the volatility term structure [23][20]. However, in our dataset, the implied volatility surface is created by combining the volatility term structure with the different deltas of the existing options. A delta, ∆, is one of the Greeks and denote how the options value change with respect to the price of the underlying asset, i.e., for a European call option, it is the derivative of C with respect to s, ∆ = ^∂C_∂s.

Using the volatility surface created by existing contracts on the market, it is possible to price any strike price, K, and maturity, T , using interpolation and extrapolation techniques. One needs to be careful, though, not to introduce arbitrage opportunities [15].

(18)

(a) (b) (c)

Figure 2.2: (a) A stationary time process with the same distribution over any given time interval. (b) A mean varying process where the expected value is dependent on time and hence, is not stationary. (c) A variance varying process where the variability depends on time and hence, is not stationary.

2.3 Financial Time Series

This thesis covers historical market data, which is structured as time series data.

Time series are defined by sequential data points given in successive order. The time series x observed on time points t = 1, . . . , n is usually written as, {xt}ⁿ_t=1. In our case, the data is considered in discrete time on a one per day basis. A common approach is to consider the observed time-series data as a realisation of a stochastic process of a random variable, X [10]. In time series analysis, an important question is whether or not the process of X is strictly and/or weakly stationary. Intuitively, stationarity means that the statistical properties of X do not change over time [9]. For a strictly stationary series it implies that the joint probability distribution function of the sequence X = {Xt−i, . . . , Xt, . . . , Xt+i} is independent of t and i,

E(X_t) = µ, V ar(X_t) = σ², ∀t (2.6) and with the autocorrelation only dependent on i,

ρi = Cov(X_t−i, X_t)

pV ar(Xt)V ar(X_t−i) = ζ_i

ζ₀, (2.7)

where Cov(Xt−i, X_t)is the autocovariance. A time series is said to be weakly stationary, or covariance stationary, if its mean and autocovariances are time independent [9], i.e.,

E(X_t) = µ < ∞, ∀t V ar(X_t) = σ² < ∞, ∀t Cov(X_t, X_t−i) = ζ_i < ∞, ∀t, i

(2.8)

This means that the autocovariances depend only on the time interval between time points and not the observation time [10]. Figure 2.2 present an example with one stationary and two non-stationary processes.

As previously mentioned, a central question in time series analysis is whether the time series is stationary or not. Analysing the standard price process of a financial instrument on the market, say the S&P500 in Figure 1.1, it is clear that the statistical properties change over time and the series is not strictly nor weakly stationary. For example, the mean, µ, of the price process is time-dependent and

(19)

(a) (b)

Figure 2.3: (a) Autocorrelation function of S&P500’s log returns from lag 1 to 40 with a 95% confidence interval. (b) Autocorrelation function of S&P500’s prices from lag 1 to 40 with a 95% confidence interval.

means that the time series has a trend component that violates any assumptions of stationarity. In order to deal with this, a standard way is to instead operate on the differences between time points, ∆xt= x_t− x_t−1 [10]. In finance, the usual way of overcoming this problem is to instead operate on the movements of the price process. Let xt denote the price of an asset at time t then,

rt = log

x_t x_t−1

, (2.9)

is the logarithmic returns of the price process, henceforth log returns. Log returns are the most usual way to work with financial time series due to their nice properties, e.g., additivity, [9], and will be the returns used in this thesis.

Figure 2.3 shows an example of the autocorrelation function of a price process and a log return process of the S&P500 with a 95% confidence interval. As depicted in Figure 2.3a, the price process has a very high autocorrelation component and makes it more difficult for some model architectures to parse the crucial signals from the underlying process due to the low signal to noise ratio [3]. This problem further motivate using log returns as the autocorrelation is smaller, depicted in Figure 2.3b, and a lot of ”noise” is removed. However, removing a large part of the autocorrelation is at the expense of removing the internal memory of the price process; see [36] for a discussion on the trade-off between stationarity and memory.

2.3.1 Stylised Facts

The presence of autocorrelation in the price processes is one of the statistical properties of a financial time series. Nevertheless, the statistical properties that asset returns share over a wide range of assets, markets, and time periods are the so-called stylised facts. Independent studies have observed the stylised facts within finance over various instruments, markets, and periods [7]. Overall, it is a well-known fact that asset returns exhibit a behaviour belonging to an ever-changing probability distribution. Regardless of asset type, sampling frequency, market, and period, the stylised facts can be summarised to; volatility clustering, fat tails, and non-linear dependence [9][7].

(20)

(a) (b)

Figure 2.4: (a) The distribution of S&P500’s log returns vs a standard Normal distribution as a histogram with 100 bins. Numbers are shown for Fisher’s kurtosis and skewness of the distribution. (b) A QQ-plot of S&P500’s log returns vs a standard normal distribution.

Fat tails regards the property that the returns’ probability distribution exhibit larger positive and negative values than a normal distribution. An example of fat tails for S&P500’s log returns is shown in Figure 2.4. Figure 2.4a shows that the log returns on S&P500 have significantly greater kurtosis, 19.08 vs 0, than the normal distribution. Also, when plotting the quantiles of the log returns against a standard normal, as in Figure 2.4b, the log returns show properties of fat tails. If the distribution were normal, the plot would depict a straight line.

Volatility clustering refers to the property that volatility in the market tends to come in clusters, i.e., there is a positive autocorrelation between volatility measures. Meaning that if a financial variable is volatile today, there is an enlarged probability of being volatile tomorrow. An example of this can be found in Fig- ure 2.5b depicting the rolling window volatility of the log returns of S&P500 and HSI.

The stylised fact of non-linear dependence regards the dependence between financial return processes and how it changes according to current market conditions. Two return processes that move somewhat independently in normal market conditions show a high temporal correlation during financially stressed periods, i.e., the prices drop together. As an example of the ever-changing dependence between financial returns, the rolling window correlation between the log returns of S&P500 and HSI is shown in Figure 2.5a.

To conclude this section, many theories in finance, such as portfolio theory and derivative pricing, are built on the assumption that returns are normally distributed, all of whom will break down if the normality assumption is violated.

However, in risk management and risk calculations, an assumption of asset returns being normally distributed leads to a substantial underestimation of risk [9]. The stylised facts should thus be carefully considered when modelling a financial time series. Implied volatility surfaces and discount rates have their specific stylised facts and can be found in Appendix C.

(21)

(a) (b)

Figure 2.5: (a) Estimated rolling daily correlation between S&P500 and HSI log returns from 1st of January 2018 to the 1st of January 2021 through the multivariate EWMA model proposed in [9] with λ = 0.94. (b) Estimated rolling daily volatility of S&P500 and HSI log returns from 1st of January 2018 to the 1st of January 2021 through the multivariate EWMA model proposed in [9] with λ = 0.94.

2.4 Related Work

The problem of missing data in time series is not limited to finance but is a broad problem in many application domains such as healthcare, meteorology, and traffic engineering [6]. Instead of missing values originating from damaged equipment, unexpected accidents, or human error, the missing values in the financial time series most commonly depend on whether the market is open or if the asset exists in the market.

The literature is sparse in imputation on pure financial time series data, but there has been work on autoregressive models [29][2], agent-based modelling [37], reinforcement learning [11] and Gaussian Processes [45]. However, there is no reason to believe that the methods used are different from other domains. Much previous work has been done using statistical and machine learning approaches when imputing missing values in time series. The most common ones seem to be autoregressive-, state-space- and expectation maximisation-models³ [12].

2.4.1 Autoregressive Models

An autoregressive model is a model where the output depends, often assumed linearly, on its history and an error term. An autoregressive model can, e.g., try to predict a stock’s price tomorrow based on its price today. A common model in the univariate case with a linear relationship is the Autoregressive-Moving-Average model, ARMA(p, q), which has a pth order autoregressive part, AR (p), and a qth order moving average part, MA (q). The assumptions behind the ARMA model holds if the AR(p)-part is stationary [30]. When it comes to distribution assumptions, one usually assume that the process is normally distributed by assuming that the error term, , is a identically independently distributed (i.i.d.) random variable

_t ∼ N (0, 1). However, this is flexible, and, e.g., a Student’s t-distribution could be assumed instead. The ARMA framework is flexible towards extending it with

3We have deliberately disregarded methods like median and mean imputation since financial time series often are non-stationary.

(22)

exogenous variables and in a multivariate setting [8]. However, in finance, if the assumptions behind the ARMA model holds, meaning that tomorrows log return can be written as a linear combination of p previous log returns and q previous error terms, it would quickly be noticed and exploited.⁴

The Generalised Autoregressive Conditional Heteroskedasticity model is introduced to overcome some of these problems. The GARCH (p, q) process models the conditional variance as if an ARMA process gave it. From this model, one can show that subsequent log returns are uncorrelated but dependent, have fat tails, creates volatility clusters and has an unconditional long-term variance. Thus, being able to recreate some of the essential stylised facts. The GARCH model takes away some of the crucial assumptions on the log returns process that ruined it for the ARMA model but, the GARCH model still assumes that the volatility of the process is stationary. Extending GARCH to the multivariate case can be very hard and troublesome [9].

2.4.2 State-Space Models

Continuing the regression-based approaches to the imputation of time-series data.

Simply put, a state-space model assumes a latent process; let us call it zt, which evolves. This process zt is not observable but drives another process, xt, that is observable. Random factors may drive the evolution of ztand its dependence on xt, and thus, this is a probabilistic model. The state-space model consists of describing the latent state over time and its dependence on the observable processes. These models overcome some of the problems with stationarity regarding the ARMA models. An example of a state-space model is the family of models called Kalman filters.

The disadvantage of these models is that one needs to make assumptions about the dynamical system being modelled and assumptions on the noise affecting the system [10][8].

2.4.3 Expectation Maximisation

Unlike the previously described methods, Expectation Maximisation (EM) methods are not necessarily regression-based. The EM method consists of two steps, an Expectation-step and a Maximisation-step. First, one assumes a statistical model and distribution of the data; the statistical model could, e.g., be a AR (p)-model [29] or simply a normal distribution. Then the two steps are performed iteratively, imputing the missing timepoints with the statistical model to maximise the probability of the missing timepoints belonging to the time series [12]. In these methods, the data do not have to be considered as time-dependent data.

2.4.4 Key Points

A common drawback with the above-described approaches is that they often make strong assumptions on the missing values and may not take temporal relationships in the data into account. Thus, instead, treat the time series as non-time dependent structured data that may not suit financial time series known to have low signal to

4At least according to the Efficient Market Hypothesis.

(23)

noise ratio with temporal correlation to other processes in time [3][12]. However, several deep learning approaches have become successful in imputing time series data with Generative Adversarial Networks (GANs) and Recurrent Neural Networks (RNNs) at the forefront [12][6]. With at least, GANs being successful working with financial time-series [13].

Imputation of financial time series is difficult due to the complex dynamics described in Section 2.3.1, the low signal to noise ratio, and the fact that the same signal can come from several different sources with a temporal effect and with varying strength [3]. So, when imputing financial time series, a model should preferably be non-parametric, take temporal relationships with the own channel and reference channels into account, and recreate the stylised facts.

(24)

Chapter 3 Theory

In this section, we will present the theory needed to understand the models that later are used. The outline is intended to follow the complexity of the models introduced, i.e., introducing less complex linear models and extending to more complex and highly non-linear models.

First, two standard imputation methods, nearest neighbour imputation and linear interpolation, is introduced. Then, the standard linear model and the regularised version called Lasso is described with their fundamental properties.

Moving from linearity, Random Forests is presented by first describing decision trees and their weakness of easily overfitting the data. Continuing with how using multiple trees in a Random Forest can mitigate this. Later, laying the foundation for Gaussian Processes, Bayesian inference is introduced and exemplified as the weight-space view of regression by Bayesian linear regression and finished by moving to non-linear modelling by introducing projections into the feature space and using the kernel trick. By connecting the previous work, Gaussian Processes are introduced as being the function-space view of regression. Different covariance functions are introduced, and how to choose the corresponding hyperparameters are described.

Then it is time to introduce artificial neural networks by starting with the vital building blocks and how they learn from data. The notion of artificial neural networks is then expanded to recurrent neural networks specialising in processing sequences and learning long-term dependencies - continuing with convolutional neural networks that are specialising in data with grid-like topology and how they differentiate from regular neural networks. An example of the successful adoption of convolutional neural networks on non-stationary time series is the WaveNet architecture, developed by DeepMind in 2016, Lastly, finishing the section with how one might speed up the convergence when training the deep neural networks.

3.1 Nearest Neighbour Imputation

A simple approach to fill the missing values is to estimate them to equal the closest observed data point. This method is referred to as Nearest Neighbour Imputation (NNI) and is illustrated in Figure 3.1a. To formulate it in a mathematical setting, assume that ytis missing and let t denote all time points with observed values. The

(25)

(a) (b)

Figure 3.1: (a) Illustration of how the Nearest Neighbour Imputation method works.

The blue dots are observed data points, and the red dots are nearest neighbour predictions. (b) Illustration of how the linear interpolation method works. The blue dots are observed data points and the red dots are the linear interpolation prediction.

NNI method, also referred to as the naive method, is then estimating yt to be, ˆ

yt = yt^∗, where t^∗ = argmin

τ ∈t

|τ − t|. (3.1)

The implication of NNI in an extrapolation setting is a flat line from the last observed value.

3.2 Linear Interpolation

The Linear Interpolation (LI) method is another simple method to fill missing values in a sequence. Like the NNI method, the missing data points are imputed based on the nearest observed values. The LI method predicts missing values based on the straight line between the closest observed points in time, see Figure 3.1b for an example. Assume yt is missing where {yt−a, y_t+b} is nearest previous and next observed data point. Thus, an interpolation problem where t − a < t < t + b. The LI method is estimating yt to,

ˆ

y_t = y_t−a+a (y_t+b− y_t−a)

a + b . (3.2)

3.3 Lasso

Consider the standard linear model,

y = w₀ + X^>w + , (3.3)

where y is the response variable, X ∈ R^p×nis a set of explanatory variables, p is the number of variables, n is the number of observations, w ∈ R^p×1being the regression coefficient, and the error term containing the noise of the linear model. Given an estimate of the regression coefficients, the predicted response is given by,

Y = wˆ ₀ + X^>w. (3.4)

(26)

In practice, w are chosen as the regression coefficients that minimise the sum of squared residuals (RSS) in the least-squares fitting procedure [14],

n

X

i=1

y_i− w₀− x^>_i w2

. (3.5)

The Least Absolute Shrinkage and Selection Operator, henceforth Lasso, is a shrinkage method that is very similar to the simple linear regression, but instead of only penalising the RSS term, Lasso contains a regularisation of the regression coefficients. The value of the Lasso coefficients, w^L, are the ones that minimise,

n

X

i=1

Y_i− w₀− x^>_i w2

+ λ

p

X

i=1

|w_i| , (3.6)

for any λ ∈ R⁺. The last term is the coefficient penalty which will shrink coefficient values towards zero. If λ = 0, the resulting estimates will equal the one obtained from Equation 3.5. The larger λ becomes, the higher the coefficient penalty and the more will the coefficient shrink towards zero. P^p_i=1|w_i| is also called the `1-norm of w. Lasso has three main advantages over the simple RSS minimising model [14]:

i) Can significantly lower the coefficient variance and thus be less prone to overfitting.

ii) Is not restricted to the setting n ≥ p, i.e. when the numbers of observations are greater than the number of features.

iii) Shrinks many of the coefficients to zero, causing a sparse set of explanatory variables, thus, increasing the model’s interpretability.

3.4 Random Forest

Random Forest (RF) is a popular machine learning method for both classification and regression tasks. In contrast to the Lasso, RF do not assume a linear relationship between the response and explanatory variables and can thus learn more complex, non-linear patterns.

The building blocks of a RF are the decision tree. Building a decision tree for regression tasks is performed in a two-step procedure [14],

i) Sequentially segmenting the predictor space X ∈ {X1, X₂, . . . , X_d} into J, disjoint, non-overlapping, regions R ∈ {R1, R2, . . . , RJ}.

ii) Associate each Ri with a prediction value. Explicitly, that is the average response value of all training observations that falls in Ri.

The goal of step i) is to find regions R that minimises the RSS given by

J

X

j=1

X

i∈Rj

y_i− ˆy_R_j2

, (3.7)

(27)

Figure 3.2: Schematic overview of a Random Forest model with m decision trees.

Red dots indicates the certain decision path within a tree. The final prediction is the average of all individual tree estimates.

where ˆyRj is the average response for the training data within the jth region. In reality, there is an infinite number of ways to segment the data and evaluating them all is not feasible. Therefore, the Recursive Binary Splitting is applied; a top-down, greedy approach to segmenting the feature space. It starts at the top of the tree and successively splits the data, where each split results in two new branches. The split is greedy since it regards the partition that yields the most significant reduction in RSS for that particular step [14].

Decision trees are easily fitted and interpreted but have the disadvantage of high variance. That is, they are susceptible to overfitting the training data. RF is an approach that compromises on the bias-variance trade-off to gain better model performance. In short, an RF contains multiple decision trees and uniformly aggregate their individual prediction to a final prediction. When building an individual decision tree, start by extracting a bootstrap sample from the training set. Then, at each split, a random sample of m predictors is chosen as split candidates. Thus, the split is restricted only to use the m predictors sampled at each split step and, this creates a greater variety amongst all decision trees in the random forest, which has proven to yield better performance [14]. An illustration of an RF model is presented in Figure 3.2 where all arrows show how the data flows through the model. The hyperparameters to a random forest include;

i) Total number of decision trees. Typically, performance converges when the amount of trees grows beyond a certain value.

ii) Number of randomly drawn features that are being considered at each split of a decision tree.

iii) Maximum depth per decision tree.

3.5 Bayesian Inference

Performing statistical inference using the Bayes’ Rule to update the probability of the hypothesis as more information is available is called Bayesian inference. More

(28)

specifically, the variable θ is treated as a random variable, and one assumes an initial guess about the distribution of θ, called the prior distribution. When more information becomes available, the distribution of θ is updated by Bayes’ Rule, called the posterior distribution.

3.5.1 Bayes’ Rule

Definition 3.5.1. For data X and variable θ, Bayes’ rule tells one how to update ones prior beliefs about the variable θ given our data X to a posterior belief, according to,

p (θ | X) = p (X | θ) p (θ)

p (X) (3.8)

where p (θ | X) is the posterior probability, p (θ) is the prior probability, p (X | θ) is called the likelihood and p (X) is called the evidence.

The evidence is also called the marginal likelihood. The term likelihood is used for the probability that a model generates the data. The maximum a posteriori (MAP) estimate is the estimate that maximises the posterior probability,

θ^{M AP} = argmax

θ

p(θ | X). (3.9)

3.5.2 Multivariate Normal Distribution

A p-dimensional normal density vector x^> = [x₁, x₂, . . . , x_p]with mean µ and covariance Σ has the distribution x ∼ N (µ, Σ). The corresponding probability density function is,

f (x) = 1

p(2π)^p|Σ|exp

−1

2(x − µ)^>Σ⁻¹(x − µ)

. (3.10)

The following is true for a multivariate normally distributed random vector x [27];

i) Linear combination of the components of x are normally distributed.

ii) All subsets of the components of x are normally distributed.

iii) Zero covariance implies that the components are independently distributed.

iv) The conditional distributions of the components are normally distributed.

3.5.3 Conditional Distribution

Lemma 3.5.1. Let X = x₁ x₂

be distributed as N (µ, Σ) with µ = µ₁ µ₂

, Σ = Σ₁₁ Σ₁₂

Σ₂₁ Σ₂₂

and |Σ22| > 0. Then the conditional distribution of x1 given that x2 = x₂ is normal and has the following distribution,

x₁ | x₂ = x₂ ∼ N µ₁+ Σ₁₂Σ⁻¹₂₂ (x₂− µ₂) , Σ₁₁− Σ₁₂Σ⁻¹₂₂Σ₂₁ . (3.11) For an example of how to obtain this results see [27]. Note that the conditional covariance Σ11− Σ₁₂Σ⁻¹₂₂Σ₂₁ do not depend on the conditioned variable.

(29)

3.5.4 Bayesian Linear Regression

In a Bayesian setting, following the notation in [39], the standard linear regression model is,

y = X^Tw + (3.12)

where w is the vector of weights, or parameters, of the regression model, X the input matrix, and is assumed to be i.i.d. normally distributed noise, ∼ N (0, σn²). Note that, X has the dimension p × n, where p is the number of features, and n the number of observations. The probability density of the observations given the weights can be written as,

p(y | w, X) =

n

Y

i=1

p y_i | x^>_i w =

n

Y

i=1

√ 1

2πσ_nexp − y_i− x^>_i w2

2σ²_n

!

= 1

(2πσ_n²)^n/2 exp

− 1 2σ_n²

y − X^>w

2

∼ N (X^>w, σ²_nI)

(3.13)

where kzk denotes the `2-norm of the vector z. Given a prior distribution of the weights, w ∼ N (0, Σp), using Bayes rule, and writing out the likelihood and the prior distribution, the posterior is,

p(w | X, y) ∝ exp

− 1

2σ_n² y − X^>w^>

y − X^>w

exp

−1

2w^>Σ⁻¹_p w

∝ exp

−1

2(w − w)^> 1

σ²_nXX^>+ Σ⁻¹_p

(w − w)

(3.14)

where w = σn⁻² σ_n⁻²XX^>+ Σ⁻¹_p −1

Xy. With A = σ⁻²n XX^>+Σ⁻¹_p , the distribution can be written as,

p(w | X, y) ∼ N

1

2σ²_nA⁻¹Xy, A⁻¹

. (3.15)

The mean is the maximum a posteriori (MAP) estimate of w and thus the most probable weights of the underlying function.

When making predictions with the model, the average of the possible values of the weights by their respective posterior probability is calculated. Namely, to get the predictive distribution of the function value, f∗, at x∗, one compute the average of the output from all possible linear models created by the weights w.r.t. to the posterior given in Equation 3.15,

p (f∗ | x∗, X, y) = Z

p (f∗ | x∗, w) p (w | X, y) dw

= N 1

σ_n²x^>_∗A⁻¹Xy, x^>_∗A⁻¹x∗

.

(3.16)

In Figure 3.3 there is an example of Bayesian linear regression where the weights of the regression models are drawn from the prior and posterior distribution. This view of regression can be described as the weight-space view of regression and allows for limited flexibility if a linear function can not correctly describe the output.

(30)

(a) (b)

(c) (d)

Figure 3.3: (a) Joint density plot of values drawn from the prior distribution of the weights, w = N (0, σ²I) with σ = 2. (b) 200 lines drawn with weights sampled from the prior distribution of the weights, w = N (0, σ²I) with σ = 2. (c) Joint density plot of values drawn from the posterior distribution of the weights, Equation 3.13. (d) 200 lines drawn with weights sampled from the posterior distribution of the weights, Equation 3.15.

(31)

3.5.5 Feature Space Projection

A linear model is restricted to linear relationships between the response and feature variables and will perform poorly on non-linear data. To account for non-linear relationships, one could make projections into the feature space using some basis functions. For example, a scalar projection into the space of powers would lead to a polynomial regression model. A vital advantage of this is that if the projections are made onto fixed functions, the model is still linear in its parameters and therefore analytically tractable.

Let φ (xi) = (φ₁, . . . , φ_N) be a basis function that maps a p-dimensional vector into a N-dimensional feature space and let the matrix Φ (X) be the collection of all columns φ (xi) for all instances in the training set. The model is,

f (x_i) = φ (x_i)^>w (3.17) where w now is a N × 1 vector. Following the same method described before, it can be shown that the expression for the predictive distribution will be the same as in equation 3.16 with the exception of that all X are replaced by Φ (X), i.e.,

f∗ | x∗, X, y ∼ N 1

σ²_nφ (x∗)^>A⁻¹Φy, φ (x∗)^>A⁻¹φ (x∗)

(3.18) where Φ = Φ (X) and A = σn⁻²ΦΦ^> + Σ⁻¹_p . To make it more computationally efficient, the implementation is often rewritten as,

f∗ | x∗, X, y ∼ N (φ^>_∗Σ_pΦ K + σ²_nI−1

y,

φ^>_∗Σ_pφ∗ − φ^>_∗Σ_pΦ K + σ_n²I−1

Φ^>Σ_pφ∗), (3.19) where φ (x∗) = φ∗ and K = Φ^>ΣpΦ is used for shorter notation. Note that this expression is equivalent to the expression of the conditional distribution presented in Equation 3.11. For a detailed explanation of the derivation, see [39].

3.5.6 The Kernel Trick

In Equation 3.19, one can see that the feature space enters in the form φ^>(x) Σ_pφ (x⁰) regardless if x and x⁰ originates from the training or test data. One can also see that φ^>(x) Σ_pφ (x⁰) is an inner-product with respect to Σp. Since Σp is positive definite, it can be written as

Σ^1/2p

2

= Σ_p and using singular value decomposition of Σp = U DU^> one can write Σ^1/2p = U D^1/2U^>. By defining ψ (x) = Σ^1/2p φ (x) the kernel can be written as,

k (x, x⁰) = ψ (x) ψ (x⁰) . (3.20) If an expression is only defined in terms of the inner-products in the input space, one can use the kernel trick and lift the inputs into the feature space using the kernel, k (x, x⁰), to replace the inner-products. This kernel trick is convenient when it is more beneficial to compute the kernel than the feature vector. As in Gaussian Processes, the kernel is the centre of interest rather than its corresponding feature space.

(32)

3.6 Gaussian Processes

Definition 3.6.1. A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A Gaussian Process (GP) is determined by its mean and covariance function. Let m(x)denote the mean function and k(x, x⁰)denote the covariance function of a GP such that,

m(x) = E[f (x)],

k (x, x⁰) = E [(f (x) − m(x)) (f (x⁰) − m (x⁰))] , (3.21) then the GP is written as,

f (x) ∼ GP (m(x), k (x, x⁰)) . (3.22) The mean function, m(x), can be any real-valued function and is often set to 0 by demeaning the observations. The covariance function k(x, x⁰), also known as the kernel function, can be any function that satisfy Mercer’s condition [39]. With a specified mean and covariance function, an implied distribution of over functions is created. To sample from this distribution let x∗ be a number of input points, then a random Gaussian vector can be drawn from the distribution,

f∗ ∼ N (m (x∗) , k (x∗, x∗)) (3.23) and the generated values can be understood as functions of the inputs [39]. The covariance function, k, models the joint variability of the GP random variables, i.e.

the function values, and returns the covariance between pair of inputs. Thus, the joint distribution of the training data, f, and the test data, f^∗, according to the prior distribution is,

f f∗

∼ N

0, k(x, x) k (x, x∗) k (x∗, x) k (x∗, x∗)

. (3.24)

From the conditioning property of the Gaussian distribution described in Equation 3.11, the posterior distribution of the test data, f∗, is,

f∗ | x∗, x, f ∼ N (k (x∗, x) k(x, x)⁻¹f ,

k (x_∗, x_∗) − k (x_∗, x) k(x, x)⁻¹k (x, x_∗) . (3.25) This hold when assuming no noise in the underlying process and its distribution.

To obtain a similar result but with noise, as the linear model in 3.5.4, one can add a noise parameter to the covariance function and instead get the prior distribution,

y f∗

∼ N

0, k(x, x) + σ_n²I_n k (x, x∗) k (x∗, x) k (x∗, x∗)

. (3.26)

And when making inference make use of the the following posterior distribution, f∗ | x∗, x, y ∼ N (k (x∗, x)k(x, x) + σ_n²I_n−1

y,

k (x∗, x∗) − k (x∗, x)k(x, x) + σ_n²I_n−1

k (x, x∗)

. (3.27) In contrast to the weight-space view in Section 3.5.4, one can yield the same results by making inference directly in the function space, and the GP is therefore used

MARKET DATA

IMPUTATION AND GENERATION OF MULTIDIMENSIONAL

MARKET DATA

Master Thesis

Tobias Wall & Jacob Titus

Abstract

Sammanfattning

Acknowledgement

Contents

Chapter 1

Introduction

1.1 Problem Definition

1.2 Dataset

Chapter 2

Background

2.1 Market Risk

2.1.1 Value at Risk

2.1.2 Expected Shortfall

2.2 Financial Variables

2.2.1 Futures

2.2.2 Discount Rates

2.2.3 Foreign Exchange Rates

2.2.4 Options

2.2.5 Volatility

2.3 Financial Time Series

2.3.1 Stylised Facts

2.4 Related Work

2.4.1 Autoregressive Models

2.4.2 State-Space Models

2.4.3 Expectation Maximisation

2.4.4 Key Points

Chapter 3

Theory

3.1 Nearest Neighbour Imputation

3.2 Linear Interpolation

3.3 Lasso

3.4 Random Forest

3.5 Bayesian Inference

3.5.1 Bayes’ Rule

3.5.2 Multivariate Normal Distribution

3.5.3 Conditional Distribution

3.5.4 Bayesian Linear Regression

3.5.5 Feature Space Projection

3.5.6 The Kernel Trick

3.6 Gaussian Processes