Spectral Portfolio Optimisation with LSTM Stock Price Prediction

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

(2)

(3)

Spectral Portfolio Optimisation

with LSTM Stock Price Prediction

NANCY WANG

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2017

(4)

TRITA-SCI-GRU 2020:083 MAT-E 2020:046

Royal Institute of Technology

School of Engineering Sciences

KTH SCI

(5)

ABSTRACT

Nobel Prize-winning modern portfolio theory (MPT) has been considered to be one of the most important and influential economic frameworks within finance and investment management. MPT assumes investors to be risk-averse and uses the variance of asset returns as a proxy of risk to maximise the performance of a portfolio. Successful portfolio management reply, thus on accurate risk esti-mate and asset return prediction. Risk estiesti-mates are commonly obtained through traditional asset pricing factor models, which allow the systematic risk to vary over time domain but not in the frequency space. This approach can impose limitations in, for instance, risk estimation. To tackle this shortcoming, interest in applications of spectral analysis to financial time series has increased lately. Among others, the novel spectral portfolio theory and the spectral factor model which demonstrate enhancement in portfolio performance through spectral risk estimation [1][11]. Moreover, stock price prediction has always been a challenging task due to its non-linearity and non-stationarity. Meanwhile, Machine learning has been successfully implemented in a wide range of applications where it is in-feasible to accomplish the needed tasks traditionally. Recent research has demon-strated significant results in single stock price prediction by artificial LSTM neural network [6][34]. This study aims to evaluate the combined effect of these two ad-vancements in a portfolio optimisation problem and optimise a spectral portfolio with stock prices predicted by LSTM neural networks. To do so, we began with mathematical derivation and theoretical presentation and then evaluated the port-folio performance generated by the spectral risk estimates and the LSTM stock price predictions, as well as the combination of the two. The result demonstrates that the LSTM predictions alone performed better than the combination, which in term performed better than the spectral risk alone.

Keywords: Artificial Neural Network, LSTM, Spectral factor model, Portfolio

(6)

(7)

SAMMANFATTNING

Den nobelprisvinnande moderna portföjlteorin (MPT) är utan tvekan en av de mest framgångsrika investeringsmodellerna inom finansvärlden och invester-ingsstrategier. MPT antar att investerarna är mindre benägna till risktagande och approximerar riskexponering med variansen av tillgångarnasränteavkastningar. Ny-ckeln till en lyckad portföljförvaltning är därmed goda riskestimat och goda förut-sägelser av tillgångspris. Riskestimering görs vanligtvis genom traditionella pris-sättningsmodellerna som tillåter risken att variera i tiden, dock inte i frekven-srummet. Denna begränsning utgör bland annat ett större fel i riskestimering. För att tackla med detta har intresset för tillämpningar av spektraanalys på fi-nansiella tidsserier ökat de senast åren. Bland annat är ett nytt tillvägagångssätt för att behandla detta den nyintroducerade spektralportföljteorin och spektralfak-tormodellen som påvisade ökad portföljenprestanda genom spektralriskskattning [1][11]. Samtidigt har prediktering av aktierpriser länge varit en stor utmaning på grund av dess icke-linjära och icke-stationära egenskaper medan maskininlärning har kunnat använts för att lösa annars omöjliga uppgifter. Färska studier har påvisat signifikant resultat i aktieprisprediktering med hjälp av artificiella LSTM neurala nätverk [6][34]. Detta arbete undersöker kombinerade effekten av dessa två framsteg i ett portföljoptimeringsproblem genom att optimera en spektral portfölj med framtida avkastningar predikterade av ett LSTM neuralt nätverk. Arbetet börjar med matematisk härledningar och teoretisk introduktion och sedan studera portföljprestation som genereras av spektra risk, LSTM aktieprispredikteringen samt en kombination av dessa två. Resultaten visar på att LSTM-predikteringen ensam presterade bättre än kombinationen, vilket i sin tur presterade bättre än enbart spektralriskskattningen.

Nyckelord: Artificiella neurala nätverk, LSTM, Spektralfaktormodell,

(8)

(9)

ACKNOWLEDGEMENTS

My sincerest gratitude goes to my supervisors Lars Meuller and Fredrik Viklund for their guidance, support and enthusiastic engagement. Lars has always provided multiple choices during the planning of this paper, and he has consistently allowed this paper to be my work while being supportive and enlightening within areas of his expertise. Fredrik has been very encouraging and kind, and he has given valuable and constructive suggestions during the planning and development of this work. It has been a delightful experience to have them as my supervisors.

Moreover, I would like to direct my appreciation to AP4 for offer this thesis opportunity. It has been a very educative experience.

(10)

(11)

Symbol Glossary

N {0, 1, 2, . . . }

R the set of real numbers

Z, the set of integer numbers

Z+ the set of positive integer numbers

C the set of complex numbers

E[X] the expectation of random variable X

Cov(X, Y ) the covariance of the random variables X and Y Var(X) the variance of random variable X

L the back shift operator

a0 the transpose of vector a

A−1 the inverse of matrix A

(14)

List of Tables

1 Scale interpretation . . . 38

2 Spectra risk decomposition . . . 51

3 Spectra betas & weights . . . 51

4 Prediction MSE for single-layered LSTM . . . 59

5 Prediction MAPE for single layered NNs . . . 59

6 Model summary of the final LSTM model . . . 60

7 Prediction MSE for 2 layer NNs . . . 61

8 Prediction MAPE for 2 layer NNs . . . 61

9 Prediction errors of dropout rates . . . 62

10 Prediction outliers - error distribution . . . 64

11 Prediction outliers - data overview . . . 64

12 Data overview . . . 67

13 Portfolio return overview - CAPM . . . 69

14 Portfolio return overview - FF3 . . . 70

15 Portfolio return overview - Carhart . . . 70

16 Portfolio return overview - FF5 . . . 71

(15)

List of Figures

1 Time series de-trending . . . 17

2 Google stock before DFT . . . 19

3 Power spectrum of Google stock . . . 19

4 Reconstruction from partial signal 1 . . . 20

5 Reconstruction from partial signal 2 . . . 20

6 Random variable projection . . . 26

7 Extended Wold decomposition - original data . . . 39

8 Distribution of spectral components . . . 39

9 Spectral orthogonal components . . . 40

10 Spectral betas & risk example - data visualisation . . . 50

11 Unit structure of a RNN . . . 55

12 Unit structure of the LSTM neural network . . . 56

13 One-month ahead global stock prediction by deep LSTM neural network . . . 57

14 Errors of different dropout rates. . . 62

15 Stock prediction error including outliers . . . 63

16 Outliers visualisation . . . 65

17 Stock prediction error excluding outliers . . . 65

18 Optimisation model design . . . 66

19 Portfolio return box plot - 10% target return . . . 71

20 Portfolio return box plot - 15% target return . . . 72

(16)

(17)

1 Introduction

1.1 Background

Data analysis has become well developed both in theory and in practice during the past decades. Large datasets are becoming available and computational tech-niques are advancing like never before, discovering new quantitative methods and refinement to classical investment problems has become crucial for investors to remain competitive. One of the essential tasks within investment management is portfolio management. A portfolio is a collection of investment assets such as stocks, bonds, commodity, etc. Portfolio Management refers to the process of de-termining investment strategy such that the return is maximised over the investing period. In other words, the main challenge within portfolio management is to al-locate a given capital within a portfolio such that the return is maximum for a given acceptable risk level.

What are risk? In asset pricing theory, there are two types of risk, namely the systematic and the unsystematic risk. The unsystematic risk can be eliminated through diversification, i.e. divide the capital into various kind of assets and spread the risk such that they cancel out one another. According to MPT, diversification is accomplished by choosing assets whose returns are uncorrelated with one another, which makes the covariance matrix of the asset returns a good proxy for risk. The systematic uncertainty is, on the contrary, non-diversifiable. It is, therefore, vital to find an optimal strategy which maximises the portfolio return without exposing oneself for unnecessary risk. By Markowitz portfolio theory [25], this is a quadratic optimisation problem which has two inputs, namely the expected asset returns and the risk estimates, i.e. the variance matrix of asset returns. For this reason, the problem is also called a mean-variance problem. The key to successful portfolio optimisation is thus accurate risk estimation and asset return prediction.

(18)

problems have been limited until recently. During the past years, there has been a reborn of interests in spectral analysis to financial and economic time series. Among others, Croux et al. proposed a frequency-based measure to describe the co-movement of co-stationary processes [15], Barunik et al. used frequency decom-position to examine the pricing of market risks in the cross-section of asset returns at various time horizons [16]. In particular, Chomesh et al. [11] introduced a spec-tral portfolio theory in 2016 and decomposed alphas, betas and risk into a sum of periodic functions. Bandi et al. [1] took further advancement and presented the spectral factor model in 2019. The authors performed optimisation to their brand-new spectral portfolio and demonstrated that the spectral decomposition leads to better risk estimate in a mean-variance optimisation problem.

Another equally important input for the mean-variance problem is the predic-tion of future asset returns. Due to the stochastic nature of the asset returns with non-linearity and non-stationarity properties, it is not a simple task to forecast asset returns with high precision, specifically for long forecasting period. Asset returns are commonly predicted by fundamental or technical analysis. Funda-mental analysis evaluates a company’s past performance as well as the credibility of its accounts while technical analysis predicts stock prices solely on the trends of the historical data. These analyses has been successful and well implemented by investors. However, some shortcomings are yet to overcome. In particular, the fundamental analysis requires an extensive amount of workload and acts as a long-term strategy. Similarly, the technical analysis is a short-term strategy due to the non-linearity and non-stationarity of asset prices.

(19)

major global stock indices [6] and Ta et al. performed stock price predicted with significant result in the context of portfolio optimisation [34].

(20)

1.2 Objective

With the background presented above, this paper aims to study the combined effect of spectral portfolio theory and the LSTM based asset price prediction. This goal can be divided into two parts. The first part is to understand the spectral risks, betas, how they appear in a spectral factor model and how they act in a spectral portfolio context. This part comprises both theoretical and practical understand-ing of the spectral factor model, which includes but not limits to the mathematical derivation and empirical portfolio optimisation with financial datasets. The sec-ond part is to explore the possibility of long-horizon time series forecasting by an LSTM neural network. This part consists of a theoretical understanding of the LSTM neural network and hyperparameter tuning to set up an optimal prediction model. Finally, to achieve the objective of this study, a mean-variance portfolio was optimised with the risk estimates evaluated by the spectral factor model and the expected asset returns predicted by the LSTM neural network, as well as the combination of the two. Thenceforth, the portfolio performances were evaluated and compared between the different models.

(21)

2 Preliminary

2.1 Time Series Analysis

In this section, we introduce some basic time-series properties and relevant con-cepts that will be needed for financial time series studies in this paper.

2.1.1 Time Series

Let us begin by defining a stochastic process:

Definition 1. A stochastic process is a family of random variables {Xt, t ∈ T}

defined on a probability space (Ω, F, P ).

The set T above is known as the parameter set or the index set, some common index sets are: Z+

, Z, R+ and R. A stochastic process with T ⊂ Z is usually called

a time series. Moreover, a stochastic process is usually described by its means and covariances, occasionally under some general assumptions.

Definition 2. For {Xt, t ∈ T } a stochastic process with Var(Xt) < ∞, the

mean and covariance functions are defined as µX(t) = E[Xt] respective γX(r, s) =

Cov(Xr, Xs), where {t, r, s} ∈ T .

Definition 3. A realisation is a function f : t → xt which assigns the random

variable Xt a realisation xt at each point in time t. We denote such a realisation

by {xt}.

We call the realisation or its underlying stochastic process a time series. Loosely speaking, a time series is a sequence of observations {xt} whose element xt is an

observed outcome of the random variable Xt at time t. In general, observations

can be done randomly over a continuous time interval or at some fixed discrete time points. In this paper, we restrict ourselves to equidistant discrete time series since we aim to study financial time series which are commonly daily, monthly or quarterly data.

The main focus of time series analysis is to understand the relationship between observations. Often, we assume some parametric trend and de-trend the data before further analysis. Consider the classical time series decomposition:

Xt = mt+ st+ Yt, (1)

(22)

• mt :a trend component, commonly a slowly changing function

• st: a seasonal component, commonly a periodic function

• Yt: a zero mean time series

Some common trends are:

• The linear parametric trend, st = 0:

Xt= β0+ β1t

| {z }

mt

+Yt.

• The quadratic parametric trend, st = 0:

Xt= β0+ β1t + β2t2

| {z }

mt

+Yt.

• The seasonal parametric trend, mt = 0, Ω ∈ R denotes the period:

Xt= β0+ β1sin _2πt Ω + β2cos _2πt Ω | {z } st +Yt.

2.1.2 Stationarity

When studying a time series, we often look at its stationarity, i.e. whether its statistical properties remain constant over time. Mathematically, there are dif-ferent levels of stationarity. Here we introduce strictly and weakly stationarity.

Definition 4. A stochastic process {Xt, t ∈ Z} is strictly stationary if the

distri-bution of (Xt1, . . . , Xtk) and (Xt1+h, . . . , Xtk+h) are the same ∀ t1, . . . , tk, h ∈ Z

Definition 5. For a stochastic process {Xt, t ∈ Z}, if

(i) Var(Xt) < ∞ ∀ t ∈ Z,

(ii) µX(t) = µ ∀ t ∈ Z,

(iii) γX(r, s) = γX(r + t, s + t) ∀ r, s, t ∈ Z,

then the process is said to be weakly stationary or covariance stationary.

From Definition 5, one can see that it is convenient to consider the covariance in term of r − s and define γX(h) ≡ γX(h, 0), where h is commonly known as the

lag. A weakly stationary process is often studied through its autocovariance and

(23)

Definition 6. For a covariance stationary stochastic process {Xt, t ∈ T }, the

auto-covariance and the autocorrelation function are defined as γX(h) = Cov(Xt+h, Xt)

and ρX(h) = γ_γX_X(h)₍₀₎, respectively.

One of the simplest covariance stationary process is the white noise which is defined as follows.

Definition 7. A process {Xt, t ∈ Z} is said to be a white noise with mean µ and

variance σ2, written {X

t∼ W N (µ, σ2)}, if E[Xt] = µ and γ(h) = σ2 if h = 0 and

zero otherwise.

One example of a strictly stationary process is the Gaussian process defined as follows.

Definition 8. The process {Xt, t ∈ Z} is said to be Gaussian if all

finite-dimensional distributions of {Xt} are multivariate normally distributed.

Most of the time series cannot be considered as strictly stationary. For this reason, much literature drops the word weakly and simply use stationary for weakly stationary processes without further specification. We adopt this custom in the rest of this paper.

Theorem 1. The covariance function γX(h) of a stationary process {Xt, t ∈ Z}

is an even positive definite function. Conversely for any even positive semi-definite function there exists a stationary time series with this positive semi-semi-definite function as its covariance function.

Theorem (1) follows from the definition of the covariance function of a station-ary process. The proof of the theorem can be found in e.g. [28].

Let us re-consider the classical time series decomposition as per Equation (1), let st = 0. One way to extract the deterministic component mt such that Yt is

(24)

Figure 1: Monthly adjusted closing price of Amazon.com,Inc. in the upper plot and the de-trended log difference plot at the bottom.

2.1.3 Spectral Density and Periodogram

A time series is often defined by its mean and covariance. However, it is not always a simple task to find a spatial covariance function. Sometimes, a walk-around of the problem is to look into the frequency space or the reciprocal space of the time series and study its spectral density.

Assume the autocovariance function γ of a time series {Xt, t ∈ Z} is summable

in the sense that P∞

h=−∞|γ(h)| < ∞, then the spectral density, f, of the time

series is defined as f (λ) = 1 2π ∞ X h=−∞ e−ihλγ(h), −π ≤ λ ≤ π.

It then follows by taking the inverse Fourier transform that ˆ π −π eihλf (λ)dλ = ˆ π −π 1 2π ∞ X l=−∞ ei(h−l)λγ(l)dλ = 1 2π ∞ X l=−∞ γ(l) ˆ π −π ei(h−l)λdλ = γ(h).

Let us consider Equation (1) with seasonal parametric trend, zero mean and let ω = 2π

(25)

Xt = β1sin (ωt) + β2cos (ωt) + Yt,

where β1 and β2 are uncorrelated random variabels with zero means and variances

σ2. We note specifically that β

1sin (ωt) + β2cos (ωt) is periodic since E[Xt] = 0,

the covariance function is then,

γX(h) = E[Xt+hXt]

= E[AB(· · · )] + E[A2cos(ω(t + h))cos(ωt)] + E[B2sin(ω(t + h))sin(ωt)]

= 0 + E[A2]cos(ω(t + h))cos(ωt) + E[B2]sin(ω(t + h))sin(ωt) = σ2cos(ω(t + h))cos(ωt) + σ2sin(ω(t + h))sin(ωt)

= σ2{cos(ω(t + h))cos(ωt) + sin(ω(t + h))sin(ωt)} = σ2cos(ωh).

Expressed in the complex form, we get the spectral representation of Xt,

γX(h) =

σ2

2

e−ihω+ eihω. (2)

If the process is covariance stationary, then by Wiener-Khinchine theorem [35] , the power spectral density, S, of the process is

S(ν) =

∞

X

h=−∞

γX(h)e−i2πνh,

where ν denotes the frequency. In words, the Wiener-Khinchine theorem states that the power spectrum of a covariance stationary stochastic process is defined as the Fourier transform of its autocorrelation function. The theorem provides a link between the power spectrum of a stationary process and the Fourier transform of the autocorrelation function.

(26)

Figure 2: Monthly return of the Google stock before decomposition, the data was dated between Jan 2005 - December 2019.

(27)

Figure 4: Reconstruction prom partial signal less than 1 cycle per year. Note that the 2008 financial crisis is visible in this spectrum.

Figure 5: Reconstruction with the signal from 2-3 cycles per year. Note that the 2008 financial crisis is no longer visible in the curve.

2.1.4 Time Series Models

Definition 9. A time series model is a specification of the joint distribution of

{Xt} for which {xt} is a realisation.

One can think of a time series model as a model of the observations, i.e. a model of the data {xt}. We precede to introduce some of the standard time series

(28)

Definition 10. The process {Xt, t ∈ Z} is said to be a moving average of order

q, MA(q), if

Xt= Zt+ θ1Zt−1+ · · · + θqZt−q, {Zt} ∼ W N (0, σ2),

where {θ1, · · · , θq} ∈ R.

Definition 11. The process {Xt, t ∈ Z} is said to be a linear process if it has the

representation Xt= ∞ X k=−∞ ψkZt−k, {Zt} ∼ W N (0, σ2), where _∞ X k=−∞ |ψk| < ∞.

Theorem 2. A linear process {Xt, t ∈ Z} with representation

Xt=

∞

X

k=−∞

ψkZt−k, {Zt} ∼ W N (0, σ2),

is stationary with µX(t) = 0, autocovariance function

γX(h) =

∞

X

k=−∞

ψkψk+hσ2,

and spectral density fX(λ) = σ2 2π|ψ(e −iλ )|2, where ψ(x) = ∞ X k=−∞ ψkxk.

Proof. We start by computing the mean and autocovariance function, given {Zt} ∼

(29)

and γX(h) = E [Xt+hXt] = E  ( ∞ X k=−∞ ψkZt+h−k)( ∞ X j=−∞ ψjZt−j)   = ∞ X k=−∞ ∞ X j=−∞ ψkψjE [Zt+h−kZt−j] = ∞ X k=−∞ ∞ X j=−∞ ψk+hψjE [Zt−kZt−j] = ∞ X k=−∞ ψk+hψkE [Zt−kZt−k] = ∞ X k=−∞ ψkψk+hσ2.

Moreover, by using the spectral density and γX(h), we have

fX(λ) = 1 2π ∞ X h=−∞ e−ihλγX(h) = 1 2π ∞ X h=−∞ e−ihλ ∞ X k=−∞ ψkψk+hσ2 = σ 2 2π ∞ X h=−∞ ∞ X k=−∞ e−ihλψkψk+h = σ 2 2π ∞ X h=−∞ ∞ X k=−∞ ψkeikλψk+he−i(k+h)λ = σ 2 2π ∞ X h=−∞ ∞ X k=−∞ ψkeikλψke−ikλ = σ 2 2πψ(e iλ )ψ(e−iλ) = σ 2 2π|ψ(e −iλ )|2.

Definition 12. The process {Xt, t ∈ Z} is said to be an autoregressive moving

average process of order (p, q), ARMA(p, q), if it is stationary and if Xt− φ1Xt−1− · · · − φpXt−p = Zt+ θ1Zt−1+ · · · + θqZt−q,

where {Zt} ∼ W N (0, σ2).In addition, we say that {Xt}is an ARMA(p, q) process

(30)

It is common to write the ARMA(p, q) process in the alternative form φ(L)Xt= θ(L)Zt, t ∈ Z, where _   φ(z) = 1 − φ1z − · · · − φpzp θ(z) = 1 + θ1z + · · · + θqzq,

where the backshift operator (also called lag operator), L, is defined as Lj_X t =

Xt−j.

Specifically, we have a MA(q) process if p = 0, and if q = 0 we have an autoregressive process defined as follows.

Definition 13. The process {Xt, t ∈ Z} is said to be an autoregressive process of

order p, AR(p), if it is stationary and if

Xt− φ1Xt−1− · · · − φpXt−p= Zt, {Zt} ∼ W N (0, σ2).

We say that {Xt} is an AR(p) with mean µ if {Xt− µ} is also an AR(p) process. Definition 14. An ARMA(p, q) process defined by the equations

φ(L)Xt= θ(L)Zt, {Zt} ∼ W N (0, σ2),

is said to be causal if there exists constants {ϕj} satisfying P∞j=0|ϕj| < ∞ and

Xt=

∞

X

j=0

ϕjZt−j, t ∈ Z.

In other words, the property of being causal is a way to relate {Xt} and {Zt},

it does not belong to the process {Xt} alone. Another equivalent way to express

the causality is to require the following relationship Cov(Xt, Zt+j) = 0, ∀j ∈ Z+.

Theorem 3. Let {Xt} be an ARMA(p,q) process for which φ and θ have no

common zeros. Then {Xt} is causal if and only if φ(z) 6= 0, ∀|z| ≤ 1, z ∈ C. The

coeﬃcients {ϕj} are uniquely defined by the identity

∞ X j=0 ϕjzj = θ(z) φ(z).

(31)

2.1.5 The Projection Theory

Before proceeding further, we need to familiarise the projection theory in vector space. It is assumed that the readers are familiar with the definition and basic properties of Hilbert space. See [37] for a more detailed introduction.

Consider the random variables X1, · · · , Xn, Y, all with finite mean and

vari-ance. We would like to predict Y in terms of X1, · · · , Xn, we restrict ourselves to

linear prediction and aim to find an estimator ˆY such that ˆ

Y = a0+ a1Xn+ · · · + anX1,

minimises the mean square error E[(Y − ˆY )2].

Denote E[Xi] = µi, E[Y ] = µ, γ = (γ1, . . . , γn)0 = (Cov(Xn, Y ), . . . , Cov(X1, Y )),

and Γ =     γ1,1 · · · γ1,n ... ... γn,1 · · · γn,n     =     Cov(Xn, Xn) · · · Cov(Xn, X1) ... ... ... Cov(X1, Xn) · · · Cov(X1, X1)     .

Note the indices are purposely skewed backwards. We can then for simplicity write,

ˆ Y − µ = a0+ a1(Xn− µn) + · · · + an(X1− µ1). Now E[(Y −Y )ˆ 2] = E[(Y − µ − a0 − a1(Xn− µn) − · · · − an(X1− µ1))2] = a2₀+ a0E[(Y − µ) − a1(Xn− µn) − · · · − an(X1− µ1)]+ + E[((Y − µ) − a1(Xn− µn) − · · · − an(X1− µ1))2] = E[((Y − µ) − a1(Xn− µn) − · · · − an(X1− µ1))2].

Where the last equality is valid by setting a0 = 0 so that E[(Y − ˆY )2] is

(32)

∂E[(Y − ˆY )2]

∂ai

= ∂E[((Y − µ) − a1(Xn− µn) − · · · − an(X1− µ1))

2_]

∂ai

= −2E[(Wn−i+1− µn−i+1)((Y − µ) − a1(Xn− µn) − · · · − an(X1− µ1))]

= −2{E[(Wn−i+1− µn−i+1)((Y − µ)] − a1E[(Wn−i+1− µn−i+1)((Xn− µn)]−

· · · − anE[(Wn−i+1− µn−i+1)((X1− µ1)]}

= −2{Cov(Wn−i+1, Y) − a1Cov(Wn−i+1, Xn)−, · · · , −anCov(Wn−i+1, X1)}

= −2(γi− n

X

j=1

aiγij) = 0, i = 1, . . . , n.

Hence E[(Y − ˆY )2_] is minimised if γ

i = Pnj=1aiγij for i = 1, . . . n. In vector

form, we can write γ = Γa or a = Γ−1_γ if Γ is an invertible matrix.

Now suppose there is another vector a∗ associated with an estimator ˆY∗ such

that γ = Γa∗, where a∗ is not necessarily identical as a, then

Var( ˆY − ˆY∗) = (a − a∗)0Γ(a − a∗) = (a − a∗)0(γ − γ) = 0.

Thus ˆY is a unique estimator of Y when Γ is nonsingular. Note that our calculation does not require {µ, µ1, . . . , µn} = 0. Moreover, the predictor ˆY is

determined by

Cov(Y − ˆY , Xi) = 0 ∀ i = 1, . . . , n. (3)

Consider the space L2_{(Ω, F , P )}of all random variables W defined in the

proba-bility space (Ω, F, P ) such that E[W2] < ∞. Let the inner product of two random

variables in the space be defined as <W1, W2> = E[W1W2], and consider the

Hilbert space generated by the random variables defined as

H = a0Y + a1Xn+ · · · + anX1, ai ∈ R∀ i ∈ {0, 1, . . . , n},

and its observable subspace

M = a1Xn+ · · · + anX1, ai ∈ R ∀ i ∈ {1, . . . , n}.

For two random variables W1 and W2 in H, W1 ⊥ W2 is equivalent as saying

Cov(W1, W2) = 0, which in term implies Var(W1 + W2) = Var(W1) + Var(W2).

(33)

Y − ˆY ⊥ M, for this reason, Equation (3) is often referred as the Projection theory.

We illustrate this relation in Figure (6), note that ˆY ∈ M ⊂ H, Y − ˆY ∈ M⊥ ⊂ H.

Figure 6: Illustration of the projection theorem for stochastic variables. Intuitively, this is the same as vector projection although the orthogonality between Y − ˆY and ˆY is expressed by Cov(Y − ˆY , ˆY ) = 0.

In the upcoming chapters, we will use the following notations without further specification:

M = sp{W1, . . . , Wn}, the Hilbert space spanned by {W1, . . . , Wn},

ˆ

Y = Psp{W1,...,Wn}Y, the projection of Y on M,

E[|Y −Y |ˆ 2] = Var(Y − ˆY ) = ||Y − ˆY ||2, the variance of the error between Y and ˆY .

2.1.6 Wold Decomposition

Let {Xt} be a zero mean stationary process. Denote Mn = sp{Xt, t ≤ n}, σ2 =

E[|Xn+1 − PMnXn+1|

2_] and M

−∞ = T∞_n=−∞Mn. The Hilbert space M−∞ is

commonly known as the infinite past.

Definition 15. The process {Xt}is called deterministic if σ2 = 0, or equivalently

if Xt∈ M−∞∀ t ∈ Z.

It should be pointed out that the space Mn is the observable Hilbert space,

being deterministic means that the mean square error between one step predic-tion value Xn+1 and the projection of Xn+1 on the entire past is zero. It is also

(34)

Definition 16. The process {Xt}is called purely non-deterministic if M−∞= {0}.

Theorem 4. (The Wold Decomposition Theorem) Any zero-mean covariance

sta-tionary process {Xt} with σ2 > 0 can be expressed as

Xt= ∞ X j=0 ψjZt−j+ Vt, where • ψ0 = 1 and P∞j=0ψj2 < ∞, • {Zt} ∼ W N (0, σ2) and {Vt} is deterministic, • Zt∈ Mt, Vt∈ M−∞ ∀ t ∈ Z, • E[ZtVs] = 0 ∀ t, s ∈ Z.

Moreover, the sets {ψj}, {Zt} and {Vt} are uniquely determined by {Xt} and the

conditions above.

(35)

2.2 Multivariate Time Series Analysis

2.2.1 Introduction to Vectorial Time Series

In practice, it is often beneficial to analyse multiple time series simultaneously. Partially because some models require numerous variables to be analysed at the same time, partially because the covariance and correlation between variables are better capture in a multivariate model. The models treated later in this study in-volve modelling and analysing multiple financial time series simultaneously. There-fore, we study some multivariate time series in this section.

Let {Xt} = {. . . , Xt−1, Xt, Xt+1, . . . } be an m-dimensional stochastic process

consists of random vectors. Denote Xt = (X1,t, X2,t, . . . , Xm,t)0 as the random

vector on Rm, i.e. {X

t} consists of m-component time series {X1,t, . . . , Xm,t}

evolving simultaneously in time. Moreover, {Xt} is covariance stationary if every

component time series is covariance stationary. We defined the multivariate mean as µt≡ E[Xt] =     E[X1,t] ... E[Xm,t]     =     µ1 ... µm     ,

the covariance matrix as Γ0 = Var(Xt) = Cov(Xt) =     Cov(X1,t, X1,t) · · · Cov(X1,t, Xm,t) ... ... ... Cov(Xm,t, X1,t) · · · Cov(Xm,t, Xm,t)     .

and the correlation matrix

R0 = Corr(Xt) =

√

DΓ0

√

D, where D = diag(Γ0),

where both Γ0 and R0 are symmetric matrices. In terms of lag, the lag h

cross-covariance matrix is Γh = Cov(Xt, Xt−h) =     Cov(X1,t, X1,t−h) · · · Cov(X1,t, Xm,t−h) ... ... ... Cov(Xm,t, X1,t−h) · · · Cov(Xm,t, Xm,t−h)     .

Similarly, the lag h cross-correlation matrix is defined as

Rh = Corr(Xt) =

√

DΓh

√

D, where D = diag(Γ0),

(36)

2.2.2 Multivariate Time Series Models

Definition 17. The process {Xt} is said to be an vector autoregressive

moving-average process of order (p, q), VARMA(p, q), process if it is stationary and if

Xt− Φ1Xt−1− · · · − ΦpXt−p= Zt+ Θ1Zt−1+ · · · + ΘqZt−q,

where Φ 6= 0, Θ 6= 0 and {Zt} ∼ W N (0, Σ).We say that {Xt}is an VARMA(p, q)

process with mean µ if {Xt− µ}is an VARMA(p, q) process.

The VARMA(p, q) process has an alternative form in term of back shift oper-ator: Φ(L)Xt= Θ(L)Zt, t ∈ Z, where    Φ(z) = Im− Φ1z − · · · − Φpzp Θ(z) = Im+ Θ1z + · · · + Θqzq,

and Φ(L), Θ(L) are m × m matrices whose elements are lag polynomials of order less or equal to p and q, respectively.

If p = 0, there is no autoregressive part, we have a simple vector moving-average process, VMA(q). If q = 0, then there is no moving-moving-average part, we have a simple vector autoregressive process, VAR(p).

Let {Xt}be a m-variate random process. The basic p-lag vector autoregressive

model, denoted as VAR(p), has the form

Xt− Φ1Xt−1− Φ2Xt−2− · · · − ΦpXt−p = Zt, (4)

where the matrices {Φi}pi=1 are coeﬃcients of size m × m, the white noise Zt ∼ W N (0, Σ)is of size m × 1.

In terms of the lag operator L, the VAR(p) takes the form Φ(L)Xt= Zt,

where Φ(L) = Im− Φ1L − · · · − ΦpLp. The VAR(p)process is stable if det(Im−

Φ1z − · · · − Φpzp) 6= 0, ∀ z ∈ C, |z| ≤ 1. Assuming the process has been initialised

in the infinite past M−∞, then the stable VAR(p) process is stationary and ergodic

with time invariant means, variances and autocovariances.

In fact, we can write any VAR(p) process to VAR(1) in the companion form

(37)

if we define Yt =          Xt Xt−1 ... ... Xt−p+1          , Λ =          A1 A2 · · · Ap Im 0 · · · 0 0 Im 0 · · · 0 ... ... ... ... ... 0 · · · 0 Im 0          , Et =          Zt 0 ... ... 0          .

The matrix Λ is commonly known as the companion matrix of the VAR(p) process. In this form, the VAR(p) process is stable if and only if all the eigenvalues of the companion block matrix have modulus less than one. Note that if Xt is stable,

then it is also covariance stationary.

The VAR(p) model is advantageous in the sense that it is flexible and easy to model for common multivariate time series analysis. It has been proven to be useful for representing and forecasting economical and financial time series. However, it should be noted that this basic VAR(p) model may not always be suﬃcient to represent the essential characteristics of the data. One extended version is

Xt= c + A1Xt−1+ A2Xt−2+ · · · + ApXt−p+ ΨQt+ ΦZt+ t,

where Qt representing the deterministic components, Zt represents the stochastic

exogenous variables and Ψ, Φ are the associated coeﬃcient matrices.

Definition 18. A process {Xt} with Φ(L)Xt = Θ(L)Zt is called causal with

respect to {Zt}if and only if there exists a sequence of absolutely summable matrices

{Ψj}, j ∈ N,P∞j=0|Ψj| < ∞ such that Xt= ∞ X j=0 ΨjZt−j.

Theorem 5. Let {Xt} be a VARMA(p,q) process with Φ(L)Xt = Θ(L)Zt and

assume that

det(Φ(z)) 6= 0 ∀z ∈ C with |z| ≤ 1,

then the stochastic difference equation Φ(L)Xt= Θ(L)Zt has exactly one

station-ary solution with causal representation

Xt=

∞

X

j=0

ΨjZt−j,

whereby the sequence of matrices {Ψj} is absolutely summable and where the

matrices are uniquely determined by the identity

(38)

Here we note a special case which is crucial in our study, if q = 0, i.e. Θ(L) =

Im, then we have a VAR(p) process, namely

Φ(L)Xt= ImZt = Zt ⇐⇒ Xt= Φ(L)−1Zt (5)

moreover,

Φ(L)Ψ(L) = Im ⇐⇒ Ψ(L) = Φ(L)−1.

If we restrict ourselves further and let p = 1, and consider the VAR(1) process

Yt= ΛYt−1+ Et.

we can then recursively write

Yt= ΛYt−1+ Et = Λ(ΛYt−2+ Et−1) + Et = Λ(Λ(ΛYt−3+ Et−2) + Et−1) + Et = Λ(Λ(Λ(· · · ) + Et−2) + Et−1) + Et = Et+ ΛEt−1+ Λ2Et−2+ Λ3Et−3+ · · · = ∞ X j=0 ΛjEt−j = ∞ X j=0 ΛjLj_E t =   ∞ X j=0 ΛjLj  E_t, where Λ0 _{= I}

m. Combining with Equation (5), we have following ways to express Yt, Yt = Λ(L)−1Et= ∞ X j=0 ΛjEt−j =   ∞ X j=0 ΛjLj  E_t. (6) In particular, we note Λ(L)−1 = ∞ X j=0 ΛjLj_, (7)

2.2.3 Multivariate Wold Decomposition

(39)

co-variance stationary time series {Xt} can be decomposed as Xt= Vt+ ∞ X k=0 ΨkZt−k, where

• {Vt} is m dimensional linearly deterministic process.

• Ψ0 = I and P∞k=0ΨkΨ0k < ∞.

• {Zt} is multivariate white noise process with – E[Zt] = 0 and

– Var(Zt) = E[ZtZ0t] = Σ

– cov(Zt, Zt−k) = E[ZtZ0t−k] = 0 ∀ k ∈ Z+ – cov(Zt, Vt−k) = 0 ∀ k ∈ Z.

(40)

2.3 Extended Wold Decomposition

In this chapter, we put forward the theoretical derivation of the extended Wold decomposition first introduced in 2017 by Cerreia-Vioglio et al. [10]. This repre-sentation is obtained in the time-domain, but it offers a tool to analyse frequency-specific properties simultaneously. In their derivation, they used a Hilbert A-modules [8] which is a generalisation of Hilbert spaces. Ortu et al. observed and justified that the extended Wold decomposition is the Discrete Haar Transform (DHT) of the original innovations of a classical wold decomposition [29]. In their study, they introduced a decomposition method by transforming the moving aver-age representation of a zero-mean covariance stationary process. Their extended Wold decomposition is based on a scaling operator that makes averages of process innovations. In particular, they introduced a scale and each component at scale

j is an infinite moving average driven by a sequence of orthogonal innovations

that are located on a time grid whose unit spacing is 2j times of the original unit

grid. The DHT is a wavelet transform that is based on the Haar function, here we simply state that DHT is suitable in the analysis of the localised features of a signal due to the orthogonality of the Haar function. For more details, see for instance chapter 5 and 6 in [36].

In our derivation, we start by presenting the extended Wold representation for a univariate stochastic process in the same manner as that of Ortu et al. [29]. The univariate case will give us a good mathematical understanding of extended decomposition itself. After that, we present the multivariate case, which is needed in the application of the spectral factor model for portfolio analysis.

2.3.1 Formal Derivation

Let {Xt} be a real-valued zero mean covariance-stationary process defined on the

space L2_{(Ω, F , P). The Wold decomposition of the X}

t is then Xt= ∞ X k=0 ψkZt−k,

where {Zt} is a unique white noise process and {ψk} ∈ R is a square-summable

sequence with ψk = E[XtZt−k]. The coeﬃcients {ψk} are named {αk} in [10],

however we adapt the notation to the context of this paper so it is easier for the reader to follow.

(41)

Haar Transformation (DHT), {Z(j) t }, as Z_t(j) = √1 2j   2j−1₋₁ X i=0 Zt−i− 2j−1₋₁ X i=0 Z_t−2j−1_−i  .

We then consider a sub-series of {Z(j)

t }defined in the support S

(j)

t = {t − k2j, k ∈

Z}. The process {Zt(j)} is a moving average of order 2j − 1 with respect to the

innovations {Zt}. In common economical and financial time series, there is some

correlation between the scale variables Z(j)

t−k2j and Z

(j)

τ −k2j with |t − τ| ≤ 2j − 1.

However, on the support S(j)

t , each sub-series {Z

(j)

t−k2j, k ∈ Z} is a white noise

process.

To be more explicitly, we decompose the space Ht(Z)if infinite moving averages

whose underlying white noise process is {Zt}into orthogonal subspaces Wt(j)where

Ht(Z) = (∞ X k=0 akZt−k, ∞ X k=0 a2_k< ∞ ) , Wt(j) = (_∞ X k=0 b(j)_k Z_t−k2(j) j, b (j) k ∈ R, j ∈ Z + ) .

It is evident that Xt in contained in the space Ht(Z) since it is a linear

summa-tion of the innovasumma-tions {Zt} which are contained in Ht(Z). Let X

(j)

t denote the

projection of Xt on the subspaces W

(j)

t , j ∈ Z+, we can then re-express Xt as

Xt=

∞

X

j=1

X_t(j).

Fix t, the components X(j)

t are orthogonal to each other by construction. In

addition, since X(j) t ∈ W (j) t , it can be expressed as X_t(j)= ∞ X k=0 ϕ(j)_k Z_t−k2(j) j,

for some real coeﬃcient sequence {ϕ(j)

k }such that ∞ X k=0 ϕ(j)_k 2 < ∞. The coeﬃcients {ϕ(j)

k } can be obtained by projecting Xt on subspace spanned

(42)

Thus, we get the extended Wold representation of Xt as Xt= ∞ X j=1 X_t(j) = ∞ X j=1 ∞ X k=0 ϕ(j)_k Z_t−k2(j) j. (9)

After some algebraic manipulation of Equation (8) we arrive at

ϕ(j)_k = √1 2j   2j−1₋₁ X i=0 ψk2j_+i− 2j−1₋₁ X i=0 ψk2j₊₂j−1_+i  . (10)

We note that the coeﬃcients in the extended Wold representation is also a DHT of the coeﬃcients of classical Wold representation.

Consider now the multidimensional case. Specifically, let us consider a m -dimensional covariance stationary process {Xt} = {. . . , Xt−1, Xt, Xt+1, . . . } with

zero mean. Denote Xt = (X1,t, X2,t, . . . , Xm,t)0 as the random vector on Rm. The

classical vectorial Wold representation is

Xt=     X1,t ... Xm,t     = ∞ X k=0 ΨkZt−k.

From the univariate case, we get the analogous multivariate extended Wold representation Xt = ∞ X j=1 ∞ X k=0 ϕ(j)_k Z(j)_t−k2j. (11)

where the extended Wold coeﬃcients are ϕ(j)_k = √1 2j   2j−1₋₁ X i=0 Ψk2j_+i− 2j−1₋₁ X i=0 Ψk2j₊₂j−1_+i  , (12)

and the innovation process at scale j is

(43)

2.4 Spectral Decomposition Methods

In this section, we present two methods to decompose a vectorial time series into frequency-specific orthogonal components. The decomposition is conducted in the time domain, the components are, however, in orthogonal frequency spaces.

2.4.1 Parametric VAR(p) Method

Given a time series vector, one can find the order of autoregression using some order selection program, e.g. the Statsmodels package for Python user. With p obtained from order selection, we can then consider a n-variate VAR(p) process and write the process as

xt = p

X

i=1

Aixt−i+ zt,

where xt = (x1,t, x2,t, . . . , xn,t)0. The error process zt is a white noise process with

zero mean and diagonal covariance matrix E(ztz0t) = Σz. As we have seen in

Chap-ter (2.2.2), this process can be rewritten as a VAR(1) process in the companion form

Xt= ΛXt−1+ Zt. (14)

An estimate of the parameter matrix ˆΛand the error process ˆZtcan be obtained

by the Ordinary Least Square (OLS) method. Once this is done, we check the convergence condition in Theorem (5), i.e. we check if det(Inp − ˆΛz) 6= 0 ∀z ∈

C with |z| ≤ 1. This is readily done by asserting the eigenvalues of matrix ˆΛ to have absolute value less or equal to one. Given that the assertion is true, we can express Xt in the Wold representation

Xt= ∞ X i=0 ΨiZt−i= ∞ X i=0 ˆ ΛiZˆt−i, (15)

and the zeroth coeﬃcient is ˆΛ0 _{= I}

np. Furthermore, we recall the extended Wold

representation from Chapter (2.3), hence we can express Xt by the estimates as Xt= ∞ X j=1 X(j)_t = ∞ X j=1 ∞ X k=0 ˆ ϕ(j)_k Zˆ(j)_t−k2j. (16)

(44)

and ˆ Z(j)_t = √1 2j   2j−1₋₁ X i=0 ˆ Zt−i− 2j−1₋₁ X i=0 ˆ Zt−2j−1_−i  . (18)

2.4.2 Nonparametric VMA(q) Method

In this section, we introduce a nonparametric method to decompose a multivariate covariance-stationary process into frequency-specific components representing its evolution in time and frequency. In contrast to the parametric approach, which is built on the VAR(p) model, the nonparametric method is built on the moving average process. In particular, we use the Discrete Haar Transform to transform the time series into its orthogonal frequency components.

Consider a multivariate time series {Xt}, we first construct its moving average

π(j)_t = 1 2j 2j₋₁ X i=0 Xt−i. In particular, we note π(j)

t = Xt for j = 0. We then define the j-th frequency

component of Xt to be X(j)t = π (j−1) t − π (j) t .

For a fix 1 ≤ J ∈ Z, we have

Xt= J

X

j=1

X(j)_t + Xres_t ,

where the residual term Xres t = π

(J )

t .

In order to gain some intuition, we observe firstly that the moving averages π_t(j) is a 2j period moving average of X

t and it satisfies the recursion

π_t(j) = π (j−1) t + π (j−1) t−2j−1 2 . Now, since X(j)

t is the difference between the moving average of length 2j−1 and

2j, it captures the fluctuations that includes in the 2j−1 moving average but not

the 2j moving average. To be more explicit, X(j)

t captures the fluctuations whose

half-life belongs to the time interval [2j−1_{, 2}j₎. Hence, the residual term converges

(45)

2.4.3 Scale Interpretation and Decomposition Example

In this section, we illustrate the spectral decomposition by an example. In partic-ular, we implement the non-parametric method on the monthly return of the S&P 500 index. To interpret the decomposition, we present the scale interpretation in Table (1). We remark that the base can be replaced by an arbitrary positive integer greater than one. For instance, one can change the base to 5 for daily data, and the cycle interval will be [1, 5), [5, 25), . . . days.

Scale Cycle(period) j = 1 [1,2) j = 2 [2,4) j = 3 [4,8) j = 4 [8,16) j = 5 [16,32) j = 6 [32,64) ... ... j = J [2J, ∞)

Table 1: Interpretation of the scale j. Recall that the frequency is the inverse of the period length. Hence, larger j corresponds to lower frequency and vice versa. The max-imum frequency resolution is not limited, if the historical data are short, it is preferred to have a smaller J. On the other hand, if one is interested in studying the characters of low-frequency components, one could fix a larger J. Furthermore, the period has the same unit as the data. For instance, if the data are daily, then the cycle is counted in days.

A scatter plot of the original data is presented in Figure (7). In order to preserve as much as possible information in the study of the components, the mean was not removed before the decomposition. It can be seen from the plot that there is some cyclic behaviour when the large spikes are disregarded.

(46)

Figure 7: Monthly return of the S&P 500 index before decomposition, the data was dated between May 1960 - December 2019.

(47)

(48)

3 Spectral Factor Model

In this section, we introduce and derive the spectral factor model presented by Bandi et al. (2019) [1]. We shall show that the spectral factor model provides a handy tool to allow the systematic risks to be varied across different frequencies in the time domain.

First, let us recall the one factor linear model for two covariance-stationary time series Yt, Xt which obeys:

Yt = α + βXt+ Zt, (19)

where Zt ∼ W N (0, σ2) and Xt is called the factor. The tradition beta is then

given by

β = Cov(Xt, Yt)

Var(Xt)

.

3.1 Two-Component Decomposition

In order to get some intuition behind the multi-component decomposition, we proceed to decompose the factor model by decomposing the time series into two orthogonal components, one component with business cycle greater than p periods and the other component shorter than p periods where p = 2j

, 1 ≤ j ∈ Z. To

be more specific, we let Yt = Yt++ Y

−

t , where Yt+ is the component of Yt with

economic cycles greater p periods and Y−

t is the component with business cycle

shorter than p periods, analogously we decompose Xt as Xt= Xt++ Xt−.

Due to the orthogonality, we have

Cov(X_t+, X_t−) = Cov(Y_t+, Y_t−) = Cov(X_t+, Y_t−) = Cov(X_t−, Y_t+) = 0. Hence, we can write Equation (19) as:

Yt= Yt++ Y

−

t (20)

= α + β+X_t++ β−X_t−+ Zt. (21)

Due to the orthogonality of the components of {Xt, Yt}and the linearity of the

covariance, we get the following:

(49)

Here the partial effects β±are called spectra betas, note that the spectral betas

can be obtained by projecting either Yton the orthogonal components of Xtor by

projecting the orthogonal components of Yt on the corresponding components of

Xt. We proceed to show the traditional β is a superposition of the spectral betas:

β = Cov(Xt, Yt) Var(Xt) = Cov(X + t + X − t , Yt++ Y − t ) Var(Xt) = Cov(X + t , Yt+) + Cov(Xt−, Yt−) Var(Xt) = Var(X + t ) Var(Xt) | {z } ≡w+ Cov(X_t+, Y_t+) Var(X_t+) | {z } ≡β+ +Var(X − t ) Var(Xt) | {z } ≡w− Cov(X_t−, Y_t−) Var(Xt−) | {z } ≡β− = w+β++ w−β−,

where {w+, w−} ∈ R may be considered as the weights for the spectral betas and

(50)

3.2 Multi-Component Decomposition

Analogous to the two-component decomposition, we can decompose the single factor model into multi-components. Consider a zero-mean covariance stationary process Xt, its extended Wold representation is Xt = P∞j=1X

(j)

t , hence Equation

(19) can be written as:

Yt= α + βXt+ Zt= α +

∞

X

j=1

βjXt(j)+ Zt,

where the index j is associated with with a specific range of frequency, and the components X(j)

t s are orthogonal to each another.

Moreover, if Ytis also a zero-mean covariance-stationary process, we can make

orthogonal projection of Ytonto the same sub-spaces of the orthogonal components

of Xt, i.e. Yt= ∞ X j=1 Y_t(j), such that

Cov(X_ti, X_tj) = Cov(Y_ti, Y_tj) = Cov(X_ti, Y_tj) = 0 if i 6= j, for all i, j ∈ Z+. We can then decompose β as

β = Cov(Xt, Yt) Var(Xt) = Cov(X (1) t + X (2) t + · · · , Y (1) t + Y (2) t + · · · ) Var(Xt) = Cov(X (1) t , Y (1) t ) + Cov(X (2) t , Y (2) t ) + · · · Var(Xt) = Var(X (1) t ) Var(Xt) | {z } ≡w1 Cov(X_t(1), Y_t(1)) Var(X_t(1)) | {z } ≡β1 +Var(X (2) t ) Var(Xt) | {z } ≡w2 Cov(X_t(2), Y_t(2)) Var(X_t(2)) | {z } ≡β2 + · · · = w1β1+ w2β2 + · · · = ∞ X j=1 wjβj,

where wj ∈ [0, 1], ∀ j ∈ Z+ are the spectral weights satisfying the condition

P∞

(51)

analo-gous definition of spectral betas as the two-component case, namely βj = Cov(X_t(j), Yt) Var(X_t(j)) = Cov(X_t(j), Y_t(j)) Var(X_t(j)) . (22)

To summary, the beta for any bivariate zero-mean covariance stationary pro-cess, {Xt, Yt}, that has a Wold representation can be decomposed into frequency

specific spectral betas such that

β =

∞

X

j=1

wjβj. (23)

where the spectral weights satisfy

wj = Var(Xt(j)) Var(Xt) , such that ∞ X j=1 wj = 1. (24)

In practice, we need far from infinite number of decomposition in real appli-cation. Partly because it is computationally costly, partly because such a fine division of frequency space does not necessarily give information of interest. Now suppose we wish to decompose our factor model into J, 1 ≤ J ∈ Z, frequency range and let the (J + 1)-th range denotes the residual in the frequency space, we can then write the factor model as

Yt = α + J X j=1 βjX (j) t + βJ +1(Xt− J X j=1 X_t(j)) + Zt (25) = α + J X j=1 βjX (j) t + βJ +1X (J +1) t + Zt, (26)

where we have defined the frequency residual, X(J +1)

t , as X_t(J +1) ≡ Xt− J X j=1 X_t(j). (27)

By definition, the residual term, X(J +1)

t , is orthogonal to X

(j)

t , j ∈ {1, . . . , J }.

(52)

4 Spectral Portfolio-Optimisation

Frame-work

The single factor model presented in previous sections can be easily extended into a multi-factor model if a proof is conducted by induction. In this section, we consider a portfolio optimisation problem with implementation of spectral factor model to outline the framework of spectral portfolio optimisation.

4.1 Markowitz Optimisation Theory

Consider a portfolio with N assets, let Ri denote the return of asset i during time

interval [t−1, t), R = (R1, . . . , RN)0, µ = E[R] and Σ = Cov(R, R0). The portfolio

optimisation problem is to find the optimal weights wi for asset i, j = 1, . . . , N

such that the portfolio return is maximised. It is common to set the P

iwi = 1.

Recall that we can go both long and short on the assets if wi ∈ [−1, 1]. For a

portfolio that does not allow short sales, one can constrain the weights such that

wi ∈ [0, 1].

In this setup, the return of the portfolio Rp is also a random variable with

mean E[Rp] = µ0w and variance Var(Rp) = w0Σw. Given an expected portfolio

return µ∗

∈ R, then by the Markowitz portfolio theory [14] , the optimal portfolio weights can be obtained by solving the following quadratic problem:

   ˆ w = arg min w w 0_Σw subject to µ0_{w = µ}∗_{, w ≥ 0} and 10_{w = 1,} (28)

where the condition w ≥ 0 applies for portfolios that do not allow short-sell. As we can see from the problem setup, for a fixed target portfolio return µ∗, the

portfolio optimisation problem has two key inputs, namely the covariance matrix Σand the expected asset return vector µ.

When solving the above problem in a quadratic programming package, the problem is rewritten in the standard quadratic form:

arg min w w0Σw, " µ0 10 # w = " µ∗ 1 # , −INw ≤ 0,

(53)

4.2 Spectral Risk

We proceed further to risk decomposition in factor models. To be more specific, we present the framework to decompose the risk covariance Σ for a spectral factor model with an arbitrary number of factors. Consider a factor model with M factors denoted as fm, m ∈ {1, . . . , M }, the return for asset i is then

Ri = αi+ M

X

m=1

βi,mfm+ i, i ∈ {1, . . . , N }, (29)

where the βi,m is the sensitivity of asset i to factor fm and i is the idiosyncratic

return of asset i. We can then write the return for all the assets in the portfolio as

    R1 ... RN     =     β1,1 · · · β1,M ... ... ... ... βN,1 · · · βN,M         f1 ... fM     +     1 ... N     .

or in the compact matrix form

R = α + βf + . (30)

It is common to assume the asset specific shocks to be zero-mean and cross sec-tionally uncorrelated, i.e. Cov(i, j) = 0 if i 6= j, hence E[0] = Dis a diagonal

matrix. Let V be the covariance matrix of the factors, by defining β as the least square projection β = Cov(R, f)V−1, it follows that Cov(f, ) = 0. The covariance

of the return is then

Σ =Cov(R, R0) = βVβ0+ D. (31)

Suppose we would like to decompose the factors into J orthogonal frequency-specific components, then we can write Equation (30) as

R = α + J X j=1 β(j)f(j)+ (j). (32)

Due to the orthogonality of the components and the linearity of the covariance, the covariance matrix of the returns becomes

Σ =

J

X

j=1

Σ(j), where Σ(j)= β(j)V(j)β(j)0 + D(j). (33)

From the relationship PJ

j=1β(j)f(j) = βf and

PJ

j=1(j) = , we see that the

(54)

4.3 Asset Pricing Factor Models

In the previous section, we presented the spectral risk decomposition in a factor model. However, we did not specify the factors fm in Equation (29). In this

section, we introduce some of the most common asset pricing factor models. In general, an asset pricing models specify two sources of volatility for the re-turn of an asset, namely one systematic and one unsystematic. The systematic uncertainty is represented by the non-diversifiable risk, and the unsystematic un-certainty is expressed by an asset-specific volatility that is diversifiable. Ideally, a reasonable investor would like to diversify her portfolio such that the unsystematic risk is eliminated and her exposure to the systematic risk is minimum. However, systematic risk is not directly observable, which leads to the search of appropriate approximations. These approximative observable series are commonly known as factors. The coeﬃcients β before the factors indicate the degree to which an asset’s return is correlated with the factor. In other words, β is simply an indicator of an asset’s vulnerability to the systematic risk. If β = 1 for an asset, the asset is said to be in phase with the overall market. If β > 1, the asset is more volatile than the market, and conversely, the asset is less risky than the market if β < 1. In general, the factors can be categorised into macro-economical, statistical and fundamental factors [13]. Statistical factors are found by statistical methods such as factor analysis and principal component analysis. Here, we restrict ourself to standard macro-economical and fundamental factor models.

4.3.1 Macro-Economical Factor Models

A macroeconomic factor, as the name indicates, is the realised values of macro variables. Example of macroeconomic factor are market risk, inflation, industrial production, interest rates, unemployment rates [12]. The simplest macroeconomic factor model is the Single Index Market (SIM) Model developed by Sharpe in 1963 [31] and the well known Capital Market Pricing Model (CAPM). Mathematically, the SIM model is the factor model in Equation (29) with M = 1, under the same mathematical assumptions, the return for asset i between time t − 1 and t is

Ri = αi+ βiRm+ i,

where

• Ri : The excess return of asset i

• Rm : The excess return of the market

Spectral Portfolio Optimisation with LSTM Stock Price Prediction

Spectral Portfolio Optimisation

with LSTM Stock Price Prediction

NANCY WANG

ABSTRACT

SAMMANFATTNING

ACKNOWLEDGEMENTS

Contents

Symbol Glossary

List of Tables

List of Figures

1 Introduction

1.1 Background

1.2 Objective

2 Preliminary

2.1 Time Series Analysis

2.1.1 Time Series

2.1.2 Stationarity

2.1.3 Spectral Density and Periodogram

2.1.4 Time Series Models

2.1.5 The Projection Theory

2.1.6 Wold Decomposition

2.2 Multivariate Time Series Analysis

2.2.1 Introduction to Vectorial Time Series

2.2.2 Multivariate Time Series Models

2.2.3 Multivariate Wold Decomposition

2.3 Extended Wold Decomposition

2.3.1 Formal Derivation

2.4 Spectral Decomposition Methods

2.4.1 Parametric VAR(p) Method

2.4.2 Nonparametric VMA(q) Method

2.4.3 Scale Interpretation and Decomposition Example

3 Spectral Factor Model

3.1 Two-Component Decomposition

3.2 Multi-Component Decomposition

4 Spectral Portfolio-Optimisation

Frame-work

4.1 Markowitz Optimisation Theory

4.2 Spectral Risk

4.3 Asset Pricing Factor Models

4.3.1 Macro-Economical Factor Models