• No results found

A Multi-Factor Stock Market Model with Regime-Switches, Student's T Margins, and Copula Dependencies

N/A
N/A
Protected

Academic year: 2021

Share "A Multi-Factor Stock Market Model with Regime-Switches, Student's T Margins, and Copula Dependencies"

Copied!
82
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköping University | Department of Management and Engineering Master's thesis 30 ECTS Applied Physics and Electrical Engineering - Applied Mathematics November 20, 2017 | ISRN: LIU-IEI-TEK-A--17/02968--SE

A Multi-Factor Stock Market Model

with Regime-Switches, Student's T

Margins, and Copula Dependencies

Authors Adnan Berberovic Alexander Eriksson Supervisors Jörgen Blomvall Pontus Schröder Examiner Mathias Henningsson Linköping University SE-581 83 Linköping, Sweden +46 (0)13 28 10 00

(2)

The publishers will keep this document online on the Internet - or its possible re-placement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies perma-nent permission for anyone to read, to download, or to print out single copies for their own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owners. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the authors has the right to be mentioned when their work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/. ©Adnan Berberovic and Alexander Eriksson.

(3)

Abstract

Investors constantly seek information that provides an edge over the market. One of the conventional methods is to nd factors which can predict asset returns. In this study we improve the Fama and French Five-Factor model with Regime-Switches, student's t distributions and copula dependencies. We also add price momentum as a sixth factor and add a one-day lag to the factors. The Regime-Switches are obtained from a Hidden Markov Model with conditional Student's t distributions. For the return process we use factor data as input, Student's t distributed residuals, and Student's t copula dependencies. To t the copulas, we develop a novel approach based on the Expectation-Maximisation algorithm. The results are promising as the quantiles for most of the portfolios show a good t to the theoretical quantiles. Using a sophisticated Stochastic Programming model, we back-test the predictive power over a 26 year period out-of-sample. Furthermore we analyse the performance of dierent factors during the dierent market regimes.

(4)

First and foremost, we want to express our sincere gratitude to Jörgen Blomvall at the Department of Management and Engineering at Linköping University for dis-cussing and providing ideas as well as help with some of the technical details and for being available whenever guidance was asked for. Second, we want to thank Pontus Schröder and Carnegie Investment Bank for providing us with ideas, information and necessary data. A lot of interesting discussions were held with Pontus both during planning, from which an idea sprung, as well as during the working phase of the thesis, which resulted in a very interesting and fun project overall. Finally we, of course, want to express our thanks to all the researchers who contributed to the body of literature in the eld and served as an inspiration.

(5)

Contents

Nomenclature 1 1 Introduction 4 1.1 Background . . . 5 1.2 Purpose . . . 6 1.3 Delimitations . . . 6 2 Methodology 7 2.1 Regime-Switching . . . 7 2.1.1 Model selection . . . 8 2.2 Factors . . . 9 2.2.1 Momentum . . . 9 2.3 Portfolio Allocation . . . 9 2.4 Avoiding over-tting . . . 10 3 Theory 11 3.1 Probability and Statistics . . . 11

3.1.1 Normal distribution . . . 12

3.1.2 Student's t distribution . . . 12

3.1.3 Cumulative distribution function . . . 13

3.1.4 Parameter estimation of probability distributions . . . 13

3.2 Copulas . . . 14

3.2.1 Copula models . . . 15

3.2.2 Copula selection . . . 18

3.3 Stochastic Processes . . . 19

3.3.1 Auto-regressive process . . . 19

3.3.2 Moving average process . . . 20

3.3.3 ARMA-process . . . 21

3.4 Markov Chains . . . 21

3.5 Hidden Markov Models . . . 22

3.5.1 Forward-Backward algorithm . . . 23 iv

(6)

3.5.2 The Viterbi algorithm and other alternatives . . . 24 3.5.3 Parameter estimation . . . 24 3.5.4 Baum-Welch . . . 25 3.5.5 Other algorithms . . . 27 3.5.6 Model selection . . . 28 3.5.7 Pseudo-residuals . . . 29 3.6 Factor model . . . 29 3.7 Asset allocation . . . 30 3.7.1 Markowitz' framework . . . 30

3.7.2 Mean-variance with Quadratic transaction costs . . . 31

3.7.3 Stochastic Programming . . . 32

3.8 Numerical methods . . . 33

3.8.1 Newton Methods . . . 33

3.8.2 Primal-dual interior point method . . . 34

3.9 Evaluation . . . 37

3.9.1 Relative return measures . . . 38

3.9.2 Information measures . . . 38

3.9.3 Timing tests . . . 39

4 Implementation 40 4.1 Data . . . 41

4.2 Hidden Markov Model . . . 43

4.3 Return process . . . 44

4.3.1 Copula inference . . . 45

4.3.2 Random initial guesses . . . 46

4.4 Portfolio optimisation . . . 46

4.5 Back-test . . . 48

4.6 Evaluation . . . 49

4.7 R Programming Language . . . 49

5 Results and analysis 50 5.1 Hidden Markov Model . . . 50

5.2 Return Process . . . 56

5.3 Portfolio Allocation . . . 58

6 Discussion 69 6.1 Recommendations for Improvements . . . 70

6.2 Ethical aspects . . . 71

Bibliography 74

(7)

Nomenclature

In order of appearance

EM Expectation Maximisation. AIC Akaike Information Criterion. BIC Bayesian Information Criterion.

r Factor returns.

fi Asset factors.

, ξ Residual terms, usually associated with a subscript. µ Mean reversion level or rst moment.

Φ Factor coecient matrix, associated with factors f. L Factor loading matrix, associated with factors f. w.r.t with respect to.

VAR Vector auto-regressive.

p Dimension of the probability distributions. y

yy Input data.

θθθ Parameter space for probability distributions. f (yyy; θθθ), pdf Probability density function.

Σ Matrix of the second moment, or scale parameter for a distribution.

δ(·) The squared Malahanobis distance of a distribution.

ν Degrees of freedom.

Γ(·) The Gamma function.

F (yyy; θθθ), cdf Cumulative distribution function. MLE Maximum Likelihood Estimation. L(θθθ) Likelihood function.

ˆ

θθθMLE Maximum Likelihood Estimate. l(θθθ) Log-likelihood function.

log(·) Natural logarithm.

C(·) Denotes a copula function.

Un Standard uniformly distributed random variable (U(0, 1)).

Φ(·) p-variate standard normal cdf. 1

(8)

W Correlation matrix.

| · | Determinant operator if the argument is a matrix, cardinality operator if the argument is a set. c(·) Copula density.

A> Transpose of a matrix A.

I Identity Matrix.

t(·) Standardised p-variate student's t cdf. φ(·) Generator function.

IFM Inference for Margins.

AR(p) Auto-regressive process of order p. MA(q) Moving average process of order q.

ARMA(p, q) Auto-regressive moving average process of order p,q. HMM Hidden Markov Model.

P

PP Transition probability matrix for a Markov Chain. pi,j Elements in PPP denoting transition probabilities fromstate i to j.

St The observed state in a Markov Chain at time t.

n Number of states in the Markov Chain. P(X|Y ) Probability of X, given Y .

Π State space distribution of a Markov chain.

σ The standard deviation of a uni-variate distribution. Θ

ΘΘ The space of all possible parameter values for θθθ.

m Number of points in time observations have been made. Ykm Yk, . . . , Ym, a set of historical output data.

α(·) Forward density. β(·) Backward density. Q(·) Auxiliary function. Ψ(·) Digamma function.

k Number of parameters, |θθθ|. CAPM Capital Asset Pricing Model. APT Arbitrage Pricing Theory. R(t) Portfolio return at time t. E[·] Expected value operator. Var[·] Variance operator.

x(t) Portfolio weights at time t. γ Risk aversion parameter. U (·) Utility function.

X(t) Investor's wealth at time t. SP Stochastic Programming. J (·) Jacobian Matrix function. PDIP Primal-dual interior point.

(9)

Nomenclature 3

,  Element wise inequality operators. 1

11 Vector of ones.

L Lagrangian function.

KKT Karush-Kuhn Tucker.

◦ Hadamard product (element wise multiplication). diag(v) A matrix with the vector v on the diagonalin an otherwise zero matrix.

α Excess return. β Market beta. Rf Risk-free return. Ra Return on asset a. σD Downside volatility. TI Treynor index. E-G Elton-Gruber.

CML Capital market line.

TE Tracking error.

IR Information ratio.

OT Observation time.

ci Asset prices in scenario i.

h Number of assets, including cash.

TCi(t) Transaction cost given scenario i at time t.

ρ Transaction cost parameter. tc t-statistic at condence level c.

R R programming language.

Mkt-RF Market-minus-risk free rate. SMB Small-minus-Big (Size).

HML High-minus-Low (Book-to-Market).

RMW Robust-minus-Weak (Operating Protability). CML Aggressive-minus-Conservative (Investment).

(10)

Introduction

Practitioners in the eld of nance have for long talked of the bull and bear market phases, whereas, for quite some time, the academic eld used linear models for the return of assets. Hamilton (1989) proposed that the economic cycles can be modelled with a Markov-Switching Model, that could capture the non-linearity of Gross Domestic Product (GDP) time series. This worked remarkably well. Since then a body of literature has developed that seeks to explain the behaviour of asset returns using similar approaches.

Rydén, Teräsvirta, and Åsbrink (1998) evaluate the ability of a Hidden Markov Model to capture the stylized facts of daily return series found in Granger and Ding (1994) and Granger and Ding (1995). They found that a Hidden Markov Model, which uses normal conditional distributions, and constitutes a mixture model, is able to capture the stylized facts well. The only stylized fact which is not easily captured by the model is the slowly decaying auto-correlation function of absolute returns.

Ang and Bekaert (2002) apply a Regime-Switching Model to portfolio allocation, and study the impact of dierent regimes on investment decision. They nd that it can be costly to ignore regimes. Ang and Bekaert (2003) nd that regime-switching asset allocation strategies, that are not limited to equities, but can also invest in other assets, such as bonds, out-perform a non-regime-switching mean-variance portfolio out-of-sample.

Bulla (2011) ventures outside of the commonly used conditional Gaussian models and instead studies a model with conditional student's t-distributions. What is found is that models with at least one student's t component are able to capture the stylized facts in Rydén, Teräsvirta, and Åsbrink (1998) at least as good or better. It is also found that these models are preferred to the standard Gaussian models by

(11)

1.1. Background (Introduction) 5

the Bayesian Inference Criterion.

In Bulla et al. (2011) Regime-Switching Models are used to evaluate a simple in-vestment strategy: either invest all of the capital in an index; or invest all of the capital in cash. This is then compared to the performance of the index. The re-sult found, out-of-sample and accounting for transaction costs, is that the simple Regime-Switching strategy outperforms the indices by up to a two percentage units. The standard deviation is signicantly lower for all indices and for NASDAQ the standard deviation is more than halved.

Gatumel and Ielpo (2014) study the number of regimes in dierent asset classes. It is found that for exchange rates, commodities, US investment grade bonds and EMU grade bonds, two states are sucient to accurately describe the returns of the assets. However for equities, three states are needed, and for high-yield bonds ve states are needed. They nd that the performances of forecasts are improved when the number of states are not assumed ex ante to be two. Jiang and Fang (2015) use a Bayesian Markov Model for monthly data from S&P 500 with a start in 1926 and nd that the Markov Model that best ts the data uses four dierent states. Ma et al. (2011) uses a Regime-Switching Multi-Factor Model to optimise the alpha of a market-neutral portfolio of sector exchange traded funds. Using the Bayesian In-formation Criterion, they nd that three regimes are optimal. The strategy outper-forms the benchmark, providing higher returns and lower volatility. Mostly during the bull market, but there is also a small out-performance during the transition-period. They do however not have any bear market periods in the out-of-sample period.

Takahiro and Naoki (2015) develop a Regime-Switching Three-Factor model, with factors according to Fama and French (1993). They also adopt it to the trading model developed in Gârleanu and Pedersen (2013), nding an optimal portfolio with regard to trading cost, as a combination of the current portfolio and the no-transaction-cost optimal portfolio. They assume that the states of the market can be known ex post with certainty, which is realistic because in practice the smoothed probabilities are very often close to 0 or 1. This allows them to nd the historical performance of dierent factors during dierent regimes, which can then be used to forecast future returns.

1.1 Background

The idea for the thesis stems from a cooperation with Carnegie Investment Bank AB. Carnegie provides services, such as analysis and order execution, for large

(12)

insti-tutional clients. Carnegie believes that investigating the performances of dierent styles of equities during dierent phases of the world market may provide informa-tion to nd great investment strategies. Carnegie proposed that a comprehensive factor study should be conducted with respect to market regimes, in order to math-ematically capture the process of stock factors in dierent regimes, and have been generous to provide us with data as well as valuable advice.

1.2 Purpose

Our purpose is to evaluate Regime-Switching Factor Models and the performances of dierent factors in dierent market regimes, as well as their contribution to equity returns.

We contribute to the body of literature by building upon the work in Takahiro and Naoki (2015) by replacing the conditional Gaussian distribution with a Student's t-distribution, by using more factors and by conducting the study on a longer time period and using daily instead of weekly data.

1.3 Delimitations

The studied asset classes are restricted to equities. Furthermore, equities are grouped into portfolios, in order to deal with the large size of the universe.

The studied equity-universe is gathered from Kenneth French's Database1 as factor

portfolios based on each factors performance as well as the market cap for each equity on NYSE, AMEX and NASDAQ with daily frequency from 1972-01-03 to 2017-03-22.

To nd a suitable Regime-Switching Model, we use S&P 500 data, daily from 1971, which is gathered from Bloomberg.

(13)

Chapter 2

Methodology

The methodology in this study can briey be summarised as follows: rst we im-plement a Hidden Markov Model and use it to split the studied period into dier-ent regimes; second we evaluate the performances of dierdier-ent factors during these regimes using a factor model; third we evaluate the predictive power of the earlier results by constructing an optimal portfolio for each time point an investment de-cision can be made, and comparing its performance to the S&P 500 index, as is illustrated by the owchart in gure 2.1.

Regime-Switching component

Factor

model allocationAsset evaluationResult Figure 2.1: Basic overview of the framework.

In this section we rst shortly introduce the Regime-Switching concept, then the Factor Model and the associated factors and nally the portfolio allocation strategy.

2.1 Regime-Switching

The Regime-Switching model in this thesis will focus on trying to nd the most probable regime, or more commonly, state, that is currently observed. A state in this case models what kind of current market behaviour we currently observe, and with this we also nd the probabilities that each state is currently observed.

(14)

Given a history of an observed variable, and some Regime-Switching Model with associated parameters, there is the problem of nding the unobserved states for the history. Since the states are unobserved, they can generally not be found with certainty, instead we nd the so called smoothing probabilities, and the process of nding them is called smoothing. When doing this we consider the entire history of the observed variable, which is natural because we want to ex post derive as much information as possible from the data set. The most used algorithm for this is called the Forward-Backward algorithm, and it nds the most likely state at any time. There is also the Viterbi algorithm, which nds the most likely sequences of states. In order to get estimations of the transition probabilities statistical inference is used. Algorithms require an ex ante assumption about the number of states, and will nd the properties of the states solving a problem equivalent to or an approximation of the maximum likelihood estimation, which is impractical to estimate directly.

One such algorithm is the Baum-Welch algorithm, which is an Expectation-Maximization (EM) algorithm adopted to Hidden Markov Models, which creates an expectation function for the log-likelihood, which it then maximizes. Another approach is di-rect numerical maximization, which includes Newton type methods. There are also hybrid type approaches, which use an EM algorithm until some stopping criterion is fullled, and then use a fast method of direct numerical approximation. (Bulla, 2011).

2.1.1 Model selection

When there are multiple models to select from, such as models with a dierent amount of states, we need a means of determining which one is the best one. Often, likelihood-maximization is used for model selection. However, this fails when we are considering models with diering number of parameters, since it is likely to over-t. Instead, we need some other means of determining what model is best. There are two primary candidates, which are based on likelihood but give a penalty proportional to the number of parameters, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC).

AIC tries to minimize divergence to the true model and BIC tries to maximize the posterior probability of the model. BIC tends to punish complexity, as measured by the amount of parameters, harder. For those who consider the "true model" to be innite-dimensional, AIC is the preferred measure, and for those that consider the "true model" to be nite-dimensional, BIC is preferred. (Yang, 2005)

(15)

2.2. Factors (Methodology) 9

2.2 Factors

Louis K. C. Chan and Lakonishok (1998) discuss in their research that by nding a small set of factors that explain most of the common shared variation, then these can be used to nd risk premia. Basing our approach on this statement and all of the above, we expect that, for each regime, certain factors will clearly outperform other factors.

In Takahiro and Naoki (2015) the classical Fama-French three-factor model is used: Size, Value, and Market risk (Beta). In Fama and French (2015) this is extended to: Size, Book-to-Market, Operating Protability, Investment, and Beta. We further expand on this by adding a sixth factor, Momentum, to our model.

2.2.1 Momentum

Jegadeesh and Titman (1993) performed one of the rst investigations on price mo-mentum and noticed an abnormality in buying stocks that are currently performing well and selling those that do not, such that momentum portfolios performed ex-traordinarily well, given that the stocks have already been performing well during the past six months. Chan, Jegadeesh, and Lakonishok (1996) and Muga and San-tamaria (2007) take this discussion further and argue that momentum itself is one of the main anomalies that keeps contradicting the market eciency theory pro-posed by Fama and French. Market eciency implies that any pattern that yields abnormal returns will eventually fade, but that appears not to be the case with momentum. Strangely enough, this phenomenon is still today a strong investment strategy, which is why it is of great interest to research this factor and nd any connection between this, market regimes and returns.

2.3 Portfolio Allocation

Using the approach of Takahiro and Naoki (2015), we will nd a target portfolio, which maximises the expected returns obtained from the Regime-Switching Factor Model. We will then invest in an optimal portfolio which maximises the future utility. Assets that the model will invest in will not be the equities in the universe directly, but portfolios of them. The portfolios will be formed as bivariate sorts on size vs. the remaining factors, as well as sorts based on factor performance (split in percentiles of 0-30, 30-70 and 70-100) resulting in 24 portfolios as shown in gure 2.2.

(16)

Figure 2.2: Visualisation of the portfolio sorts. Portfolios are bivariate sorts on size and one of Value, Investment, Protability and Momentum, resulting in a total of 24 portfolios.

2.4 Avoiding over-tting

When measuring parameters for a stochastic process you are bound to be subjected to over-tting of some degree, which always poses a problem. This is why it's important to separate the data set in two sets, one which is used for modelling and back-testing (in-sample) and the other which is only used for back-testing (out of sample), that is to test the signicance of the result, and see whether or not the proposed model does in fact succumb to over-tting of a high degree.

(17)

Chapter 3

Theory

This section will provide the necessary theory that is required to develop a Hidden Markov Model for nancial time series, t a regime switching factor model to this HMM, and explain how the results will be obtained.

We rst introduce some basic probability theory as well as two distributions, the normal and the Student's t distribution. We then introduce some copula theory as well as some stochastic processes. The Forward-Backward algorithm is then discussed in detail, which is used to calculate the probabilities of being in a certain state at an earlier point in time. This requires a fully specied model, but the algorithm is utilised in the Baum-Welch algorithm, which can be used to nd the fully specied model.

We continue by dening the model for the returns which depends on the current regime. This is followed by describing how we determine the parameters for the regime switching return process. Finally we show the optimisation model that we will use to nd optimal investments with transaction costs and no short selling using this regime switching model, as well as comparing it to the corresponding optimal choices for a model without regime switching. This is done in order to evaluate the regime switching multi-factor model.

3.1 Probability and Statistics

When working with Hidden Markov Models, the most common distribution used to model the outputs in each state is the normal distribution. Hidden Markov Models are a special case of nite mixture models, i.e. nite sums of some distribution. Finite Mixture Models are quite exible, in the sense that we can use conditional

(18)

distributions such as the normal distribution and still have a marginal distribution with skewness and kurtosis.

3.1.1 Normal distribution

The p-variate Normal distribution is given by the probability density function (pdf) f (yyy; µµµ, ΣΣΣ) = (2π)−12p|ΣΣΣ|−

1 2e−

1

2δ(yyy,µµµ,ΣΣΣ), (3.1)

where µµµ is the mean, ΣΣΣ is the covariance matrix and

δ(yyy; µµµ, ΣΣΣ) = (yyy − µµµ)TΣΣΣ−1(yyy − µµµ), (3.2) is the squared Malahanobis distance, a generalisation of variance to multivariate distributions.

The distribution is commonly used. It has been used for a long time in nance, and a lot of modelling has been done with it. However it has been shown empirically, that the normal distribution is not a good t for nancial time-series, since the tail is not fat enough as can be seen in gure 3.1. However as a conditional distribution in nite mixture models it can cause a marginal distribution with an appropriately fat tail.

3.1.2 Student's t distribution

The p-variate student's t distribution is given by

f (yyy; µµµ, ΣΣΣ, ν) = Γ( ν+p 2 )|ΣΣΣ| −1 2 (πν)12pΓ ν 2 [1 + 1 νδ(yyy, µµµ; ΣΣΣ)] 1 2(ν+p) , (3.3) where Γ(t) = R∞ 0 x

t−1e−xdx, or Γ(n) = (n − 1)! for positive integers n, is the gamma

function and µµµ is the location parameter, ΣΣΣ is the scaling parameter and ν is the degrees of freedom. If ν > 1, then µµµ is the mean of the distribution, and if ν > 2, then ΣΣΣν−2ν is the covariance matrix. (Liu and Rubin, 1995)

The distribution is elliptic, like the normal distribution, and in fact, lim ν→∞f (yyy; µµµ, ΣΣΣ, ν) = (2π) −1 2p|ΣΣΣ|− 1 2e− 1 2δ(yyy,µµµ;ΣΣΣ), (3.4)

which is the probability density function for the multivariate normal distribution. For nite ν the Student's t-distribution is a distribution with fatter tails than the normal distributions.

(19)

3.1. Probability and Statistics (Theory) 13 −4 −2 0 2 4 0.0 0.1 0.2 0.3 0.4 Comparison of t Distributions x value Density Distributions df=1 df=3 df=8 df=30 normal

Figure 3.1: Comparison of the pdf for the normal distribution and some Student's t distributions with dierent degrees of freedom.

3.1.3 Cumulative distribution function

In addition to the pdf which gives us the probability distributions, each p-variate probability distribution also has an associated cumulative distribution function (cdf) dened as F (yyy; θθθ) = Z y1 −∞ . . . Z yp −∞ f (yyy; θθθ)dyp. . . dy1 (3.5)

where yyy = (y1, . . . , yp)is the input and θθθ is the parameter set.

3.1.4 Parameter estimation of probability distributions

In order to t collected data to a chosen model (probability distribution), it is necessary to compute which parameters θθθ is the best t of the data to the model. One of the most used methods for this is Maximum Likelihood Estimation (MLE). MLE is performed by maximising the so called likelihood function L (which is chosen

(20)

as the joint pdf of the models distribution) with respect to the model parameters θθθ, L(θθθ) = p Y i=1 fi(yi; θθθ) (3.6) ˆ

θθθMLE = arg max

θ θθ∈ΘΘΘ

L(θθθ). (3.7)

However this can be reduced to a much less computationally intensive problem by taking the logarithm of L, since the logarithm is a strictly increasing function the optimal solution remains the same. We dene the log-likelihood function simply as

l(θθθ) := log L(θθθ) =

p

X

i=1

log fi(yi; θθθ) (3.8)

and then solve

ˆ

θθθMLE = arg max

θθθ∈ΘΘΘ

l(θθθ). (3.9)

3.2 Copulas

Working with multivariate distributions is harder than working with univariate dis-tributions. Therefore we introduce copula theory, which allows us to exploit that uni-variate distributions are often much easier to deal with, and creates a bridge between the univariate distributions and a multivariate distribution. A copula C ∈ [0, 1]p is

a function that describes the co-dependence between random variables by allowing us to express a joint probability distribution as a function of its marginals (Cheru-bini, Luciano, and Vecchiato, 2004). Copulas are designed as a joint distribution of uniforms

C(U1, . . . , Up) = C(F1(X1), . . . , Fp(Xp)) (3.10)

where Un= Fn(Xn)are standard uniformly distributed random variables (U(0, 1)) of

any continuous random variable Xn according to the probability-integral transform

for which proof follows. To prove this we rst let Y = FX(X), where X is any

continuous random variable. Then

FY(y) = P(Y ≤ y) = P(FX(X) ≤ y) = P(X ≤ FX−1(y)) = FX(FX−1(y)) = y (3.11)

which is the CDF of a standard uniform distribution (Y ∼ U(0, 1)). Additionally, there are some requirements for a function C(U1, . . . , Up) to be a copula,

(21)

3.2. Copulas (Theory) 15

1. The function must be grounded, i.e C(U1, . . . , 0, . . . , Up) = 0.

2. The function satises C(1, . . . , Un, . . . , 1) = Un. This requirement means that,

if all events except for Xnhas happened, then the joint probability is of course

Un.

3. C(U1, . . . , Up) must be p-increasing.1 (Cherubini, Luciano, and Vecchiato,

2004)

Furthermore, a copula always lies within the Frechét bounds, that is, max p X i=1 Ui ! − 1, 0 ! ≤ C(U1, . . . , Up) ≤ min(U1, . . . , Up). (3.12)

These upper and lower bounds are called maximum and minimum copula, as they are copulas themselves.

Let xxx = (x1, . . . , xp)and uuu = (u1, . . . , up). Sklar's theorem states that every joint

cu-mulative distribution function can be expressed as a copula function of its marginals, F (xxx) = C(F1(x1), . . . , Fp(xp)), (3.13)

as well as that the copula function can be expressed as

C(uuu) = F (F1−1(x1), . . . , Fp−1(xp)). (3.14)

Furthermore, each copula has an associated copula density, c(F1(x1), . . . , Fp(xp)) =

∂pC(F

1(x1), . . . , Fp(xp))

∂F1(x1) . . . ∂Fp(xp) (3.15)

such that the joint probability density function f can be expressed in terms of the copula density and its marginal densities

f (xxx) = c(F1(x1), . . . , Fp(xp)) p

Y

i=1

fi(xi). (3.16)

(Cherubini, Luciano, and Vecchiato, 2004)

3.2.1 Copula models

There are numerous copula models that can be used, each model with dierent properties, such as the Gaussian copula, Student's t copula, and dierent types of Archimedean copulas. Depending on what kind of data needs to be processed, some copula models work better than others.

1The denition of a generally p-increasing function is quite complicated and out of scope for

(22)

Gaussian copula The p-variate Gaussian copula function is expressed as

CWGa(uuu) = ΦW(Φ−1(x1), . . . , Φ−1(xp)) (3.17)

where ΦW(·) is the multivariate standard normal CDF with correlation matrix W ,

such that CGa

W (uuu) generates the standard normal joint distribution function when

the margins are standard normal. Furthermore, the canonical representation (from 3.16) of the p-variate Gaussian copula is

f (xxx) = 1 (2π)p2|W | 1 2 exp  −1 2xxx > (W )−1xxx  = cGaW(Φ(x1), . . . , Φ(xp)) p Y i=1  1 √ 2πexp  −1 2x 2 i  (3.18)

where | · | is the determinant operator and cGa

W is the copula density. The copula

density can then be written as

cGaW(Φ(x1), . . . , Φ(xp)) = 1 (2π)p2|W |12 exp −12xxx>(W )−1xxx Qp i=1  1 √ 2πexp − 1 2x 2 i  (3.19)

which can be further reduced to

cGaW(Φ(x1), . . . , Φ(xp)) = 1 |W |12 exp −1 2 xxx >(W )−1xxx − p X i=1 x2i !! (3.20) = 1 |W |12 exp  −1 2xxx > ((W )−1− I)xxx  (3.21) where I is the identity matrix of proper dimension. Letting un = Φ(xn)and

conse-quently xn = Φ−1(un) we can write the copula density as

cGaW(Φ(x1), . . . , Φ(xp)) = 1 |W |12 exp  −1 2ξ > ((W )−1− I)ξ  (3.22) where ξ = (Φ−1(u 1), . . . , Φ−1(up))>.

(Cherubini, Luciano, and Vecchiato, 2004)

Student's t copula Similarly as the Gaussian copula, the p-variate Student's t copula is dened as

(23)

3.2. Copulas (Theory) 17

where W is the correlation matrix, ν is the degrees of freedom, and tν(·) is the

standardised p-variate student's t cumulative distribution function.

From the canonical representation of the p-variate Student's t copula, the associated copula density is ctW,ν(uuu) = |W |12 Γ ν+p 2  Γ ν2 Γ ν2 Γ ν+12  !p 1 + 1νξ>(W )−1ξ− ν+p 2 Qp i=1  1 + ξ2i ν −ν+12 (3.24) where ξ = (t−1

ν (u1), . . . , t−1ν (up)), and Γ(·) is the gamma function as dened in (3.3).

(Cherubini, Luciano, and Vecchiato, 2004)

Archimedean copulas In order to dene Archimedean copulas, we must rst dene generator functions, φ(u). A function φ(u) is called a generator function if and only if it is continuous, decreasing, convex and if φ(1) = 0. A function is called a strict generator if it is a generator and lim

u→0φ(u) = ∞. Furthermore, the

pseudo-inverse of φ(u) is dened as

φ[−1](t) = (

φ−1(t), 0 ≤ t ≤ φ(0)

0, φ(0) ≤ t ≤ ∞. (3.25)

Given a generator function φ(u) where its pseudo-inverse is strictly monotonic on [0, ∞], a p-variate Archimedean copula is dened as

CAr(uuu) = φ[−1] p X i=1 φ(ui) ! . (3.26)

For the copula density, it is necessary to compute the derivative in (3.15) as φ(ui)

itself is not a distribution so the canonical representation cannot be used directly to compute the copula density. Usually these generator functions have one or two parameters which need to be estimated, and the parameter space for each copula is decided such that the pseudo-inverse is strictly monotonic.2 (Cherubini, Luciano,

and Vecchiato, 2004)

Below in table 3.1 we present three Archimedean copulas commonly used in nance.

2For example, these generators are usually selected as the inverses of the Laplace transforms of

cumulative distribution functions. For a more in-depth explanation of generator functions, please refer to Cherubini, Luciano, and Vecchiato (2004) p.121-123, 150.

(24)

Table 3.1: Listed in this table are three Archimedean copulas commonly used in nance.

Name φ(u) φ[−1](t) C(uuu) α

Gumbel (− log(u))α exp−t1 α  exp −  p P i=1 (− log(ui))α α1! α > 1 Clayton u−α− 1 (t + 1)−1α  p P i=1 u−αi − p + 1 −α1 α > 0 Frank logee−αu−α−1−1

 −log(1+eαt(eα−1)) −1 αlog  1 + p Q i=1 (e−αui−1) (e−α−1)p−1   α > 0 (p ≥ 3)

(Cherubini, Luciano, and Vecchiato, 2004)

3.2.2 Copula selection

Obviously it is not optimal to just pick a copula and run with it, since some copulas might be better for some data sets compared to others. So naturally there is inter-est in tinter-esting dierent models and see which one performs better than the others. In order to achieve this, rst each copula model's parameters is tted using max-imum likelihood estimation (MLE). Then, with the parameters obtained from the MLE, the likelihood is compared between the models, and the one with the highest likelihood is selected. Consider the log-likelihood function

lc(θθθ) = m X t=1 log c(F1(x1t), . . . , Fp(xpt)) + m X t=1 p X i=1 log fi(xit) (3.27)

where m is the amount of observations, given that [x1t, . . . , xpt], t ∈ (1, . . . , m)counts

as one observation. Given the marginal distributions and a copula, we can then nd the copula density, which together with the marginals makes it possible to solve

ˆ

θθθMLE = arg max

θ θ θ∈ΘΘΘ

lc(θθθ) (3.28)

where ˆθθθMLE is the ML-estimator of θθθ. This is however a highly computationally

in-tensive optimisation problem, so a method called Inference for Margins (IFM) was developed which has a signicantly lower complexity, but only solves the optimisa-tion problem approximately. IFM is done in two steps, which is basically splitting

(25)

3.3. Stochastic Processes (Theory) 19

up lc(θθθ) in two parts; rst maximising the sum of the marginal densities only,

ˆ θθθ1 = arg max θθθ1 m X t=1 p X i=1 log fi(xit, θθθ1), (3.29)

and then given ˆθθθ1, maximising the sum of copula densities in order to yield a nal

estimate for θθθ, ˆ θθθ2 = arg max θθθ2 m X t=1 log c(F1(x1t, θθθ2), . . . , Fp(xpt, θθθ2)), (3.30)

which gives us the parameters ˆθθθIFM = (ˆθθθ1, ˆθθθ2)as the approximate solution to (3.28).

Since IFM yields an approximate solution of ˆθθθMLE, it is possible to use ˆθθθIFM as

a starting point when solving (3.28), in order for the exact method to be slightly quicker, as well as getting the exact maximum likelihood estimate. (Cherubini, Luciano, and Vecchiato, 2004)

3.3 Stochastic Processes

The factors taken into consideration while developing the regime switching process will used to model a factor process as a stochastic process. There are dierent pos-sibilities and choices when it comes to determining a stochastic process to base the factor model on, so one has to adhere to what kind of properties these stochastic processes have, and compare them to the properties of nancial factors. Mills and Markellos (2008) describe dierent suitable stochastic models for nancial time se-ries, each with its own pros and cons. Furthermore, a class of stochastic processes called Markov chains will be brought up together with an in depth description of Hidden Markov Models in the next section, as the regime switching model is based on such processes.

3.3.1 Auto-regressive process

An auto-regressive (AR) process Xt of order p, denoted AR(p), is dened as

Xt= a + p

X

i=1

biXt−i+ t, (3.31)

where a is a constant, bi are coecients to each previous value of the process and

(26)

other disciplines. In order for the process to have a nite variance, the noise needs to have a nite rst and second moment.

This process is useful when it is of interest to model a linear stochastic behaviour that depends on previous values of itself. The noise from the rst observation in the process has an impact on the value of the process at any time point t > 0, since each previous value in the process Xt−i keeps the noise from its own observation. This

implies that, depending on the coecients bi, this impact will either grow or decay

for future observations, and since it is unrealistic that information further back in time has a huge impact on present or future observations, the restriction |bi| < 1 is

often imposed, which is called a stationarity condition. (Mills and Markellos, 2008) Note that an AR(p) process depends on previous values of the noise, but indirectly (see (3.31)); Xt depends on Xt−1 and t, and Xt−1 depends on Xt−2 and t−1, and

so on. A consequence of this is that an AR processes can be rewritten as an innite order moving average process (moving average processes are covered in the next section). For example, the AR(1) process can be written as

Xt= a + b1Xt−1+ t = a + b1(a + b1Xt−2+ t−1) + t= a(b1+ 1) + b21Xt−2+ b1t−1+ t = . . . = a 1 + ∞ X j=1 bj1 ! + t+ ∞ X j=1 bj1t−j. (3.32)

If the process is to be feasible, then |b1| < 1, otherwise this representation shows

that Xt is clearly diverging.

3.3.2 Moving average process

Consider the following process

Yt= c + q

X

i=1

dit−i+ t (3.33)

where a is a constant, bi are coecients and t is noise. This is called a Moving

average process of order q, denoted MA(q). Note that this type of process does not depend on previous values of itself, but rather only on the noise. In contrast, an AR(p) process depends on previous values of itself, and consequently indirectly on previous values of the noise, as covered in the previous section.

Just like the AR(1) process has an MA(∞) representation, Mills and Markellos (2008) show that for the MA(1) process, if |d1| < 1, then the process is invertible

(27)

3.4. Markov Chains (Theory) 21

and can be rewritten as an AR(∞) process as

Yt= c + ∞

X

i=1

πiYt−i+ t (3.34)

where the sum of the weights πi are absolutely convergent, i.e P ∞

i=1|πi| < ∞.

3.3.3 ARMA-process

An auto-regressive moving average (ARMA) process is a combination of the previous two models. An ARMA(p, q) process is dened as

Zt= a + p X i=1 biZt−i+ q X i=1 cit−i+ t. (3.35)

This model allows for both auto-regression and moving average modelling to be used in one single process.

3.4 Markov Chains

A Markov Chain is a collection of dierent probabilistic states and a corresponding transition probability matrix

P PP =    p1,1 · · · p1,n ... ... ... pn,1 · · · pn,n   , pi,j = P(st+1= i|st = j), (3.36) where stis the state at time t. Hence a system that can be described with a Markov

Model transitions through dierent states, and the probability of the next state depends only on the current state.

A Hidden Markov Model assumes the existence of a Markov Chain with a spe-cic number of states which are not directly observable, instead another variable is observed, which depends on the state of the system, and the states have to be inferred from this output variable. A Regime-Switching Model is a generalisation of a Hidden Markov Model which allows the output variables to depend on its pre-vious values. However in some nancial literature the terms appear to be used as synonyms. (Cappé, Moulines, and Ryden, 2005)

(28)

3.5 Hidden Markov Models

Consider a Markov Chain with n states S1, . . . , Sn and a n×n transition probability

matrix PPP , where element pi,j is the probability that we end up in Sj after being in

state Si, P(Sj|Si).

If this Markov Chain is not observable, but instead we have some output variable Y which depends only on the state, then the model is a Hidden Markov Model. If the output variable Ytat time t depends not only on the state, S(t) at time t, but also on

previous observations, then the model is more generally a Regime-Switching Model. Given a history of output variables Y1, . . . , Ym, we want to learn P(St = j|Y1, . . . , Yt),

called the smoothed probability of being in state j, for each j ∈ {1, . . . , n}. We also need an initial probability distribution Π = (π1, . . . , πn), where πj is the probability

that the initial state is j.

In addition to these parameters, we will also have a set of parameters associated with the conditional distribution of Y . In case of the normal distribution we have the mean µ and variance σ2. In case of the uni-variate student's t distribution the

parameters the location parameter µ, the scale parameter σ2, and the degrees of

freedom ν.

Thus we have a parameter set θ = (Π, P, λ), where λ is the parameters belonging to the conditional distributions. We associate θ with Θ, the space of all possible parameter values for the model. We will assume that the process is stationary, i.e. it has a stationary distribution and the initial distribution Π is the stationary distribution.

It is quite common to model returns in time series as normal distributions. However Bulla (2011) found that by substituting the normal distribution, which is standard for Hidden Markov Models, with the Student's t-distribution it is possible to increase the models robustness towards outliers, as well as increase the persistence of its states. That it is a possible improvement should not be too hard to see, since it can be seen as a generalisation of the normal distribution, if high enough degrees of freedom are allowed.

For compactness, we will henceforth use the notation Ym

k , to denote Yk, . . . , Ym,

or just Ym when k = 1. For realisations of the above variables we use lower-case

notation.

Now we will present some approaches for estimating the states of a Hidden Markov Model, given a full specication. We rst present the Forward-Backward algorithm in detail, and then we briey present some alternatives.

(29)

3.5. Hidden Markov Models (Theory) 23

3.5.1 Forward-Backward algorithm

Given a fully specied Hidden Markov Model θ with n states and a history ym,

we can calculate the smoothed probabilities P(st|ym) with the Forward-Backward

algorithm. Using the approach in Ephraim and Merhav (2002), we dene the forward density as α(st, yt) = P(st, yt), i.e. the probability of the state at t being st and the

sequence of observations up until then being yt. This can be calculated using the

recursion α(st, yt) = ( fY(yt|st) Pn k=1α(k, yt−1)pk,St t = 2, . . . , m πs1fY(yt|st) t = 1 (3.37)

where fY(yt|st) is the conditional probability density of yt given st. We dene the

backwards density as β(ym

t+1|st) = P(ymt+1|st), i.e. given state st at t, the probability

of observing the rest of the observation sequence. It is given by

β(yt+1m |st) =

( Pn

k=1β(ymt+2|k)pst,kfY(yt+1|k) t = 1, . . . , m − 1,

1, t = m. (3.38)

We can then calculate the smoothed probability as

P(st|yt) =

α(st, yt)β(ymt+1|st)

Pn

k=1α(sk, yt)β(yt+1m |sk)

. (3.39)

That this is a probability should be intuitive; all quantities are positive, and it obviously lies in the range [0, 1]. That it is the correct probability is less obvious. We have that

P(st, ym) = P(st, yt, yt+1m ) = P(st, yt)P(yt+1m |st) = α(st, yt)β(yt+1m |st). (3.40)

Hence it follows from the Kolmogorov denition of conditional probability, P(A|B) =

P(A∩B)

P(B) . Furthermore, the forward densities can be used to calculate the likelihood

function of the sequence,

L(yt) =

n

X

sm=1

α(sm, ym). (3.41)

However, for large t, the quantity α(st, yt)tends towards 0 or ∞. Hence we introduce

a normalising quantity Nt= n X st=1 α(st, yt). (3.42)

(30)

We let ˆα(s1, y1) = p(s1|y1), and for t > 1 we let ˆα(st, yt) = α(st, yt)/Nt. For the

backwards densities we let ˆβ(ym

m+1|sn) = 1 and ˆβ(yt+1m |st) = β(ymt+1|st)/Nt+1. From

this we obtain

P(st|yt) = ˆα(st, yt) ˆβ(ynt+1|st). (3.43)

We can also calculate the bivariate joint smoothed probability

P(St= i, St+1= j|ym) =

α(st−1, yt−1)pi,jfY(yt+1|j)β(yt+1m |St+1 = j)

Pn

k=1α(st−1, yt−1)pi,kfY(yt+1|k)β(ymt+1|St+1= k)

(3.44) which is necessary for the E-step in the Baum-Welch algorithm.

3.5.2 The Viterbi algorithm and other alternatives

Given a fully specied model, one may also be interested in calculating the most likely sequence of states, P(sm|ym). This is called the Viterbi path, and can be

calculated using the Viterbi algorithm. One of the reason this is useful, for certain applications, is that it is sometimes possible that the individually most likely states do not produce a reasonable path. It is possible, for instance, that for some t, P(st+1t |ym) = 0. Instead, there may be interest in excluding such paths, and instead

nd the most likely path.

Cappé, Moulines, and Ryden (2005) also provide several Monte-Carlo approaches, such as the Metropolitan-Hastings algorithm, which can be applied to nd the states of a Hidden Markov Model. These algorithms allow us to consider a broader range of models. They do not rely on analytical solutions and they do no suer as much from the curse of dimensionality as numerical approaches.

3.5.3 Parameter estimation

Assuming instead that we do not have a fully specied model, but at least know the amount of states of the model, we are tasked with nding some suitable θ ∈ Θ. Dene the likelihood function, very generally, as

L(θ) = Z

f (s, ym; θ)λσ(dS). (3.45)

where λσ is a σ-nite measure on (S, S), where S is the sample space of S and

f (·, yt; θ)is the probability density with respect to λσ. Then the θ that is most likely

(31)

3.5. Hidden Markov Models (Theory) 25

continuous case, and also that it is computable if the quantities in the Forward-Backward algorithm is computable. If the reader is not familiar with σ-algebras and measures, for our discrete-state, discrete time process this may be written as

L(θ) = n Y s=1 m Y t=1 f (s, yt; θ). (3.46)

This can be hard to maximise, especially if m is large or y has a high dimen-sion. Instead other algorithms exist which maximise this indirectly, such as the Expectation-Maximisation algorithm. The majority of this section is devoted to a special case of the Expectation-Maximisation algorithm, the Baum-Welch algorithm, commonly used in speech recognition. Common for all Expectation-Maximisation algorithms are that they have an E-step, where an auxiliary function is calculated, and an M-step, where the auxiliary function is maximized. (Cappé, Moulines, and Ryden, 2005)

3.5.4 Baum-Welch

Assume that (3.45) is computable. Dene the auxiliary function, like Ephraim and Merhav (2002), as

Q(θ, θl

) = Eθl[log f (Sm, ym; θ)|ym] . (3.47)

Then we have that

L(ˆθ) − L(θl) = log f (y m; ˆθ) f (ym; θl) = log Eθl " f (Sm, ym; ˆθ) f (Sm, ym; θl) ym # ≥ Eθl " log f (S m, ym; ˆθ) f (Sm, ym; θl) ym # = Q(ˆθ, θl) − Q(θl, θl), (3.48)

where ˆθ is the real set of parameters and θl is the current guess of the parameter set,

with the help of Jensen's inequality. This is important because it means that the improvement in likelihood is always greater than the improvement in the auxiliary function. We only have an equality if the denominator and the nominator are equal. We obtain the new estimate as

θl+1 = arg max

ˆ θ∈Θ

Q(ˆθ, θl). (3.49)

When dealing with a specic conditional distribution, Q(ˆθ, θl) and the solution to

(32)

univariate student's t distribution, Bulla (2011) gives us that Q(θ, θl) = n X i=1 P(S1 = i|yt) log πi+ n X i,j=1 m X t=1

P(St= i, St+1 = j|yn) log pi,j

+ n X i=1 m X t=1 fY(Yt = yt|St= i) log P(St= i|yt) (3.50)

and from Peel and McLachlan (2000) we learn that a new estimate of the mean can be obtained from µl+1j = Pm t=1ytu l s(t)P(St= j|ym) Pm t=1uls(t)P(St = j|ym) (3.51) and a new estimate of the scaling factor

vl+1j = Pm t=1u l s(t)(yt− µl+1s )2P(St= j|ym) Pm t=1P(St = j|ym) (3.52) where ulj(t) := ν l j + 1 νl j + yt−µl j σl j 2, (3.53)

where σ = v1/2 is the standard deviation. νl+1

j is obtained as the unique solution to

1 Pm t=1P(St= j|ym) " m X t=1 (log ulj(t) − ulj(t))P(St = j|ym) # − Ψ 1 2ν l+1 j  + log 1 2ν l+1 j  + 1 + Ψ ν l j + 1 2 ! − log ν l j+ 1 2 ! = 0, (3.54) where Ψ(x) = 1 Γ(x) ∂Γ(x)

∂x is the Digamma function. This equation can not be solved

explicitly with respect to νl+1

j , hence a numerical estimation procedure is necessary.

The equations for the normal distribution are obtained by removing u from (3.51) and (3.52). The rest of the equations are the same, except that we do not have a degrees of freedom parameter.

For a stationary Hidden Markov model, Bulla and Berzel (2008) propose that the transition probabilities and the initial distribution for the next iteration, can be obtained by maximising max P PP n X i=1 ψ1(i) log πi+ n X i,j=1 m−1 X t=1 ξt(i, j) log pij ! subject to ΠΠΠPPP = ΠΠΠ, n X i=1 πi = 1, (3.55)

(33)

3.5. Hidden Markov Models (Theory) 27

where ψt(i) := P(St = i|ym, θ), Π = (π1, . . . , πn) and ξt(i, j) := P(St = i, St+1 =

j|ym, θ). This step is also hard to solve, but Newton-type algorithms can be used.

If instead one assumes that the Hidden Markov Model is heterogeneous, we obtain the new estimates for ΠΠΠ and PPP separately. Zucchini and MacDonald (2009) give us that πjl+1 = PmP(S1 = j) i=1P(S1 = i|ym) (3.56) and pl+1i,j = n X i=1 n X j=1 log(pi,j) m X k=2 P(Sk = i, Sk−1 = j|ym). (3.57)

Hence, we can obtain a n-state Regime Switching Model using the following general procedure.

1. Pick a start guess for the parameters θl=1 ∈ Θ.

2. (The E-step) Calculate Q(θ, θl) and the smoothed probabilities using the

Forward-Backward algorithm for θl.

3. (The M-step) Use the re-estimation equations to nd θl+1. Go back to step 2

if some stopping criterion Q(θl+1, θl)/|(θl, θl)| − 1 <  is not fullled.

Alter-natively, Q can be substituted with the log-likelihood, which can be obtained easily from the forward-quantities.

3.5.5 Other algorithms

The appeal of the EM-algorithm is that the maximisation in the M-step is simple. It can however sometimes be complicated. Meng and Rubin (1993) introduce the Expectation Conditional Maximisation-Algorithm, which maximises the parameters in multiple steps. In the so called CM-step, some subset of the parameters are xed, and the rest of the parameters are being maximised conditionally on these parameters being xed. It can thus be possible to obtain analytic solutions where it is not possible with the EM-algorithm. It may also be computationally simpler, even if there are no analytic solutions.

When the likelihood, (3.45), is not computable, CM is not usable. In this case there exist some other approaches. One of these is Monte Carlo EM, which alters the E-step. Instead of Q(θ, θl), a simulated approximation is calculated. The function

to maximise is then changed to ˆ Q(θ, θl) := 1 M M X i=1 log f (ξi; θ), (3.58)

(34)

where M is the Monte Carlo sample size, ξi∀i ∈ 1, . . . , M are i.i.d. realisations of

P(y; θ).

Sometimes even this is intractable, because it can be hard to draw i.i.d. variables from the distribution. In that case one can use Markov Chain Monte Carlo EM approaches as well as Bayesian approaches. (Cappé, Moulines, and Ryden 2005)

3.5.6 Model selection

In order to estimate the tness of the models multiple methods can be used, the most obvious of which is comparing the likelihood of the models. The drawback of this is that it is likely to select a model with too many parameters (which, in the case of Hidden Markov models, increase with the number of states). Let us illustrate this with a thought experiment - consider some model of some phenomena, which is the true model. Say that this model has k parameters. If we add another parameter with coecient 0, we have a model with k + 1 parameters, and the same likelihood. Hence we can always add parameters without reducing the likelihood, as long as we can t the new model to data well. Consider that we also have some noise in the data, it should be very plausible that the model with the extra parameter could have a non-zero coecient, and a higher likelihood, due to better tting the noisy data. Thus we can carry on adding parameters ad innitum.

For this reason it is necessary to punish models with more parameters. Two com-monly used methods are Akaike Information Criterion (AIC) and the Bayesian Infor-mation Criterion (BIC). AIC was introduced by Akaike (1973), where he shows that maximising the log-likelihood is equivalent to maximising an information theoretic quantity. He then uses information theory to derive AIC, which is given as

AIC = 2k − 2 log L(θ), (3.59)

where k = |θ| is the number of parameters of the model. BIC is introduced by Schwarz (1978) as

BIC = 2k log m − 2 log L(θ), (3.60)

where m is the number of data points as before, and this is shown to be, given a few assumptions, asymptotically optimal. For both of these, it is important that the θ which maximises L(·) is chosen, otherwise an incorrect amount of parameters may be chosen. For both models lower values are desirable. The methods dier somewhat, clearly BIC punishes complexity harsher. As mentioned, which method is preferable depends on the practitioner's belief about the "true model".

(35)

3.6. Factor model (Theory) 29

3.5.7 Pseudo-residuals

Having selected a model, it is important to determine that the model is good enough. This is hard to do with likelihood, AIC and BIC. Commonly quantile-quantile plots are used, but in the case with models that have a distribution that changes over time, such as a Hidden Markov Model, this can not be done directly. Instead, Zucchini and MacDonald 2009 suggest, that we can transform the returns using the distribution function. If we have the distribution function FX of some time-changing model, we

can use that

u = FX(x) ∼ U (0, 1). (3.61)

We can then do a standard-normal quantile-quantile plot of Φ−1(u), where Φ−1 is

the inverse cumulative distribution function of the standard-normal distribution.

3.6 Factor model

Ross (1976) proposes an extension to the CAPM (which only takes market risk into account) named Arbitrage Pricing Theory (APT), to a generalised factor model

rk = ak+ n

X

i=1

bk,ifi+ k, (3.62)

where rk is the return for asset k calculated from a set of factors fi, bk,i is its

sensitivity to factor fi, intersection ak and a noise term k with zero mean. Fama

and French (1993) extend the CAPM based on Ross (1976), with two factors related to size and book-to-market equity. Fama and French (2015) extend their work by adding two additional factors. The challenge is to avoid applying factors to the APT ad hoc in order for the intersection ak to be as close to zero as possible (the closer

it is to zero, the better the factors are in terms of explaining returns), and actually nd factors that explain asset returns with high signicance which Fama and French (1993) and Fama and French (2015) do quite rigorously. To relate to the purpose, we are interested in nding in which regimes factors explain most of the returns in order to nd a regime-switching multi-factor model that has a intersection ak as

close to zero as possible across all regimes.

Takahiro and Naoki (2015) model the dynamics of factors using the vector auto-regressive process

f (t + 1) = µS(t+1)+ ΦS(t+1)f (t) + ξS(t+1)(t + 1), (3.63)

where f(t) is a vector of factors that can be used to predict returns, µS(t+1) is a

(36)

of coecients and ξS(t+1) is a regime-dependent noise vector, where the three latter

are dependent on regime S(t + 1). In their study, ξ is Gaussian, whereas we let ξ be Student's t-distributed.

They then use the factor to model the excess returns of the assets,

r(t + 1) = LS(t)f (t) + S(t)(t + 1), (3.64)

where LS(t) is a matrix giving us the assets exposure to the factors and S(t) is

another noise vector, during regime S(t).

While we are interested in excess returns, we are not interested in excess returns over the market, as we want to nd a factor model that accumulates as much of the returns as possible into factors, rather than nding alpha (excess returns). We will use the same model for total returns as Takahiro and Naoki (2015) do for excess returns over the market, with the only dierence being that r(t) are excess returns (stock returns minus risk-free rate) instead of excess returns calculated from CAPM.

3.7 Asset allocation

In order to show the eciency of a regime switching model, it is natural that a comparison is made between a trading model with regime switching and one without. Herein we present a few ideas for portfolio optimisation which could be used to evaluate the regime switching model.

Through the section, keep in mind that from (3.64) we have that the mean and variance of the portfolio returns R(t + 1) = x(t)> L

S(t)f (t) + uS(t)(t + 1)  are E[R(t + 1)] = x(t)>LS(t)f (t), Var[R(t + 1)] = x(t)> Σ ΣΣuS(t)x(t). (3.65)

where x(t) is a vector of weights for dierent assets in the portfolio.

3.7.1 Markowitz' framework

Markowitz (1952) introduced the notion that a portfolio should be selected not just according to the expected returns of the assets. This does not lead to diversication, which occurs in practice. Instead he suggested that portfolios are chosen such that the expected returns are considered in relation to their variance.

(37)

3.7. Asset allocation (Theory) 31

We dene a utility function in this case as U (x(t), γ) := x(t)S(t) −1

2γx(t)

>Var[S(t)]x(t) (3.66)

where γ is a risk-aversion parameter, which tells us how much utility we lose from variance. Models which seek to maximise this and similar functions are commonly referred to as Mean-Variance models, and are common in the literature. Often, they are extended with a transaction cost.

3.7.2 Mean-variance with Quadratic transaction costs

Takahiro and Naoki (2015) adopt the Mean-Variance style utility function of Gâr-leanu and Pedersen (2013) with quadratic transactions costs to Hidden Markov models. In particular, they model the utility as

U (x(t), γS(t)) = x(t)>LS(t)f (t) − γi 2x(t) > Σ ΣΣrS(t)x(t) − 1 2∆x(t) > BS(t)∆x(t) (3.67)

where ∆x(t) = x(t) − x(t − 1) and BS(t) is a matrix with transaction costs during

regime S(t).

They then calculate the portfolio x(t) by maximising

E " X t=m ρt−m−1U (x(t), γS(t)) w(m − 1), S(m), f (m) # (3.68) with respect to x(t), where ρ is the discount factor, and f(t) and S(t) is simulated for t > m. This is done in the following way.

1. Observe f(t) and S(t) at t.

2. Determine x(t) and re-balance x(t − 1) to x(t).

3. Increase t by one step. Choose S(t) randomly using PPP .

4. Find f(t) for the new t by randomly selecting a value from its conditional distribution.

They then nd a semi-analytic solution for x(t).

Notice that x(t) depends on its future values, and that for the optimisation we need to maximise the objective function over the entire sequence of portfolio weights x(m), x(m + 1), . . .. Hence with this approach we optimise the portfolio w.r.t a specic path for f(t). When deriving the semi-analytic solution, shorting is allowed. If we impose the restriction that x(t) ≥ 0, then it is plausible that we will not be able to nd a semi-analytic solution.

(38)

3.7.3 Stochastic Programming

A popular and very eective approach in optimising decisions based on stochastic behaviour (such as portfolio returns) is stochastic programming (SP). By modelling the objective function as the expectation of the power-utility function,

E[U (X(t), γS(t))] = ( E h X(t)γS(t) γS(t) i ,if γS(t) 6= 0, ≤ 1 E [log(X(t))] , if γS(t)= 0 (3.69)

where X(t) is the investors wealth at time t, and γS(t) is the investors risk aversion

coecient, we yield a model that takes expected value as well as risk into account, as well as being congurable with respect to risk level γS(t). By choosing γS(t) = 0

the utility function is chosen as the special case log-utility function. Luenberger (1997) shows that by choosing the log-utility function, the investor will maximise the expected growth rate. Note that this strategy implies the investor is taking a substantial risk, and will every now and then lose money. It is only in the long run that the investor is guaranteed a maximum growth rate, so γS(t) = 0is recommended

for long term investors.

In order to successfully nd optimal portfolio weights, it is imperative that a good estimation of expected future returns is used. The factor models used for estimating expected future returns will be used with Monte Carlo simulation and Antithetic sampling3 in order to generate scenarios for the SP solver. Monte Carlo simulation

is a method in which many sample paths are generated and the expectation of these sample paths is taken as the result of the future expected value. Together with Antithetic sampling, the variance in the simulations is also reduced in order to yield a more condent value each time the expectation is calculated. Antithetic sampling means that for each sample c + X drawn from the distribution with a centre in c, c − X is added to the set of drawn samples as well, which is why antithetic sampling is only possible with symmetric distributions.

By performing this above simulation and taking the expected value for the sample paths, we then have a corresponding deterministic problem, which we can solve with conventional methods. However it is also important that the method is based on an eective algorithm in order to solve the problem in a reasonable amount of time. An example of an ecient set of algorithms are primal dual interior point methods. Such a solver uses the structure in the objective function as well as the conditions

3Latin Hypercube sampling would be better but unfortunately cannot be used. Latin Hypercube

sampling means that the value set of the CDF is split into h equidistant subsets, where one sample is drawn from each subset, in order to guarantee a better representation of the entire probability distribution. However since we are drawing samples with a copula, Latin Hypercube sampling is not feasible.

(39)

3.8. Numerical methods (Theory) 33

in order to signicantly reduce the amount of operations needed to nd optimal solutions.

3.8 Numerical methods

In order to nd the solutions to some of the equations introduced in the theory, numeric methods are necessary. We herein introduce methods suitable for solving the problems. We rst introduce Newton and similar methods, which are suitable for solving equation systems. Thereafter we introduce Primal-Dual Interior Point methods, which are used to solve optimisation problems.

3.8.1 Newton Methods

Deuhard (2011) provides a lot of information about Newton methods in his book, which are used to solve general, non-linear equations f(x) = 0. They are often based around the ordinary Newton method, which involves linearising f(x) at a certain starting point x(0), evaluating the Jacobian J(x(0)) and nding where that linear

equation is equal to zero, where the solution x(1) is used as a starting point for the

next iteration, and this keeps going until a stopping criterion is fullled, which is usually dened as when the update is "small enough", or when the objective function value is "close enough" to zero.

The ordinary Newton method is described as the following algorithm. 0. Choose an initial point x(k), k = 0.

1. Calculate the Jacobian J(x) at x = x(k).

2. Find the point x(k+1) for the next iteration by rst nding the solution ∆x(k)

to the equation J(x(k))∆x(k) = −f (x(k)), where ∆x(k) = x(k+1)−x(k), and then

calculate x(k+1) from there.

3. Check the stopping criterion. If ∆x(k+1) is small enough according to some

level of tolerance, then stop the algorithm and use x(k+1) as your solution.

Otherwise, update k := k + 1 and go to step 1.

In the special case where the equation only has one variable, the method is called Newton-Raphson.

When the Jacobian does not exist or takes too many resources to compute numeri-cally, a heuristic method can be used instead, where a heuristic function M(x) which

(40)

somewhat approximates the Jacobian, but is computationally much more ecient to compute, is used instead of J(x). Otherwise the procedures remain the same.

3.8.2 Primal-dual interior point method

Primal-dual interior point (PDIP) methods is an ecient set of algorithms that can solve big scale optimisation problems, such as the one presented in this thesis. Consider a general optimisation problem

min

x f (x)

subject to Ax − a = 0 Bx − b  0,

(3.70)

where f(x) is a non-linear function and ,  are element wise inequality operators. Interior point methods rely on adding a barrier function, such as a logarithmic barrier, which penalises points that are too close to the infeasible set given by the inequality condition. Let us rst introduce the slack variable s for the inequality condition min x f (x) subject to Ax − a = 0 Bx − b + s = 0 s ≥ 0. (3.71)

Adding a barrier function to the objective function, the problem in (3.71) becomes min x f (x) − 111 > log(s) subject to Ax − a = 0 Bx − b + s = 0 (3.72)

where 111 is a vector of ones of appropriate size. Notice that as s approaches 0, the barrier function gets increasingly larger and diverges. If we impose a (small) coecient µ to the barrier function, we can reduce the eect it has on the objective function, except it will still diverge as s approaches 0. We rewrite (3.72) as

min x f (x) − µ111 > log(s) subject to Ax − a = 0 Bx − b + s = 0, (3.73)

(41)

3.8. Numerical methods (Theory) 35

where µ chosen as something small in order to reduce the eect of the barrier function in the feasible set, while still penalising values close to the infeasible set given by the inequality condition. (Boyd and Vandenberghe, 2004)

Now we introduce the concept Lagrangian relaxation. If we modify (3.73) by moving the equality and inequality constraints to the objective function and attaching a weight coecient each term, we get the relaxed optimisation problem

min

x L(x, s, νa, νb) = f (x) − µ111 >

log(s) − νa>(Ax − a) − νb>(Bx + s − b), (3.74) where L is the Lagrangian function for (3.73), νa, νb are dual variables, s is a slack

variable and µ is the barrier coecient. Since we already used a barrier function to relax the inequality constraint, here we have only performed a Lagrangian relaxation on the equality conditions. (Boyd and Vandenberghe, 2004)

The problem in (3.74) is now a non-linear, unbounded optimisation problem which can be solved by nding the solutions to the Karush-Kuhn-Tucker (KKT) condi-tions, which are sucient for a pair of primal and dual optimal points. (Boyd and Vandenberghe, 2004)

The KKT conditions are of the rst order, and in the case with a logarithmic barrier the condition for complementary slackness is slightly dierent than usual, as we will see shortly. The conditions for (3.74) can be derived from the following equations

Primal feasibility ( ∇νaL(x, s, νa, νb) = 0 ∇νbL(x, s, νa, νb) = 0 Dual feasibility ∇xL(x, s, νa, νb) = 0 Complementary slackness ∇sL(x, s, νa, νb) = 0, (3.75)

for which in the case of (3.74) we have          Ax − a = 0 Bx + s − b = 0 ∇xf (x) − A>νa− B>νb = 0 µ111 + νb◦ s = 0 (3.76)

where ◦ is the Hadamard product (element-wise multiplication). The last condition diers from the usual complementary condition, νb◦ s = 0, because we imposed a

barrier function for our slack variable. Note that when µ is tending towards zero, the complementary slackness condition approaches νb◦ s = 0.

References

Related documents

In this subsection we describe and discuss the distribution for the individual random recoveries, a binomial mixture distribution, which was used in [4] with the hockey-stick method

This paper shows that if the period following the granting of debt relief is taken into account, debt relief increases adjustment effort (investment), irrespective of whether there

While much has been written on the subject of female political participation in the Middle East, especially by prominent scholars such as Beth Baron 5 and Margot Badran, 6 not

keywords: Swedish Mortgage Portfolio, Covered Bond, Cover Pool, House Price risk, Mortgage risk, Credit risk, Liquidity

In this chapter we present a novel approach for handling occlusions in the people tracking system. We treat occlusions explicitly, i.e. we first detect them and then reason about

Using 1000 samples from the Gamma(4,7) distribution, we will for each sample (a) t parameters, (b) gener- ate 1000 bootstrap samples, (c) ret the model to each bootstrap sample

In this study, a predictive model of the stock market using historical technical data was compared to one enhanced using social media sen- timent data in order to determine if

The MS models estimated using the growth rate of FASTPI show that the Swedish housing market is more likely to remain in a positive growth regime which have an average duration