Forecasting foreign exchange rates with large regularised factor models

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2016

Forecasting foreign exchange

rates with large regularised factor

models

JESPER WELANDER

(2)

(3)

Forecasting foreign exchange rates with

large regularised factor models

J E S P E R W E L A N D E R

Master’s Thesis in Financial Mathematics (30 ECTS credits) Master Programme in Mathematics (120 credits) Royal Institute of Technology year 2016 Supervisor at Lynx Asset Management: Tobias Rydén

Supervisor at KTH: Jimmy Olsson Examiner: Boualem Djehiche

TRITA-MAT-E 2016:63 ISRN-KTH/MAT/E--16/63--SE

Royal Institute of Technology

School of Engineering Sciences

(4)

(5)

ii

Abstract

(6)

(7)

iii

Sammanfattning

Vektor autoregressiva (VAR) modeller för tidsserieanalys av högdimensionell data tenderar att drabbas av överparametrisering eftersom antalet parametrar i modellerna växer kvadratiskt med antalet inkluderade prediktorer. I dessa fall används ofta lägredimensionella strukturella antaganden genom faktormodeller eller regularisering. Faktormodeller reducerar modellens dimension genom att projicera observationerna på ett lägredimensionellt underrum av gemensamma faktorer och kan föredras om prediktorerna är kollineära. Regularisering min-skar överanpassning genom att bestraffa vissa egenskaper hos modellens es-timerade parametrar och kan föredras när exempelvis endast ett mindre antal prediktorer antas betydande.

(8)

(9)

iv

Acknowledgements

I am very grateful to Tobias Rydén, my supervisor at Lynx Asset Management, for introducing me to the topic and for his invaluable support and patience. Throughout the process, Tobias has provided insightful feedback and advice and inspired me to continue developing the ideas in this thesis.

Furthermore, I’d like to express my gratitude to Lynx Asset Management for their welcoming work environment, for sharing their knowledge and time and for providing the required computational resources. Without them, this thesis would not have been possible.

(10)

(11)

5 The regularised factor model 20 5.1 The model . . . 20 5.2 Motivation . . . 21 5.3 Implementation . . . 22 6 Performance evaluation 23 6.1 Data description . . . 23 6.2 Data preparation . . . 24 6.3 Forecast evaluation . . . 26 6.4 Parameter selection . . . 26 7 Empirical results 28 7.1 Foreign exchange rate prediction without regularisation . . . 28

(12)

CONTENTS vi 7.2 Foreign exchange rate prediction with the regularised factor model . 31

8 Conclusion 34

Bibliography 36

Appendices 39

A Optimisation using ADMM 40

B Algorithms 42

B.1 Lasso and its structured variants . . . 42

B.2 PLS . . . 43

B.3 Continuum power regression . . . 44

B.4 Dynamic PLS . . . 45

B.5 Dynamic CCA . . . 47

C Data, supplementary information 48 D Portfolio evaluation 51 D.1 Portfolio strategy . . . 51

D.2 Markowitz portfolio selection . . . 51

D.3 Multi period portfolio selection with transaction costs . . . 52

(13)

Chapter 1

Introduction

We consider a vector autoregressive (VAR) model with lag q for h steps ahead forecasting,

yt+h= AT1yt+ AT2yt≠1+ . . . + ATqyt≠q+1+ ‘t, t= 1, . . . , T, (1.1) where yt = {yti}pi=1 are vectors of observations at time t for p different predictors and Ai œ Rp◊p are matrices with model parameters (transition matrices). All else being equal it is reasonable to assume that a forecast of a certain time series could be improved by increasing the number of correlated predictors p in the model. More information should give improved predictions. However, the parameter space for es-timating the transition matrices in the VAR model increases quadratically with the number of included predictors (qp2_{). As a direct result, the number of observations}

required for estimation, n, grows by the same relative proportion and there could be some value in restricting the solution space of the model. Two commonly used methods for estimating large autoregressive models are dimensional reduction by projection onto latent factors and regularised VAR estimation by penalisation of the transition matrices Ai.

The dimension reduction approach is based on the assumption that there exist some common factors or features, z, of the different time series that explain some of the predictor dynamics. In these cases, each observation may be split into common shocks and idiosyncratic components by, for example, projecting the observations onto a latent subspace that approximates these common factors. If this projection is linear there will exist a matrix G œ Rp◊d _{that projects lagged variables in the}

time series to a d-dimensional subspace of features such that

zi = GTyi. (1.2)

Hence, our factor model would take the form

(14)

CHAPTER 1. INTRODUCTION 2 where Bi œ Rd◊p. Approximate factors in high-dimensional settings may, for exam-ple, be estimated by principal components [32]. Apart from reducing the dimension of the model, the separation of idiosyncratic and systematic effects in these mod-els may aid investigation of impacts of market-wide systematic shocks and reduce some of the noise in the time series. Additionally, investigating underlying factors rather than the realised time series may increase the understanding of the modelled financial system.

As an alternative to dimension reduction, Basu et al. [5] and Davis et al. [11] among others, have investigated regularisation to reduce the model solution space. That is, given that the model (1) may be estimated by minimising a loss function L(A), a regularised optimisation objective may take the form

min

A L(A; Y ) + ⁄R(A), (1.4)

where R(A) is a regularisation term that penalises certain characteristics of the model transition matrices and Y = {yt}Tt=1. For instance, if only certain variables are assumed important or relevant for prediction it might be suitable to penalise dense solutions to encourage sparsity. Apart from reducing the propensity to overfit, regularisation terms may utilise prior knowledge about the solution structure, for instance knowledge that certain series are uncorrelated or knowledge of the time dependence of lagged variables.

Both approaches have drawbacks. Notably, factor models are generally sensitive to model specifications and sufficient dimension reduction in high-dimensional settings may cause information loss, while the regularised VAR estimation may be sensitive to noise and collinearity in the data and may be computationally complex. To al-leviate these problems, while hopefully retaining some of their benefits, a two-step estimation procedure for high-dimensional VAR models is proposed. Firstly, the so-lution space is reduced by projecting the predictor variables onto lower-dimensional common factors. Secondly, a time series model with the constructed factors is esti-mated with regularisation by e.g. the lasso.

The proposed method is related to the methods by both Davis et al. [11] and Paul et al. [28]. Both methods promote sparsity through a two-step approach, the differences being that Davis focus on screening for variable selection and that Paul et al. focus on i.i.d. (non time-series) observations.

(15)

CHAPTER 1. INTRODUCTION 3 certain that the maximum variance as in principal component analysis, one of the more commonly used methods for approximate factor analysis, will provide the best subspace for predictive purposes and it is possible that factor model forecasts might be improved by using supervised methods where the relation of the predictors to the response is taken into account.

In the second step of the proposed method, we will use methods that do not im-pose any structural knowledge such as the ridge and lasso regularisations but also methods that force the weights to conform to a priori structural knowledge. These methods include the fused lasso that penalises differences of sequential coefficient weights of lagged covariates in the transition matrices of the time series model and the group lasso that penalises groups of variables.

The results in this thesis are divided into two parts. Firstly, we investigate whether the predictive performance of the approximate factor model may be improved by using supervised subspace projections that take the response variables into account. Secondly, we investigate the combination of dimension reduction with regularisation in order to reduce overfitting and encourage sparsity (including structural sparsity). The prediction performance is estimated by forecasting foreign exchange rate returns (daily, weekly and monthly forecasts).

There are three main contributions in this thesis. Firstly, we show that forecasts, at least in the case of foreign exchange rates, may be improved by making the latent subspace conditional on the output. Secondly, we generalise earlier results on a variant of PLS with dynamic coefficient weights and show that this dynamic approach might improve subspace extraction. Thirdly, we show that there may be benefits of combining factor models with regularised time series estimation.

The thesis is structured as follows. First, some background of VAR as well as factor models and their estimation will be provided along with a brief overview of current research within this area. Then we will discuss some of the subspace projection methods and regularisation terms that will be used to test the proposed model. Subsequently, we will describe our proposed model, the data set and method used for performance evaluation before presenting our empirical results.

Delimitations

(16)

CHAPTER 1. INTRODUCTION 4 have been found to be similar to the performance of dynamic models [6], the static model limitation should not restrict performance significantly.

Furthermore, both the fields of subspace projection and regularised estimation are very active and there exists a large amount of methods in each field. Rather than finding the best projection and the best regularisation, the purpose of this thesis is to investigate the combination of projection and regularisation in a time series setting. For this reason, neither the theoretical background nor the analysis will be exhaustive. We will only consider some well-known methods that represent a variety of different objectives to highlight their differences. Notably, we will not include the elastic net alternative to the regular lasso method. While the elastic net minimises the problem of lasso based regularisers regarding collinear data [43], we will investigate whether using subspace projections may reduce the problem with collinearity instead. Additionally, adding an l2 penalisation term to the lasso-based

(17)

Chapter 2

VAR models and variable

selection

This chapter begins by describing the general vector autoregressive model. We then present the factor model and the regularised VAR estimation approaches to high-dimensional time series analysis.

2.1 Vector autoregressive (VAR) models

Let us define a p-dimensional vector autoregressive series with lag q, VAR(q), for h steps ahead prediction as the series satisfying

yt+h= AT1yt+ AT2yt≠1+ . . . + ATqyt≠q+1+ ‘t, t= 1, . . . , T, (2.1) where yt = {yti}p_i₌₁ are (column) vectors of observations, Ai œ Rp◊p are transition matrices that capture the temporal relationship between variables and ‘t is a p-dimensional white noise process with zero mean and nonsingular covariances u. Alternatively, with mean adjusted data the above model may be represented by,

(18)

CHAPTER 2. VAR MODELS AND VARIABLE SELECTION 6 In a general case, under the assumption that the process mean is known and that the data is demeaned, the VAR model may be estimated by least squares [27]. Using representation (2.2) a least squares estimation is found by

min

A ||Y ≠ XgA||F, (2.3)

where ||B||2

F = trace(BúB) denotes the Frobenius norm of a matrix B. Under

the assumption of a stable process and if E represents white noise, Lütkepohl [27] shows that the asymptotic properties of the least squares estimate are

Ô

T(vec( ˆA) ≠ vec(A))_≠_{æ N (0, (E[Y Y}d T])≠1_¢ ‘),

where the definition of a stable VAR model is given below along with the definition of the closely linked stationarity property. In the above equation, vec denotes the linear transformation of the matrix to a column vector.

Definition 1. (VAR stability) The VAR process is stable if

det A(z) = det(IK≠ A1z≠ . . . ≠ Apzp) ”= 0 ’ z œ C, |z| Æ 1.

Definition 2. (Stationarity) A process is (weakly) stationary if its first and second

moments do not change over time, that is

E[yt] = µ ’t

E[(yt≠ µ)(yt≠h≠ µ)T] = y(h) = y(≠h)T. Any stable VAR(p) process yt is stationary [27].

2.2 Static factor models

Factor models are based on the assumption that the constructed panel of time series data contains information about some unobservable common features. That is, for a process {yit|i œ N, t œ Z}, we assume that there exists a mutually orthogonal decomposition

yit = ‰it+ ›it, (2.4)

where ‰ denote the common and › the idiosyncratic part. The above decomposition is denoted the general dynamic factor representation of yit [18].

Under the assumption that the common component may be described as a linear combination of factors F , we obtain the general static factor model,

I

‰it= bi1F1i+ . . . + birFrt

Ft:= A1Ft≠1+ . . . + ApFt≠p+ RUt

(19)

CHAPTER 2. VAR MODELS AND VARIABLE SELECTION 7 where Ut represent white noise, A are transition matrices describing a stationary autoregression and R is a matrix of rank q [18]. The term static refers to the common component being contemporaneously added. That is, the latent variables are estimated using loading constants (in the time domain) as opposed to loading filters (in the frequency domain) as in the dynamic factor model [2]. Note that the usage of the term factor model varies and that we use the term in a wide sense. For small models, exact static factors may be estimated using Gaussian maximum likelihood or by the method of moments [34]. For higher dimensional models, factors may be estimated by approximate methods. Stock and Watson show that principal component analysis (PCA), one of the more frequently used methods for factor estimation, provides consistent estimates of the factors for large numbers of predictor variables and observations, p, T æ Œ [33]. Additionally, PCA provides stable factor estimates in cases with structural instability such as regime shifts [36].

The principal component analysis method for determining the factors in a factor model is unsupervised. That is, the factor estimation does not take the response variables into account. By being independent of the variables we want to forecast, unsupervised factors may not necessarily be the most relevant factors for prediction [1] and it is possible that prediction performance could be improved by supervised factor estimation. A large number of supervised approaches have been investigated in earlier literature but two in particular are interesting for the purposes of this thesis. Firstly, among other authors, Bair et al. [3] as well as Bai and Ng [1] consider supervised principal component analysis to extract more relevant factors, where the authors select subsets of variables that are deemed more predictive. The less predictive variables are downweighted or discarded before the PCA step. Secondly, as an alternative to PCA, Chun and Keles [10] as well as Fuentes et al. [16] extract factors with high covariance to the output variable through partial least squares (PLS).

Regarding stacked models (models where multiple lags of the dependent variables are incorporated into one lag of factors), authors such as Stock and Watson have found that in the case of principal component based factors, stacked models tend to perform worse than models that estimate factors based on contemporaneous values of the predictors [33][35]. While supervised methods seem less sensitive to stacked specifications, they may also perform less than optimally with long lag series and Fuentes et al. [16] have found that sparsity assumptions in the PLS factor extraction may improve prediction performance.

2.3 Regularised VAR estimation

(20)

CHAPTER 2. VAR MODELS AND VARIABLE SELECTION 8 unstable due to overfitting. The regularised VAR estimation attempts to alleviate the problem of overfitting by penalising certain features during the model estima-tion. In the case of lasso regularisation, for example, where the absolute values of parameter estimates are penalised, the underlying assumption is that the model contains several unimportant predictors that could be set to zero. Consequently, the lasso penalty promotes sparsity in the model.

Given observations Y = {yt}Tt=1, a convex loss function L : Rn ◊ Yn æ R and a regularisation term R : Rn_{æ R, a general regularised multivariate estimation may} be formulated as solving,

min

A L(A; Y ) + ⁄R(A), (2.6)

where A is a matrix of model parameters and ⁄ > 0 is a constant indicating the desired degree of penalisation.

A number of studies have investigated regularised estimation of high-dimensional VAR models. For instance, Hsu et al. investigate VAR estimation with subset selec-tion by the lasso regulariser [20]. Similarly, Davis et al. estimate a lasso-regularised VAR with a least likelihood loss function [11]. Furthermore, Song and Bickel inves-tigate regularised VAR estimation using the group lasso regulariser that imposes structural sparsity [31]. Lastly, Li et. al. constructed additive combinations of dy-namic factor models with lasso estimates in order to find a solution corresponding to a combination of a low-rank and a sparse transition matrix [24].

(21)

Chapter 3

Subspace extraction methods

In this chapter we will present some frequently used methods of extracting common feature subspaces. Furthermore, we discuss a dynamic variant of the partial least square method and find a generalisation of the dynamic approach that is used to construct a dynamic variant of canonical correlation analysis.

3.1 Principal component analysis

Principal component analysis (PCA) is an unsupervised non-parametric linear method for dimensionality reduction that is based on finding feature directions that max-imise variance [15]. That is, feature directions gk are found through

max gk gT_kXTXgk subject to gT kgk = 1 g_kTXTXgl= 0, l < k, (3.1)

where X œ RT◊p _{is an input matrix with p predictors that are observed at T}

different times, X = S W U yT T ... y₁T T X V.

The conditions on gl are used for extracting multiple principal components and make sure that the kth_{principal component is orthogonal to any earlier components}

[19]. The optimal feature directions are the eigenvectors of XT_X_{, where g}

k is the normalised eigenvector corresponding to the kth _{largest eigenvalue [8].}

(22)

CHAPTER 3. SUBSPACE EXTRACTION METHODS 10

3.2 Partial least squares

Similarly to PCA, the partial least squares (PLS) method uses orthogonal projection to latent variables, but PLS is supervised and maximises the covariance with an output matrix Y [29]. That is, PLS finds feature directions, gk, by solving

max gk,hk gT_kX_gTY hk subject to gT kgk = 1 hT_khk= 1 g_kTX_gTXggl= 0, l < k. (3.2)

Note that (3.2) is the SIMPLS implementation of PLS. One common alternative is NIPALS. While the two methods are equivalent in the univariate case, they do differ in the multivariate case. The main difference for the purposes of this thesis is that SIMPLS minimises the multivariate covariance while NIPALS-PLS2 (the multivariate version of NIPALS) does not [12]. For this reason, we will use and refer to the SIMPLS variant whenever the PLS method is mentioned unless explicitly defined otherwise. The SIMPLS and NIPALS algorithms are included in the appendix.

In a time series setting, the above formulation would result in all lagged observations Xi, Xi≠1, . . .of the model being jointly projected onto one lag of factors Fi. Instead, we want a non-stacked model where each lag is projected to one set of factors in addition to wanting a common set of factors for each lag. That is, for each Xi we want one set of corresponding factors Fi such that any changes in factors over time may be analysed. In a time series setting this corresponds to the lags of a VAR(q) being projected onto a lower dimensional VAR(q) (a VAR with an equal number of lags but with fewer predictors) rather than being projected onto a VAR(1). Hence we use the modified PLS objective,

max gk,hk q ÿ i=1 g_kTXiY hk subject to gT kgk= 1 hT_khk= 1 gT_kXTXgl= 0, l < k, (3.3)

where p is the number of parameters in the model. The solutions to (3.3) are feature directions that may be used to project each lagged observation onto the corresponding lagged factors.

3.3 Canonical correlation analysis

(23)

CHAPTER 3. SUBSPACE EXTRACTION METHODS 11 gk, of CCA are found by solving

max gk,hk g_kTX_gTY hk subject to gT kXgTXggk= 1 hT_kYTY hk = 1 g_kTX_gTXggl= 0, l < k hT_kYTY hl = 0, l < k. (3.4)

Note that the hk orthogonality constraint is not strictly necessary but implicitly follows the orthogonality constraint on gk. The solution vector gkis the normalised eigenvector corresponding to the kth _{largest eigenvalue of the matrix [8],}

(XT_X₎≠1_XT_Y_(YT_Y₎≠1_YT_X, while hk is the normalised eigenvector of

(YT_Y₎≠1_YT_X_(XT_X₎≠1_XT_{Y .}

Similarly to the time series modification of PLS, equally weighted contemporaneous CCA feature directions may be estimated by,

max gk,hk q ÿ i=1 gT_kX_iTY hk subject to gT kXTXgk= 1 hT_kYTY hk= 1 gT_kXTXgl= 0, l < k hT_kYTY hl= 0, l < k. (3.5)

3.4 Continuum regression

Given the similarity between the PCA and PLS regressions, Stone et al. [37] pro-posed continuum regression (CR) as dynamic combination of OLS, PCA and PLS through varying the desired weights between the correlation and variance terms. Their objective takes the form

(24)

CHAPTER 3. SUBSPACE EXTRACTION METHODS 12 Hence, – = 0 gives OLS, – = 0.5 gives PLS, and – æ 1 will give PCA. Using Xw =qqi=1Xi, the equally weighted contemporaneous extension of CR is given by,

max gk,hk gT_kX_wTY hk(gkTXwTXwgk)–/(1≠–) subject to gT kgk= 1 hT_khk = 1 gT_kXTXgl = 0, l < k. (3.8)

The Stone et al. formulation of CR is rather computationally complex to estimate. As a response, De Jong et al. proposed the continuum power regression [13] that passes through OLS, PLS and PCR similarly to (3.7). However, the continuum power regression is just an approximation of the above objective and that the exact path between the three regression variants OLS, PLS and PCR differ. A thorough comparison is considered outside the scope of this thesis and may be found in [22]. Since we cannot determine that the path in the original formulation is better than the power variant and since the continuum power regression formulation is more computationally efficient, we will only use and refer to that method whenever CR is mentioned. The continuum power regression algorithm is similar to the SIMPLS algorithm and is included in the appendix.

3.5 Dynamic PLS

In the above supervised subspace extraction methods, the relation between the first lagged predictors and the response variable as well as the relation between the last lagged predictors and the response will be considered equally important. Under the assumption that different lags will have different explanatory power (more recent lags might have more explanatory power than less recent lags) an equally weighted subspace will risk discarding potentially relevant information. For this reason, Li et al. introduce a dynamic extension of PLS (DPLS) that captures this time-varying dependency between the predictors and the response variables [23].

The dynamic variant is constructed as a modification of the outer PLS model, max gk,hk,—k 1 g_kTX₁T—₁+ . . . + gT_kX_qT—q 2 Y hk subject to gT kgk= 1 hT_khk= 1 —₁2+ —₂2+ . . . + —_q2 = 1 g_kTXTXgl = 0, l < k. (3.9)

(25)

CHAPTER 3. SUBSPACE EXTRACTION METHODS 13 in a contemporaneous projection G that may describe the relation to the output better than equally weighting all lagged covariates and that does not require some a priori assumption about the time dependence of the latent variables. Li et al. solve the modified outer model through an iterative search. For each iteration, gk and

—k are estimated through the eigenproblems,

(Iq¢ gk)TXgTY YTXg(Iq¢ gk)—k= ⁄—⁄c—k (3.10) (—k¢ Iq)TXgTY YTXg(—k¢ Iq)gk= ⁄g⁄cgk. (3.11) That is, starting with a random and normalized gk, the eigenproblems are succes-sively solved until —k and gk converge. Apart from this iterative step, Li et al. base their dynamic PLS algorithm on NIPALS. Since NIPALS does not extract subspaces that maximise covariance in the multivariate case, we modify the DPLS algorithm to correspond more closely to the SIMPLS algorithm. The modification changes the deflation in the original algorithm. Instead of directly calculating the deflated data matrices for Xg and Y , an orthogonal projector is used to deflate the

cross product matrix S = XT

gY. Both the original NIPALS based DPLS and our

modified algorithm are added to the appendix.

3.6 Extending the dynamic approach to generalised

feature extraction

To investigate the dynamic approach for other subspace projection methods we show that the iterations (3.10) and (3.11) hold in a more general case, largely following the outline of the DPLS solution by Li et al [23]. We start with the latent multivariate regression framework by Burnham and Viveros [8], modified to allow dynamic objectives, maximize gk,hk,v1 (v1¢ gk)TM1hk subject to gT kM2gk = 1 hT_kM3hk= 1 v₁Tv1 = 1 g_kTM4gl = 0, l < k, (3.12)

where the parameters Mkcorresponding to the subspace extraction methods in this thesis are listed in Table 3.1. Note that the continuum regression is excluded as we have chosen to focus on the already established Burnham and Viveros framework, and including CR would require an extension. Matrices M2, M3, M4 are assumed

(26)

CHAPTER 3. SUBSPACE EXTRACTION METHODS 14 is given by

J = (v₁T ¢ gkT)M1hk+ 1₂⁄g(1 ≠ gkTM2gk) +1₂⁄h(1 ≠ hTkM3hk) +1₂⁄v(1 ≠ v1Tv1).

(3.13) The orthogonality constraint in (3.12) will be handled through deflation of the cross product matrix and is excluded. The Lagrangian is maximised by,

”J ”v₁ = (Iq¢ g T k)M1hk≠ ⁄vv1 = 0 (3.14) ”J ”gk = (v T 1 ¢ Im)M1hk≠ ⁄gM2gk= 0 (3.15) ”J ”hk =M T 1 (v1T ¢ gTk)T ≠ ⁄hM3hk= 0. (3.16) By substitution of hk in (3.14) by (3.16), we get (Iq¢ gTk)M1M3≠1M1T(v1T ¢ gkT)T = ⁄h⁄vv1 (3.17) … (Iq¢ gkT)M1M3≠1M1T(Iq¢ gkT)Tv1 = ⁄h⁄vv1. (3.18) Then, substituting hk in (3.15) by (3.16), (vT 1 ¢ Im)M1M3≠1M1T(vT1 ¢ gkT)T = ⁄g⁄hM2gk (3.19) … M2≠1(vT1 ¢ Im)M1M3≠1M1T(v1T ¢ Im)Tgk= ⁄g⁄hgk. (3.20) Equations (3.18), (3.20) and (3.25) are eigenproblems, meaning that the optimal solutions v1 and gk correspond to the eigenvectors of the left hand side matrices. These equations do also depend on one another. Thus, we investigate the optimum of the Lagrangian by inserting the derivative (3.14), obtaining

(27)

CHAPTER 3. SUBSPACE EXTRACTION METHODS 15 and Jmax= 1 ⁄h g_kT(v₁T ¢ Im)M1(M3)≠1M1T(v1T ¢ Im)gk = ⁄ggTkM2gk = ⁄g, where vT_v _{= 1 and g}T

kM2gk = 1 are taken from the optimisation constraints. Finally, we insert the derivative (3.16) into the Lagrangian,

Jmax= 1

⁄v

hT_kM₁T(Iq¢ gkT)T(Iq¢ gkT)M1hk. (3.24) This expression may be simplified by using the substitution of v1 from (3.14) in

(3.16), that is

M₁T(Iq¢ gkT)T(Iq¢ gkT)M1hk= ⁄h⁄vM3hk. (3.25) Thus, we get

Jmax= ⁄hhTkM3hk= ⁄h.

The optimal solution must then be given where Jmax = ⁄v = ⁄g = ⁄h and Li et al. note that this solution may be determined iteratively. The iterative solution is found by starting with a randomised and normed vector gk, then iteratively solving equations (3.18) and (3.20) until convergence. It is possible, however, that the algorithm could converge to a local optimum. In this case, the iterations could be restarted a number of times after convergence (with different random initial vectors gk) before choosing the optimal gkas the vector corresponding to the largest eigenvalue out of these trials.

Table 3.1: Latent regression framework parameters. ’-’ indicates an inactive constraint.

PCA PLSa _CCA _DPLSa _DCCA

(28)

CHAPTER 3. SUBSPACE EXTRACTION METHODS 16

3.7 Dynamic CCA

Using the generalisation of the dynamic model, we construct a dynamic variant of CCA, max gk,hk,—k (—T k ¢ gkT)XgTY hk subject to gT kXTXgk = 1 hT_kYTY hk= 1 —_kT—k= 1 gT_kXTXgl = 0, l < k. (3.26)

As shown in the previous section, the first feature direction g1 may be found in

the same manner as in the DPLS. Furthermore, since sequential feature directions of CCA may be determined using deflation of the data matrices [30], we follow the deflation by Li et al. [23]. That is, for the latent variables tk = Xk(i)gk and

tgk= Xg(—k¢ gk), the input and output matrices are deflated by,

Xk+1 = Yk≠ tkqkT Yk+1= Yk≠ tgkqkT with pk= XTtk tT_ktk qk = YTtgk tT_gktgk .

(29)

Chapter 4

Regularisation terms

In this chapter we will discuss some regularisation terms. The terms will be pre-sented in the context of regularised VAR estimation as given by the objective func-tion,

min

A L(A; Y ) + ⁄R(A), (4.1)

where R(A) is the regularisation term and L(A; Y ) is a loss function.

4.1 Ridge penalisation

The ridge or l2 regulariser aims to alleviate the issue with overfitting through l2

norm penalisation of the regression weights matrix. That is, R(A) = ||A||2F =

ÿ

i,j

(ai,j)2. (4.2)

Since spurious correlations between the predictors and the response variables can cause excessive coefficient weights, the ridge regression may reduce estimation errors by shrinking the estimated coefficients. Ridge-regularised problems may sometimes be solved analytically. For example, the solution to a regularised least square prob-lem, min A ||Y ≠ XA|| 2 F + ⁄||A||2F, is given by ˆ A= (XTX + ⁄I)≠1XTY .

4.2 Lasso penalisation

The lasso or l1 regulariser performs simultaneous shrinkage and variable selection

(30)

CHAPTER 4. REGULARISATION TERMS 18 is defined as,

R(A) = ||A||1 =ÿ

i,j

|ai,j|. (4.3)

Practically, the lasso encourages sparsity by limiting the number of non-zero co-efficients in the matrix A conversely to the l2 norm that performs shrinkage and

tends to keep unimportant coefficients small rather than zero. By restricting the number of non-zero coefficients, lasso-regularised regressions perform well in cases where the number of parameters is greater than the number of observations. How-ever, the lasso is unstable in cases with highly correlated variables [39]. If variables are correlated, the lasso tends to pick only one, and a small difference in data may have significant impacts on parameter estimates. An additional l2 penalisation term

could be added to (4.3) in order to make the lasso more robust as in the elastic net [43]. However, this regulariser is beyond the scope of this thesis.

4.3 Structured variants of the lasso penalty

The lasso penalty (4.3) does not make any structural assumptions about the esti-mated model. If the model does have an a priori known structure it might be possible to further reduce overfitting by including this structure in the penalty term. One variant of the lasso penalty is designed for models where there is a group structure between parameters. In this case, it might be appropriate to exclude/include the groups of parameters simultaneously [42]. For G groups of variables, where Ag are the parameters related to the variables in group g, the group lasso is given by

R(A) = G

ÿ

g=1

||Ag||F. (4.4)

This regulariser tends to make the elements in a group either all zero or all non-zero. An additional lasso penalty can be added if the problem requires sparsity within groups [25], giving the sparse group lasso,

R(A) = (1 ≠ –) G

ÿ

g=1

||Ag||F + –||A||1, (4.5)

where – indicates the desired degree of sparsity. In a time series setting, the group lasso and its sparse equivalent may be used to group temporally related variables together. For example, a group may be defined as the lags of a certain predictor. In this case, all parameters related to an unimportant predictor will become zero simultaneously, reducing the tendency to overfit in cases where, for instance, one certain lag of one variable is spuriously correlated to the output.

(31)

CHAPTER 4. REGULARISATION TERMS 19 fused lasso penalises the magnitude of their difference. That is,

R(A) = (1 ≠ –)||DkA_||1+ –||A||1, (4.6)

(32)

Chapter 5

The regularised factor model

This chapter introduces the proposed regularised factor model, its motivation and outlines the implementation of the model in this thesis.

5.1 The model

To combine the benefits of factor models and regularised time series estimation we propose a two-step approach. In the first step, a lower-dimensional subspace projection matrix is estimated (possibly by a supervised or dynamic method as discussed in Chapter 3) and the predictors X1, . . . , Xq are projected onto the latent subspace to form factors. In the second step, the parameters of the factor model are estimated using a regularised loss function.

Definition 3. (Two step factor model estimation)

Step 1. Latent variable projection: Given model specific v1 and Mk, for example as defined in Table 3.1, find the projection matrix G with columns gk by, max gk,hk,v1 (v1¢ gTk)M1hk subject to gT kM2gk= 1 hT_kM₃hk= 1 v₁Tv₁ = 1 g_lTM₄gk= 0, k < l. (5.1)

Step 2. Factor model estimation: Using G to project the observations,

es-timate the time series with regularisation, min

B ||Y ≠ [X1G, . . . , XqG]B||

2

F + N⁄nR(B), (5.2)

(33)

CHAPTER 5. THE REGULARISED FACTOR MODEL 21 where R is a regularisation term and N is the number of observations. A different loss functions could be used but the investigation in this thesis is limited to the least squares loss.

5.2 Motivation

We hypothesise that a method which combines factor models with regularisation may improve forecasting results over either regularised VAR models or static factor models by lessening the impact of some of their individual drawbacks while retaining some of their benefits. In particular, there are four main potential advantages to the proposed approach.

Firstly, high-dimensional factor models with a large number of lags may require either stacking or considerable dimension reduction to sufficiently reduce the pa-rameter space. Stacked models have been shown to perform less than optimally in previous research. Additionally, with a limited number of observations, any pro-jection that reduces the solution space to a manageable size (one that reduces the propensity to overfit) may discard valuable information. It is possible that this loss of information may be lessened by avoiding excessive dimensional reduction and letting the regularisation term handle overfitting by encouraging sparse estimates. Secondly, factor models are sensitive to model specification in the number of in-cluded lags and ascertaining a correct lag length in a VAR model may be exceed-ingly hard with noisy data. In these conditions, a long lag length may be necessary since omitted variables may cause greater bias than overfitting by including super-fluous variables. Combining factor models with regularised estimation may lessen the impact of model specification errors by allowing inclusion of superfluous lags as regularisation could lower their parameter weights.

Thirdly, as opposed to factor models, a regularised VAR model may fail to cor-rectly account for systematic market-wide shocks and collinearity between series may result in a model with covariate weight instability for lasso-based regularisers. Even for non-lasso regularisers, failure to take latent variables into account may cause dense model networks. Additionally, regularisation terms such as the lasso are rather sensitive to the presence of non-i.i.d. noise in the independent variable [9]. By separating the common factors from the idiosyncratic, the projection onto common factors may allow the regularised factor model to correctly account for sys-tematic shocks and may reduce both noise and collinearity (though some collinearity will remain between subsequent lags of factors in the model).

(34)

CHAPTER 5. THE REGULARISED FACTOR MODEL 22

5.3 Implementation

(35)

Chapter 6

Performance evaluation

6.1 Data description

The data consist of daily exchange rate spot prices between Jan 3, 2000 and Aug 31, 2015 for 30 currency pairs. Exact pairs are listed in the appendix. The starting date is chosen to avoid any impact of the 1997 Asian financial crisis that caused some changes in exchange rate regimes. By limiting the data to observations after 2000, it is assumed that the Asian countries have had time to recover and that foreign exchange markets have stabilised. The daily observations are synchronised and taken at the same time each day (at 22:00 GMT). The simultaneous snapshot of spot prices could result in the spot prices being taken at moments of high liquidity for some currencies and low liquidity for others due to intraday variation in traded volumes. This liquidity difference might have a price impact but any impact should be lower than the possible bias of having unsynchronised spot prices as these lead to non-modelled correlation between observations.

The currency pairs include both higher and lower volume currencies and represent a variety of exchange-rate regimes, albeit none with a fixed peg to another cur-rency over the full investigated period. The inclusion of non-floating currencies is motivated by the assumption that any non-pegged currencies contain their own idiosyncratic terms and that this additional information could improve subspace es-timates. It should be noted that the USDMYR and USDCNY currency pairs were both pegged to the USD until July 2005, i.e. for about a third of the investigated time period. However, the same reasoning as for the non-free-floating currencies can be used for their inclusion. Additionally, both were among the 30 most traded currencies as of 2013 [4] (the CNY being the 9th most traded), and it is assumed that changes in these currencies might have significant impacts on other currencies after 2005.

No particular structure or sufficient significant lag length could be determined for any of the assets when investigating sample autocorrelations. The autocorrelation coefficients around lag 10 are similar in size to the coefficients around lag 100 and

(36)

CHAPTER 6. PERFORMANCE EVALUATION 24 just a bit smaller around 500. A plot of a sample autocorrelations is added to the appendix. For brevity, we restrict the plot to USDEUR currency pair since the autocorrelations of all time series were similar. As seen in the figure, almost all autocorrelations are within one standard deviation and the few outliers do not seem statistically significant as the neighbouring lags are small. Excluding low lags, the impact of two subsequent lags should be somewhat similar.

As seen in Table C.2 in the appendix, most currencies are roughly equally dis-tributed on the positive/negative sides. Notable exceptions are USDRON and US-DBRL where there is a 10 percentage point difference. Additionally, USDCNY and USDMYR have a large percentage of zero returns as a result of the aforementioned currency peg before July 2005.

6.2 Data preparation

As some of the investigated methods do assume stationarity of the time series, we only investigate log returns of the spot prices. Additionally, to avoid any impacts of varying volatilities over time, the log returns are normalised by the 60 day close-to-close volatility. That is, given a vector of prices p, normalised log returns are obtained by rt= ln(pt /pt≠1) Ò 1 60qti=t≠59ln(pt/pt≠1) . (6.1)

While a measure such as Yang-Zhang volatility could be preferred as it is drift inde-pendent and more efficient than to the close-to-close volatility [41], the investigated dataset did not contain the daily high and low prices that the Yang-Zhang volatil-ity requires. The normalised log return time series are displayed in Figure 6.1. As shown in the figure, the returns of most assets seem indistinguishable from white noise.

Despite the USDMYR and USDCNY currencies being pegged for some of the in-vestigated period, there is some very slight noise in the data (in the order of 10≠4

(37)

CHAPTER 6. PERFORMANCE EVALUATION 25

(38)

CHAPTER 6. PERFORMANCE EVALUATION 26

6.3 Forecast evaluation

The forecasting performance of the regularised factor model will be estimated using nine different model setups. These are pairwise combinations of three forecasting horizons, 1 day, 1 week (5 business days) and 1 month (21 business days), with 3 maximum lag lengths, 1 month (21 business days), 3 months (63 business days) and 6 months (126 business days). The different forecasting horizons are chosen to highlight the performance of the regularised factor model with different predictor dynamics. For example, it is believed that a time series model for one step ahead forecasts should require a smaller number of lags than a model that forecasts 21 days ahead as signals should decay faster in the shorter horizons. Furthermore, the choice of the three maximum lag lengths for each model is partly due the assumption that it is generally better to include extraneous lags than to exclude actually significant ones and partly due to the fact that we want to investigate the degree to which the regularisation step may lower the propensity to overfit.

The performance of the model is estimated through a rolling schedule with an ex-panding time window. In essence, we use a time window containing the observations at times 1, . . . , t for estimating the model and forecasting the next n number of h step ahead returns. The time window is then expanded to t + n and the model is re-estimated before calculating the subsequent set of predictions. For the out-of-sample tests, we use n = 1. The frequent re-estimation of the model should reduce the impact of any spuriously good/bad estimates and allow better comparison be-tween methods. The prediction accuracy is measured for each of the h steps ahead predictions by the the mean square error relative to the MSE of a random walk (rMSE) and the hit rate (HR) of the signs of the predictions. That is,

rM SE(ˆy) = qN i=1(ˆyi≠ yi)2 qN i=1(ˆyi)2 , HR= 1 N N ÿ i=1 I(sign(ˆyi)=sign(yi)).

We use the same expanding time window for all model specifications. It is conceiv-able that the large time windows may lower prediction accuracy for the short term prediction models as the short term market dynamics may be less stable than the longer term dynamics. However, we want to avoid changing the dataset for esti-mation as this might impact estiesti-mation errors and may limit comparison between methods.

6.4 Parameter selection

(39)

CHAPTER 6. PERFORMANCE EVALUATION 27 rolling estimation scheme for the model calibration as for the out-of-sample per-formance evaluation. Since the cross validation step is time intensive we forecast

n= 5 numbers of steps before reconstructing the model parameters instead of

re-constructing the model for each step. The in-sample performance is tested with each set of hyperparameters. The optimal hyperparameters are determined as the hyperparameters that provide the highest hit rate, although this generally tended to coincide with the lowest rMSE.

The minimum time window is set to 756 days (three years) and represents a trade-off between the number of remaining observations for training and validation and the minimum number required for decent estimates in order to avoid skewing cross-validation results. That is, a smaller window may result in weaker estimates and likely requires more penalisation than a longer time window. Restricting the mini-mum to 756 days leaves just below 1000 days of observations (about 4 years of data) for validation.

Given that the set of possible integer hyperparameters in the dimension reduction step is relatively small, we test all possible integers. For the non-integer ⁄ reg-ularisation coefficients, a set of 20 logarithmically equally spaced values between estimated minimum and maximum values is constructed. The minimum values are chosen as the value where the parameter estimates are indistinguishable from the unpenalised estimates (to 4 decimal places), and the maximum values are the values where the penalisation term fully dominates. Some a priori assumptions are made for the sparse group lasso and the sparse fused lasso that contain multiple hyper-parameters in order to avoid an exponentially increasing number of cross-validation trials. We set the hyperparameter for the sparsity inducing norm in the sparse group lasso and the sparse fused lasso to – = 0.5. Additionally, we define the groups in the group lasso as the successive collections of 5 days of lagged predictors.

Furthermore, one of the goals of this thesis is to compare different subspace extrac-tion methods. For this reason, we believe that it is important to limit the number of factors that differ between the methods. For example, if we use different numbers of projected latent factors for the different methods, results could either be caused by one method finding more informative latent factors or by the difference in the number of factors. Hence, a constant number of factors is used for all first stage projections. The constant was chosen by providing the best in-sample performance for the majority of the regularised factor models.

(40)

Chapter 7

Empirical results

In this chapter, the forecasting performance of the regularised factor model is pre-sented and discussed. To limit the number of combinations of subspace extraction methods and penalty terms that is tested, the first section is restricted to perfor-mance evaluation of the subspace extraction methods without any regularisation. The results of the combined approach of regularisation with subspace projection are then presented in the subsequent section for a selected set of subspace projection methods.

An AA-BBBB convention is used for identifying different model specifications, where AA is the regularisation term and BBBB is the subspace extraction method. The labels NA denote no regularisation/no projection, FL denote the sparse fused lasso and GL denote the sparse group lasso. The remaining labels should be self-explanatory.

7.1 Foreign exchange rate prediction without

regularisation

Examining the results in Tables 7.1-7.3, it is evident that the unprojected models tend to overfit as shown by the large rMSE estimates. This is especially clear in the long lag time series. The dimensionality reduced approaches perform better as the rMSE estimates tend towards 1 for most specifications, and in the 21 step ahead and 21 step lag models, even being below 1 by a decent amount. However, the longest lag model (126 days) did seem too large as the MSE for all model specifications significantly exceeds the random walk MSE.

Comparing the unsupervised and the supervised static subspace projection meth-ods, the PCA performs well for models with a high step ahead to lag ratio, especially when including 126 lags, but it is surpassed by the supervised methods in cases with shorter lags and longer horizons. The result is expected since the latent subspace is equally weighted for all static supervised models. Under the assumption that the factors that predict short term changes in exchange rates decay relatively quickly,

(41)

CHAPTER 7. EMPIRICAL RESULTS 29 an excessive emphasis is placed on finding common factors for all lags and predic-tive information is discarded. Within the set of supervised static approaches, the continuum regression seems to have the most consistently good performance. This is likely a result of the CR being able to tend towards the PCA whenever only the few most recent lags contain the majority of predictive information, while it may tend towards supervised methods that tend to dominate performance when most lags are important.

(42)

CHAPTER 7. EMPIRICAL RESULTS 30 Table 7.1: Mean estimation errors for all assets using 21 lagged

ob-servations.

1 step ahead 5 steps ahead 21 steps ahead

Model Hit rate rMSE Hit rate rMSE Hit rate rMSE

NA-NA 0.5237 1.1573 0.5158 1.1923 0.5285 1.1196 NA-CCA 0.5123 1.0144 0.5292 0.9997 0.5606 0.9459 NA-CR 0.5237 1.0039 0.5228 1.0130 0.5491 0.9555 NA-PCA 0.5299 1.0005 0.5174 1.0192 0.5334 0.9747 NA-PLS 0.5264 1.0004 0.5257 1.0127 0.5489 0.9525 NA-DCCA 0.5327 0.9921 0.5281 1.0005 0.5626 0.9454 NA-DPLS 0.5347 1.0008 0.5241 1.0170 0.5493 0.9530

Table 7.2: Mean estimation errors for all assets using 63 lagged ob-servations.

NA-NA 0.5152 2.0561 0.5071 2.1684 0.5228 1.9046 NA-CCA 0.5123 1.0815 0.5191 1.0687 0.5267 1.0342 NA-CR 0.5286 1.0593 0.5296 1.0600 0.5434 1.0289 NA-PCA 0.5277 1.0708 0.5227 1.0938 0.5308 1.0574 NA-PLS 0.5310 1.0616 0.5281 1.0601 0.5430 1.0313 NA-DCCA 0.5284 1.0488 0.5169 1.0673 0.5258 1.0316 NA-DPLS 0.5351 1.0668 0.5255 1.0777 0.5338 1.0574

Table 7.3: Mean estimation errors for all assets using 126 lagged observations.

(43)

CHAPTER 7. EMPIRICAL RESULTS 31

7.2 Foreign exchange rate prediction with the

regularised factor model

The results in the previous section showed that the dynamic PLS and CCA variants perform very well in the small model scenario. However, the performance of all supervised models did degenerate in the largest model and they were outperformed by the PCA regarding hit rate. Seeing as the performance of the dynamic variants were rather similar, we will only investigate one of them with regularisation. We choose to focus on DPLS primarily due to the more frequent usage of the PLS method than the CCA method in recent research. Furthermore, the PCA method is included due to the good performance of PCA in the large model and due to the PCA being commonly used in the approximate factor model. Similarly to the previous section, MSE relative to the random walk estimates and hit rates are provided for each model specification.

Comparing methods, the jointly dimension reduced and regularised models almost consistently outperform models that are only regularised and the DPLS subspace models generally outperform the PCA models. The differences between methods are typically small for short lags and horizons and increase in both the lag and horizon dimensions. The good forecasting performance of non-reduced models for short term predictions could be caused by the short horizon specifications being considerably more dependent on recent observations than on older ones, while the 21 step ahead prediction should be more evenly dependent on all lags. Since a smaller number of variables is important in the former case, there may be less value in dimension reduction in order to reduce overfitting. Similarly, with a smaller number of important variables we may see a greater loss of predictive information caused by the dimension reduction.

Regarding regularisation terms, the lasso based variants generally outperform the ridge regularised models. Within the lasso group, there seems to be an increase in performance when imposing structural sparsity as the fused lasso and the group lasso tend to provide better results than the lasso, although the difference is small. Furthermore, the performance of the group lasso is very slightly better in general than the fused lasso when combined with the DPLS, but on the other hand the fused lasso performs better than the group lasso when both are combined with PCA. Any difference between the two regularisation terms does not seem sufficiently significant to recommend one over the other.

(44)

CHAPTER 7. EMPIRICAL RESULTS 32 Table 7.4: Mean estimation errors for all assets using 21 lagged

observations.

NA-DPLS 0.5347 1.0008 0.5241 1.0170 0.5493 0.9530 NA-NA 0.5237 1.1573 0.5158 1.1923 0.5285 1.1196 NA-PCA 0.5299 1.0005 0.5174 1.0192 0.5334 0.9747 FL-DPLS 0.5355 0.9643 0.5280 0.9868 0.5510 0.9782 FL-NA 0.5354 0.9641 0.5305 0.9968 0.5458 0.9996 FL-PCA 0.5316 0.9651 0.5252 0.9878 0.5501 0.9816 GL-DPLS 0.5301 0.9665 0.5264 0.9855 0.5482 0.9501 GL-NA 0.5372 0.9680 0.5268 0.9952 0.5320 0.9474 GL-PCA 0.5314 0.9676 0.5211 0.9881 0.5489 0.9597 L1-DPLS 0.5312 0.9661 0.5353 0.9962 0.5493 0.9502 L1-NA 0.5353 0.9664 0.5266 0.9924 0.5277 0.9758 L1-PCA 0.5326 0.9671 0.5246 0.9873 0.5496 0.9569 L2-DPLS 0.5291 0.9798 0.5269 0.9860 0.5529 0.9985 L2-NA 0.5314 0.9761 0.5223 0.9861 0.5472 0.9570 L2-PCA 0.5316 0.9804 0.5272 0.9989 0.5426 0.9517

(45)

CHAPTER 7. EMPIRICAL RESULTS 33

(46)

Chapter 8

Conclusion

As seen in the results, the regularised factor model achieves good prediction per-formance and better than random walk MSE for all forecasting horizons. It is also shown that these forecasts are sufficient for trading, at least for short horizons. Additionally, three general conclusions may be drawn from the results.

Firstly, factor models may benefit from supervised subspace extraction. However, the supervised methods are less general than the unsupervised methods. Since the actual dynamics between the subspace and the output variable may differ depend-ing on the modelled time series, supervised subspaces may only be preferable to unsupervised subspaces in some cases.

Secondly, supervised subspace extraction methods may be improved by adding a dynamic term as both the DCCA and the DPLS methods tended to outperform their static counterparts. The improvement in prediction performance is especially large in settings where the factors may decay quickly compared to the number of included lags, such as in short term forecasting.

Thirdly, the usage of dimension reduction along with regularisation may improve results over either separate approach. While the improvement in prediction per-formance is small for short horizon forecasts with few lags, it increases with longer forecasting horizons compared to regularisation and increases with longer lag length compared to dimension reduction. The difference between horizons may be caused by a greater number of variables with predictive power in the longer horizon model, which increases the value of dimension reduction. The difference between lag lengths is likely caused by a reduced propensity to overfit due to the regularisation. Some potential improvements are left for future research. For instance, under the assumption that the exchange rate dynamics between countries vary over time, the model could likely benefit from dynamic or time-varying parameters, especially given the large time windows that were used in this thesis. Also, forecasts of some foreign exchange rates seemed consistently weaker than others. For this reason, it could be interesting to investigate inference for the model. Lastly, the dynamic

(47)

(48)

Bibliography

[1] Jushan Bai and Serena Ng. 2008. Forecasting economic time series using tar-geted predictors. Journal of Econometrics, 146(2):304–317.

[2] Jushan Bai and Peng Wang. 2016. Econometric Analysis of Large Factor Mod-els. Annual Review of Economics, 8.

[3] Eric Bair, Trevor Hastie, Debashis Paul, and Robert Tibshirani. 2006. Predic-tion by supervised principal components. Journal of the American Statistical

Association, 101(473):119–137.

[4] Bank for International Settlements. 2013. Triennial Central Bank Survey For-eign exchange turnover in April 2013 : preliminary global results. (April). [5] Sumanta Basu and George Michailidis. 2015. Regularized estimation in sparse

high-dimensional time series models, volume 43.

[6] Jean Boivin and Serena Ng. 2005. Understanding and comparing factor-based forecasts. International Journal of Central Banking, 1(3):117–151.

[7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2010. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine

Learn-ing, 3(1):1–122.

[8] Alison J. Burnham and Roman Viveros. 1996. Frameworks for latent variable multivariate regression. Journal of chemometrics, 10(1):31–45.

[9] Xiaohui Chen, Z. Jane Wang, and Martin J. McKeown. 2010. Asymptotic analysis of robust LASSOs in the presence of noise with large variance. IEEE

Transactions on Information Theory, 56(10):5131–5149.

[10] Hyonho Chun and Sündüz Kele. 2010. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. pages 3–25. [11] Richard a. Davis, Pengfei Zang, and Tian Zheng. 2012. Sparse vector

autore-gressive modeling. pages 1–39.

(49)

BIBLIOGRAPHY 37 [12] Sijmen De Jong. 1993. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18(3): 251–263.

[13] Sijmen De Jong, Barry M. Wise, and N. Lawrence Ricker. 2001. Canonical par-tial least squares and continuum power regression. Journal of Chemometrics, 15(2):85–100.

[14] Mario Forni, Marc Hallin, Marco Lippi, and Paolo Zaffaroni. 2015. Dynamic factor models with infinite-dimensional factor spaces: One-sided representa-tions. Journal of Econometrics, 185(2):359–371.

[15] Ildiko E Frank and Jerome H Friedman. 1993. A Statistical View of Some Chemometrics Regression Tools. Technometrics, 35(2):109–135.

[16] Julieta Fuentes, Pilar Poncela, and Julio Rodríguez. 2014. Sparse Partial Least Squares in Time Series for Macroeconomic Forecasting. Journal of Applied

Econometrics, 30:576–595.

[17] Nicolae Gârleanu and Lasse Heje Pedersen. 2013. Dynamic Trading with Pre-dictable Returns and Transaction Costs. The Journal of Finance, 68(6):2309– 2340.

[18] Marc Hallin and Marco Lippi. 2013. Factor models in high-dimensional time series: A time-domain approach. Stochastic Processes and their Applications, 123(7):2678–2695.

[19] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements

of Statistical Learning, 2nd edition. Springer-Verlag, New York.

[20] Nan-jung Hsu, Hung-lin Hung, and Ya-mei Chang. 2008. Subset selection for vector autoregressive processes using Lasso. 52:3645–3657.

[21] Henrik Hult, Filip Lindskog, Ola Hammarlid, and Carl Johan Rehn. 2012. Risk

and portfolio analysis.

[22] Henk a. L. Kiers and Age K. Smilde. 2006. A comparison of various methods for multivariate regression with highly collinear variables. Statistical Methods

and Applications, 16:193–228.

[23] Gang Li, Baosheng Liu, S Joe Qin, and Donghua Zhou. 2011. Quality relevant data-driven modeling and monitoring of multivariate dynamic processes: the dynamic T-PLS approach. IEEE transactions on neural networks / a

publica-tion of the IEEE Neural Networks Council, 22(12):2262–71.

(50)

BIBLIOGRAPHY 38 [25] Yanming Li, Bin Nan, and Ji Zhu. 2015. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure.

Biometrics, 71(June):354–363.

[26] Po Ling Loh and Martin J. Wainwright. 2012. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Annals

of Statistics, 40(3):1637–1664.

[27] Helmut Lütkepohl. 2005. New Introduction to Multiple Time Series Analysis, 1 edition. Springer-Verlag.

[28] Debashis Paul, Eric Bair, Trevor Hastie, and Robert Tibshirani. 2008. "Pre-conditioning" for feature selection and regression in high-dimensional problems.

Annals of Statistics, 36(4):1595–1618.

[29] Roman Rosipal. 2006. Overview and Recent Advances in Partial Least Squares. pages 34–51.

[30] Sanjay K Sharma, Uwe Kruger, and George W Irwin. 2006. Deflation based nonlinear canonical correlation analysis. 83:34–43.

[31] Song Song and Peter J. Bickel. 2011. Large vector auto regressions. arXiv

preprint arXiv:1106.3915, pages 1–28.

[32] James H. Stock and Mark W. Watson. 2002. Forecasting Using Principal Com-ponents From a Large Number of Predictors. Journal of the American

Statis-tical Association, 97(460).

[33] James H. Stock and Mark W. Watson. 2002. Macroeconomic forecasting using diffusion indexes. Journal of Business & Economic Statistics, 20(2):147–162. [34] James H. Stock and Mark W. Watson. 2006. Forecasting with many predictors.

1(05):515–554.

[35] James H. Stock and Mark W. Watson. 2012. Generalized shrinkage methods for forecasting using many predictors. Journal of Business & Economic Statistics, 30(4):481–493.

[36] J.H. Stock and M.W. Watson. 2007. Forecasting in Dynamic Factor Models Subject To Structural Instability. The Methodology and Practice of

Economet-rics. A Festschrift in Honour of David F. Hendry, (August 2007):173–205.

[37] M Stone and R J Brooks. 1990. Continuum regression: Cross-validated sequen-tially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. Journal of the Royal Statistical

(51)

BIBLIOGRAPHY 39 [38] Robert Tibshirani. 1996. Regression Shrinkage and Selection via the Lasso.

Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–

288.

[39] Robert Tibshirani. 2015. Statistical learning with sparsity: The lasso and

gen-eralizations. Chapman and Hall/CRC.

[40] Svante Wold, Nouna Kettaneh-Wold, and Bert Skagerberg. 1989. Nonlinear PLS modeling. Chemometrics and Intelligent Laboratory Systems, 7:53–65. [41] Dennis Yang and Qiang Zhang. 2000. Drift independent volatility estimation

based on high, low, open, and close prices. The Journal of Business, 73(3): 477–492.

[42] Ming Yuan and Yi Lin. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 68(1):49–67.

[43] Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical