• No results found

Forecasting the Equity Premium and Optimal Portfolios

N/A
N/A
Protected

Academic year: 2021

Share "Forecasting the Equity Premium and Optimal Portfolios"

Copied!
123
0
0

Loading.... (view fulltext now)

Full text

(1)

Matematiska Institutionen

Department of Mathematics

Master’s Thesis

Forecasting the Equity Premium and

Optimal Portfolios

Johan Bjurgert and Marcus Edstrand

Reg Nr: LITH-MAT-EX--2008/04--SE

Linköping 2008

Matematiska institutionen Linköpings universitet

(2)
(3)

Forecasting the Equity Premium and Optimal

Portfolios

Department of Mathematics, Linköpings universitet

Johan Bjurgert and Marcus Edstrand

LITH-MAT-EX--2008/04--SE

Handledare: Dr Jörgen Blomvall

mai, Linköpings universitet

Dr Wofgang Mader

risklab GmbH

Examinator: Dr Jörgen Blomvall

mai, Linköpings universitet

(4)
(5)

Avdelning, Institution

Division, Department Division of Mathematics Department of Mathematics Linköpings universitet SE-581 83 Linköping, Sweden

Datum Date 2008-04-15 Språk Language  Svenska/Swedish  Engelska/English   Rapporttyp Report category  Licentiatavhandling  Examensarbete  C-uppsats  D-uppsats  Övrig rapport  

URL för elektronisk version

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-11795

ISBN

ISRN

LITH-MAT-EX--2008/04--SE

Serietitel och serienummer

Title of series, numbering

ISSN

Titel

Title

Forecasting the Equity Premium and Optimal Portfolios

Författare

Author

Johan Bjurgert and Marcus Edstrand

Sammanfattning

Abstract

The expected equity premium is an important parameter in many financial models, especially within portfolio optimization. A good forecast of the future equity premium is therefore of great interest. In this thesis we seek to forecast the equity premium, use it in portfolio optimization and then give evidence on how sensitive the results are to estimation errors and how the impact of these can be minimized.

Linear prediction models are commonly used by practitioners to forecast the expected equity premium, this with mixed results. To only choose the model that performs the best in-sample for forecasting, does not take model uncertainty into account. Our approach is to still use linear prediction models, but also taking model uncertainty into consideration by applying Bayesian model averaging. The predictions are used in the optimization of a portfolio with risky assets to investigate how sensitive portfolio optimization is to estimation errors in the mean vector and covariance matrix. This is performed by using a Monte Carlo based heuristic called portfolio resampling.

The results show that the predictive ability of linear models is not sub-stantially improved by taking model uncertainty into consideration. This could mean that the main problem with linear models is not model uncertainty, but rather too low predictive ability. However, we find that our approach gives better forecasts than just using the historical average as an estimate. Furthermore, we find some predictive ability in the the GDP, the short term spread and the volatility for the five years to come. Portfolio resampling proves to be useful when the input parameters in a portfolio optimization problem is suffering from vast uncertainty.

Keywords: equity premium, Bayesian model averaging, linear prediction,

estimation errors, Markowitz optimization

Nyckelord

Keywords equity premium, Bayesian model averaging, linear prediction, estimation errors, Markowitz optimization

(6)
(7)

Abstract

The expected equity premium is an important parameter in many financial mod-els, especially within portfolio optimization. A good forecast of the future equity premium is therefore of great interest. In this thesis we seek to forecast the equity premium, use it in portfolio optimization and then give evidence on how sensitive the results are to estimation errors and how the impact of these can be minimized. Linear prediction models are commonly used by practitioners to forecast the ex-pected equity premium, this with mixed results. To only choose the model that performs the best in-sample for forecasting, does not take model uncertainty into account. Our approach is to still use linear prediction models, but also taking model uncertainty into consideration by applying Bayesian model averaging. The predictions are used in the optimization of a portfolio with risky assets to investi-gate how sensitive portfolio optimization is to estimation errors in the mean vector and covariance matrix. This is performed by using a Monte Carlo based heuristic called portfolio resampling.

The results show that the predictive ability of linear models is not substantially improved by taking model uncertainty into consideration. This could mean that the main problem with linear models is not model uncertainty, but rather too low predictive ability. However, we find that our approach gives better forecasts than just using the historical average as an estimate. Furthermore, we find some pre-dictive ability in the the GDP, the short term spread and the volatility for the five years to come. Portfolio resampling proves to be useful when the input parameters in a portfolio optimization problem is suffering from vast uncertainty.

Keywords: equity premium, Bayesian model averaging, linear prediction,

estimation errors, Markowitz optimization

(8)
(9)

Acknowledgments

First of all we would like to thank risklab GmbH for giving us the opportunity to write this thesis. It has been a truly rewarding experience. We are grateful for the many inspirational discussions with Wolfgang Mader, our supervisor at risklab. He also has provided us with valuable comments and suggestions. We thank our supervisor at LiTH, Jörgen Blomvall, for his continous support and feedback. Finally we would like to acknowledge our opponent Tobias Törnfeldt, for his helpful comments.

Johan Bjurgert Marcus Edstrand Munich, April 2008

(10)
(11)

Contents

1 Introduction 5 1.1 Objectives . . . 6 1.2 Problem definition . . . 6 1.3 Limitations . . . 6 1.4 Contributions . . . 6 1.5 Outline . . . 6

I

Equity Premium Forecasting using Bayesian Statistics

7

2 The Equity Premium 9 2.1 What is the equity premium? . . . 9

2.2 Historical models . . . 10

2.3 Implied models . . . 11

2.4 Conditional models . . . 12

2.5 Multi factor models . . . 13

2.6 A short summary of the models . . . 14

2.7 What is a good model? . . . 15

2.8 Chosen model . . . 15

3 Linear Regression Models 17 3.1 Basic definitions . . . 17

3.2 The classical regression assumptions . . . 21

3.3 Robustness of OLS estimates . . . 22

3.4 Testing the regression assumptions . . . 23

4 Bayesian Statistics 25 4.1 Basic definitions . . . 25

4.2 Sufficient statistics . . . 26

4.3 Choice of prior . . . 28

4.4 Marginalization . . . 30

4.5 Bayesian model averaging . . . 30

4.6 Using BMA on linear regression models . . . 32 ix

(12)

x Contents

5 The Data Set and Linear Prediction 37

5.1 Chosen series . . . 37

5.2 The historical equity premium . . . 37

5.3 Factors explaining the equity premium . . . 39

5.4 Testing the assumptions of linear regression . . . 45

5.5 Forecasting by linear regression . . . 51

6 Implementation 53 6.1 Overview . . . 53

6.2 Linear prediction . . . 54

6.3 Bayesian model averaging . . . 55

6.4 Backtesting . . . 55

7 Results 57 7.1 Univariate forecasting . . . 57

7.2 Multivariate forecasting . . . 60

7.3 Results from the backtest . . . 62

8 Discussion of the Forecasting 65

II

Using the Equity Premium in Asset Allocation

69

9 Portfolio Optimization 71 9.1 Solution of the Markowitz problem . . . 71

9.2 Estimation error in Markowitz portfolios . . . 76

9.3 The method of portfolio resampling . . . 77

9.4 An example of portfolio resampling . . . 78

9.5 Discussion of portfolio resampling . . . 79

10 Backtesting Portfolio Performance 85 10.1 Backtesting setup and results . . . 85

11 Conclusions 89 Bibliography 91 A Mathematical Preliminaries 97 A.1 Statistical definitions . . . 97

A.2 Statistical distributions . . . 98

B Code 100 B.1 Univariate predictions . . . 100

B.2 Multivariate predictions . . . 101

B.3 Merge time series . . . 103

B.4 Load data into Matlab from Excel . . . 103

B.5 Permutations . . . 104

(13)

Contents xi

B.7 setSubColumn . . . 104 B.8 Portfolio resampling . . . 105 B.9 Quadratic optimization . . . 106

(14)
(15)

List of Figures

3.1 OLS by means of projection . . . 18

3.2 The effect of outliers . . . 22

3.3 Example of a Q-Q plot . . . 24

4.1 Bayesian revising of probabilities . . . 26

5.1 The historical equity premium over time . . . 38

5.2 Shapes of the yield curve . . . 43

5.3 QQ-Plot of the one step lagged residuals for factors 1-9 . . . 47

5.4 QQ-Plot of the one step lagged residuals for factors 10-18 . . . 48

5.5 Lagged factors 1-9 versus returns on the equity premium . . . 49

5.6 Lagged factors 10-18 versus returns on the equity premium . . . . 50

6.1 Flowchart . . . 53

6.2 User interface . . . 54

7.1 The equity premium from the univariate forecasts . . . 58

7.2 Likelihood function values for different g-values . . . 59

7.3 The equity premium from the multivariate forecasts . . . 60

7.4 Backtest of univariate models . . . 62

7.5 Backtest of multivariate models . . . 63

9.1 Comparison of efficient and resampled frontier . . . 81

9.2 Resampled portfolio allocation when shorting allowed . . . 82

9.3 Resampled portfolio allocation when no shorting allowed . . . 83

9.4 Comparison of estimation error in mean and covariance . . . 84

(16)

2 Contents

List of Tables

2.1 Advantages and disadvantages of discussed models . . . 14

3.1 Critical values for the Durbin-Watson test. . . 23

5.1 The data set and sources . . . 38

5.2 Basic statistics for the factors . . . 40

5.3 Outliers identified by the leverage measure . . . 45

5.4 Jarque-Bera test of normality . . . 46

5.5 Durbin-Watson test of autocorrelation . . . 46

5.6 Principle of lagging time series for forecasting . . . 51

5.7 Lagged R2for univariate regression . . . . 52

7.1 Forecasting statistics in percent . . . 57

7.2 The univariate model with highest probability over time . . . 58

7.3 Out of sample, R2os,uni, and hit ratios, HRuni . . . 59

7.4 Forecasting statistics in percent . . . 60

7.5 The multivariate model with highest probability over time . . . 61

7.6 Forecasts for different g-values . . . 61

7.7 Out of sample, R2 os,mv, and hit ratios, HRmv . . . 61

9.1 Input parameters for portfolio resampling . . . 78

10.1 Portfolio returns over time . . . 86

(17)

Nomenclature

The most frequently used symbols and abbreviations are described here.

Symbols

¯

µ Demanded portfolio return βi,t Beta for asset i at time t

βt True least squares parameter at time t

µ Asset return vector

t Information set at time t

Σ Estimated covariance matrix

cov[X] Covariance of the random variable X ˆ

βt Least squares estimate at time t

ˆ

Σ Sampled covariance matrix ˆ

ut Least squares sample residual at time t

λm,t Market m price of risk at time t

C Covariance matrix

In The unity matrix of size n × n

w Weights of assets

tr[X] The trace of the matrix X

var[X] Variance of the random variable X Di,t Dividend for asset i at time t

E[X] Expected value of the random variable X rf,t Riskfree rate at time t to t + 1

rm,t Return from asset m at time t

ut Population residual in the least square model at time t

Abbreviations

aHEP Average historical equity premium BM A Bayesian model averaging

DJ IA Dow Jones industrial average EEP Expected equity premium GDP Gross domestic product HEP Historical equity premium IEP Implied equity premium OLS Ordinary least squares REP Required equity premium

(18)
(19)

Chapter 1

Introduction

The expected equity risk premium is one of the single most important economic variables. A meaningful estimate of the premium is critical to valuing companies and stocks and for planning future investments. However, the only premium that can be observed is the historical premium.

Since the equity premium is shaped by overall market conditions, factors influ-encing market conditions can be used to explain the equity premium. Although predictive power usually is low, the factors can also be used for forecasting. Many of the investigations undertaken, typically set out to determine a best model, con-sisting of a set of economic predictors and then proceed as if the selected model had generated the equity premium. Such an approach ignores the uncertainty in model selection leading to over confident inferences that are more risky than one thinks that they are. In our thesis we will forecast the equity premium by computing a weighted average of a large number of linear prediction models using Bayesian model averaging (BMA) to allow for model uncertainty being taken into account.

Having forecasted the equity premium - the key input for asset allocation op-timization models, we conclude by highlighting main pitfalls in the mean variance optimization framework and present portfolio resampling as a way to arrive at suitable allocation decisions when the input parameters are very uncertain.

(20)

6 Introduction

1.1

Objectives

The objective of this thesis is to build a framework for forecasting the equity premium and then implement it to produce a functional tool for practical use. Further, the impact of uncertain input parameters in mean-variance optimization shall be investigated.

1.2

Problem definition

By means of BMA and linear prediction, what is the expected equity premium for the years to come and how is it best used as an input in a mean variance optimization problem?

1.3

Limitations

The practical part of this thesis is limited to the use of US time series only. However, the theoretical framework is valid for all economies.

1.4

Contributions

To the best knowledge of the authors, this is the first attempt to forecast the equity premium using Bayesian model averaging with the priors specified later in the thesis.

1.5

Outline

The first part of the thesis is about forecasting the equity premium whereas the second part discusses the importance of parameter uncertainty in portfolio opti-mization.

In chapter 2 we present the concept of the equity premium, usual assumptions thereof and associated models. Chapter 3 describes the fundamental ideas of lin-ear regression and its limitations. In chapter 4 we first present basic concepts of Bayesian statistics and then use them to combine the properties of linear predic-tion with Bayesian model averaging. Having defined the forecasting approach we in chapter 5 turn to the factors explaining the equity premium. Chapter 6 ad-dresses the implementation of the theory. Finally, chapter 7 presents our results and a discussion thereof is found in chapter 8. In chapter 9 we investigate the im-pact of estimation error on portfolio optimization. In chapter 10 we evaluate the performance of a portfolio when using the forecasted equity premium and portfo-lio resampling. With chapter 11 we conclude our thesis and make propositions of future investigations and work.

(21)

Part I

Equity Premium Forecasting

using Bayesian Statistics

(22)
(23)

Chapter 2

The Equity Premium

In this chapter we define the concept of the equity premium and present some models that have been used for estimating the premium. At the end of the chap-ter, a table summing up advantages and disadvantages of the different models is provided. The chapter concludes with a motivation to why we have chosen to work with multi factor models and a summary of criterions for a good model.

2.1

What is the equity premium?

As defined by Fernandéz [32], the equity premium can be split up into four different concepts. These concepts hold for single stocks as well for stock indices. In our thesis the emphasis is on stock indices.

• historical equity premium (HEP): historical return of the stock market over riskfree asset

• expected equity premium (EEP): expected return of the stock market over riskfree asset

• required equity premium (REP): incremental return of the market port-folio over the riskfree rate required by an investor in order to hold the market portfolio, or the extra return that the overall stock market must provide over the riskfree asset to compensate for the extra risk

• implied equity premium (IEP): the required equity premium that arises from a pricing model and from assuming that the market price is correct. The HEP is observable on the financial market and is equal for all investors.1 It is calculated by

HEPt= rm,t− rf,t−1= (PPt−1t − 1) − (rf,t−1) (2.1)

1This is true as long as they use the same instruments and the same time resolution.

(24)

10 The Equity Premium

where rm,tis the return on the stock market, rf,t−1 is the rate on a riskfree asset

from t − 1 to t. Pt is the stock index level.

A widely used measure for rm,t is the return on a large stock index. For the

second asset rf,t−1 in (2.1), the return on government securities is usually used.

Some practitioners use the return on short-term treasury bills; some use the re-turns on long-term government bonds. Yields on bonds instead of rere-turns have also been used to some extent. Despite the indisputable importance of the equity premium, a general consensus on exactly which assets should enter expression (2.1) does not exist. Questions like: “Which stock index should be used?” and “Which riskfree instrument should be used and which maturity should it have?” remain unanswered.

The EEP is made up of the markets expectations of future returns over a risk-free asset and is therefore not observable in the financial market. Its magnitude and the most appropriate way to produce estimates thereof is an intensively debated topic among economists. The market expectations shaping the premium are based on, at least, a non-negative premium and to some extent also average realizations of the HEP. This would mean that there is a relation between the EEP and the HEP. Some authors (e.g. [9], [21], [37] and [42]), even argue that there is a strict equality between the both, whereas other claim that the EEP is smaller than the HEP (e.g. [45], [6] and [22]). Although investors have different opinions to what is the correct level of the expected equity premium, many basic financial books recommend using 5-8%.2

The required equity premium (REP) is important in valuation since it is the key to determining the company’s required return on equity.

If one believes that prices on the financial markets are correct, then the implied eq-uity premium, (IEP), would be an estimate of the expected eqeq-uity premium (EEP). We now turn to presenting models being used to produce estimates of the dif-ferent concepts.

2.2

Historical models

The probably most used method by practitioners is to use the historical realized equity premium as a proxy for the expected equity premium [64]. They thereby implicitly follow the relationship HEP = EEP .

Assuming that the historical equity premium is equal to the expected equity pre-mium can be formulated as

rm,t= Et−1[rm,t] + em,t (2.2)

(25)

2.3 Implied models 11

where em,tis the error term, the unexpected return. The expectation is often

com-puted as the arithmetic average of all available values for the HEP. In equation (2.2), it is assumed that the errors are independent and have a mean of zero. The model then implies that investors are rational and the random error term corre-sponds to their mistakes. It is also possible to model more advanced errors. For example, an autoregressive error term might be motivated since market returns sometimes exhibit positive autocorrelation. An AR(1) model then implies that investors need one time step to learn about their mistakes. [64]

The model has the advantages of being intuitive and easy to use. The draw-backs on the other hand are not few. Except for usual problems with time series, such as used length, outliers etc, the model suffers from problems with longer pe-riods where the riskfree asset has a higher average return than the equity. Clearly, this is not plausible since an investor expects a positive return in order to invest.

2.3

Implied models

Implied models for the equity premium make use of the assumption EEP = IEP and are used much in a similar way as investors use the Black and Scholes formula backwards to solve for implied volatility. The advantage of implied models is that they provide time-varying estimates for the expected market returns since prices and expectations change over time. The main drawback is that the validity is bounded by the validity of the model used. Lately, the inverse Black Litterman model has attracted interest, see for instance [67]. Another more widely used model is the Gordon dividend growth model which is further discussed in [11]. Under certain assumptions it can be written as

Pit=

E[Di,t+1]

E[ri,t+1] − E[gi,t+1]

(2.3)

where E[Di,t+1] are the next years expected dividend, E[ri,t+1] the required rate

of return and E[gi,t+1] is the company’s expected growth rate of dividends from

today until infinity.

Assuming that CAPM3 holds, the required rate of returns for stock i can be written as

E[ri,t] = rf,t+ βi,tE[rm,t− rf,t] (2.4)

By combining the two equations, where dividends are approximated as E[Di,t+1] =

[1 + E[gi,t+1]]Di,t, under assumption that E[rf,t+1] = rf,t+1 and by aggregating

(26)

12 The Equity Premium

over all assets, we can now solve for the expected market risk premium

E[rm,t+1] =

(1 + E[gm,t+1])Dm,t

Pm,t

+ E[gm,t+1]

= (1 + E[gm,t+1]) DivYieldm,t+E[gm,t+1] (2.5)

where E[rm,t+1] is the expected market risk premium, Dm,tis the sum of dividends

from all companies, E[gm,t+1] is the expected growth rate of the dividends from

today to infinity4, and DivYieldm,tis the current market price dividend yield. [64]

One critic against using the Gordon dividend growth model is that the result depend heavily on what number is used for the expected dividend growth rate and thereby the problem is shifted to forecasting the expected dividend growth rate.

2.4

Conditional models

Conditional models refers to models conditioning on the information investors use to estimate the risk premium and thereby allow for time-varying estimations. On the other hand, the information set Ωtused by investors is not observable on the

market and it is not clear how to specify a method that investors use to form their expectations from the data set.

As an example of such a model, the conditional version of the CAPM implies the following restriction for the excess returns

E[ri,t|Ωt−1] = βi,tE[rm,t|Ωt−1] (2.6)

where the market beta is βi,t=

cov [ri,t, rm,t|Ωt−1]

var [rm,t|Ωt−1]

(2.7)

and E[ri,t|Ωt−1] and E[rm,t|Ωt−1] are expected returns on asset i and the market

portfolio conditional on investors’ information set Ωt−15.

Observing that the ratio E[rm,t|Ωt−1]/ var[rm,t|Ωt−1] is the market price of risk

λm,t, measuring the compensation an investor must receive for a unit increase

in the market return variance [55], yields the following expression for the market portfolio’s expected excess returns

E[rm,t|Ωt−1] = λm,t(Ωt−1) var [rm,t|Ωt−1]. (2.8)

By specifying a model for the conditional variance process, the equity premium can be estimated.

4E[R

m,t+1] > E[gm,t+1]

5Both returns are in excess of the riskless rate of return r

f,t−1and all returns are measured

(27)

2.5 Multi factor models 13

2.5

Multi factor models

Multi factor models make use of correlation between equity returns and returns from other economic factors. By choosing a set of economic factors and by deter-mining the coefficients, the equity premium can be estimated as

rm,t= αt+

X

j

βj,tXj,t+ εt (2.9)

where the coefficients α and β usually are calculated using the least squares method (OLS), X contains the factors and ε is the error.

The most prominent candidates of economic factors used as explanatory variables are the dividend to price ratio and the dividend yield (e.g. [60], [12], [28], [40] and [51]), the earnings to price ratio (e.g. [13], [14] and [48]), the book to market ratio (e.g. [46] and [58]), short term interest rates (e.g. [40] and [1]), yield spreads (e.g. [43], [15] and [29]), and more recently the consumption-wealth ratio (e.g. [50]). Other candidates are dividend payout ratios, corporate or net issuing ratios and beta premia (e.g. [37]), the term spread and the default spread (e.g. [2], [15], [29] and [43]), the inflation rate (e.g. [30], [27] and [19]), value of high and low beta stocks (e.g. [57]) and aggregate financing activity (e.g. [3]).

Goyal and Welch [37] showed that most of the mentioned predictors performed worse out-of-sample than just assuming that the equity premium had been con-stant. They also found that the predictors were not stable, that is their importance changes over time. Campbell and Thompson [16] on the other hand found that some of the predictors, with significant forecasting power in-sample, generally have a better out-of-sample forecast power than a forecast based on the historical av-erage.

(28)

14 The Equity Premium

2.6

A short summary of the models

Model type Advantages Disadvantages

Historical Intuitive and easy to use Might have problems with longer periods of negative equity premium

Doubtful whether past is an indicator for future Implied Relatively simple to use The validity of the

esti-mates is bounded to the validity of the used model Provides time varying

es-timates for the premium

Assumes market prices are correct

Conditional Provides time varying es-timates for the premium

The information used by investors are not visible on the market

Models for determining how investors form their expectations from the in-formation are not unam-biguous

Multi Factor High model transparency and results are easy to in-terpret

It is doubtful whether past is an indicator for future

Forecasts are only possible for a short time horizon, due to lagging

(29)

2.7 What is a good model? 15

2.7

What is a good model?

These are model criterions that the authors, inspired of Vaihekoski [64], consider important for a good estimate of the equity premium:

Economical reasoning criterions

• The premium estimate should be positive for most of the time • Model inputs should be visible at the financial markets

• The estimated premium should be rather smooth over time because investor preferences presumably do not change much over time

• The model should provide different premium estimates for different time horizons, that is, taking investors “time structure” into account

Technical reasoning criterions

• The model should allow for time variation in the premium • The model should make use of the latest time t observation

• The model should be provided with a precision of the estimated premium • It should be possible to use different time resolutions in the data input

2.8

Chosen model

All model categories previously stated are likely to be useful in estimating the equity premium. In our thesis we have chosen to work with multi factor models because they are intuitively more straight forward than both implied and condi-tional models; all model inputs are visible on the market and it is perfectly clear from the model how different factors add up to the equity premium. Furthermore, it is easy to add constraints to the model, which enables the use of economic reasoning as a complement to pure statistical analysis.

(30)
(31)

Chapter 3

Linear Regression Models

First we summarize the mechanics of linear regressions and present some formu-las that hold regardless of what statistical assumptions that are made. Then we discuss different statistical assumptions about the properties of the model and ro-bustness of the estimates.

3.1

Basic definitions

Suppose that a scalar yt is related to a vector xt ∈ Rk×1 and a noise term ut

according to the regression model

yt= x>t β + ut. (3.1)

Definition 3.1 (Ordinary least squares OLS) Given an observed sample

(y1, y2, . . . , yT), the ordinary least squares estimate of β (denoted ˆβt) is the value

that minimizes the residual sum of squares: V (β) =PT

t=1ε 2 t(β) = PT t=1(yt− ˆyt)2 =PT t=1(yt− xtβ) 2 (see [38])

Theorem 3.1 (Ordinary least squares estimate) The OLS estimate is given

by ˆ β = [ T X t=1 (xtx>t)] −1[ T X t=1 (xtyt)] (3.2)

assuming that the matrixPT

t=1(xtx>t) ∈ Rk×k is nonsingular (see [38]).

Proof : The result is found by differentiation,

dV (β) = −2

PT

t=1xt(yt− xtβ) = 0,

and the minimizing argument is thus 17

(32)

18 Linear Regression Models ˆ β = [PT t=1(xtx>t)]−1[ PT t=1(xtyt)].  Often, the regression model is written in matrix notation as

y = Xβ + u, (3.3) where y ≡      y1 y2 .. . yn      X ≡      xT 1 xT 2 .. . xTn      u ≡      u1 u2 .. . un      .

A perhaps more intuitive way to arrive at equation (3.2) is to project y on the column space of X.

Figure 3.1. OLS by means of projection

The vector of the OLS sample residuals, ˆu can then be written as ˆu = y − Xβ.

Consequently the loss function V (β) for the least squares problem can be written V (β) = minβu>u).ˆ

Since ˆy, the projection of y on the column space of X, is orthogonal to ˆu

ˆ

u>y = ˆˆ y>u = 0.ˆ (3.4) In the same way, the OLS sample residuals are orthogonal to the explanatory variables in X

ˆ

(33)

3.1 Basic definitions 19

Now, substituting ˆy = Xβ into (3.4) yields

(Xβ)>(y − Xβ) = 0 ⇔

β>(X>y − X>Xβ) = 0.

By choosing the nontrivial solution for beta, and by noticing that if X is of full rank, then the matrix X>X also is of full rank and we can compute the least

squares estimator by inverting X>X.

ˆ

β = (X>X)−1X>y. (3.6) The OLS sample residual ˆu shall not be confused with the population residual u.

The vector of OLS sample residuals can be written as ˆ

u = y − X ˆβ = y − X(X>X)−1X>y = [In− X(X>X)−1X>]y = MXy. (3.7)

The relationship between the two errors can now be found by substituting equation (3.3) into equation (3.7)

ˆ

u = MX(Xβ + u) = MXu. (3.8)

The difference between the OLS estimate ˆβ and the true parameter β is found by

substituting equation (3.3) into (3.6) ˆ

β = (X>X)−1X>[Xβ + u] = β + (X>X)−1X>u. (3.9)

Definition 3.2 (Coefficient of determination) The coefficient of

determina-tion, R2, is defined as the fraction of variance that is explained by the model

R2=var[ˆvar[y]y].

If we let X include an intercept, then (3.5) also implies that the fitted residuals have a zero mean 1nPn

i=1uˆi = 0. Now we can decompose the variance of y into

the variance of ˆy and ˆu

var[y] = var[ˆy + ˆu] = var[ˆy] + var[ˆu] − 2 cov[ˆy, ˆu].

Rewriting the covariance as

cov[ˆy, ˆu] = E[ˆu] − E[ˆy]E[ˆu]

and by using ˆy ⊥ ˆu and E[ˆu] = 0 we can write R2 as R2= var[ˆvar[y]y]= 1 −var[ˆvar[y]u].

(34)

20 Linear Regression Models

Since OLS minimizes the sum of squared fitted errors, which is proportional to var[y], it also maximizes R2.

By substituting the estimated variances, R2 can be written as var[ˆy] var[y] = 1 n Pn i=1yi− ¯y)2 1 n Pn i=1(yi− ¯y)2 = Pn i=1yi)2− n¯y2 Pn i=1(yi)2− n¯y2 = (X ˆβ) >(X ˆβ) − n¯y2 y>y − n¯y2 = y >X(X>X)−1X>y − n¯y2 y>y − n¯y2

where the identity used is calculated as

n X i=1 (xi− ¯x)2 = n X i=1 [x2i −2 nxi n X i=1 xi+ 1 n2( n X i=1 xi)2] = n X i=1 (x2i) − 2 n( n X i=1 xi)2+ n n2( n X i=1 xi)2 = n X i=1 (x2i) − 1 n( n X i=1 xi)2 = n X i=1 (x2i) − n¯x2.

(35)

3.2 The classical regression assumptions 21

3.2

The classical regression assumptions

The following assumptions1 are used for later calculations

1. xtis a vector of deterministic variables

2. utis i.i.d. with mean 0 and variance σ2

(E[u] = 0 and E[uu>] = σ2I

n)

3. utis Gaussian (0, σ2)

Substituting equation (3.3) into equation (3.6) and taking expectations using as-sumptions 1 and 2 establishes that ˆβ is unbiased,

ˆ

β = (X>X)−1X>[Xβ + u] = β + (X>X)−1X>u (3.10) E[ ˆβ] = β + (X>X)−1X>E[u] = β (3.11) with covariance matrix given by

E[( ˆβ − β)( ˆβ − β)>] = E[(X>X)−1X>uu>X(X>X)−1] (3.12) = (X>X)−1X>E[uu>]X(X>X)−1

= σ2(X>X)−1X>X(X>X)−1

= σ2(X>X)−1.

When u is Gaussian, the above calculations imply that ˆβ is Gaussian. Hence, the

preceding results imply ˆ

β ∼ N (β, σ2(X>X)−1).

It can further be shown that under assumption 1,2 and 3, ˆβ is BLUE2, that is, no unbiased estimator of β is more efficient than the OLS estimator ˆβ.

1As treated in [38]

(36)

22 Linear Regression Models

3.3

Robustness of OLS estimates

The most serious problem with OLS is non-robustness to outliers. One single bad point will have a strong influence on the solution. To remedy this one can dis-card the worst fitting data-point and recompute the OLS fit. In figure 3.2, the black line illustrates the result of discarding an outlier. Deleting of an extreme

Figure 3.2. The effect of outliers

point can be justified by arguing that there seldom are outliers which practically makes them unpredictable and therefore the deletion would make the predictive power stronger. Sometimes extreme points correspond to extraordinary changes in economies and depending on context it might be more or less justified to discard them.

Because the outliers do not get a higher residual they might be easy to over-look. A good measure for the influence of a data point is its leverage.

Definition 3.3 (Leverage) To compute leverage in ordinary least squares, the

hat matrix H is given by H = X(X>X)−1X>

, where X ∈ Rn×pand n ≥ p. Since ˆy = X ˆβ = X(X>X)−1X>y the leverage measures how an observation

es-timates its own predicted value. The diagonals hii of H contains the leverage

measures and are not influenced by y. A rule of thumb [39] for detecting out-liers is that hii > 2(p+1)n signals a high leverage point, where p is the number of

columns in the predictor matrix X aside from the intercept and n is the number of observations. [39]

(37)

3.4 Testing the regression assumptions 23

3.4

Testing the regression assumptions

Unfortunately assumption 2 can easily be violated for time series data since many time series exhibit autocorrelation, resulting in the OLS estimates being inefficient, that is, they have higher variability than they should.

Definition 3.4 (Autocorrelation function) The jth autocorrelation of a

co-variance stationary process3, denoted ρ

j, is defined as its jth autocovariance

di-vided by the variance ρj

γj

γ0

, where γj = E(Yt− µ)(Yt−j− µ). (3.13)

Since ρj is a correlation, |ρj| ≤ 1 for all j. Note also that ρ0 equals unity for all

covariance stationary processes.

A natural estimate of the the sample autocorrelation ρj is provided by the

corre-sponding sample moments

ˆ ρj ≡ ˆ γj ˆ γ0, where ˆ γj= T1 P T t=j+1(Yt− ¯y)(Yt−j− ¯y) j = 0, 1, 2 . . . , T − 1 ¯ y = 1 T PT t=1(Yt).

Definition 3.5 (Durbin-Watson test) The Durbin-Watson test statistics is used

to detect the presence of autocorrelation in the residuals from a regression analysis and is defined by DW = PT t=2(et− et−1)2 PT t=1e 2 t (3.14)

where the et, t = 1, 2, . . . , T are the regression analysis residuals.

The null hypothesis of the statistic is that there is no autocorrelation, that is ρ = 0 and the opposite hypothesis, that there is autocorrelation, ρ 6= 0. Durbin and Watson [23] derive lower and upper bounds for the critical values, see table 3.1.

ρ = 0 → DW≈ 2 No Correlation ρ = 1 → DW≈ 0 Positive Correlation ρ = −1 → DW≈ 4 Negative Correlation

Table 3.1. Critical values for the Durbin-Watson test.

(38)

24 Linear Regression Models

One way to check assumption 3 is to plot the underlying probability distribution of the sample against the theoretical distribution. Figure 3.3 is called a Q-Q plot.

Figure 3.3. Example of a Q-Q plot

For a more detailed analysis the Jarque-Bera test, a godness of fit measure from departure of normality, based on skewness and kurtosis can be employed.

Definition 3.6 (Jarque-Bera test) The test statistic JB is defined as

J B = n 6  S2+(K − 3) 2 4  (3.15) where n is the number of observations, S is the sample skewness and K is the sample kurtosis, defined as

S = 1 n Pn k=1(xk− ¯x)3 (n1Pn k=1(xk− ¯x)2)3/2 K = 1 n Pn k=1(xk− ¯x)4 (n1Pn k=1(xk− ¯x)2)2

where ¯x is the sample mean.

Asymptotically J B ∼ χ2(2) which can be used to test the null hypothesis that data are from a normal distribution. The null hypothesis is a joint hypothesis of skewness being 0 and the excess kurtosis being 3 since samples from a nor-mal distribution have an expected skewness of 0 and an expected kurtosis of 3. The definition shows that any deviation from the expectations increases the JB statistic.

(39)

Chapter 4

Bayesian Statistics

First, we introduce fundamental concepts of Bayesian statistics and then we pro-vide tools for calculating posterior densities which are crucial to our forecasting.

4.1

Basic definitions

Definition 4.1 (Prior and posterior) If Mj, j ∈ J, are considered models,

then for any data D,

p(Mj), j ∈ J, are called the prior probabilities of the Mj, j ∈ J

p(Mj|D), j ∈ J, are called the posterior probabilities of the Mj, j ∈ J

where p denotes probability distribution functions (See [5]).

Definition 4.2 (The likelihood function) Let x = (x1, . . . , xn) be a random

sample from a distribution p(x; θ) depending on an unknown parameter θ in the parameter space A. The function lx(θ) = Q

n

i=1p(xi; θ) is called the likelihood

function.

The likelihood function is then the probability that the values x1, . . . , xn are in

the random sample. Mind that the probability density is written as p(x; θ). This is to emphasize that θ is the underlying parameter and will not be written out explicitly in the sequel. Depending on context we will also refer to the likelihood function as p(x|θ) instead of lx(θ).

Theorem 4.1 (Bayes’s theorem) Let p(y, θ), denote the joint probability

den-sity function (pdf) for a random observation vector y and a parameter vector θ, also considered random. Then according to usual operations with pdf’s, we have

p(y, θ) = p(y|θ)p(θ) =p(θ|y)p(y) and thus p(θ|y) = p(θ)p(y|θ) p(y) = p(θ)p(y|θ) R Ap(y|θ)p(θ)dθ (4.1) 25

(40)

26 Bayesian Statistics

with p(y) 6= 0. In the discrete case, the theorem is written as p(θ|y) = p(θ)p(y|θ) p(y) = p(θ)p(y|θ) P i∈Ap(y|θi)p(θi) . (4.2)

The last expression can be written as follows

p(θ|y) ∝ p(θ)p(y|θ)

posterior pdf ∝ pdf × likelihood function, (4.3) here, p(y), the normalizing constant needed to obtain a proper distribution in θ is discarded and ∝ denotes proportionality. The use of the symbol ∝ is explained in the next section.

Figure 4.1 highlights the importance of Bayes‘s theorem and shows how the prior information enters the posterior pdf via the prior pdf, whereas all the sample in-formation enters the posterior pdf via the likelihood function.

Figure 4.1. Bayesian revising of probabilities

Note that an important difference between the Bayesian statistics and the classical Fisher statistics is that the parameter vector θ is considered to be a stochastic variable rather than an unknown parameter.

4.2

Sufficient statistics

A sufficient statistics can be seen as a summary of the information in data, where redundant and uninteresting information has been removed.

Definition 4.3 (Sufficient statistics) A statistic t(x) is sufficient for an

under-lying parameter θ precisely if the conditional probability distribution of the data x, given the statistic t(x), is independent of the parameter θ, (see [17]).

Shortly the definition states that θ can not give any further information about x if t(x) is sufficient for θ, that is, p(x|t, θ) = p(x|t).

The Neyman’s factorization theorem provides a convinient characterization of a sufficient statistics.

(41)

4.2 Sufficient statistics 27

Theorem 4.2 (Neyman’s factorization theorem) A statistic t is sufficient

for θ given y if and only if there are functions f and g such that p(y|θ) = f (t, θ)g(y)

where t = t(y). (see [49])

Proof: For a proof see [49]

Here, t(y) is the sufficient statistics and the function f (t, θ) relates the sufficient statistics to the parameter θ, while g(y) is a θ independent normalization factor of the pdf.

It turns out that many of the common statistical distributions have a similar form. This leads to the definition of the exponential family.

Definition 4.4 (The exponential family) A distribution is from the one-parameter

exponential family if it can be put into the form

p(y|θ) = g(y)h(θ) exp[t(y)Ψ(θ)].

Equivalently, if the likelihood of n independent observations y = (y1, y2. . . yn)

from this distribution is of the form

ly(θ) ∝ h(θ)nexp[P t(yi)Ψ(θ)],

then it follows immediately from definition 4.2 thatP t(yi) is sufficient for θ given

y.

Example 4.1: Sufficient statistics for a Gaussian

For a sequence of independent Gaussian variables with unknown mean µ yt= µ + et∼ N (µ, σ2), t = 1, 2, . . . , N p(y|µ) =QN t=1 1 √ 2πσ2exp[− 1 2(yt− µ)2] = exp[− 1 2 X µ2+ 2µXyt] | {z } =f (t,µ) (2πσ2)−N/2exp[− 1 2 X yt2] | {z } =g(y)

(42)

28 Bayesian Statistics

4.3

Choice of prior

Suppose our model M of a set of data y is parameterized by θ. Our knowledge about θ before y is measured (given) is quantified by the prior pdf, p(θ). After measuring y the posterior pdf is available as p(θ|y) ∝ p(y|θ)p(θ). It is clear that different assumptions of p(θ) leads to different inferences p(θ|y).

A good rule of thumb for prior selection is that your prior should represent the best knowledge available about the parameters before looking at data. For exam-ple, the number of scores in a football game can not be less than zero and is less than 1000, which justifies setting your prior equal to zero outside this interval. In the case that one does not have any information, a good idea might be to use an uninformative prior.

Definition 4.5 (Jeffreys prior) Jeffreys prior pJ(θ) is defined as proportional

to the determinant of the Fisher information matrix of p(y|θ) pJ(θ) ∝ |J (θ|y)|

1

2 (4.4)

where

J (θ|y)i,j= −Ey

 ∂2ln p(y|θ) ∂θi∂θj



. (4.5)

The Fisher information is a way of measuring the amount of information that an observable random variable y = (y1, . . . , yn) carries about a set of unknown

parameters θ = (θ1, . . . , θn). The notation J(θ|y) is used to make clear that the

parameter vector θ is associated with the random variable y and should not be thought of as conditioning. A perhaps more intuitive way1 to write (4.5) is

J (θ|y)i,j= covθ[

∂θi

ln p(y|θ), ∂θj

ln p(y|θ)] (4.6)

Mind that the Fisher information only is defined under certain regularity condi-tions, which is further discussed in [24]. One might wonder why Jefferys made his prior proportional to the square root of the determinant of the fisher information matrix. There is a perfectly good reason for this, consider a transformation of the unknown parameters θ to ψ(θ) then if K is the matrix Kij= ∂θi/∂ψj

J (ψ|y) = KJ (θ|y)K>

and hence the determinant of the information satisfies |J(ψ|y)| = |J(θ|y)||K|2.

Because |K| is the Jacobian, and thus, does not depend on y, it follows that pJ(θ) ∝ |J(θ|y)|

1 2

provides a scale-invariant prior, which is a highly desirable property for a reference prior. In Jefferys’ own words “any arbitrariness in the choice of parameters could make no difference to the results”.

1Remember that cov[x, y] = E[(x − µ

(43)

4.3 Choice of prior 29

Example 4.2

Consider a random sample y = (y1, . . . , yn) ∼ N (θ, φ), with mean θ known and

variance φ unknown. The Jeffreys prior pJ(φ) for φ is then computed as follows

L(φ|y) = ln (p(y|φ)) = ln ( n Y i=1 1 √ 2πφexp[− (xi− θ)2 ]) = ln ((√1 2πφ) nexp[− 1 n X i=1 (xi− θ)2]) = −1 n X i=1 (xi− θ)2− n 2 ln φ + c 2L ∂φ2 = − 3 φ3 n X i=1 (xi− θ)2+ n φ2 ⇒ −E[∂ 2L ∂φ2] = 3 φ3E[ n X i=1 (xi− θ)2] − n φ2 = = 3 φ3(nφ) − n φ2 = 2n φ2 ⇒ pJ(φ) ∝ |J (φ|y)| 1 2 ∝ 1 φ

A natural question that arises is what choices of priors generate analytical expres-sions for the posterior distribution. This question leads to the notion of conjugate priors.

Definition 4.6 (Conjugate prior) Let l be a likelihoodfunction ly(θ). A class

Π of prior distributions is said to form a conjugate family if the posterior density p(θ|y) ∝ p(θ)ly(θ)

is in the class Π for all y whenever the prior density is in Π (see [49]).

There is a minor complication with the definition and a more rigorous definition is presented in [5]. However, the definition states the key principle in a clear enough matter.

(44)

30 Bayesian Statistics

Example 4.3

Let x = (x1, . . . , xn) have independent Poisson distributions with the same mean

λ, then the likelihood function lx(λ) equals

lx(λ) =Qni=1(λ xi xie −λ) = λt e−nλ Qn i=1xi ∝ λte−nλ where t =Pn

i=1xi and by theorem 4.2 is sufficient for λ given x.

If we let the prior of λ be in the family Π of constant multiples of chi-squared random variables, p(λ) ∝ λv/2−1e−S0λ/2, then the posterior is also in Π.

p(λ|x) ∝ p(λ)lx(λ) = λt+v/2−1e

1

2(S0+2n)λ

The distribution of p(λ) is explained in appendix A.2.

Conjugate priors are useful in computing posterior densities. Although there are not that many priors that are conjugate, there might be a risk of overuse since data might be better described by another distribution that is not conjugate.

4.4

Marginalization

A useful property of conditional probabilities is the possibility to integrate out undesired variables. According to usual operations of pdf’s we have

R p(a, b)db = p(a).

Analogously, for any likelihood function of two or more variables, marginal like-lihoods with respect to any subset of the variables can be defined. Given the likelihood ly(θ, M ) the marginal likelihood ly(M ) for model M is

ly(M ) = p(y|M ) =R p(y|θ, M )p(θ|M )dθ.

Unfortunately marginal likelihoods are often very difficult to calculate and numer-ical integration techniques might have to be employed.

4.5

Bayesian model averaging

To explain the powerful idea of Bayesian model averaging (BMA) we start by an example

Example 4.4

Suppose we are analyzing data and believe that it arises from a set of probability distributions or models {Mi}ki=1. For example, the data might consist of a normally

distributed outcome y that we wish to predict future values of. We also have two other outcomes, x1 and x2, that covariates with y. Using the two covariates as

(45)

4.5 Bayesian model averaging 31

predictors on y offers two models, M1and M2 as explanation for what values y is likely to take on in the future. A novel approach to deciding what future value of

y should be used might be to simply average the two estimates. But, if one of the

models suffers from bad predictive ability, then the average of the two estimates is not likely to be especially good. Bayesian model averaging solves this issue by normalizing the estimates ˆy1 and ˆy2by how likely the models are

ˆ

y = p(M1|Data)ˆy1+ p(M2|Data)ˆy2. Using theory from the previous chapters it is possible to compute the probability p(Mi|Data) for each model.

We now treat the averaging more mathematically.

Let ∆ be a quantity of interest, then its posterior distribution given data D is

p(∆|D) =

K

X

k=1

p(∆|Mk, D)p(Mk|D). (4.7)

This is a weighted average of the posterior probability where each model Mk is

considered. The posterior probability for model Mk is

p(Mk|D) = p(D|Mk)p(Mk) PK l=1p(D|Ml)p(Ml) , (4.8) where p(D|Mk) = Z p(D|θk, Mk)p(θk|Mk)dθk (4.9)

is the marginalized likelihood of the model Mkwith parameter vectors θkas defined

in section 4.4. All probabilities are implicitly conditional on M, the set of models being considered. The posterior mean and variance of ∆ are given by

ξ = E[∆|D] = K X k=1 ˆ kp(Mk|D) (4.10)

φ = var[∆|D] = E(y2|D) − E(y|D)2 (4.11)

=

K

X

k=1

(var[y|D, Mk] + ˆk)p(Mk|D) − E[y|D]2

(46)

32 Bayesian Statistics

4.6

Using BMA on linear regression models

Here, the key issue is the uncertainty about the choice of regressors, that is the model uncertainty. Each model Mjis of the previously discussed form y = Xjβj+

u ∼ N (Xjβj, σ2In), where the regressors Xj ∈ Rn×p ∀j, with the intercept

included, correspond to the regressor set, j ∈ J , specified in chapter 5. The quantity y is the given data and we are interested in the quantity ∆, the regression line.

p(y|βj, σ2) = lyj, σ2) = (2πσ12)

n

2exp[− 1

2(y − Xjβj)>(y − Xjβj)]

By completing the square in the exponent, the sum of squares can be written as (y − Xβ)>(y − Xβ) = (β − ˆβ)>X>X(β − ˆβ) + (y − X ˆβ)>(y − X ˆβ), where ˆβ = (X>X)−1X>y is the OLS estimate. That the equality holds is proved

by multiplying out the right handside and checking that it equals the left handside. As pointed out in section 3.1, (y − X ˆβ) is the residual vector ˆu and its sum

of squares divided by the number of observations less the number of covariates is known as the residual mean square denoted by s2.

s2= (n−p)uˆ>uˆ =uˆ(v)>uˆ ⇒ ˆu>u = vsˆ 2

It is convenient to denote n − p as v, known as the degrees of freedom of the model. Now we can write the likelihood as

lyj, σ2) ∝ (σ2)− pj 2 exp[− 1 2j− ˆβj)>(Xj>Xj)(βj− ˆβj)] × (σ2)− vj 2 exp[−vjs 2 j 2].

The BMA analysis requires the specification of prior distribution for the parame-ters βj and σ2. For σ2we choose an uninformative prior

p(σ2) ∝ 1/σ2, (4.12)

which is the Jeffreys prior as calculated in example 4.2. For βj the g-prior, as

introduced by Zellner [68], is applied

p(βj|σ2, Mj) ∼ fNj|0, σ2g(X>jXj)−1), (4.13)

where ∼ fN(w|m, V ) denotes a normal density on w with mean m and covariance

matrix V. The expression σ2(X>X)−1 is recognized as the covariance matrix of the OLS-estimate and the prior covariance matrix is then assumed to be propor-tional to the sample covariance with a factor g which is used as a design parameter. An increase of g makes the distribution more flat and therefore gives higher pos-terior weights to large absolute values of βj.

(47)

4.6 Using BMA on linear regression models 33

As shown by Fernandez, Ley and Steel [33] the following three theoretical values of g lead to consistency, in the sense of asymptotically selecting the correct model.

• g = 1/n

The prior information is roughly equal to the information available from one data observation

• g = k/n

Here, more information is assigned to the prior as the number of predictors k grows

• g = k(1/k)/n

Now, less information is assigned to the prior as the number of predictors grows

To arrive at a posterior probability of the models given data we also need to specify the prior distribution for each model Mjover M the space of all K = 2p−1models.

∀Mj∈ M =      p(Mj) = pj, j = 1, . . . , K pj > 0 PK j=1pj= 1

In our application, we chose pj = 1/K so that we have a uniform distribution

over the model space since we at this point have no reason to favor a model to another. Now, the priors chosen have the tractable property of an analytical ex-pression for ly(Mj) the marginal likelihood.

Theorem 4.3 (Derivation of the marginal likelihood) Using the above

spec-ified priors, the marginalized likelihood function is given by

ly(Mj) = Z p(y|βj, σ2, Mj)p(σ2)p(βj|σ2, Mj)dβjdσ2= = Γ(n/2) πn/2(g + 1)p/2(y >y − g 1 + gy >X j(X>jXj)−1X>jy)n 2. Proof : ly(Mj, βj, σ2) = p(y|βj, σ2, Mj)p(βj|σ2, Mj)p(σ2) = = (2πσ2)−n/2exp[− 1 2(vjs 2 j+ (βj− ˆβj)>(X>jXj)(βj− ˆβj))] ×(2πσ2 )−p/2|Z0|−1/2exp[− 1 2j− ¯βj) > Z0j− ¯βj))] × 1/σ2

To integrate the expression we start by completing the square of the exponents. Here, we do not write out the index on the variables. Mind that Z0is used instead of writing out the g-prior.

(48)

34 Bayesian Statistics (β − ˆβ)>X>X(β − ˆβ) + (β − ¯β)>Z0(β − ¯β) = β>(X>X + Z0)β − β>(X>X ˆβ + Z0β) − ( ˆ¯ β>X>X + ¯β>Z0)β + ˆβ>X>X ˆβ + ¯β>Z0β =¯ = β>(X>X + Z0)β − β>(X>X + Z0) (X>X + Z0)−1(X>X ˆβ + Z0β)¯ | {z } =B1 − ( ˆβ>X>X + ¯β>Z0)(X > X + Z0) −1 | {z } =B> 1 (X>X + Z0)β + ˆβ>X>X ˆβ + ¯β>Z0β =¯ = β>(X>X + Z0)β − β>(X>X + Z0)B1− B>1(X>X + Z0)β + B>1(X>X + Z0)B1 −B>1(X > X + Z0)B1+ ˆβ>X>X ˆβ + ¯β>Z0β =¯ = (β − B1)>(X>X + Z0)(β − B1) − B>1(X > X + Z0)B1+ ˆβ>X>X ˆβ + ¯β>Z0β =¯ = (β − B1)>(X>X + Z0)(β − B1) − ( ˆβ>X>X + ¯β>Z0)(X>X + Z0)−1(X>X ˆβ + Z0β)+¯ + ˆβ>X>X ˆβ + ¯β>Z0β =¯ = (β − B1)>(X>X + Z0)(β − B1) − ( ˆβ>X>X)(X>X + Z0)−1(X>X ˆβ) − ( ˆβ>X>X)(X>X + Z0)−1Z0β − ( ¯¯ β>Z0)(X>X + Z0)−1(X>X ˆβ)+ −( ¯β>Z0)(X>X + Z0)−1(Z0β) + ( ˆ¯ β>X>X)(X>X + Z0)−1(X>X + Z0) ˆβ + ¯β>Z0(X>X + Z0)−1(X>X + Z0) ¯β = = (β − B1)>(X>X + Z0)(β − B1) − [( ˆβ>X>X)(X>X + Z0)−1(Z0β)¯ + ( ¯β>Z0)(X>X + Z0)−1(X>X ˆβ) − ( ¯β>X>X)(X>X + Z0)−1(Z0β)¯ − ( ˆβ>Z0)(X>X + Z0)−1(X>X ˆβ)] = /X>X(X>X + Z0)−1Z0= ((X>X)−1+ Z−10 ) −1 / = (β − B1)>(X>X + Z0)(β − B1) − [ ˆβ>((X>X)−1+ Z−10 ) −1¯ β + ¯β>((X>X)−1+ Z−10 ) −1ˆ β − ˆβ>((X>X)−1+ X−10 )−1β − ¯ˆ β>((X>X)−1+ Z−10 )−1β] =¯ = (β − B1)>(X>X + Z0)(β − B1) + ( ˆβ − ¯β)>((X>X)−1+ Z−10 ) −1 ( ˆβ − ¯β). Now we can write ly(Mj, βj, σ2) as

1/σ2× (2πσ2)−(n+p)/2× exp[− 1 2S1] × exp[−12j− B1)>(A1)(βj− B1)] where S1 = vjs2j+ ( ˆβj− ¯βj) > ((X>jXj) −1 + Z−10 ) −1 ( ˆβj− ¯βj) A1 = Z0+ X>jXj

The second exponent is the kernel of a multivariate normal density2and integrating with respect to β yields

1/σ2× (2πσ2)−n/2|Z

0|1/2|A1|−1/2× exp[−12S1]

which in turn is the kernel of an inverted Wishart density3.

2For a definition, see Appendix A 3For a definition, see Appendix A

(49)

4.6 Using BMA on linear regression models 35

We now integrate with respect to σ2 resulting in

lY(Mj) = (2π)−n/2|Z0|1/2|A1|−1/2|S1|−n/2c0(n0= n + 2, p0= 1) × k

where k is a proportionality constant canceling in the posterior expression. To obtain the marginal likelihood we substitute Z0 with the inverse of the g-prior 1g(X>jXj), where

σ2is integrated out. |S1|−n/2 = S −n/2 1 = (vjs2j+ ˆβ > j((1 + g)X > jXj)−1βˆj)−n/2 = (vjs2j+ ˆβ > j(1/(1 + g))(X > jXj) −1ˆ βj) −n/2 = ((y − Xjβˆj) > (y − Xjβˆj) + ˆβ > j(1/(1 + g))(X > jXj) −1ˆ βj) −n/2 = (y>y − g 1 + gy > Xj(X>jXj)−1X>jy) −n/2 |Z0|1/2 = | 1 gX > jXj|1/2= (1/g)p/2|X > jXj|1/2 |A1| −1/2 = 1 |A1|1/2 = 1 (1 + (1/g))p/2|X> jXj|1/2 c0(n 0 = n + 2, p0= 1) = 2n/2Γ(n/2)

And finally we arrive at ly(Mj) = Γ(n/2) πn/2(g+1)p/2(y

>

y −1+gg y>Xj(X>jXj)−1X>jy)−n/2.

References

Related documents

The real exchange rate depends on the expected long-run exchange rate, expected nominal interest rate differential, expected inflation differential, and currency risk premium, which

Since firms invest when they have low leverage and do not invest when they have high leverage, this interaction effect predicts that the return differential between firms with low

The impact of algorithmic trading on the frequency of triangular arbitrage opportunities is, however, only one facet of how computers may affect the price discovery process..

The value-weighted market portfolio of course also moves with the disaster probability, and thus the higher is φ j (over the relevant range), the higher is the return beta with

Whether portfolios are based on shipping costs or weight-to-value ratios, we find that the excess returns of exposed firms are concentrated in high demand elasticity

Proposition III.1 (The Skill Premium with Shutdown Threats) The possibility for firm own- ers to shut down the firm if revenues are low creates a bargaining situation where

This means that although the highest cost adjusted excess return is 2907% with a Sharpe ratio of 0.1924, it has a lower Sharpe ratio compared to limiting the bull strategy

A t-test, a Wilcoxon Rank-Sum test and a multiple linear regression are run using model including variables such as optimism, GDP/capita, stock market