Örebro University
School of Business and Economics
Statistics, advanced level thesis, 15hp
Superviser: Sune Karlsson
Examiner: Niklas Karlsson
Spring 2014
BAYESIAN AND FREQUENTIST HYPOTHESIS
TESTS OF HETEROSCEDASTICITY
WeiWei Feng
1981-‐09-‐01
Abstract
The homoscedasticity assumption is important for the classical linear regression. This assumption is often violated in time series data, cross section data or panel data. In order to address this issue, the generalized linear regression or feasible generalized linear regression is suggested. However, applying those methods requires either
knowledge of structure of variances or estimation of this structure. Before running into such a complex process, testing the heteroscedasticity is not only important but also necessary. In this paper, we used the Monte Carlo simulation to compare both the size and the power of Bayesian hypothesis test with the frequentist hypothesis tests for heteroscedasticity. The Bayesian hypothesis in this case is unpractical and less effective comparing with frequentist hypothesis tests. However, the Bayesian heteroscedasticity test could be possibly improved and the Bayesian heteroscedasticity model could be applied in other situations such as estimations of parameters or the structure of variances.
Keywords: Bayesian Hypothesis test, the White test, the Breusch-‐Pagan Test, the
Koenkar-‐Basset Test, the Gibbs sampling, the Metropolis-‐hasting sampler, the marginal likelihood simulation.
Table of Contents
Introduction... 1
The frequentist hypothesis tests ... 3
The White test...3
The Breusch-‐Pagan Test...4
The Koenker-‐Basset Test...6
The Bayesian Hypothesis test... 7
The heteroscedasticity Bayesian model Mheter and the homoscedasticity Mhom...9
Heteroscedasticity model Mheter ...9
Homoscedasticity model Mhom... 13
Bayes factor simulation ... 14
The tests for convergence... 19
The size and the power ... 21
In the frequentist hypothesis test framework... 21
In the Bayesian hypothesis test framework ... 21
Monte Carlo Simulation ... 23
Data generating process... 23
Parameters for the prior distributions in Bayesian models... 25
Starting points in Bayesian Markov Chain Monte Carlo simulation ... 29
Other parameters in simulation study... 29
Monte Carlo Results... 30
Conclusion and Discussion ... 33
References... 34
Appendix ... 35
Introduction
When testing the assumption of homoscedasticity in the classical linear regression, a number of the literatures can be gathered within the classical statistic. White (1980) is the most cited article, the well-‐known Breusch and Pagan (1979) and an improved test proposed by Koenker (1981), Godfrey (1996) and Koenker and Bassett (1982). On the other hand, there are few literatures about the Bayesian hypothesis test for
heteroscedasticity. Richard Startz (2012) constructed Bayesian heteroscedasticity-‐ Robust Standard Errors. Differing from his model, a hierarchical Bayesian model is constructed in order to conduct a Bayesian hypothesis test for heteroscedasticity. Both the size and the power will be compared with three common used frequentist
hypothesis tests: the White’s test (White, 1980), the Breusch-‐Pagan test (Breusch and Pagan, 1979), and the Koenkar-‐Basset test (Koenker and Bassett, 1982).
In the classical linear regression model, one of the important assumptions is
homoscedasticity and nonautocorrelation. It means each disturbance, εi, has the same finite variance, σ2, and is uncorrelated with any other disturbances. This assumption
limits the generality of the model and is often violated in the real world.
Heteroscedasticity means that disturbances have different variances, which often occurs in time-‐series data, cross-‐section data or panel data. In this paper, nonautocorrelation is assumed for simplicity. The linear regression model with heteroscedasticity can be expressed as follow:
y= Xβ + ε E[ε X] = 0 E[ε ε X] = σ′ 2Λ = Σ σ2Λ =σ2 λ1 … 0 ! " ! 0 # λn ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ = σ1 2 … 0 ! " ! 0 # σn 2 ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ (1)
Where y is a vector of dependent variables, X is a k× nmatrix of k independent variables
including the constant term, n is number of observations, and β is a vector of
coefficients. While ε is a vector of disturbances and assumed normally distributed in ordinary situations. Σ is a variance and covariance matrix.Λis a positive definite matrix where Λ = Ι means homoscedasticity.
Even with heteroscedasticity, under the assumption E[ε X] = 0 and the normality assumption of the disturbance, the least squares estimator ˆβ = ( ′X X)−1X y is still ′
unbiased, consistent and asymptotically normally distributed. However, the estimates are not BLUE (best linear unbiased estimator) as showed in the Gauss-‐Markov Theorem. The variance estimator s2
(X X)′ −1is incorrect and leads to unreliable confidence interval
estimation and hypothesis tests. Estimation of the asymptotic covariance matrix then could be based on
var( ˆβ X)= ( ′X X)−1X (s′ 2Λ)X( ′
X X)−1 (2)
If Λ’s structure is known, the model could be transformed to a homoscedasticity model and ordinary least square (OLS) is still BLUE. In a more common case, the structure is impossible to know. So ˆΛ is estimated according to a different approach called the method of feasible generalize least squares (FGLS). (Green, 2012)
Before rushing into such a complex process, it is best to test whether heteroscedasticity exists in the data. This paper mainly focuses on comparing both the size and the power between frequentist heteroscedasticity tests and the Bayesian hypothesis test.
The frequentist hypothesis tests
The three common used classical tests are the White test (White, 1980), the Breusch-‐ Pagan test (Breusch and Pagan, 1979), and the Koenkar-‐Basset test (Koenker and Bassett, 1982).
They are based on the idea that OLS is a consistent estimator of β even in the presence of heteroscedasticity. In this case, the OLS residuals will preserve information, although not perfectly, about the heteroscedasticity. Therefore, the tests are applied to the OLS residuals in order to detect heteroscedasticity in the data.
The above-‐mentioned three tests are all derived from the Lagrange multiplier test (LM). The LM test assumes a properly scaled Lagrange multiplier has an asymptotically
normal distribution. The LM test statistic is a quadratic form of the Lagrange multiplier. The test statistic, under the null hypothesis, is then chi-‐squared distributed.
The White test
The White test for heteroscedasticity has a basic idea: if the model is homoscedastic, then the disturbances are randomly distributed and have no relationship with any independent variables and their combinations such as squared form or a cross product form. The test statistic is derived based on the data with no heteroscedasticity, and then the least squares estimator of variance gives a consistent estimator of var[b X] as in the equation (2) (Green, 2012, p.315).
Null hypothesis H0: Homoscedasticity (σi2 =σ2 for all i in model (1))
Alternative hypothesis H1: heteroscedasticity (not all σi
2 are equal)
The testing process can be simplified as follows:
2) Compute the OLS residuals, e1,e2,...,en. e is the OLS estimator of ε in the model (1).
3) Regress ei
2 against a constant, all independent variables, their squared form, and
their cross products.
4) Compute R2 from this “ auxiliary regression” in Step 3.
5) Compare nR2 to the critical value from the Chi-‐squared distribution with p
degrees of freedom.
The test statistic is nR2 ∼ χp2−1, where p is the number of the regressor in this “ auxiliary
regression” including the constant in step 3 and n is the number of observations.
The White test is very general and the specific assumption about the nature of the heteroscedasticity is not necessary to make. However, because of its generality, it has a serious shortcoming. The test may reveal heteroscedasticity, but does not detect the reason for it. This causes some trouble when applying the generalized least squares (GLS) method, for which we need to know the structure of the variance. At the same time, the degrees of freedom grow rapidly with the number of independent variables. Hence, it may be just appropriate for the relatively large sample size.
One modification can be done in order to lower the degrees of freedom. In step 3, we regress e2 against ˆy and ˆy2
instead, where ˆy = Xb , b is the OLS estimator of the parameters in model (1). In this way, the regressors actually include all independent variables, their squared form and their cross products. In this method, the degrees of freedom decrease to 2 and the test statistic is nR2∼ χ2
(2) . This method is used instead of the original White test in our study.
The Breusch-‐Pagan Test
The Breusch-‐Pagan Test is derived from an LM test with the hypothesis that σi
2=σ2
f (Zα) , where Z is a p × n matrix of special combination of independent
variables, squared form of independent variables and the cross products of independent variables that is suspected to cause the heteroscedasticity, p is the number of regressors
including a constant and n is the number of observations, and f is a function. The model is homoscedastic if α = 0, α is a vector of parameters in the equation. The test can be carried out with a simple regression.
Null hypothesis H0: Homoscedasticity (α = 0)
Alternative hypothesis H1: Heteroscedasticity (α ≠ 0)
1) Regress the dependent variable y with all the independent variables X.
2) Compute the OLS residuals, e1,e2,...,en. e is the vector with all OLS estimator of disturbanceεi(i= 1,2,...,n)in the model (1) and estimates the variance of disturbance as e e / n′ .
3) Regress e2
/ (e e′
n ) against Z, e
2is the vector of all the squared e
i and compute the ESS (explained sum of square) of the regression. The test statistic ESS is
asymptotically distributed asχp−1
2 , p is the number of regressors in Z including a
constant. This test statistic ESS can be directly computed with the matrix
formLM = 1
2[g Z(′ Z Z )′
−1Zg]; g is the vector of observations of g
i= ei
2/ (e e / n)′ −1.
(Breusch and Pagan, 1979)
4) Compare the value of the test statistic from step 4 to the critical value from the Chi-‐squared distribution with p-‐1 degrees of freedom. Homoscedasticity is rejected if the value exceeds the critical value.
The Breusch-‐Pagan Lagrange multiplier test is claimed to be sensitive to the normality assumption. Violation of this assumption does not guarantee that the variance of εi2is
equal to 2σ4 under homoscedasticity. In this situation, the test statistic is incorrect.
(Green, 2012, p.276)
The Koenker-‐Basset Test
Koenker and Basset (1982) has the same H0 and H1and almost the same process as the Breusch-‐Pagan test.
Null hypothesis H0: Homoscedasticity (α = 0)
Alternative hypothesis H1: Heteroscedasticity (α ≠ 0)
However, it suggests a more robust estimator of the variance of εi
2:V = 1 n (ei 2− ′e e n ) 2 i=1 n
∑
. With this change the test statistic in Koenker and Basset test becomesLM = [1 V](e 2− ′e e n ′)Z( ′Z Z ) −1 ′ Z (e2− ′e e
n ) , which is asymptotically distributed as χp−1
2 .
Where p is the number of regressors in Z (has the same structure as in the Breusch-‐ Pagan test) including a constant (Green, 2012). Other steps are the same as the Breusch-‐ Pagan test.
Under normality, the Koenker-‐Basset test has the same limiting distribution as the Breusch-‐Pagan statistic, but violating the normality, it can be a more powerful test. And if Z has the same structure as the regressor in the white test, the Koenker-‐Basset test is algebraically the same as the White test (Waldman, 1983).
In the simulation part of the Breusch-‐Pagan test and the Koenker-‐Basset test, Z is some partition of combination of independent variables, square of independent variables and cross product of independent variables.
The Bayesian Hypothesis test
In the frequentist hypothesis tests framework, there are two hypotheses: the Null hypothesis H0 and the alternative hypothesis H1. A decision is made based on the collected data. The Null hypothesis is either rejected in favor of the alternative
hypothesis, or accepted. In Bayesian hypothesis testing, there could be more than two hypotheses.
The Bayesian hypothesis test will be explained in the way that only two of the hypothesis will be chosen between.
The model can be generally expressed as follows:
p(yθMi, Mi)-‐-‐ A data likelihood distribution
p(θMi Mi)-‐-‐A prior distribution under the model Mi (3)
p(Mi) -‐-‐ A prior probability that the model is “correct”
The models can be different on the data likelihood distribution, a prior distribution or both. In this paper, the heteroscedasticity model and the homoscedasticity model are different in both the data distribution and prior distributions.
Before explaining the Bayesian hypothesis test, the loss function is introduced first.
L(Mi, Mj)= 0 when model Mi is “ correct” L(Mi, Mj)> 0 when model Mi is “ wrong”, i≠ j
The posterior expected loss from choosing model i based on the data yois
E[L(Mi, Mj)]= L(Mi, Mj)p(Mj
j
∑
yo) , the model with the smallest posterior expected
loss would be chosen. yo means the collected data.
In a situation with only two hypotheses, the expected lost for model Mi:
E[L(Mi, Mj)]= L(Mi, Mj)p(Mj yo)
The expected loss for model Mj = E[L(Mj, Mi)]= L(Mj, Mi)p(Mi y o
If L(Mi, Mj)p(Mj yo)< L(M
j, Mi)p(Mi y
o) , the model M
i will be chosen over model Mj based on the data information yo.
In this inequality, the p(Mi y o
) is the posterior model probability which follows Bayes
rules P(Mi y)= p(y Mi)p(Mi) p(y Mj)p(Mj) j
∑
(4) p(y Mi)= p(y∫
θMi) p(θMi Mi)dθ (5)By putting the equality (4) into the inequality, because the denominators are the same for both models, we get:
Bij = p(yo Mi) p(yo Mj) > L(Mi, Mj)p(Mj) L(Mj, Mi)p(Mi) (6)
Bij is a Bayes factor. When it is larger than the right-‐hand side in the inequality (6), the model Mi is preferred. (Gelman et al., 2004)
In the classical hypothesis test, the null hypothesis is special which requires
overwhelming evidence in order to reject it. However, in the Bayesian hypothesis test framework, all hypotheses are treated equally. In this situation, assigning a large prior probability to one hypothesis or specifying a large loss from erroneously choosing the other hypothesis make it possible to simulate the sense of null hypothesis as in the frequentist hypothesis test. The left-‐hand side of the inequality (6) could be considered as “the test statistic” and the right-‐hand side of the inequality (6) could be considered as “the critical value” as in the frequentist hypothesis test framework. The critical value is estimated with particular Bayesian hypothesis tests to simulate the similar effect as the significance level.
To consider the Bayesian hypothesis test for heteroscedasticity, two different models are involved in the paper: Mheteris the heteroscedasticity model and Mhom is the model for homoscedasticity.
H0: Homoscedasticity H1: Heteroscedasticity
If the Bayes factorBheter/hom=
f (yo Mheter) f (yo
Mhom)
> c(c is the critical value) is true, reject the null hypothesis.
The heteroscedasticity Bayesian model Mheter and the homoscedasticity Mhom
Heteroscedasticity model Mheter
The heteroscedasticity model is a hierarchical Bayesian model:
yβ,Σ ∼ N(Xβ,Σ) β ∼ N(β,Ω) Σα,w∼ log N(Xα,w2 ) α ∼ N(α,Δ) w2 ∼ IG(ν,ρ) (7)
Where y is a vector of n dependent variables, X is a n× k matrix of independent
variables including a constant term, β is a vector of k parameters. β is the parameters vector and Ω is the variance and covariance matrix for the β prior distribution. Σ is the diagonal variance matrix and its prior distribution is a lognormal distribution with parameters α and w2 which have their own prior distributions. α and Δ are the
parameters of the α prior distribution, while v and ρ are parameters of the w2 prior
means inverse gamma distribution. The model is a hierarchical model because the parameters in the Σ have their own distributions.
The data likelihood distribution:
f (yβ,Σ, X) = (2π)− n2 Σ−12exp[−1
2(y− Xβ ′)Σ −1
(y− Xβ) (8)
This is the likelihood function for linear regression. The details of every part are expressed in the model (7).
Σ = σ1 2 … 0 ! " ! 0 # σn 2 ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟
σ1,σ2,...,σn are different which represents the heteroscedasticity in the linear regression.
The conjugate prior for β is
p(β)= (2π)−k2 Ω−12exp[−1
2(β−β ′)Ω
−1
(β−β)] (9)
Where k is the number of parameters including a constant.
The Σ is assumed to have a lognormal distribution and expected value is Xα :
lnσ2 = Xα+ u
u∼ N(0,w2
) (10)
Where σ2is vector of all the σ
i
2.
The likelihood function for Σ is
The conjugate prior distribution for α is: p(α)= (2π) −k 2 Δ−12exp[−1 2(α−α ′)Δ −1 (α −α)] (12)
The conjugate prior distribution for w2 is:
p(w2 )= ρ v Γ(ν)(w 2 )−(v+1)exp(− ρ w2) (13)
By multiplying all the parts in this hierarchical model, the posterior distribution becomes
p(β,Σ,α,w2
y)∝ f (yβ,Σ)* p(β)* f (Σα,w2
)* p(α)* p(w2
) (14)
In this situation, it is easy to get the conditional posterior distributions for all
parameters and the Gibb sampling can be applied to get marginal distributions for all parameters (Gelman et al., 2004). The details of Gibb sampling are explained in the part of the Bayes factor simulation.
1) Conditioning on Σ,α and w2, the conditional posterior for β is
p(β y, X,Σ,α,w2)∝ exp[−1 2(y− Xβ ′)Σ −1 (y− Xβ)* exp[−1 2(β−β ′)Ω −1 (β−β)] (15)
This is the kernel of the normal distributions, so we have
β y, X,Σ,α,w2 ∼ N(β,Ω)
Ω = ( ′X Σ−1X+ Ω−1)−1 β = Ω( ′X Σ−1y+ Ω−1β)
(16)
2) Conditioning on β,α and w2, the conditional posterior for Σ is:
p(Σ y, X,β,α,w2 )∝ (2π)− n2 Σ−12exp[−1 2(y− Xβ ′)Σ −1(y− Xβ)]* (2πw2 )− n2 Σ−1exp[− 1 2w2(lnσ 2− Xα ′)(lnσ2− Xα )] = (2πw)−n Σ−32 * exp{−1 2[(y− Xβ ′)Σ −1(y− Xβ) − 1 w2(lnσ 2− Xα ′)(lnσ2− Xα )]} (17) There is no closed form of this distribution, and the normalized constant part is impossible to calculate.
3) Conditioning on the Σ,w2and β , the conditional posterior for α is:
p(α y, X,β,Σ,w2)∝ exp[− 1 2w2(lnσ 2− Xα ′)(lnσ2− Xα)]* exp[−1 2(α −α ′)Δ −1 (α−α)] = exp{−1 2[w12(lnσ 2− Xα ′)(lnσ2− Xα )+ (α−α ′)Δ−1(α−α)]} (18) This is the kernel of the normal distribution:
α y, X,Σ,β,w2 ∼ N(α,Δ) Δ = ( 1 w2 X X′ + Δ −1 )−1 α = Δ( 1 w2 X ln′ σ 2+ Δ−1α ) (19) 4) Conditioning on the Σ,β and α , the conditional posterior for w2 is:
p(w2 y, X,β,α,Σ) ∝ (w2)−(v+n2+1)exp(− ρ
w2)exp[− 1 2w2(lnσ
2− Xα ′)(lnσ2− Xα )] (20)
This is the kernel for the inverse gamma distribution: w2 y, X,Σ,β,α ∼ IG(v,ρ) v= v +n 2 ρ=ρ+1 2(lnσ 2− Xα ′)(lnσ2− Xα ) (21)
Homoscedasticity model Mhom
With homoscedasticity, the Σ = γ2 … 0 ! " ! 0 # γ2 ⎛ ⎝ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟
, the disturbances have the same
variance. The likelihood function in (8) can be simplified as:
f (yβ,γ2, X)= (2πγ2)− n2exp[− 1
2γ2(y− Xβ ′)(y − Xβ)] (22) The model becomes simpler as follows:
yβ,γ2 ∼ N(Xβ,γ2 ) β ∼ N(β,Ω) γ2 ∼ IG(τ,ϕ) (23)
The prior for β is the same as the one that was used in the Mheter and γ
2 is similar to
w2
with different parameters:
p(β)= (2π) −k 2 Ω−12exp[−1 2(β−β ′)Ω −1 (β−β)] (24) p(γ)= ϕ τ Γ(τ)(γ 2 )−(τ +1)exp(− ϕ γ2) (25)
Conditioning on each other, the posteriors for β and γ are:
p(β y, X,γ2)∝ exp[−2γ12(y− Xβ ′)(y − Xβ)]exp[−12(β − β ′)Ω
−1(β − β)] β y, X,γ2 ∼ N(β,Ω) Ω = ( ′X X γ2 + Ω −1 )−1 β = Ω( ′X y γ2 + Ω −1β ) (26) p(γ2 y, X,β) ∝ (γ2)−(τ+1)(γ2)−n/2exp(−ϕ γ2)exp[− 1 2γ2(y− Xβ ′)(y − Xβ)]
γ2 y, X,β ∼ IG(τ,ϕ) τ = τ + n 2 ϕ = ϕ +1 2(y− Xβ ′)(y − Xβ) (27)
Bayes factor simulation
In the Bayesian hypothesis test, the Bayes factor (6) plays an important role. In order to calculate the Bayes factor, the marginal likelihood is needed which is obtained by integrating over the prior distribution as in equation (11). In the high dimension parameter space (more than three parameters in the model), it is almost impossible to do this integrating most of times.
In order to estimate the marginal likelihoods, Chib (1995) suggested a method based on the Gibbs sampler, and Chib and Jeliazkov (2001) generalized this to Metropolis-‐Hasting samplers. Both these two samplers are based on Markov Chain Monte Carlo simulation and the full conditional posterior is needed in order to perform the simulation. In this paper, both methods are used to estimate the marginal likelihoods.
The key insight of those methods is based on the Bayes rules.
p(θMi y o , Mi)= f (yo Mi,θMi)p(θMi Mi) p(yo Mi) , where p(θMi y o
, Mi) is the posterior distribution for parameters θMi in the model Mi and the meaning of other parts are described in the
model expression (3). The marginal likelihood p(yo Mi) is actually the normalizing constant of the posterior density. Therefore, when the data is known, this is the same constant for any points inside the parameter space. The marginal likelihood is changed to the right side, and taking logarithms on both sides can avoid underflow in
computation. Estimating the marginal likelihood becomes:
log[ p(yo Mi)]= log[ f (y o Mi,θMi * )]+ log[ p(θMi * Mi)]− log[ p(θMi * yo , Mi)] (28) In this way, the calculation of the marginal likelihood is reduced to find the left three parts with a single point θM*i. This could be any point in the parameter space. Normally,
the high-‐density point is used to increase the efficiency. The first two parts in the left-‐ hand side of equation (28) are easy to calculate because the distributions and the data are both known. The last part of equality (28) is mostly focused on in the simulation.
1) Simulation for marginal likelihood in Heteroscedasticity model Mheter:
log( p(yα,β,Σ,w2 , Mheter))= log( f (y oα* ,β* ,Σ* ,w2* )+ log( f (Σ*α* ,w2* )× p(α* )× p(β* )× p(w2* )) − log(p(α* ,β* ,Σ* ,w2* yo )) (29) In the heteroscedasticity model, the marginal likelihood for yocan be calculated by equation (29). Where * means the fixed point. If α*,β*,Σ*and w2*are all fixed points, the first two parts f (yoα*,β*,Σ*,w2*) andf (Σ*α*,w2*)× p(α*)× p(β*)× p(w2*) are just
simple calculations. Chib (1995) suggests a method to estimate the posterior in the context of Gibb MCMC sampling. In the heteroscedasticity model, the posterior
p(α*
,β*
,Σ*
,w2*
yo
) needs to be estimated. Then, applying the law of total probability, we
have:
p(α*,β*,Σ*,w2* yo)= p(Σ* yo)× p(β* Σ*, yo)× p(α* Σ*,w2, yo)× p(w2* Σ*,α*, yo) (30)
Where p(Σ*
yo
) is the marginal posterior density for special point Σ*, which can use the
ordinary Gibb sampler based on the full conditional distributions for all parameters.
p(β* Σ*
, yo) and p(α* Σ*
, yo
) are both the marginal posterior density for special points β* and α* conditioning on Σ*.
p(w2* Σ*
,α*
, yo
) is the marginal posterior density for
w2*conditioning on Σ* and α*.
The mixed stage Gibb sampler and Metropolis-‐Hastings (Chib and Jeliazkov, 2001) are conducted to estimate the marginal likelihood for the heteroscedasticity model.
(A) Estimate p(Σ*
yo
)
1. Specify the initial values for Σ0,α0,β0 and w0 2
2. Conduct the Metropolis-‐Hastings method for sampling from (17) because there is no closed form for the posterior distribution of Σ .
a) Draw the proposal pointsΣp from an independent multivariable normal distribution MVN(Σ0,ω ). ω affects the acceptance rate that is the proportion of
proposal Σp being accepted during next steps b), c) and d).
b) Calculate p=exp{− 1 2[(y− Xβ0 ′)Σp −1 (y− Xβ0)− 1 w02(lnσp 2− Xα 0 ′)(lnσp 2− Xα 0)]} exp{−1 2[(y− Xβ0 ′)Σ0 −1 (y− Xβ0)− 1 w02(lnσ0 2− Xα 0 ′)(lnσ0 2− Xα 0)]}
c) Draw a value q from the uniform distribution U(0,1) , where U means the uniform distribution.
d) If p> q , set the Σ1 value toΣp. Otherwise, keep the old value Σ0.
3. Sample β1 from the posterior multinomial distribution (16) with the value Σ1
from step 2.
4. Sample α1 from the posterior multinomial distribution (19) with the value Σ1
from step 2 and initial value w0 2.
5. Sample w1
2 from the posterior inverse gamma distribution (21) with the value Σ 1
from step 2 and the value α1 from step 4.
6. Go back to step 2 with the values Σ1,α1,β1 and w1 2.
7. Repeat step 2 to step 6 N times including burn-‐in time B till all the marginal distributions converge.
Burn-‐in time is the sufficient time to throw away before the Markov Chain is converging to the marginal distributions.
8. Find the high-‐density value Σ* in the converged marginal distribution for Σ and
(B) Estimate p(β* Σ*, yo)
1. Sample β1 from the posterior multi-‐normal distribution (16) with the special value Σ* from the part (A).
2. Repeat step 1 N times including burn-‐in time B till the marginal distribution converges.
3. Find the high-‐density value β* and its estimated density p(β* Σ*
, yo
) with the
converged marginal distribution.
(C) Estimate p(α* Σ*
,w2
, yo
)
1. Sample α1 from the posterior multi-‐normal distribution (19) with the special value Σ* from part (A) and the initial value w
0 2.
2. Sample w1
2 from the posterior inverse gamma distribution (21) with the value Σ*
from part (A) and the value α1 from step 1. 3. Go back to step 1 with the value w1
2
.
4. Repeat step 1 to step 3 N times including burn-‐in time B till all the marginal distributions converge.
5. Find the high-‐density value α* and its estimated density
p(α* Σ*,w2, yo).
(D) Estimate p(w2* Σ*
,α*
, yo
)
1. Sample w12 from the posterior inverse gamma distribution (21) with the value
Σ* from part (A) and the value α* from part (C).
2. Repeat step 1 N times including burn-‐in time B till all the marginal distribution converges.
3. Find the high-‐density value w2* and its estimated density
p(w2* Σ*
,α*
, yo
).
(E) Estimate p(yα,β,Σ,w2
, Mheter) The p(yα,β,Σ,w2
, Mheter)is estimated by putting all the special parameters Σ*,α*,β*,w2* and the densities p(Σ*
yo ), p(β* Σ* , yo ), p(α* Σ* ,w2 , yo ), p(w2* Σ* ,α* , yo ) into
the equation (29) and (30).
2) Homoscedasticity model
log( p(yβ,γ2, Mheter))= log( f (y o β* ,γ2*)+ log(p(β*)× p(γ2*))− log(p(β*,γ2* yo)) p(β* ,γ2* yo )= p(β* yo )× p(γ2* β* , yo ) (31)
The homoscedasticity model is simpler with only two parameters while they all have a closed form. Gibbs sampling can be used here to estimate the marginal likelihood (Gelman et al., 2004).
(A) Estimate p(β* yo)
1. Set initial value γ0.
2. Sample β1 from the posterior multi-‐normal distribution (26) with the initial value
γ0.
3. Sample γ1
2 from the posterior inverse gamma distribution (27) with the value β 1
from step 2.
4. Go back to step 2 with the value γ12 from step 3.
5. Repeat step 2 to step 4 N times including burn-‐in time B till all the marginal distributions converge.
6. Find the high-‐density value β* and its estimated density p(β*
yo
).
(B) Estimate p(γ2* β*, yo)
1. Sample γ1
2 from the posterior inverse gamma distribution (27) with the value β*
from part (A).
2. Repeat step 1 N times including burn-‐in time B till all the marginal distributions converge.
3. Find the high-‐density value γ 2* and its estimated density
p(γ2* β*, yo) .
(C) Calculate p(yβ,γ2
, Mhom) according to equation (31) in the same way as for the
heteroscedasticity part.
3) Bayes factor is Bheter/hom =
p(yα,β,Σ,w2, Mheter)
p(yβ,γ2
, Mhom) , in which p(yβ,γ
2
, Mhom) and
p(yα,β,Σ,w2
, Mheter) are separately estimated in part 1) and part 2).
The tests for convergence
The successful Markov Chain Monte Carlo simulation is mainly based on if the
simulation can actually converge to the target distributions. Before estimating the Bayes Factor, it is necessary to perform the tests for convergence and further decide the
proper burn-‐in draws.
One way to check if the Markov Chain has converged is to see how well the chain is mixing or moving around in the parameter space. If the chain is taking a long time to move around the parameter space, then it will take longer to converge.
Some visual inspections can be conducted to see how well the chain mixing and it needs to be performed for every parameters. Trace plots, running mean plots and
autocorrelation plots can be used as visual inspections. There are also a lot of diagnostics for testing convergence. The Heidelberg and Welch diagnostic
(Heidelberger and Welch, 1983) together with visual diagnostic running mean and autocorrelation plot will be used in this paper.
The Heidelberg and Welch diagnostic has two part.
The first part is a stationary test. It calculates the test statistic to accept or reject the null hypothesis: the Markov chain is from a stationary distribution. The test is applied first to the whole chain, if the null hypothesis is rejected, discarding the 10%, 20%, etc., of the chain until either the null hypothesis is accepted, or 50% of the chain is discarded which indicates a longer running is needed. If the null hypothesis is accepted, the number of iterations to keep and burn-‐in times are reported.
The second part is the half-‐width test, which uses the portion of the chain passed the stationary test to calculate a 95% confidence interval for the mean. The half-‐width of the interval is compared with the estimated mean. If the ratio between the half-‐width interval and the estimated mean is lower than a value ξ , the half-‐width test is passed. Otherwise, the length of the passed chain is not enough to estimate the sufficient accuracy mean.
The running mean plot is a plot of the iteration against the mean of all draws up to that iteration. The law of Large Numbers concerns the stability of the mean as the sample size increases. If the Chain converges, the running mean plot should show than the mean is moving tightly around a fixed value as the sample size increases.
The lag k autocorrelation ρkis the correlation between every draw and its kth lag. ρk = (xi− x i=1 n−k
∑
)(xi+k− x) (xi− x) 2 i=1 n∑
The kth lag autocorrelation is expected to be smaller as k increases. If autocorrelation is still relatively high for higher k, it indicates slow mixing.
For Metropolis-‐Hasting Markov Chain Monte Carlo, the acceptance rate is also important diagnostic for convergence. The acceptance rate measures the proportion of new
parameters from a proposal distribution being accepted. It suggests the proper acceptance rate between 20% and 40% (Gelman et al., 2004).
The size and the power
In the frequentist hypothesis test framework
Size is the probability of rejecting H0 when it is true. In order to evaluate the size of a frequentist hypothesis test, generate N times data under the null hypothesis and calculate proportion of rejections of H0. Proportion should be close to the significant level α .
Power is the probability of rejecting H0 when it is false. To evaluate power, generate
data under the alternative hypothesis H1 N times and calculate proportion of rejections of H0.
When controlling the size around the significance level, the bigger power implies better performance of a test.
In the Bayesian hypothesis test framework
Recall from the part Bayesian hypothesis test:
H0: Homoscedasticity H1: Heteroscedasticity
If Bayes factor Bheter/hom=
f (y Mheter)
f (y Mheter)> c (c is the critical value), the null hypothesis is rejected. For simulating the significance level in the frequentist hypothesis test, the N times Bayes Factor is calculated. The critical value c corresponds to the “significance level α ” when classical test is chosen. The N should be large enough to represent a general situation. Generate Bayes Factor N times with data under null hypothesis H0
and find the “ critical value c”. The proportion, when the Bayes Factors are bigger than c, should be equal to the significance level α . The power is the proportion when the
Bayes Factors are larger than c with the data generated under the alternative hypothesis
H1N times.
In the Mheter and Mhom the Bayes Factor is actually affected by the sample size because the number of elements in the diagonal of Σ in the heteroscedasticity model Mheter is growing as the sample size is growing. The bigger the size gets, the smaller the Bayes Factor will become. So the different sample sizes have different “ critical value c” in our Bayesian hypothesis test.
Monte Carlo Simulation
Data generating process
In order to simplify the problem without losing generality, four independent variables from normal distributions are used in the simulation. The heteroscedasticity appeared normally in panel data in which R2 is normally quite low. Therefore, R2 is chosen
around 30% to control the coefficients β and α in (32), where β = (β0,β1,β2,β3,β4) and
α = (α0,α1,α2,α3,α4) .
Here is the model used to generate data:
yi =β0+β1xi1+β2xi 2+β3xi 3+β4xi 4 +εi εi ∼ N(0,σi) σi 2= exp(α 0+α1xi1+α2xi 2+α3xi 3+α4xi 4) (32)
To make things simple, the parameter β = (1,1,1,1,1) is chosen here and the x1∼ N(1,1) , x2 ∼ N(1,1.2) , x3∼ N(1,1.5) and x4 ∼ N(1,1.8) are generated, where N stands for normal
distribution. In order to control the R2around 30%, the variance of the disturbances
should be around 18. In other words, the expected value of the variance in model (32) exp(α0+α1+α2+α3+α4) is around 18. So α = (2,0.3,0.2,0.25,0.25) is chosen in the
simulation study for the data with heteroscedasticity. With this parameter α , even when no heteroscedasticity is in the data, the R2is still quite low around 50%.
Three different comparisons are performed in our study: different sample size, normality assumption and the structure of variances.
1) Small sample size behavior is normally interesting because the asymptotically distribution of the test statistic is violated when the sample size is small. In this special case, the degrees of freedom can grow dramatically if the test statistic is estimated by regressing variances against all the independent variables, their squared forms and their cross products.