On the equivalence of confidence interval estimation based on frequentist model averagingand least-squares for the full model in linear regression

(1)

Working Paper 2016:1

Department of Statistics

On the equivalence of

confidence interval estimation

based on frequentist model

averaging and least-squares of

the full model in linear

regression

(2)

(3)

Working Paper 2016:1

April 2016

Department of Statistics

Uppsala University

Box 513

SE-751 20 UPPSALA

SWEDEN

Working papers can be downloaded from www.statistics.uu.se

Title: On the equivalence of confidence interval estimation based on frequentist model averaging

and least-squares for the full model in linear regression

Author: Sebastian Ankargren & Shaobo Jin

(4)

On the equivalence of confidence interval

estimation based on frequentist model averaging

and least squares of the full model in linear

regression

Sebastian Ankargren

∗

and Shaobo Jin

†

Department of Statistics, Uppsala University

April 5, 2016

Abstract

In many applications of linear regression models, model selection is vital. However, randomness due to model selection is commonly ignored in post-model selection inference. In order to account for the model selection uncertainty in these linear models, least squares frequentist model averaging has been proposed recently. In this paper, we show that the confidence interval from model averaging is asymptotically equivalent to the confidence interval from the full model. Fur-thermore, we demonstrate that this equivalence also holds in finite samples if the parameter of interest is a linear function of the regression coefficients.

Keywords: Asymptotic equivalence · Linear model · Local asymptotics · Model selection uncertainty · Post-selection inference

Mathematics Subject Classification (2000): 62E20 · 62J05 · 62J99 · 62F12

1 Introduction

In many situations in applied statistics, there is no model structure known to the re-searcher, but a set of plausible alternatives. Commonly, to pick one model from this set of candidates, information criteria such as AIC (Akaike, 1974) and BIC (Schwarz, 1978) are used. This can have detrimental effects on subsequent inference since the model selection step itself is stochastic and thus subject to uncertainty, an uncertainty which inference post-model selection usually fails to account for. This issue is studied

∗_{sebastian.ankargren@statistics.uu.se} †_{shaobo.jin@statistics.uu.se}

(5)

in detail by e.g. P¨otscher (1991); Kabaila (1995); Kabaila and Leeb (2006); Leeb and P¨otscher (2005); also see Berk et al (2013) for a promising way of doing inference post-model selection.

It is often by this inability of model selection to incorporate the randomness at-tached to it in post-selection inference that frequentist model averaging is motivated, advocating that it may serve as a better, more truthful alternative. For example, it is claimed that model averaging “compromises across a set of competing models, and in doing so, incorporates model uncertainty into the conclusions about the unknown parameters” (Wan et al, 2010, p. 277). Hjort and Claeskens (2003) developed a fre-quentist model averaging machinery under the likelihood framework. They proposed a confidence interval that captures the randomness in model selection. However, Wang and Zhou (2013) showed that the proposed interval is asymptotically equivalent to the confidence interval obtained from the full model and that the finite sample confidence intervals are also equivalent if the parameter of interest is a linear combination of re-gression coefficients of a varying-coefficient partially linear model, which is regarded as the semi-parametric framework by the authors.

The focus of this paper is the linear model using least squares. Least squares-based model averaging has previously been studied by for example Hansen (2007), Wan et al (2010), Hansen and Racine (2012), Liu and Okui (2013), Ando and Li (2014) and Cheng et al (2015), with a focus on prediction. Liu (2015) developed asymptotic distri-bution theory for frequentist model averaging estimators under the least squares frame-work, including results for Mallows and Jackknife model averaging (Hansen, 2007; Hansen and Racine, 2012), and proposed a confidence interval in the spirit of Hjort and Claeskens (2003). In this article, we show that the inference we can make about the unknown parameters, using the distributional theory of Liu (2015), is in fact equiva-lent to what we can make just using the model including all possible covariates, either asymptotically or both in finite samples and asymptotically depending on the nature of the parameter of interest. From this perspective, there is no additional gain in turning to model averaging. Thus, this paper extends the work by Kabaila and Leeb (2006) and Wang and Zhou (2013) in that their criticism geared towards interval estimation based on Hjort and Claeskens (2003) is also valid for interval estimation in linear models based on Liu (2015).

The remainder of the paper is organized as follows. Section 2 briefly reviews the key results in Liu (2015), Section 3 discusses the equivalence of the confidence interval and Section 4 concludes. Proofs are placed in Appendix A.

2 Frequentist model averaging

In this section we briefly review the necessary aspects of Liu (2015). Assume the linear regression model

y = Xβ + Zγ + e (1)

in which y = (y1, . . . , yn)0 is the n × 1 response vector, X = (x1, . . . , xn)0 the n × p

matrix of core regressors, Z = (z1, . . . , zn)0 the n × q matrix of auxiliary regressors,

(6)

vectors. By the formulation of the model, the core regressors in X are necessarily included in all models, but those in Z are potential and subject to scrutiny. This does not restrict applications, however, as X may be an empty matrix or simply a vector of ones such that the intercept is always included. The error term is assumed to have a zero conditional mean, E(ei|xi, zi) = 0, but allowed to be both heteroskedastic and

autocorrelated.

Collect the regressors and the parameters in H = (X, Z) and θ = (β0, γ0)0, respec-tively. The regression model (1) may then be written as

y = Hθ + e

Suppose now that there is a set of M candidate models. If this set consists of all nested models, then M = q + 1, and if it consists of all submodels, then M = 2q_{. Whatever}

the choice of model set, each model m is assumed to contain a unique combination of 0 ≤ qm≤ q regressors from Z such that the mth model’s subset of auxiliary regressors

can be written as Zm= ZΠ0m, where Zmis n × qmand Π0mis a q × qmselection matrix.

Since all M models also include the full set of core regressors X, model m includes in total p + qmregressors.

For ease of notation and with no loss of generality, let the full model be ordered last in the set of models such that model M is the full model including all regressors. The least squares estimator of θ in this model is

ˆ

θ_M= (H0H)−1H0y,

whereas for submodel m, the estimator of θm= (β0, γm0)0= (β0, γ0Π0m)0, a (p + qm) × 1

vector, is

ˆ

θm= (Hm0Hm)−1Hm0y, (2)

where Hm= (X, Zm) = (X, ZΠ0m). For later use we also define θ(m)= (β0, γm0Πm)0,

a (p + q) × 1 vector in which there are zeros placed in the positions corresponding to excluded variables in model m. Similarly, ˆθ(m)= ( ˆβ0_m, ˆγ(m)0)0= ( ˆβ_m0, ˆγ_m0Πm)0 where

ˆ

θ(m)is (p + q) × 1 and ˆγmis the qmlast elements of ˆθmin (2).

For the asymptotic analysis local asymptotics are employed. This framework en-tails the following assumption.

Assumption 1. γ = γn= δ/

√

n, whereδ is an unknown local parameter vector. Since the omitted variable bias in all models but the full is of order O(1/√n) under Assumption 1,√n( ˆθm− θm) does not diverge. Instead, we have the following:

√

n( ˆθ_M− θ)−→ N(0, Qd −1ΩQ−1) √

n( ˆθm− θm)−→ N(Ad mδ, Q−1m ΩmQ−1m )

(7)

where Am= Q−1m Sm0QS0(Iq− ΠmΠ0m) and S0= 0p×q Iq , Sm= Ip 0p×qm 0q×p Π0m Q =Qxx Qxz Q_zx Q_zz =E(xix 0 i) E(xizi0) E(z_ix0_i) E(z_iz_i0) Ω =Ωxx Ωxz Ωzx Ωzz = lim n→∞ 1 n n

∑

i=1 n

∑

j=1 E(xix0jeiej) E(xiz0jeiej) E(zix0jeiej) E(ziz0jeiej) Qm= S0mQSm, Ωm= Sm0ΩSm.

The limiting distributions in (3) are results of the following assumption. Assumption 2. As n → ∞, n−1H0H −→ Q and np −1/2_H0_e _{−→ R ∼ N(0, Ω).}d

Suppose the ultimate goal is to make inference regarding the true value of a known, smooth function µ : Rp+q7→ R evaluated at the (unknown) parameters; that is, we are interested in the true value of µ(θ). This is called the focus parameter. The model averaging estimator is a weighted average of the estimator for the focus parameter µ(θ) in the individual submodels. The averaging estimator is

¯ µ = M

∑

m=1 w(m| ˆδ) ˆµm,

where w(m| ˆδ) denotes the data-dependent weight for model m and ˆµmshould be

un-derstood as

ˆ

µm:= µ( ˆθ(m)) = µ( ˆβm, ˆγ(m)),

i.e. the function µ at the estimated values of β and γm, with the q − qm elements in γ

not included in γmset to 0.

The local parameter, δ, can be unbiasedly, but not consistently, estimated asymp-totically in the sense that

ˆ

δ =√nγˆ_M −→ Rd δ= δ + S00Q−1R ∼ N(δ, S 0

0Q−1ΩQ−1S0).

Let Dθ be the derivative of µ with respect to the full parameter vector θ evaluated

at the null points (β0, 0)0. The key theorem in Liu (2015) underlying the results for a frequentist model averaging confidence interval for µ is stated next.

Theorem 1 (Liu, 2015). If w(m| ˆδ)−→ w(m|Rd δ) as n → ∞ and Assumptions 1-2 hold,

then √ n( ¯µ − µ )−→ Dd _θ0Q−1R + D0_θ M

∑

m=1 w(m|Rδ)Cm ! Rδ whereCm= (PmQ − Ip+q)S0andPm= Sm(Sm0QSm)−1S0m.

(8)

Proof. See Liu (2015).

The proposed confidence interval rests on the idea of replacing limit quantities by finite sample counterparts. Thus, for sufficiently large n, it should be that

√ n( ¯µ − µ ) − ˆD0_θ ∑Mm=1w(m| ˆδ) ˆCmδˆ ˆ κ app. ∼ N(0, 1) where ˆκ is a consistent estimator of

κ = q

D0_θQ−1ΩQ−1Dθ.

Therefore, a confidence interval with asymptotic coverage of 100(1 − α)% is ¯ µ − b δ − zˆ 1−α/2 ˆ κ √ n, ¯µ − b ˆ δ + z1−α/2 ˆ κ √ n , (4)

where b ˆδ = ˆD0_θ∑ w m| ˆδCˆmγˆMand z1−α/2is the 1 − α/2 quantile of the standard

normal distribution.

3 Equivalence of confidence intervals

In this section we establish the equivalence of frequentist model averaging and full model confidence intervals.

3.1 Asymptotic equivalence

Consider the case where there is a fixed model and that this model is the full model. By (3) and the delta theorem, the asymptotic distribution of the estimator of the focus parameter is

√

n µ ˆθM − µ (θ) d

−→ N 0, D_θ0Q−1ΩQ−1Dθ .

The confidence interval for µ in the full model based upon a finite-sample approxima-tion to this asymptotic distribuapproxima-tion is

µ θˆM − z1−α/2 ˆ κ √ n, µ ˆθM + z1−α/2 ˆ κ √ n , (5)

where z is the 1 − α/2 quantile of a standard normal distribution. Note that the interval in (5) has the same length as the interval in (4). It therefore follows that if the centers of the two intervals are asymptotically the same, then they are asymptotically equivalent. That is, equivalence holds if ¯µ = ˆµM+ b ˆδ + op n−1/2. The existence of this

par-ticular relation is our main result, and it is established in the following theorem. The proof is placed in Appendix A.

Theorem 2. The frequentist model averaging and full model confidence intervals in (4) and (5), respectively, are asymptotically equivalent.

(9)

Theorem 2 suggests that for large sample sizes, there are no gains in terms of inter-val estimation in using model averaging. A confidence interinter-val based on least squares estimation of the full model instead offers the more compelling alternative, given its simplicity.

3.2 Small sample equivalence

Theorem 2 shows that the confidence interval (5) is asymptotically equivalent to (4) for a general smooth and real-valued function µ (θ). However, nothing is said about the finite sample properties, in which there may be substantial differences. In this section, we establish also small sample equivalence when µ (θ) = c0θ, where c is a vector of known scalars. That is, the focus parameter is some fixed linear function of the regression coefficients.

If the focus parameter is µ (θ) = c0θ, the center of the full model confidence inter-val (5) is c0θˆMand the center of the model averaging confidence interval (4) is

c0 " M

∑

m=1 w m| ˆδθˆ(m)− M

∑

m=1 w m| ˆδCˆmγˆM # = c0 " ˆ θM+ M−1

∑

m=1 w m| ˆδθˆ(m)− ˆθM− ˆCmγˆM # , (6)

since ˆD0_θ= c0. The equality holds because ˆC_M = 0 and ˆθ(M)_{= ˆ}_θ

M. Based on the

center (6), we can establish the following theorem.

Theorem 3. If µ (θ) = c0θ, the confidence interval (5) is equivalent to the confidence interval(4) in finite samples, where c is a (p + q) × 1 constant vector.

Theorem 3 says that the full model produces the same confidence interval as model averaging under some restricted cases. However, Theorem 3 says nothing about the small sample equivalence for other cases, e.g., β1/β2 or exp β1. We conjecture that

the small sample equivalence is likely to fail for focus parameters not of the form µ (θ) = c0θ. This is illustrated in the following example.

Example 1. Consider a model with two regressors y = Xβ + Zγ + ,

where γ = δ /√n. From the above discussion, the confidence interval of β from the full model is the same as that from model averaging, both asymptotically and in a finite sample. In particular, from Theorem 2 we know that, for any differentiable function h, the confidence interval of h (β ) from the full model is asymptotically the same as that from model averaging. However, the above discussion says nothing about the finite sample equivalence. In this example, consider h (β ) = β2. The center from the model averaging confidence interval is

w 1| ˆδβˆ12+1 − w 1| ˆδ ˆβM2− 2w 1| ˆδ

_ˆ

(10)

Observe that ˆβ1= ˆQ−1xxX0y/n, ˆ βM= ˆ QzzX0y/n − ˆQxzZ0y/n ˆ QxxQˆzz− ˆQ2xz and ˆγM= − ˆQxzX0y/n + ˆQxxZ0y/n ˆ QxxQˆzz− ˆQ2xz .

Then, the difference in two centers is w 1| ˆδh ˆ

β₁2− ˆβ_M2− 2 ˆβMQˆ−1xxQˆxzγˆM

i

, which is zero if and only if w 1| ˆδ = 0 or ˆβ₁2− ˆβ_M2− 2 ˆβMQˆ−1xxQˆxzγˆM= 0. The former means

the full model receives the total weight. Plugging in the expressions of ˆβ1, ˆβM, and ˆγM,

the latter satisfies ˆ β₁2− ˆβ_M2− 2 ˆβMQˆ−1xxQˆxzγˆM= 1 n2× ˆ Q2_xz QˆxzX0y − ˆQxxZ0y 2 ˆ Q2 xx QˆxxQˆzz− ˆQ2xz 2 ,

which is zero only if ˆQxzX0y = ˆQxxZ0y or ˆQxz= 0. If X and Z are correlated, the two

intervals are likely to be different.

4 Conclusion

In this work, we have studied the frequentist model averaging confidence interval for linear models proposed by Liu (2015). Using similar techniques as Wang and Zhou (2013), our key result is that the model averaging confidence interval is asymptotically equivalent to the interval obtained by use of least squares estimation in the full model. Furthermore, we also provide the stronger result that the intervals are exactly equivalent in finite samples if µ, the function defining the focus parameter, is linear.

While the results are disadvantageous to model averaging, they should be inter-preted with care. Numerous studies have illustrated impressive performance of model averaging in terms of prediction. This strand of the literature is separate from what we consider and thus remains unaffected by the conclusion herein. In fact, the good predictive performance may instead suggest that with other methods, we may be able to increase the efficiency of our confidence intervals by resorting to model averaging. This, however, we leave for further research.

A

Proofs

Proof of Theorem 2. A proof technique similar to that used by Wang and Zhou (2013) is also employed here, but applied to the different problem setting that we have.

As already noted in the text, the intervals have the same length and thus what needs to be proved is that the centers are asymptotically equal. To this end, let R ∼ N (0, Ω). Lemma 1 in Liu (2015) indicates that

√ n ˆθm− θm = Q−1m S 0 mQS0 Iq− Π0mΠm δ + Q−1m S 0 mR + op(1) , (7) where θm= (β0, δm0/ √

n)0. For simplicity and without loss of generality, assume that the set of models is ordered in such a way that M denotes the full model. If m = M,

(11)

then ΠM= I, SM= I, and Q−1M = Q−1, so then what remains is

√

n ˆθM− θ = Q−1R + op(1) ,

where θ = (β0, δ0/√n)0. Thus, R =√nQ ˆθ_M− θ + op(1) and (7) becomes

√ n ˆθm− θm = Q−1m S 0 mQS0 Iq− Π0mΠm δ + Q−1m S 0 mQ √ n ˆθM− θ + op(1) .(8)

By rules for inverses of block matrices (e.g., Theorem 8.5.11 in Harville, 1997), Q−1_m = Q−1_xx + Q−1_xxQxzΠ0mKmΠmQzxQ−1xx −Q−1xxQxzΠ0mKm −KmΠmQzxQ−1xx Km , (9) where K_m=Πm Qzz− QzxQ−1xxQxz Π0m −1 = Π_mK−1Π0_m−1 . (10) with K = Qzz− QzxQ−1xxQxz −1

. Plugging (9) into Q−1_m S0_mQ yields Q−1_m S_m0Q = Ip Am 0 Bm ,

where Am= Q−1xxQxz Iq− Π0mKmΠmK−1 and Bm= KmΠmK−1. Thus, (8) can

be expressed as √ n ˆθ_m− θm = A_m B_m I_q− Π0_mΠ_m δ + I_p A_m 0 B_m _√ n ˆθ_M− θ + op(1) .

Recall that ˆθ_m= βˆ0_m, ˆδ_m0/√n0and θm= (β0, δ0m/

√ n)0, where δm= Πmδ. Then √ n ˆβm− β = Amδ + √ n ˆβM− β + Am √ n ˆ γM− δ √ n + op(1) , (11) √ n ˆ γ_m−√δm n = B_m I_q− Π0_mΠ_m δ + Bm √ n ˆ γ_M−√δ n + o_p(1) , (12) where we have used that (Iq− Π0mBm) (Iq− Πm0 Πm) = Iq− Π0mBm.

The Taylor expansion of√n µ βˆm, Π0mγˆm − µ (β, δ/

√ n) around (β, δ/√n) is √ n µ βˆm, Π0mγˆm − µ β,√δ n = ∂ µ ∂ β 0_√ n ˆβm− β + ∂ µ ∂ γ 0_√ n Π0_mγˆm− δ √ n + op(1) , (13)

where the partial derivatives are evaluated at β, 0q×1 because of the continuous

map-ping theorem. Likewise, for the full model, √ n µ βˆM, ˆγM − µ β,√δ n = ∂ µ ∂ β 0_√ n ˆβM− β + ∂ µ ∂ γ 0_√ n ˆ γM− δ √ n + op(1) . (14)

(12)

We then get √ n ¯µ = M

∑

m=1 w m| ˆδ √nµ ˆβm, Π0mγˆm = M

∑

m=1 w m| ˆδ _√ nµ β,√δ n + ∂ µ ∂ β 0_√ n ˆβm− β + ∂ µ ∂ γ 0_√ n Π0_mγˆm− δ √ n + op(1) (by (13)) =√nµ β,√δ n + ∂ µ ∂ β 0_√ n ˆβM− β + M

∑

m=1 w m| ˆδ ∂ µ ∂ β 0 Am δ +√n ˆ γM− δ √ n + M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0_√ n Π0_mγˆm− δ √ n + op(1) (by (11)). RewritingΠ0_mγˆm−√δ_n = Π0_mγˆm−δ√m_n − (Iq− Π0mΠm)√δ_nand using (12) √ n ¯µ = √ nµ β,√δ n + ∂ µ ∂ β 0_√ n ˆβ_M− β + M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Π0_mBm Iq− Π0mΠm δ + √ n ˆ γM− δ √ n − Iq− Π0mΠm δ + op(1)

(13)

Applying (14) yields √ n ¯µ = √ nµ βˆM, ˆγM − ∂ µ ∂ γ 0_√ n ˆ γM− δ √ n + M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Π0_mB_m I_q− Π0_mΠ_m δ +√n ˆ γ_M−√δ n − I_q− Π0_mΠ_m δ + o_p(1) =√nµ βˆM, ˆγM + M

∑

m=1 w m| ˆδ ∂ µ ∂ β 0 Am+ ∂ µ ∂ γ 0 Π0_mBm− Iq _√ n ˆ γM− δ √ n + M

∑

m=1 w m| ˆδ ∂ µ ∂ β 0 Am+ ∂ µ ∂ γ 0 Π0_mBm Iq− Π0mΠm δ − M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Iq− Π0mΠm δ + op(1) =√nµ βˆM, ˆγM + M

∑

m=1 w m| ˆδ D0 θ Am Π0_mBm− Iq _√ n ˆ γM− δ √ n + M

∑

m=1 w m| ˆδ D0 θ A_m Π0_mB_m(I_q− Π0 mΠm) δ − M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Iq− Π0mΠm δ + op(1) . (15) Further, using Cm= SmQ−1m S 0 mQ − I S0= Am Π0_mBm− Iq (16) in (15) yields √ n ¯µ = √ nµ βˆM, ˆγM + " D0_θ M

∑

m=1 w m| ˆδ Cm # √ n ˆ γM− δ √ n + D0_θ M

∑

m=1 w m| ˆδ Am Π0_mBm(Iq− Π0mΠm) δ − M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Iq− Π0mΠm δ + op(1) . (17)

The center of the confidence interval proposed in Liu (2015) is ¯µ − b δ, whereˆ b ˆδ = ˆD_θ0 ∑Mm=1w m| ˆδ

_ˆ

CmγˆMwith ˆδ =

√

(14)

we have √ n ¯µ = √ nµ βˆM, ˆγM + " D_θ0 M

∑

m=1 w m| ˆδ Cm # √ nγˆM− " D0_θ M

∑

m=1 w m| ˆδ Cm # δ + D_θ0 M

∑

m=1 w m| ˆδ Am Π0_mBm(Iq− Π0mΠm) δ − M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 I_q− Π0_mΠ_m δ + op(1) =√nµ βˆM, ˆγM + " D_θ0 M

∑

m=1 w m| ˆδ Cm # √ nγˆM + D_θ0 M

∑

m=1 w m| ˆδ 0 I − Π0_mBmΠ0mΠm δ − M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Iq− Π0mΠm δ + op(1) =√nµ βˆM, ˆγM + " D_θ0 M

∑

m=1 w m| ˆδ Cm # √ nγˆM + M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Iq− Π0mΠm δ − M

∑

m=1 w m| ˆδ ∂ µ ∂ γ 0 Iq− Π0mΠm δ + op(1) (by (10) ) =√nµ βˆM, ˆγM + " D_θ0 M

∑

m=1 w m| ˆδ Cm # √ nγˆM+ op(1) .

Hence, ¯µ = ˆµM+ b( ˆδ) + op(n−1/2) and the intervals are asymptotically equivalent.

Proof of Theorem 3. Also for this proof, it suffices to show that the intervals have the same centers asymptotically as the lengths are equal. Applying equation (9) in the appendix leads to ˆ θm= 1 n _ˆ Q−1_xx + ˆQ−1_xxQˆ_xzΠ0_mKˆ_mΠ_mQˆ_zxQˆ−1_xx X0_{y − ˆ}_Q−1 xxQˆxzΠ0mKˆmΠmZ0y − ˆK_mΠ_mQˆ_zxQˆ_xx−1X0y + ˆK_mΠ_mZ0y ,

where the first block corresponds to ˆβmand the second block corresponds to ˆγm. Note

that ˆ θM+ ˆCmγˆM= _ˆ βM+ ˆAmγˆM Π0_mBˆmγˆM ,

(15)

difference of the centers satisfies c0 "_M−1

∑

m=1 w m| ˆδθˆ(m)− ˆθM− ˆCmγˆM # = c0 "_M−1

∑

m=1 w m| ˆδ _ˆ β_m− ˆβ_M− ˆA_mγˆ_M ˆ γ(m)_{− Π}0 mBˆmγˆM # . (18) First, for the upper block

ˆ βm− ˆβM− ˆAmγˆM= Qˆ−1xx + ˆQxx−1QˆxzΠ0mKˆmΠmQˆzxQˆ−1xx X 0_{y − ˆ}_Q−1 xxQˆxzΠ0mKˆmΠmZ0y − _Qˆ−1 xx + ˆQ−1xxQˆxzK ˆˆQzxQˆ−1xx X 0_{y + ˆ}_Q−1 xxQˆxzKZˆ 0y − ˆQ−1_xxQˆxz I − Π0mKˆmΠmKˆ−1 − ˆK ˆQzxQˆ−1xxX 0_{y + ˆ}_KZ0_y = 0_p×1.

Second, for the lower block ˆ γ(m)− Π0_m_Bˆ_m_γ_ˆ_M_{= Π}0 m γˆm− ˆBmγˆM = Π0_m − ˆKmΠmQˆzxQˆ−1xxX 0_{y + ˆ}_K mΠmZ0y − ˆBm − ˆK ˆQzxQˆ−1xxX 0_{y + ˆ}_KZ0_y = Π0_m BˆmK − ˆˆ KmΠm QˆzxQˆ−1xxX 0_{− Z}0_y = 0_q×1

where the last equality holds because of ˆBmK = ˆˆ KmΠmKˆ−1K = ˆˆ KmΠm. Thus, the

difference between the centers in (18) may be further seen to be

c0 "_M−1

∑

m=1 w m| ˆδ _ˆ θ_m− ˆθ_M− ˆC_mγˆ_M # = c0 "_M−1

∑

m=1 w m| ˆδ 0_p×1 0_q×1 # = 0_(p+q)×1. Consequently, the confidence intervals for a linear function µ(θ) = c0θ of the pa-rameters based on either frequentist model averaging or on least squares in the full model are equivalent, even in finite samples.

References

Akaike H (1974) A New Look at the Statistical Model Identification. IEEE Automat Contr 19(6):716–723

Ando T, Li KC (2014) A Model-Averaging Approach for High-Dimensional Regres-sion. J Am Stat Assoc 109(505):254–265

Berk R, Brown L, Buja A, Zhang K, Zhao L (2013) Valid post-selection inference. Ann Stat 41(2):802–837

Cheng TCF, Ing CK, Yu SH (2015) Toward optimal model averaging in regression models with time series errors. J Econometrics 189(2):321–334

(16)

Hansen BE (2007) Least squares model averaging. Econometrica 75(4):1175–1189 Hansen BE, Racine JS (2012) Jackknife model averaging. J Econometrics 167(1):38–

46

Harville DA (1997) Matrix algebra from a statistician’s perspective. Springer, New York

Hjort NL, Claeskens G (2003) Frequentist Model Average Estimators. J Am Stat Assoc 98(464):879–899

Kabaila P (1995) The Effect of Model Selection on Confidence Regions and Prediction Regions. Economet Theor 11(3):537–549

Kabaila P, Leeb H (2006) On the Large-Sample Minimal Coverage Probability of Con-fidence Intervals After Model Selection. J Am Stat Assoc 101(474):619–629 Leeb H, P¨otscher BM (2005) Model Selection and Inference: Facts and Fiction.

Economet Theor 21(1):21–59

Liu CA (2015) Distribution theory of the least squares averaging estimator. J Econo-metrics 186(1):142–159

Liu Q, Okui R (2013) Heteroscedasticity-robust Cp model averaging. Economet J

16(3):463–472

P¨otscher BM (1991) Effects of Model Selection on Inference. Economet Theor 7(2):163–185

Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464 Wan ATK, Zhang X, Zou G (2010) Least squares model averaging by Mallows

crite-rion. J Econometrics 156(2):277–283

Wang H, Zhou SZF (2013) Interval Estimation by Frequentist Model Averaging. Com-mun Stat Theory 42(23):4342–4356