Misspecification and inference

(1)

Student

Umeå School of Business and Economics Spring semester 2015

Misspecification and inference

A review and case studies

Author: Gabriel Wallin

(2)

Abstract

(3)

Sammanfattning

Titel: Felspecicering och inferens - en genomgång och fallstudier.

När man vill dra slutsatser från data så måste man i regel göra vissa antaganden. Ett vanligt antagande är att den parametriska modellen som beskriver beteendet hos det undersökta slumpmässiga fenomenet är korrekt specicerad. Om detta inte är fallet så kommer vissa inferens-metoder inte att kunna användas, exempelvis maximum likelihood-metoden. Den här uppsatsen undersöker och presenterar några av resultaten för felspecicerade parametriska modeller för att illustrera konsekvenserna av felspecicering och hur parameterskattningarna påverkas. En huvudfråga som undersöks är huruvida det fortfarande går att lära sig något om den sanna parametern även när modellen är felspecicerad. Ett viktigt resultat som pre-senteras är att den så kallade quasi-maximum likelihood-skattningen från en felspecicerad modell konvergerar nästan säkert mot den parameter som minimerar avståndet mellan den sanna modellen och skattningsmodellen. Det visas hur denna parameter i vissa situationer konvergerar mot den sanna parametern multiplicerat med en skalär. Detta resultat illustr-eras också för en situation som inte täcks av något teorem. Dessutom presentillustr-eras en generell klass av estimatorer som kallas M-estimatorer. Dessa används för att utvidga teorin kring felspecicerade modeller och ett exempel presenteras där teorin för M-estimatorer nyttjas.

Populärvetenskaplig sammanfattning

(4)

2.3 Quasi Likelihood . . . 8 3 Simulation study 1 11 3.1 Design A . . . 12 3.2 Results A . . . 12 3.3 Design B . . . 14 3.4 Results B . . . 15 4 Theory part 2 17 4.1 M-estimators . . . 17 5 Simulation study 2 22 5.1 Design . . . 22 5.2 Results . . . 23 6 Final recommendations 24 7 Discussion 25 8 Acknowledgements 27 References 28

(5)

1 Introduction

In statistical modeling we are interested in the underlying structure that generates the data. In that sense, we assume that it actually exists a true model that fully describes the data generating process (DGP). It then follows that the parametric model1 _{used to make inferences will be a} more or less adequate description of the DGP. In statistical research, model specication, model diagnostics and optimal model choice have been widely investigated, see for instance [4, 1] and [25]. This thesis instead takes on the approach that we have ended up with an estimation model that is not an adequate description of the DGP. We will call this type of estimation model misspecied. A natural question then arises: What happens with the parameter estimates when the parametric model is misspecied? Is there still any information regarding the true parameters that can be found when making inference based on a misspecied model? As pointed out in King and Roberts [14]:

Models are sometimes useful but almost never exactly correct, and so working out the theoretical implications for when our estimators still apply is fundamental to the massive multidisciplinary project of statistical model building.

Several papers that have investigated misspecied models have proposed robust alternatives for estimation, e.g. robust standard errors. Among others, Huber [12, 13] gave robust alternatives to the least squares estimate for regression models and both White [32] and Eicker [5] investigated a covariance matrix estimator that is robust against heteroskedasticity. To have a broader under-standing about statistical modeling it is also reasonable to add knowledge about the properties of misspecied models since this situation is very likely to occur as soon as we wish to model a random phenomena. This thesis will examplify and illustrate some of the results regarding inference for misspecied models and show situations when there still is information that can be gained about the true parameter. We have as a starting point that we are facing are two (possibly) dierent models; a true, unknown model and an estimation model. This means that the theory presented here is a complement to the litterature regarding optimal model choice and model specication. This could in fact be seen as one joint theory, where model specication and optimal model choice, model diagnostics and the results of misspecied models all are parts of the theory of statistical modeling.

1.1 Purpose of the thesis

(6)

conducting statistical inference about an unknown parameter. The results will be examplied and illustrated using simulation. Two main questions that will be given extra attention are:

How will the parameter estimates be aected when the estimation model is misspecied? Given a misspecied model, are there any special cases where there is still some information

that can be found about the true model?

1.2 Outline of the thesis

The thesis is organized as follows:

Section 2 starts with an overview of a common method of statistical inference, the maximum likelihood method. It then gives a denition of the Kullback Leibler information criterion and describes its role in the theory of model misspecication. Then another likelihood-type estimator, the quasi-maximum likelihood estimator, is proposed with motivation from the denition of the Kullback Leibler information criterion. Some results for the estimator is given together with a simple illustration of how the Kullback Leibler information criterion and quasi maximum likelihood estimator is connected.

Section 3 uses simulations to illustrate some of the results of the QMLE estimator. Section 4 introduces a broader class of so called M-estimators together with a discussion of

the contribution of M-estimation theory to the asymptotic theory of misspecied models. Section 5 illustrates some of the results of M-estimators using simulation.

(7)

2 Theory part 1

2.1 Maximum likelihood

A parametric model or parametric family of distributions is a collection of probability distri-butions that can be described by a nite number of parameters. They intent to describe the probability mass function (p.m.f.) or probability density function (p.d.f.) of a random variable. Consider a random variable whose functional form of the p.m.f. or p.d.f. is known but where the distribution depends on an unknown parameter2 _θ _{that takes values in a parameter space} Θ. If we for example know that the random variable X is described by pX(x; θ) = e−θθx/x!, x = 0, 1, 2, ... and that θ ∈ Θ = {θ : 0 < θ < ∞}, it still might be the case that we need to specify the most probable p.m.f. of X. This means that we are interested in a specic member of the family of distributions that is contained in the parametric model {pX(x; θ), θ ∈ Θ} and thus we must estimate the parameter θ. One common estimator of θ in this type of setting is the maximum likelihood estimator (MLE). Let X1, X2, ..., Xn be a random sample of size n of independent and identically distributed (i.i.d.) random variables with the realizations denoted by x1, ..., xn. If we denote the density of X as gX(x; θ) where it is regarded as a function of the unknown parameter θ, the likelihood function is dened as L(θ; x, ..., xn) =Q

n

i=1gXi(xi; θ). To get the MLE of θ, the likelihood function, or more commonly, the natural logarithm of the likelihood function, is maximized.

Given suitable regularity conditions3_{, the method of maximum likelihood gives estimates} that have several appealing properties such as eciency [7], consistency [30, 3] and asymptotic normality [3]. The last property means that √n(ˆθM LE − θ)

d

→ Np(0, I(θ)−1 , as n → ∞, where Np is a p-variate normal distribution and I(θ) is the Fisher information matrix given by

I(θ) = E _∂ ∂θTlog gX(x; θ) ∂ ∂θlog gX(x; θ

which we could rewrite with the use of the score s(X; θ) = ∂

∂θlog gX(x; θ), so that I(θ) = Es(X; θ)Ts(X; θ) = −E ∂ 2_{log g} X(x; θ) ∂θ2 .

The Bernoulli distribution can be used as an example. Let Xi ∼ Bernoulli(p), i = 1, ..., n so that the p.m.f. is given by pX(x; p) = px(1 − p)1−xand log pX(x; p) = x log p + (1 − x) log(1 − p). The score becomes

s(X; p) =X p −

1 − X 1 − p, and

2_{The parameter of interest could of course be vector valued. Throughout the thesis we will make no dierence} in notation or in use of the term between a parameter that is vector valued and a parameter that is not.

(8)

−s0(X; p) = X p2 + 1 − X (1 − p)2. So I(p) = −E (s0(X; p)) = 1 p(1 − p), and thus, V ar(p) = I(p)−1_{= p(1 − p)}

Since the variance of the MLE reaches I(θ)−1 _{asymptotically, and since V (ˆθ) ≥ I(θ)}−1 _in general, there does not exist an estimator with lower variance. We call this property of the MLE eciency.

So far we have assumed that the functional form of the p.m.f or p.d.f. is known. However, there are situations when this is not a reasonable assumption. When the true model is unknown we have to use an estimation model and thereby running the risk of misspecifying the model. A natural question then becomes what happens to the MLE under model misspecication. To investigate this, we start by dening a measure of the discrepancy between the estimation model from the true model called the Kullback Leibler information criterion (KLIC).

2.2 Kullback-Leibler Information Criterion

Before dening the KLIC, we rst briey discuss what is being meant by information. It is closely related to what is sometimes called the surprise of an event and the information theory that was formalized by Shannon [26]. Let say that you ip an unfair coin with probability 0.2 to receive heads and probability 0.8 to recieve tails. Thus the message that you will recieve heads will give you a lot of information. A message that you will recieve tails will not give you that much of information; with a probability of 0.8, tails is almost what you expect. With this reasoning we can say that if some event is very unlikely to occur, the message that it will occur gives us a lot of information and vice versa. We could use Ip = log(1_p), where p denotes the probability of an event, as an information function; the information decreases with increasing p and vice versa, just as our example regarding the coin ip. If the probability of the event changes from p to q we could measure the information value of the change by Ip− Iq,where Iq = log

1 q, so that we get that Ip− Iq= log

_q

p. Expressed as an expected information value we have that

E(Ip− Iq) = n X i=1 qilog qi pi . (2.1)

Kullback and Leibler [15] used Shannon´s ideas of information to dene a divergence measure, or information criterion, that measures the dierence between two probability distributions g and f.

(9)

D(g : f ) = Eg log _g(x) f (x, θ) . (2.2)

Note that the expectation is taken with respect to g and that D(g : f(x; θ)) ≥ 0. It can further be shown that D(g : f(x; θ)) = 0 if and only if f = g. The KLIC can be seen as the information that has gone lost when using f, the probability distribution that we have assumed, to approximate g, the true and unknown probability distribution that generates the data. Renyi [22] showed that the KLIC, as in the opening example of this section, can be thought of in terms of information, i.e. the KLIC can be seen as the information gained when carrying out an experiment of a random phenomenon. White [34] describes this as the information gained when the experimenter is given the information that the observed phenomenon is described by g and not f which was the initial belief.

For the continuous case we can write the KLIC as

D(g : f (x; θ)) = ˆ g(x) log _g(x) f (x; θ) dx (2.3) = ˆ g(x) log(g(x))dx − ˆ g(x) log(f (x; θ))dx (2.4) where the similarities with Equation 2.1 is apparent. The KLIC is not a metric since D(f : g) 6= D(g : f ), i.e. the distance from f to g is not the same as the distance from g to f meaning that the KLIC could not be used as a goodness-of-t measure in the usual sense. A simple example of how the KLIC is calculated is given in Example 1.

Example 1

(10)

−5 0 5 0.0 0.1 0.2 0.3 0.4

Correct and wrong model

µ

f(x)

Densities Mean = 0, sd = 1 Mean = 2, sd = 1

Figure 2.1: The densities for Example 2

One way of quantifying the distance between the models in Figure 2.1 is to calculate the KLIC, which for this case is given by

D(g : f (x; θ)) = ˆ ∞ −∞ g(x) log g(x) f (x) dx = ˆ ∞ −∞ 1 √ 2πexp −x 2 2 log   1 √ 2πexp −x2 2 1 √ 2πexp −(x−2)₂ 2  dx = ˆ ∞ −∞ g(x) loge12(x−2) 2₋x2 2 dx = ˆ ∞ −∞ g(x)(2 − 2x)dx = 2 ˆ ∞ −∞ g(x)dx − 2 ˆ ∞ −∞ xg(x)dx = 2 × 1 − 2 × 0 = 2

(11)

In view of the KLIC as a goodness of t test we can think of a situation where we would like to compare two estimation models f1and f2against the true model g to evaluate which one that gets closest. We could write the mean KLIC dierence as

I =D(g : f1) − D(g : f2) = ˆ g log g f1 dx − ˆ g log g f2 dx = ˆ g log f2 f1 dx

where the right hand side of the last equality can be estimated using data, even though g is unknown. We then have three potential scenarios:

1. I = 0, meaning that f1= f2

2. I > 0, meaning that f1 is a better approximation than f2of g 3. I < 0, meaning that f2 is a better approximation than f1of g.

From this we can choose the best model, i.e. the model that minimizes the distance to the true model.

2.3 Quasi Likelihood

In contrast to the situation described in Subsection 2.1, there are situations where we don´t have any pre-knowledge of the true functional form when we want to model a random phenomena. When this is the case, we have to start by specifying the functional form of an estimation model and then estimate the parameter from it. If the parametric model that we specify includes the true model, the problem of inference reduces to estimating the parameter θ which we can do consistently with the MLE.

We are interested in the parameter θ in f that needs to be estimated from our observations xthat are realizations of a random variable X with unknown density function g. Our objective should intuitively be to minimize (2.2) by an appropriate choise of θ. Indeed, Akaike [1] argued that a natural estimator of θ would be the parameter that minimizes the KLIC, i.e. the parameter minimizing the distance between the true and the false density. We dene this parameter as

min θ∈ΘE log g(x) f (x; θ) = θ∗. (2.5)

By comparing equations 2.3 and 2.4, and since D(g : f) ≥ 0, we see that choosing θ to minimize 2.3 is the same as choosing θ to maximize

∼ L(θ) =

ˆ

(12)

This in turn is the same as maximizing the average n−1∼_L(θ)_{since the maximization of θ does not} depend on n. Finally, using that n−1∼_L(θ) _{by the law of large numbers could be approximated} by n−1_{log f (X, θ) ≡ L}

n(X, θ), our minimization problem of (2.3) reduces to

max θ∈Θ Ln(X; θ) ≡ n −1 n X i=1 log f (Xi; θ), (2.6)

where we call the solution to (2.6) the quasi-maximum likelihood estimator (QMLE). White [33] has shown that the solution of (2.6) exists, is unique and furthermore given the following key result.

Theorem 2. ˆθn a.s.

→ θ∗ as n → ∞ , where ˆθn is the parameter vector that solves max

θ∈Θ Ln(X; θ).

So if our objective is to nd a parameter estimate that minimizes the KLIC, Theorem 2 establishes that this is indeed what we are doing when we use the QMLE. White [33] calls the QMLE the estimator that ...minimizes our ignorance about the true structure, and he 4 furthermore showed that

√

n(ˆθ − θ∗) d

→ N (0, C(θ∗)). (2.7)

To dene C(θ) for a parameter θ we rst need to dene the Hessian Ajk(θ) = E

∂2_{log f (X} i, θ) ∂θj∂θk

and the square of the gradient, Bjk(θ) = E ∂ log f (Xi, θ) ∂θj ·∂ log f (Xi, θ) ∂θk . Now, C(θ) = A(θ)−1B(θ)A(θ)−1, (2.8)

and we furthermore have that C(ˆθ)a.s.

→ C(θ∗). C(θ) often is estimated with the so called sandwich estimator. If we specify the parametric family correctly, −A(θ) = B(θ) meaning that C(θ) = −A(θ)−1_{= B(θ)}−1_{where B(θ)}−1_{= I(θ)}−1_{and thus, the sandwhich estimator reduces to I(θ)}−1_, giving the ecient variance of the MLE. We will return to this estimator in Section 5.

(13)

−15 −10 −5 0 5 10 15 0 50 100 150 Kullback−Leibler Mean KLIC

Figure 2.2: The kullback Leibler information criterion plotted for dierent values of µ for the misspecied model.

true model N (0, 1), and how the KLIC grows both to the left and to the right of the minimum value.

In a sense, Figure 2.2 illustrates the dierence between the MLE and the QMLE. The MLE is based on the true model and thus the KLIC is equal to zero, asymptotically. White's result does not say that the QMLE reaches the minimum value of the KLIC, but that the KLIC will be minimized given the data and the misspecication. A natural question is what θ∗ can be in relation to θ, the true parameter? Can we, even if we misspecify the model, extract some information about the true parameter?

Li and Duan [16] investigates misspecied generalized linear models (GLM) and states a proportionality result for the coecient estimates. If Y is the outcome of interest, E [Y ] = g−1(Xθ)where g−1 is the link function that connects the linear predictor Xθ to the outcome. Li and Duan gives the following result when the link function g is misspecied.

Theorem 3. The estimated coecients converges almost surely to the true parameter vector times a scalar factor, i.e. ˆθa.s.

→ γθ, where ˆθ is given by the QMLE.

(14)

The result of Theorem 3 means that it is possible to get unbiased estimates of the ratio of the regression coecients since γθl

γθm = θl

θm, l 6= m. This could for an example be of interest in applied research were one is interested in the relative eect of two treatments on an outcome.

Theorem 3 could be seen in light of the problem that it usually is not enough to solely base the choice of link function on the data [6]. Figure 2.3 illustrates a binary data set with 100 observations where the correct link function, the logit, is compared against the probit link and the complementary log log link. It is apparent that especially the logit and the probit link function is close to each other. It will in the following section be illustrated how a misspecied link function could aect the estimated parameters.

0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

Fitted probabilites with different link functions

X

Y

Logit Probit Cloglog

Figure 2.3: The solid line is the correct, logit link function, (1 + exp(−X))−1_{, the dashed line} is the probit link function, φ(X), where φ is the standard normal distribution function and the dotted line is the complementary log-log link funcktion, 1 − exp(− exp(X)).

3 Simulation study 1

(15)

3.1 Design A

Two normally distributed random variables are generated, X1 ∼ N (4, 2) and X2 ∼ N (5, 2). Also, a Bernoulli distributed random variable is generated, T ∼ Bern(e(X)), where

e(X) = 1

(1 + exp(−0.3X1+ 0.24X2)) The misspecied model is

h(X) = φ(β0+ β1X1+ β2X2),

where φ denotes the standard normal distribution function. The scale parameter γ is estimated by ˆγ1 = ˆ β1 β1 and ˆγ2 = ˆ β2

β2 and the estimates are expected to get closer to each other when the sample size increases. We will use three dierent sample sizes; n = 100, n = 500 and n = 1000, each with 1000 replicates.

3.2 Results A

(16)

−3 −2 −1 0 1 2 3 0.15 0.25 0.35 0.45 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles −3 −2 −1 0 1 2 3 −0.40 −0.30 −0.20 −0.10 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles

Figure 3.1: QQ-plots for the MLE of the two coecients, for 500 observations and 1000 replicates

−3 −2 −1 0 1 2 3 0.10 0.15 0.20 0.25 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles −3 −2 −1 0 1 2 3 −0.25 −0.20 −0.15 −0.10 −0.05 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles

Figure 3.2: QQ-plots of the QMLE when the link function is misspecied, for 500 observations and 1000 replicates

(17)

Table 1: Comparison between the correctly specied model and a misspecied model. Design A

n Specication β1 β2 βˆ1 βˆ2 γˆ2− ˆγ1

Correctly specied model

100 −0.3 0.24 −0.324 0.246

500 −0.3 0.24 −0.299 0.245

1000 −0.3 0.24 −0.300 0.240

Misspecied link function

100 −0.3 0.24 −0.196 0.149 −0.034

500 −0.3 0.24 −0.183 0.150 0.015

1000 −0.3 0.24 −0.183 0.147 0.001

We see that the MLE gives coecient estimates that is getting closer to the true coecients with increasing sample size, and that the estimates coincide with the true values for a sample size of n = 1000 (for a round o on the third decimal). For the QMLE we see that ˆβ1is overestimated and ˆβ2is underestimated for every sample size. The gamma estimates ˆγ1and ˆγ2are getting closer to each other for an increasing sample size and diers only on the third decimal for a sample size of n = 1000, providing an empirical illustration of Theorem 3.

3.3 Design B

In a second simulation design, three uniformly distributed random variables are generated, X1, X2, X3∼ U (0, 1). Also, a Bernoulli distributed random variable is generated, T ∼ Bern(e(X)) where

e(X) = 1

1 + exp(2X1+ X2− 3X3) . Two misspecied models are used,

m(X) = 1

1 + exp(β1X1+ β2X2) and

n(X) = φ(β0+ β1X1+ β2X2)

(18)

3.4 Results B

As in Design A, the coecient estimates in the QQ-plots in Figure 3.3, 3.4 and 3.5 seems to follow the straight line reasonably good, giving empirical support to the distributional limit result of the MLE and QMLE, respectively.

−3 −2 −1 0 1 2 3 −3.0 −2.5 −2.0 −1.5 −1.0 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles −3 −2 −1 0 1 2 3 −2.0 −1.5 −1.0 −0.5 0.0 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles

Figure 3.3: QQ-plots for the MLE of the two coecients, for 500 observations and 1000 replicates

−3 −2 −1 0 1 2 3 −2.5 −2.0 −1.5 −1.0 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles −3 −2 −1 0 1 2 3 −1.5 −1.0 −0.5 0.0 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles

(19)

−3 −2 −1 0 1 2 3 −1.6 −1.2 −0.8 −0.4 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles −3 −2 −1 0 1 2 3 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 Normal Q−Q Plot Theoretical Quantiles Sample Quantiles

Figure 3.5: QQ-plots for the QMLE of the two coecients, using n(X) for 500 observations and 1000replicates.

Table 2 gives the true coecients, the MLE estimates and the QMLE estimates for both misspecications. As in Table 1, the dierence between the gamma estimates is displayed. Table 2: Comparison between the correctly specied model and two dierent misspecied models.

Design B

N Specication β1 β2 β3 βˆ1 βˆ2 ˆγ2− ˆγ1

Correctly specied model

100 2 1 −3 2.281 1.167 500 2 1 −3 2.016 1.038 1000 2 1 −3 2.008 0.995 Omitted variable 100 2 1 −3 1.884 0.965 0.023 500 2 1 −3 1.717 0.875 0.017 1000 2 1 −3 1.707 0.849 −0.004

Wrong link and omitted variable

100 2 1 −3 1.156 0.593 0.015

500 2 1 −3 1.063 0.540 0.009

1000 2 1 −3 1.058 0.525 −0.004

(20)

increasing sample size, also giving an empirical illustration of Theorem 3.

The second misspecied model of Design B, n(X) deals with a situation that is not covered by any theorems since we both omit a covariate and misspecify the link function. The parameter estimates underestimates the true coecients but as have been the case for the rest of the misspecications, the scale estimates are close for every sample size and getting closer with increasing sample size. Hence it seems like we have an up to scale convergence for this setting as well.

So far we have been concerned with the QMLE and stated and illustrated some of its char-acteristics. Next, we will not only look at one type of estimator but a whole class of estimators.

4 Theory part 2

4.1 M-estimators

A parametric model does in general include two parts, a systematic part and a random part. For linear regression models the researcher needs to specify both the conditional mean of the outcome variable given the explanatory variables (the systematic part) and the distribution of the error term (the random part). For GLMs we also need to specify a link function that links the systematic part to the random part, and for which a misspecication of it was studied in Section 3. Both the random part and the systematic part are important when constructing the likelihood to be used when conducting inference, and both parts can be misspecied [2]. As have been pointed out in Section 2 it can be questioned how likely it is to know the complete functional form of the parametric model but not θ, and so far we have been concerned with the question of how the inference is aected if the model is misspecied. In addition to the results presented so far in the thesis, several suggestions have been proposed over the years that deal with model assumption violations. Huber [10] for instance introduced the so called robust statistics whose purpose was to adjust classical inference methods so that they would not be sensitive to violations of the model assumptions, e.g. outliers and normality. His proposed estimator was a special case of a broader class of estimators. As could be noted, several estimators is given by minimizing a certain function. The quasi-maximum likelihood estimator for instance is given by maximizing Qn

i=1f (Xi; θ) (which is equivalent to minimizing − Q n

i=1f (Xi; θ)). Huber used this idea and generalized it so that it did not only include one estimator but a whole class of estimators [10]. If we consider X1, X2, ..., Xn that are i.i.d. 5 random variables with distribution function F , a 1 × pparameter vector θ and a known p × 1 function ψ independent of i and n, an M-estimator6 then satises

n X

i=1

ψ(Xi, ˆθ) = 0. (4.1)

(21)

We can redene θ∗, the parameter that minimizes the KLIC, using M-estimation theory so that θ∗ is the parameter solving

EFψ(X1, θ∗) = ˆ

ψ(x, θ∗)dF (x) = 0. (4.2)

If there exists a unique solution to (4.2), then in general ˆθ p

→ θ∗ as n → ∞, where ˆθ is the solution to (4.1) [27]. Furthermore, it can been shown that

√

n(ˆθ − θ∗) d

→ N (0, V (θ∗))

as n → ∞, where V (θ∗) = A(θ∗)−1B(θ∗)A(θ∗)−1, the sandwich matrix of (2.8). To estimate V (θ∗)we use the empirical sandwich estimator given by

Vn(X; ˆθ) = An(X; ˆθ)−1B(X; ˆθ)A(X; ˆθ)−1 (4.3) where An(X; ˆθ) = 1 n n X i=1 −ψ0(Xi, ˆθ) and Bn(X; ˆθ) = 1 n n X i=1 ψ(Xi, ˆθ)ψ(Xi, ˆθ)T.

Huber [10, 11] derived the asymptotic properties of the M-estimator and because of its general apperance, M-estimators includes several classes of estimators. We will give three examples. Example 1 - Ordinary least squares

The rst example is the least-squares estimator. Consider the linear regression model Y = XT_{β +}_{, where Y is the n × 1 dimensioned response variable, X is a p × n matrix of explanatory} variables measured on n observations, β is a p × 1 coecient vector and ∼ N (0, 1) is an error term. We estimate the regression coecients by ˆβ = (XT_X)−1_XT_Y_{. This could also be} rewritten as an M-estimator by letting ψ(Yi, Xi, ˆβ) = (Yi− XiTβ)Xˆ i so that we get

n X i=1 ψ(Yi, Xi, ˆβ) = n X i=1 (Yi− XiTβ)Xˆ i= 0 were we, by solving for ˆβ, get that ˆβ = (XT_X)−1_XT_Y.

Example 2 - QMLE

(22)

also is an M-estimator. So by looking at the QMLE which is given by the parameter θ solving max

θ∈Θ Ln(Xn, θ) ≡ n −1Pn

i=1log fi(Xi, θ) it rst can be concluded that this is the same as min-imizing −Ln(Xn, θ). Thus, by letting ψ(x, ˆθ) =

∂ log f (X; ˆθ)

∂ ˆθT the minimization problem of the QMLE can be re-expressed as P ψ(x, ˆθ) = P∂ log f (X; ˆθ)

∂ ˆθT = 0 meaning that the QMLE indeed

is an M-estimator.

Example 3 - Causal inference

In the statistical theory of causal inference we are interested in estimating the average causal eect (ACE) of an intervention or treatment on an outcome of interest. If Y1denotes the potential outcome under treatment and Y0 the potential outcome under nontreatment, the causal eect would be the dierence Y1 − Y07. The fundamental problem of causal inference is that we wish to estimate the ACE for every person in the population which is impossible since we for every individual only observe either Y1 or Y0 [9]. Therefore the outcomes Y1 and Y0 are called potential and the goal of inference changes to try to estimate the population average treatment eect, τ = E(Y1− Y0), which can be identied under certain conditions. Because of background covariates that confounds the relationship between the treatment and the outcome, dierent estimators have been proposed that takes this problem into account. Several estimators use the so called propensity score, dened as the conditional probability to recieve the treatment conditional on the covariates, P (T = 1|X) ≡ e(X), where T is an indicator variable equal to one if a person have recieved the treatment and zero if not, and X is a covariate vector. Rosenbaum and Rubin [23] have showed that given (Y1, Y0) ⊥⊥ T |X, it is sucient to condition on the propensity score to achieve balance on the covariates for individuals in the dierent treatment groups when having the same propensity score, i.e. X ⊥⊥ T |e(X). Usually, the propensity score is unknown and has to be estimated. One common way is to assume that e(X) could be described by a parametric model and use logistic regression to estimate it. We express this as

P (T = 1|X) = exp(X T_β) 1 + exp(XT_β),

The coecients can be estimated by e.g. maximum likelihood. Usually the treatment variable T is modeled as a sequence of independent Bernoulli trials with treatment probability e(X). The likelihood of the coecient estimates is then given by

L(β|T ) = n Y i=1 e(Xi, β)Ti(1 − e(Xi, β))1−Ti with log-likelihood

(23)

l(β|T ) = n X

i=1

log(1 − e(Xi, β)) + Tilog

log e(Xi, β) 1 − e(Xi, β)

To get the coecient estimates we take the derivate of the log-likelihood function with respect to β, ∂ log(1 − e(Xi, β)) ∂β = − ∂e(Xi, β) ∂β 1 1 − e(Xi, β) = − e(Xi, β) e(Xi, β)(1 − e(Xi, β)) ∂e(Xi, β) ∂β (4.4) and ∂ ∂β Tie(Xi, β) 1 − e(Xi, β) = Ti e(Xi, β)(1 − e(Xi, β)) ∂e(Xi, β) ∂β (4.5)

Combining the equations in (4.4) and (4.5), setting them equal to zero and solve for β will give the MLE of the coecient estimates. Expressed as an M-estimator we have that

n X i=1 ψ(Ti, Xi, β) = n X i=1 Ti− e(Xi, β) e(Xi, β)(1 − e(Xi, β)) ∂ ∂β(e(Xi, β)) = 0. (4.6) This is indeed what was done in Simulation 1 where the model (that could have been a model for e(X)) where misspecied in the link function. Observe though that in equation (4.6) it is assumed that e(X) is correctly specied.

When interested in estimating τ, the next step could be to use the estimate of e(X) in an estimator of the ACE. One such proposed estimator of the ACE is the inverse probability weighting estimator (IPW). If the observed outcome is dened as Y = T Y1+ (1 − T )Y0, the IPW is expressed as ˆ τ = 1 n n X i=1 TiYi e(Xi) −(1 − Ti)Yi 1 − e(Xi) . (4.7)

A proof where it is shown that we can estimate τ with the observed data using the IPW is given in the appendix. This estimator can be expressed as an M-estimator by letting

g(x) = TiYi e(Xi)

−(1 − Ti)Yi 1 − e(Xi) .

We then express the IPW as the solution to P ψ(g(x), ˆτ) = P(g(x) − ˆτ) = 0, because breaking ˆ

τ out of the summation we get that nˆτ = P g(x) and nally that

ˆ τ = 1 n n X i=1 g(xi) = 1 n n X i=1 TiYi e(Xi) −(1 − Ti)Yi 1 − e(Xi) .

(24)

n X i=1 ψ(Ti, Yi, Xi, β, τ ) = Pn i=1 TiYi e(xi, β)− (1−Ti)Yi 1−e(xi,β) − τ Pn i=1 Ti−e(Xi, β) e(Xi, β)(1−e(Xi, β)) ∂ ∂β(e(Xi, β)) = 0 0 ! . (4.8)

The equation system (4.8) thus includes a parametric part and a nonparametric part, displaying the exibility of the M-estimator in that it enables equations to be stacked onto each other to, together, yield the sought estimator. This is called a partial M-estimator [27], and the properties of the M-estimator is similar to the general approach given by Randles [21] that

concerns estimators that contains estimated parameters.

In Subsection 2.3 we established that the QMLE had the KLIC-minimizer as a limit almost surely, but where this limit not necesarily were the true parameter. Assuming the same (incor-rect) parametric family but now using the M-estimation theory, we in general have a parameter estimate that is a unique solution to (4.1). We have shown that this estimator for instance could be the QMLE and stated that ˆθ p

→ θ∗ in general, where θ∗ is the solution of 4.2. So what is θ∗? Equation 4.2 does not say that θ∗ will be the true parameter, only that it is a solution to this equation8_{. So, for a misspecied model, using the fact that the QMLE is an M-estimator, the θ}

∗ given by Equation 4.2 will be equal to the KLIC-minimizer given by (2.5).

A contribution of M-estimation theory is that in situations where the model is misspecied, the variance of the estimator will no longer equal the inverse of the Fisher information. Thus, we have to use the sandwhich estimator given by (4.3) to estimate the variance, which White [33] showed for the QMLE and Huber [11] for a whole class of M-estimators. To examplify how this empirical sandwich estimator works, we revisit the simulation design A of Simulation 1 conducted in Section 3. The covariance matrix of the approximated sampling distribution for the correctly specied model using a sample size of n = 100 is given by

ˆ Σ =    0.645 −0.054 −0.080 −0.054 0.018 −0.004 −0.080 −0.004 0.020   

and the empirical sandwich estimate of the covariance matrix for the same sample size for the misspecied estimation model is given by

ˆ Σsand=    0.234 −0.021 −0.028 −0.021 0.006 −0.001 −0.028 −0.001 0.006   .

Clearly the estimated covariance matrices diers. Important to point out is that if the estimated covariance matrix diers substantially from the empirical sandwich covariance matrix, this is a clear indication that the used model is misspecied and that model diagnostics are in place [14].

(25)

As this is the case for the above example, the advise would be to respecify the model to try to get a better t. In this way, the sandwich estimator is used as a diagnostic tool.

5 Simulation study 2

In a nal simulation study we will revisit Example 3 of Section 4 to illustrate the consequences of model misspecication when estimating the ACE using the IPW estimator. We will also use the fact that the IPW estimator is an M-estimator to calculate the sandwich estimate of the standard deviation and compare them with standard deviation of the approximate sampling distribution. For the IPW estimator to be an unbiased estimate of the ACE, ˆe(X) have to be correctly specied. This means that the simulation study will display the amount of bias that a misspecied propensity score model introduces in the IPW estimator when estimating the ACE.

5.1 Design

Two uniformly distributed random variables are generated , X1, X2 ∼ U (2, 4). The potential outcomes are

Y1= 1 + 3X1+ 4X2+ 1

Y0= 2 + 2X1+ 3X2+ 0

where t∼ N (0, 1), t = 0, 1, meaning that τ = E(Y1− Y0) = 5. Using N independent Bernoulli trials, the treatment variable is generated as T ∼ Bern(e(X1, X2)), with the probability of being treated

P (T = 1|X1, X2) = e(X1, X2) =

1

1 + exp(−0.5 + 1.5X1− 1.1X2) A misspecied model is generated as

q(X1, X2) = φ(β0+ β1X1+ β2X2).

The ACE will be estimated using the IPW estimator given by (4.7). Lunceford and Davidian [17] have derived the estimates of the matrices A and B for the IPW and stated the sandwich estimator of the variance of the IPW as n−2Pn

(26)

and ˆ E−1= 1 n n X i=1 ˆ e(Xi)(1 − ˆe(Xi))XiXiT.

The ACE will for both propensity score models be estimated using sample sizes n = 500, n = 1000,and n = 3000 with 1000 replicates for each sample size.

5.2 Results

The entries in Table 3 gives the bias and standard deviation for the IPW estimator using both the correctly specied and the misspecied propensity score model, respectively. For every sample size we approximate the sampling distribution of the estimator and calculate the standard deviation (SD). This computation is compared against the sandwich estimator. In addition, the mean squared error (MSE) is calculated using the variance of the approximated sampling distribution.

Table 3: Results for the IPW when misspecifying the link function. SD, standard deviation; SDsandwich, standard deviation of sandwich estimator; MSE, mean squared error; PS, propen-sity score.

Simulation 3

n Specif ication Bias SD SDsandwich M SE

500 PS true −0.043 0.765 1.116 0.587 PS false 0.166 0.807 1.129 0.679 1000 PS true −0.010 0.513 0.547 0.263 PS false 0.193 0.523 0.553 0.311 3000 PS true 0.001 0.292 0.192 0.085 PS false 0.212 0.294 0.194 0.131

The results of Table 3 shows that the misspecied propensity score model makes the IPW a biased estimator, while the IPW that uses the correct specied propensity score model gives estimates that is close to the true value. Using the misspecied propensity score model, the bias of the IPW is increasing with increasing sample size. Both these results are in accordance with those given in [29], where it is investigated how the ACE estimate of the IPW estimator (among others) is aected by dierent types of misspecications of e(X).

(27)

misspecied. Lastly we see that the MSE of the IPW is smaller when e(X) is correctly specied, for every sample size.

Theoretically, the standard deviation estimated by the sandwich estimator should coincide with the inverse of the Fisher information, the variance limit of the MLE. The distance between the sandwich estimator and the approximate sampling distribution standard deviation is decreas-ing when the sample size increases from n = 500 to n = 1000, but when the sample size increases from n = 1000 to n = 3000, the distance increases. It thus seems like the sandwich estimator does not converge.

6 Final recommendations

White [33] noted that since −A(θ∗) = B(θ∗) only when the model is correctly specied, we can test for model misspecication by testing the null hypothesis of A(θ∗) + B(θ∗) = 0, where A(θ∗)and B(θ∗)consistently can be estimated by A(ˆθ) and B(ˆθ), respectively. He called this the Information matrix test. He furthermore adjusted the Hausman test [8] which basically measures the distance between the MLE and the QMLE. Since it asymptotically reaches zero for correctly specied models but generally not otherwise, it ought to indicate when the QMLE is inconsistent. We refer to [33] for a formal review and the derived test statistics of the two tests. As stated in the beginning of this thesis, the theory of misspecied models could be seen as one component of the theory of statistical modeling. To connect even more to that way of thinking about the presented theory, we state the recommendations of White when building an estimation model:

1. Use maximum likelihood to estimate the parameters of the model. 2. Apply the Information matrix test to check for model misspecication.

3. If the null hypothesis is not rejected we can go on with our MLE estimates, if not, we investigate the misspecication with a Hausman-test.

(28)

7 Discussion

The purpose of this thesis was to review parts of the theory for misspecied parametric models. The thesis started with a discussion of model misspecication for parametric models using the KLIC. From that, the QMLE were derived. It was stated that the QMLE is√n-convergent and also converges almost surely to the parameter which minimizes the KLIC, given the model and the data. For certain cases, the QMLE also converges up to scale towards the true parameter. These results were illustrated in Simulation 1, and then a broader class of estimators, M-estimators, were introduced for which we stated its properties. Lastly, a second simulation study were conducted where the consequences of model misspecication for a partial M-estimator, the IPW estimator, was illustrated.

This thesis has been stating results leading up to a parameter estimate that has limit θ∗. One of the main questions that this thesis aims to answer is wether we can learn something about θ, the true parameter that we actually are interested in, even in cases wher the estimation model is misspecied. We have showed three situations where we, in spite of a misspecied model, can get unbiased estimates of the ratio of the parameters of interest and in that way gain knowledge about the true parameter. Interestingly, we have by simulation found a situation not covered by any theorem where it also seems like this result holds. The up to scale convergence result also implies that misspecication of the link function will not cause any problem for hypothesis testing when testing if the coecient of interest if equal to zero. But the up to scale convergence result only has been stated to hold for certain situations, meaning that there is a lack of results for other types of misspecications. If we for example omit a variable in a linear regression model with normally distributed errors, we might neither be able to give an unbiased estimate of the true coecients nor the coecient ratio. This means that we, even though the results presented in this thesis have contributed to the theory of statistical modeling, don´t have a complete knowledge of what we can and should do when facing a misspecied model. Thus this is possible content for further studies.

As implicated in [14], readers should be suspicious when the robust and the standard errors of the sample diers. It is more reasonable to use the robust covariance matrix as a model check rather than as routine. King and Roberts [14] even mean that model diagnostics should be performed to the extent that the choice between classical and robust standard errors will not matter for the inference to be conducted. This is more restrictive than the recommendation of White, and the author at least think that it is important to report the steps made in the analysis and what eventual restrictions this puts on the results.

(29)

surround the the wrong value. An estimator giving a broader interval might instead actually include the true parameter value. Eciency is a desirable property of an estimator, but it could for misspecied models mean that you will just get a narrow interval around the wrong value.

Finally this thesis have shed some light on the very basic assumptions of statistical modeling in that we assume a true model that gives a complete description of the data generating process. That there for an example exists a true parameter, quantifying the relationship between an explanatory variable and the outcome of interest. This might not be true, yet we think of it in this way in order to meanfully discuss the estimation model specied to gain knowledge about some phenomena of interest.

(30)

8 Acknowledgements

(31)

References

[1] Akaike, H. [1973], Information theory and an extension of the likelihood principle, in `Pro-ceedings of the second international symposium of information theory'.

[2] Boos, D. D. and Stefanski, L. A. [2013], Essential statistical inference, Springer-Verlag New York.

[3] Cramér, H. [1946], Mathematical methods of statistics, Princeton University Press.

[4] Durbin, J. and Watson, G. S. [1950], `Testing for serial correlation in least squares regression: I', Biometrika 37(3/4), pp. 409428.

[5] Eicker, F. [1967], Limit theorems for regressions with unequal and dependent errors, in `Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics', University of California Press, Berkeley, Calif., pp. 5982.

[6] Faraway, J. J. [2006], Extending the linear model with R - Generalized linear, mixed eects and nonparametric models, CRC Press.

[7] Fisher, R. A. [1922], `On the mathematical foundations of theoretical statistics', Philosophi-cal Transactions of the Royal Society of London A: MathematiPhilosophi-cal, PhysiPhilosophi-cal and Engineering Sciences 222(594-604), 309368.

[8] Hausman, J. A. [1978], `Specication tests in econometrics', Econometrica 46(6), pp. 1251 1271.

[9] Holland, P. W. [1986], `Statistics and causal inference', Journal of the American Statistical Association 81(396), pp. 945960.

[10] Huber, P. J. [1964], `Robust estimation of a location parameter', The Annals of Mathematical Statistics 35(1), pp. 73101.

[11] Huber, P. J. [1967], `The behavior of maximum likelihood estimates under nonstandard conditions'.

[12] Huber, P. J. [1973], `Robust regression: Asymptotics, conjectures and monte carlo', The Annals of Statistics 1(5), pp. 799821.

[13] Huber, P. J. [1981], Robust statistics, Wiley, New York.

[14] King, G. and Roberts, M. E. [2014], `How robust standard errors expose methodological problems they do not x, and what to do about it', Political Analysis pp. 121.

(32)

[16] Li, K.-C. and Duan, N. [1989], `Regression analysis under link violation', The Annals of Statistics 17(3), pp. 10091052.

[17] Lunceford, J. K. and Davidian, M. [2004], `Stratication and weighting via the propensity score in estimation of causal treatment eects: a comparative study', Statistics in Medicine 23(19), 29372960.

[18] Manski, C. F. [1988], `Identication of binary response models', Journal of the American Statistical Association 83(403), pp. 729738.

[19] Perez-Cruz, F. [2008], Kullback-leibler divergence estimation of continuous distributions, in `2008 IEEE International Symposium on Information Theory', pp. 16661670.

[20] R Core Team [2013], R: A Language and Environment for Statistical Computing, R Foun-dation for Statistical Computing, Vienna, Austria.

URL: http://www.R-project.org/

[21] Randles, R. H. [1982], `On the asymptotic normality of statistics with estimated parameters', The Annals of Statistics 10(2), pp. 462474.

[22] Rényi, A. [1961], On measures of entropy and information, in `Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics', University of California Press, Berkeley, Calif., pp. 547561. [23] Rosenbaum, P. R. and Rubin, D. B. [1983], `The central role of the propensity score in

observational studies for causal eects', Biometrika 70(1), pp. 4155.

[24] Ruud, P. A. [1983], `Sucient conditions for the consistency of maximum likelihood es-timation despite misspecication of distribution in multinomial discrete choice models', Econometrica 51(1), pp. 225228.

[25] Schwarz, G. [1978], `Estimating the dimension of a model', The Annals of Statistics 6(2), pp. 461464.

[26] Shannon, C. E. [1948], `A mathematical theory of communication', Bell System Technical Journal 27(3), 379423.

[27] Stefanski, L. A. and Boos, D. D. [2002], `The calculus of m-estimation', The American Statistician 56(1), 2938.

[28] Viele, K. [2007], `Nonparametric estimation of kullback-leibler information illustrated by evaluating goodness of t', Bayesian Anal. 2(2), 239280.

(33)

[30] Wald, A. [1949], `Note on the consistency of the maximum likelihood estimate', The Annals of Mathematical Statistics 20(4), pp. 595601.

[31] Wasserman, L. [2004], All of statistics - A concise course in statistical inference, 1 edn, Springer-Verlag New York.

[32] White, H. [1980], `A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity', Econometrica 48(4), pp. 817838.

[33] White, H. [1982], `Maximum likelihood estimation of misspecied models', Econometrica 50(1), pp. 125.

(34)

A Proof of identiability of the IPW estimator

To prove that the IPW estimator can be identied with the observed data, we want to show that EhT Y

e(X)− (1−T )Y 1−e(X) i

= E [Y1− Y0]. We will use that the observed outcome is dened as Y = T Y1+ (1 − T )Y0 and the following assumptions:

A.1 (Y1, Y0) ⊥⊥ T |X A.2 0 < P (T = 1|X) < 1. We start by showing that Eh T Y

e(X) i = E [Y1] . E T Y e(X) = E T Y1 e(X) = E E T Y1 e(X) |X = _E ₁ e(X)E [T Y1] |X = _E ₁ e(X)E [T |X] E [Y1|X] = E {E [Y1|X]} = E [Y1] ,

where the rst equality follows from the denition of Y and the second equality holds under the total expectation law. Next we have that

E (1 − T )Y 1 − e(X) = E (1 − T )Y0 1 − e(X) = _E E (1 − T )Y0 1 − e(X) |X = _E ₁ 1 − e(X)E [(1 − T )Y0] |X = _E ₁ 1 − e(X)E [(1 − T )|X] E [Y0|X] = E {E [Y0|X]} = _{E [Y}0]

Misspecification and inference