Evaluating the Use of Ridge Regression and Principal Components in Propensity Score Estimators under Multicollinearity

(1)

Evaluating the Use of Ridge Regression and Principal Components in Propensity Score

Estimators under Multicollinearity

Sarah Gripencrantz

Department of Statistics Uppsala University

Supervisors: Johan Lyhagen and Ronnie Pingel

2014

(2)

Abstract

Multicollinearity can be present in the propensity score model when estimating average treatment effects (ATEs). In this thesis, logistic ridge regression (LRR) and principal components logistic regression (PCLR) are evaluated as an alternative to ML estimation of the propensity score model. ATE estimators based on weighting (IPW), matching and stratification are assessed in a Monte Carlo simulation study to evaluate LRR and PCLR.

Further, an empirical example of using LRR and PCLR on real data under multicollinearity is provided. Results from the simulation study reveal that under multicollinearity and in small samples, the use of LRR reduces bias in the matching estimator, compared to ML. In large samples PCLR yields lowest bias, and typically was found to have the lowest MSE in all estimators. PCLR matched ML in bias under IPW estimation and in some cases had lower bias. The stratification estimator was heavily biased compared to matching and IPW but both bias and MSE improved as PCLR was applied, and for some cases under LRR. The specification with PCLR in the empirical example was usually most sensitive as a strongly correlated covariate was included in the propensity score model.

Keywords: Causal Inference, Propensity Score, IPW estimator, Stratification, Matching, Logistic Ridge Regression, Principal Components Logistic Regression

(3)

1 Introduction

Developed in the 1970s, the Causal Inference framework is today a widely accepted and used method used to estimate causal treatment effects. The ideas were initially expressed by Ney- man (1923) and Fisher (1926) concerning randomized experiments and then later extended to the setting of observational studies by Rubin (1974). The framework is often referred to as the Rubin Causal Model (RCM) or Counterfactual model. Its simplicity and extension to non randomized experiments is one of its strengths, making it applicable in diverse fields such as economics, medicine and psychology. The aim of the RCM is to estimate the causal effect of a treatment relative to another treatment, or the lack of treatment. Depending on the study, the concept of treatment can be very different and should not be thought of only as a typical medical experiment with treatment and control groups, although this is sometimes the terminology used. Imbens and Woolridge (2009) gives examples of treatments being "job search assistance programs, educational programs, vouchers, laws or regulations, medical drugs, environmental exposure, or technologies".

In a controlled experiment the treatment assignment mechanism is random, making the treatment and error term independent and implying that any differences between units is due to treatment status only. The difference in the measured outcome variable between the treatment and control group is then an unbiased estimate of the causal effect of the treatment (Rubin, 1974). Due to the unbiasedness, this is clearly the desired and ideal situation, but as pointed out by Rubin (1974), due to ethical reasons or that a randomization can be expensive, is not always achievable in practice. In comparison, observational studies are susceptible to some traps related to the non-randomization of the treatment. Some individuals decide whether they par- ticipate in a program, and often the choice is related to the possible benefit of participating, also referred to as the problem of self-selection (Woolridge, 2008). Further, all variables influencing the treatment might not be observable. This leads to bias in the estimators for average treatment effects and in worst case the causal treatment effect is not identified. Under the assumption of strong ignorability, which is satisfied if the treatment is independent of the potential outcomes conditional on the covariates (unconfoundedness) and if all units have a chance of actually re- ceiving the treatment (overlap), the average treatment effect can be identified and estimated in observational data (Rosenbaum and Rubin, 1983).

Several estimators of the causal treatment effect have been developed for the situation of an observational study. One way of estimating the effect is adjusting the confounding variables

(5)

for pretreatment differences, trying to create the randomness lacking in observational studies.

As shown by Rosenbaum and Rubin (1983), it is enough to use the propensity score (PS), the probability of assignment to treatment given the covariates, to control for the differences in pretreatment variables. Estimators based on this approach will be evaluated in this thesis. The method of controlling solely for the propensity score reduces the problem to one dimension but also raises the question of how to model the propensity score, which is almost never observed in observational studies. Since it is a probability it is often specified as a probit or logit model.

The logit model is less sensitive than the probit model to distributional assumptions (Månsson and Shukur, 2011) and it has been preferred in the literature. Selection of variables is also an is- sue in modelling the propensity score. Overfitting the model, by including many pretreatment variables can be a way to justify the unconfoundedness assumption (Millimet and Tchernis, 2009);(Waernbaum, 2010) as well as an effort of trying to escape omitted variable bias. In doing so, it is important to understand if and how the possible correlation among the variables affect estimators of the causal treatment effect. Assuming the propensity score is modelled using logistic regression, its parameters are typically estimated using maximum likelihood (ML) estimation. As noted by Kibria et al. (2012) the mean squared error of the parameter vector can be inflated under multicollinearity. An example of this is shown by the authors in a simulation study (Kibria et al., 2012). Hence, the ML estimator can be unreliable under multicollinearity. It is not perfectly clear if this effect carries over to propensity score estimators of causal treatment effects. This thesis aims to enhance the understanding of this topic by investigating multicollinearity among covariates in the propensity score model based on how it affects three estimators of the average treatment effect (ATE). This will be assessed using results from a Monte Carlo simulation and results from an empirical study. Two alternatives to ML estimation that are designed to deal with multicollinearity, principal components logistic regression (PCLR) and logistic ridge regression (LRR) are considered in the thesis. These two alternatives will be compared to ML estimation with the aim of seeing how PCLR and LRR perform under different degree of multicollinearity and if the ATE estimators can be improved, in terms of reducing bias, by applying PCLR and LRR.

Section 2 presents the theory of the RCM, the ATE estimators and the theory of the two alternatives to ML estimation of the PS. Section 3 outlines the simulation design and its results.

In Section 4 the estimators and alternatives to ML estimation are estimated in an empirical example. The thesis is concluded with a discussion in Section 5.

(6)

2 Theory

This section presents the underlying framework of the RCM, following the notation of Imbens and Woolridge (2009). The theory behind PCLR and LRR is presented in Section 2.2. The estimators considered in this thesis and some of their properties are given in Section 2.3.

2.1 The Causal Inference Framework

Suppose we want to assess the causal effect of a treatment, it might be a job training program, or some new legislation. There are observations on N individuals. The indicator variable Wi

denotes whether individual i, i = 1, . . . , N received treatment or not. W_i = 0 if individual i was not assigned treatment and Wi = 1 if individual i was exposed to treatment. N0 is the number of untreated individuals and N₁is the number of treated individuals, so that N₀+N₁ = N . Also observed for each individual i is a K-vector of pretreatment covariates, Xi, where X denotes the N × K matrix of covariates for all N individuals.

Essential in the RCM is the idea of potential outcomes, Yi(Wi), and their distinction to the observed or realized outcome. Y_i(0) represents the potential outcome that would be realized if individual i is not treated and Yi(1) denotes the potential outcome if individual i is treated.

Clearly, as Holland (1986) points out and refers to as the Fundamental Problem of Causal Inference, it is impossible to observe both Yi(0) and Yi(1). Once one of them is observed the other is the counterfactual outcome, representing what would have been the outcome if a treated individual had not been treated or vice versa. The observed outcome is denoted by Yi. Linking the concepts yields

Y_i = Y_i(W_i) = Y_i(0)(1 − W_i) + Y_i(1)W_i =







Yi(0) if Wi = 0, Yi(1) if Wi = 1.

(1)

The causal effect of the treatment is defined as the difference of the potential outcomes with treatment and with no treatment, Yi(1)−Y_i(0). Considered in this thesis is the average treatment effect, defined as the population average causal effect

τ ≡ E[Y_i(1) − Y_i(0)] = E[Y_i(1)] − E[Y_i(0)]. (2) Other estimands for the average treatment effect is the average treatment effect of the treated (ATT), which is the ATE for the treated individuals only. The ATE can also be defined for different subgroups of the population, by conditioning on certain covariates. These are called

(7)

local average treatment effect (LATE) and were first introduced by Imbens and Angrist (1994).

Depending on the study, it might be more interesting to consider some of these effects, in particular if the treatment itself is not exposed to all units in the population. For the scope of this thesis it is enough to consider the ATE. For an extended discussion of treatment effect estimators, see for example Imbens and Woolridge (2009).

In the context of a randomized experiment, the ATE can be estimated due to the fact that the assignment mechanism is random, see Rubin (1974). However in a observational study, the mechanism is not randomized. Accordingly, the assumption of strong ignorability needs to be added in order to identify the ATE.

Assumption 1. Strong Ignorability

(i)(Yi(1), Yi(0)) ⊥⊥ Wi|Xi(Unconfoundedness) (ii)0 < P (W_i = 1|X_i) < 1 (Overlap)

Assumption 1 of strong ignorability is divided into two parts. Part (i) of unconfoundedness concerns the nature of the assignment mechanism. Given the covariates, it is assumed that whether an individual receives treatment or not is independent of the potential outcomes. The second part of Assumption 1, referred to as overlap, concludes that the propensity score is bounded between zero and one, and never attains these values. This means that for any setting of the covariates it is assumed that there is always a chance of observing both treated and untreated units (Woolridge, 2008). Assumption 1 is crucial in identifying and estimating the ATE and will be a maintained assumption throughout the thesis.

2.1.1 Modelling the Propensity Score

The probability of assignment to treatment, given the covariates, is defined as

e(X) ≡ P (W = 1|X), 0 < e(X) < 1, (3) and is referred to as the propensity score. Rosenbaum and Rubin (1983) show that

X_i ⊥⊥ W |e(X_i), (4)

the conditional distribution of the observed covariates X_i given e(X_i) is the same among the treated and untreated individuals. This means that the propensity score can serve as a tool in order to adjust for confounding covariates and it turns out that the propensity score is key in estimating causal treatment effects. The bearing result in Rosenbaum and Rubin (1983) is that

(8)

under Assumption 1 of strong ignorability,

(Y_i(1), Y_i(0)) ⊥⊥ W_i|e(X_i). (5) In other words, if treatment is independent of the potential outcomes given the covariates, the treatment is independent of the potential outcomes given the propensity score. Hence in an attempt to control for confounding variables and produce unbiased estimates of ATE, it is enough to control for the propensity score. In a non-randomized experiment the propensity score needs to be estimated, typically using the a logit model or probit model. In this thesis it will be modelled with a logit model, for consistency with other literature. Following Kibria et al. (2012), the modelling of the propensity score is as follows. The dependent variable in the logistic regression is the treatment variable W_i, which is assumed to be Bernoulli distributed, W_i ∼ Be(π_i), and taking values one for a treated unit and zero for an untreated unit. The parameter π_iis the expected value of W_igiven the covariates. The propensity score is modelled as

e(X_i) = P (W_i = 1|X₁ = x_i1, . . . , X_p = x_ip) = exp(β₀+Pp

j=1x_ijβ_j) 1 + exp(β₀+Pp

j=1x_ijβ_j), (6) where x_ij is the ijth element of the n × p matrix X of observed explanatory covariates and β_j is an element of the p × 1 vector β of unknown parameters. Estimation of β is mostly done using maximum likelihood (ML) together with iterative weighted least squares. As given in Månsson and Shukur (2011), the ML estimator of β is

βˆ_{M L} = (X⁰WX)ˆ ⁻¹(X⁰Wˆˆz), (7)

where ˆW is a n × n diagonal matrix with nonzero entries equal to the variance of the parameter π_i, ˆπ_i(1 − ˆπ_i) and ˆz a n × 1 vector with ith entry equal to

ˆ

z_i = log( ˆπ_i) + w_i− ˆπ_i ˆ

π_i(1 − ˆπ_i). (8)

2.2 Alternatives to ML Estimation of the Propensity Score

Two alternative ways of estimating the propensity score, that are designed to reduce problems of multicollinearity are presented in this section. Section 2.2.1 discusses ridge regression as a way to do this and Section 2.2.2 presents estimation based on principal components. These two methods was firstly introduced for linear regression models but in this thesis they are applied to the logistic regression model of the propensity score.

(9)

2.2.1 Logistic Ridge Regression

Ridge regression (RR) was introduced by Hoerl and Kennard (1970) as a solution to the problem of unstable ordinary least squares (OLS) estimates under multicollinearity in multiple linear regression. Consider the standard linear regression,

Y = Xβ + ε, (9)

where Y is a n×1 vector with observed response values, X is the n× p matrix with observations for p covariates, β is a p × 1 vector of unknown parameters to be estimated and ε is a n × 1 vector with normally distributed error terms with mean zero and finite variance. The OLS estimator is ˆβ_OLS = (X⁰X)⁻¹(X⁰Y). Under multicollinearity, when some of the regressors can be expressed as linear combinations of the other variables, the OLS assumption of full rank is not fulfilled, because X⁰X approaches singularity and the existence of an inverse to X⁰X is not supported. This creates imprecise parameter estimates, with large variances, and accordingly some of the variables might be insignificant under the presence of other covariates, although they do explain variation in the dependent variable in the population. The idea in RR is to adjust the OLS estimator by adding increments to the quantity X⁰X, forcing it to be non-singular. By doing so, bias is actually added to the estimator but it becomes more precise in terms of MSE. That is, the RR estimator reduces variance to the cost of increased bias. For the linear regression model defined in (9), the RR estimator is

βˆ_RR = (X⁰X + kI_k)⁻¹(X⁰Y), k ≥ 0, (10)

where k is referred to as the ridge parameter. Hoerl and Kennard (1970) shows that there exists k ≥ 0 such that the MSE of the RR estimator is less than the MSE of the OLS estimator.

The literature has since the 1970s been mainly focused on developing estimators for k and in particular within the context of linear regression models (Månsson and Shukur, 2011).

Examples of applying the ridge regression in a logistic model with the goal to reduce problems of multicollinearity are Schaefer et al. (1984), Le Cessie and van Houwelingen (1992), Månsson and Shukur (2011) and Kibria et al. (2012). As discussed earlier, the logistic regression estimation is performed using maximum likelihood, with estimator defined in (7). The logistic ridge regression (LRR) estimator is defined in analogy with (10), adding bias in form of kIk

βˆ_LRR = (X⁰WX + kIˆ _k)⁻¹X⁰WX ˆˆ β_{M L}. (11)

(10)

A value of the ridge parameter needs to be found so that the decrease in variance is larger than the increase in bias. Many such estimators has been proposed for different degree of correlation among the regressors. See for example Månsson and Shukur (2011) and Kibria et al. (2012).

2.2.2 Principal Components Logistic Regression

Using principal components (PCs) to overcome problems with multicollinearity is based on the property that PCs are orthogonal to each other. The use of principal components also has the feature of reducing the dimension of the data since usually a smaller number of PCs than the number of original variables is used. The PCs are linear combinations of the covariates and the maximum number of PCs is the total number of covariates. The first PC explains the maximum number of variance, the second explains the maximum number of variance that has not been accounted for by the first PC and so on, until the last PC. Together they account for the total variance in the data.

Using the PC technique is common in linear regression to escape multicollinearity, and this is referred to as principal component regression (PCR), introduced by Massy (1965). Im- plementing the PC technique for the propensity score model implies that it needs to be implemented in the setting of logistic regression. Aguilera et al. (2006) proposed that using PCs as covariates in a logistic regression model can improve the estimation of the parameters in the model under multicollinearity. A typical problem arises when deciding how many PCs should be used in the model. As the variance accounted for by the non chosen PCs is typically used as a measure of loss of information in the data reduction (Sharma, 1996), one way is to choose a reduced number of PCs such that they account for the most of the variance in the data. Other ways are possible. As Aguilera et al. (2006) points out, the PCs that account for the most of the variance need not be the ones that are most strongly related to the response variable. In their study, Aguilera et al. (2006) proposes a method using a forward stepwise addition of PCs based on the conditional likelihood ratio test to find the optimal number of PCs to include in the logistic regression, so that the explanatory ability of the PCs is taken into account. Using results from simulation studies, Aguilera et al. (2006) and Aguilera et al. (2004), they show that this method is preferred to the natural order of explained variance because it gives better parameter estimates with a smaller number of PCs.

Some basic results of principal component analysis (PCA) are reviewed here and then extended to he case of logistic regression. The notation in Aguilera et al. (2006) is followed.

(11)

Let x = (x1, x₂, . . . , x_p)⁰ be a p × 1 vector of covariates and suppose we have n observations on these variables in the matrix X_n×p, with column vectors denoted X₁, X₂, . . . , X_p. For simplicity and without loss of generality the observations are assumed to be centred so that

¯

x₁ = ¯x₂ = · · · = ¯x_p = 0 and the corresponding sample covariance matrix is thus given by S_p×p = _n−1¹ X⁰X. The matrix S can be diagonalized as S = V∆V⁰, where the columns of V, v_j, j = 1, . . . , p, are the eigenvectors of S and ∆ = diag(λ₁, λ₂, . . . , λ_p) are the corresponding eigenvalues such that λ1 ≥ λ2 ≥ · · · ≥ λp. The first sample principal component is defined as

Z₁ = Xv₁ = v₁₁x_j1+ v₁₂x_j2+ · · · + v_1px_jp, j = 1, . . . , n (12) and the complete matrix of PCs can be written as

Z_n×p = (Z₁|Z₂| . . . |Z_p) = XV. (13) The eigenvalue λ1is the variance associated with the first PC, Z1, which explains a proportion of the total variance given by

λ₁ Pp

j=1λ_j

!

× 100. (14)

The logistic regression can now be formulated using the principal components of the observation matrix, X. Expressing (6) in terms of PCs yields

e(X_i) = exp(β₀+Pp j=1

Pp

k=1z_ikv_jkβ_j) 1 + exp(β₀+Pp

j=1

Pp

k=1z_ikv_jkβ_j) = exp(γ₀+Pp

k=1z_ikγ_k) 1 + exp(γ₀+Pp

k=1z_ikγ_k) (15) where z_ik, i = 1, . . . n, k = 1, . . . p are the elements of the PC matrix Z = XV and γ_k = Pp

j=1vjkβj, k = 1, . . . , p. Postmultiplying (13) by V⁰ yields the reconstruction of the observation matrix as X = ZV⁰. Using this, the logistic model can be expressed in matrix form as

L = Xβ = ZV⁰β = Zγ, (16)

where Z = (1_1×n|Z) and V =





1 01×p

0_p×1 V



. A reduced number of PCs is chosen and the estimation of the new parameters is then performed using ML, but only with the PCs that remains in the model, for more details see Aguilera et al. (2006).

2.3 Estimators of the Average Treatment Effect

This section presents the three ATE estimators that will be evaluated under multicollinearity.

They all make use of the important result in (5) and thus are all based on the propensity score.

(12)

Regression methods, for example OLS, are also possible to use in order to estimate average treatment effects, but suffer from some problems due to the linear approximation of the regression function, as discussed in Imbens and Woolridge (2009). This might lead to bias if the regression function is misspecified and this might be even more severe under collinearity (Imbens and Woolridge, 2009). The first two estimators are a selection of discussed estimators in Lunceford and Davidian (2004) and the third one is proposed by Abadie and Imbens (2006) and Abadie and Imbens (2012).

2.3.1 Inverse Probability Weighting Estimators

The use of inverse probability weighting (IPW) estimators is based on the idea that the propensity score can be used to impute the missing values, the counterfactual outcomes that are unob- servable. The IPW estimator of the ATE is defined in Lunceford and Davidian (2004) as

ˆ τ_{IP W} =

n

X

i=1

Wi

ˆ e(X_i)

!−1 n

X

i=1

WiYi

ˆ

e(X_i) −

n

X

i=1

1 − Wi

1 − ˆe(X_i)

!−1 n

X

i=1

(1 − Wi)Yi

1 − ˆe(X_i) , (17) where ˆe(X_i) is the estimated propensity score for individual i. The large sample distribution of ˆ

τIP W is shown in in Lunceford and Davidian (2004) to be

√

N (ˆτ_{IP W} − τ )−→ N (0, σ^d ²_{IP W}). (18) The variance of the IPW estimator, σ_{IP W}² is

σ_{IP W}² = E (Y₁− µ1)²

e(X) + (Y0− µ0)² 1 − e(X)

, (19)

where µ_w = E(Y_w) for w = 0, 1. The IPW estimator defined in (17) belongs to a wide class of consistent semiparametric estimators, defined and discussed in Robins et al. (1994).

It is consistent if the model for the propensity score is correctly specified, see Lunceford and Davidian (2004). If there are considerate differences in covariate distributions between treated and control groups, there tends to be estimated propensity score values close to zero or one (Imbens and Woolridge, 2009). The occurrence of these values could impose the problem that observations with extreme propensity scores can be given too large weights, hence making the estimator imprecise, as pointed out by Imbens and Woolridge (2009).

2.3.2 The Stratification Estimator

The stratification estimator, also known as subclassification, is in line with the result in (5) from which it is enough to control for the propensity score in order to to balance the data and

(13)

obtain the ATE. The idea is that individuals sharing the same propensity score can be compared directly and without bias or, as expressed by Lunceford and Davidian (2004): "Because treatment exposure is essentially at random for individuals with the same propensity value, we expect mean comparisons within this group to be unbiased".

Firstly, ˆβ_{M L}is obtained from (7) using ML estimation. For each individual the estimated propensity score is calculated, ê(X_i, ˆβ_{M L}). Based on the sample quantiles of ê(X_i), form K strata such that the jth cutpoint ˆqj, j = 1, . . . , K (ˆq0 = 0 and ˆqK = 1) is such that the proportion of ê(X_i) ≤ ˆq_j is approximately j/K, that is the cutpoints are equally spaced. Within each stratum, the sample mean difference between treated and untreated units is calculated and the ATE is then found by weighting these differences by the proportion of observations in each stratum.

Let ˆQ_j = (ˆq_j−1, ˆq_j] denote each of the K strata. n_j = Pn

i=11(ˆe_i ∈ ˆQ_j) is the number of individuals in stratum j and n1j = Pn

i=1Wi1(ˆei ∈ ˆQj) is the number of treated units in stratum j. The stratification estimator is then given by

ˆ τ_S =

K

X

j=1

nj

n

"

1 n_1j

ⁿ X

i=1

W_iY_i1(ˆe_i ∈ ˆQ_j) −

1

n_j − n_1j

ⁿ X

i=1

(1 − W_i)Y_i1(ˆe_i ∈ ˆQ_j)

# . (20) The choice of the number of strata is a trade off between bias and variance. A large number of strata decreases bias but increases the variance, and vice versa. Choosing K = 5 strata is often advocated as a benchmark, with a result of bias reduction of at least 90 % in Cochran (1968) and later also proposed in Rosenbaum and Rubin (1983). As pointed out by Lunceford and Davidian (2004), because the observations within each strata only has approximately the same propensity score value, there might be some confounding effects left between individuals in the same strata. Even if the propensity score model is correctly specified, the inability of removing all the confoundedness using stratification might cause this estimator to not be consistent for the ATE in general (Lunceford and Davidian, 2004). The large sample distribution of√

ˆ τ_S− τ is normal with mean zero and variance that depend heavily on the estimation steps in obtaining the estimator, and as noted, is generally not equal to the ideal variance that would be if all the parameters were completely known (Lunceford and Davidian, 2004). Although this estimator obviously has some drawbacks, it is still regarded in this thesis for purposes of reflecting the different ways of estimating the propensity score. Ways of improving the estimator are e.g.

proposed by Lunceford and Davidian (2004).

(14)

2.3.3 The Matching Estimator

The matching estimator is similar to the IPW estimator in that seeks to it impute the counterfactual outcomes, those that are not observed once an outcome is realized but needed in order to estimate the ATE. It differs in the way that this is done. As the name reveals, the idea is to match the unobserved potential outcome with a unit from the opposite group, sharing the same features in terms of the estimated propensity score. If unit i is treated, then the missing potential outcome is Yi(0) and the imputed observation, the match, is found by searching in the control group for a unit with similar propensity score as unit i.

Following Abadie and Imbens (2006) the matching is done with replacement, so that a unit can be used as a match several times. Suppose that the propensity score is modelled as a logit model (6) and estimated using maximum likelihood (7) so that ˆei(X) is the estimated propensity score for unit i. Given that M¹ matches per unit is used, _M(i) is the set of M matches for unit i, ordered by their distance in terms of propensity score, thus

_M(i) = (

j = 1, . . . , N : W_j = 1 − W_i, X

l:Wl=1−Wi

1(|ê_i(X) − ê_l(X)| ≤ |ê_i(X) − ê_j(X)|) ≤ M )

. (21) The missing potential outcomes are imputed as

Yˆ_i(0) =







Y_i if Wi = 0,

1 M

P

j∈M(i)Y_j if Wi = 1

(22)

and

Yˆ_i(1) =







1 M

P

j∈_M(i)Y_j if W_i = 0,

Y_i if W_i = 1.

(23)

This leads to the matching estimator given by

ˆ τ_M = 1

N

X

i=1

( ˆY_i(0) − ˆY_i(1)) (24)

= 1 N

N

X

i=1

(2W_i− 1)



Y_i− 1 M

X

j∈M(i)

Y_j



. (25)

The asymptotic distribution of ˆτ_M is given in Abadie and Imbens (2012) as

√

N (ˆτ_M − τ )−→ N (0, σ^d ²_M). (26)

1For the definition to make sense it is assumed that N0≥ M, N1≥ M

(15)

In general, matching on a scalar, like the propensity score, is not consistent (Abadie and Imbens, 2006), but using one match will give the least possible bias. The variance can be decreased by increasing the number of matches (Imbens and Woolridge, 2009).

3 Simulation Study

To evaluate the possible improvements of using LRR and PCLR under multicollinearity, a Monte Carlo simulation is performed using 1000 replicates. The aim is to find out how these alternatives work under different types of correlation and different specifications of the true propensity score model. The performance of estimators are evaluated for different correlations and sample sizes, and will be assessed in terms of bias and mean squared error (MSE). The bias is defined as

Bias(ˆτ ) = 1 r

r

X

q=1

ˆ

τ_q− τ, (27)

where r is the number of replicates and ˆτq is the estimated ATE in replication q. The MSE is calculated as

M SE(ˆτ ) = 1 r

r

X

q=1

(ˆτq− τ )². (28)

The propensity score model is specified in three ways. First using a logistic regression as described in (6). This first case reflects the situation in which the multicollinearity is not taken care of. Regular maximum likelihood estimation is performed to estimate the propensity scores and these are then used to estimate the ATE by IPW, stratification and matching, using R package Matching() (Sekhon, 2013). Second, the PCLR solution is implemented by estimating the propensity score with two PCs. The method of Aguilera et al. (2006) is not implemented for this study because the number of covariates here is small compared to their study in which 10 covariates are used and it is not certain that the effect of implementing the conditional likelihood ratio test to choose the number of PCs will be as obvious as in their study. Third, the LRR solution is applied. Different values of the ridge parameter are taken into consideration.

One case with an automatically chosen ridge parameter within the package ridge() (Cule, 2014) and and other cases for k = (0, 0.1, 0.3, 0.5, 0.7, 0.9)⁰. Details for the automatic selection can be found in Cule and De Iorio (2013). The matching estimator is implemented using one match and with replacement, as described in Section 2.3.3. The simulation is performed using the software R version 3.0.0 (R Core Team, 2014).

(16)

3.1 Simulation Design

Three continuous covariates are considered in the design, X₁, X₂ and X₃. These are true con- founders, which means that they are correlated to both the treatment, T , and the outcome Y . Assume that X = (X₁, X₂, X₃)⁰ ∼ N (0, Σ), where

Σ =







1 ρ 0 ρ 1 0 0 0 1







. (29)

According to Ryan (1997), pairwise correlations can be used to indicate the degree of multicollinearity in logistic models with continuous covariates. Thus, ρ will be varied within (0, 0.3, 0.6, 0.9, 0.95, 0.99) to reflect the severity of the multicollinearity. The true model for the treatment is generated by a linear model

z = γ₁X₁+ γ₁X₂+ γ₃X₃ = γ⁰X, (30) where the intercept has been set to zero, for simplicity. The true propensity score is specified using the logistic model as

e(γ⁰X) = P (W = 1|X₁, X₂, X₃) = exp^(γ⁰^X)

1 + exp^(γ⁰^X). (31)

The treatment, W , is Binomially distributed with parameters being the true propensity score, W ∼ Bin(n, p) and where n is the sample size. The outcomes are generated by

Y = τ W + β₁X₁+ β₂X₂+ β₃X₃+ ε, (32) with ε ∼ N (0, 1) and uncorrelated with the covariates. The parameter vector is set to β = (1.6, 2, 2) and the ATE is assumed to be constant. The sample size is set to 250, considered small and 1000, considered moderate. Since multicollinearity is a small sample problem, larger sample sizes are not considered. Three different designs of the propensity score are considered.

These correspond to different amount of extremity of the propensity score distribution and is affected by the specification of the parameter vector γ. In the first design, the parameter vector in the true propensity score is γ = (0.2, 0.2, 0.3)⁰, corresponding to a mild extremity of the propensity score. The second design has parameter vector γ = (0.4, 0.4, 0.3)⁰, corresponding moderate propensity score. In the third design a parameter vector of γ = (0.8, 0.8, 0.3)⁰reflects an extreme propensity score. Only the first two elements of γ, γ₁ and γ₂, are varied as the

(17)

multicollinearity is due to correlation between X1 and X2. These specifications correspond to the overlap between the treated units and the non treated units. The more extreme the propensity score is, the less overlap is there in the propensity scores conditional on treated units and propensity scores conditional on control units. Moreover, the more extreme the propensity score is the larger is the proportion of very small and very large PS values. Examples of this is seen in Figures 3-14 in Appendix. The effect of the specification is thought to affect the estimators in different ways. Firstly, the IPW estimator is sensitive to PS values close to zero and one. The matching estimator is susceptible for the lack of overlap, since it seeks to find a unit with opposite treatment but similar propensity score. If the overlap is not substantial, there might be a considerate distance to the closest propensity score, which can result in a far from close match. Lastly, the stratification estimator is also affected by the amount of overlap since it divides the observations into subgroups. If there is little or no overlap, these groups might contain only treated or control units and might not be possible to estimate the ATE.

3.2 Results

This section presents the results of the Monte Carlo simulation. The tables are selected results, complete results can be found in Appendix, Tables 7-12.

As an example of how the density of the propensity scores transforms under multicollinearity for the different estimation methods, consider Figures 1 and 2. These are kernel densities of the propensity scores for all replications. The grey and black dashed lines show the true densities of the propensity scores conditional on the control and treated units, respectively. The blue and red lines are the estimated densities, conditional on control and treated units respectively. The figures show the cases of zero and 0.99 correlation of X1and X2 in the case of mild propensity score with sample size 1000.

(18)

0.2 0.4 0.6 0.8

0246810

ML

PS

Density

0.2 0.4 0.6 0.8

0246810

PCLR

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.1

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.3

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.5

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.7

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.9

PS

Density

Figure 1: Density of propensity scores under different estimation procedures. Scores from all replications with mild PS and n = 1000. ρ = 0. Blue line = f (ˆe(X)|W = 0), Red line = f (ˆe(X)|W = 1), Grey line = f (e(X)|W = 0), Black line = f (e(X)|W = 1).

The larger the ridge parameter is, the more shrunken towards the middle is the density.

In the case of no correlation, shown in Figure 1, the estimated PS using ML is very close to the densities of the true PS. This is also the case for k = 0, which should coincide with ML estimation, although some differences in the results were noticed. Clearly, with LRR and PCLR (and LRR with other values of k) the densities of the estimated PS are more concentrated to the middle and not similar to the true densities.

(19)

0.2 0.4 0.6 0.8

0246810

ML

PS

Density

0.2 0.4 0.6 0.8

0246810

PCLR

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.1

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.3

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.5

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.7

PS

Density

0.2 0.4 0.6 0.8

0246810

LRR k = 0.9

PS

Density

Figure 2: Density of propensity scores under different estimation procedures. Scores from all replications with mild PS and n = 1000. ρ = 0.99. Blue line = f (ˆe(X)|W = 0), Red line = f (ˆe(X)|W = 1), Grey line = f (e(X)|W = 0), Black line = f (e(X)|W = 1).

In Figure 2 the correlation is 0.99 and the transformed densities of LRR and PCLR have evened out and especially, PCLR is almost identical to the true PS. This suggests that under strong multicollinearity, the estimated propensity scores using these two methods are almost untransformed. In general for the large samples, the more extreme the specification of the true propensity score model is, the closer PCLR and LRR seem to get to the true densities. However, for the small samples the effect is not as obvious. In the mild case of 0.99 correlation, PCLR comes close to the true density and LRR is not as close but in the extreme case, the shape of the

(20)

LRR densities are closer to the true PS densities than the ones obtained by PCLR. Using other values than the computed ridge parameter consistently results in densities that are far from the true densities of the PS. Figures for mild, moderate and extreme PS and both sample sizes are available in Appendix, Figures 3-14.

3.2.1 Results for the IPW Estimator

The bias of the estimator under multicollinearity depends on the sample size as well as the design of the propensity score. In the moderate design and for the small sample, using PCLR resulted in a lower bias of the IPW estimator in one case, compared to using ML estimation. In most cases of high multicollinearity (ρ ≥ 0.6) the bias in the IPW estimator from using PCLR was equal or just slightly larger than the bias obtained using ML. At the same time, the MSE was typically lower using the PCLR method under high multicollinearity, especially in small samples and was largest for the extreme design. This difference was observed to be smaller in the larger samples and the MSE of PCLR in the large sample in Table 1 was actually higher than for ML, although not by very much and the gap decreasing as the correlation increases. For this design and sample size, the LRR method had the lowest MSE although this was not observed in any other case. The LRR method of estimating the PS was not found to be successful for the IPW estimator, the bias and MSE was never lower than PCLR or ML, except for the mentioned case.

Table 1: Bias and MSE of the IPW Estimator under extreme PS.

n = 250 ρ ML PCLR Ridge ρ ML PCLR Ridge

Bias 0 0.0292 0.9339 0.5050 MSE 0 0.1595 1.6157 0.4349 0.3 0.0463 0.0876 0.4722 0.3 0.2911 0.2734 0.4015 0.6 0.1061 0.1136 0.5597 0.6 0.3817 0.3642 0.5533 0.9 0.1467 0.1495 0.5240 0.9 0.5470 0.5252 0.6432 0.95 0.1588 0.1607 0.5404 0.95 0.5641 0.5436 0.6714 0.99 0.1828 0.1823 0.5585 0.99 0.5534 0.5374 0.6812

Bias 0 -0.0006 0.9311 0.1948 MSE 0 0.0425 1.6503 0.1675 0.3 0.0052 0.0136 0.1360 0.3 0.0829 0.0818 0.0902 0.6 0.0074 0.0063 0.1522 0.6 0.1578 0.1590 0.1570 0.9 0.0072 0.0056 0.1185 0.9 0.2584 0.2617 0.2462 0.95 0.0093 0.0076 0.1227 0.95 0.2756 0.2767 0.2626 0.99 0.0066 0.0051 0.1212 0.99 0.2970 0.2976 0.2811

(21)

For the case shown in Table 1, it was found in the large sample that under high correlation (ρ ≥ 0.6) the gains from using PCLR in the IPW estimator compared to ML to estimate the PS was a lot higher compared to the milder designs of the PS. PCLR outperformed ML in terms of bias, starting from ρ = 0.6. The bias using the ML estimated PS for the IPW was typically smaller in the larger sample for each design of the PS, and the same was observed for the PCLR method. In general, the increase in sample size made the bias of PCLR get closer to the bias of ML, an example of this is seen in Table 1. Moreover, the more extreme the severity of the PS the better was the performance of the PCLR estimated IPW.

A somewhat surprising finding is that using LRR with ridge parameter set to zero has the best performance in terms of bias in the large samples for mild and moderate PS, and matched the ML in the extreme case. Using the ridge parameter zero was thought to be consistent with ML estimation, and although close, the propensity scores were not equal to those obtained by ML. An explanation to this might be that in obtaining the propensity scores using LRR, the covariates were scaled so that the correlation matrix had unit diagonal. This means that the observations are actually transformed, which might cause this difference in propensity scores between ML and LRR using k = 0 However, it should be noted that for zero correlation the MSE of ML was lower than using k = 0. That k = 0 was better than ML and LRR with computed ridge parameter were seen in some cases for the other estimators too, but will not be commented further.

PCLR was found to be the best alternative for ML estimation for the IPW estimator in terms of MSE. In particular in the large samples, the bias in the IPW estimator using this method is very close to, and in some cases under very high correlation or in the extreme PS, outperforms the bias using ML estimation of the PS.

3.2.2 Results for the Matching Estimator

For the matching estimator, in the mild design of the PS, the LRR method produced lower bias than using ML in the small sample as long as X1 and X2 were correlated. Using ridge parameters k = 0.1 and k = 0.3 meant lower bias than ML under correlation and k = 0.3 proved to be even better than the computed ridge parameter of LRR, having the lowest bias of all cases under very high correlation (ρ ≥ 0.9). In the large sample, using LRR resulted in lower bias than using ML in the matching estimator under very high correlation (ρ ≥ 0.9) and for the same correlations using PCLR resulted in biases lower than or very close to the bias of

(22)

using ML. Further, the parameter value k = 0.1 meant lower bias than ML under very high multicollinearity and had the lowest bias of all PS estimation methods for a correlation of 0.99.

In terms of MSE, PCLR is the only method that outperforms ML under high correlation.

In the moderate design for the small sample, using LRR produced lower bias than using ML under correlation. Surprisingly, for k = 0.1 and k = 0.3 the bias is lower than ML for low correlation but not for high. This might be due to the small sample because it was not observed in the larger sample. In the large sample of the moderate design using PCLR was better than ML under high correlation (ρ ≥ 0.95), producing lower bias in the matching estimator. The bias was lower for 0.95 correlation for LRR compared to ML. In the larger sample using LRR led to a lower bias in the matching estimator than ML under zero correlation and 0.95. As found in the mild design of the PS, PCLR is the only method under multicollinearity, that has MSE close to or lower than ML for both sample sizes

The results from the extreme design of the PS is displayed in Table 2. Using LRR resulted in lower bias in the matching estimator than using ML for all correlations in the small sample.

The difference is not striking and might be due to small sample effects, as it is not as obvious for the larger sample. As seen in the table, the MSE in the matching estimator when using LRR is higher than using ML, something that was noted in general. PCLR gives the least MSE under high correlation, in both samples. In the large sample, using LRR works well under zero correlation and for very high correlation, 0.95 and 0.99. As Table 2 reveals, PCLR estimation of the PS results in lower bias than using ML under high correlation in both samples.

Table 2: Bias and MSE of the Matching Estimator under extreme PS.

Bias 0 0.0947 0.9663 0.0857 MSE 0 0.0750 1.6141 0.2291 0.3 0.1299 0.1658 0.1247 0.3 0.0940 0.1069 0.2340 0.6 0.1701 0.1719 0.1681 0.6 0.1147 0.1142 0.2490 0.9 0.2116 0.2084 0.2064 0.9 0.1402 0.1388 0.2688 0.95 0.2183 0.2182 0.2163 0.95 0.1478 0.1446 0.2740 0.99 0.2337 0.2249 0.2222 0.99 0.1535 0.1491 0.2763

Bias 0 0.0388 0.9553 0.0343 MSE 0 0.0151 1.6634 0.0519 0.3 0.0504 0.0563 0.0518 0.3 0.0191 0.0200 0.0483 0.6 0.0654 0.0621 0.0687 0.6 0.0254 0.0243 0.0566 0.9 0.0766 0.0776 0.0829 0.9 0.0301 0.0302 0.0609 0.95 0.0859 0.0827 0.0838 0.95 0.0338 0.0314 0.0621 0.99 0.0902 0.0844 0.0866 0.99 0.0345 0.0324 0.0619

(23)

In the small samples under high correlation, for every design of the true PS, the LRR resulted in lower bias of the matching estimator than using ML. In particular, for the extreme design, using LRR was found to give lower bias in this estimator for all correlations, even under no correlation. This suggests that the LRR method can be an alternative to ML estimation in small samples, in terms of bias. In the large samples of the extreme and moderate designs of the propensity score the PCLR method proved better than both ML and LRR, with lower bias and smaller MSE, under strong multicollinearity. The MSE of the matching estimator using PCLR is typically lower than ML under high correlation while the method of LRR was found to have a high MSE in general, although decreased in the larger samples.

3.2.3 Results for the Stratification Estimator

The bias of the stratification estimator was much higher in all designs and sample sizes compared to the IPW and matching estimators. This was not unexpected since it not consistent in general. Although high in comparison, the PCLR method produced lower bias and lower MSE in this estimator compared to using ML, in all cases and sample sizes, an example for this is seen in Table 3. Using LRR was also a better option than ML in some instances, especially as the PS was more extreme and in the large samples. As the extremity of the PS increases, the bias becomes higher, something seen for both the IPW and matching estimators as well and increasing the sample size tends to decrease the bias.

Table 3: Bias and MSE of the Stratification Estimator under extreme PS.

Bias 0 17.1570 11.8441 17.2619 MSE 0 421.2894 216.8620 405.8961 0.3 23.0678 21.7211 22.7165 0.3 824.7589 665.1244 701.2720 0.6 28.8436 27.4190 28.7512 0.6 1234.4954 1080.5518 1197.6811 0.9 35.6185 33.4001 35.6181 0.9 1950.5786 1593.8336 1878.0213 0.95 36.8016 34.5205 36.6071 0.95 2077.6522 1737.9322 2017.7035 0.99 37.5421 35.4620 36.7575 0.99 2135.1577 1821.4782 1996.0321

Bias 0 14.5641 10.5908 14.5918 MSE 0 221.7978 130.3609 223.6684 0.3 18.8770 18.8451 18.8871 0.3 374.1558 372.8482 375.1739 0.6 23.6250 23.5193 23.5878 0.6 589.8433 583.3981 587.7940 0.9 28.9076 28.7477 28.9106 0.9 889.8400 877.9186 889.4018 0.95 29.7658 29.6210 29.7545 0.95 944.1416 932.9786 942.9792 0.99 30.5297 30.3597 30.5143 0.99 994.5572 980.9251 992.6420

(24)

For this estimator there was a problem with the overlap assumption when applying in particular different types of logistic ridge regression, but for one case also with ML and PCLR estimation of the propensity scores (small sample in extreme design). The transformed propensity scores predicted treatment and treatment perfectly so that for example in the first strata there were only untreated units, and in the last strata there were only treated units. This becomes harmful for the stratification estimator since it weights by the number of treated units in each strata (see Equation 13), so the result becomes infinity. Therefore the bias and MSE are calculated only for the successful replicates, where this problem did not occur. It seems that this problem is diminished by increasing the sample size. The number of replications used to calculate the bias and MSE for the stratification estimator are shown in Table 13 in Appendix.

4 Empirical Example

The empirical example is a replication of Millimet and Tchernis (2009) and is extended by applying the PCLR and LRR methods when estimating the propensity scores. I want to thank Millimet and Tchernis for access to the data set used in Millimet and Tchernis (2009). The data is described briefly in Section 4.1 with variable descriptions, Section 4.2 shows the models to be estimated and the results are presented in Section 4.3.

4.1 Data

The study concerns the average treatment effect on three environmental indicators of being a member in the General Agreement on Tariffs and Trade (GATT)/World Trade Organization (WTO). The data are on country-level and from Frankel and Rose (2005). The observations are collected in 1990, before the creation of the WTO and in 1995, the year when WTO replaced

2 GATT. Table 4 contains a variable description of the covariates, outcome variable (environmental indicators) and the treatment. Figure 15 in Appendix illustrates the distributions of the covariates and Table 14 in Appendix provide some descriptive statistics.

2http://wto.org/english/thewto_e/whatis_e/wto_dg_stat_e.htm

(25)

Table 4: Variable description.

Variable Description Covariates X1 Real GDP per capita.

X2 Polity. Index of how democratic or autocratic the country is, ranging from -10 to 10 and is defined as Democracy - Autocracy X3 Land area divided by total population.

X4 Democracy. Index of how democratic a country is. Ranges from 0 to 10.

log(X1) Log of real GDP per capita.

D Dummy variable for year, takes values 1 for 1995 and 0 for 1990.

Outcome variable Y1 Industrial CO2 emissions in metric tons per capita.

Y2 Annual deforestation, average percentage change from 1990 to 1995.

Y3 Energy depletion. Product of unit resource rents and the extracted physical quantaties of fossil fuel energy. In percent of GDP.

Treatment W GATT/WTO membership.

Data consists of 223 observations after removing missing values.

The environmental indicators and country-level controls are from Frankel and Rose (2002) and Frankel and Rose (2005) GATT/WTO membership data are from Rose (2004) and Rose (2005) (http://faculty.haas.berkeley.edu/arose/)

The variables Autocracy and Democracy are index variables describing the governmental status of a country. These take integer values from from zero to 10. The sample correlation between the covariates are shown in Table 5. The variable X₃ has no strong correlation with any of the other covariates. Naturally, X1 and log(X1) are strongly correlated but will not be used together in any of the models. The striking feature of the table is the correlation of 0.98 between X2 and X4, which arises because X2 (Polity) is defined as the difference of X4

(Democracy) and Autocracy.

Table 5: Correlation among the covariates.

X1 log(X1) X2 X3 X4 X1 1 0.90 0.54 0.01 0.60 log(X1) 0.90 1 0.61 0.00 0.65 X2 0.54 0.61 1 0.02 0.98 X3 0.01 0.00 0.02 1 0.03 X4 0.60 0.65 0.98 0.03 1

(26)

4.2 Model

Two designs of the propensity score are considered, the first one with three variables:

z = γ₀+ γ₁X₁+ γ₁X₂+ γ₃X₃+ γ₄D₄+ γ₅D₅, (33) and in the second design, X4is added as

z = γ₀+ γ₁X₁+ γ₁X₂+ γ₃X₃+ γ₄X₄+ γ₅D₅+ γ₆D₆. (34) The indicator variables, D₄, D₅in (33) and D₅, D₆in (34), denote membership of GATT/WTO and year, as in line with Millimet and Tchernis (2009). As noted earlier, variables X2 and X4

have a correlation of 0.98, so adding X₄to the model of the PS means that strong multicollinearity is introduced. This makes it interesting to estimate the ATE using LRR and PCLR. Further, the number of covariates and the sample size is very similar to the design in the simulation study.

In their article, Millimet and Tchernis (2009) trim the propensity scores by using only observations with estimated PS values in [0.05, 0.95]. The reason for the trimming is basically due to the distribution of the PS, which indicates some observations with over 0.95, which is high in the context of the IPW estimator. Another way of doing this is by taking the logarithm of X₁and use this as a regressor instead of X₁. This restricts the range of the PS within the desired interval but without loosing observations, which might be a better approach, considering the small sample size. The propensity scores are estimated using the mentioned trim, no trim, using the logarithm of X1, using PCLR, LRR and LRR with values of the ridge parameter as before.

The ATE of being a member in the GATT/WTO on the environmental indicators will then be estimated using the IPW estimator, matching and stratification estimators.

4.3 Results

The results are presented in Table 6 as point estimates of the ATE. This will be a sensitivity analysis of the behaviour of the estimates as a highly correlated covariate is introduced into the model, according to Equation (36). For the different models, only the point estimates are considered and no standard deviations. This is due to some difficulties obtaining these using LRR.

Figures 16-17 in Appendix show balance plots of the propensity scores. The overlap is substantial before matching for cases using trim, no trim and log(X1). For the case with PCLR

Evaluating the Use of Ridge Regression and Principal Components in Propensity Score Estimators under Multicollinearity