Missing Data - A Gentle Introduction

(1)

Missing Data - A Gentle Introduction

Submitted by

Vilgot Österlund

A thesis submitted to the Department of Statistics in partial

fulfillment of the requirements for a two-year Master of Arts degree

in Statistics in the Faculty of Social Sciences

Supervisor

Ronnie Pingel

(2)

Abstract

This thesis provides an introduction to methods for handling missing data. A thorough review of earlier methods and the development of the field of missing data is provided. The thesis present the methods suggested in today’s literature, multiple imputation and maximum likelihood estimation. A simulation study is performed to see if there are cir-cumstances in small samples when any of the two methods are to be preferred. To show the importance of handling missing data, multiple imputation and maximum likelihood are compared to listwise deletion. The results from the simulation study does not show any crucial differences between multiple imputation and maximum likelihood when it comes to point estimates. Some differences are seen in the estimation of the confidence intervals, talking in favour of multiple imputation. The difference is decreasing with an increasing sample size and more studies are needed to draw definite conclusions. Further, the results shows that listwise deletion lead to biased estimations under a missing at ran-dom mechanism. The methods are also applied to a real dataset, the Swedish enrollment registry, to show how the methods work in a practical application.

Keywords : Missing data, Small samples, Multiple imputation, Maximum likelihood, Listwise deletion, Missing at random, Missing completely at random, Linear regression, Logistic regression.

(3)

1 Introduction

Researchers have studied and developed methods to handle missing data problems for a long time. Listwise deletion, mean imputation and other previously commonly used procedures have all been shown to be worse than more sophisticated methods. Despite the progress, it is still common to ignore missing data in a wide range of fields and thereby also ignoring the consequences of doing so. Rousseau et al. (2012) reviewed 68 quantitative articles published in The British Educational Research Journal between the years 2003 to 2007. Their results showed that 29 articles did not report information about missing data in a satisfying manor. Out of the 39 articles who did report missing data, only 16 actually mentioned how the missing data was handled. 14 of these articles used listwise deletion. It seems to be a gap between applied researchers and the recommendations from statisticians. The reason for this gap i probably lack of statistical knowledge in terms of the methods that handle missing data and also a lack of awareness for what the consequences are when ignoring an appropriate method to handle missing data. The aim of this thesis is therefore to provide an easily accessible introduction to commonly used methods for missing data. The sample size in quantitative research within the social science field varies and a review by Marszalek et al. (2011) showed that it is not unusual to have sample sizes below 20. Missing data in small samples have been addressed in articles, but a comparison of different methods in small samples are - with some exceptions, see for instance McNeish (2017) - relatively rare. By giving a thorough introduction to missing data methods and compare the methods under different settings with small sample sizes, this thesis can contribute both to applied researchers and to the statistics field. To make the ambition of the thesis more clear, it should include a review of methods to handle missing data and answer the following questions:

• Does the parameter of interest, data type, missing mechanism or level of missing data affect which of the presented methods to prefer in small samples?

• Are the results changing with an increasing sample size?

The next section gives the fundamental knowledge that is needed to follow the simulation study performed later in the thesis. It also includes a brief recap about the development of the research within the field of missing data. Earlier methods are presented to serve as a smooth transition to the methods that Schafer and Graham (2002) describes as state of

(5)

the art. A simulation study is presented in section 3 to answer the research questions. The study is performed with three different sample sizes in the case of 1) a continuous outcome variable and 2) a binary outcome variable. A dataset from the Swedish enrollment registry (Inskrivningsarkivet, INSARK) has been made availible for the thesis. In section 4, the dataset serve as an example on how the presented methods can be applied in practice. The advantages and drawbacks of the methods are partly shown in the simulation study and on the application on a real data, but they are also discussed in section 5. Interested readers can consult Endners (2010) for a pedagogical and complete introduction to the field of missing data.

2 Background

Research about missing data problems have been around for about a century. For in-stance, Wilks (1932) is considered to be the architect behind mean imputation (Endners, 2010). Today’s research is mainly based on ideas and terminology established in the 1970s. Rubin (1976) established a system to classify different types of missing data pat-terns. The system is still used in the literature and gives the foundation to all research within the field of missing data. The classification system assumes three different types of pattern to the missing data. These are Missing not at random (MNAR), Missing at ran-dom (MAR) and Missing completely at ranran-dom (MCAR). Missing not at ranran-dom means that the probability of a missing value is related to the missing value itself. Missing at random means that the probability of a missing value in one variable is related to other observed variables. Missing completely at random means that the missing values in one variable appear random, given the data. The missing values are not related to other observed variables and not to the missing value itself.

A formal definition of the different patterns is given by van Buuren (2018). Let X denote a n × p matrix of a collected sample where the elements of X is denoted by xij (i = 1, ..., n, j = 1, ..., p). R, the response matrix, denotes a n × p indicator matrix where the element rij takes the value 1 if xij is observed and 0 if the value of xij is missing. Xobs denotes the observed data and Xmis denotes the true value of the elements with missing data. There are of course values corresponding to the missing values of X but they are unknown to the collector. Hence, a complete - but unknown - data are given by

(6)

taking Xobs and Xmis together, X = (Xobs, Xmis). The distribution of the R matrix might depend on the missing value itself (Xmis), on other observed values (Xobs) or on other unobserved variables. If φ denotes unobserved variables that affect the distribution of R, the definition of missing not at random is:

P (R = 0 | Xobs, Xmis, φ).

Hence, the probability of a non-response can depend on the missing value itself, on other observed variables and on other unobserved variables, φ. A definition of missing at random is:

P (R = 0 | Xobs, φ),

where the probability of a non-response can depend on other observed variables or on unobserved variables φ. Finally, a definition of missing completely at random is:

P (R = 0 |φ),

where the probability of a non-response only depend on unobserved variables φ.

The possibility to handle the missing data is different depending on the type of missing data. In general, today’s methods can only consistently deal with data that are MCAR or MAR. It is beyond the scope of this thesis to describe methods that tries to handle MNAR data. When tackling a missing data problem, it can be good to keep in mind that it is not preferably the missing data point themselves that are of interest. Most often our actual interest is to estimate some parameter of a population, for instance the mean.

2.1 Earlier Methods

This section is devoted to earlier methods to handle missing data. It is techniques that have been proved to perform worse than newer, more sophisticated methods. An intro-duction to these methods gives a background to the development of the field of missing data. It also makes it easier to understand the methods that are suggested in the liter-ature today. All methods covered in this section are different kinds of single imputation techniques. Imputation refers to methods that impute values to fill in and replace the missing values. Single comes from the fact that these methods replace each missing value with one new value. An advantage of single imputation techniques is that it gives a com-plete data set. This makes the analysis as simple as it would have been if there had been

(7)

no missing values in the data. However, there are potential drawbacks when using single imputation techniques. We will see that single imputation can give biased estimations and that it attenuates standard errors.

To easier follow how the presented methods work, all methods will be applied to a made-up example. The example assumes that there is a positive correlation between a person’s result on a standardized test and their income. It is further assumed that a person with a low result on the test is less likely to answer in surveys. It is of interest to estimate the effect of the test score on the income. There is a register with all test scores, but the information about income has to be collected through a survey. The chosen model to estimate the effect is a linear regression model presented in the following equation:

Inc = β0+ β1(T S),

where Inc is the income, T S is the test score and β1 is the effect of the test score on the income. 20 people are randomly chosen for the survey. However, not all participants answer the survey which leaves the researcher with a dataset as presented in Table 1. This made-up example is an example of a missing at random pattern since the probability of a missing value in one variable is related to another variable. The test score and income data has been generated from a bivariate normal distribution with mean and covariance matrix: µ = 50 50 ! and Σ = 25 25 50 ! . (1)

The true coefficients are calculated with the following equations: β1 = σX,Y σ2 X = 25 25 = 1, (2) β0 = µY − β1µX = 50 − 1 × 50 = 0, (3) σ_{Y |X}2 = σ_Y2 − β2 1σ 2 X = 50 − 1 2_{× 25 = 25,} ₍₄₎

Where X referees to the variable test score and Y to the income. A variance in income of 25 and a covariance with the test score of 25 gives a true population effect of 1. On average, a one point higher test score leads to a one unit higher income. The first and second column of Table 1 shows the test score and income for each person in the sample. The values have been rounded to two decimals. As can be seen in the third column, not all participants answered the survey. Although they did not answer the survey, they still

(8)

have an income. The fact that the income for some of the participants is unknown raises some issues for the researcher. Should the analysis be implemented only on observations without missing values, so-called complete cases, or does that lead to a loss of valuable information?

Table 1: Made-up example with test score and income. Each row corresponds to one person.

Arithmetic Stochastic

Test Observed mean Regression regression

score Income Income imputation imputation imputation

41.80 36.37 - 53.31 43.73 44.77 44.22 41.54 - 53.31 45.85 45.73 44.81 48.97 - 53.31 46.37 46.20 45.48 47.51 47.51 47.51 47.51 47.51 48.16 52.19 - 53.31 49.32 48.39 48.99 46.13 - 53.31 50.05 56.30 49.35 54.21 54.21 54.21 54.21 54.21 49.59 48.06 48.06 48.06 48.06 48.06 49.76 54.53 54.53 54.53 54.53 54.53 50.07 43.42 - 53.31 51.00 51.89 50.42 44.41 44.41 44.41 44.41 44.41 50.66 54.33 - 53.31 51.52 49.45 52.19 49.87 49.87 49.87 49.87 49.87 52.22 49.30 - 53.31 52.89 48.69 52.31 52.00 52.00 52.00 52.00 52.00 53.79 54.33 54.33 54.33 54.33 54.33 54.09 59.12 - 53.31 54.53 56.38 55.79 63.42 63.42 63.42 63.42 63.42 59.33 59.06 59.06 59.06 59.06 59.06 61.73 59.06 59.06 59.06 59.06 59.06

(9)

Figure 1: Three different types of imputation. Blue triangles are observed values and red dots are imputed values.

Looking at Table 1, the missing data in the last three columns have been replaced with imputed values. Each of the three columns corresponds to a specific imputation method. The differences between the methods can also be seen in Figure 1. Arithmetic mean imputation replace all missing values with the mean of the observed values. Since the mean of the observed income is 53.31, all missing values are replaced with the value 53.31. This method gives a complete dataset but has disadvantages that clearly surpasses the convenience that comes with such a simple imputation technique. With all values replaced by the same number, the variation in the data will be lower than it should have been if the data had been complete. This can be seen in Table 2 where the estimated mean and standard error of the income are shown. The estimated effect of the test score and the residual standard error are also presented in the table. The true population parameters are shown in the first column. It is easy to see the drawbacks of mean imputation when the population values are compared to the values estimated from the data with mean imputed values. Mean imputation does not take other covariates into account and since this is an example of a missing at random problem, the estimated mean will be biased. We can also see that the standard errors and the estimated effect are underestimated. In regression imputation, a regression equation is estimated to replace the missing values with predicted values. The idea is based on the fact that covariates tend to be correlated. The first step in regression imputation is to estimate a regression equation to predict the missing values. The regression equation is estimated on the complete cases. Hence, it is estimated on the eleven persons that took part in the survey. Equation 5 shows the

(10)

estimated regression model.

Inci = ˆβ0+ ˆβ1(T Si) = 6.97 + 0.88(T Si), (5) where Inci is the predicted income for person i. T Si is the test score for person i. The regression equation estimates the effect of test score on income to 0.88. This means that a one point increase in test score leads to a 0.88 unit increase in the predicted income. This can be seen in the forth column of Table 1 and in the second plot in Figure 1, where the imputed values are directly related to the observed test score. This is the mayor drawback with regression imputation. Since the imputed values follows a straight line, the variation will be lower than it would have been if the data had been complete. This will lead to underestimated standard errors and overestimated correlations. In Table 2 we can see that it is mainly the estimated standard errors that deviate from the population parameters.

A similar technique that tries to handle the attenuated variability of regression imputation is stochastic regression imputation. It works the same way as in Equation 5 but adds a random residual term to the predicted value. The random residual follow a normal distribution with mean zero and variance equal to the residual variance in the complete case regression. The residual variance of Equation 5 is equal to 12.27. Hence, a random number from a normal distribution with mean zero and variance 12.27 have been added to the values imputed by Equation 5. The imputed values are shown in the last column of Table 1 and can also be seen the last figure in Figure 1. As expected, the values imputed by stochastic regression follows the imputed values from regression imputation, but some are larger and some are smaller. The estimated parameters from the imputed data are shown Table 2: Estimated mean and standard error of the income, along with the estimated β1

and residual standard errors for each method.

Arithmetic Stochastic

Population mean Regression regression

parameter imputation imputation imputation ˆ µInc 50 53.31 51.59 51.71 ˆ σ2 Inc 50 17.14 26.24 28.23 ˆ β1 1 0.44 0.88 0.84 ˆ σ2 Res 25 13.29 8.49 12.17

(11)

in Table 2. Of all methods presented in this section, stochastic regression imputation is the sole technique which gives unbiased parameters estimates with a missing at random data pattern (Endners, 2010). However, the technique attenuates the residual standard errors, just like all single imputation techniques. This increase the risk of rejecting a true null hypothesis. The attenuated standard errors comes from the fact that the data set with imputed values is treated as a complete data set. Because of the uncertainty that comes with missing data, an analysis of a complete data set should give smaller standard errors than an analysis on a similar data set with missing values. There are bootstrapping methods to handle the biased standard errors. However, these methods are often less convenient than multiple imputation or maximum likelihood methods.

2.2 Multiple Imputation

Multiple imputation is the first of the two missing data techniques that Schafer and Gra-ham (2002) describes as state-of-the-art. It relies on Bayesian theory but is related to stochastic regression imputation. The idea of multiple imputation based on Bayesian the-ory was published by Rubin (1978). A few years later, the method was further presented in the book Multiple imputation for nonresponse in surveys (Rubin, 1987). The book includes a description of rules that later has been known as Rubin’s rules. The procedure of multiple imputation can be described in three steps; imputation, analysis and pooling. In the imputation step, m datasets are genereated. Each of the m imputations consists of an iterative procedure with an imputation step (the I-Step) and a posterior step (the P-step). The I-step works in the same way as the method described earlier, stochastic regression imputation. All complete cases are used to estimate a model for imputing the missing values. Just like in stochastic regression, a random value with mean zero and a standard error equal to the residual standard error is added to the imputed value. In the P-step we use the imputed data to estimate a mean vector and a covariance matrix. In a Bayesian framework, the parameter estimates are seen as random variables. The posterior distribution describes the distribution of the parameter estimate. For instance, the posterior distribution of the estimate of a mean vector from a multivariate normal distribution is MN(ˆµ,_NΣˆ). Hence, the mean estimates are normally distributed around their respective estimate. Following from this idea, a new covariance matrix is achieved

(12)

by drawing a sample of size one from the posterior distribution of the covariance matrix. The posterior distribution of the covariance matrix is W−1_{(N − 1, ˆ}_Λ)_{, an inverse Wishart} distribution with N −1 degrees of freedom and sum of squares and cross products matrix

ˆ

Λ, where ˆΛ = (N − 1) ˆΣ. The sample gives a new covariance matrix, Σ∗, which is used to draw a new sample of the mean vector. Hence, a sample of size one from a multivariate normal distribution with mean ˆµand covariance matrix Σ∗ is drawn. This gives a new mean vector, µ∗_{. The new parameters µ}∗ _{and Σ}∗ _{are used to estimate the regression} coefficients in the next I-step, using Equation 2-4. Because of the random deviation of µ∗ _{and Σ}∗ _{from ˆ}

µ and ˆΣ, the new coefficients will differ from the coefficients in the previous I-step. A new complete dataset is given by imputing the missing values using the regression coefficients and adding a random residual term with mean zero and variance ˆ

σ2

Y |X. The complete dataset is carried forward to the next round of the P-step, where a new sample of Σ∗ _{and µ}∗ _{is drawn from the data. This process continues until µ}∗ _and Σ∗ are independent from the initial estimations of ˆµ and ˆΣ. A final imputed dataset is achieved in the I-step, using the last versions of µ∗ _{and Σ}∗_{. This procedure referees} to one of the m imputations. Hence, this process is repeated m times, which gives m different copies of the data. Because of the random draws of µ∗_{and Σ}∗_{, together with the} random residual in stochastic regression, each of the datasets will have different estimates of the missing values.

Recall the earlier example where we tried to estimate the effect of test score on income. If we let m = 20, we will have 20 different columns similar to the last column in Table 1. The analysis step is made on each of the m generated data sets. Hence, if we recall Table 2, we will have 20 estimates of each parameter. This means 20 mean estimates, 20 estimates of the variance, 20 estimates of β1 and 20 estimates of the residual variance. The final estimate, the multiple imputation point estimate, of each parameter is given in the pooling step:

¯ θ = 1 m m X t=1 ˆ θt, (6)

where θ is the parameter of interest. ˆθt is the parameter estimate from data set t and ¯θ is the final estimate. The formula in Equation 6 is the same as the normal formula for the sample mean. What makes multiple imputation superior to single stochastic regression imputation is that it does not attenuate the variance. It do so by accounting for the

(13)

uncertainty that comes with missing values. The calculation of the sample variance consists of the within-imputation variance (VW) and the between imputation variance (VB). The formula for calculating the total sampling variance (VT) of ˆθ is given in Equation 7: VT = 1 m m X t=1 SE_t2 | {z } VW + 1 m − 1 m X t=1 (ˆθt− ¯θ)2 | {z } VB + 1 m−1 Pm t=1(ˆθt− ¯θ)2 m | {z } VB m = 1 m m X t=1 SE_t2+ 1 + 1 m 1 m − 1 m X t=1 (ˆθt− ¯θ)2 (7) where SE2

t is the squared standard error of the tth imputed data set. θ is the parameter of interest. ˆθt is the estimate from the tth imputed data set and ¯θ denotes the overall mean across all m imputed data sets. Thus, the total variance is a sum of the mean variance estimation from the m datasets and the variation in point estimations between the m datasets.

2.3 Maximum Likelihood Estimation

Maximum likelihood is the second of the two missing data techniques that Schafer and Graham (2002) describes as state-of-the-art. A good summary of the main objective of maximum likelihood estimation is given by Endners (2010):

"The goal of maximum likelihood is to identify the population parameters that have the highest probability of producing the sample data."

Just like the quotation says, maximum likelihood aim to estimate the population param-eters. Hence, the first step is to specify a distribution for the population. The example in this section will assume a multivariate normal distribution. Let us again assume that we, with help from the test score, want to estimate the mean of the income for the population. To use maximum likelihood estimation, we need the likelihood function. The likelihood function give us the probability of observing a certain value given some specific distribu-tion parameters. The likelihood for observadistribu-tion i for the multivariate normal distribudistribu-tion is: Li = 1 (2π)k2 | Σ | 1 2 exp −1 2(Xi− µ) T Σ−1(Xi− µ) ,

(14)

and the covariance matrix of the included variables and Xi are the observed values for observation i. Usually we want to estimate µ and Σ with our sample, but let us assume that we already know the true values of these parameters. Since we generated the data we actually do know the true values of the parameters and they can be seen in Equation 1. As all parameters are known, the only term in the likelihood that may vary are the observed values. Observed values close to the mean parameter yields a larger likelihood. Observed values further away from the mean corresponds to a smaller likelihood. It is intuitive that the most likely value to observe are values equal to the mean. With a sample of size n, the probability of drawing a certain sample equals the product of all n likelihoods: L = n Y i=1 1 (2π)k2 | Σ | 1 2 exp −1 2(Xi− µ) T Σ−1(Xi− µ) . (8)

But what do we do when the parameters are unknown and we want to estimate them with our sample? Well, when we have drawn our sample and the parameters are unknown, the only things that may vary are our estimations of the parameters. Remember that the likelihood function for a given sample in Equation 8 is the probability of drawing the particular sample, given the parameters µ and Σ. What we want to do is to find the values of µ and Σ that gives the largest probability, we want to maximise the likelihood function. Maximising a function requires us to find the point where the first derivative equals to zero and control that it is a maximum by checking that the second derivative is negative. Differentiating Equation 8 is difficult because it involves the products of all individual likelihoods. Computing the natural logarithm of the likelihood function give us a sum of individual log-likelihoods, which is easier to differentiate:

logL = n X i=1 log ( 1 (2π)k2 | Σ | 1 2 exp −1 2(Xi− µ) T_Σ−1 (Xi− µ) ) , (9)

which can be rewritten to: logL = −nk 2log(2π) − n 1 2log | Σ | − 1 2 n X i=1 (Xi− µ)TΣ−1(Xi− µ). (10) Although the metric of the likelihood and the log-likelihood differs, they have the same min- and max points. To find the maximum likelihood estimation of µ we differentiate

(15)

the log-likelihood with respect to µ. ∂logL ∂µ = − 1 2 ∂Pn_i=1(Xi− µ)TΣ−1(Xi− µ) ∂µ = −1 2(−2Σ −1 n X i=1 (Xi− µ)) = Σ−1 n X i=1 (Xi− µ).

Solving for µ gives the maximum likelihood estimation of µ: µ =

Pn i=1Xi

n ,

which of course equals the usual formula for the mean. To control that we have really found a maximum we check the second derivative of the log-likelihood:

∂2_logL ∂µ2 =

∂Σ−1Pn_i=1(Xi− µ) ∂µ

= −Σ−1,

which is negative and we conclude that we have found the maximum likelihood estimator of µ.

Although the analytical solution to the estimation of the mean vector is not very difficult to find, it can quickly become complicated to differentiate other functions. Instead of finding the maximum likelihood estimation analytically, it can be found numerically. Remember that the goal of maximum likelihood estimation is to find the parameter estimation that gives the largest probability to generate the given sample. One could calculate the log-likelihood for all possible values of the parameter and then chose the value that gives the largest log-likelihood. However, the number of possible values will increase dramatically with more parameters. In these cases, an iterative optimization algorithm has to be used to minimize the computational burden. A detailed review of how optimization algorithms works is beyond the scope of this thesis. A very simplified description is that an iterative optimization algorithm starts by assigning some initial values to the parameters to calculate the function value. The iterative part comes from the fact that the values of the parameters are updated and the function is calculated again. If the function increased compared to the previous iteration, the chosen values are closer to the maximum than the previous ones. The maximum likelihood estimation

(16)

is reached by repeatedly updating and calculating the function value until it does not change. The default algorithm in the R-function optim() is presented by Nelder and Mead (1965) (R Core Team 2019). Interested readers can consult Eliason (1993) for a good overview of frequent algorithms.

Maximum likelihood theory can be used also in the presence of missing data. The first papers who suggested to handle missing data with maximum likelihood was presented in the 1950s (Edgett, 1956; Anderson, 1957). However, the major progress came in the 1970s with papers by Hartley and Hocking (1971), Orchard and Woodbury (1972), Beale and Little (1975) and Dempster et al. (1977). Recall the made-up example in section 2.1 with the data in Table 1. The log-likelihood for the complete cases, i.e. the observations with no missing values, is presented earlier in Equation 10. The log-likelihood for observation i with missing data is:

logLi = − ki 2log(2π) − 1 2log|Σi|− 1 2(Yi− µi) T_Σ−1 i (Yi− µi), (11)

where ki is the number of data points without any missing data. The only difference between this equation and the complete data log-likelihood in Equation 10 is the subscript i on k, Σ and µ. However, the subscript is of great importance because it means that the content and size of the parameter matrices can vary between observations. For each observation, it allows the log-likelihood computation to only rely on variables with complete data. We use the first and last observation in Table 1 to exemplify, assuming µ and Σ are known. The last observation is a complete case and the log-likelihood is:

−2 2log(2π) − 1 2log 25 25 25 50 − 1 2 61.73 − 50 59.06 − 50 !T 25 25 25 50 ! 61.73 − 50 59.06 − 50 ! = −7.95, while the first observation has missing values and the corresponding log-likelihood is therefore: −1 2log(2π25) − 1 2 (41.8 − 50)2 25 = −3.87.

Had these observations been the entire sample, the sample log-likelihood would have been −7.95 + (−3.87) = −11.82. When the parameters are unknown, the maximum likelihood estimation are the values of µ and Σ that gives the largest value of the sample log-likelihood. In practice, the estimation of the unknown parameters is performed by the expectation maximization algorithm (EM algorithm).

(17)

The EM algorithm is an iterative optimization algorithm that was developed in the 1970s, with Dempster et al. (1977) as an important contributor. It is an iterative procedure in two steps, the E-step and the M-step. It starts with an initial estimate of the covariance matrix and the mean vector, using only complete cases. In the E-step, the mean vector and covariance matrix are used to predict the variables with missing values. This is similar to the regression imputation described earlier in section 2.1 and gives a complete data with observed and imputed values. The purpose of the M-step is to calculate new estimates of the covariance matrix and mean vector on the complete data. The new updated values are used when the E-step is performed again, creating a new complete data set to use in the M-step. This process continues until the change between iterations in the sample log-likelihood is below some threshold. The estimated effect is calculated with the achieved covariance matrix according to Equation 2-4. The standard errors of the estimated coefficients can be achieved via the observation information matrix (also known as the observed Fisher information matrix) or via bootstrapping. It has been shown that the observed information matrix gives correct standard errors under both MCAR and MAR (Kenward and Molenberghs, 1998). However, we will estimate the standard errors via bootstrapping.

The idea of Bootstrapping was first introduced by Efron (1979). The idea is that more information about the sample can be achieved by re-sampling with replacement from the observed sample. The name comes from the phrase "to pull oneself up by one’s bootstrap". It referees to something which is brilliant but at the same time absurd and impossible. Assume that we have the sample presented in Table 1. With the EM algo-rithm, we have estimations of the mean vector, the covariance matrix and the coefficients. By drawing a sample of size n (with n equal to the original sample size) from the original sample with replacement, we get a new sample. Since it is a sample with replacement, some observations will be drawn at multiple times. This also leads to the fact that some observations will be excluded from the new sample. The coefficients are estimated with the new sample through the same methods as earlier. This process is repeated B times, which gives B estimates of the coefficients. We get the standard errors via the usual formula for the standard deviation:

ˆ θSE = 1 B − 1 B X i=1 (ˆθi− ¯θ)2, (12)

(18)

where θi is the estimate from the ith _{bootstrap sample and ¯θ is the mean estimate of all} bootstrap samples.

3 Simulation Study

The purpose of the simulation study is to investigate if there are circumstances in small samples when any of the presented methods are to be preferred. Further, it is of interest to evaluate if the results are changing when the sample size increase. In practice, this is done by comparing multiple imputation and maximum likelihood on two different types of data. Three levels of missing data are used, as well as three different levels of the sample size. Both data structures have one explanatory variable which follows a normal distribution. The missing values are located in the dependent variable and follows a MCAR- and MAR mechanism. The first data structure has a continuous dependent variable and the second has a binary dependent variable. We are interested in estimating the mean of the dependent variable, as well as the effect of the explanatory variable. The effect is estimated through a linear regression model in the continuous case and in the binary case it is estimated through a logistic regression model. As a comparison to multiple imputation and maximum likelihood, the corresponding estimates using listwise deletion are also presented.

The simulation study is performed using the statistical software R (R Core Team 2019) and requires the additional packages boot (Canty and Ripley, 2019; Davison and Hinkley, 1997), MASS (Venables and Ripley, 2002), MCMCpack (Martin et al., 2011), mice (van Buuren and Groothuis-Oudshoorn, 2011) and tidyverse (Wickham et al., 2019). The code is attached in Appendix 1.

3.1 Previous Research

There are multiple studies which have examined the performance of multiple imputation and maximum likelihood through simulation studies. However, most of the published articles are not specifically focused on small sample sizes. With small samples, Graham and Schafer (1999) showed that multiple imputation is more or less unbiased. However, a potential issue with small sample sizes and multiple imputation is that multiple imputa-tion only can rely on complete observaimputa-tions to impute values. McNeish (2017) compares

(19)

two types of multiple imputation and maximum likelihood in a simulation study with small samples and different levels of missing data. Under a MAR pattern, the results shows that the Type-I error rate of a null effect inflates much faster for maximum likeli-hood compared to the two types of multiple imputation. A concern that McNeish (2017) mention is that the impact of the prior distribution on the posterior distribution increase with a smaller sample, something which affect multiple imputation. Hughes et al. (2016) discuss and compare different imputation variance estimators. The results from their simulation study shows that Rubin’s multiple imputation variance estimator can give biased estimates when assumptions are not fulfilled.

A rule of thumb when it comes to logistic regression is the one in ten rule (Harrell Jr. et al. (1984); Peduzzi et al. (1996)), which states that there should be at least ten observations of each outcome for every explanatory variable. With a sample size of 25 and 50% missing values, this rule will obviously be violated. Logistic regression is based on maximum likelihood, where the maximum likelihood estimation is obtained through an iterative optimization algorithm. One problem that therefore may occur is that the optimization algorithm does not converge. Meeyai (2016) perform a simulation study on missing data with logistic regression. The results suggest that the sample size is of greater importance than the rate of missing data. For 10- and 50% missing data, a sample size of at least 40 and 100 is suggested. A difference from this thesis is that the missing values are located in one of the independent variables. Other methods have been presented instead of the usual logistic regression, mainly based on a method called predictive mean matching (Rubin (1986); Little (1988); Munnich and Rässler (2005)).

3.2 Simulation Design

Recall the data and example used in section 2.1. Our ambition is that the explanatory variable, test score, should explain the outcome variable, income. In the continuous case, the dependent variable follows a normal distribution conditioned on the test score. To generate the continuous data we use the exact same procedure as presented earlier in section 2.1. Hence, the data will follow the same structure as the first two columns in Table 1. In the binary case, where there are only two possible outcomes, we can assume that the outcome variable corresponds to whether the respondent has a job or

(20)

not. In this case the possible outcomes are Yes (Employed, outcome variable = 1) and No (Unemployed, outcome variable = 0). A higher test score increase the odds of being employed. The binary data are generated from a multivariate normal distribution where the outcome variable is treated as the log-odds of success. The following mean and covariance matrix are used to generate the data:

µ = 50 0 ! and Σ = 25 2.5 0.25 ! .

The given covariance matrix gives a correlation between the two variables equal to one. Hence, all observations fall on a straight line. The first variable, with mean 50, is our explanatory variable. If we recall Equation 2 and 3 we conclude that β0 = -5 and β1 = 0.1. However, our dependent variable is supposed to be binary. What we have generated are the log-odds of being employed. We achieve the probability of being employed with the formula: P (Yi = 1) = eβ0+β1Xi 1 + eβ0+β1Xi = eOdds(Yi=1) 1 + eOdds(Yi=1),

where Xi referees to the test score for observation i. Now the values of the dependent variable lies between zero and one. Each value correspond to the probability that obser-vation i are employed. A higher value of the explanatory variable gives a value closer to one. To finally achieve our binary outcome variable, we sample n bernoulli trials with the probability of success equal to P (Yi = 1). An example of the data can be seen in Table 3. It is important to mention that we only observe the test score and whether the person is employed or not. We are however, interested in estimating the effect of test score on the log-odds of being employed. A logistic regression model with a logit link would have been sufficient had the data been complete.

(21)

Table 3: An example of how the binary data are generated.

TS Log-odds P(Yi = 1) Outcome

44.58 -0.54 0.37 0 47.47 -0.25 0.44 1 47.85 -0.21 0.45 0 48.61 -0.14 0.47 1 52.73 0.27 0.57 1 52.82 0.28 0.57 1 52.87 0.29 0.57 0 54.45 0.45 0.61 1 56.04 0.60 0.65 1 61.73 1.17 0.76 0

We are interested in estimating two parameters; the mean of the dependent variable and the effect of the explanatory variable on the dependent variable. In the presence of missing values in the continuous data, we will rely on the methods presented in section 2.2 and 2.3. In the continuous case, the true mean is µ = 50 and the true model that we try to estimate is:

Y = β0 + β1X,

where β1 = 1. The procedure is the same in the binary case, but with a logistic regression model to impute the missing values and estimate β1. The true mean is µ = 0.5 and the true model is:

log P (Y = 1) 1 − P (Y = 1) = β0+ β1X, where β1 = 0.1.

The sample size is set to 25, 100 and 400 and the simulation is performed both under a MCAR- and a MAR pattern. Three levels of missing data are used, 10%, 30% and 50%. Maximum likelihood based methods are compared with multiple imputation models. The missing values are removed from the complete data using two different procedures. With a MCAR pattern, the removed values are chosen randomly. The number of removed values are calculated as the sample size times the share of missing data. If the product of the sample size and the level of missing data is not equal to an integer, the number of missing values has been randomly sampled to any of the two closest integers. Under the MAR pattern, the probability of a missing value is calculated depending on the explanatory

(22)

variable with the formula: P (Answer|Xi) = log exp(Xi) exp(quantile∗(X) 4 1 + log exp(Xi) exp(quantile∗(X) 4 , (13)

where Xi is the value of the explanatory variable for observation i and quantile∗_(X) specifies at which value the probability of an answer should be equal to 0.5. For 50% missing data, the value is set to the median of X. For 10% and 30% missing data, the quantile is not as intuitive. Through a numerical calculation, a value which would give an average probability of answering equal to 90% and 70% is the 1.45%- and 18.8%-quantile. Hence, the 1.45%-quantile and 18.8%-quantile of the sample are the respective values which will give the probability of answering to 0.5. A lower value of Xi gives a smaller probability of answering and a higher value gives a larger probability. A missing value is assigned to a person if a Bernoulli trial with the calculated probability comes out as a fail. Since the probability of a missing value has to lie between zero and one for all values of test score, the number of missing values can not be controlled. There is therefore a small chance that no - or all - values are missing. However, the quantiles in Equation 13 is set such that the mean probability of a missing value is equal to the desired level of missing data. This ensures that the mean level of missing data across several generated datasets should be close to the requested level. 1,000 datasets are generated for each missing mechanism, data structure, sample size and level of missing data. 300 bootstrap replications are used to estimate the standard errors of the estimates from maximum likelihood. For multiple imputation, the number of imputed datasets, m, are set to 20.

3.3 Results

The results of the simulation study can be seen in Table 4 and 5. The tables are grouped on data structure, sample size, level of missing data and missing mechanism. For each grouping level, the mean estimate from multiple imputation (MI), maximum likelihood (ML) and listwise deletion (LD) of µ and β1 has been calculated. µ corresponds to the mean of the dependent variable. The mean width of the estimated 95% confidence interval are also shown. A 95% confidence interval of the point estimate should, on average, cover the true parameter in 95% of the time. The column coverage referees to the share of times when the estimated confidence interval covered the true parameter. Looking at Table 4,

(23)

we can see that the point estimates of multiple imputation and maximum likelihood are more or less equal to the true parameter (µN orm = 50, µBin = 0.5). The estimate only severely deviates from the true mean in the binary case with 30- and 50% missing data and a sample size of 25. Maximum likelihood is consistently more biased than multiple imputation on the binary data with a sample size of 25. For listwise deletion, the point estimates are biased under the MAR pattern and the bias is increasing with an increasing level of missing data. Looking at the results from the normal data, it can be seen that maximum likelihood seem to produce more narrow confidence intervals compared to multiple imputation, but the difference diminish with an increasing sample size. The width of the confidence interval on the binary data is similar for the two methods. With a sample size of 25 and a high level of missing data maximum likelihood produces wider confidence intervals. However, the difference is small and vanish when the sample size increase. For the two methods, a MAR pattern yields a wider confidence interval compared to a MCAR pattern, both on the normal data and on the binary data. Compared to multiple imputation and maximum likelihood, listwise deletion produce wider confidence intervals with a MCAR pattern and similar or more narrow intervals with a MAR patter. On the normal data with a sample size of 25, the coverage of multiple imputation lies between 94.9-96.4%. The coverage seem to stabilize around 95% when the sample size increase. For maximum likelihood on the normal data, the coverage lies around 95% on all sample sizes. Looking at the binary data, there are no distinct differences in coverage between multiple imputation and maximum likelihood with a sample size of 100 or 400. However, with a sample size of 25 and 50% missing data, the coverage of maximum likelihood drops to 90-91% which can be compared to the multiple imputation coverage of 94-95%. This is despite the fact that the corresponding confidence intervals of maximum likelihood is wider. However, the lower coverage rate is likely due to the larger bias. The coverage of listwise deletion is around 95% as long as the missing mechanism is MCAR. The biased estimations under the MAR pattern yields a much lower coverage, 0.6% in the most extreme case.

(24)

Table 4: Mean estimate of µ and the average width of the estimated 95% confidence interval, together with the mean coverage.

Settings µˆ Width of CI Coverage

Data Sample size Level Mechanism True µ MI ML LD MI ML LD MI ML LD

Normal 25 10% MAR 50 49.94 49.95 50.70 6.31 6.08 6.23 0.962 0.955 0.930 Normal 25 10% MCAR 50 49.99 49.99 49.99 5.99 5.82 6.11 0.949 0.946 0.947 Normal 25 30% MAR 50 50.08 50.08 51.52 7.02 6.71 6.81 0.957 0.946 0.860 Normal 25 30% MCAR 50 50.03 50.03 50.04 6.60 6.27 7.03 0.957 0.943 0.954 Normal 25 50% MAR 50 50.05 50.05 52.26 8.50 8.36 7.95 0.953 0.945 0.795 Normal 25 50% MCAR 50 49.92 49.92 49.92 7.81 7.35 8.67 0.964 0.956 0.963 Normal 100 10% MAR 50 50.00 50.00 50.61 2.93 2.90 2.92 0.959 0.957 0.865 Normal 100 10% MCAR 50 50.06 50.06 50.06 2.89 2.86 2.96 0.959 0.958 0.959 Normal 100 30% MAR 50 49.97 49.97 51.44 3.24 3.18 3.21 0.955 0.957 0.602 Normal 100 30% MCAR 50 50.03 50.03 50.03 3.12 3.07 3.36 0.950 0.948 0.952 Normal 100 50% MAR 50 50.05 50.05 52.45 3.90 3.76 3.77 0.953 0.946 0.293 Normal 100 50% MCAR 50 50.00 50.01 50.04 3.50 3.41 3.99 0.942 0.947 0.943 Normal 400 10% MAR 50 50.02 50.01 50.58 1.44 1.44 1.44 0.953 0.948 0.635 Normal 400 10% MCAR 50 50.00 50.00 50.00 1.43 1.42 1.46 0.945 0.948 0.944 Normal 400 30% MAR 50 50.00 50.00 51.47 1.59 1.58 1.58 0.944 0.942 0.048 Normal 400 30% MCAR 50 50.00 50.00 49.99 1.54 1.53 1.66 0.954 0.957 0.960 Normal 400 50% MAR 50 50.00 50.00 52.39 1.89 1.87 1.86 0.956 0.955 0.006 Normal 400 50% MCAR 50 50.00 50.00 50.00 1.72 1.70 1.97 0.953 0.946 0.952 Binary 25 10% MAR 0.5 0.499 0.498 0.514 0.443 0.440 0.452 0.949 0.948 0.958 Binary 25 10% MCAR 0.5 0.496 0.496 0.496 0.432 0.427 0.438 0.957 0.952 0.968 Binary 25 30% MAR 0.5 0.496 0.490 0.527 0.487 0.493 0.504 0.947 0.934 0.937 Binary 25 30% MCAR 0.5 0.487 0.486 0.488 0.480 0.488 0.507 0.960 0.953 0.971 Binary 25 50% MAR 0.5 0.492 0.478 0.547 0.555 0.586 0.596 0.938 0.899 0.906 Binary 25 50% MCAR 0.5 0.483 0.480 0.485 0.540 0.572 0.616 0.948 0.914 0.930 Binary 100 10% MAR 0.5 0.499 0.499 0.513 0.212 0.211 0.212 0.940 0.941 0.932 Binary 100 10% MCAR 0.5 0.501 0.501 0.501 0.208 0.207 0.209 0.949 0.946 0.957 Binary 100 30% MAR 0.5 0.500 0.499 0.533 0.243 0.243 0.238 0.954 0.953 0.921 Binary 100 30% MCAR 0.5 0.499 0.499 0.499 0.235 0.234 0.238 0.947 0.938 0.953 Binary 100 50% MAR 0.5 0.504 0.501 0.558 0.301 0.308 0.284 0.942 0.938 0.873 Binary 100 50% MCAR 0.5 0.500 0.501 0.501 0.273 0.276 0.284 0.945 0.940 0.939 Binary 400 10% MAR 0.5 0.499 0.499 0.512 0.104 0.104 0.104 0.956 0.954 0.942 Binary 400 10% MCAR 0.5 0.500 0.500 0.500 0.104 0.103 0.104 0.949 0.941 0.943 Binary 400 30% MAR 0.5 0.502 0.501 0.535 0.122 0.121 0.117 0.936 0.938 0.784 Binary 400 30% MCAR 0.5 0.499 0.499 0.499 0.117 0.116 0.118 0.954 0.956 0.959 Binary 400 50% MAR 0.5 0.500 0.499 0.557 0.153 0.154 0.139 0.944 0.946 0.630 Binary 400 50% MCAR 0.5 0.501 0.501 0.501 0.137 0.136 0.139 0.943 0.944 0.941

(25)

Table 5 shows the mean of the estimations of β1. On the normally distributed data, the mean of the point estimate is more or less equal to the true parameter (β1 = 1) for all methods and for all grouping levels. With a binary dependent variable and sample size 25, the mean estimations of β1 are all clearly deviating from the true parameter (β1 = 0.1). The deviation is decreasing with an increasing sample size and there are no distinct differences between any of the three methods. Looking at the width of the confidence interval on the normal data, it can be seen that maximum likelihood is consistently producing a more narrow interval than multiple imputation. However, the difference is very small. On the binary data with a sample size of 25, the width of the confidence intervals are varying a lot. With a MAR pattern, multiple imputation seems to produce a reasonable width with 10% missing data. With a MCAR pattern, the width seems to be plausible also with 30% missing data. With a MAR pattern or a higher level of missing data, the produced width of the confidence interval indicates convergence problems. The same problems seems to occur for all levels of missing data for maximum likelihood. With a larger sample size, the two methods produce reasonable widths, but the interval produced by multiple imputation are consistently more narrow. With an increasing sample size, the difference diminish. The width of the confidence intervals produced by listwise deletion are similar to those produced by multiple imputation, both on the normal and binary data. Looking at the coverage on the normal data, multiple imputation have a coverage around 95% on all grouping levels. For maximum likelihood, the coverage lies between 92.9-95.1% with a sample size of 25. With a higher sample size, the coverage lies betwen 93.8-95.7%. The coverage on the binary data are all above 97.3%. This, together with the biased estimations, indicates that the estimated confidence intervals are too wide. The coverage of multiple imputation lies around 95% with a sample size of 100 and above. For maximum likelihood, the coverage lies between 96.9-98.1% with a sample size of 100. This is likely due to the wider confidence intervals compared to multiple imputation. The coverage gets closer to 95% with a sample size of 400. Listwise deletion produce similar width as multiple imputation.

(26)

Table 5: Mean estimate of β1 and the average width of the estimated 95% confidence interval, together with the mean coverage.

Settings _β1ˆ _{Width of CI} _Coverage

Data Sample size Level Mechanism True β1 MI ML LD MI ML LD MI ML LD

Normal 25 10% MAR 1.0 1.007 1.004 1.004 1.039 1.029 1.020 0.950 0.938 0.944 Normal 25 10% MCAR 1.0 0.991 0.992 0.992 0.935 0.930 0.927 0.944 0.932 0.940 Normal 25 30% MAR 1.0 0.992 0.992 0.992 1.251 1.248 1.218 0.945 0.929 0.935 Normal 25 30% MCAR 1.0 1.000 1.000 1.000 1.088 1.075 1.066 0.950 0.943 0.949 Normal 25 50% MAR 1.0 0.990 0.988 0.988 1.547 1.703 1.489 0.952 0.945 0.945 Normal 25 50% MCAR 1.0 1.021 1.020 1.020 1.392 1.515 1.344 0.942 0.951 0.950 Normal 100 10% MAR 1.0 1.000 1.001 1.001 0.456 0.449 0.453 0.950 0.949 0.953 Normal 100 10% MCAR 1.0 1.004 1.004 1.004 0.423 0.418 0.422 0.953 0.949 0.950 Normal 100 30% MAR 1.0 1.002 1.002 1.002 0.542 0.527 0.534 0.960 0.939 0.955 Normal 100 30% MCAR 1.0 0.996 0.996 0.996 0.489 0.477 0.485 0.952 0.947 0.952 Normal 100 50% MAR 1.0 1.004 1.004 1.004 0.677 0.648 0.662 0.956 0.941 0.954 Normal 100 50% MCAR 1.0 0.994 0.994 0.994 0.588 0.571 0.580 0.954 0.945 0.956 Normal 400 10% MAR 1.0 1.001 1.001 1.001 0.221 0.218 0.220 0.953 0.956 0.957 Normal 400 10% MCAR 1.0 0.999 0.999 0.999 0.208 0.207 0.208 0.944 0.938 0.946 Normal 400 30% MAR 1.0 0.998 0.999 0.999 0.263 0.260 0.262 0.943 0.947 0.949 Normal 400 30% MCAR 1.0 0.999 0.999 0.999 0.238 0.235 0.236 0.963 0.959 0.965 Normal 400 50% MAR 1.0 0.999 0.999 0.999 0.323 0.317 0.320 0.947 0.949 0.957 Normal 400 50% MCAR 1.0 1.000 0.999 0.999 0.284 0.279 0.280 0.958 0.957 0.962 Binary 25 10% MAR 0.1 0.128 0.131 0.131 0.53 38.72 0.54 0.985 0.999 0.991 Binary 25 10% MCAR 0.1 0.115 0.115 0.115 0.44 18.82 0.44 0.985 1.000 0.986 Binary 25 30% MAR 0.1 0.173 0.263 0.263 72.40 131.83 255.83 0.997 0.999 0.999 Binary 25 30% MCAR 0.1 0.105 0.108 0.108 0.52 45.27 0.53 0.990 1.000 0.989 Binary 25 50% MAR 0.1 1.066 3.237 3.237 1701.30 207.74 5733.36 0.997 0.973 0.997 Binary 25 50% MCAR 0.1 0.594 0.674 0.674 899.26 135.24 2183.67 0.993 0.983 0.997 Binary 100 10% MAR 0.1 0.104 0.105 0.105 0.202 0.215 0.201 0.951 0.969 0.958 Binary 100 10% MCAR 0.1 0.107 0.107 0.107 0.191 0.203 0.191 0.956 0.967 0.958 Binary 100 30% MAR 0.1 0.105 0.106 0.106 0.239 0.260 0.238 0.961 0.974 0.963 Binary 100 30% MCAR 0.1 0.102 0.102 0.102 0.216 0.235 0.217 0.962 0.973 0.965 Binary 100 50% MAR 0.1 0.109 0.110 0.110 0.298 0.467 0.301 0.962 0.981 0.960 Binary 100 50% MCAR 0.1 0.109 0.110 0.110 0.266 0.308 0.268 0.952 0.975 0.959 Binary 400 10% MAR 0.1 0.102 0.102 0.102 0.096 0.097 0.096 0.950 0.958 0.957 Binary 400 10% MCAR 0.1 0.100 0.100 0.100 0.091 0.092 0.091 0.947 0.949 0.947 Binary 400 30% MAR 0.1 0.099 0.100 0.100 0.114 0.115 0.113 0.958 0.958 0.962 Binary 400 30% MCAR 0.1 0.100 0.100 0.100 0.104 0.105 0.103 0.952 0.955 0.946 Binary 400 50% MAR 0.1 0.103 0.103 0.103 0.140 0.144 0.140 0.943 0.951 0.952 Binary 400 50% MCAR 0.1 0.101 0.101 0.101 0.125 0.127 0.123 0.945 0.953 0.952

(27)

3.4 Conclusions

The results shows that multiple imputation and maximum likelihood perform differently under some circumstances. However, with one exception, there are no crucial differences in point estimations. The exception is on the binary data when the sample size is 25. Under that sample size, both methods gives a point estimate of the mean which is lower than the true parameter value. This is most likely due to the problems of estimating the effect, which can be seen in Table 5. The point estimate from maximum likelihood is most biased. However, it is difficult to draw definite conclusions from these results since the multiple imputation estimate is also biased. The main difference between the two compared methods is when it comes to estimating the width of the confidence intervals. On the normal data, the width of the confidence interval of the mean for multiple im-putation is always wider than the corresponding interval from maximum likelihood. The difference is most distinct with a sample size of 25. Due to time limitations, it has not been possible to further investigate the significance of this difference. The situation is the opposite on the binary data with a sample size of 25 and 100, when multiple imputation has the most narrow intervals.

When it comes to estimate β1, there are also differences in the width of the confidence intervals. On the normal data, it seems like maximum likelihood produce more narrow confidence intervals compared to multiple imputation. Again, the significance of the difference has not been evaluated, but the results talks in favour of multiple imputation. It is clear that a sample size of 25 (with some missing values) is too low to estimate a logistic regression model. These results also supports the results in Meeyai (2016), where a minimum sample size of 40 is suggested with 10% missing data. The main advantage of multiple imputation is that it seems to perform well with a sample size of 100, while maximum likelihood needs a sample size of 400 to perform as you would expect.

Reconnecting to the research questions, the results suggest that multiple imputation should in general be preferred in front of maximum likelihood. Because of the thin-ner confidence intervals, there could be an idea to use maximum likelihood if the only interest is to estimate the mean of a normally distributed variable with missing data. However, the coverage of multiple imputation is still close to 95%, why it could still be used. Changing the number of imputed datasets in multiple imputation or increasing the

(28)

number of bootstrap replications in the maximum likelihood estimation may also lead to smaller differences between the methods. If it is shown that multiple imputation signifi-cantly produce wider intervals than maximum likelihood, it could be caused by problems discussed in Hughes et al. (2016) or McNeish (2017), but its difficult to say. The results are also indicating that an increasing sample size leads to a smaller difference between the two methods. Hence, with a sample size of 400 and above, the two methods are more or less equivalent. Although the set-up of the study in McNeish (2017) differs from this study, the conclusion to prefer multiple imputation in front of maximum likelihood is the same.

4 Application to a Real Dataset

This part of the thesis will show how the presented methods can be applied in practice. The Swedish enrollment registry is a register containing information from the enrollment to the Swedish military service. The register holds information about all enrolled men born between 1951 and 1978. All observations are anonymized but includes information about month- and year of birth. In total, the dataset includes 148 variables, ranging from vision problems to how eligible the person is to serve as an officer. To resemble a true research question, the enrollment registry will be treated as the population. Assume that it is of interest to estimate the mean height in the population and how much the weight affects the height. Given the interest, the researcher would draw a random sample from the population and collect the weight and height. By calculating the sample mean and by regressing the height on the weight, the researcher would get an estimate of the mean height as well as the effect of the weight. To follow the simulation study as much as possible, a binary outcome variable is also created. It takes the value 1 if observation i is above the mean height in the population and 0 if below. The mean of this variable corresponds to the share of the population who are above the mean height. The effect of the weight on the binary outcome variable can be estimated with a logistic regression model.

We will repeatedly draw samples of size 25, 100 and 400 from the 1978 cohort in the enrollment registry. The only variables that are collected is the weight and the height. The settings will be identical to the settings in the simulation study. 10-, 30- or 50%

(29)

missing values are assigned to the height variable. Under the MCAR pattern, the missing values are randomly assigned. Under the MAR pattern, the probability of a missing value is depending on the weight and is calculated with Equation 13. The mean height in the 1978 cohort is 179.35 centimetres. Hence, the binary outcome variable will take the value 1 if the height is above 179.35. The true proportion of individuals above 179.35 centimetres is 0.491. Multiple imputation and maximum likelihood estimation are used to estimate a point estimate and a confidence interval for the mean and the estimated effect. By estimating a confidence interval for the two parameters, we should on average cover the true population mean in 95% of the samples. Just like in the simulation study, the methods will also be compared to listwise deletion. In the continuous case, the true model that we try to estimate is:

Y = β0 + β1X,

where β1 = 0.251. X is the weight and Y is the height. The procedure is the same in the binary case, but with a logistic regression model to impute the missing values and estimate β1. The true model is:

log P (Y = 1) 1 − P (Y = 1) = β0+ β1X,

where β1 = 0.074. X is the weight and Y is a binary variable taking the value 1 if the height is above 179.35 centimetres.

Table 6 and 7 shows the mean estimates and the coverage rate from 1,000 simulations. Comparing the two methods, the results are similar to the results in the simulation study. There are no distinct differences between the methods when it comes to the point estimates, but some differences in the width of the confidence intervals. A remarkably low coverage under the MCAR pattern indicates that the required distributional assumptions are not fulfilled. An even lower coverage rate under the MAR pattern is likely due to the strong correlation between the weight and height variable. The strong correlation has probably lead to a MNAR pattern, which no method of today can handle.

(30)

Table 6: Mean estimate of µ and the average width of the estimated 95% confidence interval, together with the mean coverage.

Settings µˆ Width of CI Coverage

Data Sample size Level Mechanism True µ MI ML LD MI ML LD MI ML LD

(31)

Table 7: Mean estimate of β and the average width of the estimated 95% confidence interval, together with the mean coverage.

Settings _βˆ _{Width of CI} _Coverage

Data Sample size Level Mechanism True β MI ML LD MI ML LD MI ML LD

(32)

5 Discussion

If we ignore the results and conclusions drawn earlier, there are general pros and cons for the presented methods. An advantage of multiple imputation over maximum likelihood is that it is more user friendly and therefore easier to implement. Maximum likelihood can be difficult to implement if there are auxiliary variables that explain the missing values but which should not be included in the final model. On the other hand, maximum likelihood always provides the same point estimate on a given data. Due to the random residuals and the sampling of µ∗ and Σ∗, the multiple imputation point estimate will never be exactly the same, even though the given data are the same. This thesis has however come to similar conclusions as McNeish (2017), which talks in favour of using multiple imputation in cases of missing data in small samples. The results also supports the conclusions drawn in Meeyai (2016), where a lower limit of the sample size is suggested in case of missing data with logistic regression. There are however other methods to impute binary data with. One could use different versions of predictive mean matching as presented by Rubin (1986), Little (1988) and Munnich and Rässler (2005). It should also be possible to use linear discriminant analysis or other methods of classification to impute the missing values with. If this would improve the analysis and how it should be implemented in a multiple imputation settings is for other studies to examine.

The results shows that listwise deletion gives biased estimates of µ under a MAR pattern. The coverage decreases when the sample size is increasing. This is due to the more narrow confidence intervals that comes with a larger sample. With a biased estimator, a larger sample does not always help, which is seen is this case. The estimation of the mean is unbiased under a MCAR pattern, but the wider confidence intervals makes it inappropriate compared to multiple imputation and maximum likelihood. In a case where it is of interest to compare the mean of two populations or between two cohorts -listwise deletion requires a much larger difference to conclude that there is a statistically significant difference between the two populations. If the only interest is to estimate β1, the results indicates that listwise deletion is just as good as multiple imputation. However, multiple imputation or maximum likelihood are likely to perform better if there are more than one explanatory variable.

(33)

Looking at the general performance across all settings, the results of this thesis suggest that - with small samples - multiple imputation are to be preferred in front of maximum likelihood. Yet, the significance of the difference in performance has not been evaluated and further studies are needed to draw definite conclusions and recommendations. A higher number of simulations is one way to further investigate the performance, but some changes in design are also needed. Adding more covariates in the models and compare the methods under hypothesis testing are two ways to expand the study. Applying the methods to real data is also important. One interesting thing to do in an applied study would be to look at published articles and investigate if the results change depending on the missing data method. This leads to the final suggestion of this thesis. In the presence of missing data, a sensitivity analysis should always be performed. By comparing the results of different missing data methods, one will get an idea of the importance of the missing values.

(34)

References

Anderson, T. W. (1957). Maximum likelihood estimates for a multivariate normal dis-tribution when some observations are missing. Journal of the American Statistical Association 52(278), 200–203.

Beale, E. and R. Little (1975, 09). Missing values in multivariate analysis. Journal of the Royal Statistical Society: Series B (Methodological) 37, 129–145.

Canty, A. and B. D. Ripley (2019). boot: Bootstrap R (S-Plus) Functions. R package version 1.3-22.

Davison, A. C. and D. V. Hinkley (1997). Bootstrap Methods and Their Applications. Cambridge: Cambridge University Press. ISBN 0-521-57391-2.

Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39(1), 1–38.

Edgett, G. L. (1956). Multiple regression with missing observations among the indepen-dent variables. Journal of the American Statistical Association 51 (273), 122–131. Efron, B. (1979, 01). Bootstrap methods: Another look at the jackknife. Ann.

Statist. 7(1), 1–26.

Eliason, S. R. (1993). Maximum likelihood estimation: logic and practice, Volume no. 07-096. Newbury Park, Calif: Sage.

Endners, Craig, K. (2010). Applied Missing Data Analysis. New York: Guilford Publi-cations.

Graham, J. and J. Schafer (1999). On the performance of multiple imputation for mul-tivariate data with small sample size. In R. H. Hoyle (Ed.), Statistical strategies for small sample research, pp. 1–29. Thousand Oaks, Calif;London;: Sage Publications. Harrell Jr., F. E., K. L. Lee, R. M. Califf, D. B. Pryor, and R. A. Rosati (1984). Regression

modelling strategies for improved prognostic prediction. Statistics in Medicine 3 (2), 143–152.

(35)

Hartley, H. O. and R. R. Hocking (1971). The analysis of incomplete data. Biomet-rics 27(4), 783–823.

Hughes, R., J. Sterne, and K. Tilling (2016). Comparison of imputation variance estima-tors. Statistical Methods in Medical Research 25 (6), 2541–2557. PMID: 24682265. Kenward, M. G. and G. Molenberghs (1998, 09). Likelihood based frequentist inference

when data are missing at random. Statist. Sci. 13 (3), 236–247.

Little, R. J. A. (1988). Missing-data adjustments in large surveys. Journal of Business Economic Statistics 6(3), 287–296.

Marszalek, J., C. Barber, J. Kohlhart, and C. Holmes (2011, 04). Sample size in psycho-logical research over the past 30 years. Perceptual and motor skills 112, 331–48. Martin, A. D., K. M. Quinn, and J. H. Park (2011). MCMCpack: Markov chain monte

carlo in R. Journal of Statistical Software 42 (9), 22.

McNeish, D. (2017). Missing data methods for arbitrary missingness with small samples. Journal of Applied Statistics 44(1), 24–39.

Meeyai, S. (2016, 01). Logistic regression with missing data: A comparison of handling methods, and effects of percent missing values. Journal of Traffic and Logistics Engi-neering.

Munnich, R. T. and S. Rässler (2005). Prima: A new multiple imputation procedure for binary variables. Journal of Official Statistics 21 (2), 325–341.

Nelder, J. and R. Mead (1965, 01). A simplex method for function minimization. Com-puter J. 7, 308–313.

Orchard, T. and M. A. Woodbury (1972). A missing information principle: theory and applications. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statis-tics and Probability, Volume 1: Theory of StatisStatis-tics, Berkeley, Calif., pp. 697–715. University of California Press.

Peduzzi, P., J. Concato, E. Kemper, T. R. Holford, and A. R. Feinstein (1996). A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology 49(12), 1373–1379.

Missing Data - A Gentle Introduction