Comparing Variable Selection Algorithms On Logistic Regression – A Simulation Kevin Singh Sandhu Bachelor’s thesis in Statistics Supervisor

(1)

Comparing Variable Selection Algorithms On Logistic Regression – A Simulation

Kevin Singh Sandhu

Bachelor’s thesis in Statistics

Supervisor Philip Fowler

2021

(2)

Comparing Variable Selection Algorithms On Logistic Regression – A Simulation

Abstract

When we try to understand why some schools perform worse than others, if Covid-19 has struck harder on some demographics or whether income correlates with increased happiness, we may turn to regression to better understand how these variables are correlated. To capture the true relationship between variables we may use variable selection methods in order to ensure that the variables which have an actual effect have been included in the model.

Choosing the right model for variable selection is vital. Without it there is a risk of including variables which have little to do with the dependent variable or excluding variables that are important. Failing to capture the true effects would paint a picture disconnected from reality and it would also give a false impression of what reality really looks like. To mitigate this risk a simulation study has been conducted to find out what variable selection algorithms to apply in order to make more accurate inference. The different algorithms being tested are stepwise regression, backward elimination and lasso regression. Lasso performed worst when applied to a small sample but performed best when applied to larger samples. Backward elimination and stepwise regression had very similar results.

Keywords: stepwise regression, lasso regression, backward elimination, monte carlo simulation, replication crisis.

(3)

Contents

1. Introduction……….4

2. Background………..5

2.1 Lasso Regression……….7

2.2 Stepwise and Backward Elimination………...…8

3. Method………..9

3.1 Software………..………10

4. Results………..11

5. Discussion………13

6. Bibliography………14

(4)

1. Introduction

The field of statistical analysis is an important one. It can give us key insight into

understanding our world and make qualified guesses of where it is heading. These guesses can then lay the foundation for everything between policy proposals to finding out who should be given the Covid-19 vaccine first. These statistical models’ abilities to make good analysis is tied to how well they capture reality. This often means that the method which is used to model reality will try to capture the relevant variables and the strength of the relationship between these variables.

The challenge of correctly including relevant variables and excluding noise variables can be solved by applying variable selection algorithms. However, there are many researchers who use their own expertise when deciding what variables to include. Variable selection

algorithms can be very useful when subject knowledge is limited, or to capture effects which were not previously seen as relevant. Also, when there are a lot of variables, a variable selection algorithm may be essential as it can be hard for humans to accurately capture the relevant relationships during highly complex circumstances (Gary Smith, 2018). If these variable selection methods are incorrectly applied then problems like overfitting and underfitting may occur in the model.

The aim of this study is to compare variable selection methods to further understand their limitations. Furthermore, building on previous research and providing new information about the usage of these algorithms. These methods will be applied on logistic regression, the reason is that it is a commonly used method within statistical research. Logistic regression is

commonly used in machine learning, social sciences and medical research. Because of its importance it is of great interest to understand what variable selection algorithms are best suited when one is interested in applying logistic regression.

The methods which will be examined are stepwise regression, backward elimination and lasso regression. The motivation for the usage of stepwise regression is that it is a commonly used method. Part of the reason why this is the case is the ease of use and that the method is often included in most standard packages in statistical software (Grogan and Elashoff, 2016). The same applies to backward elimination. Because of these methods being commonly used it is of interest to further build on the research of their limitations to further understand when they

(5)

should be used and when other methods are more suitable. Lasso regression is also studied because of the method achieving good results in another study (Grogan and Elashoff, 2016).

In this study all variable selection algorithms will be tested based on their ability to accurately select variables depending on sample size and estimator strength. With a randomly generated dataset, unique in its structure, this simulation will shed some more light on how well

stepwise and backward elimination perform compared to lasso regression. It will also provide further assistance in choosing the right variable selection method when conducting statistical analysis with logistic regression in order to get the best possible results.

In the next section, the importance of the research area will be further explained, followed by an explanation for how stepwise, backward elimination and lasso regression works, and some problems these methods face. After that the study will go into more detail regarding how the simulations have been constructed. At the end results from the simulation will be presented and further discussion regarding the best method but also the limitation of the study.

2. Background

The task variable selection algorithms have is to correctly identify what variables should be included in the model. This often means avoiding both underfitting and overfitting.

Underfitting occurs when variables that should be in the model are not included. This may result in a bad model as it will be unable to uncover the structure of the data. If some important relationships between variables are not included, then big pieces of useful

information may go missing and the model loses some of its explanatory ability. It can also result in biased estimations of coefficients (Everitt and Skrondal, 2010).

The other problem is the importance of excluding noise variables. If too many variables are inserted in the model, several problems may occur. One such problem is overfitting. It is a problem which occurs when a model, tailored for one specific dataset, is applied to another dataset and performs badly. The problem is caused by including variables which do not have a statistically significant relationship with the dependent variable (Everitt and Skrondal, 2010).

In order to improve variable selection and minimize the risk for overfitting, Anderson and Burnham (2002) proposed that in order to improve out of sample performance for the model some bias could be added to the estimators. This results in more stable parameter estimates

(6)

across different samples due to variance decreasing. This is known to be part of the bias- variance trade off. By increasing bias like this the model becomes more flexible which results in better performance when the model is applied on new datasets (Kohavi and Wolpert, 1996).

This means that the risk of having problems associated with overfitting such as misleading R- squared values and regression coefficients is minimized (McNeish, 2015). One model

functioning based on this premise is lasso regression (Tibshirani, 1996). It is a method which has commonly been used for the purpose of variable selection, as it mitigates the risk

of overfitting (Tibshirani, 1996). Partly due to this, lasso is generally considered a good method for variable selection and for that reason it has been included in the study (Musoro et al., 2014). Lowering variance in the same way as lasso is not done by backward elimination and stepwise. Both of those methods are also known to have problems associated with

overfitting (McNeish, 2015). However, their problems do not end there. Some other problems also associated with stepwise regression are previously known, e.g. inflated type-1 error rates (Wilkinson, 1979), bad at handling correlated variables (Grafen and Hails, 2002) and

exclusion of variables which have a causal effect on the dependent variable (Smith, 2018).

Given this information and the literature regarding the limitations of stepwise one might think that lasso will always be the better model. However, in a paper by Hastie (2020) he found that stepwise algorithms sometimes performed better than lasso. Due to this it goes to show that there is still information to be uncovered.

However, what does all this lead to? A model is created and perhaps the risk for both underfitting and overfitting is high, but what are the actual consequences? Grogan and Elashoff (2016) explain how today's crisis of non-reproducible research may be connected to the problems of bad statistical modelling, namely underfitting and overfitting. Reproducible research means that one could replicate a previous study with the same method and get similar results. After a few accurate replications the information can be regarded as scientific

knowledge. The problem of non-reproducible research seems to be affecting every scientific branch, everything from psychology to physics. Hardest hit by this problem seems to be the social sciences and the consequences can turn out to be disastrous (Blaszczynski and Gainsbury, 2019). An example from the USA is that some regulations against pollution are being drawn back with the argument of research being non reproducible (Oreskes, 2018). This problem is connected to political consequences, however, they do not end here. Non

reproducible research may result in lots of research funds going to waste, and this can also

(7)

harm the general public’s perception of science. If research can no longer be trusted then where does that leave us, in a world full of questions without anyone to answer them.

In many areas of research, scientists are not familiar with the limitations of statistical methods and may use them when it is not suitable. This is a very important issue considering that methods such as stepwise are the dominant variable selection algorithm applied in

epidemiology (Walter and Tiemier, 2009). The question which must be answered is: when is it not the best option? In this paper some problems with stepwise and backward elimination have been stated, however, more information about their limitations is of importance as it may give researchers more knowledge regarding what methods to apply, but even more

importantly, when to apply them. It is also of great importance to gain more information about when methods such as lasso regression may be a suitable alternative.

2.1 Lasso Regression

One method commonly used for variable selection is Lasso regression (Least Absolute Shrinkage and Selection Operator). It works by performing regularization, a method applied to avoid overfitting. It does this by reducing the number of parameters and thus simplifying the model. This often makes interpretations easier. It increases bias to the estimated

parameters, this results in more stable parameter estimates across different samples due to variance decreasing. This means that the model will be better at generalizing (Glen, 2015).

Lasso applies L1 regularization, adding a penalty equal to the absolute value of the magnitude of coefficients. By doing this it limits the size of coefficients and thus some

variables can be eliminated by setting some coefficients to zero. It performs this task by using a tuning parameter often called λ. This tuning parameter controls the strength of the penalty.

A higher λ means that more variables will be eliminated as the penalty is larger (Glen, 2015).

In this simulation study 𝜆 is decided by cross-validation.

(8)

2.2 Stepwise and Backward Elimination

One of the most common methods used for variable selection is stepwise regression (Grogan and Elashoff, 2016). The stepwise methods which have been applied in this paper are

stepwise regression with dual direction and backward elimination. Stepwise regression with dual direction works by starting with a limited part of the total model. After that, it adds and removes explanatory variables at every iteration. It tries different combinations of variables up until the AIC or similar criteria can no longer be improved (Efroymson, 1960).

One contrast between backward elimination and dual direction is that variables which have been deleted can be re-introduced at a later iteration when using dual direction. Backward elimination works by starting with a full model and from there it removes the variables until it finds the model with the lowest AIC. Stepwise regression with dual direction often uses AIC to find the best model. AIC is short for Akaike Information Criterion and is a statistic which reflects how good the model is. A low AIC means that it is a good model, and the algorithm tries all different combinations up until it finds the model with the lowest AIC.

Some problems with stepwise regression methods have not been previously mentioned are that the estimated coefficients are biased (Tibshirani, 1996). The method also produces

confidence intervals which are falsely narrow for effects and values of predictors (Altman and Andersen, 1989).

3. Method

To examine the strength of each variable selection algorithm the conducted simulation used 1000 repetitions. In this study that means how well each method will select the relevant variables. This simulation will be conducted on three different sample sizes, these are 50, 250 and 1000. By varying sample size one can attain more information about when applying these methods is more or less suitable.

This simulation was based on 10 variables and an intercept. The 10 variables, seen in table 1, were randomly generated from a normal distribution with no multicollinearity and with different means and standard deviations. The mean and standard deviations for these variables can be seen in Table 1. Among these 10 there are three variables which have an effect on the

(9)

can easily distinguish them from the noise variables (X4, X5, X6, X7, X8, X9, X10). In the equation below n is the sample size. The equation which the probability of Y is built on is:

Equation 1.

𝑒𝑞𝑢𝑎𝑡𝑖𝑜𝑛 = 1.3 × 0𝑋1^∗−∑ 𝑋1^∗

𝑛 5 + 1.5 × 0𝑋2^∗−∑ 𝑋2^∗

𝑛 5 + 1.8 × 0𝑋3^∗−∑ 𝑋3^∗ 𝑛 5

Y is a binary variable from the binomial distribution. The probability is defined as:

Equation 2.

𝑃 = 1

1 + 𝑒;<=>?@ABC

Table 1. Mean and standard deviations for all variables generated in the simulation

Variables Mean Sd

X1* 60 1

X2* 18 1

X3* 50 1

X4 85 4

X5 100 6

X6 70 5

X7 24 2

X8 45 3

X9 30 2

X10 16 1

In the simulation stepwise and backward elimination sorted out relevant variables by finding the model with the smallest AIC. The variables chosen from lasso were the variables which had coefficients which were non-zero.

(10)

3.1 Software

All simulations and analysis were conducted in R Version 3.1.2 (http://www.R-project.org).

The simulations on stepwise regression and backward elimination were conducted on the basic R package stats (R Core Team, 2020) using the function step. Lasso regression was applied by utilizing the R package glmnet (Friedman et.al, 2010) and the function cv.glmnet.

Code is available on request.

4. Results

In Table 2 the regressors are X1*, X2* and X3* whilst the remaining variables are noise.

Correct in this context means at what proportion could the different models identify the regressors as regressors and noise variables as noise. If a regressor is excluded then that is a failure, if a noise variable is selected into the model then that is also a failure.

In the first simulation lasso had some difficulties accurately choosing the explanatory variables (X1*,X2*,X3*). Out of 1000 iterations it failed to include X1* 207 times, this resulted in lasso having selected X1* correctly 79.3% of the 1000 iterations. X2* and X3*

fared a bit better with respective success rates being 86.9% and 94.8%. As previously

mentioned, and in line with the hypothesis we can see that the stronger effect the independent variable has on Y the better lasso is at including it in the model. The effect on Y from X1*, X2* and X3* were 1.3, 1.5 and 1.8. Compared to the independent variables the hit rate on the noise variables were between 88.3-89.9% which is a slight improvement over X1* and X2*

but a bit worse than X3*. From the smallest sample size lasso performed best at selecting X3*, this may be connected to the explanatory variable having the highest effect on the dependent variable. Aside from that lasso performed better at discarding noise variables than including X1* and X2*. With a larger sample size lasso performed much better at selecting the dependent variables.

(11)

With a sample size of 250 it accurately selected X1*, X2* and X3* 100% of the time. There is also a noticeable improvement on the noise variables which were excluded with a success rate ranging from 92.2% to 94.2%. There is no visible pattern which might explain why some noise variables experienced a higher success rate, this is likely due to chance.

When lasso was applied to the largest sample the success rate for the dependent variables was 100%. This is identical to the smaller sample size of 250. Lasso also performed slightly better at excluding noise variables from the model. The hit rate was between 96.4% to 98%. This was the best result in the study.

Starting with the smallest sample stepwise performed quite well at including the dependent variables in the model. The success rate for accurately selecting X1*, X2* and X3* were 93.5%, 98% and 99.5% respectively. The same pattern is distinguishable here where the dependent variable with the highest effect on Y had the highest probability of being included in the model. Whilst stepwise performed better at including independent variables it fared worse at excluding the noise variables compared to lasso. The success rate ranged between 77.5% to 80.4% which was worse than lasso (88.3-89.9%). With a larger sample size stepwise performed better. It accomplished a success rate of 100% for all dependent variables. The model also performed better at excluding the noise variables. Although there is improvement the model does perform worse than lasso at the same stage. Whilst both included the

regressors at every iteration, lasso performed better at excluding the noise variables.

When applied to the largest sample stepwise included all dependent variables again. But there was no noticeable improvement in excluding the noise variables as compared to a sample size of 250.

Backward elimination achieved almost identical results to stepwise regression, with a small deviation when sample size was 250. Both models performed better than lasso when applied to the smallest sample, but when sample size increased lasso performed better.

(12)

Table 2. Results from the simulation over all sample sizes displayed in share of correctly selected variables

n = 50

X1* X2* X3* X4 X5 X6 X7 X8 X9 X10

Backward 93.4% 98% 99.5% 79.6% 77.5% 78.9% 78.4% 79.2% 80.5% 78.4%

Stepwise 93.5% 98% 99.5% 79.6% 77.5% 78.9% 78.3% 79.2% 80.4% 78.3%

LASSO 79.3% 86.9% 94.8% 88.3% 88.5% 89.9% 89.5% 89% 88.9% 89.4%

n = 250

X1* X2* X3* X4 X5 X6 X7 X8 X9 X10

Backward 100% 100% 100% 83.5% 84.5% 82.7% 85.4% 82.7% 82.6% 83.8%

Stepwise 100% 100% 100% 83.5% 84.5% 82.7% 85.4% 82.7% 82.6% 83.8%

LASSO 100% 100% 100% 93.2% 93.3% 92.4% 93.8% 92.7% 92.2% 94.2%

n = 1000

X1* X2* X3* X4 X5 X6 X7 X8 X9 X10

Backward 100% 100% 100% 84.9% 83.2% 83.8% 84.1% 83.6% 83.7% 83.9%

Stepwise 100% 100% 100% 84.9% 83.2% 83.8% 84.1% 83.6% 83.7% 83.9%

LASSO 100% 100% 100% 98% 96.8% 96.5% 97% 97.2% 96.9% 96.4%

Note: Percentages reflect how many times the models selected the correct variables. For X1*, X2* and X3* this is how many times the variables were included, for the rest it was how many times they were removed from the model.

(13)

5. Discussion

We have found that stepwise regression and backward elimination performed better than lasso when applied to a small sample. Between the two they got almost identical results. Lasso performed better when sample size increased, and it kept improving performance as the sample size grew. Stepwise and backward elimination increased moderately between the smallest sample size of 50 to a sample size of 250, but after that point there was no clear indication of improvement in performance. In this study it has been demonstrated that when applying variable selection algorithms to logistic regression with a dataset consisting of few independent variables then lasso performed better when sample size was big. Stepwise regression and backward elimination performed better when applied to a small sample size.

This is under the condition of no multicollinearity.

For further studies one can conduct the simulation under the condition of multicollinearity and varying amounts of variables. It is also of interest to analyze how each algorithm performs when applied to massive datasets with many variables. A big data application is of interest due to the increasing importance of the field and that stepwise models are commonly used despite its limitations (Smith, 2018).

The dataset which has been generated in this simulation consists of 10 independent variables from the normal distribution. There is also no multicollinearity in the model. These conditions are uncommon among real data and due to this the study provides limited information. In the study the usage of stepwise regression and backward elimination made use of the AIC which is one possible criteria. Other criterias which have not been explored are BIC and AICc.

This simulation consisted of three models, but there are many others which may be of interest to study. Some other methods are neural networks, elastic net, gradient boosters and support vector machines etc.

(14)

Bibliography

Altman, D. G., and Andersen, P. K. (1989). Bootstrap investigation of the stability of a Coxregression model. Statistics in Medicine, 8, 771-783.

Blaszczynski, A., Gainsbury S.M. (2019). Editor’s note: replication crisis in the social sciences, International Gambling Studies, 19(3): 359-361.

Burnham, K.P. and Anderson, D.R. (2002). Model Selection and Inference: A Practical Information-Theoretic Approach. 2nd Edition, Springer-Verlag, New York.

Derksen, S., & Keselman, H. J. (1992). Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of Mathematical and Statistical Psychology, 45(2): 265–282.

Efroymson, M, A. (1960). Multiple Regression Analysis In Ralston, A., and H. S. Wilf, (eds.), Mathematical Methods for Digital Computers, John Wiley, New York.

Everitt B.S., Skrondal A. (2010). Cambridge Dictionary of Statistics, Cambridge University Press, New York, Fourth Edition, 310-362

Friedman, J., Hastie, T., Tibshirani R. (2010). Regularization Paths for

Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.

Glen, S. (2015). Lasso Regression: Simple Definition, Elementary Statistics for the rest of us, [online], 24 September, [online] available at <https://www.statisticshowto.com/lasso-

regression/> [Accessed 6 May 2021]

Grafen, A. and Hails, R. (2003). Modern Statistics for the Life Sciences. Oxford University Press, Oxford, 59 (1): 200-207

Grogan, T.R and Elashoff, D.A. (2017). A simulation based method for assessing the statistical significance of logistic regression models after common variable selection

(15)

Hastie, T., Tibshirani. R. and Tibshirani, RJ. (2020). Rejoinder: Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons, Statistical Science, 35(4): 625-626

Kohavi, R. and Wolpert, D. (1996). Bias Plus Variance Decomposition for Zero-One Loss Functions. ICML

McNeish, D.M. (2015). Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences, Multivariate Behavioral Research, 50:(5):

471-484

Musoro, J.Z., Zwinderman, A.H., Puhan, M.A., Gerben, R. and Geskus, R.B. (2014).

Validation of prediction models based on lasso regression with multiply imputed data. BMC Med Res Methodol 14, 116

Oreskes, N. (2018). Beware: transparency rule is a Trojan Horse, Nature, 557: 469

R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, URL: https://www.R-project.org/

Smith, G. (2018). Step away from stepwise. J Big Data 5, 32

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, Series B, 58(1): 267-288.

Walter, S., Tiemeier, H. (2009). Variable selection: Current practice in epidemiological studies, European Journal of Epidemiology, 24(12):733-6

Wilkinson, L. (1979). Tests of significance in stepwise regression. Psychological Bulletin, 86(1): 168–174.

(16)