ON THE JACKKNIFE AVERAGING OF GENERALIZED LINEAR MODELS

(1)

ON THE JACKKNIFE AVERAGING OF

GENERALIZED LINEAR MODELS

Submitted by

Valentin Zulj

A thesis submitted to the Department of Statistics in partial

fulfillment of the requirements for a two-year Master of Arts degree

in Statistics in the Faculty of Social Sciences

Supervisor

Shaobo Jin

(2)

(3)

Abstract

Frequentist model averaging has started to grow in popularity, and it is considered a good alternative to model selection. It has recently been applied favourably to gen-eralized linear models, where it has mainly been purposed to aid the prediction of probabilities. The performance of averaging estimators has largely been compared to that of models selected using AIC or BIC, without much discussion of model screening. In this paper, we study the performance of model averaging in classification problems, and evaluate performances with reference to a single prediction model tuned using cross-validation. We discuss the concept of model screening and suggest two methods of constructing a candidate model set; averaging over the models that make up the LASSO regularization path, and the so called LASSO-GLM hybrid. By means of a Monte Carlo simulation study, we conclude that model averaging does not necessarily offer any improvement in classification rates. In terms of risk, however, we see that both methods of model screening are efficient, and their errors are more stable than those achieved by the cross-validated model of comparison.

Key words: Model averaging, jackknife, generalized linear models, model screening, classification

(4)

1 Introduction

In statistical modelling, the issue of model selection has been studied far and wide. As a result of this, many methods used for specifying what particular model to choose have emerged, out of which the most prevalent are Akaike’s [1973] information criterion (AIC) and the Bayesian information criterion (BIC) of Schwarz [1978]. Model averaging, however, is an alternative modelling strategy, which entails using a weighted average of candidate models, rather than predictions from one single model selected by use of some model selection criterion. The intuition of the model averaging approach is straightforward; it strives to make use of the models that are nearly as good as the one chosen in model selection, hoping that these candidate models will utilize parts of the signal that are not picked up by the selected model.

The strategy of model averaging has been widely researched, and both Bayesian and frequentist approaches have been developed. This paper concerns the frequentist branch of model averaging, although an overview of Bayesian advances in the field is given by Hoeting et al. [1999]. The literature regarding frequentist model averaging (FMA) is extensive, and the method has been applied to a wide array of statistical models, the main one being the ordinary linear regression model. Theoretical developments in weight estimation include the Mallows model average estimator of Hansen [2007] and the jackknife model averaging (JMA) estimator introduced by Hansen and Racine [2012]. A model averaging estimator constructed within the local asymptotic framework is presented in Hjort and Claeskens [2003], while averaging of regularized regression models is studied in Schomaker [2012] and Zhao et al. [2018].

Further, the use of FMA has started to gain traction in the context of generalized linear models (GLM). Averaging in the local asymptotic framework is exemplified by Charkhi et al. [2016], while the Kullback-Leibler loss is used to estimate the weights in Zhang et al. [2016] and Ando and Li [2017]. A review of the FMA framework is given in Mitra et al. [2019]. To our knowledge, however, no attempt has been made to use the jackknife model averaging estimator in the GLM setting, apart from the M -fold cross-validation utilized for discrete choice models in Zhao et al. [2019].

In Zhao et al. [2019] model averaging is used as a tool for estimating probabilities of success. To that end, efforts are made to prove that the averaging estimator attains the same asymptotic squared error as the infeasible best averaging estimator, and thus it is said to be optimal. The optimality, however, is only proved to be valid for the accuracy of probability predictions, and the paper does not raise any questions regarding the use of averaging estimators to solve classification problems. The application of averaging in the

(6)

classification setting is relevant since binary regression models are often used to classify the outcome of categorical trials. Further, the performance of the averaging estimators is only compared to that of models chosen using AIC and BIC. The models of comparison are not cross-validated, or tuned for the purpose of prediction, so the comparison is not entirely fair since averaging estimators are favoured by design. Lastly, there is little discussion of model screening, or reduction of the model space. The use of vast model spaces poses particular problems to averaging estimators, since much computational efficiency is lost when too many candidate models have to be estimated. Seeing as the estimation of candidate models is the most computationally intensive part of the averaging procedure, a computationally efficient method for model screening is crucial in the application of averaging estimators to everyday problems.

Hence, we intend to study whether the optimality discussed above translates to the context of classification. Also, we strive to compare JMA to methods that are developed with the specific purpose of performing well in prediction, and, finally, to present and discuss a set of methods to use in model screening. In short, our contribution can be characterized by three points:

(i) Study the performance of JMA in classification problems

(ii) Compare JMA to approaches other than model selection using AIC or BIC

(iii) Develop a method for efficient model screening in JMA for GLM

The remainder of this paper is organized as follows. In section 2, we provide a theoretical overview of generalized linear models, as well as a description of how logistic regression can be used for classification. In section 3, we present the theory of jackknife averaging and weight estimation, and also discuss the selection of candidate models. In Section 4, we conduct a Monte Carlo simulation study, in which we simulate binary outcomes and use jackknife averaging of logistic regression models in order to classify the outcome of the simulated individuals. In Section 5, we discuss the findings of our simulation study, and in Section 6 we state the conclusions we draw from the paper.

2 Generalized Linear Models

Generalized linear models (GLM) form a regression framework that is commonly used for modelling outcomes that are not normally distributed, but drawn from any one of the dis-tributions that make up the exponential dispersion family. Suppose yi, i ∈ {1, . . . , n} is

(7)

an outcome drawn from a distribution with probability density f (yi). Then, if f (yi) can be written as f (yi) = exp yiθi− b(θi) ai(φ) − c(yi, φ) , (1)

the distribution is part of the exponential dispersion family. Here, ai(·), b(·), and c(·) are all

known functions, while θi is the so called natural parameter. The GLM makes use of a link

function, g, which connects the expectation of the outcome to some linear predictor x|_iβ, where xi is a p-dimensional observation vector and β is a vector of unknown coefficients.

Let E[Yi] = µi, then the link function is applied so that g(µi) = x|iβ. As per Nelder and

Wedderburn [1972], a consistent estimate, ˆβ, of β can be found numerically using Fisher scoring. Thus, the estimate ˆβ can be used to estimate the mean of any exponential dispersion distribution.

For binary outcomes, the logistic regression model is commonly used to classify the response of certain individuals. Now, let yibe a binary outcome of a draw from the Bernoulli

distribution having πi ∈ [0, 1] as the probability of success, meaning that µi = πi. Then,

the logit link function can be employed within the framework presented above in order to connect the linear predictor to πi, so that

log πi 1 − πi = x|_iβ, and πi = exp{x|_iβ} 1 + exp{x|_iβ}.

The probabilities can be estimated as

ˆ πi =

expnx|_iβˆo 1 + exp {x|_iβ},

where ˆβ is computed by use of the method presented above. Using ˆπi, each observation can

be classified as ˆ yi =    1, if ˆπi ≥ 0.5 0, otherwise.

(8)

3 Jackknife Model Averaging

Jackknife model averaging was first formulated by Hansen and Racine [2012] as an alternative to model selection in ordinary linear regression modelling. The method entails using a weighted average of predictions from a set of candidate models, rather than relying on a single set of predictions rendered by a model deemed to be the best by use of some model evaluation criterion.

Suppose there is a set, M = {1, . . . , M }, of candidate models to average over, and that X(m) _{gives the design matrix for the specified model m ∈ M. Further, let X}(m)

[−i] denote the

design matrix of the m:th model, from which the i:th observation has been removed, and ˆ

β(m)_[−i] be the coefficient vector obtained by fitting a logistic regression using X_[−i](m). Then

P_i(m) = expnx(m)_i |βˆ_[−i](m)o 1 + exp n x(m)_i |βˆ(m)_[−i] o

is the predicted probability of Yi = 1, computed with the i:th observation removed. Also,

define

P(m) =P₁(m), . . . , P_n(m)|

as a vector collecting predictions for all individuals, made using the m:th model. Then, the jackknife averaging estimator is given by

P (w) =

M

X

m=1

w(m)P(m) = P w, (2)

where w ∈ W is a vector of weights whose m:th element determines the weight of the m:th candidate model, and P = P(1), . . . , P(M ). The set, W, from which the weights are chosen can be defined in several ways, placing different restrictions on the elements w(m). Here, the set is defined by the unit simplex W = w ∈ [0, 1]M _{: 1}0_{w = 1 . Using the jackknife}

estimator given in (2), the jackknife residual is defined as

e(w) = y − P (w) =

M

X

m=1

w(m)e(m) = ew,

(9)

3.1 Jackknife Weighting

As mentioned above, the weights used in averaging predictions can be defined and acquired in a range of ways, where the simplest method is letting all candidate models have equal weights. In jackknife weighting, however, a cross-validation criterion, depending on the weights, is formulated using the jackknife residual. The criterion is defined as

CVn(w) = e(w)|e(w) = w|Sw, (3)

where S = e|_{e is an M × M matrix of squared errors. The rightmost component of (3)}

is a quadratic program, which can be efficiently minizimed using existing algorithms, for instance the ipop function in the kernlab package written for R [Karatzoglou et al., 2004]. Hence, w can be estimated as

ˆ

w = argmin

w∈W

w|Sw.

3.2 Selection of Candidate Models

There is no convention as to how the candidate models included in the set M should be chosen, and a variety of suggestions have been presented in previous papers. The simplest way of selecting candidate models is using all possible distinct models that can be formed using the regressors at hand. Suppose, for instance, that each xi is an observation on q

variables. Then, the number of distinct models that can be formed, including an intercept only model, is 2q+1_{. For large q, however, the fitting and averaging of 2}q+1 _{models comes at}

great computational cost, and a device for reducing the model space is needed in order for model averaging to be a feasible alternative. One way of filtering candidate models out is selecting a set of candidate models that perform well according to some measure of model fit. For instance, in Wan et al. [2014], the candidate set is defined as the five models that attain the smallest BIC values. This strategy also leads to time consuming computations, since all possible models need to be fitted in order to determine which ones are the best. Hence, a range of methods have been devised in order to reduce the time, and computational power, required to apply model averaging.

A simple way of constructing a candidate model set of smaller dimension is using nested models. The set of regressors employed in the j:th nested model is always a subset of the variables used in the nested model indexed j − 1, since one regressor is always excluded in moving from one model to the next. As indicated by the results of Liang et al. [2011], the predictions and errors obtained when using nested averaging depend on the order in which the variables are removed in the nesting process, and to apply it an order has to be chosen

(10)

specifically. As a result of this, candidate sets that do not give rise to such variation in results are preferred.

Another way of forming the candidate set is through the use of shrinkage estimators, such as ridge regression or the LASSO. The shrinkage estimators impose penalties on the coefficients of the estimated regression models, meaning that they can be thought of as tools for model selection. This kind of approach is applied favourably to linear regression models in Schomaker [2012] and Zhao et al. [2018], seeing as their shrinkage averaging estimators achieve smaller squared errors than single model shrinkage estimators, as well as ordinary linear models. Adding to that, Zhao et al. [2018] prove that shrinkage averaging estimators attain the same asymptotic risk as the unfeasible best averaging estimator. The successes of shrinkage averaging in the context of linear models has encouraged us to apply the same scheme in logistic regression. Hence, two of the model screening methods applied in this paper are variations on this theme.

This paper considers three model screening approaches aimed at determining the contents of M. As mentioned above, two of them are closely related to shrinkage averaging. The first method is a straight replication of shrinkage averaging, although applied in logistic regression. The second method entails estimating all unique model forms suggested by the LASSO, and then averaging over them. We call this approach LASSO-GLM hybrid averaging. The final approach is averaging over all possible one variable models, commonly known as Singleton averaging. This approach is discussed in Charkhi et al. [2016]. The rest of this section is devoted to presenting the three strategies in further detail.

3.2.1 LASSO Regularization Path

Consider the `1 (LASSO) regularization scheme that provides means of computing

ˆ β`1 = argmin β 1 N N X i=1 `(β; xi) + λ||β||1

for any given value of the penalty parameter λ. Here, `(β; xi) is the negative log-likelihood

of observation i. Using the regularization algorithm, a unique ˆβ`1 can be found for every

λ ∈ Λ = {λ0, λ1, . . . , λL−1}, λ0 = 0, forming a set of L models whose parameter vectors are

regulated using the penalty parameters. The LASSO strategy allows for elements of ˆβ`1 to

be shrunk to 0, making it a suitable tool for model selection. Using this fact, the candidate set, M, is defined to contain these L models, and averaging over all models in M thus entails averaging over the entire LASSO regularization path.

(11)

3.2.2 LASSO-GLM Hybrid

Depending on the distance between two consecutive elements of Λ, the differences between some candidate models along the regularization path may be negligible. In the case where many candidate models are very similar, averaging over the entire regularization path may not be the most efficient form of model averaging, since the same information is unnecessarily included several times. Because of this, the regularization path can be used to determine the model forms of the candidate models, rather than the models themselves. That way, the problem of averaging over many very similar models is circumvented, since every model in the candidate set is made up of a distinct combination of regressors. Hence, this new approach averages over a smaller set of distinct and relevant models, hoping to prove more accurate and efficient. The approach is similar to the LARS-OLS hybrid discussed in Efron et al. [2004], and as such it is suitable to call it a LASSO-GLM hybrid.

Suppose the LASSO is estimated for every λ ∈ Λ. Then, M is formed as the set of unique formations of regressors along the regularization path. Logistic regression models are fitted to the model forms in M, and then jackknife averaging is applied to the estimated models. For instance, consider a scenario where there are three different regressors available, namely x1, x2, and x3. For four different values of the tuning parameter λ, the regularization

algorithm suggests using the following model forms:

Tuning Parameter Suggested Model Form λ1 β0 (intercept only)

λ2 β0+ β1x1+ β3x3 λ3 β0+ β1x1+ β3x3

λ4 β0+ β1x1+ β2x2+ β3x3.

Then, there are three unique model forms are used for averaging. One which is intercept only, one using x1 and x3, and one using all regressors. Logistic regression models are fitted

using the regressors chosen in the model forms, and weights are attributed to each logistic regression.

3.2.3 Singleton Models

Singleton models are intuitive and easily fitted. Suppose, again, that there is a set of q regressors available for consideration. Then, in the set of Singleton models, the j:th model simply consists of an intercept term and the j:th regressor, meaning that the dimension of the model space is shrunk from 2q+1 _{to just q. Every model is made up of a single regressor, hence}

the name Singleton model. Averaging over Singleton models is viewed as a viable option to averaging over the nested models above due to the theoretical results presented in Charkhi et al. [2016]. In the local asymptotic framework, the order of the nested models does not

(12)

affect the results when minimal MSE weighting is applied, and Singleton averaging produces the same predictions (and squared errors) as nested model averaging [Charkhi et al., 2016]. The Singleton averaging approach performs well in their simulation study, hence we decide to consider the approach in our study to see if its application is as fruitful with jackknife weights in the GLM setting. It is, however, worth noting that the weighting method applied in this paper differs from that used by Charkhi et al. [2016]. First, we do not assume that averaging takes place in the local asymptotic framework. Second, the sum-to-one restriction is applied in Charkhi et al. [2016], but the individual weights are not required to be positive.

4 Monte Carlo Simulation Study

In this section, we conduct a Monte Carlo simulation study aiming to compare the perfor-mances of our jackknife averaging approaches to those of a series of alternative modelling strategies. In order to study the agility of the averaging estimators in different conditions we consider two simulation designs, one in which the response is generated using power-decaying coefficients, and one where the model is given regressors that are not relevant in the generation of the outcome.

4.1 Data Generation

To generate the data, we let

ηi = c · x|iθ, i ∈ {1, . . . , n},

be the linear predictor. Here, θ = (θ1, . . . , θp) is a p × 1 vector of coefficients, while xi is a

p-dimensional observation sampled from Np(µ, Σ), and c is a constant used to regulate the

strength of the signal sent through the regressors.

To determine what values c should take, we consider the latent representation of the logistic regression model, given by

y∗_i = c · x|_iθ + εi,

where y_i∗ is an unobservable, continuous outcome which is expressed as yi ∈ {0, 1} through

(13)

π2_{/3. Hence, we determine the theoretical variance contribution of x} i to the outcome as R2 = c 2_θ|_Σθ c2_θ|_{Σθ +}π2 3 .

Using numerical methods, we find certain values of c to make R2 _{= {0.1, 0.2, . . . , 0.8}.}

Finally, we define

pi = P (Yi = 1) =

exp{ηi}

1 + exp{ηi}

,

and sample the outcomes, yi, from Bernoulli(pi).

4.1.1 First Simulation Design

In the first simulation design we modify the example set by Hansen [2007]. In doing so, we let

θj = (−1)j+1

√

2αj−α−12, j ∈ {1, . . . , 1000},

meaning that the coefficients are decaying in power, converging to 0 at a rate we control using the constant α ∈ {0.5, 1, 1.5}. Unlike Hansen, we include the alternating term (−1)j+1 and exclude the intercept. We do so in order to make sure that no single regressor dominates the signal input to ηi, and to keep the probabilities, pi, from taking values too close to 1.

We let µ = 01000, and Σ = I1000. We try to mimic a scenario where we cannot observe all

information relevant to generating the data by only considering the first 14 regressors when fitting the models.

4.1.2 Second Simulation Design

In the second design we follow one of the approaches presented in Zhao et al. [2019], with a slight modification, and define

θ = (0.5, 0.6, 0, 0, 0, 0, 0, 0),

meaning that only two out of eight sampled regressors are used in generating the linear predictor. Further, we let µ = 08, diag(Σ) = 18, and Σkl = 0.3|k−l| for k 6= l, where

k, l ∈ {1, . . . , 8}. When fitting the models, we consider all eight regressors even though six of them have no true impact on the outcome.

(14)

4.2 Simulation Set-up

We let n = 1100, and divide each sample into a training set consisting of 100 observations (nt = 100), and an evaluation set made up of 1000 observations (ne = 1000). Moreover,

we restrict the length of Λ to 500, meaning that there will be no more than 500 candidate models to consider after regularization. We use the Λ generated by the glmnet package, as discussed in Friedman et al. [2010]. In every experiment, we run 10 000 iterations for each value of c. In total, this renders 80 000 iterations for each α in the first design, and 80 000 iterations in the second design.

In addition to the three approaches to model averaging, we also fit a logistic regression model using all regressors considered (14 in the first design, and 8 in the second), as well as a cross-validated LASSO model, as suggested by the lambda.min option of cv.glmnet function. As mentioned in the introduction, cross-validated models are not commonly used for comparison in the literature. However, since cross-validation is frequently used to tune prediction models, we include the cross-validated LASSO in our paper in order to test our averaging approach against a relevant tuning method. We compare the performances of the five methods using two comparative metrics; the missclassification rate and the risk. For each method, we compute the missclassification rate as

1 ne· T T X t=1 ne X i=1 I [yit 6= I(pit(w) > 0.5)] ,

and the risk as

1 ne· T T X t=1 ne X i=1 (pit− pit(w))2,

where pi(w) is the i:th element of P (w), and T = 10 000 and ne = 1000 give the number of

replications and evaluation observations respectively.

Finally, we conduct a series of hypothesis tests to determine whether the differences found in the simulations can be deemed significant. We test the differences in classification rates using parametric paired proportions tests and χ2_{-tests. We use the χ}2_{-test to test the}

classification rate achieved by hybrid averaging against those achieved by the full model, the cross-validated LASSO and regularization path averaging. Further, we test the differences in risk using Wilcoxon’s signed-rank test for paired observations [Wilcoxon, 1945], as well as the Friedman [1937] test. In Wilcoxon’s test, we let the alternative hypothesis state that hybrid LASSO averaging produces smaller risk than the method of comparison when testing it against the full model and the cross-validated LASSO, but when testing it against

(15)

regularization path averaging the we let the alternative hypothesis state that path averaging attains the smaller risk.

4.3 Results

In this section, we present the results rendered by the Monte Carlo simulations designed above. For large c there were some convergence issues in fitting the GLM:s, meaning that the number of valid replications would tend to roughly 8 500 for the largest c. We account for this shortfall in the testing of hypotheses. We present missclassification rates and risks as defined in Section 4.2, as well as p-values of the tests regarding risk. The simulation results are displayed in Figures 1 (first design) and 2 (second design), and the hypothesis testing p-values are given in tables in each subsection. Due to the amount of individual tests, we report the p-values of the tests regarding classification rates in an appendix to the paper.

4.3.1 First Design

The right column of Figure 1 illustrates a measure of the distance between observed and fitted probabilities. The plots suggest that some averaging estimators generate smaller risk than single model methods, seeing as both regularization path averaging and LASSO-GLM hybrid averaging (hybrid averaging) uniformly outperform the full model in terms of risk. The risk of the path averaging strategy looks to be smaller than that of hybrid averaging in most cases. It is, however, worth noting that the cross-validated single model LASSO performs on a par with both shrinkage averaging variations for small values of R2_{. Averaging over}

Singleton models yields very small risks when R2 _{is small, but the risk increases steadily as}

a function of R2_{. Overall, hybrid- and path averaging seem to be most robust to changes in}

signal strength, with both achieving stable, low-risk patterns.

The hypothesis tests, whose p-values are given in Tables 1 and 2, corroborate the results presented in the figure. The p-values show that the risk of the hybrid averaging estimator is significantly smaller than that of the full model for all values of R2. The cross-validated LASSO, however, achieves significantly smaller risk than hybrid averaging for small R2, while hybrid averaging performs better as R2 _{grows. The path averaging strategy produces}

significantly smaller risk that hybrid averaging and thus, by extension, it also significantly outperforms the full model and the cross-validated LASSO generally. Thus far, the results are largely in line with earlier research, and the risk plots are similar to their counterparts in Hansen and Racine [2012]. As for the missclassification rate, the differences between methods are not all that clear.

(16)

Figure 1: Display of the results obtained from the first simulation design. Each row gives the results for one specific α. Further, the left column plots the missclassification rates while the right column shows risks. The x-axes show the values of c (bottom) and the corresponding R2(top).

(17)

Table 1: Wilcoxon test p-values. Averaging of the hybrid LASSO as compared to the method stated in the leftmost column, for different R2

R2 α 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Full 0.5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1.5 0 0 0 0 0 0 0 0 CV

0.5 1 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05

1 1 < 1e-05 < 1e-05 < 1e-05 < 1e-05 0 0 0

1.5 1 0.0002 < 1e-05 < 1e-05 < 1e-05 0 0 0

Path

0.5 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 0 0 0 1 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 1.5 < 1e-05 0 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 0.0007

Table 2: Friedman test p-values. Averaging of the hybrid LASSO as compared to the method stated in the leftmost column, for different R2

R2 α 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Full 0.5 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1.5 0 0 0 0 0 0 0 0 CV

0.5 < 1e-05∗ 0.58 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 1 < 1e-05∗ _0.018 _{< 1e-05} _{< 1e-05} _{< 1e-05} _{< 1e-05} _{< 1e-05} _{< 1e-05} 1.5 < 1e-05∗ 0.001 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05

Path

0.5 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 0 1 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 0.001 1.5 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 < 1e-05 0.04

The plots shown in the first column of Figure 1 are rather difficult to interpret, and the results are best reported in terms of hypothesis testing results – see Appendix A for tabulated results. For α = 0.5 the χ2_{-test does not yield any significant results, indicating}

that neither hybrid nor path averaging can be shown to achieve a better classification rate than the full model or the cross-validated LASSO. These claims are supported by the paired proportions tests, whose results are similar. The only significant result is that Singleton averaging acts as the worst classifier for R2 ≥ 0.6. For α = 1, the results of the tests are not consistent. The χ2_{-test suggest that both variants of shrinkage averaging are significantly}

∗_{In these particular instances, the test results indicate that the cross-validated LASSO model produces} smaller risk than averaging over unique model forms

(18)

better than the full model for some R2_{, while the paired proportions test does not produce}

any significant results of that kind. Finally, for α = 1.5, the results of the χ2_{-tests suggest}

that hybrid averaging performs better than the full model when R2 _{≥ 0.2, and that there are}

no differences between hybrid- and path averaging. The paired proportions tests produce similar results.

Hence, the results from the first simulation design are ambiguous. While there are some indications that the optimality of averaging estimators does, indeed, translate to classification problems, in particular for large α and R2, there are cases in which the optimality does not necessarily seem to hold.

4.3.2 Second Design

In the second design, we simulate the outcome using two influential variables. In the fitting of models, however, we consider six further variables that have no real bearing on the outcome, in order to compare the methods in a setting where they have to filter through data that are not relevant.

Figure 2: Display of the results obtained from the second simulation design. The left hand panel plots the missclassification rate while the right panel shows the risk. The x-axes show the values of c (bottom) and the corresponding R2 _(top).

The results of the simulations ran using the second design are similar to those of the first, with one notable exception, according to Figure 2. Both hybrid and regularization path averaging uniformly outperform the full model when it comes to the risk measure, and the two averaging approaches also produce similar risk curves. Further, in terms of risk, both shrinkage averaging variants, Singleton averaging, and the cross-validated LASSO are the

(19)

best performers for small R2_{. The risks attained by the two latter models, however, increases}

with R2_{, and for moderate and large R}2 _{both Singleton averaging and the cross-validated}

LASSO are riskier than the full model. Hence, the performance of the cross-validated LASSO is very different in this design, seeing as it does not outperform the full model on all occasions. The risks of hybrid- and path averaging, on the other hand, decrease steadily as R2 _grows.

Overall, the two averaging strategies stemming from the regularization path look to be the most robust, and produce the smallest risk.

The differences in risk are once more tested using both Friedman and Wilcoxon tests, and results are given in Table 3. As can be derived from the table, both tests imply that there is a significant difference in risk between the hybrid- and path averaging approaches and the full model. Judging by the curves in Figure 2, and the one-sided nature of the alternative hypothesis used in the Wilcoxon test, both approaches are deemed to produce uniformly smaller risk than the full model. The tests show that the hybrid- and path averaging estimators are both better than the cross-validated LASSO. For R2 _{∈ {0.1, 0.2}}

path averaging produces smaller risk than hybrid averaging, while the relationship is reversed for R2 ≥ 0.3.

Table 3: Hypothesis testing p-values. Averaging of the hybrid LASSO as compared to the method stated in the second to leftmost column, for different R2

R2

Test Alt. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Wilcoxon

Full 0 0 0 0 0 0 0 0

CV <1e-04 <1e-04 0 0 0 0 0 0

Path <1e-04 <1e-04 0.99 1 1 1 1 1

Friedman

Full 0 0 0 0 0 0 0 0

CV 0.0005 <1e-04 <1e-04 0 0 0 0 0

Path <1e-04 <1e-04 <1e-04 <1e-04 <1e-04 <1e-04 <1e-04 <1e-04

When it comes to classification rates, there are no consistent suggestions of significant differences between the approaches, and the results are similar to those presented for α = 0.5 in the first design. For R2 _{= 0.8, Pearson’s χ}2_{-test suggests that hybrid averaging}

manages a significantly higher classification rate than the full model, with a p-value of roughly 0.04. This claim is not supported by the paired proportions test. The paired proportions test suggests that both hybrid- and path averaging are significantly better classifiers than Singleton averaging for R2 = 0.8. On the whole, however, the results are not supportive of the claim that the optimality of the jackknife averaging estimator translates to the classification setting, even though the risks behave roughly as expected.

(20)

5 Discussion

In general, previous studies on model averaging led us to expect the averaging estimators to outperform the full model in terms of risk, as in Hansen [2007] and Hjort and Claeskens [2003], and the empirical results tend to support this assertion in most of the cases studied. Indeed, both hybrid averaging and the regularization path estimator perform better than the full model at all times. The one averaging estimator that does not always seem to follow theoretical expectations is the one making use of Singleton models, a possible reason for this will be discussed later on.

In this paper, we extract the regularization path from the output of the glmnet function. Our main reason for doing so is that we can think of no way to determine the smallest penalty parameter value for which all coefficients are shrunk to 0. If there was a method for analytically finding this threshold penalty, averaging could be carried out as in Schomaker [2012] and Zhao et al. [2018], using a regularization path made up of equidistant penalty parameter values. We are, however, bound to use the path provided by glmnet. The algorithm used in glmnet iterates over different values of the penalty parameter until some error measure remains relatively unchanged over a certain number of iterations, meaning that the models rendered by parameter values close to the end of the penalty parameter sequence may be all but identical. It is conceivable that this creates a scenario where models are confused and weights are not distributed optimally among relevant candidate models. Hence, this issue leaves possible room for improvement.

As mentioned above, both strategies discussed thus far attain uniformly lower risk mea-sures than the full model. For Singleton averaging, however, this is not the case. Even though the Singleton approach can produce very small risks, sometimes even the smallest of all methods, Singleton averaging seems to be particularly volatile, in terms of risk, for small α and more stable as α grows. Further, the risk produced by the Singleton averaging estimator looks to be a uniformly increasing function of c. We believe this to be caused by omitted variable bias, which can be explained by the different values given to the constants α and c. As mentioned in Section 4, α is used to control the rate at which the coefficients used in data generation converge to 0, while c is used as a tool of scaling the signal supplied by each individual regressor. Hence, when α is small, more regressors contribute notably to the generation of outcomes, and when c is large, the importance of each regressor is inflated. Thus, in fitting models using only one regressor at a time, a substantial amount of bias is introduced through the unused variables, and the Singleton averaging strategy turns out to be more risky.

(21)

Racine [2012]. Hence, it seems that the squared error characteristics of jackknife averaging estimators translate to the context of logistic regression. These characteristics, however, do not necessarily look to extend to classification rates. One reason for this may be the fact that the cross-validation criterion used in our jackknife weighting approach is formulated with respect to the risk rather than the classification rate, meaning that the benefits of averaging might be tuned towards the risk measure by default. As of yet, we have not been able to identify an efficient cross-validation criterion that works on the classification rate, and thus there is clear room for improvement of the method.

Further, the absence of improvement in terms of classification rate might be a result of the classification rule we apply, in which we use a predicted probability of 0.5 as the cut-off value. The risk illustrated on the right hand side of Figures 1 and 2 is defined as the mean squared deviation of fitted probabilities from the true ones. In general, the risk of the full model seems to hover in the district of 0.04, which means that, on average, the true probability can be found within a range of √0.04 = 0.2 from the fitted probability. Hence, the true probability of an individual needs to lie on [0.3, 0.7] if missclassification is to be likely. Since the full model seems to be the riskiest of the stable models, any individual with a probability larger than 0.7 or smaller than 0.3 runs a smaller risk of being classified differently by the different approaches. This might be an explanation as to why the order of difference between the approaches is larger for the risk measure than for missclassification rates, if the true probabilites fall outside the ’range of missclassification’.

Finally, the performance of Singleton averaging needs to be placed in context. The Sin-gleton approach is studied in Charkhi et al. [2016], in which it produces results that are more promising than the ones obtained in this paper. This discrepancy in results might be down many factors, but we believe there are two reasons that are particularly interesting. The first factor is that Charkhi et al. [2016] generate outcomes from the local asymptotic framework, which places strong restrictions on the generating coefficients. This is not something we take into account in our simulations. The second factor is that our approach to model averaging restricts the weights to be positive, and smaller than 1. This constraint is not applied by Charkhi et al. [2016], where some weights are allowed to be negative as long as the sum of all weights is equal to 1.

6 Conclusions

At the outset of this paper we listed three ways in which the paper is meant to contribute to the realm of jackknife model averaging, and the simulations carried out in the paper have given us a better understanding of all three issues. The results provide no clear

(22)

evi-dence that the optimality of the jackknife averaging estimator translates to the classification rate, at least not uniformly. There are scenarios in which the classification rate is in fact improved through the use of model averaging, but we cannot state any definitive conclu-sions as to when that is actually the case. In terms of model screening and the comparison of methods, results suggest that LASSO-GLM hybrid averaging and LASSO regularization path averaging are the most suitable approaches. They consistently produce smaller risks than the cross-validated LASSO, and their risk graphs are very stable, as opposed to that of the Singleton averaging approach. Thus, we conclude that model averaging can indeed be used in the GLM framework in order to lower the risk of prediction models tuned using cross-validation, and that hybrid- and path averaging look to be the most stable ways of doing so. It is important to note that the results of our simulation study should only be viewed as an indication, and not a formal proof of a generalized property of the jackknife averaging estimator. Thus, the paper leaves room for the construction of a general proof in support of the claim that the any of the two shrinkage based averaging estimators attain the same asymptotic risk as the infeasible best averaging model.

(23)

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory, 73:1033–1055.

Ando, T. and Li, K.-c. (2017). A weight-relaxed model averaging approach for high-dimensional generalized linear models. The Annals of Statistics, 45(6):2654–2679.

Charkhi, A., Claeskens, G., and Hansen, B. E. (2016). Minimum mean squared error model averaging in likelihood models. Statistica Sinica, 26(2):809–840.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2):407–499.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22.

Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200):675–701.

Hansen, B. E. (2007). Least squares model averaging. Econometrica, 75(4):1175–1189.

Hansen, B. E. and Racine, J. S. (2012). Jackknife model averaging. Journal of Econometrics, 167(1):38–46.

Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. Journal of the American Statistical Association, 98(464):879–899.

Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4):382–401.

Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab – an S4 package for kernel methods in R. Journal of Statistical Software, 11(9):1–20.

Liang, H., Zou, G., Wan, A. T. K., and Zhang, X. (2011). Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association, 106(495):1053–1066.

Mitra, P., Lian, H., Mitra, R., Liang, H., and Xie, M.-g. (2019). A general framework for frequentist model averaging. Science China Mathematics, 62(2):205–226.

Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society. Series A (General), 135(3):370–384.

(24)

Schomaker, M. (2012). Shrinkage averaging estimation. Statistical Papers, 53(4):1015–1034.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2):461– 464.

Wan, A. T. K., Zhang, X., and Wang, S. (2014). Frequentist model averaging for multinomial and ordered logit models. International Journal of Forecasting, 30(1):118–128.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83.

Zhang, X., Yu, D., Zou, G., and Liang, H. (2016). Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. Journal of the American Statistical Association, 111(516):1775–1790.

Zhao, S., Liao, J., and Yu, D. (2018). Model averaging estimator in ridge regression and its large sample properties. Statistical Papers, (Advance online publication).

Zhao, S., Zhou, J., and Yang, G. (2019). Averaging estimators for discrete choice by m-fold cross-validation. Economics Letters, 174:65–69.

(25)

Appendix A

Testing Classification Rates

Table A1: Results of testing for differences in classification rates in the first design, using the paired proportions test. Significant p-values are marked in bold, and indicate significant differences in classification rates. R2 Test α 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Hybrid v. Path 0.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.994 1.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Hybrid v. Sing

0.5 1.000 1.000 0.839 0.150 0.008 <1e-4 <1e-4 <1e-4

1 1.000 1.000 1.000 1.000 1.000 1.000 0.248 0.003 1.5 1.000 0.600 0.398 0.553 0.848 1.000 1.000 1.000 Hybrid v. Full 0.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1 1.000 0.402 0.220 0.165 0.194 0.205 0.196 0.121 1.5 1.000 0.085 0.026 0.022 0.020 0.017 0.012 0.012 Hybrid v. CV 0.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.994 1.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Path v. Sing

0.5 1.000 1.000 1.000 0.150 0.005 <1e-4 <1e-4 <1e-4

1 1.000 1.000 1.000 1.000 1.000 1.000 0.207 0.002 1.5 1.000 0.532 0.423 0.678 1.000 1.000 1.000 1.000 Path v. Full 0.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1 1.000 0.802 0.255 0.164 0.174 0.166 0.156 0.101 1.5 1.000 0.125 0.019 0.009 0.007 0.006 0.004 0.006 Path v. CV 0.5 1.000 1.000 1.000 1.000 1.000 1.000 0.980 0.874 1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.994 1.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Sing v. Full 0.5 1.000 1.000 1.000 0.707 0.049 0.000 <1e-4 <1e-4 1 1.000 0.091 0.027 0.052 0.281 1.000 1.000 0.813

1.5 0.766 0.001 <1e-4 <1e-4 <1e-4 0.000 0.004 0.062

Sing v. CV 0.5 1.000 1.000 1.000 1.000 0.149 0.002 <1e-4 <1e-4 1 1.000 1.000 1.000 1.000 1.000 1.000 0.838 0.048 1.5 1.000 0.328 0.342 0.553 0.848 1.000 1.000 1.000 Full v. CV 0.5 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1 1.000 1.000 0.609 0.513 0.613 0.883 0.697 0.730 1.5 1.000 0.243 0.037 0.022 0.022 0.024 0.024 0.052

(26)

Table A2: Results of testing for differences in classification rates in the second design, using the paired proportions test. Significant p-values are marked in bold, and indicate significant differences in classification rates.

R2 Test 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Hybrid v. Path 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Hybrid v. Sing 1.000 1.000 1.000 1.000 1.000 1.000 0.192 0.005 Hybrid v. Full 1.000 1.000 1.000 1.000 0.995 0.879 0.520 0.275 Hybrid v. CV 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Path v. Sing 1.000 1.000 1.000 1.000 1.000 1.000 0.277 0.009 Path v. Full 1.000 1.000 1.000 1.000 1.000 1.000 0.672 0.390 Path v. CV 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Sing v. Full 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.770 Sing v. CV 1.000 1.000 1.000 1.000 1.000 1.000 0.679 0.046 Full v. CV 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.770

Table A3: Results of testing for differences in classification rates using the χ2_{test. Significant tests} have p-values marked in bold, and indicate that there is a difference between classification rates.

R2 Test Design 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Hybrid v. Full α = 0.5 0.921 0.572 0.477 0.476 0.563 0.685 0.759 0.896 α = 1 0.311 0.045 0.024 0.021 0.022 0.023 0.022 0.020 α = 1.5 0.138 0.009 0.003 0.003 0.003 0.002 0.001 0.001 Second 0.764 0.305 0.175 0.120 0.099 0.088 0.065 0.039 Hybrid v. CV α = 0.5 0.232 0.253 0.329 0.339 0.310 0.383 0.325 0.403 α = 1 0.220 0.414 0.600 0.611 0.566 0.506 0.480 0.396 α = 1.5 0.236 0.640 0.892 0.988 0.962 0.906 0.815 0.633 Second 0.235 0.341 0.381 0.382 0.418 0.465 0.460 0.468 Hybrid v. Path α = 0.5 0.776 0.728 0.847 1.000 0.913 0.782 0.694 0.549 α = 1 0.658 0.726 0.928 0.975 0.949 0.920 0.915 0.919 α = 1.5 0.617 0.872 0.904 0.792 0.747 0.737 0.760 0.835 Second 0.808 0.725 0.774 0.815 0.845 0.878 0.871 0.844

ON THE JACKKNIFE AVERAGING OF GENERALIZED LINEAR MODELS