Discrete Choice Modeling based on Utility Theory to Explain Response Propensity in Sampling Surveys

(1)

Discrete Choice Modeling based on Utility Theory

to Explain Response Propensity in Sampling

Surveys

Author: Mattias Holm

(930705)

Fall 2019

Independent Project I, Master Thesis, 15 credits Statistics

Örebro University, School of Business Supervisor: Thomas Laitila

(2)

Abstract

A discrete choice model based on utility theory is composed to describe behaviors affecting an individual's response propensity in sampling surveys. With goods and leisure as constraints, and variables affecting both the response rate and the survey variable as a function of utility, a threshold for the discrete decision to participate or not participate in a survey is constructed. By using household expenditure survey data from Statistics Sweden, the approach is tested practical, where the results display a high significant correspondence between the theoretical model and the empirical findings. The conclusions are that the discrete choice model seems to be an appropriate approach when modeling behaviors affecting a single individual’s response propensity and can be a useful foundation in other and more complex modeling and adjustment of nonresponse.

Keywords: Response propensity, response propensity factor, nonresponse, utility theory, discrete choice model

Acknowledgement: Many thanks to Professor and supervisor Thomas Laitila for contributions both in the theoretical and empirical elements, as well as guidance and support during the entire thesis process.

(3)

1. Introduction

Reasons for nonresponse and methods to reduce it has been an increasing concern for several decades (Brick, 2013; Tourangeau & Plewes, 2013). Still, there is considerable uncertainty in how nonresponse affects the final estimates. Brick (2013) classifies three major themes in the nonresponse research; mechanisms that cause nonresponse, data collection methods to reduce nonresponse, and statistical methods adjusting for nonresponse. The main target in the latter is to minimize bias caused by nonresponse but cannot be seen as an isolated matter from the other two subjects (Groves, 2006; Groves et al., 2000). Aspects such as personal incentives, appearances from the interview or the survey illustration, and various appealing survey subjects, can affect an individual’s response propensity differently within and between surveys, while at the same time correlate to attributes affecting the survey variable (Groves et al., 2000; Fjelkegård & Persson, 2012). This demands auxiliary information to assist in reducing the covariance between the response propensity and the survey variable (Groves, 2006). However, the application of this information is challenging and distinguishes between surveys.

In a meta-analysis by Peytcheva and Groves (2009), demographical variables used as auxiliary information in the examined studies display large differences in mean values between respondents and non-respondents (all though not statistically tested). Brick (2013), however, with regards to Peytcheva and Groves (2009), states that demographic variables might not always be effective in reducing nonresponse bias. The statements are not contradictory, but statistical adjustment needs to be considered if there are discrepancies in distribution (or at least mean value) between auxiliary variables for response and nonresponse items. These somehow inconsistent allegations can be interpreted as a need for new approaches when modeling for nonresponse adjustment.

The aim in this study is to describe the behavior of response propensity by theoretically concretize these mechanisms with a discrete choice model based on utility theory. The discrete choice is a binary decision to participate in a survey or not to participate, where the choice is assumed to fall on the outcome yielding the highest perceived utility. The theorical application incorporates a response propensity factor, interpreted as a function of both measurable and latent variables affecting the response rate, and should possess the ability to effectively reduce the covariance between the response propensity and the survey variable (Groves, 2006). Both utility theory from a qualitative statistical perspective (Groves et al. 2000; AAPOR, 2014; Fjelkegård & Persson, 2012) and a micro econometrics perspective (Train & McFadden, 1978; Jara-Diaz, 1998; Gärling et al., 1998), as well as nonresponse bias research (e.g. Peytcheva & Groves, 2009; Groves 2006; Brick, 2013), are the basis for the approach in this study.

By using data from Statistics Sweden’s Household expenditure survey (HUT), the empirical models display significant results and correspond well to the theoretical model. The model specification including age and income as variables of the response propensity factor seems to be the most appropriate of the models tested, where the Hosmer and Lemeshow test indicates a good fit for the applied data. The main conclusion is that illustrating an individual’s perceived utility with a discrete choice model seems to be a suitable approach when modeling behaviors

(5)

affecting response propensity. Even though the application is limited to a single individual’s choice of participate in a survey or not, the methodology can be applied as a foundation in other and more complex modeling of response propensity in sampling surveys.

Previous research within the field of nonresponse and the applied utility theory are reviewed in second part of this first section. In section two, a description of the theoretical nonresponse modeling process is presented. The data is presented, and the gathering process in HUT is explained in section three. This follows by modeling the binary choice model in section four. Further, the models are empirically tested in section five and discussed with respect to the fit of the model, and development areas of the theoretical approach in section six. Final conclusions are then given in the seventh section.

1.1. Previous research

In research papers that treat general adjustment methods, such as in Lundsröm and Särndal (1999), correct auxiliary information is often proclaimed required to reduce nonresponse bias and can do more harm than good if definite and used incorrectly. The demand for modeling approaches of nonresponse is thus evident, where one application is made by Groves et al. (2000). The authors present the Leverage-saliency theory for survey response propensity, with an illustrative concept: An individual has attributes affecting the decision to participate in a survey, where the attribute leverage varies between individuals. These attributes can be assigned diverse salient roles in the survey request, for instance, by a strong social purpose described in the survey instructions, or an interviewer actively advertising a specific reward. In regard to this, a prominent attribute can have a significant role in a particular survey, and contrary, no effect if the attribute is omitted. This is undoubtedly one explanation as to why research on survey participation has been hard to replicate, as Groves et al. (2000) proclaim, resulting in few trusted and consistent ways in measuring the effects of different attributes. This advocate individualized (at least sub-grouped) survey formatting in terms of both appearance and message to increase the response propensity, which on the other hand, may complicate bias identification and reduction in the estimation process.

Tolonen et al. (2006) review the response rate in the Finnish adult health behavior survey from 1978-2002, where both gender, age, marital status, and education level are demonstrated to affect the response rate. Further, the difference between each variable's categories seems to have small changes and be moderately consistent over time, supporting the findings, where the response rate also overall declines over time. It is though important to highlight that the survey design affect individuals differently, consequently also differently between surveys (Groves et al. 2006). Even though the Finnish adult health behavior survey cannot represent all cases and survey designs, the results demonstrate that demographic variables are valuable and useful tools when handling nonresponse bias adjustment.

Many studies have focused on the relation between nonresponse bias and nonresponse rate from several angels. Peytcheva and Groves (2009) conducted a meta-analysis of 23 studies to review whether nonresponse bias in demographics variables are related to the nonresponse rate, where no strong evidence were established, indicating that nonresponse rates has a small effect

(6)

on corresponding biases. One cause of these results can be connected to the conclusions in Groves et al. (2000), that comparison of demographics’ explanation degree between studies might not be preferable. It is however crucial to consider these variables as adjustment tools, as previously specified.

To state time as a factor of response propensity seem reasonable, as the occasion of answering in a survey can stretch for a long period of time. The actual effect on the response rate is though hard to assess, where the main reason is due to that the perceived time to complete the survey is not linearly increasing to the actual time (Vercruyssen, Putte, & Stoop, 2011). Consequently, the effect of time to complete a survey can be assumed to interact with other factors affecting the decision to participate, both caused by the interviewer and the survey design, as well as the individual incentives and demographic. Further, in a single survey, the time is constant (approximately) across individuals, but still, one can expect that the time effort is perceived. However, if the auxiliary information is specified correctly, the assumption that the included variables will contain each individual’s perceived time effort, and the effect of the time to complete the survey on the response propensity can be treated as constant in a single survey. The theory of social exchange was one of the first recognized approaches adapted to the nonresponse qualitatively theory and is explained by Don Dillman in his book from 1978 (Fjelkegård & Persson, 2012). The theory assumes an exchange situation where an individual (respondent) strives to maximize the profit and minimize the loss. The trade-off can be applied to other than between money and goods, as in survey circumstances - to participate or not participate, where the outcome consequently is a perceived utility rather than a quantitative utility. The social exchange theory can be adapted to a discrete choice model, where the outcome depends on the discrete choice as a function of the utility. The foundation of the model was constructed by McFadden and Train (1978) and can be applied to most kinds of decisions between qualitatively diverse commodities (Jara-Diaz, 1998). The utility is maximized with respect to working time, which is a component in both the goods and the leisure constraint. Depending on the discrete choice, a different cost and time will affect the utility. The choice in this study is either to respond or not to respond, where the theoretical application is thoroughly explained in the next section.

2. Theoretical framework

In this section, the discrete choice model under the utility function constraints are derived, followed by an interpretation of a threshold equation of the choice to participant in a survey or not.

2.1. Discrete choice model

A discrete choice model based on classic utility theory is applied to quantitatively map the characteristics of response behavior. The discrete choice for each individual is either to respond or not to respond. If an individual decides to respond the individual yields a utility of 𝑈", and

(7)

output is selected. The utility functions are quantified by goods consumption G, leisure time L (Train & McFadden, 1978; Jara-Diaz, 1998; Gärling et al., 1998), and an added response propensity factor X, which can be interpreted as a function of both measurable and latent variables. The approach is summarized as follows:

Max𝑈(𝐺, 𝐿, 𝑋) (1)

Under constraint

𝐺 = 𝑤𝑊 + 𝐸 + 𝑐 (2)

𝐿 = 𝜏 − 𝑊 − 𝑡 (3)

where w is wage and W is working time. E is other income than wage, and c is economic incentives to respond (e.g. money and lottery tickets). Leisure is a function of total time available 𝜏, working time, and the time to complete the response t. Working time affects goods consumption and leisure time contrary, were a higher W implicate increased G and decreased L. Thus, the trade-off between goods consumption and leisure time depends on the optimal value of W. To solve the optimal working time, U is maximized with respect to W. Given that U takes the Cobb-Douglas form; 𝑈 = 𝐾𝐺"78_𝐿8_𝑋9_{, where K is a constant, 𝛽 is a parameter}

[0,1] that weights an individual’s preferences between G and L, and R is equal to 1 if an individual respond and 0 if not, the expression for optimal working time W* is1_:

𝑊∗ _{= (𝜏 − 𝑡)(1 − 𝛽) −}𝐸 + 𝑐

𝑤 𝛽 (4)

As seen in equation (4), W* is a function of 𝜏 − 𝑡, 𝐸 + 𝑐, and w. By substitution G and L in the utility function with G(W*) and L(W*), respectively, U* is a maximized utility function given W*, and constraint G and L, and takes the form:

𝑈∗ _{= 𝐾(1 − 𝛽)}"78_𝛽8_𝑤78_{[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)]𝑋}9 ₍₅₎

Further, given the individual’s choice to respond or not to respond yields the two different utility functions 𝑈_"∗_{and 𝑈}

#∗:

𝑈_"∗_{= 𝐾(1 − 𝛽)}"78_𝛽8_𝑤78_{[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)]𝑋 , R=1} ₍₆₎

𝑈_#∗ _{= 𝐾(1 − 𝛽)}"78_𝛽8_𝑤78_{[𝑤𝜏 + 𝐸] , R=0} ₍₇₎

The distinction between the functions clarifies two things, first, the time to complete the respond t, and the incentive to respond c, has no contribution to 𝑈_#∗_{(i.e. t=c=0), and second,}

the response propensity factor has no direct effect on 𝑈_#∗_{. It is however central to proclaim that}

(8)

even though an individual chooses not to respond, the response propensity is affected by X, as concluded further down in equation (9). As an example, connected to the Leverage-saliency theory, social interest in the survey or the survey results probably has a positive effect on the propensity to respond (Groves, Singer & Corning, 2000; Fjelkegård & Persson, 2012). In this regard, it is also important to note that the utility is an individual’s perceived utility of responding to a survey or not responding to a survey. Further, an individual will choose to participate in the survey if 𝑈_"∗_{is greater than 𝑈}

#∗:

𝑈_"∗ _{> 𝑈}

#∗ (8)

By rearranging equation (8) and set the threshold to zero for survey participation, the interpretation of factors affecting the response propensity from a practical situation becomes fairly convenient:

𝑤[𝑋(𝜏 − 𝑡) − 𝜏] + [𝑋(𝐸 + 𝑐) − 𝐸] > 0 (9) Assuming that the response propensity factor X is indifference (X=1), and τ - t and c are constant across all individuals, wage has a negative effect on survey participation. This implies that c needs to be high, and since economic incentives is not infrequently zero, other income under these circumstances (e.g. interest income or welfare) are demanded to compensate for w, where a low t requires less input of E + c. In some cases, both E and c will be equal to zero and have no effect on the utility to respond, which moreover implies no influence by wage, whereas the magnitude of X determines if the outcome is positive or negative. As comprehended, the outcome will depend on the response propensity factor X and its contribution to the relation between wage and other income, where X contributes positively if positive2_{. Further, equation (9) display a main effect from w and E as linear functions, whereas}

X can be interpreted as an interaction effect with time, wage, economic incentives, and other income. Hence, an optimistic attitude towards survey participation will decrease the perceived time effort, indicated by an increased response propensity, affecting both the terms on the left-hand side positively.

Wage will only affect the response propensity from a continuous perspective since the left-hand side will be more differentiated from zero when w is large, and w cannot be less than zero. Artsev et al. (2008) empirically conclude, with data from the Israel household expenditure survey (HET), that nonresponse in the survey of family expenditure is a decreasing convex function of income (i.e., increasing concave of response). This suggests including a quadratic factor of wage in an estimation process, in line with the structure of the threshold equation. If wage, however, is equal to zero, which is not too rare, this entails that working time is equal to zero as well3_{. An individual’s utility is then no longer depending on the trade-off between the} two constraints G and L since w and W is omitted4_{, which implies that the maximization} 2_{X can, by definition, be negative but consequently always imply a negative outcome, and will therefore not be}

discussed in this theoretical application.

3_{Volunteer work and other sorts of unpaid work are assumed to be chosen as leisure time.} 4_{Thus, when wage is equal to zero; 𝐺 = 𝐸 + 𝑐 and 𝐿 = 𝜏 − 𝑡.}

(9)

problem does not need to be solved. Due to these different preconditions, the utilities of responding or not responding when wage is equal to zero need to be added onto equation (9). Thus, by directly substituting G and L into the utility function, taking the same Cobb-Douglas form as in page five, the utility expression of responding and not responding when wage is equal to zero, respectively, take the following forms:

𝑈",BC# = 𝐾(𝐸 + 𝑐)"78(𝜏 − 𝑡)8𝑋 , R=1 (10)

𝑈_#,BC# = 𝐾𝐸"78_𝜏8_{, R=0} ₍₁₁₎

In the same manner as equation (8), an individual will choose to participate in the survey if

𝑈_",BC# is greater than 𝑈_#,BC#. By rearranging the condition and set the right-hand side equal to

zero the following equation is obtained:

𝑋 − 1

(1 + 𝑐𝐸)"78

(1 − 𝑡𝜏)8 > 0 (12)

As seen in equation (12), the choice depends on an individual’s fixed value of 𝛽. An individual with 𝛽 higher than 0.5, i.e. values leisure more than goods, will choose not to participate in a survey more often compared to an individual with 𝛽 lower than 0.5, ceteris paribus. However, the value of 𝛽 gets less essential when the quotient between c and E, respectively t and 𝜏, are close to zero5_{. In the not infrequently case when c is zero, the left term in the denominator is} equal to one, whereas time to complete a survey relative to total time available can be assumed to be small. This causes the second term on the left-hand side to be close to one across all individuals and can be identified as a constant. By denoting this constant M, and integrating the case when wage is equal to zero onto equation (9), the following threshold equation for an individual to respond is obtained:

𝑤(𝑋(𝜏 − 𝑡) − 𝜏) + 1[𝑤 > 0](𝑋(𝐸 + 𝑐) − 𝐸) + 1[𝑤 = 0](𝑋 − 𝑀) > 0 (13) Both the case when wage is equal to zero and more than zero are included in equation (13) and by denoting,

d = F1,_0, w = 0_{w ≥ 0}

and 𝑑̅ = F

1, w ≥ 0

0, w = 0

(13) can be written as:

𝑋K𝑤(𝜏 − 𝑡) + 𝑑̅(𝐸 + 𝑐) + 𝑑L − K𝑤𝜏 + 𝑑̅𝐸L − 𝑑𝑀 > 0 (14)

5_{See Figure A1 in Appendix for illustration of the right term on the left-hand side (denoted M) with different}

(10)

The interpretation of (14) is the same as (9) with the added element that the response propensity factor needs to be larger than M at the occasion when wage is equal to zero, for an individual to respond. Equation (14) will be the foundation in the process of modeling the binary outcome of responding or not responding to a survey in section four.

2.2 Response propensity factor

The variables in the response propensity factor should be interactively valid with the variables in the utility model and suitable enough to reduce the covariance between response propensity and the survey variable (Groves, 2006). X must, therefore, have an effect on both the response propensity and the survey variable. Thus, variables with these assumed to have properties, as well as usable in the empirical application for the HUT, are briefly discussed in this paragraph. A difference between male and female, as well as having children or not, are to expect in terms of overall time pressure (Mattingly & Sayer, 2006), which in turn might affect the perceived utility of answering survey negatively, due to less leisure time. Further, both gender, age, and education level seem to affect response propensity and thereby need to be controlled for, as concluded by, for example, Talonen et al. (2006) and additionally might capturing underlying attributes affecting the response propensity. As previously mentioned, Artsev et al. (2008) concluded response rate to be a function of income and could as well be included when modeling response propensity in the HUT survey. The above-mentioned qualities are possible variables to include in a final empirical model to reduce the nonresponse bias. The modeling of the response propensity factor is presented in conjunction with the modeling of the binary outcome, and further integrated into the binary model, as equation (14) emphasizes.

3. Data

The empirical study is based on Statistics Sweden’s data from HUT 2007. In this section, the collection method and specification of data from HUT are explained. Variable included in the later on presented models are also described.

3.1. Household expenditure Survey (HUT)

Statistical Sweden’s survey HUT serves to illuminate for expenditure of goods and services for different reporting groups (different combinations of individuals living in the household) (Fridlund Karlsson, 2008). HUT is a sampling survey with data collected in a continuing twelve-month process, where each randomly selected household participate under a four weeks period (two weeks gathering of data). The implementation of the survey follows a collection protocol, where each participant is being both interviewed regarding their households' larger expenditure and purchases, and capital goods the last twelve months, as well asked to keep a cash book6_.

6_{Each participant should keep track of all their household purchase in a cash book or save receipts under a 14}

days long period. A households’ larger expenditure and purchases regards to, for example, bought and sold furniture, telephone cost, and insurance cost. The exact information assembled by Statistical Sweden can be found in Fridlund Karlsson (2008).

(11)

The sample consists of 4000 households, uniformly distributed weekly-wise for the twelve-month process, yielding 52 equally sized sub-samples (Fridlund Karlsson, 2008). The sample frame is the Swedish register over the total population (RTB), where at least one individual resident in each household needs to be in the age of 0-79 years, to have a nonzero probability of being selected. There are ten different household groups in the survey, where Single with children and Single without children will be used in this study, where a child is an individual in the age of 0-18 years living in the household. To only use these two household groups are due to the inherent assumption of individual utility in the theoretical approach. To handle household groups that include cohabiting (more than one individual over 18 years old) can be assumed to have a combined utility of answering a survey, of two or more individual utilities, and thereby be more challenging to construct and interpret. Consequently, these groups cannot be supported by the theoretical approach in this study. This should not be seen as a drawback, but rather as a scaled-down problem with a more approachable starting point to this new approach of modeling and examine response propensity. Cohabiting of individuals who is between the age of 19-32 and who is younger than the age of 19 are omitted, i.e., that an individual need to be 13 years or older on the occasion when having a child/children.

3.2. Variables

The Swedish household expenditure data from 2007 contains 1 181 individuals who are living as singles were 490 of these individuals participated in the survey. In Table 1 below, descriptive statistics of the included variables for all 1 181 individuals in this study are presented. The auxiliary information of all the individuals are available from registers of Statistical Sweden. Table 1. Descriptive statistics of data from HUT 20077_.

N = 1 181 Median Mean SD Min. Max.

R 0 0.41 0.49 0 1 gender 1 0.53 0.50 0 1 age 43 45.81 17.15 19 79 nchildren 0 0.36 0.78 0 5 inc 6 477 7 276 5 081 -2 806 74 533 w 62.05 85.5 92.65 0 586.59 E 2 573 4 206 6 378 0 84 495 d 0 0.35 0.48 0 1 H 25 126 32 933 29 935 0 210 321 𝐻𝑐 25 016 32 762 29 754 0 209 572 gender(Hc+d) 4 129 16 163 24 571 0 168 969 age(Hc+d) 962 919 1 384 248 1 339 360 22 10 404 516 nchildren(Hc+d) 0 14 394 44 247 0 838 287

inc(Hc+d) 1.64E+08 3.42E+08 6.59E+08 -228 100 1.56E+10

(12)

In this study, age, gender, nchildren, and inc are used as variables in the response propensity factor. As seen in the table, 35 percent of the individuals have zero wage, and a few individuals (eight of them) has a negative income. E, other income, less than zero are truncated, whereas the two weeks disposable income, inc, can be negative. The variables H and Hc are generated to simplify the empirical model presented in section four and correspond to terms in (14), where Hc is 𝑤(𝜏 − 𝑡) + 𝑑̅(𝐸 + 𝑐) and H is 𝑤𝜏 + 𝑑̅𝐸. Since economic incentives, c, is zero in HUT, the time to complete the response, t, is the variable contributing to the slightly lower values of Hc than H. Further, the four variables included in the response propensity factor are multiplied by 𝐻𝑐 + 𝑑, respectively, to follow the specification of equation (14).

4. Modeling

Both the binary outcome model and the model of the response propensity factor are presented in this section. In the empirical process, a probit model will be used to estimate the response propensity, i.e. the probability that an individual respond.

The model of the response propensity factor is constructed as a multiple linear regression with a set of variables that explain both the response propensity and the survey variable, looking as follows:

𝑋 = 𝛽_#+ 𝛽_"𝑋_"+ ⋯ + 𝛽_O𝑋_O (15) In equation (15) There are k number of explanatory variables with k corresponding parameters 𝛽 and an intercept 𝛽_#. By substituting X into a probit model taking the shape of equation (14) the probability to respond, R=1, can be modeled as:

𝑃(𝑅 = 1) = Φ(𝛼 + 𝛽_#𝐻𝑐 + 𝛽_#∗_{𝑑 + 𝛽}

"𝑋"[𝐻𝑐 + 𝑑] + ⋯ + 𝛽O𝑋𝒌[𝐻𝑐 + 𝑑]

+ 𝛽_OU"𝐻) (16)

The above equation is a general probit model of the theoretical equation (14)8_{. From model} (16), three models are constructed for empirical testing. The first (1) model includes age, gender, number of children, and disposable income as variables of the response propensity factor. Additionally, all four of these variables are also included as main effects in model (1), as the model will serve as a comparison to model (2) and (3) who fits properly to the theoretical formulation. In these two models, who match the theoretical model, all variables included as main effects in model (1) from the response propensity factor are omitted, and further, nchildren(Hc+d) and gender(Hc+d) are excluded in model (3).

To evaluate the goodness of fit of the models, the Likelihood ratio test (LR test) will be used as a measure to compare the models. In relation to model (1) the other two models are nested, and consequently two LR test will be performed. The Hosmer and Lemeshow test will also be conducted to evaluate how well the predicted values correspond to the true values of R. 8_{The derivation of equation (14) into the general probit model expression in (16) is shown in the appendix.}

(13)

5. Results

The estimated parameters for all three models are presented in this section, as well as the LR-test and the Hosmer and Lemeshow LR-test.

Table 2. Results of three probit models estimates, LR test, and Hosmer and Lemeshow test.

(1) (2) (3)

(Intercept) -0.6376 *** -0.4370 *** -0.4090 ***

Hc 0.01002 * 0.00539 *** 0.00520 ***

d -0.3658 *** -0.2612 ** -0.2791 **

Age(Hc+d) 2.765E-07 ** 3.573E-07 *** 3.636E-07 *** gender(Hc+d) -2.273E-06 1.252E-06

nchildren(Hc+d) 1.687E-06 1.113E-06

inc(Hc+d) -3.662E-10 ** -3.379E-10 ** -2.811E-10 **

H -0.00996 * -0.00536 *** -0.00518 *** age 0.0036 gender 0.2314 * nchildren -0.0350 inc -4.599E-05 N 1181 1181 1181 Log-likelihood (parameters) -766.8022 (12) -770.3618 (8) -771.3858 (6) P-value LR test - 0.1297 0.1644

P-value Hosmer and

Lemeshow test - - 0.3775

Note: The significance levels are denoted as: *** = 0.01, ** = 0.05, * = 0.1. P-values for the LR test are for model (2) and (3) both compared with model (1).

Model (1) display several insignificant parameters, where the added main effects, age, nchildren, and inc, from the response propensity factor are three of them. By evaluating the estimated parameters signs and compare them to equation (14), both Hc (positive), H (negative), and d (negative) all correspond correctly to the theory for all three models. Examine model (2), where the main effects of X are removed, the estimates get overall more significant compared to model (1), except for d. The estimates for model (2) and (3) are fairly similar, and the exclusion of the insignificant parameters gender(Hc+d) and nchildren(Hc+d) does not seem to affect the outcome, and thereby have no effect on the response propensity. Thus, despite the significant estimated intercept, model (3) match equation (14) well. By viewing the scatter plot of fitted values with model (3) and disposable income in Appendix (Figure A3), the increasingly concave pattern is consistent with the empirical results from Artsev et al. (2008).

(14)

The results of the LR test display that model (1) does not fit the data significantly better than none of model (2) and (3), when age, nchildren, gender, and inc are removed from the model. By specifying X “more correctly” (as the results indicate), only entering age(Hc+d) and inc(Hc+d), the P-value (0.1644) shows an even higher insignificant rate for model (3) than model (2). The P-value from the Hosmer and Lemeshow tests suggests that model (3) is suitable to fit the data, and the null hypothesis that the true response rate differs from the fitted values cannot be rejected.

6. Discussion

The similarities between the empirical models, especially setting (3), and the theoretical equation (14) suggests that the application using a utility function with discrete choice modeling might be a suitable approach in adjusting of nonresponse. From a modeling perspective, the drawbacks are the significant intercept, which is not a component in (14). This could be a shortcoming with the model, but also a variable interpretation error, as well as missing attributes within the response propensity factor. An example if this is education, which could be one explanation, as the response rate can differ between the level of education (Tolonen, 2006), and affect the survey variable, household expenditure, due to correlation with income.

The definition of measurements of the variables are also a subject of discussion, as an example, the truncation of other income (E), and whether it should be allowed to be negative. Other income such as interest capital can be negative, as well as liabilities, but are here seen as minimum zero, while disposable income is allowed to be negative. The negative values of E are though both few and small and will not affect the final estimates to a great extent in this empirical study but should be further investigated to evaluate whether a truncation at zero is appropriate or not. Another consideration is the wage measure since individuals in the population, and thereby in the sample, work part-time, which is not being considered in this study due to the missing information. If, for example, wage is affecting the response propensity positively, which possibly is the case under the survey circumstances, a too high defined hourly wage can contribute to an underestimation of the positive effect of the participation probability, ceteris paribus.

In the HUT survey, the measurement of time can be relatively set, since each participant has to keep a cash book and do interviews with Statistics Sweden under a two weeks period, and the time to complete the survey gets small relative to this period, which is displayed in the Figure A3 in Appendix. The relatively short accumulated time to complete the survey under the two weeks participation period is though controlled for, which causes the small difference in the estimated parameters of H and Hc in all models. In other sampling surveys, the time to complete the survey relative to the total time might be harder to model and evaluate, and the assumption that M is constant across all individuals might not hold. Further, Vercruyssen et al. (2011), with support from Mattingly and Sayer (2006) (who proclaim that free time has the potential to reduce time pressure), conclude that less free time increase the nonresponse causes “too busy” and “have no time” measured with objective indicators of free time at weekdays

(15)

and weekends. Thus, the actual time to complete the survey might not either be quantifiable or a good indicator and could be treated as a constant across individual, whereas the perceived time effort quantified by, for example, free time, could be an indirect measurement. The variables included in the model could, however, be assumed to treat this effect, but more research is needed within this subject.

A factor that is ignored in the model is that some of the children to the individuals in the sample also has an income, which is not included as disposable income in the model. One could argue that it should be included since this factor affects the survey variable. However, the decision if and how children's income should be incorporated needs further evaluation since the utility yielded for answering the survey is assumed to regard the single individual choice to respond or not, which might be independent of their children's income. One further consideration regards whether the age of children should be included as a factor since different age groups might demand more work and time effort from the parent, where both actual and perceived time could be affected differently.

As the theoretical process of the discrete choice model has been the main focus in this study, the modeling of the response propensity factor is simply kept to a linear function of the explanatory variables. Even though the linear application and the included variables in the practical application fit the model well, the response propensity factor could also be modeled in other ways. This is one of the next steps in improving the model fit both for general and specific survey nonresponse adjustment, together with reasonably auxiliary variables, to fulfill the assumption of correlation with both attributes affecting the response rate and the survey variable, to reduce estimation bias.

7. Conclusions

The discrete choice model includes factors that, as often, not have a given correct specification, and additionally, where X and t are both challenging and crucial to determine. Nevertheless, this new theoretical approach and framework seems to be a suitable approach to concretize and assess the factors affecting individual behavior of response propensity in sampling surveys. The significant results and great similarities between the empirical results and the theoretical model demonstrates a practical fit and emphasize further application in nonresponse adjustment situations. The model is restricted to a single individual’s utility but can be used as a foundation for other and more sophisticated modeling of response propensity.

(16)

References

AAPOR. 2014. Current Knowledge and Considerations Regarding Survey Refusals. AAPOR Task Force on Survey Refusals.

Artsev, Y., Yitzhaki, S., Schechtman, E. 2008. Who Does Not Respond in the Household Expenditure Survey. Journal of Business & Economic Statistics. 26 (3), 329–344.

Brick, J. M. 2013. Unit Nonresponse and Weighting Adjustments: A Critical Review. Journal of Official Statistic. 29 (3), p. 329–353.

Dillman, D. 1978. Mail and Telephone Surveys: The Total Design Method. New Jersey: John Wiley & Sons, Inc.

Fjelkegård. L., Persson A. 2012. Varför bortfall? ur urvalspersonernas perspektiv (främst). Internal report from Statistics Sweden. Unpublished.

Fridlund Karlsson, Å. 2008. Hushållens Utgifter (HUT) 2007. Official statistical report produced by Statistical Sweden 2008-06-04. https://www.scb.se/he0201 [collected 2019-11-29].

Groves, R.M., Singer, E., Corning, A. 2000. Leverage-Saliency Theory of Survey Participation: Description and an Illustration. Oxford Journal, 64 (3), p. 299–308.

Groves, R.M. 2006. Nonresponse rates and nonresponse bias in household surveys. Public Opinion Quarterly. 70 (5), p. 646–675.

Gärling, T., Laitila, T., Westin, K. 1998. Theoretical foundations of choice modeling. Oxford: Elsevier science ltd.

Jara-Diaz, S.R. 1998. Time and Income in Travel Demand: Towards a Microeconomic Activity Framework. Universidad de Chile.

Lundsröm, S., Särndal, C-E. 1999. Calibration as a Standard Method for Treatment of Nonresponse. Journal of Official Statistics. 15 (2), p. 305-327.

Mattingly, M.J., Sayer, L.C. 2006. Under Pressure: Gender Differences in the Relationship Between Free Time and Feeling Rushed. Journal of Marriage and Family, 68 (1), p. 205-221. McFadden, D., Train, K. 1978. The goods/leisure tradeoff and disaggregate work trip mode choice models. Transportation research. 12 (5), p. 349-353.

Peytcheva, E., Groves, R.M. 2009. Using Variation in Response Rates of Demographic Subgroups as Evidence of Nonresponse Bias in Survey Estimates. Journal of Official Statistics. 25, p. 193-201.

(17)

Putte, B., Stoop, I.A.L., Vercruyssen, Anina. 2011. Are They Really Too Busy for Survey Participation? The Evolution of Busyness and Busyness Claims in Flanders. Journal of Official Statistics. 27 (4), p. 619-632.

Tolonen, H., Helakorpi, S., Talala, K., Helasoja, V., Martelin, T., Prättälä, R. 2006. 25-year trends and socio-demographic differences in response rates: Finnish adult health behaviour survey. European Journal of Epidemiology. 21, p. 409-415.

Tourangeau, R., Plewes, T. J. 2013. Nonresponse in social science surveys: a research agenda. Washington, D.C.: The national academies press. ISBN: 978-0-309-27247-6.

(18)

Appendix

Figure A1. Scatter plot of M and 𝛽 with eleven different values of 𝛽 between 0 and 1.

Note: 𝑀 = "

("U_WV)XYZ_("7[ \)Z

,where t=4 (estimated total time to respond, in hours), 𝜏=336 (two weeks in hours), c=0.

Table A2. Description of the variables in Table 1.

Variable Description

R Response rate, 1 if respond, 0 otherwise

gender 1 if female, 0 if male

age Age

nchildren Number of children under 19 years old in the household inc Disposable income during the time span to respond (2 weeks in HUT)

w Wage per hour (yearly wage income divided by 1700) E

Other income then wage income during the time span to respond (2 weeks in HUT). Calculated as disposable income minus wage income minus tax.

Minimum value is set to zero

d 1 if wage is equal to zero, otherwise 0

H wτ+𝑑̅E

Hc w(τ-t)+𝑑̅(E+c)

gender(Hc+d) gender multiplied by Hc+d

age(Hc+d) age multiplied by Hc+d

nchildren(Hc+d) nchildren multiplied by H+d.

inc(Hc+d) inc multiplied by Hc+d

0.9 1 1.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 M 𝛽

(19)

Figure A3. Scatter plot of fitted value of model (3) and disposable income.

Note: Response probability is fitted values of model (3) and Income is two weeks disposable income. One high extreme value of income is removed for illustrative purpose.

A4. Derivations from the Cobb-Douglas utility function and forward to equation (9).

Each expression denoted by (A.#) corresponds to the expression denoted with the same number as in the text, e.g. (A.1) is equal to (1).

Max𝑈(𝐺, 𝐿, 𝑋) (A.1)

Insert G and L into the Cobb-Douglas utility function:

𝑈 = 𝐾(𝑤𝑊 + 𝐸 + 𝑐)"78_{(𝜏 − 𝑊 − 𝑡)}8_𝑋9 ln(𝑈) = ln(𝐾) + (1 − 𝛽) ln(𝑤𝑊 + 𝐸 + 𝑐) + 𝛽 ln(𝜏 − 𝑊 − 𝑡) + 𝑅𝑙𝑛(𝑋) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●_● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● _● ●● ● ● ● ● ● ● ● ● ● ●●_● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● _● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● _● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●_● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● _● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● _● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10000 20000 30000 40000 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Income Response Probability

(20)

𝜕ln (𝑈) 𝜕𝑊 = 𝑤(1 − 𝛽) 𝑤𝑊 + 𝐸 + 𝑐− 𝛽 𝜏 − 𝑊 − 𝑡 Set equal to zero and solve for W to maximize the function:

𝑤(1 − 𝛽) 𝑤𝑊 + 𝐸 + 𝑐 = 𝛽 𝜏 − 𝑊 − 𝑡 𝑤(𝜏 − 𝑊 − 𝑡)(1 − 𝛽) = (𝑤𝑊 + 𝐸 + 𝑐)𝛽 𝑤(𝜏 − 𝑡)(1 − 𝛽) − 𝑤𝑊(1 − 𝛽) = 𝑤𝑊𝛽 + (𝐸 + 𝑐)𝛽 𝑤𝑊 = 𝑤(𝜏 − 𝑡)(1 − 𝛽) − (𝐸 + 𝑐)𝛽 𝑊∗ _{= (𝜏 − 𝑡)(1 − 𝛽) −}𝐸 + 𝑐 𝑤 𝛽 (A.4)

Substituting 𝑊∗_{into G and L;}

𝐺(𝑊∗_{) = 𝑤 c(𝜏 − 𝑡)(1 − 𝛽) −}𝐸 + 𝑐 𝑤 𝛽d + 𝐸 + 𝑐 = 𝑤(𝜏 − 𝑡)(1 − 𝛽) + (𝐸 + 𝑐)(1 − 𝛽) 𝐿(𝑊∗_{) = 𝜏 − [(𝜏 − 𝑡)(1 − 𝛽) −}𝐸 + 𝑐 𝑤 𝛽] − 𝑡 = (𝜏 − 𝑡)𝛽 + 𝐸 + 𝑐 𝑤 𝛽 and further into the Cobb-Douglas utility function:

𝑈∗ _{= 𝐾[𝑤(𝜏 − 𝑡)(1 − 𝛽) + (𝐸 + 𝑐)(1 − 𝛽)]}"78_{[(𝜏 − 𝑡)𝛽 +}𝐸 + 𝑐 𝑤 𝛽]8𝑋9 𝑈∗_{= 𝐾(1 − 𝛽)}"78_𝛽8_{[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)]}"78_{[(𝜏 − 𝑡) +}𝐸 + 𝑐 𝑤 ]8𝑋9 𝑈∗ _{= 𝐾(1 − 𝛽)}"78_𝛽8_{[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)]}"78_(𝑤7"_{[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)])}8_𝑋9 𝑈∗ _{= 𝐾(1 − 𝛽)}"78_𝛽8_𝑤78_{[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)]𝑋}9 _(A.5)

Substituting (6) and (7) into (8) gives:

𝐾(1 − 𝛽)"78_𝛽8_𝑤78_{[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)]𝑋 > 𝐾(1 − 𝛽)}"78_𝛽8_𝑤78_{[𝑤𝜏 + 𝐸]}

(21)

[𝑤(𝜏 − 𝑡) + (𝐸 + 𝑐)]𝑋 − 𝑤𝜏 − 𝐸 > 0

𝑤[𝑋(𝜏 − 𝑡) − 𝜏] + [𝑋(𝐸 + 𝑐) − 𝐸] > 0 (A.9)

A5. Derivations from equation (14) into the general expression of the probit model (16). By denoting:

𝐻𝑐 = 𝑤(𝜏 − 𝑡) + 𝑑̅(𝐸 + 𝑐) , 𝐻 = 𝑤𝜏 + 𝑑̅𝐸 Equation (14) ca be written as:

𝑋[𝐻𝑐 + 𝑑] − 𝐻 − 𝑀𝑑 > 0

If X is written as the general multiple linear regression as in (15), a general probit model can be expressed as: 𝑃(𝑅 = 1) = Φ(𝛼 + (𝛽_#+ 𝛽_"𝑋_"+ ⋯ + 𝛽_O𝑋_O)[𝐻𝑐 + 𝑑] + 𝛽_OU"𝐻 + 𝛽_OUe𝑑) 𝑃(𝑅 = 1) = Φ(𝛼 + 𝛽_#[𝐻𝑐 + 𝑑] + 𝛽_"𝑋_"[𝐻𝑐 + 𝑑] + ⋯ + 𝛽_O𝑋_O[𝐻𝑐 + 𝑑] + 𝛽_OU"𝐻 + 𝛽_OUe𝑑) 𝑃(𝑅 = 1) = Φ(𝛼 + 𝛽_#𝐻𝑐 + 𝛽_#𝑑 + 𝛽_"𝑋_"[𝐻𝑐 + 𝑑] + ⋯ + 𝛽_O𝑋_O[𝐻𝑐 + 𝑑] + 𝛽_OU"𝐻 + 𝛽_OUe𝑑) By denoting: 𝛽#𝑑 + 𝛽OUe𝑑 = (𝛽#+ 𝛽OUe)𝑑 = 𝛽#∗𝑑

The model looks as follows:

𝑃(𝑅 = 1) = Φ(𝛼 + 𝛽#𝐻𝑐 + 𝛽#∗𝑑 + 𝛽"𝑋"[𝐻𝑐 + 𝑑] + ⋯ + 𝛽O𝑋𝒌[𝐻𝑐 + 𝑑]

Discrete Choice Modeling based on Utility Theory to Explain Response Propensity in Sampling Surveys