Design-based and Model-assisted estimators using Machine learning methods : Exploring the k-Nearest Neighbor metod applied to data from the Recreational Fishing Survey

(1)

DESIGN-BASED AND MODEL-ASSISTED ESTIMATORS

USING MACHINE LEARNING METHODS

- Exploring the k-Nearest Neighbor method applied to data from the Recreational Fishing Survey

Rikard Gard

28 january 2019

Master thesis, 15 ECTS

Örebro University, School of Business Supervisor: Yuli Liang

(2)

Abstract

This thesis explores the k-Nearest Neighbor (k-NN) method within the context of the design-based and model-assisted difference estimator and compares it to the GREG es-timator. Data from the Recreational Fishing Survey, provided by Statistics Sweden, is used. The survey is a complex design of several samples with two phases and different strata in each phase. It is shown that the k-NN estimator does not perform as good as the generalized regression (GREG) estimator for correcting nonresponse bias.

It is shown that the k-NN perform relatively well when the entire sample is observed but not in the presence of nonresponse. One reason for this is due to the curse of dimen-sionality since the survey samples used consist of too few observations with respect to the possible combinations of the auxiliary variables in the input vector. Another reason is that the k-NN method performs bad when nonresponse is present and requires better models for handling the nonresponse in combination with using the design weighted k-NN method.

Three different versions of the k-NN is analyzed where the design weighted k-NN per-forms best over the sample set and the unweighted k-NN perper-forms best over the response set. Although the difference for the different versions is marginal and the confidence intervals overlap.

The perspective of agencies of Production of Official Statistics is considered where comparability and efficient production is necessary. Comparability in practice often means that the auxiliary variables need to be categorical and there is a problem with using the k-NN method when all, or most, auxiliary variables are categorical. In fact, in this case the k-NN simply translate into the group mean model of the design-based post-stratified estimator if K is smaller than the number of observations in the smallest group. Only now a random sample (of size K) from each group is considered for the predictions.

Keywords— Machine Learning, Model-Assisted, Design-Based, Estimator, k-Nearest Neigh-bor (k-NN), Generalized Regression (GREG), Cross-Validation, Nonresponse Bias, Recreational Fishing Survey, Production of Official Statistics

(4)

1 Introduction

Nonresponse means that elements selected from a survey are eligible for the survey, do not provide the requested information, or provide information that is not usable (Bethlehem, Jelke and Cobben, Fannie and Schouten, Barry 2011) and methods are needed that handle and counteract the nonresponse bias that might arise. Bethlehem (1988) wrote that the proper use of auxiliary variables can reduce nonresponse bias so it is of interest to find new and better methods that utilize auxiliary variables. The term proper means that the models, given the auxiliary variables, explain both the target variable and the nonresponse behavior. And it should come as no surprise to the reader that nonresponse in sample surveys is a problem, not only for the Production of Official Statistics (POS), but the entire realm of statistics that relies on the respondents of the samples.

The recent decades of technological and algorithmic developments has paved the way for the application of Machine Learning (MachL) methods within the design-based framework. Ad-vanced and computational intensive MachL-methods can derive complex models that explain the relation between the auxiliary variables and the target variable(s).

This thesis explores the k-NN method within the design-based framework as a method for prediction in the difference estimator and it is compared to the GREG estimator. Strengths and weaknesses of the k-NN estimator is explored and analyzed. The k-NN is explored from the perspective of a National Statistical Institute (NSI) and data from the Recreational Fishing Survey (RFS) is used.

Breidt & Opsomer (2017) has contributed with a theoretical framework for the application of MachL methods for model-assisted estimators of finite population parameters which in this thesis will be called MachL estimators. The main idea is that a method derived with MachL methods from the sample, denoted ˆm(⋅), is used in the classical difference estimator where the predictions of the target variable are calculated using the MachL method for the auxiliary variables, denoted as ˆyk = ˆm(xk). To study the validity of the MachL estimators Breidt &

Opsomer (2017) provided a framework for which the design-based and model-assisted estimators asymptotic properties could be analyzed. This thesis is based on the theoretical framework that they introduced and it will be described in more detail later on.

1.1 Background

One of the main challenges today for survey sample statistics concerns the effects of nonresponse which can introduce bias and imprecision in the estimates. Nonresponse has been a challenge since the introduction of survey sampling and Hansen & Hurwitz (1946) proposed a method based on two-phase sampling to address the challenge.

There are roughly two ways for dealing with nonresponse in sample surveys which is pre-vention and adjustment techniques as described by Bethlehem (1988) and the focus is on the latter. If the nonresponse is missing at random the bias is of no concern (Lohr 2009). If the nonresponse is missing at random given covariates it is necessary to control for those covariates and in practice it is difficult to identify all covariates and a good underlying model. The problem of nonresponse arises when there is a selection bias due to the response variable(s) correlation with the response probability.

The main focus is to explore the k-NN estimator and how it deals with nonresponse bias. A survey known for its problematic nonresponse structure is the RFS where there is a known selection bias, people who fish are more inclined to answer the survey than people who do not fish which is shown in Table 1. The underlying model for the response probabilities is hard to derive since there does not seem to be any auxiliary variables in the administrative registers

(5)

at Statistics Sweden (SCB) that explain the tendency for people to fish or not. So the RFS data is a good candidate for exploring the k-NN estimator and how it handles the nonresponse bias. The following table includes the response probabilities for each sample and each period of measurement.

Table 1: Response rates for the RFS by panel and measurement period

Panel Period 1 Period 2 Period 3

10 No fishing / Nonresponse 10 Fished 38 % 66 % NA NA NA NA 11 No fishing / Nonresponse 11 Fished 43 % 68 % 40 % 62 % NA NA 12 No fishing / Nonresponse 12 Fished 41 % 75 % 38 % 68 % 33 % 63 % 13 No fishing / Nonresponse 13 Fished NA NA 40 % 76 % 38 % 66 % 14 No fishing / Nonresponse 14 Fished NA NA NA NA 32 % 69 % New sample 44 % 41 % 41 %

As is seen in Table 1 there is a relation between if a person fished in previous measurement periods and the tendency to answer the survey later on. A person selected for the first time in period 1 will be in panel 13 as either a person that fished or a person that did not fish / nonresponse when measured for period 2 where the response rate was 40 % if they did not fish and 76 % if the person did fish.

There are extensive literature on methods to handle the nonresponse such as Cochran (1977) who introduced a deterministic model for the response probability. Oh & Scheuren (1983) intro-duced the quasi randomization model for the response probability. Bethlehem (1988) introintro-duced a general framework to study the behavior of estimators given nonresponse by introducing re-sponse probabilities. Another method, which is considered in this thesis, is the Rere-sponse Homo-geneity Model (RHG) as described by Särndal et al. (1992) where response groups are identified and the responders within each group is treated as if they have the same response rate have been used in practice. There is an overview of estimation techniques for surveys with nonresponse where the calibration approach is promoted written by Särndal & Lundström (2005).

These contributions has led to a general praxis for statistical agencies on how to handle nonresponse in survey sampling where either the calibration approach is used or the GREG estimator and the RHG model is used. The hope is that any underlying model for the response probabilities correctly represent the nonresponse and that the calibration approach or GREG estimator solve the nonresponse problem. But it is only a hope and relies on assumptions about the response probabilities that are hard to test (Kalton & Kasprzyk 1986).

1.2 Scope

The goal of this thesis is to explore the k-NN method which can be used in the difference estimator to see if it can be used to reduce nonresponse bias. The work is confined to the perspective of agencies of POS where the auxiliary vector considered often consists of categorical variables. It is also of interest to see if the estimated nonresponse bias can be used for model specification as a measurement of fit.

(6)

Since the goal is to see if the k-NN estimator can be used to reduce nonresponse bias it is important to find methods for estimating the nonresponse bias. One way is to find a proxy variable from administrative registers that is similar to the target variable that can be matched for the whole sample set. However there are no proxy variables for the target variables related to fishing since there are no administrative registers with that type of information. Instead the income variable from the Income and Taxation register has been matched for the sample and response set which is used for estimating the nonresponse bias. This limits the use of the work for the target variables of the fishing survey but it makes it possible to explore if the k-NN method can be used to reduce the nonresponse bias.

1.3 Production of Official Statistics (POS)

Often some given groups of auxiliary variables is of interest for users of a particular survey. The user, or researcher rather, might want to know, for example, if there is a difference in income between gender or people with different educations. The user might want to compare the income for the groups with estimates from other surveys with the same groups and population, this is referred to as comparability. There are certain conditions estimators need to meet so that agencies of POS can implement them in production to meet requirements of comparability for different surveys. The most important is that the estimators need to sum up to population totals for the groups of the auxiliary variables. The method for this is called calibration and the most common type of auxiliary variable in POS is categorical. Some methods work well for categorical variables while other methods may not and it is of interest in this thesis to explore the k-NN from this perspective.

Another requirement for POS is effectiveness of the production. Some surveys include thou-sands of variables and it is not reasonable to create separate weights for each so it is necessary that the calibrated weights are not a function of the target variables so that the weights can be used for several target variables. It is not within the scope of this thesis to meet these conditions but worth mentioning since the RFS is for official statistics and produced by Statistics Sweden and consists of several different target variables and most MachL methods do not fulfill any of the two criteria. So comparability and effectiveness is at odds, for the time being until the problem is solved, using MachL methods for design-based and model-assisted estimators.

1.4 Machine learning and Statistics

In the field of Statistics, specifically survey sampling, the goal is estimation of characteristics (such as totals, means, variance, correlation and regression parameters) for a target population. Here two important distinctions need to be made between model-based survey sampling and design-based survey sampling as described by Sterba (2009). In the model-based framework sample is drawn (or assume that the observations are drawn) from an infinite population while in the design-based framework a sample is drawn from a finite population (a frame meant to identify the target population). The goal of each discipline over the history has been different; while model-based statistics has dealt with inference of causality (such as correlations and regression parameters), design-based statistics has focused on descriptive inference (such as totals, means and variances). Both disciplines deal with inference about some characteristic of the target population.

One important contribution to Statistics is that of Särndal et al. (1992) where they introduced the framework for model-assisted survey sampling which is different from that of model-based because the inference about a finite population characteristic does not depend on the underlying model but is only assisted by it. The classical example is when a linear regression model is fit to

(7)

the survey within the design-based framework. This is an example when two disciplines within the same field merge and new theoretical frameworks, terms and concepts emerge which can lead to misunderstandings and disagreement over the meaning of the words. In any case, even under the model-assisted framework, when in the presence of nonresponse, there will always be an underlying assumption about the model of the response probability that leads to the conclusion that the MachL estimators in this thesis are both model-dependent and model-assisted.

The main goal of Machine Learning, is not to estimate population characteristics, but to create a model used for prediction of some target variable given auxiliary information or to categorize data into groups. Bishop (2006) divides Machine Learning methods into:

1. Supervised problems - With the goal to predict values (regression) or classify (classification) given a test and training set that include the target variable (label) and an input vector of auxiliary variables (features). The model is trained over the training set while evaluated against the test set. The result is a model that can predict or classify new unknown observations given that the auxiliary variables are observed for the new observations. 2. Unsupervised problems - With the goal to discover groups, also known as clusters, or to

determine distributions of the data.

These two problems are not new to the field of Statistics and in fact the field of MachL has developed new terminology for similar problems that the field of Statistics has dealt with for a while (O’Connor 2008). However there is one fundamental difference between MachL and Statistics which is statistical inference. Statistical inference requires assumptions. A statistical model is a set of assumptions about how the observed data was generated. MachL methods do not rely on statistical models thus no assumptions however it is possible to combine MachL methods with statistical models. This mixture of statistical models and MachL methods has in recent years created a new area in Statistics called Statistical Learning (James et al. 2017).

This thesis only consider supervised problems because the goal is to predict the values for a target variable used in the model-assisted and design-based difference estimator. When MachL methods are applied within this context they are referred to as MachL estimators.

1.5 Disposition

The following disposition of the thesis consist of the design and data chapter where the design of the RFS is explained to provide an understanding of the design because it is easier to understand the choice of estimator if one understands the design first. Also the sample data is explained and the auxiliary information that is used for the auxiliary vector. The theory and method chapter provides the foundation of design-based inference and the corresponding asymptotic analysis to understand the use of MachL methods in the difference estimator. The theory explained is then applied to the RFS for which all computations are done. The GREG estimator and the k-NN estimator is explained and necessary equations are provided that are used for this thesis. The theory and method chapter is ended with a section on how nonresponse bias can be calculated given a proxy target variable which is known for both the sample set and the response set. The chapter results and analysis includes estimates for each version of the k-NN method with corresponding cross-validation to show the reader how each method performs. This chapter is ended with the final result presented in Table 3 with estimates for the optimal choice of k. The last chapter discussion is about the conclusions of the work done and the strength and weakness of the k-NN method in the context of Production of Official Statistics.

(8)

2 Design and data

In this chapter the design, the auxiliary variables as well as the register variables of the Recre-ational Fishing Survey (RFS) will be presented.

2.1 Sample design

The design of the RFS produced at Statistics Sweden can be considered as a complex design since it consists of four independent samples where each sample is a two-phased sample with different strata in each phase conditioned on the target variable from earlier measurement periods. And this is only for one measurement period. The complexity of the design can be seen in Figure 1 later on.

The frame population consists of all people who were registered in the Swedish Population Register 2017. During one year there are three periods of measurements: January - April (period 1), May - August (period 2) and September - December (period 3). In each of these periods each sample consists of 4 independent samples that are weighted together to get the final estimate of each period. One sample is a stratified random sample and the other three samples are panel samples which means they are two-phased samples where the first phase always consist of a stratified sample from earlier measurement periods. The second phase for each of the three samples are also stratified but on the target variable if a person did some recreational fishing last year.

(9)

Figure 1: Schematic figure of the sample design

2.2 Data

The data used in this thesis consists of the measurements for year 2017 period 2 from the RFS. Auxiliary variables (age, gender, education, region, birth country and living in a large city) and the proxy variable (income) has been collected for the target population (individuals between 17 and 80 years) from the administrative registers Total Population Register (TPR) for 2017, Education Register for 2017 and the Income and Taxation Register (IoT) for 2016.

Some imputation has been done for the auxiliary variables and the proxy variable for the sample. Income has been set to 0 if no information could be found in IoT (not to be confused with the Internet of Things). The other auxiliary variables has been set to it is mode if there were missing values. These imputations should not interfere with the interpretation of the results when the result is analyzed.

(10)

3 Theory and Method

The framework of the design-based and model-assisted estimators will be described since this is the framework on which the difference estimator is derived from. This is followed by a section about survey asymptotic analysis on which modern prediction methods are justified within the difference estimator. This framework will then be described as applied to the design of the RFS and each of the prediction methods, k-NN and linear regression, will be described as special cases of the difference estimator. The model specification process for finding an optimal K for the k-NN method is explained with the 5-folds cross-validation method used.

3.1 Design-based and model-assisted estimators

One major contribution to the design-based approach is that of Särndal et al. (1992). The design-based approach means random samples are drawn from a finite population. This is the foundation for deriving estimators where the target variables are considered fixed and the samples drawn are realized by probabilistic selection schemes.

Denote y_k as the target variable for the k:th element in the finite population denoted U = {1, 2, ..., k, ..., N}. The target parameter is the total and denoted as ty = ∑

k∈U

yk.

The sampling design, denoted p(s), is the probability of selecting some sample s of a spe-cific outcome. The random nature of the sample can be described as a function of the sample membership indicator

Ik=⎧⎪⎪⎨⎪⎪

⎩

1, if k∈ s 0, otherwise For k, l ∈ U, let π_k = E[I_k] = P(k ∈ s) = ∑

s∃k

p(s) be the first-order inclusion probabilities where s∃k denotes that the sum is over those samples s that contain the given k, and π_kl = E[IkIl] = P(k, l ∈ s) = ∑

s∃k

p(s) the second-order inclusion probabilities. Horvitz & Thompson (1952) introduced their well-known HT estimator that utilizes the sampling design by inverse probability weighting HT(y) = ∑ k∈s yk πk = ∑_k_∈U yk Ik πk (1) which is a design-unbiased estimator for ty.

The variance estimator has the form ˆ V(HT(y)) = ∑ k,l∈U Cov(Ik, Il) ykylIkIl πkπlπkl (2)

When nonresponse is presence in a survey the HT estimator is a bad choice unless the original inclusion probabilities are adjusted. The response homogeneity group model (RHG) described in Särndal et al. (1992) will be used to deal with the nonresponse. There are other methods to adjust for nonresponse with one being the calibration approach given in Särndal & Lundström (2005) but this will not be considered. The RHG model means that individuals within the same response groups, which can be set to the original design stratum, are assumed to be homogeneous with respect to the response probabilities. This means each inclusion probability is adjusted according to the responses within each stratum so from now on the first-order and second-order inclusion probabilities will be denoted as πnr_k and πnr_kl and summed over the response set r. It is also necessary to change the sample membership indicator to the response membership indicator

(11)

defined as

Ik=⎧⎪⎪⎨⎪⎪

⎩

1, if k∈ r 0, otherwise

The HT estimator does not make use of any auxiliary variables. However, it is possible to improve the precision and accuracy by utilizing auxiliary information known for the population and the sample as both Bethlehem (1988) and Särndal & Lundström (2005) write. One well-known estimator is the difference estimator which can be used together with models that rely on MachL methods as explained by Breidt & Opsomer (2017).

The vector of auxiliary variables is denoted by x_k which is known for the sample s and the population U and the total is denoted as tx= ∑

k∈Uxk

. A method m(⋅) for predicting yk fromxkis

denoted as y_k= m(x_k) where m(x_k) does not depend on the sample. In general the method will depend on the sample but it is necessary to show this distinction when the estimated method is introduced later on. Now denoted the difference estimator as

DIF F(y, m) = ∑ k∈U m(xk) + ∑ k∈r yk− m(xk) πnr_k (3)

which is exactly unbiased if the response model is correct. The variance estimator is ˆ V(DIFF(y, m)) = ∑ k,l∈U Cov(I_knr, I_lnr)yk− m(xk) πnr_k yl− m(xl) π_lnr I_knrI_lnr πnr_kl . (4)

3.2 Survey asymptotic and methods of prediction

In the previous section a general framework for the difference estimator was described where a model m(⋅) which is independent of the sample is used. However, in practice, it is not very likely to obtain such a model so the method needs to be estimated from the sample, denoted

ˆ

m(⋅) and referred to as the estimated method. To motivate the switch to the estimated method it is necessary to understand the asymptotic behavior. Breidt & Opsomer (2017) provides a theoretical framework for the asymptotic analysis of design-based estimators on which this thesis is based upon.

They start with the asymptotic properties of the estimator HT(y). They provide the con-ditions that need to be met and show that the HT(y) is design consistent. Given the regularity conditions of probability sampling designs and well-behaved limits of the second order finite population moments, together with earlier established fact that the HT-estimator is unbiased, they show that the HT-estimator is design mean square consistent, i.e. design consistent. The regularity conditions can also be found in Särndal et al. (1992) but not described within the framework of asymptotic analysis. Other conditions concern if the design is measurable which taken together with the assumption of asymptotic normality implies that normal confidence in-tervals can be constructed. These asymptotic results carry over to the difference estimator since the difference estimator is a shifted version of the HT-estimator. This, they argue, is necessary in order to obtain asymptotic properties of model-assisted estimators.

Given that the asymptotic properties are met it is possible to describe the difference estimator with the estimated method ˆm(⋅) as

DIF F(y, ˆm) = ∑ k∈U ˆ m(xk) + ∑ k∈r yk− ˆm(xk) πnr k = DIFF(y, m) + (remainder) (5)

This means that the difference estimator, with the estimated method, is an unbiased estimator except for the remainder term. Different estimated methods can be used and the challenge is to

(12)

show that the remainder is negligible, which can be done with smoothing conditions and Taylor approximations and depends on the chosen method (Breidt & Opsomer 2017). However, if the remainder is shown to be negligible then the model-assisted difference estimator inherits the asymptotic properties and will be asymptotically unbiased.

3.3 GREG estimator

The GREG estimator is a special case of the difference estimator where the estimated method used in the difference estimator for all k∈ U is

ˆ m(xk) = J ∑ j=1 ˆ βjxjk (6)

where J is the number of auxiliary variables and the ˆβj is estimated from the sample by ˆβ =

( ˆβ1, ..., ˆβJ)′ = ( ∑ k∈s xkx’k σ2 kπk) −1 ∑ k∈s xkyk σ2

kπk as described by Särndal et al. (1992) as the weighted

least-squares estimator of β with σ_k= 1. It is also shown that the GREG estimator is approximately unbiased for large samples.

Breidt & Opsomer (2017) examine the asymptotic properties of the GREG estimator and show that the rate of the remainder term in Equation (5) is O_p(N(Nπ∗_N)−1) = o_p(N(Nπ_N∗)−1/2) which implies that the GREG estimator behaves asymptotically like the difference estimator and it is asymptotically unbiased and mean square consistent.

In it’s most basic form the computation of the ˆβ estimates does not rely on any underlying assumptions or any statistical models. It is simply a mathematical minimization problem as described by Casella & Berger (2002). It is possible to define a loss function for minimizing the ordinary least square and iterate through estimates, given a learning rate set by the statistician, and obtain the same results. This is simply the linear regression where the ordinary least square is minimized by a machine learning algorithm. This is where the estimates by machine learning methods and statistical methods converge. So it is possible to argue that the method for linear regression is a MachL method, as long as no assumptions about underlying statistical models are made. However linear regression has been applied in the context of design-based and model-assisted estimators for a while. The well-known GREG estimator is thoroughly described by Särndal et al. (1992) where the linear regression is motivated by an assumption that the values of yk is stochastic and in fact generated from a statistical model.

The choice of auxiliary variables comes from the process of model specification and evaluation in finding the best model fit. This is a cumbersome procedure where often p2 models has to be checked, with p being the number of variables, and evaluated by different criteria such as Akaike information criteria (AIC), Bayesian information criteria (BIC), mean square error (MSE), cross-validation (CV) or adjusted R2 (James et al. 2017). With access to more than several 1000 variables from administrative registers it is not possible to follow standard procedure since it would take too long. The perspective of agencies of POS is taken into consideration which means some categorical variables need to be present to satisfy the quality component comparability. Due to this an auxiliary vector of “standard” auxiliary variables often used in the estimations are considered.

3.4 k-NN estimator

The k-Nearest Neighbor (k-NN) method can be used to predict quantitative variables by finding the k nearest neighbors and taking the average of the neighborhood. The distance measurement

(13)

used for finding the nearest neighbors, represented by L_k, is the Euclidean distance written on the form √_n ∑ i=1(qi− pi) 2 _{where q}

i and pi are auxiliary variables from training data and test data.

The number of K neighbors are set in a process of finding the best fit of the model for the test and training data. The unweighted formula for the k-NN is

ˆ m(xk) = 1 K ∑_l_∈L k yk (7)

however this method is designed for identically distributed y_k′s. In the RFS (and many other sample surveys) the inclusion probabilities are not identical. To handle this the sampling design weights are considered, as is done for the linear regression, by applying the inclusion probabilities. The adjustment weights w_k = 1 means no consideration is done to the distance where w_k =

1

n

∑

i=1(qik−pik)

2 means observations with smaller distance has a larger prediction impact than larger

distances. The formula for this k-NN method is

ˆ m(xk) = ∑ l∈Lk yk∗wk πk ∑ l∈Lk wk πk (8)

Baffetta et al. (2009) applied the standard k-NN method in the difference estimator. Even though no proofs for the validity of the estimator were given for why the remainder term in Equation (5) could be considered negligible they provided an intuitive argument which was justified by Monte Carlo simulations.

To find the best choice of k of the k-NN it is necessary to do model assessment. The model assessment approach considered for estimating the test error is named m-folds cross-validation (CV) method (James et al. 2017). This approach randomly divides the data into m= 5 different groups, or folds, of approximately equal size. One fold is first used used as a validation set and the other m-1 folds are used as a training set. The Mean Square Error (MSE) is then computed for the validation set and the method is repeated m times until all folds have been used as a validation set. The m-fold CV estimate is then computed by taking the average of the MSEs

CV_(m)= 1 m m ∑ i=1 M SEi (9)

3.5 Difference estimator for the RFS

The estimators described so far has been for a general design consisting of one phase. In this section an estimator for the sampling design of the RFS will be derived. The RFS is a composite estimator consisting of four independent two-phase sampling designs with different strata in each phase. The first phase for each sample is denoted as s_ah with the sampling design denoted by pah(⋅) and the second phase sample is denoted sg, a sub-sample selected from sa, with sampling

design p(⋅∣sa). Designs for both the response set and the sample set is considered, however for

simplicity only the subscript of the response set r is used. The first phase consists of 13 different strata denoted with h while the second phase strata denoted by g is conditioned on whether or not the individual fished or did not fish in the first phase which means there are two stratum in the second phase.

The notation for the difference estimator from Särndal et al. (1992) for a two phased sample is adjusted to the notation of Breidt & Opsomer (2017) with the application of the RHG model,

(14)

where the response groups are set to the design strata, for the inclusion probabilities gives the following formula: DIF F(y, ˆm1, ˆm2) = ∑ U ˆ m1(xk) + ∑ sa ˆ m2(xk) − ˆm1(xk) πak + ∑r yk− ˆm2(xk) π_knr∗ (10)

The π_ak represent the inclusion probability for the first phase and π_knr∗ = π_akπnr_k_∣s

a where

πnr_k_∣s

a is the nonresponse adjusted inclusion probability for the second phase conditional on the

first phase sample. And the methods can be applied for different auxiliary information for each phase. However this thesis will only consider utilizing auxiliary information on the population level and the response set which means that ˆm1(xk) = ˆm2(xk) so the second term is zero and

also the subscript 1 can be dropped:

DIF F(y, ˆm) = ∑ U ˆ m(xk) + ∑ r yk− ˆm(xk) πnr∗ k (11) with the corresponding variance estimator as

ˆ V(DIFF(y, ˆm)) = H ∑ h=1 k∑∈rh ∑ l∈rh ∆akl πnr∗ kl Dk πak Dl πal+ H ∑ h=1k∑∈rh ∑ l∈rh ∆kl∣sa πnr kl∣sa ˇ ˇ DkDˇˇl (12)

where the covariance terms are ∆akl= πakl− πakπaland ∆kl∣sa= π

nr kl∣sa− π nr k∣saπ nr l∣sa. The difference

Dk is calculated as yk− ˆm(xk) and the inverse probability weighted difference is ˇˇDk= Dk/πknr∗.

The first-order inclusion probabilities are calculated by πak = n_Nah

h, π nr k∣sa= mg nah and π nr∗ k = πakπ nr k∣sa

where N_h is the size of the population for each stratum, n_ah is the size of the first phase sample with stratum h, n_ag is the size of the first phase sample with stratum g for the second phase and mg is the number of responders in stratum g of the second phase sample. The second-order

inclusion probabilities are summarized in the following table

Table 2: Inclusion probabilities for each stratum and phase Condition πakl πnr_kl_∣s_a k= l nah Nh mg nag k≠ l k∈ sah & l∈ sah k∈ sg & l∈ sg nah Nh nah−1 Nh−1 mg nag mg−1 nag−1 k≠ l k∈ sah & l∈ sah′ k∈ sg & l∈ sg nah Nh n_ah′ N h′ mg nag mg−1 nag−1 k≠ l k∈ sah & l∈ sah k∈ sg & l∈ sg′ nah Nh nah−1 Nh−1 mg nag m_g′ n ag′ k≠ l k∈ sah & l∈ sah′ k∈ sg & l∈ sg′ nah Nh n ah′ N h′ mg nag m g′ n ag′ 1)_{The condition column means that if that}

con-dition is met then the inclusion probability is calculated as described for that row

The estimator that has been described is for one sample of the RFS but there are four independent samples used for estimating one measurement period. Särndal et al. (1992) calls

(15)

the method for utilizing more than one sample for the same measurement period the composite estimator. It is described for the RFS as:

ˆ ty = 4 ∑ i=1 (wi)DIFF(y, ˆmi) (13) where ∑4 i=1

wi= 1 and the weights are set proportional to the number of respondents. The

corre-sponding variance estimator is ˆ V(ˆty) = 4 ∑ i=1 (wi)2Vˆ(DIFF(y, ˆmi)) (14)

There are no covariance terms since the samples are independent.

3.6 Bias estimation

The source of inspiration for measuring the effect of bias due to nonresponse is a nonresponse analysis for the Labour Force Surveys (LFS) done by Statistics Sweden (2018). The idea of the method is to use a proxy variable from administrative registers that is relevant to the target variable of interest. It is a straightforward formula for estimating the relative nonresponse bias;

ˆ

RB( ˆθr) = ( ˆθr− ˆθs)/ ˆθs where ˆθr and ˆθs are the estimates for the response set and the sample set

given the MachL estimator.

For the LFS there are a handful of different auxiliary variables that could be of use but for the RFS there are no auxiliary information about the amount of fish caught, fishing licenses or anything remotely similar. So it is not possible to use this method to estimate the nonresponse bias for any fishing variables. Instead income is considered as the target (proxy) variable. This allows for exploring the k-NN method with respect to the nonresponse bias. The information of the estimated nonresponse bias may be valueable for model assessment.

(16)

4 Results

This chapter contains the results of the different MachL estimators as well as the results from the model specification process explained in the k-NN estimator section. An optimal choice of K with respect to the closest estimate to the true population total is shown and the chapter ends with a result table for different selections of the k-NN method as well as the estimate from the GREG estimator. The theory from earlier chapter is applied to the data from the RFS. The auxiliary vector for the models consists of gender, age, education, birth country, region and whether the individual lives in a large city or not.

One of the goals of this thesis has been to explore the k-NN method in the difference estimator and compare it to the GREG estimator. The different methods have been evaluated by 5-folds cross-validation to find the best choice of K for the income variable in the RFS. Three different versions of the k-NN was analyzed in this thesis: 1) The unweighted k-NN 2) The design weighted k-NN 3) The distance and design weighted k-NN.

4.1 Unweighted k-NN estimator

The unweighted k-NN estimator predicts the target variable (income) with the method from Equation (7). The upper graph in Figure 2 shows an increase in the estimates given the response set for K > 25 which means that the estimates will be more biased with larger K. The lowest values of the relative nonresponse do not correspond to the most accurate estimates.

Figure 2: Unweighted k-NN estimator

The estimated relative nonresponse bias increases from K ≈ 2 mainly due to increasing bias for the response set estimates. The patterns are different for the response set and sample set meaning different K can, and should, be used.

(17)

Figure 3: 5-fold cross-validation of the unweighted k-NN method for the sample set

Figure 4: 5-fold cross-validation of the unweighted k-NN method for the response set

The estimated CV for the sample set and the response set show a decreasing trend and it is clear that lower CV are estimated from the sample set than the response set. Given the cross-validation it is not always clear what value of K to choose since different trends are observed for the different samples of the survey. But when K is around 20 then for three out of the four samples the estimated CV begin to increase.

(18)

4.2 Design weighted k-NN estimator

The design weighted k-NN estimator predicts the target variable with the method from Equation (8) with w_k = 1. Figure 5 show an increasing trend of the estimates given the response set for K > 12 which mean the estimates will be more biased with larger K. For the sample set it is K> 23.

Figure 5: Design weighted k-NN estimator

The upper graph in Figure 5 shows estimates from the sample set relatively closely to the true population total. It can be seen in Table 3 that the true value is indeed covered by the estimated confidence interval. The closest estimates to the population total given the sample set is for K= 23. The pattern for the relative nonresponse bias is similar to that of the unweighted k-NN. In fact it is similar for all different versions of the k-NN method.

(19)

Figure 7: 5-fold cross-validation of the design weighted k-NN method for the response set

The cross-validation graphs show similar trends as for the unweighted k-NN where the optimal K seems to lie around 30.

4.3 Design and distance weighted k-NN estimator

The design weighted k-NN estimator predicts the target variable with the method from Equation

(8) where w_k = 1

(√n

∑

i=1(qik−pik) 2)

2. Figure 8 shows an increasing trend of the estimates given the

response set for K> 33 which mean the estimates will be more biased with larger K while larger K gives more accurate estimates when the whole sample set i used.

Figure 8: Design and distance weighted k-NN estimator

(20)

for the other k-NN methods however the trend decrease more slowly meaning larger K will be necessary in the model specification.

Figure 9: 5- fold cross-validation of the design and distance weighted k-NN method for the sample set

Figure 10: 5-fold cross-validation of the design and distance weighted k-NN method for the response set

Both Figure 9 and Figure 10 show the CV of the same decreasing trend but for different levels. This is the same pattern as for the other k-NN methods. The model seems to be optimal with K around 30 according to the estimated CV.

(21)

To clearly see the difference between the different versions of the k-NN estimators a graph of the estimates from the response set is shown where it is clear then the difference is marginal since the estimated confidence intervals overlap which can be seen from Table 3.

Figure 11: Estimates of the response set

It can be seen from Figure 11 the unweighted k-NN methods performs best for reducing the nonresponse bias but still not as good as the GREG estimator.

(22)

The estimates in Figure 12 show similar decreasing trends with larger K and it is also clear that the design weighted k-NN estimator computes estimates closest to the true value. However none of the k-NN estimators are as close as the GREG estimator.

4.4 Comparison

Table 3: Comparison of the estimates

Estimator Auxiliary vector Sample set Response set Relative Bias

Population (true) - 2.211 -

-HT - 2.213 ± 0.096 2.452 ± 0.171 10.8%

GREG estimator Gender Age Bcountry

Education Region Metropol 2.221 ± 0.069 2.275 ± 0.125 2.5%

k-NN estimator (K=23) Unweighted

Gender Age Bcountry

k-NN estimator (K=23) Design weighted

Gender Age Bcountry

k-NN estimator (K=23) Distance and design weighted

Gender Age Bcountry

1)_{All estimates are written in 10}12

2)_{The intervall of the standard errors estimate is a 95 % confidence interval}

If the whole sample is used then all estimated confidence intervals from the different estimators cover the true value of the population but the HT and GREG estimators are more accurate than the k-NN since the values are more accurate. If the response set is used then only the estimated confidence intervals from the GREG estimator cover the true value of the population. The GREG estimator also has the smallest relative nonresponse bias.

It is clear from Table 3 that the GREG estimator performs better than both the k-NN estimator and the HT estimator. The linear regression applied in the difference estimator is a robust method and is not as sensitive as the k-NN when there are few observations or when nonresponse is present. So the GREG estimator seems a better choice for reducing nonresponse bias. The k-NN estimators performs better than the HT estimator with respect to both accuracy, precision and nonresponse bias.

5 Discussion

This thesis explore three different versions of the k-NN method used in the difference estimator and compares them to both the HT estimator and the GREG estimator. The design weighted k-NN seems promising and performs quite well, when the whole sample set is used, with respect to accuracy and precision as is seen from Table 3. However it does not perform well in the presence of nonresponse. The GREG estimator obtains the most accurate estimates and from the results it is clearly the best estimator. The unweighted k-NN estimator performs better over the response set than any of the other k-NN estimators but the difference is marginal and the confidence intervals overlap.

There are some possible reasons why the k-NN estimator does not perform that well. The first being that the real relationship between the auxiliary variables and the predicted value is linear. In this case the k-NN method fails to obtain a good working model, since it models a non-linear relation, for predicting the target variable. This can be solved by other distance measures for

(23)

the different auxiliary variables. Weight functions can be considered where the importance, or distance, between the values of some categorical variable are used to create homogeneous groups of the neighborhoods but this is a cumbersome analysis. It is however possible to specify better models for the k-NN method than is specified in this thesis.

Another reason for why the k-NN does not perform that well is that if there are few ob-servations, or too many auxiliary variables, then MachL methods might suffer ”the curse of dimensionality” (James et al. 2017). This means that the prediction for the k-NN of one neigh-borhood will be estimated by the ”closest neighneigh-borhood” where some, or all, groups are selected at random. Selecting a random neighborhood as the closest group of prediction is not a sound method. In fact James et al. (2017) has written that parametric models (such as the linear regression) will tend to outperform non-parametric models (such as the k-NN method). From the results in Table 3 the linear regression outperforms the k-NN method.

Another reason why the k-NN performs badly is that it fails to create a good model in the presence of nonresponse. The nonresponse is handled by the RHG model where each stratum is considered as the response homogeneous groups. The assumption that all individuals answer with the same probability in each strata is of course a naive assumption in this case and it seems necessary to create better RHG groups. It seems is if both the k-NN and HT estimators are both more sensitive to the response mechanism than the GREG estimator when dealing with nonresponse since all estimators have the same underlying RHG model.

The k-NN estimator is less robust than the GREG when dealing with categorical variables and few observations. It seems as if the k-NN method does not manage to handle the nonresponse which can be seen by the difference between estimates from the sample set and estimates of the response set. To solve this it is possible to reduce the auxiliary vector to only a few categorical variables however since the variables are categorical the k-NN method will be reduced to a group mean model with random sampling within each group, given that K<= ng for each group where

ng is the number of observations in group g which is the neighborhood of the k-NN. This is

actually described as the group mean model and the post-stratified estimator in Särndal et al. (1992) except that now only K observations are used from each group. If K> g then observations from other groups are used for the group mean and the next neighbor K in line needs to have predictive power for the group. This may or may not be the case depending on how the relation between the auxiliary variables distances are modeled in the k-NN method. This thesis apply a secondary sorting criteria of a random uniform variable to avoid any problem of bias due to the sorting of the input data. Suffice to say the k-NN method may work well but it is not a good method when all of the variables considered are categorical and if there are few observations. The strength of the k-NN lies in utilizing auxiliary variables that are numerical where the relation is non-linear. For the purpose of Production of Official Statistics the k-NN seems as a poor choice especially since comparability is desired between surveys. Another problem is when combinations of auxiliary variables are present in the population but not in the sample, given some auxiliary variables. Then the k-NN will always predict values from groups that may or may not have any predictive power. It is important to design surveys with respect to the auxiliary vectors of interest to ensure that all categorical variables have some information for the sample.

Other combinations of the auxiliary variables age, gender, education, birth country, living in a large city and region have been considered and tested in this thesis but no result has been presented. The reason is that the estimates did not become any more accurate which would indicate that the biggest problem of the k-NN is how it deals with the nonresponse and that a better RHG model is required. It might be possible to find better models with more quantitative auxiliary variables however since it is common, and sometimes necessary, to consider only categorical auxiliary variables in POS this has not been considered. In this case the GREG

(24)

estimator performs well and should be used.

The estimated relative nonresponse bias does not work as a measurement of fit. For example the lowest estimated nonresponse bias in Figure 8 does not correspond to the most accurate estimates. One of the goals of this thesis was to test if the estimated nonresponse bias could be a good selection criteria for model specification and it is clear that it is not a good measurement of fit.

During the course of this thesis the importance of exploring the administrative registers became clear. New and more relevant auxiliary variables that explain nonresponse and the target variables are needed. An estimator is only as good as the auxiliary information allows.

It is important to explore the unsupervised MachL methods; for example cluster methods and factor methods. Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome (2017) write about Principal Component Regression that creates new variables, that are orthogonal to each other, used in the linear regression. This method could be used to create new variables that are used in the GREG estimator. This area of application together with exploring a lot of other auxiliary variables is a good field for future studies.

References

Baffetta, F., Fattorini, L., Franceschi, S. & Corona, P. (2009), ‘Design-based approach to k-nearest neighbours technique for coupling field and remotely sensed data in forest surveys’, Remote Sens. Environ.113, 463–475.

Bethlehem, J. G. (1988), ‘Reduction of Nonresponse Bias Through Regression Estimation’, Jour-nal of Official Statistics 4(03), 251–260.

Bethlehem, Jelke and Cobben, Fannie and Schouten, Barry (2011), Handbook of Nonresponse in Household Surveys, John Wiley & Sons, Inc., New Jersey.

Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer, New York.

Breidt, F. J. & Opsomer, J. D. (2017), ‘Model-Assisted Survey Estimation with Modern Predic-tion Techniques’, Statistical Science32(02), 190–205.

Casella, G. & Berger, R. L. (2002), Statistical inference, Brooks/Cole, Cengage Learning, United Kingdom.

Cochran, W. G. (1977), Sampling Techniques, John Wiley & Sons, Inc, New York.

Hansen, M. H. & Hurwitz, W. N. (1946), ‘The problem of non-response in sample surveys’, Journal of the American Statistical Association 41, 517–529.

Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome (2017), The Elements of Statistical Learning - Data Mining, Inference, and Prediction, Springer.

Horvitz, D. G. & Thompson, D. J. (1952), ‘A generalization of sampling without replacement from a finite universe’, Journal of the American Statistical Association 47, 663–685.

James, G., Witten, D., Hastie, T. & Tibshirani, R. (2017), An Introduction to Statistical Learn-ing, Springer, New York.

Kalton, G. & Kasprzyk, D. (1986), ‘The treatment of missing survey data’, International Statis-tical Review 12, 1–16.

(25)

Lohr, S. L. (2009), Sampling: Design and Analysis, Brooks/Cole, Cengage Learning, 20 Channel Center Street, Boston.

O’Connor, B. (2008), ‘Statistics vs. Machine Learning, fight! ’.

https://brenocon.com/blog/2008/12/statistics-vs-machine-learning-fight/.

Oh, H. L. & Scheuren, F. J. (1983), ‘Weighting adjustment for unit nonresponse’, Incomplete Data in Sample Surveys: Academic Press 02, 143–184.

Särndal, C.-E. & Lundström, S. (2005), Estimation in Surveys with Nonresponse, John Wiley & Sons, Ltd, Chichester.

Särndal, C.-E., Swensson, B. & Wretman, J. (1992), Model Assisted Survey Sampling, Springer, New York.

Statistics Sweden (2018), Analysis on nonresponse bias for the Swedish Labour Force Surveys (LFS), Statistics Sweden, Stockholm.

Sterba, S. K. (2009), ‘Alternative Model-Based and Design-Based Frameworks for Inference From Samples to Populations: From Polarization to Integration’, Multivariate Behav Res. 44(06), 711–740.

(26)

A

Code for the input data

Listing 1: Input data

############################## Reading and c l e a n i n g d a t a ################################## # Data p r o c e s s i n g −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # i n s t a l l . p a c k a g e s ( " r e a d x l " ) library ( readxl )

setwd( "P: /Data/PMU/MIO/RikardG/ Studier / U p p s a t s 2 0 1 8 s t a t i s t i k /Data" ) d a t a f i s k <− read_e x c e l ( " d a t a f i s k_clean . xlsx " )

d a t a f i s k$kon <− as . character ( as . character ( d a t a f i s k $kon ) )

d a t a f i s k$ a l d e r s l u t <− as . numeric( as . character ( d a t a f i s k $ a l d e r s l u t ) ) d a t a f i s k$SStad <− as . character ( as . numeric( d a t a f i s k $SStad ) )

d a t a f i s k$ f l a n d <− as . character ( as . numeric( d a t a f i s k $ fland ) ) d a t a f i s k$ g i f t <− as . character ( as . numeric( d a t a f i s k $ g i f t ) )

d a t a f i s k$ u t b i l d n i n g <− as . character ( as . numeric( d a t a f i s k $ utbildning ) ) #d a t a f i s k$ f2a <− as . character ( as . numeric ( d a t a f i s k $f2a ) )

# F u n c t i o n f o r t h e mode , u s e d f o r i m p u t a t i o n getmode <− function ( v ) {

u n i q v <− unique( v )

u n i q v [which .max( tabulate (match(v , uniqv ) ) ) ] }

# d a t a c l e a n i n g and i m p u t a t i o n

d a t a f i s k$CSFVI [ i s . na( d a t a f i s k $CSFVI) ] <− 0 d a t a f i s k$ f2a [ i s . na( d a t a f i s k $ f2a ) ] <− 1

d a t a f i s k$kon [ i s . na( d a t a f i s k $kon ) ] <− getmode ( d a t a f i s k $kon )

d a t a f i s k$agecat [ i s . na( d a t a f i s k $agecat ) ] <− getmode ( d a t a f i s k $agecat ) d a t a f i s k$ a l d e r s l u t [ i s . na( d a t a f i s k $ a l d e r s l u t ) ] <− getmode ( d a t a f i s k $

a l d e r s l u t )

d a t a f i s k$ u t b i l d n i n g [ i s . na( d a t a f i s k $ u t b i l d n i n g ) ] <− getmode ( d a t a f i s k $ u t b i l d n i n g )

d a t a f i s k$Region [ i s . na( d a t a f i s k $Region ) ] <− getmode ( d a t a f i s k $Region ) d a t a f i s k$SStad [ i s . na( d a t a f i s k $SStad ) ] <− getmode ( d a t a f i s k $SStad ) d a t a f i s k$ f l a n d [ i s . na( d a t a f i s k $ f l a n d ) ] <− getmode ( d a t a f i s k $ fland ) d a t a f i s k$ g i f t [ i s . na( d a t a f i s k $ g i f t ) ] <− getmode ( d a t a f i s k $ g i f t ) # rename

colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )=="Summa_Tot_4_12" ] <− " SummaTot412"

colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" stratum_f a s 2 " ] <− " s t r a t u m f a s 2 "

(27)

s t r a t u m f a s 1 "

colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )=="SUN2000Niva_Grov" ] <− " SUN2000NivaGrov "

colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" Storan_f a s 1 " ] <− " Storanfas1 " colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" l i l l a n_f a s 1 " ] <− " l i l l a n f a s 1 " colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" svarande_f a s 1 " ] <− "

s v a r a n d e f a s 1 "

colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" Storan_f a s 2 " ] <− " Storanfas2 " colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" l i l l a n_f a s 2 " ] <− " l i l l a n f a s 2 " colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" Svarande_f a s 2 " ] <− "

S v a r a n d e f a s 2 "

colnames ( d a t a f i s k ) [ colnames ( d a t a f i s k )==" c a l_weight " ] <− " calweight " # d a t a f i s k <− cbind ( d a t a f i s k , model . matrix ( ~ kon − 1 , data = d a t a f i s k

) )

# d a t a f i s k <− cbind ( d a t a f i s k , model . matrix ( ~ aldergroup − 1 , data = d a t a f i s k ) )

p o p t o t <− read_e x c e l ( " p o p u l a t i o n s t o t a l e r . xlsx " ) # i n s t a l l . p a c k a g e s ( " d p l y r " )

library ( data . table ) # l i b r a r y ( d p l y r )

c o m b i n a t i o n s <− expand . grid (

kon=unique ( poptot$kon ) ,

u t b i l d n i n g=unique ( poptot$ u t b i l d n i n g ) , Region=unique ( poptot$Region ) ,

SStad=unique ( poptot$SStad ) , f l a n d=unique ( poptot$ f l a n d ) , a l d e r s l u t=unique ( poptot$ a l d e r s l u t ) ) h e l p d a t a <− merge( combinations , po pto t , by . x = c ( ’ kon ’ , ’ a l d e r s l u t ’ , ’ u t b i l d n i n g ’ , ’ Region ’ , ’ SStad ’ , ’ f l a n d ’ ) , by . y = c ( ’ kon ’ , ’ a l d e r s l u t ’ , ’ u t b i l d n i n g ’ , ’ Region ’ , ’ SStad ’ , ’ f l a n d ’ ) , a l l=T)

colnames ( helpdata ) [ colnames ( helpdata )=="summa_ink " ] <− "inkomst" h e l p d a t a$kon <− as . character ( as . numeric( helpdata$kon ) )

h e l p d a t a$ u t b i l d n i n g <− as . character ( as . numeric( helpdata$ utbildning ) ) h e l p d a t a$SStad <− as . character ( as . numeric( helpdata$SStad ) )

h e l p d a t a$ f l a n d <− as . character ( as . numeric( helpdata$ fland ) ) h e l p d a t a$ antal [ i s . na( helpdata$ antal ) ] <− 0

h e l p d a t a$inkomst [ i s . na( helpdata$inkomst ) ] <− 0 h e l p d a t a <− subset ( helpdata , antal >0)

(28)

# i n s t a l l . p a c k a g e s ( "dummy" )

# i n s t a l l . p a c k a g e s ( " dummies " ) l i b r a r y ( dummies ) library (dummy)

a u x c a t <− c ( ’ kon ’ , " utbildning " , " fland " , "SStad" , "Region" )

h e l p d a t a <− cbind ( helpdata ,dummy: :dummy( helpdata [ auxcat ] , i n t=TRUE) ) d a t a f i s k <− cbind ( datafisk ,dummy: :dummy( d a t a f i s k [ auxcat ] , i n t=TRUE) )

set . seed (10) # c r e a t e d e s i g n w e i g h t s d a t a f i s k$ f i r s t i n k l 1 <− d a t a f i s k $ l i l l a n f a s 1 / d a t a f i s k $ S t o r a n f a s 1 d a t a f i s k$ f i r s t i n k l 2 r <− d a t a f i s k $Svarandefas2 / d a t a f i s k $ S t o r a n f a s 2 d a t a f i s k$ f i r s t i n k l 2 s <− d a t a f i s k $ l i l l a n f a s 2 / d a t a f i s k $ S t o r a n f a s 2 d a t a f i s k$ f i r s t i n k l s t a r r <− d a t a f i s k $ f i r s t i n k l 1 ∗ d a t a f i s k $ f i r s t i n k l 2 r d a t a f i s k$ f i r s t i n k l s t a r s <− d a t a f i s k $ f i r s t i n k l 1 ∗ d a t a f i s k $ f i r s t i n k l 2 s d a t a f i s k$ d k s t a r r <− 1 / d a t a f i s k $ f i r s t i n k l s t a r r d a t a f i s k$ dk s t ar s <− 1 / d a t a f i s k $ f i r s t i n k l s t a r s d a t a f i s k$random <− runif (nrow( d a t a f i s k ) ,0 ,1)

d a t a f i s k <− d a t a f i s k [ order ( d a t a f i s k $random ) , ] #Sort the data by random v a r i a b l e f o r t h e knn d i s t a n c e measure

# F I s h i n g V a r i a b l e

d a t a f i s k$ f i s k <− i f e l s e ( d a t a f i s k $f2a == 1 , 0 , 1)

# S p l i t t i n g up t h e d e s i g n i n t o e a c h i n d e p e n d e n t s a m p l e s a m p l e l i s t_s <− vector (mode = " l i s t " )

s a m p l e l i s t_s [ [ 1 ] ] <− subset ( datafisk , startomgang==11) s a m p l e l i s t_s [ [ 2 ] ] <− subset ( datafisk , startomgang==12) s a m p l e l i s t_s [ [ 3 ] ] <− subset ( datafisk , startomgang==13) s a m p l e l i s t_s [ [ 4 ] ] <− subset ( datafisk , startomgang==14) s a m p l e l i s t_r <− vector (mode = " l i s t " )

s a m p l e l i s t_r [ [ 1 ] ] <− subset ( datafisk , startomgang==11 & d a t a f i s k $ c a l w e i g h t >=0)

(29)

c a l w e i g h t >=0) t e s t d a t a <− s a m p l e l i s t_r [ [ 4 ] ] c o m p o s i t e_weights <− c ( 0 . 1 3 , 0 . 0 7 , 0 . 0 9 , 0 . 7 1 )

(30)

B

Code for the k-NN

Listing 2: k-NN estimator # i n s t a l l . p a c k a g e s ( " p d i s t " ) library ( p d i s t ) library ( ggplot2 ) library ( gridExtra )

a u x h e l p <− grep ( "_" , names( helpdata ) , value=TRUE) #k <− 5 # S e l e c t t h e k−nearest neighbours k n n l i s t_r <− vector ( ) k n n l i s t_s <− vector ( ) k l i s t <− vector ( ) b i a s l i s t <− vector ( ) #W r i t t e n i n l o o p form i n s t e a d o f a b o v e for ( k in 23) {

# KNN−method f o r the sample set , p r e d i c t i n g the population p o p u l a t i o n_s <− vector (mode = " l i s t " ) h e l p d a t a_s <− vector (mode = " l i s t " ) for ( i in 1 : length ( s a m p l e l i s t_s ) ) { c a l c d a t a <− s a m p l e l i s t_s [ [ i ] ] [ c ( " dkstars " , "CSFVI" ) ] m a t r i x 1 <− s a m p l e l i s t_s [ [ i ] ] [ c ( " a l d e r s l u t " , auxhelp ) ] p o p u l a t i o n_s [ [ i ] ] <− helpdata [ c ( " a l d e r s l u t " , auxhelp ) ] h e l p d a t a_s [ [ i ] ] <− helpdata

d i s t m a t r i x <− as . matrix ( pdist ( matrix1 , population_s [ [ i ] ] ) ) for ( j in 1 :nrow( population_s [ [ i ] ] ) ) {

x <− sort ( distmatrix [ , j ] , method=" quick " , decreasing=FALSE, index . return=TRUE)$ i x [ 1 : k ]

temp <− calcdata [ x , ]

temp$ d i s t <− distmatrix [ x , j ] temp$ d i s t [ temp$ dist <=0] <− 1 #temp$ weight <− 1 / ( temp$ d i s t )^2

#temp$ weight <− 1 # I f t h i s i s a c t i v e t h e n d i s t a n c e w e i g h t i s s e t t o 1

#h e l p d a t a_s [ [ i ] ] $pred [ j ] <− (sum( temp$CSFVI ∗ temp$ d k s t a r s ∗ temp$ weight ) ) / sum( temp$ d k s t a r s ∗ temp$ weight ) # I f t h i s i s

a c t i v e t h e n t h e d e s i g n w e i g h t s a r e u s e d .

h e l p d a t a_s [ [ i ] ] $pred [ j ] <− sum( temp$CSFVI) / k # I f t h i s i s a c t i v e t h e n t h e u n w e i g h t e d k−NN i used .

} }

(31)

# KNN−method f o r the response set , p r e d i c t i n g the population p o p u l a t i o n_r <− vector (mode = " l i s t " ) h e l p d a t a_r <− vector (mode = " l i s t " ) for ( i in 1 : length ( s a m p l e l i s t_r ) ) { c a l c d a t a <− s a m p l e l i s t_r [ [ i ] ] [ c ( " dkstarr " , "CSFVI" ) ] m a t r i x 1 <− s a m p l e l i s t_r [ [ i ] ] [ c ( " a l d e r s l u t " , auxhelp ) ] p o p u l a t i o n_r [ [ i ] ] <− helpdata [ c ( " a l d e r s l u t " , auxhelp ) ] h e l p d a t a_r [ [ i ] ] <− helpdata

d i s t m a t r i x <− as . matrix ( pdist ( matrix1 , population_r [ [ i ] ] ) ) for ( j in 1 :nrow( population_r [ [ i ] ] ) ) {

temp$ d i s t <− distmatrix [ x , j ] temp$ d i s t [ temp$ dist <=0] <− 1 #temp$ weight <− 1 / ( temp$ d i s t )^2 #temp$ weight <− 1

#h e l p d a t a_r [ [ i ] ] $pred [ j ] <− (sum( temp$CSFVI ∗ temp$ d k s t a r r ∗ temp$ weight ) ) / sum( temp$ d k s t a r r ∗ temp$ weight )

h e l p d a t a_r [ [ i ] ] $pred [ j ] <− sum( temp$CSFVI) / k }

}

################### E s t i m a t o r − The estimate f o r the population # p o p u l a t i o n from s a m p l e model

p r e d_pop_s <− l i s t ( )

for ( i in 1 : length ( s a m p l e l i s t_s ) ) {

p r e d_pop_s [ [ i ] ] <− helpdata_s [ [ i ] ] $pred ∗ helpdata_s [ [ i ] ] $ antal }

# p o p u l a t i o n from r e s p o n s e s e t model p r e d_pop_r <− l i s t ( )

for ( i in 1 : length ( s a m p l e l i s t_r ) ) {

p r e d_pop_r [ [ i ] ] <− helpdata_r [ [ i ] ] $pred ∗ helpdata_r [ [ i ] ] $ antal }

# KNN−method f o r the sample set , p r e d i c t i n g the the samples sample_s <− vector (mode = " l i s t " )

c a l c d a t a <− s a m p l e l i s t_s [ [ i ] ] [ c ( " dkstars " , "CSFVI" ) ] m a t r i x 1 <− s a m p l e l i s t_s [ [ i ] ] [ c ( " a l d e r s l u t " , auxhelp ) ] sample_s [ [ i ] ] <− matrix1

(32)

for ( j in 1 :nrow(sample_s [ [ i ] ] ) ) {

temp$ d i s t <− distmatrix [ x , j ] temp$ d i s t [ temp$ dis t <=0] <− 1 #temp$ weight <− 1 / ( temp$ d i s t )^2 #temp$ weight <− 1

#s a m p l e l i s t_s [ [ i ] ] $pred [ j ] <− (sum( temp$CSFVI ∗ temp$ d k s t a r s ∗ temp$ weight ) ) / sum( temp$ d k s t a r s ∗ temp$ weight )

s a m p l e l i s t_s [ [ i ] ] $pred [ j ] <− sum( temp$CSFVI) / k }

}

# KNN−method f o r the sample set , p r e d i c t i n g the the samples sample_r <− vector (mode = " l i s t " )

c a l c d a t a <− s a m p l e l i s t_r [ [ i ] ] [ c ( " dkstarr " , "CSFVI" ) ] m a t r i x 1 <− s a m p l e l i s t_r [ [ i ] ] [ c ( " a l d e r s l u t " , auxhelp ) ] sample_r [ [ i ] ] <− matrix1

d i s t m a t r i x <− as . matrix ( d i s t ( matrix1 , method=" euclidean " ) ) for ( j in 1 :nrow(sample_r [ [ i ] ] ) ) {

temp$ d i s t <− distmatrix [ x , j ] temp$ d i s t [ temp$ dist <=0] <− 1 #temp$ weight <− 1 / ( temp$ d i s t )^2 #temp$ weight <− 1

#s a m p l e l i s t_r [ [ i ] ] $pred [ j ] <− (sum( temp$CSFVI ∗ temp$ d k s t a r r ∗ temp$ weight ) ) / sum( temp$ d k s t a r r ∗ temp$ weight )

s a m p l e l i s t_r [ [ i ] ] $pred [ j ] <− sum( temp$CSFVI) / k }

}

# s a m p l e p a r t o f t h e D i f f e r e n c e a s t i m a t o r from t h e s a m p l e s e t p r e d_sample_s <− l i s t ( )

p r e d_sample_s [ [ i ] ] <− sum( ( s a m p l e l i s t_s [ [ i ] ] $CSFVI− s a m p l e l i s t_s [ [ i ] ]$pred )/ s a m p l e l i s t_s [ [ i ] ] $ f i r s t i n k l s t a r s )

}

# r e s p o n s e p a r t o f t h e D i f f e r e n c e a s t i m a t o r from t h e s a m p l e s e t p r e d_sample_r <− l i s t ( )

(33)

p r e d_sample_r [ [ i ] ] <− sum( ( s a m p l e l i s t_r [ [ i ] ] $CSFVI− s a m p l e l i s t_r [ [ i ] ]$pred )/ s a m p l e l i s t_r [ [ i ] ] $ f i r s t i n k l s t a r r ) } # F i n a l e s t i m a t e o f t h e p o p u l a t i o n t o t a l i n l c u d i n g d i f f e r e n c e e s t i m a t e # Sample s e t t h a t_l i s t_s <− l i s t ( ) for ( i in 1 : length ( s a m p l e l i s t_s ) ) {

t h a t_l i s t_s [ [ i ] ] <− sum( pred_pop_s [ [ i ] ] ) + pred_sample_s [ [ i ] ] }

# Response s e t

t h a t_l i s t_r <− l i s t ( )

t h a t_l i s t_r [ [ i ] ] <− sum( pred_pop_r [ [ i ] ] ) + pred_sample_r [ [ i ] ] }

# Sample s e t t h a t_s <− 0

temp <− that_l i s t_s [ [ i ] ] ∗composite_weights [ [ i ] ] t h a t_s <− that_s + temp

}

# r e s p o n s e s e t t h a t_r <− 0

temp <− that_l i s t_r [ [ i ] ] ∗composite_weights [ [ i ] ] t h a t_r <− that_r + temp

}

# R e l a t i v e B i a s L i n e a r r e g r e s s i o n

b i a s_reg <− ( ( that_r−that_s )/that_s )∗100 k n n l i s t_r [ k ] <− that_r

k n n l i s t_s [ k ] <− that_s b i a s l i s t [ k ] <− bias_reg k l i s t [ k ] <− k

(34)

C

Code for the GREG

Listing 3: GREG estimator

# L i n e a r r e g r e s s i o n −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− # T h i s s c r i p t c r e a t e s a model from l i n e a r r e g r e s s i o n a s i n p u t t o t h e d i f f e r e n c e e s t i m a t o r # D e f i n e t h e r e g r e s s i o n model h e r e # Sample s e t m o d e l l i s t_s <− l i s t ( ) for ( i in 1 : length ( s a m p l e l i s t_s ) ) { m o d e l l i s t_s [ [ i ] ] <− lm(CSFVI~ factor ( kon ) + factor ( u t b i l d n i n g ) + factor ( Region ) + factor ( SStad ) + factor ( f l a n d ) + a l d e r s l u t , data=s a m p l e l i s t_s [ [ i ] ] , weights=( s a m p l e l i s t_s [ [ i ] ] $ dk st ar s ) ) } # Response s e t m o d e l l i s t_r <− l i s t ( ) for ( i in 1 : length ( s a m p l e l i s t_r ) ) { m o d e l l i s t_r [ [ i ] ] <− lm(CSFVI~ factor ( kon ) + factor ( u t b i l d n i n g ) + factor ( Region ) + factor ( SStad ) + factor ( f l a n d ) + a l d e r s l u t , data=s a m p l e l i s t_r [ [ i ] ] , weights=( s a m p l e l i s t_r [ [ i ] ] $ d k s t a r r ) ) }

################### E s t i m a t o r − The estimate f o r the population # p o p u l a t i o n from s a m p l e model

p r e d_pop_s <− l i s t ( )

p r e d_pop_s [ [ i ] ] <− predict ( m o d e l l i s t_s [ [ i ] ] , newdata = helpdata ) ∗ h e l p d a t a$ antal

}

Design-based and Model-assisted estimators using Machine learning methods : Exploring the k-Nearest Neighbor metod applied to data from the Recreational Fishing Survey