• No results found

Comparisons of methods for generating conditional Poisson samples and Sampford samples

N/A
N/A
Protected

Academic year: 2022

Share "Comparisons of methods for generating conditional Poisson samples and Sampford samples"

Copied!
35
0
0

Loading.... (view fulltext now)

Full text

(1)

Comparisons of methods for generating conditional Poisson samples and Sampford samples

Anton Grafstr¨om

Master’s thesis in Mathematical Statistics

June 2005

(2)

Anton Grafstr¨om: J¨amf¨orelse av metoder f¨or att generera betingade Poissonstickprov och Sampfordstickprov. Institutionen f¨or matematik och matematisk statistik, Ume˚a Universitet, 901 87 Ume˚a.

Anton Grafstr¨om, 2005c

Detta arbete har skrivits f¨or avl¨aggande av filosofie magisterexamen i ¨amnet matema- tisk statistik. Arbetet har gjorts p˚a D-niv˚a och omfattar 20 po¨ang.

Handledare: Lennart Bondesson Examinator: Leif Nilsson

Anton Grafstr¨om: Comparisons of methods for generating conditional Poisson sam- ples and Sampford samples. Department of Mathematics and Mathematical Statistics, Ume˚a University, 901 87 Ume˚a.

Anton Grafstr¨om, 2005c

This is a thesis for the degree of Master of Sience in Mathematical Statistics.

Supervisor: Lennart Bondesson Examiner: Leif Nilsson

(3)

Abstract

Methods for conditional Poisson sampling (CP-sampling) and Sampford sampling are compared and the focus is on the efficiency of the methods. The efficiency is investigated by simulation in different sampling situations. It was of interest to compare methods since new methods for both CP-sampling and Sampford sampling were introduced by Bondesson, Traat & Lundqvist in 2004. The new methods are acceptance rejection methods that use the efficient Pareto sampling method. They are found to be very efficient and useful in all situations. The list sequential methods for both CP-sampling and Sampford sampling are other methods that are found to be efficient, especially if many samples are to be generated.

Key words: acceptance rate, acceptance-rejection, conditional Poisson sam- pling, list sequential sampling, Pareto sampling, Sampford sampling, simulation

(4)

Contents

1 Introduction 4

2 Sampling methods 5

2.1 Pareto sampling . . . 5

2.2 Conditional Poisson Sampling . . . 5

2.2.1 CP-reject (method 1) . . . 6

2.2.2 CP-with replacement (method 2) . . . 6

2.2.3 CP-list sequential (method 3) . . . 7

2.2.4 CP-sampling via Pareto sampling (method 4) . . . 9

2.3 Sampford Sampling . . . 10

2.3.1 Sampford with replacement (method 1) . . . 10

2.3.2 Sampford reject (method 2) . . . 11

2.3.3 Sampford list sequential (method 3) . . . 12

2.3.4 Sampford sampling via Pareto sampling (method 4) . . . . 12

3 Simulation and results 13 3.1 Sampling from the MU284 population . . . 14

3.2 Results for the MU284 population . . . 15

3.3 Sampling from a big population . . . 16

3.4 Results for the big population . . . 17

4 Comparing adapted CP-sampling and Sampford sampling 18 4.1 Results for the Sampford-Hajek population . . . 19

5 Conclusions 21

(5)

References 23

Appendix A. True and adapted inclusion probabilities 24

Appendix B. Matlab functions 25

Appendix C. Results for the Sampford-Hajek population 30

(6)

1 Introduction

In sampling it is often important to use an efficient sampling method, otherwise it may take too long time to generate samples. Methods for two types of sampling designs are compared. Both designs are used to get a sample of fixed size n from a population of size N when the inclusion probabilities are unequal. The inclusion probability for a unit in the population is the probability for that unit to be included in a selected sample. The first design is conditional Poisson sampling (CP-sampling), which is a modification of Poisson sampling. Each unit i in the population is included with a given probability pi but only samples of size n are accepted. The probability pi is called the target inclusion probability for unit i. The true inclusion probability for unit i is the probability for unit i to be included in a sample that is accepted. CP-sampling gives true inclusion probabilities that only approximately coincide with the pis. The second design is Sampford sampling, which will give true inclusion probabilities that coincide with the given inclusion probabilities. Four methods for CP-sampling and four methods for Sampford sampling are compared. The following questions are of interest. Which method is the most efficient and which method is easiest to implement? Some previous studies in this area have been done. ¨Ohlund compared some methods for CP-sampling in his master’s thesis ( ¨Ohlund, 1999) and he found a list sequential method to be efficient.

Now there exist new methods for both CP-sampling and Sampford sampling, which were introduced by Bondesson, Traat & Lundqvist (2004). These new methods use Pareto sampling, which was introduced by Ros´en (1997a,b). The methods are acceptance rejection (A-R) methods and they use the fact that the Pareto sampling design is very close to the design of both CP-sampling and Sampford sampling. A Pareto sample, which is rapidly generated, can be adjusted to become a true CP-sample or a true Sampford sample by the use of an A-R filter. It is believed that the new methods will be more efficient than the older methods and that they will work well in all situations.

In section 2 there is a description of each of the sampling methods. Then in section 3, the methods are tested by simulation in some different sampling situ- ations. Adapted CP-sampling and Sampford sampling are compared in section 4. Adapted CP-sampling is CP-sampling with adapted pis to get correct inclu- sion probabilities. We believe that the multivariate distributions of the inclusion vector for adapted CP-sampling and Sampford sampling are very close to each other, even for small populations and samples. The conclusions are presented in section 5.

(7)

2 Sampling methods

The sampling methods are described in this section. We start with the Pareto sampling method since it will be used in the new methods later.

2.1 Pareto sampling

Pareto sampling (Ros´en, 1997a,b) is used to select a sample of fixed size n from a population of size N . Let λi be the given target inclusion probability for unit i and PNi=1λi = n. The method works as follows.

Generate U1, U2, ..., UN where Ui are independent U (0, 1) variables. Then calcu- late the Pareto ranking variables

Qi = Ui/(1 − Ui) λi/(1 − λi)

for each unit. Select the n units with the smallest Q-values as a Pareto sample of fixed size n. The true inclusion probabilities will be approximately λi.

2.2 Conditional Poisson Sampling

CP-sampling (Hajek, 1981) is used to select a sample of a fixed size n from a population of size N . Let pi be the given target inclusion probability for unit i, i = 1, ..., N . Each unit i in the population is included with probability pi but only samples of size n are accepted. Usually it is assumed thatPNi=1pi = n since it will maximize the probability to get samples of size n. The assumption PNi=1pi = n is not restrictive. If it is not satisfied, the pis can be transformed to satisfy that condition (Hajek, 1981, p. 66, Brostr¨om and Nilsson, 2000). When using CP- sampling, the true inclusion probabilities will only be approximately pi. However, there is a possibility to adapt the pis to obtain correct inclusion probabilities (Dupacova, 1979). A new and fast method for calculating the adapted pis, which was introduced in a revised version of Bondesson, Traat & Lundqvist (2004), can be found in appendix A. When using adapted pis the design will be called adapted CP-sampling.

(8)

2.2.1 CP-reject (method 1)

This method for CP-sampling can be found in Hajek (1981). Let the target inclusion probability for unit i be pi withPNi=1pi = n. Also, let Ii be independent and Be(pi) distributed inclusion variables (i.e. Bernoulli distributed with success probability pi). Then unit i is included in the sample if Ii = 1. Simulate Ii for i = 1, ..., N and accept the sample as a CP-sample if PNi=1Ii = n. Repeat the procedure until a sample is accepted. The acceptance rate using this method is

ARCP 1 = P

N

X

i=1

Ii = n

!

"

N

X

i=1

pi(1 − pi)

#−1/2

as

N

X

i=1

pi(1 − pi) → ∞.

A proof of this can be found in Hajek (1981, p. 68).

2.2.2 CP-with replacement (method 2)

This second method for CP-sampling can also be found in Hajek (1981). Let the target inclusion probability for unit i be pi and PNi=1pi = n. Draw n units with replacement where unit i is drawn with probability pi ∝ pi/(1 − pi) and

PN

i=1pi = 1, i.e.

pi = c · pi

1 − pi where c−1=

N

X

i=1

pi 1 − pi.

If all n units are distinct, the sample is accepted as a CP-sample. Let A be the event that all n drawn units are distinct, then the acceptance rate for this method will be

ARCP 2 = P (A) ∼ n!

"N X

i=1

pi

1 − pi

#−n"N Y

i=1

(1 − pi)

#−1"

N

X

i=1

pi(1 − pi)

#−1/2

asPNi=1pi(1 − pi) → ∞. A proof of this can be found in Hajek (1981, pp. 70-71).

(9)

2.2.3 CP-list sequential (method 3)

This is the list sequential method that ¨Ohlund (1999) found to be efficient. The method can also be found in Traat, Bondesson & Meister (2004). The method uses the definition of conditional probability. Let the target inclusion probability for unit i be pi and PNi=1pi = n. Also, let Ii be independent Be(pi) distributed random inclusion variables. Then the inclusion variables Ii can be successively generated from the conditional distributions

P

Ii = x

N

X

j=i

Ij = n − ni−1

, x = 0, 1,

where ni−1=Pi−1j=0Ij and I0 = 0, thus we will always get a sample of size n. The conditional probabilities can be written as

P

Ii = 1

N

X

j=i

Ij = n − ni−1

= P (Ii = 1) PPNj=i+1Ij = n − ni−1− 1 P PNj=iIj = n − ni−1

for i = 1, 2, ..., N − 1. For i = N there are only two cases, i.e. if PN −1j=1 Ij = n − 1 then P (IN = 1) = 1 and if PN −1j=1 Ij = n then P (IN = 1) = 0. In order to use this formula, one first has to calculate the probabilities P PNj=iIj = k for all i = 1, 2, ..., N and for each i, for k from max(0, n − (i − 1)) to min(N − i − 1, n).

This can be done with the following recursive formula.

(1) First calculate PPNj=NIj = 0= 1 − pN and P PNj=NIj = 1= pN (2) For i = N − 1 to i = 1 calculate

P

N

X

j=i

Ij = k

= piP

N

X

j=i+1

Ij = k − 1

+ (1 − pi)P

N

X

j=i+1

Ij = k

if k > 0 and

P

N

X

j=i

Ij = k

= (1 − pi)P

N

X

j=i+1

Ij = k

if k = 0.

In each step i, the calculations are done for k from max(0, n − (i − 1)) to

min(N − i − 1, n). The maximum number of these probabilities that has to be calculated is N42 + N . Fortunately these probabilities need only to be calculated once. Then they can be used to generate as many samples as needed. The calcu- lation may still be too time-consuming if N and n are large. Then it is possible to

(10)

calculate only some of the probabilities exactly and use normal approximations for the rest of them.

Let Si =PNj=iIj, then

P (Si = k) ≈ 1

√2πσi

exp

(

−(k − µi)2i2

)

, 1 ≤ i ≤ N,

where µi = PNj=ipj and σ2i = PNj=ipj(1 − pj). This approximation works well if the pjs are close to 0.5. If the pjs are ordered so that

|p1− 0.5| ≤ |p2− 0.5| ≤ ... ≤ |pN − 0.5|,

then the probabilities P (S1 = k), .., P (Sm = k) can be approximated and P (Sm+1 = k), . . . , P (SN = k) can be calculated exactly.

Example 1 (Normal approximation of probabilities). Let S = P20i=1Ii where Ii are independent and Be(1/4) distributed, i.e. S ∈ Bin(20, 1/4). The true probabilities P (S = k) for k = 0, 1, ..., 20 can be approximated by the value of the probability density function of a normally distributed random variable with µ = P20i=1E(Ii) = 5 and σ2 = P20i=1V ar(Ii) = 3.75. We see in figure 1 that the approximation is rather close even for this small example.

0 5 10 15 20

0 0.05 0.1 0.15 0.2 0.25

k

P(S = k)

Figure 1: Example of normal approximation of probabilities. The true probabil- ities are represented by dots and approximated by the value of the probability density function of a normally distributed random variable.

(11)

2.2.4 CP-sampling via Pareto sampling (method 4)

This is the new method for CP-sampling that was introduced by Bondesson, Traat & Lundqvist (2004). Let the target inclusion probability for unit i be pi

and PNi=1pi = n. First a Pareto sample is generated with λi = pi, i = 1, ..., N . Then the Pareto sample is either rejected or accepted as a CP-sample using the probability functions for the Pareto and CP designs. Let I = (I1, I2, ...IN) be the vector of random inclusion variables, i.e. Ii ∈ {0, 1} and if Ii = 1 then unit i is sampled. Also, let |I| = PNi=1Ii = n be the sample size. The probability functions p(x) = P (I = x) for the designs can then be written as

pCP(x) = CCP

Ypxii(1 − pi)1−xi, |x| = n and if we set λi = pi

pP ar(x) =Ypxii(1 − pi)1−xi ×Xckxk, |x| = n, where

c0 =

Z

0 xn−1Y 1 + τi

1 + τixdx and ck =

Z

0 xn−1Y 1 + τi

1 + τix · 1 1 + τkxdx, with τi = pi/(1−pi). The sums and products are taken over the integers 1, 2, ..., N . The constant CCP is found from the normalizing conditionPx:|x|=np(x) = 1. We also have CCP ≈ √

2πd for large values of d = Ppi(1 − pi). The cks can be calculated exactly or approximated using Laplace approximations. One Laplace approximation of the cks is

ck≈ ck =

Z

−∞exp{− log(1+τk)−y τk

1 + τk− y2

2k}dy = (1−pk)√

2πσkexp{σk2p2k/2}

where σk2 = 1/(d + pk(1 − pk)). This approximation can be calibrated to be better as follows

c∗(cal)k = (N − n)ck

P

ici c0 = (1 − pk) (N − n)σkexp{σ2kp2k/2}

P

i(1 − piiexp{σi2p2i/2}c0,

where c0 can be calculated exactly or approximated by c0 =q2π/d. See Bondes- son, Traat & Lundqvist (2004) for a more thorough description of these Laplace approximations.

Now let us consider when we can accept a Pareto sample as a CP-sample. Let p1(·) and p2(·) be two probability functions. If there exists a constant B such that

(12)

p1(x) is less than Bp2(x) for all x, then a sample from p2(·) can be generated and accepted as a sample from p1(·) if U ≤ p1(x)/(Bp2(x)) where U ∈ U(0, 1). The A-R technique can be found in Ross (2002, pp. 53-54).

If p1(·) = pCP(·) without CCP and p2(·) = pP ar(·), then the constant B must be chosen so that 1 ≤ BPckxk for all x. If the probabilities pi, i = 1, ..., N, are given in increasing order, then the cks will decrease. The best choice of B will be B−1 =PNk=mck where m = N − n + 1.

The conditional acceptance rate for accepting a Pareto sample as a CP-sample is

CAR(x) = 1

BPckxk.

Thus a generated Pareto sample with λi = pi will be accepted as a CP-sample if U ≤ CAR(x), where U ∈ U(0, 1).

The unconditional acceptance rate will be

AR = X

x:|x|=n

CAR(x)ppar(x) ≈ 1 d

N

X

k=m

(1 − pk) ≥ 1 1 −Nn

1 n

N

X

k=m

(1 − pk),

where m = N − n + 1 and d = Ppi(1 − pi). See Bondesson, Traat & Lundqvist (2004) for details.

2.3 Sampford Sampling

Sampford sampling is also used to select a sample of size n from a population of size N . Let πi be the given inclusion probability for unit i andPNi=1πi = n. When using Sampford sampling the true inclusion probabilities will be πi, which is a big advantage of this design. The probability function for the Sampford design can be written as

pSampf(x) = CS

Yπixi(1 − πi)1−xi ×X(1 − πk)xk, |x| = n.

The Sampford sampling design is very close to the Pareto sampling design, even closer than the CP-sampling design is.

2.3.1 Sampford with replacement (method 1)

The first unit is drawn with replacement according to the probabilities πi/n, i = 1, 2, ..., N . The remaining n − 1 units are drawn with replacement according

(13)

to the probabilities pi ∝ πi/(1−πi) wherePNi=1pi = 1. If all n units are unique the sample is accepted and if not the procedure is repeated from the beginning. The approximative acceptance rate for this method can be derived by conditioning on the first selected unit and then use an approximation similar to Hajek’s acceptance rate for CP-method 2. Let p(i)k be the drawing probabilities given that i is the first selected unit and unit i is not included again. Then

p(i)k = pk 1 − pi

and X

k:k6=i

p(i)k = 1 for all i.

The corresponding target inclusion probabilities π(i)k (k 6= i) need be calculated for all i such that p(i)kπ

(i) k

1−πk(i) and Pk:k6=iπk(i) = n − 1. The calculation of the πk(i)s can be done with some numerical method like Newton-Raphson. Then the approximative acceptance rate can be written as

ARS1

N

X

i=1

(n − 1)!

"

X

k:k6=i

πk(i) 1 − πk(i)

#−(n−1)"

Y

k:k6=i

(1 − πk(i))

#−1"

X

k:k6=i

π(i)k (1 − π(i)k )

#−1/2

· (1 − pi)n−1·πi

n

as PNi=1πi(1 − πi) → ∞.

2.3.2 Sampford reject (method 2)

The first unit is drawn with replacement according to the probabilities πi/n, i = 1, 2, ..., N . Let unit j be the first selected unit. Then independent Ii ∈ Be(πi) are generated, i = 1, 2, ..., N . If PNi=1Ii = n − 1 and Ij = 0, then the sample consisting of j and {i : Ii = 1} is accepted. Otherwise the procedure is repeated from the beginning. The probabilities

πi = πi

1 + a · (1 − πi), where a = 1

PN

i=1πi(1 − πi) can be used in the second step to increase the acceptance rate. Then

PN

i=1πi ≈ n − 1. The approximative acceptance rate for this method can be derived by conditioning on the first selected unit and then using an approximation similar to Hajek’s acceptance rate for CP-method 1. We have

ARS2

N

X

i=1

X

j:j6=i

πj(1 − πj)

−1/2

exp

n − 1 −Pj:j6=iπj2 2Pj:j6=iπj(1 − πj)

· (1 − πi) · πi

n

as PNi=1πi(1 − πi) → ∞.

(14)

2.3.3 Sampford list sequential (method 3)

The first unit is drawn with replacement according to the probabilities πi/n, i = 1, 2, ..., N . Let unit j be the first selected unit. Then the CP-list sequential method is used to select the remaining n − 1 units from the whole population (including unit j) given that PNi=1Ii = n − 1. If unit j is selected again, then repeat from the beginning. This means that the necessary probabilities for sums are calculated only once. Then this procedure can be repeated until a sample of n unique units is found (or until several samples are found).

2.3.4 Sampford sampling via Pareto sampling (method 4)

This is the new method for Sampford sampling that was introduced by Bondesson, Traat & Lundqvist (2004). The method, which is similar to CP-sampling via Pareto sampling, works as follows. First a Pareto sample is generated with λi = πi, i = 1, ..., N . Then the sample is either rejected or accepted as a Sampford sample using the probability functions for the Pareto and Sampford designs. The probability functions for the designs can be written as

pSampf(x) = CS

Yπxii(1 − πi)1−xi×X(1 − πk)xk, |x| = n and if we set λi = πi

pP ar(x) =Yπixi(1 − πi)1−xi×Xckxk, |x| = n, where

ck =

Z

0 xn−1Y 1 + τi

1 + τix · 1 1 + τkxdx

with τi = πi/(1−πi). The products and sums are taken over the integers 1, 2, ..., N and the constant CS is found from the normalizing condition Px:|x|=np(x) = 1.

The cks can be calculated exactly, or we can use the same Laplace approximations for the cks as described in CP-sampling via Pareto sampling.

To decide if we shall accept the generated Pareto sample as a Sampford sample, we can use the same method as for CP-sampling via Pareto sampling and let p1(·) and p2(·) be two probability functions. If there exists a constant B such that p1(x) is less than Bp2(x) for all x, then a sample from p2(·) can be generated and accepted as a sample from p1(·) if U ≤ p1(x)/(Bp2(x)) where U ∈ U(0, 1).

In this case we let p1(·) be pSampf(·) without CS and p2(·) be pP ar(·). Then we must find B such that

X(1 − πk)xk ≤ BXckxk for all x, |x| = n.

(15)

This holds if 1 − πk ≤ Bck for all k, so B can be chosen as B = max

k {(1 − πk)/ck} = 1/ min

k {ck/(1 − πk)}

The conditional acceptance rate will be CAR(x) =

P(1 − πk)xk

BPckxk

.

Thus we will accept a Pareto sample as a Sampford sample if U ≤ CAR(x) where U ∈ U(0, 1). The unconditional acceptance rate will be

AR = X

x:|x|=n

CAR(x)ppar(x).

If we use the Laplace approximation c∗(cal)k of the cks and also use the approxi- mation CSq2π/d where d = Pπi(1 − πi), then the acceptance rate can be written as

ARLapl = (N − n) minkkexp {σ2kπk2/2}}

P(1 − πiiexp {σi2πi2/2} ,

where σk2 = 1/(d + πk(1 − πk)). See Bondesson, Traat & Lundqvist (2004) for details.

3 Simulation and results

The sampling methods have been implemented as Matlab functions, which can be found in appendix B. For CP-sampling via Pareto sampling and Sampford sampling via Pareto sampling we have used the calibrated Laplace approximation for calculations of cks. In the list sequential methods (CP-method 3 and Sampford method 3) all necessary probabilities for sums are calculated exactly. This is done with the recursive formula, which can be found as a separate Matlab function in the appendix.

The methods are first tested on a relatively small population with two different sample sizes. Then a larger population is used, where the differences are more apparent.

(16)

3.1 Sampling from the MU284 population

The population that consists of the 284 municipalities of Sweden is called the MU284 population and can be found in S¨arndal, Swensson & Wretman (1992, pp. 652-659). We use the variable P 85, which is the population size in a municipal in the year 1985. Sampling is performed proportional to the size of the population (P 85) in each municipal. The sample sizes n = 10 and n = 50 are used. For each sample size, 1000 samples were generated with each method. For n = 50, four municipalities are always included when sampling proportional to the size. The actual sample size then becomes 46. Also, for n = 50, Sampford method 1 and CP-method 2 are excluded since the acceptance rate is too low. As an example, the acceptance rate for CP-method 2 is approximately ARCP 2 ≈ 0.0000016. This means that the average number of samples that has to be generated before one sample is accepted is more than 625000. It is superfluous to include these methods since they are not comparable with the other methods and the simulation would take long time.

A method that is slow to generate one sample can be very fast in generating many samples if the preliminary calculations are time-consuming. Therefore, the simulation is split into two cases.

Case 1: We are interested in the mean time to get one sample and we want to estimate the acceptance rate. In this case all preliminary calculations are re- peated each time for those methods which require such calculations. All necessary probabilities for sums in the list sequential methods are calculated exactly. The population is relatively small, so there is not too many probabilities to calculate.

Table 1 shows the mean times to generate one sample and the acceptance rate.

Case 2: If we know that we need generate many samples from the population, then it is not necessary to repeat the preliminary calculations each time. Here we are interested in the mean time to generate one sample when the preliminary cal- culations are already performed. Also the times for the preliminary calculations are of interest. Only methods with a significant time for preliminary calculations are included in case 2, i.e. CP-method 3, CP-method 4, Sampford method 3 and Sampford method 4 are included. For the remaining methods, the result will be the same as in case 1. Table 2 contains the result for case 2.

(17)

3.2 Results for the MU284 population

Table 1: Results for the MU284 population (case 1). The mean time to get one sample and the standard error of the mean time are in milliseconds. The lowest mean time in each category is ∗-marked. The acceptance rate for this simulation was ˆARSim and ˆAR is the approximative teoretical acceptance rate, which has been calculated with the formulas presented in section 2.

Method n Mean time SE Mean time ARˆ Sim ARˆ

CP 1 10 1.20 0.10 0.136 0.136

CP 2 10 4.43 0.18 0.127 0.133

CP 3 10 1.05 0.14 1 1

CP 4 10 0.99* 0.10 0.867 0.868

S 1 10 3.80 0.16 0.157 0.159

S 2 10 1.66 0.12 0.123 0.124

S 3 10 1.13 0.11 0.870

S 4 10 0.95* 0.09 0.995 0.994

CP 1 50 8.73 0.30 0.067 0.071

CP 2 50 - - - -

CP 3 50 2.52 0.14 1 1

CP 4 50 1.12* 0.10 0.792 0.785

S 1 50 - - - -

S 2 50 10.33 0.32 0.051 0.050

S 3 50 3.09 0.16 0.696

S 4 50 0.99* 0.10 1 0.999

We start by looking at the results for case 1, which can be found in table 1.

For the smaller sample size n = 10, there is not a big difference between the methods. CP-method 4 seems to be the most efficient of the CP-methods and Sampford method 4 seems to be the most efficient of the Sampford sampling methods. For n = 50, CP-method 2 and Sampford method 1 were excluded since the acceptance rate was too low. The times for the preliminary calculations in CP-method 3 and Sampford method 3 (the list sequential methods) is starting to show as an increased mean time. We can see that the mean times for CP- method 4 and Sampford method 4 is rather independent of the sample size n.

We also see that the approximative teoretical acceptance rates are very close to the acceptance rates for this simulation. In case 2 for the MU284 population (table 2), when the preliminary calculations are already performed, we see that CP-method 3 has a lower mean time than CP-method 4 for both n = 10 and n = 50. For the Sampford sampling methods (Sampford method 3 and Sampford method 4) there is not a significant difference. For n = 10, Sampford method 3 is the fastest and for n = 50 Sampford method 4 is the fastest of the two.

(18)

Table 2: Results for the MU284 population (case 2). This table contains the times for the preliminary calculations and the mean times to generate one sample after such calculations are performed. The lowest mean time in each category is

∗-marked. All times are in milliseconds.

Method n Prel. calc. Mean time SE Mean time

CP 3 10 0.76 0.19* 0.05

CP 4 10 0.44 0.61 0.08

S 3 10 0.76 0.50* 0.08

S 4 10 0.39 0.60 0.08

CP 3 50 2.33 0.30* 0.05

CP 4 50 0.53 0.64 0.08

S 3 50 2.25 0.63 0.08

S 4 50 0.42 0.57* 0.07

3.3 Sampling from a big population

To really demonstrate the differences in efficiency we need a large population and a large sample. In this example N = 10000 will be the population size and n = 2000 will be the sample size. Let the target inclusion probabilities for CP-sampling and the inclusion probabilities for Sampford sampling be p1 = 0.1, p2 = 0.15, p3 = 0.2, p4 = 0.25, p5 = 0.3, where each p-value is used for 2000 units.

The acceptance rate for CP-method 2 and Sampford method 1 is too low for the methods to be used in this example. For each of the other methods, 100 samples were generated. As before, the simulation is divided in two cases since a method that is slow to generate one sample can be very fast in generating many samples if the preliminary calculations are time-consuming.

Case 1: In this case all preliminary calculations are repeated each time. Since the list sequential methods have a lot of probabilities that need be calculated, they will be slow if all probabilities are to be calculated exactly each time. To make the them work faster, normal approximation of probabilities for sums of more than 1000 variables has been used1. Probabilities for sums of 1000 variables or less have been calculated exactly. This makes these two methods quite fast and very useful, even for larger populations and samples. For CP-method 4 and Sampford method 4 we used the calibrated Laplace approximation for calculations of cks, which are recalculated each time. Table 3 shows the mean times to get one sample and also the acceptance rate.

1The matlab functions for these variants of the methods are not included in the appendix.

(19)

Case 2: We are interested in the mean time to generate one sample when the preliminary calculations are already performed and also the time of the prelim- inary calculations. In this case all probabilities for sums in the list sequential methods have been calculated exactly since they only need be calculated once.

For CP-method 4 and Sampford method 4 we still used the calibrated Laplace approximation for calculations of cks, but now they are calculated only once.

Methods with a significant time for preliminary calculations are included in case 2. For the remaining methods, the result are the same as in case 1. Table 4 contains the result for case 2.

3.4 Results for the big population

Table 3: Results for the big population (case 1). The mean time to get one sample and the standard error of the mean time are in milliseconds. The lowest mean time in each category is ∗-marked. The acceptance rate for this simulation was ˆARSim and ˆAR is the approximative teoretical acceptance rate, which has been calculated with the formulas presented in section 2.

Method Mean time SE Mean time ARˆ Sim ARˆ

CP 1 8740 937 0.009 0.010

CP 2 - - - -

CP 3 693 2 1 1

CP 4 30* 1 0.893 0.903

S 1 - - - -

S 2 8377 884 0.008 0.008

S 3 810 22 0.746

S 4 27* 1 1 1.000

The results for the big population and case 1, where the preliminary calculations are repeated each time, can be found in table 3. The result is as follows. CP- method 4 is approximately 20 times faster than CP-method 3 even though CP- method 3 has been speeded up with normal approximation of probabilities for sums of more than 1000 variables. CP-method 4 is also about 300 times faster than CP-method 1. The results are similar for the Sampford methods, where Sampford method 4 is about 30 times faster than Sampford method 3 and 300 times faster than Sampford method 2.

If we turn to case 2 for the big population, which can be found in table 4, we can see that the changes are more significant than in the smaller MU284 population.

Now the list sequential methods have the lowest mean time. A simple calculation

(20)

shows that if more than 600 CP-samples are to be generated then CP-method 3 is the most efficient (i.e. has the least total time), otherwise CP-method 4 is the most efficient. Again, the result for the Sampford sampling methods are similar.

If more than 800 samples are to be generated then Sampford method 3 is the most efficient, otherwise Sampford method 4 is the most efficient.

Table 4: Results for the big population (case 2). This table contains the times for the preliminary calculations and the mean times to generate one sample after such calculations have been performed. The lowest mean time in each category is ∗-marked. All times are in milliseconds.

Method Prel. calc. Mean time SE Mean time

CP 3 6631 4* 1

CP 4 80 15 1

S 3 6831 6* 1

S 4 70 14 1

Remark 1. The results for the list sequential methods in table 3 and table 4 can not be compared since approximation has been used in table 3. When using approximation there are much fewer probabilities to calculate in the preliminary calculations.

4 Comparing adapted CP-sampling and Samp- ford sampling

The content of this section is a deviation from the main topic of this work. We compare adapted CP-sampling and Sampford sampling. Adapted CP-sampling is CP-sampling with adapted pis to get correct inclusion probabilities πi. In this section we see I and all other vectors as column vectors. It is suspected that the multivariate distributions of the inclusion vector I for adapted CP-sampling and Sampford sampling are very close to each other, even for small populations and samples. To see if this is the case, we compare the distributions of some linear combinations aTI for adapted CP-sampling and Sampford sampling. This is done for the small Sampford-Hajek population with N = 10 and n = 5 (Sampford, 1967, Hajek, 1981, p. 90). It can be mentioned that the well known Horvitz- Thompson estimator is ˆYHT =PNi=1Yπi

iIi, which is a linear form and can be written as PNi=1aiIi = aTI.

(21)

Let ΣCP and ΣS be the covariance matrices of I for adapted CP-sampling and Sampford sampling. Then the variance of aTI can be written as V arCP(aTI) = aTΣCPa for adapted CP-sampling and V arS(aTI) = aTΣSa for Sampford sam- pling. The matrices ΣCP and ΣS both have the diagonal elements πi(1 − πi) so the sum of the eigenvalues λi of both ΣCP and ΣS isPNi=1πi(1 − πi). This implies that the variance of aTI for adapted CP-sampling is not uniformly smaller than the same variance for Sampford sampling and vice versa.

4.1 Results for the Sampford-Hajek population

Let N = 10, n = 5 and π = (0.2 0.25 0.35 0.4 0.5 0.5 0.55 0.65 0.7 0.9).

First the adapted pis for CP-sampling are calculated with a proportional fitting method, which can be found in appendix A. A formula for calculation of true in- clusion probabilities for the CP-design can also be found in appendix A. The true inclusion probabilities and the adapted inclusion probabilities for CP-sampling are listed in table 5. Adapted inclusion probabilities have been calculated until p(4)i , which gives true inclusion probabilities πi(4) very close to πi. Thus we use p(4)i as adapted pis.

Table 5: True inclusion probabilities and adapted pis

Unit πi p(1)i πi(1) p(2)i π(2)i p(3)i π(3)i p(4)i π(4)i 1 .2 .22286 .20079 .22200 .19993 .22207 .20001 .22207 .20000 2 .25 .27184 .25045 .27135 .24994 .27141 .25001 .27140 .25000 3 .35 .36503 .34993 .36509 .35000 .36509 .35000 .36509 .35000 4 .4 .41001 .39986 .41014 .40002 .41012 .40000 .41012 .40000 5 .5 .49842 .50007 .49833 .50000 .49833 .50000 .49833 .50000 6 .5 .49842 .50007 .49833 .50000 .49833 .50000 .49833 .50000 7 .55 .54250 .55020 .54229 .54999 .54230 .55000 .54230 .55000 8 .65 .63203 .65008 .63194 .65001 .63192 .65000 .63193 .65000 9 .7 .67815 .69980 .67835 .70005 .67830 .69999 .67831 .70000 10 .9 .88075 .89875 .88219 .90006 .88212 .90000 .88212 .90000

The covariance matrix of I can be calculated by use of 2nd order inclusion prob- abilities. The 2nd order inclusion probabilities for CP-sampling are defined as

πij = P (Ii = 1, Ij = 1|SN = n) = pipj · P (SN(i,j)= n − 2) P (SN = n) ,

where SN = PNi=1Ii and SN(i,j) = SN − Ii − Ij. Then the covariance will be CovCP(Ii, Ij) = πij − πiπj for i 6= j and V arCP(Ii) = πi(1 − πi).

(22)

Since this is a small population, the 2nd order inclusion probabilities for Sampford sampling are calculated by summing the probability of all samples (of size n) that include both unit i and unit j, i.e.

πij = X

x:|x|=n

xixj · pSampf(x).

As for CP-sampling, the covariances for Sampford sampling are calculated as CovS(Ii, Ij) = πij − πiπj for i 6= j and V arS(Ii) = πi(1 − πi). The covariance matrices for both adapted CP-sampling and Sampford sampling have been cal- culated exactly and they are very close to each other. The matrices can be found in appendix C.

The distribution of the linear combinations aTkI are compared for adapted CP- sampling and Sampford sampling, where ak = λ−1/2k bk and bk is an eigenvector of ΣCP and λk is the corresponding eigenvalue. This is done for the 9 largest eigenvectors, i.e. k = 1, ..., 9 and λ1 ≥ λ2 ≥ ... ≥ λ9.

We used 20000 CP-samples and 20000 Sampford samples in this example. The CP-samples were generated with the adapted pis. Let ykCP = aTkICP for adapted CP-sampling and ykS = aTkIS for Sampford sampling. Histograms for the first linear combinations (k = 1) can be found in figure 2 and histograms for the remaining linear combinations can be found in appendix C. The resulting means of the linear combinations can be found in table 6.

−2.50 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

1000 2000 3000 4000 5000 6000

Adapted CP, k = 1

−2.50 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

1000 2000 3000 4000 5000 6000

Sampford, k = 1

Figure 2: Distribution of yCPk and ykS for k = 1

The simulated means and distributions of ykCP and ySk are very close for all k, even for this small population. It was expected that the differences between the distributions would be small, but not this small.

(23)

Table 6: Results for the Sampford-Hajek population k y¯kCP σˆkCPkS σˆkS

1 -0.00 1.00 -0.01 1.00 2 -0.07 1.00 -0.07 1.00 3 -0.23 1.00 -0.22 1.00 4 0.12 1.00 0.11 1.00 5 0.46 1.00 0.45 1.00 6 -0.31 0.99 -0.29 1.00 7 0.66 0.99 0.63 1.00 8 0.50 1.00 0.51 1.00 9 1.36 1.00 1.35 1.00

5 Conclusions

The first question of interest was the efficiency of the different sampling methods.

We found that CP-method 2 and Sampford method 1 (the methods that use draws with replacement) are efficient only when n is much smaller than N . The methods get inefficient very fast when the sample size n increases. CP-method 1 and Sampford method 2 are efficient when sampling from a small population and the acceptance rate for these methods decreases when the population size increases. CP-method 3 and Sampford method 3 (the list sequential methods) are special. If all probabilities for sums are calculated exactly, then the times for the preliminary calculations increases rapidly when the sample size n and the population size N increases. However, after the preliminary calculations are performed the methods are very efficient. If we instead use normal approximation of probabilities for sums of many variables, then the preliminary calculations are fast and the remaining part of the sampling procedure is a little bit slower. Thus it will be more efficient to use these methods with normal approximation if the goal is to generate only one sample. CP-method 4 and Sampford method 4 (the new methods) are very efficient in all situations. We have used Laplace approximation of the cks for these methods. The approximation is very good and it makes these methods faster than if the cks are calculated exactly. The Laplace approximation is easy to implement. The time it takes to generate a sample with these methods is rather independent of the sample size n. These new methods are the most efficient in general, but not always. If the population and the sample size are not too big, then the list sequential methods are very efficient and useful ( ¨Ohlund, 1999). The list sequential methods might even be the most efficient if many samples are to be generated.

(24)

Another question of interest is the effort needed to implement the methods. The effort can be measured in different ways and some measures are the complexity of the method and the time it takes to implement. The last measure depends on the choice of programming language and the person who implements the methods. I found that from a good description of the methods, there is not a big difference in effort to implement them. The little additional time it takes to implement a more efficient method, such as Sampford sampling via Pareto sampling (or CP-sampling via Pareto sampling) will save more time later on.

Linear combinations of the inclusion vector I for adapted CP-sampling and Samp- ford sampling were compared in section 4. The distributions of the simulated lin- ear combinations were very close for the small Sampford-Hajek population. This gives an indication that the multivariate distributions of the inclusion vector I for adapted CP-sampling and Sampford sampling are very close to each other, even for small populations and samples.

(25)

Acknowledgements

I would like to thank my supervisor, professor Lennart Bondesson, for all the help and valuable advice regarding this thesis. He has also shared many of his ideas with me and for that I am grateful. I would also like to thank Leif Nilsson for valuable comments on my writing.

References

Bondesson, L., Traat, I., Lundqvist, A. (2004). Pareto Sampling versus Sampford and Conditional Poisson Sampling. Research Report No. 6 2004, Department of Mathematical Statistics, Ume˚a University.

Brostr¨om, G. & Nilsson, L. (2000). Acceptance-Rejection sampling from the conditional distribution of independent discrete random variables, given their sum. Statistics 34, 247-257.

Dupacova, J. (1979). A note on Rejective Sampling, Contributions to Statistics, Jaroslav Hajek Memorial Volume. Reidel, Holland and Academia, Prague, 71-78.

Hajek, J. (1981). Sampling from a finite population. Marcel Dekker, New York.

Ohlund, A. (1999). J¨amf¨orelse av olika metoder att generera Bernoullif¨ordelade¨ slumptal givet deras summa (Comparisons of different methods to generate Bernoulli distributed random numbers given their sum). Master’s thesis, Department of Mathematical Statistics, University of Ume˚a, Sweden.

Ros´en, B. (1997a). Asymptotic theory for order sampling. J. Statist. Plann.

Inference 62, 135-158.

Ros´en, B. (1997b). On sampling with probability proportional to size. J. Statist.

Plann. Inference 62, 159-191.

Ross, S.M. (2002). Simulation, third edition. Academic Press, San Diego.

Sampford, M.R. (1967). On sampling without replacement with unequal proba- bilities of selection. Biometrika 54, 499-513.

S¨arndal, C-E, Swensson, B. & Wretman, J. (1992). Model assisted survey sam- pling. Springer-Verlag, New York.

Traat, I., Bondesson, L. & Meister, K. (2004). Sampling design and sample selection through distribution theory. J. Statist. Plann. Inference 123, 395-413.

(26)

Appendix A

True inclusion probabilities for CP-sampling

Since the true inclusion probabilities for CP-sampling will only be approximately the target inclusion probabilities pi, it might be of interest to calculate the true inclusion probabilities. This can be done by the following formula

πiCP = P Ii = 1 XIj = n= piP (SN −1(i) = n − 1) P (SN = n) , where SN =PIj, SN −1(i) = SN − Ii and Ii ∈ Be(pi).

Adapted target inclusion probabilities for CP-sampling

Adapted CP-sampling is CP-sampling with adapted pis to get correct inclusion probabilities πi. The adapted pis can be calculated as follows.

First p(1)i , i = 1, 2, ..., N, are chosen such that Pp(1)i = n and p(1)i

1 − p(1)i

∝ πi

1 − πie−πi/d, i = 1, 2, ..., N,

where d = Pπi(1 −πi). This approximation of the pis is based on the assumption that the Sampford probability function is close to the desired CP probability function and also on linearization.

The approximation can be improved by an iterative proportional fitting algorithm.

Choose p(2)i such that Pp(2)i = n and p(2)i

1 − p(2)i

∝ p(1)i 1 − p(1)i

× πi/(1 − πi)

π(1)i /(1 − π(1)i ), i = 1, 2, ..., N,

where πi(1), i = 1, 2, ..., N, are the true inclusion probabilities corresponding to p(1)i . This step can be iterated to get very exact pis.

(27)

Appendix B

Function discrete ar.m

This function generates a discrete random variable according to distribution p.

Input is p where PNi=1pi = 1 and pmax = maxi(pi).

function [x] = discrete_ar(p,pmax) found = 0;

while found == 0

x = floor(rand*length(p))+1;

if rand < p(x)/pmax found = 1;

end end

Function ps.m

This is the recursive formula used to calculate probabilities needed in the list sequential methods. Input is the vector p of target inclusion probabilities and the sample size n. Output is a matrix m of size N × (n + 1), where m(i, k + 1) = P PNj=iIj = k.

function [m] = ps(p,n)

N = length(p); %population size m = zeros(N,n+1); %the matrix

% calculate the first probabilities

% needed for the recursion formula m(N,1) = 1-p(N);

m(N,2) = p(N);

% recursion i = N-1;

while i >= 1

% lowest possible value for sum low_k = max(0,n-(i-1));

% highest possible value for sum high_k = min(N-i+1,n);

for k=low_k:high_k if k > 0

m(i,k+1) = p(i)*m(i+1,k)+(1-p(i))*m(i+1,k+1);

else

m(i,k+1) = (1-p(i))*m(i+1,k+1);

end end i = i-1;

end m;

Function pareto.m

This function generates a Pareto sample of size n. Input is the vector p of target inclusion probabilities where PNi=1pi = n and n.

function [sample] = pareto(p,n) N = length(p); % population size u = rand(1,N); % N U(0,1)

% Create pareto distributed ranking variables q(i) one = ones(1,N);

q = (u./(one-u))./(p./(one-p));

% Sort q and use the indexmatrix n first elements

% indexmatrix will have the old index on the new position [Y,I] = sort(q);

sample = I(1:n);

sample = sort(sample);

References

Related documents

Regioner med en omfattande varuproduktion hade också en tydlig tendens att ha den starkaste nedgången i bruttoregionproduktionen (BRP) under krisåret 2009. De

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

Det har inte varit möjligt att skapa en tydlig överblick över hur FoI-verksamheten på Energimyndigheten bidrar till målet, det vill säga hur målen påverkar resursprioriteringar

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

DIN representerar Tyskland i ISO och CEN, och har en permanent plats i ISO:s råd. Det ger dem en bra position för att påverka strategiska frågor inom den internationella