Comparisons of methods for generating conditional Poisson samples and Sampford samples

(1)

Comparisons of methods for generating conditional Poisson samples and Sampford samples

Anton Grafstr¨om

Master’s thesis in Mathematical Statistics

June 2005

(2)

Anton Grafström: Jämförelse av metoder för att generera betingade Poissonstickprov och Sampfordstickprov. Institutionen för matematik och matematisk statistik, Ume˚a Universitet, 901 87 Ume˚a.

Anton Grafstr¨om, 2005c

Detta arbete har skrivits för avläggande av filosofie magisterexamen i ämnet matematisk statistik. Arbetet har gjorts p˚a D-niv˚a och omfattar 20 poäng.

Handledare: Lennart Bondesson Examinator: Leif Nilsson

Anton Grafstr¨om: Comparisons of methods for generating conditional Poisson samples and Sampford samples. Department of Mathematics and Mathematical Statistics, Ume˚a University, 901 87 Ume˚a.

Anton Grafstr¨om, 2005c

This is a thesis for the degree of Master of Sience in Mathematical Statistics.

Supervisor: Lennart Bondesson Examiner: Leif Nilsson

(3)

Abstract

Methods for conditional Poisson sampling (CP-sampling) and Sampford sampling are compared and the focus is on the efficiency of the methods. The efficiency is investigated by simulation in different sampling situations. It was of interest to compare methods since new methods for both CP-sampling and Sampford sampling were introduced by Bondesson, Traat & Lundqvist in 2004. The new methods are acceptance rejection methods that use the efficient Pareto sampling method. They are found to be very efficient and useful in all situations. The list sequential methods for both CP-sampling and Sampford sampling are other methods that are found to be efficient, especially if many samples are to be generated.

Key words: acceptance rate, acceptance-rejection, conditional Poisson sampling, list sequential sampling, Pareto sampling, Sampford sampling, simulation

(4)

1 Introduction

In sampling it is often important to use an efficient sampling method, otherwise it may take too long time to generate samples. Methods for two types of sampling designs are compared. Both designs are used to get a sample of fixed size n from a population of size N when the inclusion probabilities are unequal. The inclusion probability for a unit in the population is the probability for that unit to be included in a selected sample. The first design is conditional Poisson sampling (CP-sampling), which is a modification of Poisson sampling. Each unit i in the population is included with a given probability pi but only samples of size n are accepted. The probability p_i is called the target inclusion probability for unit i. The true inclusion probability for unit i is the probability for unit i to be included in a sample that is accepted. CP-sampling gives true inclusion probabilities that only approximately coincide with the p_is. The second design is Sampford sampling, which will give true inclusion probabilities that coincide with the given inclusion probabilities. Four methods for CP-sampling and four methods for Sampford sampling are compared. The following questions are of interest. Which method is the most efficient and which method is easiest to implement? Some previous studies in this area have been done. ¨Ohlund compared some methods for CP-sampling in his master’s thesis ( ¨Ohlund, 1999) and he found a list sequential method to be efficient.

Now there exist new methods for both CP-sampling and Sampford sampling, which were introduced by Bondesson, Traat & Lundqvist (2004). These new methods use Pareto sampling, which was introduced by Ros´en (1997a,b). The methods are acceptance rejection (A-R) methods and they use the fact that the Pareto sampling design is very close to the design of both CP-sampling and Sampford sampling. A Pareto sample, which is rapidly generated, can be adjusted to become a true CP-sample or a true Sampford sample by the use of an A-R filter. It is believed that the new methods will be more efficient than the older methods and that they will work well in all situations.

In section 2 there is a description of each of the sampling methods. Then in section 3, the methods are tested by simulation in some different sampling situations. Adapted CP-sampling and Sampford sampling are compared in section 4. Adapted CP-sampling is CP-sampling with adapted pis to get correct inclusion probabilities. We believe that the multivariate distributions of the inclusion vector for adapted CP-sampling and Sampford sampling are very close to each other, even for small populations and samples. The conclusions are presented in section 5.

(7)

2 Sampling methods

The sampling methods are described in this section. We start with the Pareto sampling method since it will be used in the new methods later.

2.1 Pareto sampling

Pareto sampling (Ros´en, 1997a,b) is used to select a sample of fixed size n from a population of size N . Let λi be the given target inclusion probability for unit i and ^P^N_i=1λi = n. The method works as follows.

Generate U1, U2, ..., UN where Ui are independent U (0, 1) variables. Then calculate the Pareto ranking variables

Qi = Ui/(1 − Uⁱ) λ_i/(1 − λi)

for each unit. Select the n units with the smallest Q-values as a Pareto sample of fixed size n. The true inclusion probabilities will be approximately λi.

2.2 Conditional Poisson Sampling

CP-sampling (Hajek, 1981) is used to select a sample of a fixed size n from a population of size N . Let pi be the given target inclusion probability for unit i, i = 1, ..., N . Each unit i in the population is included with probability pi but only samples of size n are accepted. Usually it is assumed that^P^N_i=1p_i = n since it will maximize the probability to get samples of size n. The assumption ^P^N_i=1pi = n is not restrictive. If it is not satisfied, the pis can be transformed to satisfy that condition (Hajek, 1981, p. 66, Brostr¨om and Nilsson, 2000). When using CP- sampling, the true inclusion probabilities will only be approximately pi. However, there is a possibility to adapt the pis to obtain correct inclusion probabilities (Dupacova, 1979). A new and fast method for calculating the adapted p_is, which was introduced in a revised version of Bondesson, Traat & Lundqvist (2004), can be found in appendix A. When using adapted pis the design will be called adapted CP-sampling.

(8)

2.2.1 CP-reject (method 1)

This method for CP-sampling can be found in Hajek (1981). Let the target inclusion probability for unit i be pi with^P^N_i=1pi = n. Also, let Ii be independent and Be(pi) distributed inclusion variables (i.e. Bernoulli distributed with success probability pi). Then unit i is included in the sample if Ii = 1. Simulate Ii for i = 1, ..., N and accept the sample as a CP-sample if ^P^N_i=1Ii = n. Repeat the procedure until a sample is accepted. The acceptance rate using this method is

AR_{CP 1} = P

N

X

i=1

I_i = n

!

∼

"

2π

N

X

i=1

p_i(1 − pi)

#−1/2

as

N

X

i=1

p_i(1 − pi) → ∞.

A proof of this can be found in Hajek (1981, p. 68).

2.2.2 CP-with replacement (method 2)

This second method for CP-sampling can also be found in Hajek (1981). Let the target inclusion probability for unit i be pi and ^P^N_i=1pi = n. Draw n units with replacement where unit i is drawn with probability p^′_i ∝ pⁱ/(1 − pⁱ) and

P_N

i=1p^′_i = 1, i.e.

p^′_i = c · p_i

1 − pⁱ where c⁻¹=

N

X

i=1

p_i 1 − pⁱ.

If all n units are distinct, the sample is accepted as a CP-sample. Let A be the event that all n drawn units are distinct, then the acceptance rate for this method will be

ARCP 2 = P (A) ∼ n!

"_N X

i=1

pi

1 − pⁱ

#−n"_N Y

i=1

(1 − pⁱ)

#−1"

2π

N

X

i=1

pi(1 − pⁱ)

#−1/2

as^P^N_i=1pi(1 − pⁱ) → ∞. A proof of this can be found in Hajek (1981, pp. 70-71).

(9)

2.2.3 CP-list sequential (method 3)

This is the list sequential method that ¨Ohlund (1999) found to be efficient. The method can also be found in Traat, Bondesson & Meister (2004). The method uses the definition of conditional probability. Let the target inclusion probability for unit i be pi and ^P^N_i=1pi = n. Also, let Ii be independent Be(pi) distributed random inclusion variables. Then the inclusion variables Ii can be successively generated from the conditional distributions

P



Ii = x

N

X

j=i

Ij = n − nⁱ⁻¹



, x = 0, 1,

where ni−1=^Pⁱ⁻¹_j=0Ij and I0 = 0, thus we will always get a sample of size n. The conditional probabilities can be written as

P



Ii = 1

N

X

j=i

Ij = n − nⁱ⁻¹



= P (Ii = 1) P^P^N_j=i+1Ij = n − nⁱ⁻¹− 1 P ^P^N_j=iIj = n − nⁱ⁻¹

for i = 1, 2, ..., N − 1. For i = N there are only two cases, i.e. if ^P^{N −1}j=1 I_j = n − 1 then P (IN = 1) = 1 and if ^P^{N −1}_j=1 Ij = n then P (IN = 1) = 0. In order to use this formula, one first has to calculate the probabilities P ^P^N_j=iIj = k for all i = 1, 2, ..., N and for each i, for k from max(0, n − (i − 1)) to min(N − i − 1, n).

This can be done with the following recursive formula.

(1) First calculate P^P^N_j=NI_j = 0= 1 − pN and P ^P^N_j=NI_j = 1= p_N (2) For i = N − 1 to i = 1 calculate

P





N

X

j=i

Ij = k



= piP





N

X

j=i+1

Ij = k − 1



+ (1 − pⁱ)P





N

X

j=i+1

Ij = k



 if k > 0 and

P





N

X

j=i

I_j = k



= (1 − pi)P





N

X

j=i+1

I_j = k



 if k = 0.

In each step i, the calculations are done for k from max(0, n − (i − 1)) to

min(N − i − 1, n). The maximum number of these probabilities that has to be calculated is ^N₄² + N . Fortunately these probabilities need only to be calculated once. Then they can be used to generate as many samples as needed. The calculation may still be too time-consuming if N and n are large. Then it is possible to

(10)

calculate only some of the probabilities exactly and use normal approximations for the rest of them.

Let Si =^P^N_j=iIj, then

P (Si = k) ≈ 1

√2πσi

exp

(

−(k − µⁱ)² 2σ_i²

)

, 1 ≤ i ≤ N,

where µi = ^P^N_j=ipj and σ²_i = ^P^N_j=ipj(1 − p^j). This approximation works well if the p_js are close to 0.5. If the p_js are ordered so that

|p¹− 0.5| ≤ |p²− 0.5| ≤ ... ≤ |p^N − 0.5|,

then the probabilities P (S1 = k), .., P (Sm = k) can be approximated and P (S_m+1 = k), . . . , P (S_N = k) can be calculated exactly.

Example 1 (Normal approximation of probabilities). Let S = ^P²⁰_i=1Ii where Ii are independent and Be(1/4) distributed, i.e. S ∈ Bin(20, 1/4). The true probabilities P (S = k) for k = 0, 1, ..., 20 can be approximated by the value of the probability density function of a normally distributed random variable with µ = ^P²⁰_i=1E(Ii) = 5 and σ² = ^P²⁰_i=1V ar(Ii) = 3.75. We see in figure 1 that the approximation is rather close even for this small example.

0 5 10 15 20

0 0.05 0.1 0.15 0.2 0.25

k

P(S = k)

Figure 1: Example of normal approximation of probabilities. The true probabilities are represented by dots and approximated by the value of the probability density function of a normally distributed random variable.

(11)

2.2.4 CP-sampling via Pareto sampling (method 4)

This is the new method for CP-sampling that was introduced by Bondesson, Traat & Lundqvist (2004). Let the target inclusion probability for unit i be pi

and ^P^N_i=1pi = n. First a Pareto sample is generated with λi = pi, i = 1, ..., N . Then the Pareto sample is either rejected or accepted as a CP-sample using the probability functions for the Pareto and CP designs. Let I = (I1, I2, ...IN) be the vector of random inclusion variables, i.e. Ii ∈ {0, 1} and if Iⁱ = 1 then unit i is sampled. Also, let |I| = ^P^Ni=1Ii = n be the sample size. The probability functions p(x) = P (I = x) for the designs can then be written as

pCP(x) = CCP

Yp^x_iⁱ(1 − pⁱ)^1−xⁱ, |x| = n and if we set λi = pi

pP ar(x) =^Yp^x_iⁱ(1 − pⁱ)^1−xⁱ ×^Xckxk, |x| = n, where

c0 =

Z _∞

0 xⁿ⁻¹^Y 1 + τi

1 + τixdx and ck =

Z _∞

0 xⁿ⁻¹^Y 1 + τi

1 + τix · 1 1 + τkxdx, with τi = pi/(1−pⁱ). The sums and products are taken over the integers 1, 2, ..., N . The constant CCP is found from the normalizing condition^P_x:|x|=np(x) = 1. We also have CCP ≈ √

2πd for large values of d = ^Ppi(1 − pⁱ). The cks can be calculated exactly or approximated using Laplace approximations. One Laplace approximation of the c_ks is

ck≈ c^∗k =

Z _∞

−∞exp{− log(1+τ^k)−y τk

1 + τk− y²

2σ²_k}dy = (1−p^k)√

2πσkexp{σk²p²_k/2}

where σ_k² = 1/(d + pk(1 − p^k)). This approximation can be calibrated to be better as follows

c^∗(cal)_k = (N − n)c^∗k

P

ic^∗_i c0 = (1 − p^k) (N − n)σ^kexp{σ²kp²_k/2}

P

i(1 − pi)σ_iexp{σi²p²_i/2}c0,

where c0 can be calculated exactly or approximated by c^∗₀ =^q2π/d. See Bondes- son, Traat & Lundqvist (2004) for a more thorough description of these Laplace approximations.

Now let us consider when we can accept a Pareto sample as a CP-sample. Let p₁(·) and p2(·) be two probability functions. If there exists a constant B such that

(12)

p1(x) is less than Bp2(x) for all x, then a sample from p2(·) can be generated and accepted as a sample from p₁(·) if U ≤ p1(x)/(Bp₂(x)) where U ∈ U(0, 1). The A-R technique can be found in Ross (2002, pp. 53-54).

If p1(·) = p^CP(·) without C^CP and p2(·) = p^{P ar}(·), then the constant B must be chosen so that 1 ≤ B^Pckxk for all x. If the probabilities pi, i = 1, ..., N, are given in increasing order, then the cks will decrease. The best choice of B will be B⁻¹ =^P^N_k=mck where m = N − n + 1.

The conditional acceptance rate for accepting a Pareto sample as a CP-sample is

CAR(x) = 1

B^Pc_kx_k.

Thus a generated Pareto sample with λi = pi will be accepted as a CP-sample if U ≤ CAR(x), where U ∈ U(0, 1).

The unconditional acceptance rate will be

AR = ^X

x:|x|=n

CAR(x)ppar(x) ≈ 1 d

N

X

k=m

(1 − p^k) ≥ 1 1 −_Nⁿ

1 n

N

X

k=m

(1 − p^k),

where m = N − n + 1 and d = ^Pp_i(1 − pi). See Bondesson, Traat & Lundqvist (2004) for details.

2.3 Sampford Sampling

Sampford sampling is also used to select a sample of size n from a population of size N . Let πi be the given inclusion probability for unit i and^P^N_i=1πi = n. When using Sampford sampling the true inclusion probabilities will be πi, which is a big advantage of this design. The probability function for the Sampford design can be written as

pSampf(x) = CS

Yπ_i^xⁱ(1 − πⁱ)^1−xⁱ ×^X(1 − π^k)xk, |x| = n.

The Sampford sampling design is very close to the Pareto sampling design, even closer than the CP-sampling design is.

2.3.1 Sampford with replacement (method 1)

The first unit is drawn with replacement according to the probabilities πi/n, i = 1, 2, ..., N . The remaining n − 1 units are drawn with replacement according

(13)

to the probabilities p^′_i ∝ πⁱ/(1−πⁱ) where^P^N_i=1p^′_i = 1. If all n units are unique the sample is accepted and if not the procedure is repeated from the beginning. The approximative acceptance rate for this method can be derived by conditioning on the first selected unit and then use an approximation similar to Hajek’s acceptance rate for CP-method 2. Let p⁽ⁱ⁾_k be the drawing probabilities given that i is the first selected unit and unit i is not included again. Then

p⁽ⁱ⁾_k = p^′_k 1 − p^′i

and ^X

k:k6=i

p⁽ⁱ⁾_k = 1 for all i.

The corresponding target inclusion probabilities π⁽ⁱ⁾_k (k 6= i) need be calculated for all i such that p⁽ⁱ⁾_k ∝ ^π

(i) k

1−π_k⁽ⁱ⁾ and ^P_k:k6=iπ_k⁽ⁱ⁾ = n − 1. The calculation of the π_k⁽ⁱ⁾s can be done with some numerical method like Newton-Raphson. Then the approximative acceptance rate can be written as

ARS1∼

N

X

i=1



(n − 1)!

"

X

k:k6=i

π_k⁽ⁱ⁾ 1 − π_k⁽ⁱ⁾

#−(n−1)"

Y

k:k6=i

(1 − π_k⁽ⁱ⁾)

#−1"

2πX

k:k6=i

π⁽ⁱ⁾_k (1 − π⁽ⁱ⁾_k )

#−1/2

· (1 − p^′_i)ⁿ⁻¹·πi

n





as ^P^N_i=1πi(1 − πⁱ) → ∞.

2.3.2 Sampford reject (method 2)

The first unit is drawn with replacement according to the probabilities πi/n, i = 1, 2, ..., N . Let unit j be the first selected unit. Then independent Ii ∈ Be(πⁱ) are generated, i = 1, 2, ..., N . If ^P^N_i=1I_i = n − 1 and Ij = 0, then the sample consisting of j and {i : Iⁱ = 1} is accepted. Otherwise the procedure is repeated from the beginning. The probabilities

π^′_i = πi

1 + a · (1 − πⁱ), where a = 1

P_N

i=1πi(1 − πⁱ) can be used in the second step to increase the acceptance rate. Then

PN

i=1π_i^′ ≈ n − 1. The approximative acceptance rate for this method can be derived by conditioning on the first selected unit and then using an approximation similar to Hajek’s acceptance rate for CP-method 1. We have

ARS2∼

N

X

i=1









2π ^X

j:j6=i

π_j^′(1 − π^′j)





−1/2

exp







−

n − 1 −^Pj:j6=iπ_j^′² 2^P_j:j6=iπ^′_j(1 − π^′j)







· (1 − πi^′) · πi

n







as ^P^N_i=1π^′_i(1 − πi^′) → ∞.

(14)

2.3.3 Sampford list sequential (method 3)

The first unit is drawn with replacement according to the probabilities πi/n, i = 1, 2, ..., N . Let unit j be the first selected unit. Then the CP-list sequential method is used to select the remaining n − 1 units from the whole population (including unit j) given that ^P^N_i=1Ii = n − 1. If unit j is selected again, then repeat from the beginning. This means that the necessary probabilities for sums are calculated only once. Then this procedure can be repeated until a sample of n unique units is found (or until several samples are found).

2.3.4 Sampford sampling via Pareto sampling (method 4)

This is the new method for Sampford sampling that was introduced by Bondesson, Traat & Lundqvist (2004). The method, which is similar to CP-sampling via Pareto sampling, works as follows. First a Pareto sample is generated with λi = πi, i = 1, ..., N . Then the sample is either rejected or accepted as a Sampford sample using the probability functions for the Pareto and Sampford designs. The probability functions for the designs can be written as

pSampf(x) = CS

Yπ^x_iⁱ(1 − πⁱ)^1−xⁱ×^X(1 − π^k)xk, |x| = n and if we set λi = πi

pP ar(x) =^Yπ_i^xⁱ(1 − πⁱ)^1−xⁱ×^Xckxk, |x| = n, where

ck =

Z _∞

0 xⁿ⁻¹^Y 1 + τi

1 + τix · 1 1 + τkxdx

with τ_i = π_i/(1−πi). The products and sums are taken over the integers 1, 2, ..., N and the constant CS is found from the normalizing condition ^P_x:|x|=np(x) = 1.

The cks can be calculated exactly, or we can use the same Laplace approximations for the c_ks as described in CP-sampling via Pareto sampling.

To decide if we shall accept the generated Pareto sample as a Sampford sample, we can use the same method as for CP-sampling via Pareto sampling and let p1(·) and p2(·) be two probability functions. If there exists a constant B such that p1(x) is less than Bp2(x) for all x, then a sample from p2(·) can be generated and accepted as a sample from p1(·) if U ≤ p¹(x)/(Bp2(x)) where U ∈ U(0, 1).

In this case we let p1(·) be p^Sampf(·) without C^S and p2(·) be p^{P ar}(·). Then we must find B such that

X(1 − π^k)xk ≤ B^Xckxk for all x, |x| = n.

(15)

This holds if 1 − π^k ≤ Bc^k for all k, so B can be chosen as B = max

k {(1 − π^k)/ck} = 1/ min

k {c^k/(1 − π^k)}

The conditional acceptance rate will be CAR(x) =

P(1 − π^k)xk

B^Pckxk

.

Thus we will accept a Pareto sample as a Sampford sample if U ≤ CAR(x) where U ∈ U(0, 1). The unconditional acceptance rate will be

AR = ^X

x:|x|=n

CAR(x)p_par(x).

If we use the Laplace approximation c^∗(cal)_k of the c_ks and also use the approximation CS ≈ ^q2π/d where d = ^Pπi(1 − πⁱ), then the acceptance rate can be written as

ARLapl = (N − n) min^k{σ^kexp {σ²kπ_k²/2}}

P(1 − πi)σ_iexp {σi²π_i²/2} ,

where σ_k² = 1/(d + πk(1 − π^k)). See Bondesson, Traat & Lundqvist (2004) for details.

3 Simulation and results

The sampling methods have been implemented as Matlab functions, which can be found in appendix B. For CP-sampling via Pareto sampling and Sampford sampling via Pareto sampling we have used the calibrated Laplace approximation for calculations of cks. In the list sequential methods (CP-method 3 and Sampford method 3) all necessary probabilities for sums are calculated exactly. This is done with the recursive formula, which can be found as a separate Matlab function in the appendix.

The methods are first tested on a relatively small population with two different sample sizes. Then a larger population is used, where the differences are more apparent.

(16)

3.1 Sampling from the MU284 population

The population that consists of the 284 municipalities of Sweden is called the MU284 population and can be found in S¨arndal, Swensson & Wretman (1992, pp. 652-659). We use the variable P 85, which is the population size in a municipal in the year 1985. Sampling is performed proportional to the size of the population (P 85) in each municipal. The sample sizes n = 10 and n = 50 are used. For each sample size, 1000 samples were generated with each method. For n = 50, four municipalities are always included when sampling proportional to the size. The actual sample size then becomes 46. Also, for n = 50, Sampford method 1 and CP-method 2 are excluded since the acceptance rate is too low. As an example, the acceptance rate for CP-method 2 is approximately ARCP 2 ≈ 0.0000016. This means that the average number of samples that has to be generated before one sample is accepted is more than 625000. It is superfluous to include these methods since they are not comparable with the other methods and the simulation would take long time.

A method that is slow to generate one sample can be very fast in generating many samples if the preliminary calculations are time-consuming. Therefore, the simulation is split into two cases.

Case 1: We are interested in the mean time to get one sample and we want to estimate the acceptance rate. In this case all preliminary calculations are repeated each time for those methods which require such calculations. All necessary probabilities for sums in the list sequential methods are calculated exactly. The population is relatively small, so there is not too many probabilities to calculate.

Table 1 shows the mean times to generate one sample and the acceptance rate.

Case 2: If we know that we need generate many samples from the population, then it is not necessary to repeat the preliminary calculations each time. Here we are interested in the mean time to generate one sample when the preliminary calculations are already performed. Also the times for the preliminary calculations are of interest. Only methods with a significant time for preliminary calculations are included in case 2, i.e. CP-method 3, CP-method 4, Sampford method 3 and Sampford method 4 are included. For the remaining methods, the result will be the same as in case 1. Table 2 contains the result for case 2.

(17)

3.2 Results for the MU284 population

Table 1: Results for the MU284 population (case 1). The mean time to get one sample and the standard error of the mean time are in milliseconds. The lowest mean time in each category is ∗-marked. The acceptance rate for this simulation was ˆARSim and ˆAR is the approximative teoretical acceptance rate, which has been calculated with the formulas presented in section 2.

Method n Mean time SE Mean time ARˆ Sim ARˆ

CP 1 10 1.20 0.10 0.136 0.136

CP 2 10 4.43 0.18 0.127 0.133

CP 3 10 1.05 0.14 1 1

CP 4 10 0.99* 0.10 0.867 0.868

S 1 10 3.80 0.16 0.157 0.159

S 2 10 1.66 0.12 0.123 0.124

S 3 10 1.13 0.11 0.870

S 4 10 0.95* 0.09 0.995 0.994

CP 1 50 8.73 0.30 0.067 0.071

CP 2 50 - - - -

CP 3 50 2.52 0.14 1 1

CP 4 50 1.12* 0.10 0.792 0.785

S 1 50 - - - -

S 2 50 10.33 0.32 0.051 0.050

S 3 50 3.09 0.16 0.696

S 4 50 0.99* 0.10 1 0.999

We start by looking at the results for case 1, which can be found in table 1.

For the smaller sample size n = 10, there is not a big difference between the methods. CP-method 4 seems to be the most efficient of the CP-methods and Sampford method 4 seems to be the most efficient of the Sampford sampling methods. For n = 50, CP-method 2 and Sampford method 1 were excluded since the acceptance rate was too low. The times for the preliminary calculations in CP-method 3 and Sampford method 3 (the list sequential methods) is starting to show as an increased mean time. We can see that the mean times for CP- method 4 and Sampford method 4 is rather independent of the sample size n.

We also see that the approximative teoretical acceptance rates are very close to the acceptance rates for this simulation. In case 2 for the MU284 population (table 2), when the preliminary calculations are already performed, we see that CP-method 3 has a lower mean time than CP-method 4 for both n = 10 and n = 50. For the Sampford sampling methods (Sampford method 3 and Sampford method 4) there is not a significant difference. For n = 10, Sampford method 3 is the fastest and for n = 50 Sampford method 4 is the fastest of the two.

(18)

Table 2: Results for the MU284 population (case 2). This table contains the times for the preliminary calculations and the mean times to generate one sample after such calculations are performed. The lowest mean time in each category is

∗-marked. All times are in milliseconds.

Method n Prel. calc. Mean time SE Mean time

CP 3 10 0.76 0.19* 0.05

CP 4 10 0.44 0.61 0.08

S 3 10 0.76 0.50* 0.08

S 4 10 0.39 0.60 0.08

CP 3 50 2.33 0.30* 0.05

CP 4 50 0.53 0.64 0.08

S 3 50 2.25 0.63 0.08

S 4 50 0.42 0.57* 0.07

3.3 Sampling from a big population

To really demonstrate the differences in efficiency we need a large population and a large sample. In this example N = 10000 will be the population size and n = 2000 will be the sample size. Let the target inclusion probabilities for CP-sampling and the inclusion probabilities for Sampford sampling be p₁ = 0.1, p2 = 0.15, p3 = 0.2, p4 = 0.25, p5 = 0.3, where each p-value is used for 2000 units.

The acceptance rate for CP-method 2 and Sampford method 1 is too low for the methods to be used in this example. For each of the other methods, 100 samples were generated. As before, the simulation is divided in two cases since a method that is slow to generate one sample can be very fast in generating many samples if the preliminary calculations are time-consuming.

Case 1: In this case all preliminary calculations are repeated each time. Since the list sequential methods have a lot of probabilities that need be calculated, they will be slow if all probabilities are to be calculated exactly each time. To make the them work faster, normal approximation of probabilities for sums of more than 1000 variables has been used¹. Probabilities for sums of 1000 variables or less have been calculated exactly. This makes these two methods quite fast and very useful, even for larger populations and samples. For CP-method 4 and Sampford method 4 we used the calibrated Laplace approximation for calculations of c_ks, which are recalculated each time. Table 3 shows the mean times to get one sample and also the acceptance rate.

1The matlab functions for these variants of the methods are not included in the appendix.

(19)

Case 2: We are interested in the mean time to generate one sample when the preliminary calculations are already performed and also the time of the preliminary calculations. In this case all probabilities for sums in the list sequential methods have been calculated exactly since they only need be calculated once.

For CP-method 4 and Sampford method 4 we still used the calibrated Laplace approximation for calculations of cks, but now they are calculated only once.

Methods with a significant time for preliminary calculations are included in case 2. For the remaining methods, the result are the same as in case 1. Table 4 contains the result for case 2.

3.4 Results for the big population

Table 3: Results for the big population (case 1). The mean time to get one sample and the standard error of the mean time are in milliseconds. The lowest mean time in each category is ∗-marked. The acceptance rate for this simulation was ˆARSim and ˆAR is the approximative teoretical acceptance rate, which has been calculated with the formulas presented in section 2.

Method Mean time SE Mean time ARˆ _Sim ARˆ

CP 1 8740 937 0.009 0.010

CP 2 - - - -

CP 3 693 2 1 1

CP 4 30* 1 0.893 0.903

S 1 - - - -

S 2 8377 884 0.008 0.008

S 3 810 22 0.746

S 4 27* 1 1 1.000

The results for the big population and case 1, where the preliminary calculations are repeated each time, can be found in table 3. The result is as follows. CP- method 4 is approximately 20 times faster than CP-method 3 even though CP- method 3 has been speeded up with normal approximation of probabilities for sums of more than 1000 variables. CP-method 4 is also about 300 times faster than CP-method 1. The results are similar for the Sampford methods, where Sampford method 4 is about 30 times faster than Sampford method 3 and 300 times faster than Sampford method 2.

If we turn to case 2 for the big population, which can be found in table 4, we can see that the changes are more significant than in the smaller MU284 population.

Now the list sequential methods have the lowest mean time. A simple calculation

(20)

shows that if more than 600 CP-samples are to be generated then CP-method 3 is the most efficient (i.e. has the least total time), otherwise CP-method 4 is the most efficient. Again, the result for the Sampford sampling methods are similar.

If more than 800 samples are to be generated then Sampford method 3 is the most efficient, otherwise Sampford method 4 is the most efficient.

Table 4: Results for the big population (case 2). This table contains the times for the preliminary calculations and the mean times to generate one sample after such calculations have been performed. The lowest mean time in each category is ∗-marked. All times are in milliseconds.

Method Prel. calc. Mean time SE Mean time

CP 3 6631 4* 1

CP 4 80 15 1

S 3 6831 6* 1

S 4 70 14 1

Remark 1. The results for the list sequential methods in table 3 and table 4 can not be compared since approximation has been used in table 3. When using approximation there are much fewer probabilities to calculate in the preliminary calculations.

4 Comparing adapted CP-sampling and Samp- ford sampling

The content of this section is a deviation from the main topic of this work. We compare adapted CP-sampling and Sampford sampling. Adapted CP-sampling is CP-sampling with adapted pis to get correct inclusion probabilities πi. In this section we see I and all other vectors as column vectors. It is suspected that the multivariate distributions of the inclusion vector I for adapted CP-sampling and Sampford sampling are very close to each other, even for small populations and samples. To see if this is the case, we compare the distributions of some linear combinations a^TI for adapted CP-sampling and Sampford sampling. This is done for the small Sampford-Hajek population with N = 10 and n = 5 (Sampford, 1967, Hajek, 1981, p. 90). It can be mentioned that the well known Horvitz- Thompson estimator is ˆY_HT =^P^N_i=1^Y_πⁱ

iI_i, which is a linear form and can be written as ^P^N_i=1aiIi = a^TI.

(21)

Let ΣCP and ΣS be the covariance matrices of I for adapted CP-sampling and Sampford sampling. Then the variance of a^TI can be written as V ar_CP(a^TI) = a^TΣCPa for adapted CP-sampling and V arS(a^TI) = a^TΣSa for Sampford sampling. The matrices ΣCP and ΣS both have the diagonal elements πi(1 − πⁱ) so the sum of the eigenvalues λ_i of both Σ_CP and Σ_S is^P^N_i=1π_i(1 − πi). This implies that the variance of a^TI for adapted CP-sampling is not uniformly smaller than the same variance for Sampford sampling and vice versa.

4.1 Results for the Sampford-Hajek population

Let N = 10, n = 5 and π = (0.2 0.25 0.35 0.4 0.5 0.5 0.55 0.65 0.7 0.9).

First the adapted pis for CP-sampling are calculated with a proportional fitting method, which can be found in appendix A. A formula for calculation of true inclusion probabilities for the CP-design can also be found in appendix A. The true inclusion probabilities and the adapted inclusion probabilities for CP-sampling are listed in table 5. Adapted inclusion probabilities have been calculated until p⁽⁴⁾_i , which gives true inclusion probabilities π_i⁽⁴⁾ very close to π_i. Thus we use p⁽⁴⁾_i as adapted pis.

Table 5: True inclusion probabilities and adapted pis

Unit πi p⁽¹⁾_i π_i⁽¹⁾ p⁽²⁾_i π⁽²⁾_i p⁽³⁾_i π⁽³⁾_i p⁽⁴⁾_i π⁽⁴⁾_i 1 .2 .22286 .20079 .22200 .19993 .22207 .20001 .22207 .20000 2 .25 .27184 .25045 .27135 .24994 .27141 .25001 .27140 .25000 3 .35 .36503 .34993 .36509 .35000 .36509 .35000 .36509 .35000 4 .4 .41001 .39986 .41014 .40002 .41012 .40000 .41012 .40000 5 .5 .49842 .50007 .49833 .50000 .49833 .50000 .49833 .50000 6 .5 .49842 .50007 .49833 .50000 .49833 .50000 .49833 .50000 7 .55 .54250 .55020 .54229 .54999 .54230 .55000 .54230 .55000 8 .65 .63203 .65008 .63194 .65001 .63192 .65000 .63193 .65000 9 .7 .67815 .69980 .67835 .70005 .67830 .69999 .67831 .70000 10 .9 .88075 .89875 .88219 .90006 .88212 .90000 .88212 .90000

The covariance matrix of I can be calculated by use of 2nd order inclusion probabilities. The 2nd order inclusion probabilities for CP-sampling are defined as

π_ij = P (I_i = 1, I_j = 1|SN = n) = p_ip_j · P (S_N^(i,j)= n − 2) P (SN = n) ,

where SN = ^P^N_i=1Ii and S_N^(i,j) = SN − Iⁱ − I^j. Then the covariance will be Cov_CP(I_i, I_j) = π_ij − πiπ_j for i 6= j and V arCP(I_i) = π_i(1 − πi).

(22)

Since this is a small population, the 2nd order inclusion probabilities for Sampford sampling are calculated by summing the probability of all samples (of size n) that include both unit i and unit j, i.e.

πij = ^X

x:|x|=n

xixj · p^Sampf(x).

As for CP-sampling, the covariances for Sampford sampling are calculated as CovS(Ii, Ij) = πij − πⁱπj for i 6= j and V ar^S(Ii) = πi(1 − πⁱ). The covariance matrices for both adapted CP-sampling and Sampford sampling have been calculated exactly and they are very close to each other. The matrices can be found in appendix C.

The distribution of the linear combinations a^T_kI are compared for adapted CP- sampling and Sampford sampling, where ak = λ^−1/2_k bk and bk is an eigenvector of Σ_CP and λ_k is the corresponding eigenvalue. This is done for the 9 largest eigenvectors, i.e. k = 1, ..., 9 and λ1 ≥ λ² ≥ ... ≥ λ⁹.

We used 20000 CP-samples and 20000 Sampford samples in this example. The CP-samples were generated with the adapted pis. Let y_k^CP = a^T_kICP for adapted CP-sampling and y_k^S = a^T_kIS for Sampford sampling. Histograms for the first linear combinations (k = 1) can be found in figure 2 and histograms for the remaining linear combinations can be found in appendix C. The resulting means of the linear combinations can be found in table 6.

−2.50 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

1000 2000 3000 4000 5000 6000

Adapted CP, k = 1

−2.50 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

1000 2000 3000 4000 5000 6000

Sampford, k = 1

Figure 2: Distribution of y^CP_k and y_k^S for k = 1

The simulated means and distributions of y_k^CP and y^S_k are very close for all k, even for this small population. It was expected that the differences between the distributions would be small, but not this small.

(23)

Table 6: Results for the Sampford-Hajek population k y¯_k^CP σˆ_k^CP y¯_k^S σˆ_k^S

1 -0.00 1.00 -0.01 1.00 2 -0.07 1.00 -0.07 1.00 3 -0.23 1.00 -0.22 1.00 4 0.12 1.00 0.11 1.00 5 0.46 1.00 0.45 1.00 6 -0.31 0.99 -0.29 1.00 7 0.66 0.99 0.63 1.00 8 0.50 1.00 0.51 1.00 9 1.36 1.00 1.35 1.00

5 Conclusions

The first question of interest was the efficiency of the different sampling methods.

We found that CP-method 2 and Sampford method 1 (the methods that use draws with replacement) are efficient only when n is much smaller than N . The methods get inefficient very fast when the sample size n increases. CP-method 1 and Sampford method 2 are efficient when sampling from a small population and the acceptance rate for these methods decreases when the population size increases. CP-method 3 and Sampford method 3 (the list sequential methods) are special. If all probabilities for sums are calculated exactly, then the times for the preliminary calculations increases rapidly when the sample size n and the population size N increases. However, after the preliminary calculations are performed the methods are very efficient. If we instead use normal approximation of probabilities for sums of many variables, then the preliminary calculations are fast and the remaining part of the sampling procedure is a little bit slower. Thus it will be more efficient to use these methods with normal approximation if the goal is to generate only one sample. CP-method 4 and Sampford method 4 (the new methods) are very efficient in all situations. We have used Laplace approximation of the cks for these methods. The approximation is very good and it makes these methods faster than if the cks are calculated exactly. The Laplace approximation is easy to implement. The time it takes to generate a sample with these methods is rather independent of the sample size n. These new methods are the most efficient in general, but not always. If the population and the sample size are not too big, then the list sequential methods are very efficient and useful ( ¨Ohlund, 1999). The list sequential methods might even be the most efficient if many samples are to be generated.

(24)

Another question of interest is the effort needed to implement the methods. The effort can be measured in different ways and some measures are the complexity of the method and the time it takes to implement. The last measure depends on the choice of programming language and the person who implements the methods. I found that from a good description of the methods, there is not a big difference in effort to implement them. The little additional time it takes to implement a more efficient method, such as Sampford sampling via Pareto sampling (or CP-sampling via Pareto sampling) will save more time later on.

Linear combinations of the inclusion vector I for adapted CP-sampling and Samp- ford sampling were compared in section 4. The distributions of the simulated linear combinations were very close for the small Sampford-Hajek population. This gives an indication that the multivariate distributions of the inclusion vector I for adapted CP-sampling and Sampford sampling are very close to each other, even for small populations and samples.

(25)

Acknowledgements

I would like to thank my supervisor, professor Lennart Bondesson, for all the help and valuable advice regarding this thesis. He has also shared many of his ideas with me and for that I am grateful. I would also like to thank Leif Nilsson for valuable comments on my writing.

References

Bondesson, L., Traat, I., Lundqvist, A. (2004). Pareto Sampling versus Sampford and Conditional Poisson Sampling. Research Report No. 6 2004, Department of Mathematical Statistics, Ume˚a University.

Brostr¨om, G. & Nilsson, L. (2000). Acceptance-Rejection sampling from the conditional distribution of independent discrete random variables, given their sum. Statistics 34, 247-257.

Dupacova, J. (1979). A note on Rejective Sampling, Contributions to Statistics, Jaroslav Hajek Memorial Volume. Reidel, Holland and Academia, Prague, 71-78.

Hajek, J. (1981). Sampling from a finite population. Marcel Dekker, New York.

Ohlund, A. (1999). Jämförelse av olika metoder att generera Bernoullifördelade¨ slumptal givet deras summa (Comparisons of different methods to generate Bernoulli distributed random numbers given their sum). Master’s thesis, Department of Mathematical Statistics, University of Ume˚a, Sweden.

Ros´en, B. (1997a). Asymptotic theory for order sampling. J. Statist. Plann.

Inference 62, 135-158.

Ros´en, B. (1997b). On sampling with probability proportional to size. J. Statist.

Plann. Inference 62, 159-191.

Ross, S.M. (2002). Simulation, third edition. Academic Press, San Diego.

Sampford, M.R. (1967). On sampling without replacement with unequal probabilities of selection. Biometrika 54, 499-513.

S¨arndal, C-E, Swensson, B. & Wretman, J. (1992). Model assisted survey sampling. Springer-Verlag, New York.

Traat, I., Bondesson, L. & Meister, K. (2004). Sampling design and sample selection through distribution theory. J. Statist. Plann. Inference 123, 395-413.

(26)

Appendix A

True inclusion probabilities for CP-sampling

Since the true inclusion probabilities for CP-sampling will only be approximately the target inclusion probabilities pi, it might be of interest to calculate the true inclusion probabilities. This can be done by the following formula

π_i^CP = P I_i = 1^XI_j = n= p_iP (S_{N −1}⁽ⁱ⁾ = n − 1) P (SN = n) , where SN =^PIj, S_{N −1}⁽ⁱ⁾ = SN − Iⁱ and Ii ∈ Be(pⁱ).

Adapted target inclusion probabilities for CP-sampling

Adapted CP-sampling is CP-sampling with adapted pis to get correct inclusion probabilities πi. The adapted pis can be calculated as follows.

First p⁽¹⁾_i , i = 1, 2, ..., N, are chosen such that ^Pp⁽¹⁾_i = n and p⁽¹⁾_i

1 − p⁽¹⁾i

∝ πi

1 − πⁱe^−πⁱ^/d, i = 1, 2, ..., N,

where d = ^Pπ_i(1 −πi). This approximation of the p_is is based on the assumption that the Sampford probability function is close to the desired CP probability function and also on linearization.

The approximation can be improved by an iterative proportional fitting algorithm.

Choose p⁽²⁾_i such that ^Pp⁽²⁾_i = n and p⁽²⁾_i

1 − p⁽²⁾i

∝ p⁽¹⁾_i 1 − p⁽¹⁾i

× πi/(1 − πⁱ)

π⁽¹⁾_i /(1 − π⁽¹⁾i ), i = 1, 2, ..., N,

where π_i⁽¹⁾, i = 1, 2, ..., N, are the true inclusion probabilities corresponding to p⁽¹⁾_i . This step can be iterated to get very exact p_is.

(27)

Appendix B

Function discrete ar.m

This function generates a discrete random variable according to distribution p.

Input is p where ^P^N_i=1pi = 1 and pmax = maxi(pi).

function [x] = discrete_ar(p,pmax) found = 0;

while found == 0

x = floor(rand*length(p))+1;

if rand < p(x)/pmax found = 1;

end end

Function ps.m

This is the recursive formula used to calculate probabilities needed in the list sequential methods. Input is the vector p of target inclusion probabilities and the sample size n. Output is a matrix m of size N × (n + 1), where m(i, k + 1) = P ^P^N_j=iIj = k.

function [m] = ps(p,n)

N = length(p); %population size m = zeros(N,n+1); %the matrix

% calculate the first probabilities

% needed for the recursion formula m(N,1) = 1-p(N);

m(N,2) = p(N);

% recursion i = N-1;

while i >= 1

% lowest possible value for sum low_k = max(0,n-(i-1));

% highest possible value for sum high_k = min(N-i+1,n);

for k=low_k:high_k if k > 0

m(i,k+1) = p(i)*m(i+1,k)+(1-p(i))*m(i+1,k+1);

else

m(i,k+1) = (1-p(i))*m(i+1,k+1);

end end i = i-1;

end m;

Function pareto.m

This function generates a Pareto sample of size n. Input is the vector p of target inclusion probabilities where ^P^N_i=1pi = n and n.

function [sample] = pareto(p,n) N = length(p); % population size u = rand(1,N); % N U(0,1)

% Create pareto distributed ranking variables q(i) one = ones(1,N);

q = (u./(one-u))./(p./(one-p));

% Sort q and use the indexmatrix n first elements

% indexmatrix will have the old index on the new position [Y,I] = sort(q);

sample = I(1:n);

sample = sort(sample);

Comparisons of methods for generating conditional Poisson samples and Sampford samples