• No results found

Clustering using k-means algorithm in multivariate dependent models with factor structure

N/A
N/A
Protected

Academic year: 2022

Share "Clustering using k-means algorithm in multivariate dependent models with factor structure"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

U.U.D.M. Project Report 2020:52

Examensarbete i matematik, 30 hp Handledare: Rolf Larsson

Examinator: Julian Külshammer December 2020

Department of Mathematics Uppsala University

Clustering using k-means algorithm in

multivariate dependent models with factor structure

Dimitris Dineff

(2)
(3)

Contents

1 Introduction 3

2 k-means Clustering 4

3 Model 5

3.1 A model in dimension 2 . . . . 5

3.2 A model in dimension 5 . . . . 6

3.3 A model in dimension 7 . . . . 7

4 Model Selection & Simulations 8 4.1 Model Selection . . . . 8

4.2 Dimension 5 . . . . 8

4.3 Dimension 7 . . . . 16

4.4 Observations and Comparison . . . . 26

5 Empirical Example 29 6 Conclusion 31 7 Appendix 32 7.1 Dimension 5 . . . . 32

7.2 Dimension 7 . . . . 32

(4)

1 Introduction

One of the major growing topics in various science fields is machine learning. This rapid development is mainly due to the huge need that exists for interpretation and analysis of countless data that arise and collected in a daily basis. Machine learning is divided into two big categories, supervised and unsupervised. In unsupervised machine learning, there is no supervisor which provides the correct values of an output, while we are trying to learn the mapping from the input, but we only have input data. The aim is to find the regularities in the input. There is a structure to the input space such that certain patterns occur more often than others, and we want to see what generally happens and what does not. In statistics, this is called density estimation. One method for density estimation is clustering where the aim is to find clusters or groupings of input ([8]).

One of the major clustering approaches is based on the sum of squares criterion and on the algorithm that is today well known under the name ’k-means’ ([9]). The k-means algorithm is the most widely used clustering method. It constructs a partition of a set of objects into k clusters, that minimize some objective function, usually a squared error function, which imply round-shape clusters. The input parameter k is fixed and must be given in advance that limits its real applicability to streaming and evolving data ([8]).

The models which we are going to apply k-means to are described thoroughly in a relevant paper ([2]). These are dependent models with factor structure containing discrete data which we generated in Matlab. Working with k-means clustering corresponds to the determination of the input parameter k. In this paper, we select k depending on the factor structure of our dependent models. For example, if we have a model where a number of variates are described as a linear combination of a factor U1and some independent random variables and the rest of the variables are described as a linear combination of a second factor, U2 and some other independent random variables, then we ’divide’ our model in two groups, one where the variables are linked through factor U1 and another one where the variables are linked though factor U2. Thus, we are selecting k equal to two. If we also had some variables where they were only equal to some independent random variables, then we would had a third group, as a result, we would select k equal to three.

First, we begin with a simple dependent Poisson model in dimension 2 ([3]) and then we constructed dependent, discrete Poisson, binomial and mixed Poisson and binomial factor models in dimensions 5 and 7.

Our goal is to explore with k-means clustering which model structures are easier to be found than others and this we do, by calculating the accuracy of all the models, which we will exhibit in a simulation study. Also, we will use different parameters for the factors and the variables and we will investigate how the accuracy is affected. Last but not least, we will compare our performance with the one in a relevant paper ([2]).

Finally, a few words about the structure. In Section 2, we will present the k-means method. In Section 3, we will analyze the models and their components. In Section 4, there will be the results of our simulations. In Section 5, we perform k-means clustering in an empirical example with ordinal data, previously analyzed by J¨oreskog ([7]).

Finally, in Section 6, there will be the conclusion.

(5)

2 k-means Clustering

Clustering is a data analysis technique that, when applied to a set of heterogeneous items, identifies homogeneous subgroups as defined by a given model or measure of similarity. One feature of clustering is that the process is unsupervised, that is, there is no predefined grouping that the clustering seeks to reproduce. In unsupervised machine learning, only the inputs are available and the task is to reveal aspects of the underlying distribution of the input data. Clustering is a technique for exploratory data analysis and is used increasingly in preliminary analyses of large data sets of medium and high dimensionality as a method of selection, diversity analysis and data reduction ([4]).

If a data set is analyzed in an iterative way, such that at each step a pair of clusters is merged or a single cluster is divided, the result is hierarchical, with a parent-child relationship being established between cluster at each successive level of the iteration. If the data set is analyzed to produce a single partition of the compounds resulting in a set of clusters, the result is then non-hierarchical ([4]).

The k-means method is a non-hierarchical relocation clustering technique in which each item is assigned to the cluster having the nearest centroid (mean). A relocation method is one in which compounds are moved from one cluster to another, to try to improve on the initial estimation of the cluster. The relocation is typically accomplished based on improving a cost function, describing the ”goodness” of each resultant cluster ([4]). It solves the clustering reduction problem which amounts to grouping similar items. First, we choose k initial cluster centers which they called centroids. At the second step, the algorithm computes point-to-cluster-centroid distances of all observations to each centroid. At the third step, it assigns each observation to the cluster with the closest centroid. After that, it computes the average of the observations in each cluster to obtain k new centroid locations. Finally, it repeats the second step to the fourth one until the cluster assignments do not change or the maximum number of iterations are reached ([6]). The final assignment of items to clusters will be, to some extent, dependent upon the initial partition or the initial selection of seed points ([1], p. 696).

The k-means clustering algorithm amounts to selecting the clusters such that the sum of pairwise squared Euclidean distances within each cluster are minimized.

arg minR1,...,Rk

k

X

j=1

1

|Rj| X

x,x’∈Rj

kx − x’k22 (1)

where |Rj| is the number of data points in cluster Rj.

The intention of (1) is to select the clusters such that all the points within each cluster are as similar as possible.

Solving (1) is to select the clusters such that the distance to the cluster center, summed over all data points is minimized.

arg minR

1,...,Rk k

X

j=1

X

x∈Rj

x − µj

2

2 (2)

where |Rj| is the number of data points in cluster Rj and µj is the center (average of all data points x) ([5]).

(6)

3 Model

The general model is a model with factor structure as we can see below :

Y1= X1

Y2= X2 ...

Yn0= Xn0

Yn0+1= U1+ Xn0+1 ...

Yn0+n1 = U1+ Xn0+n1

Yn0+n1+1= U2+ Xn0+n1+1 ...

Yn0+n1+n2= U2+ Xn0+n1+n2

...

Yn0+...+nk−1+1= Uk+ Xn0+...+nk−1+1 ...

Yn0+...+nk= Uk+ Xn0+...+nk

(3)

where N = n0+ ... + nk, U1, ..., Uk the factors, Y1, ..., Yn the dependent variables and X1, ..., XN the independent ones. All the variables follow the Poisson distribution. The type of the model is (n1, n2, ..., nk, 1, ..., 1), where there are n0 ones at the end.

3.1 A model in dimension 2

Here, we have the Karlis bivariate model (2), which is : (Y1= U + X1

Y2= U + X2

(4)

where U, X1, X2are independent non negative valued Poisson variables.

We want to estimate the parameters of the above model by maximum likelihood. Let f (u; λ) and g(x; µj) the probability mass functions of U and X1, X2respectively. We have a set of observation pairs (y11, y12), ..., (yn1, yn2).

(7)

Since Y1 and Y2 are conditionally independent given U and U , X1, X2 follow the Poisson distribution, the likelihood is :

L(λ, µ1, µ2) =

n

Y

i=1

min(yi1,yi2)

X

u=0

λuexp{−λ}

u!

µy1i1−uexp{−µ1} (yi1− u)!

µy2i2−uexp{−µ2} (yi2− u)! =

= exp{−n(λ1+ µ1+ µ2)}

n

Y

i=1

min(yi1,yi2)

X

u=0

λu u!

µy1i1−u (yi1− u)!

µy2i2−u

(yi2− u)! (5)

After taking the logarithm of the right hand side of the above expression we can numerically maximize only over the parameter λ by inserting ˆµ1= ¯y1− ˆλ, ˆµ2= ¯y2− ˆλ. This is a result of the following proposition.

Proposition 1 The parameters that maximize (6), ˆλ, ˆµ1, ..., ˆµm, satisfy the equalities

¯

yk = ˆµk+ ˆλ, k = 1, 2, ..., m, (6)

where ¯yk= 1nPn

i=1yik for all k ([2])

3.2 A model in dimension 5

Let’s introduce the (3, 2) model.

Y1= U1+ X1 Y2= U1+ X2

Y3= U1+ X3 Y4= U2+ X4

Y5= U2+ X5

We will estimate the parameters of the above model by maximum likelihood. Let f (u; λξ) and g(x; µj) be the probability mass functions of Ui where i = 1, 2, ξ = 1, 2 and X1j where j = 1, ..., 5 respectively. We have five observations (y11, y12, ..., y15), ..., (yn1, yn2, ..., yn5). Since Y1, Y2and Y3are conditionally independent given U1 and Y4, Y5conditionally independent given U2, the likelihood is :

L(λξ, µj) =

5

Y

i=1

min(yi1,...,yi5)

X

u=0

f (u1; λ1)f (u2; λ2)g(yi1− u1; µ1)g(yi2− u1; µ2)g(yi3−u1; µ3) g(yi4−u2; µ4)g(yi5−u2; µ5) =

=

5

Y

i=1

min(yi1,...,yi5)

X

u=0

λu11exp{−λ1} u1!

λu22exp{−λ2} u2!

µy1i1−u1exp{−µ1} (yi1− u1)!

µy2i2−u1exp{−µ2} (yi2− u2)!

µy3i3−u1exp{−µ3} (yi3− u1)!

µy4i4−u2exp{−µ4} (yi4− u2)!

µy5i5−u2exp{−µ5} (yi5− u2)! =

= exp

−5(λ1+ λ2+

5

X

j=1

µj)

7

Y

i=1

min(yi1,...,yi5)

X

u=0

µy1i1−u1µy2i2−u2µy3i1−u1µy4i4−u2µy5i5−u2

(u1)!(u2)!(yi1− u1)!(yi2− u1)!(yi3− u1)!(yi4− u2)!(yi5− u2)!

After taking the logarithm of the above expression we numerically maximize over the parameter λ, using Propo-

(8)

3.3 A model in dimension 7

Let’s introduce the (3, 2, 1, 1) model.

Y1= U1+ X1 Y2= U1+ X2

Y3= U1+ X3

Y4= U2+ X4

Y5= U2+ X5

Y6= X6

Y7= X7

We will estimate the parameters of the above model by maximum likelihood. Let f (u; λξ) and g(x; µj) the prob- ability mass functions of Ui, i = 1, 2, ξ = 1, 2 and X1j where j = 1, ..., 7 respectively. We have seven observations (y11, y12, ..., y17), ..., (yn1, yn2, ..., yn7)). Since Y3, Y4 and Y5 are conditionally independent given U1 and Y6, Y7 con- ditionally indepenent given U2, the likelihood is :

L(λξ, µj) =

7

Y

i=1

min(yi1,...,yi7)

X

u=0

f (u1; λ1)f (u2; λ2)g(yi1; µ1)g(yi2; µ2)g(yi3−u1; µ3)g(yi4−u1; µ4)g(yi5−u1; µ5)

g(yi6−u2; µ6)g(yi7−u2; µ7) =

=

7

Y

i=1

min(yi1,...,yi7)

X

u=0

λu11exp{−λ1} u1!

λu22exp{−λ2} u2!

µy1i1exp{−µ1} (yi1)!

µy2i2exp{−µ2} (yi2)!

µy3i1−u1exp{−µ3} (yi3− u1)!

µy4i4−u1exp{−µ4} (yi4− u1)!

µy5i5−u1exp{−µ5} (yi5− u1)!

µy6i6−u2exp{−µ6} (yi6− u2)!

µy7i7−u2exp{−µ7} (yi7− u2)! =

= exp

−7(λ1+ λ2+

7

X

j=1

µj)

7

Y

i=1

min(yi1,...,yi7)

X

u=0

µy1i1µy2i2−u2µy3i1−u1µy4i4−u1µy5i5−u1

(u1)!(u2)!(yi1)!(yi2)!(yi3− u1)!(yi4− u1)!(yi5− u1)!

µy6i6−u2µy7i7−u2

(yi6− u2)!(yi7− u2)! (7)

After taking the logarithm of the above expression we numerically maximize over the parameter λ, using Proposition 1 again.

(9)

4 Model Selection & Simulations

4.1 Model Selection

Here, the model selection is simple. Our goal is to see which model is easier to be found in the correct structure using k-means. We will present a table with the accuracy of its model in the dimensions 5 and 7 and after that we will compare our results with those from Larsson’s paper ([2]).

Larsson ([2]) uses the AIC as a proposed method for model selection. Due to the large number of the potential models, he focuses in finding models where one factor loads to each variable. It is the same for all the dimensions.

Considering dimension 7, his model selection algorithm starts by computing the AIC for the independence model and compare it with all the (2,1,...,1). If it has the lowest AIC, the algorithm stops. If not, he estimates all (3,1,...,1) models where the pair of variables that had the same factor in the first step is joined by one of the other variables as well as all (2,2,1,...,1) models where he adds a new pair of variables that consists of any two that were not in the first pair. If none of the (3,1...,1) of (2,2,1,...,1) models that he tried is better than the previously chosen (2,1...,1) model, the algorithm stops and chooses the previous model. If not, it continues to test new model, the way we described above.

4.2 Dimension 5

We started simulating with the models (5) and (1, 1, 1, 1, 1) and all of times we achieved 100 % accuracy. The simulations have been done for 100.000 replications.

In the (4,1) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, X5∼ P (1) and U1∼ P (λ).

In the (3,2) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, 5 and Ui∼ P (λ), i = 1, 2.

In the (3,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, X4,5 ∼ P (1) and U1∼ P (λ).

In the (2,2,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, X5∼ P (1) and Ui∼ P (λ), i = 1, 2.

In the (2,1,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, X3,4,5∼ P (1) and U1∼ P (λ).

The above is explained because we always have in mind that for the variables which are linked with a factor, it holds :

E[Y ] = E[U ] + E[X] (8)

For the variables that are not linked with a factor, it holds :

E[Y ] = E[X] (9)

The variables follow the Poisson distribution with parameters µ = 0.5 and λ = 0.5.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 47.20 51.44 51.92 52.00 52.03

(3,2) 66.82 74.76 76.41 77.00 79.27

(3,1,1) 36.91 41.69 43.40 44.70 44.97

(2,2,1) 45.81 53.27 55.80 56.45 57.19

(2,1,1,1) 47.08 52.97 55.24 56.93 57.33

The variables follow the Poisson distribution with parameters µ = 0.2 and λ = 0.8.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 79.06 80.89 82.27 84.25 85.92

(3,2) 96.75 98.42 99.32 100 100

(3,1,1) 69.98 72.39 73.78 74.50 74.69

(2,2,1) 80.17 83.04 85.39 89.94 90.23

(2,1,1,1) 76.89 79.91 81.55 86.62 87.02

(10)

The variables follow the Poisson distribution with parameters µ = 0.8 and λ = 0.2.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 21.59 26.00 31.87 43.53 43.51

(3,2) 24.34 32.29 43.88 64.86 64.84

(3,1,1) 13.59 17.44 22.75 33.96 34.08

(2,2,1) 13.18 19.72 27.68 44.95 45.63

(2,1,1,1) 21.41 26.48 32.30 44.65 45.27

In the (4,1) model, the variables Xj∼ Bin(n, p1), j = 1, 2, 3, 4, X5∼ Bin(n, p2) and U1∼ Bin(n, p3).

In the (3,2) model, the variables Xj∼ Bin(n, p1), j = 1, 2, 3, 4, 5 and Ui∼ Bin(n, p3), i = 1, 2.

In the (3,1,1) model, the variables Xj ∼ Bin(n, p1), j = 1, 2, 3, X4,5∼ Bin(n, p2) and U1∼ Bin(n, p3).

In the (2,2,1) model, the variables Xj ∼ Bin(n, p1), j = 1, 2, 3, 4, X5∼ Bin(n, p2) and Ui∼ Bin(n, p3), i = 1, 2.

In the (2,1,1,1) model, the variables Xj ∼ Bin(n, p1), j = 1, 2, X3,4,5∼ Bin(n, p2) and U1∼ Bin(n, p3).

The variables follow the binomial distribution with parameter n = 5, p1= 0.1, p2= 0.2 and p3= 0.1.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 44.97 49.60 51.02 50.82 50.82

(3,2) 67.83 74.80 76.56 76.85 79.50

(3,1,1) 34.34 39.71 41.46 42.37 42.39

(2,2,1) 44.82 52.53 54.38 55.42 55.47

(2,1,1,1) 44.35 50.10 52.58 54.18 54.42

The variables follow the binomial distribution with parameters n = 10, p1= 0.05, p2= 0.1 and p3= 0.05 Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000

Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 46.10 50.31 51.44 51.38 51.48

(3,2) 67.47 74.47 76.20 77.19 79.24

(3,1,1) 28.77 36.52 39.24 40.53 41.38

(2,2,1) 45.29 52.96 54.85 56.14 56.34

(2,1,1,1) 46.26 51.64 53.93 55.43 55.97

The variables follow the binomial distribution with parameters n = 10, p1= 0.08, p2= 0.1 and p3= 0.02.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 21.03 25.89 31.71 43.57 43.95

(3,2) 25.97 34.83 47.05 65.09 65.34

(3,1,1) 13.74 17.84 23.18 34.09 34.09

(2,2,1) 14.81 20.66 28.93 45.27 45.23

(2,1,1,1) 21.43 26.13 32.34 44.43 44.88

The variables follow the binomial distribution with parameters n = 10, p1= 0.02, p2= 0.1 and p3= 0.08.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 77.19 79.04 80.35 80.60 79.03

(3,2) 96.40 98.12 99.13 100 100

(3,1,1) 69.73 70.44 71.81 72.64 71.43

(2,2,1) 78.59 81.45 83.15 88.64 88.07

(2,1,1,1) 74.96 77.65 79.22 82.56 78.59

Also, we can calculate the confidence interval for the estimated proportions using the binomial distribution and the number r of the replicates. That is, that exists probability p that k-means finds the right model if the estimated

(11)

probability is ˆp. For example, for the model (3,2) when Xj∼ Bin(5, 0.1), Ui∼ Bin(5, 0.1) at n = 25 the estimated proportion is 0.4497. Thus, we have :

ˆ p ± 1.96

rp(1 − ˆˆ p)

r = (10)

= 0.4497 ± 1.96

r0.4497(1 − 0.4497)

100.000 = [0.4466, 0.4528]

is a 95 % confidence interval for p.

In the next table, we calculated a few more confidence intervals.

Models n = 100 n = 1000 n = 10.000

For (4,1) when Xj ∼ Bin(5, 0.1), Ui∼ Bin(5, 0.1) [0.5071, 0.5133] [0.5051, 0.5113] [0.5051, 0.5113]

For (4,1) when Xj ∼ Bin(10, 0.05), Ui∼ Bin(10, 0.05) [0.5113, 0.5175] [0.5107, 0.5169] [0.5117, 0.5179]

For (4,1) when Xj ∼ Bin(10, 0.08), Ui∼ Bin(10, 0.02) [0.3142, 0.3200] [0.4326, 0.4388] [0.4284, 0.4346]

For (3,1,1) when Xj∼ Bin(10, 0.08), Ui ∼ Bin(10, 0.02) [0.2292, 0.2344] [0.3380, 0.3438] [0.3380, 0.3438]

For (2,2,1) when Xj∼ Bin(10, 0.08), Ui ∼ Bin(10, 0.02) [0.2865, 0.2921] [0.4496, 0.4558] [0.4492, 0.4554]

For (4,1) when Xj ∼ Bin(10, 0.02), Ui∼ Bin(10, 0.08) [0.8010, 0.8060] [0.8036, 0.8085] [0.7878, 0.7928]

For (3,1,1) when Xj∼ Bin(10, 0.02), Ui ∼ Bin(10, 0.08) [0.7928, 0.7209] [0.7236, 0.7292] [0.7115, 0.7171]

For (2,2,1) when Xj∼ Bin(10, 0.02), Ui ∼ Bin(10, 0.08) [0.8292, 0.8338] [0.8844, 0.8884] [0.8787, 0.8827]

For (2,1,1,1) when Xj ∼ Bin(10, 0.02), Ui∼ Bin(10, 0.08) [0.7897, 0.7947] [0.8233, 0.8280] [0.7834, 0.7884]

Going back to the simulations, in the next table we have :

In the (4,1) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, X5∼ P (1) and U1∼ Bin(n, p1).

In the (3,2) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, 5 and Ui∼ Bin(n, p1), i = 1, 2.

In the (3,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, X4,5 ∼ P (1) and U1∼ Bin(n, p1).

In the (2,2,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, X5∼ P (1) and Ui∼ Bin(n.p1), i = 1, 2.

In the (2,1,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, X3,4,5∼ P (1) and U1∼ Bin(n, p1).

The variables Xj follow the Poisson distribution with parameter µ = 0.5 and the factors Ui follows the binomial distribution with parameters n = 10, p1= 0.05.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 46.83 50.68 51.55 51.83 51.89

(3,2) 65.39 73.20 75.33 75.45 75.98

(3,1,1) 36.74 41.46 43.72 44.34 44.73

(2,2,1) 44.97 52.10 55.05 56.40 56.98

(2,1,1,1) 47.00 52.58 55.00 56.74 57.26

Here, Xj follow the Poisson distribution with parameter µ = 0.2 and Ui follow the binomial distribution with parameter n = 10, p1= 0.08

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 78.05 80.05 81.34 82.46 84.13

(3,2) 96.07 97.79 98.97 100 100

(3,1,1) 69.64 72.08 74.20 76.80 79.49

(2,2,1) 79.60 82.40 84.95 90.50 91.68

(2,1,1,1) 77.26 80.07 82.30 92.04 99.09

Here, Xj follow the Poisson distribution with parameter µ = 0.8 and Ui follow the binomial distribution with parameter n = 10, p1= 0.02

(12)

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 21.17 25.77 31.78 43.56 43.41

(3,2) 23.60 31.92 43.16 65.02 64.81

(3,1,1) 13.61 17.34 22.62 33.83 34.23

(2,2,1) 13.91 19.47 27.32 45.03 45.38

(2,1,1,1) 21.65 26.19 32.01 45.01 45.11

In the (4,1) model, the variables Xj∼ Bin(n, p1), j = 1, 2, 3, 4, X5∼ Bin(n, p2) and U1∼ P (µ).

In the (3,2) model, the variables Xj∼ Bin(n, p1), j = 1, 2, 3, 4, 5 and Ui∼ P (µ), i = 1, 2.

In the (3,1,1) model, the variables Xj ∼ Bin(n, p1), j = 1, 2, 3, X4,5∼ Bin(n, p2) and U1∼ P (µ).

In the (2,2,1) model, the variables Xj ∼ Bin(n, p1), j = 1, 2, 3, 4, X5∼ Bin(n, p2) and Ui∼ P (µ), i = 1, 2.

In the (2,1,1,1) model, the variables Xj ∼ Bin(n, p1), j = 1, 2, X3,4,5∼ Bin(n, p2) and U1∼ P (µ).

Here, Xj follow the binomial distribution with parameters n = 10, p1= 0.05 and Uifollow the Poisson distribu- tion with parameter µ = 0.5

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 46.72 50.61 51.73 51.71 51.68

(3,2) 68.75 75.66 77.39 78.99 80.38

(3,1,1) 37.15 41.90 43.50 44.34 44.72

(2,2,1) 46.25 53.43 55.51 56.25 56.43

(2,1,1,1) 46.48 51.67 53.93 55.83 56.06

Here, Xj follow the binomial distribution with parameters n = 10, p1= 0.08 and Uifollow the Poisson distribu- tion with parameter µ = 0.2

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 21.27 25.93 32.10 43.24 43.44

(3,2) 26.29 35.56 47.39 65.40 65.19

(3,1,1) 13.85 18.18 23.63 34.10 34.30

(2,2,1) 15.07 20.83 29.40 45.22 45.20

(2,1,1,1) 21.67 26.23 32.77 44.30 44.95

Here, Xj follow the binomial distribution with parameters n = 10, p1= 0.02 and Uifollow the Poisson distribu- tion with parameter µ = 0.8

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(4,1) 78.26 80.14 81.53 82.63 84.19

(3,2) 97.06 98.68 99.49 100 100

(3,1,1) 69.27 71.24 72.36 72.23 70.51

(2,2,1) 79.45 81.86 84.06 87.90 86.78

(2,1,1,1) 75.12 77.26 78.47 78.49 75.94

The following plots have been done with the Matlab command semilog which plots x and y-coordinates using a base-10 logarithmic scale on the x-axis and a linear scale on the y-axis.

(13)
(14)
(15)
(16)
(17)

4.3 Dimension 7

At first, we started simulating for the models (7) and (1,1,1,1,1,1,1) and we achieved 100 % in all of them. All the simulations have been done for 100.000 replications.

In the (6,1) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, 5, 6, X7∼ P (1) and U1∼ P (λ).

In the (5,2) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, 5, 6, 7 and Ui∼ P (λ), i = 1, 2.

In the (5,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, 5, X6,7 ∼ P (1) and U1∼ P (λ).

In the (4,3) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, 5, 6, 7 and Ui∼ P (λ), i = 1, 2.

In the (4,2,1) model, the variables Xj ∼ P (µ) , j = 1, 2, 3, 4, 5, 6, X7∼ P (1) and Ui∼ P (λ), i = 1, 2.

In the (4,1,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, X5,6,7∼ P (1) and U1∼ P (λ).

In the (3,3,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, 5, 6, X7∼ P (1) and Ui∼ P (λ), i = 1, 2.

In the (3,3,2) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, 5, 6, 7 and Ui ∼ P (λ), i = 1, 2, 3.

In the (3,2,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, 5, X6,7∼ P (1) and Ui∼ P (λ), i = 1, 2.

In the (3,1,1,1,1) model, the variables Xj∼ P (µ), j = 1, 2, 3, X4,5,6,7 ∼ P (1) and U1∼ P (λ).

In the (2,2,2,1) model, the variables Xj ∼ P (µ), j = 1, 2, 3, 4, 5, 6, X7∼ P (1) and Ui∼ P (λ), i = 1, 2, 3.

In the (2,2,1,1,1) model, the variables Xj∼ P (µ), j = 1, 2, 3, 4, X5,6,7∼ P (1) and Ui∼ P (λ), i = 1, 2.

In the (2,1,1,1,1,1) model, the variables Xj ∼ P (µ), j = 1, 2, X3,4,5,6,7∼ P (1) and U1∼ P (λ).

The variables Xj follow the Poisson distribution with the parameter µ = 0.5 and the variables Ui follow also the Poisson distribution with the parameter λ = 0.5.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(6,1) 33.74 37.79 38.55 38.80 38.94

(5,2) 55.02 63.80 66.87 69.80 70.72

(5,1,1) 18.70 22.56 24.02 24.72 25.29

(4,3) 65.67 75.88 79.74 81.86 82.22

(4,2,1) 26.84 33.54 36.04 37.90 38.39

(4,1,1,1) 15.16 19.19 20.87 22.21 22.46

(3,3,1) 29.07 36.85 40.11 41.80 41.69

(3,2,2) 37.12 48.72 52.03 53.94 56.08

(3,2,1,1) 20.37 27.60 29.91 31.96 32.34

(3,1,1,1,1) 18.53 23.73 25.96 27.67 28.23

(2,2,2,1) 17.87 22.56 24.52 24.83 25.48

(2,2,1,1,1) 21.46 28.48 32.82 34.67 35.26

(2,1,1,1,1,1) 34.21 40.69 43.69 46.61 47.35

The variables Xj follow the Poisson distribution with the parameter µ = 0.2 and the variables Ui follow also the Poisson distribution with the parameter λ = 0.8.

Sample size n = 25 n = 50 n = 100 n = 1000 n = 10000 Model Accuracy Accuracy Accuracy Accuracy Accuracy

(6,1) 66.23 69.97 72.09 72.58 72.50

(5,2) 92.59 93.55 93.73 92.94 92.92

(5,1,1) 51.28 54.27 55.86 57.44 57.65

(4,3) 97.65 98.88 99.38 99.97 100

(4,2,1) 65.17 68.32 70.07 74.01 76.10

(4,1,1,1) 46.69 49.91 52.29 55.52 57.71

(3,3,1) 67.04 70.30 71.39 72.61 72.70

(3,2,2) 83.12 86.56 88.09 89.51 89.55

(3,2,1,1) 56.16 59.99 62.85 66.68 67.30

(3,1,1,1,1) 51.13 54.59 56.80 59.66 59.89

(2,2,2,1) 64.89 67.19 68.92 72.22 72.16

(2,2,1,1,1) 58.02 62.27 65.74 73.04 73.50

(2,1,1,1,1,1) 66.71 70.09 72.88 78.40 78.87

References

Related documents

Host, An analytical model for require- ments selection quality evaluation in product software development, in: The Proceedings of the 11th International Conference on

Swedenergy would like to underline the need of technology neutral methods for calculating the amount of renewable energy used for cooling and district cooling and to achieve an

It has previously been reported that 3T3-L1 adipocytes and murine adipose tissue express FVII (30), but we now provide evidence for FVII expression in isolated human primary

Objectives To determine the factor structure of posttraumatic stress symptoms (PTSS) and assess its stabili- ty over time among parents of children diagnosed with cancer..

Table 10 below shows the results for this fourth model and the odds of the wife reporting that her husband beats her given that the state has the alcohol prohibition policy is 51%

The Bartlett-Thompson approach yielded consistent estimates only when the distribution of the latent exogenous variables was nor- mal, whereas the Hoshino-Bentler and adjusted

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Byggstarten i maj 2020 av Lalandia och 440 nya fritidshus i Søndervig är således resultatet av 14 års ansträngningar från en lång rad lokala och nationella aktörer och ett