On Asymptotic Properties of Principal Component Analysis

36  Download (0)

Full text

(1)

U.U.D.M. Project Report 2020:36

Examensarbete i matematik, 15 hp Handledare: Silvelyn Zwanzig Examinator: Martin Herschend Juni 2020

Department of Mathematics Uppsala University

On Asymptotic Properties of Principal Component Analysis

Johannes Graner

(2)
(3)

Abstract

This paper looks at the asymptotic behavior of principal component analysis as the sam- ple size increases. Specifically, the number of components required to retain a given proportion of variance in the sample is investigated for independent and identically dis- tributed observations from a normal distribution. The number of components required for three special cases are estimated using Monte Carlo integration and the results are compared to those from an asymptotic test for the number of components required.

(4)

Contents

Introduction 4

1 Principal component analysis 4

1.1 Single Value Decomposition . . . 4

1.2 Deriving the components . . . 5

2 Random Matrix Theory 6 2.1 Wishart distribution . . . 7

3 Number of required components 11 3.1 Methodology . . . 15

3.2 Unstructured Covariance . . . 16

3.2.1 Sequence with linear decay . . . 17

3.2.2 Sequence with exponential decay . . . 20

3.3 Structured Covariance . . . 23

3.4 Analysis . . . 27

(5)

Introduction

When collecting data, many different types of measurements are often taken for the same object. For example, when investigating the performance of racing cars, one could measure the cars’ weights, the size of the engines and the force applied by the brakes among other things. Many different measurements increases the variance in the data but at the same time, it is more difficult to analyze high-dimensional data and each additional measurement can be thought of as adding another dimension to the data.

There exist many techniques for simplifying high-dimensional data and one of them is Principal Component Analysis (PCA) where the dimensionality of the data is reduced by projections. However, it is not always evident how much the data can be simplified without losing too much of the variance in the data.

That problem is investigated in this thesis, specifically focusing on how much the dimen- sionality can be reduced depending on the sample size. First, some theory and back- ground about PCA is covered and then two different methods of measuring the number of components (dimensions) required for data which follows a normal distribution. In later Sections, the number of components required is investigated for a few special cases and a comparison between the two methods is made.

1 Principal component analysis

Given a sample of size n of a p-dimensional random variable X, one could imagine that most of the inherent randomness would be explained well by fewer than p linear combi- nations of the components of X.

Expressed more formally, for X ∈ Rp we try to find Y ∈ Rq such that q < p, yi = Pp

j=1aijxj, i = 1, . . . , q and we don’t lose a significant amount of variance by considering samples of Y instead of samples of X. Since the "amount of randomness" can be measured by variances and covariances, it could be fruitful to consider the covariance matrix of X in order to find the best linear combinations yi. It makes sense to be interested in maximizing randomness since a constant variable could be completely understood by just one observation. Hence, it is reasonable to choose yi with large variances and since we want each additional yi to only add new information, the linear combinations yi are chosen such that their covariances are minimized. yi chosen in this manner are called the principal components of X, but in order to make this argument sufficiently rigorous, we first need some basic properties of vectors and matrices.

1.1 Single Value Decomposition

Let [σij] = Σ := Cov(X) be the covariance matrix of X where σii = pV ar(Xi), i = 1, . . . , p and σij = Cov(Xi, Xj), i 6= j. Since for any scalar valued random variables U, V , Cov(U, V ) = Cov(V, U ), Σ is symmetric. If the components of X are a.s. linearly inde- pendent, then for any a ∈ Rp\{0}, 0 < V ar(aTX) = aTΣa so Σ is positive definite.

Having Σ both symmetric and positive definite, we use the spectral theorem to factorize

(6)

Σ = EΛET where

E =e1 e2 . . . ep , Λ =

λ1 0 . . . 0 0 λ2 . . . 0 ... ... . .. ...

0 0 . . . λp

and ei is the eigenvector of Σ corresponding to eigenvalue λi. Since Σ is symmetric and positive definite, we also know that all eigenvectors are real and positive so we order them as λ1 ≥ λ2 ≥ · · · ≥ λp > 0.

In order to properly define and construct the principal components, an additional lemma is needed. The lemma together with it’s proof can be found in [2] on page 83.

Lemma 1.1. For a positive definite matrix B ∈ Mp×p(R) with eigenvalues λ1 ≥ λ2

· · · ≥ λp > 0 and corresponding eigenvectors e1, e2, . . . , ep such that ∀i : eTi ei = 1,

max

x6=0

xTBx

xTx = λ1 (attained for x = e1)

x⊥emax1,...,ek

xTBx

xTx = λk+1 (attained for x = ek+1) Where x ⊥ y ⇐⇒ xTy = 0.

1.2 Deriving the components

With these tools, we are able to determine the best choices of yi. Definition 1.1. Given a random vector X = X1, X2, . . . , XpT

, the Principal Compo- nents of X are the random variables

Y1 = √aT1X

aT1a1

=

Pp j=1a1jXj

||a1||

Y2 = √aT2X

aT2a2

=

Pp j=1a2jXj

||a2||

... Yp = √aTpX

aTpap

=

Pp j=1apjXj

||ap||

such that V ar(Yi) = maxa∈AiV ar(aTX

aTa) where ||a|| = √

aTa, A1 = Rp\{0} and Ai = {a ∈ Rp\{0} : Cov(aTX

aTa, Yj) = 0 ∀j < i} for i = 2, . . . , p.

The principal components of X are the p uncorrelated linear combinations with maximal variances which is precisely what was sought. In order to construct the components in practice, the vectors ai are required. If X has covariance matrix Cov(X) = Σ, then V ar(Yi) = aaTiTΣai

iai and Cov(Yi, Yj) = ||aaTiΣaj

i||×||aj||. When comparing the formula for V ar(Yi) with the expression that is maximized in Lemma 1.1, one can see that they are very similar. Assuming the components of X are a.s. linearly independent, Σ is both symmetric and positive definite so Lemma 1.1 is used to determine the principal components of X.

(7)

Theorem 1.2. Suppose X is a random vector with a.s. linearly independent components and covariance matrix Σ. Then the principal components of X are given by Yi = eTi X and V ar(Yi) = λi where λ1 ≥ λ2 ≥ · · · ≥ λp > 0 are the eigenvalues of Σ with corresponding normalized eigenvectors e1, e2, . . . , ep.

Proof. To get Y1, we use Lemma 1.1 to see that

maxa6=0 V ar(aTX

aTa) = max

a6=0

aTΣa aTa = λ1

which is attained by a = e1. Now, suppose Yi = eTi X for i ≤ k − 1. We see that for i ≤ k − 1, Cov(Yi, Yk) = 0 ⇐⇒ eTi Σak = 0 and since Σ is symmetric and positive definite, eTi Σ = λieTi so Cov(Yi, Yk) = 0 ⇐⇒ eTi a = 0. Hence Yk is a.s. uncorrelated with all previous principal components if and only if ak ⊥ e1, . . . , ek−1. We now use Lemma 1.1 once again to determine that

ak⊥emax1,...,ek−1

V ar(Yk) = max

ak⊥e1,...,ek−1

aTkΣak

aTkak = λk

and that this maximum is attained for ak= ek. So, by the induction principle, Yi = eTi X and V ar(Yi) = λi for all i = 1, . . . , p

We remark that if there are repeated eigenvalues (i.e. λi = λj for some i, j), then the principal components are not unique. This problem could be solved by considering that it is possible to find an orthogonal basis for the eigenspace corresponding to the repeated eigenvalue and use those basis vectors (in any order) to determine the principal components corresponding to that eigenvalue.

So far we only considered the population principal components. Thankfully, the theory carries over to the sample case as well, merely requiring that we consider ˆΣ = S =

1 n−1

Pn

i=1(Xi − X)(Xi− X)T for an i.i.d. sample X1, X2, . . . , Xn. The eigenvalues and eigenvectors for S are then used to find the sample principal components.

Now that we have the principal components of X, we would like to analyze the dis- tributions of the eigenvalues and eigenvectors of S since they determine the principal components.

2 Random Matrix Theory

Since X1, . . . , Xn is a sample of observations of the random vector X, the estimated covariance matrix S is inherently random as well. We therefore need some theory for matrices whose elements are random, called random matrices. The theory for random matrices spans very wide and we will therefore focus only on what is needed, which in this case is the distribution for S. If X is allowed to have any distribution, there would be very little control over the distribution of S so we consider only the case where X follows a normal distribution since there is a lot of literature covering this case.

(8)

2.1 Wishart distribution

Let Z1, . . . , Zm iid∼ Np(0, Σ). Then, S = Pm

j=1ZjZjT ∼ Wp(m, Σ) where Wp(m, Σ) is the p-dimensional central Wishart distribution with scale matrix Σ and m degrees of freedom.

Since ZZT = [zizj]ij, it is clear that S is symmetric and we observe that if x ∈ Rp, then xTZZTx = (ZTx)T(ZTx) = (ZTx)2 ≥ 0 and the equality is only achieved if ZTx = 0. For S, this only happens if ZjTx = 0 ∀j = 1, . . . , m so if m ≥ p, then xTSx = 0 ⇐⇒ x = 0 and we can conclude that S is a.s. positive definite whenever m ≥ p (which will always be the case in this investigation). The density for the aforementioned Wishart-distributed random matrix S is as follows,

fS(X) =

(|X|(m−p−1)/2e−tr(Σ−1X)/2

2mp/2|Σ|m/2Γp(m/2) , |X| > 0

0, otherwise

(1)

where |X| = det(X) and Γp is the multivariate gamma function. We will not need to work with the density directly but there are some properties of the Wishart distribution that will prove very useful. One of these is the expectation, E(S) = mΣ. This also implies that if we let Uj = 1mZj, then ˜S = Pm

j=1UjUjT ∼ Wp(m,m1Σ) would have expectation E( ˜S) = Σ.

The property that we will be most interested in and find most useful is the fact that since a Wishart matrix is a.s. positive definite, all of its eigenvalues are real and positive.

This means that they can be ordered as σ1 ≥ · · · ≥ σp > 0 and that it is meaningful to discuss not only the joint distribution of the eigenvalues, but also that of the largest q eigenvalues. This has been explored in a paper by Zanella et al. [6] and there the following form for the joint distribution of the eigenvalues of S is presented.

fσ(x) =

(K|Φ(x)| · |Ψ(x, σ)|Qp

l=1ξ(xl), x1 > · · · > xp > 0

0, otherwise

K =

Qp

i=1(i − 1)!|V|Σ|−m

2(σ)|

Qp

j=1(m − j)!Qm

k=1(p − k)!,

Φ(x) = [xi−1j ]ij, (2)

Ψ(x), σ) = [e−xji]ij, V2(σ) = [(−σj)1−i]ij

ξ(x) = xm−p

where σ is a vector of the ordered eigenvalues of Σ. If the eigenvalues σj are not distinct, the determinant |V2(σ)| = 0. According to [6], this can be fixed by considering the limit limσ2,...,σl→σ1

|Ψ(x,σ)|

|V2(σ)| . Even if Σ does not have distinct eigenvalues, the eigenvalues of S are a.s. distinct since if xi = xj for some i 6= j, then fσ(x) = 0. The density written in this form looks fairly complicated so we try to simplify it. We note that Φ and V2 seem to have a similar structure.

(9)

Definition 2.1. Let a = [a1, a2, . . . , ap]T ∈ Rp. Then A = [ai−1j ]ij, i, j = 1, . . . , p is the p × p Vandermonde matrix corresponding to a.

Both Φ and V2 have this structure (V2(σ) = [(−σj)1−i]ij = [(−σ1

j)i−1]ij) so they are both p × p Vandermonde matrices, whose determinants have fairly simple forms.

Lemma 2.1. Let a = [a1, a2, . . . , ap]T ∈ Rp and A be the p × p Vandermonde matrix corresponding to a (i.e. A = [ai−1j ]ij). Then

det(A) = Y

1≤i<j≤p

(aj− ai) =

p

Y

j=2 j

Y

i=1

(aj − ai)

Proof. Let An be the Vandermonde matrix that is constructed when considering ˜an = [ap−n+1, . . . , ap] for n = 2, . . . , p. Then

det(A2) = det

 1 1

ap−1 ap



= ap− ap−1

So the result holds for n = 2. In order to make an induction argument, assume that the result holds for n = k, i.e. that |A| = det(Ak) = Q

1≤i<j≤k(aj − ai). We then get the following:

det(Ak+1) =

1 1 . . . 1 1

ap−k ap−k+1 . . . ap−1 ap a2p−k a2p−k+1 . . . a2p−1 a2p ... ... . .. ... ... ak−1p−k ak−1p−k+1 . . . ak−1p−1 ak−1p akp−k akp−k+1 . . . akp−1 akp

=

1 0 . . . 0 0

ap−k ap−k+1− ap−k . . . ap−1− ap−k ap− ap−k a2p−k a2p−k+1− a2p−k . . . a2p−1− a2p−k a2p− a2p−k

... ... . .. ... ...

ak−1p−k ak−1p−k+1− ak−1p−k . . . ak−1p−1− ak−1p−k ak−1p − ak−1p−k akp−k akp−k+1− akp−k . . . akp−1− akp−k akp − akp−k

=

1 0 . . . 0 0

0 ap−k+1− ap−k . . . ap−1− ap−k ap− ap−k 0 (ap−k+1 − ap−k)ap−k+1 . . . (ap−1− ap−k) (ap− ap−k)

... ... . .. ... ...

0 (ap−k+1 − ap−k)ak−2p−k+1 . . . (ap−1− ap−k)ak−2p−1 (ap− ap−k)ak−2p 0 (ap−k+1 − ap−k)ak−1p−k+1 . . . (ap−1− ap−k)ak−1p−1 (ap− ap−k)ak−1p

(10)

=

k+1

Y

j=2

(ap−(k+1)+j − ap−k)

1 0 . . . 0 0

0 1 . . . 1 1

0 ap−k+1 . . . ap−1 ap ... ... . .. ... ... 0 ak−2p−k+1 . . . ak−2p−1 ak−2p 0 ak−1p−k+1 . . . ak−1p−1 ak−1p

=

k+1

Y

j=2

(ap−(k+1)+j − ap−k)

1 . . . 1 1

ap−k+1 . . . ap−1 ap

... . .. ... ... ak−2p−k+1 . . . ak−2p−1 ak−2p ak−1p−k+1 . . . ak−1p−1 ak−1p

=

k+1

Y

j=2

(ap−(k+1)+j − ap−k)|Ak| =

k+1

Y

j=2

(ap−(k+1)+j − ap−k) Y

1≤i<l≤k

(al− ai)

= Y

1≤i<j≤k+1

(aj − ai).

The result follows from induction on k.

Note that in the proof, the first matrix equality is given by subtracting the first column from all the other columns and the second equality is given by subtracting each row by ap−k+1 times the row just above it, starting from the bottom so that this operation does not interfere with the later iterations. The last equality is true since Qk+1

j=2(ap−(k+1)+j − ap−k) is the product of the sequence ap−k+1−ap−k, . . . , ap−ap−kand therefore only includes the factors missing from Q

1≤i<j≤k(aj− ai) to getQ

1≤i<j≤k+1(aj − ai). The above proof was inspired by [5].

In Zanella et al. [6], V2 = [−σ1−ij ]ij (instead of [(−σj)1−i]ij as in Equation 2) but this produces a negative density for some values of p. Since the inclusion of the parentheses only changes the sign of the determinant and not its value, this discrepancy can only affect the sign of the density, which has to be non-negative everywhere. This investigation will therefore use the form given in Equation (2) as this produces a non-negative density. The discrepancy is illustrated in the following example.

Example 2.2. Let Σ =2 1 1 2



and m = 3. Then p = 2, σ1 = 3, σ2 = 1. Let V2,Z be the form of V2(σ) given in Zanella et al. and V2,C be the corrected form in Equation 2.

|V2,Z| = det−1 −1

13 −1



= 2 3

|V2,C| = det 1 1

13 −1



= −2 3

so there is a discrepancy in the sign of the determinant. In order to determine which of them is right we let x = [x1, x2]T with x1 > x2 > 0 and get

|Φ(x)| = x2− x1 < 0

(11)

2

Y

l=1

ξ(xl) = x1 · x2 > 0

Ψ(x) =e−x1/3 e−x2/3 e−x1 e−x2



=⇒ |Ψ(x)| = e−(x13 +x2)− e−(x1+x23 ).

x1 > x2 > 0, so we get x31+x2−(x1+x32) = −2x3 1+2x32 < −2x3 1+2x31 = 0 =⇒ −(x31+x2) >

−(x1+x32) =⇒ |Ψ(x)| > 0 since ex is a monotonically increasing function in x.

K =

Q2

i=1(i − 1)!|V|Σ|−3

2(σ)|

Q2

j=1(3 − j)!Q2

k=1(2 − k)! = 1 · 3−3

|V2(σ)| · 2 · 1 = 1

54 · |V2(σ)| =⇒

=⇒ KZ = − 1

108, KC = 1 108.

Combining these results, fσ,Z(x) = KZ(x2 − x1)(e−(x13 +x2) − e−(x1+x23))x1x2 < 0 and fσ,C(x) = KC(x2 − x1)(e−(x13 +x2)− e−(x1+x23))x1x2 > 0.

Since fσ is a density function, fσ(x) ≥ 0 for any value of x. Hence, KZ cannot be the right form of K. When implementing the density function with KC in Matlab, the integral over the whole domain is 1 for several different choices of p, Σ and m which suggests that this is indeed the correct form of K.

Using Lemma 2.1, we see that |Φ(x)| =Qp j=2

Qj

i=1(xj − xi) and

|V2(σ)| =

p

Y

j=2 j

Y

i=1

(−1 σj + 1

σi) =

p

Y

j=2 j

Y

i=1

σj − σi σiσj .

Further, K contains the same factor in both the numerator and denominator.

Qp

i=1(i − 1)! = Qp

k=1(p − k)! by observing that they are the products of the same se- quence but in reversed order and since multiplication is commutative, the two products are equal. Another simplification that can be made is that |Σ| = Qp

j=1σj and by com- bining all of these simplifications we arrive at the following form for the density of the eigenvalues:

fσ(x) = |Ψ(x)| ·

p

Y

j=1

xm−pj σmj (m − j)!

j−1

Y

i=1

σiσjxj − xi

σj − σi

!

(3)

= |Ψ(x)| ·

p

Y

j=1

1 xpj(m − j)!

 xj σj

m j−1

Y

i=1

σiσjxj − xi σj − σi

!

where the empty product that occurs when j = 1 is interpreted as being 1. An inter- esting observation that results from this form of the density is that the density for the eigenvalues of a Wishart matrix does not depend on the exact form or structure of Σ, instead depending only on the eigenvalues of Σ.

The simplified form of fσ(x) makes it quite easy to see what effect scaling has on the density.

(12)

Lemma 2.3. Let fσ(x) be the density of the ordered eigenvalues of a Wishart distribution W (m, Σ) such that Σ is a p × p covariance matrix and the vector of ordered eigenvalues of Σ is σ. Let a > 0 be a constant. Then,

f(ax) = a−pfσ(x).

Proof. First note that |Ψ(ax)| = |Ψσ(x)| since Ψ(ax) = [e−(axj)/(aσi)]ij = [e−xji]ij = Ψσ(x). So,

f(ax) = |Ψ(ax)| ·

p

Y

j=1

1 (axj)p(m − j)!

 axjj

m j−1

Y

i=1

(aσi)(aσj)axj − axij − aσi

!

= |Ψσ(x)| · 1 ap2

p

Y

j=1

1 xpj(m − j)!

 xj σj

m j−1

Y

i=1

a2σiσjxj − xi σj − σi

!

= |Ψσ(x)| · 1 ap2

p

Y

j=1

1 xpj(m − j)!

 xj σj

m

a2(j−1)

j−1

Y

i=1

σiσjxj − xi

σj − σi

!

= |Ψσ(x)| · a2Pp−1j=1j ap2

p

Y

j=1

1 xpj(m − j)!

 xj σj

m j−1

Y

i=1

σiσjxj− xi σj− σi

!

= a2Pp−1j=1j

ap2 fσ(x) = a2p(p−1)2

ap2 fσ(x) = ap2−p

ap2 fσ(x) = a−pfσ(x)

3 Number of required components

In this section it is assumed that X1, X2, . . . , Xm is an i.i.d. sample from Np(µ, Σ). As mentioned previously, the sample covariance matrix in this case is S = m−11 Pm

j=1(Xj − X)(Xj− X)T. S follows a Wishart distribution Wp(m − 1,m−11 Σ), as proven in e.g. ([1], page 255). The estimator S is consistent, meaning that as n → ∞, S → Σ, due to thep Law of large numbers as shown in e.g. ([2], page 186). S is also clearly unbiased since ES = (m − 1)m−11 Σ = Σ. The loss of a degree of freedom is due to "using it" to calculate the sample mean X instead of using the population mean µ which generally is not known a priori. However, this does not present a problem since we can consider ˜m = m − 1 without losing any properties of the distribution. Since S follows a Wishart distribution, the knowledge about the distribution of eigenvalues for Wishart distributions is used to investigate the number of principal components required to not lose more than a specified amount of the variance in the sample.

The density for the eigenvalues is fσ as in (3). If the proportion of variance we are prepared to give up is , then we are interested in whether or not the eigenvalues σ1 ≥ · · · ≥ σp > 0 of the actual covariance matrix Σ fulfill

(13)

Rq(Σ) :=

Pq j=1σj

Pp

i=1σi = 1 tr(Σ)

q

X

j=1

σj ≥ 1 − . (4)

The notation Rq(Σ) is not a standard notation but this ratio will come up often so it is prudent to give it a name. Note that Rq(Σ) is invariant in regards to scaling Σ by a positive constant a > 0.

A commonly used alternative to Rq(Σ) is the elbow plot in which the eigenvalues are visualized in a graph and the number of eigenvalues used is determined by where the graph stops decreasing sharply. A comparison between the elbow plot and Rq(Σ) can be seen in Figure 1 where the eigenvalues are σ = [10, 7.5, 5, 4.6, 4.6, 4, 3.4, 3, 3]. It can be difficult to determine how much of the variance is lost when just using the elbow plot and in Figure 1, one might only retain about 60% of the variance if using the elbow plot to gauge how many components are required. For this reason, this investigation uses Rq instead which allows us to lose only up to a specified proportion of the variance.

0 1 2 3 4 5 6 7 8 9 10

j 2

3 4 5 6 7 8 9 10

j

Elbow plot R1 = 0.22

R2 = 0.39

R3 = 0.5

R4 = 0.6

R5 = 0.7

R6 = 0.79

R7 = 0.87

R8 = 0.93 R9 = 1

Figure 1: Comparison between elbow plot and Rq(Σ)

We are now ready to formulate the main question in this investigation. What values of Σ, m, q,  and α fulfill P (Rq(S) ≥ 1 − ) ≥ 1 − α where S ∼ Wp(m,m1Σ)?

When trying to answer this question, we consider two cases. In the first case no structure on the covariance matrix Σ is assumed and in the second case we consider a special structure on the covariance matrix. Since structures of Σ that do not affect the eigenvalues

(14)

of Σ also do not affect the distribution of eigenvalues for S, the section about structured covariance matrices will focus on a useful case where Σ do not have trivial eigenvalues.

For small values of m and p (the dimension of Σ) we will be able to use the density (3) to get exact probabilities (or at least numerical approximations of said probabilities) while for larger values we have to rely on asymptotic results in order to estimate the probabilities. In all cases, the eigenvalues of Σ are distinct so the formula for the density of the eigenvalues in Equation (2) can be used directly.

Such an asymptotic result can be found in [1] on p.479-480 which can be slightly adjusted to get the following result.

Theorem 3.1. With the same setup as previously and the null hypothesis

H0 : Rq(Σ) < 1 − , (5)

a test with approximate significance α for large m is given by rejecting H0 if

√m(1 −  − Rq(S)) < Z(α) s

2

 

tr(S)



(ˆσ12+ · · · + ˆσq2) + 2 1 −  tr(S)



(ˆσq+12 + · · · + ˆσp2) (6) where S is the estimator for the covariance matrix Σ, ˆσj are the ordered eigenvalues of S and Z(α) is the α-quantile of the standard normal distribution.

A proof of this theorem is outlined in [1] on page 480, using theorem 4.2.3 to which a proof can be found in [3] (where said theorem is theorem A in section 3.3). This proof is quite lengthy and will thus not be written here.

By studying (6), it it evident that an approximate p-value for the test is given by isolating Z(α) and applying the distribution function of Z ∼ N (0, 1) on both sides. Let

T,q(S) = 1 −  − Rq(S)

r 2

 tr(S)



(ˆσ12+ · · · + ˆσq2) + 2

1−

tr(S)



(ˆσq+12 + · · · + ˆσp2) .

Then p = P (T,q(S) < Tobs) ≈ φ(√

m · T,q(S)), where φ is the distribution function of the standard normal distribution N (0, 1), is the approximate p-value of the test when the value of the observed statistic is Tobs. Under H0, T,q(Σ) is positive so the test is one-directional, testing that T,q(S) is negative with a large enough margin. As with the exact method, we observe that the test only depends on the eigenvalues of Σ (estimated by the eigenvalues of the sample covariance matrix S) since tr(S) = Pp

j=1σˆj and Rq(S) are uniquely determined from the eigenvalues of S.

Since S is consistent, T,q(S) → Tp ,q(Σ) and we want to show that the test is consistent as well. A test is consistent if p → 0 under H1 : ¬H0 as m → ∞. Σ is independent of m, √

m · T,q(Σ) is monotonic in m and is increasing if T,q(Σ) > 0 and decreasing if T,q(Σ) < 0. Further, the denominator of T,q(Σ) is always positive since tr(Σ) > 0 and

 ∈ (0, 1). This means that

(15)

m→∞lim p =

(limm→∞FZ(√

m · |T,q(Σ)|), T,q(Σ) > 0 limm→∞FZ(−√

m · |T,q(Σ)|), T,q(Σ) ≤ 0 =

(1, T,q(Σ) > 0

0, T,q(Σ) ≤ 0 (7) The sign of T,q(Σ) is independent of the denominator and the numerator is positive if and only if Rq(Σ) < 1 − , so (7) reduces to

m→∞lim p =

(1, Rq(Σ) < 1 − 

0, Rq(Σ) ≥ 1 −  (8)

which shows that the test is consistent.

Remark 3.2. Although it would not affect the test result in the limit, scaling the eigen- values of Σ by a constant a > 0 (thus also scaling the random matrix S by the same constant) does affect the test result for finite values of m.

T,q(aS) = 1 −  − Rq(aS)

r 2

 tr(aS)



( ˆaσ21+ · · · + ˆaσ2q) + 2

1−

tr(aS)



( ˆaσ2q+1+ · · · + ˆaσ2p)

= 1 −  − Rq(S)

r 2

 a·tr(S)



a2(ˆσ21+ · · · + ˆσq2) + 2

1−

a·tr(S)



a2(ˆσ2q+1+ · · · + ˆσp2)

= 1 −  − Rq(S)

1 a

r 2

 tr(S)



(ˆσ12+ · · · + ˆσ2q) + 2

1−

tr(S)



(ˆσq+12 + · · · + ˆσ2p)

=√

a · T,q(S)

Thus, scaling the eigenvalues by a > 0 has the same effect on the test as instead scaling m by the same constant a since√

am · |T,q(S)| =√ m · |√

a · T,q(S)| =√

m · |T,q(aS)|.

Now, only one obstacle remains before the analysis can begin. For smaller values of m where we use the density to get exact results, we have not yet considered how to calculate the probability that q eigenvalues are enough, i.e. P (Rq(S) ≥ 1 − ). We note that this probability can be written in the form of an integral.

P (Rq(S) ≥ 1 − ) = Z

Dp

fσ(x)IRq(x)dx (9)

where Dp = {x ∈ Rp : x1 > · · · > xp > 0} and Rq = n

x ∈ Rp :

Pq i=1xi

Pp

j=1xj ≥ 1 − o

. This integral would not be very pleasant to solve analytically since it, among other complica- tions, contain the factor |Ψ(x)| which is the determinant of a matrix whose elements are exponential functions of x1, . . . , xp. Further, it is a p-dimensional integral, so determin- istic numerical methods quickly become infeasible as p grows. We therefore compute the integral using Monte Carlo methods.

(16)

3.1 Methodology

The Monte Carlo method used is an independent Monte Carlo integration which has the following algorithm which is a slightly adapted version of Algorithm 2.1 in [4] on page 27.

Algorithm 3.3. Given a sample size N ∈ Z+ and an integral of the form µ = R

Dh(x) · f (x)dx < ∞ where h : D → R is some integrable function on D and f is the density function for a random variable X which is supported on D,

1. For i = 1 to N :

(a) Draw xi ∼ X independently from all prior draws.

(b) If xi ∈ D, return to (a). Otherwise, continue./ 2. estimate ˆµ = N1 PN

i=1h(xi).

In the case (9), the domain of integration Dp is unbounded since there is no limit on how large x1 can be. It is therefore necessary to impose some limit on x1 when estimating the integral using a computer. A limit L is chosen and the integral in (9) is estimated using DpL= Dp∩ [0, L]p = {x ∈ Rp : L ≥ x1 > · · · > xp > 0} and the uniform distribution UDp

L

on DLp is used. The density function of UDp

L is fU

Dp L

(x) = vol(D1 p L)IDp

L(x) where vol(DpL) is the generalized volume of DLp.

Proposition 3.4. Let DpL= {x ∈ Rp : L ≥ x1 > · · · > xp > 0}. Then vol (DLp) = Lp!p Proof. The volume of DLp is the integral over DpL of the constant function 1 and R

{x1,x2:b>x1>x2>a}f (x)dx =Rb a

Rx1

a f (x)dx1dx2 so we get the following.

vol(DLp) = Z L

0

Z x1

0

· · · Z xp−2

0

Z xp−1

0

dxpdxp−1. . . dx2dx1

= Z L

0

Z x1

0

· · · Z xp−2

0

(xp−1− 0)dxp−1. . . dx2dx1

= Z L

0

Z x1

0

· · · Z xp−3

0

x2p−2− 0

2 dxp−2. . . dx2dx1

= · · · = Z L

0

xp−11 − 0

(p − 1)!dx1 = Lp− 0 p! = Lp

p!

Since the integrand in (9) is a density function multiplied with an indicator function which is just the identity function for p = q (Rp = Rp), it is known that Rp

Dfσ(x)IRp(x)dx = 1 which can be used to calibrate the limit L. If the integral is estimated to much less than 1 (for example ≈ 0.9) using a certain value of L and q = p, this suggests that a significant probability mass is outside DLp and hence the value of L is increased to make sure that the only error comes from the approximation by a finite sum in Algorithm 3.3. It is also desirable that L be as small as possible since for large enough L, very little probability mass is contained in DLp0\DLp when L0 > L, making the algorithm more inefficient for L0 since more of the sampled points will contribute very little to the estimate of the integral. When a suitable value for L is found, one sample x(1), . . . , x(N ) is drawn from

(17)

UDp

L and used to estimate the probabilities P (Rq(S) ≥ 1 − ) for all q = 1, . . . , p by P (Rq(S) ≥ 1 − ) ≈ N1 PN

i=1fσ(xi)IRq(xi). Note that if the eigenvalues and the argument are both scaled by a > 0 in the density of the eigenvalues as in Lemma 2.3, then L can be scaled by the same constant a to get the same estimation for the integral since

Z

DpaL

f(ax)IRq(ax)dx = vol(DpaL) vol(DpL)

Z

DpL

a−pfσ(x)IRq(x)dx

= apa−p Z

DLp

fσ(x)IRq(x)dx = Z

DpL

fσ(x)IRq(x)dx

where the property ∀a > 0 : x ∈ Rq =⇒ ax ∈ Rq is used. By testing different values for L, it was discovered that L = 1 +25m seemed to work well whenever the largest eigenvalue was 1 and this value was therefore used in all calculations.

For the asymptotic results, the methodology is slightly different. The goal is to investigate how many eigenvalues are required for a specific m. The following algorithm provides a way to see how a typical test result would look like for many different values of m.

Algorithm 3.5. Given p, σ, , α, a lower limit m0 and an upper limit M for m, let Σ be the diagonal matrix with σ1, . . . , σp on the diagonal.

• For m = m0 to M :

1. Draw Sm ∼ W (m,m1Σ) independent from all other draws and compute the eigenvalues sm = (s1,m, . . . , sp,m) such that s1,m ≥ · · · ≥ sp,m where the in- equality is a.s. strict.

2. Compute the test statistics T,q(Sm) for q = 1, . . . , p using the eigenvalues computed in the previous step.

3. The number of required components for m is then min{q ∈ {1, . . . , p} : FZ(√ m·

T,q(Sm)) < α}

When the true eigenvalues are known, one can compare the results from this algorithm with the actual number of eigenvalues required according to the test in Theorem 3.1 by calculating the test statistic of the true eigenvalues and using the same method as in step 3 to compute the number of required eigenvalues. Due the Law of Large Numbers, consistency and unbiasedness, this is equivalent to looking at the average of a large sample. This way, one can see how reliable the test is and form a heuristic about how many eigenvalues are actually required given a test result.

In order to reduce the number of variables involved, we only consider the case with  = 0.1 and α = 0.05, meaning that no more than 10% of the variance is lost in at least 95% of cases. When estimating the integral (9) by using Algorithm 3.3, N = 108 is used.

3.2 Unstructured Covariance

It was previously noted that the number of components required only depends on the underlying covariance matrix through the eigenvalues of said matrix, so the unstructured covariance case boils down to investigating the probability (9) given a decreasing sequence of positive eigenvalues.

(18)

We provide two examples of such sequences and how the number of components required is affected by the difference of the sequences and by changes in m. The sequences are scaled such that the largest eigenvalue is 1 since Rq is independent of a constant scaling of σ. Scaling both the eigenvalues and the argument in fσ(x) by m1 (i.e. using f1

mσ(m1x)) is equivalent to scaling the scale matrix Σ in the the Wishart distribution W (m, Σ) by

1

m, which is precisely what is wanted to get the desired distribution of S. Note that even if some other scaling were to be done to the eigenvalues and the argument in fσ(x), say f(ax), the probability would still be the same as discussed in the previous Section.

The two sequences consist of one with linear decay and one with exponential decay.

Both sequences have their respective values of P (Rq ≥ 1 − ) calculated exactly for m = 10, 15, 20, 30, 40, 50 while for larger values of m, the asymptotic result is used.

3.2.1 Sequence with linear decay Let σp,lin =



1,p−1p , . . . , 1p



be the eigenvalues of some symmetric and positive definite p × p matrix Σp,lin. Sp,lin ∼ Wp(m,m1Σp,lin) is the estimate of Σp,lin and the following tables contain the estimated probabilites P (Rq(Sp,lin) ≥ 1 − ) for 2 ≤ p ≤ 9, 1 ≤ q ≤ p and  = 0.1.

q p

1 2 3 4 5 6 7 8 9

1 3.88 · 10−3 2.84 · 10−7 5.64 · 10−12 3.03 · 10−17 2.07 · 10−23 0 0 0 2 1 0.39 6.72 · 10−3 1.77 · 10−5 1.67 · 10−8 4.22 · 10−12 1.03 · 10−16 1.59 · 10−26

3 1 0.98 0.46 4.55 · 10−2 1.54 · 10−3 1.87 · 10−5 1.57 · 10−7

4 1 1 1 0.75 0.32 6.46 · 10−2

5 1 1.01 0.97 1.04 1.1

6 1.01 0.97 1.04 1.11

7 0.97 1.04 1.11

8 1.04 1.11

9 1.11

Table 1: probabilities for m =10

q p

1 2 3 4 5 6 7 8 9

1 2.34 · 10−4 5.26 · 10−11 4.31 · 10−19 1.76 · 10−28 1.93 · 10−40 1.37 · 10−56 0 0 2 1 0.19 1.05 · 10−4 2 · 10−9 4.39 · 10−15 3.57 · 10−21 1.79 · 10−28 0

3 1 0.94 9.94 · 10−2 3.55 · 10−4 1.4 · 10−7 1.02 · 10−11 8.89 · 10−17

4 1 1 0.86 0.16 3.29 · 10−3 1.35 · 10−5

5 1 1 0.99 0.93 0.4

6 1 0.99 1.01 1.01

7 0.99 1.01 1.01

8 1.01 1.01

9 1.01

Table 2: probabilities for m =15

Figur

Updating...

Referenser

Relaterade ämnen :