On Asymptotic Properties of Principal Component Analysis

(1)

U.U.D.M. Project Report 2020:36

Examensarbete i matematik, 15 hp Handledare: Silvelyn Zwanzig Examinator: Martin Herschend Juni 2020

Department of Mathematics Uppsala University

On Asymptotic Properties of Principal Component Analysis

Johannes Graner

(2)

(3)

Abstract

This paper looks at the asymptotic behavior of principal component analysis as the sample size increases. Specifically, the number of components required to retain a given proportion of variance in the sample is investigated for independent and identically distributed observations from a normal distribution. The number of components required for three special cases are estimated using Monte Carlo integration and the results are compared to those from an asymptotic test for the number of components required.

(4)

Introduction

When collecting data, many different types of measurements are often taken for the same object. For example, when investigating the performance of racing cars, one could measure the cars’ weights, the size of the engines and the force applied by the brakes among other things. Many different measurements increases the variance in the data but at the same time, it is more difficult to analyze high-dimensional data and each additional measurement can be thought of as adding another dimension to the data.

There exist many techniques for simplifying high-dimensional data and one of them is Principal Component Analysis (PCA) where the dimensionality of the data is reduced by projections. However, it is not always evident how much the data can be simplified without losing too much of the variance in the data.

That problem is investigated in this thesis, specifically focusing on how much the dimensionality can be reduced depending on the sample size. First, some theory and back- ground about PCA is covered and then two different methods of measuring the number of components (dimensions) required for data which follows a normal distribution. In later Sections, the number of components required is investigated for a few special cases and a comparison between the two methods is made.

1 Principal component analysis

Given a sample of size n of a p-dimensional random variable X, one could imagine that most of the inherent randomness would be explained well by fewer than p linear combinations of the components of X.

Expressed more formally, for X ∈ R^p we try to find Y ∈ R^q such that q < p, y_i = Pp

j=1aijxj, i = 1, . . . , q and we don’t lose a significant amount of variance by considering samples of Y instead of samples of X. Since the "amount of randomness" can be measured by variances and covariances, it could be fruitful to consider the covariance matrix of X in order to find the best linear combinations yi. It makes sense to be interested in maximizing randomness since a constant variable could be completely understood by just one observation. Hence, it is reasonable to choose y_i with large variances and since we want each additional yi to only add new information, the linear combinations yi are chosen such that their covariances are minimized. y_i chosen in this manner are called the principal components of X, but in order to make this argument sufficiently rigorous, we first need some basic properties of vectors and matrices.

1.1 Single Value Decomposition

Let [σ_ij] = Σ := Cov(X) be the covariance matrix of X where σ_ii = pV ar(X_i), i = 1, . . . , p and σij = Cov(Xi, Xj), i 6= j. Since for any scalar valued random variables U, V , Cov(U, V ) = Cov(V, U ), Σ is symmetric. If the components of X are a.s. linearly independent, then for any a ∈ R^p\{0}, 0 < V ar(a^TX) = a^TΣa so Σ is positive definite.

Having Σ both symmetric and positive definite, we use the spectral theorem to factorize

(6)

Σ = EΛE^T where

E =e1 e2 . . . ep , Λ =







λ₁ 0 . . . 0 0 λ₂ . . . 0 ... ... . .. ...

0 0 . . . λ_p







and ei is the eigenvector of Σ corresponding to eigenvalue λi. Since Σ is symmetric and positive definite, we also know that all eigenvectors are real and positive so we order them as λ₁ ≥ λ₂ ≥ · · · ≥ λ_p > 0.

In order to properly define and construct the principal components, an additional lemma is needed. The lemma together with it’s proof can be found in [2] on page 83.

Lemma 1.1. For a positive definite matrix B ∈ Mp×p(R) with eigenvalues λ¹ ≥ λ2 ≥

· · · ≥ λ_p > 0 and corresponding eigenvectors e₁, e₂, . . . , e_p such that ∀i : e^T_i e_i = 1,

max

x6=0

x^TBx

x^Tx = λ₁ (attained for x = e₁)

x⊥emax1,...,e_k

x^TBx

x^Tx = λ_k+1 (attained for x = e_k+1) Where x ⊥ y ⇐⇒ x^Ty = 0.

1.2 Deriving the components

With these tools, we are able to determine the best choices of y_i. Definition 1.1. Given a random vector X = X1, X₂, . . . , X_pT

, the Principal Compo- nents of X are the random variables

Y1 = √^a^T¹^X

a^T₁a1

=

Pp j=1a1jXj

||a₁||

Y₂ = √^a^T²^X

a^T₂a2

=

Pp j=1a2jXj

||a2||

... Y_p = √^a^T^p^X

a^T_pap

=

Pp j=1apjXj

||ap||

such that V ar(Y_i) = max_a∈A_iV ar(^√^a^T^X

a^Ta) where ||a|| = √

a^Ta, A₁ = R^p\{0} and A_i = {a ∈ R^p\{0} : Cov(^√^a^T^X

a^Ta, Y_j) = 0 ∀j < i} for i = 2, . . . , p.

The principal components of X are the p uncorrelated linear combinations with maximal variances which is precisely what was sought. In order to construct the components in practice, the vectors a_i are required. If X has covariance matrix Cov(X) = Σ, then V ar(Y_i) = ^a_a^TⁱT^Σaⁱ

iai and Cov(Y_i, Y_j) = _||a^a^Tⁱ^Σa^j

i||×||a_j||. When comparing the formula for V ar(Y_i) with the expression that is maximized in Lemma 1.1, one can see that they are very similar. Assuming the components of X are a.s. linearly independent, Σ is both symmetric and positive definite so Lemma 1.1 is used to determine the principal components of X.

(7)

Theorem 1.2. Suppose X is a random vector with a.s. linearly independent components and covariance matrix Σ. Then the principal components of X are given by Y_i = e^T_i X and V ar(Y_i) = λ_i where λ₁ ≥ λ₂ ≥ · · · ≥ λ_p > 0 are the eigenvalues of Σ with corresponding normalized eigenvectors e1, e₂, . . . , e_p.

Proof. To get Y₁, we use Lemma 1.1 to see that

maxa6=0 V ar(a^TX

√

a^Ta) = max

a6=0

a^TΣa a^Ta = λ₁

which is attained by a = e₁. Now, suppose Y_i = e^T_i X for i ≤ k − 1. We see that for i ≤ k − 1, Cov(Y_i, Y_k) = 0 ⇐⇒ e^T_i Σa_k = 0 and since Σ is symmetric and positive definite, e^T_i Σ = λie^T_i so Cov(Yi, Yk) = 0 ⇐⇒ e^T_i a = 0. Hence Yk is a.s. uncorrelated with all previous principal components if and only if a_k ⊥ e₁, . . . , e_k−1. We now use Lemma 1.1 once again to determine that

ak⊥emax₁,...,ek−1

V ar(Yk) = max

ak⊥e₁,...,ek−1

a^T_kΣak

a^T_ka_k = λk

and that this maximum is attained for a_k= e_k. So, by the induction principle, Y_i = e^T_i X and V ar(Y_i) = λ_i for all i = 1, . . . , p

We remark that if there are repeated eigenvalues (i.e. λ_i = λ_j for some i, j), then the principal components are not unique. This problem could be solved by considering that it is possible to find an orthogonal basis for the eigenspace corresponding to the repeated eigenvalue and use those basis vectors (in any order) to determine the principal components corresponding to that eigenvalue.

So far we only considered the population principal components. Thankfully, the theory carries over to the sample case as well, merely requiring that we consider ˆΣ = S =

1 n−1

Pn

i=1(X_i − X)(X_i− X)^T for an i.i.d. sample X₁, X₂, . . . , X_n. The eigenvalues and eigenvectors for S are then used to find the sample principal components.

Now that we have the principal components of X, we would like to analyze the distributions of the eigenvalues and eigenvectors of S since they determine the principal components.

2 Random Matrix Theory

Since X₁, . . . , X_n is a sample of observations of the random vector X, the estimated covariance matrix S is inherently random as well. We therefore need some theory for matrices whose elements are random, called random matrices. The theory for random matrices spans very wide and we will therefore focus only on what is needed, which in this case is the distribution for S. If X is allowed to have any distribution, there would be very little control over the distribution of S so we consider only the case where X follows a normal distribution since there is a lot of literature covering this case.

(8)

2.1 Wishart distribution

Let Z₁, . . . , Z_m ^iid∼ N_p(0, Σ). Then, S = Pm

j=1Z_jZ_j^T ∼ W_p(m, Σ) where W_p(m, Σ) is the p-dimensional central Wishart distribution with scale matrix Σ and m degrees of freedom.

Since ZZ^T = [z_iz_j]_ij, it is clear that S is symmetric and we observe that if x ∈ R^p, then x^TZZ^Tx = (Z^Tx)^T(Z^Tx) = (Z^Tx)² ≥ 0 and the equality is only achieved if Z^Tx = 0. For S, this only happens if Z_j^Tx = 0 ∀j = 1, . . . , m so if m ≥ p, then x^TSx = 0 ⇐⇒ x = 0 and we can conclude that S is a.s. positive definite whenever m ≥ p (which will always be the case in this investigation). The density for the aforementioned Wishart-distributed random matrix S is as follows,

f_S(X) =

(_|X|(m−p−1)/2e−tr(Σ−1X)/2

2^mp/2|Σ|^m/2Γp(m/2) , |X| > 0

0, otherwise

(1)

where |X| = det(X) and Γp is the multivariate gamma function. We will not need to work with the density directly but there are some properties of the Wishart distribution that will prove very useful. One of these is the expectation, E(S) = mΣ. This also implies that if we let Uj = ^√¹_mZ_j, then ˜S = Pm

j=1U_jU_j^T ∼ W_p(m,_m¹Σ) would have expectation E( ˜S) = Σ.

The property that we will be most interested in and find most useful is the fact that since a Wishart matrix is a.s. positive definite, all of its eigenvalues are real and positive.

This means that they can be ordered as σ₁ ≥ · · · ≥ σ_p > 0 and that it is meaningful to discuss not only the joint distribution of the eigenvalues, but also that of the largest q eigenvalues. This has been explored in a paper by Zanella et al. [6] and there the following form for the joint distribution of the eigenvalues of S is presented.

fσ(x) =

(K|Φ(x)| · |Ψ(x, σ)|Qp

l=1ξ(x_l), x₁ > · · · > x_p > 0

0, otherwise

K =

Qp

i=1(i − 1)!_|V^|Σ|^−m

2(σ)|

Qp

j=1(m − j)!Qm

k=1(p − k)!,

Φ(x) = [xⁱ⁻¹_j ]_ij, (2)

Ψ(x), σ) = [e^−x^j^/σⁱ]_ij, V₂(σ) = [(−σ_j)¹⁻ⁱ]_ij

ξ(x) = x^m−p

where σ is a vector of the ordered eigenvalues of Σ. If the eigenvalues σj are not distinct, the determinant |V₂(σ)| = 0. According to [6], this can be fixed by considering the limit limσ2,...,σ_l→σ1

|Ψ(x,σ)|

|V₂(σ)| . Even if Σ does not have distinct eigenvalues, the eigenvalues of S are a.s. distinct since if x_i = x_j for some i 6= j, then f_σ(x) = 0. The density written in this form looks fairly complicated so we try to simplify it. We note that Φ and V2 seem to have a similar structure.

(9)

Definition 2.1. Let a = [a1, a₂, . . . , a_p]^T ∈ R^p. Then A = [aⁱ⁻¹_j ]_ij, i, j = 1, . . . , p is the p × p Vandermonde matrix corresponding to a.

Both Φ and V2 have this structure (V2(σ) = [(−σj)¹⁻ⁱ]ij = [(−_σ¹

j)ⁱ⁻¹]ij) so they are both p × p Vandermonde matrices, whose determinants have fairly simple forms.

Lemma 2.1. Let a = [a₁, a₂, . . . , a_p]^T ∈ R^p and A be the p × p Vandermonde matrix corresponding to a (i.e. A = [aⁱ⁻¹_j ]_ij). Then

det(A) = Y

1≤i<j≤p

(a_j− a_i) =

p

Y

j=2 j

Y

i=1

(a_j − a_i)

Proof. Let A_n be the Vandermonde matrix that is constructed when considering ˜a_n = [ap−n+1, . . . , ap] for n = 2, . . . , p. Then

det(A₂) = det

1 1

a_p−1 a_p

= a_p− a_p−1

So the result holds for n = 2. In order to make an induction argument, assume that the result holds for n = k, i.e. that |A| = det(Ak) = Q

1≤i<j≤k(aj − ai). We then get the following:

det(Ak+1) =

1 1 . . . 1 1

a_p−k a_p−k+1 . . . a_p−1 a_p a²_p−k a²_p−k+1 . . . a²_p−1 a²_p ... ... . .. ... ... a^k−1_p−k a^k−1_p−k+1 . . . a^k−1_p−1 a^k−1_p a^k_p−k a^k_p−k+1 . . . a^k_p−1 a^k_p

=

1 0 . . . 0 0

a_p−k a_p−k+1− a_p−k . . . a_p−1− a_p−k a_p− a_p−k a²_p−k a²_p−k+1− a²_p−k . . . a²_p−1− a²_p−k a²_p− a²_p−k

... ... . .. ... ...

a^k−1_p−k a^k−1_p−k+1− a^k−1_p−k . . . a^k−1_p−1− a^k−1_p−k a^k−1_p − a^k−1_p−k a^k_p−k a^k_p−k+1− a^k_p−k . . . a^k_p−1− a^k_p−k a^k_p − a^k_p−k

=

1 0 . . . 0 0

0 a_p−k+1− a_p−k . . . a_p−1− a_p−k a_p− a_p−k 0 (a_p−k+1 − a_p−k)a_p−k+1 . . . (a_p−1− a_p−k) (a_p− a_p−k)

... ... . .. ... ...

0 (a_p−k+1 − a_p−k)a^k−2_p−k+1 . . . (a_p−1− a_p−k)a^k−2_p−1 (a_p− a_p−k)a^k−2_p 0 (a_p−k+1 − a_p−k)a^k−1_p−k+1 . . . (a_p−1− a_p−k)a^k−1_p−1 (a_p− a_p−k)a^k−1_p

(10)

=

k+1

Y

j=2

(a_p−(k+1)+j − a_p−k)

1 0 . . . 0 0

0 1 . . . 1 1

0 a_p−k+1 . . . a_p−1 a_p ... ... . .. ... ... 0 a^k−2_p−k+1 . . . a^k−2_p−1 a^k−2_p 0 a^k−1_p−k+1 . . . a^k−1_p−1 a^k−1_p

=

k+1

Y

j=2

(a_p−(k+1)+j − ap−k)

1 . . . 1 1

ap−k+1 . . . ap−1 ap

... . .. ... ... a^k−2_p−k+1 . . . a^k−2_p−1 a^k−2_p a^k−1_p−k+1 . . . a^k−1_p−1 a^k−1_p

=

k+1

Y

j=2

(a_p−(k+1)+j − a_p−k)|A_k| =

k+1

Y

j=2

(a_p−(k+1)+j − a_p−k) Y

1≤i<l≤k

(a_l− a_i)

= Y

1≤i<j≤k+1

(aj − ai).

The result follows from induction on k.

Note that in the proof, the first matrix equality is given by subtracting the first column from all the other columns and the second equality is given by subtracting each row by a_p−k+1 times the row just above it, starting from the bottom so that this operation does not interfere with the later iterations. The last equality is true since Qk+1

j=2(a_p−(k+1)+j − ap−k) is the product of the sequence ap−k+1−ap−k, . . . , ap−ap−kand therefore only includes the factors missing from Q

1≤i<j≤k(a_j− a_i) to getQ

1≤i<j≤k+1(a_j − a_i). The above proof was inspired by [5].

In Zanella et al. [6], V₂ = [−σ¹⁻ⁱ_j ]_ij (instead of [(−σ_j)¹⁻ⁱ]_ij as in Equation 2) but this produces a negative density for some values of p. Since the inclusion of the parentheses only changes the sign of the determinant and not its value, this discrepancy can only affect the sign of the density, which has to be non-negative everywhere. This investigation will therefore use the form given in Equation (2) as this produces a non-negative density. The discrepancy is illustrated in the following example.

Example 2.2. Let Σ =2 1 1 2

and m = 3. Then p = 2, σ₁ = 3, σ₂ = 1. Let V_2,Z be the form of V₂(σ) given in Zanella et al. and V_2,C be the corrected form in Equation 2.

|V_2,Z| = det−1 −1

−¹₃ −1

= 2 3

|V_2,C| = det 1 1

−¹₃ −1

= −2 3

so there is a discrepancy in the sign of the determinant. In order to determine which of them is right we let x = [x1, x₂]^T with x1 > x₂ > 0 and get

|Φ(x)| = x₂− x₁ < 0

(11)

2

Y

l=1

ξ(x_l) = x₁ · x₂ > 0

Ψ(x) =e^−x¹^/3 e^−x²^/3 e^−x¹ e^−x²

=⇒ |Ψ(x)| = e⁻⁽^x1³ ^+x²⁾− e^−(x¹⁺^x2³ ⁾.

x1 > x2 > 0, so we get ^x₃¹+x2−(x1+^x₃²) = ^−2x₃ ¹+^2x₃² < ^−2x₃ ¹+^2x₃¹ = 0 =⇒ −(^x₃¹+x2) >

−(x₁+^x₃²) =⇒ |Ψ(x)| > 0 since e^x is a monotonically increasing function in x.

K =

Q2

i=1(i − 1)!_|V^|Σ|⁻³

2(σ)|

Q2

j=1(3 − j)!Q2

k=1(2 − k)! = 1 · 3⁻³

|V2(σ)| · 2 · 1 = 1

54 · |V2(σ)| =⇒

=⇒ K_Z = − 1

108, K_C = 1 108.

Combining these results, f_σ,Z(x) = K_Z(x₂ − x₁)(e⁻⁽^x1³ ^+x²⁾ − e^−(x¹⁺^x2³⁾)x₁x₂ < 0 and f_σ,C(x) = K_C(x₂ − x₁)(e⁻⁽^x1³ ^+x²⁾− e^−(x¹⁺^x2³⁾)x₁x₂ > 0.

Since f_σ is a density function, f_σ(x) ≥ 0 for any value of x. Hence, K_Z cannot be the right form of K. When implementing the density function with K_C in Matlab, the integral over the whole domain is 1 for several different choices of p, Σ and m which suggests that this is indeed the correct form of K.

Using Lemma 2.1, we see that |Φ(x)| =Qp j=2

Qj

i=1(x_j − x_i) and

|V₂(σ)| =

p

Y

j=2 j

Y

i=1

(−1 σ_j + 1

σ_i) =

p

Y

j=2 j

Y

i=1

σ_j − σ_i σ_iσ_j .

Further, K contains the same factor in both the numerator and denominator.

Qp

i=1(i − 1)! = Qp

k=1(p − k)! by observing that they are the products of the same sequence but in reversed order and since multiplication is commutative, the two products are equal. Another simplification that can be made is that |Σ| = Qp

j=1σ_j and by combining all of these simplifications we arrive at the following form for the density of the eigenvalues:

f_σ(x) = |Ψ(x)| ·

p

Y

j=1

x^m−p_j σ^m_j (m − j)!

j−1

Y

i=1

σ_iσ_jxj − xi

σ_j − σ_i

!

(3)

= |Ψ(x)| ·

p

Y

j=1

1 x^p_j(m − j)!

x_j σ_j

m j−1

Y

i=1

σ_iσ_jx_j − x_i σ_j − σ_i

!

where the empty product that occurs when j = 1 is interpreted as being 1. An inter- esting observation that results from this form of the density is that the density for the eigenvalues of a Wishart matrix does not depend on the exact form or structure of Σ, instead depending only on the eigenvalues of Σ.

The simplified form of f_σ(x) makes it quite easy to see what effect scaling has on the density.

(12)

Lemma 2.3. Let fσ(x) be the density of the ordered eigenvalues of a Wishart distribution W (m, Σ) such that Σ is a p × p covariance matrix and the vector of ordered eigenvalues of Σ is σ. Let a > 0 be a constant. Then,

f_aσ(ax) = a^−pf_σ(x).

Proof. First note that |Ψ_aσ(ax)| = |Ψ_σ(x)| since Ψ_aσ(ax) = [e^−(ax^j^)/(aσⁱ⁾]_ij = [e^−x^j^/σⁱ]_ij = Ψσ(x). So,

f_aσ(ax) = |Ψ_aσ(ax)| ·

p

Y

j=1

1 (axj)^p(m − j)!

ax_j aσj

m j−1

Y

i=1

(aσ_i)(aσ_j)ax_j − ax_i aσj − aσi

!

= |Ψ_σ(x)| · 1 a^p²

p

Y

j=1

1 x^p_j(m − j)!

x_j σ_j

m j−1

Y

i=1

a²σ_iσ_jx_j − x_i σ_j − σ_i

!

= |Ψ_σ(x)| · 1 a^p²

p

Y

j=1

1 x^p_j(m − j)!

x_j σ_j

m

a^2(j−1)

j−1

Y

i=1

σ_iσ_jxj − xi

σ_j − σ_i

!

= |Ψ_σ(x)| · a²^P^p−1^j=1^j a^p²

p

Y

j=1

1 x^p_j(m − j)!

x_j σ_j

m j−1

Y

i=1

σ_iσ_jx_j− x_i σ_j− σ_i

!

= a²^P^p−1^j=1^j

a^p² f_σ(x) = a²^p(p−1)²

a^p² f_σ(x) = a^p²^−p

a^p² f_σ(x) = a^−pf_σ(x)

3 Number of required components

In this section it is assumed that X₁, X₂, . . . , X_m is an i.i.d. sample from N_p(µ, Σ). As mentioned previously, the sample covariance matrix in this case is S = _m−1¹ Pm

j=1(X_j − X)(X_j− X)^T. S follows a Wishart distribution W_p(m − 1,_m−1¹ Σ), as proven in e.g. ([1], page 255). The estimator S is consistent, meaning that as n → ∞, S → Σ, due to the^p Law of large numbers as shown in e.g. ([2], page 186). S is also clearly unbiased since ES = (m − 1)_m−1¹ Σ = Σ. The loss of a degree of freedom is due to "using it" to calculate the sample mean X instead of using the population mean µ which generally is not known a priori. However, this does not present a problem since we can consider ˜m = m − 1 without losing any properties of the distribution. Since S follows a Wishart distribution, the knowledge about the distribution of eigenvalues for Wishart distributions is used to investigate the number of principal components required to not lose more than a specified amount of the variance in the sample.

The density for the eigenvalues is f_σ as in (3). If the proportion of variance we are prepared to give up is , then we are interested in whether or not the eigenvalues σ1 ≥ · · · ≥ σp > 0 of the actual covariance matrix Σ fulfill

(13)

R_q(Σ) :=

Pq j=1σj

Pp

i=1σ_i = 1 tr(Σ)

q

X

j=1

σ_j ≥ 1 − . (4)

The notation Rq(Σ) is not a standard notation but this ratio will come up often so it is prudent to give it a name. Note that R_q(Σ) is invariant in regards to scaling Σ by a positive constant a > 0.

A commonly used alternative to R_q(Σ) is the elbow plot in which the eigenvalues are visualized in a graph and the number of eigenvalues used is determined by where the graph stops decreasing sharply. A comparison between the elbow plot and Rq(Σ) can be seen in Figure 1 where the eigenvalues are σ = [10, 7.5, 5, 4.6, 4.6, 4, 3.4, 3, 3]. It can be difficult to determine how much of the variance is lost when just using the elbow plot and in Figure 1, one might only retain about 60% of the variance if using the elbow plot to gauge how many components are required. For this reason, this investigation uses R_q instead which allows us to lose only up to a specified proportion of the variance.

0 1 2 3 4 5 6 7 8 9 10

j 2

3 4 5 6 7 8 9 10

j

Elbow plot R1 = 0.22

R2 = 0.39

R₃ = 0.5

R4 = 0.6

R₅ = 0.7

R₆ = 0.79

R7 = 0.87

R₈ = 0.93 R9 = 1

Figure 1: Comparison between elbow plot and R_q(Σ)

We are now ready to formulate the main question in this investigation. What values of Σ, m, q, and α fulfill P (R_q(S) ≥ 1 − ) ≥ 1 − α where S ∼ W_p(m,_m¹Σ)?

When trying to answer this question, we consider two cases. In the first case no structure on the covariance matrix Σ is assumed and in the second case we consider a special structure on the covariance matrix. Since structures of Σ that do not affect the eigenvalues

(14)

of Σ also do not affect the distribution of eigenvalues for S, the section about structured covariance matrices will focus on a useful case where Σ do not have trivial eigenvalues.

For small values of m and p (the dimension of Σ) we will be able to use the density (3) to get exact probabilities (or at least numerical approximations of said probabilities) while for larger values we have to rely on asymptotic results in order to estimate the probabilities. In all cases, the eigenvalues of Σ are distinct so the formula for the density of the eigenvalues in Equation (2) can be used directly.

Such an asymptotic result can be found in [1] on p.479-480 which can be slightly adjusted to get the following result.

Theorem 3.1. With the same setup as previously and the null hypothesis

H₀ : R_q(Σ) < 1 − , (5)

a test with approximate significance α for large m is given by rejecting H0 if

√m(1 − − R_q(S)) < Z_(α) s

2

tr(S)

(ˆσ₁²+ · · · + ˆσ_q²) + 2 1 − tr(S)

(ˆσ_q+1² + · · · + ˆσ_p²) (6) where S is the estimator for the covariance matrix Σ, ˆσ_j are the ordered eigenvalues of S and Z_(α) is the α-quantile of the standard normal distribution.

A proof of this theorem is outlined in [1] on page 480, using theorem 4.2.3 to which a proof can be found in [3] (where said theorem is theorem A in section 3.3). This proof is quite lengthy and will thus not be written here.

By studying (6), it it evident that an approximate p-value for the test is given by isolating Z_(α) and applying the distribution function of Z ∼ N (0, 1) on both sides. Let

T_,q(S) = 1 − − Rq(S)

r 2

tr(S)

(ˆσ₁²+ · · · + ˆσ_q²) + 2

1−

tr(S)

(ˆσ_q+1² + · · · + ˆσ_p²) .

Then p = P (T_,q(S) < T_obs) ≈ φ(√

m · T_,q(S)), where φ is the distribution function of the standard normal distribution N (0, 1), is the approximate p-value of the test when the value of the observed statistic is T_obs. Under H₀, T_,q(Σ) is positive so the test is one-directional, testing that T_,q(S) is negative with a large enough margin. As with the exact method, we observe that the test only depends on the eigenvalues of Σ (estimated by the eigenvalues of the sample covariance matrix S) since tr(S) = Pp

j=1σˆ_j and R_q(S) are uniquely determined from the eigenvalues of S.

Since S is consistent, T_,q(S) → T^p _,q(Σ) and we want to show that the test is consistent as well. A test is consistent if p → 0 under H₁ : ¬H₀ as m → ∞. Σ is independent of m, √

m · T_,q(Σ) is monotonic in m and is increasing if T_,q(Σ) > 0 and decreasing if T_,q(Σ) < 0. Further, the denominator of T_,q(Σ) is always positive since tr(Σ) > 0 and

∈ (0, 1). This means that

(15)

m→∞lim p =

(lim_m→∞F_Z(√

m · |T_,q(Σ)|), T_,q(Σ) > 0 lim_m→∞F_Z(−√

m · |T_,q(Σ)|), T_,q(Σ) ≤ 0 =

(1, T_,q(Σ) > 0

0, T_,q(Σ) ≤ 0 (7) The sign of T_,q(Σ) is independent of the denominator and the numerator is positive if and only if R_q(Σ) < 1 − , so (7) reduces to

m→∞lim p =

(1, R_q(Σ) < 1 −

0, R_q(Σ) ≥ 1 − (8)

which shows that the test is consistent.

Remark 3.2. Although it would not affect the test result in the limit, scaling the eigenvalues of Σ by a constant a > 0 (thus also scaling the random matrix S by the same constant) does affect the test result for finite values of m.

T_,q(aS) = 1 − − R_q(aS)

r 2

tr(aS)

( ˆaσ²₁+ · · · + ˆaσ²_q) + 2

1−

tr(aS)

( ˆaσ²_q+1+ · · · + ˆaσ²_p)

= 1 − − R_q(S)

r 2

a·tr(S)

a²(ˆσ²₁+ · · · + ˆσ_q²) + 2

1−

a·tr(S)

a²(ˆσ²_q+1+ · · · + ˆσ_p²)

= 1 − − R_q(S)

√1 a

r 2

tr(S)

(ˆσ₁²+ · · · + ˆσ²_q) + 2

1−

tr(S)

(ˆσ_q+1² + · · · + ˆσ²_p)

=√

a · T_,q(S)

Thus, scaling the eigenvalues by a > 0 has the same effect on the test as instead scaling m by the same constant a since√

am · |T_,q(S)| =√ m · |√

a · T_,q(S)| =√

m · |T_,q(aS)|.

Now, only one obstacle remains before the analysis can begin. For smaller values of m where we use the density to get exact results, we have not yet considered how to calculate the probability that q eigenvalues are enough, i.e. P (R_q(S) ≥ 1 − ). We note that this probability can be written in the form of an integral.

P (R_q(S) ≥ 1 − ) = Z

D^p

f_σ(x)IR_q(x)dx (9)

where D^p = {x ∈ R^p : x₁ > · · · > x_p > 0} and R_q = n

x ∈ R^p :

Pq i=1xi

Pp

j=1xj ≥ 1 − o

. This integral would not be very pleasant to solve analytically since it, among other complica- tions, contain the factor |Ψ(x)| which is the determinant of a matrix whose elements are exponential functions of x₁, . . . , x_p. Further, it is a p-dimensional integral, so determin- istic numerical methods quickly become infeasible as p grows. We therefore compute the integral using Monte Carlo methods.

(16)

3.1 Methodology

The Monte Carlo method used is an independent Monte Carlo integration which has the following algorithm which is a slightly adapted version of Algorithm 2.1 in [4] on page 27.

Algorithm 3.3. Given a sample size N ∈ Z⁺ and an integral of the form µ = R

Dh(x) · f (x)dx < ∞ where h : D → R is some integrable function on D and f is the density function for a random variable X which is supported on D,

1. For i = 1 to N :

(a) Draw x_i ∼ X independently from all prior draws.

(b) If xi ∈ D, return to (a). Otherwise, continue./ 2. estimate ˆµ = _N¹ PN

i=1h(x_i).

In the case (9), the domain of integration D^p is unbounded since there is no limit on how large x₁ can be. It is therefore necessary to impose some limit on x₁ when estimating the integral using a computer. A limit L is chosen and the integral in (9) is estimated using D^p_L= D^p∩ [0, L]^p = {x ∈ R^p : L ≥ x₁ > · · · > x_p > 0} and the uniform distribution U_D^p

L

on D_L^p is used. The density function of U_D^p

L is f_U

Dp L

(x) = _vol(D¹ p L)I_D^p

L(x) where vol(D^p_L) is the generalized volume of D_L^p.

Proposition 3.4. Let D^p_L= {x ∈ R^p : L ≥ x₁ > · · · > x_p > 0}. Then vol (D_L^p) = ^L_p!^p Proof. The volume of D_L^p is the integral over D^p_L of the constant function 1 and R

{x1,x2:b>x1>x2>a}f (x)dx =Rb a

Rx1

a f (x)dx₁dx₂ so we get the following.

vol(D_L^p) = Z L

0

Z x1

0

· · · Z xp−2

0

Z xp−1

0

dx_pdx_p−1. . . dx₂dx₁

= Z L

0

Z x1

0

· · · Z xp−2

0

(x_p−1− 0)dx_p−1. . . dx₂dx₁

= Z L

0

Z x1

0

· · · Z xp−3

0

x²_p−2− 0

2 dxp−2. . . dx2dx1

= · · · = Z L

0

x^p−1₁ − 0

(p − 1)!dx₁ = L^p− 0 p! = L^p

p!

Since the integrand in (9) is a density function multiplied with an indicator function which is just the identity function for p = q (R_p = R^p), it is known that Rp

Df_σ(x)IR_p(x)dx = 1 which can be used to calibrate the limit L. If the integral is estimated to much less than 1 (for example ≈ 0.9) using a certain value of L and q = p, this suggests that a significant probability mass is outside D_L^p and hence the value of L is increased to make sure that the only error comes from the approximation by a finite sum in Algorithm 3.3. It is also desirable that L be as small as possible since for large enough L, very little probability mass is contained in D_L^p0\D_L^p when L⁰ > L, making the algorithm more inefficient for L⁰ since more of the sampled points will contribute very little to the estimate of the integral. When a suitable value for L is found, one sample x⁽¹⁾, . . . , x^{(N )} is drawn from

(17)

U_D^p

L and used to estimate the probabilities P (Rq(S) ≥ 1 − ) for all q = 1, . . . , p by P (Rq(S) ≥ 1 − ) ≈ _N¹ PN

i=1fσ(xi)IRq(xi). Note that if the eigenvalues and the argument are both scaled by a > 0 in the density of the eigenvalues as in Lemma 2.3, then L can be scaled by the same constant a to get the same estimation for the integral since

Z

D^p_aL

faσ(ax)IRq(ax)dx = vol(D^p_aL) vol(D^p_L)

Z

D^p_L

a^−pfσ(x)IRq(x)dx

= a^pa^−p Z

D_L^p

f_σ(x)I_R_q(x)dx = Z

D^p_L

f_σ(x)I_R_q(x)dx

where the property ∀a > 0 : x ∈ Rq =⇒ ax ∈ R_q is used. By testing different values for L, it was discovered that L = 1 +²⁵_m seemed to work well whenever the largest eigenvalue was 1 and this value was therefore used in all calculations.

For the asymptotic results, the methodology is slightly different. The goal is to investigate how many eigenvalues are required for a specific m. The following algorithm provides a way to see how a typical test result would look like for many different values of m.

Algorithm 3.5. Given p, σ, , α, a lower limit m₀ and an upper limit M for m, let Σ be the diagonal matrix with σ₁, . . . , σ_p on the diagonal.

• For m = m₀ to M :

1. Draw S_m ∼ W (m,_m¹Σ) independent from all other draws and compute the eigenvalues sm = (s1,m, . . . , sp,m) such that s1,m ≥ · · · ≥ sp,m where the in- equality is a.s. strict.

2. Compute the test statistics T_,q(S_m) for q = 1, . . . , p using the eigenvalues computed in the previous step.

3. The number of required components for m is then min{q ∈ {1, . . . , p} : F_Z(√ m·

T,q(Sm)) < α}

When the true eigenvalues are known, one can compare the results from this algorithm with the actual number of eigenvalues required according to the test in Theorem 3.1 by calculating the test statistic of the true eigenvalues and using the same method as in step 3 to compute the number of required eigenvalues. Due the Law of Large Numbers, consistency and unbiasedness, this is equivalent to looking at the average of a large sample. This way, one can see how reliable the test is and form a heuristic about how many eigenvalues are actually required given a test result.

In order to reduce the number of variables involved, we only consider the case with = 0.1 and α = 0.05, meaning that no more than 10% of the variance is lost in at least 95% of cases. When estimating the integral (9) by using Algorithm 3.3, N = 10⁸ is used.

3.2 Unstructured Covariance

It was previously noted that the number of components required only depends on the underlying covariance matrix through the eigenvalues of said matrix, so the unstructured covariance case boils down to investigating the probability (9) given a decreasing sequence of positive eigenvalues.

(18)

We provide two examples of such sequences and how the number of components required is affected by the difference of the sequences and by changes in m. The sequences are scaled such that the largest eigenvalue is 1 since R_q is independent of a constant scaling of σ. Scaling both the eigenvalues and the argument in fσ(x) by _m¹ (i.e. using f¹

mσ(_m¹x)) is equivalent to scaling the scale matrix Σ in the the Wishart distribution W (m, Σ) by

1

m, which is precisely what is wanted to get the desired distribution of S. Note that even if some other scaling were to be done to the eigenvalues and the argument in f_σ(x), say faσ(ax), the probability would still be the same as discussed in the previous Section.

The two sequences consist of one with linear decay and one with exponential decay.

Both sequences have their respective values of P (R_q ≥ 1 − ) calculated exactly for m = 10, 15, 20, 30, 40, 50 while for larger values of m, the asymptotic result is used.

3.2.1 Sequence with linear decay Let σp,lin =

1,^p−1_p , . . . , ¹_p

be the eigenvalues of some symmetric and positive definite p × p matrix Σ_p,lin. Sp,lin ∼ W_p(m,_m¹Σ_p,lin) is the estimate of Σ_p,lin and the following tables contain the estimated probabilites P (R_q(S_p,lin) ≥ 1 − ) for 2 ≤ p ≤ 9, 1 ≤ q ≤ p and = 0.1.

q p

1 2 3 4 5 6 7 8 9

1 3.88 · 10⁻³ 2.84 · 10⁻⁷ 5.64 · 10⁻¹² 3.03 · 10⁻¹⁷ 2.07 · 10⁻²³ 0 0 0 2 1 0.39 6.72 · 10⁻³ 1.77 · 10⁻⁵ 1.67 · 10⁻⁸ 4.22 · 10⁻¹² 1.03 · 10⁻¹⁶ 1.59 · 10⁻²⁶

3 1 0.98 0.46 4.55 · 10⁻² 1.54 · 10⁻³ 1.87 · 10⁻⁵ 1.57 · 10⁻⁷

4 1 1 1 0.75 0.32 6.46 · 10⁻²

5 1 1.01 0.97 1.04 1.1

6 1.01 0.97 1.04 1.11

7 0.97 1.04 1.11

8 1.04 1.11

9 1.11

Table 1: probabilities for m =10

q p

1 2 3 4 5 6 7 8 9

1 2.34 · 10⁻⁴ 5.26 · 10⁻¹¹ 4.31 · 10⁻¹⁹ 1.76 · 10⁻²⁸ 1.93 · 10⁻⁴⁰ 1.37 · 10⁻⁵⁶ 0 0 2 1 0.19 1.05 · 10⁻⁴ 2 · 10⁻⁹ 4.39 · 10⁻¹⁵ 3.57 · 10⁻²¹ 1.79 · 10⁻²⁸ 0

3 1 0.94 9.94 · 10⁻² 3.55 · 10⁻⁴ 1.4 · 10⁻⁷ 1.02 · 10⁻¹¹ 8.89 · 10⁻¹⁷

4 1 1 0.86 0.16 3.29 · 10⁻³ 1.35 · 10⁻⁵

5 1 1 0.99 0.93 0.4

6 1 0.99 1.01 1.01

7 0.99 1.01 1.01

8 1.01 1.01

9 1.01

Table 2: probabilities for m =15

On Asymptotic Properties of Principal Component Analysis

U.U.D.M. Project Report 2020:36

Department of Mathematics Uppsala University

On Asymptotic Properties of Principal Component Analysis

Johannes Graner

Abstract

Contents

Introduction

1 Principal component analysis

1.1 Single Value Decomposition

1.2 Deriving the components

2 Random Matrix Theory

2.1 Wishart distribution

3 Number of required components

3.1 Methodology

3.2 Unstructured Covariance