### U.U.D.M. Project Report 2020:36

Examensarbete i matematik, 15 hp Handledare: Silvelyn Zwanzig Examinator: Martin Herschend Juni 2020

### Department of Mathematics Uppsala University

## On Asymptotic Properties of Principal Component Analysis

### Johannes Graner

### Abstract

This paper looks at the asymptotic behavior of principal component analysis as the sam- ple size increases. Specifically, the number of components required to retain a given proportion of variance in the sample is investigated for independent and identically dis- tributed observations from a normal distribution. The number of components required for three special cases are estimated using Monte Carlo integration and the results are compared to those from an asymptotic test for the number of components required.

### Contents

Introduction 4

1 Principal component analysis 4

1.1 Single Value Decomposition . . . 4

1.2 Deriving the components . . . 5

2 Random Matrix Theory 6 2.1 Wishart distribution . . . 7

3 Number of required components 11 3.1 Methodology . . . 15

3.2 Unstructured Covariance . . . 16

3.2.1 Sequence with linear decay . . . 17

3.2.2 Sequence with exponential decay . . . 20

3.3 Structured Covariance . . . 23

3.4 Analysis . . . 27

### Introduction

When collecting data, many different types of measurements are often taken for the same object. For example, when investigating the performance of racing cars, one could measure the cars’ weights, the size of the engines and the force applied by the brakes among other things. Many different measurements increases the variance in the data but at the same time, it is more difficult to analyze high-dimensional data and each additional measurement can be thought of as adding another dimension to the data.

There exist many techniques for simplifying high-dimensional data and one of them is Principal Component Analysis (PCA) where the dimensionality of the data is reduced by projections. However, it is not always evident how much the data can be simplified without losing too much of the variance in the data.

That problem is investigated in this thesis, specifically focusing on how much the dimen- sionality can be reduced depending on the sample size. First, some theory and back- ground about PCA is covered and then two different methods of measuring the number of components (dimensions) required for data which follows a normal distribution. In later Sections, the number of components required is investigated for a few special cases and a comparison between the two methods is made.

### 1 Principal component analysis

Given a sample of size n of a p-dimensional random variable X, one could imagine that most of the inherent randomness would be explained well by fewer than p linear combi- nations of the components of X.

Expressed more formally, for X ∈ R^{p} we try to find Y ∈ R^{q} such that q < p, y_{i} =
Pp

j=1aijxj, i = 1, . . . , q and we don’t lose a significant amount of variance by considering
samples of Y instead of samples of X. Since the "amount of randomness" can be measured
by variances and covariances, it could be fruitful to consider the covariance matrix of X
in order to find the best linear combinations yi. It makes sense to be interested in
maximizing randomness since a constant variable could be completely understood by
just one observation. Hence, it is reasonable to choose y_{i} with large variances and since
we want each additional yi to only add new information, the linear combinations yi are
chosen such that their covariances are minimized. y_{i} chosen in this manner are called the
principal components of X, but in order to make this argument sufficiently rigorous, we
first need some basic properties of vectors and matrices.

### 1.1 Single Value Decomposition

Let [σ_{ij}] = Σ := Cov(X) be the covariance matrix of X where σ_{ii} = pV ar(X_{i}), i =
1, . . . , p and σij = Cov(Xi, Xj), i 6= j. Since for any scalar valued random variables U, V ,
Cov(U, V ) = Cov(V, U ), Σ is symmetric. If the components of X are a.s. linearly inde-
pendent, then for any a ∈ R^{p}\{0}, 0 < V ar(a^{T}X) = a^{T}Σa so Σ is positive definite.

Having Σ both symmetric and positive definite, we use the spectral theorem to factorize

Σ = EΛE^{T} where

E =e1 e2 . . . ep , Λ =

λ_{1} 0 . . . 0
0 λ_{2} . . . 0
... ... . .. ...

0 0 . . . λ_{p}

and ei is the eigenvector of Σ corresponding to eigenvalue λi. Since Σ is symmetric and
positive definite, we also know that all eigenvectors are real and positive so we order them
as λ_{1} ≥ λ_{2} ≥ · · · ≥ λ_{p} > 0.

In order to properly define and construct the principal components, an additional lemma is needed. The lemma together with it’s proof can be found in [2] on page 83.

Lemma 1.1. For a positive definite matrix B ∈ Mp×p(R) with eigenvalues λ^{1} ≥ λ2 ≥

· · · ≥ λ_{p} > 0 and corresponding eigenvectors e_{1}, e_{2}, . . . , e_{p} such that ∀i : e^{T}_{i} e_{i} = 1,

max

x6=0

x^{T}Bx

x^{T}x = λ_{1} (attained for x = e_{1})

x⊥emax1,...,e_{k}

x^{T}Bx

x^{T}x = λ_{k+1} (attained for x = e_{k+1})
Where x ⊥ y ⇐⇒ x^{T}y = 0.

### 1.2 Deriving the components

With these tools, we are able to determine the best choices of y_{i}.
Definition 1.1. Given a random vector X = X1, X_{2}, . . . , X_{p}T

, the Principal Compo- nents of X are the random variables

Y1 = √^{a}^{T}^{1}^{X}

a^{T}_{1}a1

=

Pp j=1a1jXj

||a_{1}||

Y_{2} = √^{a}^{T}^{2}^{X}

a^{T}_{2}a2

=

Pp j=1a2jXj

||a2||

...
Y_{p} = √^{a}^{T}^{p}^{X}

a^{T}_{p}ap

=

Pp j=1apjXj

||ap||

such that V ar(Y_{i}) = max_{a∈A}_{i}V ar(^{√}^{a}^{T}^{X}

a^{T}a) where ||a|| = √

a^{T}a, A_{1} = R^{p}\{0} and A_{i} =
{a ∈ R^{p}\{0} : Cov(^{√}^{a}^{T}^{X}

a^{T}a, Y_{j}) = 0 ∀j < i} for i = 2, . . . , p.

The principal components of X are the p uncorrelated linear combinations with maximal
variances which is precisely what was sought. In order to construct the components in
practice, the vectors a_{i} are required. If X has covariance matrix Cov(X) = Σ, then
V ar(Y_{i}) = ^{a}_{a}^{T}^{i}T^{Σa}^{i}

iai and Cov(Y_{i}, Y_{j}) = _{||a}^{a}^{T}^{i}^{Σa}^{j}

i||×||a_{j}||. When comparing the formula for V ar(Y_{i})
with the expression that is maximized in Lemma 1.1, one can see that they are very
similar. Assuming the components of X are a.s. linearly independent, Σ is both symmetric
and positive definite so Lemma 1.1 is used to determine the principal components of
X.

Theorem 1.2. Suppose X is a random vector with a.s. linearly independent components
and covariance matrix Σ. Then the principal components of X are given by Y_{i} = e^{T}_{i} X and
V ar(Y_{i}) = λ_{i} where λ_{1} ≥ λ_{2} ≥ · · · ≥ λ_{p} > 0 are the eigenvalues of Σ with corresponding
normalized eigenvectors e1, e_{2}, . . . , e_{p}.

Proof. To get Y_{1}, we use Lemma 1.1 to see that

maxa6=0 V ar(a^{T}X

√

a^{T}a) = max

a6=0

a^{T}Σa
a^{T}a = λ_{1}

which is attained by a = e_{1}. Now, suppose Y_{i} = e^{T}_{i} X for i ≤ k − 1. We see that for
i ≤ k − 1, Cov(Y_{i}, Y_{k}) = 0 ⇐⇒ e^{T}_{i} Σa_{k} = 0 and since Σ is symmetric and positive
definite, e^{T}_{i} Σ = λie^{T}_{i} so Cov(Yi, Yk) = 0 ⇐⇒ e^{T}_{i} a = 0. Hence Yk is a.s. uncorrelated
with all previous principal components if and only if a_{k} ⊥ e_{1}, . . . , e_{k−1}. We now use
Lemma 1.1 once again to determine that

ak⊥emax_{1},...,ek−1

V ar(Yk) = max

ak⊥e_{1},...,ek−1

a^{T}_{k}Σak

a^{T}_{k}a_{k} = λk

and that this maximum is attained for a_{k}= e_{k}. So, by the induction principle, Y_{i} = e^{T}_{i} X
and V ar(Y_{i}) = λ_{i} for all i = 1, . . . , p

We remark that if there are repeated eigenvalues (i.e. λ_{i} = λ_{j} for some i, j), then
the principal components are not unique. This problem could be solved by considering
that it is possible to find an orthogonal basis for the eigenspace corresponding to the
repeated eigenvalue and use those basis vectors (in any order) to determine the principal
components corresponding to that eigenvalue.

So far we only considered the population principal components. Thankfully, the theory carries over to the sample case as well, merely requiring that we consider ˆΣ = S =

1 n−1

Pn

i=1(X_{i} − X)(X_{i}− X)^{T} for an i.i.d. sample X_{1}, X_{2}, . . . , X_{n}. The eigenvalues and
eigenvectors for S are then used to find the sample principal components.

Now that we have the principal components of X, we would like to analyze the dis- tributions of the eigenvalues and eigenvectors of S since they determine the principal components.

### 2 Random Matrix Theory

Since X_{1}, . . . , X_{n} is a sample of observations of the random vector X, the estimated
covariance matrix S is inherently random as well. We therefore need some theory for
matrices whose elements are random, called random matrices. The theory for random
matrices spans very wide and we will therefore focus only on what is needed, which in
this case is the distribution for S. If X is allowed to have any distribution, there would be
very little control over the distribution of S so we consider only the case where X follows
a normal distribution since there is a lot of literature covering this case.

### 2.1 Wishart distribution

Let Z_{1}, . . . , Z_{m} ^{iid}∼ N_{p}(0, Σ). Then, S = Pm

j=1Z_{j}Z_{j}^{T} ∼ W_{p}(m, Σ) where W_{p}(m, Σ) is the
p-dimensional central Wishart distribution with scale matrix Σ and m degrees of freedom.

Since ZZ^{T} = [z_{i}z_{j}]_{ij}, it is clear that S is symmetric and we observe that if x ∈ R^{p}, then
x^{T}ZZ^{T}x = (Z^{T}x)^{T}(Z^{T}x) = (Z^{T}x)^{2} ≥ 0 and the equality is only achieved if Z^{T}x = 0. For
S, this only happens if Z_{j}^{T}x = 0 ∀j = 1, . . . , m so if m ≥ p, then x^{T}Sx = 0 ⇐⇒ x = 0
and we can conclude that S is a.s. positive definite whenever m ≥ p (which will always be
the case in this investigation). The density for the aforementioned Wishart-distributed
random matrix S is as follows,

f_{S}(X) =

(_{|X|}(m−p−1)/2e−tr(Σ−1X)/2

2^{mp/2}|Σ|^{m/2}Γp(m/2) , |X| > 0

0, otherwise

(1)

where |X| = det(X) and Γp is the multivariate gamma function. We will not need to work
with the density directly but there are some properties of the Wishart distribution that
will prove very useful. One of these is the expectation, E(S) = mΣ. This also implies
that if we let Uj = ^{√}^{1}_{m}Z_{j}, then ˜S = Pm

j=1U_{j}U_{j}^{T} ∼ W_{p}(m,_{m}^{1}Σ) would have expectation
E( ˜S) = Σ.

The property that we will be most interested in and find most useful is the fact that since a Wishart matrix is a.s. positive definite, all of its eigenvalues are real and positive.

This means that they can be ordered as σ_{1} ≥ · · · ≥ σ_{p} > 0 and that it is meaningful
to discuss not only the joint distribution of the eigenvalues, but also that of the largest
q eigenvalues. This has been explored in a paper by Zanella et al. [6] and there the
following form for the joint distribution of the eigenvalues of S is presented.

fσ(x) =

(K|Φ(x)| · |Ψ(x, σ)|Qp

l=1ξ(x_{l}), x_{1} > · · · > x_{p} > 0

0, otherwise

K =

Qp

i=1(i − 1)!_{|V}^{|Σ|}^{−m}

2(σ)|

Qp

j=1(m − j)!Qm

k=1(p − k)!,

Φ(x) = [x^{i−1}_{j} ]_{ij}, (2)

Ψ(x), σ) = [e^{−x}^{j}^{/σ}^{i}]_{ij},
V_{2}(σ) = [(−σ_{j})^{1−i}]_{ij}

ξ(x) = x^{m−p}

where σ is a vector of the ordered eigenvalues of Σ. If the eigenvalues σj are not distinct,
the determinant |V_{2}(σ)| = 0. According to [6], this can be fixed by considering the limit
limσ2,...,σ_{l}→σ1

|Ψ(x,σ)|

|V_{2}(σ)| . Even if Σ does not have distinct eigenvalues, the eigenvalues of S
are a.s. distinct since if x_{i} = x_{j} for some i 6= j, then f_{σ}(x) = 0. The density written in
this form looks fairly complicated so we try to simplify it. We note that Φ and V2 seem
to have a similar structure.

Definition 2.1. Let a = [a1, a_{2}, . . . , a_{p}]^{T} ∈ R^{p}. Then A = [a^{i−1}_{j} ]_{ij}, i, j = 1, . . . , p is the
p × p Vandermonde matrix corresponding to a.

Both Φ and V2 have this structure (V2(σ) = [(−σj)^{1−i}]ij = [(−_{σ}^{1}

j)^{i−1}]ij) so they are both
p × p Vandermonde matrices, whose determinants have fairly simple forms.

Lemma 2.1. Let a = [a_{1}, a_{2}, . . . , a_{p}]^{T} ∈ R^{p} and A be the p × p Vandermonde matrix
corresponding to a (i.e. A = [a^{i−1}_{j} ]_{ij}). Then

det(A) = Y

1≤i<j≤p

(a_{j}− a_{i}) =

p

Y

j=2 j

Y

i=1

(a_{j} − a_{i})

Proof. Let A_{n} be the Vandermonde matrix that is constructed when considering ˜a_{n} =
[ap−n+1, . . . , ap] for n = 2, . . . , p. Then

det(A_{2}) = det

1 1

a_{p−1} a_{p}

= a_{p}− a_{p−1}

So the result holds for n = 2. In order to make an induction argument, assume that the result holds for n = k, i.e. that |A| = det(Ak) = Q

1≤i<j≤k(aj − ai). We then get the following:

det(Ak+1) =

1 1 . . . 1 1

a_{p−k} a_{p−k+1} . . . a_{p−1} a_{p}
a^{2}_{p−k} a^{2}_{p−k+1} . . . a^{2}_{p−1} a^{2}_{p}
... ... . .. ... ...
a^{k−1}_{p−k} a^{k−1}_{p−k+1} . . . a^{k−1}_{p−1} a^{k−1}_{p}
a^{k}_{p−k} a^{k}_{p−k+1} . . . a^{k}_{p−1} a^{k}_{p}

=

1 0 . . . 0 0

a_{p−k} a_{p−k+1}− a_{p−k} . . . a_{p−1}− a_{p−k} a_{p}− a_{p−k}
a^{2}_{p−k} a^{2}_{p−k+1}− a^{2}_{p−k} . . . a^{2}_{p−1}− a^{2}_{p−k} a^{2}_{p}− a^{2}_{p−k}

... ... . .. ... ...

a^{k−1}_{p−k} a^{k−1}_{p−k+1}− a^{k−1}_{p−k} . . . a^{k−1}_{p−1}− a^{k−1}_{p−k} a^{k−1}_{p} − a^{k−1}_{p−k}
a^{k}_{p−k} a^{k}_{p−k+1}− a^{k}_{p−k} . . . a^{k}_{p−1}− a^{k}_{p−k} a^{k}_{p} − a^{k}_{p−k}

=

1 0 . . . 0 0

0 a_{p−k+1}− a_{p−k} . . . a_{p−1}− a_{p−k} a_{p}− a_{p−k}
0 (a_{p−k+1} − a_{p−k})a_{p−k+1} . . . (a_{p−1}− a_{p−k}) (a_{p}− a_{p−k})

... ... . .. ... ...

0 (a_{p−k+1} − a_{p−k})a^{k−2}_{p−k+1} . . . (a_{p−1}− a_{p−k})a^{k−2}_{p−1} (a_{p}− a_{p−k})a^{k−2}_{p}
0 (a_{p−k+1} − a_{p−k})a^{k−1}_{p−k+1} . . . (a_{p−1}− a_{p−k})a^{k−1}_{p−1} (a_{p}− a_{p−k})a^{k−1}_{p}

=

k+1

Y

j=2

(a_{p−(k+1)+j} − a_{p−k})

1 0 . . . 0 0

0 1 . . . 1 1

0 a_{p−k+1} . . . a_{p−1} a_{p}
... ... . .. ... ...
0 a^{k−2}_{p−k+1} . . . a^{k−2}_{p−1} a^{k−2}_{p}
0 a^{k−1}_{p−k+1} . . . a^{k−1}_{p−1} a^{k−1}_{p}

=

k+1

Y

j=2

(a_{p−(k+1)+j} − ap−k)

1 . . . 1 1

ap−k+1 . . . ap−1 ap

... . .. ... ...
a^{k−2}_{p−k+1} . . . a^{k−2}_{p−1} a^{k−2}_{p}
a^{k−1}_{p−k+1} . . . a^{k−1}_{p−1} a^{k−1}_{p}

=

k+1

Y

j=2

(a_{p−(k+1)+j} − a_{p−k})|A_{k}| =

k+1

Y

j=2

(a_{p−(k+1)+j} − a_{p−k}) Y

1≤i<l≤k

(a_{l}− a_{i})

= Y

1≤i<j≤k+1

(aj − ai).

The result follows from induction on k.

Note that in the proof, the first matrix equality is given by subtracting the first column
from all the other columns and the second equality is given by subtracting each row by
a_{p−k+1} times the row just above it, starting from the bottom so that this operation does
not interfere with the later iterations. The last equality is true since Qk+1

j=2(a_{p−(k+1)+j} −
ap−k) is the product of the sequence ap−k+1−ap−k, . . . , ap−ap−kand therefore only includes
the factors missing from Q

1≤i<j≤k(a_{j}− a_{i}) to getQ

1≤i<j≤k+1(a_{j} − a_{i}). The above proof
was inspired by [5].

In Zanella et al. [6], V_{2} = [−σ^{1−i}_{j} ]_{ij} (instead of [(−σ_{j})^{1−i}]_{ij} as in Equation 2) but this
produces a negative density for some values of p. Since the inclusion of the parentheses
only changes the sign of the determinant and not its value, this discrepancy can only affect
the sign of the density, which has to be non-negative everywhere. This investigation will
therefore use the form given in Equation (2) as this produces a non-negative density. The
discrepancy is illustrated in the following example.

Example 2.2. Let Σ =2 1 1 2

and m = 3. Then p = 2, σ_{1} = 3, σ_{2} = 1. Let V_{2,Z} be the
form of V_{2}(σ) given in Zanella et al. and V_{2,C} be the corrected form in Equation 2.

|V_{2,Z}| = det−1 −1

−^{1}_{3} −1

= 2 3

|V_{2,C}| = det 1 1

−^{1}_{3} −1

= −2 3

so there is a discrepancy in the sign of the determinant. In order to determine which of
them is right we let x = [x1, x_{2}]^{T} with x1 > x_{2} > 0 and get

|Φ(x)| = x_{2}− x_{1} < 0

2

Y

l=1

ξ(x_{l}) = x_{1} · x_{2} > 0

Ψ(x) =e^{−x}^{1}^{/3} e^{−x}^{2}^{/3}
e^{−x}^{1} e^{−x}^{2}

=⇒ |Ψ(x)| = e^{−(}^{x1}^{3} ^{+x}^{2}^{)}− e^{−(x}^{1}^{+}^{x2}^{3} ^{)}.

x1 > x2 > 0, so we get ^{x}_{3}^{1}+x2−(x1+^{x}_{3}^{2}) = ^{−2x}_{3} ^{1}+^{2x}_{3}^{2} < ^{−2x}_{3} ^{1}+^{2x}_{3}^{1} = 0 =⇒ −(^{x}_{3}^{1}+x2) >

−(x_{1}+^{x}_{3}^{2}) =⇒ |Ψ(x)| > 0 since e^{x} is a monotonically increasing function in x.

K =

Q2

i=1(i − 1)!_{|V}^{|Σ|}^{−3}

2(σ)|

Q2

j=1(3 − j)!Q2

k=1(2 − k)! = 1 · 3^{−3}

|V2(σ)| · 2 · 1 = 1

54 · |V2(σ)| =⇒

=⇒ K_{Z} = − 1

108, K_{C} = 1
108.

Combining these results, f_{σ,Z}(x) = K_{Z}(x_{2} − x_{1})(e^{−(}^{x1}^{3} ^{+x}^{2}^{)} − e^{−(x}^{1}^{+}^{x2}^{3}^{)})x_{1}x_{2} < 0 and
f_{σ,C}(x) = K_{C}(x_{2} − x_{1})(e^{−(}^{x1}^{3} ^{+x}^{2}^{)}− e^{−(x}^{1}^{+}^{x2}^{3}^{)})x_{1}x_{2} > 0.

Since f_{σ} is a density function, f_{σ}(x) ≥ 0 for any value of x. Hence, K_{Z} cannot be the right
form of K. When implementing the density function with K_{C} in Matlab, the integral
over the whole domain is 1 for several different choices of p, Σ and m which suggests that
this is indeed the correct form of K.

Using Lemma 2.1, we see that |Φ(x)| =Qp j=2

Qj

i=1(x_{j} − x_{i}) and

|V_{2}(σ)| =

p

Y

j=2 j

Y

i=1

(−1
σ_{j} + 1

σ_{i}) =

p

Y

j=2 j

Y

i=1

σ_{j} − σ_{i}
σ_{i}σ_{j} .

Further, K contains the same factor in both the numerator and denominator.

Qp

i=1(i − 1)! = Qp

k=1(p − k)! by observing that they are the products of the same se- quence but in reversed order and since multiplication is commutative, the two products are equal. Another simplification that can be made is that |Σ| = Qp

j=1σ_{j} and by com-
bining all of these simplifications we arrive at the following form for the density of the
eigenvalues:

f_{σ}(x) = |Ψ(x)| ·

p

Y

j=1

x^{m−p}_{j}
σ^{m}_{j} (m − j)!

j−1

Y

i=1

σ_{i}σ_{j}xj − xi

σ_{j} − σ_{i}

!

(3)

= |Ψ(x)| ·

p

Y

j=1

1
x^{p}_{j}(m − j)!

x_{j}
σ_{j}

m j−1

Y

i=1

σ_{i}σ_{j}x_{j} − x_{i}
σ_{j} − σ_{i}

!

where the empty product that occurs when j = 1 is interpreted as being 1. An inter- esting observation that results from this form of the density is that the density for the eigenvalues of a Wishart matrix does not depend on the exact form or structure of Σ, instead depending only on the eigenvalues of Σ.

The simplified form of f_{σ}(x) makes it quite easy to see what effect scaling has on the
density.

Lemma 2.3. Let fσ(x) be the density of the ordered eigenvalues of a Wishart distribution W (m, Σ) such that Σ is a p × p covariance matrix and the vector of ordered eigenvalues of Σ is σ. Let a > 0 be a constant. Then,

f_{aσ}(ax) = a^{−p}f_{σ}(x).

Proof. First note that |Ψ_{aσ}(ax)| = |Ψ_{σ}(x)| since Ψ_{aσ}(ax) = [e^{−(ax}^{j}^{)/(aσ}^{i}^{)}]_{ij} = [e^{−x}^{j}^{/σ}^{i}]_{ij} =
Ψσ(x). So,

f_{aσ}(ax) = |Ψ_{aσ}(ax)| ·

p

Y

j=1

1
(axj)^{p}(m − j)!

ax_{j}
aσj

m j−1

Y

i=1

(aσ_{i})(aσ_{j})ax_{j} − ax_{i}
aσj − aσi

!

= |Ψ_{σ}(x)| · 1
a^{p}^{2}

p

Y

j=1

1
x^{p}_{j}(m − j)!

x_{j}
σ_{j}

m j−1

Y

i=1

a^{2}σ_{i}σ_{j}x_{j} − x_{i}
σ_{j} − σ_{i}

!

= |Ψ_{σ}(x)| · 1
a^{p}^{2}

p

Y

j=1

1
x^{p}_{j}(m − j)!

x_{j}
σ_{j}

m

a^{2(j−1)}

j−1

Y

i=1

σ_{i}σ_{j}xj − xi

σ_{j} − σ_{i}

!

= |Ψ_{σ}(x)| · a^{2}^{P}^{p−1}^{j=1}^{j}
a^{p}^{2}

p

Y

j=1

1
x^{p}_{j}(m − j)!

x_{j}
σ_{j}

m j−1

Y

i=1

σ_{i}σ_{j}x_{j}− x_{i}
σ_{j}− σ_{i}

!

= a^{2}^{P}^{p−1}^{j=1}^{j}

a^{p}^{2} f_{σ}(x) = a^{2}^{p(p−1)}^{2}

a^{p}^{2} f_{σ}(x) = a^{p}^{2}^{−p}

a^{p}^{2} f_{σ}(x) = a^{−p}f_{σ}(x)

### 3 Number of required components

In this section it is assumed that X_{1}, X_{2}, . . . , X_{m} is an i.i.d. sample from N_{p}(µ, Σ). As
mentioned previously, the sample covariance matrix in this case is S = _{m−1}^{1} Pm

j=1(X_{j} −
X)(X_{j}− X)^{T}. S follows a Wishart distribution W_{p}(m − 1,_{m−1}^{1} Σ), as proven in e.g. ([1],
page 255). The estimator S is consistent, meaning that as n → ∞, S → Σ, due to the^{p}
Law of large numbers as shown in e.g. ([2], page 186). S is also clearly unbiased since
ES = (m − 1)_{m−1}^{1} Σ = Σ. The loss of a degree of freedom is due to "using it" to calculate
the sample mean X instead of using the population mean µ which generally is not known
a priori. However, this does not present a problem since we can consider ˜m = m − 1
without losing any properties of the distribution. Since S follows a Wishart distribution,
the knowledge about the distribution of eigenvalues for Wishart distributions is used to
investigate the number of principal components required to not lose more than a specified
amount of the variance in the sample.

The density for the eigenvalues is f_{σ} as in (3). If the proportion of variance we are prepared
to give up is , then we are interested in whether or not the eigenvalues σ1 ≥ · · · ≥ σp > 0
of the actual covariance matrix Σ fulfill

R_{q}(Σ) :=

Pq j=1σj

Pp

i=1σ_{i} = 1
tr(Σ)

q

X

j=1

σ_{j} ≥ 1 − . (4)

The notation Rq(Σ) is not a standard notation but this ratio will come up often so it
is prudent to give it a name. Note that R_{q}(Σ) is invariant in regards to scaling Σ by a
positive constant a > 0.

A commonly used alternative to R_{q}(Σ) is the elbow plot in which the eigenvalues are
visualized in a graph and the number of eigenvalues used is determined by where the
graph stops decreasing sharply. A comparison between the elbow plot and Rq(Σ) can be
seen in Figure 1 where the eigenvalues are σ = [10, 7.5, 5, 4.6, 4.6, 4, 3.4, 3, 3]. It can be
difficult to determine how much of the variance is lost when just using the elbow plot
and in Figure 1, one might only retain about 60% of the variance if using the elbow plot
to gauge how many components are required. For this reason, this investigation uses R_{q}
instead which allows us to lose only up to a specified proportion of the variance.

0 1 2 3 4 5 6 7 8 9 10

j 2

3 4 5 6 7 8 9 10

j

**Elbow plot**
R1 = 0.22

R2 = 0.39

R_{3} = 0.5

R4 = 0.6

R_{5} = 0.7

R_{6} = 0.79

R7 = 0.87

R_{8} = 0.93
R9 = 1

Figure 1: Comparison between elbow plot and R_{q}(Σ)

We are now ready to formulate the main question in this investigation. What values of
Σ, m, q, and α fulfill P (R_{q}(S) ≥ 1 − ) ≥ 1 − α where S ∼ W_{p}(m,_{m}^{1}Σ)?

When trying to answer this question, we consider two cases. In the first case no structure on the covariance matrix Σ is assumed and in the second case we consider a special structure on the covariance matrix. Since structures of Σ that do not affect the eigenvalues

of Σ also do not affect the distribution of eigenvalues for S, the section about structured covariance matrices will focus on a useful case where Σ do not have trivial eigenvalues.

For small values of m and p (the dimension of Σ) we will be able to use the density (3) to get exact probabilities (or at least numerical approximations of said probabilities) while for larger values we have to rely on asymptotic results in order to estimate the probabilities. In all cases, the eigenvalues of Σ are distinct so the formula for the density of the eigenvalues in Equation (2) can be used directly.

Such an asymptotic result can be found in [1] on p.479-480 which can be slightly adjusted to get the following result.

Theorem 3.1. With the same setup as previously and the null hypothesis

H_{0} : R_{q}(Σ) < 1 − , (5)

a test with approximate significance α for large m is given by rejecting H0 if

√m(1 − − R_{q}(S)) < Z_{(α)}
s

2

tr(S)

(ˆσ_{1}^{2}+ · · · + ˆσ_{q}^{2}) + 2 1 −
tr(S)

(ˆσ_{q+1}^{2} + · · · + ˆσ_{p}^{2})
(6)
where S is the estimator for the covariance matrix Σ, ˆσ_{j} are the ordered eigenvalues of
S and Z_{(α)} is the α-quantile of the standard normal distribution.

A proof of this theorem is outlined in [1] on page 480, using theorem 4.2.3 to which a proof can be found in [3] (where said theorem is theorem A in section 3.3). This proof is quite lengthy and will thus not be written here.

By studying (6), it it evident that an approximate p-value for the test is given by isolating
Z_{(α)} and applying the distribution function of Z ∼ N (0, 1) on both sides. Let

T_{,q}(S) = 1 − − Rq(S)

r 2

tr(S)

(ˆσ_{1}^{2}+ · · · + ˆσ_{q}^{2}) + 2

1−

tr(S)

(ˆσ_{q+1}^{2} + · · · + ˆσ_{p}^{2})
.

Then p = P (T_{,q}(S) < T_{obs}) ≈ φ(√

m · T_{,q}(S)), where φ is the distribution function of
the standard normal distribution N (0, 1), is the approximate p-value of the test when
the value of the observed statistic is T_{obs}. Under H_{0}, T_{,q}(Σ) is positive so the test is
one-directional, testing that T_{,q}(S) is negative with a large enough margin. As with the
exact method, we observe that the test only depends on the eigenvalues of Σ (estimated
by the eigenvalues of the sample covariance matrix S) since tr(S) = Pp

j=1σˆ_{j} and R_{q}(S)
are uniquely determined from the eigenvalues of S.

Since S is consistent, T_{,q}(S) → T^{p} _{,q}(Σ) and we want to show that the test is consistent
as well. A test is consistent if p → 0 under H_{1} : ¬H_{0} as m → ∞. Σ is independent
of m, √

m · T_{,q}(Σ) is monotonic in m and is increasing if T_{,q}(Σ) > 0 and decreasing if
T_{,q}(Σ) < 0. Further, the denominator of T_{,q}(Σ) is always positive since tr(Σ) > 0 and

∈ (0, 1). This means that

m→∞lim p =

(lim_{m→∞}F_{Z}(√

m · |T_{,q}(Σ)|), T_{,q}(Σ) > 0
lim_{m→∞}F_{Z}(−√

m · |T_{,q}(Σ)|), T_{,q}(Σ) ≤ 0 =

(1, T_{,q}(Σ) > 0

0, T_{,q}(Σ) ≤ 0 (7)
The sign of T_{,q}(Σ) is independent of the denominator and the numerator is positive if
and only if R_{q}(Σ) < 1 − , so (7) reduces to

m→∞lim p =

(1, R_{q}(Σ) < 1 −

0, R_{q}(Σ) ≥ 1 − (8)

which shows that the test is consistent.

Remark 3.2. Although it would not affect the test result in the limit, scaling the eigen- values of Σ by a constant a > 0 (thus also scaling the random matrix S by the same constant) does affect the test result for finite values of m.

T_{,q}(aS) = 1 − − R_{q}(aS)

r 2

tr(aS)

( ˆaσ^{2}_{1}+ · · · + ˆaσ^{2}_{q}) + 2

1−

tr(aS)

( ˆaσ^{2}_{q+1}+ · · · + ˆaσ^{2}_{p})

= 1 − − R_{q}(S)

r 2

a·tr(S)

a^{2}(ˆσ^{2}_{1}+ · · · + ˆσ_{q}^{2}) + 2

1−

a·tr(S)

a^{2}(ˆσ^{2}_{q+1}+ · · · + ˆσ_{p}^{2})

= 1 − − R_{q}(S)

√1 a

r 2

tr(S)

(ˆσ_{1}^{2}+ · · · + ˆσ^{2}_{q}) + 2

1−

tr(S)

(ˆσ_{q+1}^{2} + · · · + ˆσ^{2}_{p})

=√

a · T_{,q}(S)

Thus, scaling the eigenvalues by a > 0 has the same effect on the test as instead scaling m by the same constant a since√

am · |T_{,q}(S)| =√
m · |√

a · T_{,q}(S)| =√

m · |T_{,q}(aS)|.

Now, only one obstacle remains before the analysis can begin. For smaller values of m
where we use the density to get exact results, we have not yet considered how to calculate
the probability that q eigenvalues are enough, i.e. P (R_{q}(S) ≥ 1 − ). We note that this
probability can be written in the form of an integral.

P (R_{q}(S) ≥ 1 − ) =
Z

D^{p}

f_{σ}(x)IR_{q}(x)dx (9)

where D^{p} = {x ∈ R^{p} : x_{1} > · · · > x_{p} > 0} and R_{q} = n

x ∈ R^{p} :

Pq i=1xi

Pp

j=1xj ≥ 1 − o

. This
integral would not be very pleasant to solve analytically since it, among other complica-
tions, contain the factor |Ψ(x)| which is the determinant of a matrix whose elements are
exponential functions of x_{1}, . . . , x_{p}. Further, it is a p-dimensional integral, so determin-
istic numerical methods quickly become infeasible as p grows. We therefore compute the
integral using Monte Carlo methods.

### 3.1 Methodology

The Monte Carlo method used is an independent Monte Carlo integration which has the following algorithm which is a slightly adapted version of Algorithm 2.1 in [4] on page 27.

Algorithm 3.3. Given a sample size N ∈ Z^{+} and an integral of the form
µ = R

Dh(x) · f (x)dx < ∞ where h : D → R is some integrable function on D and f is the density function for a random variable X which is supported on D,

1. For i = 1 to N :

(a) Draw x_{i} ∼ X independently from all prior draws.

(b) If xi ∈ D, return to (a). Otherwise, continue./
2. estimate ˆµ = _{N}^{1} PN

i=1h(x_{i}).

In the case (9), the domain of integration D^{p} is unbounded since there is no limit on how
large x_{1} can be. It is therefore necessary to impose some limit on x_{1} when estimating the
integral using a computer. A limit L is chosen and the integral in (9) is estimated using
D^{p}_{L}= D^{p}∩ [0, L]^{p} = {x ∈ R^{p} : L ≥ x_{1} > · · · > x_{p} > 0} and the uniform distribution U_{D}^{p}

L

on D_{L}^{p} is used. The density function of U_{D}^{p}

L is f_{U}

Dp L

(x) = _{vol(D}^{1} p
L)I_{D}^{p}

L(x) where vol(D^{p}_{L}) is
the generalized volume of D_{L}^{p}.

Proposition 3.4. Let D^{p}_{L}= {x ∈ R^{p} : L ≥ x_{1} > · · · > x_{p} > 0}. Then vol (D_{L}^{p}) = ^{L}_{p!}^{p}
Proof. The volume of D_{L}^{p} is the integral over D^{p}_{L} of the constant function 1 and
R

{x1,x2:b>x1>x2>a}f (x)dx =Rb a

Rx1

a f (x)dx_{1}dx_{2} so we get the following.

vol(D_{L}^{p}) =
Z L

0

Z x1

0

· · · Z xp−2

0

Z xp−1

0

dx_{p}dx_{p−1}. . . dx_{2}dx_{1}

= Z L

0

Z x1

0

· · · Z xp−2

0

(x_{p−1}− 0)dx_{p−1}. . . dx_{2}dx_{1}

= Z L

0

Z x1

0

· · · Z xp−3

0

x^{2}_{p−2}− 0

2 dxp−2. . . dx2dx1

= · · · = Z L

0

x^{p−1}_{1} − 0

(p − 1)!dx_{1} = L^{p}− 0
p! = L^{p}

p!

Since the integrand in (9) is a density function multiplied with an indicator function which
is just the identity function for p = q (R_{p} = R^{p}), it is known that Rp

Df_{σ}(x)IR_{p}(x)dx = 1
which can be used to calibrate the limit L. If the integral is estimated to much less than
1 (for example ≈ 0.9) using a certain value of L and q = p, this suggests that a significant
probability mass is outside D_{L}^{p} and hence the value of L is increased to make sure that
the only error comes from the approximation by a finite sum in Algorithm 3.3. It is also
desirable that L be as small as possible since for large enough L, very little probability
mass is contained in D_{L}^{p}0\D_{L}^{p} when L^{0} > L, making the algorithm more inefficient for
L^{0} since more of the sampled points will contribute very little to the estimate of the
integral. When a suitable value for L is found, one sample x^{(1)}, . . . , x^{(N )} is drawn from

U_{D}^{p}

L and used to estimate the probabilities P (Rq(S) ≥ 1 − ) for all q = 1, . . . , p by
P (Rq(S) ≥ 1 − ) ≈ _{N}^{1} PN

i=1fσ(xi)IRq(xi). Note that if the eigenvalues and the argument are both scaled by a > 0 in the density of the eigenvalues as in Lemma 2.3, then L can be scaled by the same constant a to get the same estimation for the integral since

Z

D^{p}_{aL}

faσ(ax)IRq(ax)dx = vol(D^{p}_{aL})
vol(D^{p}_{L})

Z

D^{p}_{L}

a^{−p}fσ(x)IRq(x)dx

= a^{p}a^{−p}
Z

D_{L}^{p}

f_{σ}(x)I_{R}_{q}(x)dx =
Z

D^{p}_{L}

f_{σ}(x)I_{R}_{q}(x)dx

where the property ∀a > 0 : x ∈ Rq =⇒ ax ∈ R_{q} is used. By testing different values for
L, it was discovered that L = 1 +^{25}_{m} seemed to work well whenever the largest eigenvalue
was 1 and this value was therefore used in all calculations.

For the asymptotic results, the methodology is slightly different. The goal is to investigate how many eigenvalues are required for a specific m. The following algorithm provides a way to see how a typical test result would look like for many different values of m.

Algorithm 3.5. Given p, σ, , α, a lower limit m_{0} and an upper limit M for m, let Σ be
the diagonal matrix with σ_{1}, . . . , σ_{p} on the diagonal.

• For m = m_{0} to M :

1. Draw S_{m} ∼ W (m,_{m}^{1}Σ) independent from all other draws and compute the
eigenvalues sm = (s1,m, . . . , sp,m) such that s1,m ≥ · · · ≥ sp,m where the in-
equality is a.s. strict.

2. Compute the test statistics T_{,q}(S_{m}) for q = 1, . . . , p using the eigenvalues
computed in the previous step.

3. The number of required components for m is then min{q ∈ {1, . . . , p} : F_{Z}(√
m·

T,q(Sm)) < α}

When the true eigenvalues are known, one can compare the results from this algorithm with the actual number of eigenvalues required according to the test in Theorem 3.1 by calculating the test statistic of the true eigenvalues and using the same method as in step 3 to compute the number of required eigenvalues. Due the Law of Large Numbers, consistency and unbiasedness, this is equivalent to looking at the average of a large sample. This way, one can see how reliable the test is and form a heuristic about how many eigenvalues are actually required given a test result.

In order to reduce the number of variables involved, we only consider the case with = 0.1
and α = 0.05, meaning that no more than 10% of the variance is lost in at least 95% of
cases. When estimating the integral (9) by using Algorithm 3.3, N = 10^{8} is used.

### 3.2 Unstructured Covariance

It was previously noted that the number of components required only depends on the underlying covariance matrix through the eigenvalues of said matrix, so the unstructured covariance case boils down to investigating the probability (9) given a decreasing sequence of positive eigenvalues.

We provide two examples of such sequences and how the number of components required
is affected by the difference of the sequences and by changes in m. The sequences are
scaled such that the largest eigenvalue is 1 since R_{q} is independent of a constant scaling
of σ. Scaling both the eigenvalues and the argument in fσ(x) by _{m}^{1} (i.e. using f^{1}

mσ(_{m}^{1}x))
is equivalent to scaling the scale matrix Σ in the the Wishart distribution W (m, Σ) by

1

m, which is precisely what is wanted to get the desired distribution of S. Note that
even if some other scaling were to be done to the eigenvalues and the argument in f_{σ}(x),
say faσ(ax), the probability would still be the same as discussed in the previous Section.

The two sequences consist of one with linear decay and one with exponential decay.

Both sequences have their respective values of P (R_{q} ≥ 1 − ) calculated exactly for
m = 10, 15, 20, 30, 40, 50 while for larger values of m, the asymptotic result is used.

3.2.1 Sequence with linear decay Let σp,lin =

1,^{p−1}_{p} , . . . , ^{1}_{p}

be the eigenvalues of some symmetric and positive definite
p × p matrix Σ_{p,lin}. Sp,lin ∼ W_{p}(m,_{m}^{1}Σ_{p,lin}) is the estimate of Σ_{p,lin} and the following
tables contain the estimated probabilites P (R_{q}(S_{p,lin}) ≥ 1 − ) for 2 ≤ p ≤ 9, 1 ≤ q ≤ p
and = 0.1.

q p

1 2 3 4 5 6 7 8 9

1 3.88 · 10^{−3} 2.84 · 10^{−7} 5.64 · 10^{−12} 3.03 · 10^{−17} 2.07 · 10^{−23} 0 0 0
2 1 0.39 6.72 · 10^{−3} 1.77 · 10^{−5} 1.67 · 10^{−8} 4.22 · 10^{−12} 1.03 · 10^{−16} 1.59 · 10^{−26}

3 1 0.98 0.46 4.55 · 10^{−2} 1.54 · 10^{−3} 1.87 · 10^{−5} 1.57 · 10^{−7}

4 1 1 1 0.75 0.32 6.46 · 10^{−2}

5 1 1.01 0.97 1.04 1.1

6 1.01 0.97 1.04 1.11

7 0.97 1.04 1.11

8 1.04 1.11

9 1.11

Table 1: probabilities for m =10

q p

1 2 3 4 5 6 7 8 9

1 2.34 · 10^{−4} 5.26 · 10^{−11} 4.31 · 10^{−19} 1.76 · 10^{−28} 1.93 · 10^{−40} 1.37 · 10^{−56} 0 0
2 1 0.19 1.05 · 10^{−4} 2 · 10^{−9} 4.39 · 10^{−15} 3.57 · 10^{−21} 1.79 · 10^{−28} 0

3 1 0.94 9.94 · 10^{−2} 3.55 · 10^{−4} 1.4 · 10^{−7} 1.02 · 10^{−11} 8.89 · 10^{−17}

4 1 1 0.86 0.16 3.29 · 10^{−3} 1.35 · 10^{−5}

5 1 1 0.99 0.93 0.4

6 1 0.99 1.01 1.01

7 0.99 1.01 1.01

8 1.01 1.01

9 1.01

Table 2: probabilities for m =15