Estimating the intrinsic dimensionalityof high dimensional data

(1)

DEGREE PROJECT, IN MATHEMATICAL STATISTICS , SECOND LEVEL

STOCKHOLM, SWEDEN 2015

Estimating the intrinsic dimensionality

of high dimensional data

JOAKIM WINIGER

(2)

(3)

Estimating the intrinsic dimensionality

of high dimensional data

J O A K I M W I N I G E R

Master’s Thesis in Mathematical Statistics (30 ECTS credits) Master Programme in Applied and Computational Mathematics (120 credits)

Royal Institute of Technology year 2015 Supervisor at KTH was Timo Koski

Examiner was Timo Koski

TRITA-MAT-E 2015:08 ISRN-KTH/MAT/E--15/08-SE

Royal Institute of Technology

School of Engineering Sciences KTH SCI

(4)

(5)

Abstract

This report presents a review of some methods for estimating what is known as intrinsic dimensionality (ID). The principle behind intrinsic dimensionality estimation is that frequently, it is possible to find some structure in data which makes it possible to re-express it using a fewer number of coordinates (dimensions). The main objective of the report is to solve a common problem: Given a (typically high-dimensional) dataset, determine whether the number of dimensions are redundant, and if so, find a lower dimensional representation of it.

We introduce different approaches for ID estimation, motivate them theoretically and compare them using both synthetic and real datasets. The first three methods estimate the ID of a dataset while the fourth finds a low dimensional version of the data. This is a useful order in which to organize the task, given an estimate of the ID of a dataset, construct a simpler version of the dataset using this number of dimensions.

(6)

(7)

Sammanfattning

Denna rapport ger en genomg˚ang av olika metoder för skattning av inre dimension (ID). Principen bakom begreppet ID är att det ofta är möjligt att hitta strukturer i data som gör det möjligt att uttrycka samma data p˚a nytt med ett färre antal koordinater (dimensioner). Syftet med detta projekt ¨

ar att lösa ett vanligt problem: given en (vanligtvis högdimensionell) datamängd, avgör om antalet dimensioner är överflödiga, och om s˚a är fallet, hitta en representation av datamängden som har ett mindre antal dimensioner.

Vi introducerar olika tillvägag˚angssätt för skattning av inre dimension, g˚ar igenom teorin bakom dem och jämför deras resultat för b˚ade syntetiska och verkliga datamängder. De tre första metoderna skattar den inre dimensionen av data medan den fjärde hittar en lägre-dimensionell version av en datamängd. Denna ordning är praktisk för syftet med projektet, när vi har en skattning av den inre dimensionen av en datamängd kan vi använda denna skattning för att konstruera en enklare version av datamängden som har detta antal dimensioner.

(8)

(9)

Introduction

This report introduces some well-known methods of estimating the intrinsic dimensionality of a dataset. The notion of intrinsic dimensionality (ID) of is that it may be possible to re-express, without loss of information, a dataset with a fewer number of coordinates (dimensions). The fewest number of coordinates required is then the intrinsic dimensionality.

An introductory example of a dimensionality reduction method is principal component analysis (PCA) often attributed to Carl Pearson 1901. For a thorough introduction and discussion of PCA, see [1]. The PCA approach to dimensionality reduction is limited to the cases where the underlying pattern of the data is linear - for example, if the data essentially reside in some hyperplane, PCA will effectively find this structure, but if the data are located on a sphere, PCA will not be able to remove any redundant dimensions.

When PCA fails, one approach to dimensionality reduction is to assume that the relevant data lie on an embedded nonlinear manifold within the higher-dimensional space. These methods are named manifold learning or simply nonlinear dimensionality reduction (NLDR).

1.1 Background

When treating high dimensional data1_{, reducing the number of dimensions can be hugely beneficial}

for statistical and computational purposes. This phenomenon is know as the curse of dimensionality and refers to the fact that it becomes exponentially more difficult to analyse and store data as the dimension increases.

In addition to requiring large computational powers, high dimensional vectors have several interesting and sometimes counter-intuitive properties.

1.1.1 Concentration of norms and distances

Another common feature seen in high-dimensional data is that distances between high dimensional vectors concentrate, making many tasks such as clustering or classification difficult, since the dissim-ilarity of the nearest object and the farthest object may be virtually equal. Demartines [3] proved that for a random vector X ∈ RD with iid components, the euclidean norm k · k2 concentrates in the

following way:

E[kXk2] =

√

aD − b + O(D−1) Var[kXk2] = b + O(D−1/2)

(12)

where a and b do not depend on D. The conclusion is that for large D, the variance of kXk2 is much

smaller than E[kXk2]. Fran¸cois and Wertz [4] showed that the same phenomenon occurs for arbitrary

Minkowski norms, and that the assumption of independent and identically distributed components of X merely makes the proofs less technical.

1.1.2 Spheres inscribed in cubes

The volume of a sphere of radius in d dimensions is given by Vd(r) =

πd2rd

Γ(d₂+ 1) (1.1) If this sphere is inscribed in a cube, whose volume is (2r)d_{, the proportion of volume occupied by the}

sphere, in relation to the total volume occupied by the cube, is given by π

d 2r Γ(d 2+1)(2)d . Interestingly, lim d→∞ πd2r Γ(d₂+ 1)(2)d = 0,

and already at d = 10, the fraction is merely 0.0025.

1.1.3 Outer layer of a sphere

If we take two concentric spheres Srd and SrD of radii r and R, respectively, such that R > r, we get

from 1.1 that the proportion of volume occupied by the spherical shell defined as the set difference Sd R\ S d r is given by Vd(R) − Vd(r) Vd(R) = 1 −r R d

Since this ratio approaches 1, nearly all volume is confined to the spherical shell. This means, for example, that uniformly drawn points from a high-dimensional ball will be found on the surface.

1.2 Outline of the report

In chapter 2, four different approaches to dimensionality reduction are introduced. The first three methods estimate, in different ways, the intrinsic dimensionality of an input dataset. This is to be contrasted to the fourth method Isomap that goes a step further and finds a lower-dimensional em-bedding of non-linear, high dimensional data.

In chapter 3, the methods are applied to synthetic datasets of known intrinsic dimensionality and other known features. The purpose of this chapter is to verify that the methods produce acceptable results, by comparing the outputs to what is already known about the data.

(13)

Chapter 2

Methods

2.1 Grassberger-Procaccia Algorithm

The Grassberger-Procaccia (GP) algorithm was introduced in 1983 and we will introduce it with a heuristic argument which goes as follows;

For an iid sample X1, . . . , Xn of m-dimensional observations from a continuous density function

f (x), the fraction of points falling within a small distance from some point x is roughly equal to f (x)m_{V (m), where V (m) is the volume of the unit sphere in m dimensions. This means that we}

have the following approximation of the number of observations within distance from each sample point:

average no. of points less than away from Xi≈ mnf (Xi)V (m) (2.1)

which depends on m and though m. If the data has intrinsic dimensionality d < m the approximation holds for d instead of m.

To describe the method, introduce the probability measure µ and let x and y be independent m-dimensional random vectors distributed according to µ. Consider the correlation integral C() defined by C() := Z dµ(x) Z dµ(y)1{kx − yk ≤ }, (2.2) where1(A) is the indicator function of the set A. From this definition, it follows that

C() = Pr(kx − yk ≤ ). (2.3) By the introductory argument C() should increase like a power law with the ID as exponent, i.e.

C() ∼ d. (2.4) In the paper [2], the authors call the quantity d the correlation dimension, defined as, whenever the limit exists,

d := lim

→0

log C()

log . (2.5) Now suppose we have n iid observations, X1, X2, . . . , Xn from µ. An unbiased estimator of the

correlation integral C() is the sample correlation sum bC() defined by b C() :=n 2 −1 n X i>j 1{kXi− Xjk ≤ }. (2.6)

If bC() is calculated for different values of , it is possible to estimate d using bd, defined as the slope of the trend line of the measurementslog j, log bC(j)

k

j=1

(14)

2.1.1 Discussion

There is an inherent problem in this approach; as → 0, there will be no pairs (Xi, Xj) at most

apart from each other in a finite sample, thus bC() will equal zero. We remedy this in the next section by replacing the indicator function in 2.6 by a decreasing function of the distance kXi− Xjk, so called

an isotropic kernel function.

2.2 U-statistics based method

2.2.1 Theoretical aspects

This method is introduced in [3]. Using the same notation as in the GP algorithm with X1, . . . , Xn

being m-dimensional observations, X and Y m-dimensional iid vectors with common probability den-sity function p(x). We assume that the observations X1, . . . , Xn lie on a d-dimensional submanifold1

S ∈ Rm_{. The approach is to replace the indicator function in 2.6 with a function k}d_{, defined as}

kd(kXi− Xjk2) :=

1

hdk(kXi− Xjk 2

/h2), (2.7) for a kernel function k : R+→ R satisfying kkk∞= b < ∞. Here the norm k · k∞denotes the uniform

norm defined for real functions f : S → R by kf k∞= sup { |f (x)| : x ∈ S }.

Next, the statistic Un,h(kd) is defined as

Un,h(kd) := n 2 −1 n X i=1 n X j=i+1 kh kXi− Xjk2 . (2.8)

The expectation of Un,h(kd) is given by

E[Un,h(km)] = E[km(kX − Y k )] =

Z Z

km kx − yk2 _{p(x)p(y)dxdy.} _(2.9)

As h → 0, this expectation becomes: lim h→0E[Un,h(k m_{)] = C} 1 Z p2(x)dx (2.10) where C1=R k(kyk2)y2dy. Also the variance of each term is

Var[kd(kX − Y k2)] = σ2< ∞ (2.11)

Moreover, the following convergence theorems are given in [3]:,

Theorem 1. Assume that nhd→ ∞. Then, under the stated assumptions, Un,h(kd)

P

−→ E[Un,h(kd)] as n → ∞ and h → 0. (2.12)

Theorem 2. Assume that the stronger condition nhd_{/ log n → ∞ holds. Then, under the stated}

assumptions,

Un,h(kd) a.s.

−→ E[Un,h] as n → ∞ and h → 0. (2.13)

We give here a brief sketch of theorem 1:

The convergence relies on the the fact that Un,h(kd) is a sum of bounded and independent random

variables, which means that Bernstein’s inequality (see appendix) can be applied. In fact the random

(15)

variables considered are even identically distributed. The first step is to rewrite Un,h(kd) in order to

apply Bernsteins inequality. Consider the sum S = 1 n! X n,n kd(kXi1− Xi2k) + . . . + k d_(kX i_{2[ n} 2]−1 − Xi_{2[ n} 2] k) [n 2] , (2.14) where [x] = integer part of x andP

n,nmeans that the sum is taken over all n! permutations (i1, . . . , in)

of (1, . . . , n). S is in fact equal to Un,h, the difference is that each term kd(kXi− Xjk2), i 6= j in Un,h

enters the calculation of S several times, but the end result is the same. Let Si, i = 1, . . . , n! denote the i :th summand of S, i.e.

Si=

kh(kXi1− Xi2k) + . . . + kh(kXi2[ n₂]−1− Xi2[ n₂]k)

[n₂] . (2.15) Next, we bound each Si using Bernstein’s inequality. As it is formulated in the appendix, we need a

uniform bound on each summands deviation from its mean. Since kkk∞ = b, this bound is given by

|b − E[Un,h]|

Applying Bernsteins inequality to Si here gives:

Pr(|Si− E[Si]| > t) < e

−[n/2]t2

2σ2 + 2₃_{|b−E[Un,h]|t} _(2.16)

This is also a bound on the sum over all permutations of the indices (1, . . . , 2[n₂]). To see this, write Pr (kS − E[S]k > t) ≤ Pr max 1≤i≤n!kSi− E[Si]k > t < e −[n/2]t2 2σ2 + 2₃|b−E[Un,h]|t_. _(2.17)

Now, letting n % ∞ yields convergence in probability.

A useful corollary which can be applied to both theorems is that if kd in (2.7) is replaced by kl

h(kXi− Xjk2) := _h1lk(kXi− Xjk2/h2) for some integer l > 0, the limits (in probability and almost

surely, respectively) of Un,hchanges according to the following corollary:

Corollary 1. lim n→∞ h→0 Un,h=      ∞ if l > d 0 if l < d E[Un,h] if l = d. (2.18) This corollary is easily established by writing:

Un,h(kl) P −→ C1 hl−d Z p2(x)dx as n → ∞, (2.19) and as h → 0, the right hand side diverges if l > d and converges to 0 if l < d.

2.2.2 Method

The method is based on the convergence in probability of the statistics Un,h as well as Corollary 1.

The usefulness of Corollary 1 is that, since d is in general not known, we may calculate Un,h(kl) for

different l’s in {lmin, . . . , lmax} where lmin can be assumed to be smaller than d. Then, since the limits

in probability of the Un,h(kl)’s are zero when l < d, finite and non-zero when l = d and equal to ∞

when l > d, we can use the following approximation for large n which follows from Corollary (1): log Un,hl(n)(kl)≈ log C1

Z

(16)

In order to reduce variance in the estimates, we follow the procedure in the paper [2] and calculate the U-statistic for different subsamples of the original dataset. In what follows, we let N denote the size of the original data, and n the size of a subsample.

For each plausible l, we calculate Un,hl(n)for n = {[N/5], [N/4], . . . , N }. Then we perform a weighted

least squares regression on log Un,hl(n) vs log hl(n). The weights are designed so that larger

empha-sis is given to subsamples of larger size. Since the slope of the log-log diagram is −(d − l) where d is the true intrinsic dimensionality, we pick the dimension estimate as the value of l that gives the smallest slope in absolute value. In this project, we have used, as in the paper [2], the kernel function k(x) = (1 − x)+= max(1 − x, 0).

Convergence rate for the bandwith h

As stated in Theorem (1), in order to have convergence in probability, the kernel bandwidth h has to fulfill

nhl→ ∞ as n → ∞ and h → 0. (2.21) This means that h cannot approach 0 arbitrarily fast, or else (2.21) will not be satisfied. This can be solved by first considering some function f : N → R by

f (n) = nhl, (2.22) and solving for h yields

h(n) = f (n) n

1/l

. (2.23)

Since we require h → 0, f (n) must approach ∞ slower than n. Therefore we can, for example, set, for some c > 0,

f (n) = 1

cllog n, (2.24)

which gives h as a function of l and n:

hl(n) = 1 c log n n 1/l . (2.25)

For the entire dataset of size N we pick h(N ) = 1 N N X i=1 d(Xi), (2.26)

where d(Xi) is the nearest neighbour distance of the point Xi.

This yields c = 1 h(N ) log N N 1/l , (2.27)

which gives for hl(n):

hl(n) = h(N )

N log n n log N

1/l

(17)

2.3 Maximum Likelihood method

In this section, we follow the arguments in the paper [8] and use a Binomial to Poisson approximation for the number of observations falling within a predetermined distance from each sample points.

We use again the notation that X1, . . . , Xn are iid observations in Rm, where each Xj is in fact a

lower dimensional observation Yj ∈ Rd embedded in Rm, where i : Rd→ Rmis a smooth embedding,

such that Xj = i(Yj). The smoothness assumption ensures that close neighbours are preserved.

Furthermore, the observations Yi are sampled from some d-dimensional density function f (x) which

can be assumed constant in a ball of radius R around x. For fixed t, N (t, x), defined as N (t, x) =

n

X

i=1

1{|Xi− x| < t}, (2.29)

is a binomial random variable counting the number of observations within distance t from the point x. The success probability p is thus given by

p = Pr{|Xi− x| ≤ t}. (2.30)

Next we use the following approximation:

Pr{|Xi− x| ≤ t} ≈ f (x)V (d)td (2.31)

where V (d) = _Γ(πdd/2 2+1)

is the volume of the d-dimensional unit ball. The approximation uses the assumption that f (x) is constant in a small region around x, and that the proportion of sample points falling within distance t from x is roughly f (x) times the volume of the corresponding ball.

With this in mind, we want to consider N (t, x) for varying t, and approximate it by an inhomogeneous Poisson process with intensity λ(t) given by

λ(t) =dp

dt = f (x)V (d)dt

d−1_. _(2.32)

This formula for the intensity makes sense, since for an inhomogeneous Poisson process the number of occurrences between to times t1 and t2 has a Poisson distribution with intensity λ[t1,t2] given by

λ[t1,t2]=

Z t2

t1

λ(t)dt = f (x)V (d)(td₂− td

1) (2.33)

which is the success probability for an occurrence in view of the probability (2.31) .

The observed jumps of the Poisson process {N (t, x), 0 ≤ t ≤ R} are labelled as t0, t1. . . , tn, where

t0= 0 and tn = R. The likelihood for the observations is thus

λ(t1)e−λ[t0,t1]λ(t2)e−λ[t1,t2]. . . λ(tn)e −λ_[tn−1,tn] = n Y i=1 λ(ti)e−λ[ti−1,ti]. (2.34)

This gives the log-likelihood

`(λ) =

n

X

i=1

log λ(ti) − λ[ti−1,ti] (2.35)

The key parameters are d, the dimension and the density f (x). To simplify the calculations we take ϕ = log f (x). Using the Riemann-Stieltjes integral, ` can be written as

(18)

= nϕ + n log V (d) + n log d + (d − 1) n X i=1 log(ti) − eϕV (d)Rd. (2.37) Letting ∂`/∂ϕ = 0 yields 0 = n − eϕV (d)Rd (2.38) thus eϕ_{= f (x) = n/(V (d)R}d_{). This gives}

`(d) = n log n − n log Rd+ n log d + (d − 1)

n X i=1 log(ti) − n (2.39) 0 = −n log R +n d+ n X i=1 log ti (2.40) ˆ dM L_R = " 1 n n X i=1 logR ti #−1 (2.41) In practice it may be more convenient to fix Tk(x), the distance to the k:th nearest neighbour

of x, rather than R. Then the observed jumps are simply the k − 1 nearest neighbour-distances: T1(x), . . . , Tk(x) and N (x, Tk(x)) = k − 1 ˆ dM L_k (Xj) = " 1 k − 1 k−1 X i=1 logTk(Xj) Ti(Xj) #−1 . (2.42) The estimator (2.42) estimates the ID locally around the point x. In practice however, we average over the estimate at every data point Xi, i = 1, . . . , n.

2.3.1 Relation to the Hill estimator

A related estimator is due to Hill [7]. Interest lies in power law density functions f (z) of the type f (z) ≈ cz−a, z → ∞, (2.43) where the value of the exponent a is to be estimated. Given iid observations Z1, . . . , Zm from f (z),

condition on the the nth order statistic Z(n), and assume that

Pr Z(i) Z(n)

≤ z|Zn

= cza, 0 < z < 1, c > 0, i = 1, . . . , n − 1. (2.44) The maximum likelihood estimate of a under the model (2.44) is given by:

ˆ Hm,n= − 1 m − 1 m X i=1 log Z(m) log Z(i) !−1 . (2.45) In the context of ID estimation, we take Zi = |Xi− Xi−1|. The distributional assumption (2.44) is

validated by the approximation 2.1. The difference between the estimators (2.42) and (2.45) is that in the former, we estimate ID locally around Xj and then average over all data points, while in the

(19)

2.3.2 Asymptotic properties

As described in the paper [8], if one assumes that the Binomial and Poisson approximations are exact, we have that for a fixed dimension d, G = dPk−1

j=1log(Tk/Tj) has a Gamma distribution with shape

parameter k − 1 and scale parameter 1. Using properties of the Inverse Gamma distribution (see appendix), we have that

E[G−1] = 1/(k − 2)

and thus dG−1(k − 2) has expectation d. Also note that dG−1 (k − 2) is equal to k−2

k−1mˆk(x). As this

analysis is asymptotic in k, the factor k−2

k−1 is not of grave importance. Hence we have that

E[ ˆdM L_k (x)] = d (2.46) And again, by the Inverse Gamma distribution, we get that

Var[ ˆdM L_k (x)] = d

2

k − 3. (2.47)

2.4 A Monte Carlo method

In [9], the authors discuss an approach to validating the estimates using the traditional non-parametric bootstrap. The advantage of this method is that given a dataset X = (X1, . . . , Xn) and a plug-in

esti-mate t(X), we can assess the uncertainty in the estiesti-mate t(X) without making too many assumptions about the distribution of the data. The only assumption is that the (X1, . . . , Xn) are independent and

identically distributed [5]. The idea is to create new datasets called bootstrap datasets, by drawing randomly with replacement from the existing data. Then the estimate is calculated for every such new dataset, yielding an empirical distribution for the statistic t(X).

The non-parametric bootstrap is described below: Algorithm 1 Non-parametric Bootstrap

1: select B, the number of resampled datasets.

2: for i = 1, . . . , B do

3: draw X_i∗= (X₁∗_i, . . . , X_n∗_i) with replacement from X = (X1, . . . , Xn)

4: set t∗_i ← t(X∗ i)

5: end

6: Retain the sequence {t∗_i}B i=1

Using the sequence {t∗_i}B

i=1, we may construct confidence intervals for the estimate t(X).

When we study the estimates in the previous two sections, the bootstrap method as described above can not be used. The reason is that when drawing the dataset (X₁∗

i, . . . , X

∗

ni) with replacement from

X = (X1, . . . , Xn) we usually end up with replicates in the resampled dataset, which yield zero as the

nearest neigbour distance. Since both the maximum likelihood method and the correlation integral method use logarithms of nearest neighbour distances.

The authors in [9] proposed a modification of the bootstrap, applicable to the maximum-likelihood estimator and the correlation integral method. The bootstrap datasets are of size 2n rather than n, and the distances are calculated using consecutive differences, as described below.

1. From X1, . . . , Xn, draw, with replacement, X1∗, . . . , X2n∗

2. Calculate the distances X_k∗= |X2k− X2k−1|, k = 1, . . . , 2n. the order statistics Z∗(1), . . . Zn∗

Given the ordered distances Z∗(1), . . . Zn∗, we estimate the dimension using Hill’s estimator Hm,n

(20)

2.5 Isomap

The Isomap method is an amalgamation of two methods, multidimensional scaling (MDS) and the Floyd-Warshall algorithm for approximating geodesic distances on manifolds. We start by introduc-ing multidimensional scalintroduc-ing as a simple procedure based on distance matrices for obtainintroduc-ing low-dimensional embeddings of data. Specifically, we show that the rank of the output lower-low-dimensional data matrix is equal to that of the original data matrix. The discussion is based on the paper [10]. Subsequently we describe how geodesic distances can be approximated using euclidean distances, and how to find them in practice. Then we combine the two by taking the approximated geodesics as the input distance matrix in the MDS scheme, as in the original paper [14].

2.5.1 Multidimensional Scaling

Theory and notation

Multidimensional scaling aims to find a lower dimensional dataset Y1, . . . , Yn while preserving the

distance matrix ∆2 _{of our m-dimensional dataset X}

1, . . . , Xn. The distance matrix ∆2 defined by

∆2=      d2_(X 1, X1) d2(X1, X2) · · · d2(X1, Xn) d2_(X 2, X1) d2(X2, X2) · · · d2(X2, Xn) .. . ... . .. ... d2_(X n, X1) d2(Xn, X2) · · · d2(Xn, Xn)      . (2.49)

where d(·, ·) : X × X → R is some distance function on the set of possible outcomes X (here X = Rm) satisfying, for every x, y, z in X:

1. non-negativity: d(x, y) ≥ 0

2. identity of indiscernibles: d(x, y) = 0 if and only if x = y 3. symmetry: d(x, y) = d(y, x)

4. subadditivity d(x, z) ≤ d(x, y) + d(y, z) To simplify notation further on we take

X =    X₁T .. . X1T   ∈ R n×m _(2.50)

as the data matrix which has the m-dimensional observations as row vectors. We shall also use the matrix of inner products, also known as the Gram matrix G:

G = XXT =      XT 1X1 X1TX2 · · · X1TXn XT 2X1 X2TX2 · · · X2TXn .. . ... . .. ... XT nX1 XnT, X2 · · · XnTXn      . (2.51)

We use also the notation XT

i = (xi1, . . . , xim) for the components of the vector Xi. If d(·, ·) is the

euclidean distance, we get that

(21)

= X_iTXi− 2XiTXj+ XjTXj. (2.52)

In view of the last equality (2.52), we can write the distance matrix ∆2as

∆2= diag∗(G)1T_n − 2G + 1ndiag∗(G)T, (2.53)

where 1n is the vector of ones in n dimensions; 1Tn = ( 1, . . . , 1

| {z }

=n copies

) and diag∗_{(·) : R}n×n_{→ R}n _{creates an}

n-dimensional vector consisting of the diagonal elements of an n × n matrix.

The last equation highlights an important point: given a kernel matrix G = XXT_{, a distance matrix}

∆2 _{can be obtained using (2.53). However, ∆}2 _{is not unique, since one can arbitrarily translate and}

rotate a dataset while preserving the distance between points, yielding the same distance matrix ∆2_.

For many reasons it is preferable to use the centered data matrix Xcrather than the original matrix

X. Centering a vector a with aT _{= (a}

1, . . . , an) can be described in terms of a matrix Pc= I − P1=

In− 1N1TN

k1Nk2. This can be made clear from the following:

Pca = (I − P1)a = a − P1a, (2.54)

where P1a becomes simply

P1a = 1 k1Nk2    1 1 · · · 1 .. . . .. . .. ... 1 · · · 1       a1 .. . an   = 1 n    Pn i=1ai .. . Pn i=1ai   =    µa .. . µa    (2.55) where µa= 1_nPn_i=1ai.

Centering a vector with identical elements yields the zero vector, hence Pc1n = 0. Furthermore,

Pc is symmetric. Combining these two insights, we get that:

Pc∆2Pc= Pcdiag∗(G) 1TnPc | {z } =0 −2PcGPc+Pc1n | {z } =0 diag∗(G)TPc = −2PcXXTPc= −2PcX(PcX)T = −2XcXcT. (2.56) Where Xc= PcX. Thus, if we let B = XcXCT,

B = Pc(−

1 2∆

2_)P

c. (2.57)

Now B is positive semi definite, since for any y ∈ Rn, yTBy = yTXcXcTy = (X T c y) T_(XT cy) = kX t cyk 2_{≥ 0.} _(2.58)

This fact implies that the eigenvalue decomposition of B,

B = QΛQT, (2.59) where Q ∈ Rn×n _{has the eigenvectors of B as columns, and Λ = (λ}

1, . . . , λn)T is the diagonal matrix

of ordered eigenvalues, can be further decomposed to B = QΛ12Λ 1 2QT = (QΛ 1 2)(QΛ 1 2)T (2.60)

where Λ12 is also real-valued and diagonal with diagonal entries equal to the square root of those of Λ.

The matrix QΛ12 is used to obtain our lower dimensional dataset Y = (Y₁, . . . , Y_n).

The following theorem is useful to see this:

Theorem 3. The (column)rank of Xc is equal to that of QΛ

(22)

Proof. Denote the rank of Xc by q ∈ {1, . . . , n}. Then, rank(B) = rank(XcXcT) = q. Since B =

QΛ12(QΛ 1

2)T, it follows that rank(QΛ 1

2) = q. Since the columns of Q are linearly independent by

definition, rank(Q) = n, it follows that rank(Λ12) = q, since

q = min(rank(Q) | {z }

=n

, rank(Λ12_)), _(2.61)

which forces rank(Λ12) = 0.

Then, since Λ12 has q non-zero diagonal entries, QΛ 1

2 has q non-zero column vectors. Picking the

reduced dataset bY as b Y = QΛ12 Iq×q 0(n−q)×q (2.62) In practice, we (almost surely) never encounter data matrices with rank less than n. Therefore, in the eigendecomposition of the kernel matrix B = XcXcT we will focus on the d dominant eigenvalues,

and take the low dimensional dataset bY as: b Y = QΛ12 Id×d 0(n−d)×d . (2.63)

To choose d, we need a quantification of the error incurred by dropping dimensions. The Frobenius norm k · k of a matrix is the square root of the sum of all components of the matrix. Applied to the centered data matrix Xc we get

kXck2= n X i=1 kxc ik 2₌ n X i=1 xc_iTxc_i, (2.64) where xc

i is the i:th column of Xc. Since the summands xc

T

i xci are the elements along the diagonal of

the kernel matrix B = XcXcT, we get that

kXck2= tr(B). (2.65)

If we use the eigendecomposition of B, B = QΛQT_{, we get}

tr(B) = tr(QΛQT) = tr(Λ QQT | {z } =I ) = tr(Λ) = n X i=1 λi. (2.66) Similarly, k bY k2= d X i=1 λi. (2.67)

The proportion of variation explained by d dimensions is thus k bY k2 k cXck2 = Pd i=1λi Pn i=1λi , (2.68)

and the error e(d) can be quantified as

e(d) = 1 − Pd i=1λi Pn i=1λi . (2.69)

Another way of quantifying the error is to look at how well the dissimilarity was maintained; i.e. e∆(d) =

X

i>j

k{∆2_}

i,j− kYi− Yjk2k2, (2.70)

(23)

2.5.2 Approximating geodesics

The MDS method takes as input the distance matrix ∆2 = {d(Xi, Xj)2}. If the data X1, . . . , Xn lie

on some curved manifold M , e.g. a sphere, using an euclidean distance matrix will not uncover the underlying pattern. Instead, distances on the sphere must be used. We describe the theory from the original paper [14] in detail,

A geodesic dM(x, y) is the shortest path on a manifold between two points on the manifold. It can

be approximated using the sum of (short) euclidean distances. The idea behind the approximation method is that since the manifold locally resembles a euclidean space, small euclidean distances will be close to small geodesics.

Definition 1 (Geodeisc). The geodesic dM(x, y) between two points x and y on a manifold M is

defined by

dM(x, y) = inf

γ (length(γ(t))), (2.71)

where γ varies over the set of smooth arcs on M connecting x to y, and length(γ(t)) =

Z

Dγ

kγ0(t)kdt (2.72) A geodesic is thus what we intuitively think of when we speak of distances on curvy objects. Dis-tances on the surface of the earth are a natural example of geodesics.

The approximation of the geodesic dM using the sum of small euclidean distances is theoretically

justified in two steps. We will use the approximator dG as the sum of a number p of small euclidean

distances:

dG(x, y) = min

p (kx0− x1k + . . . + kxp−1− xpk) (2.73)

and the auxiliary distance measure dS

dS= min

p (dM(x0, x1) + . . . + dM(xp−1, xp)). (2.74)

Next we have two theorems which give the result dM ≈ dS, and dS ≈ dG, leading to dG ≈ dM. The

proofs are provided in full detail in [16].

Theorem 4. Assume that for every point m ∈ M , there exists a sample point xisuch that dM(m, xi) <

δ For , δ > 0 such that 4δ < ,

dM(x, y) ≤ dS(x, y) ≤ (1 + 4δ/)dM(x, y). (2.75)

Proof. The first inequality follows from the triangle inequality. To show the second inequality, let γ be any piecewise smooth arc connecting x to y such that length(γ) = `. Note that dM(s, y) ≤ `. Now,

if ` ≤ − 2δ, x and y are connected by an edge in G which is a valid path in our graph. If ` > − 2δ, we split up γ into an equidistant sequence of points {γ0, . . . , γp} with γ0 = x and γp = y. We thus

consider the path

x, γ1, . . . , γp−1, y, (2.76)

where length(γi, γi+1) = `1, i = 1, . . . , p − 1 and `1= − 2δ, and dM(x, γ1) = dM(γp−1, y) = `0,−2δ₂ ≤

`0≤ − 2δ. `0and `1are such that 2`0+ (p − 2)`1= `. By assumption, for each γi, there exists a data

point Xi such that dM(γi, Xi) < δ. Now, using the triangle inequality of dM(·, ·) we obtain:

dM(Xi, Xi+1) ≤ dM(Xi, γi) + dM(γi, γi+1) + dM(γi+1, Xi+1) = `1+ 2δ = =

`1

(24)

Similarly, dM(x,1) ≤ dM(x, γ1) + dM(γ1, X1) = `0+ δ ≤ 2 ≤ `0 − 2δ. (2.78) Combining these inequalities yields an upper bound on the total length of the path x, X1, . . . , Xp−1, y

in G: dM(x, X1) + dM(X1, X2) + . . . + dM(Xp−2, Xp−1) + dM(Xp, y) ≤ (2.79) ≤ 2 `0 − 2δ + (p − 2) `1 − 2δ = ` − 2δ = ` 1 1 −2δ < `(1 + 4δ ), (2.80) where the last inequality stems from the fact that for x ∈ [0, 1/2], 1/(1 − x) < 1 + 2x. Taking the infimum over all arcs γ yields ` = dM(x, y) and thus the desired result is obtained:

dS(x, y) ≤ (1 +

4δ

)dM(x, y). (2.81)

To show that dS ≈ dG , we need to introduce some further concepts from differential geometry.

Definition 2 (Minimum radius of curvature). The minimum radius of curvature r0(M ) of a manifold

M is defined by:

r0=

1 supγ,tkγ00(t)k

(2.82) where γ(t) varies over all unit speed geodesics in M , that is, curves γ(t) for which kγ0(t)k = 1 for all t in Dγ.

The minimum radius of curvature can be interpreted as a measure of how curly or entangled the manifold is, with a small r0 meaning that the manifold curls sharply.

Definition 3 (Minimum branch separation). The minimum branch separation s0(M ) is defined as the

largest positive number for which kx − yk < s0 implies that dM(x, y) ≤ πr0 for any (x, y) ∈ M .

The minimum branch separation can be understood as an upper bound on the pairwise distance between observations in the input space for which the corresponding distance on the manifold is bounded by πr0, meaning that they are within We also take the following lemma without proof:

Lemma 5. If the geodesic dM(x, y) is such that length(dM(x, y)) = ` ≤ πr0, then

2r0sin(

` 2r0

) ≤ kx − yk ≤ `. (2.83) Corollary 2 (to lemma (5)). Since sin(t) ≥ t − t3_{/3 for t ≥ 0 we get the weaker version}

`(1 − `2/(24r2₀)) ≤ kx − yk ≤ ` (2.84) as well as the even more weakened version

2

π` ≤ kx − yk ≤ `. (2.85) We are ready for the next theorem;

Theorem 6. Suppose (xi, xi+1) ∈ M are such that kxi− xi+1k ≤ _π2r0

√

24λ for some λ and kxi− xi+1k ≤ s0. Suppose further that the geodesic dM(xi, xi+1) is of length `. Then,

(25)

Proof. From the second assumption we get that ` ≤ πr0. Then we take the weakest version of the

lemma and the first assumption to obtain 2

π` ≤ kx − yk ≤ r0 √

24λ. (2.87) solving for 1 − λ we get 1 − λ ≤ 1 −_24r12

0

, which gives the desired (1 − λ)` ≤ (1 − 1

24r2 0

)` ≤ kxi− xi+1k ≤ `. (2.88)

Since dM ≈ dS and dS ≈ dG, we get the desired result dG≈ dM.

2.5.3 Floyd-Warshall’s algorithm

Finding dG(Xi, Xj) in practice can be cast as a graph analysis problem of finding the sequence of edges

between each pair of nodes (i, j) with minimum sum of edge weights, where the graph G = (V, E) has the data points X1, . . . , Xn as nodes and the euclidean distance kXi− Xjk as the edge weight w(i, j),

whenever kXi− Xjk is less than or equal to some threshold . If kXi− Xjk > , we set w(i, j) = ∞.

This is because we do not want to shortcut the manifold, which makes us allow only euclidean distances smaller than some which should be chosen so that kXi−Xjk ≤ implies that kXi−Xjk ≈ dM(Xi, Xj).

The figure 2.5.3 shows why the choice of is crucial for a correct approximation of geodesics.

In the original paper [14], the authors propose Dijkstra’s and Floyd-Warshall’s algorithm. In this paper we will focus on the latter.

Figure 2.1: A dataset where the underlying manifold structure has been circumvented by choosing a too large . The distance covered by the red line is too large, and hence the geodesic is vastly underestimated.

The Floyd-Warshall algorithm is an example of a dynamic programming algorithm, which is ap-plicable to problems containing what is called overlapping subproblems and optimal substructures. In our setting, these features materialize in the fact that we may for each pair (i, j) reduce the problem to finding the optimal path in a subset of all possible paths. Then, this subset is expanded and the optimal path between each pair (i, j) in the expanded subset is tested against the optimal path in the previous subset.

(26)

Algorithm 2 Floyd-Warshall Algorithm

1: shortestPath(i, j) ← w(i, j)

2: for k = 1, . . . , n do

3: for i = 1, . . . , n do

4: for j = 1, . . . , n do

5: if ShortestPath(i, j) > ShortestPath(i, k) + ShortestPath(k, j) then 6: ShortestPath(i, j) ← ShortestPath(i, k) + ShortestPath(k, j);

(27)

Chapter 3

Synthetic Dataset

Here, some synthetic datasets are introduced. They are chosen so that their intrinsic dimensionalities are already known. The purpose is to asses the efficacy of the introduced methods.

3.1 Spherical data

A synthetic dataset {X1, . . . , Xn} with Xi= (x1i, . . . , xmi)

T

∈ Rm_{has been constructed by simulating}

uniformly in the polar and azimuthal angles φi iid

∼ U (0, 2π) and θi iid

∼ U [0, π], i = 1, . . . , N , and applying the common transformation:

     X1i= ρ cos(θi) sin(φi) X2i= ρ sin(θi) sin(φi) X3i= ρ cos(φi), (3.1) which creates points from a 3-dimensional sphere of radius ρ. Note however that the points are not uniformly distributed on the sphere, but concentrated around the poles. In figure 3.1, scatterplots in 2 and 3 dimensions of such a dataset of sample size N = 2000 is shown. The GP-algorithm gives an ID estimate of 1.8524 and the U-statistic-based method estimates the ID as 2 (recall that this method only estimates dimensions that are integers.)

The maximum likelihood estimate of the ID using nearest neighbours 1 through 200 yields the estimates shown in Figure 3.2.

(28)

Figure 3.2: ML estimates from the spherical dataset, taking averages over the first through the 200th

nearest neighbours.

We apply the ISOMAP method using the parameters ρ = 10 and = 1.5. The result from ISOMAP is a 2-dimensional representation:

(29)

Figure 3.4: Empirical Bootstrap distribution of the Hill estimator with B = 100, yielding the Bootstrap 95% confidence interval [1.8035, 1.9266] for the ID of the 3d spherical data.

3.2 _{Three-dimensional spherical data embedded in R}

m

, d = 2

We take the spherical data from the previous section and append to each such point m − 3 coordinates containing zero-mean noise with small variance are appended, i.e.

X4i, . . . , Xmi

iid

∼ N(0, σ2_). _(3.2)

The sample correlation integral and the estimates using the U-statistic method are found, for different number of embedding dimensions m (in the range 4 to 100) and the noise standard deviation in the range 0.01 to 0.1 are found in Figure 3.5.

(30)

Figure 3.6: Maximum Likelihood estimation of ID on the 3-dimensional sphere embedded in m dimen-sions, for different m. The different curves represent estimations when using different values of σ, the standard deviation of the noise. The top curve has σ = 0.1, the second σ = 0.09, the third 0.08 and so on until σ = 0.01 for the bottom curve.

3.3 Swiss Roll Manifold

We consider random points on a spiralling band which is the 3-dimensional surface defined by {(x, y, z) : x = (t + 1) cos(t), y = (t + 1) sin(t)} (3.3) This surface has intrinsic dimensionality 2, namely the parameter t and the height z of the swiss roll.

(31)

Figure 3.8: Left: log-correlation sum for various log and sample sizes N plotted against Right: Maximum likelihood estimates of the true dimension 2 for different sample sizes N and different number of nearest neighbours considered.

Sample size Correlation Integral U-statistic 200 1.6082 2 500 2.0536 2 1000 1.9645 2 2000 2.0841 2

When implementing ISOMAP, the key parameter is . Choosing too large means that we may ”shortcut” the underlying manifold.

(32)

(33)

Chapter 4

Real Datasets

4.1 Gene expression Analysis

A Gene Expression is the process by which the information in the DNA-sequence of the gene is transmitted to the different functions and products of the cell. The end product of a gene expression is typically a protein but some genes may have an RNA molecule as end product. It is the most basic level at which the phenotype (observable traits) of an organism are derived from its genotype (its inherent genetic properties).

A B-cell lymphoma is a group of blood cell tumors that affect B-cells, a type of lymphocyte in the humoral immunity of the adaptive immune system.

4.1.1 Description of data

The dataset consists of 4026 observations that are 64-dimensional. The purpose of this analysis is of course to reduce this dimensionality. As a first step, the U-statistic- and Correlation sum-methods give 14 and 11.3610, respectively. This is, of course, a remarkable reduction from 64.

4.1.2 Maximum Likelihood estimator

The ML estimator described in chapter 2 can be used as a first estimate of the underlying dimensionality of the data. Since it merely finds the dimensionality and not a low-dimensional embedding of the data, it is not of much use on its own. Nevertheless, its distributional characteristics are known and thus one may calculate confidence regions for estimates and test hypotheses. The ML estimator that has been derived in section 2.3 was given by

ˆ dM L_k (x) = " 1 k − 1 k−1 X i=1 logTk(x) Ti(x) #−1 , (4.1)

where x is an arbitrary point in the higher dimensional space around which we estimate the intrinsic dimension. In practice, we will calculate the estimator using the observations X1, . . . , Xn as x, and

then take the average to produce an overall estimate: ˆ dM L_k (X) := 1 N N X i=1 ˆ dM L(Xi) (4.2)

The number k, representing how many neighbours we consider need also be chosen.

(34)

Figure 4.1: Maximum Likelihood estimation of intrinsic dimensionality for the gene expression data.

4.1.3 Isomap

Guided by our exploratory analysis of the data using the maximum likelihood estimator and the U-statistics based method, we can see that the intrinsic dimensionality seems to lie somewhere around 15, which of course is a remarkable reduction from 64.

One of the key parameters in ISOMAP is which determines which vertices are connected in the graph. The pair i, j is connected with an edge if kXi− Xjk < , and the corresponding edge weight is

kXi− Xjk. The theoretical background regarding is that we assume < s0, where s0is the minimum

branch separation, which in turn means that kXi− Xjk < implies dM(Xi, Xj) < πr0, where r0 is

the minimum radius of curvature. Thus, the larger , the less curvy we allow the manifold to be. Another decision that needs to be addressed is how many dimensions we will map the dataset to.

(35)

Figure 4.3: Residual variance when using Isomap with = 4 to reduce the data to dimensions 1 through 10.

(36)

Figure 4.5: Residual variance when using Isomap with = 10 to reduce the data to dimensions 1 through 10.

4.2 Image data

In this section we analyse the face dataset considered in the original Isomap paper [14]. The images contain the same face, but photographed from different angles. This means that the dataset has three intrinsic dimensions by construction, namely the polar angle representing the rotation around the object, an azimuthal angle for The images have 4096 pixels and we have 698 images. Some of the faces are shown in figure 4.2.

Figure 4.6: Empirical Bootstrap distribution of the Hill estimator with B = 100, yielding the Bootstrap 95% confidence interval [3.55, 5.12] for the ID of the image data.

(37)

Figure 4.7: Faces

(38)

4.3 Binary Data

Binary data occurs in many applications and numerous approaches have been taken to dimensionality estimation and reduction in binary data. There is no consensus as to which distance metric to use for binary data dimensionality reduction. Typically, L1or L2 distances are used. The methods that have

been discussed in this project can all be used for binary data, but the interpretation of dimensionality is not as clear. In [13], the authors extend the notion of correlation dimension using L1distances and

tailor the GP algorithm for binary datasets.

We use a dataset consisting of 507 binary vectors of dimension 994. These are strains of Vibri-onaceae (a family of Proteobacteria) analysed using FAFLP (Fluorescent Amplified Fragments Length Polymorphism, a tool used in genetics research). In [15] the dataset was clustered into 69 clusters, many of which representing known species of Vibrionaceae. The clustering was made using Ward’s method, which minimizes the within-cluster variance. The residual variance from Isomap is presented in figure 4.9. We use both the L1 and L2 distances. The residual variance becomes reasonably small

already at 2 dimensions, accordingly, a two-dimensional Isomap configuration is presented in figure 4.10.

Figure 4.9: Residual variance from Isomap projection of the 994-dimensional binary data, using L1

and L2distance metrics.

For the Hill/ML estimator, the dimension estimates vary considerably depending on the distance measure. This was expected since the L2 distance is bounded by the L1 distance. More specifically,

consider the scaling relation (2.1), i.e. P (||x − y|| ≤ ) ∼ d_{, use the fact that ||x − y||}

2≤ ||x − y||1

and fix some . One sees then that the dimensions estimate should be smaller when using L1 in place

of L2.

This dataset has dimension 994, yet our methods reduce this number to well below 20. The authors of [15] found 69 clusters. This indicates that the data is strongly correlated, which might explain why it was possible to reduce the dimensionality to such a large extent. In addition, figure 4.12 shows that at least 10% of the 994 dimensions are somewhat redundant, indicating that it should be possible to reduce the dimensionality by at least 10%. Another reason which might explain why the data has such small intrinsic dimensionality is that distances between binary vectors is bounded by the number of dimensions. This is trivially seen from calculating the L1 distance between a vector of all ones

and a vector of all zeros. The L1 distance is then equal to the number of dimensions. Furthermore,

the L2 distance is bounded by the L1 distance, by the triangle inequality. This means that binary

(39)

Figure 4.10: Two-dimensional Isomap projections of the 994-dimensional binary data, using L1 and

L2distance metrics.

Figure 4.11: Bootstrap histograms of Hill’s estimator (MLE) for the binary dataset, using 500 replicate datasets and the L1distances (left), and L2 distances (right). The corresponding confidence intervals

(40)

(41)

Chapter 5

Conclusions and final remarks

Grassberger-Procaccia

The Grassberger-Procaccia algorithm (or correlation integral) is based on the rather simple approxi-mation (2.1), relating the distribution of nearest neighbours to the intrinsic dimensionality, assuming only that the density of observations can be approximated as constant on a small enough scale. This methods theoretical justification is based on the limit lim→∞Pr{|X − Y| ≤ }, but in practice, there

are no pairs Xi, Xj less than apart from each other in a finite sample. In spite of this, the method

produces acceptable results for synthetic data sets and results that conform to the other methods for real life data.

U-statistics

This method solves the theoretical issue in the GP algorithm by replacing the indicator function in (2.6) by a decreasing kernel function. As a consequence, the underlying theory is much more complicated. The way in which the method was used calculates the ID estimate as an integer. This is not a strong restriction in the context of ID estimation since the dimension should be an integer in order to find lower-dimensional embeddings of data. It should be noted that the algorithm can easily be generalized to use non-integer dimensions. As far as numerical results are concerned, we get the expected outputs when using synthetic data, but we can see that the method is less conservative than both the GP algorithm and the maximum likelihood estimate.

Maximum-likelihood method

The maximum likelihood method is based on the same approximation (2.1) and uses the distributional assumption to derive a likelihood function of the data. The specific version which we considered is rather involved, since it calculates an ID estimate for all observations, making it a local ID estimate. But later on, we average over all the estimates which shows that the ML estimator is similar to the Hill estimator Hm,n. The results obtained were as expected for synthetic data and are, in general,

close to the ones obtained with the correlation integral.

Isomap

(42)

Chapter 6

Appendix

6.1 Definitions

6.1.1 Topology and Geometry

Definition 4 (Topological space). If X is a set and τ a collection of subsets of X, then the pair T = (X, τ ) is a topological space if the following conditions hold:

1. The empty set ∅ is in T 2. X ∈ T

3. If {τi}ni≥1is a finite collection of sets in τ , then the intersection

Tn

i=1τi is in τ

4. If {τi}i≥1is a (possibly infinite) collection of sets in τ , then the union Si=1τi is in τ

Definition 5 (Homeomorphism). A function f : X → Y is a homeomorphism if the following hold: 1. f is a bijection

2. f is continuous

3. The inverse function f−1 is continuous

Definition 6 (Manifold). A manifold of dimension n is a topological space M = (X, τ ) for which there exists, for every x ∈ X, a τi∈ τ for which there exists a homeomorphism f : τi→ Bn where Bn

is the n-dimensional euclidean ball:

Bn= {(x1, x2, . . . , xn) ∈ Rn | x21+ x22+ · · · + x2n< 1}.

Definition 7 (Submanifold). A submanifold is a manifold embedded in a euclidean space. For example, a sphere is a two-dimensional manifold embedded in three-dimensional euclidean space. The sphere is therefore a submanifold of Rn

6.1.2 Probability theory

Definition 8 (Inverse Gamma distribution). The Inverse Gamma distribution can be characterized as the distribution of the reciprocal of a random variable with a Gamma distribution.

Let X be distributed according to a Gamma distribution with shape parameter k and scale parameter θ, meaning that X has density f (x; k, θ) given by:

f (x; k, θ) = x

k−1_e−x θ

(43)

It then holds that X−1∈ Inv-Gamma(k, θ−1_), E[X−1] = θ −1 k − 1 for k > 1, and Var[X−1] = θ −2 (k − 1)2_{(k − 2)}, for k > 2.

6.2 Theorems

Theorem 7. Bernsteins’s inequality Let X1, . . . , Xn be independent random variables such that for

all i, Pr(|Xi− E[Xi]| < m) = 1 and Var[Xi] = σ2. Then,

(44)

Bibliography

[1] Alizadeh et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503-511 (3 February 2000)

[2] J.Y. Audibert & M. Hein , Intrinsic Dimensionality Estimations of Submanifolds in _Rd_{. Max}

Planck Institute for Biological Cybernetics. T¨ubingen, Germany. CERTIS, ENPC, Paris, France. 2005.

[3] P. Demartines, Analyse de Données par Réseaux de Neurones Auto-Organisés PhD dissertation, Institut National Polytechnique de Grenoble, Grenoble, France, 1994 (in French).

[4] D. Fran¸cois, V. Wertz The Concentration of Fractional Distances IEEE Transactions on Knowledge and Data Engineering, VOL. 19, NO. 7, JULY 2007

[5] B. Efron Bootstrap Methods - Another Look at the Jackknife The Annals of Statistics, 1979, Vol.7, No. 1, 1-26

[6] Grassberger, P. & Procaccia, I. (1983) , Measuring the Strangeness of Strange Attractors Depart-ment of Chemical Physics, Weizmann Institute of Science, Rehovot 76100, Israel

[7] B.M. Hill A simple general approach to inference about the tail of a distribution. Annals of Statistics 3:1163-1174 (1975).

[8] Elizaveta Levina and Peter J. Bickel Maximum Likelihood Estimation of Intrinsic Dimension Ad-vances in Neural Information Processing Systems 17. 2005.

[9] T. Mikosch and Q. Wang A Monte Carlo Method for Estimating the Correlation Exponent Journal of Statistical Physics, Vol. 78, Nos. 3/4, 1995

[10] M. ˇSari´c, C. H. Ek and D. Kragi´c Dimensionality Reduction via Euclidean Distance Embeddings KTH, Royal Institute of Technology. 2011.

[11] R Serfling Approximation Theorems of mathematical statistics 1980 John Wiley & Sons, Inc. [12] J. Shlens, A Tutorial on Principal Component Analysis. Google Research, Mountain View CA,

2014.

[13] N. Tatti, T. Mielik¨ainen, A. Gionis and H. Mannila. What is the Dimension of Your Binary Data? ICDM 2006 Proceedings of the Sixth International Conference on Data Mining 603-612

[14] J. B. Tenenbaum, V. de Silva, J. C. Langford. A global geometric framework for nonlinear dimen-sionality reduction. Science 290 (5500): 2319-2323, 22 December 2000.

(45)

(46)

(47)

(48)

TRITA -MAT-E 2015:08 ISRN -KTH/MAT/E-15/08--SE

Estimating the intrinsic dimensionalityof high dimensional data

Estimating the intrinsic dimensionality

of high dimensional data

JOAKIM WINIGER

Estimating the intrinsic dimensionality

of high dimensional data

J O A K I M W I N I G E R

Contents

Chapter 1

Introduction

1.1

Background

1.1.1

Concentration of norms and distances

1.1.2

Spheres inscribed in cubes

1.1.3

Outer layer of a sphere

1.2

Outline of the report

Chapter 2

Methods

2.1

Grassberger-Procaccia Algorithm

2.1.1

Discussion

2.2

U-statistics based method

2.2.1

Theoretical aspects

2.2.2

Method

2.3

Maximum Likelihood method

2.3.1

Relation to the Hill estimator

2.3.2

Asymptotic properties

2.4

A Monte Carlo method

2.5

Isomap

2.5.1

Multidimensional Scaling

2.5.2

Approximating geodesics

2.5.3

Floyd-Warshall’s algorithm

Chapter 3

Synthetic Dataset

3.1

Spherical data

3.2

Three-dimensional spherical data embedded in R

, d = 2

3.3

Swiss Roll Manifold

Chapter 4

Real Datasets

4.1

Gene expression Analysis

4.1.1

Description of data

4.1.2

Maximum Likelihood estimator

4.1.3

Isomap

4.2

Image data

4.3

Binary Data

Chapter 5

Conclusions and final remarks

Grassberger-Procaccia

U-statistics

Maximum-likelihood method

Isomap

Chapter 6

Appendix

6.1

_{Three-dimensional spherical data embedded in R}