Document distances using the Zipf distribution and a novel metric

(1)

A PPLIED P HYSICS AND E LECTRONICS

U ME _A ˚ U NIVERISTY , S WEDEN

D IGITAL M EDIA L AB

Document distances using the Zipf distribution and a novel metric

Apostolos A. Georgakis

¹

Dept. Applied Physics and Electronics

Ume˚a University SE-90187, Ume˚a Sweden

e-mail: apostolos.georgakis@tfe.umu.se

H. Li

Dept. Applied Physics and Electronics Ume˚a University

SE-90187, Ume˚a Sweden e-mail: haibo.li@tfe.umu.se

DML Technical Report: DML-TR-2003:01 ISSN Number: 1652-8441

Report Date: December 1, 2003

1This work was supported by the European Union Research Training Network (RTN)“MUHCI: Multi-modal Human Computer Interaction (HPRN-CT-2000-00111).

(2)

A novel metric is proposed in the present report for the evaluation of the goodness-of-fit criterion between the distribution functions of two samples. We extend the usage of the proposed criterion for the case of the generalized Zipf distribution. Detailed mathematical analysis of the proposed metric, which is embodied in a hypothesis testing, is also provided.

Keywords

Zipf distribution, n-gram frequencies, bhattacharyya metric

(3)

1 Introduction

In a plethora of natural phenomena the distribution of a characteristic under consideration is heavily skewed. For example, biological, ecological, and chemical systems sometimes tend to exhibit an exponential decaying model.

Web site popularity, web access statistics, Internet traffic, population and growth of cities also comply with the same decaying model. Furthermore, many references can be found in bibliometrics, informetrics and library science. A plethora of distributions exists in the literature that are capable to model the above phenomena; with the most prevalent among them the well-know Zipf distribution [1, 3].

The Zipf distribution rely on an empirical law discovered by Estoup in 1916 and named after the Harvard linguistic professor G. K. Zipf (1902-1950). This distribution relates the frequency of occurrence of an eventα and the rank, m_α, of the event when the rank is determined by the above frequency of occurrence. The relationship is the power-law function:

P(α) ∼ 1/m^θ_α (1)

with the exponentθto be close to unity. The probability distribution in Eq. (1)is an instance of a power law. Zipf’s law is an experimental law, not a theoretical one. The causes of Zipfian distributions in real life are a matter of some controversy. However, Zipfian distributions are commonly observed in many kinds of phenomena.

Initially the Zipf distribution was confined to the linguistic community and associated the frequency of word in a document with its rank [4, 7]. The prerequisite for the above law to be applicable in linguistics is that the size of the document to be fairly large.

2 Document distance

Its is generally admissible that the contextual “similarity” between documents (regardless of their size) can be based on their structural textual elements, namely the words forming these documents. The previous fact is the basic principle behind the vector space model (VSM) [6]. In the VSM, the available textual data are encoded into a numerical form and are represented by numerical vectors. Furthermore, it is generally agreed upon that the contextual similarity between documents exists also in their vectorial representation. Since the Zipf distribution of a document employs the frequencies of the words forming that particular document, it is justified to evaluate the contextual similarity based on the numerical encoding produced by the particular distribution.

A novel distance measure will be provided in the current chapter. In Appendix A it will be proven that the proposed distance measure is also a metric. This metric is used in order to evaluate the similarity between Zipf distributed vectors. The suggested metric can be easily proven that it is computationally superior (faster) than the Euclidean distance. For example, for two N_w-dimensional vectors, the computational cost of the suggested metric is N_wmultiplications, a bit-shift operation and N_wadditions compared to N_wmultiplications and(2Nw−1) addition of the Euclidean distance.

Furthermore, by exploiting the fact that the vectors under consideration are distributed according to the Zipf law enables us to extend the suggested metric towards the direction of a statistical hypothesis. The hypothesis under consideration is whether two Zipf distributed vectors, and subsequently two documents, are similar or not.

For this reason a detailed distribution is provided for the proposed metric along with a detailed proof. Also, two distribution tables are supplied for the proposed metric to make the chapter self-content.

In what follows, section 2.1 provides a description of the proposed metric and section 2.2 describes the process of incorporating the Zipf distribution in the proposed metric. It also provides a detailed proof for the evaluation of the distribution associated with the proposed metric. Following, section 2.3 provides the hypothesis testing for the evaluation of the similarity between two Zipf distributed vectors.

(4)

2.1 Proposed metric

Let us suppose thatX^N= {x1,x2,...,xN} is a collection of Nw-dimensional random vectors, where x_i= (xi1,xi2, ...,xiNw)^T with cumulative probability density function f_x(i). Let also ximdenote the univariate random variable with distribution function fi(m), where fi(m) corresponds to the probability of the mth element of the ith vector, that is, f_i(m) = P(xim), where

Nw

m=1

P(xim) = 1. (2)

We further assume that the probabilities in Eq. (2) follow the Zipf distribution. In order to assess whether two vectors drawn independently from the setX^N are of the same “shape”, one needs to compare their distribution functions. For this purpose a novel metric is introduced. Let x_iand x_j denote two vectors randomly drawn from the setX^N^{, (x}i,xj∈X^N). The hypothesis whose validity is to be tested is:

H₀: The two cumulative distribution functions are “identical”⇒ fx(i) = fx( j) or equivalently

H₀: f_i(m) ∼= fj(m), for almost each m,

against the negation of H₀. If the null hypothesis is true, the population distributions are identical and the two samples are drawn from the same population, meaning that the vectors x_iand x_jshould be regarded as instances of the same population. Therefore, allowing for statistically neglectful sampling variations, under H₀there should be reasonable agreement between the two distributions. The proposed criterion between the ith and jth distributions, henceforth denoted by Di j, is defined as:

D_{i j}= h(xi,xj,xi◦ xj) = (xi+ xj+ g(xi,xj))(xi+ xj+ g(xi,xj))^T, (3) where the notion(◦) corresponds to the Hadamard product between two vectors and g(xi,xj) corresponds to the N_w-dimensional vector whose the k-th element is

P(xik)P(xjk) (square root of the Hadamard product between the vectors x_iand x_j).

From Eq. (3), the following form for the variable D_{i j}, derives:

D_{i j} =^∆

Nw

m=1

fi(m) +

fi(m) fj(m)2

f_i(m) (4)

=

Nw

m=1

f_i²(m) + fi(m) fj(m) + 2 fi(m)

f_i(m) fj(m) f_i(m)

=

Nw

m=1

f_i(m) + fj(m) + 2

f_i(m) fj(m)

=

Nw

m=1

( fi(m) + fj(m)) +

n m=1

2

f_i(m) fj(m)

= 2 + 2

Nw

m=1

f_i(m) fj(m)

(5)

= 2 + 2L^Bi j (6)

From Eq. (5) is evident that only the square roots of the x_imand x_jmare needed. Therefore, instead of storing the actual values for the x_imand x_jmone can only retain the square roots of them. In that way there is no need

(5)

2 4 6 8 10 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

←x_ik x_jk→

Euclidean distance

Probability

(a)

θ_i=1.35 θj=1.55

2 4 6 8 10

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

←x_ik xjk→

←x_ikx_jk Proposed metric

Probability

(b)

θ_i=1.35 θj=1.55

Figure 1: The divergence viewed under different metrics. The grayed area corresponds to the divergence measured from: (a) the Euclidean distance, which relies on the shaded area between the distribution functions, and (b) the proposed metric that is based on the shaded area in the bottom left side of the plot.

to evaluate the square roots each time one needs to evaluate the value of the random variable D_{i j}, thus limiting the computations cost to just N_wmultiplications, a bit-shift operation (the multiplication by 2) and N_wadditions.

Appendix A proves that the proposed distance measure satisfies also the three properties of a metric, so it will be referred as a metric, henceforth.

From Eq. (25) is obvious that Di j∈ [2,4], where Di jequals four when fi(m) = fj(m),∀m. On the other hand Di jequals two only in the extreme case where the distributions of the ith and jth RVs are of the following form:

P(xim) =

= 0, when P(xim) = 0

= 0, elsewhere ∀m (7)

in which case the product f_i(m) fj(m) equals to zero and therefore Di jtends towards the value two. So the closer the pdf of the ith RV is to the pdf of the jth RV the larger is the value of L_{i j}and subsequently, the value of D_{i j} tends toward the value of four. So the hypothesis test mentioned earlier is transformed into:

H0: Di jis statistically equal to four H₁: The negation of H₀

It must be noted here that Eq. (4) resembles the Chi-square goodness-of-fit test proposed by Pearson but there is no other resemblance with that particular test. In fact, since Chi-square uses the maximum divergence between the distribution under considerations this might lead to unexpected results in case when the distributions differ in just two samples out of the N_wsamples comprising the N_w-dimensional vectors.

Figure 1 depicts the areas used by the proposed metric and the Euclidean distance in evaluating the similarity between the distributions¹.

2.2 The Zipf distribution and the proposed metric

In order to evaluate the hypothesis test mentioned in section 2.1 it is needed to compute the probability density function of the random variable D_{i j}. In doing so one must first determine the probability of the random variable

1The vectors used in this figure were artificially generated.

(6)

x_im. For the case under consideration the probability of the random variable is:

f_i(m) = 1

m^θⁱH_N_w_,θ_i, (8)

whereθiis a parameter dependent on the data-set under consideration and H_N_w_,θ_i is the so-called Nwth Harmonic number of orderθiwhich is a normalizing factor equal to:

H_N_w_,θ_i=

N_w

m=1

1

m^θⁱ. (9)

Equation (8) is the well known generalized Zipf distribution [1].

The first step towards the computation of the distribution of the variable D_{i j}is to evaluate the distribution of the elements of the random vector z_{i j}= (zi j1,zi j2,...,zi jNw) = xi◦xj= (xi1x_j1,xi2x_j2,...,xiNwx_jN_w). Since for the formation of the mth element of z_{i j}it is needed to multiply the corresponding mth elements in both x_iand x_jthis leads to the following: P(zi jm) = P(ximxjm). In the previous expression the random variable ximis independent of the variable xjmsince they refer to two different random vectors, which leads to: P(zi jm) = P(xim)P(xjm).

For the evaluation of the probability of z_{i jm}it is needed first to determine the cdf for m a given number, where m∈ IN. Lets denote this distribution by Fi j(m):

F_{i j}(m) = P(until the mth element of zi j) (10)

= Fi(m) · Fj(m) =

m s=1

f_i(s) ·

m t=1

f_j(t)

=

m s=1

P(xis) ·

m t=1

P(xjt)

=

m s=1

1 s^θⁱH_N_w_,θ_i ·

m t=1

1 t^θ^jH_N_w_,θ_j

= 1

H_N_w_,θ_iH_N_w_,θ_j ·

m s=1

1 s^θⁱ ·

m t=1

1

t^θ^j (11)

where F_i(m) and Fj(m) are the cdfs of the ith and jth RVs respectively. The next step is to find the pdf for the random variable zi jm, that is:

f_{i j}(m) = P(zi jm) = Fi j(m) − Fi j(m − 1)

= ai j

_m

s=1

m t=1

1 s^θⁱ · 1

t^θ^j −

m−1

s=1 m−1

t=1

1 s^θⁱ · 1

t^θ^j

(12)

(7)

0 20 40 60 80 100 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Pdf of the RV x ik

Probabiltiy

(a) θ_i=1.35

0 20 40 60 80 100

0 0.1 0.2 0.3 0.4 0.5

Pdf of the RV x jk

Probabiltiy

(b) θ_j=1.55

0 50

100 0

50 1000

0.1 0.2

xik 3D display of the pdf of the RV z^*_ijk

(c) xjk

Probabiltiy

0 20 40 60 80 100

0 0.05 0.1 0.15 0.2

2D projection of the pdf of the RV z^*_ijk

Probabiltiy

(d)

Figure 2: The probability density function for the Zipf distribution for N_w= 100 and for (a) θi = 1.35, (b) θj= 1.55, and (c) the product z^∗_{i jm}.

where a_{i j}denotes the fraction 1/

H_N_w_,θ_iH_N_w_,θ_j

. From (12) derives:

f_{i j}(m) = ai j

1 m(^θⁱ^+θ^j) +

1 m^θⁱ

m−1

t=1

1 t^θ^j + 1

m^θ^j

m−1

s=1

1 s^θⁱ

= ai j

1

m(^θⁱ^+θ^j) + H_N_w_,θ_j

m^θⁱ F_j(m − 1) +H_N_w_,θ_i

m^θ^j F_i(m − 1)

⇒

fi j(m) =











a_{i j}, m= 1

ai j







1 m(^θi+θj) +

H_Nw,θ

j

m^θi F_j(m − 1)+

H_Nw,θ

i

m^θj F_i(m − 1)







, ∀m ∈ {2,Nw}

0, elsewhere

(13)

Figure 2 depicts the process of obtaining the distribution of the random variable z_{i jm}.

After the computation of the pdf for z_{i jm}it is needed to compute the density function of the random variable

√z_{i jm}. This is due to the fact that D_{i j}is a linear combination of √z_{i jm}. Let z^∗_{i jm}denote the square root of z_{i jm}, that is, z^∗_{i jm}= √zi jm, where m∈ {1,2,...,Nw}. Since the sample space for the RV zi jmis the set Z₁= {1,2,...,Nw}, the sample space corresponding to z^∗_{i jm}is the set Z₂=

1,√ 2,...,√

N_w

. It must be noted here that the cardinality of the set Z₂is equal to N_wsince each element of the set Z₂is the square root of the set Z₁. So z_{i jm}is a discrete RV then the RV z^∗_{i jm} is of the same pdf as the RV z_{i jm} [5] and if f_{i j}^∗(m) denotes the pdf of the RV z^∗_{i jm}, then,

f_{i j}^∗(m) = fi j(m),∀m.

The final step is to evaluate the pdf of the random variable L_{i j}=Nw

m=1√z_{i jm}=Nw

m=1z^∗_{i jm}. For a large value

(8)

of N_wand due to the central limit theorem (CLT) the pdf of the above sum tends toward the normal distribution with mean value µ and varianceσ²[5]. The mean value is:

µ = E[Li j] = E _N

w

m=1

z^∗_{i jm}

=

Nw

m=1

E z^∗_{i jm}

= NwE z^∗_{i jm}

= Nw Nw

m=1

√m f_{i j}^∗(m)

= Nwai j Nw

m=2

√m

1

m(^θⁱ^+θ^j) + H_N_w_,θ_j

m^θⁱ Fj(m − 1) +H_N_w_,θ_i

m^θ^j Fi(m − 1)

= Nwa_{i j}

Nw

m=1

1

m(^θⁱ^+θ^j)^−0.5+H_N_w_,θ_j

m^θⁱ^−0.5F_j(m − 1) + H_N_w_,θ_i

m^θ^j^−0.5F_i(m − 1)

= Nwa_{i j}





 H_N

w,(^θⁱ^+θ^j)^−0.5+ H_N_w_,θ_j

Nw

m=1 Fj(m−1)

m^θ^i−0.5+ H_N_w_,θ_i

Nw

m=1 F_i(m−1)

m^θj−0.5







(14)

and the variance is:

σ² = E

(Li j− µ)²

= E (Li j)²

− µ²

= E





_N

w

m=1

z^∗_{i jm}

2

 −µ²

= E







Nw

m=1

z^∗_{i jm}₂ + 2

Nw

m₁,m2=1 m1=m2

z^∗_{i jm}

1z^∗_{i jm}

2





 − µ²

=

Nw

m=1

E

z^∗_{i jm}₂ + 2

Nw

m₁,m2=1 m₁=m2

E z^∗_{i jm}

1z^∗_{i jm}

2

− µ²

= NwE

z^∗_{i jm}2

+ 2Nw(Nw− 1)E z^∗_{i jm}

1z^∗_{i jm}

2

− µ² (15)

At this point, and without lost of generality, it can be regarded that the RVs z^∗_{i jm}

1 and z^∗_{i jm}

2 are independent.

Having this postulate:

E z^∗_{i jm}

1z^∗_{i jm}

2

= E z^∗_{i jm}

1

E z^∗_{i jm}

2

(16)

(9)

The first term on the right side of the variance equation is equal to:

E

(zi jm)²

= ai j Nw

m=1

√m2







1 m(^θⁱ⁺^θ^j) +

H_Nw,θ

j

m^θⁱ F_j(m − 1)+

H_Nw,θi

m^θ^j F_i(m − 1)







= ai j





 H_N

w,(^θⁱ^+θ^j)⁻¹+ H_N_w_,θ_j

Nw

m=1 Fj(m−1)

m^θⁱ⁻¹ + H_N_w_,θ_i

Nw

m=1 F_i(m−1)

m^θj⁻¹







(17)

whereas the second term equals to:

E z^∗_{i jm}

1

E z^∗_{i jm}

2

= µ² (18)

So the total variance of the random variable L_{i j}is:

σ² = Nwa_{i j}

H_N

w,(^θⁱ^+θ^j)⁻¹+ HNw,θj

Nw

m=1

F_j(m − 1)

m^θⁱ⁻¹ + HNw,θi

Nw

m=1

F_i(m − 1) m^θj−1

+ [2Nw(Nw− 1) − 1]µ²

= Nwa_{i j}

H_N

w,(^θⁱ^+θ^j)⁻¹+ HNw,θj

Nw

m=1

F_j(m − 1)

m^θⁱ⁻¹ + HNw,θi

Nw

m=1

F_i(m − 1) m^θj−1

+ Nwa_{i j}[2Nw(Nw− 1) − 1]





 H_N

w,(^θⁱ^+θ^j)^−0.5+ H_N_w_,θ_j

Nw

m=1 Fj(m−1)

m^θ^i−0.5+ H_N_w_,θ_i

Nw

m=1 F_i(m−1)

m^θj−0.5







= Nwai j

H_N

w,(^θⁱ^+θ^j)⁻¹+ [2Nw(Nw− 1) − 1]H_N_w_,(^θⁱ^+θ^j)^−0.5

+ 2ai jN_w²(Nw− 1)HNw,θj

_N

w

m=1

F_j(m − 1) m^θⁱ⁻¹ +

Nw

m=1

F_j(m − 1) m^θⁱ^−0.5

+ 2ai jN_w²(Nw− 1)HNw,θi

_N

w

m=1

Fi(m − 1) m^θ^j⁻¹ +

Nw

m=1

Fi(m − 1) m^θ^j^−0.5

(19) which is equal to:

σ² = Nwa_{i j}

Nw

m=1

1+ [2Nw(Nw− 1) − 1]m^−0.5 m(^θⁱ^+θ^j)⁻¹

+ 2ai jN_w²(Nw− 1)HNw,θj

N_w

m=1

F_j(m − 1)

1− m^−0.5 m^θⁱ⁻¹ + 2ai jNw2(Nw− 1)HNw,θi

Nw

m=1

F_i(m − 1)

1− m^−0.5

m^θ^j⁻¹ (20)

(10)

Finally, the pdf of the RV D_{i j}= 2 + 2Li jhas to be computed. Given the fact that L_{i j}is normally distributed we get the following pdf for the RV D_{i j}:

f_D_{i j}(t) = f_L_{i j}_t−2

2

2 = 1

√8πσexp

− 1

8σ²(t − 2 − 2µ)² (21)

where µ andσare the expected value and the standard deviation of the random variable L_{i j}.

But since the random variable Di jis confined in the interval[2,4] (Di j∈ [2,4]), Eq. (21) obviously underesti- mates the true pdf of D_{i j}. The accurate form of the pdf is:

fD_{i j}(t) =











0, −∞≤ t ≤ 2

exp−_8σ¹₂(t−2−2µ)²

! ₄

2

exp−_8σ¹₂(t−2−2µ)²dt

, 2 ≤ t ≤ 4

0, 4≤ t ≤ +∞

(22)

which is the so-called truncated normal distribution [5]. Equation (22) can be simplified in the following form:

f_D_{i j}(t) =







exp−_8σ¹₂(t−2(1+µ))²

√2πσer f^"√^x 2σ−^2(1+µ)√

2σ #$$$⁴ x=2

, 2 ≤ t ≤ 4

0, elsewhere

(23)

where er f(a) =^√²_π%_a

0e^−t²dt denotes the so-called error function [2].

2.3 Hypothesis test evaluation

To assess whether the ith and jth distributions are of the same “shape”, the RV D_{i j}will be employed in a hypothesis test. The hypothesis is as follows:

H₀: The ith and jth pdf are statistically identical which equals to D_{i j}→ 4.

H₁: The ith and jth pdf are not identical.

Given a pre-defined significant level a, the rejection region for the above hypothesis test is formulated as follows:

a = P(Di j≤ za) =

za

!

2

f_D_{i j}(t)dt

= er f

"

(1+µ)√

2σ −₂^√^z^a_2σ#

− er f"

(1+µ)√

2σ −^√¹_2σ# er f

"

(1+µ)√

2σ −^√²_2σ#

− er f"

(1+µ)√

2σ −^√¹_2σ# . (24)

In Eq. (24) the only unknown is the parameter z_α. After evaluating the parameter z_α the null hypothesis is accepted if the expression z_α≤ Di j(t) is true otherwise its rejected. Figure 3 depicts the distribution of the RV D_{i j}along with the support regions for the hypothesis H₀and the alternative hypothesis H₁.

Appendix B provides a brief description of the computation of the critical values for the acceptance or the rejection of the null hypothesis along with two tables with critical values computed by the proposed Eq. (27).

(11)

2 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 4 0

0.1 0.2 0.3 0.4 0.5 0.6

t fD_ij(t)

α

↓

^(1−α)

↓

H₁ H₀

Probability

Figure 3: The support regions for the null and the alternative hypothesis for the RV D_{i j}for N_w= 2000,θi= 1.35 andθj= 1.55 at a significant level ofα= 0.90. (Important remark: Although the above graph implies a uniform distribution this is not true. The slope of the line in the graph approaches zero but still is significantly different than this value.)

3 Conclusions

The present report provides a preliminary mathematical analysis on a novel metric, that is also introduced in the report, for the evaluation of the contextual similarity between documents. The proposed metric is computationally superior than the Euclidean distance which is oftenly employed in similar tasks. Further investigation will be performed towards the direction of the biasness of the introduced metric (investigate whether the proposed metric is biased or not).

A Is it a metric?

In order to prove that the proposed statistic, D_{i j}, is also a metric distance the following has to be proven:

Positiveness: Since f_i(m) and fj(m) for m = 1,2,...,Nw contains the total probability mass of the ith and jth

(12)

RV the following stems out:

0≤ xim< 1and ^N^w

m=1x_im= 1 0≤ xjm< 1and ^N^w

m=1

x_jm= 1











⇒

0≤ ximx_jm≤ 1 ⇒ 0≤ √ximxjm≤ 1 ⇒ 0≤ ^N^w

m=1

√x_imx_jm≤ 1 ⇒

0≤ 2Li j≤ 2 ⇒ 2≤ 2 + 2Li j≤ 4 ⇒ 2≤ Di j≤ 4

(25)

In case where i= j then Lii= ^N^w

m=1

√x_imx_jm= ^N^w

m=1x_im= 1 ⇒ Dii= 2 + 2Lii= 4.

Symmetry: Since x_imx_jm= xjmx_im⇒ Di j= Dji.

Triangular inequality: In order to prove the triangular inequality it can be proven that:

D_{i j}+ Djm≥ Dim⇒ 2 + 2Li j+ 2 + Ljm≥ 2 + Lim⇒ 1 + Li j+ Ljm≥ Lim (26) which is obvious since L_{i j}, L_jm≥ 0 and 1 ≥ Lim.

B Critical values

The critical values for the hypothesis test associated with the RV D_{i j}are computed using the following:

1 2√

2σ(2(1+µ)−z! α) 0

e^−t²dt = α√ π 2 er f

(µ − 1)√ 2σ

+(1 −α)√ π 2 er f

µ

√2σ

⇒ z_α = 2(1 + µ) − 2√

2σer f inv

αer f

(µ − 1)√ 2σ

+ (1 −α)er f

µ

√2σ

, (27)

where er f inv is the inverse of the error function [2]. Using Eq. (27) and pre-defined significance levels two tables of critical values for hypothesis test were computed. Table I²corresponds to a significance level ofα= 0.90 when the dimensionality of the feature vectors is N_w= 2000, whereas, table II³corresponds to a significance level of α= 0.95 under the same feature vector dimensionality.

2The values of the table a scaled version of the original values. Original values= 3.8+scaled value∗10⁻⁹.

3The values of the table a scaled version of the original values. Original values= 3.8+scaled value∗10⁻⁹.

(13)

References

[1] References on zipf’s law. http://linkage.rockefeller.edu/wli/zipf.

[2] M. Abramowitz and I. A. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Math- ematical Tables. Dover Pubns., 10th edition, 1974.

[3] L. A. Adamic. Zipf, Power-laws, and Pareto - a ranking tutorial. http://ginger.hpl.hp.com/shl/papers/ranking/

ranking.html.

[4] D. Manning and H. Sch¨utze. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999.

[5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. New York: McGraw-Hill, 1984.

[6] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw Hill, 1983.

[7] R. B. Yates and B. R. Neto. Modern Information Retrieval. ACM Press, 1999.

(14)

TableI:DistributiontablesfortheRVDijandforNw=2000atα=0.90(10%ConfidenceLevel). 1.051.101.151.201.251.301.351.401.451.501.551.601.651.70 1.051.20811.21171.21461.21831.22341.22561.22771.23141.23141.23281.23651.23721.2365 1.101.21611.21971.22631.22991.23361.23721.24081.2431.24521.24591.24811.2488 1.151.22631.23281.23791.24381.24811.2511.25471.25761.2591.26121.2627 1.201.23941.24671.25391.25831.26341.26671.2711.27361.27611.2776 1.251.25611.26191.26851.27541.28121.28521.28891.29291.2958 1.301.27251.28161.28851.29541.30121.30561.30921.3132 1.351.29291.30231.311.31611.32311.32821.3322 1.401.31511.32381.33291.34051.34711.3529 1.451.33831.34891.3581.3661.3733 1.501.36421.37511.38531.3938 1.551.39151.40271.4135 1.601.41951.4313 1.651.4484 1.751.801.851.901.952.00 1.051.23721.23791.23791.23941.23871.2394 1.101.24881.25031.25031.2511.2511.2503 1.151.26411.26491.26561.26591.26631.2659 1.201.2791.28051.28051.28231.28231.282 1.251.29721.29831.29981.30011.30161.3012 1.301.31511.31721.31871.32011.32051.3216 1.351.33621.3381.34091.34271.34421.3445 1.401.35691.36091.36341.36561.36741.3682 1.451.37871.38311.38711.38981.39181.3935 1.501.40051.4061.41131.41441.41731.4196 1.551.42161.42871.43441.43891.44241.4451 1.601.44181.45021.45751.46281.46731.4706 1.651.46041.47041.47841.48511.491.4942 1.701.47691.48821.49751.50591.51191.517 1.751.50461.51531.52411.53151.537 1.801.53011.54021.54831.5544 1.851.55391.56261.5697 1.901.57521.5824 1.951.593 2.00

(15)

TableII:DistributiontablesfortheRVDijandforNw=2000atα=0.95(5%ConfidenceLevel). 1.051.101.151.201.251.301.351.401.451.501.551.601.651.70 1.056.38256.3976.41166.45526.4486.46256.47716.49166.50626.49896.52076.5286.5207 1.106.41886.43346.46986.48436.51346.52076.54986.55716.56446.57166.59356.6007 1.156.46256.51346.5286.55716.59356.6086.62266.62996.65176.65176.6662 1.206.54256.57896.61536.64446.68086.68446.70996.72446.7396.7463 1.256.62266.6596.6996.73176.76086.78266.80456.82276.8372 1.306.71726.76086.79366.84096.87366.89186.916.9354 1.356.82276.87366.90636.94646.997.00467.0373 1.406.94646.98647.03737.0817.11017.1392 1.457.05917.1217.16467.20837.2483 1.507.2017.25567.31027.3538 1.557.33937.40297.4593 1.607.49027.5557 1.657.643 1.751.801.851.901.952.00 1.056.52076.54256.54256.5286.54986.5425 1.106.60076.6086.59356.6086.60076.6007 1.156.6596.68086.68086.68086.68446.6808 1.206.74996.76086.76086.76456.76456.7608 1.256.84456.85186.85546.8596.86636.8736 1.306.93916.95736.96096.96456.97186.9827 1.357.04827.06287.07737.08467.09197.0992 1.407.1617.18287.19377.20837.21197.2228 1.457.27387.30297.32117.33387.34847.352 1.507.39387.41937.44487.46667.48117.4921 1.557.49937.53937.57037.59577.61217.6285 1.607.60667.65397.69217.72317.74137.7613 1.657.70497.75947.80677.83777.86317.8868 1.707.79227.85597.9057.94687.98148.005 1.757.94137.99598.04328.08148.1123 1.808.07788.12878.17058.2051 1.858.20158.25068.2833 1.908.31428.3506 1.958.407 2.00

Document distances using the Zipf distribution and a novel metric

A PPLIED P HYSICS AND E LECTRONICS

U ME A ˚ U NIVERISTY , S WEDEN

D IGITAL M EDIA L AB

Document distances using the Zipf distribution and a novel metric

Apostolos A. Georgakis

Dept. Applied Physics and Electronics

Ume˚a University SE-90187, Ume˚a Sweden

e-mail: apostolos.georgakis@tfe.umu.se

H. Li

Dept. Applied Physics and Electronics Ume˚a University

SE-90187, Ume˚a Sweden e-mail: haibo.li@tfe.umu.se

DML Technical Report: DML-TR-2003:01 ISSN Number: 1652-8441

Report Date: December 1, 2003

Keywords

1 Introduction

2 Document distance

2.1 Proposed metric

2.2 The Zipf distribution and the proposed metric

2.3 Hypothesis test evaluation

↓

↓

3 Conclusions

A Is it a metric?

B Critical values

References

U ME _A ˚ U NIVERISTY , S WEDEN