Factorisation of Latent Variables in Word Space Models

(1)

Factorisation of Latent Variables in Word Space Models

Studying redistribution of weight on latent variables

DAVID ÖDLING & ARVID ÖSTERLUND

Bachelor’s Thesis at CSC, KTH Supervisor at Gavagai: Magnus Sahlgren

Supervisor at KTH: Hedvig Kjellström Examiner: Mårten Olsson

(2)

(3)

Abstract

The ultimate goal of any DSM is a scalable and accurate representation of lexical semantics.

Recent developments due to Bullinaria & Levy (2012) and Caron (2001) indicate that the accuracy of such models can be improved by redistribution of weight on the principal components. However, this method is poorly understood and barely replicated due to the computational expensive dimension reduction and the puzzling nature of the results. This thesis aims to explore the nature of these results. Begin- ning by reproducing the results in Bullinaria & Levy (2012) we move on to deepen the understanding of these results, quantitatively as well as qualitatively, using various forms of the BLESS test and juxtapose these with previous results.

The main result of this thesis is the verification of the 100% score on the TOEFL test and 91.5% on a paradigmatic version of the BLESS test. Our qualitative tests indicate that the redistribution of weight away from the first principal components is slightly different between word categories and hence the improvement in the TOEFL and BLESS results. While we do not find any significant relation between word frequencies and weight distribution, we find an empirical relation for the optimal weight distribution.

Based on these results, we suggest a range of further studies to better understand these phenomena.

(4)

Faktorisering av Latenta Variabler i Ordrumsmodeller

Målet med alla semantiska fördelningsmodeller (DSMs) är en skalbar och precis representation av semantiska relationer. Nya rön från Bulli- naria & Levy (2012) och Caron (2001) indikerar att man kan förbättra prestandan avsevärt genom att omfördela vikten ifrån principalkomponenterna med störst varians mot de lägre. Varför metoden fungerar är dock fortfarande oklart, delvis på grund av höga beräkningskostnader för PCA men även på grund av att resultaten strider mot tidigare praxis.

Vi börjar med att replikera resultaten i Bullinaria & Levy (2012) för att sedan fördjupa oss i resultaten, både kvantitativt och kvalitativt, genom att använda oss av BLESS testet.

Huvudresultaten av denna studie är verifiering av 100% på TOE- FL testet och ett nytt resultat på en paradigmatisk variant av BLESS testet på 91.5%. Våra resultat tyder på att en omfördelning av vikten ifrån de första principalkomponenterna leder till en förändring i fördel- ningen sinsemellan de semantiska relationerna vilket delvis förklarar för- bättringen i TOEFL resultaten. Vidare finner vi i enlighet med tidigare resultat ingen signifikant relation mellan ordfrekvenser och viktomför- delning.

Utifrån dessa resultat föreslår vi en rad experiment som kan ge vidare insikt till dessa intressanta resultat.

(5)

Introduction

1.1 Motivation

The ability to read is arguably more important than ever before. At the very core of all forms of language lies a set of semantic relations containing the linguistic relations between words. In essence, it is these patterns that forms words into sentences and enable us to derive meaning from words. Hence, the idea to encapsulate these in computational models has existed since the dawn of the computer and the study of such models is commonly referred to as distributional semantic models, DSM s.

Vector space models, or word space models, WSM s (Sahlgren 2006), are a particularly successful class of such models and is the focus of this thesis . There are a myriad of different such models, see Turney & Pantel (2010), with various applications. But, for the purpose of this thesis, we shall discuss some seemingly general phenomena in the factorisation of a large class of such models. These derive semantic structures from frequencies of word occurrences in similar contexts.

Recent developments involving redistribution of weight on principal components provides huge performance improvements (Caron 2001, Bullinaria & Levy 2012). However, the method is poorly understood and barely replicated due to the computational expensive dimension reduction and the puzzling nature of the results. Nonetheless, if better understood, the method may be applicable on a wide variety of WSMs.

1.2 Problem Statement

First and foremost further establish the rather surprising results, by Bullinaria & Levy (2012) and Caron (2001), that a redistribution of weight on the PCs away from those with highest variance, provide a better representation of lexical semantics. Then attempt to pinpoint what properties of the WSM that underlies these puzzling results.

The reader familiar with these articles may move directly to chapter 3.

(8)

(9)

Chapter 2

Preliminaries

This chapter aims to provide a gentle introduction to the relevant theory and previous works aimed at our peers of the Engineering Physics programme.

2.1 The Distributional Hypothesis

The prevalent way of defining meaning for WSM is the following observation; words that occur in the same contexts tend to have similar meanings (Harris 1970). Consider the following sentence; Arvid likes green apples. By looking at many similar such sentences, this way of looking at semantic information has shown to be particularly well apt to form syntagmatic and paradigmatic relations (Sahlgren 2006).

Paradigmatic Syntagmatic

Arvid likes green apples David craves blue cheese She detests everything else

Table 2.1. An example of syntagmatic and paradigmatic relations.

As seen from the example in table 2.1, the syntagmatic relations are word relations that co-occur in a text. On the other hand, paradigmatic relations refer to words that can occur in the same context but not in the same type of position. Even though it is possible that the WSM representation contains many other more specific types of linguistic relations, we will focus on synonyms and more generally paradigmatic relations.

The type of relations encapsulated in WSMs depends on how it constructs the vector representation of a word based on its context and is studied in detail by Sahlgren (2006).

(10)

2.2 Word Space Model

As introduced by the distributional hypothesis, the idea is to harness this contextual similarity using the context words as a basis in a vector space representation. Consider a set of words W = {w₁, ..., w_n} and a set of context words C = {c₁, ..., c_m}. The WSM representation of the semantic similarities are created through registering an occurrence of a word w_i with a set of context words c_j, ..., c_k with a corresponding increment of the projection of w_i on the c_j, ..., c_k bases. In other words each cell f_ij, in the matrix representation F , is represented by a word w_i and its co-occurrence matrix with c_j. Moreover, the co-occurrence is counted within a context window. As shown in Bullinaria & Levy (2012), a small and symmetric window is effective for paradigmatic relations. Furthermore, for pragmatic reasons we will be using W = C making the co-occurrence symmetric F_n×n.

Consider a co-occurrence matrix created with a 1-1 window from two similar sentences, as illustrated in Table 2.2 below.

“Arvid likes green apples”

“Arvid hates green cheese”

Arvid likes green apples hates cheese

Arvid 0 1 0 0 1 0

likes 1 0 1 0 0 0

green 0 1 0 1 1 1

apples 0 0 1 0 0 0

hates 1 0 1 0 0 0

cheese 0 0 1 0 0 0

Table 2.2. An example co-occurrence matrix, with word window 1.

As we can see from this example, each word receives a spatial representation in a n- dimensional word space.

2.3 Corpus

A lot of text is needed to construct a robust WSM, such large and structured collection of texts are commonly refereed to as a corpus. There are many different types of corpora from a large variety of sources. Many include information in the form of a tag on each the words specifying the word class or position in speech.

As a large part of this thesis involves reproducing the results of Bullinaria & Levy (2012), we will use the plain text version of the ukWaC corpus Baroni et al. (2009) to build our co- occurrence matrix. It consists of over two billion words and was built by web crawling the .uk domain. The quality of the sources can be questioned and this is important to have in mind later on as it forms the basis for all semantic relations in the WSM. However, as large size tends to counterbalance the relative influence that single texts can have on the results of an analysis Biber et al. (1998), the low quality of the crawled text from the ukWaC is largly balanced out by the sheer size of the corpus.

However, before the corpus can be used for most purposes, it needs to be preprocessed. For starters, converting words to lowercase, removing multiple white-spaces and punctuations are all common practice. Moreover, the process of lemmatisation ¹ and frequency filtering, often implying the removal of stopwords with low information content such as the, of, this, a, are also common.

1 Morphing words into word stems; looks, looking, looked → look

(11)

2.3. CORPUS

2.3.1 Zifp’s Law

One common feature of all words in a language is the empirical relation between rank and frequency. Zipf’s law states that given some corpus of natural language utterances, the frequency f of any word is inversely proportional to its rank k in a ordered frequency table.

f ≈ ck^s (2.1)

Figure 2.1. Zipf’s like distribution on preprocessed ukWaC

Our Zipf’s like frequency distribution of the first 50000 words of our pre-processed corpus is shown in figure 2.1 with f ≈ 1.224 · 10⁸k^−0.867.

(12)

2.4 Mutual Information

There are multiple reasons for the prevalence of weighting the raw frequency data. Primarily, it is a matter of pragmatism, it works surprisingly well for a large span of WSMs (Turney &

Pantel 2010). Moreover, there exists convincing physiological and linguistic arguments in its favour (Church & Hanks 1990). Finally, a purely informational theoretic description based on entropy leads directly to the same formulation. Namely that the mutual information I for two random variables X and Y are given by their union.

I(X, Y ) = P (X ∩ Y )

P (X) · P (Y ) (2.2)

The maximum likelihood estimate of I based on a set of observations of x ∈ X, and y ∈ Y is presented below and will form a basis for the weighting function.

I(X; Y ) = ^X

y∈Y

X

x∈X

p(x, y) log₂

p(x, y) p(x) p(y)

(2.3) The transformed co-occurrence information has a normalising effect on the disproportional representations of highly frequent semantic relations due to Zipf’s frequency distribution. More- over, removing negative values provides computational advantages; maintaining a positive def- inite matrix is advantageous whilst performing a dimensionality reduction and a more sparse data improves general improvements in computational speed. Thus, we will constrain the out- put into positive values, and use the form called positive point-wise mutual information PPMI.

In terms of co-occurrence frequencies, we get the following.

ppmi (f_ij) =





 log

_f

ij(P

ijfij)²

P

if ijP

jfijP

ijfij

if > 0

0, if <= 0

(2.4)

Due to the computational requirements and nonlinearity of this particular form, it is not suitable for incremental reweighing but has a couple of approximative forms. For a more com- prehensive study, see for example Manning et al. (1999) or Landauer et al. (2013).

(13)

2.5. MEASURE OF SIMILARITY

2.5 Measure of Similarity

As seen in Section 2.2 a WSM can be seen as a high dimensional vector space. This is the foundation of all WSMs; to exploit the advantages of this high dimensionality. Consider for example the distribution of angles Θ_j between a homogeneously distributed set of points X_i ∈ Sⁿ on a unit hypersphere Sⁿ. The probability density function, p(θ) for θ ∈ Ω(Θ) can be expressed as follows (Cai et al. 2013) :

p(θ) = 1 π

Γ(ⁿ₂)

Γ(ⁿ⁻¹₂ )sinⁿ⁻²θ (2.5)

As for the well known case of n = 3, the maximum ∀ n is found at θ = ^π₂ while the proportion of mass around it grows quickly with n, as showed in figure 2.1.

Figure 2.2. Distribution of angles between uniformly distributed points on Sⁿ.

The idea is to exploit this abundance of orthogonality to separate words with different meanings. One way to do this is to simply define a Minkowski metric of distance dist(w_i, wj) as a basis for a similarity measure since sim(x, y) = ¹

dist(wi,wj). However, the ordinary Minkowski distance metric is not commonly applied as more complex measures give better results. For the purpose of this thesis we will use the angle between the vectors as similarity measure since it effectively reduces the impact of the frequency effects.

sim(w_i, wj) = Cos (w_i, wj) = w_i· w_j

|w_i||w_j| (2.6)

Another important property of the cosine measure is that is invariant under unitary transfor- mations F · U → M⁰ where U ∈ O(n) as utilised in Caron (2001).

For a more thorough review, see for example Weeds et al. (2004).

(14)

2.6 Evaluation of Word Space Models

Having a measure of similarity is useless without a way of evaluating its performance. As reviewed in Sahlgren (2006), there are many different types of tests that can be used for this purpose. They can be divided into five different categories: information retrieval, synonym tests, word-sense disambiguation, lexical priming data and knowledge assessment. Many of these are more of applications where WSMs could be used, rather than specifically designed tests to evaluate a certain property of the model.

As the first part of this thesis will focus on understanding Bullinaria & Levy (2012) results and in particular their improvements on the TOEFL test, the relevant benchmarks are tests that evaluate paradigmatic relations.

The TOEFL benchmark was introduced to the computational linguistics community by Landauer & Dumais (1997). It is a synonym test with 80 multiple-choice questions, each with four answers where one is the correct synonym to the question.

However, as pointed out in Baroni & Lenci (2011), the TOEFL test was not designed to benchmark WSMs; it was originally designed to test the understanding of English as a second language. For example, there is mostly psychological difficulties behind some of the choices, consider for example these questions in table 2.2 below.

Question Correct Incorrect

physician doctor nurse pharmacist chemist easygoing relaxed farming boring frontier

grin smile joke rest exercise

Table 2.3. Examples of TOEFL questions

The first question in table is an example where all answers have clear paradigmatic relations to the question while the other two questions have more varies answers. For example, there is no clear paradigmatic relation between easygoing and farming. Another obvious drawback of TOEFL test is that it only tests for synonyms, clearly this is not a good representation of the paradigmatic relations in general.

As suggesten in Baroni & Lenci (2011), in order to examine the performance of a WSM a test needs to single out a specific semantic relation. In particular, use a data set that is able to explicitly and reliably encode the semantic information for a given evaluation criteria. Hence, the authors designed the BLESS. It consists of word tuples with information about the different types of linguistic relations between the question and its available word pair. An example of such relations is presented in equation 2.7. This way one can see what kind of relations the distributional semantic model emphasize.

(15)

2.7. PRINCIPAL COMPONENT ANALYSIS

Alligator











aggressive ∈ attri

frog ∈ coord

attack ∈ event beast ∈ hyper

mouth ∈ mero

courthouse ∈ random

(2.7)

The above semantic relations to the association word are described below.

Hypernym relations, abbreviated by hyper, are defined as words or phrases that refer to broad categories or general concepts. In other words, Beast is a broader category that can describe the more precise term Alligator. Hence, beast is a hypernym of alligator.

Hyponym is the opposite of hypernym relations, that is, words that belong to a broader category. For example, alligator is a hyponym of beast.

Co-hyponym, abbreviated by coord, relations are a words or phrases that shares the same hyponym as the question. For example, frog and alligator are in the same category (amphibian reptiles) and therefore will share hyponyms.

Mero is a word that is a part, a component, a organ or a member related to the word in question. For example, a mouth is a part of an alligator and a motor is a component of a car.

Event is a word that is an action, an activity, a happening or an event related to the word in question. For example, an alligator can attack and a fridge can break.

Random words are judged to be completely unrelated to the word in question by humans.

For example, alligator and courthouses have no sound relation to each other just as turtles and philosophy have few logical connections.

By measuring the similarity between the association word to the best word in each semantic category, we get a quantitative measure of what category is best represented in the WSM.

2.7 Principal Component Analysis

Due to the extremely high dimensional space of the raw co-occurrence matrix, dimensionality reduction is imperative for most practical applications. For our purposes, the basis for all information queries is a scalar product and thus of O(sd) where s is a sparseness factor and d is the context dimension. Hence, by reducing the context dimension d, one achieves a proportional improvement in speed and even greater reduction in memory requirements. Thus, in general, we face the following minimisation problem.

min

F ∈Rˆ ⁿ×R^k

|F − ˆF | (2.8)

where ˆF denotes the rank-k approximation of F .

That is, we obtain a basis { ˆx_j, ..., ˆx_k} that is a linear combination of the original basis {x_k, ...} and that re-expresses the information in a lower dimensional set. Equivalently, in the language of information theory the goal is to find a basis such that the mutual information I(Xi, Xj) = 0 for i 6= j (Landauer & Dumais 1997).

Assuming Gaussian-like distributions, a canonical way of achieving this is to maximise the variance of the data in the new basis (Ding 1999). This enables ordering of importance after variance, hence the name the principal components or simply PC s.

Hence, for sufficiently Gaussian distributions p(w_i|c_j) ≈ e^wⁱ^·c^j/Z(c_j) min

F ∈Rˆ ⁿ×R^k

|F − ˆF | ⇔ max

xj,xk∈ˆxi

X

j,k

(x_j− x_k)² i ∈ {1..k} (2.9) where Z(c) is a normalising factor.

(16)

Luckily, as we shall see in the next section, the spectral theorem provides an elegant solution to this problem.

Even though the assumption of a joint Gaussian distribution provides good results, it does not actually hold in reality. Hence, there are a large class of more advanced statistical models which we wont have time to consider, based on multinomial distributions, referred to as Topic Models, see Hofmann (1999) or Zhang et al. (2013).

2.7.1 Singular Value Decomposition

Our co-occurrence matrices are positive and symmetric and have therefore, by the spectral theorem, a spectral decomposition of an ordered set of positive eigenvalues and a orthogonal set of eigenvalues. Hence the similarity matrix F F^T, which is positive and symmetric for an even larger class of matrices, has a spectral decomposition.

F F^T =^X

j

λ_jv^T_jv_j (2.10)

Where λ₁ < ... < λw and {v₁, ..., vw}. Moreover, with the log-likelihood l(ˆxi) as defined below, a straight forward maximisation of l yields ˆxj = v_j.

l(x_i) = log^Yp(x_i|ˆx_j) = ˆx^T_jF^TF ˆx_j− n log Z(ˆx_j) (2.11) In other words, the optimal PCA representation of F is the eigenbasis {v₁, .., v_w}. Moreover, it can be showed that σ²_i = λ_i, enabling ordering in accordance with PCA. Rewriting the original co-occurrence matrix F in this basis can be done as below:

F = U ΣV^T (2.12)

where U = span(v₁, ..., vw) , Σ = diag(λ₁, ..., λw) and V ∈ U (w) is a unitary matrix mapping the original basis of F into its eigenbase. Hence, by simply choosing the first k-eigenvalues and their respective vectors we have the central result:

mink |F − ˆF | → F ≈ Uˆ kΣ_kV_k^T (2.13) where ˆF is the best rank-k approximation in the Frobenius-norm (Murphy 2012). This is commonly refereed to Truncated Singular Value Decomposition or SVD, and has an efficient algorithmic implementation of O(snk), where s is a sparsity factor, n the number of words and k the rank.

This linear projection into the eigenbasis does not only provide an efficient compression of the sparse co-occurrence data but does also have an abstraction effect in the creation of the latent semantic structures, improving performance as shown in Bullinaria & Levy (2012).

Moreover, the same study also shows that the SVD decoupling drastically improves robustness in the presence of noise.

Finally, using the sim(w_i, w_j) = cos(w_i, w_j) measure of similarity, V is redundant due to invariance under unitary transformations. We will therefore represent the principal components of ˆF in its most compact form ˆF ≡ U Σ without any further comment.

2.8 Related Work

Caron (2001) introduced a class of LSA scoring functions based on a renormalisation of the latent variables though an exponent factor p ∈ R:

U Σ → U Σ^p (2.14)

(17)

2.8. RELATED WORK

Moreover, using several benchmarks such as the TOEFL test, showed that the conventional Caron P value p = 1, was not optimal. But due to large differences between optima no further conclusions of nature of this effect where provided.

Bullinaria & Levy (2012) began by showing that neither stoplists nor stemming has any significant effect on the performance based on the TOEFL test. Then, most importantly, moved on to further investigate the results from Caron (2001). Using a symmetric 1-1 window and PPMI weighting, the optimum exponent parameter p was with strong statistical significance p < 1 in agreement with the results in Caron (2001). Moreover, due to the redistribution of weight to the lower variance PCs, showed that similar effects can be achieved by simply removing a couple of the first PCs. A highlight of the results is a new state-of-the-art performance on the infamous TOEFL test of 100%. Due to the increase in performance amongst all categories using some method of reducing the contribution of the highest variance PCs, Bullinaria and Levy concluded that these components do not represent the wanted lexical semantic information.

(18)

(19)

Chapter 3

Contribution

After the quick survey of the relevant methods and previous work, we are now ready to narrow down our problem statement and present our contribution.

3.1 Refined Problem Statement

As introduced in Bullinaria & Levy (2012) and Caron (2001), we seek to understand why the redistribution of weight away from the first PCs improves synonym relations in WSMs.

Furthermore, by extrapolating from the previous results, we seek a deeper understanding of the results.

In particular, using the BLESS test, we analyse these effects by quantitative analysis through violin plots of word categories as well as qualitative analysis through statistics of paradigmatic relations in the BLESS test.

From these observations we seek to extract some heuristic explanation for the apparent structure behind the refactorization of latent variables.

3.2 Method

With the refined problem statement formulated, we are now ready to state the means by which we are to answer its questions. The tests presented below are those we found most fruitful in this pursuit.

3.2.1 Word Space Model Construction

We begin the process of creating the WSM model from the preprocessed data as follows. In the same spirit as in Bullinaria & Levy (2012) we performed an as simplistic preprocessing on the ukWaC corpus as possible (Section 2.3); from the untagged version of the corpus we removed punctuations and changed all letters into lowercase.

The next step is to construct the raw co-occurrence matrix. These are either constructed from part of the corpus or the entire corpus and with either 1-1 or 2-2 in context window.

Common for all these matricies are that they contain only the 50000 most frequent words (Section 2.2) as counted from the entire corpus. Then PPMI weighting (Section 2.4) was applied on the matrix. Finally, we performed a dimensionality reduction in the form of PCA using SVD on the matrix (Section 2.7.1). This way of building WSMs will be used as a basis for all the following tests.

(20)

3.2.2 Verification

Beginning with the same parameters as the best result in Bullinaria & Levy (2012), with the context window set to 1-1 and the entire corpus as the data, we perform a dimensionality reduction to 5000 PCs F_5·104×5·10⁴ → ˆF_5·104×5000. These parameters are taken from the best results in Bullinaria & Levy (2012), receiving 100% on the TOEFL test (Section 2.6) using both the Caron P transform and by the PC removal scheme (Section 2.8). With the aim to replicate these results, the Caron P transform as well as the PCs removal scheme is applied on ˆF and benchmarked with the TOEFL test.

In order to provide a fruitful juxtaposition of the results, statistical validity is of primary concern. Hence, we also split the corpus into 40 different parts (2.5% of the corpus), more than the 12 used in Bullinaria & Levy (2012), in order to truly anchor the validity of the results.

However, to compensate for the lowered word frequencies in these sub-corpora, we double the window size to 2-2. Even though a small window size fosters good syntagmatic relations, as shown by Bullinaria & Levy (2012) the difference between 1-1 and 2-2 is small compared to the doubled frequency count. Hence, it is a matter of pragmatism, as it provides a good balance between computational speed and statistical robustness. Using a 1400 PCs representation, both the Caron P transform and the PC removal scheme are applied and benchmarked on the TOEFL test.

To verify that these results are equivalent to those with 1-1 in context window, we perform the same tests on larger parts of the corpora, from 5% to 100% with the same number of PCs (1400) as for the smaller 2.5% parts. Finally, to compare these with best results of Bullinaria &

Levy (2012), we also provide one benchmark of the 5000 PCs representation to see if we receive the same 100% score.

3.2.3 Qualitative Tests

The next step, after the verification of the previous results, is to provide a qualitatively analysis.

This is done using the BLESS test (Section 2.6), as it provides a more robust paradigmatic benchmark and most importantly facilitates qualitative analysis in the form of classification into semantic classes.

We begin by conducting a quantitative version of the BLESS test by applying both the Caron P transform and PC removal scheme. This is done with the same WSMs that were tested with the TOEFL test in the previous section. The test was conducted in accordance with Baroni & Lenci (2011), where a correct answer is received when either a hypernym or co- hyponym relation gets the highest cosine similarity measure to the target word. This way, we restrict our analysis to hypernym and co-hyponym relations since they are, like the synonyms of the TOEFL test, paradigmatic relations.

To deepen our understanding of these results, we begin by scrutinising how the Caron P transform and PC removal scheme affect the cosine similarity measures of the different semantic classes. In particular, based on the results from in Section 3.3.2, we suspect that co-hyponym and hypernym relations change more from the weight redistribution relative to the other classes.

Hence, by normalising the cosine measures of the best result Θ_i of each category i, with the mean µ and variance σ of all categories we end up with a distribution of normalised cosine measures ˆΘ_i for each class.

Θ =ˆ Θ − µ

σ (3.1)

To analyse how these distributions ˆΘ_i change with the redistribution, we use violin density plots. We begin by applying this analysis on the representation with the best result and where the redistribution of weight is the largest compared to the original, which we will see in Section 3.3.2 will be the 5000 PCs representation under Caron P transform with p = −0.25.

(21)

3.2. METHOD

As the Caron P transform redistributes the weight on all PCs and the best answers from each category is a very small set of words, it is difficult to draw general conclusions. Hence, to ensure that the words in the statistics are the same we perform the same violin distribution visualisation as before but include all words from each category based regardless of cosine measure. Moreover, we compare both the best result from the removed PC scheme with the corresponding PCs that are removed as well as with the original distribution. This way we hope to get a more representative and easily analysed set of data of how the entire WSM changes.

3.2.4 Frequency Norm Relation

We will conduct quantitative measurements of the relation between co-occurrence frequencies and optimal eigenvalue distribution to establish an empirical relationship.

As we will see the results from Section 3.3.2, we will have compelling reasons to investigate the relationship between target word frequency and cosine measure under redistribution of PC weight. In particular, we will test the hypotheses that there is a difference between how the answers which the WSM answers wrong are transformed under the redistribution of weight on PCs.

This categorisation of words into a frequency range also enables a comparison between the words in the BLESS and TOEFL tests. To compensate for the low number of incorrect answers in both tests, we gather this data for all 40 parts of the corpus with and without applying their respective optimal Caron P transform. The results for both the TOEFL and BLESS test is then compared using density plots of the cosine measures for an appropriate choice of frequency intervals.

3.2.5 Statistical Validity

To test the statistical significance of the results, we will use a paired two tailed t-test between of the results before and after the the transformation. In other words, we test the null hypothesis that the difference between two sets of results has a mean value of zero. Where appropriate, we will also use other statistical measures to characterise the results, but as we are mostly interested in qualitative results, we will not go into their statistical significance in detail.

(22)

3.3 Results

Using the introduced methods we here present our results in chronological order.

3.3.1 Verification

To begin, we verify that each step in the factorisation of the word co-occurrence frequencies have the desired effect. Using the same 1-1 window as used in Bullinaria & Levy (2012), we are able to reach the impressive 100% TOEFL score. As seen in Table 3.1, the improvement in result is impressive; 62.5% on the untreated WSM to 100% by removing the first 379 PCs from the PPMI transformed version in the 5000 PCs representation.

Raw PPMI PCA: 5000 PCs Caron P = −0.4 379-5000 PCs

62.5% 81.25% 85% 93.75% 100%

Table 3.1. Verification of TOEFL results using window 1-1.

Based on this promising first result, we move on to strengthen its statistical validity. Using 1400 PCs of the PPMI weighted 40 parts of the ukWaC corpus processed with the doubled window size of 2-2, we begin by applying the same procedure of simply eliminating one principal component after the other and ran the TOEFL test. As shown in Figure 3.1, we receive large improvement by removing about 100 PCs. The statistical significance of the improvement is far above the five sigma on the paired two-sided t-test described in Section 3.2.5.

Figure 3.1. TOEFL score for removed PCs and under Caron P transformation. The light green band indicate the three sigma confidence interval around the mean. For larger figures, see Appendix A.

Again using the 40 subsets of the corpus, we perform the Caron P transform from Caron (2001). From Figure 3.1 it is clear that a significantly better score is received with p 6= 1.

Specifically, there is clear and statistical significance improvement on the five sigma level as in the case of the removal of PCs.

In conjunction with Bullinaria & Levy (2012) we also went for a 100% score a 2-2 window to ensure that the doubled window size does not interfere with the phenomena behind these results. Hence, we increased the number of latent dimensions to 5000 PCs and started to remove them in descending order. The results are shown on the left in Figure 3.2 and provides as or even better results than the 1-1 window.

An equivalent score is achieved using the Caron P transform and we actually receive clear improvement compared to the 1-1 window. We will not investigate the discrepancy between the 1-1 and 2-2 results, as our main interest is to understand the reason behind the phenomena.

(23)

3.3. RESULTS

Figure 3.2. TOEFL score for the removed PCs scheme and the Caron P transform on the 5000 PCs representation.

As a final sanity check we also verified that the positive effect of a larger corpus as shown by Bullinaria & Levy (2012) remains valid under the Caron P transform. As shown in Figure 3.3 below, the results behave roughly as expected.

Figure 3.3. TOEFL score for 1400 PC representation of different corpora sizes under the Caron P transform. For larger figures, see Appendix A.

Even though the sample size is low, our results indicate a similar percentage improvement using the Caron P transform on different sizes of corpora. Moreover, the optimal value of p is clearly lower with a larger corpus; from around 0 for 5% to around {−3, −2} with a larger size of corpus. But as the optimal value is less interesting than the actual cause of this phenomena we move on.

The unquestionably positive impact of the redistribution of weight away from the PCs of the highest variance provides plenty of motivation to study this effect more thoroughly. As the 400 words of the TOEFL test are all rather peculiar, they are indeed chosen to be hard for humans with no consideration of the difficulty level for computers, we introduce a more qualitative and broad analysis of effects of the studied methods in the next section.

(24)

3.3.2 Qualitative Tests

As discussed in Section 2.6, the words in BLESS test are from a wider frequency span and class of words than the TOEFL test. By giving the model correct if it chooses paradigmatic relations over syntagmatic, we still test similar properties as the TOEFL test; the ability of the model to form paradigmatic relations. Hence, when any of the two classes of paradigmatic relations in the test, hypernyms and co-hyponyms, are chosen over the other semantic relations we give the model correct (Section 3.2.3).

As before, we use the 40 parts of the corpus with 2-2 in context window and 1400 PCs and start removing the highest variance PCs in descending order. The results is shown to the left in Figure 3.4.

Figure 3.4. BLESS score with removed PCs and under Caron P transformation for the 1400 PCs representation. The light green band shows the three sigma confidence interval around the mean. For larger figures, see Appendix A.

As seen, the results are not as positive as for the TOEFL test but remain statistical significant on the five sigma level with the paired t-test.

This clearly more sensitive in optimal parameter, should probably have a significant impact on the Caron P transform as it changes the weight on all PCs simultaneously. Hence, as before, we continue with the 40 parts of the corpus in the 1400 PC representation and apply the Caron P transform.

As shown to the right in Figure 3.4, there is no longer a clear improvement. Nevertheless, a paired two-sided t-test still gives a statistically significant difference on the 3 sigma level. Even though it is not surprising considering the result of removing PCs, it is still a quite remarkable difference compared to the TOEFL results.

Albeit puzzled by these results, we continue to see if the size of the corpora makes any difference. Based on the TOEFL results, we would expect better results for a larger number of removed PCs at the optima for a larger corpus. Fortunately, this is also the case and is shown in Figure 3.5. Interestingly, there is clearly a larger improvement in performance of the Caron P transform than for the PC removal scheme.

Based on the quantitative increase in cosine similarity implied by stronger paradigmatic relations compared to syntagmatic, we expect to be able to see a relative difference between categories under the redistribution of weight. This difference between the categories of the BLESS test is best illustrated by violin plots based on the maximum values of each category for each of the questions in the BLESS test. Where the width of the violin represents the normalised probability density of cosine measures in each category as described in Section 3.2.3. As stated, we expect some difference between how hypernym (hyper) and co-hyponym (coord) similarities are transformed compared to the rest of the relations and the results are shown in Figure 3.6.

However, as seen in the figure, except for the clear dominance of co-hyponym relations both

(25)

3.3. RESULTS

Figure 3.5. Bless score for removed PCs scheme and Caron P transform for different sizes of corpus.

before and after the Caron P transform, there are only small subtle changes. The difference between the distributions is that the cosine measures for co-hyponym and hypernym relations have slightly larger variance from their outliers. As it is the relative difference between each semantic category to a specific word that matters in the BLESS test, there may be larger structural changes that are responsible for the performance increase that are not encapsulated by the density plot. However not visible in the normalised plots, the most striking change for and after the Caron P transform is a clear shift into lower similarity; the cosine measures are all closer to 0. This is illustrated in Figure 3.7 where the paradigmatic relations are merged into one distribution and the cosine measures not normalised. Moreover, to facilitate the analysis, incremental removals of PCs are applied instead of the Caron P transform.

Even though the hypernym relations do in fact increase slightly relatively to the removed PCs, they are all overlapping with each others interquartile range. Moreover, the greater number of outliers in the hypernym category might have some positive effect but, compared to the

Figure 3.6. Top: 5000 PCs representation of entire corpus. Bottom: Caron P transform with p = −0.25.

(26)

Figure 3.7. Violin plot of hyper and co-hyper relations versus the rest using the 5000 PC representation of the entire corpus.

original tail for the co-hyponym distribution, the effect is small.

As there is no clear pattern that may be responsible for the large quantitative improvement on the BLESS and TOEFL test, we continue by looking at the set of PCs with the best results according the PC removal scheme. That is, we look at the components which we remove; the 120 first PCs in isolation. Moreover, to ensure that we do not miss any large structural changes, we include all word associations and not only the best from each category. The result is illustrated using violin plots of the normalised cosine distributions in Figure 3.8.

Based on the three violin plots, it seems as if the top 120 PCs contain a higher level of co-hyponym relations than the lower. Moreover, the distribution of coord relations in the representation without the top 120 PCs is most accurately describes as inverse to these. However, compared to the mean and interquartile ranges, there is a larger tail which deviates more significantly then the distribution for the first 120 PCs. In essence, it seems as if there is a more fundamental reason underlying this phenomena, as it is the very combination of PCs that yield the best representation.

Despite having seen that the first PCs contain some valuable information and that there are some small difference between how the classes of the BLESS test are affected by the redistribution of weight away from these, it is far from clear why we get these results.

Another way of quantifying the difference between the words are their frequency in the corpus. As the BLESS test contain a larger span of words from a larger span of frequencies this could perhaps also shed some light on the difference in the result from the TOEFL test. This relation between frequency and the eigenvalue distribution is studied in the next section.

3.3.3 Frequency Norm Distribution

The frequency versus rank distribution of the entire corpus is, as shown Section 2.3, well de- scribed by the Zipf’s distribution with an exponent s = −0.87. But, as shown below in Figure 3.9, the smaller set of TOEFL and BLESS words are less so.

As the very foundation of the WSM model is to count co-occurrence frequencies we expect the norm distribution of the raw context vectors to follow about the same exponent decay as

(27)

3.3. RESULTS

Figure 3.8. BLESS targets versus categories from 1400 PCs representation of entire corpus. Top: first 120 PCs, middle: all but the first 120 PCs, bottom: all PCs.

Figure 3.9. Rank of words.

the corpus. As seen in Figure 3.10, this is also the case as the raw norms rank decline with a

−0.86 exponent. However, as seen in the same figure, the PPMI drastically reduce this relation and the best fit exponent is s = −0.14. Interestingly, the eigenvalue distribution, λ_i = σ_i², is almost the same distribution as the word frequency. This may of course be a coincidence, but as seen Figure 3.10, it is not only the exponent decline that follow the same distribution, but the magnitude as well.

Based on the unclear change amongst word categories, it is thus natural to postulate that

(28)

Figure 3.10. Norm.

there might be some relation between the word frequency and optimal weight on PCs. Even though it is clear that the PPMI does a good job removing the frequency effects, the eigenvalue distribution suggests that it still might be the case that the latent semantic structure is skewed by the high frequency words. Hence, we postulate that since all words forming the semantic structures are drawn from a Zipf’s distribution there may be word pairs that remain skewed despite the PPMI normalisation.

Hence, even though this effect is not visible in the Frobenius norm of the PPMI due to the normalised columns, the effect might still exist as indicated by the eigenvalue distribution.

This hypothesis is also considered by Bullinaria & Levy (2012) as they first postulated that the removal of the first PCs is equivalent with a stopped corpus¹. To test this hypothesis, we classify the cosine measures of words which our model answer wrong based on their frequency range. As discussed in Section 3.2.4, this categorisation of incorrect words enables a comparison between the incorrect answers on the TOEFL test with those of the BLESS test. As the incorrect answers are few, we use the 40 sub-corpora from the earlier tests to improve statistical significance and the Caron P transform to maximise weight redistribution. The results are shown in figure 3.11.

In agreement Bullinaria & Levy (2012), the separation of words after frequency do not show any different behaviour under the Caron P transform than the others. However disappointing the results may be, they all point to some very interesting topological properties of the WSMs.

The eigenvalue distribution shows a great similarity with the undirected graphs of the internet as discussed by Mihail & Papadimitriou (2002). In fact, the eigenvalue distribution has shown to be an effective benchmark for models trying to construct similar graph structures. However, one large difference is that the domain of the exponential decline is much larger. This probably implies that there is even more structure and signal to be found in the semantic structures of language than there is between websites.

Further studies are needed on the eigenvalue distribution and we suspect that some insights from random matrix theory may be fruitful in this goal. We leave this for further studies and move on to one last observation that may be of help in a more general setting.

1a corpus where stopwords are removed

(29)

3.3. RESULTS

Figure 3.11. The incorrect answers in the each test forms distribution of cosine measures between the correct word pair. Low:f < 100, intermediate: 100 <= f < 500 and high: f > 500. Top: BLESS results, bottom: TOEFL results. Left: 1400 PCs , right: Caron p = −0.25. For larger figures, see Appendix A.

3.3.4 Equivalent Transformations

Judging by our results it is not likely that there is an easy way to find a general expression of the optimal redistribution of weight on the PCs for a given application. The large variance in the optimal parameter for a given task suggests that the neither the PC removal scheme nor the Caron P transform are optimal.

Consider the common practice of using the first PCs representing 80% of the total eigenvalue for a sufficiently good representation of the raw data.

Pk i λ Pn

i λ ≈ 80% (3.2)

There are many similar methods, but in general they all vary greatly in effectiveness depending on the type of data (Peres-Neto et al. 2005).

However, as shown in Bullinaria & Levy (2012), for all practically achievable numbers of PCs, the more the better. Inspired by this measure and guided by the results from Sections 3.3.1-2, we postulate an inverse version of the 80/20 rule for WSMs. Given a computational and practical limit of the number PCs m with weights Λ = {σ₁, ..., σ_m}, the optimal redistribution of weight on these components are such that the components bearing 80% of the original distribution Λ = {σ_l, ..., σ_m} is such that it becomes 20% of the new total mass. In other words, the function f : Λ → ˆΛ performing this redistribution is such that:

Pm i=lσˆi

Pm

i=1σˆ_i ≈ 20% (3.3)

In this formulation, we can consider the Caron P transform and the removal of PCs scheme as special cases of such transforms.

(30)

f (Λ) =

(fcaron : f (λi) = λ^p_i ∀i, p ∈ R

fbullinaria = f(λⁱ) = (1 − δ(F ))σ_i ∀i, F = {1....l} (3.4) Where δ(F ) is the generalised Kronecker delta function for the set of first PCs F .

To test this claim, we form this quotient for the distributions of weights at the optimal parameters for Caron P transform and the PC removal scheme for both the BLESS and TOEFL tests. The results are shown in figure 3.12.

Figure 3.12. Left: TOEFL test , right: BLESS test, top: PC removal scheme, bottom: Caron P transform. For larger figures, see Appendix A.

As seen, the results point in favour of this measure; there are only slight differences between the optimal mass distributions between the Caron P transformation and the removals of PCs for the TOEFL or BLESS tests. In an attempt to use this apparent structure, we revisit the thus far used methods of redistributing weight on PCs. The Caron P transform and PC removal scheme represent two extremes of redistributing the weight. The removal of the first PCs simply erase all information they contain. But, as we have seen, they do contain some valuable information, so to systematically remove such information does not seem feasible in a real applications. While the Caron P transform provides just as good results, it has shown to be more sensitive to the exponent parameter p. Hence, it would be nice to have something in between.

As translating the entire distribution by the same amount does not change the weight distribution, we have a good starting point for a more general transformation.

fI : f(σⁱ) =

σi

I

p

∀i, p, I ∈ R (3.5)

Choosing I to be a value in the lower part of the distribution of eigenvalues will have an especially desired effect. Namely, for values p > 0, the transform will have the same effect as that of the Caron P transform. Moreover, for p < 0, the transform will decrease the weight

(31)

3.3. RESULTS

on those before I much faster than those after, hence move towards removing the PCs before I completely. This effect is all due to the exponential distribution of weight. The difference between the original Caron P transform and this normalised version is shown in figure 3.13

Figure 3.13. Distribution of weights for different values of p ∈ [−1, 1], each des- ignated by a separate color, for the Caron P transform to the left and the mean normalised version to the right

It turns out that the mean value of the variance is often a good indicator of where the optimal number of removed components lie. Hence, choosing this value as I =

Pm i=1σi

m , we have a new more pragmatic way of redistributing the weight of the PCs. As shown in in figure 3.14, the results on the entire corpus for a range of PCs shows how this method is more robust to the weighting parameter p for different numbers of dimensions.

Figure 3.14. Mean normalised Caron P transform on the TOEFL and BLESS tests

The most important quality of this transform is that is less sensitive in the choice of exponent compared to the Caron P transform. Hence perhaps more practical for applications.

(32)

3.3.5 Reductio Ad Absurdum

This section aims to illustrate some of the bizarre properties of this representation by taking the removal scheme ad absurdum. As discussed in Karlgren et al. (2008), the topology of semantic information in WSMs is most amply characterised by a local low fractal dimensional structures embedded in a more global filament structure.

Considering the global nature of the PCA, it is perhaps not too surprising that this linear projection onto the eigenbasis scaled after variance constituting the PCs, do not constitute a canonical way of describing semantic relations in text. This fact is clearly illustrated in Figure 3.15 below.

Figure 3.15. TOEFL score versus removed PCs.

As seen in the figure, a large part of the semantic information appears to be found many orders of magnitude below the largest PC. While it may be the case that this effect depends on the particular properties of the TOEFL test, it is still likely that this feature is a characteristic of this particular topology. In essence, this suggests that if the local structure of word spaces could be exploited, there are potentially huge gains in efficiency to be found in a such an representation.

(33)

Chapter 4

Discussion

This thesis has investigated the effects of redistribution of weight on the latent dimensions. In particular, by extrapolating from previous results in Bullinaria & Levy (2012), we have analysed how different semantic categories are affected under such transformations. While our results illustrates some interesting properties, we could not derive any general conclusions from them.

In particular, we are unable to answer all our research questions, but provide a number of interesting results that encourage future studies.

4.1 Summary of Results

We succeeded in reproducing the 100% score on the TOEFL test using three different ways of redistribution the weight; the Caron P transform, the PC removal scheme, and with a new normalised version of the Caron P transform. Using the same transformations we received a score of 91,5% on our paradigmatic version of the BLESS test. Apart from this, the BLESS categories gave little help in understanding why the redistribution of weight on PCs is effective.

We also found an interesting similarity between the eigenvalue distribution and the Zipf’s distribution of word frequencies. This lead us to believe that the redistribution of weight acted differently depending on the frequency the words. But our tests found no such difference on neither the TOEFL nor BLESS words. Finally, we also note that these redistributions can be characterised by an inverse 80/20 rule; the optimal redistribution of weight is such that the top 80% of the original distribution amounts to about 20% of the mass after the transformation.

4.2 Statistical Validity

Using a paired two sided t-test we have consistently been able to strengthen the hypothesis that the redistribution of weight provides a statistical significant increase in performance, with at least three sigma in confidence level, on both the BLESS and TOEFL tests. Since the extent of this increase varied greatly, it was not further analysed. The same applies for the specific parameters for each optima. However, these are studied from the point of view of 80/20 quotients in Section 3.3.4. Even though these quotients have been tested outside the realm of the 1400 PCs representation of the 40 sub-corpora, further tests are needed for a stronger statistical significance of the result.

One interesting property of the removal PC scheme on both the BLESS and TOEFL test was the high variance in the result (Figure 3.1 & 3.4). A considerable portion of the results are far beyond the 3 sigma confidence interval. This signifies that it may be more appropriate to consider these as drawn from Cauchy distribution then a normal distribution.

Finally, the small qualitative differences between the normalised cosine measures of each semantic category in the BLESS test can all be considered significant due to the properties of the cosine measure in high dimensional vector spaces (Section 2.5).

(34)

4.3 Evaluation of Method

In this section we qualitatively discuss the various ways our choice of method could have affected the validity and scope of our results.

We begin in chronological order with the preprocessing, this could have been done more thoroughly. Since we did not remove stopwords from our corpus, the primary concern with this is that the high frequency stopwords contaminates the semantic relations (Section 2.3).

However, our results show that the PPMI weighting removes this effect (Section 3.3.3). Another potential mistake concerning the used data was the untagged and unstemmed corpus. As both the TOEFL and BLESS test are based on word stems, this may have had a negative impact on the tests results. Our final concern with the preprocessing was that we used an American version of the TOEFL test while our corpus was derived from British websites. This has probably also had a negative impact on the tests results. This could have been averted in the preprocessing by translating the TOEFL words into British English. But since both the American and British versions of the TOEFL words were within the frequency span of the first 50000 words, this effect was probably smaller than the absence of stemming.

The effects of splitting the corpus into 40 parts should perhaps have been more properly investigated, in particular when most TOEFL words where all in the lower end of the frequency spectrum. However, this was compensated by increasing the window size to 2-2 but still, the smallest word frequency in any of the corpora was 10. Hence, the results of the sub-corpora could be somewhat misleading because of this. It might have been wiser to use a frequency cut off and remove the questions where not all words were present in the model. Luckily, the results of analysing the questions that the model got wrong, and in particular their frequency, calmed our nerves slightly, and we do not believe the absence of a frequency cut off could have played a significant role in the effectiveness of redistributing the weight on PCs (Section 3.3.3).

The choice of focusing the analysis on the 1400 PCs representation was due to its nice cut off between speed and evaluation performance. However, as seen for the range included in this thesis 120-5000 PCs, the more the better (at least for WSMs based on 100% of the corpus).

Moreover, as the focus of this thesis has been the redistribution of weight on PCs, the validity of our results beyond the studied range is strongly dependent on the continuation of the same eigenvalue distribution outside of this range. One way to see such a changes in distribution, would be if it goes towards a distribution close to that of a completely random counterpart as discussed in Peres-Neto et al. (2005).

Another important aspect of the PCA dimensionality reduction using SVD, is to note that it aims to maximise the global latent structures using a linear projection into its eigenbasis.

It is far from the only way to perform dimensionality reduction and probably not the most effective. The results presented about the redistribution of weight from SVD do not have any explicit connection to other methods. However, as discussed in the outlook, this is an obvious continuation of the study to see if and to what extent this is a general phenomena for WSM models.

Regarding the distribution of cosine measures between semantic categories of the BLESS test (Section 3.3.2), we do not know of any other factor that could have skewed the distributions other then the weight redistribution. Moreover, as there was little difference between the complete distribution of relations and the ones that were answered incorrect, there is little room for any deviating pattern in the distribution of correct answers. What is less clear, is whether the performance increase on both the TOEFL and BLESS tests may be explained by the small qualitative difference in how the semantic categories are transformed under the weight redistribution. The same applies for frequency categorisation, where there are some evidence of a relationship to the frequency distribution (Section 3.2.3), but the test provided no such correlation. While we do not know of a more effective method, there might be more conclusions to be drawn from our tests, but we leave this for further research.

(35)

4.4. OUTLOOK

In general, we believe the BLESS test provided a good way to get a deeper understanding of what semantic relations are responsible for the TOEFL results. But as the categorisation tests it provided were largely inconclusive, there may be other semantic tests that could provide a more fruitful way of examining the weight redistribution. We still find the results from the BLESS test compelling and more statistically sound than the TOEFL results and have fulfilled a large part of its purpose by strengthening the results from Bullinaria & Levy (2012) and Caron (2001).

Finally, the observed inverse 80/20 rule may be an ineffective measure. But, as the introduction of the robust mean normalised Caron P transform use this property was based on this observation, it may also turn out to be a handy heuristic. While it was the only additional weight function to be properly analysed with this property, there may be more exotic functions that may provide even better results. Hence, more tests would be needed to fully confirm if there is any merit in the inverse 80/20 rule.

4.4 Outlook

As illustrated by the previous section, a recurring element of this thesis has been the narrow scope. For each step in the WSM construction, from preprocessing to choice of similarity measure and frequency weighting, there are many more options to consider. Hence, there are plenty of changes in both parameters and data in which the domain of the observed effects could differ. Moreover, considering the unclear nature of the results, it is still possible that they are accounted by some representation invariant feature of WSMs and could thereby have counterparts in a more general setting.

We suggest a continuation of our study on four fronts. Based on our procedure, we suggest one separate study of each parameter ceteris paribus¹. First, considering more advanced pre-processing tools such as stemming for a more narrow and clean set of semantic relations.

Secondly, investigate whether other methods of dimension reduction such as the non linear counterparts of PCA have counterparts to the studied effect of SVD. Thirdly, a more complete study of the optimal parameters in relation to the inverse 80/20 rule. Such a study could perhaps be fruitful in conjunction with a more radical study of transformations involving non-linear alternatives to the linear scaling. Finally, further tests are needed on other types of linguistic relations to test the scope of these results amongst different semantic categories.

1with all else held constant.

(36)

Factorisation of Latent Variables in Word Space Models