• No results found

Topic discovery and document similarity via pre-trained word embeddings

N/A
N/A
Protected

Academic year: 2022

Share "Topic discovery and document similarity via pre-trained word embeddings"

Copied!
93
0
0

Loading.... (view fulltext now)

Full text

(1)

SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018 ,

Topic discovery and document similarity via pre-trained word embeddings

SIMIN CHEN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Abstract

Throughout the history, humans continue to generate an ever-growing volume of documents about a wide range of topics. We now rely on computer programs to automat- ically process these vast collections of documents in vari- ous applications. Many applications require a quantitative measure of the document similarity. Traditional methods first learn a vector representation for each document using a large corpus, and then compute the distance between two document vectors as the document similarity.

In contrast to this corpus-based approach, we propose a straightforward model that directly discovers the topics of a document by clustering its words, without the need of a corpus. We define a vector representation called normal- ized bag-of-topic-embeddings (nBTE) to encapsulate these discovered topics and compute the soft cosine similarity be- tween two nBTE vectors as the document similarity. In addition, we propose a logistic word importance function that assigns words different importance weights based on their relative discriminating power.

Our model is efficient in terms of the average time com- plexity. The nBTE representation is also interpretable as it allows for topic discovery of the document. On three labeled public data sets, our model achieved comparable k-nearest neighbor classification accuracy with five state- of-art baseline models. Furthermore, from these three data sets, we derived four multi-topic data sets where each label refers to a set of topics. Our model consistently outper- forms the state-of-art baseline models by a large margin on these four challenging multi-topic data sets. These works together provide answers to the research question of this thesis:

Can we construct an interpretable document represen- tation by clustering the words in a document, and effectively and efficiently estimate the document similarity?

Keywords

Document similarity; document representation; word embedding; natural lan-

guage processing; topic modeling;

(3)

Under hela historien fortsätter människor att skapa en väx- ande mängd dokument om ett brett spektrum av publika- tioner. Vi förlitar oss nu på dataprogram för att automa- tiskt bearbeta dessa stora samlingar av dokument i olika applikationer. Många applikationer kräver en kvantitativ- mått av dokumentets likhet. Traditionella metoder först lä- ra en vektorrepresentation för varje dokument med hjälp av en stor corpus och beräkna sedan avståndet mellan two document vektorer som dokumentets likhet.

Till skillnad från detta corpusbaserade tillvägagångs- sätt, föreslår vi en rak modell som direkt upptäcker äm- nena i ett dokument genom att klustra sina ord , utan be- hov av en corpus. Vi definierar en vektorrepresentation som kallas normalized bag-of-topic-embeddings (nBTE) för att inkapsla de upptäckta ämnena och beräkna den mjuka co- sinuslikheten mellan två nBTE-vektorer som dokumentets likhet. Dessutom föreslår vi en logistisk ordbetydelsefunk- tion som tilldelar ord olika viktvikter baserat på relativ diskriminerande kraft.

Vår modell är effektiv när det gäller den genomsnittliga tidskomplexiteten. nBTE-representationen är också tolk- bar som möjliggör ämnesidentifiering av dokumentet. På tremärkta offentliga dataset uppnådde vår modell jämför- bar närmaste grannklassningsnoggrannhet med fem topp- moderna modeller. Vidare härledde vi från de tre dataseten fyra multi-ämnesdatasatser där varje etikett hänvisar till en uppsättning ämnen. Vår modell överensstämmer över- ens med de högteknologiska baslinjemodellerna med en stor marginal av fyra utmanande multi-ämnesdatasatser. Dessa arbetsstöd ger svar på forskningsproblemet av tisthesis:

Kan vi konstruera en tolkbar dokumentrepresentation

genom att klustra orden i ett dokument och effektivt och

effektivt uppskatta dokumentets likhet?

(4)

Acknowledgments

I would like to first express my gratitude to my supervisor Prof. Sarunas Girdzi- jauskas for his guidance and help throughout this research. It was his full support that enabled me to pursue my research interest and carry out the work of the chosen research topic. His insightful feedback was invaluable to the achievements of this research.

I would like to thank my examiner Prof. Henrik Boström for his detailed feed- back on this thesis. His remarks not only made it possible to take the quality of this thesis to a higher level but also helped me become a better researcher.

Finally, I would like to thank my family for supporting me in everything I need

during my persuasion of the master’s degree.

(5)

1 Introduction 1

1.1 Background . . . . 2

1.1.1 Document representation . . . . 2

1.1.2 Word embeddings . . . . 2

1.1.3 From word embedding to document similarity . . . . 3

1.1.4 Distance metrics for document similarity . . . . 4

1.2 Problem . . . . 4

1.2.1 A word clustering strategy . . . . 4

1.3 Purpose . . . . 5

1.4 Goal . . . . 5

1.5 Benefits, ethics, and sustainability . . . . 6

1.6 Methodology . . . . 7

1.7 Delimitation . . . . 8

1.8 Outline . . . . 8

2 Extended background 9 2.1 Natural language processing . . . . 9

2.2 Document representation . . . . 10

2.2.1 Bag-of-word . . . . 10

2.2.2 Term-frequency inverse-term-frequency . . . . 10

2.2.3 Latent Semantic Indexing . . . . 12

2.2.4 Latent Dirichlet Allocation . . . . 12

2.2.5 Common document similarity measures . . . . 15

2.2.6 Summary . . . . 15

2.3 Word embedding . . . . 16

2.3.1 Word2vec . . . . 16

2.4 Soft cosine similarity . . . . 18

2.5 Word Mover’s Distance . . . . 19

2.5.1 Word centroid distance . . . . 20

2.5.2 k-nearest neighbor classification . . . . 20

2.6 Clustering analysis . . . . 21

2.6.1 k-means clustering . . . . 21

2.6.2 Spectral clustering . . . . 22

(6)

CONTENTS

2.6.3 Internal cluster validation criteria . . . . 23

3 Methodology 25 3.1 Research methods . . . . 25

3.2 Data collection . . . . 27

3.3 Data analysis . . . . 28

3.3.1 Verifying the word clustering assumption . . . . 28

3.3.2 Evaluating the effectiveness . . . . 28

3.3.3 Evaluating the efficiency and interpretability with the algo- rithms . . . . 29

3.3.4 Evaluating the interpretability with an experiment . . . . 29

3.4 Model design . . . . 29

3.5 Experimental design . . . . 30

3.5.1 Experiment 1: verifying the assumption . . . . 30

3.5.2 Experiment 2: effectiveness of the model . . . . 35

3.5.3 Experiment 3: interpretability of the document representation 37 4 Verifying the word clustering assumption 41 4.1 A simple synthetic document . . . . 41

4.2 A complex synthetic document . . . . 41

4.3 A short real-world document . . . . 42

4.4 A long real-world document . . . . 45

4.5 Discussion . . . . 48

5 The proposed model 51 5.1 Logistic word importance . . . . 51

5.1.1 Word cluster importance . . . . 53

5.2 Topic modeling with word embeddings . . . . 54

5.2.1 Notations . . . . 55

5.2.2 Building the nBTE representation . . . . 56

5.2.3 A view of data compression . . . . 57

5.3 Soft cosine document similarity . . . . 58

6 Performance evaluation 59 6.1 Efficiency . . . . 59

6.2 Effectiveness . . . . 60

6.2.1 Overall results . . . . 60

6.2.2 Comparing the baseline models with our model . . . . 60

6.2.3 Comparing WMD with the soft cosine similarity . . . . 62

6.3 Interpretability . . . . 64

6.3.1 Comparing with LDA . . . . 64

6.3.2 More sample documents . . . . 65

7 Conclusion 69

(7)

7.1 Discussion . . . . 70

7.1.1 Word clusters in the word2vec vector space . . . . 70

7.1.2 Effective similarity measure . . . . 71

7.1.3 Efficient similarity estimation . . . . 71

7.1.4 Interpretable document representation . . . . 71

7.2 Drawbacks and future work . . . . 72

Bibliography 75 Appendices 80 A Sample documents 81 A.1 Sample document from BBCNews data set . . . . 81

A.2 Sample document from BBCSport data set . . . . 81

A.3 Sample document from news20group data set . . . . 82

B Used software 85

(8)

Chapter 1

Introduction

Written language enables humans to store complex information and communicate with each other across time and space. The ancient Chinese threw tortoise shells into fire, and read the crack signs on the shell in order to tell the fortune [29]. We could regard the tortoise shells as the earliest form of Chinese documents, which talk about fortune. Throughout the history, humans continue to generate an ever- growing volume of documents, not only about the fortune, but also our joys and sufferings, every destroyed and flourishing civilization, every magnificent technol- ogy, thousands of splendid ideology and religions, our theory about the universe, competing economic doctrines and more.

With the advancement of information technology, these documents are digitized and maintained in large data centers and personal devices with an unprecedented speed. Due to the fast increasing volume of documents, manual handling of various document-related tasks is no longer feasible. We now rely on computer programs to automatically process the vast collections of documents in various applications.

Many of the applications involve comparing the similarity between documents. A common information retrieval (IR) task is, given a specific document, querying a ranked list of documents that are most similar to it. This requires a quantitative measure of the document similarity. Such measure is also useful for other tasks like document classification and clustering, where a natural strategy is to assign the same label to the similar documents [2].

Due to its importance in real-life applications, document similarity is one central research theme in the field of information retrieval and natural language processing [6]. Researchers in these fields have proposed various models for document repre- sentation and document similarity which have been successfully applied to tasks like document retrieval and document classification. These models typically first learn a vector representation for each document using a large corpus of documents [31], and then measure the distance between two document vectors as the document similarity.

Recently, significant progress has been made on word representation models [32][38]. These models are able to learn high-quality vector representation for mil-

1

(9)

lions of words using a corpus of hundreds of billions of words [13]. These word vectors are shown to accurately capture the similarity between words and are as- sociated with abstract concepts [32][33]. Researchers have also made these learned word vectors publicly accessible [13][21][11]. This sheds new light on a bottom-up approach for document similarity: we could directly infer the topics of documents and compare the similarity using the publicly available word vectors, without the need of an external corpus. This is intuitive since a document is constructed from words. If we understand the individual words well, we should also be able to un- derstand a document without referencing to other documents.

In this thesis, we further investigate the feasibility of this bottom-up approach.

We aim to propose a new document representation model that is useful for topic discovery and document similarity, requiring only the publicly available word vec- tors.

1.1 Background

In order to quantitatively measure the document similarity, we need 1) an appropri- ate representation of documents, and 2) a suitable distance metric. Both of these two ingredients are important for an effective similarity measure. On one hand, the representation of documents should capture as much information as possible. On the other hand, the distance metric should be meaningful for the chosen represen- tation.

1.1.1 Document representation

Traditional IR methods like BOW and TF-IDF, as discussed in section 2.2.1 and 2.2.2, transform a document into a sparse vector representation whose components refer to unique words [43]. In the past decades, machine learning researchers have proposed more advanced models to learn dense vector representation of documents.

Two notable models are Latent Semantic Indexing (LSI) [7] and Latent Dirichlet Allocation (LDA) [4]. LSI achieves significant vector compression by eigendecom- posing the word-document matrix. Deerwester et al. showed that the resulting latent vector space leads to improved performance in the document retrieval task [7]. In contrast, LDA takes a probabilistic approach to modeling documents [4]. In the LDA model, the words are probabilistically grouped into topics and a document is modeled as a distribution over the underlying set of topics.

All these models transform the documents into vectors using a corpus. We could then compute the cosine similarity between two document vectors as the document similarity [2].

1.1.2 Word embeddings

In recent years, significant progress has been made on the problem of word repre-

sentation [32][38]. The celebrated Word2vec model [32] can efficiently learn a dense

(10)

1.1. BACKGROUND 3

Figure 1.1: Figure taken from [25]. An illustration of the word mover’s distance.

All non-stop words (bold) of both documents are embedded into a word2vec space.

The distance between the two documents is the minimum cumulative distance that all words in document 1 need to travel to exactly match document 2.

vector representation of real values for each word. Such vectors are also called word embeddings. Mikolov et al. demonstrated that both syntactic and semantic relation- ships between words are preserved in the word2vec vector space [32]. Semantically similar words tend to have similar embeddings and such similarity is quantified us- ing the cosine similarity. A famous example that demonstrates the power of the word2vec embeddings is that, vector(”king”) − vector(”man”) + vector(”woman”) results in a vector that is closest to vector(”queen”) in terms of cosine similarity [32]. This shows that the word2vec embeddings can not only capture the semantic similarity between words but also associate each word with some abstract concept.

Besides its encouraging performance, the word2vec model is also efficient to train [32][34]. Google has published its pre-trained word2vec embeddings for millions of words using a corpus of up to hundreds of billions of words [13].

1.1.3 From word embedding to document similarity

Kusner et al. made a successful attempt at using the pre-trained Google word embeddings to estimate the document similarity. They proposed the Word Mover’s Distance(WMD) [25], which measures the similarity between two documents as the minimum amount of distance that the embedded words of one document need to

“travel” to reach the embedded words of another document. An illustration of the WMD distance can be found in Figure 1.1.

WMD models a document as a weighted cloud of its embedded words, therefore

requiring only the pre-trained word embeddings. Kusner et al. showed that WMD

achieved unprecedented low k-nearest neighbor document classification error rate

comparing to a set of state-of-art baseline models including LSI and LDA [25]. This

result suggests that it is not only feasible but also effective to compare document

similarity using only the pre-trained word embeddings.

(11)

1.1.4 Distance metrics for document similarity

Kusner et al. used the Earth Mover’s Distance(EMD) to estimate the distance be- tween two weighted clouds of words. The best average time complexity of solving EMD is O(n 3 log n), where n is the number of unique words in the respective doc- uments [25]. However, there are other applicable distance metrics [44][19]. Grigori Sidorov et al. proposed the soft cosine similarity[44] of time complexity O(n 2 ) for measuring document similarity. The soft cosine similarity assumes that the simi- larity between word features is known, which holds true for the word embeddings.

Grigori Sidorov et al. showed that the soft cosine similarity is an effective distance metric in a document retrieval case study, where they used the Levenshtein distance to estimate word similarity [44].

1.2 Problem

The best average time complexity of WMD is O(n 3 log n) [25]. The computational overhead could become a significant issue considering that a document could contain hundreds to thousands of words. Another drawback of WMD is that the resulting document representation has poor interpretability comparing to LDA. The com- ponents of the LDA vector representation refer to different topics and the entries their weight, therefore allowing us to interpret it as a mixture of different topics.

However, WMD simply models a document as a cloud of all its words. It is hard to interpret the resulting document representation to gain any structured information about the content of the document.

1.2.1 A word clustering strategy

A natural strategy to improve the time complexity of WMD is to use less embed- ded vectors to represent a document. This is feasible considering that there are similar words that express the same contextual meaning in a document, e.g. syn- onyms("country", "nation", "state"), variants of the same root word("denote, deno- tation, denoting") and words that describe the same topic ("elections", "parliament",

"politics"). If we could find such word clusters in a document, we could then rep- resent each cluster using a single embedded vector that is most representative of the member words. Therefore, we could represent a document as a weighted cloud of k such vectors, where k is the number of word clusters. When k is significantly smaller than n, e.g. k = O(

2

n), the time complexity of WMD could be improved to O(n 1.5 log n), which is sub-quadratic. If we use the soft cosine similarity to es- timate the similarity between two clouds of embedded words, the time complexity can be further improved to O(n). However, we note that it is not trivial to find such semantically coherent word clusters. Therefore the overall complexity depends on the algorithms for finding these word clusters.

Another benefit of this word clustering approach is topic discovery. We could

examine the topic of each cluster by querying its member words. The normalized

(12)

1.3. PURPOSE 5

size of each cluster may also be a naive estimation of its contribution to the con- tent of the document 1 . In this sense, we could produce an interpretable document representation.

1.3 Purpose

To the best of our knowledge, the aforementioned word clustering strategy for doc- ument representation and document similarity is not studied in any literature. We consider it worthy of further investigation due to its potential benefits in terms of time complexity and interpretability. This leads to our main research question:

Can we construct an interpretable document representation by clus- tering the words in a document, and effectively and efficiently esti- mate the document similarity?

The above research question is concerned with three aspects, namely 1) effectiveness, 2) efficiency and 3) interpretability.

To be more specific,

1. effectiveness means the performance for document similarity is comparable with the state-of-art models 2 .

2. efficiency means the time complexity of the overall model for document sim- ilarity should be O(n 2 log n) so that it is an order of exponent better than WMD.

3. interpretability means that the resulting document representation can be in- terpreted as a mixture of multiple topics, each weighted by its contribution to the content of the document.

Note that the term interpretability itself can be interpreted differently and may concern many different aspects, e.g. model interpretability. In this thesis, we focus on the effectiveness and efficiency aspects and only informally address the inter- pretability aspect by limiting it to the ability of topic discovery of our model.

The answer to this research question will advance the field of document represen- tation and enable further research of document understanding from words, without the need of a corpus.

1.4 Goal

The goal of this thesis is in line with the purpose discussed in the previous section.

The purpose of this thesis is centered around the raised research question. To be able to fully answer the research question, we need to achieve the following goals,

1

We shall see in Chapter 4 that the normalized size of a cluster is not an appropriate estimation of its contribution.

2

The selected state-of-art baseline models are listed in section 3.5.2.

(13)

1. propose a document representation model that is based on the word clustering strategy, requiring only the pre-trained word embeddings but not a corpus.

2. propose a similarity measure for the resulting document representation.

3. Analyze and evaluate the proposed model and similarity measure with re- spect to the three concerned aspects, namely the efficiency, effectiveness, and interpretability.

In this work, we focus on the efficiency and effectiveness aspects, while only informally address the interpretability aspect. To be able to fully address the inter- pretability aspect, we need a more sophisticated definition of the interpretability.

The evaluation of the interpretability is also challenging since there lacks a consensus on what interpretability implies. Instead, we informally address the interpretability aspect by analyzing the algorithmic steps of the proposed model in order to explain how we can interpret the resulting document representation as a mixture of top- ics. We also demonstrate its ability of topic discovery on one sample document 3 , compared with LDA.

To address the efficiency aspect, we analyze the average time complexity of the proposed model and document similarity measure. To address the effective- ness aspect, we evaluate the performance of the document similarity measure using multiple public data sets, compared with a set of state-of-art baseline models.

1.5 Benefits, ethics, and sustainability

The potential benefit of our work is significant to many document-related applica- tions. Humans rely on documents to maintain our knowledge base about our world.

Take the academia for example, we continue to use papers to record every scientific breakthrough. Due to the ever-increasing volume of our knowledge base, we rely on smart algorithms to retrieve desired information. The effectiveness and efficiency of the algorithms directly impact the productivity of human society as a whole: the less time we spend on retrieving the desired information, the more time we have to spend on producing activities.

We could also take the perspective of a machine to view the potential benefit.

With our proposed method, machines are able to understand the documents in absence of an external corpus. This enables the machines to undertake a larger set of document processing tasks where such corpus is often not available. On the other hand, the knowledge of words learned by one machine can be transferred to many other machines. This makes the use of the computational resources much more sustainable.

This work also indirectly addresses one central ethical issue of artificial intelli- gence(AI), that the AIs become biased when trained on biased data sets [5]. In the

3

We select the sample document which was used by Blei et al. to demonstrate the ability of

topic discovery of LDA [4].

(14)

1.6. METHODOLOGY 7

context of document understanding, many models require either labeled training data or an external corpus. These model can be easily abused by the user if he chose to supply biased training data or corpus. However, our model requires only the pre-trained word embeddings. The AI community could agree on one central unbiased corpus with which the word embeddings are trained, e.g. the corpus used by Google [13], Facebook [11] and Stanford [21]. Therefore effectively minimizing the risk of biased usage of our model.

1.6 Methodology

To the best of our knowledge, there is no rigorous probabilistic explanation of the word2vec vector space [12][27]. The most comprehensible study of the statistical implication of the word2vec model we found is from Levy et al.[27], where they argued that the word2vec model is implicitly performing a matrix decomposition on a word- cooccurrence statistics matrix. However, there is still no formal proba- bilistic model proposed for the word2vec embeddings. Therefore a rigorous analytic approach is not applicable. Instead, the conclusions of this thesis will be based on a scientific study using an empirical and experimental approach to achieve scientific validity.

The word clustering strategy discussed in section 1.2.1 is based on the assump- tion that semantically similar words form clusters in the word2vec vector space. To verify this assumption, we conduct an exploratory experiment on both synthetic and real-world documents. This exploratory experiment aims to generate empirical evidence that supports the word clustering assumption, and also recommends for the detailed design of the proposed model. We qualitatively evaluate the seman- tic coherence of the word clusters by intuitively examining their member words.

We also quantitatively evaluate the word clusters using several cluster validation criteria.

Based on the results of this exploratory experiment, we propose 1) a document representation model that is able to discover the topics of a document, and 2) a similarity measure for the resulting document representation. We present the detailed algorithmic steps of 1) and 2) so that we can argue their efficiency in terms of the average time complexity, and the interpretability by explaining how we can reveal the topics of a document using the resulting document representation. We also qualitatively evaluate the discovered topics on several sample documents to demonstrate the interpretability of the resulting document representation in terms of topic discovery.

We quantitatively evaluate the effectiveness of the proposed model for document

similarity in the context of the k-nearest neighbor document classification task, as

proposed by Kusner et al. [25]. The k-nearest neighbor classification algorithm

determines the label of an unseen document d as the majority label among its

closest neighbors. If the document similarity measure is indeed effective, these

closest neighbors would be similar to d. Therefore the classification error rate would

(15)

also be low. We compare the performance of our model with a set of state-of-art baseline models to argue its relative effectiveness. We use publicly available data sets so that the experimental results can be reproduced. Since the topic discovery is an important feature of our model, we selected three public data sets that are suitable for topic discovery.

The details of the research methodology can be found in chapter 3.

1.7 Delimitation

In this thesis, we only experiment with the pre-trained Google word embeddings.

Other pre-trained word embeddings are not considered. This means that the per- formance of our model will be different when other pre-trained word embeddings are used.

We only informally address the interpretability aspect of our model since there lacks a consensus of what interpretability is, so does the evaluation method.

1.8 Outline

This thesis is organized as follows. In Chapter 2, we introduce the relevant theory and related work. In Chapter 3, we present the details of the research methodology.

In Chapter 4, we present and discuss the results of the first experiment. We then propose a document representation model and a document similarity measure in Chapter 5. In Chapter 6, we present the evaluation results of the second experiment.

In Chapter 7, we discuss these evaluation results and conclude this thesis by directly

answering the raised research question. We also propose several valid future works

in Chapter 7.

(16)

Chapter 2

Extended background

This chapter introduces the extended background of our research.

Document similarity is tightly connected with the problem of document repre- sentation, which is a central research topic in the area of natural language process- ing. Therefore we first briefly introduce the area of natural language processing, with a focus on the problem of document representation. We then introduce the word2vec model with a focus on its training objective. After that, we introduce two document similarity measures that may be suitable for our model. Finally, we introduce the clustering analysis, a task of grouping similar objects into clusters.

We cover a number of clustering algorithms that may be suitable for discovering the word clusters in the word2vec vector space.

2.1 Natural language processing

Natural language processing (NLP), a.k.a. computational linguistics, is the sub-field of computer science concerned with using computational techniques to understand and reproduce human language content [16]. The earliest research of NLP may trace back to 1950s when the idea of automatic machine translation attracted research attention [18]. The early NLP systems were mostly based on complex sets of hand- written rules and not able to deal with variation, ambiguity, and context-dependent interpretation of the language. Starting in the late 1980s, with the increasing com- putational power, more and more researchers started applying a variety of statistical methods to automatically learn the abstract rules using a large corpus [16]. This direction of research is also known as statistical NLP (SNLP).

At the core of any NLP task is the important issue of natural language un- derstanding [6]. The raw text needs to be transformed into an appropriate rep- resentation that can be processed by the computer programs. In this section, we extensively discuss the notable methods for document representation following a chronological order.

9

(17)

2.2 Document representation

Researchers in the field of information retrievals(IR) have made significant progress on the problem of document representation [2]. The basic paradigm proposed by IR researchers is to learn a vector representation for the documents using an external corpus [4]. Starting from the simplest Bag-of-word model, we gradually discuss more complex models for document representation.

2.2.1 Bag-of-word

The bag-of-words(BOW) model represents each document as a multi-set of words, disregarding the grammar and ordering [43], but preserving the number of occur- rences of each word. Take the following document for example,

Bob transferred eight bitcoins to Alice, and Alice transferred back five bitcoins.

its BOW multi-set representation is,

{Bob 2 , Alice 2 ,transferred 2 , bitcoins 2 , eight, five, to, and, back}

where the upper indices equal to the multiplicity of the respective word. We can then build a BOW vector of real values where each dimension refers to a unique word w and the entry refers to the frequency of w in the document. The BOW vector of the above document is (2, 2, 2, 2, 1, 1, 1, 1, 1), assuming an ordered list of unique words that match the components of the BOW vector. It is common to normalize the BOW vector by dividing each entry by the total frequency. The normalized BOW vector of the above sentence is then ( 13 2 , 13 2 , 13 2 , 13 2 , 13 1 , 13 1 , 13 1 , 13 1 , 13 1 ).

The BOW model takes into account the word occurrences in a document, which to some extent reflects the content of a document. Intuitively thinking, two doc- uments with similar word occurrences should also be similar in content. However, in the BOW model, all words are equally important. This is clearly not realistic considering that, in certain domains, words like "economics" and "monetary" should have much higher discriminating power than words like "therefore" and "although".

A remedy to this problem is weighting each word by its term-frequency inverse- term-frequency, a method which we shall discuss in the next section.

2.2.2 Term-frequency inverse-term-frequency

The term-frequency-inverse-document-frequency(TF-IDF) method [43](1983) is a

step forward from BOW. It assumes to have access to a corpus D consisting of

a collection of documents. It modifies the BOW vector by offsetting the word

frequency by the inverse document frequency, which measures the discriminating

power of a word.

(18)

2.2. DOCUMENT REPRESENTATION 11

Term-frequency

The term frequency tf (w, d) is defined as the frequency of word w in a document instance d. It is a measure of the contribution of a word to a document.

tf (t, d) = frequency of t in d (2.1) Inverse document frequency

The inverse term frequency is defined as,

idf (t, d) = log |D| + 1

|{d ∈ D : t ∈ d} + 1| (2.2)

The inverse term frequency is a measure of how much information a word pro- vides. A low inverse document frequency indicates that a word is rare and therefore likely being more discriminating. In contrast, a high inverse document frequency indicates that the word is common across the corpus, therefore carries little dis- criminating power.

TF-IDF

Combining the term frequency and inverse document frequency, we can produce a composite weight for each term in a document, defined as,

tf idf (t, d, D) = tf (t, d) × idf (t, d, D) (2.3) By its definition, the TF-IDF score of a word w in a document d is highest if w occurs rather frequently in d while occurring in only few documents in D. In this way, discriminating words are considered to be more important.

Term-document matrix

Let R be a dictionary consisting of all the unique words in D. We can then build a term-document matrix X ∈ R |R|×|D| where the j th column is the TF-IDF vector representation of document D i and the i th row is the TF-IDF score of the i th word in all the |D| documents. In this way, a document d of arbitrary length can be represented as a fixed-length |R|-dimensional vector where i th dimension refers to the i th word in dictionary R and the entry refers to the TF-IDF score of the i th word in d.

Comparing to BOW vector, TF-IDF is better at identifying the discriminating words by incorporating the word-occurrence statistics in a corpus. However, 1), the TF-IDF vectors are rather sparse since its dimension equals the vocabulary size.

Such vector could easily reach hundreds of thousands of entries with a sufficiently

large corpus. 2), It also reveals little intra and inter-document structure [4]. Syn-

onymous terms are regarded as totally different words even though they have the

same meaning, while polysemous terms are regarded as words expressing the same

(19)

meaning 3), Besides, it is often not suitable for document similarity measure due to their frequent near-orthogonality [25]. The near-orthogonality of TF-IDF vectors means that a given querying TF-IDF vector may be far away from all other vectors in terms of cosine similarity, therefore producing poor document similarity scores.

As we shall discuss in the next section, Latent Semantic Indexing model tries to address the aforementioned issues by constructing a latent vector space for the documents and words.

2.2.3 Latent Semantic Indexing

The Latent Semantic Indexing model (LSI) [7](1990), sometimes referred to as Latent Semantic Analysis (LSA), finds a lower-rank approximation to the term- document matrix M ∈ R v×d using Singular Vector Decomposition (SVD). Without diving into the details of the linear algebras, LSI finds a lower-rank approximation C 0 that can be decomposed into three matrices,

C 0 = U ΣV (2.4)

where U has dimensions (v × k), Σ(k × k) and V (k × d).

We can then construct a new matrix W as

W = U Σ (2.5)

with dimension (v × k), where each row refers to a word vector, and a new matrix D as

D = ΣV (2.6)

with dimension (k × d), where each column refers to a document vector.

The end result of SVD is that all documents and words are represented as vectors of reduced dimension of k. In this way, LSI is able to achieve significant vector compression. Deerwester et al. also showed that the resulting latent vector space of documents reveals better inter-document structure and the latent vector space of words captures some linguistic notions such as synonymy and polysemy [7].

The LSI model uses linear algebra techniques to achieve significant dimension reduction, with the hope that the resulting latent vectors can capture semantic similarity. Even though LSI has shown to be successful, it still lacks a solid theoret- ical foundation, therefore hinders a more principled study of the model itself [17].

To address this issue, researchers have proposed to use a probabilistic approach to model the latent topics of a document, which has a sound statistical foundation [17][4]. We shall discuss one such notable generative probabilistic model in the next section.

2.2.4 Latent Dirichlet Allocation

The Latent Dirichlet Allocation(LDA) [4](2003) takes a probabilistic approach to

modeling documents. The words are probabilistically grouped into a set of topics T

(20)

2.2. DOCUMENT REPRESENTATION 13

of size k and a document is modeled as a distribution over an underlying set of topics.

Similarly to LSI and TF-IDF, LDA assumes to have access to a corpus D consisting of a collection of documents and a dictionary R consisting of all unique words in D.

LDA also assumes that the document is in the BOW multi-set representation [4], therefore disregarding the grammar and word order in a document. In the language of probability theory, this is the assumption of exchangeability for the words in a document [1].

Before introducing the details of the generative process for a document, we first give definition to several probability distributions.

Poisson distribution:

The Poisson distribution is a discrete probability distribution parameterized by a positive value λ. The probability mass function is given by

f (x|λ) = λ x exp −λ

x! (2.7)

Dirichlet distribution:

The Dirichlet distribution Dir(α) is a family of continuous multivariate proba- bility distribution parameterized by a vector α = (α i , ...α k ) of positive values. The order of a Dirichlet distribution is |α| = k. The probability density function of Dir(α) of order k is given by:

f (x 1 , ...x k 1 , ...α k ) = 1 B(α)

k

Y

i=1

x α i

i

−1 (2.8)

where

B(α) = Q k

i=1 Γ(α i )

Γ( P k i=1 α i ) (2.9)

and

k

X

i

x i = 1 ∧ x i ≥ 0, ∀i ∈ [1, k] (2.10) Multinomial distribution:

The Multinomial distribution is a generalization of binomial distribution parametrized by a vector θ = (θ 1 , ...θ k ). The probability mass function of Multinomial(θ) of order k is given by:

f (x 1 , ...x k 1 , ...θ i ) = Γ( P k i x i + 1) Q k

i Γ(x i + 1)

k

Y

i

p x i

i

(2.11)

The overview of the generative process of LDA is as follows: we first choose the

length N of a document d, then choose a topic distribution for d. We then iteratively

generate a word by first choosing a topic according to the topic distribution, and

then choosing a word instance according to the word distribution of the chosen

topic.

(21)

Now we describe the probabilistic details of each of these steps.

The LDA model assumes that the length of a document follows Poisson(λ). For a document d, we draw a random variable N from Poisson(λ) as the length of the document. Note that the Poisson assumption is not critical to the following generative process and more realistic length distribution can be used [4].

LDA uses the Dirichlet distribution as prior to assign a topic distribution to a document d by drawing a sample θ from Dir(α), where θ i is the probability of mentioning topic T i in d. Note that LDA assumes the number of topics k is known and fixed. The choice of using Dirichlet distribution as prior is based on a number of convenient mathematical properties of Dirichlet distribution including the fact that it is a conjugate to the Multinomial distribution [4]. These properties facilitate the parameter estimation algorithms used for training the LDA model.

For each word w i in d, the LDA uses Multinomial(θ) to draw a random topic z i from the topic distribution θ. We denote the resulting topics of all words as z = (z i , ..., z N ).

We have provided a probabilistic model to select the topic distribution of a document, thereafter select the topic of each individual word in the document.

However, we still have not yet generated any word instance for document d. To this end, we define a matrix β ∈ R k×|R| , where β i,j = p(R j = 1|T i ). Therefore, given a topic T i , we can sample a word w from Multinomial(η), where η is the word distribution for topic T i , determined by the i th row of β.

As a summary, the LDA assumes the following generative process for any doc- ument d in corpus D.

1. Choose N ∼Poisson(ε) to be the number of words in d.

2. Choose θ ∼ Dir(α) to be the topic distribution of d.

3. For each of the N words w i :

a) Choose a topic z n ∼Multinomial(θ).

b) Choose a word from R i from p(R i |z n , β), a multinomial probability con- ditioned on the topic z n .

LDA uses variational Bayes approximation of the posterior distribution to infer the hidden variable θ and z for each document [4]. Blei, et. al. also introduced an EM algorithm for empirical Bayes parameter estimation of α and β [4]. The variational inference and EM algorithm are not relevant to our research, therefore we do not introduce them in this thesis. The detailed description of these methods can be found in [4].

The resulting LDA representation for a document d is a k-dimensional vector

where the i th entry refers to the probability of d mentioning topic T i . We can

then interpret a document as expressing content about k topics with a different

weight assigned to each topic. Besides, we can examine the content of each topic

by querying the top words using the multinomial distribution matrix β. Blei et. al.

(22)

2.2. DOCUMENT REPRESENTATION 15

demonstrated that on a sample document, LDA learns topic groups that contain mostly semantically coherent words [4].

Comparing to LSI, LDA has a solid statistical foundation [4]. It also provides better interpretability and modularity. LDA can be readily extended to contin- uous data or other non-multinomial data by modifying its probabilistic modules [4]. Similarly to LSI, we can view LDA as a dimension reduction technique which compresses the length of the document vector to k.

2.2.5 Common document similarity measures

We have discussed various models for transforming documents into vectors in some latent vector space. This way of representing documents as fixed-length vectors is known as vector space model. In the vector space, each dimension refers to a unique feature. In the case of BOW and TF-IDF model, the dimensions refer to the unique words and the entries of a vector capture the relative importance of these words in a document. As for LSI and LDA, the dimensions represent some latent concepts.

Using the vector space model, we can estimate the document similarity between two documents as the distance between their vector representations. The two most common distance measures are the cosine similarity and Euclidean distance [43].

Euclidean distance

Given two vectors v and w of dimension d, their Euclidean distance is defined as,

Euclidean(v, w) = v u u t

n

X

i=1

(v i − w i ) 2 (2.12)

Cosine similarity

Given two vectors v and w, the cosine similarity is defined as cosine(v, w) = v · w

kvkkwk (2.13)

where v · w denotes the dot product between v and w, which is P n i=1 v i · w i , and kvk denotes the Euclidean length, which is q P n i=1 v 2 i .

The cosine similarity measures the difference in direction between two vectors.

From this view, cosine similarity is preferred over the Euclidean distance in situa- tions where we are only concerned with what concept a document is about, but not how strong it is towards that concept [43].

2.2.6 Summary

We have discussed a wide range of notable document representation models. Except

for the simplest BOW model, they all share the same paradigm which aims to build

a fixed-length vector representation for a document using a corpus. Comparing to

BOW and TF-IDF, more advanced models including LSI and LDA achieve signifi-

cant dimension reduction and are able to capture the semantic similarity between

(23)

individual words. In these models, words also represented as fixed-length vectors in some latent vector space, enabling us to calculate the semantic similarity between words.

Besides the fixed-length paradigm, these models all share the BOW assumption which disregards the word orders. However, the word orders may provide valuable contextual information that is useful for estimating semantic similarity between words. Let us consider two intuitive situations: 1), if two words occur frequently in neighboring positions in the document, we may conjecture that these two words are often used to express the same contextual meaning. 2), if two words occur frequently at the same position in a sentence, we may conjecture that these two words are substitutable (synonyms). These conjectures are formally formulated in the distributional hypothesis of Harris [15], which states that words appearing in similar context have similar meanings.

In the next section, we shall exploit this line of thought and discuss several models that are specifically designed for building dense vector representation for words, also known as word embedding.

2.3 Word embedding

A word embedding is a dense vector representation of a word [27]. We call a model that transforms an input word to a fixed-length dense vector as a word embedding model. In section 2.2.3, we showed that LSI can map a word into a latent vector space where the semantic similarity between words is preserved. We may regard LSI as a successful word embedding model.

Many of the modern word embedding models use various training methods in- spired by neural-network language modeling [27]. Such models were shown to per- form significantly better than LSI fore preserving linear regularities among words [35]. One of the earliest neural word embedding model was proposed by Bengio et al. [3] (2003), where a feedforward neural network is used to learn word vector representation using the linguistic context in which the word occurs. Most of the newest word embedding methods extend the work of Bengio and rely on a neural network architecture [32]. One notable model was recently proposed by Mikolov et.al. [32] (2013) that is both efficient to train and provide state-of-art results on various linguistic tasks [27]. This model is also known as word2vec. In the next section, we discuss the word2vec model in detail.

2.3.1 Word2vec

Like many other word embedding models, the word2vec model is based on the distributional hypothesis in linguistic which states that semantically similar words have similar distributions in large samples of language data [15]. In other words, words that occur in the similar contexts have similar meanings.

In the word2vec model, the context c of a word w is defined as the list of

its neighboring words in a document, fixed by a window size k. For instance, if

(24)

2.3. WORD EMBEDDING 17

k = 5, context c includes the five previous and five following words of w. The vector representation of words is then learned by maximizing the conditional probability p(c|w) or p(w|c) over all (w, c) pairs.

Mikolov et al. proposed two model architectures, namely the continuous bag- of-words model (CBOW) which maximizes p(c|w), and continuous skip-gram model (skip-gram) which maximizes p(w|c) [32]. Mikolov et al. also proposed an efficient training method called negative sampling for the skip-gram model (SGNS) [33].

Mikolov et al. argued that SGNS improves both the training speed and the quality of the word embeddings [33]. Since Google used the SGNS model to train its word embeddings, we focus on discussing the SGNS model.

Skip-gram

Given a corpus, we denote W to be the set of words and C to be the set of contexts in the corpus. We denote C(w) to be all the contexts where w appeared.

Now consider a word-context pair (w, c) where w ∈ W and c ∈ C. Let P (c|w) be the probability that (w, c) is observed in D given word w. This conditional probability is modeled as

P (c|w) = e − ~ w·~ c P

c

i

∈C e − ~ w·~ c

i

(2.14) where ~ w and ~ c are the vector representations of word w and context c to be learned. Note that this is essentially a multinomial logistic model that assumes linear dependency between ~ w and ~ c.

The objective of the skip-gram model is to find parameters ~ w and ~ c that maxi- mize the average log probability. Therefore objective function is given by

arg max

~ w,~ c

1 N

X

w∈W

X

c∈C(w)

log P (c|w) = X

w∈W

X

c∈C(w)

(log e w·~ ~ c − log X

c

i

∈C

e w·~ ~ c

i

) (2.15)

Equation 2.15 makes an important assumption: by optimizing this objective function, similar words would eventually have similar word vectors. As noted by Goldberg et al. [12], this assumption has no sound statistical foundation and it is unclear why it holds.

The objective function shown in equation 2.15 is expensive to compute because of the term P c

i

∈C e w·~ ~ c

0

, which iterates through all contexts c ∈ C. This becomes extremely costly when the corpus contains billions of sentences. To address this issue, Mikolov et al. proposed the negative sampling method which optimizes a slightly different objective function.

Negative sampling

We again consider a word-context pair (w, c). We denote #(w, c) to be the number

of times the pair is observed in the corpus, and #c the number of occurrences of c.

(25)

Let P (D = 1|w, c) be the probability that (w, c) is observed from the corpus. This conditional probability is modeled as

P (D = 1|w, c) = σ(w, c) = 1

1 + e − ~ w·~ c (2.16) This is essentially a binomial logistic model. The negative sampling method aims to maximize σ(w, c) for observed word-context pairs while minimizing it for k randomly sampled unobserved pairs. These unobserved pairs are regarded as negative samples, hence the name "negative sampling". Note that an assumption put here is that if a pair (w, c) is not observed from the corpus, then it is indeed a negative sample. Furthermore, the context c of the negative samples are drawn according to the empirical Bernoulli distribution P noise (c) = Bernourlli( #c |Q| ), where Q is the set of observed (w, c) pairs.

Therefore the objective function is given by

` = X

w∈W

X

c∈C(w)

#(w, c) log σ( ~ w, ~ c) + k · E c

i

vP

noise

(c

i

) [log σ(− ~ w, ~ c i )] (2.17)

By optimizing equation 2.17, we can expect that words appearing in similar contexts to have similar embedding. However, a formal proof is not available [27].

Nevertheless, empirical results have shown that the cosine similarity between two word embeddings is an effective measure of the semantic similarity [32][33].

2.4 Soft cosine similarity

In the previous section, we saw that the semantic similarity between words can be measured as the cosine similarity between their word2vec embeddings. In this section, we discuss the soft cosine similarity which could utilize this property for measuring document similarity.

Recall that in the vector space of BOW and TF-IDF, the dimensions refer to unique word features. This becomes problematic when two documents share no common words even though the words are semantically similar. Their BOW and TF-IDF vectors would be orthogonal and their cosine similarity is zero.

To address this problem, Grigori et al. proposed the soft cosine similarity [44].

Given two vectors a and b of length n, the soft cosine similarity between a and b is defined as

sof t_cosine(a, b) =

P P n

i,j=1 s i,j a i b j

q P P n

i,j=1 s i,j a i a j

q P P n

i,j=1 s i,j b i b j

(2.18)

where s i,j is the similarity between component i and j.

This formula is generalized from the cosine similarity, taking into account the

basis vectors. Let us denote the basis vectors to be e 1 , e 2 , ...e n , we can then rewrite

vector a = (a 1 , ...a n ) as

(26)

2.5. WORD MOVER’S DISTANCE 19

a =

n

X

i

a i e i (2.19)

Since dot product is bilinear, and |e i | = 1 in the BOW and TF-IDF vector space, we have

a · b = (

n

X

i

a i e i ) · (

n

X

i

b i e i )

=

n

X

i=1 n

X

j=1

a i b j · (e i · e j )

=

n

X

i=1 n

X

j=1

a i b j · cosine(e i , e j ) · |e i | · |e j |

=

n

X

i=1 n

X

j=1

s i,j a i b j

(2.20)

which leads to the numerator in formula 2.19. Similarly, we can derive a · a and b · b which would lead to the denominator in formula 2.19.

The soft cosine similarity allows us to compute the similarity between two BOW vectors taking into account the word similarity. Grigori et al. used the Levenstein’s distance between two words as s i,j . Instead, we can use the cosine similarity between the word2vec embeddings of the two words.

2.5 Word Mover’s Distance

In the previous section, we showed that the soft cosine similarity between two BOW vectors can naturally take into account the word similarity. Here we discuss another method that is successful in measuring document similarity using the words of the documents.

Word Mover’s Distance (WMD) is a distance function between documents, pro- posed by Kusner et al [25] (2015). WMD models a document as a weighted point cloud of its words in the word2vec space. The distance between document A and B is the minimum cumulative distance that the words from A need to travel to match exactly the point cloud of B [25]. The weight of each point is the normalized term frequency of the corresponding word.

Let d 1 , d 2 be the normalized BOW vectors of two documents of dimension n, where component i refers to word w i . We denote c(i, j) to be the minimum distance required for word w i to move to w j . WMD assumes c(i, j) to be the Euclidean distance between the word embeddings of w i and w j , provided by a pre-trained word embedding model. Let us define a matrix T ∈ R n×n where T i,j is the "mass"

of w i in d 1 traveled to w j in d 2 . The Word Mover’s Distance is estimated by solving

the following linear program.

(27)

minimize

n

X

i,j=1

T i,j c(i, j)

subject to :

n

X

j=1

T i,j = d 2 i , ∀i ∈ {1, ..., n}

n

X

i=1

T i,j = d 1 j , ∀j ∈ {1, ..., n}

(2.21)

The first constraint states that the incoming "mass" of words to w i in d 2 equals its weight. The second constraint states that the outgoing "mass" of words from w j in d 1 equals its weight. These two constraints together guarantee that all the "mass"

of words in d 1 is moved to the words in d 2 while each word in d 2 receives no more mass than its own weight.

This optimization problem can be cast as calculating the Earth Mover’s Distance (EMD) [42]. We do not discuss further the details of EMD since it is irrelevant to our research. However, we do note that EMD problem can be efficiently solved with the best average time complexity of O(n 3 log n).

2.5.1 Word centroid distance

The average word embedding (AWE) representation of a document of n words is defined as

AW E = 1 n

n

X

i

~

w i (2.22)

which is simply the unweighted average of all the word embeddings of a document Kusner et al. defined the Word centroid distance (WCD) between two documents as the distance between the corresponding AWE vectors [25]. It is shown that WCD is a lower bound of WMD and has comparable performance with WMD [25] in a k-nearest neighbor document classification task.

2.5.2 k-nearest neighbor classification

Kusner et al. argued the effectiveness of WMD in comparison with several baseline models in a k-nearest neighbor document classification task. In this task, a collection of labeled documents is divided into a training set D and a test set S. A k-nearest neighbor classifier (k-NN) is then trained on D and predicts the labels of documents in S. The performance is evaluated in terms of error rate.

The k-NN classifier works by retrieving the top k most similar documents to

an unseen document d, and assigns it the majority label among the retrieved docu-

ments. Thus the performance of the k-NN classifier depends on the effectiveness of

the document similarity measure. Therefore the k-NN classifier is a surrogate for

evaluating the effectiveness of the models.

(28)

2.6. CLUSTERING ANALYSIS 21

Kusner et al. selected a set of state-of-art baseline document representation models including LSI and LDA. For all the baseline models, the distance metric for the k-NN classifier is Euclidean distance. It is shown that WMD outperforms all the baseline models [25]. This result shows that we can effectively estimate document similarity using only the pre-trained word embeddings.

2.6 Clustering analysis

Clustering analysis is a widely used technique in machine learning and pattern recognition [36]. Clustering is an unsupervised task which groups a set of objects in such a way that objects in the same group are more similar to each other than those in other groups [8]. In section 2.3, we saw that the semantic similarity between words can be estimated using their word embeddings. This suggests that we could find semantically coherent word clusters using suitable clustering algorithms.

There are many clustering algorithms available [10], such as k-means [24], spec- tral clustering [30], Gaussian mixture models [39], DBSCAN [9] and so on. The purpose of this thesis is not to find the best clustering algorithms for word cluster- ing. Therefore we only introduce several notable clustering algorithms that may be suitable for finding good word clusters.

2.6.1 k-means clustering

Assuming a set of d-dimensional vectors v = {v 1 , ..., v n } and the number of clusters k, the k-means clustering aims to minimize the within-cluster sum of variance.

Formally, the objective is to find arg min

C

=

k

X

i=1

X

v∈C

i

kv − c i k 2 (2.23)

The Lloyd-algorithm is commonly used to solve the above optimization problem [24]. The algorithmic steps are shown below.

1. randomly generate k centers c 1 , ..., c k ,

2. for each vector in v, find its closest center c i and assign it to cluster C i 3. update each center c i to be the centroid of all vectors in cluster C i 4. repeat step 2) and 3) until cluster assignments do not change

By the definition of the objective function, the k-means algorithm defines the

dissimilarity between objects to be the Euclidean distance. The WMD distance also

measures the semantic dissimilarity between words as the Euclidean distance be-

tween their word embeddings and achieved good performance in a k-nearest neigh-

bor document classification task. Therefore we expect k-means to find clusters

containing semantically similar words. The time complexity of k-means is O(nkt)

References

Related documents

Based on the hypothesis that the strategic communication of a psychological similarity forms a connection, I hypothesise that a larger share of the same group of Player

Now, by using similarity analysis and based on our measurement from Finding 12, calculate the similarity score for each ingredient at the Github scope would take around 19 hours for

At the time, universal suffrage and parlamentarism were established in Sweden and France. Universal suffrage invokes an imagined community of individuals, the demos, constituted

Department of Thoracic Surgery, National Cancer Center / Cancer Hospital, Chinese Academy of Medical Sciences, Beijing,

The first part of the work on models of set theory consists in establishing a refined version of Friedman’s theorem on the existence of embeddings between countable non-standard

To test the significance level (p-value) between the groups (non-injured and injured athletes) an independent parametric t-test were made for the normally distributed data

Abstract: In the lightning rods categorized as Early Streamer Emission (ESE) types, an intermittent voltage impulse is applied to the lightning rod to modulate the electric field at

This report gives a good overview on how to approach and develop natural language processing support for applications, containing algorithms used within the field, tools