SECOND CYCLE, 30 CREDITS STOCKHOLM SWEDEN 2018 ,
Topic discovery and document similarity via pre-trained word embeddings
SIMIN CHEN
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Abstract
Throughout the history, humans continue to generate an ever-growing volume of documents about a wide range of topics. We now rely on computer programs to automat- ically process these vast collections of documents in vari- ous applications. Many applications require a quantitative measure of the document similarity. Traditional methods first learn a vector representation for each document using a large corpus, and then compute the distance between two document vectors as the document similarity.
In contrast to this corpus-based approach, we propose a straightforward model that directly discovers the topics of a document by clustering its words, without the need of a corpus. We define a vector representation called normal- ized bag-of-topic-embeddings (nBTE) to encapsulate these discovered topics and compute the soft cosine similarity be- tween two nBTE vectors as the document similarity. In addition, we propose a logistic word importance function that assigns words different importance weights based on their relative discriminating power.
Our model is efficient in terms of the average time com- plexity. The nBTE representation is also interpretable as it allows for topic discovery of the document. On three labeled public data sets, our model achieved comparable k-nearest neighbor classification accuracy with five state- of-art baseline models. Furthermore, from these three data sets, we derived four multi-topic data sets where each label refers to a set of topics. Our model consistently outper- forms the state-of-art baseline models by a large margin on these four challenging multi-topic data sets. These works together provide answers to the research question of this thesis:
Can we construct an interpretable document represen- tation by clustering the words in a document, and effectively and efficiently estimate the document similarity?
Keywords
Document similarity; document representation; word embedding; natural lan-
guage processing; topic modeling;
Under hela historien fortsätter människor att skapa en väx- ande mängd dokument om ett brett spektrum av publika- tioner. Vi förlitar oss nu på dataprogram för att automa- tiskt bearbeta dessa stora samlingar av dokument i olika applikationer. Många applikationer kräver en kvantitativ- mått av dokumentets likhet. Traditionella metoder först lä- ra en vektorrepresentation för varje dokument med hjälp av en stor corpus och beräkna sedan avståndet mellan two document vektorer som dokumentets likhet.
Till skillnad från detta corpusbaserade tillvägagångs- sätt, föreslår vi en rak modell som direkt upptäcker äm- nena i ett dokument genom att klustra sina ord , utan be- hov av en corpus. Vi definierar en vektorrepresentation som kallas normalized bag-of-topic-embeddings (nBTE) för att inkapsla de upptäckta ämnena och beräkna den mjuka co- sinuslikheten mellan två nBTE-vektorer som dokumentets likhet. Dessutom föreslår vi en logistisk ordbetydelsefunk- tion som tilldelar ord olika viktvikter baserat på relativ diskriminerande kraft.
Vår modell är effektiv när det gäller den genomsnittliga tidskomplexiteten. nBTE-representationen är också tolk- bar som möjliggör ämnesidentifiering av dokumentet. På tremärkta offentliga dataset uppnådde vår modell jämför- bar närmaste grannklassningsnoggrannhet med fem topp- moderna modeller. Vidare härledde vi från de tre dataseten fyra multi-ämnesdatasatser där varje etikett hänvisar till en uppsättning ämnen. Vår modell överensstämmer över- ens med de högteknologiska baslinjemodellerna med en stor marginal av fyra utmanande multi-ämnesdatasatser. Dessa arbetsstöd ger svar på forskningsproblemet av tisthesis:
Kan vi konstruera en tolkbar dokumentrepresentation
genom att klustra orden i ett dokument och effektivt och
effektivt uppskatta dokumentets likhet?
Acknowledgments
I would like to first express my gratitude to my supervisor Prof. Sarunas Girdzi- jauskas for his guidance and help throughout this research. It was his full support that enabled me to pursue my research interest and carry out the work of the chosen research topic. His insightful feedback was invaluable to the achievements of this research.
I would like to thank my examiner Prof. Henrik Boström for his detailed feed- back on this thesis. His remarks not only made it possible to take the quality of this thesis to a higher level but also helped me become a better researcher.
Finally, I would like to thank my family for supporting me in everything I need
during my persuasion of the master’s degree.
1 Introduction 1
1.1 Background . . . . 2
1.1.1 Document representation . . . . 2
1.1.2 Word embeddings . . . . 2
1.1.3 From word embedding to document similarity . . . . 3
1.1.4 Distance metrics for document similarity . . . . 4
1.2 Problem . . . . 4
1.2.1 A word clustering strategy . . . . 4
1.3 Purpose . . . . 5
1.4 Goal . . . . 5
1.5 Benefits, ethics, and sustainability . . . . 6
1.6 Methodology . . . . 7
1.7 Delimitation . . . . 8
1.8 Outline . . . . 8
2 Extended background 9 2.1 Natural language processing . . . . 9
2.2 Document representation . . . . 10
2.2.1 Bag-of-word . . . . 10
2.2.2 Term-frequency inverse-term-frequency . . . . 10
2.2.3 Latent Semantic Indexing . . . . 12
2.2.4 Latent Dirichlet Allocation . . . . 12
2.2.5 Common document similarity measures . . . . 15
2.2.6 Summary . . . . 15
2.3 Word embedding . . . . 16
2.3.1 Word2vec . . . . 16
2.4 Soft cosine similarity . . . . 18
2.5 Word Mover’s Distance . . . . 19
2.5.1 Word centroid distance . . . . 20
2.5.2 k-nearest neighbor classification . . . . 20
2.6 Clustering analysis . . . . 21
2.6.1 k-means clustering . . . . 21
2.6.2 Spectral clustering . . . . 22
CONTENTS
2.6.3 Internal cluster validation criteria . . . . 23
3 Methodology 25 3.1 Research methods . . . . 25
3.2 Data collection . . . . 27
3.3 Data analysis . . . . 28
3.3.1 Verifying the word clustering assumption . . . . 28
3.3.2 Evaluating the effectiveness . . . . 28
3.3.3 Evaluating the efficiency and interpretability with the algo- rithms . . . . 29
3.3.4 Evaluating the interpretability with an experiment . . . . 29
3.4 Model design . . . . 29
3.5 Experimental design . . . . 30
3.5.1 Experiment 1: verifying the assumption . . . . 30
3.5.2 Experiment 2: effectiveness of the model . . . . 35
3.5.3 Experiment 3: interpretability of the document representation 37 4 Verifying the word clustering assumption 41 4.1 A simple synthetic document . . . . 41
4.2 A complex synthetic document . . . . 41
4.3 A short real-world document . . . . 42
4.4 A long real-world document . . . . 45
4.5 Discussion . . . . 48
5 The proposed model 51 5.1 Logistic word importance . . . . 51
5.1.1 Word cluster importance . . . . 53
5.2 Topic modeling with word embeddings . . . . 54
5.2.1 Notations . . . . 55
5.2.2 Building the nBTE representation . . . . 56
5.2.3 A view of data compression . . . . 57
5.3 Soft cosine document similarity . . . . 58
6 Performance evaluation 59 6.1 Efficiency . . . . 59
6.2 Effectiveness . . . . 60
6.2.1 Overall results . . . . 60
6.2.2 Comparing the baseline models with our model . . . . 60
6.2.3 Comparing WMD with the soft cosine similarity . . . . 62
6.3 Interpretability . . . . 64
6.3.1 Comparing with LDA . . . . 64
6.3.2 More sample documents . . . . 65
7 Conclusion 69
7.1 Discussion . . . . 70
7.1.1 Word clusters in the word2vec vector space . . . . 70
7.1.2 Effective similarity measure . . . . 71
7.1.3 Efficient similarity estimation . . . . 71
7.1.4 Interpretable document representation . . . . 71
7.2 Drawbacks and future work . . . . 72
Bibliography 75 Appendices 80 A Sample documents 81 A.1 Sample document from BBCNews data set . . . . 81
A.2 Sample document from BBCSport data set . . . . 81
A.3 Sample document from news20group data set . . . . 82
B Used software 85
Chapter 1
Introduction
Written language enables humans to store complex information and communicate with each other across time and space. The ancient Chinese threw tortoise shells into fire, and read the crack signs on the shell in order to tell the fortune [29]. We could regard the tortoise shells as the earliest form of Chinese documents, which talk about fortune. Throughout the history, humans continue to generate an ever- growing volume of documents, not only about the fortune, but also our joys and sufferings, every destroyed and flourishing civilization, every magnificent technol- ogy, thousands of splendid ideology and religions, our theory about the universe, competing economic doctrines and more.
With the advancement of information technology, these documents are digitized and maintained in large data centers and personal devices with an unprecedented speed. Due to the fast increasing volume of documents, manual handling of various document-related tasks is no longer feasible. We now rely on computer programs to automatically process the vast collections of documents in various applications.
Many of the applications involve comparing the similarity between documents. A common information retrieval (IR) task is, given a specific document, querying a ranked list of documents that are most similar to it. This requires a quantitative measure of the document similarity. Such measure is also useful for other tasks like document classification and clustering, where a natural strategy is to assign the same label to the similar documents [2].
Due to its importance in real-life applications, document similarity is one central research theme in the field of information retrieval and natural language processing [6]. Researchers in these fields have proposed various models for document repre- sentation and document similarity which have been successfully applied to tasks like document retrieval and document classification. These models typically first learn a vector representation for each document using a large corpus of documents [31], and then measure the distance between two document vectors as the document similarity.
Recently, significant progress has been made on word representation models [32][38]. These models are able to learn high-quality vector representation for mil-
1
lions of words using a corpus of hundreds of billions of words [13]. These word vectors are shown to accurately capture the similarity between words and are as- sociated with abstract concepts [32][33]. Researchers have also made these learned word vectors publicly accessible [13][21][11]. This sheds new light on a bottom-up approach for document similarity: we could directly infer the topics of documents and compare the similarity using the publicly available word vectors, without the need of an external corpus. This is intuitive since a document is constructed from words. If we understand the individual words well, we should also be able to un- derstand a document without referencing to other documents.
In this thesis, we further investigate the feasibility of this bottom-up approach.
We aim to propose a new document representation model that is useful for topic discovery and document similarity, requiring only the publicly available word vec- tors.
1.1 Background
In order to quantitatively measure the document similarity, we need 1) an appropri- ate representation of documents, and 2) a suitable distance metric. Both of these two ingredients are important for an effective similarity measure. On one hand, the representation of documents should capture as much information as possible. On the other hand, the distance metric should be meaningful for the chosen represen- tation.
1.1.1 Document representation
Traditional IR methods like BOW and TF-IDF, as discussed in section 2.2.1 and 2.2.2, transform a document into a sparse vector representation whose components refer to unique words [43]. In the past decades, machine learning researchers have proposed more advanced models to learn dense vector representation of documents.
Two notable models are Latent Semantic Indexing (LSI) [7] and Latent Dirichlet Allocation (LDA) [4]. LSI achieves significant vector compression by eigendecom- posing the word-document matrix. Deerwester et al. showed that the resulting latent vector space leads to improved performance in the document retrieval task [7]. In contrast, LDA takes a probabilistic approach to modeling documents [4]. In the LDA model, the words are probabilistically grouped into topics and a document is modeled as a distribution over the underlying set of topics.
All these models transform the documents into vectors using a corpus. We could then compute the cosine similarity between two document vectors as the document similarity [2].
1.1.2 Word embeddings
In recent years, significant progress has been made on the problem of word repre-
sentation [32][38]. The celebrated Word2vec model [32] can efficiently learn a dense
1.1. BACKGROUND 3
Figure 1.1: Figure taken from [25]. An illustration of the word mover’s distance.
All non-stop words (bold) of both documents are embedded into a word2vec space.
The distance between the two documents is the minimum cumulative distance that all words in document 1 need to travel to exactly match document 2.
vector representation of real values for each word. Such vectors are also called word embeddings. Mikolov et al. demonstrated that both syntactic and semantic relation- ships between words are preserved in the word2vec vector space [32]. Semantically similar words tend to have similar embeddings and such similarity is quantified us- ing the cosine similarity. A famous example that demonstrates the power of the word2vec embeddings is that, vector(”king”) − vector(”man”) + vector(”woman”) results in a vector that is closest to vector(”queen”) in terms of cosine similarity [32]. This shows that the word2vec embeddings can not only capture the semantic similarity between words but also associate each word with some abstract concept.
Besides its encouraging performance, the word2vec model is also efficient to train [32][34]. Google has published its pre-trained word2vec embeddings for millions of words using a corpus of up to hundreds of billions of words [13].
1.1.3 From word embedding to document similarity
Kusner et al. made a successful attempt at using the pre-trained Google word embeddings to estimate the document similarity. They proposed the Word Mover’s Distance(WMD) [25], which measures the similarity between two documents as the minimum amount of distance that the embedded words of one document need to
“travel” to reach the embedded words of another document. An illustration of the WMD distance can be found in Figure 1.1.
WMD models a document as a weighted cloud of its embedded words, therefore
requiring only the pre-trained word embeddings. Kusner et al. showed that WMD
achieved unprecedented low k-nearest neighbor document classification error rate
comparing to a set of state-of-art baseline models including LSI and LDA [25]. This
result suggests that it is not only feasible but also effective to compare document
similarity using only the pre-trained word embeddings.
1.1.4 Distance metrics for document similarity
Kusner et al. used the Earth Mover’s Distance(EMD) to estimate the distance be- tween two weighted clouds of words. The best average time complexity of solving EMD is O(n 3 log n), where n is the number of unique words in the respective doc- uments [25]. However, there are other applicable distance metrics [44][19]. Grigori Sidorov et al. proposed the soft cosine similarity[44] of time complexity O(n 2 ) for measuring document similarity. The soft cosine similarity assumes that the simi- larity between word features is known, which holds true for the word embeddings.
Grigori Sidorov et al. showed that the soft cosine similarity is an effective distance metric in a document retrieval case study, where they used the Levenshtein distance to estimate word similarity [44].
1.2 Problem
The best average time complexity of WMD is O(n 3 log n) [25]. The computational overhead could become a significant issue considering that a document could contain hundreds to thousands of words. Another drawback of WMD is that the resulting document representation has poor interpretability comparing to LDA. The com- ponents of the LDA vector representation refer to different topics and the entries their weight, therefore allowing us to interpret it as a mixture of different topics.
However, WMD simply models a document as a cloud of all its words. It is hard to interpret the resulting document representation to gain any structured information about the content of the document.
1.2.1 A word clustering strategy
A natural strategy to improve the time complexity of WMD is to use less embed- ded vectors to represent a document. This is feasible considering that there are similar words that express the same contextual meaning in a document, e.g. syn- onyms("country", "nation", "state"), variants of the same root word("denote, deno- tation, denoting") and words that describe the same topic ("elections", "parliament",
"politics"). If we could find such word clusters in a document, we could then rep- resent each cluster using a single embedded vector that is most representative of the member words. Therefore, we could represent a document as a weighted cloud of k such vectors, where k is the number of word clusters. When k is significantly smaller than n, e.g. k = O( √
2n), the time complexity of WMD could be improved to O(n 1.5 log n), which is sub-quadratic. If we use the soft cosine similarity to es- timate the similarity between two clouds of embedded words, the time complexity can be further improved to O(n). However, we note that it is not trivial to find such semantically coherent word clusters. Therefore the overall complexity depends on the algorithms for finding these word clusters.
Another benefit of this word clustering approach is topic discovery. We could
examine the topic of each cluster by querying its member words. The normalized
1.3. PURPOSE 5
size of each cluster may also be a naive estimation of its contribution to the con- tent of the document 1 . In this sense, we could produce an interpretable document representation.
1.3 Purpose
To the best of our knowledge, the aforementioned word clustering strategy for doc- ument representation and document similarity is not studied in any literature. We consider it worthy of further investigation due to its potential benefits in terms of time complexity and interpretability. This leads to our main research question:
Can we construct an interpretable document representation by clus- tering the words in a document, and effectively and efficiently esti- mate the document similarity?
The above research question is concerned with three aspects, namely 1) effectiveness, 2) efficiency and 3) interpretability.
To be more specific,
1. effectiveness means the performance for document similarity is comparable with the state-of-art models 2 .
2. efficiency means the time complexity of the overall model for document sim- ilarity should be O(n 2 log n) so that it is an order of exponent better than WMD.
3. interpretability means that the resulting document representation can be in- terpreted as a mixture of multiple topics, each weighted by its contribution to the content of the document.
Note that the term interpretability itself can be interpreted differently and may concern many different aspects, e.g. model interpretability. In this thesis, we focus on the effectiveness and efficiency aspects and only informally address the inter- pretability aspect by limiting it to the ability of topic discovery of our model.
The answer to this research question will advance the field of document represen- tation and enable further research of document understanding from words, without the need of a corpus.
1.4 Goal
The goal of this thesis is in line with the purpose discussed in the previous section.
The purpose of this thesis is centered around the raised research question. To be able to fully answer the research question, we need to achieve the following goals,
1
We shall see in Chapter 4 that the normalized size of a cluster is not an appropriate estimation of its contribution.
2
The selected state-of-art baseline models are listed in section 3.5.2.
1. propose a document representation model that is based on the word clustering strategy, requiring only the pre-trained word embeddings but not a corpus.
2. propose a similarity measure for the resulting document representation.
3. Analyze and evaluate the proposed model and similarity measure with re- spect to the three concerned aspects, namely the efficiency, effectiveness, and interpretability.
In this work, we focus on the efficiency and effectiveness aspects, while only informally address the interpretability aspect. To be able to fully address the inter- pretability aspect, we need a more sophisticated definition of the interpretability.
The evaluation of the interpretability is also challenging since there lacks a consensus on what interpretability implies. Instead, we informally address the interpretability aspect by analyzing the algorithmic steps of the proposed model in order to explain how we can interpret the resulting document representation as a mixture of top- ics. We also demonstrate its ability of topic discovery on one sample document 3 , compared with LDA.
To address the efficiency aspect, we analyze the average time complexity of the proposed model and document similarity measure. To address the effective- ness aspect, we evaluate the performance of the document similarity measure using multiple public data sets, compared with a set of state-of-art baseline models.
1.5 Benefits, ethics, and sustainability
The potential benefit of our work is significant to many document-related applica- tions. Humans rely on documents to maintain our knowledge base about our world.
Take the academia for example, we continue to use papers to record every scientific breakthrough. Due to the ever-increasing volume of our knowledge base, we rely on smart algorithms to retrieve desired information. The effectiveness and efficiency of the algorithms directly impact the productivity of human society as a whole: the less time we spend on retrieving the desired information, the more time we have to spend on producing activities.
We could also take the perspective of a machine to view the potential benefit.
With our proposed method, machines are able to understand the documents in absence of an external corpus. This enables the machines to undertake a larger set of document processing tasks where such corpus is often not available. On the other hand, the knowledge of words learned by one machine can be transferred to many other machines. This makes the use of the computational resources much more sustainable.
This work also indirectly addresses one central ethical issue of artificial intelli- gence(AI), that the AIs become biased when trained on biased data sets [5]. In the
3
We select the sample document which was used by Blei et al. to demonstrate the ability of
topic discovery of LDA [4].
1.6. METHODOLOGY 7
context of document understanding, many models require either labeled training data or an external corpus. These model can be easily abused by the user if he chose to supply biased training data or corpus. However, our model requires only the pre-trained word embeddings. The AI community could agree on one central unbiased corpus with which the word embeddings are trained, e.g. the corpus used by Google [13], Facebook [11] and Stanford [21]. Therefore effectively minimizing the risk of biased usage of our model.
1.6 Methodology
To the best of our knowledge, there is no rigorous probabilistic explanation of the word2vec vector space [12][27]. The most comprehensible study of the statistical implication of the word2vec model we found is from Levy et al.[27], where they argued that the word2vec model is implicitly performing a matrix decomposition on a word- cooccurrence statistics matrix. However, there is still no formal proba- bilistic model proposed for the word2vec embeddings. Therefore a rigorous analytic approach is not applicable. Instead, the conclusions of this thesis will be based on a scientific study using an empirical and experimental approach to achieve scientific validity.
The word clustering strategy discussed in section 1.2.1 is based on the assump- tion that semantically similar words form clusters in the word2vec vector space. To verify this assumption, we conduct an exploratory experiment on both synthetic and real-world documents. This exploratory experiment aims to generate empirical evidence that supports the word clustering assumption, and also recommends for the detailed design of the proposed model. We qualitatively evaluate the seman- tic coherence of the word clusters by intuitively examining their member words.
We also quantitatively evaluate the word clusters using several cluster validation criteria.
Based on the results of this exploratory experiment, we propose 1) a document representation model that is able to discover the topics of a document, and 2) a similarity measure for the resulting document representation. We present the detailed algorithmic steps of 1) and 2) so that we can argue their efficiency in terms of the average time complexity, and the interpretability by explaining how we can reveal the topics of a document using the resulting document representation. We also qualitatively evaluate the discovered topics on several sample documents to demonstrate the interpretability of the resulting document representation in terms of topic discovery.
We quantitatively evaluate the effectiveness of the proposed model for document
similarity in the context of the k-nearest neighbor document classification task, as
proposed by Kusner et al. [25]. The k-nearest neighbor classification algorithm
determines the label of an unseen document d as the majority label among its
closest neighbors. If the document similarity measure is indeed effective, these
closest neighbors would be similar to d. Therefore the classification error rate would
also be low. We compare the performance of our model with a set of state-of-art baseline models to argue its relative effectiveness. We use publicly available data sets so that the experimental results can be reproduced. Since the topic discovery is an important feature of our model, we selected three public data sets that are suitable for topic discovery.
The details of the research methodology can be found in chapter 3.
1.7 Delimitation
In this thesis, we only experiment with the pre-trained Google word embeddings.
Other pre-trained word embeddings are not considered. This means that the per- formance of our model will be different when other pre-trained word embeddings are used.
We only informally address the interpretability aspect of our model since there lacks a consensus of what interpretability is, so does the evaluation method.
1.8 Outline
This thesis is organized as follows. In Chapter 2, we introduce the relevant theory and related work. In Chapter 3, we present the details of the research methodology.
In Chapter 4, we present and discuss the results of the first experiment. We then propose a document representation model and a document similarity measure in Chapter 5. In Chapter 6, we present the evaluation results of the second experiment.
In Chapter 7, we discuss these evaluation results and conclude this thesis by directly
answering the raised research question. We also propose several valid future works
in Chapter 7.
Chapter 2
Extended background
This chapter introduces the extended background of our research.
Document similarity is tightly connected with the problem of document repre- sentation, which is a central research topic in the area of natural language process- ing. Therefore we first briefly introduce the area of natural language processing, with a focus on the problem of document representation. We then introduce the word2vec model with a focus on its training objective. After that, we introduce two document similarity measures that may be suitable for our model. Finally, we introduce the clustering analysis, a task of grouping similar objects into clusters.
We cover a number of clustering algorithms that may be suitable for discovering the word clusters in the word2vec vector space.
2.1 Natural language processing
Natural language processing (NLP), a.k.a. computational linguistics, is the sub-field of computer science concerned with using computational techniques to understand and reproduce human language content [16]. The earliest research of NLP may trace back to 1950s when the idea of automatic machine translation attracted research attention [18]. The early NLP systems were mostly based on complex sets of hand- written rules and not able to deal with variation, ambiguity, and context-dependent interpretation of the language. Starting in the late 1980s, with the increasing com- putational power, more and more researchers started applying a variety of statistical methods to automatically learn the abstract rules using a large corpus [16]. This direction of research is also known as statistical NLP (SNLP).
At the core of any NLP task is the important issue of natural language un- derstanding [6]. The raw text needs to be transformed into an appropriate rep- resentation that can be processed by the computer programs. In this section, we extensively discuss the notable methods for document representation following a chronological order.
9
2.2 Document representation
Researchers in the field of information retrievals(IR) have made significant progress on the problem of document representation [2]. The basic paradigm proposed by IR researchers is to learn a vector representation for the documents using an external corpus [4]. Starting from the simplest Bag-of-word model, we gradually discuss more complex models for document representation.
2.2.1 Bag-of-word
The bag-of-words(BOW) model represents each document as a multi-set of words, disregarding the grammar and ordering [43], but preserving the number of occur- rences of each word. Take the following document for example,
Bob transferred eight bitcoins to Alice, and Alice transferred back five bitcoins.
its BOW multi-set representation is,
{Bob 2 , Alice 2 ,transferred 2 , bitcoins 2 , eight, five, to, and, back}
where the upper indices equal to the multiplicity of the respective word. We can then build a BOW vector of real values where each dimension refers to a unique word w and the entry refers to the frequency of w in the document. The BOW vector of the above document is (2, 2, 2, 2, 1, 1, 1, 1, 1), assuming an ordered list of unique words that match the components of the BOW vector. It is common to normalize the BOW vector by dividing each entry by the total frequency. The normalized BOW vector of the above sentence is then ( 13 2 , 13 2 , 13 2 , 13 2 , 13 1 , 13 1 , 13 1 , 13 1 , 13 1 ).
The BOW model takes into account the word occurrences in a document, which to some extent reflects the content of a document. Intuitively thinking, two doc- uments with similar word occurrences should also be similar in content. However, in the BOW model, all words are equally important. This is clearly not realistic considering that, in certain domains, words like "economics" and "monetary" should have much higher discriminating power than words like "therefore" and "although".
A remedy to this problem is weighting each word by its term-frequency inverse- term-frequency, a method which we shall discuss in the next section.
2.2.2 Term-frequency inverse-term-frequency
The term-frequency-inverse-document-frequency(TF-IDF) method [43](1983) is a
step forward from BOW. It assumes to have access to a corpus D consisting of
a collection of documents. It modifies the BOW vector by offsetting the word
frequency by the inverse document frequency, which measures the discriminating
power of a word.
2.2. DOCUMENT REPRESENTATION 11
Term-frequency
The term frequency tf (w, d) is defined as the frequency of word w in a document instance d. It is a measure of the contribution of a word to a document.
tf (t, d) = frequency of t in d (2.1) Inverse document frequency
The inverse term frequency is defined as,
idf (t, d) = log |D| + 1
|{d ∈ D : t ∈ d} + 1| (2.2)
The inverse term frequency is a measure of how much information a word pro- vides. A low inverse document frequency indicates that a word is rare and therefore likely being more discriminating. In contrast, a high inverse document frequency indicates that the word is common across the corpus, therefore carries little dis- criminating power.
TF-IDF
Combining the term frequency and inverse document frequency, we can produce a composite weight for each term in a document, defined as,
tf idf (t, d, D) = tf (t, d) × idf (t, d, D) (2.3) By its definition, the TF-IDF score of a word w in a document d is highest if w occurs rather frequently in d while occurring in only few documents in D. In this way, discriminating words are considered to be more important.
Term-document matrix
Let R be a dictionary consisting of all the unique words in D. We can then build a term-document matrix X ∈ R |R|×|D| where the j th column is the TF-IDF vector representation of document D i and the i th row is the TF-IDF score of the i th word in all the |D| documents. In this way, a document d of arbitrary length can be represented as a fixed-length |R|-dimensional vector where i th dimension refers to the i th word in dictionary R and the entry refers to the TF-IDF score of the i th word in d.
Comparing to BOW vector, TF-IDF is better at identifying the discriminating words by incorporating the word-occurrence statistics in a corpus. However, 1), the TF-IDF vectors are rather sparse since its dimension equals the vocabulary size.
Such vector could easily reach hundreds of thousands of entries with a sufficiently
large corpus. 2), It also reveals little intra and inter-document structure [4]. Syn-
onymous terms are regarded as totally different words even though they have the
same meaning, while polysemous terms are regarded as words expressing the same
meaning 3), Besides, it is often not suitable for document similarity measure due to their frequent near-orthogonality [25]. The near-orthogonality of TF-IDF vectors means that a given querying TF-IDF vector may be far away from all other vectors in terms of cosine similarity, therefore producing poor document similarity scores.
As we shall discuss in the next section, Latent Semantic Indexing model tries to address the aforementioned issues by constructing a latent vector space for the documents and words.
2.2.3 Latent Semantic Indexing
The Latent Semantic Indexing model (LSI) [7](1990), sometimes referred to as Latent Semantic Analysis (LSA), finds a lower-rank approximation to the term- document matrix M ∈ R v×d using Singular Vector Decomposition (SVD). Without diving into the details of the linear algebras, LSI finds a lower-rank approximation C 0 that can be decomposed into three matrices,
C 0 = U ΣV (2.4)
where U has dimensions (v × k), Σ(k × k) and V (k × d).
We can then construct a new matrix W as
W = U Σ (2.5)
with dimension (v × k), where each row refers to a word vector, and a new matrix D as
D = ΣV (2.6)
with dimension (k × d), where each column refers to a document vector.
The end result of SVD is that all documents and words are represented as vectors of reduced dimension of k. In this way, LSI is able to achieve significant vector compression. Deerwester et al. also showed that the resulting latent vector space of documents reveals better inter-document structure and the latent vector space of words captures some linguistic notions such as synonymy and polysemy [7].
The LSI model uses linear algebra techniques to achieve significant dimension reduction, with the hope that the resulting latent vectors can capture semantic similarity. Even though LSI has shown to be successful, it still lacks a solid theoret- ical foundation, therefore hinders a more principled study of the model itself [17].
To address this issue, researchers have proposed to use a probabilistic approach to model the latent topics of a document, which has a sound statistical foundation [17][4]. We shall discuss one such notable generative probabilistic model in the next section.
2.2.4 Latent Dirichlet Allocation
The Latent Dirichlet Allocation(LDA) [4](2003) takes a probabilistic approach to
modeling documents. The words are probabilistically grouped into a set of topics T
2.2. DOCUMENT REPRESENTATION 13
of size k and a document is modeled as a distribution over an underlying set of topics.
Similarly to LSI and TF-IDF, LDA assumes to have access to a corpus D consisting of a collection of documents and a dictionary R consisting of all unique words in D.
LDA also assumes that the document is in the BOW multi-set representation [4], therefore disregarding the grammar and word order in a document. In the language of probability theory, this is the assumption of exchangeability for the words in a document [1].
Before introducing the details of the generative process for a document, we first give definition to several probability distributions.
Poisson distribution:
The Poisson distribution is a discrete probability distribution parameterized by a positive value λ. The probability mass function is given by
f (x|λ) = λ x exp −λ
x! (2.7)
Dirichlet distribution:
The Dirichlet distribution Dir(α) is a family of continuous multivariate proba- bility distribution parameterized by a vector α = (α i , ...α k ) of positive values. The order of a Dirichlet distribution is |α| = k. The probability density function of Dir(α) of order k is given by:
f (x 1 , ...x k |α 1 , ...α k ) = 1 B(α)
k
Y
i=1
x α i
i−1 (2.8)
where
B(α) = Q k
i=1 Γ(α i )
Γ( P k i=1 α i ) (2.9)
and
k
X
i
x i = 1 ∧ x i ≥ 0, ∀i ∈ [1, k] (2.10) Multinomial distribution:
The Multinomial distribution is a generalization of binomial distribution parametrized by a vector θ = (θ 1 , ...θ k ). The probability mass function of Multinomial(θ) of order k is given by:
f (x 1 , ...x k |θ 1 , ...θ i ) = Γ( P k i x i + 1) Q k
i Γ(x i + 1)
k
Y
i
p x i
i(2.11)
The overview of the generative process of LDA is as follows: we first choose the
length N of a document d, then choose a topic distribution for d. We then iteratively
generate a word by first choosing a topic according to the topic distribution, and
then choosing a word instance according to the word distribution of the chosen
topic.
Now we describe the probabilistic details of each of these steps.
The LDA model assumes that the length of a document follows Poisson(λ). For a document d, we draw a random variable N from Poisson(λ) as the length of the document. Note that the Poisson assumption is not critical to the following generative process and more realistic length distribution can be used [4].
LDA uses the Dirichlet distribution as prior to assign a topic distribution to a document d by drawing a sample θ from Dir(α), where θ i is the probability of mentioning topic T i in d. Note that LDA assumes the number of topics k is known and fixed. The choice of using Dirichlet distribution as prior is based on a number of convenient mathematical properties of Dirichlet distribution including the fact that it is a conjugate to the Multinomial distribution [4]. These properties facilitate the parameter estimation algorithms used for training the LDA model.
For each word w i in d, the LDA uses Multinomial(θ) to draw a random topic z i from the topic distribution θ. We denote the resulting topics of all words as z = (z i , ..., z N ).
We have provided a probabilistic model to select the topic distribution of a document, thereafter select the topic of each individual word in the document.
However, we still have not yet generated any word instance for document d. To this end, we define a matrix β ∈ R k×|R| , where β i,j = p(R j = 1|T i ). Therefore, given a topic T i , we can sample a word w from Multinomial(η), where η is the word distribution for topic T i , determined by the i th row of β.
As a summary, the LDA assumes the following generative process for any doc- ument d in corpus D.
1. Choose N ∼Poisson(ε) to be the number of words in d.
2. Choose θ ∼ Dir(α) to be the topic distribution of d.
3. For each of the N words w i :
a) Choose a topic z n ∼Multinomial(θ).
b) Choose a word from R i from p(R i |z n , β), a multinomial probability con- ditioned on the topic z n .
LDA uses variational Bayes approximation of the posterior distribution to infer the hidden variable θ and z for each document [4]. Blei, et. al. also introduced an EM algorithm for empirical Bayes parameter estimation of α and β [4]. The variational inference and EM algorithm are not relevant to our research, therefore we do not introduce them in this thesis. The detailed description of these methods can be found in [4].
The resulting LDA representation for a document d is a k-dimensional vector
where the i th entry refers to the probability of d mentioning topic T i . We can
then interpret a document as expressing content about k topics with a different
weight assigned to each topic. Besides, we can examine the content of each topic
by querying the top words using the multinomial distribution matrix β. Blei et. al.
2.2. DOCUMENT REPRESENTATION 15
demonstrated that on a sample document, LDA learns topic groups that contain mostly semantically coherent words [4].
Comparing to LSI, LDA has a solid statistical foundation [4]. It also provides better interpretability and modularity. LDA can be readily extended to contin- uous data or other non-multinomial data by modifying its probabilistic modules [4]. Similarly to LSI, we can view LDA as a dimension reduction technique which compresses the length of the document vector to k.
2.2.5 Common document similarity measures
We have discussed various models for transforming documents into vectors in some latent vector space. This way of representing documents as fixed-length vectors is known as vector space model. In the vector space, each dimension refers to a unique feature. In the case of BOW and TF-IDF model, the dimensions refer to the unique words and the entries of a vector capture the relative importance of these words in a document. As for LSI and LDA, the dimensions represent some latent concepts.
Using the vector space model, we can estimate the document similarity between two documents as the distance between their vector representations. The two most common distance measures are the cosine similarity and Euclidean distance [43].
Euclidean distance
Given two vectors v and w of dimension d, their Euclidean distance is defined as,
Euclidean(v, w) = v u u t
n
X
i=1
(v i − w i ) 2 (2.12)
Cosine similarity
Given two vectors v and w, the cosine similarity is defined as cosine(v, w) = v · w
kvkkwk (2.13)
where v · w denotes the dot product between v and w, which is P n i=1 v i · w i , and kvk denotes the Euclidean length, which is q P n i=1 v 2 i .
The cosine similarity measures the difference in direction between two vectors.
From this view, cosine similarity is preferred over the Euclidean distance in situa- tions where we are only concerned with what concept a document is about, but not how strong it is towards that concept [43].
2.2.6 Summary
We have discussed a wide range of notable document representation models. Except
for the simplest BOW model, they all share the same paradigm which aims to build
a fixed-length vector representation for a document using a corpus. Comparing to
BOW and TF-IDF, more advanced models including LSI and LDA achieve signifi-
cant dimension reduction and are able to capture the semantic similarity between
individual words. In these models, words also represented as fixed-length vectors in some latent vector space, enabling us to calculate the semantic similarity between words.
Besides the fixed-length paradigm, these models all share the BOW assumption which disregards the word orders. However, the word orders may provide valuable contextual information that is useful for estimating semantic similarity between words. Let us consider two intuitive situations: 1), if two words occur frequently in neighboring positions in the document, we may conjecture that these two words are often used to express the same contextual meaning. 2), if two words occur frequently at the same position in a sentence, we may conjecture that these two words are substitutable (synonyms). These conjectures are formally formulated in the distributional hypothesis of Harris [15], which states that words appearing in similar context have similar meanings.
In the next section, we shall exploit this line of thought and discuss several models that are specifically designed for building dense vector representation for words, also known as word embedding.
2.3 Word embedding
A word embedding is a dense vector representation of a word [27]. We call a model that transforms an input word to a fixed-length dense vector as a word embedding model. In section 2.2.3, we showed that LSI can map a word into a latent vector space where the semantic similarity between words is preserved. We may regard LSI as a successful word embedding model.
Many of the modern word embedding models use various training methods in- spired by neural-network language modeling [27]. Such models were shown to per- form significantly better than LSI fore preserving linear regularities among words [35]. One of the earliest neural word embedding model was proposed by Bengio et al. [3] (2003), where a feedforward neural network is used to learn word vector representation using the linguistic context in which the word occurs. Most of the newest word embedding methods extend the work of Bengio and rely on a neural network architecture [32]. One notable model was recently proposed by Mikolov et.al. [32] (2013) that is both efficient to train and provide state-of-art results on various linguistic tasks [27]. This model is also known as word2vec. In the next section, we discuss the word2vec model in detail.
2.3.1 Word2vec
Like many other word embedding models, the word2vec model is based on the distributional hypothesis in linguistic which states that semantically similar words have similar distributions in large samples of language data [15]. In other words, words that occur in the similar contexts have similar meanings.
In the word2vec model, the context c of a word w is defined as the list of
its neighboring words in a document, fixed by a window size k. For instance, if
2.3. WORD EMBEDDING 17
k = 5, context c includes the five previous and five following words of w. The vector representation of words is then learned by maximizing the conditional probability p(c|w) or p(w|c) over all (w, c) pairs.
Mikolov et al. proposed two model architectures, namely the continuous bag- of-words model (CBOW) which maximizes p(c|w), and continuous skip-gram model (skip-gram) which maximizes p(w|c) [32]. Mikolov et al. also proposed an efficient training method called negative sampling for the skip-gram model (SGNS) [33].
Mikolov et al. argued that SGNS improves both the training speed and the quality of the word embeddings [33]. Since Google used the SGNS model to train its word embeddings, we focus on discussing the SGNS model.
Skip-gram
Given a corpus, we denote W to be the set of words and C to be the set of contexts in the corpus. We denote C(w) to be all the contexts where w appeared.
Now consider a word-context pair (w, c) where w ∈ W and c ∈ C. Let P (c|w) be the probability that (w, c) is observed in D given word w. This conditional probability is modeled as
P (c|w) = e − ~ w·~ c P
c
i∈C e − ~ w·~ c
i(2.14) where ~ w and ~ c are the vector representations of word w and context c to be learned. Note that this is essentially a multinomial logistic model that assumes linear dependency between ~ w and ~ c.
The objective of the skip-gram model is to find parameters ~ w and ~ c that maxi- mize the average log probability. Therefore objective function is given by
arg max
~ w,~ c
1 N
X
w∈W
X
c∈C(w)
log P (c|w) = X
w∈W
X
c∈C(w)
(log e w·~ ~ c − log X
c
i∈C
e w·~ ~ c
i) (2.15)
Equation 2.15 makes an important assumption: by optimizing this objective function, similar words would eventually have similar word vectors. As noted by Goldberg et al. [12], this assumption has no sound statistical foundation and it is unclear why it holds.
The objective function shown in equation 2.15 is expensive to compute because of the term P c
i