The Use of Distributional Semantics in Text Classification Models : Comparative performance analysis of popular word embeddings

(1)

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

The Use of Distributional Semantics in Text

Classification Models

Comparative performance analysis of popular word embeddings

Examensarbete utfört i Datorseende vid Tekniska högskolan vid Linköpings universitet

av Tobias Norlund LiTH-ISY-EX--16/4926--SE

Linköping 2016

Department of Electrical Engineering Linköpings tekniska högskola

Linköpings universitet Linköpings universitet

(2)

(3)

The Use of Distributional Semantics in Text

Classification Models

Comparative performance analysis of popular word embeddings

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Tobias Norlund LiTH-ISY-EX--16/4926--SE

Handledare: Kristoffer Öfjäll

isy_{, Linköpings universitet} Magnus Sahlgren

Gavagai

Examinator: Michael Fehlsberg isy, Linköpings universitet

(4)

(5)

Avdelning, Institution Division, Department

Datorseende

Department of Electrical Engineering SE-581 83 Linköping Datum Date 2016-03-13 Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats Övrig rapport

URL för elektronisk version

-ISBN — ISRN

LiTH-ISY-EX--16/4926--SE Serietitel och serienummer Title of series, numbering

ISSN —

Titel Title

Nyttan av Distributionell Semantik i Textklassificeringsmodeller The Use of Distributional Semantics in Text Classification Models

Författare Author

Tobias Norlund

Sammanfattning Abstract

Inom Natural Language Processing har modeller för övervakad maskininlärning (supervised machine learning) visat sig framgångrika i att lösa klassificeringsuppgifter såsom sentiment-analys och textkategorisering. Det klassiska sättet att representera texten till dessa modeller har varit att använda den högdimensionella så kallade Bag-Of-Words-representationen. Men på senare tid har dessa ofta kommit att ersättas av lågdimensionella täta ordvektorer som tillsammans med nya modeller också har rapporterat banbrytande resultat. Då få studier har gjorts för att på ett rättvist sätt jämföra dessa olika representationer och den klassificer-ingsprestanda de ger så försöker denna uppsats fylla den luckan. Vi söker särskilt insikt i hur olika populära oövervakade för-tränade ordvektorer skiljer sig i det hänseendet. Dessutom tittar vi närmare på den så kallade Random Indexing-representationen och föreslår två sätt att finjustera dess ordvektorer under träning av klassificeringsuppgiften. Resultaten visar att även om de lågdimensionella för-tränade representationerna ofta har praktiska beräknings-fördelar och har visat mycket goda prestanda så behöver de inte nödvändigtvis överträffa de klassiska representationerna i alla lägen.

Nyckelord

(6)

(7)

Sammanfattning

Inom Natural Language Processing har modeller för övervakad maskininlär-ning (supervised machine learmaskininlär-ning) visat sig framgångrika i att lösa klassifi-ceringsuppgifter såsom sentimentanalys och textkategorisering. Det klassiska sättet att representera texten till dessa modeller har varit att använda den hög-dimensionella så kallade Bag-Of-Words-representationen. Men på senare tid har dessa ofta kommit att ersättas av lågdimensionella täta ordvektorer som tillsammans med nya modeller också har rapporterat banbrytande resultat. Då få studier har gjorts för att på ett rättvist sätt jämföra dessa olika repre-sentationer och den klassificeringsprestanda de ger så försöker denna uppsats fylla den luckan. Vi söker särskilt insikt i hur olika populära oövervakade för-tränade ordvektorer skiljer sig i det hänseendet. Dessutom tittar vi närmare på den så kallade Random Indexing-representationen och föreslår två sätt att finjustera dess ordvektorer under träning av klassificeringsuppgiften. Resulta-ten visar att även om de lågdimensionella för-tränade representationerna ofta har praktiska beräkningsfördelar och har visat mycket goda prestanda så be-höver de inte nödvändigtvis överträffa de klassiska representationerna i alla lägen.

(8)

(9)

Abstract

In the field of Natural Language Processing, supervised machine learning is commonly used to solve classification tasks such as sentiment analysis and text categorization. The classical way of representing the text has been to use the well known Bag-Of-Words representation. However lately low-dimensional dense word vectors have come to dominate the input to state-of-the-art mod-els. While few studies have made a fair comparison of the models’ sensibility to the text representation, this thesis tries to fill that gap. We especially seek insight in the impact various unsupervised pre-trained vectors have on the performance. In addition, we take a closer look at the Random Indexing rep-resentation and try to optimize it jointly with the classification task. The re-sults show that while low-dimensional pre-trained representations often have computational benefits and have also reported state-of-the-art performance, they do not necessarily outperform the classical representations in all cases.

(10)

(11)

Acknowledgments

I would like to dedicate this work to my family for their consistent support throughout my studies. I also want to thank my supervisor Magnus Sahlgren for his support and for giving me the opportunity to immerse within this very interesting field.

Finally, a big thanks to my examiner, academic supervisors and my two opponents for giving me great constructive feedback on this thesis!

Stockholm, March 2016 Tobias Norlund

(12)

(13)

1

Introduction

1.1 Background

In Natural Language Processing (NLP), the ultimate goal is to get a computer to understand and use natural (human) languages just as fluently as any hu-man is capable of. For this to be possible, we need to seek a structured rep-resentation of language that is capable of encapsulating the huge variability and flexibility that natural language hold. The representation should unam-biguously be able to represent meaning which leads us to the philosophical question what the meaning of ’meaning’ actually is. This is something which has been thoroughly investigated in the linguistic community, for example in [Wittgenstein, 1953, Karlgren and Sahlgren, 2001].

Since the ultimate structured representation of natural language is yet to be seen, the NLP problem has been divided into many different subtasks. Of-ten, these tasks more or less correspond to real world applications or consti-tute building blocks that, when composed, solve more high-level tasks. Most of these tasks usually go under Natural Language Understanding that deals with reading comprehension or Natural Language Generation that aims at generating natural text from a machine data structure. Examples of common tasks are:

• Topic Modelling: Given a large collection of text documents, find latent topics that recur and determine to what degree each document corre-spond to each topic.

• Part-Of-Speech Tagging: Assigns a "part-of-speech tag" (e.g. noun, verb, adverb etc.) to each word in a sentence. The tags are then usually used in more high-level NLP tasks.

(16)

• Sentiment Analysis: Extracts subjective information from a set of docu-ments. Sentiment Analysis determines to what degree a text document conforms to a certain concept, such as positive or negative.

• Named Entity Recognition: Seeks to identify and possibly disambiguate parts of text into pre-defined classes, such as ’Person’, ’Date’, ’Location’, ’Organisation’ etc. The output could look something like:

[Microsoft]Company aquired the successful [Swedish]Country game

pro-ducer [Mojang]Companyin [2014]Year

• Parsing: Performs a grammatical analysis of a sentence that tries to find (and solve) possible grammatical ambiguities. For example, "Peter shot an elephant in his pajamas" has two grammatically correct meanings; one may be argued to be more likely than the other. Parsing usually result in what is called a parse tree, which also can be used to support other, more high-level NLP tasks.

• Question Answering: Is one typical high-level NLP task where the sys-tem is given a question that queries an underlying knowledge base for information.

• Machine Translation: Automatically translate a text document from one language to another.

In the search for solutions to some of these problems, one idea is to represent the semantic meaning of words in a Vector Space Model (VSM) and previ-ous research has shown that it may perform favorably. [Sahlgren, 2006, Deer-wester et al., 1990, Schütze, 1992, Lund and Burgess, 1996]. A Vector Space Model (VSM) supplies each word with a vector representation and depending on the task, these representations are used in various different ways.

Several different approaches to construct these word vectors have been pro-posed, and one that is motivated from a linguistic standpoint is the Random Indexing representation [Karlgren and Sahlgren, 2001, Sahlgren, 2005]. The Random Indexing representation is explained more in depth in section 2.3 and will be further investigated throughout this thesis.

1.2 Problem Formulation

In recent time, supervised machine learning has been successively applied to several classical NLP tasks including Sentiment Analysis, Text Categoriza-tion, Part-Of-Speech Tagging and Named Entity Recognition. Various differ-ent models have been investigated such as classical models like Naive Bayes, Support Vector Machines and ensemble methods. Recently, in particular (deep) neural methods such as Convolutional Neural Networks [Collobert et al., 2011, Kim, 2014, Kalchbrenner et al., 2014] and Recursive Neural Networks [Socher et al., 2013b] have reported outstanding performance.

(17)

1.3 Approach 3

Figure 1.1: A conceptual illustration of a common supervised learning setup in NLP. Word embeddings are fed into a model which is trained and evaluated on a dataset

It is common to represent the input text’s words by vectors that are either pre-trained or randomized and use them as input to these supervised models. Such a supervised setup in NLP can conceptually be described to constitute three parts:

• Some input text, which in this thesis will be represented as word vectors (or word embeddings)

• A supervised learning model which takes the embeddings as input • A labeled dataset which is used to train and evaluate the model and the

embeddings

This conceptual setup is illustrated in figure 1.1. The thesis aims to make a quantitative comparison of commonly used embeddings on two models using standard benchmark classification datasets. We seek insight in the impact different popular embeddings have on the classification performance.

As a side objective, we will also take a closer look at the Random Index-ing embeddIndex-ing. Nilsson tried to "adjust" the Random IndexIndex-ing embeddIndex-ings through what he calls a bow ri mlp (Bag-Of-Words Random Indexing Multi-Layered Perceptron) which however did not give any improvements [Nilsson, 2015]. In this thesis, another approach is taken to optimize the Random In-dexing embeddings to a specific task.

1.3 Approach

The comparative study will be performed by means of a set of experiments. An experiment in this thesis is to run the setup in figure 1.1 using a spe-cific embedding, model and dataset. The experiments are more thoroughly explained in chapter 4. In chapter 3 we propose two different ways of "updat-ing" the Random Indexing embeddings jointly with the task, which are also included in the experiments.

(18)

1.4 Related Work

Word embeddings or word space models can be explained to be a special kind of Distributional Semantic Models (DSM) which uses linguistic items as con-text [Sahlgren, 2006]. They have earlier been used in language modelling [Ben-gio et al., 2003], syntactic parsing [Socher et al., 2013a] and sentiment analysis [Socher et al., 2013b] for example. DSMs have also successfully been applied in paraphrasing, named entity recognition and topic modelling amongst oth-ers.

The aim of word embeddings is to let the word vectors represent the mean-ing of the words. Words of similar meanmean-ing could then be represented by vectors that lie close to each other according to some distance metric, and un-related words orthogonal. For example, a sentiment analysis system in a finan-cial context might have learned that word combinations such as "big profit" correlates with a positive sentiment. Now, if the system observes the combi-nation "huge profit" without having seen "huge" in the sentiment training set, the hope is that if the vector for "huge" lies close to the vector for "big", the system might be able to generalize and still classify the document as positive. The question is how to construct these word representations (also known as word embeddings) with the ultimate goal of optimizing the performance of various NLP tasks. This is more deeply covered in section 2.2.

The classic approach to construct word embeddings is by using Latent Se-mantic Indexing (LSI) (also denoted Latent SeSe-mantic Analysis) [Deerwester et al., 1988], which is explained in section 2.2. To overcome some scaling problems with LSI, Random Indexing was proposed [Karlgren and Sahlgren, 2001, Sahlgren, 2005]. Recently the Skip-Gram model along with the Contin-uous Bag Of Words (CBOW) model [Mikolov et al., 2013a,b] have been widely adopted in the field. Implemented in the word2vec1 tool, it provides a fast and performant way of learning word embeddings. The objective of the Skip-Gram model is to model the probabilities p(wj|wi) using word embeddings,

which are to be optimized. That is, given a focus word wi, how likely is wj

to appear in the context of wi (more about this in chapter 2). Skip-Gram

it-eratively reads the corpus one word at a time and successively optimizes the embeddings using Stochastic Gradient Descent to better predict the local con-text that is seen.

An interesting result of the Skip-Gram model is the so-called analogy property of the word vectors. It turns out that the vector operations

vector(King) - vector(Man) + vector(Woman)

result in a vector close to vector(Queen). It captures both these semantic relations, but also syntactic ones like:

vector(Apple) - vector(Apples) + vector(Car)

(19)

1.4 Related Work 5

end up close to vector(Cars). This has been shown to be due to the log-bilinear relationship between the word embeddings and the output probabili-ties in the model objective [Mikolov et al., 2013a].

The GloVe model is a model that was explicitly designed to embed word relations as linear relations [Pennington et al., 2014]. Whereas the Skip-Gram and Random Indexing models are optimized incrementally based on local con-text windows, the classical LSI operates on a global scale and optimizes its objective with the complete aggregated co-occurrence matrix at hand. GloVe aims to combine the two philosophies by keeping the log-bilinear embedding-probability relationship from Skip-Gram while optimizing on the globally ag-gregated contexts instead of one local context at a time.

Recently, such pre-trained word representations have been used as input to supervised learning models with the intuition that their inherent semanti-cal properties should assist the training and help generalization. Many such supervised models are compositional models, which somehow combines rep-resentations of words to a document level representation and then finally pre-dicts a class using it.

David Nilsson [Nilsson, 2015] used the centroid (sum) of Random Index-ing embeddIndex-ings as input to a neural network (multi-layered perceptron) which was trained on various classification tasks. He also tried to update the embed-dings using a backpropagation scheme.

[Collobert et al., 2011] utilizes a convolutional neural network (CNN) to solve word-tagging tasks including Part-Of-Speech Tagging, Named Entity Recognition, Chunking and Semantic Role Labeling. Inputs to the network are randomized word embeddings which are updated using backpropagation. Convolutional neural networks were initially developed for image recognition, but have turned out to perform well in NLP as well. A walkthrough of a com-mon CNN architecture for NLP can be found in section 2.5.

The network proposed by [Collobert et al., 2011] only supports convolu-tional filters of one size. [Kim, 2014] builds upon the model to also allow for multiple-sized filters and applied it to various text classification tasks. In addition, the word embeddings are initialized to the pre-trained embeddings from [Mikolov et al., 2013a,b]. The model turns out to be very performant and achieves state-of-the-art results on popular sentiment and categorization datasets.

A more in-depth sensitivity analysis of the model proposed by [Kim, 2014] is performed in [Zhang and Wallace, 2015]. Common to this thesis, it com-pares the performance of the CNN for sentence classification using different embeddings. The two pre-trained embeddings compared are however trained on highly different data sources. Google word2vec (Skip-Gram) vectors were trained on a proprietary Google News dataset of 100 billion words, whereas the GloVe vectors were trained on a web data corpus of 840 billion words. To make a fair and consistent comparison of embeddings, their underlying training data need to be the same. In this thesis, all embeddings (with one

(20)

exception) are trained on a Wikipedia dump from 2010, which is also more deeply covered in chapter 4.

(21)

2

Theory

2.1 Notation

The notation that will be used throughout this thesis is defined here. Matri-ces are denoted by bold capital letters, vectors by bold lowercase letters and scalars, functions as well as individual words by lowercase letters.

Definition 2.1 (Corpus and Vocabulary). We define a corpus T as a (finite or infinite) sequence of words drawn from a finite set called vocabulary V :

T _{= w}₁_{, w}₂_{, w}₃_{, . . . , w}_t_{, . . .} _(2.1)

wt∈ V (2.2)

The corpus is usually divided into a number of documents or articles, which we do not regard in this thesis.

We aim to construct a d dimensional Vector Space Model (VSM) for our word embeddings:

Definition 2.2 (Vector Space Model for Word Embeddings). We first define a mapping function m from a word wtto a d dimensional vector denoted word

embedding vt:

vt:= m(wt) ∈ Rd (2.3)

Each word embedding is constructed by a linear combination of ˜d frame

(22)

tors ej: v_i = ˜ d X j=1 λjej i = 1, . . . , |V | (2.4)

We will use subscript i when indexing in the vocabulary, and t when indexing in the corpus. Note that the frame B = (ej)

˜

d

1could be either under-represented

( ˜d < d), span the whole space ( ˜d = d) or be over-represented ( ˜d > d).

2.2 Word Embeddings Theory

From the definition of the VSM, we can now describe the simplest embedding possible, known as the Bag-Of-Words (BOW) embedding.

In the BOW embedding the number of dimensions equals the size of our vocabulary (d = |V |), which in general yields a very high-dimensional vec-tor. The frame B is defined to be the natural basis which spans the whole

d dimensional space ( ˜d = d). That means that every word gets its own

dedi-cated dimension and the word embedding for a particular word is constructed by setting that word’s corresponding dimension to 1, while keeping all other dimensions zero. This yields a very sparse representation and the word em-beddings also form an orthogonal basis of Rd. The usual way of representing a document is by simply summing up the word vectors for all words in the document. One could also apply different weightings such as TF-IDF [Jones, 1972] to account for the fact that words are unequally distributed.

Even though the BOW representation produces good results for many cases, it does not capture any semantic information since all embeddings are mutu-ally orthogonal. Also, due to the curse of dimensionality it gets hard for a learning system to generalize well. To push the boundary further, one idea has been to supply more semantical meaning with the vectors.

We want to embed meaning into the word vectors in some way such that it would assist a classifier. How do you do that? What semantical properties of words do we want to embed and how can those be learned in an unsupervised way from a large text corpus?

The Distributional Hypothesis

From a linguistic point of view, the distributional hypothesis states that words occurring in the same contexts tend to have similar meaning [Sahlgren, 2006]. The hypothesis has inspired most popular word embeddings up to date, and such models also goes under the common name Distributional Semantic Mod-els (DSM). The idea is to use the distributions of words in large amounts of text to represent their meaning. How do we concretize that?

In this thesis we are going to use the definition of context as described in Definition 2.3.

(23)

2.2 Word Embeddings Theory 9

Figure 2.1: Example of context when k = 2. The context of the word "sat" is underlined. "sat" is the focus word and "fat", "cat", "on" and "the" are context words.

Definition 2.3 (Context). We define the set of contexts C to be the vocabu-lary set.

C_{:= V} _(2.5)

Consider a corpus word occurrence wt at a certain corpus position t. We

de-note it as focus word. We now define the context (or context words) of wt,

denoted c(wt), as a multiset of the k left-most and the k right-most words to

the focus word.

c(wt) := {wt−k, . . . , wt−1, wt+1, . . . , wt+k} (2.6)

That is, the closest 2k words appearing around the word in question. It is common to say that c(wt) is the words appearing within a k +k window around

the focus word (the k nearest words to the left plus the k nearest words to the right).

We define the aggregated contexts of a word wi to be the set of all words

appearing within a k + k window for all occurrences of wi in a corpus.

An illustration of context drawn from a corpus can be seen in figure 2.1. But how do we realize this in practice? How do we construct word embed-dings using the distributional hypothesis stated above?

The simplest way is by constructing a so called co-occurrence matrix C. The co-occurrence matrix is a matrix that has the vocabulary’s size |V | number of rows, and the total number of contexts |C| as columns. In our case, the num-ber of contexts also equals the vocabulary size, which makes C a quadratic and symmetric matrix. Each cell Cij contains the count of how many times a

word wj has appeared within a k + k window to word wi in a corpus.

Remark 2.4. Other definitions of context are common. For example, one could define context on a document level instead of a word level. The set of contexts C is then defined to be the set of documents in the corpus. Then Cij contains the number of

times word i appeared in document j. ˜d (the total number of contexts) would then be

the number of documents.

(24)

that occur in similar contexts will look similar. We can interpret the rows of the co-occurrence matrix as word vectors in a d = |V | dimensional vector space

v_i := Ci: (2.7)

where Ci:denotes the ith row of C.

Inspecting equation (2.4) in the light of equation (2.7), the λjs correspond

to the number of times word wj has appeared in the context of word wi. The

frame vectors ej are one-hot (or BOW) vectors constituting a natural basis of

the |V | dimensional space.

In Latent Semantic Indexing (LSI) [Deerwester et al., 1988], where con-text is defined as in remark 2.4, you perform a Singular Value Decomposition (SVD) of the co-occurrence matrix. By only keeping the d most significant sin-gular values we find a d < |V | dimensional subspace onto which you project the embeddings from equation (2.7). However, |V | could be in the size of mil-lions which quickly becomes a practical problem when you try to decompose a (possibly) millions by millions matrix.

LSI also has other shortcomings. If the corpus or vocabulary changes over time, you have to re-sample the co-occurrence matrix and perform the costly decomposition again to update your embeddings. An incremental decompo-sition would be to prefer and this is exactly what the Random Indexing algo-rithm is aiming to solve.

2.3 Random Indexing

As pointed out in section 2.2, the Random Indexing algorithm [Sahlgren, 2005] intend to get around the shortcomings of Latent Semantic Indexing, which mainly are the construction of the (possibly) very large co-occurrence matrix as well as the process of decomposing it. Random Indexing does both at the same time, in an incremental fashion. Because of that, it allows for an ever-changing corpus and vocabulary, which are attractive properties in practice. We will describe the Random Indexing algorithm in terms of the Vector Space Model defined in Definition 2.2.

Random Indexing can be simply explained in terms of the co-occurrence matrix C and equation (2.4). We only re-define the frame B:

1. The embedding dimensionality d is set to be something significantly smaller than |V |. In this thesis d = 2.000 is used.

2. The number of frame vectors remains at ˜d = |V |, which yield an

over-represented frame. We re-define the frame vectors ei according to the

following.

For every i = 1, . . . , ˜d, let ei have non-zero elements, where half of

them are -1 and half of them +1. This will make the frame a "nearly" orthogonal sparse frame.

(25)

2.3 Random Indexing 11

We can now, in an incremental fashion, construct low dimensional word em-beddings as we pass through the corpus effectively avoiding to store and de-compose the huge co-occurrence matrix. The Random Indexing algorithm is formalized in pseudo-code in Algorithm 1.

Remark 2.5. In Random Indexing, the frame vectors ei and word embeddings vi are

usually denoted index vectors and context vectors respectively. They will from now on be referred as such.

Algorithm 1:The Random Indexing algorithm Input: A corpus T

d: Embedding target dimensionality

: Number of non-zero elements in index vectors k: Context window size

Result: Word embeddings (context vectors) vi 1 begin

/* Init context and index vectors */

2 v_i ←0 ∈ Rd ∀i = 1, . . . , |V | 3 e_i ←RandIndexVector() ∀i = 1, . . . , |V | 4 foreachw_tin corpus T do /* Left context */ 5 forl ← 1 to k do 6 v_t←v_t+ e_t−l 7 end /* Right context */ 8 forl ← 1 to k do 9 v_t←v_t+ e_t+l 10 end 11 end 12 end 13 Function RandIndexVector() is 14 e ← 0 ∈ Rd 15 forn ← 1 to /2 do 16 e[randInt(1, d)] ← 1 17 end 18 forn ← 1 to /2 do 19 e[randInt(1, d)] ← −1 20 end 21 return e 22 end

(26)

Remark 2.6. Note that if we set d = |V | and define the frame vectors ei to be

one-hot vectors instead of the vectors generated by RandIndexVector(), we recover the embeddings from equation (2.7), using the same algorithm.

Furthermore, if we stack the index vectors as rows to form a matrix, de-noted W ∈ Rd×d˜

W_i:= ei i = 1, . . . , |V | (2.8)

W defines a transformation from the BOW frame to the Random Indexing frame. Since W is randomized, the Random Indexing algorithm is equivalent to what known as a random projection and hence the name Random Indexing.

To summarize; the Random Indexing context vectors are constructed by iteratively stepping through a corpus, word by word. Initialize the context vectors as zero vectors and the index vectors as described above. For each word wtin a corpus, add its context’s index vectors to wt’s context vector.

Additionally, to make frequent words (so called stop words like "the", "a", "an" etc.) less influent in describing the contexts of words, one might weight the index vectors before adding them to the context vector. In this thesis, when mentioned, the context vectors are updated using the following entropy weighting function: v_t←_v_t₊ k X l=−k l,0 h(t, l)et+l (2.9) h(t, l) = e −_cft+l |V | _(2.10)

where et+l is the corresponding index vector to wt+l, c is a constant, ft+lis the

corpus frequency of word wt+land |V | the size of the vocabulary. This would

modify Algorithm 1 by introducing the weighting function h(t, l) on line 6 and 9.

2.4 Pointwise Mutual Information (PMI)

Recall the co-occurrence matrix from section 2.2. It is common in NLP to transform the co-occurrence matrix to what is called a Pointwise Mutual In-formation (PMI) matrix. PMI is a common measure of association in statistics and information theory and is defined as:

PMI(x;y) = log p(x, y)

p(x)p(y) (2.11)

where x and y are samples from discrete random variables X and Y . It essen-tially tells how much information lies in the coincidence that both x and y

(27)

2.5 Convolutional Neural Networks 13

was observed. The denominator can be seen as the probability of observing x and y at random, and naturally, if the variables are independent, the numer-ator equals the denominnumer-ator, which gives a PMI of 0. If the joint probability

p(x, y) is greater than observing the variables together at random, the PMI will

become positive.

From a frequentist probability point of view, the co-occurrence matrix can easily be turned into a matrix of PMI values by estimating the probabilities as:

p(wi, wj) = Cij P i0 j0C_i0_j0 (2.12) p(wi) = P jCij P i0 j0C_i0_j0 (2.13) p(wj) = P iCij P i0_j0C_i0_j0 (2.14) CPMI_ij = log p(wi, wj) p(wi)p(wj) (2.15) However, problem arises when we have unobserved co-occurrences. If Cij = 0

for any i or j, the probability will be estimated zero, which in turn will yield a PMI of −∞. It is often not reasonable to believe the probabilities to actually be 0 just because they are not observed in our corpus, and that is why it is common to take on a hack to overcome the problem. One alternative is to ’hallucinate’ counts for the unobserved word - context pairs, by for example setting them to 1 instead of 0:

C_ij := 1 ∀_{(i, j) ∈ {(i, j) : C}_ij _{= 0}} _(2.16) This is usually known as ’smoothing’. Another common alternative is to trans-form the PMI-matrix CPMI_{to a Positive PMI (PPMI) matrix. Essentially this}

disregards all negative values and sets them to zero: CPPMI_ij := 0 ∀_{(i, j) ∈ {(i, j) : C}PMI

ij < 0} (2.17)

The constraint introduces a claim that no p(wi, wj) can be smaller than the

probability of observing the pair at random (p(wi)p(wj)), which viability can

be argued. We will however apply the PPMI transformation when using PMI word embeddings in this thesis.

2.5 Convolutional Neural Networks

The Convolutional Neural Network (CNN) was initially invented as an image classification model. It is inspired by the biological processes in the visual

(28)

Figure 2.2: Illustration of the CNN with one convolutional layer with 4 feature maps, a max-pooling layer and lastly a fully connected layer. cortex of cats [Hubel and Wiesel, 1968], and more specifically, the cells it con-tains. The cells are known to be sensitive to local sub-regions of the visual field and can thus be seen as local feature detectors. Those are then connected in a complex manner to combine their local information to a collective picture of what is seen.

The CNN models this cell arrangement through a set of linear filters, which act as such local feature detectors. The filters are convolved over the input in optionally multiple layers and lastly connected through a fully connected neu-ral network to represent the complex arrangement of the cells. Even though the CNN originally was designed for two-dimensional image data, it has with some modifications been successfully applied to NLP as well. This section aims at describing the architecture used in this thesis, originally proposed by [Kim, 2014].

The inputs are stacked word embeddings of the text document to be clas-sified, as shown in figure 2.2. The resulting matrix is then convolved by a set of n two dimensional filters

w_i ∈ Rhi×d _{i = 1, . . . , n} _(2.18)

The two dimensional filters stretch the whole width of the embedding matrix. The heights hi may vary from filter to filter, and in figure 2.2 two filters of

height 2 respectively 3 are illustrated. Here, the convolutions omit the border effects and only include the positions where the filters fully cover embeddings. This results in n so called feature maps which are one-dimensional since the filter widths equal the embedding length (d). We now apply max-pooling which essentially just takes the maximum value of each feature map and can be seen to reflect the amount each feature detector has fired off. These n max-imum values constitute the input to a fully connected neural network layer

(29)

2.5 Convolutional Neural Networks 15

that is used to decide the class of the document.

As for regularization, dropout is employed, which randomly drops out connections in the ultimate (fully connected) layer. That is, during forward propagation, each weight in the fully connected layer has a probability p to be set to zero. This prevents co-adaptation of the feature maps. Additionally, the l2-norm of the filter vectors wi are constrained such that whenever the l2

(30)

(31)

3

Random Indexing modifications

Since the Random Indexing word embeddings are constructed unsupervised by parsing a large corpus, it is likely that the resulting embeddings are very general. If the corpus is very large and contains a broad range of topics, em-beddings of words with multiple senses will end up as a combination of all of those senses. This may lower the expressiveness of the embeddings if they are going to be used in a very specific domain. Take the example of training a text categorization classifier within a financial context. Corpus occurrences of the words "bank" and "stock" in the senses of ’large collection’ and ’inven-tory’ may pose as noise in this domain and clutter up the "bank" and "stock" embeddings.

One could think of other examples of this. Different word classes (nouns, verbs, adjectives, determiners etc.) may have a variable importance depending on the task at hand. In sentiment analysis we seek the amount a text document is associated with a concept such as ’positive’ or ’negative’. Then it is viable to think adjectives may play a more important role in describing the contexts of words.

The aforementioned cases exemplify what Nilsson describes as ’redundant features’ of the embeddings [Nilsson, 2015]. The hypothesis is that the perfor-mance of any classifier would increase if the embeddings were jointly trained with the classifier.

There are multiple ways of jointly training the embeddings with the clas-sifier. In this thesis we focus on neural classifiers where backpropagation can be applied to optimize the model parameters. We are going to use two distinct strategies to implement this hypothesis.

(32)

3.1 Backpropagation into word embeddings

When backpropagation is used as optimization strategy, one can also treat the input as parameters to update. It is straightforward to take the derivatives of the objective function with respect to the input and apply stochastic gra-dient descent updates just as for the model parameters. This strategy will be referred to as Stochastic Gradient Descent Random Indexing, sgd ri.

This is almost equivalent to what is denoted BOW RI MLP in [Nilsson, 2015]. By feeding Bag-Of-Words features to a d-dimensional hidden (input) layer while having the word embeddings as the weights, the same thing is accomplished as backpropagating into the embeddings, as long as there is not a non-linearity at the hidden (input) layer.

3.2 Parameterize embeddings

The other strategy is to parameterize the word embeddings in some way, and optimize those parameters jointly with the task using backpropagation. The Random Indexing algorithm weights context words’ influence based on their relative frequency according to equation (2.10). We would however like to weight context words not only depending on relative frequency but also on word class to tackle the second example from above. However, this would require the whole corpus to be Part-Of-Speech tagged (to resolve each word’s class), which could be a cumbersome task. Especially for non-English cor-pora since Part-Of-Speech taggers may not be available. However, since word classes in many languages tend to show up in the same relative positions to each other it could be necessary to weight just depending on the context word’s relative window position. In linguistics this is what is commonly stud-ied in ’word order typology’.

To describe the parametrization, recall the Random Indexing algorithm where we look in a streaming fashion at each word and its context in the cor-pus. The context vectors are constructed by summing up all the index vectors of all words occurring within a window around the focus word.

We believe relative positions within the window may have variant impor-tance to describe the contexts of words, given the task. Look at figure 3.1 for an illustration. We will try to parameterize the window positions such that the index vectors are weighted conditioned on their relative position and the focus word before added to the context vector.

This can be seen as a modification to equation (2.10) where we add an additional factor:

h(t, l) = θwt

l e

−_cft+l

|V | _(3.1)

This weighting can be seen as a form of attention mechanism, since we aim to focus our attention to certain parts of the word’s contexts depending on the

(33)

3.2 Parameterize embeddings 19

Figure 3.1: Illustration of the att ri parameterization where the window size is 2 + 2. Each word has its own set of weights.

task at hand. This strategy will be referred to as Attention Random Indexing (att ri).

For this parametrization to be plausible, we fit the weights to optimize the embedding’s performance to a semantic similarity dataset called SimLex. This additional motivation can be found in section 3.2.2

3.2.1

ATT RI

implementation details

This section aims to explain some details of how the att ri modification was implemented. Recall equation (2.9), which describes how the context vectors are updated for each corpus occurrence. Assuming a finite corpus, the final context vectors can then be expressed as:

v_i = k X l=−k l,0 |T | X t=1 wt=wi h(t, l)et+l (3.2)

With the expression from (3.1) inserted, it becomes:

v_i = k X l=−k l,0 |T | X t=1 wt=wi θwt l e −_cft+l |V | _e t+l (3.3)

By carefully inspecting, the θwt

l can be moved outside the inner sum, while

swapping the subscript to i since wt = wi:

v_i = k X l=−k l,0 θwi l |T | X t=1 wt=wi e −_cft+l |V | _e t+l | {z } ˜vl_i (3.4)

The rewrite now allows the inner sum to be calculated before fitting the θwi

l s

which makes the algorithm much more efficient. In practice, this means we aggregate a context vector ˜vl_i for each relative window position l, for each word wi.

(34)

Stacking these 2k context vectors into a matrix Vi and collecting the θ wi l s in a vector yields: V_i =             ˜v−_ik . . . ˜v−_i1 ˜v+1_i . . . ˜v+k_i             (3.5) θi =                           θwi −_k .. . θwi −₁ θwi +1 .. . θwi +k                           (3.6)

Equation (3.4) can now be rewritten as a matrix vector multiplication:

v_i = Viθi (3.7)

In other words, this suggests instead of aggregating context vectors vi

accord-ing to (2.9), to aggregate matrices Vi upon parsing the corpus. The word

vec-tors are then calculated as a multiplication with a parameter vector θi

accord-ing to (3.7). Note that when θi = 1 you recover the vanilla Random Indexing

embeddings.

3.2.2 Motivation: SimLex

To further motivate the viability of the parameterization proposed, we hereby present an experiment to show the expressiveness of it. SimLex [Hill et al., 2014] is a similarity dataset comprised of 1.000 word pairs. Each pair has associated annotations such as part-of-speech tag, concreteness of the words (1 to 7) and a similarity rating (from 0 to 10). The similarity scores are con-structed to reflect true similarity rather than relatedness between words. This means that for example antonyms (as opposed to synonyms) are given a low similarity score, even though they can be argued to relate to each other. As an example of this, the word pair (’plane’,’jet’) receives a similarity score of 8.1 while pairs like (’leg’, ’arm’) receives only 2.88.

It is common to quantify two words’ similarity by their vectors’ mutual dis-tance, according to some distance metric. One could argue it eligible to have a word space model in which mutual distances reflect the similarity scores provided by SimLex.

Typically, word spaces are evaluated to SimLex by performing a Spearman Rank Correlation between the word vector distances and the SimLex scores. The Spearman Rank Correlation measures to which degree the relationship between the two can be described by a monotonic function. It returns a value

(35)

3.2 Parameterize embeddings 21

between −1 and +1 and where the limit cases occur whenever the relationship is perfect monotonic.

To motivate the suggested parameterization of Random Indexing, an exper-iment was performed to see how much the Spearman Rank Correlation could be improved by fitting the θis such that cosine similarity between the word

vectors would correspond to the SimLex scores.

Formally, we seek to minimize the following objective function: min θ∗ X (wi,wj)∈S 1 2 cos αij−s(wi, wj) 2 | {z } f (wi,wj) (3.8)

where (wi, wj) ∈ S corresponds to each word pair in SimLex. s(wi, wj) is the

SimLex similarity score for the word pair (scaled to [0, 1]) and cos αij is the

cosine similarity between the word’s corresponding vectors:

cos αij=

vT_i vj

kv_ik₂kv_jk₂ (3.9)

where vi and vj are wi and wj’s corresponding word vectors, calculated as

in equation (3.7). Since this is a non-convex problem, Stochastic Gradient Descent is applied as optimization strategy. Calculating the gradient of f with respect to θi and θjis straightforward:

δf δθi =cos αij−s(wi, wj) δ cos αij δθi (3.10) δf δθj =cos αij −s(wi, wj) δ cos αij δθj (3.11) Applying the chain rule, the gradient of cos αij becomes:

δ cos αij δθi = δ cos αij δvi δvi δθi δ cos αij δvi = v_jk_v_ik₂−_v_ivTivj k_v_ik₂ k_v_ik2 2kvjk2 δvi δθi = Vi (3.12)

The expression for δ cos α_δθ ij

j is the same, but with the subscripts interchanged.

We now apply Stochastic Gradient Descent to optimize the θis iteratively

(36)

Table 3.1:Hyper-parameter settings for the att ri embeddings Parameter Value Description

d 2,000 Dimensionality

k 10 Window size

c 60 Constant in frequency weight function

10 Number of non-zero elements in index_{vectors randomly drawn from {−1, +1}}

Table 3.2: Results of the SimLex experiment. The Spearman correlation is drastically improved with optimized θis

Avg. error Spearman Correlation Initial θis (θ∗= 1) 0.28 0.21 Optimized θis 0.19 0.62 θ(t+1)_i = θt_i −_η δf δθi θ(t+1)_j = θt_j−_η δf δθj (3.13)

This procedure was performed using V∗matrices generated by parsing a

Wiki-pedia dump from 2010 with the Random Indexing hyper-parameters listed in table 3.1. The θis were initialized to one-vectors (θ∗ = 1) and updated

according to equation (3.13) with a learning rate η = 1.0 until convergence. The results are summarized in table 3.2. We can see that the Spearman correlation is drastically improved with the optimized θis. This experiment

can be seen as, for each word wi, finding a linear combination in the column

space of Vi that optimizes the cosine similarity of the word vectors to match

the SimLex similarity scores. It is remarkable that optimizing the θis in the

relatively small 20-dimensional (R2k) subspaces of the full word space (R2000) yield such a big improvement.

This finding should motivate the parameterization to be viable for improv-ing the performance in text classification as well.

(37)

4

Experiments

In this chapter, the experiments and their constituents are further explained. Two suggested enhancements to the Random Indexing embeddings for classi-fication tasks are also hypothesized, implemented and evaluated. Referring to figure 1.1, each experiment can be seen as combining an embedding, a model and a benchmark dataset. The experiment then consists of training the selected model using the selected word embedding to the classification task defined by the selected dataset.

In total, 8 distinct embeddings, including the two Random Indexing mod-ifications are combined with 2 models and evaluated on 2 benchmark classifi-cation datasets. For consistency, all embeddings are pre-trained on the same underlying corpus; a Wikipedia dump from 2010 of 1.2 billion tokens.

4.1 Word embeddings

Eight distinct types of embeddings are evaluated in the study: • bow - Bag-Of-Words

We will use simple Bag-Of-Words embeddings, which also will work as a baseline for the other embeddings.

• rand - Random

2000 dimensional dense embeddings randomly drawn from a uniform distribution U (−0.25, 0.25) will also be used as a baseline.

• pmi - Pointwise Mutual Information

We are going to use word embeddings from a PPMI matrix, as explained in section 2.4, constructed from the Wikipedia corpus concatenated with

(38)

Table 4.1: Hyper-parameter settings for the Random Indexing embed-ding

Parameter Value Description

d 2,000 Dimensionality

k 2 Window size

c 60 Constant in frequency weight function

10 Number of non-zero elements in

index vectors randomly drawn from {−1, +1}

the training set of the IMDB corpus1. However, to constrain the dimen-sionality of the word vectors, the columns of CPPMIcorresponding to the 50 000 most frequent words are kept, while all others are removed. • ri - Random Indexing

Random Indexing embeddings, as described in section 2.3, will be eval-uated. The parameters used are listed in table 4.1. Each vector is nor-malized to be of unit l2length.

• sgd ri - Stochastic Gradient Descent Random Indexing

As described in section 3.1, Random Indexing embeddings that are up-dated using stochastic gradient descent will be used in the study. The initial embeddings are the same as the ri embeddings above.

• att ri - Attention Random Indexing

As described in section 3.2, the parameterized Random Indexing em-beddings will also be used in the study. The word weights θi will all

be initialized to 1 and updated using backpropagation along with the model parameters. The Random Indexing parameters are listed in table 3.1. Note that the window size k is increased to 10 to be able to capture context information further out.

• sg - Skip-Gram

Skip-Gram embeddings, as proposed in [Mikolov et al., 2013a,b] and trained by the word2vec2 tool, are also included in the study. The pa-rameters are listed in table 4.2.

• gl - GloVe

The GloVe embedding was proposed in [Pennington et al., 2014] and was trained with the hyper-parameter settings listed in table 4.3.

1_{http://ai.stanford.edu/∼amaas/data/sentiment/} 2_{https://code.google.com/p/word2vec/}

(39)

4.1 Word embeddings 25

Table 4.2: Hyper-parameter settings for the Skip-Gram embedding. k is chosen to 2 for consistency with ri. d is set to 300 which is a commonly used dimensionality for Skip-Gram. The other values are the default val-ues of the word2vec tool

d 300 Dimensionality

k 2 Window size

negative 5 Number of negative samples per positive down sampling no Whether down sampling is applied.

α 0.025 Initial learning rate

iter 5 Number of iterations

Table 4.3: Hyper-parameter settings for the GloVe embedding. k is cho-sen to 2 for consistency with the ri embedding and d to 300 for consis-tency with sg. The other values are the default values of the GloVe tool.

d 300 Dimensionality

k 2 Window size

iter 15 Number of training iterations x-max 10 Cutoff in weighting function

α 0.75 Constant in exponent of weighting function

(40)

Figure 4.1: Illustration of the mlp model. The word embeddings are summarized to form a document vector, which is then fed into a neural network. The predicted class is treated as 0 if the output node’s value is

< 0.5, else 1.

4.2 Models

We are only going to look at neural models in this thesis since backpropa-gation seamlessly allow for optimizing our Random Indexing modification parameters jointly with the model.

• mlp - Multi-Layered Perceptron

We use a standard neural network [Rumelhart et al., 1986] with one hid-den layer of 120 nodes with sigmoid activations and one sigmoid output unit. All word vectors are normalized to an l2 norm of 1 and naively

summed into a document vector, as seen in figure 4.1. The weights in the neural network are also l2 regularized with a constant factor of

λ = 0.001.

• cnn - Convolutional Neural Network

We will use the model proposed by [Kim, 2014] which implements a convolutional neural network for Natural Language Processing, further explained in section 2.5. The hyper-parameters used are listed in table 4.4. Like the mlp model, the word embeddings are also normalized to unit length.

4.3 Datasets

The two datasets that the embeddings and the models are going to be trained and evaluated on are listed here. Both of them are benchmark sentiment datasets in the field.

(41)

4.4 Experimental Setup 27

Table 4.4:Hyper-parameter settings for the cnn model Parameter Value Description

n 300 Number of filters, 100 of height 3,4,5 respectively

p 0.5 Dropout rate

s 3 Filter max l2-norm

Remark 4.1. Note that the embeddings were pre-trained (unsupervised) on Wikipedia data, which is not to be confused with the supervised task specific datasets mentioned here.

• pl05 - Pang & Lee, Sentence Polarity Dataset v1.0

Pang and Lee, Sentence Polarity Dataset v1.0 is a sentiment analysis dataset comprised of 10,662 short movie reviews released by Rotten Ro-matoes3. The dataset was introduced in [Pang and Lee, 2005] and will be referred to as pl05. The sentences are to be classified into being either positive or negative and is hence a binary classification problem. Experi-ments using this dataset are split into 25% test and 75% train/validation sets and evaluated by 5-fold cross validation on the training/validation set. We make two consecutive runs, in total 10 trainings, and report their maximum, minimum and mean accuracy as well as their standard deviation. The same splits are used in all experiments with pl05. • sst - Stanford Sentiment Treebank

Stanford Sentiment Treebank is an extension of pl05 with train/validation/test splits provided. The dataset also provides fine-grained labels (very pos-itive, pospos-itive, neutral, negative, very negative). In this study we have however omitted the neutral labels and treated it as a binary classifica-tion problem by merging the very positive, positive, very negative and negative classes into two. We report the maximum, minimum and mean accuracy as well as the standard deviation of 10 consecutive runs using the provided train/val/test splits.

4.4 Experimental Setup

A conceptual illustration of the experimental setup can be seen in figure 1.1. We aim to run all possible combinations of modification, model and dataset and present the results in chapter 4.5. Some combinations were however not suitable to run and have thus been excluded from the results. Those include the high-dimensional embeddings (bow and pmi) combined with the cnn model. The reason is due to the huge increase in model parameters such a combination would yield, which simply did not fit in the memory of the hard-ware used.

(42)

Statistics of the size and vocabulary intersection with embeddings of the two datasets are found in table 4.5.

Table 4.5: Statistics of the embeddings and datasets after tokenization.

N : Dataset size, |V |: Vocabulary size, |V∗

pre|: Vocabulary size present in

the certain embedding.

pl05 sst N 10,662 9,613 |_{V |} _18,765 _16,186 |_Vbow pre | 17,794 15,898 |_Vrand pre | 17,794 15,898 |_Vpmi pre| 13,725 12,595 |_Vri pre| 17,863 15,936 |_Vsgd ri pre | 17,863 15,936 |_Vatt ri pre | 17,862 15,935 |_Vsg pre| 17,794 15,898 |_Vgl pre| 17,794 15,898

4.5 Results

The results of all experiments are shown in table 4.6. The accuracies for the pl05dataset are slightly lower than sst, which is consistent with the results of others [Nilsson, 2015, Kim, 2014, Zhang and Wallace, 2015].

Model comparison

Comparing the mlp with the cnn, the latter consistently outperformed the former with about 3-6 points. This is not surprising and one major difference is that the cnn do take the word order in consideration whereas the mlp do not. Another interesting observation is the fact that the variances are notably higher in the mlp model. Take the pmi mlp sst experiment for example. Even though 8 of the 10 runs had an accuracy around ∼80%, two of them scored as little as ∼65%. The reason for this must be due to the neural network get-ting stuck in severe local minima from which it cannot escape. In the cnn, dropout was applied as a regularizer at its fully connected layer that may have influenced the fact that its variances are so much lower. Since the objective of this thesis was rather to compare the embeddings and not the models, this reasoning will not be developed any further.

(43)

4.5 Results 29

Table 4.6: Experiment results. Mean accuracies in %. The parentheses contain the maximum achieved accuracy, the standard deviation and the minimum achieved accuracy of the consecutive runs

Emb + Model pl05 sst bow mlp _{73.99 (↑ 74.57, ±0.72, ↓ 72.62)} _{77.96 (↑ 79.68, ±1.58, ↓ 75.67)} rand mlp _{68.23 (↑ 69.58, ±0.98, ↓ 66.32)} _{69.89 (↑ 71.11, ±0.83, ↓ 68.64)} pmi mlp _74.40(↑ 76.89, ±2.71, ↓ 69.34) 77.07 (↑ 81.16, ±5.69, ↓ 64.36) ri mlp 72.45 (↑ 74.91, ±2.99, ↓ 66.54) 75.13 (↑ 77.98, ±2.54, ↓ 73.75) sgd ri mlp 73.62 (↑ 74.16, ±0.77, ↓ 71.42) 77.91 (↑ 78.80, ±0.82, ↓ 76.11) att ri mlp 72.45 (↑ 74.83, ±2.20, ↓ 68.90) 78.03(↑ 79.63, ±1.60, ↓ 74.46) sg mlp 73.84 (↑ 74.76, ±1.14, ↓ 71.57) 77.27 (↑ 79.57, ±3.13, ↓ 70.02) gl mlp 71.29 (↑ 73.67, ±1.67, ↓ 68.60) 76.26 (↑ 77.32, ±1.66, ↓ 71.61) bow cnn _- -rand cnn 72.12 (↑ 72.91, ±0.50, ↓ 71.28) 76.99 (↑ 77.94, ±0.76, ↓ 75.50) pmi cnn - -ri cnn 76.18 (↑ 76.60, ±0.35, ↓ 75.51) 81.83 (↑ 82.72, ±0.39, ↓ 81.39) sgd ri cnn 75.67 (↑ 76.26, ±0.63, ↓ 74.22) 81.31 (↑ 81.89, ±0.38, ↓ 80.78) att ri cnn 77.55 (↑ 78.08, ±0.31, ↓ 77.09) 81.77 (↑ 82.64, ±0.60, ↓ 80.66) sg cnn 77.92(↑ 78.34, ±0.24, ↓ 77.55) 83.44(↑ 84.00, ±0.51, ↓ 82.00) gl cnn 77.35 (↑ 77.77, ±0.34, ↓ 76.79) 81.56 (↑ 82.06, ±0.34, ↓ 80.78) BOWembeddings

The classical Bag-Of-Words embedding performs surprisingly well despite its lack of inherent semantical information. With the second highest score on both mlp pl05 and mlp sst it is here among the top most performant word embeddings. However, as mentioned earlier, it does have scaling limitations when used with the cnn model. Since our cnn had 300 filters of size {3, 4, 5}×

d, it quickly becomes very memory demanding and that is also why it could

not be run. Nevertheless, this experiment was performed in another study yielding an accuracy of 77.83 on pl05 and 79.80 on sst which also goes in line with our other cnn results [Zhang and Wallace, 2015].

PMIembeddings

The PPMI transformed embeddings comprises the same issues as the bow em-bedding due to its high dimensionality that also is why it was not run with the cnnmodel. An interesting finding is that the Skip-Gram algorithm is indeed factorizing a shifted PMI matrix [Levy and Goldberg, 2014] which makes if viable to believe their results should be similar. However due to the high vari-ances of the mlp model, it is hard to draw any such conclusions even though the mean accuracies of pmi and sg look similar.

(44)

RIembeddings

The results of the vanilla Random Indexing embeddings provides a baseline for the two proposed "adjusted" embeddings. Inspecting the ri mlp numbers, they fall in the in the average among the other embeddings.

SGD RIembeddings

When the ri embeddings were allowed to be updated using stochastic gradi-ent descgradi-ent, the the hope was that the prediction accuracy would increase. This was the case in the sgd ri mlp experiments, at least in terms of the mean accuracy. However, looking at the sgd ri cnn results, the accuracies actually went down. Nevertheless, the updating seems to stabilize the results; the sgd ri mlp_{variances are significantly smaller compared to ri mlp.}

Another interesting observation, though not reported here, was that the convergence was much faster in all experiments with sgd ri compared to ri. The peak validation accuracy was reached in significantly fewer epochs using updateable embeddings. The number of epochs needed to find the peak vali-dation accuracy was lower in sgd ri which however not necessarily makes it generalize better.

One could also wonder what impact the updating of the embeddings re-ally had. A visualization of the embeddings before and after the updates is therefore shown in the Appendix. The updated embeddings are projected onto their two major principal axes and plotted in figure A.2. For compara-bility, the original embeddings are projected onto the same axes and plotted in figure A.1. Recall that the sst dataset consists of movie reviews and by inspecting, we can see some tendencies that positive words ("best", "positive", "great", "excellent", etc.) show up on the left and more negative words in this context ("repetitive", "poor", "obvious", irrelevant", "unnecessary", etc.) end up to the right. This pattern is hard to be seen in the original embeddings, and should thus be due to the updating.

ATT RIembeddings

The att ri embeddings performed well, with the highest mean accuracy on the sst dataset, using the mlp model. All θis were initialized to one vectors

and updated using backpropagation. The resulting weights for some adjec-tives, determiners and nouns are plotted in figure 4.2, 4.3 and 4.4 respectively. A more thorough analysis of these results is found in chapter 5.

SGembeddings

The Skip-Gram embeddings scores the best accuracy in terms of both peak and average performance, using the cnn model, on both datasets. This is

(45)

4.5 Results 31 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight good 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight reliable 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight positive 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight bad

Figure 4.2: The learned weights for four adjectives

10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight the 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight and 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight of 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight but

Figure 4.3: The learned weights for four determiners

10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight tolkien 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight show 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight tv 10 5 0 5 10 Window position 0.0 0.5 1.0 1.5 2.0 Weight warfare

(46)

interesting since Skip-Gram have also previously shown to perform well with the CNN. The discussion is continued in chapter 5.

GLembeddings

The GloVe embeddings, even though their similarities to the Skip-Gram em-beddings, consistently underperformed Skip-Gram with a few points. This is in contrast to the experiments performed by [Zhang and Wallace, 2015] where the difference was minor.

(47)

5

Discussion

Part of the aim of this thesis was to get more insight in the impact different word embeddings have on the performance of text classification. We con-strained ourselves to only look at binary sentiment analysis of short movie reviews, using two neural supervised learning models.

To make a fair comparison of the different embeddings they were all trained on the same background corpus, a Wikipedia dump from 2010 of 1.2 billion tokens. The one exception was the pmi embedding whose background corpus also included the unsupervised train data of the IMDB dataset1. This was due to unfortunate practical reasons.

In the results of our experiments, Skip-Gram drew the longest straw in the cnn model whereas pmi and the proposed att ri embeddings performed best in the mlp model. However, there are a number of things worth pointing out here.

First of all, the performance disparity among the embeddings are in most cases so small they are practically equivalent. This is especially the case for the mlpexperiments where the variances reaches over two points in many cases. It is therefore doubtful to - from these results - draw a conclusion that any single embedding is superior to the others in the general case. To further sup-port this, there are also many inter-embedding differences that are excluded from this study. For example how big influence the embedding dimensional-ity per se has on the classification performance and variance. sg and gl had a dimensionality of 300 whereas rand, ri, sgd ri and att ri had 2,000. The influence that might have had remains untold. It is viable to believe the vari-ance to increase with higher dimensionality due to a much larger parameter space with more local minima.

1_{http://ai.stanford.edu/ãmaas/data/sentiment/}

(48)

We also compared the ’classical’ high dimensional embeddings (bow, pmi) to the dense low dimensional ones (rand, ri, sgd ri, att ri, sg, gl). The bow _{has its advantage of being extremely simple and requires no} unsuper-vised pre-training. The disadvantage is its high dimensionality that makes it impractical for use in state-of-the-art models such as cnn. By inspecting the mlp_{results, there is no significant disparity since all numbers fall within each} others’ standard deviation, with rand as the exception.

This argues that the embedded semantic knowledge in the low-dimensional dense embeddings does not assist the classifier significantly. On the other hand, the randomized rand vectors clearly underperform all other embed-dings which argues in the other direction. [Kim, 2014, Zhang and Wallace, 2015] also managed to push the boundaries up to 80.10 for pl05 and 84.88 for sstusing Skip-Gram embeddings pre-trained on a much larger 100 billion to-kens Google News dataset. We believe this somewhat increased performance is partly due to the bigger dataset. Another factor could also be that the lan-guage style in news articles is more similar to the movie reviews compared to Wikipedia, arguably yielding better-suited embeddings.

Updating of the Skip-Gram embeddings just like the sgd ri for the cnn have also been studied in [Zhang and Wallace, 2015]. They saw a performance boost of about ∼0.8% unlike our sgd ri cnn results which instead fell. This could be due to the ri embeddings being more high dimensional than sg, yielding a larger and harder parameter space to optimize.

ATT RIanalysis

When the att ri parametrization was proposed, the hypothesis was that cer-tain relative window positions would be more important in describing the context of a word than others. Moreover, the further away from the focus word you get the less related the context word is in general. We hoped the sys-tem would learn that, as well as to pick up deviations to handle for example word classes differently. We initialized all window weights θis to one vectors

and updated using backpropagation to the task at hand, sentiment analysis in our case.

The learned weights for some adjectives, determiners and nouns are plot-ted in figure 4.2, 4.3 and 4.4 respectively. We hoped to see a ’hill’ like curve leaning to zero at the edges. This shape is partially apparent for some of the words, for example in "and", "of", "good" and "bad", but for most words, the weights are almost unchanged. We believe this could be due to the vanish-ing gradient problem where the gradient seems to vanish deeper down the model. In addition, the less common the word is in the training set, the less it is updated.

By inspecting the l1norm of the weight vectors, we get a hint of the words’

relative importance given the task. We can see that the l1norm for the words

The Use of Distributional Semantics in Text Classification Models : Comparative performance analysis of popular word embeddings

Institutionen för systemteknik

Department of Electrical Engineering

Examensarbete

The Use of Distributional Semantics in Text

Classification Models

Comparative performance analysis of popular word embeddings

The Use of Distributional Semantics in Text

Classification Models

Comparative performance analysis of popular word embeddings

Examensarbete utfört i Datorseende

vid Tekniska högskolan vid Linköpings universitet

av

Sammanfattning

Abstract

Acknowledgments

Contents

1

Introduction

1.1

Background

1.2

Problem Formulation

1.3

Approach

1.4

Related Work

2

Theory

2.1

Notation

2.2

Word Embeddings Theory

2.3

Random Indexing

2.4

Pointwise Mutual Information (PMI)

2.5

Convolutional Neural Networks

3

Random Indexing modifications

3.1

Backpropagation into word embeddings

3.2

Parameterize embeddings

3.2.1

implementation details

3.2.2

Motivation: SimLex

4

Experiments

4.1

Word embeddings

4.2

Models

4.3

Datasets

4.4

Experimental Setup

4.5

Results

5

Discussion