Comparing Feature Extraction Methods and Effects of Pre-Processing Methods for Multi-Label Classification of Textual Data

(1)

Comparing Feature Extraction Methods and Effects of Pre- Processing Methods for Multi-

Label Classification of Textual Data

MARTIN EKLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Extraction Methods and Effects of Pre-Processing Methods for Multi-Label

Classification of Textual Data

MARTIN EKLUND

Master in Computer Science Date: June 27, 2018

Supervisor: Håkan Lane Examiner: Olof Bälter

Swedish title: Utvärdering av Metoder för Extraktion av Särdrag och Förbehandling av Data för Multi-Taggning av Textdata School of Electrical Engineering and Computer Science

(4)

(5)

Abstract

This thesis aims to investigate how different feature extraction methods applied to textual data affect the results of multi-label classification. Two different Bag of Words extraction methods are used, specifically the Count Vector and the TF-IDF approaches. A word embedding method is also investigated, called the GloVe extraction method.

Multi-label classification can be useful for categorizing items, such as pieces of music or news articles, that may belong to multiple classes or topics. The effect of using different pre-processing methods is also investigated, such as the use of N-grams, stop-word elimination, and stemming. Two different classifiers, an SVM and an ANN, are used for multi-label classification using a Binary Relevance approach. The results indicate that the choice of extraction method has a meaningful impact on the resulting classifications, but that no one method consis- tently outperforms the others. Instead the results show that the GloVe extraction method performs the best for the recall metrics, while the Bag of Words methods perform the best for the precision metrics.

(6)

iv

Sammanfattning

Detta arbete ämnar att undersöka vilken effekt olika metoder för att extrahera särdrag ur textdata har när dessa används för att multi-tagga textdatan. Två metoder baserat på Bag of Words undersöks, närmare bestämt Count Vector-metoden samt TF-IDF-metoden. Även en metod som använder sig av word embessings undersöks, som kallas för GloVe-metoden. Multi-taggning av data kan vara användbart när datan, exempelvis musikaliska stycken eller nyhetsartiklar, kan tillhöra flera klasser eller områden. Även användandet av flera olika metoder för att förbehandla datan undersöks, såsom användandet utav N- gram, eliminering av icke-intressanta ord, samt transformering av ord med olika böjningsformer till gemensam stamform. Två olika klassifi- cerare, en SVM samt en ANN, används för multi-taggningen genom använding utav en metod kallad Binary Relevance. Resultaten visar att valet av metod för extraktion av särdrag har en betydelsefull roll för den resulterande multi-taggningen, men att det inte finns en metod som ger bäst resultat genom alla tester. Istället indikerar resultaten att extraktionsmetoden baserad på GloVe presterar bäst när det gäl- ler ’recall’-mätvärden, medan Bag of Words-metoderna presterar bäst gällade ’precision’-mätvärden.

(7)

1 Introduction 1

1.1 Problem Statement . . . 2

1.2 Scope . . . 2

1.3 Objective . . . 3

2 Background 4 2.1 Multi-Label Classification . . . 4

2.1.1 Methods . . . 5

2.2 Content-Based Recommendation . . . 5

2.2.1 Exploitation and Exploration . . . 6

2.3 Pre-Processing . . . 7

2.3.1 Tokenization and N-grams . . . 7

2.3.2 Stemming . . . 8

2.3.3 Stop-Word Elimination . . . 8

2.4 Feature Extraction . . . 8

2.4.1 Bag of Words . . . 9

2.4.2 Term Frequency-Inverse Document Frequency . . 9

2.4.3 Word Embeddings and GloVe . . . 11

2.5 Classifiers . . . 12

2.5.1 Support Vector Machine . . . 12

2.5.2 Artificial Neural Networks . . . 13

2.6 Related Work . . . 14

2.6.1 Word Embeddings for Single-Label Classification 14 2.6.2 Bag of Words for Multi-Label Classification . . . . 15

2.6.3 Research Gap . . . 15

3 Method 16 3.1 Dataset . . . 16

3.2 Pre-Processing . . . 17

v

(8)

vi CONTENTS

3.3 Extraction Methods . . . 17

3.3.1 TF-IDF . . . 17

3.3.2 Bag of Words/Count Vector . . . 18

3.3.3 GloVe . . . 18

3.4 Classifiers . . . 18

3.5 Evaluation . . . 19

3.5.1 Precision . . . 19

3.5.2 Recall . . . 19

3.5.3 F-Score . . . 20

4 Results 21 4.1 TF-IDF . . . 21

4.2 Count Vector . . . 21

4.3 GloVe . . . 25

4.4 Effect of N-grams . . . 25

4.5 Stop-Words . . . 25

4.6 Stemming . . . 28

4.7 Best Scores . . . 30

5 Discussion 32 5.1 Effects of Preprocessing . . . 32

5.1.1 N-grams . . . 32

5.1.2 Stop-Word Elimination . . . 33

5.1.3 Stemming . . . 33

5.2 Extraction Methods . . . 33

5.2.1 TF-IDF and Count Vector . . . 33

5.2.2 Bag of Words vs. Word Embeddings . . . 34

5.3 Concerns . . . 35

5.4 Future Work . . . 35

6 Conclusion 36

Bibliography 38

A Appended Material 41

(9)

Introduction

The amount of digital content available on the Internet is steadily grow- ing, and it can prove challenging to make the best use of the vast amount of data that is available. One way to handle this problem is to classify data into different categories, thus giving a better overview of what kinds of data are available. This can in turn for instance enable users of a news site to better filter out the articles that they are interested in, or enable users on social media to locate photos of themselves.

In order not to have to do this procedure of categorization manually, one can instead employ machine learning techniques to automate the process. Such techniques usually require distinguishing features of an object in order to be able to classify it.

There are several different ways to extract features from different types of data, and different methods may prove suitable in different situations. For instance when classifying images of apples and or- anges, a crude extraction method could be to take the average pixel values of the images. If the average pixel value is close to orange, then the image would be classified as an orange, and if it is closer to green then it would be classified as an apple. This method would however also most likely classify a tiger as an orange, since it does not take into account any other feature than color into consideration. By choosing a more appropriate extraction method this could hopefully be avoided.

Instead of only taking the average pixel values into account one could also consider features such as the shape, size and texture of the object.

Taking these features into account, an image of a tiger would most likely not be classified as an orange.

Categorized data items can, among other things, be used to pro-

1

(10)

2 CHAPTER 1. INTRODUCTION

vide recommendations of similar items to a user depending on which categories the user has previously shown an interest in. By using a multi-label approach, hopefully, the problem of Exploitation and Ex- ploration described in section 2.2.1 could be somewhat mitigated. This could be done by recommending items that might not be the ones most similar to the users viewing history, but that still lie relatively close to other categories that the user has shown a previous interest in.

1.1 Problem Statement

The aim of this report is to evaluate how different methods for feature extraction affect a multi-label classification problem with textual data.

The effect of different pre-processing methods is also investigated. If the choice of extraction method has a significant impact on the final classification results, then it might be better to choose the right extraction method rather than spend too much time optimizing the classifier itself.

The questions that this report will attempt to answer are:

• Which one of the three feature extraction methods (Count Vector, TF-IDF and GloVe) performs the best?

• Does one of these feature extraction methods perform the best even when used with different classifier models?

• Are commonly used pre-processing methods always useful when applied to different feature extraction methods?

1.2 Scope

The main goal of this thesis is to examine the effect of feature extraction methods for multi-label classification for textual data. Three feature extraction methods are evaluated in conjunction with two classifier models. It would be possible to include more classifier models, but hopefully two are enough to examine whether or not extraction methods yield similar results when used with different classifier.

The resulting classifications may in turn aid in building a simple recommendation system, but it is outside the scope of this thesis to

(11)

evaluate such a system. Instead this work is to be viewed as examining the classification foundation for such a system. This work is being done in association with the Swedish Pensions Agency (Pensionsmyn- digheten), who are interested in developing a prototype for a recommender system.

1.3 Objective

The objective of this report is to investigate how different approaches for feature extraction affects the results of multi-label classification.

Naturally you want as good results as possible when doing classification. If the choice of feature extraction method has a great impact, it might be worth it to spend more time focusing on selecting an appropriate feature extraction method rather than spending a lot of time optimizing a certain classifier. This report also investigates if it is always useful to apply commonly used pre-processing methods to text data when different extraction methods are used. Hopefully the results of this work could prove useful for trying to better classify multi-label data.

(12)

Chapter 2 Background

2.1 Multi-Label Classification

In single-label classification a data sample is assigned only one label (or category) from a set of disjoint labels. If the size of this set is equal to two then the task is known as binary classification. When the size of the set of labels is greater than two the task is called multi-class classification [12].

When it comes to multi-label classification, a data-point can be assigned multiple labels. This approach to classification is useful when the data to be classified can belong to multiple separate classes [12].

For instance when it comes to music categorization, where a song often can relate to multiple different genres.

Giving news articles multiple different labels can enable users to sort out only the articles relating to topics that they are interested in [6].

In this case, using single-label classification would most likely have omitted a certain amount of articles for a topic that might have been more strongly related to other topics. For instance an article about North Korea and the Olympics might be strongly related to sports, but would also be of interest for someone wanting to read articles about global politics. Having multiple labels for news articles can also pro- vide a foundation for a recommender system [6]. Multi-label classification has also proven to be useful for protein function classification and semantic classification of images [12].

4

(13)

2.1.1 Methods

There are two main approaches for multi-label classification, known as algorithm adaptation methods and problem transformation methods.

Algorithm adaptation methods transform a specific algorithm to be able to handle multi-label data. This approach is algorithm dependent, and it can prove challenging to adapt an algorithm to the multi-label classification task. An example of this method would be the ML-kNN algorithm [24] which extends the well known kNN (k-Nearest Neigh- bors) algorithm, which classifies a sample based on the classes of the k nearest neighbors of the sample, to be able to handle multi-label data.

Problem transformation methods on the other hand, are algorithm independent. Instead this approach uses several single-label classification tasks together to solve a multi-label classification problem, meaning that it is possible to take any single-label classification algorithm and use it for multi-label classification. One of the most popular problem transformation methods is called Binary Relevance. Using this method, one binary classifier is trained for each individual label. Each of these binary classifiers is then applied to the dataset, assigning either true or false for each data sample in the dataset depending on whether the classifier predicted that sample to belong to the label or not [12]. This is illustrated in figure 2.1. These results are then aggregated, meaning that a single data sample may be given multiple labels.

One possible drawback to this approach is that is does not take into account that some labels may be correlated with others. For instance a news article related to finance might be more likely to also be related to politics than to sports.

2.2 Content-Based Recommendation

One approach for how to recommend an item to a user is what is called content-based recommendation. What this entails is that a user is recommended an item based on a profile of that user, and the properties of the items that can be recommended [16]. For instance a system recommending a movie to a user might take attributes such as genre, di- rector, writer and actors into account. If a user has previously shown an interest in movies directed by Martin Scorsese, the recommender system may try to find other Martin Scorsese movies to recommend to the user. This is an example using structured data, where items are

(14)

6 CHAPTER 2. BACKGROUND

Figure 2.1: Illustration of the Binary Relevance approach.

described by the same set of attributes and it is known which values these attributes can have [16].

In contrast to structured data there is also unstructured data, such as news articles, where there are no predetermined attributes which can take on well-defined values. Instead we only have the free-flowing text of the articles, and the text of two separate articles can differ greatly from each other both in length and content. It can therefore be harder to deal with unstructured data compared to structured data due to the complexities of natural language. Some of the problems that arise are due to factors such as polysemy (words can have several different meanings) and synonymy (different words can have the same meaning) [16].

2.2.1 Exploitation and Exploration

When recommending an item to a user, there is a trade-off between what is called Exploitation and Exploration. Exploitation means taking advantage of (or exploiting) the information available in the user’s profile. Exploration, on the other hand, means recommending items that the user may not have shown as great of an interest in compared to some other items [21].

By relying too much on the Exploitation approach, a user might only receive recommendations that are very similar to previously vis- ited items. This may result in a sort of echo chamber, where the user is

(15)

recommended items so similar to each other over and over again to the point where the recommendations become close to useless. An example of this being a user, on a website selling books, who has shown interest in a couple of books by Stephen King being recommended nothing but more books written by Stephen King. The user could most likely have found these books without the help of the recommendation system.

In contrast, when relying too much on Exploration a user might receive recommendations that have little, or nothing, to do with the items that the user has previously shown an interest in. An extreme example here being a user who is recommended nothing but Tolstoy’s

"War and Peace" and similar literature after having shown interest in nothing but cookbooks.

Finding a sufficient balance between both Exploitation and Explo- ration would ensure that the user is recommended relevant items that the user will actually find useful, while at the same time also recommending items that the user otherwise probably would not have found on their own [21].

2.3 Pre-Processing

Previous research has shown that applying pre-processing methods to unstructured textual data may have a noticeable effect on resulting text mining [2]. The purpose of these pre-processing methods is to transform the data into a more manageable format for the feature extractor, and also to remove superfluous information.

2.3.1 Tokenization and N-grams

The process of tokenization divides the text data into pieces (or tokens), and often also removes certain special characters such as apos- trophes, commas and periods [2]. The text is also often normalized to be lower-case only. The tokens usually consist of either a single word or what is called an N-gram, meaning that N consecutive words are split into a single token. The idea is to preserve some of the information that is stored in the order of the words. For example when using 2-grams, the semantic difference between "cat calling" and "cat food" would not get lost, which it would be when using 1-grams. For instance the sentence "These are my tokens" would be split into the

(16)

following tokens when using 2-grams: "These are", "are my", "my tokens". This approach has proven to yield better classification results in some cases [23].

2.3.2 Stemming

Applying stemming to a word means breaking it down to its root, or stem. By using this technique words that have the same meaning, but different forms, are transformed into their common stem [2].

The idea behind employing this technique before feature extraction is to capture distinguishing features that otherwise might have been lost due to infrequency. For instance a short article containing the words "educator", "educational" and "educated" exactly once probably relates to education. However, since these three words are different from each other they might not be recognized as distinguishing features for the article. Instead they would be recognized as three separate features. By utilizing stemming these three words might be transformed into a common stem, thus enabling the possibility to rec- ognize the three separate words as a single distinguishing feature.

There are several different algorithms that perform stemming, one of the more popular ones being the Porter stemming algorithm [19].

The Porter stemming algorithm works in part by rule-based suffix stripping of words. For instance the words "operate", "operation" and

"operating" will all be stemmed to their common stem "oper". How- ever, the possibility exists that words that do not share the same meaning get stemmed to the same stem.

2.3.3 Stop-Word Elimination

Words that are common across all documents in a corpus are not very useful for distinguishing documents from each other. Some of these can therefore be removed through a process called stop-word elimination. Conjunctions (for, and, but) and pre/post-positions (in, of, with) are examples of stop-words [2].

2.4 Feature Extraction

In most articles the actual text is both highly dimensional and unstructured. Meaning that every unique word can be seen as a separate di-

(17)

mension and that different articles are structured in different ways.

This can make it difficult to apply many classifiers to the raw text.

Therefore one can first extract the most distinguishing features of a text, thus reducing dimensionality. This process is called feature extraction, or text mining.

2.4.1 Bag of Words

One of the simplest types of feature extraction models is called Bag of Words. The name Bag of Words refers to the fact that this model does not take the order of the words into account. Instead one can imag- ine that every word is put into a bag, where the ordering of the words gets lost. Although there exist a few different variations of this model, the most common one is to simply count the number of occurrences of each word within a document and keep the result in a vector (here- after referred to as a count vector) [2]. This way the frequencies of the terms remain intact, although grammar and order is lost [15]. Another approach is to instead have a binary vector keeping track of whether or not a word exists within a document. However this approach also loses the multiplicity of the words in addition to the order and grammar.

2.4.2 Term Frequency-Inverse Document Frequency

One method that has proven itself to be both simple and effective for feature extraction is Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is an information retrieval technique that can be used to determine the relevance of terms in documents in relation to a query [6]. In this case it can be used for feature extraction by determining which terms in a document are most distinguishing for that document.

This method can also be viewed as a form of Bag of Words model, since it does not take grammar or order into consideration.

TF-IDF consists of two steps, first calculating the term frequency (TF), and then calculating the inverse document frequency (IDF). There are several variants of both of these parts.

One variant of TF (formally defined in equation 2.1) works by first calculating how many times a term occurs in a document, just as you do for the count vector. The reasoning here is that words that frequently occur in a document are probably more important than words

(18)

that do not occur frequently. The result is then normalized by dividing it by the number of words in the whole document. This normalization is done in order to prevent a bias towards longer documents, so that we get the frequency of which the term occurs and not just the raw count of the term [13].

tf_t,d= n_t,d

P

knk,d

(2.1) Where nt,dis the number of times that term t occurs in document d, and nk,dis the number of occurrences of every term in document d.

To calculate the IDF part of the formula, one variant (formally defined in equation 2.2) is to take the total number of documents in the corpus and divide it by the number of documents where the term appears. The result is then logarithmized. The IDF part of the formula acts as a form of weight-assigner, giving more weight to important terms and less weight to insignificant terms.

idf_t= log|D|

|D_t| (2.2)

Where |D| is the total number of documents, and |Dt| is the number of documents where the term t appears.

By multiplying the TF part and the IDF part for a certain term, we get a measure of how distinguishing that term is. In an instance where a corpus of news articles is used, the word “company” would probably have a pretty high TF-IDF score since it would occur often in articles related to business. It would not however be very common in every other document, since news articles relating to sports, culture, and other topics would probably not contain that particular term very often. In contrast a common word like “today” will probably get a pretty low TF-IDF score, since it is likely to appear in news articles relating to every news topic, and is therefore not very distinguishing a term [13].

One of the weaknesses of this method, and similar Bag of Words methods in general, is that it does not take context and synonyms into consideration. It would for instance not take into account that the terms “Priest” and “Reverend” refer to a similar subject. Nor would it take into account that the term “Apple” could refer to either the company Apple Inc. or the fruit depending on the context.

(19)

2.4.3 Word Embeddings and GloVe

When using word embedding models, words are represented in a real- valued vector space [22]. A good word embedding would ideally represent words in such a way that two different words with similar semantic meanings would have similar vector space representations.

Other linguistic relationships between different words can also be preserved. For instance when using these vector-space representations for words, the operations "King - Man + Woman" yields a result that is very similar to the vector-space representation for the word "Queen"

[18]. These sorts of relationships are further illustrated in table 2.2 where it is possible to see that the gender-based differences between several different pairs of terms with the same semantic meaning are very similar to each other.

Words closest to "frog"

0 frog 1 frogs 2 toad 3 litoria

4 leptodactylidae 5 rana

6 lizard

7 eleutherodactylus

Table 2.1: The nearest neighbors of the word "frog" in a GloVe model (Litoria, leptodactylidae, rana and eleutherodactylus are all different genuses/families of frogs). [9]

One commonly used word embedding model is GloVe (Global Vec- tors). GloVe is a statistical unsupervised learning model which uses a co-occurrence matrix to generate vector-space representations of words [18]. This is done by calculating how frequently different words co- occur within a specified context-window in a corpus. Dimensionality reduction is then performed on this co-occurrence matrix. Depending on the size of the corpus used for training, this method can require a large amount of memory and be computationally expensive.

Another example of a popular word embedding is the Word2vec model, which is a neural network-based tool developed by Google[15].

(20)

Figure 2.2: Illustration of linear substructures when using GloVe.[9]

2.5 Classifiers

There are several different types of classifiers, with different benefits and drawbacks often depending on the problem to be solved.

2.5.1 Support Vector Machine

Since textual data often is high-dimensional, there is a risk of overfitting. Support Vector Machines (SVMs) have built-in overfitting protec- tion, which is one of the reasons why they have been proven to work well for text categorization [11].

The idea behind SVMs is to find a linear hyperplane (or decision boundary) which separates the data-points of one class from the rest, and to do this in such a way that maximizes the margin between them as illustrated in figure 2.3. With a maximized margin, the likelihood of a future misclassification is lower than if the margin had been smaller [1]. If the data is not linearly separable it is possible to transform the data into a higher dimension by utilizing what is called the kernel trick. In this higher dimensional space we can then find a separating

(21)

Figure 2.3: SVM with maximized margin.[10]

optimal hyperplane [3].

2.5.2 Artificial Neural Networks

The structure of an Artificial Neural Network (ANN) is inspired by the structure of biological brains. Several nodes, or neurons, compose a network capable of solving classification problems after sufficient training [3]. The neurons have what is called activation functions that determine whether or not that neurons should be activated based on the input to the neuron.

The network structure is often divided into several interconnected layers. There are three different types of layers: input layers, hidden layers and output layers. The input data for the network is fed into the input layer. The hidden layer or layers takes weighted input from the previous layer in the topology of the network and feeds it forward, either to another hidden layer or to the output layer. The output layer takes the data that it receives and transforms it to become the final output of the whole network [3]. An example of such an artificial neural network of this kind is illustrated in figure 2.4

(22)

Figure 2.4: Example of an ANN with one hidden layer.[4]

2.6 Related Work

2.6.1 Word Embeddings for Single-Label Classifica- tion

In [15], the authors apply a Word2vec-based word embedding technique for feature extraction in order to solve a single-label classification problem in the text domain. They compare their results against a TF-IDF extraction technique, and also investigate the use of stop- words. Their results indicate that their Word2vec extraction technique did not outperform TF-IDF. However a combination of both Word2vec and TF-IDF is able to yield the highest accuracy by a small margin in only some cases. Their results also show that the use of stop-words does not always increase performance. In order to outperform a TF- IDF extraction method with stop-words, the authors had to combine Word2vec and TF-IDF without using stop-words.

(23)

2.6.2 Bag of Words for Multi-Label Classification

In [20] a number of different Bag of Words methods are investigated for multi-label classification of news articles. Several multi-label classification methods are investigated in conjunction with different classifiers. The results show that using Binary Relevance as a multi-label classification method, combined with an SVM-classifier, performs well when used on their dataset. However after a feature selection method is applied, another method (Calibrated Label Ranking) performs marginally better.

2.6.3 Research Gap

In no previous works examined is the GloVe extraction method examined, nor are any word embedding methods used for multi-label classification. This report will attempt to cover this gap by evaluat- ing the GloVe extraction method against Bag of Words methods when used for multi-label classification. The use of pre-processing methods have been examined in previous works, but not as extensively as this in this report.

(24)

Chapter 3 Method

3.1 Dataset

Since the dataset provided by the Swedish Pensions Agency ultimately proved to be too small, the publicly available multi-label dataset "Reuters- 21578, Distribution 1.0" was used [14]. This dataset is based on newswire data recorded on the Reuters newswire in 1987. It has been used as a benchmark in several research papers on information retrieval, text classification and natural language processing [7] [11]. Therefore the results of this work are hopefully comparable to other similar works within the area, without having to account for unique datasets.

The dataset contains 21,578 documents which have been categorized manually into 120 different topics. For this thesis a commonly used subset of the dataset has been employed. This subset is referred to as the ModApte split, and contains 9,603 training documents and 3,299 test documents [14]. However after only taking into consideration categories that have at least one training document and one test document, there are only 90 categories. This subset of the ModApte split is sometimes referred to as the R(90) subset. After this pruning there are 7769 training documents and 3019 test documents, with an average cardinality of 1.2. The division of training set and test set was based on the dates that the data was recorded.

The dataset is not uniform with regard to the number of topics.

The five least populated topics only have one training sample and one test sample each. The most common topic has roughly a third of the documents belonging to it.

One possible problem with the dataset is that even though some

16

(25)

data samples were supposed to have been assigned topics manually, they had not been [14]. This may prove to be a source of error, if the multi-label classifier assigns a label to a data sample that is actually correct, but that had not been assigned during the manual classification of the dataset.

3.2 Pre-Processing

The first pre-processing step taken is tokenization of the text. Next all tokens are converted to be lower-case, and then all tokens are stemmed.

The Porter stemmer [19] implemented in the Natural Language Toolkit (NLTK)-package [5] has been used for this project. Finally, characters containing symbols not present in the English alphabet are filtered out so that no periods, commas or other special characters are present in the tokens. Without doing this, the tokens "Apple" and "Apple!" would be regarded as separate tokens.

Additionally, for the usage of N-grams another step is required.

Both the use of 2-grams and 3-grams are investigated, meaning that either two or three consecutive tokens are aggregated into a single token. N-grams are investigated since, as mentioned in section 2.3.1, they are able to preserve some of the semantic information that exists in the ordering of the words in a sentence. The use of N-grams up to size 3 was deemed to be sufficient in order to examine the effects of using increasing sizes of N-grams.

The effects of not using stop-word elimination and of not using stemming is also investigated. For these experiments only 1-grams are used since they proved to yield the best results during the previous experiments.

3.3 Extraction Methods

3.3.1 TF-IDF

The TF-IDF extractor is implemented using the scikit-learn (Sklearn) toolkit [17]. The extractor is applied to the pre-processed data, giving each document a vectorized representation based on the TF-IDF scores on the terms within each document. The length of these vectors are set to be at most 100 in length, meaning that at most 100 of the terms that

(26)

18 CHAPTER 3. METHOD

have the highest TF-IDF scores are used. These vectorized representations are collected into a matrix which is then used as input for the classifiers.

3.3.2 Bag of Words/Count Vector

The Count Vector is similar to the TF-IDF vector, but it does not take the length of the current document, nor other documents within a class, into consideration. It simply counts the number of times that each word occurs within a document and stores the results in a vector. A maximum feature length of 100 is also chosen for this extraction method.

3.3.3 GloVe

The GloVe feature extractor is implemented by employing a pre-trained GloVe model [9]. The model has been trained on data from Wikipedia and Gigaword 5 [9]. This data contains 6 billion tokens, and a vo- cabulary of 400,000 words. The length of the vectorized word representations is chosen to be 100 in order for this approach to be fairly compared with the TF-IDF and Count Vector methods. However pre- trained models using lengths of 50, 200, or 300 are also available [9].

As a feature extraction method, the resulting vectorized representation of the whole document is the average of all vectorized word representations of the words in the document. Since this approach takes both semantics and context into consideration, no use of N-grams is applied to the data for this method.

3.4 Classifiers

All classifiers are implemented using the problem transformation approach described in section 2.1.1. More specifically with the Binary Relevance method being employed. Binary Relevance was used since, as mentioned in section 2.6.2, it has proven to yield god results. It is also easier to implement that some Algorithm Adaptation approach.

Two different classifiers, an SVM classifier and an ANN classifier, were chosen in order to examine if different extraction techniques might have different effects depending on the classifier used. The SVM classifier has, as mentioned in section 2.5.1, proven to perform well when

(27)

used for text categorization. The ANN was chosen mostly to have something to compare the against. The SVM classifier is implemented with a linear kernel. The ANN classifier is implemented using one hidden layer with a size of 100 nodes.

3.5 Evaluation

Two different types of metrics that are commonly used when evaluat- ing supervised classifiers are macro-average metrics and micro-average metrics. Both of these metrics usually consist of scores of precision, recall and a combination of the two called an F-score. The macro-average metrics are calculated on a class-basis, and therefore does not take the size of the class into consideration. The micro-average metrics on the other hand are calculated on a document-basis, meaning that larger classes will have a greater impact on the resulting metrics than smaller classes will [8].

Employing both of these types of metrics means that we can evaluate the performance of the classifier on both smaller and larger classes, which can often be a useful thing to do since data is not always uniform in practice.

3.5.1 Precision

Precision is a metric measuring how many errors are made during classification. A data sample determined to be of a certain class can either be a true positive (TP), meaning that it was correctly classified, or a false positive (FP), meaning that it was incorrectly classified. Pre- cision formally defined as [3]:

P = T P

T P + F P (3.1)

3.5.2 Recall

Recall is a metric that takes into account data points which should have been classified to a certain class, but were not. Meaning that they are false negatives (FN). The formal definition of recall is [3]:

R = T P

T P + F N (3.2)

(28)

20 CHAPTER 3. METHOD

3.5.3 F-Score

Both precision and recall are important metrics to take into consideration. They are therefore often combined into a single metric called an F-score, which can be defined as [3]:

F = 2RP

R + P (3.3)

(29)

Results

4.1 TF-IDF

Table 4.1, which illustrates the classification scores when using the TF- IDF extraction method with varying N-gram lengths, clearly shows that the SVM classifier performs the best when using the TF-IDF extraction method. Five out of the six measured top scores belongs to the SVM classifier when using N-grams of length 1. Only when it comes to the micro-precision metric does the ANN classifier outperform the SVM classifier, and it does so only by a very slight margin. Overall, it seems as though the scores decrease as the length of the N-grams increase, which is further illustrated in figures 4.1 and 4.2. Only the micro-precision scores for the ANN classifier show a notable improvement.

4.2 Count Vector

When using the Count Vector extraction method, the results in table 4.2 suggest that the ANN classifier outperforms the SVM classifier overall. The ANN classifier used in conjunction with N-grams of length 1 is responsible for four out of six top scores. Here we also see that increasing the length of the N-grams seems to have an overall negative impact on performance. Only the micro-precision scores increase as the N-gram length increases, illustrated in figures 4.3 and 4.4.

21

(30)

22 CHAPTER 4. RESULTS

TF-IDF Extraction Micro-metrics

Classifier N-gram Precision Recall F-score

SVM 1 0.947 0.803 0.869

SVM 2 0.949 0.661 0.779

SVM 3 0.947 0.535 0.684

ANN 1 0.838 0.680 0.751

ANN 2 0.847 0.488 0.619

ANN 3 0.952 0.310 0.468

Macro-metrics

SVM 1 0.640 0.396 0.467

SVM 2 0.543 0.222 0.291

SVM 3 0.472 0.128 0.187

ANN 1 0.297 0.167 0.196

ANN 2 0.104 0.047 0.058

ANN 3 0.039 0.016 0.019

Table 4.1: Results from using TF-IDF extraction with different N-grams

Figure 4.1: Illustration of micro metrics when using the TF-IDF extraction method.

(31)

Figure 4.2: Illustration of macro metrics when using the TF-IDF extraction method

Count Vector Extraction Micro-metrics

Classifier N-gram Precision Recall F-score

SVM 1 0.790 0.604 0.685

SVM 2 0.882 0.444 0.591

SVM 3 0.957 0.311 0.469

ANN 1 0.843 0.663 0.742

ANN 2 0.831 0.498 0.623

ANN 3 0.954 0.311 0.469

Macro-metrics

SVM 1 0.228 0.178 0.185

SVM 2 0.130 0.038 0.051

SVM 3 0.073 0.016 0.021

ANN 1 0.347 0.175 0.215

ANN 2 0.158 0.057 0.073

ANN 3 0.061 0.016 0.020

Table 4.2: Results from using a Count Vector extraction

(32)

Figure 4.3: Illustration of micro metrics when using the Count Vector extraction method.

Figure 4.4: Illustration of macro metrics when using the Count Vector extraction method

(33)

GloVe Extraction Micro-metrics

Classifier Precision Recall F-score

SVM 0.713 0.725 0.719

ANN 0.764 0.861 0.810

Macro-metrics

SVM 0.339 0.327 0.318

ANN 0.326 0.548 0.382

Table 4.3: Results from using GloVe extraction method

4.3 GloVe

When applying the GloVe extraction technique to the data, the ANN classifier performs better than the SVM classifier by a significant amount, as illustrated in table 4.3 and in figures 4.5 and 4.6. Only for the macro- precision score does the SVM classifier yield a (slightly) better score than the ANN classifier.

4.4 Effect of N-grams

Overall, the usage of N-grams does not increase performance. In fact it decreases the average performance of the classifiers significantly, as illustrated by tables 4.1 and 4.2. More specifically it is the recall metrics that greatly deteriorates across all different types of metrics. However the micro-precision score does increase, when using the Count Vector extraction method, as the length of the N-grams increases.

4.5 Stop-Words

When the removal of stop-words is not applied to the data, the ANN classifier in conjunction with the GloVe extraction method performs the best, as shown in table 4.4. Five out of the six highest scores belong to the ANN classifier used with GloVe. Only when looking at the micro-precision score does another approach (the TF-IDF extraction method) perform better.

(34)

Figure 4.5: Illustration of micro metrics when using the GloVe extraction method.

Figure 4.6: Illustration of macro metrics when using the GloVe extraction method

(35)

No stop-word elimination Micro-metrics

Classifier Extraction Precision Recall F-score

SVM TF-IDF 0.891 0.574 0.698

SVM Count Vector 0.745 0.585 0.656

SVM GloVe 0.696 0.744 0.719

ANN TF-IDF 0.830 0.641 0.723

ANN Count Vector 0.826 0.642 0.723

ANN GloVe 0.760 0.842 0.799

Macro-metrics

SVM TF-IDF 0.227 0.089 0.112

SVM GloVe 0.319 0.348 0.314

ANN TF-IDF 0.257 0.135 0.164

ANN GloVe 0.340 0.523 0.391

Table 4.4: Results when not using stop-words. N-gram of size 1 used for all results in table.

(36)

Figure 4.7: Illustration of performance difference for micro metrics when not using stop-word removal compared to using stop-word removal

When examining figures 4.7 and 4.8, it shows an overall decrease in performance when not applying removal of stop-words. Especially for the SVM classifier used in combination with the TF-IDF extraction method is it possible to see a major decrease in performance, both for micro and macro metrics. Only for the SVM combined with the GloVe extraction method is it possible to see an increase in score, however this is only true for the recall metrics.

4.6 Stemming

When observing the results in table 4.5 it is possible to see that using the ANN classifier with the GloVe extraction method yields the best result in five out of six categories, just as it does when no stop-word removal is used. Again, it is only for the micro-precision score that this approach does not achieve the best score.

When examining figures 4.9 and 4.10, which shows the performance difference of not using stemming compared to using stemming, it is possible to see mixed results. The results SVM classifier used with the TF-IDF extraction suffers when not applying stemming. However for

(37)

Figure 4.8: Illustration of performance difference for macro metrics when not using stop-word removal compared to using stop-word removal

No stemming Micro-metrics

Classifier Extraction Precision Recall F-score

SVM TF-IDF 0.905 0.627 0.741

SVM GloVe 0.753 0.747 0.750

ANN TF-IDF 0.856 0.682 0.759

ANN GloVe 0.791 0.835 0.813

Macro-metrics

SVM TF-IDF 0.214 0.112 0.132

SVM GloVe 0.368 0.362 0.352

ANN TF-IDF 0.280 0.159 0.185

ANN GloVe 0.397 0.524 0.429

Table 4.5: Results when not applying stemming. N-gram of size 1 used for all results in table.

(38)

Figure 4.9: Illustration of performance difference for micro metrics when not using stemming compared to using stemming

the other classifier and extraction combinations, there are both metrics which increase and metrics which decrease when stemming is not used.

4.7 Best Scores

Table 4.6 shows the combination of extraction method and classifier that provided the highest score for each available metric.

(39)

Figure 4.10: Illustration of performance difference for macro metrics when not using stemming compared to using stemming

Micro-metrics

Metric Description Score

Precision SVM with Count Vector, 3-gram 0.958

Recall ANN with GloVe 0.861

F-score SVM with TF-IDF, 1-gram 0.869 Macro-metrics

Precision SVM with TF-IDF, 1-gram 0.640

Recall ANN with GloVe 0.548

F-score SVM with TF-IDF, 1-gram 0.467

Table 4.6: Best scores for each metric score.

(40)

Chapter 5 Discussion

5.1 Effects of Preprocessing

5.1.1 N-grams

It seems as though the use of N-grams makes the classifier more restrictive. This probably introduces more false negative predictions, as is evident by examining the deteriorating recall scores. However since the classifier becomes more restrictive, the chance of false positives decreases, which is shown in the increasing micro-precision scores. One possible reason for this happening could be that certain keywords that would be recognized as useful features on their own, were not recognized as such when combined with one or more other keywords.

An example of this could be two phrases from the same class being:

"Microsoft’s revenue decreased" and "Apple’s revenue remained constant". Here the word "revenue" would be recognized as a common distinguishable keyword for both examples when using 1-grams, but when using 2-grams the keywords would be "Microsoft’s revenue"

and "revenue decreased" for the first example. These keywords would not match any of the second example’s keywords: "Apple’s revenue",

"revenue remained" and "remained constant".

The reason why the macro-precision scores do not increase is probably due to the fact that, as mentioned in section 3.1, the dataset is not balanced. Meaning that the classifier does not have enough training data.

32

(41)

5.1.2 Stop-Word Elimination

By omitting the use of stop-word elimination, the performance of all the classifier and extraction combinations generally decreases, as in- dicated by table 4.4. However only a minimal performance decrease occurs when using the GloVe extraction method when compared to the results in table 4.3. This is further illustrated in figures 4.7 and 4.8.

This could be because the semantics of the words are still intact when using a word embedding model, which are therefore recognized as not being very useful as distinguishing features in this case.

5.1.3 Stemming

When the use of stemming is not applied to the data, there are many varying performance differences versus when it is applied. When examining figures 4.9 and 4.10 we see that almost all of the results when using the GloVe extraction method are a small improvement. This sug- gests that since the semantic meaning of different forms of the same word is preserved when using GloVe, stemming might not be neces- sary when using it.

When examining the results for the TF-IDF method in figures 4.9 and 4.10, we see a moderate decrease in performance for the SVM classifier when looking at the micro metrics and a great decrease when regarding the macro metrics.

For the TF-IDF method used with the ANN classifier we see a slight increase in performance for the micro-metrics and a slight decrease for the macro-metrics. For the Count Vector method we see a sharp de- cline in micro precision. It seems as though the TF-IDF method and Count Vector method are more sensitive to different forms of the same words than the GloVe extraction method, which is logical since these methods do not preserve the semantic meanings of the terms examined.

5.2 Extraction Methods

5.2.1 TF-IDF and Count Vector

For the SVM classifier, the TF-IDF method performed better on average than using a Count Vector method, as is evident when examining

(42)

34 CHAPTER 5. DISCUSSION

tables 4.1 and 4.2. This difference in performance is almost certainly due to the fact that TF-IDF both use normalization with regard to the length of the document, and that the rarity of the terms across all documents are taken into consideration. This in contrast to the Count Vec- tor, which gives a bias to longer documents since terms have a higher likelihood to occur a greater number of times in them. Nor does the Count Vector take into account the frequency of terms across all documents in the corpus. However when looking at the ANN classifier, the results seem to be pretty similar between the TF-IDF method and the Count Vector method. The micro-average F-score is marginally better when using the TF-IDF method, while the macro-average F-score is slightly better when using the Count Vector.

5.2.2 Bag of Words vs. Word Embeddings

The Bag of Words models looks to be more sensitive to the different pre-processing methods that are used, while the Word Embedding model yields similar results regardless. Which approach is better is debatable, and depends largely on which metric performance is most important for the problem at hand. When examining table 4.6, which displays the approach that yielded the best score for each metric, we see that all extraction methods are present.

The SVM classifier used with the TF-IDF extraction method yielded three out of six top scores. However, if recall is the most important metric, then the ANN classifier used with the GloVe extraction method is the best choice. An example of when this is the case could be for a recommender system where leaving out correct recommendations can be very costly. For instance when recommending items available for purchase online since it might result in lost sales. For some other sort of system where multi-label classification is used, precision might be much more important. Where a false positive prediction is very costly, but a false negative prediction does not have a great impact.

Another important factor to keep in mind is how balanced the data is. If it is known that the dataset is heavily imbalanced, and it is likely that future data will be as well, then it might be better to focus on the micro-average metrics in this report. If, on the other hand, it is more important to classify correctly across all different categories then one should focus on the macro-average metrics instead.

(43)

5.3 Concerns

More accurate multi-label classifications would hopefully lead to a more efficient use of both energy and time. When applied to a recommender system, it would hopefully mean that people could get relevant items or articles recommended to them both faster and more ac- curately. When recommending items available for purchase, this could mean that people start purchasing items more frequently. This would lead to increased revenue for the company selling the products, but it would also most likely increase the amount of deliveries to be made.

This would in turn lead to an increase in emissions of green-house gases.

When it concerns articles regarding pensions, which was the orig- inal starting point of this work, it would mean that people would hopefully be more informed regarding their financial decisions and both their future and current well-being. Moreover, it would ideally reduce the need for direct communication between the Swedish Pen- sions Agency and people with questions regarding their pension. This could lead to reduced spending of tax-money, but could also mean that some people would lose their jobs.

5.4 Future Work

Future work could include examining how different feature extraction methods affect the results when using different strategies for multi- label classification, not only examining Binary Relevance as has been done here. Of course you could also look at more different types of classifiers, and how different extraction methods play a part in the results when using them. One could also see how other extraction techniques perform, such as Word2vec or fastText for word embeddings and using a binary vector or BM25 for Bag of Words.

Finally it would be interesting to see if the classifications provided could mitigate the problem of Exploration and Exploitation present in many recommender systems. Hopefully this could be achieved in part by expanding the users’ profiles by utilizing multi-labeled data, thereby hopefully allowing them to explore new areas, while at the same time being recommended relevant items.

(44)

Chapter 6 Conclusion

There is no one extraction method that outperforms the rest across all metrics. Nor is there an extraction method that performs the best across both classifiers. However the extraction method can have a significant impact on the results of multi-label classification. The best choice of extraction method depends on the what the multi-label classifications are to be used for. If the priority lies with not producing false negatives, then the results of this work indicates that the GloVe extraction method is the best choice. If, however, not producing false positives is the highest priority, then a Bag of Words extraction method is the superior choice.

The pre-processing methods investigated did not always yield better results when applied to the data. The use of N-grams generally provided worse classification outcomes. However it did prove to no- tably increase the micro-average precision of the ANN classifier.

Removal of stop-words generally had a positive effect on the results. Only when using the GloVe extraction method did it prove to be detrimental for some metrics. Not removing stop-words had a great negative effect on the SVM classifier when the TF-IDF extraction method was used.

Applying stemming to the data gave mixed results. Again, the results when using the SVM classifier with the TF-IDF extraction method suffered greatly when this pre-processing step was not applied. The rest of the classifier and extraction combinations showed mostly increased scores when looking at the micro metrics, and slight decreases for some of the macro metrics when stemming was not applied. The GloVe extraction method seemed to actually perform better when stem-

36

(45)

ming was not applied, suggesting that stemming should not always be applied.

Furthermore, the structure of the data that is to be used is also an important factor to take into consideration when choosing an extraction method. The dataset used for this work was heavily imbalanced real-world data. For this dataset, the best overall result was achieved by using the TF-IDF method used in conjunction with stop-word elimination, using stemming, not using N-grams, and an SVM classifier.

(46)

Bibliography

[1] Gediminas Adomavicius and Alexander Tuzhilin. “Context-aware recommender systems”. In: Recommender systems handbook. Springer, 2015, pp. 191–226.

[2] Mehdi Allahyari et al. “A brief survey of text mining: Classifi- cation, clustering and extraction techniques”. In: arXiv preprint arXiv:1707.02919 (2017).

[3] Xavier Amatriain and Josep M Pujol. “Data mining methods for recommender systems”. In: Recommender systems handbook. Springer, 2015, pp. 227–262.

[4] Artificial neural network. https://en.wikipedia.org/wiki/

Artificial_neural_network. Accessed: 2018-05-28.

[5] Edward Loper Bird Steven and Ewan Klein. “Natural Language Processing with Python”. In: (2009).

[6] Zach CHASE, Nicolas Genain, and Orren Karniol-Tambour. “Learn- ing Multi-Label Topic Classification of News Articles”. In: (2014).

[7] Franca Debole and Fabrizio Sebastiani. “An analysis of the rel- ative hardness of Reuters-21578 subsets”. In: Journal of the Asso- ciation for Information Science and Technology 56.6 (2005), pp. 584–

596.

[8] George Forman. “A pitfall and solution in multi-class feature selection for text classification”. In: Proceedings of the twenty-first international conference on Machine learning. ACM. 2004, p. 38.

[9] GloVe: Global Vectors for Word Representation. https : / / nlp . stanford.edu/projects/glove/. Accessed: 2018-05-19.

[10] Introduction to Support Vector Machines. https://docs.opencv.

org/2.4/doc/tutorials/ml/introduction_to_svm/

introduction_to_svm.html. Accessed: 2018-03-29.

38

(47)

[11] Thorsten Joachims. “Text categorization with support vector machines: Learning with many relevant features”. In: European conference on machine learning. Springer. 1998, pp. 137–142.

[12] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. “Mul- tilabel text classification for automated tag suggestion”. In: Pro- ceedings of the ECML/PKDD. Vol. 18. 2008.

[13] Sungjick Lee and Han-joon Kim. “News keyword extraction for topic tracking”. In: Networked Computing and Advanced Informa- tion Management, 2008. NCM’08. Fourth International Conference on. Vol. 2. IEEE. 2008, pp. 554–559.

[14] David D. Lewis. Reuters-21578 text categorization test collection.

http://www.daviddlewis.com/resources/testcollections/

reuters21578/readme.txt. Accessed: 2018-03-29.

[15] Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. “Support vector machines and word2vec for text classification with semantic features”. In: Cognitive Informatics & Cognitive Computing (ICCI*

CC), 2015 IEEE 14th International Conference on. IEEE. 2015, pp. 136–

140.

[16] Michael J Pazzani and Daniel Billsus. “Content-based recommendation systems”. In: The adaptive web. Springer, 2007, pp. 325–

341.

[17] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”.

In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[18] Jeffrey Pennington, Richard Socher, and Christopher Manning.

“Glove: Global vectors for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014, pp. 1532–1543.

[19] Martin F Porter. “An algorithm for suffix stripping”. In: Program 14.3 (1980), pp. 130–137.

[20] Dyah Rahmawati and Masayu Leylia Khodra. “Automatic mul- tilabel classification for Indonesian news articles”. In: Advanced Informatics: Concepts, Theory and Applications (ICAICTA), 2015 2nd International Conference on. IEEE. 2015, pp. 1–6.

[21] Neil Rubens et al. “Active learning in recommender systems”.

In: Recommender systems handbook. Springer, 2015, pp. 809–846.

(48)

40 BIBLIOGRAPHY

[22] Tobias Schnabel et al. “Evaluation methods for unsupervised word embeddings”. In: Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing. 2015, pp. 298–307.

[23] Abinash Tripathy, Ankit Agrawal, and Santanu Kumar Rath. “Clas- sification of sentiment reviews using n-gram machine learning approach”. In: Expert Systems with Applications 57 (2016), pp. 117–

126.

[24] Min-Ling Zhang and Zhi-Hua Zhou. “ML-KNN: A lazy learning approach to multi-label learning”. In: Pattern recognition 40.7 (2007), pp. 2038–2048.

(49)

Appended Material

Reuters-21578, Distribution 1.0 dataset available at: http://www.daviddlewis.com/resources/testcollections/reuters21578

41

(50)

www.kth.se