Comparing Feature Extraction Methods and Effects of Pre- Processing Methods for Multi-
Label Classification of Textual Data
MARTIN EKLUND
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Extraction Methods and Effects of Pre-Processing Methods for Multi-Label
Classification of Textual Data
MARTIN EKLUND
Master in Computer Science Date: June 27, 2018
Supervisor: HĂ„kan Lane Examiner: Olof BĂ€lter
Swedish title: UtvÀrdering av Metoder för Extraktion av SÀrdrag och Förbehandling av Data för Multi-Taggning av Textdata School of Electrical Engineering and Computer Science
Abstract
This thesis aims to investigate how different feature extraction meth- ods applied to textual data affect the results of multi-label classifica- tion. Two different Bag of Words extraction methods are used, specif- ically the Count Vector and the TF-IDF approaches. A word embed- ding method is also investigated, called the GloVe extraction method.
Multi-label classification can be useful for categorizing items, such as pieces of music or news articles, that may belong to multiple classes or topics. The effect of using different pre-processing methods is also investigated, such as the use of N-grams, stop-word elimination, and stemming. Two different classifiers, an SVM and an ANN, are used for multi-label classification using a Binary Relevance approach. The results indicate that the choice of extraction method has a meaningful impact on the resulting classifications, but that no one method consis- tently outperforms the others. Instead the results show that the GloVe extraction method performs the best for the recall metrics, while the Bag of Words methods perform the best for the precision metrics.
iv
Sammanfattning
Detta arbete Ă€mnar att undersöka vilken effekt olika metoder för att extrahera sĂ€rdrag ur textdata har nĂ€r dessa anvĂ€nds för att multi-tagga textdatan. TvĂ„ metoder baserat pĂ„ Bag of Words undersöks, nĂ€rmare bestĂ€mt Count Vector-metoden samt TF-IDF-metoden. Ăven en me- tod som anvĂ€nder sig av word embessings undersöks, som kallas för GloVe-metoden. Multi-taggning av data kan vara anvĂ€ndbart nĂ€r da- tan, exempelvis musikaliska stycken eller nyhetsartiklar, kan tillhöra flera klasser eller omrĂ„den. Ăven anvĂ€ndandet av flera olika meto- der för att förbehandla datan undersöks, sĂ„som anvĂ€ndandet utav N- gram, eliminering av icke-intressanta ord, samt transformering av ord med olika böjningsformer till gemensam stamform. TvĂ„ olika klassifi- cerare, en SVM samt en ANN, anvĂ€nds för multi-taggningen genom anvĂ€nding utav en metod kallad Binary Relevance. Resultaten visar att valet av metod för extraktion av sĂ€rdrag har en betydelsefull roll för den resulterande multi-taggningen, men att det inte finns en me- tod som ger bĂ€st resultat genom alla tester. IstĂ€llet indikerar resultaten att extraktionsmetoden baserad pĂ„ GloVe presterar bĂ€st nĂ€r det gĂ€l- ler ârecallâ-mĂ€tvĂ€rden, medan Bag of Words-metoderna presterar bĂ€st gĂ€llade âprecisionâ-mĂ€tvĂ€rden.
1 Introduction 1
1.1 Problem Statement . . . 2
1.2 Scope . . . 2
1.3 Objective . . . 3
2 Background 4 2.1 Multi-Label Classification . . . 4
2.1.1 Methods . . . 5
2.2 Content-Based Recommendation . . . 5
2.2.1 Exploitation and Exploration . . . 6
2.3 Pre-Processing . . . 7
2.3.1 Tokenization and N-grams . . . 7
2.3.2 Stemming . . . 8
2.3.3 Stop-Word Elimination . . . 8
2.4 Feature Extraction . . . 8
2.4.1 Bag of Words . . . 9
2.4.2 Term Frequency-Inverse Document Frequency . . 9
2.4.3 Word Embeddings and GloVe . . . 11
2.5 Classifiers . . . 12
2.5.1 Support Vector Machine . . . 12
2.5.2 Artificial Neural Networks . . . 13
2.6 Related Work . . . 14
2.6.1 Word Embeddings for Single-Label Classification 14 2.6.2 Bag of Words for Multi-Label Classification . . . . 15
2.6.3 Research Gap . . . 15
3 Method 16 3.1 Dataset . . . 16
3.2 Pre-Processing . . . 17
v
vi CONTENTS
3.3 Extraction Methods . . . 17
3.3.1 TF-IDF . . . 17
3.3.2 Bag of Words/Count Vector . . . 18
3.3.3 GloVe . . . 18
3.4 Classifiers . . . 18
3.5 Evaluation . . . 19
3.5.1 Precision . . . 19
3.5.2 Recall . . . 19
3.5.3 F-Score . . . 20
4 Results 21 4.1 TF-IDF . . . 21
4.2 Count Vector . . . 21
4.3 GloVe . . . 25
4.4 Effect of N-grams . . . 25
4.5 Stop-Words . . . 25
4.6 Stemming . . . 28
4.7 Best Scores . . . 30
5 Discussion 32 5.1 Effects of Preprocessing . . . 32
5.1.1 N-grams . . . 32
5.1.2 Stop-Word Elimination . . . 33
5.1.3 Stemming . . . 33
5.2 Extraction Methods . . . 33
5.2.1 TF-IDF and Count Vector . . . 33
5.2.2 Bag of Words vs. Word Embeddings . . . 34
5.3 Concerns . . . 35
5.4 Future Work . . . 35
6 Conclusion 36
Bibliography 38
A Appended Material 41
Introduction
The amount of digital content available on the Internet is steadily grow- ing, and it can prove challenging to make the best use of the vast amount of data that is available. One way to handle this problem is to classify data into different categories, thus giving a better overview of what kinds of data are available. This can in turn for instance enable users of a news site to better filter out the articles that they are inter- ested in, or enable users on social media to locate photos of themselves.
In order not to have to do this procedure of categorization manually, one can instead employ machine learning techniques to automate the process. Such techniques usually require distinguishing features of an object in order to be able to classify it.
There are several different ways to extract features from different types of data, and different methods may prove suitable in different situations. For instance when classifying images of apples and or- anges, a crude extraction method could be to take the average pixel values of the images. If the average pixel value is close to orange, then the image would be classified as an orange, and if it is closer to green then it would be classified as an apple. This method would however also most likely classify a tiger as an orange, since it does not take into account any other feature than color into consideration. By choosing a more appropriate extraction method this could hopefully be avoided.
Instead of only taking the average pixel values into account one could also consider features such as the shape, size and texture of the object.
Taking these features into account, an image of a tiger would most likely not be classified as an orange.
Categorized data items can, among other things, be used to pro-
1
2 CHAPTER 1. INTRODUCTION
vide recommendations of similar items to a user depending on which categories the user has previously shown an interest in. By using a multi-label approach, hopefully, the problem of Exploitation and Ex- ploration described in section 2.2.1 could be somewhat mitigated. This could be done by recommending items that might not be the ones most similar to the users viewing history, but that still lie relatively close to other categories that the user has shown a previous interest in.
1.1 Problem Statement
The aim of this report is to evaluate how different methods for feature extraction affect a multi-label classification problem with textual data.
The effect of different pre-processing methods is also investigated. If the choice of extraction method has a significant impact on the final classification results, then it might be better to choose the right extrac- tion method rather than spend too much time optimizing the classifier itself.
The questions that this report will attempt to answer are:
âą Which one of the three feature extraction methods (Count Vector, TF-IDF and GloVe) performs the best?
âą Does one of these feature extraction methods perform the best even when used with different classifier models?
âą Are commonly used pre-processing methods always useful when applied to different feature extraction methods?
1.2 Scope
The main goal of this thesis is to examine the effect of feature extrac- tion methods for multi-label classification for textual data. Three fea- ture extraction methods are evaluated in conjunction with two classi- fier models. It would be possible to include more classifier models, but hopefully two are enough to examine whether or not extraction methods yield similar results when used with different classifier.
The resulting classifications may in turn aid in building a simple recommendation system, but it is outside the scope of this thesis to
evaluate such a system. Instead this work is to be viewed as examin- ing the classification foundation for such a system. This work is being done in association with the Swedish Pensions Agency (Pensionsmyn- digheten), who are interested in developing a prototype for a recom- mender system.
1.3 Objective
The objective of this report is to investigate how different approaches for feature extraction affects the results of multi-label classification.
Naturally you want as good results as possible when doing classifi- cation. If the choice of feature extraction method has a great impact, it might be worth it to spend more time focusing on selecting an appro- priate feature extraction method rather than spending a lot of time op- timizing a certain classifier. This report also investigates if it is always useful to apply commonly used pre-processing methods to text data when different extraction methods are used. Hopefully the results of this work could prove useful for trying to better classify multi-label data.
Chapter 2 Background
2.1 Multi-Label Classification
In single-label classification a data sample is assigned only one label (or category) from a set of disjoint labels. If the size of this set is equal to two then the task is known as binary classification. When the size of the set of labels is greater than two the task is called multi-class classification [12].
When it comes to multi-label classification, a data-point can be as- signed multiple labels. This approach to classification is useful when the data to be classified can belong to multiple separate classes [12].
For instance when it comes to music categorization, where a song of- ten can relate to multiple different genres.
Giving news articles multiple different labels can enable users to sort out only the articles relating to topics that they are interested in [6].
In this case, using single-label classification would most likely have omitted a certain amount of articles for a topic that might have been more strongly related to other topics. For instance an article about North Korea and the Olympics might be strongly related to sports, but would also be of interest for someone wanting to read articles about global politics. Having multiple labels for news articles can also pro- vide a foundation for a recommender system [6]. Multi-label classifi- cation has also proven to be useful for protein function classification and semantic classification of images [12].
4
2.1.1 Methods
There are two main approaches for multi-label classification, known as algorithm adaptation methods and problem transformation methods.
Algorithm adaptation methods transform a specific algorithm to be able to handle multi-label data. This approach is algorithm dependent, and it can prove challenging to adapt an algorithm to the multi-label classification task. An example of this method would be the ML-kNN algorithm [24] which extends the well known kNN (k-Nearest Neigh- bors) algorithm, which classifies a sample based on the classes of the k nearest neighbors of the sample, to be able to handle multi-label data.
Problem transformation methods on the other hand, are algorithm independent. Instead this approach uses several single-label classifica- tion tasks together to solve a multi-label classification problem, mean- ing that it is possible to take any single-label classification algorithm and use it for multi-label classification. One of the most popular prob- lem transformation methods is called Binary Relevance. Using this method, one binary classifier is trained for each individual label. Each of these binary classifiers is then applied to the dataset, assigning ei- ther true or false for each data sample in the dataset depending on whether the classifier predicted that sample to belong to the label or not [12]. This is illustrated in figure 2.1. These results are then aggre- gated, meaning that a single data sample may be given multiple labels.
One possible drawback to this approach is that is does not take into ac- count that some labels may be correlated with others. For instance a news article related to finance might be more likely to also be related to politics than to sports.
2.2 Content-Based Recommendation
One approach for how to recommend an item to a user is what is called content-based recommendation. What this entails is that a user is rec- ommended an item based on a profile of that user, and the properties of the items that can be recommended [16]. For instance a system rec- ommending a movie to a user might take attributes such as genre, di- rector, writer and actors into account. If a user has previously shown an interest in movies directed by Martin Scorsese, the recommender system may try to find other Martin Scorsese movies to recommend to the user. This is an example using structured data, where items are
6 CHAPTER 2. BACKGROUND
Figure 2.1: Illustration of the Binary Relevance approach.
described by the same set of attributes and it is known which values these attributes can have [16].
In contrast to structured data there is also unstructured data, such as news articles, where there are no predetermined attributes which can take on well-defined values. Instead we only have the free-flowing text of the articles, and the text of two separate articles can differ greatly from each other both in length and content. It can therefore be harder to deal with unstructured data compared to structured data due to the complexities of natural language. Some of the problems that arise are due to factors such as polysemy (words can have several different meanings) and synonymy (different words can have the same mean- ing) [16].
2.2.1 Exploitation and Exploration
When recommending an item to a user, there is a trade-off between what is called Exploitation and Exploration. Exploitation means tak- ing advantage of (or exploiting) the information available in the userâs profile. Exploration, on the other hand, means recommending items that the user may not have shown as great of an interest in compared to some other items [21].
By relying too much on the Exploitation approach, a user might only receive recommendations that are very similar to previously vis- ited items. This may result in a sort of echo chamber, where the user is
recommended items so similar to each other over and over again to the point where the recommendations become close to useless. An exam- ple of this being a user, on a website selling books, who has shown interest in a couple of books by Stephen King being recommended nothing but more books written by Stephen King. The user could most likely have found these books without the help of the recommendation system.
In contrast, when relying too much on Exploration a user might receive recommendations that have little, or nothing, to do with the items that the user has previously shown an interest in. An extreme example here being a user who is recommended nothing but Tolstoyâs
"War and Peace" and similar literature after having shown interest in nothing but cookbooks.
Finding a sufficient balance between both Exploitation and Explo- ration would ensure that the user is recommended relevant items that the user will actually find useful, while at the same time also recom- mending items that the user otherwise probably would not have found on their own [21].
2.3 Pre-Processing
Previous research has shown that applying pre-processing methods to unstructured textual data may have a noticeable effect on resulting text mining [2]. The purpose of these pre-processing methods is to trans- form the data into a more manageable format for the feature extractor, and also to remove superfluous information.
2.3.1 Tokenization and N-grams
The process of tokenization divides the text data into pieces (or to- kens), and often also removes certain special characters such as apos- trophes, commas and periods [2]. The text is also often normalized to be lower-case only. The tokens usually consist of either a single word or what is called an N-gram, meaning that N consecutive words are split into a single token. The idea is to preserve some of the infor- mation that is stored in the order of the words. For example when using 2-grams, the semantic difference between "cat calling" and "cat food" would not get lost, which it would be when using 1-grams. For instance the sentence "These are my tokens" would be split into the
8 CHAPTER 2. BACKGROUND
following tokens when using 2-grams: "These are", "are my", "my to- kens". This approach has proven to yield better classification results in some cases [23].
2.3.2 Stemming
Applying stemming to a word means breaking it down to its root, or stem. By using this technique words that have the same meaning, but different forms, are transformed into their common stem [2].
The idea behind employing this technique before feature extrac- tion is to capture distinguishing features that otherwise might have been lost due to infrequency. For instance a short article containing the words "educator", "educational" and "educated" exactly once prob- ably relates to education. However, since these three words are dif- ferent from each other they might not be recognized as distinguish- ing features for the article. Instead they would be recognized as three separate features. By utilizing stemming these three words might be transformed into a common stem, thus enabling the possibility to rec- ognize the three separate words as a single distinguishing feature.
There are several different algorithms that perform stemming, one of the more popular ones being the Porter stemming algorithm [19].
The Porter stemming algorithm works in part by rule-based suffix stripping of words. For instance the words "operate", "operation" and
"operating" will all be stemmed to their common stem "oper". How- ever, the possibility exists that words that do not share the same mean- ing get stemmed to the same stem.
2.3.3 Stop-Word Elimination
Words that are common across all documents in a corpus are not very useful for distinguishing documents from each other. Some of these can therefore be removed through a process called stop-word elimina- tion. Conjunctions (for, and, but) and pre/post-positions (in, of, with) are examples of stop-words [2].
2.4 Feature Extraction
In most articles the actual text is both highly dimensional and unstruc- tured. Meaning that every unique word can be seen as a separate di-
mension and that different articles are structured in different ways.
This can make it difficult to apply many classifiers to the raw text.
Therefore one can first extract the most distinguishing features of a text, thus reducing dimensionality. This process is called feature ex- traction, or text mining.
2.4.1 Bag of Words
One of the simplest types of feature extraction models is called Bag of Words. The name Bag of Words refers to the fact that this model does not take the order of the words into account. Instead one can imag- ine that every word is put into a bag, where the ordering of the words gets lost. Although there exist a few different variations of this model, the most common one is to simply count the number of occurrences of each word within a document and keep the result in a vector (here- after referred to as a count vector) [2]. This way the frequencies of the terms remain intact, although grammar and order is lost [15]. Another approach is to instead have a binary vector keeping track of whether or not a word exists within a document. However this approach also loses the multiplicity of the words in addition to the order and gram- mar.
2.4.2 Term Frequency-Inverse Document Frequency
One method that has proven itself to be both simple and effective for feature extraction is Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is an information retrieval technique that can be used to determine the relevance of terms in documents in relation to a query [6]. In this case it can be used for feature extraction by determining which terms in a document are most distinguishing for that document.
This method can also be viewed as a form of Bag of Words model, since it does not take grammar or order into consideration.
TF-IDF consists of two steps, first calculating the term frequency (TF), and then calculating the inverse document frequency (IDF). There are several variants of both of these parts.
One variant of TF (formally defined in equation 2.1) works by first calculating how many times a term occurs in a document, just as you do for the count vector. The reasoning here is that words that fre- quently occur in a document are probably more important than words
10 CHAPTER 2. BACKGROUND
that do not occur frequently. The result is then normalized by dividing it by the number of words in the whole document. This normalization is done in order to prevent a bias towards longer documents, so that we get the frequency of which the term occurs and not just the raw count of the term [13].
tft,d= nt,d
P
knk,d
(2.1) Where nt,dis the number of times that term t occurs in document d, and nk,dis the number of occurrences of every term in document d.
To calculate the IDF part of the formula, one variant (formally de- fined in equation 2.2) is to take the total number of documents in the corpus and divide it by the number of documents where the term ap- pears. The result is then logarithmized. The IDF part of the formula acts as a form of weight-assigner, giving more weight to important terms and less weight to insignificant terms.
idft= log|D|
|Dt| (2.2)
Where |D| is the total number of documents, and |Dt| is the number of documents where the term t appears.
By multiplying the TF part and the IDF part for a certain term, we get a measure of how distinguishing that term is. In an instance where a corpus of news articles is used, the word âcompanyâ would proba- bly have a pretty high TF-IDF score since it would occur often in ar- ticles related to business. It would not however be very common in every other document, since news articles relating to sports, culture, and other topics would probably not contain that particular term very often. In contrast a common word like âtodayâ will probably get a pretty low TF-IDF score, since it is likely to appear in news articles re- lating to every news topic, and is therefore not very distinguishing a term [13].
One of the weaknesses of this method, and similar Bag of Words methods in general, is that it does not take context and synonyms into consideration. It would for instance not take into account that the terms âPriestâ and âReverendâ refer to a similar subject. Nor would it take into account that the term âAppleâ could refer to either the com- pany Apple Inc. or the fruit depending on the context.
2.4.3 Word Embeddings and GloVe
When using word embedding models, words are represented in a real- valued vector space [22]. A good word embedding would ideally represent words in such a way that two different words with similar semantic meanings would have similar vector space representations.
Other linguistic relationships between different words can also be pre- served. For instance when using these vector-space representations for words, the operations "King - Man + Woman" yields a result that is very similar to the vector-space representation for the word "Queen"
[18]. These sorts of relationships are further illustrated in table 2.2 where it is possible to see that the gender-based differences between several different pairs of terms with the same semantic meaning are very similar to each other.
Words closest to "frog"
0 frog 1 frogs 2 toad 3 litoria
4 leptodactylidae 5 rana
6 lizard
7 eleutherodactylus
Table 2.1: The nearest neighbors of the word "frog" in a GloVe model (Litoria, leptodactylidae, rana and eleutherodactylus are all different genuses/families of frogs). [9]
One commonly used word embedding model is GloVe (Global Vec- tors). GloVe is a statistical unsupervised learning model which uses a co-occurrence matrix to generate vector-space representations of words [18]. This is done by calculating how frequently different words co- occur within a specified context-window in a corpus. Dimensionality reduction is then performed on this co-occurrence matrix. Depending on the size of the corpus used for training, this method can require a large amount of memory and be computationally expensive.
Another example of a popular word embedding is the Word2vec model, which is a neural network-based tool developed by Google[15].
12 CHAPTER 2. BACKGROUND
Figure 2.2: Illustration of linear substructures when using GloVe.[9]
2.5 Classifiers
There are several different types of classifiers, with different benefits and drawbacks often depending on the problem to be solved.
2.5.1 Support Vector Machine
Since textual data often is high-dimensional, there is a risk of overfit- ting. Support Vector Machines (SVMs) have built-in overfitting protec- tion, which is one of the reasons why they have been proven to work well for text categorization [11].
The idea behind SVMs is to find a linear hyperplane (or decision boundary) which separates the data-points of one class from the rest, and to do this in such a way that maximizes the margin between them as illustrated in figure 2.3. With a maximized margin, the likelihood of a future misclassification is lower than if the margin had been smaller [1]. If the data is not linearly separable it is possible to transform the data into a higher dimension by utilizing what is called the kernel trick. In this higher dimensional space we can then find a separating
Figure 2.3: SVM with maximized margin.[10]
optimal hyperplane [3].
2.5.2 Artificial Neural Networks
The structure of an Artificial Neural Network (ANN) is inspired by the structure of biological brains. Several nodes, or neurons, compose a network capable of solving classification problems after sufficient training [3]. The neurons have what is called activation functions that determine whether or not that neurons should be activated based on the input to the neuron.
The network structure is often divided into several interconnected layers. There are three different types of layers: input layers, hidden layers and output layers. The input data for the network is fed into the input layer. The hidden layer or layers takes weighted input from the previous layer in the topology of the network and feeds it forward, either to another hidden layer or to the output layer. The output layer takes the data that it receives and transforms it to become the final out- put of the whole network [3]. An example of such an artificial neural network of this kind is illustrated in figure 2.4
14 CHAPTER 2. BACKGROUND
Figure 2.4: Example of an ANN with one hidden layer.[4]
2.6 Related Work
2.6.1 Word Embeddings for Single-Label Classifica- tion
In [15], the authors apply a Word2vec-based word embedding tech- nique for feature extraction in order to solve a single-label classifica- tion problem in the text domain. They compare their results against a TF-IDF extraction technique, and also investigate the use of stop- words. Their results indicate that their Word2vec extraction technique did not outperform TF-IDF. However a combination of both Word2vec and TF-IDF is able to yield the highest accuracy by a small margin in only some cases. Their results also show that the use of stop-words does not always increase performance. In order to outperform a TF- IDF extraction method with stop-words, the authors had to combine Word2vec and TF-IDF without using stop-words.
2.6.2 Bag of Words for Multi-Label Classification
In [20] a number of different Bag of Words methods are investigated for multi-label classification of news articles. Several multi-label clas- sification methods are investigated in conjunction with different clas- sifiers. The results show that using Binary Relevance as a multi-label classification method, combined with an SVM-classifier, performs well when used on their dataset. However after a feature selection method is applied, another method (Calibrated Label Ranking) performs marginally better.
2.6.3 Research Gap
In no previous works examined is the GloVe extraction method ex- amined, nor are any word embedding methods used for multi-label classification. This report will attempt to cover this gap by evaluat- ing the GloVe extraction method against Bag of Words methods when used for multi-label classification. The use of pre-processing methods have been examined in previous works, but not as extensively as this in this report.
Chapter 3 Method
3.1 Dataset
Since the dataset provided by the Swedish Pensions Agency ultimately proved to be too small, the publicly available multi-label dataset "Reuters- 21578, Distribution 1.0" was used [14]. This dataset is based on newswire data recorded on the Reuters newswire in 1987. It has been used as a benchmark in several research papers on information retrieval, text classification and natural language processing [7] [11]. Therefore the results of this work are hopefully comparable to other similar works within the area, without having to account for unique datasets.
The dataset contains 21,578 documents which have been catego- rized manually into 120 different topics. For this thesis a commonly used subset of the dataset has been employed. This subset is referred to as the ModApte split, and contains 9,603 training documents and 3,299 test documents [14]. However after only taking into considera- tion categories that have at least one training document and one test document, there are only 90 categories. This subset of the ModApte split is sometimes referred to as the R(90) subset. After this pruning there are 7769 training documents and 3019 test documents, with an average cardinality of 1.2. The division of training set and test set was based on the dates that the data was recorded.
The dataset is not uniform with regard to the number of topics.
The five least populated topics only have one training sample and one test sample each. The most common topic has roughly a third of the documents belonging to it.
One possible problem with the dataset is that even though some
16
data samples were supposed to have been assigned topics manually, they had not been [14]. This may prove to be a source of error, if the multi-label classifier assigns a label to a data sample that is actually correct, but that had not been assigned during the manual classifica- tion of the dataset.
3.2 Pre-Processing
The first pre-processing step taken is tokenization of the text. Next all tokens are converted to be lower-case, and then all tokens are stemmed.
The Porter stemmer [19] implemented in the Natural Language Toolkit (NLTK)-package [5] has been used for this project. Finally, characters containing symbols not present in the English alphabet are filtered out so that no periods, commas or other special characters are present in the tokens. Without doing this, the tokens "Apple" and "Apple!" would be regarded as separate tokens.
Additionally, for the usage of N-grams another step is required.
Both the use of 2-grams and 3-grams are investigated, meaning that either two or three consecutive tokens are aggregated into a single to- ken. N-grams are investigated since, as mentioned in section 2.3.1, they are able to preserve some of the semantic information that exists in the ordering of the words in a sentence. The use of N-grams up to size 3 was deemed to be sufficient in order to examine the effects of using increasing sizes of N-grams.
The effects of not using stop-word elimination and of not using stemming is also investigated. For these experiments only 1-grams are used since they proved to yield the best results during the previous experiments.
3.3 Extraction Methods
3.3.1 TF-IDF
The TF-IDF extractor is implemented using the scikit-learn (Sklearn) toolkit [17]. The extractor is applied to the pre-processed data, giving each document a vectorized representation based on the TF-IDF scores on the terms within each document. The length of these vectors are set to be at most 100 in length, meaning that at most 100 of the terms that
18 CHAPTER 3. METHOD
have the highest TF-IDF scores are used. These vectorized represen- tations are collected into a matrix which is then used as input for the classifiers.
3.3.2 Bag of Words/Count Vector
The Count Vector is similar to the TF-IDF vector, but it does not take the length of the current document, nor other documents within a class, into consideration. It simply counts the number of times that each word occurs within a document and stores the results in a vec- tor. A maximum feature length of 100 is also chosen for this extraction method.
3.3.3 GloVe
The GloVe feature extractor is implemented by employing a pre-trained GloVe model [9]. The model has been trained on data from Wikipedia and Gigaword 5 [9]. This data contains 6 billion tokens, and a vo- cabulary of 400,000 words. The length of the vectorized word repre- sentations is chosen to be 100 in order for this approach to be fairly compared with the TF-IDF and Count Vector methods. However pre- trained models using lengths of 50, 200, or 300 are also available [9].
As a feature extraction method, the resulting vectorized represen- tation of the whole document is the average of all vectorized word rep- resentations of the words in the document. Since this approach takes both semantics and context into consideration, no use of N-grams is applied to the data for this method.
3.4 Classifiers
All classifiers are implemented using the problem transformation ap- proach described in section 2.1.1. More specifically with the Binary Relevance method being employed. Binary Relevance was used since, as mentioned in section 2.6.2, it has proven to yield god results. It is also easier to implement that some Algorithm Adaptation approach.
Two different classifiers, an SVM classifier and an ANN classifier, were chosen in order to examine if different extraction techniques might have different effects depending on the classifier used. The SVM clas- sifier has, as mentioned in section 2.5.1, proven to perform well when
used for text categorization. The ANN was chosen mostly to have something to compare the against. The SVM classifier is implemented with a linear kernel. The ANN classifier is implemented using one hidden layer with a size of 100 nodes.
3.5 Evaluation
Two different types of metrics that are commonly used when evaluat- ing supervised classifiers are macro-average metrics and micro-average metrics. Both of these metrics usually consist of scores of precision, re- call and a combination of the two called an F-score. The macro-average metrics are calculated on a class-basis, and therefore does not take the size of the class into consideration. The micro-average metrics on the other hand are calculated on a document-basis, meaning that larger classes will have a greater impact on the resulting metrics than smaller classes will [8].
Employing both of these types of metrics means that we can evalu- ate the performance of the classifier on both smaller and larger classes, which can often be a useful thing to do since data is not always uni- form in practice.
3.5.1 Precision
Precision is a metric measuring how many errors are made during classification. A data sample determined to be of a certain class can either be a true positive (TP), meaning that it was correctly classified, or a false positive (FP), meaning that it was incorrectly classified. Pre- cision formally defined as [3]:
P = T P
T P + F P (3.1)
3.5.2 Recall
Recall is a metric that takes into account data points which should have been classified to a certain class, but were not. Meaning that they are false negatives (FN). The formal definition of recall is [3]:
R = T P
T P + F N (3.2)
20 CHAPTER 3. METHOD
3.5.3 F-Score
Both precision and recall are important metrics to take into considera- tion. They are therefore often combined into a single metric called an F-score, which can be defined as [3]:
F = 2RP
R + P (3.3)
Results
4.1 TF-IDF
Table 4.1, which illustrates the classification scores when using the TF- IDF extraction method with varying N-gram lengths, clearly shows that the SVM classifier performs the best when using the TF-IDF ex- traction method. Five out of the six measured top scores belongs to the SVM classifier when using N-grams of length 1. Only when it comes to the micro-precision metric does the ANN classifier outperform the SVM classifier, and it does so only by a very slight margin. Overall, it seems as though the scores decrease as the length of the N-grams increase, which is further illustrated in figures 4.1 and 4.2. Only the micro-precision scores for the ANN classifier show a notable improve- ment.
4.2 Count Vector
When using the Count Vector extraction method, the results in table 4.2 suggest that the ANN classifier outperforms the SVM classifier over- all. The ANN classifier used in conjunction with N-grams of length 1 is responsible for four out of six top scores. Here we also see that in- creasing the length of the N-grams seems to have an overall negative impact on performance. Only the micro-precision scores increase as the N-gram length increases, illustrated in figures 4.3 and 4.4.
21
22 CHAPTER 4. RESULTS
TF-IDF Extraction Micro-metrics
Classifier N-gram Precision Recall F-score
SVM 1 0.947 0.803 0.869
SVM 2 0.949 0.661 0.779
SVM 3 0.947 0.535 0.684
ANN 1 0.838 0.680 0.751
ANN 2 0.847 0.488 0.619
ANN 3 0.952 0.310 0.468
Macro-metrics
SVM 1 0.640 0.396 0.467
SVM 2 0.543 0.222 0.291
SVM 3 0.472 0.128 0.187
ANN 1 0.297 0.167 0.196
ANN 2 0.104 0.047 0.058
ANN 3 0.039 0.016 0.019
Table 4.1: Results from using TF-IDF extraction with different N-grams
Figure 4.1: Illustration of micro metrics when using the TF-IDF extrac- tion method.
Figure 4.2: Illustration of macro metrics when using the TF-IDF extrac- tion method
Count Vector Extraction Micro-metrics
Classifier N-gram Precision Recall F-score
SVM 1 0.790 0.604 0.685
SVM 2 0.882 0.444 0.591
SVM 3 0.957 0.311 0.469
ANN 1 0.843 0.663 0.742
ANN 2 0.831 0.498 0.623
ANN 3 0.954 0.311 0.469
Macro-metrics
SVM 1 0.228 0.178 0.185
SVM 2 0.130 0.038 0.051
SVM 3 0.073 0.016 0.021
ANN 1 0.347 0.175 0.215
ANN 2 0.158 0.057 0.073
ANN 3 0.061 0.016 0.020
Table 4.2: Results from using a Count Vector extraction
24 CHAPTER 4. RESULTS
Figure 4.3: Illustration of micro metrics when using the Count Vector extraction method.
Figure 4.4: Illustration of macro metrics when using the Count Vector extraction method
GloVe Extraction Micro-metrics
Classifier Precision Recall F-score
SVM 0.713 0.725 0.719
ANN 0.764 0.861 0.810
Macro-metrics
SVM 0.339 0.327 0.318
ANN 0.326 0.548 0.382
Table 4.3: Results from using GloVe extraction method
4.3 GloVe
When applying the GloVe extraction technique to the data, the ANN classifier performs better than the SVM classifier by a significant amount, as illustrated in table 4.3 and in figures 4.5 and 4.6. Only for the macro- precision score does the SVM classifier yield a (slightly) better score than the ANN classifier.
4.4 Effect of N-grams
Overall, the usage of N-grams does not increase performance. In fact it decreases the average performance of the classifiers significantly, as illustrated by tables 4.1 and 4.2. More specifically it is the recall metrics that greatly deteriorates across all different types of metrics. However the micro-precision score does increase, when using the Count Vector extraction method, as the length of the N-grams increases.
4.5 Stop-Words
When the removal of stop-words is not applied to the data, the ANN classifier in conjunction with the GloVe extraction method performs the best, as shown in table 4.4. Five out of the six highest scores be- long to the ANN classifier used with GloVe. Only when looking at the micro-precision score does another approach (the TF-IDF extraction method) perform better.
26 CHAPTER 4. RESULTS
Figure 4.5: Illustration of micro metrics when using the GloVe extrac- tion method.
Figure 4.6: Illustration of macro metrics when using the GloVe extrac- tion method
No stop-word elimination Micro-metrics
Classifier Extraction Precision Recall F-score
SVM TF-IDF 0.891 0.574 0.698
SVM Count Vector 0.745 0.585 0.656
SVM GloVe 0.696 0.744 0.719
ANN TF-IDF 0.830 0.641 0.723
ANN Count Vector 0.826 0.642 0.723
ANN GloVe 0.760 0.842 0.799
Macro-metrics
SVM TF-IDF 0.227 0.089 0.112
SVM Count Vector 0.237 0.167 0.165
SVM GloVe 0.319 0.348 0.314
ANN TF-IDF 0.257 0.135 0.164
ANN Count Vector 0.320 0.152 0.191
ANN GloVe 0.340 0.523 0.391
Table 4.4: Results when not using stop-words. N-gram of size 1 used for all results in table.
28 CHAPTER 4. RESULTS
Figure 4.7: Illustration of performance difference for micro metrics when not using stop-word removal compared to using stop-word re- moval
When examining figures 4.7 and 4.8, it shows an overall decrease in performance when not applying removal of stop-words. Especially for the SVM classifier used in combination with the TF-IDF extraction method is it possible to see a major decrease in performance, both for micro and macro metrics. Only for the SVM combined with the GloVe extraction method is it possible to see an increase in score, however this is only true for the recall metrics.
4.6 Stemming
When observing the results in table 4.5 it is possible to see that using the ANN classifier with the GloVe extraction method yields the best result in five out of six categories, just as it does when no stop-word removal is used. Again, it is only for the micro-precision score that this approach does not achieve the best score.
When examining figures 4.9 and 4.10, which shows the performance difference of not using stemming compared to using stemming, it is possible to see mixed results. The results SVM classifier used with the TF-IDF extraction suffers when not applying stemming. However for
Figure 4.8: Illustration of performance difference for macro metrics when not using stop-word removal compared to using stop-word re- moval
No stemming Micro-metrics
Classifier Extraction Precision Recall F-score
SVM TF-IDF 0.905 0.627 0.741
SVM Count Vector 0.805 0.631 0.707
SVM GloVe 0.753 0.747 0.750
ANN TF-IDF 0.856 0.682 0.759
ANN Count Vector 0.753 0.675 0.751
ANN GloVe 0.791 0.835 0.813
Macro-metrics
SVM TF-IDF 0.214 0.112 0.132
SVM Count Vector 0.232 0.167 0.179
SVM GloVe 0.368 0.362 0.352
ANN TF-IDF 0.280 0.159 0.185
ANN Count Vector 0.317 0.175 0.209
ANN GloVe 0.397 0.524 0.429
Table 4.5: Results when not applying stemming. N-gram of size 1 used for all results in table.
30 CHAPTER 4. RESULTS
Figure 4.9: Illustration of performance difference for micro metrics when not using stemming compared to using stemming
the other classifier and extraction combinations, there are both met- rics which increase and metrics which decrease when stemming is not used.
4.7 Best Scores
Table 4.6 shows the combination of extraction method and classifier that provided the highest score for each available metric.
Figure 4.10: Illustration of performance difference for macro metrics when not using stemming compared to using stemming
Micro-metrics
Metric Description Score
Precision SVM with Count Vector, 3-gram 0.958
Recall ANN with GloVe 0.861
F-score SVM with TF-IDF, 1-gram 0.869 Macro-metrics
Precision SVM with TF-IDF, 1-gram 0.640
Recall ANN with GloVe 0.548
F-score SVM with TF-IDF, 1-gram 0.467
Table 4.6: Best scores for each metric score.
Chapter 5 Discussion
5.1 Effects of Preprocessing
5.1.1 N-grams
It seems as though the use of N-grams makes the classifier more re- strictive. This probably introduces more false negative predictions, as is evident by examining the deteriorating recall scores. However since the classifier becomes more restrictive, the chance of false positives de- creases, which is shown in the increasing micro-precision scores. One possible reason for this happening could be that certain keywords that would be recognized as useful features on their own, were not rec- ognized as such when combined with one or more other keywords.
An example of this could be two phrases from the same class being:
"Microsoftâs revenue decreased" and "Appleâs revenue remained con- stant". Here the word "revenue" would be recognized as a common distinguishable keyword for both examples when using 1-grams, but when using 2-grams the keywords would be "Microsoftâs revenue"
and "revenue decreased" for the first example. These keywords would not match any of the second exampleâs keywords: "Appleâs revenue",
"revenue remained" and "remained constant".
The reason why the macro-precision scores do not increase is prob- ably due to the fact that, as mentioned in section 3.1, the dataset is not balanced. Meaning that the classifier does not have enough training data.
32
5.1.2 Stop-Word Elimination
By omitting the use of stop-word elimination, the performance of all the classifier and extraction combinations generally decreases, as in- dicated by table 4.4. However only a minimal performance decrease occurs when using the GloVe extraction method when compared to the results in table 4.3. This is further illustrated in figures 4.7 and 4.8.
This could be because the semantics of the words are still intact when using a word embedding model, which are therefore recognized as not being very useful as distinguishing features in this case.
5.1.3 Stemming
When the use of stemming is not applied to the data, there are many varying performance differences versus when it is applied. When ex- amining figures 4.9 and 4.10 we see that almost all of the results when using the GloVe extraction method are a small improvement. This sug- gests that since the semantic meaning of different forms of the same word is preserved when using GloVe, stemming might not be neces- sary when using it.
When examining the results for the TF-IDF method in figures 4.9 and 4.10, we see a moderate decrease in performance for the SVM clas- sifier when looking at the micro metrics and a great decrease when regarding the macro metrics.
For the TF-IDF method used with the ANN classifier we see a slight increase in performance for the micro-metrics and a slight decrease for the macro-metrics. For the Count Vector method we see a sharp de- cline in micro precision. It seems as though the TF-IDF method and Count Vector method are more sensitive to different forms of the same words than the GloVe extraction method, which is logical since these methods do not preserve the semantic meanings of the terms exam- ined.
5.2 Extraction Methods
5.2.1 TF-IDF and Count Vector
For the SVM classifier, the TF-IDF method performed better on aver- age than using a Count Vector method, as is evident when examining
34 CHAPTER 5. DISCUSSION
tables 4.1 and 4.2. This difference in performance is almost certainly due to the fact that TF-IDF both use normalization with regard to the length of the document, and that the rarity of the terms across all doc- uments are taken into consideration. This in contrast to the Count Vec- tor, which gives a bias to longer documents since terms have a higher likelihood to occur a greater number of times in them. Nor does the Count Vector take into account the frequency of terms across all docu- ments in the corpus. However when looking at the ANN classifier, the results seem to be pretty similar between the TF-IDF method and the Count Vector method. The micro-average F-score is marginally better when using the TF-IDF method, while the macro-average F-score is slightly better when using the Count Vector.
5.2.2 Bag of Words vs. Word Embeddings
The Bag of Words models looks to be more sensitive to the different pre-processing methods that are used, while the Word Embedding model yields similar results regardless. Which approach is better is debatable, and depends largely on which metric performance is most important for the problem at hand. When examining table 4.6, which displays the approach that yielded the best score for each metric, we see that all extraction methods are present.
The SVM classifier used with the TF-IDF extraction method yielded three out of six top scores. However, if recall is the most important metric, then the ANN classifier used with the GloVe extraction method is the best choice. An example of when this is the case could be for a recommender system where leaving out correct recommendations can be very costly. For instance when recommending items available for purchase online since it might result in lost sales. For some other sort of system where multi-label classification is used, precision might be much more important. Where a false positive prediction is very costly, but a false negative prediction does not have a great impact.
Another important factor to keep in mind is how balanced the data is. If it is known that the dataset is heavily imbalanced, and it is likely that future data will be as well, then it might be better to focus on the micro-average metrics in this report. If, on the other hand, it is more important to classify correctly across all different categories then one should focus on the macro-average metrics instead.
5.3 Concerns
More accurate multi-label classifications would hopefully lead to a more efficient use of both energy and time. When applied to a rec- ommender system, it would hopefully mean that people could get rel- evant items or articles recommended to them both faster and more ac- curately. When recommending items available for purchase, this could mean that people start purchasing items more frequently. This would lead to increased revenue for the company selling the products, but it would also most likely increase the amount of deliveries to be made.
This would in turn lead to an increase in emissions of green-house gases.
When it concerns articles regarding pensions, which was the orig- inal starting point of this work, it would mean that people would hopefully be more informed regarding their financial decisions and both their future and current well-being. Moreover, it would ideally reduce the need for direct communication between the Swedish Pen- sions Agency and people with questions regarding their pension. This could lead to reduced spending of tax-money, but could also mean that some people would lose their jobs.
5.4 Future Work
Future work could include examining how different feature extraction methods affect the results when using different strategies for multi- label classification, not only examining Binary Relevance as has been done here. Of course you could also look at more different types of classifiers, and how different extraction methods play a part in the re- sults when using them. One could also see how other extraction tech- niques perform, such as Word2vec or fastText for word embeddings and using a binary vector or BM25 for Bag of Words.
Finally it would be interesting to see if the classifications provided could mitigate the problem of Exploration and Exploitation present in many recommender systems. Hopefully this could be achieved in part by expanding the usersâ profiles by utilizing multi-labeled data, thereby hopefully allowing them to explore new areas, while at the same time being recommended relevant items.
Chapter 6 Conclusion
There is no one extraction method that outperforms the rest across all metrics. Nor is there an extraction method that performs the best across both classifiers. However the extraction method can have a sig- nificant impact on the results of multi-label classification. The best choice of extraction method depends on the what the multi-label clas- sifications are to be used for. If the priority lies with not producing false negatives, then the results of this work indicates that the GloVe extraction method is the best choice. If, however, not producing false positives is the highest priority, then a Bag of Words extraction method is the superior choice.
The pre-processing methods investigated did not always yield bet- ter results when applied to the data. The use of N-grams generally provided worse classification outcomes. However it did prove to no- tably increase the micro-average precision of the ANN classifier.
Removal of stop-words generally had a positive effect on the re- sults. Only when using the GloVe extraction method did it prove to be detrimental for some metrics. Not removing stop-words had a great negative effect on the SVM classifier when the TF-IDF extraction method was used.
Applying stemming to the data gave mixed results. Again, the re- sults when using the SVM classifier with the TF-IDF extraction method suffered greatly when this pre-processing step was not applied. The rest of the classifier and extraction combinations showed mostly in- creased scores when looking at the micro metrics, and slight decreases for some of the macro metrics when stemming was not applied. The GloVe extraction method seemed to actually perform better when stem-
36
ming was not applied, suggesting that stemming should not always be applied.
Furthermore, the structure of the data that is to be used is also an important factor to take into consideration when choosing an extrac- tion method. The dataset used for this work was heavily imbalanced real-world data. For this dataset, the best overall result was achieved by using the TF-IDF method used in conjunction with stop-word elim- ination, using stemming, not using N-grams, and an SVM classifier.
Bibliography
[1] Gediminas Adomavicius and Alexander Tuzhilin. âContext-aware recommender systemsâ. In: Recommender systems handbook. Springer, 2015, pp. 191â226.
[2] Mehdi Allahyari et al. âA brief survey of text mining: Classifi- cation, clustering and extraction techniquesâ. In: arXiv preprint arXiv:1707.02919 (2017).
[3] Xavier Amatriain and Josep M Pujol. âData mining methods for recommender systemsâ. In: Recommender systems handbook. Springer, 2015, pp. 227â262.
[4] Artificial neural network. https://en.wikipedia.org/wiki/
Artificial_neural_network. Accessed: 2018-05-28.
[5] Edward Loper Bird Steven and Ewan Klein. âNatural Language Processing with Pythonâ. In: (2009).
[6] Zach CHASE, Nicolas Genain, and Orren Karniol-Tambour. âLearn- ing Multi-Label Topic Classification of News Articlesâ. In: (2014).
[7] Franca Debole and Fabrizio Sebastiani. âAn analysis of the rel- ative hardness of Reuters-21578 subsetsâ. In: Journal of the Asso- ciation for Information Science and Technology 56.6 (2005), pp. 584â
596.
[8] George Forman. âA pitfall and solution in multi-class feature se- lection for text classificationâ. In: Proceedings of the twenty-first international conference on Machine learning. ACM. 2004, p. 38.
[9] GloVe: Global Vectors for Word Representation. https : / / nlp . stanford.edu/projects/glove/. Accessed: 2018-05-19.
[10] Introduction to Support Vector Machines. https://docs.opencv.
org/2.4/doc/tutorials/ml/introduction_to_svm/
introduction_to_svm.html. Accessed: 2018-03-29.
38
[11] Thorsten Joachims. âText categorization with support vector ma- chines: Learning with many relevant featuresâ. In: European con- ference on machine learning. Springer. 1998, pp. 137â142.
[12] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. âMul- tilabel text classification for automated tag suggestionâ. In: Pro- ceedings of the ECML/PKDD. Vol. 18. 2008.
[13] Sungjick Lee and Han-joon Kim. âNews keyword extraction for topic trackingâ. In: Networked Computing and Advanced Informa- tion Management, 2008. NCMâ08. Fourth International Conference on. Vol. 2. IEEE. 2008, pp. 554â559.
[14] David D. Lewis. Reuters-21578 text categorization test collection.
http://www.daviddlewis.com/resources/testcollections/
reuters21578/readme.txt. Accessed: 2018-03-29.
[15] Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. âSupport vec- tor machines and word2vec for text classification with semantic featuresâ. In: Cognitive Informatics & Cognitive Computing (ICCI*
CC), 2015 IEEE 14th International Conference on. IEEE. 2015, pp. 136â
140.
[16] Michael J Pazzani and Daniel Billsus. âContent-based recom- mendation systemsâ. In: The adaptive web. Springer, 2007, pp. 325â
341.
[17] F. Pedregosa et al. âScikit-learn: Machine Learning in Pythonâ.
In: Journal of Machine Learning Research 12 (2011), pp. 2825â2830.
[18] Jeffrey Pennington, Richard Socher, and Christopher Manning.
âGlove: Global vectors for word representationâ. In: Proceedings of the 2014 conference on empirical methods in natural language pro- cessing (EMNLP). 2014, pp. 1532â1543.
[19] Martin F Porter. âAn algorithm for suffix strippingâ. In: Program 14.3 (1980), pp. 130â137.
[20] Dyah Rahmawati and Masayu Leylia Khodra. âAutomatic mul- tilabel classification for Indonesian news articlesâ. In: Advanced Informatics: Concepts, Theory and Applications (ICAICTA), 2015 2nd International Conference on. IEEE. 2015, pp. 1â6.
[21] Neil Rubens et al. âActive learning in recommender systemsâ.
In: Recommender systems handbook. Springer, 2015, pp. 809â846.
40 BIBLIOGRAPHY
[22] Tobias Schnabel et al. âEvaluation methods for unsupervised word embeddingsâ. In: Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing. 2015, pp. 298â307.
[23] Abinash Tripathy, Ankit Agrawal, and Santanu Kumar Rath. âClas- sification of sentiment reviews using n-gram machine learning approachâ. In: Expert Systems with Applications 57 (2016), pp. 117â
126.
[24] Min-Ling Zhang and Zhi-Hua Zhou. âML-KNN: A lazy learn- ing approach to multi-label learningâ. In: Pattern recognition 40.7 (2007), pp. 2038â2048.
Appended Material
Reuters-21578, Distribution 1.0 dataset available at: http://www.daviddlewis.com/resources/testcollections/reuters21578
41
www.kth.se