Teknik och Samhälle Systemutvecklare
Examensarbete 15 högskolepoäng, grundnivå
How to explain graph-based
semi-supervised learning for
non-mathematicians?
Lucas Borg
Mattias Jönsson
Examen: kandidatexamen 180 hp Huvudområde: datavetenskap Program: systemutvecklare Datum för slutseminarium: 2019-06-03 Handledare: Gion Koch Svedberg Examinator: Mia Persson
Sammanfattning
Den stora mängden tillgänglig data på internet kan användas för att förbättra förutsägelser genom maskininlärning. Problemet är att sådan data ofta är i ett obehandlat format och kräver att någon manuellt bestämmer etiketter på den insamlade datan innan den kan användas av algoritmen. Semi-supervised learning (SSL) är en teknik där algoritmen använder ett fåtal förbehandlade exempel och därefter automatiskt bestämmer etiketter för resterande data. Ett tillvägagångssätt inom SSL är att representera datan i en graf, vilket kallas för graf-baserad semi-supervised learning (GSSL), och sedan hitta likheter mellan noderna i grafen för att automatiskt bestämma etiketter.
Vårt mål i denna uppsatsen är att förenkla de avancerade processerna och stegen för att implementera en GSSL-algoritm. Vi kommer att gå igen grundläggande steg som hur utvecklingsmiljön ska installeras men även mer avancerade steg som data pre-processering och feature extraction. Feature extraction metoderna som uppsatsen använder sig av är bag-of-words (BOW) och term frequency-inverse document frequency (TF-IDF). Slutgiltligen presenterar vi klassificering av dokument med Label Propagation (LP) och Multinomial Naive Bayes (MNB) samt en detaljerad beskrivning över hur GSSL fungerar.
Vi presenterar även prestanda för klassificering-algoritmerna genom att klassificera 20 Newsgroup datasetet med LP och MNB. Resultaten dokumenteras genom två olika utvärderingspoäng vilka är F1-score och accuracy. Vi gör även en jämförelse mellan MNB och LP med två olika typer av kärnor, KNN och RBF, på olika mängder av förbehandlade träningsdokument. Resultaten ifrån klassificering-algoritmerna visar att MNB är bättre på att klassificera datasetet än LP.
Abstract
The large amount of available data on the web can be used to improve the predictions made by machine learning algorithms. The problem is that such data is often in a raw format and needs to be manually labeled by a human before it can be used by a machine learning algorithm. Semi-supervised learning (SSL) is a technique where the algorithm uses a few prepared samples to automatically prepare the rest of the data. One approach to SSL is to represent the data in a graph, also called graph-based semi-supervised learning (GSSL), and find similarities between the nodes for automatic labeling.
Our goal in this thesis is to simplify the advanced processes and steps to implement a GSSL-algorithm. We will cover basic tasks such as setup of the developing environment and more advanced steps such as data preprocessing and feature extraction. The feature extraction techniques covered are bag-of-words (BOW) and term frequency-inverse document frequency (TF-IDF). Lastly, we present how to classify documents using Label Propagation (LP) and Multinomial Naive Bayes (MNB) with a detailed explanation of the inner workings of GSSL.
We showcased the classification performance by classifying documents from the 20 Newsgroup dataset using LP and MNB. The results are documented using two different evaluation scores called F1-score and accuracy. A comparison between MNB and the LP-algorithm using two different types of kernels, KNN and RBF, was made on different amount of labeled documents. The results from the classification algorithms shows that MNB is better at classifying the data than LP.
Keywords: Graph based SSL, Label Propagation, Naive Bayes’, KNN, RBF, feature extraction, 20 newsgroup, preprocessing, graph construction
Acknowledgement
We would like to thank Gion Koch Svedberg for his extraordinary support and encouragement throughout this thesis project.
Table of Content
1 Introduction 8
2 Method 11
2.1 Design and Creation 11
3. Implementation 13
3.1 Installing the tools 13
3.1.2 Verifying the installation 14
3.2 Dataset 14 3.3 Data preprocessing 15 3.2.1 Data cleaning 16 3.2.3 Data reduction 17 3.3.3 Feature extraction 18 3.4 Classification algorithms 20 3.4.1 Label Propagation 20
3.4.2 Multinomial naive Bayes 21
4 Result 22
4.1 Label Propagation 22
4.1.1 Graph construction 22
4.1.2 Propagate labels 23
4.2 Classification comparison 24
5 Discussion and conclusion 30
References 36
Appendix A 38
Appendix B 46
1 Introduction
The amount of available data on the web is constantly increasing with technologies such as social media and the arise of IoT. Machine learning is a field where such data can be used to train a model which in return makes predictions on future data. This is useful for example predicting future house prices or the species of different flowers. Before the model can make such predictions the algorithm must be provided with training data. The training data often contains features with labels. This means that each datarow has input values (features) which is tagged with a category or number telling the algorithm what the output for the input should be (label). For example, a model that predict house prices could be trained with square meters as a feature and the selling price as label. By tagging, or labeling data, the model can find relations between different houses’ square meters and house prices. Predictions, based on the training data, can now be made by providing the model with the square meters and it will tell us the house price.
The problem is that the data found on the web is not labeled by default and the model requires a large amount of labeled data to make accurate predictions. A solution is to manually label the collected data but this is a time consuming and expensive task [15][20]. The manual approach, where a human labels all the data, is called supervised learning (SL). One approach to solve the labelling issue is to use unsupervised learning (UL). A UL-algorithm is the opposite of a SL-algorithm and requires no manual labeling. Instead, it searches for commonalities in the data and automatically assigns labels.
The third approach, which has shown to be more accurate than unsupervised learning, is semi-supervised learning (SSL). SSL-algorithms combine the advantages of both supervised- and unsupervised learning. This enables the SSL-algorithm to use a small amount of labeled data to label a large amount of unlabeled data [16].
Since every machine-learning algorithm needs data to learn from, the first step is to construct a dataset. The data could be in the form of a ready-to-use dataset or collected from different sources. A ready-to-use dataset contains data rows where all the features are labeled. This approach could save a lot of time but the amount of ready-to-use datasets are limited and there may not be a dataset that fits the purpose of the algorithm. If this is the case, the other approach is to collect the data from sources such as the internet.
Data preprocessing is the process to improve the data quality of raw data through different methods [2]. Such methods include data cleaning, normalization, transformation, feature extraction and selection [2][3]. It could also be explain as data cleaning, integration, transformation and reduction [9].
The preprocessing of data is a time-consuming [4] and important process of machine learning but it can significantly improve the performance of a
classification algorithm [2][3]. Real-world data could have too many features, noisy instances or redundancy [3] which could affect the classification. The same goes for short text documents which to some degree contain words that the classification-algorithm would perform better without [5]. Such words, for example ‘so’ and ‘because’, are not unique for a specific category and would therefore not improve the classification.
The classification-algorithm can, after the data has been collected and preprocessed, start to learn and recognize patterns in the data. There has been extensive research done in the field of SL-algorithms and their different use cases. Examples of such methods are Naive Bayes, Neural Networks, Support Vector Machine [5][6], Decision Tree and k-Nearest Neighbor [5]. The problem with some of the traditional methods is that they cannot be trained with unlabeled data [20]. The purpose of the SSL-algorithms is to solve this issue by enabling the use of both labeled and unlabeled data.
Like SL-algorithms, there are many different approaches to SSL-algorithms [13]. One approach is called self-training and involves training a classifier with the labeled data and then make predictions on the unlabeled data. The newly predicted data points are then merged into the training dataset [20].
Another approach is graph-based SSL (GSSL) which builds a complete graph based on similarities between the labeled and unlabeled nodes [19][20]. The assumptions is that nodes with high similarity tends to have the same label [20].
Graph-based techniques have shown to be an effective solution to different kind of problems [19] and representing text in graphs has become more popular because of recent advances in the field [7]. According to Widmann and Verberne [7], the method has shown to be especially effective on short text classification.
The two most common GSSL-algorithms use either the graph to spread labeled labels to unlabeled labels or by optimizing a loss function [16]. The LP-algorithm is a well-explored method in SSL [18] and belongs to the first mentioned category. It was first introduced by Zhu and Ghahramani [13] and has since 2002 been modified many times [7]. The algorithm propagates its labels in a graph to the unlabeled ones [18][19] based on proximity [13][20].
The advantages with the algorithm is that it converges quickly and can be scaled [19]. Another advantage is its flexibility to adapt, but on the other hand it has shown to be computationally expensive [7].
The problem for newcomers in the machine learning field and non-mathematicians is that current papers use advanced mathematical explanations of the algorithms and their inner workings. This can easily get confusing for those who lack advanced mathematical knowledge or never have worked with SSL. A possible outcome in such scenario is a classification-algorithm that is trained with wrongly labeled data which would result in wrong predictions. It is therefore essential to understand the different processes behind SSL to achieve reliable results.
Our implementation have, except for a detailed explanation of GSSL, reproduced and simplified the work done by Widmann and Verberne [7] where they proposed a graph connected with both documents and features. We have simplified our work by reproducing the same steps as Widmann and Verberne [7] but without the feature nodes. This have enabled us to determine what effect the feature nodes have on the classification result by comparing our result against Widmann and Verberne [7].
This study could be used as course material for teachers explaining GSSL for students enrolled in non-technical programs. A target group with minimum technical knowledge would also, except for the explanation of the algorithm, need a thorough explanation of the steps involved to implement the program. Such steps include how to setup the required tools and how to preprocess the data.
This thesis is ordered into five sections. Section two explains our research method and how we used the different steps in the design and creation research method. This is followed by a detailed explanation of our implementation of the artefact. Section four guides the user through GSSL and the results from classifying texts with our artefact. We end the thesis with a discussion of the results in the last section.
Research questions
The purpose of this paper is to explain the different processes involved in SSL and to give a more detailed understanding of GSSL. We aim to answer the following research questions:
1. How to explain graph-based semi-supervised learning for non-mathematicians? 2. What kind of preprocessing is most effective regarding the quality of results of
semi-supervised learning?
We limited our work by doing literature research, analyzing the code behind the library scikit-learn’s implementation of the LP-algorithm and reproduced the SSL-processes done by Widmann and Verberne [7]. Our result was based on data from the 20 Newsgroup dataset which have been runned 10 times to get the average classification percent. Our contributions where to reproduce and simplify GSSL for non-mathematicians as well as compare different kind of preprocessing techniques.
2 Method
This section describes the research method and why we chose to use design and creation to answer our research questions. We also present in detail our approach to create the artefact.
2.1 Design and Creation
SSL is a comprehensive area where small decisions, like which dataset or data cleaning method one should use, could have a big impact on the final result. Current research uses mathematical formulas to explain the processes and algorithms in SSL. Therefore, to fully understand the different stages in SSL and be able to explain the processes in full detail, the best approach was to reproduce the steps made by Widmann and Verberne [7] with the design and creation research method.
The design and creation method is a problem solving approach and focuses on developing artefacts. Such artefacts could include constructs, models, methods or instantiations [23]. Our research contribute with methods, which is guidance and process stages for a model [23], to explain GSSL and the preprocessing stages. The design and creation approach involves five steps called awareness, suggestion, development, evaluation and conclusion which are performed in a iterative cycle [23]. These step and how we used them is presented below.
Process-step 1: Awareness
Awareness is the first step in the iterative cycle and it was here we recognized a problem in current GSSL-research. We had more questions than answers after our extensive literature search in GSSL with the goal to get a basic understanding of the inner workings of GSSL. This was mainly because of our lack of mathematical understanding since most of the processes in GSSL are explained using mathematical formulas. The result of the litteratur research left us with questions like “how could the current research be understood by us who had none to little mathematical experience?” and “how could the mathematical algorithms be implemented in code?”.
There were also no explanations of how to get started with the setup and the literature research made it clear that the preprocessing of data was an important step but wasn’t explained in detailed in the papers.
Process-step 2: Suggestion
The second step was to figure out a how to solve the problem, also called the suggestion step in the design and creation process. We started to research which frameworks were used in current research papers. This led us to the scikit-learn library which where used in multiple papers. We also wanted to use a well-known dataset so we could evaluate our solution against previous research and found the 20 Newsgroup dataset. Both scikit-learn and the 20 Newsgroup where used in [7].
We therefore decided to use Widmann and Verberne’s work [7], which is based on Widmann’s master thesis in [24], as a base for implementing GSSL ourselves. This way were we able to extract the different steps in their processes, from dataset to the classifiers, and research each step more thoroughly. This would give us a better understanding of how the data changes through the different steps.
Process-step 3: Development
After extracting and researching the processes explained above, we moved on to the third step called development. This resulted in a step-by-step plan to implement our artefact which is presented below.
1. Collect data. Every machine learning algorithm needs to learn from data and it was therefore essential to first address what kind of data we wanted to use in our artefact. We decided on the 20 Newsgroup dataset which is a commonly used dataset for document classification [12] and have previously been used for SSL text classification in [7][12][17].
2. Preprocessing data . The next step involved removing uninformative data from the dataset as well as transforming the 20 Newsgroup through feature extraction to make it understandable for the machine. We followed the same preprocessing process as Widmann and Verberne [7].
3. Classifier. The preprocessed data was at this step ready to train our classifier model. Scikit-learn’s ready-to-use implementation of the algorithm made it possible to evaluate the code. Evaluating the code and printing the data flow to the console resulted in a deeper knowledge of the algorithm without the mathematical formulas.
Process-step 4: Evaluation
Process-step 1-3 resulted in the first implementation of our artefact. We could now move on to the fourth step in the design and creation process called evaluation. This steps involved evaluating our classifier by comparing the first results with the classification results of Widmann and Verberne [7]. This step confirmed that our implementation was successfully implemented.
Process-step 5: Conclusion
The last step, called conclusion, involved iterative modifying and testing different approaches with process-step 1-4. We could at this point test approaches such as the amount of labeled data in relation to unlabeled data and different preprocessing techniques. This iterative process resulted deeper understanding of the dataset, preprocessing and the GSSL-algorithm.
3. Implementation
This section explains in detail the process from installing the necessary tools to the classification-algorithm in our artefact. Figure 1 presents an overview of the process where we apply preprocessing to a dataset and transform the result with feature extraction which enables us to classify the documents by representing texts as nodes in a graph.
Figure 1: An overview of the steps in the GSSL-implementation.
3.1 Installing the tools
The scikit-learn library requires the programming language Python and two packages called SciPy and NumPy to function properly. We used Python 3.7.2, Scikit-learn 20.2, SciPy 1.2.1, NumPy 1.16.1 on a Windows 10 machine. A detailed explanation of the required steps to install scikit-learn and it’s third-party packages is presented below.
1. Download the Windows executable from Python’s official website¹. 2. Enable “Add Python 3.7 to PATH” and install Python.
3. Python versions above 3.4 will automatically install “pip” which will enable us to download other Python packages. Open command prompt and type “python -m pip install --user numpy scipy”. This command will download the latest version of the packages NumPy and SciPy.
4. Install the scikit-learn library by typing “pip install -U scikit-learn” in command prompt.
¹https://www.python.org
5. We used a Python IDE called PyCharm from Jetbrains for faster development. The community version is for free and can be downloaded².
6. Packages that we used for visualization of data are Matplotlib and NetworkX. These packages can be installed with the command “python -m pip install -U matplotlib networkx ”.
7. NLTKs WordNetLemmatizer is used for the preprocessing. NLTK is installed with “pip install nltk”. WordNetLemmatizer can be downloaded from PyCharm by first importing NLTK with “import nltk” and then downloading WordNet with “nltk.download('wordnet')”.
3.1.2 Verifying the installation
We created a simple script to ensure that the installation of scikit-learn and it’s third-party libraries were successful.
1. Create a new project in PyCharm and include the scikit-learn library. This can be done by navigating and following the path “File - Settings - Project Interpreter - +” in PyCharm. Search for “scikit-learn” in the search box that appear after clicking the “+”-sign. The search results will display multiple scikit-learn packages with similar package names. Make sure to pick the package called “scikit-learn” which also has the same version that was previously installed with the command prompt.
2. Choose the Python file that was created on project setup and insert the code from Figure B1 in Appendix B. This code will import the scikit-learn library and print a dataset to the console in PyCharm. The installation was successful if the console output is free from errors.
3.2 Dataset
The Scikit-learn library has multiple ready-to-use datasets for different use-cases. These datasets can be downloaded to the computer by importing the chosen dataset from the sklearn.datasets package. One of the datasets from sklearn.datasets is the 20 Newsgroups dataset which we will be using in this thesis.
The 20 Newsgroups dataset consists of 20 different topic labels and two subsets, one is for training and has 11,314 documents and the other one is for testing and consists of 7,532 documents. The documents are fairly even spread over the different topics and the majority of the topics have between 550 and 600 documents in total [7]. ²https://www.jetbrains.com/pycharm/download/#section=windows
The dataset is ideal for supervised learning since all the data is labeled. However, since this thesis uses semi-supervised learning, we need to mask some of the labels and set them to unlabeled. This process is done with a manual script that splits the dataset into a new training dataset.
The documents are chosen at random, as in Widmann and Verberne research [7], from a copy of the original dataset where all documents have been preprocessed. All documents are shuffled in a random order to ensure that the documents are not ordered by labels before returning the new dataset.
Splitting the dataset enables us to decide how many labeled versus unlabeled documents the dataset should have. This feature is useful for testing the classification score depending on the amount of labeled documents.
Figure 2 illustrates the process to store and retrieve our custom dataset. The original 20 Newsgroup dataset is first preprocessed and grouped together by label where each labels documents are stored in .txt files. This enables us to quickly fetch preprocessed data based on specific labels.
The 20 Newsgroups dataset is by default ordered into a training and testing dataset. We used the default split and ordered the data into two different folders based on training or testing. The training dataset is then used to train the classifier and the testing dataset is used to evaluate the classification-algorithms.
Figure 2: The process to store and retrieve our custom dataset.
3.3 Data preprocessing
The following subsections are dedicated to explaining the necessary preprocessing steps to transform the original data.
3.2.1 Data cleaning
The purpose of data cleaning is to remove corrupt, incorrect and irrelevant data. It would be time consuming to manually inspect all instances in a large dataset with the goal to remove such data. Another approach, which was recommended in [4], is to sample a few documents and analyze their content.
Table 1: Data cleaning
Before After
From: genetic+@pitt.edu (David M. Tate)
Subject: Re: MARLINS WIN! MARLINS WIN!
Article-I.D.: blue.7961
Organization: Department of Industrial Engineering Lines: 13
dwarner@journalism.indiana.edu said:
>I only caught the tail end of this one on ESPN. Does anyone have a report?
>(Look at all that Teal!!!! BLEAH!!!!!!!!!)
Maybe it's just me, but the
combination of those *young* faces peeking out
from under oversized aqua helmets screams "Little League" in every fibre of
my being...
--
David M. Tate | (i do not know what it is about you that closes posing as: | and opens; only
Maybe it's just me, but the
combination of those *young* faces peeking out
from under oversized aqua helmets screams "Little League" in every fibre of
my being...
something in me understands e e (can | the pocket of your glove is deeper than Pete Rose's) dy) cummings | nobody, not even Tim Raines, has such soft hands
Table 1 displays a document from the 20 Newsgroup dataset before and after it has been through the process of data cleaning. We can see that the document has a header, footer and quote before the data cleaning. Such information would not be helpful in classifying the correct label and by removing this from all documents, which is also done by Widmann and Verberne [7], we filter out data that wouldn’t improve the classification.
By simply sampling a few documents we could find irrelevant data patterns for all documents. Removing the header, footer and quote can be done using the code in Figure B2 in Appendix B.
3.2.3 Data reduction
Data reduction is applied to reduce the representation of data [9]. This will result in faster processing for the algorithm and also improve the accuracy because the algorithm does not need to handle as much irrelevant data.
Lemmatization is used to restore the words to their base form [14]. We used the library WordNetLemmatizer from NLTK for lemmatization which determines if each word is a adjective, noun, verb or adverb and then transform the word into its base form. Without lemmatization, the algorithm would have interpreted the words “dogs” and “dog” as two different words with no context. Lemmatization of the following features would result in “were” → “be”, “are” → “be”, “is” → “be” and “dogs” → “dog”. These features, in their original form, would have provided unnecessary extra features since they have the same meaning as their base form. Lemmatization is implemented using the code from Figure B3 in Appendix B.
Stop-words are words that appear frequently in our everyday language. Such words are not useful for the classification algorithm since they are frequently occuring words in all texts and can be removed without affecting the classification accuracy negatively [6]. Examples of words that could be removed are “we”, “the” and “and”. Stop-words can also be more inline with the texts. For example, if all the texts are about computers then the word “computer” would be an appropriate stop-word. The list of stop-words³ removed in our preprocessing is the same that was used to remove stop-words by Widmann and Verberne [7], with the code from Figure B4 in Appendix B.
Reducing feature count is used to remove less desirable features and is determined by the features’ frequency. Features that appear in too many or too few documents do not help to determine the document- labels. To determine which features should be removed we use the same strategy as [7] and remove all features that appear in more than 50% of the documents or in less than 10 documents. Removing these features are done using the code from Figure B5 in Appendix B.
Vocabulary could be imagined as the opposite of stop-words. A vocabulary consists of a number of distinguishable keywords from every class. If the topic of a class is “computer graphics”, keywords such as “image”, “jpeg” and “graphic” might be effective in the vocabulary. The selection of keywords can be done by a human selecting appropriate words from texts. However, manual selection can be tedious and time consuming. Another option for constructing a vocabulary, which we used in this thesis, is to use an algorithm to generate keywords based on all of the documents.
Our algorithm selects the ten most commonly occuring words from each category and then removes the duplicate words. This means that if a word is one of the ten most frequent in two different categories then the word will still only occur once in the vocabulary. We also use two different kinds of vocabularies. The first one is constructed right after the preprocessing and is built using all the training documents. The other vocabulary is built during runtime using only the randomly selected training documents.
Table 2: Preprocessing one document
Before Preprocess After Preprocess
Maybe it's just me, but the
combination of those *young* faces peeking out
from under oversized aqua helmets screams "Little League" in every fibre of
my being...
maybe just combination young face helmet little league
Table 2 displays a document in the dataset before and after it has been lemmatized, had the stop-words removed and rare and frequent features removed. The document now lack words “me”, “the” and “in” etc. which are words that are not helpful in determining the category.
3.3.3 Feature extraction
Feature extraction is the process of taking the preprocessed data and derive numerical features [8]. Two different methods to extract features are used in this work for comparison. The first one is “Bag Of Words” and the second one is “Term
Frequency-Inverse Document Frequency”. Both these methods were used by Widmann and Verbene [7].
Bag Of Words (BOW) constructs feature vectors from a vocabulary of unique words called tokens [8]. This is also called a feature matrix which contains a feature vector for each document. The vectors contain how many times each word occurs in the corresponding document. Tokens can be divided into different n-gram models depending on the purpose. A 1-gram model contains one word per token and a 2-gram model contains two words per token. This would result in ‘the sky’, ‘sky is’ and ‘is blue’ for the phrase “the sky is blue” using 2-gram model. We are using the 1-gram model in this thesis which results in the tokens ‘the’, ‘sky’, ‘is’ and ‘blue’ for the same phrase.
Term Frequency-Inverse Document Frequency (TF-IDF) is an extra layer on top of BOW which reduces the importance of words that appear more frequently. More frequently appearing words should be considered less informative than words that appear in small fractions in the text corpus [8]. For example, the words “is” and “the” are common in the english language and would therefore be viewed as less important for the classification in relation to words like “computer” or “car” that are more specific. The TF-IDF value of each word in the documents are normalized with L2-normalization which results in a value between 0-1 based on each word’s overall impact on the classification [26]. L2-normalization is also known as euclidean normalization. TF-IDF does not improve upon BOW’s accuracy when used by all classifiers [10] but it does improve the training speed as it allows for features in a vector to be pruned away if their value is zero. These features are replaced with null and can be skipped while training.
Below is feature extraction demonstrated for the phrases “‘The sky is blue', 'So is the sea. ' and 'The sky is blue and so is the sea. ' using BOW, TF-IDF and TF-IDF L2-normalized.
Table 3: Feature vectors BOW Feature
Phrase and blue is sea sky so the
The sky is blue 0 1 1 0 1 0 1
So is the sea 0 0 1 1 0 1 1
The sky is blue and so
is the sea 1 1 2 1 1 1 2
Table 3 displays how BOW summarize the count for each token in the sentences.
Table 4: Feature vectors TF-IDF without normalization Feature
Phrase and blue is sea sky so the
The sky is blue 0 1.29 1 0 1.29 0 1
So is the sea 0 0 1 1.29 0 1.29 1
The sky is blue and so
is the sea 1.69 1.29 2 1.29 1.29 1.29 2
Table 4 shows that every word which is present in a sentence has a larger impact than using BOW (shown in Table 3). The values of the features are calculated using
log((number of phrases + 1)/(number of phrases the feature occurs in + 1)) + 1 [26].
The value of the feature blue in the phrase “The sky is blue” is therefore calculated with log(4/3) + 1.
Table 5: Feature vectors TF-IDF L2-normalization Feature
Phrase and blue is sea sky so the
The sky is blue 0 0.59 0.43 0 0.56 0 0.44
So is the sea 0 0 0.43 0.56 0 0.56 0.43
The sky is blue and so
is the sea 0.4 0.31 0.48 0.31 0.31 0.31 0.48
Table 5 displays the feature extraction with TF-IDF L2-normalization which is calculated using the non-normalized TF-IDF values (displayed in Table 4). The
value of the L2-normalization is calculated by taking the value of a feature and divide it with the root of the sum of the square for every feature in the phrase [26].
3.4 Classification algorithms
We use two different classification algorithms to compare and analyze the results. One is a GSSL-algorithm called Label Propagation (LP) and the other one is Multinomial Naive Bayes which is a supervised algorithm.
3.4.1 Label Propagation
Label propagation is implemented with the ready-to-use algorithm provided in the scikit-learn library through the sklearn.semi_supervised import. The same import also has a algorithm called Label spreading (LS) which is based on the LP-algorithm. The difference between the two algorithms are the similarity matrix and clamping effect [11]. We use these algorithms to classify the documents in the 20 Newsgroup dataset because they are used by Widmann and Verbene [7]. The ready-to-use algorithm also allows us to analyze the code which will result in deeper knowledge of GSSL-algorithms.
The preprocessed data and their corresponding labels are passed through the fit method which builds the graph. Two different kernels are used with the algorithms, KNN and RBF, and these are also built-in into the package. Our algorithm uses 10 neighbors for KNN and a gamma value of 5 for RBF. The number of iterations in the LP-algorithm is set to a maximum of 1000.
The LP-algorithm and it’s different kernels are explained in detail in the result-section 4.1.
3.4.2 Multinomial naive Bayes
The Naive Bayes algorithm is based on Bayes’ Theorem [14]. Bayes’ Theorem calculates the probability of an event based on previously established knowledge with possible relation to the event. Naive Bayes classifiers assume that the values of features is for the same class independent of each other. Bayes’ Theorem states that the probability of event X occurring, given that event Y is occurring, is equivalent to the probability of both events occurring divided by the probability of event Y occurring. Naive Bayes’ classifiers apply Bayes’ Theorem for document classification by letting X be the class and Y the document. Bayes’ Theorem then calculates the probability of the document Y being of class X. Naive Bayes’ uses
Laplace Smoothing to counteract the fact that all the documents in a class might not have a certain feature [14]. For example the class “alt.atheism” does not have the feature “graphic”. The occurrence of the features are therefore increased with
Laplace Smoothing so that the value zero will not exist when multiplying the probabilities.
Multinomial naive Bayes (MNB) is a classification algorithm based on Bayes’ Theorem [15][21]. The difference between the standard naive Bayes’ classifier and MNB is that MNB specifies that the data shall have a multinomial distribution [29]. This in our case of text classification specifies that all the features will be represented together with the categories in the form of a multinomial distribution. The multinomial distribution tells us the probability of each individual feature occuring in the different categories. MNB differentiate compared to the other naive Bayes classifiers as MNB factor in the multiple occurences of the same feature in one document while other naive Bayes classifiers does not. This makes MNB the most suitable for text classification. Our implementation of MNB uses an Additive Laplace Smoothing parameter of 1. Classifying documents using MNB is implemented using the code in Figure B6 in Appendix B.
4 Result
This section explain GSSL through the LP-algorithm and displays the classification results from the implementation of our artefact.
4.1 Label Propagation
The LP-algorithm can be explained in two steps:
1. Construct a graph with weighted edges based on the difference between the connected nodes’ feature vectors.
2. Iteratively determine with help of the weights of the edges connected to a node, which label a node most likely belongs to until it converges.
4.1.1 Graph construction
The graph is constructed from the feature extraction matrix based on the feature vectors as explained in section 3.3.3, which are based on the preprocessed data. Each document in the feature extraction matrix is represented as a node and has weighted edges to all other nodes in the graph. The weights are based on the similarity between the documents features.
There are multiple ways to calculate the similarity between documents and two common methods are K-Nearest Neighbors (K-NN) and Radial Basis Function Kernel (RBF kernel).
K-NN calculates the distance by choosing a positive integer value of K and a distance metric where the value of K is small. The algorithm then finds the K nearest neighbors of the document to classify and labels the sample with the most frequent class of the neighbors [8]. It is important to choose a appropriate value of
K since the value can make the algorithm underfit or overfit. Classification using the K-NN kernel is implemented using the code in Figure B7 in Appendix B.
RBF kernel is defined as (x, ) K x′ = exp(− ||xγ − x′|| ) 2 where |x| − x′|| 2 is the squared Euclidean distance between the feature vectors of two nodes x and x’ (documents) and γ is a float value that is greater than zero [11]. The RBF kernel value increases as the distance between vectors decreases which will produce a fully connected graph with a dense matrix. This could lead to a slow computing time. However, the high accuracy of RBF in GSSL more than makes up for its computing time [7]. Classification using the RBF kernel is implemented using the code in Figure B8 in Appendix B.
Figure 3. Example of 5 nodes (documents) represented in a graph with weighted edges based on similarity. Each node’s number represents the index in the feature extraction matrix. The node colors represent the labels for the document where black nodes represent unlabeled documents.
The result of the graph construction step is a complete graph representing all documents and their similarity to the other documents (Figure 3). The next step of GSSL is to iterative propagate the labels.
4.1.2 Propagate labels
The labels propagate by iteratively normalizing the probability interpretation and clamping the labeled data [13]. Clamping the result ensures that the original labels do not change after normalization [22]. Before we can iterate, the first step is to setup the label distribution. The distribution is a matrix with the label probability for each document. The columns in the matrix represent the label index, with one row per document.
Each iteration will calculate the label probability for each document and update the distribution. This is done by calculating the dot product of the similarity graph matrix and current distribution and is followed by normalization of the result. Lastly, since we already know the labels for some of the documents in the label distribution, these vectors are set to their initial state by clamping. The process for one iteration is shown in Figure 4 below.
1. Setup 1 0 0 0 0 1 2. After normalizing 0.75 0.25 0.44 0.56 0.11 0.89 3. Clamping 1 0 0.44 0.56 0 1
Figure 4. The label distribution on initial setup to the left, with two labels and an unlabeled document (row 2), after normalizing in the middle and clamping to the right.
Normalizing and clamping the label distribution is done until the algorithm converges. Convergence is reached when all nodes have the same label as the majority of their neighbor-nodes. If the algorithm would have converged in Figure 4, the predicted label for the unlabeled document would have been the second one since this label has the highest probability of 0.56.
Figure 5 below displays how the label distribution changes to some degree after different amount of iterations.
Figure 5. The label distribution for the first 30 documents in the graph. To the left is the graph on initialization where black nodes represent unlabeled documents. The graph in the middle displays how the labels have propagated after 10 iterations and to the right is the label distribution after 100 iterations.
4.2 Classification comparison
Running the LP-algorithm without a vocabulary turned out to be very computationally demanding and time consuming. Our solution was to use a 4-class instead where we reduce the categories in the dataset from twenty to four. Widmann used a similar approach in [24] where she classified both a 4-class and 20-class for text categorization on the same dataset as we are using. She
determined that the performance from the 20-class is consistent with the 4-class. We therefore made the assumption that our 4-class text categorization also should be consistent with the 20-class. The four categories we used in this thesis are “rec.autos”, “rec.motorcycles”, “rec.sport.baseball” and “rec.sport.hockey” which are the same categories as Widmann used in her 4-class classification in [24].
Closely related categories should in theory be more difficult to classify since they share common words. Two of our four categories, hockey and baseball, are sports texts and would therefore share common words which could confuse the classification algorithm. The same goes for the two other categories, autos and motorcycles, which both belongs to the vehicles-category.
The classification accuracy of our 4-class are derived using F1-score and “accuracy". Both methods uses categories from the confusion matrix to calculate their scores. A confusion matrix has four categories that represent the different decisions made by the classifier [25]. Two of the categories represent correctly classified examples. These are “true positives” which are positive examples labeled as positive and “false negatives” which are negative examples classified as negatives. There are also two categories for incorrectly labeled examples. Results called “true negatives” are incorrectly negative labeled as positive and “false positives” are the opposite [25]. The F1-score is calculated with 2 * (precision * recall) / (precision + recall) where precision is true positive / (true positive + false positive) and recall is true positive / (true positive + false negative) . The accuracy score is calculated with (true positive + true negative) / number of documents.
Both scores will fall within a range of 0.00 to 1.00 where a score of 1.00 could be view as 100% correctly labeled test documents. The code for calculating the scores are found in Figure B9 and Figure B10 in Appendix B.
The scores are evaluated by comparing the original labels, which are masked in the classification process, with the predicted labels. The predicted labels are validated by comparing the result against the classification result in Widmann and Verberne research [7]. Widmann and Verbene [7] classifies documents with labeled training documents in the range of 1-350. We used 1-100 labeled training documents and could verify our result from this range to the same range in Widmann and Verberne research [7]. Since we use nearly the same preprocessing technique and dataset, a comparison between our result against Widmann and Verberne [7] should validate that our classification result was reasonable. We also based our results on running each classifier and its specific preprocessing process 10 times..
Figure 6-17 below displays the results from our classification algorithms by using different preprocessing techniques and the same results are presented in tables in Appendix A.
Each graph has different combinations of preprocessing techniques which are tested on four classifiers. These classifiers, which are displayed as lines in the graphs, are Multinomial naive Bayes (MNB), Label Spreading with KNN-kernel (LS KNN), Label Spreading with RBF-kernel (LS RBF) and also Label Propagation
with RBF-kernel (LP RBF). Each title in the graphs explain what kind of score it displays where Figure 6-11 display the accuracy score and Figure 12-17 display the F1 score. The title also indicates the preprocessing process. The different preprocessing techniques are “processed” where the data has been cleaned and preprocessed as well as BOW or TF-IDF which are feature extraction methods. Some results are also based on a vocabulary where there are two different kind of techniques. The first one is a vocabulary built from all training documents and the second one, called “runtime vocabulary”, is a vocabulary built from randomly selected labeled training documents.
The number of labeled training documents is displayed on the X-axis and the accuracy-score is displayed on the Y-axis. The amount of training documents varies from 10 to 100 and the rest of the preprocessed documents from each category are used as test documents. We chose to limit the amount of labeled documents to 100, in contrast to Widmann and Verberne [7] which uses up to 350 labels, because our goal is to use less labels than Widmann and Verberne [7] and measure the impact from the feature-feature graph on lower number of labeled documents. The general rule is that the more labeled documents for training the better test results.
The LP RBF result is not visible in figures using BOW as the feature extraction method. This occurs because the results are nearly identical with LS RBF and are therefore covered by this line in the graphs.
The results for the accuracy-score in Figure 6-11 show that the MNB baseline is overall better for classification on up to 100 labeled training documents. The only results where MNB was outperformed was in Figure 7. This shows that LS KNN using TF-IDF without a vocabulary performed better classifying based on up to 25 labeled training documents. However, when the amount of training documents where more than 25 the MNB-classifier performed better.
The above results also show that using a vocabulary in combination with TF-IDF improves the overall results for algorithms using the RBF-kernel on lower number of labeled training documents, but worsens the result for the other classifiers. The MNB-classifier using BOW displays better results without a vocabulary when the number of labeled training documents increases.
Algorithms using the KNN-kernel show an overall improvement for classifications in comparison to RBF-kernels. A problem using the RBF-kernel with BOW, which are displayed in figures 6, 8 and 10, is that the algorithm reaches its iteration limit before it has finished classifying all documents. This results in a non-improving accuracy even if the number of labeled training documents increases.
The F1-score tests in Figure 12-17 display similar results as the accuracy tests in Figure 6-11. The MNB-classifier shows overall better performance when measured with the F1-score in comparison to the other classification algorithms, which also was the case for the accuracy-score.
A comparison between Figure 12 and 13 shows that both LS using KNN or RBF with TF-IDF without a vocabulary returns a significantly better result in comparison to the algorithms with BOW.
The result of LP RBF shows that the more labeled training documents the better the result. As in the previous tests (Figure 6-11) the RBF-kernel reaches its iteration limit using BOW which results in poorly classified documents and a low F1-score in Figure 12, 14 and 16. However, using BOW without a vocabulary displays a better result than with a vocabulary for the MNB-classifier.
Figures 7, 9, 11, 13, 15 and 17 display the results of classifiers using TF-IDF. The F1-score starts out lower than the accuracy for RBF-kernels but end at a similar value. The F1-score and the accuracy for the KNN-kernel and MNB are similar for the all different amount of labeled training documents.
Figures 18 and 19 display the highest classification results achieved by the classification-algorithms from both, the F1-score and accuracy-score. Figure 18 shows the best results from the accuracy-scores in Figure 6-11, and Figure 19 contains the best measurements from Figure 12-17.
We can see that the two measurements are very similar in their scoring. The most noticeable difference is that LP RBF has a much lower F1-score, in contrast to its accuracy-score, for 10 labeled training documents per category. This is mainly because the accuracy-score does not take into account an uneven class distribution and can therefore show a high accuracy if most of the documents are of the same class. The F1-score is the harmonic mean between precision and recall. This means that it takes into account how many of the classified documents that actually belong in that class as well as how many of all the documents that should be in the class are actually there.
5 Discussion and conclusion
We are not experts in the field of machine-learning and can therefore not provide an advanced analysis to why the classifiers work the way they do. This also makes us more susceptible to possible mistakes and we may not have fine-tuned the algorithms and classifiers ideally. The time constraint for this degree thesis has not allowed us to fully explore the options available during the GSSL-process. Such options could for example have been to test other GSSL-classifiers such as MAD [27] or to improve the selection method for our vocabularies.
The time constraint also affects the number of maximum iteration the LP-algorithm was allowed to run. A lower number enables more trial and error while a higher number allows us to achieve a better classification result for BOW since some classifiers did not have enough iterations to finish.
Accuracy and F1-score
The classification measurement calculates the percentage of correctly classified documents. A successful score should be higher than randomly selecting the labels for the documents. Since we have four categories is it 25% chance to guess the correct label by simply assigning documents their label randomly. It is therefore essential that the classification-algorithm has a higher score than 25% for it to be useful.
Figure 7 shows that our classification-algorithms have a better accuracy than 25% which makes the predictions better than just randomly assign labels. There are however results from our classification where we simply could have assigned the labels randomly. Figure 6 shows that the RBF-kernel has a score of around 25% due to reaching the iteration limit when using BOW. In these cases it is better to look at the error rate where we want as low of an error rate as possible. A classification-score of 100% might look good on paper but is not desirable. Such a score could tell us that something is not working properly or that the classifier has been overly trained with data.
Except for achieving a higher classification score than randomly selecting labels it is important that the classifier is precise. We also want the classification-algorithms to be sensitive which is measured by recall. F1-score is the harmonic mean of precision and recall which shows if the classifier is precise and sensitive. Each category has an equal effect on the F1-score and a significantly lower F1-score than accuracy-score shows that the classifier has bias to label documents with a certain category. Also, the F1-score can not be higher than the accuracy-score.
Instability
Empty training documents have a major effect on the stability of the result when using GSSL without a vocabulary. The varying result is due to the fact that randomly selected labeled training documents can be empty after the preprocessing which are not good to train the classifier with. The result of MNB does not vary as much as the GSSL-algorithms for the same number of empty training documents. This is most likely because MNB only uses the labeled training documents when training and excludes many of the empty documents. This is not the case for GSSL-algorithms which uses all the training documents.
Vocabulary
Vocabulary features: The vocabulary constructed from the preprocessing reduced the number of features, or unique words, in all texts, from 2793 to 27. This number varies when using the MNB-classifier which is an SL-classifier which only uses the labeled training documents when training the classifier. SSL classifiers use labeled and unlabeled training documents. Without a vocabulary, MNB usually had around 2200 features. The number of features using the runtime vocabulary are dependent on the randomly selected training documents and varies between 10-40 features. However, it is unrealistic that a runtime vocabulary would consist of ten features since it would mean that the four different categories would have the same ten most frequently used words. Instead, the amount of features for a runtime vocabulary was the same as the standard vocabulary.
The classification tests show that the classifier makes more stable predictions with a vocabulary because we remove noisy words and get a lower feature count per document. Since the feature count is lower and unnecessary words which could “confuse” the classifier are removed, the result is less differentiated in the classification measurements between lower and higher number of labeled training documents per category. This makes vocabularies ideal for algorithms using lower number of labels since it is possible to not have enough features without a vocabulary to differentiate categories.
A vocabulary makes it easier to predict categories but the downside is that it will produce a lower maximum score for the classifier. This is due to the fact that using fewer features causes some features which would have differentiated the categories to no longer be included in the classification. For example, the categories car and motorcycle would to some degree share the same features. Such features could be driver, vehicle, rearview mirror. The problem is that the classification result would have been random if a motorcycle document only has these features and not other additional features such as helmet, two-wheels and handles which would have differentiated the motorcycle from the car.
The combination of a vocabulary with TF-IDF generates better results for the RBF-kernel at lower number of labeled training documents. However, the MNB and LS KNN results show a steeper incline at 10 labels per category when not using a vocabulary. This leads us to believe that if the number of labels are further
decreased, using a vocabulary would have become better at lower number of labels for those classifiers as well.
Vocabulary vs runtime vocabulary: The assumption is that a runtime vocabulary should provide better results than the standard vocabulary. We make this because a standard vocabulary is created with features from all the training documents which will result in features that are not present in any of the labeled training documents. These documents will not have any features and are therefore not useful in the classification. A runtime vocabulary provides more accurate results since the above issue is eliminated. We saw a 2-5 percentage increase in classification score using a runtime vocabulary compared to the standard vocabulary. The result of using a runtime vocabulary could also become worse than the standard vocabulary. Such a result is achieved when the randomly selected labeled training documents have a misleading representation of the other training documents. This means that the words in the vocabulary rarely occur in the unlabeled training documents and will therefore not provide any help in the classification.
MNB using vocabulary: The results from the MNB-classifier becomes less desirable when using a vocabulary. This happens because it assumes that the value of the features in the same class are independent from each other [14]. Since a vocabulary reduces the number of features, and MNB performs better with more features, the use of MNB with a vocabulary will perform worse.
Any reduction in the number of unique features would however reduce the result of MNB. MNB has an accuracy of 0.81 at 100 labeled training documents per category using TF-IDF without a vocabulary. When using data which has not gone through data cleaning or preprocessing for the same scenario, the accuracy improves to 0.90. This shows that the result of MNB becomes better even when using features which many other classifier perceive as not good for classification.
GSSL using vocabulary: The classification result for LS using KNN- and RBF-kernels decreases when using a vocabulary because of the lower amount of distinguishing features. Another reason is that the vocabulary is not optimal for distinguishing between the categories since it is constructed from the most frequent words in each category and nothing stops these words from occurring in another category. A vocabulary with words that only occur in one category would therefore improve the result. Such a vocabulary could be constructed manually by a human selecting the features. However, it would only be possible for the standard vocabulary and not the runtime vocabulary since it is built from the random generated categories. The feature selection for the runtime vocabulary could instead be built by a custom algorithm which would be very time expensive.
The classification result increases for LP, in opposite to LS, when using a vocabulary. An example is LP RBF which has a significant 15 percentage points differentiation for 100 labeled training documents depending if it uses a vocabulary or not. LP without a vocabulary has a steeper classification line, compared to using
a vocabulary, towards 100 labeled training documents. This leads us to believe that if the number of labels per category increased further, LP without a vocabulary should give a better result than with a vocabulary.
GSSL
Classification iterations: The number of iterations required to classify all nodes in a GSSL-algorithm depends on the number of features and the parameters passed to the function. TF-IDF requires less iteration than BOW because it can prune away features that are zero after the feature extraction process. A vocabulary reduces the number of features even further which leads to fewer iterations than without a vocabulary. This is shown in Figure 6-7 where the RBF-kernel produce a linear classification result when using BOW as it reaches the maximum number of allowed iterations, whereas TF-IDF produces a result without reaching the iteration limit. The RBF-kernel using BOW is not even close to produce a result at 1000. Increasing the number of iteration allowed to 100.000 still makes BOW unable to produce a result without reaching the iteration limit. The iteration limit is therefore not increased to allow classification using BOW to finish. The RBF-kernel in Figure 7 has enough iterations available to finish the classification using TF-IDF which supports the fact that BOW requires more iterations. The KNN-kernel using BOW without a vocabulary also starts off with a linear result at 10 and 25 labels per category but then increases from there. This tells us that the KNN-kernel requires less iteration than the RBF kernel and that higher amounts of labeled training documents require less iterations. The speed with which KNN iterates is slower compared to the RBF kernel. The number of iterations KNN requires also depends on the number of neighbors that it takes into consideration. Higher number of neighbors increases the number of iterations required. Higher gamma values when using RBF-kernel also increases the number of iterations required. The reason we are not producing the result of LP KNN is also because of the number of iterations it requires. LP KNN is not able to train using TF-IDF with 100 labels per category at 1.000.000 iterations, which is more than 1000 times greater than what the RBF kernel and LS KNN required. Including LP KNN would therefore only show a linear result as we do not have the time to run it using the number of iterations required to train the classifier properly.
MNB baseline: It is difficult to achieve a better result than the result produced by the MNB baseline. This is because SL-classifiers, such as the MNB-classifier, does not have the same potential of training errors as GSSL-classifiers. When using SL-classifiers the labels of the training data is unchangeable. Classification using LP does allow for the already labeled training data to change labels. This potentially leads to documents which are known to a certain label ending up with another label. GSSL can achieve a better results than SL which are shown in Figure 7 and Figure 13 where using LS KNN at 10 and 25 labels per category produces a better result than MNB.