• No results found

How to explain graph-based semi-supervised learning for non-mathematicians?

N/A
N/A
Protected

Academic year: 2021

Share "How to explain graph-based semi-supervised learning for non-mathematicians?"

Copied!
49
0
0

Full text

(1)

Teknik och Samhälle Systemutvecklare        

 

Examensarbete  15 högskolepoäng, grundnivå         

How to explain graph-based

semi-supervised learning for

non-mathematicians?

 

Lucas Borg 

Mattias Jönsson 

                                    Examen: kandidatexamen 180 hp  Huvudområde: datavetenskap  Program: systemutvecklare  Datum för slutseminarium: 2019-06-03  Handledare: Gion Koch Svedberg  Examinator: Mia Persson 

(2)

 

Sammanfattning 

Den stora mängden tillgänglig data på internet kan användas för att förbättra        förutsägelser genom maskininlärning. Problemet är att sådan data ofta är i ett        obehandlat format och kräver att någon manuellt bestämmer etiketter på den        insamlade datan innan den kan användas av algoritmen. Semi-supervised learning        (SSL) är en teknik där algoritmen använder ett fåtal förbehandlade exempel och        därefter  automatiskt  bestämmer  etiketter  för  resterande  data.  Ett  tillvägagångssätt inom SSL är att representera datan i en graf, vilket kallas för        graf-baserad semi-supervised learning (GSSL), och sedan hitta likheter mellan        noderna i grafen för att automatiskt bestämma etiketter. 

 

Vårt mål i denna uppsatsen är att förenkla de avancerade processerna och stegen        för att implementera en GSSL-algoritm. Vi kommer att gå igen grundläggande steg        som hur utvecklingsmiljön ska installeras men även mer avancerade steg som data        pre-processering och feature extraction. Feature extraction metoderna som        uppsatsen använder sig av är bag-of-words (BOW) och term frequency-inverse        document frequency (TF-IDF). Slutgiltligen presenterar vi klassificering av        dokument med Label Propagation (LP) och Multinomial Naive Bayes (MNB) samt        en detaljerad beskrivning över hur GSSL fungerar. 

 

Vi presenterar även prestanda för klassificering-algoritmerna genom att        klassificera 20 Newsgroup datasetet med LP och MNB. Resultaten dokumenteras        genom två olika utvärderingspoäng vilka är F1-score och accuracy. Vi gör även en        jämförelse mellan MNB och LP med två olika typer av kärnor, KNN och RBF, på        olika  mängder  av  förbehandlade  träningsdokument.  Resultaten  ifrån  klassificering-algoritmerna visar att MNB är bättre på att klassificera datasetet än        LP. 

 

(3)

 

(4)

 

Abstract 

The large amount of available data on the web can be used to improve the        predictions made by machine learning algorithms. The problem is that such data is        often in a raw format and needs to be manually labeled by a human before it can be        used by a machine learning algorithm. Semi-supervised learning (SSL) is a        technique where the algorithm uses a few prepared samples to automatically        prepare the rest of the data. One approach to SSL is to represent the data in a        graph, also called graph-based semi-supervised learning (GSSL), and find        similarities between the nodes for automatic labeling. 

 

Our goal in this thesis is to simplify the advanced processes and steps to implement        a GSSL-algorithm. We will cover basic tasks such as setup of the developing        environment and more advanced steps such as data preprocessing and feature        extraction. The feature extraction techniques covered are bag-of-words (BOW) and        term frequency-inverse document frequency (TF-IDF). Lastly, we present how to        classify documents using Label Propagation (LP) and Multinomial Naive Bayes        (MNB) with a detailed explanation of the inner workings of GSSL.  

 

We showcased the classification performance by classifying documents from the 20        Newsgroup dataset using LP and MNB. The results are documented using two        different evaluation scores called F1-score and accuracy. A comparison between        MNB and the LP-algorithm using two different types of kernels, KNN and RBF,        was made on different amount of labeled documents. The results from the        classification algorithms shows that MNB is better at classifying the data than LP.   

Keywords: Graph based SSL, Label Propagation, Naive Bayes’, KNN, RBF,        feature extraction, 20 newsgroup, preprocessing, graph construction   

(5)

 

Acknowledgement 

We would like to thank Gion Koch Svedberg for his extraordinary support and        encouragement throughout this thesis project. 

(6)

 

(7)

 

Table of Content 

1 Introduction

2 Method 11 

2.1 Design and Creation 11 

3. Implementation 13 

3.1 Installing the tools 13 

3.1.2 Verifying the installation 14 

3.2 Dataset 14  3.3 Data preprocessing 15  3.2.1 Data cleaning 16  3.2.3 Data reduction 17  3.3.3 Feature extraction 18  3.4 Classification algorithms 20  3.4.1 Label Propagation 20 

3.4.2 Multinomial naive Bayes 21 

4 Result 22 

4.1 Label Propagation 22 

4.1.1 Graph construction 22 

4.1.2 Propagate labels 23 

4.2 Classification comparison 24 

5 Discussion and conclusion 30 

References 36 

Appendix A 38 

Appendix B 46 

(8)

 

(9)

 

1 Introduction 

The amount of available data on the web is constantly increasing with technologies        such as social media and the arise of IoT. Machine learning is a field where such        data can be used to train a model which in return makes predictions on future data.        This is useful for example predicting future house prices or the species of different        flowers. Before the model can make such predictions the algorithm must be        provided with training data. The training data often contains features with labels.        This means that each datarow has input values (features) which is tagged with a        category or number telling the algorithm what the output for the input should be        (label). For example, a model that predict house prices could be trained with square        meters as a feature and the selling price as label. By tagging, or labeling data, the        model can find relations between different houses’ square meters and house prices.        Predictions, based on the training data, can now be made by providing the model        with the square meters and it will tell us the house price. 

 

The problem is that the data found on the web is not labeled by default and the        model requires a large amount of labeled data to make accurate predictions. A        solution is to manually label the collected data but this is a time consuming and        expensive task [15][20]. The manual approach, where a human labels all the data,        is called supervised learning (SL). One approach to solve the labelling issue is to        use unsupervised learning (UL). A UL-algorithm is the opposite of a SL-algorithm        and requires no manual labeling. Instead, it searches for commonalities in the data        and automatically assigns labels.  

The third approach, which has shown to be more accurate than unsupervised        learning, is semi-supervised learning (SSL). SSL-algorithms combine the        advantages of both supervised- and unsupervised learning. This enables the        SSL-algorithm to use a small amount of labeled data to label a large amount of        unlabeled data [16].  

 

Since every machine-learning algorithm needs data to learn from, the first step is        to construct a dataset. The data could be in the form of a ready-to-use dataset or        collected from different sources. A ready-to-use dataset contains data rows where        all the features are labeled. This approach could save a lot of time but the amount        of ready-to-use datasets are limited and there may not be a dataset that fits the        purpose of the algorithm. If this is the case, the other approach is to collect the data        from sources such as the internet. 

 

Data preprocessing is the process to improve the data quality of raw data through        different methods [2]. Such methods include data cleaning, normalization,        transformation, feature extraction and selection [2][3]. It could also be explain as        data cleaning, integration, transformation and reduction [9].  

The preprocessing of data is a time-consuming [4] and important process of        machine learning but it can significantly improve the performance of a       

(10)

 

classification algorithm [2][3]. Real-world data could have too many features, noisy        instances or redundancy [3] which could affect the classification. The same goes for        short text documents which to some degree contain words that the        classification-algorithm would perform better without [5]. Such words, for example        ‘so’ and ‘because’, are not unique for a specific category and would therefore not        improve the classification. 

 

The classification-algorithm can, after the data has been collected and        preprocessed, start to learn and recognize patterns in the data. There has been        extensive research done in the field of SL-algorithms and their different use cases.        Examples of such methods are Naive Bayes, Neural Networks, Support Vector        Machine [5][6], Decision Tree and k-Nearest Neighbor [5]. The problem with some        of the traditional methods is that they cannot be trained with unlabeled data [20].        The purpose of the SSL-algorithms is to solve this issue by enabling the use of both        labeled and unlabeled data. 

 

Like SL-algorithms, there are many different approaches to SSL-algorithms [13].        One approach is called self-training and involves training a classifier with the        labeled data and then make predictions on the unlabeled data. The newly predicted        data points are then merged into the training dataset [20].  

Another approach is graph-based SSL (GSSL) which builds a complete graph based        on similarities between the labeled and unlabeled nodes [19][20]. The assumptions        is that nodes with high similarity tends to have the same label [20]. 

Graph-based techniques have shown to be an effective solution to different kind of        problems [19] and representing text in graphs has become more popular because of        recent advances in the field [7]. According to Widmann and Verberne [7], the        method has shown to be especially effective on short text classification.  

 

The two most common GSSL-algorithms use either the graph to spread labeled        labels to unlabeled labels or by optimizing a loss function [16]. The LP-algorithm is        a well-explored method in SSL [18] and belongs to the first mentioned category. It        was first introduced by Zhu and Ghahramani [13] and has since 2002 been modified        many times [7]. The algorithm propagates its labels in a graph to the unlabeled        ones [18][19] based on proximity [13][20].  

The advantages with the algorithm is that it converges quickly and can be scaled        [19]. Another advantage is its flexibility to adapt, but on the other hand it has        shown to be computationally expensive [7].  

 

The problem for newcomers in the machine learning field and non-mathematicians        is that current papers use advanced mathematical explanations of the algorithms        and their inner workings. This can easily get confusing for those who lack advanced        mathematical knowledge or never have worked with SSL. A possible outcome in        such scenario is a classification-algorithm that is trained with wrongly labeled data        which would result in wrong predictions. It is therefore essential to understand the        different processes behind SSL to achieve reliable results.  

(11)

 

Our implementation have, except for a detailed explanation of GSSL, reproduced        and simplified the work done by Widmann and Verberne [7] where they proposed a        graph connected with both documents and features. We have simplified our work        by reproducing the same steps as Widmann and Verberne [7] but without the        feature nodes. This have enabled us to determine what effect the feature nodes        have on the classification result by comparing our result against Widmann and        Verberne [7]. 

This study could be used as course material for teachers explaining GSSL for        students enrolled in non-technical programs. A target group with minimum        technical knowledge would also, except for the explanation of the algorithm, need a        thorough explanation of the steps involved to implement the program. Such steps        include how to setup the required tools and how to preprocess the data. 

 

This thesis is ordered into five sections. Section two explains our research method        and how we used the different steps in the design and creation research method.        This is followed by a detailed explanation of our implementation of the artefact.        Section four guides the user through GSSL and the results from classifying texts        with our artefact. We end the thesis with a discussion of the results in the last        section. 

 

Research questions 

The purpose of this paper is to explain the different processes involved in SSL and        to give a more detailed understanding of GSSL. We aim to answer the following        research questions:  

1. How to explain graph-based semi-supervised learning for non-mathematicians?  2. What kind of preprocessing is most effective regarding the quality of results of       

semi-supervised learning?   

We limited our work by doing literature research, analyzing the code behind the        library scikit-learn’s implementation of the LP-algorithm and reproduced the        SSL-processes done by Widmann and Verberne [7]. Our result was based on data        from the 20 Newsgroup dataset which have been runned 10 times to get the        average classification percent. Our contributions where to reproduce and simplify        GSSL for non-mathematicians as well as compare different kind of preprocessing        techniques.  

(12)

 

2 Method 

This section describes the research method and why we chose to use design and        creation to answer our research questions. We also present in detail our approach        to create the artefact. 

2.1 Design and Creation 

SSL is a comprehensive area where small decisions, like which dataset or data        cleaning method one should use, could have a big impact on the final result.        Current research uses mathematical formulas to explain the processes and        algorithms in SSL. Therefore, to fully understand the different stages in SSL and        be able to explain the processes in full detail, the best approach was to reproduce        the steps made by Widmann and Verberne [7] with the design and creation        research method.  

The design and creation method is a problem solving approach and focuses on        developing artefacts. Such artefacts could include constructs, models, methods or        instantiations [23]. Our research contribute with methods, which is guidance and        process stages for a model [23], to explain GSSL and the preprocessing stages.   The design and creation approach involves five steps called awareness, suggestion,        development, evaluation and conclusion which are performed in a iterative cycle        [23]. These step and how we used them is presented below. 

 

Process-step 1: Awareness 

Awareness is the first step in the iterative cycle and it was here we recognized a        problem in current GSSL-research. We had more questions than answers after our        extensive literature search in GSSL with the goal to get a basic understanding of        the inner workings of GSSL. This was mainly because of our lack of mathematical        understanding since most of the processes in GSSL are explained using        mathematical formulas. The result of the litteratur research left us with questions        like “how could the current research be understood by us who had none to little        mathematical experience?” and “how could the mathematical algorithms be        implemented in code?”.  

There were also no explanations of how to get started with the setup and the        literature research made it clear that the preprocessing of data was an important        step but wasn’t explained in detailed in the papers. 

 

Process-step 2: Suggestion 

The second step was to figure out a how to solve the problem, also called the        suggestion step in the design and creation process. We started to research which        frameworks were used in current research papers. This led us to the scikit-learn        library which where used in multiple papers. We also wanted to use a well-known        dataset so we could evaluate our solution against previous research and found the        20 Newsgroup dataset. Both scikit-learn and the 20 Newsgroup where used in [7].       

(13)

 

We therefore decided to use ​Widmann and Verberne’s work [7], which is based on        Widmann’s master thesis in [24], as a base for implementing GSSL ourselves. ​This        way were we able to extract the different steps in their processes, from dataset to        the classifiers, and research each step more thoroughly. ​This would give us a better        understanding of how the data changes through the different steps.  

 

Process-step 3: Development 

After extracting and researching the processes explained above, we moved on to the        third step called development. This resulted in a step-by-step plan to implement        our artefact which is presented below.  

1. Collect data. Every machine learning algorithm needs to learn from data and it        was therefore essential to first address what kind of data we wanted to use in        our artefact. We decided on the 20 Newsgroup dataset which ​is a commonly        used dataset for document classification [12] and have previously been used        for SSL text classification in [7][12][17]. 

2. Preprocessing data  . The next step involved removing uninformative data from        the dataset as well as transforming the 20 Newsgroup through feature        extraction to make it understandable for the machine. We followed the same        preprocessing process as Widmann and Verberne [7]. 

3. Classifier. The preprocessed data was at this step ready to train our classifier        model. Scikit-learn’s ready-to-use implementation of the algorithm made it        possible to evaluate the code. Evaluating the code and printing the data flow to        the console resulted in a deeper knowledge of the algorithm without the        mathematical formulas. 

 

Process-step 4: Evaluation 

Process-step 1-3 resulted in the first implementation of our artefact. We could now        move on to the fourth step in the design and creation process called evaluation.        This steps involved evaluating our classifier by comparing the first results with the        classification results of ​Widmann and Verberne ​[7]. This step confirmed that our        implementation was successfully implemented.  

 

Process-step 5: Conclusion 

The last step, called conclusion, involved iterative modifying and testing different        approaches with process-step 1-4. We could at this point test approaches such as        the amount of labeled data in relation to unlabeled data and different preprocessing        techniques. This iterative process resulted deeper understanding of the dataset,        preprocessing and the GSSL-algorithm. 

(14)

   

3. Implementation 

This section explains in detail the process from installing the necessary tools to the        classification-algorithm in our artefact. Figure 1 presents an overview of the        process where we apply preprocessing to a dataset and transform the result with        feature extraction which enables us to classify the documents by representing texts        as nodes in a graph.  

  Figure 1:​ An overview of the steps in the GSSL-implementation. 

3.1 Installing the tools 

The scikit-learn library requires the programming language Python and two        packages called SciPy and NumPy to function properly. We used Python 3.7.2,        Scikit-learn 20.2, SciPy 1.2.1, NumPy 1.16.1 on a Windows 10 machine. A detailed        explanation of the required steps to install scikit-learn and it’s third-party        packages is presented below.   

 

1. Download the Windows executable from Python’s official website¹.  2. Enable “Add Python 3.7 to PATH” and install Python. 

3. Python versions above 3.4 will automatically install “pip” which will enable us        to download other Python packages. Open command prompt and type “python        -m pip install --user numpy scipy”. This command will download the latest        version of the packages NumPy and SciPy. 

4. Install the scikit-learn library by typing “pip install -U scikit-learn” in        command prompt. 

    

  ¹​https://www.python.org 

(15)

 

5. We used a Python IDE called PyCharm from Jetbrains for faster development.        The community version is for free and can be downloaded²​. 

6. Packages that we used for visualization of data are Matplotlib and NetworkX.        These packages can be installed with the command “python -m pip install -U        matplotlib networkx ”. 

7. NLTKs WordNetLemmatizer is used for the preprocessing. NLTK is installed        with “pip install nltk”. WordNetLemmatizer can be downloaded from PyCharm        by first importing NLTK with “import nltk” and then downloading WordNet        with “nltk.download('wordnet')”.  

3.1.2 Verifying the installation 

We created a simple script to ensure that the installation of scikit-learn and it’s        third-party libraries were successful.  

 

1. Create a new project in PyCharm and include the scikit-learn library. This can        be done by navigating and following the path “File - Settings - Project        Interpreter - +” in PyCharm. Search for “scikit-learn” in the search box that        appear after clicking the “+”-sign. The search results will display multiple        scikit-learn packages with similar package names. Make sure to pick the        package called “scikit-learn” which also has the same version that was        previously installed with the command prompt.  

2. Choose the Python file that was created on project setup and insert the code        from Figure B1 in Appendix B. This code will import the scikit-learn library        and print a dataset to the console in PyCharm. The installation was successful        if the console output is free from errors. 

3.2 Dataset 

The Scikit-learn library has multiple ready-to-use datasets for different use-cases.        These datasets can be downloaded to the computer by importing the chosen dataset        from the sklearn.datasets package. One of the datasets from sklearn.datasets is the        20 Newsgroups dataset which we will be using in this thesis. 

 

The 20 Newsgroups dataset consists of 20 different topic labels and two subsets,        one is for training and has 11,314 documents and the other one is for testing and        consists of 7,532 documents. The documents are fairly even spread over the        different topics and the majority of the topics have between 550 and 600 documents        in total [7].              ²​https://www.jetbrains.com/pycharm/download/#section=windows  

(16)

 

The dataset is ideal for supervised learning since all the data is labeled. However,        since this thesis uses semi-supervised learning, we need to mask some of the labels        and set them to unlabeled. This process is done with a manual script that splits the        dataset into a new training dataset.  

The documents are chosen at random, as in Widmann and Verberne research [7],        from a copy of the original dataset where all documents have been preprocessed. All        documents are shuffled in a random order to ensure that the documents are not        ordered by labels before returning the new dataset.  

Splitting the dataset enables us to decide how many labeled versus unlabeled        documents the dataset should have. This feature is useful for testing the        classification score depending on the amount of labeled documents.  

 

Figure 2 illustrates the process to store and retrieve our custom dataset. The        original 20 Newsgroup dataset is first preprocessed and grouped together by label        where each labels documents are stored in .txt files. This enables us to quickly fetch        preprocessed data based on specific labels.  

The 20 Newsgroups dataset is by default ordered into a training and testing        dataset. We used the default split and ordered the data into two different folders        based on training or testing. The training dataset is then used to train the classifier        and the testing dataset is used to evaluate the classification-algorithms. 

 

  Figure 2:​ The process to store and retrieve our custom dataset. 

 

 

(17)

 

3.3 Data preprocessing 

The following subsections are dedicated to explaining the necessary preprocessing        steps to transform the original data.  

3.2.1 Data cleaning 

The purpose of data cleaning is to remove corrupt, incorrect and irrelevant data. It        would be time consuming to manually inspect all instances in a large dataset with        the goal to remove such data. Another approach, which was recommended in [4], is        to sample a few documents and analyze their content. 

 

Table 1: Data cleaning  

Before  After 

From: genetic+@pitt.edu (David M.  Tate) 

Subject: Re: MARLINS WIN!  MARLINS WIN! 

Article-I.D.: blue.7961 

Organization: Department of  Industrial Engineering  Lines: 13 

 

dwarner@journalism.indiana.edu  said: 

>I only caught the tail end of this  one on ESPN. Does anyone have a  report? 

>(Look at all that Teal!!!!  BLEAH!!!!!!!!!) 

 

Maybe it's just me, but the 

combination of those *young* faces  peeking out 

from under oversized aqua helmets  screams "Little League" in every  fibre of 

my being...   

--  

David M. Tate | (i do not know  what it is about you that closes  posing as: | and opens; only 

Maybe it's just me, but the 

combination of those *young* faces  peeking out 

from under oversized aqua helmets  screams "Little League" in every  fibre of 

my being...   

(18)

 

something in me understands  e e (can | the pocket of your  glove is deeper than Pete Rose's)  dy) cummings | nobody, not even  Tim Raines, has such soft hands   

Table 1 displays a document from the 20 Newsgroup dataset before and after it has        been through the process of data cleaning. We can see that the document has a        header, footer and quote before the data cleaning. Such information would not be        helpful in classifying the correct label and by removing this from all documents,        which is also done by Widmann and Verberne [7], we filter out data that wouldn’t        improve the classification.  

By simply sampling a few documents we could find irrelevant data patterns for all        documents. Removing the header, footer and quote can be done using the code in        Figure B2 in Appendix B. 

3.2.3 Data reduction 

Data reduction is applied to reduce the representation of data [9]. This will result        in faster processing for the algorithm and also improve the accuracy because the        algorithm does not need to handle as much irrelevant data.  

 

Lemmatization   is used to restore the words to their base form [14]. We used the        library WordNetLemmatizer from NLTK for lemmatization which determines if        each word is a adjective, noun, verb or adverb and then transform the word into its        base form. Without lemmatization, the algorithm would have interpreted the words        “dogs” and “dog” as two different words with no context. Lemmatization of the        following features would result in “were” → “be”, “are” → “be”, “is” → “be” and        “dogs” → “dog”. These features, in their original form, would have provided        unnecessary extra features since they have the same meaning as their base form.        Lemmatization is implemented using the code from Figure B3 in Appendix B. 

 

Stop-words are words that appear frequently in our everyday language. Such words        are not useful for the classification algorithm since they are frequently occuring        words in all texts and can be removed without affecting the classification accuracy        negatively [6]. Examples of words that could be removed are “we”, “the” and “and”.        Stop-words can also be more inline with the texts. For example, if all the texts are        about computers then the word “computer” would be an appropriate stop-word. The        list of stop-words³ removed in our preprocessing is the same that was used to        remove stop-words by Widmann and Verberne [7], with the code from Figure B4 in        Appendix B. 

 

 

(19)

 

Reducing feature count is used to remove less desirable features and is determined          by the features’ frequency. Features that appear in too many or too few documents        do not help to determine the document- labels. To determine which features should        be removed we use the same strategy as [7] and remove all features that appear in        more than 50% of the documents or in less than 10 documents. Removing these        features are done using the code from Figure B5 in Appendix B. 

 

Vocabulary could be imagined as the opposite of stop-words. A vocabulary consists        of a number of distinguishable keywords from every class. If the topic of a class is        “computer graphics”, keywords such as “image”, “jpeg” and “graphic” might be        effective in the vocabulary. The selection of keywords can be done by a human        selecting appropriate words from texts. However, manual selection can be tedious        and time consuming. Another option for constructing a vocabulary, which we used        in this thesis, is to use an algorithm to generate keywords based on all of the        documents. 

Our algorithm selects the ten most commonly occuring words from each category        and then removes the duplicate words. This means that if a word is one of the ten        most frequent in two different categories then the word will still only occur once in        the vocabulary. We also use two different kinds of vocabularies. The first one is        constructed right after the preprocessing and is built using all the training        documents. The other vocabulary is built during runtime using only the randomly        selected training documents.  

 

Table 2: Preprocessing one document 

Before Preprocess  After Preprocess 

Maybe it's just me, but the 

combination of those *young* faces  peeking out 

from under oversized aqua helmets  screams "Little League" in every  fibre of 

my being... 

maybe just combination young face  helmet little league 

           

Table 2 displays a document in the dataset before and after it has been lemmatized,        had the stop-words removed and rare and frequent features removed. The        document now lack words “me”, “the” and “in” etc. which are words that are not        helpful in determining the category. 

3.3.3 Feature extraction 

Feature extraction is the process of taking the preprocessed data and derive        numerical features [8]. Two different methods to extract features are used in this        work for comparison. The first one is “Bag Of Words” and the second one is “Term       

(20)

 

Frequency-Inverse Document Frequency”. Both these methods were used by        Widmann and Verbene [7]. 

 

Bag Of Words (BOW) constructs feature vectors from a vocabulary of unique words            called tokens [8]. This is also called a feature matrix which contains a feature        vector for each document. The vectors contain how many times each word occurs in        the corresponding document. Tokens can be divided into different n-gram models        depending on the purpose. A 1-gram model contains one word per token and a        2-gram model contains two words per token. This would result in ‘the sky’, ‘sky is’        and ‘is blue’ for the phrase “the sky is blue” using 2-gram model. We are using the        1-gram model in this thesis which results in the tokens ‘the’, ‘sky’, ‘is’ and ‘blue’ for        the same phrase. 

 

Term Frequency-Inverse Document Frequency (TF-IDF) is an extra layer on top of              BOW which reduces the importance of words that appear more frequently. More        frequently appearing words should be considered less informative than words that        appear in small fractions in the text corpus [8]. For example, the words “is” and        “the” are common in the english language and would therefore be viewed as less        important for the classification in relation to words like “computer” or “car” that are        more specific. The TF-IDF value of each word in the documents are normalized        with L2-normalization which results in a value between 0-1 based on each word’s        overall impact on the classification [26]. L2-normalization is also known as        euclidean normalization. TF-IDF does not improve upon BOW’s accuracy when        used by all classifiers [10] but it does improve the training speed as it allows for        features in a vector to be pruned away if their value is zero. These features are        replaced with null and can be skipped while training. 

 

Below is feature extraction demonstrated for the phrases “‘​The sky is blue​', '​So is                  the sea.  ​' and '​The sky is blue and so is the sea.​                    ' using BOW, TF-IDF and TF-IDF        L2-normalized. 

 

(21)

 

Table 3: Feature vectors BOW  Feature 

Phrase  and  blue  is  sea  sky  so  the 

The sky is blue  0  1  1  0  1  0  1 

So is the sea  0  0  1  1  0  1  1 

The sky is blue and so 

is the sea  1  1  2  1  1  1  2 

 

Table 3 displays how BOW summarize the count for each token in the sentences.   

Table 4: Feature vectors TF-IDF without normalization  Feature 

Phrase  and  blue  is  sea  sky  so  the 

The sky is blue  0  1.29  1  0  1.29  0  1 

So is the sea  0  0  1  1.29  0  1.29  1 

The sky is blue and so       

is the sea  1.69  1.29  2  1.29  1.29  1.29  2 

 

Table 4 shows that every word which is present in a sentence has a larger impact        than using BOW (shown in Table 3). The values of the features are calculated using       

log((number of phrases + 1)/(number of phrases the feature occurs in + 1)) + 1                             ​[26]​. 

The value of the feature blue in the phrase ​“The sky is blue” is therefore calculated                    with ​log(4/3) + 1.   

 

Table 5: Feature vectors TF-IDF L2-normalization   Feature 

Phrase  and  blue  is  sea  sky  so  the 

The sky is blue  0  0.59  0.43  0  0.56  0  0.44 

So is the sea  0  0  0.43  0.56  0  0.56  0.43 

The sky is blue and so       

is the sea  0.4  0.31  0.48  0.31  0.31  0.31  0.48 

 

Table 5 displays the feature extraction with TF-IDF L2-normalization which is        calculated using the non-normalized TF-IDF values (displayed in Table 4). The       

(22)

 

value of the L2-normalization is calculated by taking the value of a feature and        divide it with the root of the sum of the square for every feature in the phrase [26]. 

3.4 Classification algorithms 

We use two different classification algorithms to compare and analyze the results.        One is a GSSL-algorithm called Label Propagation (LP) and the other one is        Multinomial Naive Bayes which is a supervised algorithm. 

3.4.1 Label Propagation 

Label propagation     is implemented with the ready-to-use algorithm provided in the        scikit-learn library through the sklearn.semi_supervised import. The same import        also has a algorithm called Label spreading (LS) which is based on the        LP-algorithm. The difference between the two algorithms are the similarity matrix        and clamping effect [11]. We use these algorithms to classify the documents in the        20 Newsgroup dataset because they are used by Widmann and Verbene [7]. The        ready-to-use algorithm also allows us to analyze the code which will result in        deeper knowledge of GSSL-algorithms. 

The preprocessed data and their corresponding labels are passed through the fit        method which builds the graph. Two different kernels are used with the        algorithms, KNN and RBF, and these are also built-in into the package. Our        algorithm uses 10 neighbors for KNN and a gamma value of 5 for RBF. The number        of iterations in the LP-algorithm is set to a maximum of 1000. 

The LP-algorithm and it’s different kernels are explained in detail in the        result-section 4.1. 

3.4.2 Multinomial naive Bayes 

The Naive Bayes algorithm is based on Bayes’ Theorem [14]. Bayes’ Theorem        calculates the probability of an event based on previously established knowledge        with possible relation to the event. Naive Bayes classifiers assume that the values        of features is for the same class independent of each other. Bayes’ Theorem states        that the probability of event X occurring, given that event Y is occurring, is        equivalent to the probability of both events occurring divided by the probability of        event Y occurring. Naive Bayes’ classifiers apply Bayes’ Theorem for document        classification by letting X be the class and Y the document. Bayes’ Theorem then        calculates the probability of the document Y being of class X. Naive Bayes’ uses       

Laplace Smoothing to counteract the fact that all the documents in a class might        not have a certain feature [14]. For example the class “alt.atheism” does not have        the feature “graphic”. The occurrence of the features are therefore increased with       

Laplace Smoothing so that the value zero will not exist when multiplying the        probabilities. 

(23)

 

Multinomial naive Bayes (MNB) is a classification algorithm based on Bayes’        Theorem [15][21]. The difference between the standard naive Bayes’ classifier and        MNB is that MNB specifies that the data shall have a multinomial distribution        [29]. This in our case of text classification specifies that all the features will be        represented together with the categories in the form of a multinomial distribution.        The multinomial distribution tells us the probability of each individual feature        occuring in the different categories. MNB differentiate compared to the other naive        Bayes classifiers as MNB factor in the multiple occurences of the same feature in        one document while other naive Bayes classifiers does not. This makes MNB the        most suitable for text classification. Our implementation of MNB uses an Additive        Laplace Smoothing parameter of 1. Classifying documents using MNB is        implemented using the code in Figure B6 in Appendix B.   

(24)

 

4 Result 

This section explain GSSL through the LP-algorithm and displays the classification        results from the implementation of our artefact. 

4.1 Label Propagation 

The LP-algorithm can be explained in two steps: 

1. Construct a graph with weighted edges based on the difference between the        connected nodes’ feature vectors. 

2. Iteratively determine with help of the weights of the edges connected to a node,        which label a node most likely belongs to until it converges. 

4.1.1 Graph construction

The graph is constructed from the feature extraction matrix based on the feature        vectors as explained in section 3.3.3, which are based on the preprocessed data.        Each document in the feature extraction matrix is represented as a node and has        weighted edges to all other nodes in the graph. The weights are based on the        similarity between the documents features.  

There are multiple ways to calculate the similarity between documents and two        common methods are K-Nearest Neighbors (K-NN) and Radial Basis Function        Kernel (RBF kernel).  

 

K-NN calculates the distance by choosing a positive integer value of ​K           ​and a    distance metric where the value of ​K               ​is small. The algorithm then finds the ​K          nearest neighbors of the document to classify and labels the sample with the most        frequent class of the neighbors [8]. It is important to choose a appropriate value of       

since the value can make the algorithm underfit or overfit. Classification using        the K-NN kernel is implemented using the code in Figure B7 in Appendix B. 

 

RBF kernel is defined as          (x, ) K x= exp(− ||xγ − x′|| ) 2 where  |x| − x′|| 2   is the squared    Euclidean distance between the feature vectors of two nodes x and x’ (documents)        and  γ      is a float value that is greater than zero [11]. The RBF kernel value        increases as the distance between vectors decreases which will produce a fully        connected graph with a dense matrix. This could lead to a slow computing time.        However, the high accuracy of RBF in GSSL more than makes up for its computing        time [7]. Classification using the RBF kernel is implemented using the code in        Figure B8 in Appendix B. 

 

(25)

 

   

Figure 3. Example of 5 nodes (documents) represented in a graph with weighted          edges based on similarity. Each node’s number represents the index in the feature        extraction matrix. The node colors represent the labels for the document where        black nodes represent unlabeled documents. 

 

The result of the graph construction step is a complete graph representing all        documents and their similarity to the other documents (Figure 3). The next step of        GSSL is to iterative propagate the labels. 

4.1.2 Propagate labels

The labels propagate by iteratively normalizing the probability interpretation and        clamping the labeled data [13]. Clamping the result ensures that the original labels        do not change after normalization [22]. Before we can iterate, the first step is to        setup the label distribution. The distribution is a matrix with the label probability        for each document. The columns in the matrix represent the label index, with one        row per document. 

 

Each iteration will calculate the label probability for each document and update the        distribution. This is done by calculating the dot product of the similarity graph        matrix and current distribution and is followed by normalization of the result.        Lastly, since we already know the labels for some of the documents in the label        distribution, these vectors are set to their initial state by clamping. The process for        one iteration is shown in Figure 4 below. 

           

(26)

        1. Setup  1  0  0  0  0  1      2. After normalizing  0.75  0.25  0.44  0.56  0.11  0.89      3. Clamping  1  0  0.44  0.56  0  1   

Figure 4. The label distribution on initial setup to the left, with two labels and an          unlabeled document (row 2), after normalizing in the middle and clamping to the        right.  

 

Normalizing and clamping the label distribution is done until the algorithm        converges. Convergence is reached when all nodes have the same label as the        majority of their neighbor-nodes. If the algorithm would have converged in Figure        4, the predicted label for the unlabeled document would have been the second one        since this label has the highest probability of 0.56. 

Figure 5 below displays how the label distribution changes to some degree after        different amount of iterations. 

 

   

 

Figure 5. The label distribution for the first 30 documents in the graph. To the left          is the graph on initialization where black nodes represent unlabeled documents.        The graph in      the middle displays how the labels have propagated after 10        iterations and to the right is the label distribution after 100 iterations.  

4.2 Classification comparison 

Running the LP-algorithm without a vocabulary turned out to be very        computationally demanding and time consuming. Our solution was to use a 4-class        instead where we reduce the categories in the dataset from twenty to four.        Widmann used a similar approach in [24] where she classified both a 4-class and        20-class for text categorization on the same dataset as we are using. She       

(27)

 

determined that the performance from the 20-class is consistent with the 4-class.        We therefore made the assumption that our 4-class text categorization also should        be consistent with the 20-class. The four categories we used in this thesis are        “rec.autos”, “rec.motorcycles”, “rec.sport.baseball” and “rec.sport.hockey” which are        the same categories as Widmann used in her 4-class classification in [24].  

Closely related categories should in theory be more difficult to classify since they        share common words. Two of our four categories, hockey and baseball, are sports        texts and would therefore share common words which could confuse the        classification algorithm. The same goes for the two other categories, autos and        motorcycles, which both belongs to the vehicles-category.   

 

The classification accuracy of our 4-class are derived using F1-score and “accuracy".        Both methods uses categories from the confusion matrix to calculate their scores. A        confusion matrix has four categories that represent the different decisions made by        the classifier [25]. Two of the categories represent correctly classified examples.        These are “true positives” which are positive examples labeled as positive and “false        negatives” which are negative examples classified as negatives. There are also two        categories for incorrectly labeled examples. Results called “true negatives” are        incorrectly negative labeled as positive and “false positives” are the opposite [25].        The F1-score is calculated with ​2 * (precision * recall) / (precision + recall)                           ​where  precision is ​true positive / (true positive + false positive)     ​and recall is ​true positive / (true      positive + false negative) . The accuracy score is calculated with      ​(true positive + true        negative) / number of documents.  

Both scores will fall within a range of 0.00 to 1.00 where a score of 1.00 could be        view as 100% correctly labeled test documents. The code for calculating the scores        are found in Figure B9 and Figure B10 in Appendix B. 

 

The scores are evaluated by comparing the original labels, which are masked in the        classification process, with the predicted labels. The predicted labels are validated        by comparing the result against the classification result in Widmann and Verberne        research [7]. Widmann and Verbene [7] classifies documents with labeled training        documents in the range of 1-350. We used 1-100 labeled training documents and        could verify our result from this range to the same range in Widmann and        Verberne research [7]. Since we use nearly the same preprocessing technique and        dataset, a comparison between our result against Widmann and Verberne [7]        should validate that our classification result was reasonable. We also based our        results on running each classifier and its specific preprocessing process 10 times..   

Figure 6-17 below displays the results from our classification algorithms by using        different preprocessing techniques and the same results are presented in tables in        Appendix A.  

Each graph has different combinations of preprocessing techniques which are        tested on four classifiers. These classifiers, which are displayed as lines in the        graphs, are Multinomial naive Bayes (MNB), Label Spreading with KNN-kernel        (LS KNN), Label Spreading with RBF-kernel (LS RBF) and also Label Propagation       

(28)

 

with RBF-kernel (LP RBF). Each title in the graphs explain what kind of score it        displays where Figure 6-11 display the accuracy score and Figure 12-17 display the        F1 score. The title also indicates the preprocessing process. The different        preprocessing techniques are “processed” where the data has been cleaned and        preprocessed as well as BOW or TF-IDF which are feature extraction methods.        Some results are also based on a vocabulary where there are two different kind of        techniques. The first one is a vocabulary built from all training documents and the        second one, called “runtime vocabulary”, is a vocabulary built from randomly        selected labeled training documents. 

The number of labeled training documents is displayed on the X-axis and the        accuracy-score is displayed on the Y-axis. The amount of training documents varies        from 10 to 100 and the rest of the preprocessed documents from each category are        used as test documents. We chose to limit the amount of labeled documents to 100,        in contrast to Widmann and Verberne [7] which uses up to 350 labels, because our        goal is to use less labels than Widmann and Verberne [7] and measure the impact        from the feature-feature graph on lower number of labeled documents. The general        rule is that the more labeled documents for training the better test results.  

The LP RBF result is not visible in figures using BOW as the feature extraction        method. This occurs because the results are nearly identical with LS RBF and are        therefore covered by this line in the graphs. 

   

   

(29)

 

   

 

The results for the accuracy-score in Figure 6-11 ​show that the MNB baseline is              overall better for classification on up to 100 labeled training documents. The only        results where MNB was outperformed was in Figure 7. This shows that LS KNN        using TF-IDF without a vocabulary performed better classifying based on up to 25        labeled training documents. However, when the amount of training documents        where more than 25 the MNB-classifier performed better. 

 

The above results also show that using a vocabulary in combination with TF-IDF        improves the overall results for algorithms using the RBF-kernel on lower number        of labeled training documents, but worsens the result for the other classifiers. The        MNB-classifier using BOW displays better results without a vocabulary when the        number of labeled training documents increases.  

 

Algorithms using the KNN-kernel show an overall improvement for classifications        in comparison to RBF-kernels. A problem using the RBF-kernel with BOW, which        are displayed in figures 6, 8 and 10, is that the algorithm reaches its iteration limit        before it has finished classifying all documents. This results in a non-improving        accuracy even if the number of labeled training documents increases. 

   

(30)

 

   

   

 

The F1-score tests in Figure 12-17 display similar results as the accuracy tests in        Figure 6-11. The MNB-classifier shows overall better performance when measured        with the F1-score in comparison to the other classification algorithms, which also        was the case for the accuracy-score. 

 

A comparison between Figure 12 and 13 shows that both LS using KNN or RBF        with TF-IDF without a vocabulary returns a significantly better result      in  comparison to the algorithms with BOW. 

The result of LP RBF shows that the more labeled training documents the better        the result. As in the previous tests (Figure 6-11) the RBF-kernel reaches its        iteration limit using BOW which results in poorly classified documents and a low        F1-score in Figure 12, 14 and 16. However, using BOW without a vocabulary        displays a better result than with a vocabulary for the MNB-classifier. 

 

Figures 7, 9, 11, 13, 15 and 17 display the results of classifiers using TF-IDF. The        F1-score starts out lower than the accuracy for RBF-kernels but end at a similar        value. The F1-score and the accuracy for the KNN-kernel and MNB are similar for        the all different amount of labeled training documents. 

(31)

 

   

 

Figures 18 and 19 ​display the highest classification results achieved by the              classification-algorithms from both, the F1-score and accuracy-score. Figure 18        shows the best results from the accuracy-scores in Figure 6-11, and Figure 19        contains the best measurements from Figure 12-17.  

We can see that the two measurements are very similar in their scoring. The most        noticeable difference is that LP RBF has a much lower F1-score, in contrast to its        accuracy-score, for 10 labeled training documents per category. This is mainly        because the accuracy-score does not take into account an uneven class distribution        and can therefore show a high accuracy if most of the documents are of the same        class. The F1-score is the harmonic mean between precision and recall. This means        that it takes into account how many of the classified documents that actually        belong in that class as well as how many of all the documents that should be in the        class are actually there. 

(32)

 

5 Discussion and conclusion

 

We are not experts in the field of machine-learning and can therefore not provide        an advanced analysis to why the classifiers work the way they do. This also makes        us more susceptible to possible mistakes and we may not have fine-tuned the        algorithms and classifiers ideally. The time constraint for this degree thesis has        not allowed us to fully explore the options available during the GSSL-process. Such        options could for example have been to test other GSSL-classifiers such as MAD        [27] or to improve the selection method for our vocabularies.  

The time constraint also affects the number of maximum iteration the        LP-algorithm was allowed to run. A lower number enables more trial and error        while a higher number allows us to achieve a better classification result for BOW        since some classifiers did not have enough iterations to finish. 

 

Accuracy and F1-score 

The classification measurement calculates the percentage of correctly classified        documents. A successful score should be higher than randomly selecting the labels        for the documents. Since we have four categories is it 25% chance to guess the        correct label by simply assigning documents their label randomly. It is therefore        essential that the classification-algorithm has a higher score than 25% for it to be        useful. 

Figure 7 shows that our classification-algorithms have a better accuracy than 25%        which makes the predictions better than just randomly assign labels. There are        however results from our classification where we simply could have assigned the        labels randomly. Figure 6 shows that the RBF-kernel has a score of around 25%        due to reaching the iteration limit when using BOW. In these cases it is better to        look at the error rate where we want as low of an error rate as possible. A        classification-score of 100% might look good on paper but is not desirable. Such a        score could tell us that something is not working properly or that the classifier has        been overly trained with data.  

Except for achieving a higher classification score than randomly selecting labels it        is  important  that  the  classifier  is  precise.  We  also  want  the  classification-algorithms to be sensitive which is measured by recall. F1-score is        the harmonic mean of precision and recall which shows if the classifier is precise        and sensitive. Each category has an equal effect on the F1-score and a significantly        lower F1-score than accuracy-score shows that the classifier has bias to label        documents with a certain category. Also, the F1-score can not be higher than the        accuracy-score.  

(33)

 

Instability 

Empty training documents have a major effect on the stability of the result when        using GSSL without a vocabulary. The varying result is due to the fact that        randomly selected labeled training documents can be empty after the preprocessing        which are not good to train the classifier with. The result of MNB does not vary as        much as the GSSL-algorithms for the same number of empty training documents.        This is most likely because MNB only uses the labeled training documents when        training and excludes many of the empty documents. This is not the case for        GSSL-algorithms which uses all the training documents.  

 

Vocabulary 

Vocabulary features: The vocabulary constructed from the preprocessing reduced        the number of features, or unique words, in all texts, from 2793 to 27. This number        varies when using the MNB-classifier which is an SL-classifier which only uses the        labeled training documents when training the classifier. SSL classifiers use labeled        and unlabeled training documents. Without a vocabulary, MNB usually had        around 2200 features. The number of features using the runtime vocabulary are        dependent on the randomly selected training documents and varies between 10-40        features. However, it is unrealistic that a runtime vocabulary would consist of ten        features since it would mean that the four different categories would have the same        ten most frequently used words. Instead, the amount of features for a runtime        vocabulary was the same as the standard vocabulary. 

The classification tests show that the classifier makes more stable predictions with        a vocabulary because we remove noisy words and get a lower feature count per        document. Since the feature count is lower and unnecessary words which could        “confuse” the classifier are removed, the result is less differentiated in the        classification measurements between lower and higher number of labeled training        documents per category. This makes vocabularies ideal for algorithms using lower        number of labels since it is possible to not have enough features without a        vocabulary to differentiate categories. 

A vocabulary makes it easier to predict categories but the downside is that it will        produce a lower maximum score for the classifier. This is due to the fact that using        fewer features causes some features which would have differentiated the categories        to no longer be included in the classification. For example, the categories car and        motorcycle would to some degree share the same features. Such features could be        driver, vehicle, rearview mirror. The problem is that the classification result would        have been random if a motorcycle document only has these features and not other        additional features such as helmet, two-wheels and handles which would have        differentiated the motorcycle from the car.  

The combination of a vocabulary with TF-IDF generates better results for the        RBF-kernel at lower number of labeled training documents. However, the MNB        and LS KNN results show a steeper incline at 10 labels per category when not        using a vocabulary. This leads us to believe that if the number of labels are further       

(34)

 

decreased, using a vocabulary would have become better at lower number of labels        for those classifiers as well.  

 

Vocabulary vs runtime vocabulary:       The assumption is that a runtime vocabulary        should provide better results than the standard vocabulary. We make this because        a standard vocabulary is created with features from all the training documents        which will result in features that are not present in any of the labeled training        documents. These documents will not have any features and are therefore not        useful in the classification. A runtime vocabulary provides more accurate results        since the above issue is eliminated. We saw a 2-5 percentage increase in        classification score using a runtime vocabulary compared to the standard        vocabulary. The result of using a runtime vocabulary could also become worse than        the standard vocabulary. Such a result is achieved when the randomly selected        labeled training documents have a misleading representation of the other training        documents. This means that the words in the vocabulary rarely occur in the        unlabeled training documents and will therefore not provide any help in the        classification.  

 

MNB using vocabulary:     The results from the MNB-classifier becomes less desirable        when using a vocabulary. This happens because it assumes that the value of the        features in the same class are independent from each other [14]. Since a vocabulary        reduces the number of features, and MNB performs better with more features, the        use of MNB with a vocabulary will perform worse. 

Any reduction in the number of unique features would however reduce the result of        MNB. MNB has an accuracy of 0.81 at 100 labeled training documents per category        using TF-IDF without a vocabulary. When using data which has not gone through        data cleaning or preprocessing for the same scenario, the accuracy improves to 0.90.        This shows that the result of MNB becomes better even when using features which        many other classifier perceive as not good for classification. 

 

GSSL using vocabulary:      The classification result for LS using KNN- and        RBF-kernels decreases when using a vocabulary because of the lower amount of        distinguishing features. Another reason is that the vocabulary is not optimal for        distinguishing between the categories since it is constructed from the most frequent        words in each category and nothing stops these words from occurring in another        category. A vocabulary with words that only occur in one category would therefore        improve the result. Such a vocabulary could be constructed manually by a human        selecting the features. However, it would only be possible for the standard        vocabulary and not the runtime vocabulary since it is built from the random        generated categories. The feature selection for the runtime vocabulary could        instead be built by a custom algorithm which would be very time expensive.  

The classification result increases for LP, in opposite to LS, when using a        vocabulary. An example is LP RBF which has a significant 15 percentage points        differentiation for 100 labeled training documents depending if it uses a vocabulary        or not. LP without a vocabulary has a steeper classification line, compared to using       

(35)

 

a vocabulary, towards 100 labeled training documents. This leads us to believe that        if the number of labels per category increased further, LP without a vocabulary        should give a better result than with a vocabulary.  

  GSSL 

Classification iterations: The number of iterations required to classify all nodes in a        GSSL-algorithm depends on the number of features and the parameters passed to        the function. TF-IDF requires less iteration than BOW because it can prune away        features that are zero after the feature extraction process. A vocabulary reduces the        number of features even further which leads to fewer iterations than without a        vocabulary. This is shown in Figure 6-7 where the RBF-kernel produce a linear        classification result when using BOW as it reaches the maximum number of        allowed iterations, whereas TF-IDF produces a result without reaching the        iteration limit. The RBF-kernel using BOW is not even close to produce a result at        1000. Increasing the number of iteration allowed to 100.000 still makes BOW        unable to produce a result without reaching the iteration limit. The iteration limit        is therefore not increased to allow classification using BOW to finish. The        RBF-kernel in Figure 7 has enough iterations available to finish the classification        using TF-IDF which supports the fact that BOW requires more iterations. The        KNN-kernel using BOW without a vocabulary also starts off with a linear result at        10 and 25 labels per category but then increases from there. This tells us that the        KNN-kernel requires less iteration than the RBF kernel and that higher amounts        of labeled training documents require less iterations. The speed with which KNN        iterates is slower compared to the RBF kernel. The number of iterations KNN        requires also depends on the number of neighbors that it takes into consideration.        Higher number of neighbors increases the number of iterations required. Higher        gamma values when using RBF-kernel also increases the number of iterations        required. The reason we are not producing the result of LP KNN is also because of        the number of iterations it requires. LP KNN is not able to train using TF-IDF with        100 labels per category at 1.000.000 iterations, which is more than 1000 times        greater than what the RBF kernel and LS KNN required. Including LP KNN would        therefore only show a linear result as we do not have the time to run it using the        number of iterations required to train the classifier properly.   

 

MNB baseline:       It is difficult to achieve a better result than the result produced by        the MNB baseline. This is because SL-classifiers, such as the MNB-classifier, does        not have the same potential of training errors as GSSL-classifiers. When using        SL-classifiers the labels of the training data is unchangeable. Classification using        LP does allow for the already labeled training data to change labels. This        potentially leads to documents which are known to a certain label ending up with        another label. GSSL can achieve a better results than SL which are shown in        Figure 7 and Figure 13 where using LS KNN at 10 and 25 labels per category        produces a better result than MNB.  

Figure

Figure 2 illustrates the process to store and retrieve our custom dataset. The                           original 20 Newsgroup dataset is first preprocessed and grouped together by label                where each labels documents are stored in .txt files
Table 1: Data cleaning   
Table 1 displays a document from the 20 Newsgroup dataset before and after it has                               been through the process of data cleaning
Table 2 displays a document in the dataset before and after it has been lemmatized,                               had the stop-words removed and rare and frequent features removed
+7

References

Related documents

When using RFE as a variable selection method, there is a direct relation between an increase in F -score and an increase of number of variables used in the model. On the other

When sampling data for training, two different schemes are used. To test the capabilities of the different methods when having a low amount of labeled data, one scheme samples a

Gillar inte att det inte finns ett kollektivt ansvar för den gemensamma fastigheten på samma sätt som vid en bostadsrätt Jag tror det skulle bli allt för mycket andrahandsuthyrning

samtiden känd, numera obekant vismelodi, anknytes till en hel rad andra lika okända visor. Som sista exempel från kommentaren må slutligen anföras den lyckliga

VBU delar utredarens bedömning att utgångspunkten i socialtjänstens arbete bör vara vilka insatser som erbjuds och vad insatserna ska syfta till, i stället för nuvarande inriktning

INVESTIGATION OF THE EFFECT OF THE TRANSFORMER CONNECTION TYPE ON VOLTAGE UNBALANCE PROPAGATION: CASE STUDY AT.. NÄSUDDEN

Section V-C shows localization results according to the different number of unlabeled data and values of tuning parameters, and Section V-D describes the computation time of

Other sentiment classifications of Twitter data [15–17] also show higher accuracies using multinomial naïve Bayes classifiers with similar feature extraction, further indicating

For the Semi-Supervised scenario, we also introduce an auxiliary supervised fine-tuning step on the available la- beled samples, to further increase disentanglement and,

Användning av modern elektronik för informationsöverföring till bilförare - fara eller fördel.. Kåre

Bågarna är till för att få mer stabilitet på produkten när den används, de är inte till för att spänna fast plattan mot handtaget vilket gör att de inte behöver passa

A series of “double-cable” conjugated polymers were developed for application in efficient single-component polymer solar cells, in which high quantum efficiencies could

This research study will use empirical data from earlier presented research papers and research reports done within the Swedish Police (see for example Holgersson, 2005, 2007,

Tommie Lundqvist, Historieämnets historia: Recension av Sven Liljas Historia i tiden, Studentlitteraur, Lund 1989, Kronos : historia i skola och samhälle, 1989, Nr.2, s..

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Flere af informanterne peger på, at de bruger deres sociale netværk for at komme i arbejde, men peger samtidigt også på, at det havde været lettere at bruge dette netværk

But the argument in this essay holds that the migration issue challenges a set of strong ideas in contemporary political art, and, that a closer look at some artistic

Elevens skolsituation måste vara strukturerad på ett sätt som gör den begriplig och hanterbar för eleven, skolan måste se till att det finns vuxna som eleverna kan knyta an till

Vikten av, och kraven på, systematiskt hälsofrämjande arbete inom hälso- och sjukvård speglas tydligt i forskning, litteratur, lagar och riktlinjer. Distriktssköterskan

På liknande vis arbetar även Lärare B som betonar att flerspråkiga elever får använda varandra som en resurs i undervisningen och förklarar att hen ” uppmuntrar