Analyzing the ability of Naive-Bayes and Label Spreading to predict labels with varying quantities of training data

(1)

Analyzing the ability of Naive-Bayes and Label

Spreading to predict labels with varying quantities

of training data

Classifier Evaluation

ROBIN KAMMERLANDER & TEDY WARSITHA

Bachelor’s Thesis at school of Computer Science and Communication (CSC) Supervisor: Jeanette Hällgren Kotaleski

Examiner: Pawel Herman

(2)

(3)

Abstract

(4)

Prestandaanalys av två metoder inom

semisupervised och supervised

maskininlärning

(5)

1 Introduction 1 1.1 Related Works . . . 2 1.2 Problem Statement . . . 2 2 Background 3 2.1 Machine Learning . . . 3 2.1.1 Vectorization . . . 4 2.1.2 Supervised Learning . . . 4 2.1.3 Un-Supervised Learning . . . 4 2.1.4 Semi-Supervised Learning . . . 4 2.2 Methods Of Classification . . . 4 2.2.1 Naive-Bayes . . . 4 2.2.2 Label Spreading . . . 5 2.3 Classes . . . 5 2.4 Statistical Evaluation . . . 7

2.4.1 Accuracy and Precision . . . 8

2.4.2 Recall and F1-score: . . . 10

2.4.3 McNemar . . . 10

3 Equipment & Methodology 11 3.1 Data-sets . . . 11

3.2 Storage . . . 11

3.3 Vectorization . . . 12

3.4 Classification . . . 12

3.5 Testing & Interpreting Results . . . 12

4 Results 15 4.1 Confusion Matrix . . . 16

4.2 Classifier Scores . . . 17

4.3 McNemar Test Results . . . 19

4.4 McNemar Test Results . . . 20

(6)

6 Conclusion 27

Appendices 27

A Utility 29

A.1 List of classifiers . . . 29

A.2 List of data-sets . . . 29

A.3 List of storage engines . . . 29

A.4 List of toolkits and third-party python modules . . . 29

(7)

Chapter 1

Introduction

Handling large amounts of data can be overwhelming and slow when processed manually, e.g. by human labor. Computers have unlocked the possibility to quickly and more efficiently process and manage massive amounts of data. As it might be expected, the uses for an effective classifier method has found many areas. E.g., given many different types of articles, it might be desirable to sort them by genre (business, sports, tabloid articles, and so forth). Another use is finding correla-tions that may not be obvious. A large and seemingly unrelated set of data, can be divided based on similarity. This process is known as "clustering". Tuning the restrictions on similarity between data may result in different groupings, allowing the researcher to manipulate the results and retrieve the amount of clusters de-sired. This provides researchers with a useful method to find undetected groupings. These two uses showcase different properties of the general concepts within machine learning (Scikit-learn, 2014).

(8)

1.1 Related Works

Earlier reports have encountered a similar problem statement in various ways, grant-ing this report the opportunity of expandgrant-ing the subjects knowledge base, and pos-sibly allowing others to improve the work undertaken in this research. A couple of relevant studies are listed below:

Ramdass and Seshasai (2009) investigated and implemented techniques which could be used to perform automatic article classification. In their report Document

Classi-fication for Newspaper Articles they wanted to find an effective approach to sort and

classify various newspaper articles. Using an existing database of labeled articles they applied a suitable method in their implementation. A supervised classification method was selected, where the researchers experimented with different statistical techniques. Results were thereafter compared; leading to the conclusion that with a proper and adequate set of extracted features used for training, their classifier could reach an accuracy of 77%. Their best performing classifier was a Naive Bayesian classifier in a multi-variate Bernoulli setting.

Mehro-tra and Watave (2009) investigated the complications surfacing when devel-oping a spam filter. They developed an identification engine with the aim to identify spam, which employed a Naive-Bayesian classifier trained with TREC 2006, a corpus consisting of spam and non-spam emails. They presented data from their experi-ments with different conditions applied and discussed which conditions could lead to the greatest impact on the efficiency of their chosen method. To conclude their study, they stated that a Naive-Bayesian spam filter would be fit for real world application.

1.2 Problem Statement

(9)

Chapter 2

Background

The background chapter consists of four main parts:

• A brief description of the general concepts of machine learning.

• A detailed description of the various aspects of the learning methods used as classifiers.

• A description of the labels chosen for the classifiers and the implementation of a spam filter.

• An final part dedicated to statistical evaluation; the procedure of measurement and comparison of results.

2.1 Machine Learning

In machine learning, categorization is analogous to the procedure in which unclas-sified data is in the process of being clasunclas-sified, e.g. grouping subjects by genres, locality, or other types depending on similar traits in content. Accomplishing this in a large scale with traditional methods, is mostly regarded as a tedious, repeti-tive and time-consuming process, and therefore it’s highly desirable to find a faster method to accomplish this, e.g. with computer processing power. This is commonly known as a document classification problem, where a solution involves finding the fastest and most efficient way to classify data (Scikit-learn, 2014).

(10)

2.1.1 Vectorization

In order to utilize meaningful data from samples, these must first be "vectorized", i.e. words needs to be extracted from the samples and accounted for. Another way to describe this is to convert the samples into a "Bag Of Words" representation, where the words are the selected features of the sample, and the sample is the bag. In practice, a common way of vectorizing a sample is to count the occurrence of each word. This is known as "Count vectorization", which provides with a meaningful approach to compare with other vectors in order to discover similarity, e.g. by probability (Scikit-learn, 2014).

2.1.2 Supervised Learning

Supervised learning relies solely on pre-labeled training data, i.e. a collection of pre-defined examples, setting the algorithms to predict results after these existing labels (Kunal Mehro-tra and Shailendra Watave, 2009).

2.1.3 Un-Supervised Learning

Unsupervised learning is distinguished from supervised learning, with the main objective not to pair the desired output with any specific data and instead process unlabeled data to find hidden groups; more commonly known as clusters. This "clustering" technique can generally be described as grouping input data after similar traits (Scikit-learn, 2014).

2.1.4 Semi-Supervised Learning

The semi-supervised approach is similar to the supervised approach with a single differentiating feature; the clustering process is also applied as earlier described. The main idea of the semi-supervised approach is to apply it in scenarios where only a small amount of pre-labeled data for training is available, which is compen-sated by finding similar features in unlabeled data also used for training (Olivier Chapelle,Bernhard Schölkopf, and Alexander Zien, 2006).

2.2 Methods Of Classification

The two chosen subjects for experimentation belongs to different parts of the general concepts in machine learning. A more detailed description of these is written below.

2.2.1 Naive-Bayes

(11)

2.3. CLASSES

based on the common principle that every particular feature is independent of the value of other features, given the class variable. This technique is known for its efficiency in a supervised learning setting–only requiring a small amount of training data–which can be regarded as its main advantage over other classification tech-niques. The algorithm has linear time complexity (Scikit-learn, 2014).

2.2.2 Label Spreading

Label spreading is a classification technique which belongs to the family of semi-supervised classification methods. It’s most prominent feature is its ability to deliver excellent performance with even smaller amounts of training samples, in compari-son to Naive-Bayes, mixed with a larger amount unlabeled training samples. This specific mixture of samples, which can be regarded as two subsets of the train-ing set, is processed in a pattern recogniztrain-ing algorithm to classify the unlabeled data during training. By this procedure the classifier effectively builds a large la-beled training set. However, while reducing an otherwise costly requirement in classification–the existence of adequate amounts of pre-labeled training samples– introduces the need of greater computational power. A traditional algorithm, using a so called "K-Nearest-Neighbor" algorithm to predict each unlabeled training sam-ple from a selected range of nearest neighbors, has the time polynomial comsam-plexity

O(n2) (Scikit-learn, 2014).

2.3 Classes

The following description of "spam" and "ham" are the classes used for training the classifiers in the experiments.

Email spam belongs to the family of electronic spam. The general concept of spam involves sending huge amount of unsolicited messages to various recipients, with a high probability that the contents will be very similar. This may however not apply if a spam email is custom-tailored for one person or a smaller group. In an investigation from the Message Anti-Abuse Working Group, around 90% of all emails sent between the first quarter of 2012 and the second quarter of 2014 where identified as abusive. Spam emails would fall within this group.

Spam emails may consist of links leading to malicious web sites. In many cases the email is cleverly disguised by mimicking a familiar accepted source of email, e.g. a well established bank or company. Even opening spam emails may cause harm to the recipient. Scripts or executable files with exploitative intent may lurk within the email. As mentioned before, most spam emails are sent in bulk, following a common template, and may be easy to recognize (M3AAWG, 2014).

(12)

(13)

2.4. STATISTICAL EVALUATION

2.4 Statistical Evaluation

Measuring a classifier’s performance can be done in various ways. Common methods make use of the accuracy, precision, recall, and f1 scores for the classifier. Before one can fully understand the mentioned scoring measures one has to understand the meaning of the four types of predictions: true negative, true positive, false negative and false positive. A prediction can only be one of these types. A true negative or a true positive result is a correct prediction by the classifier, while the false counterparts are the incorrect predictions (Scikit-learn, 2014).

In this study, the terms are abbreviated as follows:

T rueP ositives = T P (2.1)

F alseP ositives = F P (2.2)

T rueN egatives = T N (2.3)

F alseN egatives = F N (2.4)

In the figure below, a contingency table is shown describing the relationship between a classifier’s predictions and actual values of the samples.

Actual

True False Predicted True True Positive False Positive

False False Negative True Negative

(14)

2.4.1 Accuracy and Precision

Accuracy is the ratio of the amount of correct predictions to the total amount of data. It can be calculated as follows:

Accuracy = T P + T N

T P + T N + F P + F N (2.5)

Precision is a measure of the classsifier’s capability to predict labels. The two values are the PPV - the capability to predict positive values, (positive predictive value) and the NPV (negative predictive value) - the capability to predict negative values. These can be calculated as follows (Bethan Davies, 2013):

P P V = T P

T P + F P (2.6)

N P V = T N

(15)

2.4. STATISTICAL EVALUATION

In the figure below a visual representation of accuracy, precision and their relation is shown. A depiction of target practice can be conveniently applied to describe classification. Precise predictions are made by a classifier with high precision but not necessarily high accuracy; the shot grouping’s closeness is tight but the placement of the shots may not be on the desired area of the target. Accurate predictions are made by a classifier with high accuracy but not necessarily high precision; the shot grouping is broad but the placement of the shots are on the desired area of the target. Valid predictions–both accurate and precise–are made by a classifier with high accuracy and high precision; the shot grouping is spot on, hitting the exact desired area of the chosen target with tight closeness. In contrast to valid predictions, the shot grouping can be described as almost random; hitting anywhere without consistency (Bethan Davies, 2013).

(16)

2.4.2 Recall and F1-score:

The recall of a classifier is the ratio defined as:

Recall = T P

T P + F N (2.8)

In words it can be defined as the capability to detect all the positive data.

The F1 score is the weighted average of the classifiers recall and precision, valued between 0 to 1. The score is considered better closer to 1. The F1 score can be calculated as follows (Scikit-learn, 2014):

F 1 = 2 ∗ P recision ∗ Recall

P recision + Recall (2.9)

2.4.3 McNemar

Two things needs to be taken into consideration when conducting a statistical test on paired nominal data: the hypotheses and the choice of statistical testing method. For a McNemar test the null and the alternative test hypothesises can be defined as H₀ = "The two results are equal", and H₁ = "The two results differ significantly". It’s important to choose a suitable method for comparing the results of the two clas-sifiers. In this case, a study with the aim to find marginal homogeneity is conducted, and thus the most suitable method to compare two results is the McNemar’s test on paired nominal data. This particular test uses the chi-squared test variable to determine whether to accept the null hypothesis if the statistical result is significant, or to reject it if insignificant. The test is applied on a 2x2 contingency table where, in this case, the tabulated data is divided as follows (Statistics Solutions, 2016):

Label Spreading

Correct Incorrect

Naive-Bayes Correct Both Correct NB Correct, LS Incorrect Incorrect NB Incorrect, LS Correct Both Incorrect

(17)

Chapter 3

Equipment & Methodology

The equipment and methodology chapter consists of four main parts:

• A description of the equipment used in the experiments. • A part describing the procedure of pre-processing the samples. • A description of the classification procedure.

• An final part describing testing and interpretation of the results.

3.1 Data-sets

Two data-sets were used in the experiments; the Enron email corpus, and the email corpus from Csmining group. The data from Csmining group, which were used for testing, provided 4327 documents labeled as spam or ham. The data from Enron, which were used for training, provided 43030 spam documents, 65519 ham documents and 517401 unlabeled documents. In order to minimize the risk of having polluted data in the evaluation, i.e. an evaluation suffering from systematic errors by bad data, documents were stripped of formatting before storage. The Beautiful Soup Python module was utilized for this task.

3.2 Storage

(18)

categories, values in keys, Redis proved to be adequate for the task while allowing fast access for the testing procedure. After successfully stripping the documents of formatting, they were stored as values in a Redis database with the keys: training, testing, and unlabeled training; and finally grouped as spam, ham or unlabeled files.

3.3 Vectorization

Words extracted from samples can be organized into an array of unique keys with occurrences as variable values. In the process of training a classifier, words are put into the vector as unique keys and for every occurrence the associated count value is incremented. This is a approach known as "Count vectorization" described earlier. The Sci-kit Python library offers this particular kind of vectorization method.

3.4 Classification

To retain sample data free from bias, a selected amount of randomly shuffled samples from the redis database were vectorized by utilizing the parsing tools acquired from the Sci-Kit library. After the samples were successfully vectorized these were used for training the classifier. By not using the whole data set at once, more control over input quantity was granted as it was desired to retain equal amounts of samples from each group while allowing the whole procedure - both training and testing, to be repeated, in which the input sample quantity was increased for each iteration. Dividing the procedure in several iterations also granted the opportunity to observe the performance of each classifiers from very small sample quantities to larger ones. While several parsing methods are offered by the Sci-Kit Python library, feature ex-traction by count vectorization using English stop words was the selected approach. Furthermore, the library also provides a vast array of methods in machine learning, including Naive-Bayes and Label Spreading, and scientific methods to measure the performance of classifiers which were used in the experiments.

In process of semi-supervised classification, labeled training samples were mixed with unlabeled training samples as specified in the guidelines for semi-supervised classification. Contrary to this, while using a fully supervised classifier such as the Naive-Bayes classifier, the unlabeled training samples are never used and must therefore not be merged with labeled training data, as specified in the guidelines for supervised classification.

3.5 Testing & Interpreting Results

(19)

3.5. TESTING & INTERPRETING RESULTS

the classifier in order to achieve different measurements of performance, defined as accuracy, precision, recall and f1-scores.

For every prediction made by the classifier, the correct pre-defined label of the processed test sample was compared with the classifier’s prediction, resulting in either a match or a mismatch. A match is either a true positive or a true negative, and a mismatch is the opposite; a false positive or a false negative. Collected results were compiled into sums for each type of match in a confusion matrix.

(20)

(21)

Chapter 4

Results

The results chapter consists of two main parts:

• A part describing the scores and prediction results acquired from the experi-ments.

(22)

4.1 Confusion Matrix

Listed in the table below are the different values for the confusion matrix; true positives, true negatives, false positives, and false negatives.

# Naive-Bayes Label Spreading nTP nTN nFP nFN nTP nTN nFP nFN 10 3 3 2 2 5 0 0 5 20 3 9 7 1 6 7 4 3 30 5 13 10 2 0 14 15 1 40 5 18 15 2 0 17 20 3 50 6 19 19 6 0 22 25 3 60 8 25 22 5 1 27 29 3 70 12 32 23 3 1 32 34 3 80 17 36 23 4 1 37 39 3 90 18 39 27 6 1 40 44 5 100 19 45 31 5 1 45 49 5 150 13 68 62 7 3 68 72 7 200 22 91 78 9 4 92 96 8 250 25 120 100 5 4 114 121 11 300 41 145 109 5 6 136 144 14 350 56 167 119 8 7 158 168 17 400 64 189 136 11 7 182 193 18 450 59 217 166 8 9 206 216 19 500 88 240 162 10 9 230 241 20 600 130 286 170 14 12 271 288 29 700 155 331 195 19 12 318 338 32 800 168 379 232 21 14 365 386 35 900 157 430 293 20 18 412 432 38 1,000 141 485 359 15 19 461 481 39 1,250 197 604 428 21 24 586 601 39 1,500 197 730 553 20 33 705 717 45 1,750 154 858 721 17 41 820 834 55 2,000 103 989 897 11 43 938 957 62 2,500 95 1,239 1,155 11 55 1,174 1,195 76

(23)

4.2. CLASSIFIER SCORES

Listed in the table below are sums of true and false predictions of each classifier, i.e. true values consists of true positives and true negatives, while false values consists of false positives and false negatives.

# Naive-Bayes Label Spreading nTrue nFalse nTrue nFalse

10 6 4 5 0 20 12 8 13 4 30 18 12 14 15 40 23 17 17 20 50 25 25 22 25 60 33 27 28 29 70 44 26 33 34 80 53 27 38 39 90 57 33 41 44 100 64 36 46 49 150 81 69 71 72 200 113 87 96 96 250 145 105 118 121 300 186 114 142 144 350 223 127 165 168 400 253 147 189 193 450 276 174 215 216 500 328 172 239 241 600 416 184 283 288 700 486 214 330 338 800 547 253 379 386 900 587 313 430 432 1,000 626 374 480 481 1,250 801 449 610 601 1,500 927 573 738 717 1,750 1,012 738 861 834 2,000 1,092 908 981 957 2,500 1,334 1,166 1,229 1,195

Table 4.2. True and false values of each classifier by sample sizes.

4.2 Classifier Scores

(24)

# Naive-Bayes Label Spreading

Acc. PPV NPV Recall F1 Acc. PPV NPV Recall F1 10 0.6000 0.6000 0.6000 0.6000 0.6000 0.5000 1.0000 0.0000 1.0000 0.6667 20 0.6000 0.3000 0.9000 0.3000 0.4286 0.6500 0.6000 0.7000 0.6000 0.6316 30 0.6000 0.3333 0.8667 0.3333 0.4545 0.4667 0.0000 0.9333 0.0000 0.0000 40 0.5750 0.2500 0.9000 0.2500 0.3704 0.4250 0.0000 0.8500 0.0000 0.0000 50 0.5000 0.2400 0.7600 0.2400 0.3243 0.4400 0.0000 0.8800 0.0000 0.0000 60 0.5500 0.2667 0.8333 0.2667 0.3721 0.4667 0.0333 0.9000 0.0333 0.0588 70 0.6286 0.3429 0.9143 0.3429 0.4800 0.4714 0.0286 0.9143 0.0286 0.0513 80 0.6625 0.4250 0.9000 0.4250 0.5574 0.4750 0.0250 0.9250 0.0250 0.0455 90 0.6333 0.4000 0.8667 0.4000 0.5217 0.4556 0.0222 0.8889 0.0222 0.0392 100 0.6400 0.3800 0.9000 0.3800 0.5135 0.4600 0.0200 0.9000 0.0200 0.0357 150 0.5400 0.1733 0.9067 0.1733 0.2737 0.4733 0.0400 0.9067 0.0400 0.0706 200 0.5650 0.2200 0.9100 0.2200 0.3359 0.4800 0.0400 0.9200 0.0400 0.0714 250 0.5800 0.2000 0.9600 0.2000 0.3226 0.4720 0.0320 0.9120 0.0320 0.0571 300 0.6200 0.2733 0.9667 0.2733 0.4184 0.4733 0.0400 0.9067 0.0400 0.0706 350 0.6371 0.3200 0.9543 0.3200 0.4686 0.4714 0.0400 0.9029 0.0400 0.0704 400 0.6325 0.3200 0.9450 0.3200 0.4655 0.4725 0.0350 0.9100 0.0350 0.0622 450 0.6133 0.2622 0.9644 0.2622 0.4041 0.4778 0.0400 0.9156 0.0400 0.0711 500 0.6560 0.3520 0.9600 0.3520 0.5057 0.4780 0.0360 0.9200 0.0360 0.0645 600 0.6933 0.4333 0.9533 0.4333 0.5856 0.4717 0.0400 0.9033 0.0400 0.0704 700 0.6943 0.4429 0.9457 0.4429 0.5916 0.4714 0.0343 0.9086 0.0343 0.0609 800 0.6838 0.4200 0.9475 0.4200 0.5705 0.4738 0.0350 0.9125 0.0350 0.0624 900 0.6522 0.3489 0.9556 0.3489 0.5008 0.4778 0.0400 0.9156 0.0400 0.0711 1,000 0.6260 0.2820 0.9700 0.2820 0.4299 0.4800 0.0380 0.9220 0.0380 0.0681 1,250 0.6408 0.3152 0.9664 0.3152 0.4674 0.4880 0.0384 0.9376 0.0384 0.0698 1,500 0.6180 0.2627 0.9733 0.2627 0.4074 0.4920 0.0440 0.9400 0.0440 0.0797 1,750 0.5783 0.1760 0.9806 0.1760 0.2945 0.4920 0.0469 0.9371 0.0469 0.0844 2,000 0.5460 0.1030 0.9890 0.1030 0.1849 0.4905 0.0430 0.9380 0.0430 0.0778 2,500 0.5336 0.0760 0.9912 0.0760 0.1401 0.4916 0.0440 0.9392 0.0440 0.0797

(25)

4.3. MCNEMAR TEST RESULTS

4.3 McNemar Test Results

Shown in the figure below are the scores in terms of accuracy and prediction for the two classifiers. 0 500 1,000 1,500 2,000 2,500 0 10 20 30 40 50 60 70 80 90 100 Quantity % Accuracy Precision

Figure 4.1. Accuracy and Precision for the Naive-Bayes Method.

0 500 1,000 1,500 2,000 2,500 −10 0 10 20 30 40 50 60 70 80 90 100 Quantity % Accuracy Precision

(26)

4.4 McNemar Test Results

Listed in the table below are the McNemar contingency table values; true-true, true-false, false-true, and false-false occurrences for each classifier.

# nTT nTF nFT nFF 10 0 3 0 2 20 6 3 1 0 30 12 1 2 0 40 15 3 2 0 50 18 1 4 2 60 24 1 3 2 70 31 1 1 2 80 35 1 2 2 90 37 2 3 3 100 43 2 2 3 150 65 3 3 4 200 87 4 5 4 250 111 9 3 2 300 132 13 4 1 350 152 15 6 2 400 176 13 6 5 450 200 17 6 2 500 223 17 7 3 600 260 26 11 3 700 302 29 16 3 800 348 31 17 4 900 396 34 16 4 1,000 450 35 11 4 1,250 568 36 18 3 1,500 688 42 17 3 1,750 806 52 14 3 2,000 930 59 8 3 2,500 1,166 73 8 3

(27)

4.4. MCNEMAR TEST RESULTS

Listed in the table below are the McNemar decision for each classifier, determining whether the null hypothesis should be accepted at a significance level 5%, i.e., whether the classifier results are similar.

# McNemar Decision 10 TRUE 20 TRUE 30 TRUE 40 TRUE 50 TRUE 60 TRUE 70 TRUE 80 TRUE 90 TRUE 100 TRUE 150 TRUE 200 TRUE 250 TRUE 300 FALSE 350 FALSE 400 TRUE 450 FALSE 500 FALSE 600 FALSE 700 TRUE 800 FALSE 900 FALSE 1,000 FALSE 1,250 FALSE 1,500 FALSE 1,750 FALSE 2,000 FALSE 2,500 FALSE

Table 4.5. McNemar decision by amount of samples. True means accepting the null

(28)

(29)

Chapter 5

Discussion

The discussion chapter consists of two main parts:

• A part where the method is discussed and results are analysed.

• A part where the restrictions of the experiment and further improvements are discussed.

5.1 Method

The decision to use two different data-sets unrelated to each other was made to increase the variation of the samples. While the larger Enron corpus consists of data from their employees, the smaller corpus from CSmining group consists data from random subjects. Shuffling these two data-sets before the main procedure of training and tests ensured the variety samples.

5.2 Result Analysis

The constraints of computational resources limited the opportunity for larger quan-tities of data to be examined, which could have allowed more prominent results to emerge. However a hint of a pattern can be observed in the upper scale of sample quantities, where the McNemar decisions from the two classifiers started to diverge. Examining table 4.1 at larger data sets (around 2000 and up), it can be observed that the accuracy and precision are starting to converge around the same value. This is depicted in the accompanying graphs, figure 4.1 and figure 4.2.

(30)

method needs a proper amount of training data to perform properly, and therefor better results are achieved with greater quantities of training data, i.e. the classifier is provided a more precise definition of the labels spam and ham.

In comparison to the results of Naive-Bayes, Label Spreading provided more con-sistent results throughout the the experiments. This can be observed as a very consistent line through all sample sizes depicted in the plot in figure 4.2, which can be interpreted as a contradicting result in regards to the theory described in the beginning of this report. However, in a closer examination of table 4.1 it is observed that both method had low scores in terms of accuracy and precision, and therefor it can be argued that these results may lack validity.

Furthermore, the results of the McNemar test may also be regarded as inconclusive because of the poor scores, as the theory explains that there should be a significant difference already at the beginning. This argument can also be supported by the turbulent results of the Naive-Bayes method. However the results point to correct McNemar decisions via the data available in the confusion matrix, regardless of the poor scores. The shift from accepting the null hypothesis to rejecting it can be seen starting at sample quantities around 300 in table 4.5 as mentioned earlier. Further evidence is found supporting this argument when examining the results in table 4.2, where an increasing gap between the two classifiers’ results can be observed.

5.3 Restrictions & Recommendations

Because of limited hardware the number of samples that could be reached estimates to about 2500 samples, before the execution time of the program became too long for results to appear. This restriction halted further studies and have therefor lead to a lesser amount of results than desired. Given the greatest bottleneck in the experiments–the Label Spreading method, which runs more costly in terms of processing power compared to the Naive-Bayes method–it can be speculated that a better approach exists in performing the experiments, e.g. improved algorithms with partial processing of the samples giving partial incrementing results to spare computational resources, instead of a full cycle yielding a stack of results at once. An improved algorithm that reduces both the memory footprint and use of extensive processing power may also improve the execution time of the program. In the end however, more resources are needed because of the polynomial time complexity given by the bottle-neck. This obstacle may be bypassed if utilization of a cloud computing network is available.

(31)

5.3. RESTRICTIONS & RECOMMENDATIONS

(32)

(33)

Chapter 6

Conclusion

After series of experiments, it’s concluded that in a lower scale of sample quanti-ties there is no significant difference between the two classifiers performance. In larger scales however (around 2500 samples), significant differences starts to appear between the two classifiers performance.

(34)

(35)

Appendix A

Utility

A.1 List of classifiers

• Multinomial Naive Bayes

• Label Spreading with K-Nearest-Neighbor algorithm

A.2 List of data-sets

• CSDMC2010 SPAM corpus • Enron email

A.3 List of storage engines

• Redis

A.4 List of toolkits and third-party python modules

• Scikit

(36)

(37)

Appendix B

Bibliography

Bethan Davies. (2013), Precision and accuracy in glacial geology,

(http://www.antarcticglaciers.org/glacial-geology/dating-glacial-sediments-2/precision-and-accuracy-glacial-geology/). Accessed: 2 April 2016

Dennis Ramdass & Shreyes Seshasai. (2009), Document Classification for

Newspa-per Articles. Massachusetts Institute of Technology

Kunal Mehro-tra and Shailendra Watave. (2009), Spam Detection: A Bayesian

approach to filtering spam.

M3AAWG, (Messaging, Malware and Mobile Anti-Abuse Working Group). (2014),

M3AAWG Email Metrics Program: The Network Operators’ Perspective, Report #16 1st Quarter 2012 through 2nd Quarter 2014.

M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. (1998), A Bayesian approach

to filtering junk email. AAAI Technical Report WS-98-05

Olivier Chapelle,Bernhard Schölkopf, and Alexander Zien. (2006), Semi-Supervised

Learning. The MIT Press Cambridge, Massachusetts. London, England

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. (2011), Psykit-learn, Scikit-learn: Machine Learning

in Python, Journal of Machine Learning Research.

Analyzing the ability of Naive-Bayes and Label Spreading to predict labels with varying quantities of training data