1.2 Related Work

(1)

(2)

(3)

(4)

(5)

1 Introduction

Skynet in ”Terminator”, the machines of ”The Matrix” and HAL 9000 in ”2001: A space odyssey” are all scary displays of ideas of Artificial Intelligence (AI) from the movie scene. It is probably what comes to mind when people think of AI as well.

However, this thesis is not about the topic of superintelligence but rather it covers the AI in understanding human written text. This kind of AI belong to the research field of Natural Language Processing (NLP), where ”natural language” refers to any language used by humans to communicate [BKL09e]. The language that was processed in this project was English.

Creating intelligent systems that can understand text and natural language has been around since the beginning of computers. Alan Turing wrote the article ”Com- puting Machinery and intelligence” in the year of 1950 and ever since, it has been a scientific topic researched and experimented on [Tur50]. Probably one of the most well known implementations that uses NLP is IBM’s Watson, which is described as a ”question answering” system. A computer participated using Watson in the Amer- ican quiz show ”Jeopardy!” and managed to beat two previous winners of the show [Mar11].

The background for this project was connected to the NLP task of matching natural language queries with user profiles. This project was divided in two studies, an experimental and a case study together with Thingmap. Where the results from the experimental study were applied to Thingmaps solution that maps queries to users, to see if any improvement was gained.

The approach was to increase the context of natural language queries through text classification. An attempt to categorize short texts into multiple classes. Yet, complete solutions for text classification did not seem to be suitable for training short texts classifiers, since less information is given than for large texts and documents. A lack of theoretical support was discovered, of how to design a classification system for short texts and how the text should be represented to achieve optimal results. The experimental study’s purpose was to bridge this percieved research gap, to find out how to represent text and how to design a short text classification for multiple classes.

This thesis presents experiments on how to represent the text as features for short text classification, and comparisons on how a flat design of classification stood against a hierarchical designed classification system.

(7)

1.1 Delimitations

A dataset with short texts and a large number of samples was the setup for the experiments. Since the number of samples is large, adding detailed features is costly for training time. A consequence of this was a limited level of detail on text features in the experiments.

Another consequence was that the suitable algorithms for statistical learning was reduced by using a larger dataset. The approximation method Stochastic Gradient Descent (SGD) was chosen and no other method was compared. SGD is considered an efficient method for large datasets [Bot10]. The SGD together with logistic regression will be explained in the method in section 2.1.3.

The motivation for the choice of text features that was included in the study, was a combination of advice from supervisors and concepts for the most succesful solutions from a competition in short text classification hosted by Kaggle in 2013 [KAG13].

1.2 Related Work

In an article that was published 2016, the authors write about a Java library that they have developed. The library is called ”Edison”, which serves flexibility for the user to specify and implement different feature extractors for text classification or clus- tering purposes [SCK⁺16]. Edison supports a variety of NLP tools such as Named Entity Recognition, Part-of-speech tagging and to detect numerical expressions in text. Edison is a ready-to-use solution with some powerful opportunities to config- ure feature extractors. The paper about Edison focuses on how to simplify feature extraction. While this paper was more about comparing different text features, for short text classification.

Another project implemented a technique to perform classification on hierarchi- cally structured data, where the data were organized in a hierarchy of increasing specificity. The authors tested the hierarchical classification system and they tried to implement a feature for class similarity. They claimed that the accuracy of their implemented classifier was better than a traditional design of classification [WZH].

1.3 Important Concepts

The classification in this thesis, consisted of supervised statistical learning. In supervised learning, a model receives an input X, an output Y, and the model attempts to map from input to output based on previous attempts [Alp14]. By comparing the guessed output with the real output for each sample of X and Y, the model measures

(8)

the loss of each prediction. The loss regulates how the model parameters are adjusted for future predictions [Sl16a]. The learning phase is referred to, as training in this thesis. A model that finished training is called a classifier, and its task is to, for a given input X, predict a class among known classes from the training phase.

For it to be possible to classify an object, the object has to be represented in a way that the model can understand. Whether it is a classifier that should recognize a face in an image or detect if an e-mail is spam or not, the object is represented with so called features. Before training, the features to extract from the object is specified [BKL09f]. With other words, feature extraction is how a classifier understands the object, thus, an important part of classifying it.

The most common type of features for text purposes is called Bag of Words (BOW). The concept is for the feature to represent the occurring terms in a text.

By going through the text and creating a vocabulary of the terms encountered, then noting, for each text which terms was found [BMG10]. BOW features can be varied in different ways, with weighting methods, or extending the number of words a term is made up of and also by excluding some words from the vocabulary. These are some examples that will be examined in the experimental study.

A contemporary survey of methods in machine learning is given in [HTF09].

The book [MS99] reviews methods for dealing with natural language. The book [MRS08] discusses methods of information retrieval

(9)

2 Method

The method consisted of an experimental study on two different designs of text classifiers and the choice of text features for short text classification. The measurements used in the experimental study were: precision, recall, Reciever Operation Charac- teristics (ROC) and Area Under the Curve (AUC) of ROC.

In the thesis, all equations and formulas are indexed to the right for referencing purposes.

2.1 Experimental study

For the experiments, a dataset with english titles from Stack Exchange¹(SE) posts that was labeled among 17 classes. The dataset was separated with one part for supervised training and one part for testing and evaluating the classifier’s performance.

Two different designs of classification was compared in the experimental study.

A flat design which predicted among 17 subclasses²in contrast to a 2-level hierarchical design. The hierarchical design first predicted among 4 main classes, where the prediction lead to the second level classifier, which classified the subclass. For each mainclass there was a varying number of subclasses. The subclasses together made up the same set of 17 classes that the flat designed classifier predicted. Train- ing and testing both classifier designs was done with the same datasets, which made it possible for direct comparison of the results.

The experimental study consisted of four different experiments. The order of the experiments was very important, because each result lead to how the settings for the next experiment was set. For an example: In experiment 1, two different weighting methods, A and B, was compared. If weighting method A gave better results, method A was then used as weighting method for experiment 2 where something else was experimented on. This was an attempt to implement a classifier performing as well as possible in the last experiment.

Except for comparing the classifier designs for each experiment, the study consisted of four parts. These parts were: 1) comparing weighting methods of the terms in BOW features; 2) the effect of letter case conversion and removal of stop words; 3) the amount of words to use in each term for BOW features; 4) the effects of adding features for the number of characters in a text and how many quotation marks occurred in the text.

1An online community that hosts ca. 150 Question-and-Answer sites, [SE16a]

2See all the classes in appendix A.

(10)

2.1.1 Dataset for the Experiments

The dataset was based on the openly distributed data dump from SE, the contents of their sites through ”Stack Exchange Data Dump”. The data dump holds data from all the user-contributed content from the sites on the SE network [SE16b]. The data dump that was used for this project was uploaded September 12th, 2016. In this project the SE data dump came in the form of XML-files which had to be processed to be able to use as training and testing data for the system. For the experimental study, 3.2 million samples of SE titles were used.

Supervised training means that the training samples are labeled with a class. For the dataset used in this project, the samples was labeled among 4 main classes and 17 subclasses. The flat designed classifier focused on the subclasses only. The dataset was divided in a training set of 80% of the total, meaning that all the classifiers of the experiments were trained with the same 2 563 571 samples. The remaining 20%

was set aside for testing the prediction performances of the classifiers. During the training phase for the classifiers of the experiments, no exposure of the 641 206 test set samples occurred.

The samples consisted of short english texts with between 30 and 150 characters.

An example of what a sample looked like: ”Statistical Dimension of a Cone”, where the sample was labeled with the class: Mathematics and Statistics.

2.1.2 Quality measures

When choosing measures for performance evaluation of classifiers, precision and recall are the most common. However, these have been criticized for not taking account for error costs and over-represented classes [Faw06]. Therefore, Reciever Operation Characteristics (ROC) curve from signal detection was added to the measures that was used to balance the quality measures. The measures are thoroughly explained in the following subsections.

Confusion Matrix

Most quality measures for predictive modeling and classification are derived from the Confusion matrix (also called coincidence matrix or classification matrix). The matrix is a n × n where n is the number of classes. The rows holds the true classes while the columns holds the predicted classes. The optimal result of a predictive model would be shown in a diagonal confusion matrix, that is: all zeros except for the diagonal between(1, 1) and (n, n) [Tin10].

From the confusion matrix, the predictions of a classifier can be analyzed. It can show which classes the prediction model had problem identifying, which classes

(11)

was mistaken for one another and which classes were easy to predict.

Precision, Recall, and F1-score

Having the confusion matrix as foundation, it was possible to calculate additional measures. Precision and recall are frequently used in information retrieval and pattern recognition [BKL09a]. The measures are applicable for binary classes where the outcoming class is either positive or negative. The formula for precision:

Precision= T P

T P+ FP (1)

Where TP is the true positives, the number of positive predictions that were correct. FP is the number of predicted positives that were incorrect. The recall measure was calculated as:

Recall= T P

T P+ FN (2)

TP, is true positives and FN are the number of predicted negatives that was incorrect. F1-score was also used, it is a harmonic mean of precision (P) and recall (R). The formula for F1-score is:

F1=2 · P · R

P+ R (3)

With words, precision is the rate that a positive prediction of a certain class is correct. Recall is the ratio of how many samples of total samples belonging to class that are predicted. Following in table 1, is an example of how to derive precision and recall from a confusion matrix.

Table 1: Example of a Confusion Matrix for a classifer with classes: A, B and C. The classifier predicted 26 samples, with the columns representing the predicted classes and the rows show the true classes.

n = 26 Predicted A Predicted B Predicted C

True A 4 1 0

True B 6 6 1

True C 2 3 3

In table 1, column 2 shows the samples that were predicted to belong to class B.

The true positives (TP) is highlighted with a circle, T P_B= 6, following, the other

(12)

predicted B’s are false positives (FP), hence FP_B= 1 + 3 = 4 . Row 2 shows the samples that truly belong to class B, the true B’s that are not predicted positves are false negatives (FN), FN_B= 6 + 1 = 7. Having the values needed for calculation, P, R and F1-score for class B can be calculated:

P_B= 6

6+ 4 = 0.6000 (4)

R_B= 6

6+ 7 = 0.4615 (5)

F1_B= 2 · 0.6 · 0.4615

0.6+ 0.4615 = 0.5217 (6)

These values are calculated for all classes, and shown in table 2.

Table 2: Precision, Recall, F1-score calculated for each class A,B and C from values of ??.

Also the number of samples n, for each class is presented.

n = 26 Precision Recall F1-score n

A 0.3333 0.8000 0.4706 5

B 0.6000 0.4615 0.5217 13

C 0.7500 0.3750 0.5000 8

The resulting precision, recall and F1-score for the classifiers in the experiments are averages for all classes. Where the precision, recall and F1-score of class k are weighted on the n_knumber of samples for that class, summed and finally divided by total number of test samples N. Formula for weighted average of precision:

P¯= 1 N

17 k=1

∑

P_k∗ n_k (7)

For the example results in table 2, the weighted average precision, was accord- ingly to equation 7.

P¯= 0.3333 · 5+ 0.6 · 13 + 0.75 · 8

26 = 0.5949 (8)

(13)

Receiver Operating Characteristics

With origins in signal detection, the receiver operating characteristics curve (ROC curve) is a plot for binary prediction that also reflects on the costs of false positive predictions. It is a measure that plots recall (or true positive rate (TPR)) against false positive rate (FPR). Unlike precision and recall, the ROC curve will give a just measure for test sets with unbalanced classes [Faw06]. TPR is the same as Recall and FPR is calculated as:

FPR= FP

FP+ T N (9)

Where FP is the number of false positives and TN is the number of predicted negatives that was correct. The ROC curve is purposed for a binary classifier that gives probabilities for its predictions. When plotting an ROC curve for a test, the resulted predictions should come as a vector of probabilities. Where a value close to 1 means that a positive predictions is likely and a value close to 0 means that the classifier is confident that the sample is negative. The second parameter for an ROC curve is a binary vector of the true classes [Sl16c]. The threshold value decides at which probability a positive prediction is given, if the threshold is 0.4 the classifier will predict positives for all probabilities above 0.4. The threshold value is swept from 0 to 1, and for each step, the TPR and FPR is plotted out.

While the ROC curve visually displays the performance of a classifier, a numerical value can be easier to compare. That is why the Area Under the Curve (AUC) often is calculated for the ROC curve to indicate the performance of a classifier for a certain test run. The AUC will be a value between 0 and 1. A random classifier would get an ROC as a line from the point (0, 0) and (1, 1) with AUC = 0.5, which means that an AUC better than 0.5 is needed to perform better than a random classifier [Fla10].

However, ROC is made for binary classification and since this project deals with multiclass classification, a modification of the ROC had to be done. The modification was to let the classifier predict classes for the test dataset. Then the true classes and predicted classes are compared, then a binary vector is created holding 0 or 1 de- pending on if the prediction was correct or not. The probability vector is containing the probability that the classifier gave the most likely label.

The random baseline follows the same procedure of plotting an ROC curve but on a random classifier. The random classifier distributes 17 random probabilities where: ∑¹⁷_i=1p(i) = 1 the highest probability will be the class index that is predicted.

The idea is to get a random classifier that gives an ROC curve as the line between (0, 0) and (1, 1).

(14)

AUC values for a large number of samples (as in the case of the test dataset), are considered as normally distributed [HM82]. For calculating the standard error (SE) of an AUC value, Hanley and McNeil’s formula was applied:

SE(AUC) =

sAUC(1 − AUC) + (n_true− 1)(Q1− AUC²) + (n_{f alse}− 1)(Q2− AUC²) n_true· n_{f alse}

(10) Where n_true is the number of correct predictions and n_{f alse} respectively is number of false predictions. Q₁and Q₂are calculated in equations 11 and 12.

Q₁= AUC

2 − AUC (11)

Q₂= 2 · AUC²

1+ AUC (12)

Hypothesis testing was conducted for the ROC analysis for all experiments. By performing z-tests on the differences of two AUC values, to see if the difference was separated from zero with distinction. The hypothesis test used the fact that the difference Z of two random variables X ,Y , from independent normal distibutions:

X ∼ N(µX, σ²_X), Y ∼ N(µY, σ_Y²) leads to a Z that also is normally distributed [ES08]:

Z∼ N(µX− µ_Y, σ²_X+ σ²_Y) (13) This rule made it possible to determine wether two classifier’s AUC values differed with statistical significance. The null hypothesis was that the difference of two AUC values was equal to 0. A right tailed z-test was used to get the p-value for the difference of two AUC values and the normal distribution X ∼ N(0, σ²_AUC

1+ σ²_AUC

2).

A p-value smaller than 0.05 resulted in rejection of the null hypothesis, establishing that the difference of AUC₁, AUC₂ was of statistical significance. Respectively, a larger p-value than 0.05 means that the difference was not of statistical significance.

2.1.3 Classification

For this project to be reproducible, some points will be specified of the learning procedure and give a more detailed explanation of the algorithms used to train the prediction models. The focus of the experimental study were both on the classifier design, but also in the part called feature extractor, where it is decided which features in the text that represents it for the classifier. Apart from studying the feature

(15)

extractor, a flat classification design like fig. 1 have been compared to a hierarchical designed classification system (see figure 2).

Training

Prediction

Input Input

Feature Extractor Feature Extractor

x_1 x_2 ....

x_n

x_1 x_2 ....

x_n y_1 y_2 ....

y_n

Learning Algorithm Feature Vector

Feature Vector

Classiﬁer

Class

Figure 1: General layout of the classification process with labeled data (supervised classification) [BKL09b].

The classifiers was trained using a model called One Versus Rest (OVR), also known as one-vs-all. Which means that for each class, a binary classifier will be trained to decide how likely it is that a given input belongs to a certain class or not.

The probabilities will be compared and the highest scoring binary classifier will give its resulting class as the result of the whole classifier. OVR can be thought of as a simple design, yet, the complexity grows with the number of classes since every class needs its own predictive model [Sl16d].

Each predictive model within the OVR, performs the prediction by using a linear prediction function:

f(x) = w^Tx+ b (14)

The prediction function is learned during the training phase. In the ”learning algo-

(16)

rithm” (figure 1) the model parameters w with dimension equal to the lenght of the feature vector used, and constant b are sought after. In this project, this is done by using an algorithm called Stochastic Gradient Descent (SGD).

Stochastic Gradient Descent

The type of problem that the SGD solves belongs to stochastic approximation algorithms in statistical learning. The goal is to find an expected loss of E( f (x)) that is as small as possible. Consider the expected loss to be given by:

E( f (x)) = L( f (X),Y ) (15)

Where X = {x1, x₂, ..., x_n},Y = {y1, y₂, ..., y_n} and L() measures the size of the error according to the logistic loss. SGD’s solution to minimize E( f (x)) is to visit all examples and update the model parameters w according to that step’s logistic loss [Zha04].

For the specific SGD method used in this project, the update function for w looks like:

w_i+1= wi− η(α∂R(w_i)

∂w_i +∂L( f (x_i), y_i)

∂w_i ) (16)

The η defines at which rate the w values will be updated. α is a scaling factor for the the regularization term, which is:

R(wi) = w²_i

2 (17)

As mentioned, this is one implementation of SGD with attached with logistic regression as loss function and the regularization term R(wi) (eq. 17) [Sl16a].

Hierarchical model

The hierarchical classifier in the experimental study (see figure 2), consisted of five classifiers, one that classified the main class, and four classifiers that predicted the subclass (see all classes in Appendix A). The design for each of the five classifiers was like the flat classifier design in figure 1. For testing and comparing the 2-level hierarchical with the flat design, the same input were given, and the same output should be predicted.

(17)

Level 1 Level 2

Business and Logistics

Classiﬁer

Health, Life and Earth

Classiﬁer

Human Sciences Classiﬁer

Tech and Physical Sciences

Classiﬁer

Main class classiﬁer Input

Feature Extractor x_1

x_2 ....

x_n

Feature Vector

Class

Feature Extractor Feature Extractor Feature Extractor Feature Extractor

Feature Vector Feature Vector Feature Vector Feature Vector Redirector

of input

Class Class Class Class

Figure 2: Design of the 2-level hierarchical classifying system. The Main class classifier’s prediction tells which subclass classifier should perform the 2:nd level classification.

Software Implementation of the Classifiers

The implementation of the classifiers in the experimental study, was done with the programming language Python³ with the machine learning toolkit Scikit-Learn

3All dependencies are presented in Appendix B

(18)

(Sklearn). The OVR model used was Sklearns OneVsRestClassifier⁴, with Sklearns SGDclassifier⁵as the estimator for each class in the OVR. For the SGD the loss function logistic regression was chosen, by the parameter SGDclassifier(loss="log"), other parameters (η, α) for the OVR and SGD modules were left as defaults. The feature vectorizers that were used were a CountVectorizer and a TfidfVectorizer.

For the last experiment a DictVectorizer was used to transform a list of Python dictionairies to a feature vector for classification.

2.1.4 Machine Architecture

The machine used for the project had 8 GigaByte (GB) of RAM, a Intel Core i7- 3537U processor with 2 cores and 2 hyperthreads at 2.00GHz clock frequency. All the classifiers in the experiment was trained and tested was performed on the described computer.

4[Sl16d]

5[Sl16a]

(19)

3 Results

In this section the results from the experimental study will be presented. The results are divided in 4 experiments and each experiment is explained and visualized in its own subsection.

3.1 Results from the Experimental Study

In the experimental study, comparisons between different settings of features, for short text classification. For all the experiments, flat against 2-level hierarchical design of classifier was included. In the comparison, the metrics that was examined was precision, recall, ROC curve and its AOC.

The text features that were included in the experiments were divided into four experiments, these four were: Term Frequency times Inverse Document Frequency (TFIDF) normalized BOW features versus integer represented terms in BOW features, removing versus keeping stopwords together with lowercase conversion versus no case conversion, order of n in n-grams of BOW features, adding features for the number of characters and counting quotation marks in the texts. The results are displayed in the following subsections with graphs of ROC curves accompanied with AUC values with hypothesis testing and tables with the measures precision, recall and F1-score.

3.1.1 Comparing Weighting Methods of Terms in BOW features

TFIDF is a weighting method for terms in BOW features, TFIDF stands for Term Frequency times Inverse Document Frequency. It is calculated for each term in a document and for all documents during the feature extraction phase. For a term t in document d of dataset D:

T FIDF(t, d, D) = T F(t, d) · IDF(t, D) (18) The term frequency T F(t, d) is the number of times the term t occurs in document d. In Sklearns TFIDF method, which was used in this experiment, the IDF part is calculated as:

IDF(t, D) = log( 1+ ND)

1+ DF(t, D)) + 1 (19)

Where N_d is the number of documents in dataset D, and DF(t, D) is the number of documents in the dataset that contains the term t. After the TFIDF value is calculated

(20)

for each term, the vectors containing the TFIDF values are normalized with the Euclidian norm [Sl16b]:

v_normalized= v

q

v²₁+ v²₂+ ... + v²_n

(20)

Since this experiment was the first, no other settings for the BOW features was added. The experiment was done using the default parameters for the Sklearn modules TfidfVectorizer and CountVectorizer. The TfidfVectorizer performed the weighting method explained with formulas 18, 19 and 20. The CountVectorizer used a term count representation of BOW features.

Figure 3: ROC curves of the experiment on TFIDF weighted BOW features versus integer represented term occurrence in BOW features (term count). The comparison is made for both flat and 2-level hierarchical designs.

In figure 3 it is shown that the flat designed classifier with TFIDF weighting got the highest score of AUC₁= 0.8803. Second was 2-level hierarchical with integer

(21)

represented term occurrence at AUC₂= 0.8797. Third best result came from the flat design with integer represented term occurrence, with score AUC₃= 0.8777. The z-test of the difference AUC1− AUC₂ gave a p-value of p= 0.2212 (see figure 4), and the difference AUC₁− AUC₃gave a p-value of p= 0.0092 (see figure 5).

Figure 4: AUC of Flat design with TFIDF weighting minus AUC of 2-level Hierarchical design using integer represented term occurrence. Looking at the right side tail of the normal distribution, the difference: AUC₁− AUC₂= 0.0006 gives a p-value of 0.2212

(22)

Figure 5: AUC of 2-level hierarchical designed minus flat designed classifier, both using integer represented term occurrence. Looking at the right side tail of the normal distribution, the difference: AUC2− AUC₃= 0.0020 gives a p-value of 0.0092

The z-test displayed in figure 4 shows that the difference between Flat design with TFIDF and 2-level hierarchical with integer represented term occurrence has p-value p= 0.2212 and thereby, too high to reject the null hypothesis. The other z-test, seen in figure 5, shows that the difference between 2-level hierarchical designed and flat designed classifiers, both using integer represented term occurrence, had a difference which gave a p-value of p= 0.0092, low enough to reject the null hypothesis.

Table 3: Results in precision, recall and F1-score for integer represented term occurrence (term count) versus TFIDF weighted BOW features. Both feature types was tested with flat and 2-level hierarchical designs.

Classifier Design BOW-feature representation Precision Recall F1-score

Flat Term Count 0.7899 0.7908 0.7576

Flat TFIDF weighting 0.7214 0.7067 0.6240

2-level hierarchical Term Count 0.7912 0.7915 0.7598 2-level hierarchical TFIDF weighting 0.7239 0.7073 0.6305

Unlike the ROC analysis in figure 3, the measures in table 3 showed higher values for integer represented term occurrence. For the flat designed classifier, the

(23)

F1-score showed 0.7576 for the integer represented term occurrence and 0.6240 for the TFIDF weighting method, which was an increase of 21.4% in F1-score. The 2-level hierarchical designed classifier’s F1-scores were: 0.7598, for the integer represented term occurrence and 0.6305 for the TFIDF weighting method, which is an increase of 20.5% in F1-score.

3.1.2 The Effect of Case conversion and Removal of Stop Words

This subsection contains the results of the experiment of converting the texts to lower case and removing predefined stop words. Stop words are common words that rarely contribute to the meaning of texts and by removing them, the dimension of the BOW-feature vector is lowered [BKL09c]. In this experiment, NLTKs list of english stop words were used. Prior to creating the feature vector, words in the text that exists in the list of stop words is removed. Case conversion is also performed before the feature vector is produced. Going through the input and convert all letters to lowercase. The idea of lowercase conversion is to minimize the vocabulary, without case conversion the same word can exist multiple times in the vocabulary, spelled with different cased letters. This experiment obtained the result from the previous experiment, therefore this test used integer represented term occurrence, with the CountVectorizeras feature vectorizer.

(24)

Figure 6: ROC curves of the experiment on case conversion and stop word removal. The different approaches were tested on both flat and 2-level hierachical designs.

The results of the ROC analysis can be seen in figure 6. Three classifiers with 2- level hierarchical got the highest AUC scores. Lowercase conversion at AUC₁= 0.8826, stop word removal at AUC₂= 0.8829 and both lowercase conversion and stop word

removal got AUC3= 0.8825. Highest difference between the three was AUC2− AUC₃= 0.0004, which was less than both results’ standard error which shows 0.0006. A difference

of statistical significance between the three can therefore rejected.

The best AUC score for the flat designed classifier used lowercase conversion, with a score of AUC₄= 0.8797. The z-test performed on the difference AUC1− AUC₄ resulted with a p-value of p= 0.0003, a difference of statistical significance.

(25)

Figure 7: AUC of 2-level hierarchical design minus AUC of flat design, both with lowercase conversion. Looking at the right side tail of the normal distribution, the difference:

AUC₁− AUC₄= 0.0029 gives a p-value of 0.0003

For the flat designed classifier, it is shown in figure 8, that conversion to lower case letters were better than no case conversion with statistical significance, the p- value was p= 0.0169.

Figure 8: AUC of lowercase conversion minus no case conversion, both with flat classifier designs. Looking at the right side tail of the normal distribution, the difference:

AUC₁− AUC₄= 0.0018 gives a p-value of 0.0169

(26)

In the case of the 2-level hierachical design for comparison of lower case conversion against no case conversion, the standard errors were the same as for the flat design (σ= 0.0006), but the difference between the AUC values was 0.0023, which is larger than for the flat design 0.0018, therefore the hypothesis testing gives a p- value less than the flat case p < 0.0169 which establishes statistical significance.

The figure 9 it is shown that removing stop words were better than keeping them for the flat designed classifier, with a statistical significance with p-value p= 0.0226.

Figure 9: AUC of removing stop words minus keeping stop words, both with flat classifier designs. Looking at the right side tail of the normal distribution, the difference was: 0.0017 giving a p-value of 0.0226

The difference of AUC values from removing stop words and keeping stop words was larger for the 2-level hierarchical classifier than the difference for the flat design, which was of statistical significance. Sharing the standard error (σ= 0.0006), meaning that the p-value of the difference for the 2-level hierarchical design was p< 0.0226, and also of statistical significance.

Table 4, declares values of precision, recall and F1-score from the experiment of case conversion and stop word removal.

(27)

Table 4: Results in precision, recall and F1-score for case conversion and stop word removal. All feature settings were tested on both flat and 2-level hierarchical designs.

Classifier design Lowercase Stop words removed Precision Recall F1-score

Flat No No 0.7900 0.7902 0.7568

Flat Yes No 0.8013 0.8045 0.7765

Flat Yes Yes 0.8013 0.8044 0.7762

Flat No Yes 0.8013 0.8044 0.7762

2-level hierarchical No No 0.7920 0.7920 0.7605

2-level hierarchical Yes No 0.8035 0.8057 0.7790

2-level hierarchical Yes Yes 0.8039 0.8060 0.7795

2-level hierarchical No Yes 0.8032 0.8052 0.7783

It is shown in table 4 that the classifiers that used neither case conversion nor stop word removal got worse results than the others. For the flat design, the best score was received from lowercase conversion F1 − score= 0.7765 which was 2.6% higher than the F1-score of the classifier using no case conversion. The 2-level hierarchical designed classifier with lowercase conversion had a 2.4% higher F1-score compared to using no case conversion. When comparing the results of stop word removal, the flat designed classifier, where stop words were removed had a 2.6% higher F1-score than keeping stop words. The 2-level hierarchical case showed a similar increase of 2.3%. The best performing classifier of this experiment in terms of F1-score was the 2-level hierarchical design with lower case conversion and stop word removal (F1 − score= 0.7795).

3.1.3 Number of Words to use in Terms for BOW features

This experiment was examining a BOW features setting called n-grams. The number of words in the terms can be adjusted for texts in the BOW features. The maximum number n, of words that a term can hold is referred to as n-grams.

For an example sentence: ”My dog scared them away.”, the unigram (n = 1) BOW features would contain the terms: ["my", "dog", "scared", "them", "away"].

For bigram (n = 2) the BOW features would contain: ["my", "dog", "scared",

"them", "away", "my dog", "dog scared", "scared them", "them away"]

[BKL09d]. By extending the order of n-gram, more information from the text is col- lected. More information leads to a larger dimensioned feature vector, which means heavier weight on memory and time consumed for prediction. For the training set,

(28)

the size vocabulary of the flat designed classifier increased from 246 169 for unigram, to 4 300 121 for bigram and 14 402 201 for trigram.

For this experiment the CountVectorizer (introduced in section 3.1.1) was used with lowercase conversion and keeping stop words (see section 3.1.2). The experiment examined the effects when increasing the n-gram order from unigram, to bigram and trigram. Due to RAM limitations⁶, the size of the feature vector was set to 5 000 000. The trigram feature vector’s hit the limited max features in this experiment.

Figure 10: ROC curves of the experiment on length of n-grams in BOW-features, uni-, bi- and trigrams. The three n-gram lengths were tested on both flat and 2-level hierarchical designs.

In figure 10 the results of the ROC analysis of this experiment is shown. 2-level hierarchical design with bigram and trigram got the highest AUC value, AUC₁= 0.8870, it follows that there is no reliable difference between those two results. Flat design

6See architecture of the machine used in the project in section 2.1.4

(29)

with trigram was second at AUC₂= 0.8837, 2-level hierarchical design with unigram received a score at AUC₃= 0.8829. For each n-gram setting, the 2-level hierarchical design got higher AUC values. In the first z-test, displayed in figure 11, the difference between flat and 2-level hierarchical designs for trigram BOW features was examined, giving p= 0.0004 statistically significant.

Figure 11: AUC of 2-level hierarchical design minus AUC of flat design, both with trigram BOW features. Looking at the right side tail of the normal distribution, the difference:

AUC₁− AUC₂= 0.0033 gives a p-value of 0.0002

The AUC for the flat designed classifier with bigrams, differed with 0.0035 against the unigram. The difference showed to be of statistical significance, with a p-value of p= 0.0001 as seen in figure 12. The difference of the same case for the 2-level hierarchical designed classifier was 0.0041 and their standard errors were lower, which means that that difference also was of statistical significance with a p-value less than 0.0001.

(30)

Figure 12: AUC of bigram minus AUC of unigram, both with flat classifier design. Looking at the right side tail of the normal distribution, the difference: AUC1− AUC₃= 0.0035, gives a p-value of p= 0.0001

The measures in precision, recall and F1-score, displayed in table 5, asserts that for each increase of order in n-gram the classifiers scored higher. For the flat design, extending from unigram to bigram gave a 1.6% higher F1-score, extending from bigram to trigram gave 0.07% higher F1-score. The same observation on 2-level hierarchical design showed from unigram to bigram a 1.3% increase of F1-score, and from bigram to trigram a 0.02% increase of F1-score.

Table 5: Results in precision, recall and F1-score of the experiment on length of n-grams in BOW-features. The three n-gram lengths were tested on both flat and 2-level hierarchical designs.

Classifier Design n-gram Precision Recall F1-score

Flat Unigram 0.8013 0.8041 0.7758

Flat Bigram 0.8113 0.8138 0.7883

Flat Trigram 0.8122 0.8143 0.7889

2- level hierarchical Unigram 0.8032 0.8054 0.7784 2- level hierarchical Bigram 0.8130 0.8135 0.7887 2- level hierarchical Trigram 0.8146 0.8148 0.7904

(31)

3.1.4 Features for Counting Characters and Quotation Marks

This last experiment of the study was to test if adding features for the length of the texts (i.e. how many characters a text consisted of) and how many quotation marks occurred in the texts. With integer representations in the feature vector holding the counted text stats added in this experiment. The experiment was done using lowercase conversion and integer represented bigram occurrences as BOW features. A CountVectorizer, with parameters lowercase = True, ngram_range = (1,2) and the rest used the default parameters.

Figure 13: ROC curves of the experiment on adding features for the length of the texts and the number of quotemarks in them. The different text stat features were tested on both flat and 2-level hierarchical designs.

In the ROC analysis in figure 13, the highest AUC value AUC₁= 8872, was from 2-level hierarchical design without the text stat features examined in this experiment. At AUC2= 0.8867, the 2-level hierarchical classifier with the feature for counting quotemarks got the second highest AUC value. Third highest scores was received from the flat design, where the classifier without the text stat features and

(32)

the classifier using the feature for counting quotemarks got the same AUC value of AUC₃= 0.8837.

In the z-test that examined if the difference AUC1− AUC₂was of statistical significance (see figure 14), the result showed a p − value= 0.2278, which indicates that the difference was not of statistical significance.

Figure 14: AUC of no text stats minus AUC of counting quotemarks, both with 2-level hierarchical design. Looking at the right side tail of the normal distribution, the difference:

AUC₁− AUC₂= 0.0005 gave a p-value of 0.2778

The last z-test, examines the difference between flat and 2-level hierarchical classifier design, both without using text stat features introduced in this experiment.

As seen in figure 15, a p-value of 0.0001 was obtained, which was far from the threshold value and the difference was large enough to be of statistical significance.

(33)

Figure 15: AUC of 2-level hierarchical design minus AUC of flat design, both without using any text stat features for counting characters or quotemarks. Looking at the right side tail of the normal distribution, the difference: AUC1− AUC₃= 0.0035 gave a p-value of p=0.0001

For the second set of measures in this experiment, the flat classifier design received the highest precision, recall and F1-score, it had an F1-score of 0.8013. 2- level hierchical design with quotation marks recieved an F1-score of 0.7905, as seen in table 6. For the flat designed classifier, counting quotemarks increased the F1- score by 0.03%, respectively 0.02% for 2-level hierarchical design.

Table 6: Results in precision, recall and F1-score of the experiments on stop word removal and case conversion

Classifier design Text stats Precision Recall F1-score

Flat - 0.8109 0.8126 0.7866

Flat Text length 0.8026 0.8016 0.7762

Flat Text length and quotation marks 0.8168 0.8199 0.8013

Flat Quotation marks 0.8128 0.8147 0.7893

2-level hierarchical - 0.8136 0.8136 0.7888

2-level hierarchical Text length 0.8024 0.7984 0.7698

2-level hierarchical Text length and quotation marks 0.8051 0.8039 0.7779 2-level hierarchical Quotation marks 0.8152 0.8150 0.7905

Appendix C, shows the confusion matrix of the result from the flat designed

(34)

classifier that got the best score in precision and recall. Looking at the confusion matrix, one can get an overview of how the data was distributed over the classes and how it was predicted.

(35)

4 Discussion

This thesis examined how a classifier should be designed and what decisions to take when choosing text features for short text classification for multiclass classification purposes. The experiments showed that a 2-level hierarchical designed classifier scored higher than the flat design in 11 out of 13 cases for both F1-score and AUC of ROC curve.

The highest AUC value was recieved from the 2-level hierarchical designed classifier with feature settings: term count representation of bigram terms in BOW features; conversion to lowercase letters of input; keeping the stop words of the text; no features added for counting characters or quotemarks in the texts. The above classifier received an AUC value of 0.8872. The best F1-score, with F1= 0.8013 was achieved by a flat designed classifier. With feature settings: term count representation of bigram terms in BOW features; conversion to lower case letters of input;

keeping the stop words of the texts and adding features for number of characters and quotemarks in the texts.

The findings of this thesis acts as a opening on the subject of how to implement short text classifiers. Since previous research on the subject was sparse, this thesis contributes with scientific support on what previously were opinions and ideas.

One of the inspirations of the examined text features compared was a Kaggle competition [KAG13] where most solutions used TFIDF weighted terms in BOW feature vectors, but in this study, unexpectedly, it was more successful to use a term count representation of the BOW features.

To increase the reliability of the results, the choice was to add the ROC curve to the quality measures, rather than only using the standard measures of classification:

precision, recall and F1-score. ROC analysis was brought in to balance to the results.

However, the ROC curve is designed to analyze binary predictions, a modification for multiclass ROC analysis was used in this project. Since it is an untried method, it should be used along with other measures, to see if it is a credible measure method for multiclass classification.

4.1 Conclusions

The 2-level hierarchical designed classifiers showed to give significantly better ROC curves than the flat designed classifiers. With a p-value of p= 0.0006, the 2-level hierarchical designed classifier with the highest AUC value, 0.8872, showed to be significantly better than the flat designed classifier with the highest AUC value 0.8837.

The 2-level hierarchical design gave better results on 11 out of 13 total implemented

(36)

classifiers. For OneVersusRest classifiers, each class learns their own binary classifier, therefore the complexity is reduced in terms of training time and memory load, when using a hierarchical designed classifier.

On the comparison of weighting method of terms in BOW features, the ROC analysis resulted with a flat designed classifier with TFIDF weighting receiving the best AUC value of 0.8803. Yet, the 2-level hierarchical designed classifier with term count got an AUC value of 0.8797. With a p-value of p= 0.2212, the two classifiers AUC results were too close to claim that there was a difference of statistical significance. Examining the F1-scores, the term count recieved a 21.4% (for flat design,) and 20.5% (for 2-level hierarchical design,) better F1-score. The TFIDF weighting method has been shown successful for other text classification purposes, but the findings of this thesis indicates that term count was more suitable for classification of short texts.

Conversion to lowercase letters indicated better results than no case conversion.

The ROC analysis showed that the flat designed classifier with lowercase conversion received higher AUC value than no case conversion. With a p-value of 0.0169 the difference were of statistical significance. Also for the 2-level hierarchical design lowercase conversion showed to give a higher AUC value than no case conversion value with statistical significance. Moving on to the F1-scores, the lowercase conversion got a 2.6% and 2.4% better F1-score than no case conversion, for flat respectively 2-level hierarchical classifier designs. Lowercase conversion is a simple adjustment on the input, which according to this study, improves classification performance for short text classification.

The result indicates that removing stop words were more successful than keeping stop words. The AUC values from the ROC curves has shown, for both classifier designs, that removing stop words is better with statistical significance than keeping the stop words, with a p-value of 0.0226 for the flat designed classifier, and less than 0.0226 for the 2-level hierarchical design. The F1-scores showed an increase when removing stop words in contrast to keeping stop words, an increase of 2.6%

and 2.3% for flat respectively 2-level hierarchical designs. This result shows that even for short texts, stop words do not bring any information that contributes to the meaning of the texts, since removing the stop words gives better results for the classifiers.

The n-gram length of bigram and trigram showed to get better results than un- igrams, for the length of terms in the BOW feautures. The AUC values from the ROC analysis, pointed out that bigram BOW features was better than unigram with statistical significance. The flat designed classifier with bigram BOW features differed significantly from unigram BOW features, with a p-value of p= 0.0001, and

(37)

the 2-level hierarchical difference of the same case gave a p-value less than 0.0001.

The difference of AUC values, between bigram and trigram did not present a statistical significance for any of the classifier designs. Looking at the F1-scores, the flat designed classifier with bigram BOW features against the unigram, increased the F1-score with 1.6%, when extending to trigram from bigram F1-score only increased with 0.07%. The 2-level hierarchical F1-score increased respectively with 1.3% and 0.02%. As mentioned in the result (in section 3.1.3), the trigram hit the limit of maximum BOW features of 5 000 000 terms in the vocabulary. Which should be kept in mind, when examining the results of this experiment. Increasing n-gram in BOW features, was a successful way to add information from the short texts used in this thesis.

Adding a feature for the number of characters in the texts, did not show to be successful. Both for AUC values and the F1-score, the result showed to be better when not using the text length feature.

The feature for counting quotemarks in the texts indicated a minimal improvement of F1-score, than without counting quotemarks. For the AUC values though, no difference of statistical significance was identified, the flat designed classifier resulted with the same AUC value for not counting quotemarks and counting quotemarks.

For the 2-level hierarchical classifier design, not counting quotemarks’ AUC value was higher than the value when counting the quotemarks, but with a p-value of 0.2778 it was not of statistical significance. The F1 score increased with 0.03% for the flat designed classifier and 0.02% for the 2-level hierarchical design.

The results from the experimental study were used for a case study with Thingmap, for mapping natural language queries to users. This resulted in an improvement over earlier solutions of their system.

4.2 Future Work

Since 2-level hierarchical design indicated to be successful, it would be interesting to see what higher order of hierarchy can do for not only short text classification, but also other classification tasks.

The dataset that was used was imbalanced and weighted heavy to the computer science direction. It would be interesting to see similar experiments but with a balanced dataset. However, large datasets for supervised training are not commonly found. Another aspect that is interesting is to test methods that deals with imbalanced datasets, as: undersampling, which removes samples of the majority classes;

oversampling, where samples are generated for minority classes; or cost sensitive learning, which evaluates the cost associated with misclassifying observations.

(38)

This thesis contributed with a method for doing ROC analysis on multiclass classifiers, for future research on multiclass classifiers. The ROC modification is recommended as addition to other measures. Partly to establish the credibility of the ROC modification, but also to add a balanced measure for a study, for example, when dealing with imbalanced classes.

(39)

5 References

[Alp14] Ethem Alpaydin. Introduction to Machine Learning. MIT press, 3rd edition, 2014.

[BKL09a] Steven Bird, Ewan Klein, and Edward Loper. Evaluation. In Julie Steele, editor, Natural Language Processing with Python, chapter 6.3, pages 237–241. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09b] Steven Bird, Ewan Klein, and Edward Loper. Figure 6.1. In Julie Steele, editor, Natural Language Processing with Python, chapter 6.1, page 222. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09c] Steven Bird, Ewan Klein, and Edward Loper. Lexical resources. In Julie Steele, editor, Natural Language Processing with Python, chapter 2.4, pages 59–66. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09d] Steven Bird, Ewan Klein, and Edward Loper. N-gram tagging. In Julie Steele, editor, Natural Language Processing with Python, chapter 5.5, pages 202–208. O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09e] Steven Bird, Ewan Klein, and Edward Loper. Preface. In Julie Steele, editor, Natural Language Processing with Python, chapter -, page ix.

O’Reilly Media, Inc, Sebastopol, 2009.

[BKL09f] Steven Bird, Ewan Klein, and Edward Loper. Supervised classification.

In Julie Steele, editor, Natural Language Processing with Python, chapter 6.1, pages 221–233. O’Reilly Media, Inc, Sebastopol, 2009.

[BMG10] Janez Brank, Dunja Mladeni´c, and Marko Grobelnik. Feature Construc- tion in Text Mining, pages 397–401. Springer US, Boston, MA, 2010.

[Bot10] L´eon Bottou. Large-scale machine learning with stochastic gradient descent. International Conference on Computational Statistic, page 177–187, 2010.

[ES08] Bennett Eisenberg and Rosemary Sullivan. Why is the sum of independent normal random variables normal? Mathematics Magazine, 81(5):362–366, 2008.

[Faw06] Tom Fawcett. An introduction to roc analysis. In Pattern Recognition Letters, volume 27, page 861–874. Elsevier, Palo Alto, 2006.

(40)

[Fla10] Peter A. Flach. Roc analysis. In Claude Sammut and Geoffrey I. Webb, editors, Encyclopedia of Machine Learning, pages 869–875. Springer US, Boston, MA, 2010.

[HM82] J A Hanley and B J McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143(1):29–36, 1982. PMID: 7063747.

[HTF09] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer New York, 2009.

[KAG13] Facebook recruiting iii - keyword extraction. https://www.kaggle.

com/c/facebook-recruiting-iii-keyword-extraction, 2013.

Visited 2017-01-09.

[Mar11] John Markoff. Computer wins on ‘jeopardy!’: Trivial, it’s not. The New York Times, 2011.

[MRS08] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze.

Introduction to Information Retrieval. Cambridge University Press, Boston, MA, 2008.

[MS99] Christopher D Manning and Hinrich Sch¨utze. Foundations of statistical natural language processing. MIT Press, 1999.

[SCK⁺16] Mark Sammons, Christos Christodoulopoulos, Parisa Kordjamshidi, Daniel Khashabi, Vivek Srikumar, Paul Vijayakumar, Mazin Bokhari, Xinbo Wu, and Dan Roth. Edison: Feature extraction for nlp, simpli- fied. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asun- cion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evalua- tion (LREC 2016). European Language Resources Association (ELRA), 2016.

[SE16a] Inc. Stack Exchange. About - stack exchange. http://

stackexchange.com/about, 2016. Visited 2016-09-12.

[SE16b] Inc. Stack Exchange. Stack exchange data dump. https://archive.

org/details/stackexchange, 2016. Visited 2016-09-12.

(41)

[Sl16a] Scikit-learn. 1.5 stochastic gradient descent. http://scikit-learn.

org/stable/modules/sgd.html, 2016. Visited 2017-01-17.

[Sl16b] Scikit-learn. Feature extraction. http://scikit-learn.org/

stable/modules/feature_extraction.html, 2016. Visited 2017- 01-05.

[Sl16c] Scikit-learn. sklearn.metrics.roc curve. http://scikit-learn.org/

stable/modules/generated/sklearn.metrics.roc_curve.html, 2016. Visited 2017-01-10.

[Sl16d] Scikit-learn. sklearn.multiclass.onevsrestclassifier. http:

//scikit-learn.org/stable/modules/generated/

sklearn.multiclass.OneVsRestClassifier.html#

sklearn-multiclass-onevsrestclassifier, 2016. Visited 2017-01-29.

[Tin10] Kai Ming Ting. Confusion matrix. In Claude Sammut and Geof- frey I. Webb, editors, Encyclopedia of Machine Learning, pages 209–

209. Springer US, Boston, MA, 2010.

[Tur50] Alan M. Turing. Computing machinery and intelligence. Mind, 49:433–460, 1950.

[WZH] Ke Wang, Senqiang Zhou, and Yu He. Hierarchical classification of real life documents. In Proceedings of the 2001 SIAM International Conference on Data Mining, pages 1–16.

[Zha04] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 2004: PROCEEDINGS OF THE TWENTY-FIRST INTERNATIONAL CONFERENCE ON MA- CHINE LEARNING. OMNIPRESS, pages 919–926, 2004.

(42)

Appendices

A Classes for the Classifier Designs

Flat classifier’s classes Arts

Business and Administration Computer Science

Education Engineering Environment Health Humanities

Journalism and Information Law

Life Sciences

Mathematics and Statistics Personal Services and Hobbies Physical Sciences

Social Science Tech

Transport Services

2-Level Hierarchical Classifier’s main classes Business and Logistics

Health, Life and Earth Human Sciences

Tech and Physical Sciences Business and Logistics’s Subclasses

Business and Administration Transport Services

Health, Life and Earth’s Subclasses Environment

Health Life Sciences Social Science

Human Sciences’ Subclasses Arts

Education Humanities

Journalism and Information Law

Personal Services and Hobbies Tech and Physical Sciences’ Subclasses

Computer Science Engineering

Mathematics and Statistics Physical Sciences

Tech

(43)

B Python Build Dependencies

Python version 3.5.2 with following dependencies:

Cython 0.24.1 matplotlib 1.5.3 nltk 3.2.1

numpy 1.11.2 scikit-learn 0.18 scipy 0.18.1

stop-words 2015.2.23.1

(44)

C Confusion Matrix

ConfusionMatrixoftheresultfromthebestperformingflatdesignedclassifier.UsingBOW-featureswithtermcountedbigrams,lowercaseconversionand featuresfortextlenghtandnumberofquotationmarks.Theboldmarkeddiagonalshowsthecorrectpredictionsofthetest. Total Arts373131549161974004178000996146639200016087 BusinessandAdministration27064695118140064100010154473200016184 ComputerScience5212403264544353800316500080611559262000341374 Education038791966000030900020882100002575 Engineering614969113611700516000165568332101016362 Environment0962030021000135119000178 Health0911203006600032746000302 Humanities82390501723270023255000135398712701131263 JournalismandInformation0211200003000570000129 Law41194452200204050078607000971 LifeSciences73385738005010064322951410002283 MathematicsandStatistics233371650868195002601000132048727544000153051 PersonalServicesandHobbies70684133191929400299700014641863224800037727 PhysicalSciences501603713223900012960014574840767400018720 SocialScience23723153002110011804218000730 Tech21361730230001480006010650502143 TransportServices36644201500214000107223520041126

1.2 Related Work

Contents

1 Introduction

1.1 Delimitations

1.2 Related Work

1.3 Important Concepts

2 Method

2.1 Experimental study

∑

Training

Prediction

Classiﬁer

Class

Level 1 Level 2

3 Results

3.1 Results from the Experimental Study

4 Discussion

4.1 Conclusions

4.2 Future Work

5 References

Appendices

A Classes for the Classifier Designs

B Python Build Dependencies

C Confusion Matrix