Comparative Study of the Combined Performance of Learning Algorithms and Preprocessing Techniques for Text Classification

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2018,

Comparative Study of the Combined Performance of Learning Algorithms and

Preprocessing Techniques for Text Classification

MILA GRANCHAROVA

MICHAELA JANGEFALK

(2)

Abstract With the development in the area of machine learning, society has become more dependent on applications that build on machine learning techniques. Despite this, there are extensive classification tasks which are still performed by humans. This is time costly and often results in errors. One ap- plication in machine learning is text classification which has been researched a lot the past twenty years. Text classification tasks can be automated through the machine learning technique supervised learning which can lead to increased performance compared to manual classification. When handling text data, the data often has to be preprocessed in different ways to assure a good classification. Preprocessing techniques have been shown to increase performance of text classification through supervised learning. Different processing techniques affect the performance differently depending on the choice of learning algorithm and characteristics of the data set.

This thesis investigates how classification accuracy is affected by different learning algorithms and different preprocessing techniques for a specific customer feedback data set. The researched algorithms are Naïve Bayes, Support Vector Machine and Decision Tree. The research is done by experiments with dependency on algorithm and combinations of preprocessing techniques. The results show that spelling correction and removing stop words increase the accuracy for all classifiers while stemming lowers the accuracy for all classifiers. Furthermo- re, Decision Tree was most positively affected by preprocessing while Support Vector Machine was most negatively affected. A deeper study on why the preprocessing techniques affected the algorithms in such a way is recommended for future work.

Keywords

text classification; supervised learning; preprocessing

(3)

Sammanfattning I och med utvecklingen inom området maskininlärning har samhället blivit mer beroende av applikationer som bygger på maskininlärnings- tekniker. Trots detta finns omfattande klassificeringsuppgifter som fortfarande utförs av människor. Detta är tidskrävande och resulterar ofta i olika typer av fel.

En uppgift inom maskininlärning är textklassificering som har forskats mycket i de senaste tjugo åren. Textklassificering kan automatiseras genom övervakad maskininlärningsteknik vilket kan leda till effektiviseringar jämfört med manuell klassificering. Ofta måste textdata förbehandlas på olika sätt för att säkerställa en god klassificering. Förbehandlingstekniker har visat sig öka textklassificering- ens prestanda genom övervakad inlärning. Olika förbetningstekniker påverkar prestandan olika beroende på valet av inlärningsalgoritm och egenskaper hos datamängden.

Denna avhandling undersöker hur klassificeringsnoggrannheten påverkas av olika inlärningsalgoritmer och olika förbehandlingstekniker för en specifik data- mängd som utgörs av kunddata. De undersökta algoritmerna är naiv Bayes, sup- portvektormaskin och beslutsträd. Undersökningen görs genom experiment med beroende av algoritm och kombinationer av förbehandlingstekniker. Resultaten visar att stavningskorrektion och borttagning av stoppord ökar noggrannheten för alla klassificerare medan stämming sänker noggrannheten för alla. Decision Tree var dessutom mest positivt påverkad av de olika förbehandlingsmetoderna medan Support Vector Machine påverkades mest negativt. En djupare studie om varför förbehandlingsresultaten påverkat algoritmerna på ett sådant sätt rekommenderas för framtida arbete.

Nyckelord

textklassificering; övervakad inlärning; förbehandlingstekniker

(4)

Content

1 Introduction 3

1.1 Background . . . 4

1.2 Problem . . . 5

1.3 Purpose . . . 5

1.4 Objectives . . . 5

1.5 Sustainability and Ethics . . . 5

1.6 Methodology . . . 6

1.7 Delimitations . . . 6

1.8 Thesis Outline . . . 7

2 Technical Background 7 2.1 Classification Algorithms . . . 7

2.1.1 Naïve Bayes . . . 7

2.1.2 Support Vector Machine . . . 8

2.1.3 Decision Tree . . . 8

2.2 Performance Measurements . . . 9

2.3 Natural Language Processing . . . 10

3 Method 12 3.1 Research Method . . . 12

3.2 Data Limitations . . . 13

3.3 Preparatory Work . . . 13

3.4 Experiments . . . 13

3.4.1 Classification Algorithms to be Evaluated . . . 14

3.4.2 Preprocessing Techniques to be Evaluated . . . 14

3.4.3 Experiments to be Conducted . . . 14

3.4.4 Evaluation Method . . . 16

3.5 Setup . . . 16

4 Results 18 4.1 Preparatory Work . . . 18

4.2 Experiments . . . 20

5 Discussion and Conclusions 21 5.1 Discussion . . . 21

5.2 Conclusions . . . 23

5.3 Future work . . . 23

Appendix A Stop words 28

(5)

List of abbreviations

DT Decision Tree FP False Positive FN False Negative NB Naïve Bayes

NLP Natural Language Processing NLTK Natural Language Tool Kit POS Part of Speech

SVM Support Vector Machine TN True Negative

TP True Positive

(6)

List of Tables

1 Format of the data set . . . 15

2 Setup for test 1 of Naïve Bayes . . . 19

6 Benchmark results of the classification algorithms . . . 20

7 Accuracy result of level 1 experiments . . . 21

8 Accuracy result of level 2 experiments . . . 21

(7)

1 Introduction

Over the past three decades, society has become more and more dependent on applications and services based on machine learning techniques. These applications include spam filtering [1], recommendation systems [2] and anti-bullying systems for online gaming [3]. The particular implementation largely depends on the type of data that is to be analyzed. This data can consist of video, audio, images, numbers and text.

Since the success of the World Wide Web in the 1990s and the age of smart phones, mobile traffic for data has increased. Not only have subscriptions to smart phones increased, but so has the data consumption in relation to mobile traffic used for voice calls [4]. It is therefore to no surprise that a large amount of the available data in the world is in the form of natural language text, meaning text written in any language spoken by humans [5].

When designing applications which analyze communication between humans, large sets of natural language text data become relevant. Therefore, it is of interest to use machine learning to classify this type of data. Classifying data means to decide to which topic, out of some known topics, a sample of data belongs. Text classification is explained in more detail in section 1.1.

With the use of efficient text classification, a workload traditionally performed by humans can be replaced by systems built on machine learning techniques. An example of this is classification of customer feedback data. Almost every business gets some sort of feedback in written form. This feedback needs to be classified to know where to send it and how to compile it. Kotsiantis et. al propose several issues with the traditional approach of human labor [6]. One of them is that information gets lost in the process. Another is that misclassifying is a risk due to the human factor. Furthermore, it is very time consuming to perform manual classification.

Text classification is a subsection of machine learning which has been well researched in the past twenty years. This research has given important information about challenges and commonly used approaches. Among other findings, it has become clear that text data often must be processed in several ways to ensure a good classification [7]. This goes under the name preprocessing, which is also delved deeper into in section 1.1.

Different data sets and classification algorithms react differently to different preprocessing techniques [8]. There exist many different preprocessing techniques and many different data sets with various combinations of characteristics.

Therefore, an extensive overview of which preprocessing techniques are appropriate for which data set and algorithm does not exist. This issue sets the basis for the problem described in section 1.2, and in turn the purpose and objectives presented in sections 1.3 and 1.4. Further, section 1.5 discusses the social perspectives of contributing to the field of text classification.

To put this thesis in a practical setting, it is performed in collaboration with the Swedish telecommunication operator Tele2. Tele2 have shared their customer feedback data to see if some of their classification needs can be met with the help of machine learning.

Section 1.6 gives an overview of what methodology is typically used for similar projects and in particular for this thesis. Then, section 1.7 presents some delimitations which are done on the project. Lastly, section 1.8 presents the outline of this report.

(8)

1.1 Background

In this section, a general background is presented covering the concepts of machine learning, text classification and preprocessing.

Machine Learning

The area machine learning is a sub area of artificial intelligence which is a part of the area computer science. The aim of machine learning is to give computers the ability to learn on their own and solve different tasks without being explicit programmed to do so. In most cases, humans are very good in finding patterns in different kinds of data, such as written language [9]. Despite this humans can miss patterns that are harder to see and therefore tools based on machine learning techniques can be very useful in e.g. hard classification tasks. With the help of computers which use machine learning techniques, this can be automated on very large data sets and in some cases achieve better results than humans.

In the area of machine learning, there exist different approaches of how to computers can learn. Two frequently used methods are supervised and unsupervised learning. Supervised learning is a machine learning technique where the algorithm attempts to label data samples based on examples of labeled data [6]. This means that the algorithm gets a tuple as input, the object it should classify and the label of the data. This makes it easy to control how well the algorithm performs. When labled data is not available, the method unsupervised learning needs to be used [6]. Unsupervised learning algorithms try to find a specific function to describe the unlabeled data. In both approaches many different algorithms can be used.

Since this thesis is focused on text classification that is described in the next section the supervised learning approach will be used. This is because supervised learning is a common approach to handle text classification [7]. In the Technical Background, Chapter 2, supervised learning algorithms are presented.

Text Classification

As briefly mentioned earlier, a large amount of the data in the world is in the form of text [4]. This leads to huge datasets of text originating from different sources and with different purposes to be interpreted and classified. Due to the human factor, the risk of wrongly classifying data is high, and the task is time consuming [5]. Therefore, text classification and Natural Language Processing (NLP) are of great importance to create standards and efficient classifications.

One common approach for classifying text data efficiently is supervised learning in combination with NLP.

The purpose of NLP is to preprocess the text and extract valuable information upon which the classification can be based. An extracted piece of information is called a feature. The classification algorithm takes sets of features with corresponding labels to train on. After training is complete, the algorithm takes unlabeled sets of features and attempts to classify them. Trivially, if no preprocessing is done on the data, the features are simply the raw data itself [10].

(9)

1.2 Problem

This thesis is done in collaboration with the large Swedish telecommunications operator Tele2. Like many other companies, Tele2 does the analysis of their customer feedback by hand. This analysis is not standardized in terms of a certain software. As mentioned earlier, this can be problematic as it can result in errors and is time consuming.

It is thought that this project could benefit Tele2 if ways to improve their current method of customer feedback analysis could be identified. Simultane- ously, this project could benefit from using real life data. This way conclusions can be drawn about this type of data set and what challenges there exists to be helpful to other organizations that also do manual classification of customer feedback.

One of the challenges that exist is to find information about which machine learning algorithms to use in combination with which preprocessing techniques for different data sets. Without this information, companies might have a harder time to efficiently implement machine learning techniques in their business to handle their customer feedback.

1.3 Purpose

The purpose of this thesis is to answer the question:

How will the performance of classification algorithms be affected by using diffe- rent techniques to preprocess customer feedback data?

It is intended to contribute to the general overview of which classification algorithms in combination with which preprocessing techniques increase classification performance for different types of text data.

1.4 Objectives

To be able to fulfill the purpose of this thesis, three objectives need to be reached.

These are implemented techniques, benchmark results and experiment results.

The implementation part aims at producing working implementations of the preprocessing techniques and the classification algorithms which are to be investigated through experiments.

Once everything is implemented, a benchmark result needs to be generated for each algorithm. These values are the performance of each algorithm when no preprocessing techniques are applied. The purpose of benchmarks is to have values to compare the results of the experiments to.

Lastly, some experiments must be conducted on the different combinations of preprocessing techniques and classification algorithms. These experiments generate the results upon which an attempt to answer the problem of this thesis, stated in 1.3, will be based.

1.5 Sustainability and Ethics

This thesis focuses on the area of text classification and aims to be a part of a positive development in the field. Therefor, it is important to understand

(10)

how development in this area could affect society in regards to both ethic and sustainability issues.

By doing the mentioned research, this project can contribute to automation in parts of businesses which rely on text data. This can be generalized to a positive sustainable development where resources in a company are better ma- naged. More effective classification can also contribute to social sustainability if efficient text analysis can be used to understand problems which exist and the importance of these problems.

Another important aspect are the ethical issues this research can have. Re- search like the one presented in this thesis could contribute to automating tasks which today are performed manually. This could jeopardize jobs and leave pe- ople unemployed, which could have negative effects on social and economical sustainability.

With aspects including ethical issues and social impact presented, it is important to understand that this project on its own is on a far too basic level to have any significant impact on society.

1.6 Methodology

To answer the research question defined in section 1.3, this thesis uses a combination of different research techniques and approaches. More specifically, an inductive research approach is used in combination with quantitative research methods. In particular, a case study including experiments is used as the co- re method for this thesis. The experiments are preceded by exploratory data analysis. The motivation behind the method choices of the data extraction and analysis are discussed in the two following sections.

Data Extraction

The motivation behind using a case study in combination with experiments to extract the data has its base in the purpose of this project. This thesis aims to contribute to a general overview in the field of text classification and setup for different data sets. With the time limit and the extensive scope of this area, this thesis needs to be delimited. Therefore, a case study is an appropriate way to go.

Data Analysis

This thesis uses an exploratory data analysis which aims to find relationships between data to set up hypotheses to be tested through experiments. The experiments are thus executed in an iterative way, iterating with exploratory data analysis to dynamically generate hypotheses. The reason exploratory data analysis is chosen in favor of confirmatory data analysis is that it is unclear which hypotheses could be formed before some initial hypotheses are tested. Testing all possible hypothesis which could arise would is of scope for this project.

1.7 Delimitations

In this thesis, only three classification algorithms are investigated through experiments due to time limitations. These are chosen based on a literature study

(11)

regarding which methods are suitable for text data. The chosen algorithms are Naïve Bayes, Support Vector Machine and Decision Tree, which are all commonly applied for text classification through supervised learning [5].

To stay within time limits, it is decided that only positive hypotheses are to be tested regarding the improvement of classification precision. This means that if it is suspected that a certain combination of preprocessing techniques might increase performance for a particular algorithm, it is tested. If it is suspected that the performance might decrease, it is not tested.

1.8 Thesis Outline

This section provides information about the outline of the report. In the following chapter, a technical background is given, covering the algorithms and preprocessing techniques used in this thesis. Performance measurements for classification are also presented. In Chapter 3, the method regarding how the project is conducted is introduced. In Chapter 4, the results are presented. In the final Chapter, a discussion of the results, conclusions and suggestions for future work are presented.

2 Technical Background

The aim of this background is to provide necessary information and context within the area of text classification. In section 2.1, the chosen classification algorithms are presented. Section 2.2 presents common performance measurements for classification algorithms. In section 2.3 natural language processing techniques are described. Finally, section 2.4 presents the limitations on the data set used in this project.

2.1 Classification Algorithms

There are many different machine learning algorithms used for text classification through supervised learning. In this thesis, the limitation has been done to analyze three commonly used algorithms, namely Naïve Bayes, Support Vector Machine and Decision Tree. In the following sections, introductions to these algorithms are presented.

2.1.1 Naïve Bayes

The Naïve Bayes classifier is a probabilistic classifier based on the famous Tho- mas Bayes theorem that is defined as follows:

P (A|B) = P (A ∪ B)

P (B) = P (A) · P (B|A)

P (B) (1)

where P(A |B) is the probability of event A happening given event B and P(A) and P(B) is the probability of event A respectively B happening independently [11]. The theorem is used to calculate the highest probability that a data sample belonging to class A given the attribute B. B can be a vector of multiple features that represents the feature extraction from the data sample.

In a NB classifier the assumption of a strong independence between the features is made. This means that the effect of an attribute value on a given

(12)

class is independent on the values of the other attributes. This assumption is made to simplify the computation and this is why it is named Naive. Due to this we can simplify the equation (1) to the following formula:

P (C_k|x₁, ..., x_n) = p(C_k) ·

n

Y

i=1

P (x_i|C_k) (2)

where Ck is a given class out of k classes and x is the feature.

The NB algorithm is a well researched algorithm in regards to its ability to classify data. Given the independence assumption, it performs surprisingly well even when there exists a dependency between features [12]. The NB algorithm has also been researched in regards to how well it can handle text classification problems. The result of this research has been that compared to other algorithms, such as SVM, the NB algorithm performs worse when faced with text classification tasks [13] [14]. It has also been shown that the algorithm is robust to isolated noise points and irrelevant features [15].

2.1.2 Support Vector Machine

Support Vector Machine is a supervised learning model which can be used for both classification and regression. The model is a representation of data samples in space mapped so they are separated by a gap. The model then predicts new samples depending on which side of the gap the samples fall on. This gap is also called a hyperplane. SVM creates one or several hyperplanes to classify data samples.

In Figure 1, the classification of a SVM is presented. In this scenario, one hyperplane is used to separate the data into two categories and it is done with a linear classification. In practice, a linear classification is hard to do, often due to noisy data which makes it difficult to find a linear relation [16]. To handle this, different approaches are used, for instances soft margin, slack variables and kernel functions. Soft margins and slack variables allows the model to relax the margins which reduces the complexity of the model.

When the data is affected by errors and variation, soft margins and slack avriables returns poor results. In such cases, kernel functions are used [16].

Kernel functions make it possible to map a finite-dimensional space into a much higher dimensional space where a separation of the data samples should be easier. Kernel functions enable the model to operate in a higher dimensional space without actually computing the coordinates for the data samples in this space. Instead, the inner product between the images of all pairs of data in the feature space are computed, which is computationally cheaper to do.

Support Vector Machines is suitable for text classification for several reasons, one of which is that most text classification problems are linearly separable [17].

2.1.3 Decision Tree

In Decision Tree learning, a Decision Tree is used as a predictive model. A DT is is a tree shaped model where each branch represent an observation about an item and each end node represents a target value for that item. DTs can be used for many things, amongst others data mining, classification and regression but also reinforcement learning and clustering.

(13)

Figure 1: Linearly separable data classified by SVM

When using DTs for classification, each leaf represents class labels and the branches represents conjunctions of feature values that will lead to specific class labels. The goal with using a DT for classification is to create a model which predicts the value of a target variable based on other input variables.

The challenge in DT learning is how to choose the features, what conditions to use for splitting the tree and when to stop splitting the tree. When growing a tree, the aim is to find a small as possible tree to avoid overfitting. Overfitting means building a too complex model which is good at predicting data from the training set but not good at generalization which results in a higher error rate for the test set [18]. To avoid overfitting, the tree can be limited in hight during construction. The pruning technique can also be used. Pruning aims to remove branches from the tree which use less informative features and therefor create a more complex tree than needed [18].

In relation to text classification, the DT algorithm has been proven to have generated good results [14]. Text classification tasks can involve huge feature spaces [17]. As mentioned earlier in this section, some actions to avoid overfitting of the model might be needed in order to handle this.

2.2 Performance Measurements

When comparing the performance of classification algorithms, a couple of different measurements are used. Depending on the goal of the classification and the data set, some measurements can be more appropriate than others.

The most straightforward approach to analyze the performance of classifiers is by using a conclusion matrix [19]. This matrix is a two-by-two matrix which consists of four numbers representing data samples categorized correctly or incorrectly by the classifier. These four categories are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). Using these numbers in different combinations creates different performance measurements.

Some of them are defined in the following equations:

Sensitivity/Recall = T P

T P + F N (3)

(14)

Accuracy = T N + T P

T N + F N + T P + F P (4)

P recision = T P

T P + F P (5)

When using classification algorithms, the aim is to minimize wrongly classified samples[19]. In the presented notation, this would mean to minimize FN and FP. However, this is not always possible and it can therefore be interesting to try to minimize one of them. Which one to minimize depends on the problem at hand.

The measurement accuracy, described in equation (4), measures the total ratio of correctly classified data samples. This means that it is of interest to know how well the classifier predicts both positive and negative samples. In our notation, this means that both TP and TN are of interest. Accuracy is a poor measurement to use when the data set is imbalanced [19] [20]. The reason is that if there exists a class which holds a majority of the data samples, the TN ratio will be large and therefore decrease the accuracy. However, an imbalanced data set can be modified so that the consequences of imbalance can be discarded. One approach is to do an over-sampling. This means that copies of samples from the minority class/classes are added to even out the ratio between the classes [21].

A side effect of over-sampling is a higher risk for overfitting the model [21].

The two measurements recall and precision, defined in equation (3) and (5) respectively, are proposed as good measurements to use when the data set is imbalanced [20]. In theses measurements, the TN ratio is not used in the calculations and the focus lies on the TP ratio.

Another performance measurement is time. Time could be of interest when trying to optimize different algorithms, rather than just to optimize the classification.

2.3 Natural Language Processing

All languages spoken by humans are natural languages. NLP is the area within artificial intelligence concerned with allowing machines to successfully process large sets of natural language [22]. Natural language comes in two forms; speech and written text. For the purpose of this report, only the later will be discussed.

There are many NLP techniques used for preprocessing data for text classification and other machine learning tasks. Which technique or combination of techniques to use may depend on the data set and the task. Typically, several techniques are applied to a data set in a pipeline fashion [22]. Some NLP techniques are presented in this section.

Stemming and Lemmatization

Stemming is the process of stripping an inflected word to its root by removing affixes. Lemmatization is similar to stemming, but only removes affixes if the resulting word is in the dictionary [23]. Thus, lemmatization guarantees that the resulting word is a legitimate word while stemming does not.

(15)

Lemmatization may fail if removing affixes does not generate a legitimate word. For example, given the word “lying”, a lemmatization function would find the word “ly” which, on checking the dictionary, would be discarded. A stemmer on the other hand would be satisfied with “ly”. Furthermore, lemmatization is generally less time efficient than stemming since it requires checking the dictionary.

Sentence Breaking and Word Segmentation

Sentence breaking deals with finding the boundaries in a text. These boundaries are typically marked by different types of punctuation. However, punctuation does not necessarily need to mark a boundary. A period can for example be part of an abbreviation such as “U.S.”.

Closely related to sentence breaking is word segmentation, which deals with separating text into words. Techniques for word segmentation are not often discussed when dealing with western language since these words are typically separated by spaces, making word segmentation a straightforward task [23].

Sentence breaking and word segmentation are often called sentence tokenization and word tokenization.

Tagging

Tagging is the process of categorizing words by attaching a tag to them. A popular type of tagging is part-of-speech-tagging, or POS-tagging. POS-tagging provides useful information for data sets with grammaticality correct sentences but is less useful for texts with many grammatical errors [24].

Representing tagged tokens can be done using tuples which include a token (word) and a tag. Some words may belong to several parts of speech, such as the word “book”, which can be both a noun and a verb depending on the context.

Spelling Correction

Correcting misspelled words might not be useful when dealing with text which has been proof-read, considering that it is time costly to spell check every word.

When working with informal text however, spelling correction could prove to be profitable.

A popular approach for spelling correction is edit distance [25]. Edit distance is a way of comparing two words to each other by counting the minimum number of operations it would take to transform one to the other. These operations are insertions, deletions and substitutions of letters. Using edit distance to correct a misspelled word involves finding the edit distance of the misspelled word to words in a dictionary and returning the word with minimum edit distance.

Another approach is similarity key [25]. The idea is to map every word to a key such that similarly spelled words have identical keys. For each spelling error, a key will be computed. Thus, the key of a misspelled word will point to all similarly spelled words in a dictionary.

Both spelling correction algorithms will switch a misspelled word with a word in the dictionary which is similar to the misspelled word. They would thus only attempt to switch words which are not found in the dictionary, without taking the context into account. If a word has been misspelled in such a way

(16)

that it has become another legitimate word (e.g. “in” and “on”), the algorithms would not catch this mistake.

Removing Stop Words and other Normalization

Stop words are commonly used words which are considered useless to the classifier. In English, such words may include “the”, “a” and “an”. Removing them can be done by comparing each word in the data set to some list of stop words.

Sometimes accents, punctuation and other symbols are also removed.

For some data sets, removing common stop words may result in loss of informative features. In such cases, dynamically finding which words are stop words for the particular data set can be better than using common stop words.

[26]There are other common techniques used to normalize text in addition to those mentioned above. For instance, a common practice is to set all letters to lower case so that the words “Banana” and “banana” are treated the same way by the classifier. Furthermore, sometimes abbreviations are expanded so that

“U.S.” is treated the same as “United States” [25].

3 Method

This chapter presents the steps which were taken to produce the results, presented in Chapter 4. Section 3.1 motivates the research method chosen for this thesis. Next, the data set used throughout the project is presented in section 3.2Some experiments were conducted in order to extract data. Section 3.3 describes the preparatory work which was done in order to generate benchmark results for the experiments. In section 3.4, the particular experimental method used in this thesis is presented and discussed. The setup for the experiments is presented in section 3.5.

3.1 Research Method

One can go about answering the research question, stated in section 1.3, in several ways. Here, a literature study and a case study are discussed.

A literature study could have been an option to proceed with such a project.

However, it was observed in the beginning of the project, when a general literature was done, that this approach would probably not have been fruitful due to the huge number of text data set, preprocessing techniques and classification algorithms. A literature study was therefore discarded as an option.

A case study was considered a more suitable research method for this project since it only requires limited combinations of techniques to be tested. The method also allows the project to fulfill its purpose of contributing to the general overview of which classification algorithms in combination with which preprocessing techniques increase classification performance for different types of text data. Therefore, a case study was chosen to proceed with this thesis. The sub- ject of the study was the customer feedback data set by Tele2, presented in the following section.

(17)

3.2 Data Limitations

The data set used in this thesis is real life customer feedback from Tele2 customers. The data set consist of approximately 9000 data samples in the form shown in Table 1. The label represents the correct class label that the comment should be classified as. All data is classified into nine different classes with assistance from Tele2.

The comments are written in the Swedish language and have no restrictions on symbols or length. The data has not been processed in any way before this project and therefor holds misspelled words and irrelevant words and symbols.

The majority of the data samples have less than 10 words and less than two sentences.

Table 1: Format of the data set Lable Comment

1 Det fungerar inte att logga in 2 Riktigt bra service!

3.3 Preparatory Work

Before experiments could be conducted, it was necessary to do some preparatory work on the data set. The goal of this work was to achieve acceptable benchmark results for all three algorithms when no preprocessing techniques were applied.

The purpose of these benchmark results was to have some values to compare the results of the experiments to in order to conclude whether the preprocessing techniques improved classification performance.

It is non-trivial to define what an acceptable benchmark result is. For the purposes of this thesis, a benchmark result did not need to be considered a good classification itself, as long as it clearly verified that the classification is better than randomized. For example, if two classes are being classified, a random classification would result in 50% of the instances being classified correctly.

That would not be an acceptable benchmark result. On the other hand, if ten classes are being classified, the random classification would classify around 10%

of the instances correctly. In the later case, 50% would be an acceptable result since the classification is beyond doubt better than randomized.

3.4 Experiments

A number of experiments were conducted to test which preprocessing techniques improve performance for each of the three classification algorithms. In section 3.4.1, the motivation behind the choice of which classification algorithms to investigate is presented. In section 3.4.2, the preprocessing techniques which were tested as well as the motivation behind that choice are given. The method by which the experiments were conducted is presented in 3.4.3. The performance measurement by which the experiments were evaluated is presented in 3.4.4.

(18)

3.4.1 Classification Algorithms to be Evaluated

As mentioned in section 1.7 some delimitations had to be done. It was decided that three algorithms for text classification would be investigated. The choice of algorithms was based on a literature study regarding methods commonly used for text classification tasks. The chosen algorithms were Naïve Bayes, Sup- port Vector Machine and Decision Tree. These were chosen because they are commonly applied for text classification through supervised learning [5].

3.4.2 Preprocessing Techniques to be Evaluated

Several NLP techniques for preprocessing text were presented in section 2.3.

Out of these, some were found appropriate for the data set described section 3.2.The fact that most data samples are short makes sentence breaking unne- cessary, since no more than one sentence would be found for the typical item.

The lack of grammatically correct sentences in most data samples means POS- tagging would not give much information, as mentioned in section 2.3. Furt- hermore, the many misspelled words infer with the algorithms’ ability to draw conclusions about the occurrence of a certain word, suggesting that spelling correction could be beneficial.

Based on these observations, it was found that there is reason to suspect that stemming, removing stop words and spelling correction could improve the classification performance. These were therefore the three preprocessing techniques tested in different combinations in experiments.

The preprocessing also involved word tokenization and making all letters lower case. These techniques were held constant throughout all experiments because it was not suspected that they could have negative effect on the classification performance.

The reason that stemming was chosen in favor of lemmatization is that lemmatization requires a dictionary for the given language, while stemming only requires a list of affixes. Lemmatization implementations for English text are readily offered, but not for Swedish. To implement lemmatization for Swedish text from scratch is out of scope for this thesis due to the existing work load and time limitation.

3.4.3 Experiments to be Conducted

For each of the three classification algorithms, the steps depicted in Figure 2 were followed. The first step involved applying only one of the chosen preprocessing techniques.

Once all techniques had been tested individually, combinations of two techniques were tested. Permutations needed to be taken into account because the order in which preprocessing techniques are applied can alter the resulting data. This difference could lead to different performance of classification, which is precisely what is being researched.

After the combinations of two techniques were tested, the results were ranked.

A result was ranked high if it was greater than or equal to the benchmark result for the classification algorithm of matter and low otherwise. If a possible combination of three was within the highly the ranked combinations of two, it was tested.

(19)

Start

Get benchmark values

Test techniques separately

Test combinations of 2

Ranking

Combinations of 3?

Test combinations of 3

End

Preparatory work

Level 1 Experiments

Level 2 Experiments

Level 3 Experiments

Combina- tion(s) found

No combinations found

Figure 2: Method visualized

(20)

A possible combination of three means that two couples of techniques can form a pipeline of three techniques. For example, the couples {spelling correction, stemming} and {stemming, removing stop words}, can for the combination {spelling correction, stemming, removing stop words}.

The reasons why an iterative approach was chosen in favor of conducting experiments for all permutations of three for each algorithm is that it was decided, due to time limitations as mentioned in section 3.1, that only positive hypotheses would be tested. It is unmotivated to form the hypothesis that a pipeline of two bad combinations would result in a good combination.

3.4.4 Evaluation Method

Due to the problem of this thesis, accuracy was chosen as the performance measurement to be used. The motivation behind this choice is that it both TP and TN are of interest. It is not of interest for this thesis how well the classifiers can predict TP or TN individually. Furthermore, accuracy is often used in other studies of machine learning algorithms [6] [27].

The reason for why time measurement was discarded is due to the problem statement. It is not relevant for this research to compare how fast the algorithms run since no optimization in implementing fast algorithms was considered a goal.

3.5 Setup

In order to conduct experiments, implementations of the preprocessing techniques and the three classification algorithms were required. All programs for the experiments were written in Python 2.7.14. Natural Language Toolkit (NLTK) was used to implement all preprocessing techniques. The implementations of the classification algorithms were taken from scikit-learn. In the following two sections, the libraries which were used from these platforms are presented, as well as other implementation details.

Preprocessing and NLTK

NLTK is a platform for building programs in Python to work with natural language data. It provides a variety of libraries for preprocessing. For this project, the snowball stemmer of the .stem library was used, as well as functions from the .tokenize library for tokenization. Table 2 shows an example customer feedback comment before and after going through stemming.

Table 2: Example of stemmed comment

Before Får ju inte mitt topup att funka... Så ni kan ju hjälpa till med det.

After får ju int mitt topup att funk... så ni kan ju hjälp till med det.

The stop word removing function was implemented using a list of Swedish stop words from the NLTK corpus and adding punctuation to it. The exact list of stop words used can be found in Appendix A. Table 3 shows an example of removing stop words from a comment.

NLTK does not include any spelling correction library. A Swedish spelling correction function was implemented according to the edit distance technique

(21)

Table 3: Example of comment with removed stop words

After får topup funka kan hjälpa

described in section 2.3. The dictionary used for the spelling correction was

“Lars Aronsson svenska ordlista”[28]. An example of the function at work is presented in Table 4.

Table 4: Example of spelling corrected comment

After får ju inte mitt popup att funka... så ni kan ju hjälpa till med det.

As can be seen in Tables 2-4, all letters are turned to lower case. This is done with the help of the .tokenize library from NLTK as well.

Classification Algorithms and Scikit-learn

Scikit-learn is an open source platform for machine learning in Python. It inclu- des libraries for several common machine learning algorithms, many of which can be used for other tasks than classification. In this project, only the classification implementations were used.

For the Naïve Bayes algorithm, MultinomialNB from the library .naive_bay- es was used. The motivation behind this choice is that the multinomial version of Naive Bayes has shown to be the version of the algorithm which performs best for text classification tasks [13].

For the Support Vector Machine algorithm, scikit-learn’s SVC implementation from the .svm library was used. It was used with a linear kernel due to the assumption presented in section 2.1.2 that text categorization problem often are linearly separable.

For the decision tree algorithm, the DecisionTreeClassifier from the .tree library was used.

For calculating the accuracy of the predictions on the test set, the .score library from scikit-learn was used. The training set was 80% of the data set and the test set was 20%, which this is a common ratio between training set and test set.

In supervised learning tasks, cross-validation is often used in order to indi- cate how well the model generalizes to an unseen data set [29]. If the model drops in performance when predicting the unseen validation set, that indicates overfitting, which means that the analysis corresponds too closely to the particular training set [30]. Cross validation requires a validation set which can be tested while tuning the classifier, before the test set is predicted. In this project, the data set was only split into a training set and a test set. Since the research question of this thesis is not related to optimization of any predictor and due to limited resources, this is not considered a central issue.

(22)

4 Results

It this chapter, all results generated from the work described in Chapter 3 are presented. In section 4.1, results from the preparatory work on the data set can be found. In 4.2, the results of the experiments are presented. The results of the data extraction and data analysis are weaved together due to the iterative nature of the method used in this thesis, presented in Chapter 3.

4.1 Preparatory Work

In this section, results of the preparatory work which needed to be done in order to get benchmark results for the classifiers is presented. The work is divided into five tests, each presented in the following five sections.

Test 1: First Feature Extraction Method

In Table 5, the setup and result of Test 1 are presented. In the test, features are extracted by choosing the 40 most common words in each class and discarding the words which are common in multiple classes. The goal of this is to keep the words that are unique to each of the classes. The value 40 is chosen on an experimental basis.

The result of this setup is an accuracy of 38.2%. The question regarding if feature extraction could be done in a better way arises and sets the base for Test 2.

Table 5: Setup for test 1 of Naïve Bayes

Classifier Training instances Test instances Number of classes Accuracy

Naïve Bayes 3918 760 9 38.2%

Test 2: Second Feature Extraction Method

In Test 2, the same setup as in Test 1 is used. In this test however, features are extracted if they appear more than 0.13% in relation to the other words in the class, and if they appear more than 40% in relation to appearances in other classes.

The value 0.13% is chosen on an experimental basis. It is found that a value higher than that gives very few features which is thought to effect the classifiers negatively. This check is implemented to ensure that different amount of words can be selected from each class.

The second check is implemented so that words which are evenly distributed between classes are discarded.

In Table 6, the setup and accuracy result for Test 2 are presented. It shows that the classifier performs better with the new method for extracting features.

The expectation for this test was that the accuracy would increase more. It is speculated that an overlap between the classes exists. This speculation sets the base for Test 3.

(23)

Naïve Bayes 3918 760 9 41.1%

Test 3: Limiting the Number of Classes

In the previous test, the accuracy increased but not as much as expected. An overlap between the classes is suspected and four classes are chosen to limit this test. The classes are chosen in such a way that an overlap is unlikely. New training and test sets are assembled. The new setup is specified in Table 7, along with the resulting accuracy.

The test shows over 20 percentage points increase in accuracy compared to the previous test. It is therefore reasoned that there is a high probability that an overlap was the reason for the low accuracy in the previous test. With an overlap, the classifiers would not be able to extract features which are unique or highly informative for each class.

Since a training set of 2000 instances is considered small, it is thought of to increase the number of samples, leading to Test 4.

Naïve Bayes 2000 400 4 64.5%

Test 4: Increasing the Number of Data Samples

In the following test, the goal is to see if the classifier previously received too few samples to practice on. New training and test sets are formed. Since the original distribution of classes in the data set is uneven, over-sampling of two of the classes is done. Samples for two of the smaller classes are copied so that the distribution between the new training set becomes balanced. An important thing to note is that the samples in the test set are not included in the training set.With the new data sets the classifier performs better, as shown in Table 8.

The level of accuracy is found to be on an acceptable level to start testing the other classifiers, which leads to Test 5.

Table 8: Set up for test 4 of Naïve Bayes

Naïve Bayes 4030 655 4 74.6%

Test 5: Benchmark Results

In this test, all three classification algorithms are tested with the same setup as in the previous test. The goal is to check that all algorithms generate an acceptable result which can be used as benchmark results for the investigation of the effects of preprocessing.

(24)

In Table 9, the accuracy of the classifiers is presented. These are reasoned to be acceptable results to base the rest of the research on.

Table 9: Benchmarks results of the classification algorithms

Naïve Bayes Support Vector Machine Decision Tree

No preprocessing 74.6% 74.0 % 68.2%

4.2 Experiments

In this section, the results of the experiments on different preprocessing techniques and classification algorithms are presented. The section is divided into three subsections which individually present the results of each experimental level depicted in Figure 2.

Level 1 Experiments

In this section, the result of Level 1 experiments are presented. For each of the preprocessing techniques, it is tested how the accuracy of the classification algorithms is affected. The results are presented in Table 10.

When applying removing stop words and spelling correction separately, all classification algorithms perform better than the benchmark results shown in Table 9. Stemming decreases the accuracy of all algorithms.

Table 10: Accuracy results of level 1 experiments

Preprocessing method

Classifier Naïve Bayes Support Vector Machine Decision Tree

Remove stop words 77.4% 75.1% 72.5%

Spelling correction 76.5% 74.9% 69.0%

Stemming 67.6% 68.2% 64.7%

In this section, the results of the experiments of Level 2 are presented. Each of the preprocessing techniques presented in Table 10 are combined and all permutations are tested. In Table 11, the results of these combinations are presented for each algorithm.

Ranking shows that there are two combinations which increase the result of all classifiers in relation to the benchmark results and the results from Level 1 experiments. These are marked in bold in Table 11. One exception is the combination {spelling correction, remove stop words} for Support Vector Machine.

This combination is still better than the benchmark result but not better than applying the two techniques separately.

It is found in the Level 2 experiments that there are only two combinations which increase the accuracy of the classifiers beyond the benchmark results.

(25)

Table 11: Accuracy results of level 2 experiments

Preprocessing methods

Classifier Naïve Bayes Support Vector Machine Decision Tree

Remove stop words and Spelling correction 78.4% 75.4% 73.9%

Remove stop words and Stemming 66.5% 63.6% 60.4%

Spelling correction and Stemming 67.7% 68.3% 66.7%

Spelling correction and Remove stop words 77.8% 74.6% 73.4%

Stemming and Spelling correction 67.4% 67.9% 66.8%

Stemming and Remove stop words 66.5% 63.5% 62.1%

These combinations are the permutations of spelling correction and removing stop words.

Since all possible combinations of three preprocessing techniques include stemming, which has been shown do decrease the classification accuracy for all classifiers, it is not suspected that any such combination will increase the accuracy. In accordance with the method presented in section 3.4.3, Experiments to be Conducted , it is therefore reasoned that no further experiments need to be conducted.

5 Discussion and Conclusions

In this chapter, the final part of the thesis is presented. First, section 5.1 discusses the results and the choices made in this thesis. In section 5.2, the conclusions of the research are presented. Finally, in section 5.3, recommendations for future work are suggested.

5.1 Discussion

In this section, the outcomes of the project are discussed. First, the results of the experiments conducted are discussed. Then, the choices of methodology for this thesis are reflected upon.

Results

The results show that the Naïve Bayes algorithm outperforms the other classifiers for a majority of the experiments. The difference in accuracy is not extensive but as stated in section 2.1.1, Naïve Bayes has been shown to perform worse than Support Vector Machine for text classification tasks. This might be related to the fact that no optimization of the algorithms was done in this project. Since it is out of scope for this thesis to try to optimize the algorithms, this is left for future work.

Table 11 reveals that the two permutations of removing stop words and spelling correction are the combinations which generate the highest accuracy for the algorithms Naïve Bayes and Decision Tree. The most successful combination is {Removing stop words, Spelling correction}. This is also true for Support Vector Machine. However, SVM performs worse for the combination {Spelling correction, remove stop words} than when the techniques are applied separately.

The difference between classification accuracy when applying {Removing stop words, Spelling correction} and {Spelling correction, removing stop words}

(26)

is small but it is clear that doing removing stop words before doing spelling correction affects the classification better than the other way around. This is true for all classifiers. Since this thesis does not investigate why this is the case, only speculations about the reason can be discussed. Stop words are often short words. It might be that short misspelled words or abbreviations are corrected to common stop words and therefore removed when actually being an informative feature. This could also have the effect that the possible feature space gets smaller and thus the possibility for other features to be extracted could increase.

As can be seen in Table 10, using stemming to preprocess the data results in the worst classification accuracy of all three techniques for all three classification algorithms. Even more, it lowers the accuracy for all algorithms while the other techniques raise the accuracy for all algorithms. Unsurprisingly, Table 11 shows that the only combinations of preprocessing techniques which increase the accuracy compared to the benchmarks are those not including stemming.

This is true for all algorithms.

As explained in section 2.3, stemming strips inflected words of their affixes and returns the root of the words. A possible explanation to why stemming is counterproductive in this case could be that customers choose to express different types of issues using different inflections of a word. It could be that two different inflections of the same word are informative features for two separate classes. By applying stemming to the data set, such features are lost.

Another interesting aspect observed in the results is the effect that the preprocessing techniques have on the classifier Support Vector Machine. SVM is less affected by the techniques Spelling correction and Removing stop words than the other classifiers, regardless of the permutation. These techniques increase the accuracy for all classifiers. The increase of accuracy is as highest 1.4 percentage points for SVM. For NB it is 3.8 percentage points and for DT it is 5.7 percentage points. On the contrary, SVM is most effected by the stemming technique which lowers the accuracy for all algorithms. The decrease of accuracy is at most 8.1 percentage points for NB, 11.5 for SVM and 7.8 for DT.

The lack of increase of accuracy for SVM when applying the techniques removing stop words and spelling correction suggest that the model is not effected by irrelevant features such as common stop words or misspelled words like the other classifiers are. It can also be suspected that the preprocessing techniques help the Decision Tree algorithm in creating a less complex model, resulting in higher increase accuracy or less decrease in accuracy than the other classifiers.

Methodology

The choice of an iterative experimental method for this project has had an upside and a downside. The upside is that it has allowed stopping the conducting of experiments after Level 2 because it became apparent that there were no Level 3 combinations which would be expected to raise the accuracy further. Having a linear method rather than an iterative one would have required generating hypotheses before beginning the experiments in order to predict such things.

This would mean that a more extensive literature study would have been needed.

That would be out of scope for this thesis. The downside with the iterative approach is that the research is not entirely exhaustive. Further conclusions could possible be drawn if more experiments had been conducted.

(27)

The usage of the specific data set might limit the possibility to generalize any conclusions about classifying customer feedback on its own. There is a high possibility that different data sets of customer feedback data are structured in different ways that would affect the result of the classifiers and the preprocessing techniques. Therefore different case studies could generate different results.

In this thesis, accuracy is chosen as the performance measurement to evaluate the classifications. To be able to get a result which was not affected by the imbalance of the data set an over-sampling was made. It might be better to chose a performance measurement which would not be effected by the imbalance. Such a measurement is precision. When classifying several classes, like in this project, the ratio of TN would be high. Since the precision measurement only measures the ratio of TP and gives information about FP, the ratio of TN would not have an effect on the result.

5.2 Conclusions

This thesis presents results to answer the research question stated in section 1.3. With the chosen methodology, some conclusions can be drawn from the work in regards to the presented setup considering algorithms and preprocessing techniques.

The three preprocessing techniques investigated, namely removing stop words, spelling correction and stemming, affect the accuracy of the three classification algorithms in different ways. Stemming results in poor accuracy for all classifiers. Spelling correction and removing stop words improve the classification accuracy of all classifiers.

For the techniques which increased accuracy, Decision tree is most affected and Support Vector Machine is least affected. For the technique which decreased the accuracy, Support Vector Machine was most affected.

5.3 Future work

Tables 7 and 8 show that the accuracy increased significantly when the size of the data set was doubled. This suggests that the accuracy could be increased further with an even larger data set. This study focused only on the effect of preprocessing techniques on the classification accuracy. It would be of interest to explore how the accuracy changes with respect to preprocessing techniques and data set size.

The method by which features were extracted, described in 4.1, could be sub- optimal. Again, this thesis focuses only on the effects preprocessing techniques have on classification performance. It should be tested how different feature selection methods and different representations of features affect the performance of the same algorithms for the same data set. With more time and lager data set, cross-validation could also be used to avoid overvitting.

Some unexpected results were presented in this thesis and it was out of scope to try and explain them. This refers to why the Naïve Bayes algorithm performed better than SVM and why stemming was such a bad technique to use on the chosen data set. A deeper study on why the preprocessing techniques generated these results is recommended.

In a broader perspective, similar studies should be conducted with different data sets and different classification algorithms in order to contribute to the

(28)

overall overview of text classification methods.

(29)

References

[1] T. S. Guzella and W. M. Caminhas, “A review of machine learning appro- aches to spam filtering,” Expert Systems with Applications, vol. 36, no. 7, pp. 10206 – 10222, 2009.

[2] I. Portugal, P. Alencar, and D. Cowan, “The use of machine learning al- gorithms in recommender systems: A systematic review,” Expert Systems with Applications, vol. 97, pp. 205 – 227, 2018.

[3] S. Murnion, W. J. Buchanan, A. Smales, and G. Russell, “Machine learning and semantic analysis of in-game chat for cyberbullying,” Computers &

Security, vol. 76, pp. 197 – 213, 2018.

[4] Ericsson.com, “Ericsson mobility report,” Nov, 2017. [Online]. Ava- lible: https://www.ericsson.com/assets/local/mobility-report/documents/2017/ericsson-mobility-report-november-2017.pdf. [Accessed:

08-May-2018].

[5] M. Ikonomakis, S. Kotsiantis, and V. Tampakas, “Text classification using machine learning techniques.,” WSEAS transactions on computers, vol. 4, no. 8, pp. 966–974, 2005.

[6] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine lear- ning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007.

[7] A. Khan, B. Baharudin, L. H. Lee, and K. Khan, “A review of machine learning algorithms for text-documents classification,” Journal of advances in information technology, vol. 1, no. 1, pp. 4–20, 2010.

[8] J. Yeckle and S. Abdelwahed, “An evaluation of selection method in the classification of scada datasets based on the characteristics of the data and priority of performance,” in Proceedings of the International Confe- rence on Compute and Data Analysis, ICCDA ’17, (New York, NY, USA), pp. 98–103, ACM, 2017.

[9] E. C. Schwab and H. C. Nusbaum, Pattern Recognition by Humans and Machines: Speech Perception, vol. 1. Academic Press, 2013.

[10] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Data preprocessing for supervised learning,” vol. 1, pp. 111–117, 01 2006.

[11] Investopedia.com, “Bayes’ theorem.” [Online]. Avalible: https://www.investopedia.com/terms/b/bayes-theorem.asp [Accessed: 04-May-2018].

[12] P. Domingos and M. Pazzani, “On the optimality of the simple baye- sian classifier under zero-one loss,” Machine learning, vol. 29, no. 2-3, pp. 103–130, 1997.

[13] S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng, “Some effective tech- niques for naive bayes text classification,” IEEE transactions on knowledge and data engineering, vol. 18, no. 11, pp. 1457–1466, 2006.

(30)

[14] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in Proceedings of the seventh international conference on Information and knowledge mana- gement, pp. 148–155, ACM, 1998.

[15] F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Recommender Systems Handbook. Springer US, 2011.

[16] I. Zoppis, G. Mauri, and R. Dondi, “Kernel methods: Support vector machi- nes,” in Reference Module in Life Sciences, Elsevier, 2018.

[17] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in European conference on machine learning, pp. 544–551, Springer, 1998.

[18] P. Gupta, “Decision trees in machine learning – towards data science,”

2017. [Online]. https://towardsdatascience.com/decision-trees-in-machine- learning-641b9c4e8052 [Accessed 20-May-2018].

[19] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,”

SIGKDD Explor. Newsl., vol. 6, pp. 20–29, June 2004.

[20] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:

synthetic minority over-sampling technique,” Journal of artificial intelli- gence research, vol. 16, pp. 321–357, 2002.

[21] R. Barandela, R. M. Valdovinos, J. S. Sánchez, and F. J. Ferri, “The imbalanced training sample problem: Under or over sampling?,” in Joint IAPR International Workshops on Statistical Techniques in Pattern Recog- nition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 806–814, Springer, 2004.

[22] P. M. N. L. Ohno-Machado and W. W. Chapman, “Natural language pro- cessing: an introduction,” Journal of the American Medical Informatics JAMIA, vol. 5, pp. 321–357, 2011. [Online]. Avalible: https://www.nc- bi.nlm.nih.gov/pmc/articles/PMC3168328/ [Accessed: 18-May-2018].

[23] T. Kuzar and P. Navrat, “Preprocessing of slovak blog articles for cluste- ring,” in 2010 IEEE/WIC/ACM International Conference on Web Intel- ligence and Intelligent Agent Technology, vol. 3, pp. 314–317, Aug 2010.

[Online]. Avalible: https://ieeexplore.ieee.org/document/5614178/ [Acces- sed: 18-May-2018].

[24] D. Ninomiya and M. Mozgovoy, “Improving pos tagging for ung- rammatical phrases,” in Proceedings of the 2012 Joint International Conference on Human-Centered Computer Environments, HCCE ’12, (New York, NY, USA), pp. 28–31, ACM, 2012. [Online]. Avalib- le: http://doi.acm.org.focus.lib.kth.se/10.1145/2160749.2160756 [Accessed:

18-May-2018].

[25] W. Wong, W. Liu, and M. Bennamoun, “Integrated scoring for spelling error correction, abbreviation expansion and case restoration in dirty text,”

in Proceedings of the Fifth Australasian Conference on Data Mining and

(31)

Analystics - Volume 61, AusDM ’06, (Darlinghurst, Australia, Austra- lia), pp. 83–89, Australian Computer Society, Inc., 2006. [Online]. Ava- lible: http://dl.acm.org.focus.lib.kth.se/citation.cfm?id=1273808.1273820 [Accessed: 18-May-2018].

[26] B. Sun, P. Mitra, C. L. Giles, J. Yen, and H. Zha, “Topic segmentation with shared topic detection and alignment of multiple documents,”

in Proceedings of the 30th Annual International ACM SIGIR Confe- rence on Research and Development in Information Retrieval, SIGIR

’07, (New York, NY, USA), pp. 199–206, ACM, 2007. [Online]. Avalib- le: http://doi.acm.org.focus.lib.kth.se/10.1145/1277741.1277778 [Accessed:

18-May-2018].

[27] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classifi- cation using machine learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pp. 79–86, Association for Computational Linguistics, 2002.

[28] Runeberg.org, “Lars aronssons svenska ordlista.” [Online]. Avalible:

http://runeberg.org/words/ss100.txt. [Accessed 9-May-2018].

[29] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estima- tion and model selection,” Stanford University, Computer Science Depart- ment, 1995.

[30] “Overfitting.” [Online]. https://en.oxforddictionaries.com/definition/overfitting [Accessed 15-June-2018].

(32)

Appendices

A Stop words

. ,

!

?

; :

’ ...

alla allt att av blev bli blir blivit de dem den denna deras dess dessa det detta dig din

dina ditt du då där ej efter eller en er ert era ett från för ha hade han hans har henne hennes hon honom hur här i

icke ingen inom inte jag ju kan kunde man med mellan men mig min mina mitt mot mycket ni nu någon något några när och om oss på samma sedan

(33)

sig sin sina sitta själv skulle som så sådan sådana sådant till under

upp ut utan vad vara varit vars vart vem vi vid vilka vilkas

vilken vilket var varför varje vår vårt åt än är över

(34)

TRITA EECS-EX-2018:455