• No results found

Learning Methods for Improving News Retrieval Systems

N/A
N/A
Protected

Academic year: 2022

Share "Learning Methods for Improving News Retrieval Systems"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Learning Methods for Improving News Retrieval Systems

Textklassificeringsmetoder för förbättrad hämtning av nyhetsdata

ELI PLEANER, FELIX ENGSTRÖM

Degree Project in Computer Science, DD143X Supervisor: Atsuto Maki

Examiner: Örjan Ekeberg

CSC, KTH, 2016-05-11

(3)

Abstract

Content providers require an efficient and accurate way of retrieving relevant content with minimal human aid. News retrieval, for instance, often requires human intervention to recognize which text documents are news articles and which are not. The differences between a factual news article and an opinionated blog piece may be subtle, yet are critical for providing informative and relevant content to users. This thesis explores the problem of format classification: the task of classifying text documents based on the format in which they are written, such as a news article, blog entry or forum text. More explicitly, the goal of the thesis is to examine how well state-of-the-art supervised text classifica- tion techniques work for format classification. We select a number of classifiers that have been shown to perform well in other text classification tasks and evaluate their perfor- mance in this unexplored task. Experimental evaluation, performed on a novel dataset created from multiple existing datasets, explores both binary and multi-class classification in a bag-of-words feature space. Based on our experimental results, we have found that state-of-the-art supervised text classification techniques perform acceptably well at format classification. Furthermore, we propose a Gradient Boost model as a candidate classifier for the task of format clas- sification, and provide a discussion of future work.

(4)

Sammanfattning

Företag som tillhandahåller innehållshanteringstjänster be- höver effektiva och precisa metoder för att med minsta möjliga mänskliga arbetskraft utvinna relevant innehåll ur stora mängder data. Ett exempel på detta är tjänster för insamlande av nyheter, där nyheter skall utvinnas från olika källor. Som en del av den processen krävs att de kan avgöra om en text är en nyhetsartikel eller någon annan form av text. Skillnaden mellan en nyhetsartikel och en text skriven för en blogg kan vara subtil, men är avgörande för dessa företag. Denna rapport utforskar formatklassifi- cering: uppgiften att klassificera texter baserat på vilket format de är skrivna för. Exempel på format är: nyhet- sartikel, bloggtext eller forumtext. Mer specifikt tar den sig an uppgiften att undersöka hur väl de metoder som idag används i den väl utforskade uppgiften att klassificera texter baserat på ämne fungerar applicerade på formatk- lassificering. Det utforskas med experimentell evaluering på ett nytt dataset som konstruerats genom att kombin- era ett flertal existerande dataset. Detta görs både som en binär- och multiklassificeringsuppgift i en bag-of-word vektorrymd. Ett antal ämnesklassificeringsmetoder väljs baserat på resultat från tidigare forskning, och hur dessa presterar på formatklassificering undersöks. Vi drar slut- satsen att våra resultat visar att de textklassificeringsme- toder vi testat fungerar acceptabelt väl på formatklassifi- cering. Vi föreslår vidare gradient-boost eller multinomial naive bayes för att lösa uppgiften, beroende på om fokus ligger på kvaliteten av klassificeringen eller prestanda. Slut- ligen diskuteras resultaten , de sätts i relation till de begrän- sningar som förelegat och förslag till framtida forskning ges.

(5)

Contents

1 Introduction 1

1.1 Problem statement . . . . 2

1.2 Scope . . . . 2

2 Background 3 2.1 Current Applications . . . . 3

2.2 Text Classification Process . . . . 3

2.2.1 Pre-processing . . . . 3

2.2.2 Text Representation . . . . 3

2.2.3 Feature Selection . . . . 4

2.2.4 Learning . . . . 4

2.2.5 Classifiers . . . . 5

2.2.6 Ensemble Classifiers . . . . 6

2.3 Hyperparameter Tuning . . . . 7

2.4 Evaluation Criteria . . . . 7

2.4.1 Accuracy . . . . 8

2.4.2 Precision . . . . 8

2.4.3 Recall . . . . 8

2.4.4 F-Score . . . . 8

2.4.5 Precision-Recall Curve . . . . 9

2.4.6 Receiver Operating Characteristic Curve . . . . 9

2.4.7 Multi-Class Metric Averaging . . . . 9

3 Methods 11 3.1 Dataset . . . 11

3.1.1 Data Sources . . . 11

3.1.2 Subsets . . . 12

3.2 Preprocessing . . . 12

3.3 Feature Selection/Dimensionality Reduction . . . 13

3.4 Classifiers . . . 13

3.4.1 Hyperparameter Tuning . . . 13

3.4.2 Classifiers Compared . . . 13

3.5 Evaluating Performance . . . 14

(6)

4 Results 15 4.1 Experimental Results and Discussion . . . 15 4.2 Limitations and Considerations . . . 21

5 Conclusion 23

5.1 Future Work . . . 23

Bibliography 25

(7)

1. Introduction

In our present day, there is more content being produced than ever before. With this, there is a tendency to be overwhelmed by a constant influx of information. It requires certain cognitive effort to sift through all the low-quality content to find that which is relevant, informative, and worthwhile.

This increasing amount of media constantly being produced has created the need for a new kind of filtering service. Many businesses today work with the task of sifting through huge data flows, sorting out data relevant to their customers. But for them to process this manually at such a huge scale is laborious and inefficient.

To avoid this, one solution is to use an automated information retrieval (IR) system, that constantly collects and processes information. The term "text mining" is often used to describe the tasks within IR relating to extracting useful information from large quantities of texts.

One sub-task within text mining is text classification (TC) [17]. TC is the pro- cess of assigning texts to one or more categories within a set of possible categories.

In the past TC has mainly been done with the use of knowledge engineering, in which a set of rules are defined by which documents are classified. However, in the last 30 years, it has been steadily moving towards solutions using machine learning [17], which involves training classifiers to notice patterns in sets of data.

Much progress has been made in the field of TC using ML algorithms. Intelligent learning systems can be trained on collections of text documents to automatically classify documents without human intervention. These systems can be highly scal- able and more efficient than human-dependent alternatives.

Much of the research being done in TC learning systems is focused on cate- gorization related to the topic or theme of the text. An unexplored problem is classification based on the format for which a text was created. In other words, identifying whether a text is a news article, an opinion piece, or some other format.

This problem has particular usefulness for automated news retrieval, a task that requires sorting through huge amounts of documents and may need to distinguish texts of different formats. For example, it may be important to be able to distinguish a news article from a discussion post. Many companies depend on news retrieval systems to provide relevant news to its users, handling large flows of documents to be processed and evaluated for publishing. Unless the flow of documents is guaranteed to contain only news articles, the first task for these companies is to distinguish texts that are actually news articles from those that are not. This is a laborious

(8)

CHAPTER 1. INTRODUCTION task if done manually, and current automatic methods are prone to error.

Therefore, we propose using state-of-the-art TC learning techniques to accu- rately identify relevant documents, thus improving news retrieval systems.

1.1 Problem statement

This thesis aims to explore the efficiency of current state-of-the-art TC methods at solving the task of format classification, thereby filling a gap that currently exists in the field. Furthermore, the intent is to suggest a suitable classification strategy based on experimental results. Specifically, we aim to investigate the question:

Can state-of-the-art TC techniques classify documents based on format, with performance comparable to the results from recent work in the field of topic

classification?

1.2 Scope

This thesis will lead to a suggestion of a method for solving the task of format classification, as well suggestions for further research to improve on our solution.

This thesis is intended as the first steps in exploring the task of format classifica- tion. It will therefore be limited to evaluating how well popular supervised learning methods perform at format classification, using a bag-of-words feature space. Given the time frame of our research and limited computational resources, certain learning methods such as neural networks and K-Nearest Neighbors will not be covered.

(9)

2. Background

2.1 Current Applications

Current applications of text classification (TC) are widespread, with research being done in many domains. In the medical sector, examples of current TC applications include detecting adverse drug reactions via social media [24], and predicting inten- sive care unit (ICU) mortality risk using nurses’ notes [17]. TC has been applied to financial analysis, such as predicting the movements of financial assets using news [15]. Other applications include, and are not limited to, web-content blocking [13], spam detection [19, 25], and API information retrieval for software development [21].

2.2 Text Classification Process

The conventional text classification process involves pre-processing, text represen- tation, feature selection, and classification learning [1]. Each of these stages will be described in the following subsections.

2.2.1 Pre-processing

Common pre-processing tasks involve the removal of words that do not contribute to the distinctiveness of documents, such as stop-words. These are words that are common across documents and do not help discriminate one text from another (e.g "and", "the", "but"). Other possible tasks are the removal of words in other languages (if these words are not significant to the objective) [5], punctuation and digit removal, and case conversion [26].

Lemmatizing, another popular text pre-processing task, involves converting or reducing different forms of words to their shared root. For example, both "eating"

and "ate" would be converted to "eat" [1].

2.2.2 Text Representation

After texts have undergone pre-processing, they must be represented in a way that is usable in the learning stage. This is commonly accomplished by representing the

(10)

CHAPTER 2. BACKGROUND texts as a vector space model, translating the words of a document into an algebraic form [22].

Bag-of-Words

A common vector representation scheme is bag-of-words (BOW). BOW represen- tation stores the frequency of word usages, ignoring document structure and word ordering [14]. Thus, a bag-of-words representation of a text is a vector where each component is the number of occurrences of a word in the text. A similar, alternative bag-of-words representation stores a binary value for each word, signifying whether the word appears in the document or not [10].

2.2.3 Feature Selection

Term Frequency–Inverse Document Frequency

Term frequency–inverse document frequency (tf-idf) is a popular technique used to determine the associated weight of a word to a document within a corpus. If a word appears frequently in a document, that word can be considered more relevant.

However, if that word also appears frequently throughout the document corpus, then it does little to distinguish documents. Thus, words can be identified and removed that may appear significant in an individual document, but are in fact common within the document corpus.

The equation for calculating the associated weight v of a term i in document j using tf-idf is as follows:

v(i, j) = tf(i, j) ∗ idf(i) (2.1) where tf(i, j) is the frequency of term i in document j, and idf(i) is the inverse document frequency of term i. idf(i) can be calculated by dividing the the total number of documents by the number of documents that term i appears [5].

2.2.4 Learning

Supervised vs. Unsupervised vs. Semi-Supervised Learning

In general, learning methods fit into one of three categories, although others exist as well: supervised learning, unsupervised learning, and semi-supervised learning.

In supervised learning, inputs and their corresponding outputs are used to train a model in order to find a generalized approximation function that fits the behavior of the data. In unsupervised learning, only inputs are known, so a model attempts to cluster data based on underlying patterns. Semi-supervised learning uses both labeled and unlabeled data in combination to train a model.

Typically, labeled data is costly to obtain: in TC, it is time-consuming and tedious to manually label substantial amounts of documents. On the other hand,

(11)

2.2. TEXT CLASSIFICATION PROCESS

unlabeled data is abundant and inexpensive to obtain. Therefore, the ability to effectively utilize unlabeled data in learning is highly desirable.

The learning classifiers used for this research are outlined in section 2.2.5.

2.2.5 Classifiers

Support Vector Machine

Support vector machine classification (SVC) is a simple and powerful technique.

SVC classification attempts to find a function that separates data by the widest margin, or hyperplane. SVC classifiers yield impressive results, and the technique is comparable to modern, state-of-the-art techniques in terms of classification success despite being a relatively older method [6]. SVC classifiers are particularly powerful because they work independently of the dimensionality of the feature space, which tends to be quite high for text documents [11]. Along with Naive Bayes and k- Nearest Neighbor classifiers, SVC classifiers are often used to benchmark novel TC techniques [6].

SVC classifiers support multiple features that make them capable of estimating non-linear separation, such as kernels [11]. Kernels can be used to extend an SVC’s flexibility in learning non-linearly separable data. Kernels map input data to a higher dimensional space in order to find a linearly separating hyperplane in this new space. Common kernels include linear, polynomial, radial basis function (RBF), and sigmoid [9].

During non-linear separation, there is a trade-off between finding the widest hyperplane and minimizing errors. A technique known as soft margins allow for a small number of points to be misclassified in order to allow for a wider hyper- plane. A regularization parameter known as C can be used to control this trade-off.

The higher the value of C, the more emphasis there is on minimizing errors dur- ing separation. Thus, low C values can result in a wider hyperplane that fails to fully separate all data, whereas high C values can result in a thin hyperplane that separates the data with a lower error rate. The optimal choice of C value is criti- cal, and can be determined via cross-validation and an understanding of the data distribution.

Naive Bayes

Naive Bayes (NB) classification is a technique based on Bayes’ theorem. This tech- nique assumes independence of features. Thus, in the context of TC, it assumes there is no correlation between the appearances of different words. NB has been used extensively for TC [7, 14], and is often used to evaluate the results of novel techniques [6]. The equation for NB is as follows [12]:

P(c|d) = P(d|c) ∗ P (c)

P(d) (2.2)

where:

(12)

CHAPTER 2. BACKGROUND

• P (c|d) is the probability of class c given document d,

• P (d|c) is the probability of document d given class c,

• P (c) and P (d) are the probabilities of class c and document d, respectively Two NB variants are commonly used for TC: Gaussian NB and Multinomial NB.

Guassian NB assumes the likelihood of the features to be Gaussian. Multinomial NB assumes multinomially distributed data [3].

Nearest Neighbor

Nearest Neighbor methods are conceptually straight-forward: a document d is as- signed the class c if the training patterns closest to d are from class c [14]. Multiple methods exist for measuring the distance between documents, such as Euclidean distance and cosine similarity [23]. Two common nearest neighbor algorithms are K-Nearest Neighbors and Nearest Centroid.

In K-Nearest Neighbors (KNN), the k closest training patterns to an input document are used to classify the document. In Nearest Centroid, each class is represented by its centroid, and the smallest distance between a document and a centroid is used to classify the document [3]. When applied to TC with tf-idf vector- ization, Nearest Centroid classification is also referred to as Rocchio classification [16].

Perceptron

Perceptron classification is a supervised, linear classification method. Every input node has weighted connections to each of two output nodes. A perceptron attempts to assign binary labels to each weighted input. When an input is mislabeled, the input’s weights are updated, moving them closer to the correct output.

Decision Tree

Decision Tree classification learns decision rules inferred from data features in order to make predictions on unseen data. Decision trees consist of leaves and branches:

leaves represent class labels and branches represent feature combinations that lead to those labels. Decision trees are useful as a "white box" classifier, since decisions being made that lead to a prediction can be visualized in a human-interpretable way [3].

2.2.6 Ensemble Classifiers

Ensemble classifiers combine multiple classifiers in order to improve performance over a single classifier. Ensemble classifiers typically fall into two categories: bagging and boosting [3].

(13)

2.3. HYPERPARAMETER TUNING

Bagging methods combine the results of multiple individually trained classifiers.

Boosting methods sequentially train each classifier in an ensemble. The training done emphasizes learning the mislabeled data of the previous classifiers.

A popular bagging ensemble classifier is Random Forest. Examples of the boost- ing ensemble strategy include Gradient Boosting and Adaptive Boosting.

Random Forest

Random Forest uses the predictions of multiple random decision trees to formulate a final prediction. The predictions of each decision tree are considered equally. The trees are trained using bagging and at each split a random subset of the features is chosen, which are then used to train the tree.

Gradient Boosting

Gradient Boosting sequentially makes small decision trees at each boosting stage.

Each tree is restrained, often by depth, making them weak classifier [8]. Following the principals of gradient-descent, each tree is trained based on the gradient of some loss function in relation to the previous tree.

Adaptive Boosting

Adaptive (Ada) Boosting assigns weights to input data and sequentially trains clas- sifiers on these weighted inputs. After each boosting stage, the weights of mislabeled inputs are increased and the weights correctly labeled inputs are decreased. The next classifier is then trained on this modified data [3].

2.3 Hyperparameter Tuning

Many classifiers accept parameters when being constructed, which are known as hyperparameters. Finding the best hyperparameter values is critical for a classi- fier’s performance. Which hyperparameters are optimal depends on the task being solved. Thus, the optimal combination of values often cannot be known without some experimentation and evaluation.

A common hyperparameter tuning strategy is an exhaustive grid search. This involves exhaustively exploring all combinations of values within a defined set of parameters. A model is constructed with each combination, trained on a small sample, and scored on their performance. By comparing the performances of each model, an optimal hyperparameter combination can be found. [3]

2.4 Evaluation Criteria

When evaluating the performance of a classifier, there are different choices of what to measure. The choices reflect which evaluation criteria are relevant to the problem.

(14)

CHAPTER 2. BACKGROUND

2.4.1 Accuracy

Accuracy measures the fraction of correct predictions made. It is the ratio of the number of correct predictions over the total number of predictions [5]:

A= Number of correct predictions

Total number of predictions (2.3) In the context of news retrieval, high accuracy means that all documents being retrieved are correctly being labeled according to their format.

2.4.2 Precision

Precision measures a classifier’s ability to not falsely label a positive document as negative. It is the ratio of true positives over all positive predictions made [3]:

P = Number of true positives

Number of true and false positives (2.4) Choosing precision as the evaluation criterion implies that the priority is avoid- ing mislabeling data, with less emphasis on how many relevant documents are missed. In the context of format classification, high precision means that very few documents are being mistakenly labeled as another class, even if the number of documents retrieved is low. This is important for news retrieval: retrieving a smaller number of highly relevant news articles along with minimal unrelated con- tent enhances the user experience.

2.4.3 Recall

Recall is the measure of the success in retrieving all positive samples. It is the ratio of the number of correct predictions made over the total number of positively labeled documents [5]:

R= Number of true positives

Total number of positively labeled documents (2.5) Choosing recall as the evaluation criterion implies that the priority is not missing any relevant documents, with less emphasis on mislabeling documents. This is less important for news retrieval, but not insignificant. Retrieving a high number of news articles but also including spam or other non-news documents results in a negative user experience. On the other hand, low recall could result in some highly relevant documents being left unseen.

2.4.4 F-Score

F-Score, or F1, measurement is a popular technique to combine precision and recall into one value. It is the weighted average of precision and recall [27]:

F1 = 2 ∗ R ∗ P

R+ P (2.6)

(15)

2.4. EVALUATION CRITERIA

F-Score is used when it is desirable to have both high precision of the retrieved data and a low ratio of missed data. F-Score is an intuitive way of compromising between precision and recall. In the context of news retrieval, a high F-score means that almost all relevant documents are being retrieved, along with minimal irrelevant documents.

2.4.5 Precision-Recall Curve

A precision-recall curve plots the precision versus recall of a classifier. The area under the curve (AUC) is useful for representing the curve as a single value. A large area under the curve shows that a classifier has both high precision and high recall, which is preferred [3].

2.4.6 Receiver Operating Characteristic Curve

A receiver operating characteristic (ROC) curve plots the true positive rate (fraction of true positives over all positives) versus the false positive rate (fraction of false positives over all negatives). The false positive rate is one minus the true positive rate. Thus, a large area under this curve shows that a classifier has high true positive rate and low false positive rate, which is preferred [3].

2.4.7 Multi-Class Metric Averaging

When evaluating multi-class performance, there are multiple averaging techniques that can be applied to the evaluation metrics discussed in the previous subsections.

These include micro-averaging, macro-averaging, and weighted averaging.

Micro-averaging involves computing evaluation metrics individually for each class and then taking the average over all classes [27]. Macro-averaging involves combining the values used in the metric formulas for each class and then applying the metric functions on the combined values [27]. Weighted averaging weights the metrics for each class by the number of true instances for each class before aver- aging. Weighted averaging may result in an F1-score that is not between precision and recall [3]

Micro-averaging tends to be more effective when the size of the class are similar.

Macro-averaging is more effective when there are large discrepancies in class sizes as it gives equal weight to each category, regardless of size [16]. Weighted averaging is also more effective when there are large discrepancies in class sizes, and also takes into consideration class imbalance [3].

(16)
(17)

3. Methods

3.1 Dataset

The dataset used is an amalgamation of multiple existing datasets as well as novel data scraped specifically for our research. The existing datasets consist of the 20Newsgroups dataset, the Reuters 21578 dataset, and the Signal Media One Million News Articles dataset. The novel dataset consists of the extracted text from web pages of news providers. The motivation for including the novel dataset is to have data that news retrieval systems typically see and process.

For multi-class classification, the data is labeled as "News", "Newsgroup", "Blog"

or "Other". For binary classification, documents are either positively labeled as

"News" or negatively as "Non-news". Tables 3.1 and 3.2 show the number of docu- ments per class used in training and testing for the multi-class and binary classifi- cation tasks.

Documents/Class News Newsgroup Blog Other Total

Training 44,716 13,402 9,436 107 67,661

Testing 44,885 13,287 9,385 104 67,661

Total 89,601 26,869 18,821 211 135,322

Table 3.1: Documents Per Class in Multi-class Dataset

Documents/Class News Non-News Total

Training 44,716 22,945 67,661

Testing 44,885 22,776 67,661

Total 89,601 45,721 135,322

Table 3.2: Documents Per Class in Binary Dataset

3.1.1 Data Sources

The 20Newsgroups and the Reuters 21578 datasets were provided via Ana Car- doso Cachopo’s homepage[4], which hosts a number of datasets made available for

(18)

CHAPTER 3. METHODS research purposes. The 20Newsgroups dataset consists of approximately 20,000 dis- cussion posts, taken from a range of 20 different Usenet newsgroups. All 20News- groups documents are labeled as "Newsgroup" for multi-class classification, and are considered non-news for binary classification.

The Reuters 21578 dataset consists of 15,500 news articles taken from the Reuters newswire in 1987. These articles cover a wide range of topics. All Reuters documents are labeled as "News" for both multi-class and binary classification.

The Signal Media dataset consists of documents collected from a variety of news sources over a period of 1 month (September 2015) [18]. Sources of these documents include major news providers, local news sources and blogs. This dataset consists of around 750,000 news articles and 250,000 blog articles, all labeled. Blog articles are labeled as "Blogs" and new articles are labeled as "News". Blog articles are considered non-news for binary classification.

The novel dataset was scraped and provided by Tellus, a company that focuses on news collection and provision. The documents in this dataset were collected in the past year, and come from a wide variety of news sources. The scraping was primarily done unfiltered, and the dataset includes news articles, advertisements, spam, as well as the text from assorted media web galleries (video and photography).

The dataset consists of 790 labeled news articles and 210 labeled non-news articles.

The news articles are labeled "News" and all non-news are labeled "Other".

3.1.2 Subsets

Given the limited processing capacity of the available machines, a smaller subset of the main dataset has been used. This subset consists of a number of samples from the Signal Media dataset, along with all data from each other dataset. The sampling of the Signal Media dataset was done in a way that guaranteed a ratio of news to non-news consistent with the original dataset (4:1 positive to negative).

100,000 samples were taken from the Signal Media dataset. The trade-offs of using smaller subsets are discussed in section 5.

3.2 Preprocessing

In order to have a uniform dataset that performs well for TC, a number of pre- processing steps were applied. All characters in every document were converted to lowercase and stripped of accents. Punctuation, digits, and words shorter than four characters in length were removed. Stop words were removed, using scikit-learn’s built-in English stop word list [3]. All words in every document were lemmatized using NLTK’s WordNetLemmatizer [2].

(19)

3.3. FEATURE SELECTION/DIMENSIONALITY REDUCTION

3.3 Feature Selection/Dimensionality Reduction

The preprocessed dataset was then transformed using tf-idf into a BOW feature space. The top 10,000 features were considered, ordered by term frequency across the corpus.

3.4 Classifiers

A number of classifiers were trained and their performances compared in order to determine the best suited candidates for the classification task. Classifiers were selected based on their prior success in related text classification tasks [4]. The classifiers compared and their hyperparameters are outlined in section 3.4.2.

3.4.1 Hyperparameter Tuning

To optimize each classifier’s performance, classifier hyperparameter values were se- lected via an exhaustive grid search over a defined parameter space. A defined set of parameters were explored for each classifier. For every combination of parameters, a model was trained and its performance evaluated using 3-fold cross-validation. The model’s performances were then compared, and the parameter set that resulted in the best-performing model was chosen.

This exhaustive grid search was done separately with both binary data and multi-class data. Some hyperparameter values were optimal for binary data, while other hyperparameter values were optimal for multi-class data.

3.4.2 Classifiers Compared

The following classifiers were selected to be compared, with the corresponding hy- perparameters:

• Support Vector Machine with an RBF kernel, an error term penalty of 100 and a gamma value of 0.001. These parameters were used for both binary and multi-class data.

• Perceptron Classifier, with 500 passes over training data, using L2 penalty and an alpha value of 0.00001. These parameters were used for both binary and multi-class data.

• Rocchio Classifier, using cosine distance for both binary and multi-class data.

• Decision Tree Classifier considering

#features at each split.

• Gaussian Naive Bayes Classifier (which takes no hyperparameters).

• Multinomial Naive Bayes Classifier, with alpha=0 for binary data and al- pha=0.00001 for multi-class data.

(20)

CHAPTER 3. METHODS

• Random Forest Classifier, with 100 trees for both binary and multi-class data.

• Gradient Boost Classifier with 200 boosting stages, a learning rate of 0.1, ten maximum nodes per tree, considering

#features at each split.

• Ada Boost Classifier with 100 classifiers, and a learning rate of 0.1 for both binary and multi-class data.

3.5 Evaluating Performance

The evaluation criteria introduced in section 2.4 were used as a measurement of each model’s performance. Half of the dataset was saved and used for evaluation of the models.

To further evaluate each model’s performance, the results were then compared to the results from previous research in TC [4] to determine if the models’ performances are acceptable.

(21)

4. Results

4.1 Experimental Results and Discussion

As previously stated, every estimator was trained and tested on both the multi- class and the binary classification task. The averaged metrics for the multi-class classification task can be seen in table 4.1, and those for the binary classification task in table 4.2.

Name Avg. Accuracy Avg. Precision Avg. Recall F1 Score

1. AdaBoost 0.748555 0.770759 0.748555 0.66854

2. DecisionTree 0.741638 0.736171 0.741638 0.738309

3. GaussianNB 0.724967 0.801119 0.724967 0.743294

4. GradientBoost 0.851495 0.846423 0.851495 0.841009 5. MultinomialNB 0.817265 0.813173 0.817265 0.81408 6. NearestCentroid 0.73818 0.816175 0.73818 0.757763

7. Perceptron 0.801067 0.800384 0.801067 0.800456

8. RandomForest 0.844829 0.841733 0.844829 0.823409

9. SVC 0.85148 0.844559 0.85148 0.834714

Table 4.1: Multi-class classification metrics

Name Avg. Accuracy Avg. Precision Avg. Recall F1 Score

1. ADABoost 0.772129 0.755706 0.970101 0.849586

2. DecisionTree 0.777553 0.828858 0.837629 0.833221

3. GaussianNB 0.794608 0.881212 0.79795 0.837517

4. GradientBoost 0.864264 0.870596 0.934254 0.901302 5. MultinomialNB 0.831853 0.859636 0.892213 0.875622 6. NearestCentroid 0.792702 0.897207 0.776473 0.832485

7. Perceptron 0.797239 0.854578 0.836738 0.845564

8. RandomForest 0.860111 0.852494 0.954239 0.900501

9. SVC 0.864309 0.867038 0.939534 0.901832

Table 4.2: Binary classification metrics

(22)

CHAPTER 4. RESULTS The following figures provide a visual comparison of each estimator’s perfor- mance in regards to each evaluation criterion. Figure 4.1 provides a comparison of multi-class estimators, and figure 4.2 provides a comparison of binary estimators.

To highlight the differences in performance between estimators, the y-axis of each graph is limited to the minimum and maximum metric scores.

Figure 4.1: Comparison of multi-class estimators

Figure 4.2: Comparison of binary estimators

Figure 4.1 clearly illustrates Gradient Boost, SVC, and Random Forest as the best-performing models for multi-class classification, scoring highest across all met- rics.

(23)

4.1. EXPERIMENTAL RESULTS AND DISCUSSION

Figure 4.2 shows that for binary classification, the results were a bit more spread out. Gradient Boost, SVC, and Random Forest still scored highest in accuracy and F1-score, while other estimators scored higher in recall and precision.

When interpreting the results of binary classification, it is important to consider the trade-offs between recall and precision. Recall and precision are inversely depen- dent, and the estimators that scored highest in recall or precision scored lowest in the other. Thus, estimators such as Ada Boost may seem to be performing very well with 97% average recall, but in fact are performing very poorly when considering other metrics as well.

The F1-score is a way of doing a weighted evaluation of the precision and recall, making it a suitable measurement of the performance of the binary classifiers. Based on F1-score, the top three classifiers are the same as for the multi-class classification, with slightly altered order. This can be seen in figure 4.4.

The top three estimators for each evaluation criterion are compared in table 4.3 for multi-class classification and table 4.4 for binary classification. It should be noted that the difference in F1-score between the highest- and third highest-performing classifier is less than .02 for multi-class and 0.007 for binary.

Avg. Accuracy 1. GradientBoost 0.85149

2. SVC 0.85148

3. RandomForest 0.84483

Avg. Precision 1. GradientBoost 0.84642

2. SVC 0.84455

3. RandomForest 0.84173

Avg. Recall 1. GradientBoost 0.85149

2. SVC 0.85148

3. RandomForest 0.84483

F1 Score 1. GradientBoost 0.84100

2. SVC 0.83471

3. RandomForest 0.82340

Table 4.3: Comparison of best-performing multi-class estimators

Avg. Accuracy

1. SVC 0.86430

2. GradientBoost 0.86426

3. RandomForest 0.86011

Avg. Precision 1. NearestCentroid 0.89721

2. GaussianNB 0.88121

3. GradientBoost 0.87059 Avg. Recall

1. ADABoost 0.97010

2. RandomForest 0.95423

3. SVC 0.93953

F1 Score

1. SVC 0.90183

2. GradientBoost 0.90130

3. RandomForest 0.90050

Table 4.4: Comparison of best-performing binary estimators

(24)

CHAPTER 4. RESULTS The ROC and Precision-Recall curves for binary classification are presented in figures 4.3 and 4.4. The top three performers in F1-Score from tables 4.3 and 4.4 also has the largest area under the curve in both the ROC and Precision-Recall curves.

Figure 4.3: ROC curves for binary classification

Figure 4.4: Precision-Recall curves for binary classification

(25)

4.1. EXPERIMENTAL RESULTS AND DISCUSSION

Tables 4.5, 4.6, and 4.7 show how Gradient Boost, Random Forest, and SVC (the three best-performing multi-class estimators) performed on predicting each class. The "support" column gives the true number of occurrences of each class.

Class / Criteria precision recall f1-score support

news 0.84803 0.95359 0.89772 44885

blog 0.74381 0.48815 0.58945 13287

newsgroup 0.98318 0.88460 0.93129 9385

other 0.92000 0.22115 0.35659 104

avg / total 0.84642 0.85149 0.84101 67661 Table 4.5: Classification specifics for GradientBoost

Class / Criteria precision recall f1-score support

news 0.83343 0.97048 0.89675 44885

blog 0.80047 0.36020 0.49683 13287

newsgroup 0.94533 0.93404 0.93965 9385

other 0.34965 0.48077 0.40486 104

avg / total 0.84173 0.84483 0.82341 67661 Table 4.6: Classification specifics for RandomForest

Class / Criteria precision recall f1-score support

news 0.84031 0.96451 0.89814 44885

blog 0.76863 0.41454 0.53860 13287

newsgroup 0.98173 0.93895 0.95986 9385

other 0.00000 0.00000 0.00000 104

avg / total 0.84456 0.85148 0.83471 67661 Table 4.7: Classification specifics for SVC

The "other" class gave the estimators the most trouble, with SVC scoring zeros across all metrics. This is unsurprising: the "other" class was the smallest and contained the most variation in features among members. Thus, estimators tended to classify data from the "other" class into one of the other classes. To avoid this, more "other" samples could be used when training the models. This highlights the importance of using a balanced dataset for training.

(26)

CHAPTER 4. RESULTS Tables 4.8 and 4.9 show the time taken to train and test each estimator. The estimators are listed in ascending order by train time, with the Multinomial NB esti- mator being quickest to train. It is notable that SVC, one of the highest-performing estimators, was the slowest estimator in both training and testing. SVC was more than ten times slower to train and almost three times slower to train than the next slowest estimator.

Furthermore, while Gradient Boost outperforms Random Forest in terms of training time, Random Forest is almost eight times faster at testing. Multinomial NB trains 1000 times faster than Gradient Boost and tests 20 times faster than Random Forest, while performing with only slightly worse accuracy and F1-score.

Name/Duration (seconds) Train Test 1. MultinomialNB 0.092654 0.104559 2. NearestCentroid 0.180683 0.139841 3. DecisionTree 5.3353 0.0722189

4. Perceptron 63.3027 0.048183

5. GradientBoost 102.103 16.9473

6. ADABoost 113.071 3.33636

7. GaussianNB 145.149 521.194

8. RandomForest 151.63 2.48072

9. SVC 2075.81 1363.25

Table 4.8: Train and test duration for multi-class estimators

Name/Duration (seconds) Train Test 1. MultinomialNB 0.0565448 0.1038 2. NearestCentroid 0.170777 0.146689 3. DecisionTree 5.1547 0.0833509

4. Perceptron 15.1282 0.0126741

5. GradientBoost 29.3447 13.6074

6. ADABoost 121.85 3.11134

7. GaussianNB 130.068 383.221

8. RandomForest 142.639 3.09278

9. SVC 2021.07 1281.76

Table 4.9: Train and test duration for binary estimators

(27)

4.2. LIMITATIONS AND CONSIDERATIONS

Table 4.10 shows the results produced by Anna Cachopo research in evaluating text topic classification in 2007 [4]. The results show the best performing classifier on three datasets also used in this thesis.

Dataset Classifier Accuracy

Reuter 8 [4] SVC 0.9698

Reuter 52 [4] SVC 0.9377

Ng20 [4] SVC 0.8284

Our dataset (multi-class) SVC 0.8643 Table 4.10: Comparison of results to previous research

Comparing our results with Capocho’s, it is clear that SVC is performing in the same range in format classification as it is in topic classification. This indicates that our classifier is performing acceptably well. However, as the datasets used and task performed are different, the comparison can not be seen as definitive. Instead, it is useful to use this comparison as further validation of our results.

The difference in performance between the two tasks is small enough that it is hard to conclude whether it is the result of implementation differences, insufficient data, or format classification being a harder task to solve than topic classification.

4.2 Limitations and Considerations

Several limitations need to be taken in to account when considering the results. This section discusses the nature of these limitations and how they impact the results.

One such limitation was computational power. Without access to high-performance machines, the machines used were limited in their processing abilities and memory capacities. Limited computational power restricted the size of the dataset that could be used during training. With over a million documents available, we had to reduce our dataset to a smaller sample in order to avoid extremely long training and testing times. Using cross-validation during hyperparameter tuning minimized the effects of this restriction.

Computational limitations also resulted in some estimators being unusable due to their high resource requirement. For example, when testing a KNN model with five nearest neighbors, the process exceeded the memory capacity on any available machine. At the same time, resource requirements and scalability are important factors to consider when evaluating a model. If a model does not scale well or is extremely resource intensive, it likely is not viable for large-scale classification tasks.

Another limitation stemmed from using a small number of sources for the dataset. Due to this, the dataset was less balanced and diverse. The imbalance among class sizes caused classifiers to have difficulty predicting classes with fewer samples. Furthermore, a more diverse dataset would better represent a real-world format classification task. We take this into consideration when presenting our

(28)

CHAPTER 4. RESULTS conclusions, being clear that this only is the first step in exploring the format clas- sification problem.

(29)

5. Conclusion

Through experimental evaluation of multiple state-of-the-art TC techniques, we have shown that learning methods can succeed at the task of format classification.

We have identified multiple classifiers that perform well at the task of format clas- sification. Based on the evaluation scores presented in tables 4.3 and 4.4, Gradient Boost and SVC are the two best-performing classifiers, with Random Forest a close third.

Taking into consideration the training and testing times from tables 4.8 and 4.9, Gradient Boost and Random Forest are preferable over SVC. If a small loss of performance is tolerable, Multinomial Naive Bayes could be considered as it offers extremely short train and test times at only a small cost to performance.

The performances of these classifiers scored very similarly to each other, and were comparable to the performances of the same classifiers in similar text classification tasks. This demonstrates that format classification can be done by existing state- of-the-art TC learning methods.

These findings can help improve news retrieval services, making the tasks in- volved more efficient, accurate, and less laborious for humans. This thesis also serve as the first steps for further research in the task of format classification, with hopes that our results can be improved upon in the future.

5.1 Future Work

This thesis has evaluated the performance of many state-of-the-art supervised learn- ing techniques in a novel TC task. However, this task is well-fit for semi-supervised and unsupervised learning as well. Techniques such as clustering and recurrent neural networks [6, 20] could take advantage of the multitudes of unlabeled data available, potentially yielding more accurate results.

Further research should also consider creating a dataset specific to this task.

Creating a problem-specific and balanced dataset with more classes would lend more weight and validity to future research. If the dataset were to be made public, it would allow for easier comparison between different works performed using it.

(30)
(31)

5. Bibliography

[1] Basant Agarwal and Namita Mittal. Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), chapter VI: Text Classification Using Machine Learning Methods - A Survey. December 2012.

[2] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, 2009.

[3] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[4] Ana Cardoso-Cachopo. Improving methods for single-label text categorization.

PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa, 2007.

[5] Imane Chatri, Nizar Bouguila, and Djemel Ziou. Classification of text doc- uments and extraction of semantically related words using hierarchical latent dirichlet allocation. Master’s thesis, Concordia University, March 2015.

[6] Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. CoRR, abs/1511.01432, 2015.

[7] Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. Transferring naive bayes classifiers for text classification. In Proceedings of the national conference on artificial intelligence, volume 22, page 540. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press, 2007.

[8] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.

[9] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to sup- port vector classification. Department of Computer Science, National Taiwan University, April 2010.

[10] Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In Proceedings of the Fourteenth International

(32)

BIBLIOGRAPHY Conference on Machine Learning, ICML ’97, pages 143–151, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.

[11] Thorsten Joachims. Machine Learning: ECML-98: 10th European Confer- ence on Machine Learning Chemnitz, Germany, April 21–23, 1998 Proceed- ings, chapter Text categorization with Support Vector Machines: Learning with many relevant features, pages 137–142. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998.

[12] Dan Jurafsky. Text Classification and Naive Bayes. Lecture slides, Stanford University, December 2015.

[13] Igor Kotenko, Andrey Chechulin, and Dmitry Komashinsky. Evaluation of text classification techniques for inappropriate web content blocking. volume 8th International Conference of Intelligent Data Acquisition and Advanced Com- puting Systems: Technology and Applications (IDAACS). IEEE, September 2015.

[14] Yonghong Li and Anil Jain. Classification of text documents. The Computer Journal, 41, 1998.

[15] Ronny. Luss and Alexandre d’Aspremont. Predicting abnormal returns from news using text classification. Quantitative Finance, 15(6):999–1012, 2015.

[16] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. Intro- duction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[17] Ben J. Marafino, W. John Boscardin, and R. Adams Dudley. Efficient and sparse feature selection for biomedical text classification via the elastic net:

Application to ICU risk stratification from nursing notes. Journal of Biomedical Informatics, 54:114 – 120, 2015.

[18] Signal Media. The signal media one-million news articles dataset. URL:

http://research.signalmedia.co/newsir16/signal-dataset.html, 2016.

[19] Naresh Kumar Nagwani and Aakanksha Sharaff. SMS spam filtering and thread identification using bi-level text classification and clustering techniques. Jour- nal of Information Science, 2015.

[20] Kamal Nigam, Andrew McCallum, and Tom Mitchell. Semi-supervised text classification using EM. Semi-Supervised Learning, pages 33–56, 2006.

[21] Gayane Petrosyan, Martin P Robillard, and Renato De Mori. Discovering information explaining API types using text classification. In Proceedings of the 37th International Conference on Software Engineering-Volume 1, pages 869–879. IEEE Press, 2015.

(33)

BIBLIOGRAPHY

[22] G. Salton, A. Wong, and C.S. Yang. Vector space model for automatic indexing.

Communications of the ACM, 18(11):613–620, 1975.

[23] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., New York, NY, USA, 1986.

[24] Abeed Sarker and Graciela Gonzalez. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. Journal of Biomed- ical Informatics, 53:196 – 207, 2015.

[25] Mugdha Sharma and Jasmeen Kaur. A novel data mining approach for detect- ing spam emails using robust chi-square features. In Proceedings of the Third International Symposium on Women in Computing and Informatics, WCI ’15, pages 49–53, New York, NY, USA, 2015. ACM.

[26] Alper Kursat Uysal and Serkan Gunal. The impact of preprocessing on text classification. Information Processing and Management, 50(1):104 – 112, 2014.

[27] Yiming Yang and Xin Liu. A re-examination of text categorization methods.

In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, pages 42–49, New York, NY, USA, 1999. ACM.

(34)

References

Related documents

It is further argued that the structure and format of privacy policies need to diverge from the current form of EULAs and develop into more suitable forms enabling the data subject

The implemented methods for this classification tasks are the       well-known Support Vector Machine (SVM) and the Convolutional Neural       Network (CNN), the most appreciated

Vår respondent menar att dessa policys finns tillgängliga för alla, men enligt honom behöver inte alla anställda kunna dem till punkt och pricka.. Det är enligt honom dessutom

In a more recent study Raza & Hasan (2015) compared ten different machine learning algorithms on a single prostate cancer dataset in order find the best performing algorithm

In our problem classes are based on the exchanged parts, directly using the initial strategy and indirectly using the improved, and as we have discusses above the number of

We presented a machine learning ensemble classifier for the pre-selection of news reports for event coding.. In order to overcome the problem of a hugely imbalanced training

The table shows that, different to accuracy and precision, the automatic system achieves a higher mean recall than the logistic regression classifier over all stations.. The

When sampling data for training, two different schemes are used. To test the capabilities of the different methods when having a low amount of labeled data, one scheme samples a