Predicting Swedish News Article Popularity

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/099--SE

Predicting Swedish News

Ar-ticle Popularity

Prediktion av Svenska Nyhetsartiklars Populäritet

Ludvig Noring

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

In this work, 132,229 articles from a Swedish news publisher are used to explore news arti-cle popularity prediction. Linear-, k-Nearest Neighbor- and Support Vector Regression are evaluated using the two different metrics root mean squared error and R2. The problem is then relaxed into only attempting to rank the articles relative to each other. The prediction problem is also explored as a classification problem using the classes Low, Mid and High popularity. The classifiers evaluated are Naive Bayes and SVM using pre-defined features and using a Bag-of-words feature set. The results were analyzed to understand what in-formation they can bring to the editors at the publisher and news agencies in general. The results clearly showed that the manually set metadata newsvalue had a large impact on article performance. A survey was done with editors to compare human prediction perfor-mance with the classifier perforperfor-mance. Although the SVM classifier performs with higher accuracy than the editors (59% vs 55%) the models are considered weak in their current state.

(4)

Acknowledgments

I want to thank my supervisor at the publisher Gabrielle Lindesvärd for all your help and effort put into this thesis project. I also want to thank Knattarna Dan Gustafsson, Johanna Haglund and Magnus Furugård for your assistance and making me feel welcome at the com-pany.

I would like to thank my supervisor Marco Kuhlmann and examiner Arne Jönsson at IDA, Linköping University, for all your time, help and suggestions. I would also like to thank my opponent Viktor Wällstedt for your comments and feedback on the thesis.

(5)

List of Figures

3.1 Histogram of visits vs. Laplace smoothed log transformed visits . . . 15

4.1 Label distributions . . . 17

4.2 Labels per day . . . 17

4.3 Mean visit and publication count for the authors in the corpora . . . 18

4.4 Header Token Count Distribution for the Labels . . . 19

4.5 Text Token Count Distribution for the Labels . . . 19

4.6 Newsvalue distribution for the labels . . . 19

5.1 Regression performance using all features . . . 22

5.2 Regression performance using all features but newsvalue . . . 23

5.3 Prediction of article rank . . . 24

(8)

List of Tables

2.1 Bag-of-Words representation of two sentences . . . 9

3.1 Different corpora . . . 13

3.2 All features collected . . . 14

5.1 Regression without NLP features . . . 21

5.2 Regression with added NLP features . . . 21

5.3 Most-frequent-class classifier baseline . . . 23

5.4 Classification with non-NLP features . . . 25

5.5 Classification with NLP features . . . 25

5.6 Using only unigram and Term frequency - inverse document frequency (tf-idf) . . 26

5.7 N-grams and features combined . . . 26

5.8 Accuracy for the different classifiers on the various article-sets . . . 27

5.9 Most informative features for the combined system using article set NF . . . 27

5.10 Most informative features for the combined system using article set AF60 . . . 28

5.11 Most informative tokens for SVM tf-idf classifier . . . 28

(9)

Glossary

n-gram Set of n Tokens. 8

BOW Bag-of-words. 9

corpus Large Set of Documents. 8

document One Text or Article. 8

kNNR K-Nearest Neighbor Regression. 15

LDA Latent Dirichlet Allocation. 10

LR Linear Regression. 15

MSE Mean Squared Error. 5

NB Naive Bayes. 6

NLP Natural Language Processing. 4

RMSE Root Mean Squared Error. 5

SNP Swedish News Publisher. 2

SVM Support Vector Machine. 6

SVR Support Vector Regression. 15

(10)

1 Introduction

This master thesis is carried out as a project for a Swedish News Publisher (SNP). The ques-tions they hope to answer with this project will be described in this chapter.

1.1 Motivation

SNP and news agencies in general have two methods of generating income from articles. Subscribers pay for access to articles behind paywalls and publicly available articles are used to generate ad revenues. Since SNP has a diverse selection of articles not all articles might be suitable for the same income-generating method. Deciding whether or not to lock an article behind a paywall can be a hard decision to make for editors. Having too many open articles might result in fewer monthly subscribers. Having too few open articles can result in less viral spread and loss of potential ad revenue. A better understanding of how well an article might perform can help in making that decision. For both editors and authors it is also beneficial to identify what drives article popularity, acting as feedback on how the article’s textual data and metadata affect page visits.

1.2 Aim

The aim of this master thesis is first to examine how different machine learning techniques perform on predicting news article popularity at SNP using only data available before publi-cation. Second, to identify which features hold most information regarding article popularity and last to examine how the machine learning models stand in comparison to human edito-rial prediction performance.

(11)

1.3. Research questions

1.3 Research questions

This work will answer the following research questions:

1. How viable are machine learning regression and classification techniques to predict article popularity?

2. How can natural language processing techniques be used to improve prediction perfor-mance?

3. Which features drive article popularity?

4. How well do machine learning techniques stand in comparison to human editorial pre-diction performance?

1.4 Related work

Predicting online popularity has been investigated by several others prior to this master the-sis. Online popularity is a broader term that includes not only news agencies but also social media. The work done by Bandari et al. [2] attempts to predict Twitter retweets. The work by Fernandes et al. [5], Shreyas et al. [18] and Uddin et al. [23] use the publicly available Mashable article set as its source for predicting article shares. The work by Arapakis et al. [1] attempts to predict the spread (Tweets) and popularity (pageviews) on articles published on Yahoo [25]. Yahoo is a very popular news outlet which provides its own original articles, much like SNP. Twitter and Mashable usually provide more easily digestible, share-friendly content. All sources mentioned in the related work have been on textual content in English. The works done by Tatar et al. [21], Keneshloo et al. [8] and VanCanneyt et al. [24] also at-tempt to predict online news popularity. They, however, use some data available only after publication such as visit after 30 minutes etc. While this gives improved results in terms of accuracy and predictive capability it gives less value for editors and is not of interest in the scope of this work.

– Bandari et al. [2] used „40000 tweets from one week in August 2011.

– Arapakis et al. [1] used „13000 Yahoo news articles collected over two weeks from 2014. – Uddin et al. [23] Fernandes et al. [5] and Shreyas et al. [18] used „40000 news articles from a publicly available dataset provided by Mashable, which spans two years from 2013 to 2015.

(12)

2 Theory

Machine learning, Natural Language Processing (NLP), regression and classification, all broad terms that are applied in many different fields in many different ways. In this work predicting article popularity the theory is divided into two main categories:

1. Predictive modeling in general 2. Methods used working with text

The two categories are described in order below.

2.1 Predictive Modeling

A real-life phenomenon such as the weather or a student’s exam results can be seen as the mathematical representation f(X) = y. A function f() which takes an input vector X of features and generates an answer y. As one can imagine X can be very complex and all data might not be obtainable depending on the problem attempting to be explained. In machine learning, historical data available is used to make an approximation of the function f()and the input vector X.

The machine learning is done when X has been decided upon on. In supervised learning, a training set of multiple entries where both the features X and the target value y is known are used to help train the model, the function f(). For example, a dataset containing the diaries of a set of students together with their exam results. Features such as hours studied and hours slept can be extracted and the function f(slepth, studiedh)can be created.

In classification modeling, the output y generated will be from a discrete set of labels, e.g. an exam classifier that tries to correctly label students as passing or failing. A regression model will instead generate a continuous value y. An exam regression model will attempt to score each student e.g. on a range from 0 to 100%.

(13)

2.1. Predictive Modeling

2.1.1 Regression

As described in 2.1, regression is about finding the function f() that takes some input X to predict an output y. Described below are three different types of regression; Linear Regression, SVM Regression, and kNN Regression.

2.1.1.1 Linear Regression

Formally linear regression is about finding the linear function:

y=w0+xTw (2.1)

where w0 is some bias and w is the weight vector for each feature in x. Each feature in x can be both continuous or binary. It treats each feature as an independent indicator for y. It then finds the weights to create a line that fits the training data with the least possible error. Looking at a two-dimensional scatter plot it is intuitive to see how the line which minimizes the error is drawn. Scaling up the dimensions increases the complexity but the same logic applies.

2.1.1.2 Support Vector Regression

Support Vector Machines are most commonly used as classifiers and will be explained further in section 2.2. It can however also be used as a regression method called Support vector regression (SVR). In SVR the hyperplane described in 2.2 to seperate the data is instead used to predict the target value. This is done by extending the model with a loss function.

2.1.1.3 kNN Regression

kNN is one of the most simple machine learning algorithms. Using the training set it will populate the feature space. Then when predicting on new data it will place that data point in the feature space, select the k closest training data points and return their average target value. One aspect of this algorithm is that a predicted value can never be higher than the average of the k highest points and never lower than the average of the k lowest points in the training set.

2.1.1.4 Regression Evaluation

Evaluating the performance of regression models can be done in various ways. One popu-lar metric is Mean Squared Error (MSE). As the name suggests it calculates the mean square error for all predictions. It does this by computing the distance from each prediction to the true value and then squaring that error. To finally get MSE the mean for all squared errors is calculated. By taking the square error underestimation and overestimation will be handled equally and errors will quickly increase as they grow larger. The metric Root Mean Squared Error (RMSE) is the root of MSE. This is to get a score that is in the same order as the tar-get value and therefore more relatable. The formula used to calculate RMSE can be seen in formula 2.2, where ˆθ is a set of N estimations for the true values θ.

RMSE(ˆθ) = b MSE(ˆθ) = d řN i=1(ˆθi´ θi)2 N (2.2)

R2, also known as the coefficient of determination, is also a popular evaluation measure for regression models. It is a measure of how well a model fits the data in comparison to the mean. A model always predicting the mean value would get an R2_{-value of 0. A perfect} model would get a score of 1. A model making worse predictions than the mean would get

(14)

2.2. Classification

a negative score. R2is calculated using formula 2.5 using SStot and SSres. SStotis the sum of squared distances each true value θ has to the mean µ. SSresis the same squared error seen earlier in formula 2.2. SStot= n ÿ i (θi´ µ)2 (2.3) SSres= n ÿ i (θi´ ˆθi)2 (2.4) R2=1 ´SSres SStot (2.5)

2.2 Classification

Classifiers have a discrete output instead of continuous output like the regression algorithms. A famous example of a text classifier is the spam filter which has two outputs, spam or ham. In this section, Naive Bayes (NB) and Support Vector Machines (SVMs) will be described.

2.2.1 Naive Bayes

The NB classifier is a probabilistic classifier and using formula 2.6 will assign the most prob-able class ˆy given a feature vector x. It based on Bayes theorem and is naive in the way that it assumes all features independent of each other. This naive approach simplifies learning while still maintaining performance. [16] When training the classifier a set of n features x and a set of K classes C need to be decided on. Then using this set of features a feature vector for each entry in the training set is computed and fed to the classifier. Training the model falls under supervised learning so the correct class is needed together with the feature vector. With this information, the classifier can calculate the probability for each class p(Ck)and the probability for each feature given a class p(xi|Ck). When the two necessary probabilities have been computed for the whole training set new unseen feature vectors from the test set can be classified. ˆy= argmax kPt1,2,...,Ku p(Ck) n ź i=1 p(xi|Ck) (2.6)

2.2.2 SVM

SVM works by finding the hyperplane that best separates the training data. What this means is that even though the original feature space used might not allow for a linear separation of the classes, a feature space which does can always be found. SVM finds the hyperplane that separates the training data with the most margin i.e. the plane where the distance between the two closest points of the respective classes is greatest.

2.2.3 Classifier Evaluation

Classifiers are commonly evalatued using accuracy as well as precision, recall and F1-score. They are based on the confusion matrix of true positives (TP), true negatives (TN), false pos-itives (FP) and false negatives (FN). Accuracy is computed on the all predictions while the remaining three are computed on each class separately. The different measures are computed as follows:

(15)

2.3. Pre-processing

• Precision - The rate of correct predictions compared to the total number of predictions for a given class.

Precision= TP TP+FP • Recall - The coverage of a given class.

Recall= TP TP+FN

• F1-Score - The harmonic mean between precision and recall. F1-Score=2 precision ˚ recall

precision+recall • Accuracy - The total rate of correct classifications.

Accuracy= TP+TN TP+TN+FP+FN

2.3 Pre-processing

Natural Language Processing brings a set of additional problems to the machine learning world. Since language is complex and unstructured simply feeding data into an algorithm is not possible. Text and language need to be represented in a numeric way compatible with the algorithms. Pre-processing can be divided into two sections; segmentation and normalization which will be described below.

2.3.1 Text Segmentation

Text is usually stored as long strings of characters. It is one single chunk of data that the com-puter reads byte by byte or character by character. This type of format works well when you are to process numbers or maybe images. For the machine to have any chance of interpreting the long strings of character one first needs to split up the text into segments.

2.3.1.1 Tokenization

The basis of most NLP techniques depends on that documents or sentences are broken down into tokens. The segmentation is mostly performed on white space or some form of punctu-ation. However, this is not always enough. The purpose of the segmentation is to divide the characters of a sentence into its building blocks. For example, "I live in Mexico" and "I live in New Mexico" only differ with what location I live in and should reasonably be segmented into[I, live, in, Mexico]and[I, live, in, NewMexico] respectively. Each segment is referred to as a token.

United States is made up of two words but would arguably be considered one single token. Used in a sentence it is one single block of information to the reader and splitting it up into two individual tokens only confuse the information it is conveying.

The contraction Don’t is made up of two words and could be split up into the two tokens Do, n’t. A good tokenizer can not solely create tokens from whitespace. It requires more information about the language and the world to create meaningful tokens.

(16)

2.4. Text Representation

2.3.1.2 N-grams

Multiple tokens can be joined together to create n-grams. An n-gram is a set of tokens of length n. They are used for various applications such as word suggest on mobile keyboards. N-grams of length one are referred to as unigrams, length of two bigrams and length of three trigrams.

Splitting up the sentence above into bigrams would create the following segments: [(I, live),(live, in),(in, Mexico)]). Special start and end tokens can be added to the document which would create the following segments[(START, I), ...,(Mexico, END)].

Why n-grams bigrams or trigrams can be useful is explained further in section 2.4.

2.3.2 Text Normalization

After the text has been segmented the different tokens can be normalized to convey much of the same information in a more simple format. Three common ways to normalize the text are lowercasing, stop word removal and stemming and will be described below.

2.3.2.1 Lowercasing

Lowercasing is creating a case insensitive interpretation of the corpus. Simply making all the characters in the corpus lowercase characters. This is to reduce the size of the problem search space. The strings Floor and floor would otherwise be interpreted as two separate words. All the information the model has about floor would not be transferred to the token Floor. Making all characters lowercase would solve this problem and help improve the model performance.

2.3.2.2 Stopping

Stop words are words, or tokens, which convey little to no information to the application such as the, is or a. Stop words usually occur uniformly across the corpus and removing them helps reduce the search space and improve the model performance. Stop words are handled in different ways in NLP applications. One way is to simply remove them. Removing stop words before segmenting the corpus into n-grams will create token groups not actually seen in the documents. The resulting n-grams will still convey information but might lose some of their contextual meaning.

2.3.2.3 Stemming

The texts can be normalized further using a technique called stemming. This means that all words are reduced to their morphological stem.[19] Fishing, fished, fisher would all be reduced to the stem fish. Fish is the stem to which you add some suffix in order to create multiple different words. Most of the tokens can easily be truncated to reach their stem, but for exam-ple the tokens dry and dries require some additional logic. Stemming tokens in a corpus by truncation can create words not actually present in the language. That is of no issue however since the point of the stemming is to group multiple tokens together.

This normalization step is more complex than the previous two and a trained stemmer on the language in question needs to be used to properly stem most tokens. There exists pre-trained stemmers ready to use such as the Snowball Stemmer [19].

2.4 Text Representation

There are multiple methods to convert our human-readable texts to numbers or vectors which can be used in machine learning algorithms.

(17)

2.4. Text Representation

2.4.1 One-hot & Bag-of-words

One-hot encoding is one of the simplest methods of representing a word. It works by first creating a vocabulary of all unique words in the corpus. Afterward, a vector can then be created where each index maps to a unique word in the vocabulary. To represent a word using this vector one would set its corresponding index to 1 and leaving the rest of the values at 0. The dimensionality of the vector would the same as the number of unique words in the vocabulary. Usually, the vocabulary size can be tens of thousands tokens resulting in very sparse vectors. One-hot encoding is simple and does not take any contextual information into consideration. There is no information about the relations or similarity of different words. Both of these issues are somewhat mitigated in the pre-processing steps presented in 2.3.2. Using this encoding a sentence would require to stack one of these vectors for each word into a matrix resulting in much memory usage.

The Bag-of-words (BOW) model is the sum of all one-hot encoded word vectors of a docu-ment. The resulting vector contains information about which words occurred but not in what order. The frequency of the words can be discarded in a boolean BOW model. Having only a single vector for each document is computationally advantageous. The sparsity is still very large however since a document will only contain a fraction of the words in the vocabulary. A small example of the BOW model is shown in Table 2.1. In the table the two sentences "Ludvig likes to write a report" and "I would like to report a crime to the police" are represented as BOW model vectors.

Table 2.1: Bag-of-Words representation of two sentences Index Word Sentence 1 Sentence 2

0 likes 1 0 1 report 1 1 2 would 0 1 3 police 0 1 4 a 1 1 5 write 1 0 6 crime 0 1 7 the 0 1 8 Ludvig 1 0 9 to 1 2 10 I 0 1 11 like 0 1

As mentioned the BOW model will strip all words of their context. This is not always desired so to keep some level of context n-grams are used. This will use longer sequences of tokens keeping their neighboring words. Increasing the number of words in an n-grams increase the contextual information for each word. It also creates more unique n-grams in the vocabulary, resulting in a larger search space for the model.

2.4.2 Term Frequency–Inverse Document Frequency (tf-idf)

The problem with the bag-of-words model described above is that the importance of each word is not captured. General words that are commonly used in all texts give very little actual information. This problem is somewhat mitigated when removing stop words from the corpus in the pre-processing step. However, words that are very rare but only tend to appear in one particular class hold much information for the classifier. These words should arguably be emphasized further. Reversely words that occur across many classes should be

(18)

2.5. Latent Dirichlet Allocation (LDA)

weighted less. Tf-idf does this by calculating weights for every word in a document. A high weight indicates that the word is a strong indicator for that particular document’s class. [15] Tf-idf is calculated using formula 2.7. Tf-idf for a given word w occurring in document d is calculated by the product of the term frequency of w in d and the log quotient of the total count of documents N divided by the count of documents containing w nw.

tf-idf=t f(w, d)¨logN

nw (2.7)

2.5 Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is an unsupervised topic model technique. [3] LDA will make a statistical analysis of the corpus to gather the word distribution across the different documents and explain this distribution with a given number of latent variables, or top-ics. Each topic will have a probability distribution of words and by examining the topics most probable words an attempt can be made to label the topics. One topic might have high probabilities for words such as puck, ice and stick and could reasonably be labeled as a hockey-related topic. Each document will not be associated with a single topic but instead with many topics to different degrees. The number of topics LDA will use is manually set and should be set according to the data. Too few topics and the topics will be less distinct and contain words from what could be considered different topics. Too many topics and they might become too similar and words will be unnecessarily split amongst many different topics.

2.6 Spearman’s rank correlation coefficient

The spearman’s rank correlation coefficient, or spearman’s rho, is a measure of how well two variables can be described using a monotonic function. Spearman’s rho does not consider how linear the relationship between the two variables is, a perfect spearman’s rho of 1 or ´1 occurs when one variable is a perfect monotonic function of the other. Spearman’s rho of the two variables Xiand Yiis calculated using formula 2.8 where n is the number of observations given that all ranks X and Y are distinct integers.

ρ=1 ´6 ř

(Xi´Yi)2

(19)

3 Method

This section will go through the method used to answer the research questions stated in the introduction (1.3). The development of the prediction models can be divided into a number of tasks that are listed below. For each of these tasks, a section will explain it more thoroughly in order to provide the relevant information.

– Collecting and filtering of the articles – Labeling of articles

– Pre-processing articles

– Dividing into train and test sets – Deciding on a feature set – Regression

– Classification – Editorial Study

3.1 Environment

All work was done in Python 3.7 [13], mostly using Jupyter Notebooks [7]. The machine learning algorithms used where imported from sklearn [17] and NLTK [10]. The datasets were processed using the pandas library [12]. All plots were made using the library mat-plotlib [9].

In this work, the evaluation does not consider time nor memory usage. Memory usage has been kept at levels that allow development on an average personal computer. To reproduce this work no computer specs in particular are required. The dataset however is not pub-licly available but will be described to some extent not to give away any private company information.

(20)

3.2. Data Exploration

3.2 Data Exploration

All raw article data was provided by SNP. The dataset used contains 132229 articles pub-lished by SNP spanning 670 days from 2016-01-02 to 2017-11-01. Compared to datasets in related work shown in section 1.4 the dataset provided by SNP is larger in size and should suffice to create comparative results.

The following data and metadata could be retrieved from the dataset:

• Article ID - Unique ID for each article

• Publish Date - Day and time on which the article was published • Author ID - Unique ID for the article’s author

• Article Text - Raw text content of the article • Visits - Number of visits 24 hours after publication

• Newsvalue - Number set by the editor. In range[10, 20, ..., 90, 100] • News lifetime - Number set by the editor. In range[20, 40, 60]

The various headlines for each article were stored in the following fashion:

• Article ID - Maps to the corresponding article • Headline - The headline as it appeared on SNP • Date - Date and time when the headline was updated

Most of these columns are static once an article is published but newsvalue and headline change during the editorial adjustments afterward. Newsvalue is usually decreased after hours or days depending on the article. A celebrity death announcement will get published with a very high newsvalue but will very quickly lose its newsvalue. The newsvalue stored will be the last set value, therefore almost no articles will have a maximum newsvalue of 100. The newsvalue will affect the article placement on the site giving more or less exposure to certain articles. This will in turn likely affect visits as well.

The article headline is also often changed after an article is published. It will be published with a certain headline and then be adjusted or changed completely. To pick the headline that has been exposed to most users one would need to have information about when each article visit occurred. Since this data was not available during this work the headline will be picked based on exposure time instead. All of the article headline updates are stored with the time and date of when the change was made together with the new headline. This allows one to compute the exposure time for each headline. Since most article visits happen early after publication the article headline with the most exposure time during the first six hours is picked.

Since SNP is primarily interested in predicting engagement for their quality articles the articles in the dataset are filtered. One filter applied is based on authorID. Out of SNP’s articles many are imported from news distributors such as TT [22] and Omni [11]. Since they are external news sources one would expect the headline and body to differ from SNPs internal articles. Therefore these imported articles are filtered by excluding articles with authorID P ttt, direkt, omni, omninext, omni-ekonomiu.

(21)

3.3. Labeling the Data

In Table 3.1 a statistical overview of the corpora can be seen. After the filtering mentioned above has been applied the corpus is trimmed to a much smaller size. This corpus size is very similar to the one used in the related work on the mashable dataset by Fernandes et al. [5], Shreyas et al. [18] and Uddin et al. [23].

One additional filtering is done based on newsvalue. Articles with higher newsvalue will be featured higher on the site gaining more exposure and therefore a higher likelihood for more visits. By extracting a subset of articles were newsvalue = 60 we mitigate some exposure differences between the articles. The resulting article set NF60 has a similar visit distribution as the original unfiltered article set while still containing the most number of articles. Most articles have all metadata available, some however do not. Creating the corpora, only the articles with all metadata available will be used, i.e. no fields will be blank.

In this work, all article-sets will be evaluated and results will be compared. Table 3.1: Different corpora

Articles per Day Alias Filtered On Number of Articles Total Days Mean Std Min Max

NF No Filter 123372 656 188 43 84 288

AF Author 37043 656 56 13 23 102

NF60 Newsvalue 18325 656 28 8 9 54

AF60 Author & Newsvalue 12228 656 18 6 4 36

3.3 Labeling the Data

The articles are stored together with their visits 24 hours after publication. This same duration was used in the work by [8]. Visits are the measure of how many viewing sessions an article has had. Unlike raw page views, visits do not increase if the page is refreshed or closed and reopened. However, closing and reopening it after a period of time will result in an additional visit. Both the regression and the classification models in this thesis will use visits as the target value. The regressors will fit using the exact number of visits while the classifiers will use three discrete labels; low, mid and high. The thresholds are based on the editorial opinion on what qualifies as a low, mid and high performing article.

3.4 Pre-processing the Data

The pre-processing of the data can be split up into two categories, text segmentation and text normalization.

3.4.1 Text Segmentation

The first step in processing the corpus is dividing the raw text into tokens as described in 2.3.1.1. In this work this is done using the python library NLTK [10]. It offers functionality to segment text into sentences and also individual tokens. It does this by using a Swedish punctuation list combined with a series of regular expressions.

3.4.2 Text Cleaning & Normalization

The corpus is normalized by lowercasing all documents. The NLTK library provides a list of Swedish stop words which are used in removing stop words from the documents. Fur-thermore, the corpus is normalized by stemming all words using the Snowball Stemmer [19]

(22)

3.5. Creating LDA Topics

also available in the NLTK library. Snowball Stemmer uses lists of suffixes and a set of rules specifically made for Swedish to stem the words.

3.5 Creating LDA Topics

After the pre-processing of the corpus is done LDA topics are computed for each document. The number of topics in the LDA algorithm was set to 100. This value was chosen based on word-clouds generated from different iterations of LDA with a different number of topics. One word-cloud would be generated for each topic where the most relevant words would be shown for each topic. Using 100 topics gave the best results in creating topics with distinct themes. When performing LDA a 100-dimensional vector with values ranging from 0 to 1 for each topic for each document will be created. A document will be assigned the topic of its highest LDA topic value. The average visit count for each topic will be calculated and the top-visited topics determined. A document will then have a closeness to each top topic based on its LDA vector value for that particular topic.

A group of people working at SNP manually set human-readable names for most LDA topics.

3.6 Train and Test sets

In the work of Arapakis et al. [1] they compare two different ways of training and evaluating their models. First, they use ten-fold cross-validation. This will create train and test sets with-out regard to article publication date. Training on articles from the future could be considered cheating and give a positive bias. They compare the ten-fold cross-validation results to a train and test set split where the date was taken into consideration. Where the models can only train on articles published before its test set. They report that there was no considerable bias in letting the models train on future articles. In this work, articles were split randomly into train and test sets of equal size.

3.7 Features

A total of 38 features were collected from the dataset. The features are divided into three categories, time, metadata and NLP as seen in Table 3.2. Features such as token count and polarity are extracted from the header and body separately. The models will first be evalu-ated using only the non-NLP features. They will then be compared to models trained on all features to observe any change in performance.

Table 3.2: All features collected

Feature Type (# Features) Feature Type (# Features)

Time NLP

Day of Year Numeric (1) Sentences Count (1)

Day of Week Numeric (1) Polarity -1, 0, 1 (2)

Hour of Day Numeric (1) Nouns, Verbs, Adjectives, Adverbs Count (8)

Metadata Nouns, Verbs, Adjectives, Adverbs Ratio (8)

Author Categorical (1) Named Entities (PER, ORG, LOC) Count (6)

Newsvalue Numeric (1) Total Named Entities Count (2)

News lifetime Numeric (1) LDA Topic Categorical (1)

(23)

3.8. Regression

Figure 3.1: Histogram of visits vs. Laplace smoothed log transformed visits

3.8 Regression

After the data has been pre-processed and split up as described in the previous sections, a set of regressors will be evaluated. Similar to [1] Linear Regression (LR), K-Nearest Neighbor Regression (kNNR) and Support Vector Regression (SVR) will be evaluated.

Using the features presented in Table 3.2 the different kinds of regression models are fitted. The visits are log-transformed to more closely resemble a normal distribution and Laplace smoothed to allow the log transform of 0 visit articles. In Figure 3.1 the resulting visit distri-bution can be seen.

First, a model using only the non-NLP feature will be evaluated. Afterward, the NLP features are added and the change in performance can be observed.

The fitted models will be evaluated by their R2value and RMSE and compared to a baseline model which only predicts the corpus’ mean visit value.

Also as Arapakis et al. [1] mentions in their work, a news agency is mostly interested in how well articles will perform relative to each other. Therefore a model only needs to correctly rank the articles relative to each other, i.e. predicting the exact number of visits is not as important. The models will be evaluated how well they perform in ranking the articles and scored using Spearman’s rank correlation coefficient.

3.9 Classification

The news popularity problem is also approached as a classification problem. Using the labels described in 3.3 different classifiers are trained. Similar to the ranking evaluation described in 3.8 the exact value of visits is not as important for the classifier. It only needs to correctly label the articles into the three different labels which should relax the problem but still be able to give valuable results.

The classifiers which will be evaluated in this work are Naive Bayes and SVM. The features used to train the classifiers can be split up into two categories, predefined features (same features as for the regressors) and n-grams as features. When the classifiers will be trained using n-grams as features they will train using a BOW representation for each document as described in 2.4. The classifiers will also be evaluated with the extension of a tf-idf table as described in 2.4. To reduce the search space only the 2000 most frequent n-grams are kept in the vocabulary.

(24)

3.10. Editorial Study

All classifiers will use the same pre-processing as described in 3.4. They will all be measured in precision, recall, F1-score and accuracy as described in 3.9.1.

Finally, a classifier using a subset of the features 3.2 and BOW will be evaluated. The classifier is built using the Naive Bayes base from the NLTK library.

3.9.1 Classifier Evaluation

The predictions of the classifier are matched against the correct labels to compute the mea-sures precision, recall and F1-score for each class. The accuracy of each model will also be presented. The different measures are explained in section 2.2.3.

3.10 Editorial Study

To see how the machine learning models perform in comparison to humans an editorial study will be performed. In the study, two editors from SNP will attempt to classify 56 articles using the same Low, Mid, High labels, counting 28, 19 and 9 of each label respectively. The editors will be presented with the article headline and preamble. They also know what year the article was released. The articles picked for the study are from the earlier portions of the dataset. This is to ensure that both editors have not worked with any of the articles in the study. The editors will be evaluated in accuracy, precision, recall and F1-score. How they rank all articles will not be evaluated in this study.

The editors will not be evaluated on regression or ranking of the articles because of time constraints.

(25)

4 Data Exploration

In this chapter, the dataset is explored and presented in more detail. This is to help under-stand the results in the following chapter and to allow for discussion afterward.

4.0.1 Labels

Depending on how the articles are filtered the label distribution varies noticeably. Figure 4.1 shows that the unfiltered dataset follows a power-law distribution. This is in line with other related work and internet popularity in general [4]. As the dataset is filtered on either author or newsvalue the labels are getting more linear. When using both filters the label distribution turns completely linear.

As seen in Figure 4.2 the label count for all articles keeps a stable variance and mean through-out the two years. Some exceptions can be seen such as a small gap in the mid of 2016, a decrease in articles during the summer and spikes in high performing articles on breaking news events such as the terror event in Stockholm April 2017.

Low Mid High 0

50000

100000

NF

Low Mid High 0

10000

20000

AF

Low Mid High 0

5000

10000

NF60

Low Mid High 0

2000 4000

6000

AF60

Figure 4.1: Label distributions

2016-05 2016-09 2017-01 2017-05 2017-09 1 2 3 4 5

Log Article Count Low Mid High

(26)

Figure 4.3: Mean visit and publication count for the authors in the corpora

4.0.2 Features

In this section, the label distribution (low, mid, high) for some features is visualized to give a better understanding of how they might help give information to the models.

Most authors have very few publications while a select few have many. There is no significant difference between the different subset of articles. As seen in Figure 4.3 most authors are clustered around the bottom left, few publications and low mean visits for their articles. Some outliers stand out however with many publications and some with exceptionally popular articles.

The remaining figures show how the label distribution change for each feature. Most features differ very little for each label giving little information in predicting the article label. One feature that stands out is newsvalue, showing a very drastic inverse correlation between the low and mid, high labels.

In Figure 4.4 the header token count distribution for the labels can be seen. For the mid and high performing articles the token count follows a normal distribution. For the less popular articles, the distribution follows more of a beta distribution, tending to contain fewer header tokens.

Figure 4.5 shows the token count for the whole article body. Similar to header token count, articles labeled as low differ from the other two labels when containing fewer tokens. All three labels peek in article count at around 500 tokens.

In Figure 4.6 where the feature newsvalue’s distribution is described, one can clearly see how low newsvalue correlates to low labeled articles. As the newsvalue increase to 60 the

(27)

Figure 4.4: Header Token Count Distribu-tion for the Labels

Figure 4.5: Text Token Count Distribution for the Labels

20 40 60 80 100 Newsvalue 0 2 4 6 8 10

Log Article Count

NF

Low Mid High 20 40 60 80 100 Newsvalue 0 2 4 6 8 10

AF

Low Mid High

Figure 4.6: Newsvalue distribution for the labels

mid and high articles steadily rise in frequency. After 60 the low labeled articles drastically drops below the other two labels. The same trend can be seen for both article-sets NF and AF. The article-sets NF60 and AF60 are not displayed since they only contain articles with newsvalue=60.

(28)

5 Results

This chapter will present the results gathered. Each research question’s corresponding ex-periments are presented in order. The results will later be discussed and analyzed in the following discussion chapter.

5.1 How viable are ML regression and classification techniques to predict

article performance?

This question will be answered with the help of results from all research questions. To begin we start with regression and classification models using only a subset of the complete feature set.

5.1.1 Regression

Fitting and testing the regression models as described in the method chapter 3.8 yield the results seen in Table 5.1. Linear Regression performs best for all article-sets, both regarding RMSE and R2. Compared to the mean-predicting baseline the regressors perform best on the largest article set NF. They also show improvement to the baseline on article set AF. Regarding article set NF60 and AF60 however they show very little or even worse change compared to the baseline.

5.1.2 Classification

Training classifiers as described in 3.9, using the same non-NLP features as the regressors above yield the results in Table 5.4. Precision, recall and F1-Score are computed for all labels, for all article subsets. The best F1-Score between the two classifiers are marked in bold. The results can be compared to the baseline results seen in Table 5.3.

The accuracies for all classifiers are summarized in Table 5.8. Marked in gray are the results which are equal to the baseline and results marked in green surpass the baseline.

(29)

5.2. How can NLP techniques be used to improve prediction accuracy?

Both classifiers perform best predicting the low labeled articles, while struggling with the mid and high labels in comparison. NB has its highest accuracy on article set NF with 71%. On the other article-sets NB has accuracies between 44% and 52%. SVM varies less in its accuracy performance with accuracies between 54% and 59%. SVM score one percentage higher accuracy than the baseline on article-sets AF and AF60.

5.2 How can NLP techniques be used to improve prediction accuracy?

To answer this research question three experiments are carried out:

– NLP features are added to the feature set used by the regressors and classifiers in the previous section

– Different classification approach using a BOW classifier

– Combining the two classifiers above to use both BOW and the predefined feature set

5.2.1 Added NLP Features

Both the regressors and the classifier are extended with the NLP features seen in Table 3.2.

5.2.1.1 Regression

Expanding the feature set by adding the NLP features yield the results seen in Table 5.2. The best scores for each article set are marked in bold. Adding the NLP features give a very minor added performance to LR while decreasing performance for KNN and SVR. LR’s RMSE decreases by approximately 0.02 across all article-sets while SVR increases its error with 0.35 on article set NF.

Table 5.1: Regression without NLP features

Baseline KNN LR SVR

Set RMSE R2 RMSE R2 RMSE R2 RMSE R2

NF 1.59 0 1.07 0.54 1.02 0.59 1.10 0.52

NF60 1.02 0 1.08 -0.12 0.99 0.05 1.05 -0.06

AF 1.47 0 1.22 0.31 1.14 0.39 1.34 0.17

AF60 1.02 0 1.08 -0.12 0.99 0.05 1.07 -0.09

Table 5.2: Regression with added NLP features

Baseline KNN LR SVR

Set RMSE R2 RMSE R2 RMSE R2 RMSE R2

NF 1.59 0 1.10 0.52 0.99 0.61 1.45 0.17

NF60 1.02 0 1.05 -0.07 0.97 0.09 1.01 0.02

AF 1.47 0 1.23 0.29 1.12 0.42 1.45 0.03

AF60 1.02 0 1.05 -0.07 0.97 0.09 1.02 0

The plots seen in figures 5.1 and 5.2 show the true visits compared to the predicted visits for the LR regressor. By ordering the articles in the test sets by their true visits and then plotting true- versus predicted visits the performance of the regressors can be visualized. To explain further, each point is placed based on an article’s true visit and what the regressor predicted. The color is set based on the article’s newsvalue feature. The dashed lines mark the label

(30)

Figure 5.1: Regression performance using all features

thresholds for labels mid and high. The data shown is a random sample of 5000 articles out of the test set.

In the first figure (5.1) the model has been trained on all features. All articles with a low newsvalue are predicted to roughly the same low amount of visits. Many of these articles have a low amount of visits in reality but a large part of the articles perform better than what the model predicts. The model then predicts higher visits as the newsvalue increases which clearly is in line with the true visits.

The right subplots in Figure 5.1 show the regressor training and predicting on articles with newsvalue 60 only. Both true visits and predicted visits show less variance. A very slight tilt to the predicted visits following the curve of true visits can be noticed.

In the second figure (5.2) the model has been trained on all features except newsvalue. As expected the right subplots look the same as in 5.1. However, the left subplots differ quite a bit. On set NF the low to mid-range newsvalue articles are predicted to roughly the same low visit count. As the true visits, and newsvalue, increase for the articles the regressor follows with its predictions, albeit very slightly. On article set AF the same tilt can be seen as on NF60 and AF60 but with greater variance.

5.2.1.2 Rank Prediction

In figures 5.4 and 5.3 the true ranking versus the predicted ranking of the articles is shown. The rank is in decreasing popularity with rank 0 being the most visited article. A perfect model would create a diagonal line. Going from top to bottom the predicted article popularity increase. Going from right to left the true article popularity increase. Every article placed under the line the regressor overestimates and every article placed over the line the regressor

(31)

Figure 5.2: Regression performance using all features but newsvalue

underestimates. The rankings performed by the regressor are scored using Spearman’s rank correlation coefficient which can be seen above each subplot.

In Figure 5.3 the same layering of newsvalue as in Figure 5.1 predicted by the regressor can be seen. The low ranked articles (high popularity) are centered more tightly around the diagonal, while as article popularity decrease the regressor struggles more. On article set AF a lot of articles are spread on the left border along the y-axis. These are all the articles the regressor greatly underestimated. The right subplots show weaker ranking performance but still better than random. A completely random ranker would get a Spearman’s coefficient close to 0.

5.2.1.3 Classification

The results for the classifiers with the added NLP features can be seen in Table 5.5. With the added features the NB classifier shows equal or slightly reduced performance on the article-sets. The same effects can be seen when the features are added to the SVM classifier except on article NF where performance increase from 59% to 83% accuracy.

Table 5.3: Most-frequent-class classifier baseline Baseline

Set Class Precision Recall F1-Score

NF Low 0.84 1.00 0.91

NF60 Low 0.59 1.00 0.74

AF Low 0.58 1.00 0.73

(32)

Figure 5.3: Prediction of article rank

(33)

Table 5.4: Classification with non-NLP features

Naive Bayes SVM

Precision Recall F1-Score Precision Recall F1-Score

Low 0.92 0.79 0.85 0.97 0.64 0.78 Mid 0.27 0.24 0.25 0.55 0.00 0.00 NF High 0.13 0.40 0.19 0.12 0.94 0.22 Low 0.63 0.66 0.64 0.64 0.82 0.72 Mid 0.31 0.15 0.21 0.38 0.30 0.34 NF60 High 0.16 0.34 0.21 0.00 0.00 0.00 Low 0.70 0.63 0.66 0.62 0.95 0.75 Mid 0.28 0.32 0.30 0.36 0.02 0.04 AF High 0.37 0.42 0.40 0.40 0.25 0.30 Low 0.57 0.69 0.62 0.53 1.00 0.69 Mid 0.34 0.19 0.25 0.77 0.01 0.02 AF60 High 0.20 0.22 0.21 0.00 0.00 0.00

Table 5.5: Classification with NLP features

Naive Bayes SVM

Low 0.92 0.77 0.84 0.89 0.95 0.92 Mid 0.21 0.23 0.22 0.39 0.01 0.01 NF High 0.15 0.45 0.22 0.27 0.51 0.35 Low 0.64 0.65 0.64 0.64 0.87 0.73 Mid 0.32 0.13 0.18 0.63 0.00 0.01 NF60 High 0.15 0.39 0.22 0.21 0.37 0.27 Low 0.71 0.67 0.69 0.60 0.96 0.74 Mid 0.33 0.27 0.30 0.37 0.10 0.16 AF High 0.28 0.41 0.34 0.38 0.00 0.00 Low 0.56 0.63 0.59 0.53 0.99 0.69 Mid 0.34 0.16 0.21 0.46 0.02 0.04 AF60 High 0.20 0.34 0.25 0.00 0.00 0.00

5.2.2 N-gram Classifiers

Instead of using the pre-defined feature set a BOW approach as described in section 3.9 is evaluated. Each classifier will also be extended with a tf-idf table. To reduce table size and the amount of presented data, only the F1-Score will be reported in Table 5.6 and the accuracy can be seen with the other classifiers in 5.8.

The NB BOW classifier show increased accuracy on all article-sets compared to the NB classi-fier with the pre-defined features. The largest accuracy increase compared to NB NLP can be seen on article set NF60 where the accuracy rises from 47% to 58%. On article set AF60 the accuracy is equal to the baseline and on the other sets the classifier still has accuracy lower than the baseline. The extension of a tf-idf table causes the model to only predict the most frequent class and therefore ties with the baseline on all article-sets.

The SVM BOW classifier does not show increased performance compared to fitting on the pre-defined features except on article set NF where the classifier’s accuracy is equal to the baseline. With the extension of a tf-idf table the classifier increase in accuracy on all article-sets except NF.

(34)

5.3. Can key features be identified which drive article performance?

5.2.3 Combined N-Gram with Predefined Features

Appending some features to the n-gram classifiers yield the results presented in Table 5.7. The features added are presented below.

– Newsvalue – LDA Topic – Author

– Text entity count – Text polarity – Headline polarity

The combined classifier show low accuracy but a relatively all-round decent performance on all article-sets and labels.

Table 5.6: Using only unigram and tf-idf F1-Score NB NB tf-idf SVM SVM tf-idf Low 0.89 0.91 0.92 0.91 Mid 0.33 0.00 0.25 0.01 NF High 0.25 0.00 0.24 0.06 Low 0.72 0.74 0.67 0.75 Mid 0.36 0.00 0.37 0.26 NF60 High 0.14 0.00 0.18 0.17 Low 0.70 0.73 0.72 0.77 Mid 0.39 0.00 0.34 0.18 AF High 0.40 0.00 0.31 0.41 Low 0.67 0.70 0.65 0.69 Mid 0.39 0.00 0.36 0.32 AF60 High 0.17 0.00 0.23 0.24

Table 5.7: N-grams and features combined Combined

Precision Recall F1-Score

Low 0.94 0.79 0.86 Mid 0.20 0.29 0.24 NF High 0.23 0.55 0.33 Low 0.66 0.64 0.65 Mid 0.32 0.17 0.23 NF60 High 0.18 0.43 0.25 Low 0.74 0.66 0.70 Mid 0.34 0.30 0.31 AF High 0.34 0.53 0.42 Low 0.61 0.64 0.62 Mid 0.37 0.29 0.33 AF60 High 0.27 0.36 0.31

5.3 Can key features be identified which drive article performance?

After a classifier has been trained the most informative features can be extracted. The com-bined classifier which has been trained on article set NF favored the features shown in Table

(35)

5.3. Can key features be identified which drive article performance?

Table 5.8: Accuracy for the different classifiers on the various article-sets

Classifier NF NF60 AF AF60 Baseline 84% 59% 58% 53% NB NLP 71% 47% 52% 46% NB NLP 70% 47% 52% 44% NB n-grams 78% 58% 55% 53% NB tf-idf 84% 59% 58% 53% SVM NLP 59% 57% 59% 54% SVM NLP 83% 55% 58% 53% SVM n-grams 84% 53% 56% 51% SVM tf-idf 84% 60% 62% 55% Combined 74% 48% 55% 48%

5.9 the most. The same classifier but trained on article set AF60 favored most of the features shown in Table 5.10. The background color of each cell corresponds to which label it connects to. Red for features describing low visit articles, yellow for mid (absent) and blue for high. The authors have been anonymized.

In Table 5.9 the top three ranked features are newsvalue 90, 100 and 80 respectively and are indicators for high article popularity. The following features for high labeled articles are mostly authors with the exception of two bigrams. On rank five the feature newsvalue = 10 is placed and is the strongest indicator for low labeled articles. The next indicator for low labeled articles is ranked 13, LDA topic 76. According to the LDA topic labeling, topic 76 is hockey related.

In Table 5.10 author 1 is the most informative feature and is still an indicator for high labeled articles. Three new authors follow on rank 3, 5 and 6 and are also indicators for high labeled articles. Four unigrams can be seen as indicators for high labeled articles, iphon being the most informative followed by pension, borät and fullständ. The unigrams have been truncated during the pre-processing but the original words can still be recognized. The only features seen indicating low labeled articles are LDA topics 55 and 47, corresponding to Film & Film festival and Human rights respectively.

No features indicating mid labeled articles can be seen in either table mentioned above. Table 5.9: Most informative features for the combined system using article set NF

Rank Feature Rank Feature

1 newsvalue = 90 8 [author 4]

2 newsvalue = 100 9 bigram: [statsvet , :] 3 newsvalue = 80 10 [author 5]

4 [author 1] 11 [author 6]

5 newsvalue = 10 12 bigram: [(^) , statsvet] 6 [author 2] 13 topic = 76 (Hockey)

As seen in the accuracy table (5.8) the SVM tf-idf classifier scores the highest accuracy on all article-sets. The five most informative features for each label are shown in Table 5.11. Rank 5 for high labeled articles, bostadsrät, is closely related to the previously seen unigram borät. No other features in the table are similar to features seen in the previous feature tables.

(36)

5.4. Editorial Study

Table 5.10: Most informative features for the combined system using article set AF60

Rank Feature Rank Feature

2 topic 55 (Film & filmfestival) 7 unigram: pension

3 [author 8] 8 unigram: borät

4 unigram: iphon 9 unigram: fullständ

5 [author 9] 10 topic 47 (Mänskliga rättigehter)

Table 5.11: Most informative tokens for SVM tf-idf classifier

Rank Low Mid High

1 fredrikson försäkring ensamkomm 2 valberedning alfredson insättningsgarantin

3 centrum internet show

4 resurs någonting listan

5 utställning dem bostadsrät

5.4 Editorial Study

The prediction results for the two editors at SNP can be seen in Table 5.12. The SVM tf-idf classifier’s prediction result on the same 56 articles can be seen in comparison. The classifier predicts with an accuracy of 59% compared to the editor’s average accuracy of 55%. The classifier scores highest on low and high labeled articles while editors perform better on mid labeled articles.

Table 5.12: How the classifier performance stand in comparison to editorial predictions

Editors Classifier

Low 0.81 0.46 0.58 0.66 0.82 0.73

Mid 0.40 0.41 0.41 0.38 0.26 0.31

High 0.38 0.78 0.51 0.62 0.56 0.59

(37)

6 Discussion

In this chapter, the results are analyzed and discussed to answer the research questions stated. The method will also be discussed and criticized accompanied by suggestions on how it could be improved. Finally, it will be discussed how this work can affect SNP and news agencies in general.

6.1 Results

The research questions will be discussed in the same order as stated and presented in the result chapter.

6.1.1 How viable are machine learning techniques to predict article

performance?

To begin answering this question the experiments will be discussed in chronological order, first regression performance and then classification performance.

6.1.1.1 Regression

Looking at the results in Table 5.1 LR performs best on all article-sets regarding both RMSE and R2. This is in-line with the results from Arapakis et al. [1] and Bandari et al. [2].

The models perform very differently on the different subsets of articles. Compared to the baseline the regressors outperform on subset NF and AF, and perform almost equally on subset NF60 and AF60. Since NF60 and AF60 only contain articles with the same newsvalue (60), the models can not use newsvalue for fitting on the data. Seeing that performance drops so drastically on NF60 and AF60 one can draw the conclusion that newsvalue is a significant driver in the feature set for generating visits. Comparing with the results from Bandari et al. [2] the R2value on article set NF is much stronger (0.61 vs their highest of 0.43), very similar on article set AF and much weaker on NF60 and AF60. By creating the subsets NF60 and AF60 where all articles have the same newsvalue the regressors have no strong features left to model after.

(38)

6.1. Results

Looking at Figure 4.6 the newsvalue clearly is a good feature for predicting visits on the dataset, and the results are in line with what the data exploration show. From a machine learning point of view performance caused by the newsvalue is not interesting. Newsvalue is set manually by the editors and affects the placement on the site resulting in exposure dif-ferences. By keeping the newsvalues in the feature set the models simply mimic the decisions of the editors. Because of this the two article-sets NF60 and AF60 were created to mitigate the effects from the manually set newsvalue. A problem with these article-sets is that since the newsvalue has been set to the same value 60, the variance in type and quality is probably less than in the whole dataset, which makes it harder to model for the regressors.

The regressors perform very similarly to the baseline on article-sets NF60 and AF60. Look-ing at both Figure 5.1 and Figure 5.2 on the subplots for article-sets NF60 and AF60 the poor performance is clearly visible. The regressors predict around the mean value with some seemingly random spread. When removing the newsvalue for prediction in Figure 5.2 the performance clearly decreases. Looking at the newsvalue colors the newsvalue goes from low to high going from left to right, which means that as the newsvalue increases the true visits generally increase as well. When removing the newsvalue in training the regressors still manage to slightly follow the true visit curve which can be seen as the rightmost articles are generally predicted higher than the rest.

As mentioned earlier, from the editors’ point of view only correctly ranking the articles should be enough information for making data-driven editorial decisions. The ranking plots 5.3 and 5.4 show decreased performance when not allowing the regressor to train on news-value. Article-sets NF60 and AF60 show even more decrease in performance. When the model trains and predict on NF using newsvalue we can see a very strong Spearman’s rho of 0.78. When the model later trains and predict on the same article set NF but without using newsvalue we see a weaker but still good Spearman’s rho of 0.60. When evaluating NF60 and AF60 the models score a Spearman’s rho around 0.30, a significant decrease compared to the other article-sets but still better than random. Looking at the ranking plots the ranks seem to be more accurate in ranking the high performing articles. The points are more tightly centered along the diagonal in the lower left and then widen as the true article rank decrease. This is promising if a news agency such as SNP is more interested in making data-driven decisions for more quality articles. In the ranking plot on page views from Arapakis et al. [1] the ranks are more accurately predicted near the higher ranks (lower page views) and a line of underestimated articles can be seen. A similar string of underestimated articles can be seen with article set AF in Figure 5.3. In our case, the underestimation comes from lower newsvalue articles with high visit numbers. In Arapakis et al.´s case, they most likely also have one feature which usually is a good indication for low visit articles but with exceptions where the articles get a great number of views.

Looking at the classifier results using no NLP features in Table 5.4 we see that the measures vary greatly between the different article-sets. What we saw earlier in the regressor perfor-mance, newsvalue was the strongest driver for the predictions, and looking at Figure 5.1 or Figure 5.2 and subplot NF we can examine how the newsvalue aligns with the three labels. By looking at all articles to the left of where the true-visit line crosses the low threshold we see almost only articles with newsvalue in the range of 10-60. Looking at the articles between where the true-visit line crosses the mid threshold and the low threshold we see articles span-ning from 30-100 newsvalue. The rest of the articles mostly land in the span of 60-100. This means that there is a lot of overlap between newsvalue and the different labels and newsvalue alone should not accurately be able to predict all labels. However, a very low newsvalue of 20 does seem to almost guarantee a low labeled article and thus increase performance for

(39)

6.1. Results

that class. Looking again at the results in Table 5.4 we can see that the classifiers generally perform well on predicting low labeled articles.

Comparing the accuracies to the baseline in Table 5.8 SVM manages to outperform it with one percentage on AF and AF60. The classification results from Arapakis et al. [1] also show that only SVM manages to outperform the baseline.

6.1.2 How can NLP techniques be used to improve prediction performance?

How NLP techniques affected prediction performance is divided into results from the ex-tended feature set and from the BOW classifiers.

6.1.2.1 Extended Feature Set

First, the models were extended with additional NLP features in the hope to improve predic-tion performance. As seen in tables 5.1 and 5.2 the best performing regressor LR improved with the extended feature set. Even though the non-NLP feature set is so small it still per-forms well compared to the full feature set. The extended feature set contains some features that have proven good in similar work such as token count and top LDA topic similarity [23, 5]. Why they improve the regressor so little is yet again because the smaller feature set can perform well with only newsvalue, without the extended features.

As seen in Table 5.8 the accuracy is decreased with the extended feature set, both for NB and SVM. This means that most added features do not correlate with visits and are just noise for the classifier, causing it to overfit on the training set.

The BOW models proved to have similar performance as the feature set classifier. Shown in Table 5.8 the NB BOW classifier had improved accuracy on all articles sets while SVM had a decrease in accuracy on all except NF. By using a tf-idf table for NB the classifier resorts to only predicting the most frequent class. Adding a tf-idf table to the SVM classifier accuracy increases. The SVM tf-idf classifier scores the highest accuracy for all classifiers on all article-sets. However, looking at the F1-scores in Table 5.6 SVM without a tf-idf table performs better with capturing more of the mid and high labeled articles.

One benefit of using a BOW model is that no predefined feature set has to be used. Should SNP decide to remove the newsvalue feature from the articles a BOW model would not be affected in a negative way.

6.1.3 Can key features be identified which drive article performance?

Looking at the most informative features for the combined classifier presented in Table 5.9 the top three are not surprisingly newsvalue. What information this can provide to SNP is how strong the correlation is between newsvalue and article performance. On rank 4 [author 1] is placed. [author 1] is a writer of the editorial page. Given that this particular editorial author is such a strong feature gives an indication that the author’s content is popular. To summarize the author’s content it is mostly articles written about politics. Rank numbers 9 and 12 can be seen as the same feature. They come from the common headline start "Statsvetare:". Many political articles start in that manner followed by what the political scientist expressed. On rank 13 the first LDA topic appears. It is a sports topic in which articles can differ greatly from SNP’s regular selection.

Predicting Swedish News Article Popularity

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/099--SE

Predicting Swedish News

Ar-ticle Popularity

Prediktion av Svenska Nyhetsartiklars Populäritet

Ludvig Noring

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

Glossary

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Related work

2

Theory

2.1

Predictive Modeling

2.1.1

Regression

2.2

Classification

2.2.1

Naive Bayes

2.2.2

SVM

2.2.3

Classifier Evaluation

2.3

Pre-processing

2.3.1

Text Segmentation

2.3.2

Text Normalization

2.4

Text Representation

2.4.1

One-hot & Bag-of-words

2.4.2

Term Frequency–Inverse Document Frequency (tf-idf)

2.5

Latent Dirichlet Allocation (LDA)

2.6

Spearman’s rank correlation coefficient

3

Method

3.1

Environment

3.2

Data Exploration

3.3

Labeling the Data

3.4

Pre-processing the Data

3.4.1

Text Segmentation

3.4.2

Text Cleaning & Normalization

3.5

Creating LDA Topics

3.6

Train and Test sets

3.7

Features

3.8

Regression

3.9

Classification

3.9.1