• No results found

News Value Modeling and Prediction using Textual Features and Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "News Value Modeling and Prediction using Textual Features and Machine Learning"

Copied!
70
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2020 | LIU-IDA/LITH-EX-A--20/028--SE

News Value Modeling and

Predic-tion using Textual Features and

Machine Learning

Modellering och prediktion av nyhetsvärde med textattribut och

maskininlärning

Rebecca Lindblom

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

News value assessment has been done forever in the news media industry and is today often done in real-time without any documentation. Editors take a lot of different quali-tative aspects into consideration when deciding what news stories will make it to the first page. This thesis explores how the complex news value assessment process can be trans-lated into a quantitative model, and also how those news values can be predicted in an effective way using machine learning and NLP.

Two models for news value were constructed, for which the correlation between mod-eled and manual news values was measured, and the results show that the more complex model gives a higher correlation. For prediction, different types of features are extracted, Random Forest and SVM are used, and the predictions are evaluated with accuracy, F1-score, RMSE, and MAE. Random Forest shows the best results for all metrics on all datasets, the best result being on the largest dataset, probably due to the smaller datasets having a less even distribution between classes.

(4)

Acknowledgments

I would like to thank my supervisor Marco Kuhlmann, my examiner Arne Jönsson and the entire thesis seminar group at NLPLAB, especially my opponent Emma Nilsson Tengstrand, for great feedback and discussions during the work. Also, a great thank you to everyone at iMatrics for making me feel like a true part of the team.

The largest of thanks to Jon Vik and my parents Göran and Nina. I would not have been able to complete my thesis or my education without your love and support.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

List of Abbreviations x 1 Introduction 1 1.1 Motivation . . . 2 1.2 Aim . . . 2 1.3 Research questions . . . 2 1.4 Delimitations . . . 3 1.5 Outline . . . 3 2 Background 4 2.1 iMatrics . . . 4 2.2 Media houses . . . 4

2.3 News value assessment today . . . 5

3 Theory 6 3.1 News popularity and impact . . . 6

3.2 Text pre-processing . . . 8

3.3 Text representation . . . 9

3.4 Non-text features . . . 10

3.5 Features in literature . . . 11

3.6 Machine learning prediction . . . 11

3.7 Feature selection . . . 14

3.8 News value prediction in the literature . . . 14

3.9 Evaluation . . . 15 4 Method 17 4.1 Environment . . . 17 4.2 Data set . . . 17 4.3 Popularity model . . . 20 4.4 Feature collection . . . 22 4.5 Text pre-processing . . . 24 4.6 Vectorizing . . . 24 4.7 Feature selection . . . 24

(6)

4.8 Hyperparameter Tuning . . . 25

4.9 Training and prediction . . . 26

4.10 Evaluation . . . 26

5 Results 27 5.1 Popularity Model . . . 27

5.2 Feature selection . . . 31

5.3 Training and prediction . . . 33

5.4 Comparison with manual news values . . . 37

6 Discussion 39 6.1 Results . . . 39

6.2 Method . . . 42

6.3 The work in a wider context . . . 45

7 Conclusion 46 7.1 Research questions . . . 46

7.2 Fulfilling the purpose . . . 47

7.3 Future work . . . 47

Appendices 48

A Swedish Stop Words 49

B Top 500 features 50

(7)

List of Figures

3.1 Illustration of how an splits 2-dimensional data into two classes with a 1-dimensional line. . . 13 3.2 Illustration of 5-fold cross-validation. The dataset is split into five blocks which

alternate to act as the evaluation set during training and testing five models. . . 13 3.3 What true and false positives and negatives are in a evaluation of predicting one

class. . . 15 4.1 The distribution of manual news values in the main dataset. . . 21 5.1 The distribution of popularity values across all articles in the bt-rt dataset. . . 28 5.2 The distribution of manual news values in the bt-rt dataset for the binning strategy

’bot’. . . 29 5.3 The distribution of popularity values (without reading time) in the main dataset. . 30 5.4 The distribution of manual news values in the main dataset for the binning

strat-egy ’guess’. . . 31 5.5 The density of how the Random Forest predicted the news value of the articles in

the main dataset. Each subplot shows the correct articles in each class, and what they were predicted as. . . 35 5.6 The density of how the predicted the news value of the articles in the main dataset.

Each subplot shows the correct articles in each class, and what they were predicted as. . . 36 6.1 The feature importances for the top 500 features for each of the three datasets. . . . 44

(8)

List of Tables

2.1 Possible values for news value and lifetime on news articles at Gota Media and NTM. . . 5 3.1 An example of the Bag-of-words model with the sentences "We really like Robin’s

new haircut" and "Do they really look like that?" . . . 10 3.2 An example of how a one-hot encoding will convert key-value pairs into boolean

feature fields. . . 11 4.1 All newspapers, their number of articles in the dataset and their number of

sub-scribers for both paper and digital issue [Audit_Rapporter_2019]. . . . 18 4.2 The datasets for training and prediction. . . 18 4.3 The binning strategies, what newspapers and how many articles they where based

on, and the rates of each news value bin. . . 22 4.4 The collected features for each article. . . 23 4.5 The number of features after vectorizing each part of the datasets. . . 25 4.6 The parameter spaces for hyperparameter tuning Random Forest with Random

Search and Grid Search. . . 25 4.7 The parameter spaces for hyperparameter tuning with Random Search and Grid

Search. . . 25 4.8 The parameter settings for the models. . . 26 5.1 The results for Spearman’s ranking correlation for the different binning strategies,

on the bt-rt dataset. . . 28 5.2 The results for Spearman’s ranking correlation for the different binning strategies,

on the main dataset. . . 30 5.3 The top 15 features after feature selection for the datasets main, bt and bt-rt. The

value in parentheses is the importance value generated by Random Forest that will sum up to 1 for all features in the dataset. . . 32 5.4 The evaluation measure results for the Random Forest predictions for the

popu-larity model with reading data on the bt-rt dataset. . . 33 5.5 The evaluation measure results for the predictions for the popularity model with

reading data on the bt-rt dataset. . . 33 5.6 The evaluation measure results for the Random Forest predictions on the main

dataset. . . 34 5.7 The evaluation measure results for the predictions on the main dataset. . . 35 5.8 The evaluation measure results for the Random Forest predictions on the bt dataset. 37 5.9 The evaluation measure results for the predictions on the bt dataset. . . 37 5.10 The evaluation measure results for the manual news value assessment. . . 38 5.11 The evaluation measure results for the manual news value assessment for the

(9)

6.1 RMSE and MAE for all classifications compared to only misclassifications, for datasets main and bt-rt. . . 40 A.1 The list of stopwords used in the preprocessing. . . 49

(10)

List of acronyms

BOW Bag-of-words

MAE Mean Absolute Error

MFC Most Frequent Class

ML Machine Learning

NLP Natural Language Processing

NLTK Natural Language Toolkit

PCA Principal Component Analysis

POS Part-of-Speech

RMSE Root Mean Squared Error

SVM Support Vector Machine

(11)

1

Introduction

News has always been of importance for humans. Media and news sources are a fundamental part of the democratic society and appear in multiple forms. All from word-of-mouth, newspapers, radio, television, to web pages and social media, have been and still are, messengers of current events and announcements.

The media industry, especially newspapers, is facing times of reconstruction. The digital transformation of society is changing our behavior and approach to the consumption of news. The major issue right now is for the newspapers to adapt their businesses to these new behaviors [1]. The number of subscriptions of paper issues is going down, as well as the revenues from advertising. People are more prone to read their news online, on social media, or in apps on their smartphones. Suddenly the newspapers are no longer only competing with each other, but with the major IT companies for the user’s time online. This competition is held in a completely different field with different prerequisites and resources [2].

Another new behavior of news consumers is the free mentality [3]. Since newspapers have had a stable business model with printed issues for several hundred years, the web-based news has been free since the start of their presence on the internet. Therefore, people have been used to read news for free and that attitude can be hard to change. Today, some articles on the newspapers’ websites are openly distributed, while some are behind a paywall where the reader needs to have a subscription to read the article.

The digital transformation has not only created trouble for newspapers. The digital environment has given opportunities to collect information about readers and customers, and also to develop digital tools and processes for insights and higher efficiency [4]. There are examples of other industries that have faced the same situation of digitalization and have managed to convert in a sustainable way. One of them is the music industry with Spotify in the lead, who has changed both the behavior and mentality of music consumers.

One way where a digital tool could assist the newspaper business in times of digitalization is to predict news values. To get as many reads as possible, the news stories must be arranged in an attractive way on the website. Today, editors have to decide in real-time which articles are of most interest and choose how to place them. The placement of the news on the website aims to reflect the readers’ interest, where the most important news should get the most exposure. Many aspects of the news need to be considered in the placement process. These aspects could be for example the geographical distance, relevance for the target audience, and timely relevance of the news [5]. All these aspects are weighted in order to determine

(12)

1.1. Motivation

the news value of each news story. Instead of letting editors or reporters assessing the news value, which could lead to inconsistent or subjective values, this thesis will explore how the news value can be modeled and predicted with the help of Machine Learning (ML).

1.1

Motivation

By having a working model for predicting the news value, reporters and editors can be assisted in the assessment and prioritization of news stories. At least two media houses that are clients of iMatrics are using or intend to use the digital news value as a support for automating the placement of stories on their websites and in the printed issues. By also automating the news value assessment, the news values will be consistent throughout all articles, and the comparison between articles will be more accurate, than if every reporter would assess their own articles. This could also release time from both reporters and editors, and also be helpful for insights into the production.

1.2

Aim

This thesis work aims to construct a quantitative model for news value as well as a prototype for the prediction of news values. The results of the prediction should be expandable to support the development of other applications that can make use of the news value. The thesis will also contribute to the knowledge of what features could be interesting in Natural Language Processing (NLP) regarding the assessment of news articles. To reach these goals, different quantitative models for news value will be constructed and evaluated by correlation to current manual news values. The models will then be used to generate news values using data from historical articles. These articles and modeled news values will be used in different ML models for predicting news values for future articles.

1.3

Research questions

Here follows the research question that this thesis will aim to answer.

1. How can the news value of an article be modeled with quantitative data?

The automated process for news value assessment – which in the long term is the goal of this thesis – will not be able to take qualitative values into consideration when modeling a news value for an article. How will the process of taking the qualitative nature and complexity of manual news value assessment into a quantitative environment look? What metrics can be used to measure the popularity of an article, and how can that popularity value be translated to a discrete news value?

2. What ML algorithm would be a suitable option for news value prediction?

There are a lot of different types of machine learning algorithms. Some are more suitable to work with text, and some are better for other types of features. The work will look into finding an algorithm that will perform well with a mix of different types of features. 3. Which features can effectively be used in a machine learning model to predict the news

value of an article?

Feature engineering and selection has to be performed in order to discover what features will give good results in the prediction. To answer this question, it will be examined how features perform and filter out those that are redundant or do not contribute to improvements to the results.

(13)

1.4. Delimitations

4. How does the proposed method compare to a manual news value prediction?

The editors do currently set the news value manually, and this data could be used to compare the human assessment to the ML predictions. By evaluating the manual news value the same was that the ML models are evaluated, a comparison of those metric values can be made.

1.4

Delimitations

This study will only look at news articles in Swedish. The features and metrics of the articles that can be used in the prediction will depend on the data that can be provided by the customers of iMatrics.

Also, the study is limited to a quantitative analysis of the data available. Even though news impact and popularity can be abstract and qualitative measures as mentioned in Section 3.1, this study will be limited to interpret potential qualitative measures as quantitative.

The prediction of news value will only be based on data that are available before publication. This means that the prediction will not consider other articles that have not yet been published, but might be present in the future on the newspaper’s website or the printed newspaper at the same time as the predicted news story. The process of positioning news stories in the newspaper or on the website is considered to be another part of the journalistic work of news evaluation, and the prediction could be used for that in the future.

1.5

Outline

This report will be divided into chapters that contain background, theory, method, results, discussion, and conclusion. The background chapter will give the reader some information about the news industry and how the process of news value assessment is carried out at two of the biggest media houses in Sweden. The theory chapter will introduce all relevant concepts used in the project, both regarding news value and popularity, along with text processing and machine learning. The process of this project will be presented in the method chapter, where all the steps taken will be described. The results will be presented in the results chapter, and those together with theory and method will be discussed in the discussion chapter. Finally, conclusions regarding the research questions will be stated in the conclusion chapter.

(14)

2

Background

The background chapter presents the news industry where information was provided by the customers of iMatrics through interviews.

2.1

iMatrics

iMatrics is a NLP company that assists its customers with, among other things, automatic tagging of news articles. By tagging the news articles with topics that they contain, readers can easily navigate to what topics might interest them and read all relevant articles for that topic. One advantage of using an automatic tagging service, except the time saved, is that the tagging will be consistent throughout all articles.

The customers are mostly media houses that own local newspapers in Sweden but also include magazines and larger news media in Sweden, Denmark, and the Netherlands. iMatrics aims at being the digital strategic partner to its customers and enables them to adapt to a sustainable business with the help of technology.

2.1.1

iMatrics tagging service

The main service that iMatrics deliver to its customers is the tagging service. The reporter writes their article, and the tagging service will in real-time add tags that are relevant to the content. The reporter can also remove and add tags manually. The service finds mentioned entities, which can be persons, places, and organizations and tags the article with such. It also adds tags for the relevant category and topic.

2.2

Media houses

NTM1 and Gota Media2 are two major Swedish media houses that mainly publish local newspapers in the southern part of Sweden. Gota media owns 30 newspapers and NTM around 35 (March 2020). NTM also owns companies that offer printing, distribution, and property management, among other things. Both houses each have over 200 000 subscribers

1https://ntm.eu/

(15)

2.3. News value assessment today

in total. For Gota Media, about 60% of the content is locked and only available for subscribers. For NTM, it differs widely between newspapers, but some newspapers lock almost all articles except important information for the community.

2.3

News value assessment today

News value assessment has been done forever in the news media industry. Editors take a lot of different aspects into consideration when deciding what news stories will make it to the first page. Three main aspects of news value are the geographical distance, distance in time, and cultural distance [5]. News stories that happen close to us are more interesting than news far away. The word news itself reveal the meaning of distance in time since new events are more newsworthy than old events. While both geographical and timely distance is measurable in a simple way, cultural distance is more complex and hard to measure in a quantitative way. It could for example be if the reader feels that they can relate to the people in the news story. All of these aspects are important parts in deciding the news value, which will relative to the other news stories being published at the same time.

Table 2.1: Possible values for news value and lifetime on news articles at Gota Media and NTM.

Property Possible values

News value 1, 2, 3, 4, 5, 6

Lifetime (NTM) Short, Medium, Long Lifetime (Gota Media) 6H, 1D, 7D, 8

Historically, the news value assessment has been made in real-time by the editors, and any measurable values of the assessed news value have not been recorded. Today, however, media houses like Gota Media and NTM are keeping digital records of their news values, as a starting point for automating the placement process. The assessment is made before publication by the reporter who has written the article and can be independently decided for the lower news values, while an editor should be advised for higher news value. This way of determining a news value has the risk of being inconsistent and subjective. The reporter also estimates a lifetime for the article, where after that timestamp the article is considered dead, and will not receive any attention. The possible values for news value and lifetime can be seen in Table 2.1. Gota media gives the news value 2 to all external articles, while NTM uses the attached news values from external news agencies.

(16)

3

Theory

This chapter will introduce the theoretical concepts needed for this thesis work. It covers concepts about how news popularity can be measured, theory about machine learning that will be the technology used in the thesis, NLP techniques for handling text within the machine learning area and how the results can be evaluated. Most sections will mention some related work.

3.1

News popularity and impact

Impact measured by a journalist could have a larger and broader perspective than what this thesis will be able to have. This is made clear in a quote from Schiffrin and Zuckerman [6]:

There’s no guarantee that a story read by millions of people will have more impact than one that reaches only a few hundred readers.

In this context impact is more of how the readers get affected by what they have read, and how they will change their thoughts and behavior. This has been researched in several studies [7, 8, 9]. Real-world impact is hard to measure, and is therefore hard to use in a quantitative context and is therefore partly out of scope for this thesis. Instead, the popularity and impact of an article have to be measured in other ways. Even though a quantitative measure of impact through other channels might not be the optimal choice, the continuation of the quote from Schiffrin and Zuckerman gives a comforting testimonial that it still could give a promising result.

... But it’s easier to posit impact when a story finds a substantial audience.

This section will present some different metrics that could be used for the purpose of measuring the popularity and impact of news articles.

3.1.1

Clicks and visits

Clicks are a simple measurement of how many times the article has been clicked on and viewed. Visits are a different measure where the unique sessions for users are counted. There could be several clicks from the same user during one session, but only one visit is noted. One

(17)

3.1. News popularity and impact

click is happening before the user has read the actual article. This means that no conclusions regarding the connection between the content of the article and its popularity can be drawn, it is more an assessment of the headline and image placed on the first page.

To measure readers’ interest in an article with the number of clicks can be misleading, according to Kormelink and Meijer [10]. Readers of a web page will have different approaches to different news stories, some will be clicked and some will not, but the heading and images might be looked at anyway. The small information about the article on the landing page might not be enough for attracting interest enough for clicking. It could also be the opposite, that the amount of information is enough for the reader to feel well informed, and do not feel the need of clicking. This does not mean that the story was not interesting. These findings mean that simply using clicks as a measure for interest/popularity/impact might not be enough for picking up all dimensions of interest.

Zheng et al. [11] also emphasize that measuring relevance, which in this context is how relevant a certain item is to the user, with clicks is not always recommended. They examined a recommendation system and found that the algorithms with higher relevance did not correspond to a higher click rate. They state that the evaluation of recommendation systems based on clicks should be used with caution.

3.1.2

Time spent reading

Measuring the time a reader has spent staying on a page for one news story could indicate if the reader thought the article was interesting enough to continue reading after clicking on the article. The time has to be scaled according to the length of the article to get comparable numbers. The length of the article can be a heavy factor for if the reader will read the entire article [12]. A click will be registered even if the reader chooses to not read the article due to its length since the length can not bee seen before the click is performed.

As stated by Schaudt and Carpenter [13], some online metrics companies changed their factors for measuring attention, due to the then new web 2.0 techniques that allowed the user to accomplish more tasks on the same page without generating another click in the statistics, along with other reasons. One of them being that the system of counting page views would not indicate what would be of general interest by the readers in the long term, but rather what would be most popular right now.

3.1.3

Shares in social media

Sharing an article on social media could indicate that the reader found it interesting, and want others to read it as well. The sharing can be done in several ways. As mentioned by Berger and Milkman [14], narrowcasting is sharing content with a few specific persons. It could address niche content sent in an e-mail. In contrast, broadcasting requires content with a wider interest since it is shared with a large group of people, not chosen for their interests but rather that they are reachable. Examples of such sharing would be to share the article on Twitter or Facebook or write a blog post about the article. In whatever way the sharing is done, it is usually with the intent that the content of the shared piece is of interest in some way, and the user sharing it thinks it is worth reading or looking at.

Berger and Milkman also [14] discusses what makes online content viral. In the study, they found that content that evoked high-arousal emotions were more prone to go viral, regardless of whether the emotions were positive (awe) or negative (anger or anxiety). The contents less likely to be viral were evoking deactivating emotions such as sadness. They also highlighted the fact that content that is featured prominently will have a higher chance to be shared and go viral. Further, surprising and interesting content, as well as practically useful and positive content, are more viral, both following the concepts that people share content to entertain, inform, and boost others’ moods.

(18)

3.2. Text pre-processing

3.1.4

Article comments

Some news sites have comment sections belonging to each news article. This enables the readers to start discussions directly on the site with other readers of the same article. On some news sites, the comment section can be powered by an external comment service, for example, Ifrågasätt1. Then the users need to be registered users of the external site to be able to comment, even if the user already is a subscriber of the newspaper. Since this is an extra step, this could potentially be an obstacle that hinders some users from actually posting their comments [15]. Having an anonymous comment section might on the opposite engage more readers to comment.

A comment section will contain user-generated content, and could, therefore, risk containing uncivil comments. According to Canter [16], journalist thinks it can be damaging for the brand of the newspaper to not moderate the comment section. It is also part of Swedish law to remove hateful comments and be responsible for the forum that you provide [17]. Therefore the number of comments is prone to be changed if inappropriate comments are published.

Tenenboim and Cohen [18] mention that comments can be seen as a measure of success by people working in journalism. Comments are sometimes also seen as a rating system to see how the readers respond to the content. The authors do however not mention that the content of a comment must be taken into consideration to be able to count it as compliance with the news story or not. They further investigated the connection between clicks and comments and found that 40-59% of the stories that were highly clicked during any month were different from the stories that received many comments. The articles with high numbers of clicks more often were about sensational and curiosity-arousing topics, while for comments it showed that political and social topics and elements of controversy generated a higher engagement.

3.2

Text pre-processing

Human language is complex, and while it might come naturally to us humans, computers need some simplification for parsing the text correctly. This section will go through some NLP techniques for preparing the text for the machine learning analysis.

3.2.1

Tokenization

Tokenization is the process of splitting up a sentence to separate words or tokens. Usually text stored as a long string of words concatenated to sentences, which in turn is concatenated to documents. When analyzing text computationally, the boundary between words must be specified since a computer won’t be able to determine that without instructions.

One easy way of tokenizing is by using white spaces as the separator, but a more sophisticated tokenizer should be able to split words and punctuation marks as comma and quotes. Depending on the language, contractions like "isn’t" should be split into two words, "is" and "n’t". An example could be the sentences

Hans went to school, even though it was closed. ’Why isn’t it open?’ he thought. Tokenizing this sentence should result in the following tokens:

["Hans", "went", "to", "school", ",", "even", "though", "it", "was", "closed", ".", "’", "Why", "is", "n’t", "it", "open", "?", "’", "he", "thought", "."]

(19)

3.3. Text representation

3.2.2

Lowercasing

The words in the data set will contain words that have the same meaning but are written with differently cased letters, e.g. "House" and "house". One word could for example appear in both the beginning of sentences and in the middle of them. These words would be considered two different words in the machine learning model. To be able to compare words even though they have a different casing, all words get lowercased.

3.2.3

Removal of stop words

Stop words are small words that are necessary for humans speaking and writing the language, but are not useful when analyzing text computationally, since they do not really contribute with more meaning to a sentence. Examples in Swedish are att, men, och, eller, en, or ett. These type of words are removed during pre-processing.

3.2.4

Part-of-Speech tagging

Part-of-Speech (POS) tagging is a process where each word in a sentence is annotated with its word type in the current sentence. Examples of word types are noun, verb, adjective, preposition, pronoun, and adverb2. Since words can have different meanings and be of different word types depending on the context, this is not always a trivial task and can be solved with different techniques. A POS tagger can be rule-based and decides the word type based on look-ups in a lexicon and a set of rules. A tagger can also be stochastic, where probabilities are used for deciding the word type.

3.3

Text representation

A machine learning model is a mathematical concept and does, therefore, need the input features in a numerical format. Text is by nature not numerical and needs to be translated for the machine learning model to understand it.

Vectorizing is the process of taking all features and converting them to a vector format that will be readable by the machine learning model. For text, a count vectorizer or a Term Frequency-Inverse Document Frequency (Tf–Idf) vectorizer can be used. A count vectorizer use the concept of Bag-of-words which is explained below, and the Tf–Idf vectorizer use the concept of Tf–Idf which is explained further down in Section 3.3.2.

3.3.1

Bag-of-words

One way of representing text in a numerical way is the Bag-of-words (BOW) model. By taking all words in a document collection, or a corpus, a vocabulary is created. A word or a group of words will then be represented by a vector with the same length as the vocabulary. Each slot in the vector maps to one word in the vocabulary. One word will be represented as a vector with 1 in the slot of that word, and 0 in the rest of the slots. For a sentence, there will be 1’s in the slots of all the words in the sentence. If one word appears more than once in a sentence, it can be represented with that number instead of 1. An example of BOW can be seen in Table 3.1.

This vector representation of the entire vocabulary will give large and sparse vectors, which can be fast to use for computations but also has drawbacks. One being that all information about context is lost, since the BOW only represents which words are used, but not in which order.

(20)

3.4. Non-text features

Table 3.1: An example of the Bag-of-words model with the sentences "We really like Robin’s new haircut" and "Do they really look like that?"

Word Sentence 1 Sentence 2

do 0 1 we 1 0 they 0 1 really 1 1 like 1 1 Robin’s 1 0 haircut 1 0 new 1 0 look 0 1 that 0 1

3.3.2

Term Frequency-Inverse Document Frequency

Tf–Idf is also a model for representing words and sentences in numerical form. It works the same way as BOW with the sparse vector representation, but instead of only a simple count of how many times each word occurs in the document, Tf–Idf assigns weights to the words according to their relevance and recurrence in the documents. The concept can be explained as two separate parts. Term frequency, denoted with t f(w, d), is how often a certain word appears in one document, in this case, one news article. Document frequency, denoted with d f(w), is how often a certain word appears in the whole corpus of documents. A common word such as any stop word (see Section 3.2.3) would have a high document frequency, indicating that it appears in many documents. It would also have a high term frequency since it most probably would appear often in a certain document as well. If a word has a high term frequency but low document frequency, it is indicating that this word contains specific information about the content of that specific document. This is the measure of Tf–Idf, which in scikit-learn is implemented as the following equation, where w is a word, d a document and N the number of documents in the corpus.

t f id f(w, d) =t f(w, d)¨  log 1+N 1+d f(t)+1  (3.1)

3.4

Non-text features

An article can also be modeled with different features that will represent the article in the news value prediction, not only the plain text. This could be for example metadata about the article. For these sorts of features, a one-hot encoding can be used for representation in the ML model.

3.4.1

One-hot encoding

The one-hot encoding could be described as a binary version of Bag-of-words. It is used as a representation of categorical values that do not have an ordinal relationship. The method takes all different categorical values for each feature, and construct individual feature fields with boolean values for each alternative. An example can be seen in Table 3.2.

(21)

3.5. Features in literature

Table 3.2: An example of how a one-hot encoding will convert key-value pairs into boolean feature fields.

Key-value pairs Generated feature fields

"authors":["Denise Jacobsson", "Maria Ahlquist"], "year":"2010" AuthorDeniseJacobsson: 1 AuthorErikFogelström: 0 AuthorMariaAhlquist: 1 year2010: 1 year2011: 0

"authors":["Erik Fogelström"], "year":"2011"

AuthorDeniseJacobsson: 0 AuthorErikFogelström: 1 AuthorMariaAhlquist: 0 year2010: 0 year2011: 1

3.5

Features in literature

Features used in previous studies have been of different types. Tsagkias et al. [19] had five groups of features named surface, cumulative, textual, semantic, and real-world. Surface features being metadata about the article, cumulative features being if similar news stories have been published before, textual features are the most interesting terms in the news story, semantic features are named entities and their type, and finally real-world features is the temperature during publication time.

The sentiment of an article can also be interesting to look at and has been done in several studies, for example by Arapkis et al. [20]. They studied how readers’ engagement is affected by sentiment and the polarity of the articles. They found that user engagement can be predicted to a certain degree if sentiment and polarity that will arouse human curiosity and attention are taken into account.

Fernandez et al. [21] compiled during 2015 a dataset of 39,000 articles from Mashable3, with 62 defined features and the number of article shares as the gold standard target value. The dataset is available from the UCI Machine Learning Repository4. It contains features regarding words; the number of words in title and body, rate of stop words and unique words, links; links to and from other articles, media; if the article contains images and video,

time; date and time of publication, keywords; the keywords of the article and their rating, and NLP; a number of different features about topics, subjectivity, sentiment, and polarity.

3.6

Machine learning prediction

A prediction in the area of machine learning could be seen as extrapolating historical information to get insights about a potential future. By making a machine learning model look at historical data, it will create an approximated function to fit these data points as well as possible. This is called training the machine learning model. The prediction is done by handing a future data point with the same type of properties and characteristics, called features, and the model will return a prediction of what it believes will be the outcome of the future data point. In supervised learning, the historical data points all have a target value so that the machine learning model knows the correct answer. One example could be a

3http://mashable.com

(22)

3.6. Machine learning prediction

prediction to see if an e-mail is spam or not. The features could be if subject line contains one or more of the words "loan", "urgent", "opportunity", "50% off " and "click here". The model would train on historical e-mails that are labeled with "spam" or "not spam". The prediction will result in a label for a future e-mail.

There are two main types of prediction, classification and regression. In classification, the data point that is the input will be assigned to one of two or more discrete classes. It is called binary classification when there are only two classes, as in the example with spam e-mails. With three or more classes, it is called multinomial classification. In regression there is no discretization, instead, the result is a value from a continuous range of values.

3.6.1

Bagging

The word bagging comes from shortening the full description bootstrap aggregation. It is a method that averages the ML model used with it. It improves the stability and accuracy of the model, and can also prevent overfitting.

Bagging works by generating a number N of bootstrapped training sets Di, i P 1, 2, ..., N out

of one training set D and then running multiple runs of a prediction algorithm on each of the bootstrapped training sets. The final result will be the average of the results.

The bootstrapped training sets Di are created by randomly choosing entries in D,

sampling with replacement until the set Di is of the same size as D. This means that Di

can contain the same element from D several times. If N is large, the bootstrapped sets have around 63.2%(«1 ´ 1/e)unique samples from D, and the rest is duplicate entries. [22] The set of elements that were not chosen is called the Out-of-Bag dataset.

3.6.2

Random Forest

Random Forest is an algorithm based on bagging which can be used for both regression and classification, where the latter is being used in this work. It creates one bootstrapped training set D1 and trains a decision tree on that set. For each decision node in the tree, instead of

taking all features, a random subset of features is considered. When the tree is finished, the algorithm starts all over and creates another bootstrapped training set D2, and creates another

decision tree in the same manner. When predicting a value for a future data point, its features are run in all created trees, and the class that most of the trees have chosen is the final result. The depth of the tree, the number of trees, and the number of random features to take into consideration are hyperparameters that can be tuned in order to get a better result.

Random Forests are used in several previous studies. Both Fernandez et al. [21] and Ren and Yang [23] tested different machine learning algorithms to predict the popularity of articles, both using the same dataset from Mashable that Fernandez et al. collected and published [21]. The two studies both found that Random Forest has the best performance in that classification problem compared to the other algorithms tested.

3.6.3

Support Vector Machines

A Support Vector Machine (SVM) is a ML algorithm that can be used for both regression and classification. For this work, the classification model will be described. During training, the SVM finds a hyperplane that separates the dataset into two classes. The hyperplane can be constructed with different methods that can be linear or non-linear. A 2-dimensional linear example can be seen in Figure 3.1. For multiclass classification, a one-vs-rest approach can be used, which means that one classifier is trained for each class. The base classifiers will produce a confidence score for each element instead of a class label, to decide which class the element belongs to. The SVM is not scale invariant, which means that feature values need to be normalized before training to obtain relevant results.

(23)

3.6. Machine learning prediction

Figure 3.1: Illustration of how an SVM splits 2-dimensional data into two classes with a 1-dimensional line.

3.6.4

Cross-validation

Cross-validation is a method for getting a general evaluation by only using one dataset. By using a k-fold cross-validation, the dataset is randomly split into k partitions. By running k models, the partitions take turns in being used as the evaluation set, while the rest k ´ 1 partitions act as the training set. The evaluation result will be the mean of the results from the k models. An illustration of a 5-fold cross-validation can be seen in Figure 3.2.

Figure 3.2: Illustration of 5-fold cross-validation. The dataset is split into five blocks which alternate to act as the evaluation set during training and testing five models.

(24)

3.7. Feature selection

By using cross-validation, a small dataset can be fully utilized without having to be a fixed split in one training set and one evaluation set. Also, by running the evaluation k times, the results will be more generalized compared to one run with a fixed split of the dataset.

3.7

Feature selection

Even though a lot of features have been gathered, it is not certain that they all will boost the result. Some features could instead lead to overfitting, where the machine learning model will create a function that maps extremely good to the training data, but not to any testing data. Also, removing redundant features could also speed up the execution time for preprocessing, hyperparameter tuning, training, and prediction.

There are different methods for feature selection. Redundant features can be found through correlation analysis, where features that have a high correlation with each other indicate that they represent similar characteristics and both do not have to be present. Another approach is to find the most important features and remove the rest, which can be done with the help of a Random Forest.

3.7.1

Random Forest for feature selection

When training a Random Forest, information about the importance of the features can be obtained. For each node in a decision tree, a set of features is used for constructing questions that make a split in the tree into two child nodes. The importance of each feature can be calculated by looking at a specific metric, in this case, the Gini Impurity. Other applicable metrics are information gain and entropy.

The Gini Impurity is a metric for measuring the quality of a tree split, where good quality means that the split perfectly separates the elements into two classes, without any overlap. By comparing the Gini Impurity before and after the split, the importance of the introduced features in the split can be determined.

3.7.2

Feature selection in literature

Fernandes et al. [21] used Random Forest to detect the top 15 features for their model out of the original 62 features. Most of them were concerning keywords and closeness to certain topics, but some also about the rating of lining articles.

Ren and Yang [23] used the Mashable dataset from [21] and used a few methods for feature selection in their study. They found that Principal Component Analysis (PCA) is only a good option if there is a correlation between the features in the data set. Otherwise PCA will destroy information and give a worse result.

In a paper from 2018, Kamel and Nour [24] looked at different ways of how the 62 features from the Mashable dataset [21] could be optimized. They used correlation, information gain, and relief for feature selection from the dataset. They aimed for removing features from the set with a high correlation with other features (feature redundancy) and features with a low correlation with the class label (feature relevancy). They found that combining the three sets of remaining features, they could reach 98.1% accuracy with logistic regression.

3.8

News value prediction in the literature

The work by Bandari et al. [25] is one of the first papers in the area of news value prediction before publication. They looked at how news stories were shared on Twitter. Four different features were used: the category of the story; if the text was written subjectively or objectively; what named entities where mentioned in the text; and from what source the story came from. The ML methods used were SVM classification, decision tree, bagging and Naive

(25)

3.9. Evaluation

3.9

Evaluation

To be able to see how the machine learning models perform and compare different models, evaluation is needed. This section will describe some methods used.

3.9.1

Classifier evaluation

Classifiers are commonly evaluated using accuracy, precision, recall and F1-score. In all of those metrics, the concepts of true and false positives and negatives are used. An illustration of those can be seen in Figure 3.3. The true positives and negatives are called the gold standard.

Figure 3.3: What true and false positives and negatives are in a evaluation of predicting one class.

• Accuracy is the measurement of how many items that were correctly classified. Accuracy= TP+TN

TP+TN+FP+FN (3.2)

• Precision measures how many of the selected items were correctly classified for a given class.

Precision= TP

TP+FP (3.3)

• Recall measures how many of the relevant items that were correctly classified for a given class.

Recall = TP

(26)

3.9. Evaluation

• F1 score is the harmonic mean between precision and recall. It reaches 1 when both precision and recall are perfect, and 0 at worst.

F1=2 ¨ precision ¨ recall

precision+recall (3.5)

3.9.2

Root Mean Squared Error

The measure Root Mean Squared Error (RMSE) is most commonly used for regression when there are no categorized true and false positives and negatives. Instead, this is a distance between the predicted values and observed values. Due to the square of the error, the RMSE will reflect big errors more than smaller errors.

RMSE= g f f e n ÿ i=1 (yi´xi)2 n (3.6)

3.9.3

Mean Absolute Error

The measure Mean Absolute Error (MAE) is in the same way as RMSE a distance between the predicted values and observed values. The calculation has no square, and will, therefore, show the average error of all predictions.

MAE= n ÿ i=1 |yi´xi| n (3.7)

3.9.4

Spearman’s rank correlation coefficient

Spearman’s rank correlation coefficient, also called Spearman’s ρ, is a measurement for rank correlation between two variables, which can be both continuous and ordinal. The Spearman’s correlation measures the relationship between the two variables, and how well it can be described by a monotonic function. A value of -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation. The following equation describes the Spearman’s correlation given that both variables are distinct integers. ρ=1 ´ 6 n ÿ i=1 (xi´yi)2 n(n2´1) (3.8)

(27)

4

Method

This chapter will describe all steps taken in order to answer the research questions in Section 1.3.

4.1

Environment

The code is written in Python 3.6, and for the machine learning components, scikit-learn1is used.

4.2

Data set

The data consists of news articles and events between 2019-12-01, and 2020-04-02. In total there are 45072 articles. Since events like clicks and read times on the articles were going to be collected, some time was needed for those events to accumulate. To manage that, articles created after 2020-03-22 were removed from the dataset, and only events on the remaining articles were collected after that date.

The articles come from Gota Media, and their 20 different newspapers and news sites. Some of the newspapers and news sites are free and does therefore not have any data regarding the number of subscribers that were going to be used in the popularity model (see section 4.3). Articles that were only published in one of those newspapers were removed from the set.

The data was also filtered to remove data that were not supposed to be considered in the model. All user events from unregistered users were removed since a paywall on some articles could affect the number of users actually being able to read the article. Also, news articles with an invalid manual news value were removed, since a correct value has to exist in order for a comparison to be made. The tags were also going to be used in the prediction, so therefore articles without tags were removed from the dataset.

The datasets were divided into training, validation and test sets with a 60-20-20 split, giving 60% of the data to the training set, and 20% each to the validation and test sets. The split was done in a stratified fashion, which means that the distribution of the modeled

(28)

4.2. Data set

news values was taken into consideration, and the resulting datasets would keep the same distribution. All datasets and their number of articles can be seen in Table 4.2.

The final number of articles for each newspaper after all filtering can be seen in Table 4.1. Note that since some articles are published in more than one newspaper, the sum is larger than the total number of articles.

Table 4.1: All newspapers, their number of articles in the dataset and their number of subscribers for both paper and digital issue [26].

Newspaper No of articles Subscribers

BLT Blekinge Läns Tidning 3,784 33,100 Barometern Oskarshamnstidningen 6,453 38,700 Borås Tidning 3,855 40,100 Kristianstadsbladet 3,316 26,600 Ölandsbladet 1,153 7,900 Smålandsposten 4,345 34,500

Sydöstran Sydöstra Sveriges Dagblad 3,452 10,500

Trelleborgs Allehanda 1,144 8,000

Ulricehamns Tidning 631 8,100

Ystads Allehanda 3,787 22,000

Table 4.2: The datasets for training and prediction.

Dataset name Has reading time Number of articles Train-validation-test

main No 29,223 17,533 - 5,845 - 5,845

bt No 3,855 2,313 - 771 - 771

bt-rt Yes 3,132 1,878 - 627 - 627

4.2.1

Reading time dataset

The articles in the main dataset did not contain data about reading times. One separate dataset for this data was acquired. This data was aggregated in an inconsistent environment and was therefore not exact. Each entry in the dataset included:

• Date: The date the data concerned.

• Article ID: The unique identifier for the article.

• Logged in Status: Indicated if the user was logged in (subscriber) or not.

• Unique Page Views: The number of page views for the article during the day for the type of users.

• Avg. Time on Page: The average time spent on the page for the number of unique page views.

This dataset was merged with the main dataset using an inner join on Article ID. The resulting dataset named bt-rt consisted of 3132 articles, all from Borås Tidning (bt). This dataset was a subset of all articles from Borås Tidning in the main dataset. The dataset has an entry in Table 4.2.

(29)

4.2. Data set

4.2.2

Data fields

For each article, metadata is stored in a JSON format. Here follows a list of all data fields that belongs to each article.

Text features

• Headline: The headline of the article.

• Preamble: The short introduction text that describes the article in short. • Body: The text in the article.

Other features

• Article ID: Unique identification number for each article. • Created timestamp: Timestamp when the article was created.

• Dateline: An editorial category for the main channel that the article belongs to. • Language: What language the article is written in.

• Expiration timestamp: The timestamp, based on the lifetime value, when the article is considered dead.

• Newspapers: What newspapers the article has been published in. • Published: Boolean value if the article is published or not.

• Lifetime: The lifetime value set by the reporter mentioned in Section 2.3.

• Minute read: An approximated value for how long it would take to read the article based on average reading speed.

• News value: The news value estimated by the reporter, mentioned in Section 2.3. • Channels: What sections of the newspaper the article belongs to, e.g. sport or national

news.

• Tags: The tags made by the iMatrics tag engine, mentioned in Section 2.1.1. Includes topic, category and named entities.

Tag ID: All tags has an unique identification number.

Title: The name of the tag.

Type: The type of the tag, e.g. person or category.

Articles: A list of all article ID’s that are tagged with the tag.

• Latest version timestamp: Timestamp when the latest change was made to the article. • Source data: The HTML source code for the website.

• Authors: The reporters who wrote the article and the photographers who took the photos.

• Publication timestamp: Time when the article was published. • Events: A list of all occurrences of when a user clicks on the article.

Newspaper: What newspaper the event happened on.

Timestamp: The timestamp of the event.

URL: The URL to the article.

User ID: The identification number of the user. If the user is not a registered subscriber the user ID will contain periods, or be set to 0.

(30)

4.3. Popularity model

4.3

Popularity model

To be able to measure the popularity of an article in a reasonable way, a model for popularity had to be created. Since there are several previous studies showing that a single metric, like the number of clicks or shares on social media, is too shallow and only capture one single dimension of user behavior, the model had to consist of a number of different parameters. On the other hand, the data supply also had a limit for what data existed and could be used. The newspapers which supplied the data did not have comments on their articles. The newspapers are local newspapers that by nature delivers news stories of local interest, which means that the base of potential readers is rather limited. This will imply that the number of shares in social media might be too low to give some insights. However, both clicks and time spent reading articles were available.

Two models for news values were constructed. The first model is based on the fact even though a reader clicks on an article, they might not be satisfied with or interested in the article, and therefore navigating to another page rather quickly. This will increase the number of visits, but not the time spent reading. By multiplying the number of visits with the total reading time, the dissatisfied readers will have an impact over the popularity measurement. Since the data set contains articles from different newspapers with a different number of articles and subscribers, the numbers had to be scaled accordingly. Therefore, the multiplied values were divided with the number of subscribers for each newspaper from 2019 [26].

Popularity= Number of clicks from subscribers ¨ Total reading time

Number of paying subscribers (4.1)

The second model was based on the first one, but without the time spent reading. This model was constructed to be able to compare what would happen to the results if the reading time was not available, which is the case for some of iMatrics’ customers.

Popularity= Number of clicks from subscribers

Number of paying subscribers (4.2)

Since some articles were published in more than one newspaper, the resulting popularity scores for each newspaper were averaged together to one.

(31)

4.3. Popularity model

Figure 4.1: The distribution of manual news values in the main dataset.

The calculated popularity values had to be translated to news values in the range between 1 and 6. This was done by binning the data according to four different strategies. The distribution of manual news values in different subsets of the main dataset, which can be seen in Figure 4.1, became the base for how the different binning strategies were constructed. The size of the bins for a strategy was decided based on the percental rate of each news value in a dataset. For three strategies the main dataset, one subset of the main dataset with articles from Barometern and one subset of the main dataset with articles from Ölandsbladet where used as the foundation for the bin sizes. Barometern was chosen for being a medium-sized newspaper with large interest and commitment in the automated news value assessment process. Ölandsbladet was chosen for being one of the smallest newspapers that might have an interesting distribution of news values. For the fourth strategy, a guessed strategy was constructed based on simplifying assumptions of how the news values were described by the representatives of the newspapers.

The binning was done by dividing the popularity values in the strategy’s bin sizes. The popularity values were sorted, and the value that was in the dividing position between two bins became the upper limit for the first bin. However, some problems could arise that actions had to be taken against. First, when two limit popularity values were the same, it would result in non-unique bin limits. This happened with the first bin for news value 1, were many of the lower popularity values were equal to zero. For those cases, the limit for the

(32)

4.4. Feature collection

first bin had to be the first value that was not equal to zero. Second, for the bot stategy on the bt-rt dataset, the bin for class 6 with 0.01% resulted in that class containing less than one element, since a percentage of 0.03% would equal one element. Therefore the largest limits were adjusted in order for class 6 to contain one element. The rest of the limits were kept the same as the distribution showed. This was done for the different binning strategies, and the resulting bin rates can be seen in Table 4.3. With the now constructed bins, all popularity values that belonged to the same bin got the same modeled news value. For measurement of the the first popularity model with reading time, binning with all strategies was done on the bt-rt dataset, and the correlation between the manual news values and the modeled news values were calculated by using Spearman’s rank of correlation. The same procedure was done for the second popularity model, but the binning was performed on the main dataset. Table 4.3: The binning strategies, what newspapers and how many articles they where based on, and the rates of each news value bin.

Strategy No. of articles 1 2 3 4 5 6

all (All) 29,223 20.56% 20.78% 36.72% 18.50% 3.36% 0.09%

guess - 23.35% 30.0% 30.65% 10.0% 5.0% 1.0%

bot for main (Barometern) 6,453 10.80% 30.54% 35.83% 19.65% 3.18% 0.01% bot for bt-rt (Barometern) 6,453 10.80% 30.53% 35.82% 19.64% 3.18% 0.03% ob (Ölandsbladet) 1,153 21.94% 46.14% 18.99% 7.98% 4.34% 0.61%

4.4

Feature collection

First, a large set of features were extracted from the data set. The features had to be available before publication time, and could therefore not contain elements like timestamps of changes in the article or user events that had happened to the article since publication. The collected features can be seen in Table 4.4. Some features were based on historical news values for tags and authors. By accumulating news values on articles tagged with a certain tag, features could be constructed for example by taking the average news value for each tag and then taking the maximum of those values for a future article. That would give a feature called "Maximum of average news values for all tags.". This process was done for different combinations of maximum, minimum, and average news value, for both tags and authors, and can be seen as features 28-31 and 53-38 in Table 4.4.

4.4.1

Sentiment analysis

The sentiment analysis was made using the VADER sentiment for Swedish [27]. The sentiment was calculated for the headline, the preamble, and the body. It is calculated by looking up each word in a sentiment lexicon, where the words are annotated with a valence score. The values are summed and normalized for each sentence, resulting in a compound score between -1 and 1. A positive value indicates a positive tone or meaning, and a negative value indicates the opposite. A value of 0 indicates a neutral tone.

(33)

4.4. Feature collection

Table 4.4: The collected features for each article.

No Feature Type

1 Headline Text

2 Body Text

3 Number of paragraphs Int

4 Rate of POS tags for each tag Float

5 Average length of paragraphs Float

6 Maximum length of paragraphs Int

7 Minimum length of paragraphs Int

8 Sentiment of body Float

9 Sentiment of headline Float

10 Sentiment of preamble Float

11 Creation year Int

12 Creation month Int

13 Creation day Int

14 Creation weekday Int

15 Creation hour Int

16 Dateline String

17 Lifetime String

18 Estimated read time Float

19 Has image Boolean

20 Has video Boolean

21 Language String

22 Newspapers List of strings

23 Number of newspapers Int

24 Channels List of strings

25 Number of Channels Int

26 All tags’ title List of strings

27 Number of tags in the article Int

28 Max and average of all tags’ maximum news value Int, float 29 Min and average of all tags’ minimum news value Int, float 30 Max, min and average of all tags’ average news value Int, int, float 31 Max, min and average number of articles for each tag Int, int, float

32 Number of tags for each tag type Int

33 All authors List of strings

34 Number of authors of the article Int

35 Max and average of all authors’ maximum news value Int, float 36 Min and average of all authors’ minimum news value Int, float 37 Max, min and average of all authors’ average news value Int, int, float 38 Max, min and average number of articles for each author Int, int, float

(34)

4.5. Text pre-processing

4.5

Text pre-processing

Five elements of text were pre-processed: the headline, the preamble, the body, the tags, and the authors. The first three were tokenized using the built-in tokenizer in Tf–Idf Vectorizer from scikit-learn. The tokenizer uses a regular expression to select the following tokens:

"The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)."

The vectorizer also lower-cased all words and removed stop words, by using a list of Swedish stop words from Natural Language Toolkit (NLTK)2 [28]. The complete list can be seen in Appendix A.

The tags and the authors were first formatted to be one string each. For example, the tag "Brott och straff" and the author "Andreas Jenefeldt" would be formatted to "Brott_och_straff", and "Andreas_Jenefeldt". This was done so that the words would not be separated as different tags or authors. All tags and authors were concatenated into one string respectively to fit the input requirements of the vectorizer. After this, they were lower-cased and tokenized by the vectorizers. Since there were no stop words in the tags or authors, there was no stop word removal performed for those.

4.6

Vectorizing

The set of extracted features consisted of five different parts, which were vectorized with three different types of vectorizers: CountVectorizer, Tf–Idf vectorizer and DictVectorizer from scikit-learn, which uses Bag-of-words or Tf–Idf approaches to vectorize. The headline and body consisted of written text, and where vectorized with one Tf–Idf vectorizer each. The tags and the authors were vectorized with one CountVectorizer each. This was done because one article could have several tags and authors, and by using a CountVectorizer the tags and authors were one-hot encoded. The collected features that consisted of other metadata or NLP features were vectorized with DictVectorizer, which created a one-hot encoding for all features. The five vectors were concatenated together and used in the ML models.

4.7

Feature selection

The feature selection was done using a Random Forest classifier, for each subset of the data. The classifier was trained on the training set of the data, and tested with the validation set. The total number of features was 260071 for the main dataset, and the top 500 was chosen to be used in the model. The number of features for each part can be seen in Table 4.5. By looking at the features importances in the trained Random Forest, the most important features could be selected.

(35)

4.8. Hyperparameter Tuning

Table 4.5: The number of features after vectorizing each part of the datasets.

Part main bt bt-rt Headline 77,137 14,413 11,928 Body 167,235 58,654 50,295 Tags 10,631 3,058 2,687 Authors 3,031 828 725 Other features 2,037 521 473 Total 260,071 77,474 66,108 Chosen 500 500 500

4.8

Hyperparameter Tuning

The tuning of hyperparameters was performed for Random Forest and for SVM on the main dataset. It was performed by using a combination of Random Search and Grid Search. Random Search was used for finding a space of parameters that Grid Search would explore in more detail. For Random Forest, four rounds of tuning were executed, which of the three initial rounds was with Random Search and the last round was with Grid Search. For SVM, two rounds of Random Search were performed following by one round of Grid Search. The different parameter spaces and the best parameters (in bold) can be seen in Table 4.6 for Random Forest and Table 4.7 for SVM. The search was carried out using 5-fold cross-validation, with the training part of the dataset as input data. This means that each combination of parameters was run five times and the resulting values were averaged to get the performance of the current parameter combination.

Table 4.6: The parameter spaces for hyperparameter tuning Random Forest with Random Search and Grid Search.

Round Search method n_estimators max_features

1 Random Search [100, 150, 200, 250, 300] [2, 20, 40, 60, 80, 100] 2 Random Search [200, 250, 300, 350, 400] [100, 120, 140]

3 Random Search [250, 300] [120, 140, 160]

4 Grid Search [250, 260, 270, 280, 290, 300] [140, 150, 160, 170, 180]

Table 4.7: The parameter spaces for hyperparameter tuning SVM with Random Search and Grid Search.

Round Search method C loss tol

1 Random Search [1.0, 2.0, 3.0, 4.0, 5.0] [’hinge’, ’squared_hinge’] [1e-05, 0.0001, 0.001] 2 Random Search [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] [’hinge’, ’squared_hinge’] [1e-06, 1e-05, 0.0001, 0.001] 3 Grid Search [1.0, 3.0, 5.0, 6.0, 7.0]

’squared_hinge’ [1e-07, 1e-06,

5e-06, 1e-05]

In Random Forest, the parameter n_estimators is the number of decision trees that the model constructs. The parameter max_features represents the maximum number of features that are taken into consideration when there is a split done in the tree construction. For SVM,

References

Related documents

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

However, the effect of receiving a public loan on firm growth despite its high interest rate cost is more significant in urban regions than in less densely populated regions,

Som visas i figurerna är effekterna av Almis lån som störst i storstäderna, MC, för alla utfallsvariabler och för såväl äldre som nya företag.. Äldre företag i

Samtidigt som man redan idag skickar mindre försändelser direkt till kund skulle även denna verksamhet kunna behållas för att täcka in leveranser som

Tommie Lundqvist, Historieämnets historia: Recension av Sven Liljas Historia i tiden, Studentlitteraur, Lund 1989, Kronos : historia i skola och samhälle, 1989, Nr.2, s..

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating