Linköpings universitet/Linköping University | IDA Bachelor, 16hp | Innovativ programmering (IP) Spring term 2021 | LIU-IDA/LITH-EX-G--21/060—SE
Evaluation Of Methods For Automatically
Deciding Article Type For Newspapers
Adam Eriksson
Rita Kovordanyi Jalal Maleki
Linköpings universitet/Linköping University | IDA Bachelor, 16hp | Innovativ programmering (IP) Spring term 2021 | LIU-IDA/LITH-EX-G--21/060—SE
Upphovsrätt
Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från
publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för
enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning.
Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan
användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten
och tillgängligheten finns lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god
sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras
eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära
eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida
https://ep.liu.se/
.
Copyright
The publishers will keep this document online on the Internet – or its possible replacement – for a period
of 25 years starting from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for anyone to read, to
download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial
research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All
other uses of the document are conditional upon the consent of the copyright owner. The publisher has
taken technical and administrative measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work is
accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures for
publication and for assurance of document integrity, please refer to its www home page:
https://ep.liu.se/.
Evaluation Of Methods For Automatically Deciding Article
Type For Newspapers
Adam Eriksson
Linköping, Sweden
adaer173@student.liu.se
ABSTRACT
For each article written today at most modern newspapaers the journalist or someone else at the paper also has to fill in metadata about the article into their Content Management System. This thesis report investigates the possibility of automatically determining the type of article. This is done using two commonly used algorithms within NLP. The goal of the thesis is to evaluate what kind of performance can be achieved using different numbers of articles to train. The best results are achieved when the dataset is as big as possible while still maintaining a balance between article types. Both the accuracy and F1-score got up to 0.78-0.79 for both algorithms with up to 10 000 articles/type used for training.
INTRODUCTION
Metadata is becoming more and more important in the media industry [1], both for decisions of where to show the content as well as for statistical analysis of live and historic articles. The burden of creating even more metadata is often given to the journalist writing the article which leads to metadata taking up more of the journalist’s time. This often leads to wasting the journalist's time and low quality of the metadata being produced since the ones doing it are not experts in metadata.
To solve this problem an increasing number of media houses are looking at automating this process, either through developing their solutions in-house or hiring external help to help analyze their content.
In this thesis, I take a closer look at evaluating different methods for automatically deciding which type of commonly occurring newspaper text (e.g., news reports, feature articles, editorials, and opinion pieces) an article is. This can help to save a small amount of time for each newly written article or a large amount of time when a newspaper wants to add metadata to all their old articles to index them in their modern metadata-driven systems. Without good automation, they would potentially need to read through and add metadata to millions of articles which could take years and cost a lot of money.
QUESTION
What accuracy and F1-score can be achieved for classifying news articles into the most used article types using a naive Bayes classifier compared to a random forest classifier for different numbers of articles in the training data?
DATASET
The dataset consists of around half a million articles written in the Swedish media house Gota media. The articles were
all written from January 2018 to December 2020. Each article has an associated type which was set by the journalist that wrote the article. For more details on the number of articles for each type in the dataset see Table 1.
Reportage Reportage 23 597 Feature 893 Nyhet Notis 15 535 Nyhet 382 487 Bakgrund 87 Liverapportering 1250 Intervju 430 Förhands 3 140 Arkivmaterial 172 Minnesord 315 Sammanfattning 227 Granskning 125 Lokalkorre 88 Biografi 84 Exklusivt 62 Råd 18 Åsikt Ledare 9 032 Insändare 15 092 Krönika 12 061 Debatt 14 049 Kommentar 1 142 Ledarkrönika 2 600 Reflektion 404 Essä 169 Recension Recension 6 567
Table 1 Number of articles per type in the dataset. The left column shows article categories, and the middle column shows article types.
THEORY
In this chapter, I will introduce and explain the terms that are needed to follow the rest of the paper.
Scikit-learn
Scikit-learn1 is an open-source tool for machine learning
written in python. It has many uses and, in this thesis, I use it to use and evaluate different classification algorithms more easily.
TF-IDF Vectorizer
To classify text using these algorithms the text first needs to be vectorized. For this, each word must be given a score. TF-IDF stands for “term frequency-inverse document frequency” and is used to set weights to words that signify how important they are. This weighting method eliminates common terms across all source documents while highlighting rare terms that appear more frequently in a small subset of the documents. Term Frequency specifies a value increasing with the number of occurrences of the term in the document, usually calculated using the natural logarithm of that sum. Inverse Document Frequency is the natural logarithm of the number of documents divided by the portion of which includes the term. The TF-IDF weighting is the product of these.
For example, the term “the” will appear frequently in all documents, and thus be assigned a high value through TF, but will receive a very low value through IDF which will make its final score very low.
Text Classification
There are many different algorithms for text classification that are good in different scenarios. Review articles provide a useful basis for choosing between algorithms. Hartmann and colleagues [2] compared ten different algorithms on 41 different datasets. In this article, they found that random forest is one of the most consistent across many different classification problems.
Naive Bayes Classifier
A naive Bayes classifier is a simple probabilistic classifier that looks at the probability for each word in a text to be part of a certain classification and then is multiplied together to get a result for the whole text. Let us say for example the word “crime” has been used 100 times in total and 70 of them were in news articles and the word “serious” occurred 50 of 100 times in news articles. Then the text “serious crime” would get a score of 0.7*0.5 to be a news article. This score is then calculated for all possible types and the one with the highest score is selected as the answer. In practice, the numbers would get too small for longer texts so therefore the probabilities are converted to log(p) and added together to get the final scores.
Random Forest
The random forest classifier works by training multiple decision tree classifiers. Each decision tree is trained on a
1 https://scikit-learn.org/
random subset of the training data and when a prediction is made each tree calculates an answer and the most common answer is then selected as the result.
Measurements
Accuracy
Accuracy2 is a standard measurement that measures the
percentage of classifications that the model guessed correctly. The accuracy will be written as decimal fractions. So, if we have 100 articles and the model predicts 75 of them correctly the accuracy would be 0.75.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑔𝑢𝑒𝑠𝑠𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑔𝑢𝑒𝑠𝑠𝑒𝑠
F1-score
F1-score [3] is the harmonic mean of precision and recall. Precision is the percentage of guesses on a specific article type that is correct. So, if the model has guessed “insändare” on 10 articles and 7 of them were correct the precision would be 0.7. The recall is the percentage of the actual number of an article type that was correctly guessed.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐹1 = 2 ∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 Article Types
Previous research has shown that a more detailed categorization of old articles is important for optimal usage of digitized newspapers [4]. When this data is available it is easier to search through old articles to find what you are looking for in a historic archive. It is also used in modern content creation and publishing tools to guide decisions about recommendations, page structure, and design. Modern newspapers have many types of articles, some of the most common include:
• News article • Opinion column • Feature story • Profile
• Letter to the editor • Review
Each type has a different type of content and way of presenting it. For example, a news article should be a collection of facts where the authors' opinion is not allowed
to show, and a review is the exact opposite where the authors' opinions have the focus.
Related Work
While machine learning techniques have been used to generate metadata for news articles before for example by using named entity recognition [5] or to predict the popularity of said article [6] it is difficult to find articles on predicting article types.
METHOD
In this chapter, I will describe what I did and how I made the measurements that are later used to discuss the results.
Data Processing
I started by removing all article types with less than 5000 occurrences to make sure I have enough data to train for all types.
That left me with the following article types: • Reportage • Notis • Nyhet • Ledare • Insändare • Krönika • Debatt • Recension
After that, I split up the articles based on the type and put 10% of each of the types away for testing.
Algorithms And Training
Both the naive Bayes and random forest algorithms are available in scikit-learn which I used to train my models. To train the models I started by converting the texts to vectors using the built-in TF-IDF vectorizer. These vectors were then piped into one of the algorithms and the training was started.
I started by training and saving models using 10, 100, 1 000, and 10 000 as well as all available articles of each article type. In the cases where 10 000 articles were not available, I used the maximum available for that type and 10 000 for those that had enough. So, for example, there were only 8 128 “Ledare” available in the training set when training with 10 000 or unlimited articles of each type. In the final training set, the data got quite unbalanced since “Reportage” had 23 597 articles to train on and “Ledare” was still at 8 128.
Evaluation
To evaluate I used the test articles that I had previously set aside and fed them to a built-in method in scikit-learn that measured the accuracy, precision, recall, and f1-score on each of the trained models and saved the results.
RESULTS
This chapter will present the results that were acquired when evaluating the trained models.
Accuracy
The accuracy when training the models on a different number of articles can be seen in Table 2 and Figure 1. As shown the accuracy becomes higher the more articles are added until the uneven unlimited dataset where the accuracy goes down.
ARTICLES/TYPE NAIVE BAYES RANDOM FOREST 10 0.44 0.54 100 0.64 0.70 1 000 0.74 0.74 10 000 0.79 0.79 UNLIMITED 0.79 0.73
Table 2 Accuracy measurements
Figure 1 Accuracy measurements
F1-score
The same pattern can be seen in the weighted average F1-scores. ARTICLES/TYPE NAIVE BAYES RANDOM FOREST 10 0.41 0.54 100 0.62 0.69 1 000 0.73 0.73 10 000 0.79 0.78 UNLIMITED 0.79 0.71 Table 3 F1-scores
Figure 3 Confusion matrix on test-set from Naive Bayes model trained on 10 000 articles. The rows are the correct article type and the columns are the guesses that the model made. A perfect result would be zeros in all places except for the diagonal from the top leff to the bottom right. The colors highlight higher numbers to make the results easier to read at a glance.
Figure 2 F1-scores Confusion Matrices
The biggest flaws can be seen in the confusion matrices, the largest confusion happens among the different articles that are all in the article category of “Åsikt”. While it does happen that a “Reportage” is confused with an “Insändare” the frequency for that is much lower than that of "Insändare”- and “Debatt“-articles being confused.
It is easy to see the difference that a bigger dataset has on the results when comparing Figure 3 and Figure 4. When only training on 100 articles it is for example more common that an “Insändare” article is marked with the “Debatt” article type.
DISCUSSION
Accuracy Over Different Training Sets
As can be seen in Table 2 and Figure 1 the accuracy starts quite low for both algorithms (41%, 51%) but still significantly above what would be achieved if randomly guessing (12%). The Random Forest algorithm gets significantly better results with smaller training sets but with at least 1000 articles or more the Naive Bayes algorithm catches up and delivers similar results. Another noteworthy thing is that the random forest algorithm decreases in accuracy with the largest dataset. This is likely due to not being able to handle an uneven number of articles of each type and starts to pick the most common article type in the training set when it is close.
F1-scores
As can be seen in Table 3 and Figure 2 the F1 scores have a very similar curve to the accuracy where the naive Bayes algorithm starts off worse and then performs similarly at 1000-10000 articles/type before the random forest algorithm starts making more mistakes.
Figure 4 Confusion matrix on test-set from Naive Bayes model trained on 100 articles.
Performance
With training sets of 100 or fewer articles per type, the training times are very similar with both algorithms completing their training in a couple of seconds. However, as the training set grew to 1 000, 10 000, and more articles per type the random forest algorithm starts to take around a minute to about 15 minutes for the largest dataset. In the meantime, the Naive Bayes algorithm still only takes a couple of seconds to run.
Dataset And Confusion Matrices
In this thesis, I focus on predicting the specific article types of the articles. They can however be divided into four main categories, “Repotage”, “Nyhet”, “Åsikt”, and “Recension”. Of these only the “Åsikt”-articles have more than one article type that is used in the training. This is very noticeable for all “Åsikt”-articles but especially so for the “Debatt” and “Insändare” types. The reason for this being that the main differentiator between those two types being that “Debatt”-articles are often written by leaders in a field and “Insändare” by interested readers. Since I did not use the authors' names as a feature the model did not get a good way to differentiate between the two. Still, both algorithms managed to predict a significantly better percentage (59-67%) of those types than simply selecting at random when being trained with 10 000 articles per type.
CONCLUSION
The methods are very similar in prediction performance and can both likely be useful to suggest article types to the author of the article. At the very least it is good at predicting the main article category but could use some more work to correctly predict between similar article types. After 1 000 articles per type the training starts to give diminishing results but if the data exists it is relatively easy to train using more articles for slightly better results. It is however important to make sure you use the same number of articles in each type to get the best results. One thing to note is since they are so close in accuracy and F1-scores it would be better to use the Naive Bayes algorithm for the larger datasets for time saving and environmental reasons since it takes significantly less computing resources to train the model.
REFERENCES