Extractive Text Summarization of Norwegian News Articles Using BERT

Full text

(1)LiU-ITN-TEK-A--21/016-SE. Extractive Text Summarization of Norwegian News Articles Using BERT Thomas Indrias Biniam Adam Morén 2021-06-04. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(2) LiU-ITN-TEK-A--21/016-SE. Extractive Text Summarization of Norwegian News Articles Using BERT The thesis work carried out in Datateknik at Tekniska högskolan at Linköpings universitet. Thomas Indrias Biniam Adam Morén Norrköping 2021-06-04. Department of Science and Technology Linköping University SE-601 74 Norrköping , Sw eden. Institutionen för teknik och naturvetenskap Linköpings universitet 601 74 Norrköping.

(3) Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/. © Thomas Indrias Biniam, Adam Morén.

(4) Abstract Extractive text summarization has over the years been an important research area in Natural Language Processing. Numerous methods have been proposed for extracting information from text documents. Recent works has shown great success for English summarization tasks by finetuning the language model BERT using large summarization datasets. However, less research has been made for low-resource languages. This work contributes by investigating how BERT can be used for Norwegian text summarization. Two models are developed by applying a modified BERT architecture, called BERTSum, on pre-trained Norwegian and Multilingual BERT. The results are models able to predict key sentences from articles to generate bullet-point summaries. These models are evaluated with the automatic metric ROUGE and in this evaluation the Multilingual BERT model outperforms the Norwegian model. The multilingual model is further evaluated in a human evaluation by journalists, revealing that the generated summaries are not entirely satisfactory in some aspects. With some improvements, the model shows to be a valuable tool for journalists to edit and rewrite generated summaries, saving time and workload..

(5) Acknowledgments We want to start by giving our gratitude to our supervisor, Elmira Zohrevandi and examiner Pierangelo Dellacqua at Linköpings University for their commitment and valuable advice. We also want to express our gratitude to our supervisor, Björn Schiffler at Schibsted, for actively committing and guiding us with both technical and academic advice throughout our work. We thank the contextual team at Schibsted for welcoming us into the team and providing us with advice and the necessary data to perform our research. Norrköping, June 2021 Adam Morén and Thomas Indrias. iv.

(6) Contents Abstract. iii. Acknowledgments. iv. Contents. v. List of Figures. viii. List of Tables 1 Introduction 1.1 Background . . . 1.2 Motivation . . . . 1.3 Aim . . . . . . . . 1.4 Research question 1.5 Delimitations . . .. ix. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 2 Theory 2.1 Natural Language Processing . . . . . . 2.1.1 Text Processing . . . . . . . . . . 2.1.2 Statistical Methods . . . . . . . . 2.1.3 Artificial Neural Networks . . . 2.2 Sequential models . . . . . . . . . . . . . 2.2.1 RNN . . . . . . . . . . . . . . . . 2.2.2 Encoder-Decoder . . . . . . . . . 2.2.3 Attention . . . . . . . . . . . . . . 2.2.4 Transformers . . . . . . . . . . . . 2.3 BERT . . . . . . . . . . . . . . . . . . . . 2.3.1 Input and Output Embeddings . 2.3.2 Pre-training . . . . . . . . . . . . 2.3.3 Fine-tuning . . . . . . . . . . . . 2.3.4 Pretrained BERT models . . . . . 2.4 Extractive Text Summarization Methods 2.4.1 TF-IDF . . . . . . . . . . . . . . . 2.4.2 TextRank . . . . . . . . . . . . . . v. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . . .. . . . . .. 1 1 2 2 3 3. . . . . . . . . . . . . . . . . .. 4 5 5 6 7 9 9 11 11 12 15 15 17 18 18 19 19 20.

(7) 2.5. 2.4.3 BERTSum . . . . . . . . . . . . Evaluation metrics for summarization 2.5.1 Precision, Recall and F-Score . 2.5.2 ROUGE . . . . . . . . . . . . . . 2.5.3 Qualitative Evaluation . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . 21 . 23 . 23 . 23 . 25 27 27 27 28 30 31 33 33 34 36 37 38 38 39 39 40. 3 Method 3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 CNN/DailyMail . . . . . . . . . . . . . . . . 3.1.2 Aftenposten/Oppsummert . . . . . . . . . . 3.2 Implementation . . . . . . . . . . . . . . . . . . . . . 3.2.1 Restructure of the AP/Oppsummert dataset 3.2.2 Truncation of articles . . . . . . . . . . . . . . 3.2.3 Oracle Summary Generation . . . . . . . . . 3.2.4 Models . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Hyper-Parameters . . . . . . . . . . . . . . . 3.2.6 Fine tuning . . . . . . . . . . . . . . . . . . . . 3.2.7 Prediction . . . . . . . . . . . . . . . . . . . . 3.2.8 Hardware . . . . . . . . . . . . . . . . . . . . 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 ROUGE Evaluation . . . . . . . . . . . . . . . 3.3.2 Human Evaluation with Journalists . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. 4 Results 4.1 Implementation . . . . . . . . . . . . . . . . 4.2 Evaluation . . . . . . . . . . . . . . . . . . . 4.2.1 ROUGE Evaluation . . . . . . . . . . 4.2.2 Human Evaluation with Journalists. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 41 . 41 . 45 . 45 . 45. 5 Discussion 5.1 Results . . . . . . . . . . . . . . . . . . . . . 5.1.1 ROUGE Scores . . . . . . . . . . . . 5.1.2 Sentence Selection . . . . . . . . . . . 5.1.3 Human Evaluation with Journalists 5.2 Method . . . . . . . . . . . . . . . . . . . . . 5.2.1 Datasets . . . . . . . . . . . . . . . . 5.2.2 Implementation . . . . . . . . . . . . 5.2.3 Metrics . . . . . . . . . . . . . . . . . 5.3 The work in a wider context . . . . . . . . . 5.3.1 Ethical Aspects . . . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. 51 51 51 52 53 56 56 58 58 60 60. 6 Conclusion 62 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64. vi.

(8) Bibliography. 66. A Appendix A.1 All responses from Human Evaluation A.1.1 Article 1 . . . . . . . . . . . . . A.1.2 Article 2 . . . . . . . . . . . . . A.1.3 Article 3 . . . . . . . . . . . . . A.1.4 Article 4 . . . . . . . . . . . . .. vii. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. I I I III V VIII.

(9) List of Figures 2.1 2.2 2.3 2.4 2.5. Perceptron model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of a Multilayer neural network . . . . . . . . . . . . . . RNN illustrated by C. Olah [rnn-lstm] . . . . . . . . . . . . . . . . . RNN unpacked illustrated by C. Olah [rnn-lstm] . . . . . . . . . . . RNN Encoder-Decoder sequence-to-sequence model illustrated by Kostadinov [encoder-decoder-seq2seq] . . . . . . . . . . . . . . . . 2.6 Attention example illustrated by Bahdanau et al. [bahdanau2016neural] . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 model architecture of a transformer illustrated by Vaswani et al. [google-attention] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 The input embeddings and embedding layers for BERT illustrated by Devlin et al. [pre-training-of-BERT] . . . . . . . . . . . . . . . . 2.9 Two words broken down into sub-words using WordPiece tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Binary labels generated by two pair inputs. . . . . . . . . . . . . . . 2.11 Position embeddings layer. . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Architecture of BERTSum proposed by Yang [liu2019text] . . . . . .. 8 9 10 10. 3.1 3.2 3.3. 30 30. 3.4 4.1. 4.2. 4.3. Summaries associated with x articles in the AP/Oppsummert dataset Number of sentences in the AP/Oppsummert summaries dataset . ROUGE-2 and ROUGE-L recall scores for summaries with one article in (a) and (b), summaries with more articles and the top-scoring articles in (c) and (d), and summaries with more articles and the second-best articles in (e) and (f). . . . . . . . . . . . . . . . . . . . . Proportion of sentences with highest ROUGE score according to their position in the original article . . . . . . . . . . . . . . . . . . .. 12 13 14 15 16 17 18 22. 32 34. Sentence selection for Norwegian BERT fine-tuned on (a) Oracle-3 (b) Oracle-7 (c) Oracle-10 with trigram blocking and on (d) Oracle3 (e) Oracle-7 and (f) Oracle-10 without trigram blocking. . . . . . . 43 Sentence selection for Multilingual BERT fine-tuned on (a) oracle-3 (b) oracle-7 (c) oracle-10 with trigram blocking and on (d) oracle-3 (e) oracle-7 and (f) oracle-10 without trigram blocking. . . . . . . . . 44 Average human evaluation scores for each category where the highest score for each example is 20 . . . . . . . . . . . . . . . . . . 46 viii.

(10) List of Tables 3.1 3.2 3.3 3.4 3.5 4.1. Average token and sentence count for news articles and summaries in the CNN/DailyMail dataset . . . . . . . . . . . . . . . . . . . . Dataset split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Article data type in the AP/Oppsummert dataset . . . . . . . . . Summary data type in the AP/Oppsummert dataset . . . . . . . . Average token and sentence count for news articles and summaries for AP/Oppsummert . . . . . . . . . . . . . . . . . . . . . . . . . .. . 28 . 28 . 29 . 29 .. 4.6. Time it took to fine-tune time the Norwegian and Multilingual BERT models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ROUGE scores on AP/Oppsummert test data (116 articles). *With trigram blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Journalists’ opinion reflecting their satisfaction with generated summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Journalists’ opinion on generated summaries mentioning features they found the algorithm performed weak. . . . . . . . . . . . . . . Journalists’ opinion on generated summaries reflecting potential for improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overall comments from journalists . . . . . . . . . . . . . . . . . . .. A.1 A.2 A.3 A.4. Responses from the journalists on article 1 Responses from the journalists on article 2 Responses from the journalists on article 3 Responses from the journalists on article 4. 4.2 4.3 4.4 4.5. ix. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 29 41 47 48 49 50 50. . III . V . VII . IX.

(11) 1 Introduction. Over recent years, the amount of data available for both users and companies has massively kept increasing. As a response to this, creating summarizations of data has become a popular topic in data science. Text summarization tasks take part in this, focusing on representing context in a shorter format. Considering the amount of text data in news and media, it is an example of a field where automatic text summarization could be beneficial.. 1.1. Background. Aftenposten is Norway’s largest daily newspaper based in Oslo. It is a private company owned by Schibsted and has an estimate of 1.2 million readers. To save readers time, Aftenposten developed a daily brief called Oppsummert, which features the most important stories of the day, offered in a summarized format. The idea is to help readers to be updated on the most important daily news in a time-efficient way and at the same time offer a consistent and standardized reading experience. Summarizing articles manually leads to an increased workload for journalists. Additionally, many journalists want to focus on great journalism and innovation, not re-writing shorter versions of already written articles. The challenge here is to achieve both daily briefs for readers and, at the same time, use fewer resources from the newsroom and their journalists. For this challenge, we see the potential for automatic text summarization by learning machines to understand and process the human language. There are two main text summarization strategies: extractive and abstractive. 1.

(12) 1.2. Motivation Extractive techniques are about identifying the most important sentences of a text and extracting them. In contrast, abstractive techniques produce new, shorter text that captures the context of the original longer text. When implementing automatic text summarization in this thesis, the approach will be to use extractive techniques. The motivation for this is that we want the summaries to use sentences written by the original journalist. Abstractive summarization techniques can sometimes lead to misinformation or biased generated results, which we want to prevent. Traditional approaches for extractive text summarization are based on statistical and graph-based methods such as TF-IDF and TextRank. However, these methods have recently started to be replaced by methods based on neural networks, such as BERT. BERT is a new state-of-the-art language model that can learn to perform specific tasks using labeled data. In our case, the amount of summarized articles from Aftenposten is limited since Oppsummert is a newly released feature. Therefore, we hypothesize that it will be challenging to train a Norwegian BERT model and get good results. Our approach will investigate this and try different BERT models and methods.. 1.2. Motivation. In the massive flood of news and media seen today, it can be challenging for newsreaders to filter out the most important daily news. Furthermore, due to the rapidness of our daily lives, users often want to be as time-efficient as possible. Therefore, the motivation of news summaries is to help readers be updated on the most important daily news in a time-efficient way. However, writing these summaries manually leads to an increased workload for the journalists. This is where machine learning potential can be seen for generating these summaries automatically. By implementing a model that can extract key sentences from an article, we can reduce the workload for journalists, and at the same time deliver summaries with the most important content to the newsreaders.. 1.3. Aim. The thesis project aims to develop a model that can extract the most relevant sentences from an article written in Norwegian on which journalists can base their summaries. This will be done by investigating possible approaches for extractive text summarization using BERT with a limited labeled Norwegian data set and evaluating the results.. 2.

(13) 1.4. Research question. 1.4. Research question. The current work aims to answer the following research question: • How can a high-performance BERT-based extractive summarization model be developed based on a limited amount of news summaries in Norwegian? To this end we aim to investigate: • How news summaries can be used to generate labeled data that is required for a supervised learning model. • How the model’s performance should be evaluated and assessed. • How BERT can be used for extractive text summarization on a low resource language. • Limitations with BERT and how they should be dealt with.. 1.5. Delimitations. The study focuses on BERT-based models for extractive text summarization. However, it will also explore traditional methods for comparison purposes. The articles and summaries from Aftenposten are in Bokmål, one of Norway’s official writing standards. Therefore, we narrow the scope of the language to only Bokmål.. 3.

(14) 2 Theory. Automatic text summarization is the process of a machine condensing a longer text into a shorter comprehensive synopsis. The technique can be either abstractive or extractive. The abstractive approach aims to present a text with newly generated sentences, and the extractive technique aims to find and re-use key sentences from the original text. The output format can be in the form of bullet points, quotes, questions or speakable summaries. These outputs are usually analyzed and rated relative to how well they capture main points, grammar, text quality, etc. An automatic summarization architecture must be able to capture the essence of longer articles. Therefore, summary evaluation becomes a crucial part of automatic text summarization. Extractive text summarization can be treated either as a sentence scoring and selection task or as a sentence classification task. Sentence scoring and selection is the traditional approach where each sentence is scored based on its importance to the text, and the sentences with the highest scores are selected for the final summary. With the approach of sentence classification, each sentence is instead classified into one of two different classes: extracted or not extracted. The former approach is part of statistical methods, and the latter approach utilizes neural networks for task-learning. In the following, we cover common methods and tasks with a focus on text and text summarization. In section 2.1, we introduce the research area of automatic text summarization known as natural language processing. Secondly, in section 2.2, previous methods within textual tasks through neural networks are introduced. Thirdly, in section 2.3, the current state-of-the-art language model BERT is introduced. Finally, in sections 2.4 and 2.5, we 4.

(15) 2.1. Natural Language Processing present different methods for extractive text summarization and how these models can be evaluated.. 2.1. Natural Language Processing. Natural language processing (NLP) is the field of computer science and artificial intelligence that deals with the enabling of computers to understand and process the human language. This includes the understanding of both written and spoken language, which comes with many complex challenges. Today computer applications are expected to translate, answer voice commands, give directions, and even produce human-like texts. These challenges are hard for computers to manage because of how abstract and inconsistent the human language is in it’s nature. Things like humor, sarcasm, irony, intent, and sentiment is a few examples that not only vary from different languages, but also from different people. NLP aims to solve these challenges by converting language to numerical computational inputs that a computer can understand and process. By then combining computer algorithms with statistics, it’s possible to extract, classify, and label language.. 2.1.1. Text Processing. For a computer to be able to work with text and solve larger NLP-tasks, the text input must first be processed. Text processing contains several sub tasks, such as: Tokenization: Tokenzation is usually the first subtask for text processing. It is used for separate a chunk of continuous text into tokens that help the computer to better understand the text. For example, the sentence "The firefighter saved the cat. Awesome!" would with a tokenization of words be converted into ["The", "fiefighter", "saved", "the", "cat", ".", "Awesome", "!,"], and as a tokenization of sentences into ["The firefighter saved the cat.", "Awesome!"]. After a text input has been tokanized, the computer can use the processed input for other important processes, such as stemming and lemmatization. Stemming and Lemmatization: Stemming and lemmitization are methods for trimming down words to their root form. For example the word "saved" has the root "save". The difference between the two methods is that stemming solely focuses changing the form of a word, whereas lammatizaiton actually finds a dictionary form of the word. Meaning that after applying lemmatization, we will always get a valid word.. 5.

(16) 2.1. Natural Language Processing Stop Words: Stop words are usually words with no semantics, and are therefore considered not to provide any relevant information for the task. The English language contains several hundreds stop words such as "the" or "and" which does not carry any signification by themselves and are therefore often removed from documents [39]. POS tagging: Part-of-Speech tagging (POS tagging) is the method for identifying the part of speech for words (noun, adverb, adjective, etc.). Some words in a language can be used as more than one part of speech. For example, consider the word "book" which can be used both as a noun, in the sentence "The firefighter read a book" and as a verb, in the sentence "Did the firefighter book a table?". This example shows why POS tagging becomes important to process sentiment in a text. Sentence boundary identification: Sentence boundary identification is important in order for the system to recognize the end of sentences in the document. Establishing where a sentence ends and the next one begins is important for a clear sentence structure for many NLP tasks. Word Embeddings: Word embeddings is a method for representing words that can capture syntactical and semantic information. Usually, words are mapped to a vector space of numbers with a fixed dimension, where words with similar meaning are closer together in the vector space. This makes it possible to for example detect synonymous words, or suggest additional words for sentences. Word embeddings are obtained from, and used by, language models that use neural networks to train and learn word associations from a large corpus of text.. 2.1.2. Statistical Methods. For many years the most common NLP methods were based solely on statistical methods. For texts, this includes algorithms and rules for the statistics of the words and sentences in a text document. An example is TF-IDF, a numerical statistic that can reflect a word’s importance in a collection of text documents. Machine learning algorithms should also be mentioned here, as they revolutionised natural language processing when introduced in 1980. Popular machine learning classifiers and algorithms are Naive Bayes, Support Vector Machine, decision trees and graph structures. Today, statistical methods in NLP have been largely replaced by neural networks. However, statistical methods continue to be relevant in some contexts, for example, when the amount of training data is insufficient. In Section 2.4 we investigate two statistical methods for extractive text summarization; TF-IDF and a graph-based method called TextRank. 6.

(17) 2.1. Natural Language Processing. 2.1.3. Artificial Neural Networks. Most methods that are currently achieving state-of-the-art results for NLP tasks employ neural networks. A neural network is an artificial intelligence system that mimics how a biological brain works through artificial neurons. It enables models to learn tasks iteratively, and one of the reasons for its success in recent years has to do with the massive increase of data on which the models can train. In this section, we overview the main concepts of neural networks to understand better what is happening when a model learns to perform a specific task. Perceptron: Single Layer Neural Net The simplest form of a neural network is a single-layer perceptron, capable of classifying linearly separable data. A perceptron is an algorithm that can be explained as a simplified biological neuron. It takes in a signal and outputs a binary signal. A vector of numbers represents the input, and the classification of this input represents the output. The framework for a perceptron is the following: • Input: x = ( x1 , x2 , ..., xd ) • Output: y • Model: Weight vector w = (w1 , w2 , ..., wd ) and bias b The perceptron makes its predictions based on the prediction function that is presented in Eq 2.1. An illustration of this equation is also shown in Fig 2.1 where f is an activation function that can be different for different types of neurons, w is the weights vector that represents the strength of nodes, T is the transpose, and b is the bias. y = f (w T x + b). (2.1). Training Perceptrons The training of perceptrons is a process which is known as gradient decent. The goal of gradient descent is to find optimal parameters w that minimize the loss function. A training set is used during training, which contains input values X and their corresponding correct values Y. The model predicts a value yˆ from the input x, and this prediction is then compared with the actual value y. The difference between predicted values and actual values is the error E of the model. E can be calculated in different ways depending on the output type of the model. A model that has a binary output type usually has a loss function that is the binary cross-entropy loss. Since the goal is to minimize the error of the loss function, we are looking for where the gradient 7.

(18) 2.1. Natural Language Processing. Figure 2.1: Perceptron model. of the loss function is zero. This is why it is called gradient descent because the goal is to go down the gradient until it no longer has a slope, i.e., the error becomes as small as it possibly can get. The model parameters are updated via the equation presented in Eq 2.2. Here, we calculate the new values of the parameters as the old parameters minus a step in the direction of the derivative. e is called learning rate, and it is a value that determines how big the size of the step should be. The size of the step is important because if it is too large, we risk stepping over the optimal point, and if it’s too small, the decent takes too much computational time. wi ( t + 1) = wi ( t ) ´ e. BE Bwi. (2.2). Training of perceptrons happens in epochs. An epoch is defined as a full cycle through the training data. In the standard gradient descent method, we accumulate the loss for every example in the train set before updating the weights. The problem with the standard gradient descent method is that if the number of training samples is large, it may be time-consuming because we have to run through the whole training set for every parameter update. Instead, Stochastic Gradient Decent (SGD) is often applied for faster convergence. There are two main methods for SGD: • Update weights for each sample • Minibatch SGD: Update weights for a small set of samples Updating the weights for each sample is fast, but it makes the model very sensitive to noise in the training set. Using minibatch SGD is both fast and robust to noise. That is why it is often preferred in training.. 8.

(19) 2.2. Sequential models. Figure 2.2: Illustration of a Multilayer neural network. Multi-Layer NN as a non-linear classifier The problem with a single layer neural net is that it only can be used as a linear classifier and not be used for feature learning. To solve this, multiple perceptrons can be combined to form a neural network with more layers, called hidden layers. An illustration of such a neural network consists of an input layer, two hidden layers, and an output layer is shown in Fig 2.2. The advantage of a multi-layer neural network is that it is able to solve non-linear functions unlike single layer neural networks that can only solve linear functions.. 2.2. Sequential models. When working with textual data in NLP, the data is generally transformed into a sequence. It is essential to keep the order of the sequence since if the order of words is changed, the sentence’s context could also change. For example, ["the", "firefighter", "saved", "a", "cat"] has a different meaning then ["a", "cat", "saved", "the", "firefighter"]. Data, where the order is important, is called sequential data and models working with this type of data is called sequential models. Another requirement for sequential models is that the processed sequence should remember previous important parts of sequences. For example, if the data is a sequence of sentences, such as ["X was walking home", ..., "he forgot to buy milk on the way"], it is essential to remember specific keywords such as "X" and "home".. 2.2.1. RNN. A Recurrent Neural Network (RNN) is a family of neural networks for processing sequential data. RNNs can process long sequences and sequences 9.

(20) 2.2. Sequential models with variable length, meaning that the input sequence does not have to be the same length as the output sequence.. Figure 2.3: RNN illustrated by C. Olah [30] Figure 2.3 above shows an RNN with loops. The model A takes in input sequence xt and outputs a value ht . The model also passes its past state to the next step. The same RNN can be visualized as an unpacked network instead, shown in the following figure 2.4.. Figure 2.4: RNN unpacked illustrated by C. Olah [30] RNN can have different structures and combination. For example, RNN can have multiple layers so that an output from one layer can be used as input to another layer. Such layering are often called deep RNNs. Goldberg [13] observed empirically that deep RNNs works better than shallower RNNs on some tasks. However, it is not theoretically clear why they perform better. Another extension of RNN is bidirectional-RNN (BI-RNN). Conventional RNN only uses the past state as seen in figure 2.4. However, the future state might also hold useful information of the following words in a sequence. BIRNN attempts to deal with this by maintaining two separate states. Each state has two layers. BI-RNN run the input in two ways; one from front-to-back and one from back-to-front. 10.

(21) 2.2. Sequential models Simple RNN The most conventional RNN is called simple RNN (S-RNN), and it was proposed by Elman [10]. Mikolov [27] later explored S-RNN for use in NLP [13]. It builds a strong foundation for tasks such as sequence tagging and language modelling. However, S-RNN introduces a problem that causes the gradients that carry information used in a parameter update to increase or decrease rapidly over time. This problem is known as the exploding or vanishing gradients problem, resulting in the gradients becoming so big or small that the parameter updates carry no significant changes. In other words, this problem causes the model not to learn. In later works, Hochreiter [15] proposed an architecture known as Long Short-Term Memory which managed to overcome the exploding and vanishing gradient problem. LSTM Long Short Term Memory networks (LSTMs) is a special kind of RNN capable of learning long-term dependencies [30]. The main difference between RNN and LSTM is that an LSTM is made up of a memory cell, input and output gate, and a forget gate [24]. The memory cell is responsible for remembering dependencies in the input sequence, while the gates control how much of the previous states should be memorized.. 2.2.2. Encoder-Decoder. Encoder-Decoder architecture is a standard method used in sequence-to-sequence (seq2seq) NLP tasks such as translation. For RNN (section 2.2.1), an EncoderDecoder structure was first proposed by Cho et al (2014) [4]. The encoder takes a sequence as an input and produces an encoder vector used as input to the decoder. The decoder then predicts an output at each step with respect to the previous states (auto-regression) until some END-token is generated. Figure 2.5 shows an RNN encoder-decoder architecture for seq2seq tasks where hi is the hidden state, xi is the input sequence and y j is the output sequence.. 2.2.3. Attention. An apparent disadvantage with conventional encoder-decoder models described in Section 2.2.2, is that the input sequence has a fixed-length vector. This issue limits the model to learn later parts of a sequence by truncating it. Additionally, early parts of long sequences within the fixed-length are often forgotten once it has processed the entire sequence [4]. Bahdanau et al. (2016) [2] proposed an approach to solve the limitations of encoder-decoder models by extending encoder-decoder forming a technique called Attention. Unlike the conventional encoder-decoder, Attention 11.

(22) 2.2. Sequential models. Figure 2.5: RNN Encoder-Decoder sequence-to-sequence model illustrated by Kostadinov [19] allows the model to focus on relevant parts of an input sequence. The process is done in two steps. First, instead of only passing the last encoder’s hidden state (context vector) to the decoder, the encoder passes all the hidden states of the previous encoders to the decoder. Second, the decoder gives each encoder’s hidden state a score where each of these states is associated with a certain word from the input sequence. This way, the model does not train on using one context vector but rather learn which parts of a sequence to pay attention to. Bahdanau et al. provide an example shown in figure 2.6. It illustrates Attention when translating the English input sequence: [", This, will, change, my, future, with, my, family, ., ", the, man, said], to the French target sequence: [", Cela, va, changer, mon, avenir, avec, ma, famille,", a, dit, l’, homme, .]. It can be seen in the figure that the alignment of the words is largely monotonic, hence the high attention score along the diagonal. However, some words are non-monotonic. For example, the English word "man" is "l’homme" in French, and in the example, we can find high attention scores both for "l’" and "homme".. 2.2.4. Transformers. Transformers are attention-based architecture consisting of two main components, encoder and decoder. The model was introduced by Vaswani et al. [43] to solve existing problems with recurrent models, presented in section 2.2.1, that preclude parallelization, which would result in longer training time and drop in performance for longer dependencies. Due to the attention-based non-sequential nature of transformers, it can be highly parallelized and reach a constant sequential and path complexity, O(1). Transformers are used to solve translation problems. The aim is to find a relationship between words 12.

(23) 2.2. Sequential models. Figure 2.6: Attention example illustrated by Bahdanau et al. [2]. in an input sentence and combine it with an existing translation of that sentence [3]. Encoder: The encoder consists of multiple encoder layers where each layer has two sub-layers. The first sub-layer is a multi-head self-attention mechanism. For instance, looking at the same example as mentioned in the previous section 2.2.4, self-attention means that the target sequence is the same as the input sequence. In other words, self-attention is just another form of attention mechanism that relates different positions of the same input sequence. The term "multi-head" means that instead of computing the attention once, it utilizes scaled dot-product attention, allowing multiple attention computations in parallel [43]. The second sub-layer is a simple position-wise, fully connected feed-forward network. All sub-layers have a residual connection and a layer normalization. The purpose of this is to add the output of each sub-layer with its previous input. The left block on figure 2.7 shows the encoder of a transformer. Decoder: The decoder is structured similarly to the encoder. Still, it has an additional sub-layer called masked multi-head attention, which is a modified multi-head attention mechanism that, unlike self-attention, prevents it from attending to subsequent positions. The goal of masking positions is to ensure that the predictions made are not looking into the future of the target sequence [43]. The right block on figure 2.7 shows the decoder of a transformer. For instance, in translation tasks, the encoder is fed with words of a specific language, processing each word simultaneously. It then generates embed13.

(24) 2.2. Sequential models. Figure 2.7: model architecture of a transformer illustrated by Vaswani et al. [43] dings for each word which are vectors that describe the meaning of the words in the form of numbers. Similar words have closer numbers in their respective vectors. The decoder can then be used to predict the translation of a word in a sentence by combining the embeddings from the encoder and the previously generated translated sentence. The decoder predicts one word at a time until the end of the sentence is reached. Transformers have a token limitation of 512. The reason is that the memory and computations requirements for a transformer grow quadratically with sequence length, making it impractical to process long sequences [43]. This means that transformers can only process input that is below 512 tokens. Later, new solutions were introduced, such as Transformer XL that uses a recurrent mechanism to handle text sequences longer than 512 tokens [6]. However, in most cases, it is sufficient to truncate sequences longer than 512 tokens to make them fit. In general, the encoder learns what a word is in relation to the origin of language, grammar and, more importantly, context. In contrast, the decoder learns how the origin word relates to the target word in terms of language.. 14.

(25) 2.3. BERT. 2.3. BERT. BERT is a transformer-based model, introduced by Devlin et al [7]. The authors motivate that previous language representation models, such as RNNs, were limited in how they encode tokens by only considering the tokens in one direction. Unlike RNNs, the authors utilize transformers, described in section 2.2.4, to design a Bidirectional Encoder Representation from Transformers (BERT), which is able to encode a token using tokens from both directions. BERT can solve various types of problems such as question answering, sentiment analysis, and text summarization. However, these problems require an understanding of language, which is solved by a pre-training and a fine-tuning phase. The first phase consists of pre-training BERT to understand language, and then fine-tuning is done so that BERT can learn to solve a specific task.. 2.3.1. Input and Output Embeddings. Similar to other language models, BERT processes each input token through a token embedding layer to create a vector representation. Additionally, BERT has two more embedding layers called, segment and position embeddings. In figure 2.8, an illustration of the three embedding layers can be seen.. Figure 2.8: The input embeddings and embedding layers for BERT illustrated by Devlin et al. [7] To create an input embedding, Token, segment, and position embeddings are summed into an input embedding for a given token. Before the embedding layers process the input, the input text is tokenized using WordPiece. WordPiece BERT adopts WordPiece tokenization proposed by Wu et al. [44]. The aim of using WordPiece tokenization was to improve the handling of rare words. The solution was to divide words into sub-words (WordPieces) using a fixed 15.

(26) 2.3. BERT vocabulary set. In terms of BERT, it has a vocabulary size of 30 000 WordPieces. In figure 2.9, an example of two words broken down into subwords is shown. When the rarity of the word increase, the word can be broken down into single characters. Additionally, every subword except the first subword of a word is symbolized with "##" symbols. The first subword is separated from the rest of the subwords because they can be redundant for the whole word. For example, the word "bedding" can be split into subwords "bed", "##ding". The subword "bed" conveys meaning to bedding because that can be closely related to the word "bed".. Figure 2.9: Two words broken down into sub-words using WordPiece tokenization Finally, a "[CLS]" token is added to the start and "[SEP]" token is added to the end of a tokenized sentence. The objective of adding the extra tokens is to distinguish a pair of sentences which will help create segment embeddings 2.3.1. Since BERT uses default transformer encoders (2.2.4), BERT is limited to process input sequences up to 512 tokens. Token Embeddings The first step is to create vector representations of the tokenized input in the token embeddings layer. Each token has a hidden size of 1x768 vector. For N input tokens, the token embedding results in a matrix shape of Nx768 or, as a tensor, 1xNx768. Segment Embeddings BERT can handle a pair of input sentences as shown in figure 2.10. The inputs are tokenized and concatenated to create a pair of tokenized sentences. Thanks to the [SEP] token, BERT can distinguish two sentences and label the 16.

(27) 2.3. BERT sequence in binary.. Figure 2.10: Binary labels generated by two pair inputs. The label sequence is then expanded into the same matrix shape as for token embeddings, Nx768, where N is the number of tokens. For example, for the paired input in figure 2.10, the segment embedding would result in a matrix shape of 8x768. Position Embedding BERT is a transformer-based model and, therefore, will not process tokens sequentially. Thus, to avoid BERT forgetting the order of tokens, position embeddings are required. The position embeddings layer can be used as a look-up table as illustrated in figure 2.11, where the index of a row represents a token position. For example, the two sentences, "Cat is stuck" and "Call the firefighter" has identical vector representations for the words; "Cat" - "Call", "is" - "the" and "stuck" - "firefighter".. 2.3.2. Pre-training. The pre-training phase is done by training on two unsupervised tasks simultaneously, which are Masked Language Model (MLM) and Next Sentence Prediction (NSP) [16]. Masked Language Model Masked Language Modeling (MLM) is an unsupervised task done during the pre-train of BERT. The goal of MLM is to help BERT understand deep bidirectional representations. MLM is done by masking 15 % of all WordPiece tokens for the input sequence randomly. The tokens are masked by replacing the token with a [ MASK ] token instead, which BERT identifies and predicts. 17.

(28) 2.3. BERT. Figure 2.11: Position embeddings layer.. Next Sentence Prediction Next Sentence Prediction is another unsupervised task done during the pretrain of BERT. The objective of this task is to capture the relationship between two sentences. To capture sentence relationships, BERT is pre-trained for a binarized next sentence prediction that can be generated from any monolingual corpus [7]. That is done by setting 50% of the inputs to sentence pairs where the second sentence is the subsequent sentence from the corpus. The other 50 % contain the same sentence pairs except that the second sentence is instead a random sentence selected from the corpus. For example, if A is a sentence from the corpus, then 50 % of the time, B is the subsequent sentence of A and the other 50 % is a random sentence from the corpus.. 2.3.3. Fine-tuning. Fine-tuning allows the pre-trained BERT model to be used for specific NLP tasks through supervised learning. It works by replacing the fully connected output layers of the pre-trained BERT model with a new set of output layers that can output an answer with respect to the NLP problem at hand. The new model performs supervised learning with labeled data to update the weights of the output layers. Since only the output layer weights are updated during fine-tuning, the learning during fine-tune is relatively fast [7].. 2.3.4. Pretrained BERT models. As described in the previous section (2.3), a BERT model has to be pre-trained before it is fine-tuned on different tasks because the model needs to be taught to encode language. This process is both time and resource-consuming. For example, Devlin et al. [7] pre-trained the BERT model for four days using four cloud TPUs (16 TPU chips in total). Therefore many BERT models are released as pre-trained models with initialized parameters ready for specific 18.

(29) 2.4. Extractive Text Summarization Methods tasks. In turn, fine-tuning can be done on the pre-trained model to be used for particular tasks. Norwegian BERT: At the current state, the SOTA monolingual BERT model supporting the Norwegian language (both Bokmål and Nynorsk) is made by the National Library of Norway 1 . It is based on the same structure as the multilingual BERT (2.3.4) and trained on a wide variety of Norwegian text in both Bokmål and Nynorsk from the last 200 years. Multilingual BERT: Multilingual BERT (M-BERT) is a BERT-based model pre-trained on concatenated monolingual Wikipedia corpora from 104 languages 2 . In a study done by Pires et al. [38], it is shown that the M-BERT performs exceptionally well on zero-shot cross-lingual model transfer. Meaning, M-BERT can be fine-tuned using task-specific supervised data from one language and evaluated in a different language. This was done in a paper from Elmadani et al. [9]. They applied M-BERT for Arabic text summarization and showed how effective it could be in low resource situations for both extractive and abstractive approaches.. 2.4. Extractive Text Summarization Methods. Today there exists different extractive methods for text summarization. In this chapter, the two well known unsupervised methods, TF-IDF and TextRank, are presented in section 2.4.1 and 2.4.2. Furthermore, section 2.4.3 presents a supervised method called BERTSum which utilize the language model BERT for text summarization.. 2.4.1. TF-IDF. TF-IDF is short for term frequency-inverse document frequency. It is a numerical statistic that reflects how important a word is to a document within a corpus [39]. Term weighting based on term frequency was first introduced by Luhn [25]. Luhn stated that the importance of a term is proportional to its frequency. In mathematical terms, this can be described as: t f (t, d) = f t,d. (2.3). As seen in eq. 2.3, the term frequency t f , is equal to the frequency f of a 1 https://github.com/NBAiLab/notram 2 https://github.com/google-research/bert/blob/master/. multilingual.md. 19.

(30) 2.4. Extractive Text Summarization Methods term t found in a document d. For example in the following sentence: The firefighter rescued a cat. The cat is safe now. The term, cat, would have high importance because it’s mentioned multiple times. However, it is seen that common terms such as, the, is also weighted as important. To solve the issue of common terms appearing as important words, Jones [17] proposed a metric called, Inverse document frequency (IDF). The idea is to reduce the weighting of common terms and increase the weights of terms that occur infrequently, see Equation 2.4. id f (t, d) = log. n nt. (2.4). Here, terms are weighted based on the inverse fraction of the documents containing a term. The fraction is calculated by dividing the total number of documents n by the number of documents nt containing a term t. The combination of both TF and IDF favors more unique terms and damps common terms that occur in several documents. The combined equation is presented as: t f ´ id f (t, d) = f t,d ˆ log. n nt. (2.5). For sentence weighting, the same principle can be used. Document d in eq 2.5 can be reformulated as a sentence s and term t can be represented as a word w. In this case, n would be the total number of sentences, and nt would be the number of sentences containing the term t. The final equation for sentence weighting: n (2.6) t f ´ id f (w, s) = f w,s ˆ log nw. 2.4.2. TextRank. TextRank is a graph-based ranking algorithm proposed by Mihalcea and Tarau [26]. The ranking is done by deciding the importance of vertex in a graph-based on global information drawn recursively from the entire graph. This is done by linking one vertex to another. The importance of a vertex is measured by the number of links to other vertices as well as the score of the vertices casting the votes. A directed graph can be defined as G = (V, E), where V is a set of vertices and E is a set of edges. E is, in turn, a subset of V ˆ V. For a vertex Vi , let In(Vi ) be the set of vertices pointing to it and let Out(Vi ) be the vertices Vi. 20.

(31) 2.4. Extractive Text Summarization Methods points to [26]. The score of a vertex Vi , indicating its importance, is based by Brin et al. [35]:. S(Vi ) = (1 ´ d) + d ˆ. ÿ jPIn(Vi ). 1 S(Vj ), where 0 ă d ă 1 |Out(Vj )|. (2.7). In equation 2.7, d is a damping factor that sets the probability of jumping from a given vertex to another. TextRank can be applied for sentence extraction as proposed by Mihalcea and Tarau. This is done by setting the vertices of a TextRank graph equivalent to the number of sentences so that each vertex represents a sentence. The damping d shown in eq. 2.7 is equal to the similarity between two sentences. Additionally, the similarity function proposed by Mihalcea and Tarau also considers using normalization factors and division of content overlap to avoid promoting long sentences. Given two sentences Si and S j where a sentence is represented by a set of Ni words (Si = ω1i , ω2i , ..., ω iNi ), the similarity between Si and S j is defined as (Mihalcea and Tarau): Similarity(Si , S j ) =. 2.4.3. |tωk | P Si &wk P S j u| log |Si | + log |S j |. (2.8). BERTSum. BERT can not be directly used for extractive summarization. There are two problems at hand that Liu (2019) [23] points out. Firstly, BERT is trained using a masked language model (section 2.3.2). Therefore the output vectors result in tokens rather than sentences. Secondly, although BERT has segmentation embeddings for indicating different sentences, it can only differentiate a pair of sentences because BERT is also trained on next sentence prediction (section 2.3.2). Liu [23] propose a method for handling multiple sentences using BERT by inserting [CLS] tokens before each sentence and a [SEP] token after each sentence. To distinguish multiple sentences rather than two sentences, interval segment embeddings can be used. This means that each token in a sentence will be assigned the same E A or EB if the position of the sentence is odd or even. As seen in figure 2.12, the output of the BERT layer shown as Ti , are the corresponding [CLS] tokens from the top BERT layer. Each Ti are treated as a sentence representation of sentence i.. 21.

(32) 2.4. Extractive Text Summarization Methods. Figure 2.12: Architecture of BERTSum proposed by Yang [23]. After obtaining sentence representations for multiple sentences, Yang suggests several methods to capture document-level features for extracting summaries: 1. Using a simple classifier on top of BERT outputs and a sigmoid function to get predicted score. 2. Inter-sentence transformer by adding more transformer layers on top of BERT outputs and a simple classifier together with a sigmoid function. 3. Applying an LSTM layer on top of BERT outputs together with a simple classifier and a sigmoid function. Although from Liu’s experiments, the author stated that the second option in list 2.4.3, with two transformer layers, showed the best performance. The loss of the model is the binary cross-entropy of a prediction against its gold labels [23]. The predicted output from BERTSum is ranked by their importance which is represented by a score. Before ranking each sentence by its score, the author (Liu) implemented Trigram Blocking, introduced by Paulus et al. (2018) [37] to reduce redundancy by minimizing similarity between the selected sentences in the predicted summary. Like BERT, the sequence input for BERTSum has a limit of 512 tokens.. 22.

(33) 2.5. Evaluation metrics for summarization. 2.5. Evaluation metrics for summarization. Metrics that would traditionally be used to evaluate text summaries are coherence, conciseness, grammaticality, readability, and content [21]. These are metrics that experts consider when writing summaries, and since experts are going to use the developed tool, they become important. Evaluating summaries manually does not scale well since it would required huge amounts of time and effort to evaluate the hundreds, or even thousands, of summaries that exists. Therefore, it is crucial to complement human evaluation with qualitative evaluation methods and metrics to evaluate summaries automatically.. 2.5.1. Precision, Recall and F-Score. An extractive text summary can be seen as a binary classification problem where 1 indicates that a sentence from the document is extracted and 0 indicates that it is not. In statistical analysis of binary classification, Precision, Recall, and F-Score measure the test’s accuracy. The precision score is the number of true positive results divided by all selected positive results. The recall score is the number of true positives divided by all positive values. Another way to interpret these values is to think of the precision score as how many of the selected items that are relevant and the recall score as how many of the relevant items that are selected. It then becomes clear that these values alone are not always applicable. For example, if we were to pick out three red apples in a bowl of ten apples, we could achieve a high precision score by simply picking one red apple. Similarly, we would achieve a high recall score by simply picking all ten apples in the bowl. In these cases, the F1 score, known as the harmonic mean of precision and recall, becomes useful. The F1 score is calculated as in Equation 2.9. F1 = 2 ¨. 2.5.2. precision score ¨ recall score precision score + recall score. (2.9). ROUGE. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of metrics presented by Lin [21] in 2004 for automatic evaluation of summaries. The metrics compare machine-generated summaries against one or multiple reference summaries created by humans. The ROUGE-1, ROUGE-2, and ROUGEL metrics are commonly used for bench-marking of document summarization models, such as on the leaderboard for document summarization on the CNN/Daily Mail dataset [5]. For each metric, the recall, precision, and F1 score are generally computed. With ROUGE, the true positives are the words in the sentences of the reference summaries.. 23.

(34) 2.5. Evaluation metrics for summarization ROUGE-N: N-gram Co-Occurrence Statistics ROUGE-N is defined as the overlap of n-grams between the candidate summary and the reference summary. As mentioned, the most common metrics are ROUGE-1 and ROUGE-2, where ROUGE-1 refers to the overlap of unigrams (single words), and ROUGE-2 refers to the overlap of bigrams (two adjacent words). ROUGE-L: Longest Common Subsequence ROUGE-L refers to the Longest Common Subsequence (LCS) of words between a candidate summary and a reference summary. It reflects similarity on sentence-level based on the longest in-sequence matches of words. ROUGE-W: Weighted Longest Common Subsequence One disadvantage with the longest common sub-sequence with ROUGE-L is that it does not favor consecutive matches. This disadvantage means that word sequences with a less spatial difference will result in the same ROUGE score as sequences with a larger spatial difference. ROUGE-W deals with this problem by recognizing the length of encountered consecutive matches, giving a weighted longest common subsequence [21]. ROUGE-S: Skip-Bigram Co-Occurrence Statistics One can think of ROUGE-S as the opposite of ROUGE-2. Instead of measuring the overlap of bigrams, ROUGE-S measures the overlap of skip-bigrams between the candidate and the set of reference translations. A skip-bigram is any pair of words in their sentence order. A sentence with x number of words will have x!/(2! ¨ 2!) number of skip-bigrams. ROUGE-SU: Extension of ROUGE-S Rouge-SU is an extension of ROUGE-S that additionally considers unigrams as a counting unit. The extension could be necessary if, for example, a candidate sentence is the exact reverse of the reference summary. In that case, only using ROUGE-S would result in a score of zero even though the sentence has single word co-occurrences. With ROUGE-SU, a reversed candidate sentence would get a higher score than sentences that do not have a single word cooccurrence with the reference sentence. ROUGE Example For clarification on ROUGE scores, let us investigate the following example: sentence one is a reference sentence, and sentences two and three are candidates. 24.

(35) 2.5. Evaluation metrics for summarization 1. The firefighter saved the cat. 2. The firefighter rescued cat. 3. Cat saved the firefighter. Considering ROUGE-1, we can see that sentence three gives the best match with a recall score of 4/5 = 0.8 and a precision score of 4/4 = 1. In the case of ROUGE-l, sentence 2 is preferred, with a recall score of 3/5 = 0.6 and a precision score of 3/4 = 0.75. For ROGUE-2 both of the candidate summaries gives a recall and precision score of 2/4 = 0.5 and 2/5 = 0.4. The importance of this example is to understand that focusing on only one type of ROUGE score does not always provide good insight. In our example, intuitively can be agreed that sentence two is the one that best fits the reference sentence since sentence three completely changes the context. But according to ROUGE-1, sentence three is preferred. This example shows why combining the three is often a good idea and the importance of complementing the results with a qualitative evaluation. ROUGE Limitations Regardless of ROUGEs popularity among papers on automatic text summarization, there are some limitations that must be addressed: 1. ROUGE only considers content selection and not others aspects such as fluency and coherence. 2. ROUGE is relying only on exact overlaps, but a summary can express the same content as an article without exact overlaps, using other words and synonyms. 3. ROUGE was first designed to be used with multiple reference summaries with consideration that summaries are subjective. However, most datasets today only provides single summary references to each input.. 2.5.3. Qualitative Evaluation. A qualitative evaluation method is often used together with quantitative data to deepen the understanding of the statistical numbers. Patton [36] suggests three kinds of data collection methods for qualitative evaluation: • Open-ended interviews • Direct observations. • Written documents. 25.

(36) 2.5. Evaluation metrics for summarization The purpose of these methods is to gather information and insights that are useful for decision-making. Qualitative methods should therefore be appropriate and suitable, which means that it is essential to determine qualitative strategies, data collection options, and analysis approaches based on the evaluation’s purpose. An example of a method that combines quantitative measurements and qualitative data is a questionnaire, or interview, that asks both fixed-choice questions and open-ended questions.. 26.

(37) 3 Method. In this chapter, the methodology for creating a text summarization model is described. Firstly, in section 3.1, the datasets that were used, their properties and features are introduced. Secondly, in section 3.2, implementation techniques are described in three parts: pre-processing, binary label generation, and fine-tuning. Finally, in section 3.3, we cover the methods used for evaluating our different models.. 3.1. Datasets. The following section presents the features and properties of the two datasets used in this work.. 3.1.1. CNN/DailyMail. The CNN/DailyMail dataset1 was initially developed for machine-reading and comprehension and abstractive question answering by Herman et al. [14]. The dataset contains over 300k unique news articles written in English by journalists at CNN and the Daily Mail. Their script to obtain the data was later modified by Nallapati et al. [29] to support training models of abstractive and extractive text summarization, using multi-sentence summaries. Both of these datasets are anonymized versions, where the data has been pre-processed to replace named entities with unique identifier labels. A third version of the dataset also exists, which operates directly on the original text (non-anonymized), see [41]. 1 https://cs.nyu.edu/~kcho/DMQA/. 27.

(38) 3.1. Datasets The CNN/DailyMail dataset consists of two main features: articles, which are strings containing the body of the news article, and multi-sentence summaries, which are strings containing the highlight of the article as written by the article author. Table 3.1 displays the average token count and number of sentences in the dataset. Table 3.1: Average token and sentence count for news articles and summaries in the CNN/DailyMail dataset Type News Articles Multi-Sentence Summaries. Average Token count 781 56. Average nr of sentences 29.74 3.75. Furthermore, the dataset is split into a train, validation and test set according to Table 3.2. Table 3.2: Dataset split Dataset Split Train Validation Test. Number of Instances 287,113 13,368 11,490. Model performance on the CNN/DailyMail dataset is measured by the ROUGE score value of the model’s predicted summaries when compared to the golden summaries. The highest achieving models can be found on the Papers With Code Leader-board 2. 3.1.2. Aftenposten/Oppsummert. The Norwegian articles and summaries provided from Aftenposten (AP) and Oppsummert build up to two datasets. One with 162k articles and one with 979 summaries. Each column in the article dataset includes is presented in Table 3.3. The summary dataset contains an array of article IDs, which are the articles that the summary is based on. Table 3.4 presents each column in the summary dataset. To get an idea of how many articles from the article dataset were used to create the summaries, we plot this relation in Figure 3.1. As for the CNN/DailyMail dataset, we were interested in examining the average number of sentences in the AP/Oppsummert summaries. This plot is presented in Figure 3.2. Table 3.5 also displays the average token count and the average number of sentences in the articles and summaries datasets. 2 https://paperswithcode.com/sota/document-summarization-on-cnn-daily-mail. 28.

(39) 3.1. Datasets Table 3.3: Article data type in the AP/Oppsummert dataset Field. Description. ARTICLE_ID. The article’s ID. ARTICLE_TITLE. The title of the article. ARTICLE_TEXT. raw article text data. ARTICLE_NEWSROOM. The newsroom is Aftenposten. LAST_UPDATE. The date of when the article was last updated. Table 3.4: Summary data type in the AP/Oppsummert dataset. Field. Description. ARTICLE_ID. The summary’s ID. ARTICLE_TITLE. The title of the summary. ARTICLE_TEXT. raw summary text data. ARTICLE_NEWSROOM. The newsroom is always Aftenposten. LAST_UPDATE. The date of when the summary was last updated. SUMMARIZED_ARTICLES. An array of connected article IDs. Table 3.5: Average token and sentence count for news articles and summaries for AP/Oppsummert Feature News Articles Multi-Sentence Summaries. Mean Token count 703 154. Average nr of sentences 40.3 9.5. Compared to the CNN/DailyMail dataset, the AP/Oppsummert dataset is more dynamic, both with a higher variance of sentences in the summaries and associated articles to each summary.. 29.

(40) 3.2. Implementation. Figure 3.1: Summaries associated with x articles in the AP/Oppsummert dataset. Figure 3.2: Number of sentences in the AP/Oppsummert summaries dataset. 3.2. Implementation. In this section, the implementation of an automatic extractive text summarization model will be described and the different problems we had to overcome. Firstly, in section 3.2.1, 3.2.2 and 3.2.4, the dataset is restructured, truncated and labeled. Secondly, in section 3.2.4, the different model implementations are described. Lastly, in section 3.2.6, 3.2.7 and 3.2.8, the fine-tuning, pre-. 30.

(41) 3.2. Implementation diction and the hardware used for implementing a BERT-based model is described.. 3.2.1. Restructure of the AP/Oppsummert dataset. When training a model, one of the most important aspects is to have good training data. The CNN/DailyMail dataset is relatively straightforward to use, having one summary per article. However, this was not the case for the AP/Oppsummert dataset since some of the summaries have multiple related articles. We, therefore, analyzed these summaries, together with their related articles, to identify where the summaries’ content comes from. Our method for this was to plot the ROUGE scores for articles that maximize the ROUGE-2 and ROUGE-l recall scores against their gold summaries, an approach similar to the sentence selection algorithm presented in the BERTSum paper[23]. The objective was to visualize and compare the score between the top-scoring articles and the second scoring articles. For the summaries with only one article, we plot their ROUGE-2 and ROUGE-l recall score against the summary directly to understand how extractive they are, i.e., high ROUGE score meaning that the summaries use similar words and sentences as the connected article. These plots are presented in Figure 3.3. The reason for using the recall score is that we were not interested in the length variation of the articles, only to what extent the summaries use content from the different articles. From the graphs presented in Figure 3.3 we can draw two important conclusions about the AP/Oppsummert dataset: 1. The summaries with only one article are predominantly extractive written (since they have high ROUGE-2 and ROUGE-L scores). 2. The summaries with more articles regularly use sentences from only one main article (since both the scores from the second-best article are far worse than the scores from the top-scoring article). With these two conclusions, the dataset was restructured so that summaries with multiple IDs of related articles only were to be connected with the highest-scoring article in that set. Therefore the field SUMMARIZED_ARTICLES was updated from an array of IDs to only one article ID.. 31.

(42) 3.2. Implementation. (a). (b). (c). (d). (e). (f). Figure 3.3: ROUGE-2 and ROUGE-L recall scores for summaries with one article in (a) and (b), summaries with more articles and the top-scoring articles in (c) and (d), and summaries with more articles and the second-best articles in (e) and (f).. 32.

(43) 3.2. Implementation. 3.2.2. Truncation of articles. Before text articles can be used in a model like BERTSum, token limits must be addressed. As mentioned in 2.4.3, both BERT and BERTSum have an input limit of 512 tokens. The news articles in the CNN/DailyMails dataset have a mean token count of 781, and the news articles in the AP/Oppsummert dataset have a mean token count of 703. This means that the token limit of BERTSum is indeed a problem when using these datasets. Different approaches to handling token limitation have been suggested in previous works [23] [42]. A standard method is to truncate longer texts to fit the model’s token limit. The problem with this method is the loss of data that it introduces. If important information is discarded, it will result in a poorly trained model. For news articles, important information is primarily presented in the first third of the article [20]. This observation is also the case for the CNN/DailyMail dataset, as demonstrated by Liu [23]. Using ROUGE, we examined if this also is true for the AP/Oppsumert dataset. We did this by plotting the position of every document’s Oracle sentences, i.e., the sentences with the highest ROUGE score against the document’s golden summary, see Fig 3.4. In this particular plot, we choose Oracle-3, which is the top three scoring sentence. Figure 3.4 shows that the top-scoring sentences in the AP/Oppsummert dataset mainly occur at the beginning of the articles. Therefore, it was decided that longer articles that do not fit the token limit of 512 should be truncated only to use the first sentences sequentially since it is known that this is where the most important information exists in each document. The same approach was chosen for truncation of the CNN/DailyMail dataset, regarding the results from Liu [23].. 3.2.3. Oracle Summary Generation. Similar to the approach followed in [22], we are indirectly using the abstractive summaries (gold summaries) to create oracle summaries for supervised learning. Since the gold summaries from AP/Oppsummert are abstractive, they can not directly be used for supervised learning. Therefore a greedy algorithm based on calculating and maximizing the ROUGE-2 score between sentences in the gold summary and the article is performed to generate an oracle summary. An oracle summary contains label 1 for selected sentences and 0 for the rest. Liu [22], also suggests a second algorithm for oracle summary generation. The second algorithm considers all sentence combinations. 33.

(44) 3.2. Implementation. Figure 3.4: Proportion of sentences with highest ROUGE score according to their position in the original article. for maximizing the ROUGE-2 score; however, for many combinations, the algorithm can be time-consuming.. 3.2.4. Models. Six types of models were implemented for the task of extractive text summarization. With each model, we predicted three, seven, and ten sentences for the summaries. Out of the six models, TextRank and TF-IDF will only be used as a comparison to the BERTSum models. Oracle Oracle was not only used as a label generation method described in section 3.2.3, but also as an upper limit for our BERT models. Since oracle summaries are used as labels for our BERT models, the model can not score higher than the oracle summaries. Therefore, the oracle summaries could be set as a ceiling for our BERT models. We also experimented with both the greedy and combination algorithm mentioned in section 3.2.3. However, the combination algorithm resulted in slow performance, as Liu [22] mentioned, especially when selecting more than three sentences. Thus, the greedy algorithm was used throughout the experiments.. 34.

(45) 3.2. Implementation LEAD We used LEAD as a baseline, which selects the first sentences in an article. From the previously presented analysis in section 3.2.2, with the position of the highest ROUGE scoring sentences shown in Fig 3.4, we considered LEAD to be a good baseline to use. With Oracle and LEAD, we now had a good range for the ROUGE scores of where we wanted our models to perform within. For clarification: Our models should perform above the LEAD score and under or the same as the Oracle score. Next, we implemented two models based on statistical methods and two models based on BERT and BERTsum. TextRank We adopted a Python implementation3 of TextRank based on the approach followed in [26]. This implementation performs both keyword extraction as well as text summarization. We used Natural Language Toolkit (NLTK)4 to download necessary files used by stopwords, tokenizer, and stemmer. TF-IDF An implementation5 of TF-IDF was adopted for extractive text summarization. The source code was updated to support Norwegian using Spacy6 , an NLP toolkit similar to NLTK. Key sentences could then be extracted by ranking the scores for each sentence in descending order. BERTSum We used BERTSum described in section 2.4.3 to fine-tune two pre-trained BERT models for the task of extractive text summarization. The original BERTSum code is currently using an older version of PyTorch and therefore is not directly suitable for integrating new models. Thus, an updated version of BERTSum by Microsoft 7 was used. PyTorch, introduced by Klein et al. [18], is a toolkit for deep learning in Python. There are other currently popular deep learning libraries, such as TensorFlow (Abadi et al. [1]). However, PyTorch was chosen for our task since both the original and updated BERTSum code is processed using PyTorch. PyTorch is also more suitable for development in 3 https://github.com/acatovic/textrank 4 https://www.nltk.org/ 5 https://github.com/luisfredgs/extractive-text-summarization 6 https://spacy.io/ 7 https://github.com/microsoft/nlp-recipes. 35.

No results found