Extractive Multi-document Summarization of News Articles

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Computer Science

2019 | LIU-IDA/LITH-EX-A--19/038--SE

Extractive Multi-document

Sum-marization of News Articles

Harald Grant

Supervisor : Arne Jönsson Examiner : Marco Kuhlman

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Publicly available data grows exponentially through web services and technological advancements. To comprehend large data-streams multi-document summarization (MDS) can be used. In this research, the area of multi-document summarization is investigated. Multiple systems for extractive multi-document summarization are implemented using modern techniques, in the form of the pre-trained BERT language model for word embed-dings and sentence classification. This is combined with well proven techniques, in the form of the TextRank ranking algorithm, the Waterfall architecture and anti-redundancy filtering. The systems are evaluated on the DUC-2002, 2006 and 2007 datasets using the ROUGE metric. Where the results show that the BM25 sentence representation imple-mented in the TextRank model using the Waterfall architecture and an anti-redundancy technique outperforms the other implementations, providing competitive results with other state-of-the-art systems. A cohesive model is derived from the leading system and tried in a user study using a real-world application. The user study is conducted using a real-time news detection application with users from the news-domain. The study shows a clear favour for cohesive summaries in the case of extractive multi-document summariza-tion. Where the cohesive summary is preferred in the majority of cases.

(4)

Acknowledgements

First and foremost I would like to thank my supervisor Arne Jönsson for great support and assistance throughout the thesis. Furthermore I would like to thank Marco Kulman for intro-ducing me to the area of NLP and contributing with inspiration during my studies. Addition-ally I would like to thank my external supervisor Gabrielle Lindesvärd, for all encouragement and assistance during my thesis project. I would also like to thank Magnus and Johanna from the data-science department, as well as all of Svenska Dagbladet, for providing resources and purpose for my thesis.

(5)

List of Figures

2.1 Figure illustrating the difference between extractive and abstractie summaries. . . 6 2.2 Figure from Liu, Yu, and Deng (2015) representing the steps of their

summariza-tion model. . . 7 2.3 Single-layer hierarchical document flow. . . 9 2.4 Waterfall document flow. . . 9 2.5 Figures depicting the merging of the single-layer and waterfall model from Marujo

et al. (2015) . . . 9 2.6 Figure representing the process of contextual word embeddings using attention

weights. . . 11 2.7 Example of ROUGE-2 scoring of a candidate with regards to a reference text. . . . 12 3.1 BERT network architecture with inputs and outputs. . . 17 3.2 Example of a sentence accompanied by it’s SCU from the DUC 2006 dataset. . . 19 4.1 Example of coreferences, where the coreferences are underlined. In this

exam-ple there are 4 coreferences, from 3 following sentences. This would count as 3 sentence-wise coreferences. . . 23 5.1 Example of a candidate and reference texts from Lin (2004), where the candidate

(8)

List of Tables

3.1 Showing the attributes for each dataset. Where size is the number of entries in the dataset and document ´ count is the maximum and minimum amount of doc-uments per document-set. The s ´ length is the word length maximum of the

reference summaries. . . 18

4.1 Evaluations on the DUC-2002 dataset. . . 22

4.4 Showing the average length and amount of sentence coreferences of the sum-maries from each system. . . 23

(9)

1 Introduction

The amount of data is growing exponentially, and according to an IBM Marketing study1_,

in the year of 2016 90% of the data on the web had been created within the previous 12 months. To cope with the massive flow of data, news organizations including BBC2, Forbes3 and Reuters4 have turned to intelligent systems. Traditionally in information retrieval, it is common to interact with these quantities of data using web services like search engines. These types of services e.g. Google, Bing and Yahoo return a set of documents relevant to a search query. The documents are then manually scanned until the information wanted is extracted by the user. This type of data interaction can be shortened with the help of auto-matic multi-document summarization systems. Providing a summary of the most relevant documents for a quicker overview of the information.

Automatic summarization is a way to compress textual information to a dense linguistic form. There are different types of automatic summarization tasks, abstractive and extractive. Abstractive summarization is the task of summarizing a document using language indepen-dent of the document, while preserving and presenting the same main concept. Extractive summarization is the task of selecting the most important sentences from a text, best repre-senting the text in full. Extractive summarization does thus not rely on paraphrasing or text generation. However, non-fluent extractive summaries may through fragmentation missrep-resent source information.

Existing summarization systems often summarize a single document. For summarizing the content from multiple sources, the technique of multi-document summarization is instead adopted. For these systems evaluation is done using intrinsic metrics, which try to capture n-gram overlap of the summary to a set of gold-standard summaries. The metrics are sup-posed to reward relevance and coverage while penalizing redundancy. When handling mul-tiple documents about the same topic the sources often overlap, sharing some information, while still having relevant individual contributions; thus not repeating information becomes a more prevalent issue in multi-document summarization systems, where covering the

sub-1 https://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345usen/watson-customer-engagement-watson-marketing-wr-other-papers-and-reports-wrl12345usen-20170719.pdf 2_{http://bbcnewslabs.co.uk/projects/juicer/} 3 https://www.forbes.com/sites/forbesproductgroup/2018/07/11/entering-the-next-century-with-a-new-forbes-experience/#239e48133bf4 4 https://agency.reuters.com/en/insights/articles/articles-archive/reuters-news-tracer-filtering-through-the-noise-of-social-media.html

(10)

1.1. Aim

ject with regards to all documents is important. Many different models are possible for this, and can rely on extending a single document summarization method to work with a larger set of documents, as shown in Baumel, Eyal, and Elhadad (2018) and Marujo et al. (2015). Au-tomatic summarization systems are often built combining multiple different techniques from the field of natural language processing (NLP). In recent NLP research, new language mod-eling techniques (Devlin et al., 2018), have been developed. These techniques have shown great promise. Setting a new state of the art for many language understanding tasks includ-ing sentence level classification, which is the foundation of text summarization (Liu et al., 2018).

1.1 Aim

The aim of the study is to explore extractive multi-document summarization techniques, test automatic summarization in a real word application and compare generated summaries with different traits in a user context.

For this purpose, different multi-document summarization systems will be implemented and compared. The study will be conducted in two steps, firstly intrinsically comparing the different summarization systems using the traditional ROUGE metric on the well-used DUC datasets. Then comparing a system for informative summarization with a more cohesive summarization system. This evaluation will be done using user preference scores on created summaries from articles in a real-time system. The implementations for the intrinsic evalua-tion are as follows:

1. An implementation using the Waterfall method (Marujo et al., 2015) and a two step approach (Tohalino and Amancio, 2018), combining redundancy filtering (Ribaldo et al., 2012) and initial sentence extraction with the TextRank model (Mihalcea and Tarau, 2004). Using TextRank with the tf-idf based Best Match/Okapi 25 (BM25) algorithm for sentence vector representation and closeness (Barrios et al., 2016).

2. A similar approach as the previously mentioned system, however using different sen-tence similarity and vector representations in the TextRank extraction. Representing sentences using BERT word embeddings averaged per word (Devlin et al., 2018)(Ken-ter, Borisov, and Rijke, 2016), evaluating closeness with cosine similarity.

3. A baseline neural embeddings system, using the same approach as the BERT word em-bedding system but with averaged word2vec (Mikolov et al., 2018) sentence embed-dings.

4. A pure machine learning based approach, using transfer learning from the BERT lan-guage model. Fine-tuning a sentence classifier on the task of extractive summarization (Cheng and Lapata, 2016), (Al-Sabahi, Zuping, and Nadher, 2018), (Nallapati, Zhai, and Zhou, 2017).

In the second step process, the best performing original system will be compared to the de-rived cohesive system which uses different sentence filtering. Swapping the anti-redundancy filtering for an entailment filter. Creating a more cohesive summarization system (Hovy and Lin, 1998). The goal is to investigate user preference of fluency and information density. Us-ing one system to create informative summaries and another prioritisUs-ing fluency.

1.2 Research questions

1. How does state-of-the-art neural embeddings compare to tf-idf-based sentence vectors in a rank-based summarization system?

(11)

1.3. Contributions

Measured by comparing different sentence vectorization techniques in the rank-based summarization algorithm TextRank. The BM25, BERT embeddings and word2vec embeddings are used for text representation in different TextRank imple-mentations. The systems are evaluated on the DUC 2002, 2006 and 2007 datasets using ROUGE.

2. How does a classification based summarization model work in comparison to different TextRank implementations?

Measuring the BERT implementation with a classification layer for extractive sum-mariation on the DUC 2002, 2006 and 2007 datasets with regards to the TextRank solu-tions.

3. How does cohesion/fluency impact the end user experience of multi-document sum-marizations?

Evaluating a system with a modified version using a fluency technique with user preference evaluations. Fluency in this case, regarding a sentence-level more coherent summary.

1.3 Contributions

The study contributes with a clear baseline of extractive multi-document summarization sys-tems, where multiple systems using traditional and modern techniques have been imple-mented and evaluated on popular datasets and metrics. This shows the difference of modern neural embeddings with traditional unsupervised approaches in the sentence representation domain. The study further shows the performance of a modern classification based system with the other baseline systems. A cohesive and an informative system are compared using user evaluations in a real-world context. This quantifies the usability of cohesion for multi-document summarization systems.

1.4 Outline & Organization

The study will consist of implementing the different summarization systems, initially evalu-ating the systems intrinsicly and then implementing a cohesive derivative of one of the sys-tems. To compare the systems extrinsicly, a news summarization application will be imple-mented using both summarization systems. Lastly, a user-evaluation study will be conducted using directorial personnel through the summarization application. The paper follows the following chapter structure.

Chapter 2presents the different forms of summarization and distinguishes the problems faced with the different types of systems. It is shown how other research have dealt with the problems of cohesion and redundancy. Other multi-document summarization approaches are expressed to give a background of the subject and show what techniques are in the fore-front of summarization systems. Further, the evaluation process, summarization algorithms and vector representations are detailed to give a background in the area.

Chapter 3 introduces the multi-document summarization architecture and the different single-document summarization implementations used in the study. Further, the formatting of the datasets used for training and evaluation are in detail explained. Both the intrinsic and extrinsic evaluations are presented in context of the study. Where the user-interaction system for extrinsic evaluation is depicted.

Chapter 4shows the results from the evaluations on the different datasets in comparison to other prevalent research in the field. These systems have all been evaluated using the same metrics and datasets on the same or a similar task. The evaluated systems from external research are explained in short with their differences in tasks and evaluation.

(12)

1.4. Outline & Organization

Chapter 5presents issues with the validity of the results and brings forth the differences of the compared systems from other research. The chapter explains how the different results should be interpreted with regards to the different implementations.

Chapter 6shows what can be done in future work in this research area, and concludes the results of this study.

(13)

2 Theory and Related Work

In this chapter the main concepts and ideas relevant to the study and implementations are presented. First a background of automated summarization, where the extractive and multi-document tasks are defined. Common problems and approaches of solving them are pre-sented through related research. As well as different approaches of implementing extractive multi-document summarization systems. Further, the techniques used in this study are in-troduced. Including different types of algorithms and textual vector representations, used in natural language processing. For a complete background, the metrics commonly used in evaluating summarization systems are presented. The chapter ends by showing the theory and intuition of the neural models used in this study and how it has been used in similar research.

2.1 Summarization

Automated text summarization is the task of creating a dense text description of a larger text, a way to downscale text information. Regular abstractive summarization can be depicted as Natural Language Generation, where text is used as the information source (text-to-text gen-eration). For other types of summarization like query-focused summarization, an additional input of a query is used to specify the subject of the summary. Traditional Natural Language Generation can be broken down into three different steps: content determination (choosing what information to present), sentence planning (determining the information in each sen-tence and sensen-tence ordering) and surface realization (generating the actual grammatically correct text) (Reiter, 1996). For extractive systems the surface realization step is excluded and the information is presented as is, without any paraphrazing as compared to abstractive sys-tems. Ignoring this extra step makes the task inherently less difficult, which is why focus has been on extractive systems.

2.1.1 Abstractive and Extractive summarization

In automatic summarization there are different types of approaches generating different types of summaries, abstractive and extractive. Abstractive summarization could be described as (Hovy and Lin, 1998), when the summary-text is created from an internal representation of the information presented in the text. The text is thus created independently from the input

(14)

2.1. Summarization

document vocabulary. Extractive summarization on the other hand, could be described as creating a summary by picking a subset of the sentences from the input document. An extractive summary is thus always sentence-wise linguisticly sound with respect to the input document. This, however, does not have to be the case for an abstractive summary, because the language is chosen internally and not from the document. Figure 2.1 illustrates the differ-ence between an extractive and abstractive summary with respect to the source text. When creating a summary of predefined sentences, as in the case of extractive summarization, the language choice is more static. Which can limit sentence transitions and expressivity. This is one downside of extractive summarization with respect to abstractive. In addition to abstractive and extractive approaches, there are different types of summarization systems, those for single-document summarization (SDS) and multi-document summarization (MDS).

Figure 2.1: Figure illustrating the difference between extractive and abstractie summaries.

2.1.2 TextRank

TextRank is a graph-based ranking algorithm that can be used for extractive SDS and key-word extraction. It intends to rank each vertex in a graph by importance with regards to connecting vertexes. The ranks are updated by recursively computing the vertex rank glob-ally until converging, as described in Mihalcea and Tarau (2004). The intuition is the same as for the PageRank algorithm introduced in Brin and Page (1998). The more incoming links a page has, the more important it is. The more important the page, the more important the pages’ outgoing links are.

The difference when working with text instead of web-pages is how the links are formed. In TextRank the pages are represented as sentences, and the links are determined by the sentence-level similarity by a chosen metric (Mihalcea and Tarau, 2004). For example, the bag-of-words vector representation can be used to represent sentences, and cosine similarity can be used to calculate sentence closeness. Recently, research from Barrios et al. (2016) has improved the TextRank algorithm. In their research they use new comparison techniques from the BM25, and BM25+ ranking functions, replacing the sentence representation and closeness measurement. The formula is depicted in Equation 2.2.

2.1.3 Neural models

Neural models for extractive text summarization have shown great promise, as shown in research from Al-Sabahi, Zuping, and Nadher (2018). For the single-document task, summa-rization can be done by treating the sentence extraction as a classification problem (Al-Sabahi, Zuping, and Nadher, 2018) (Nallapati, Zhai, and Zhou, 2017) (Cheng and Lapata, 2016). The classification problem works by, for each sentence classifying if it should, or should not be a part of the summary. This type of machine learning based approach can be implemented using different algorithms and feature representations (Leite and Rino, 2008).

(15)

2.2. Multi-document summarization

Combining a state-of-the-art pre-trained language model for the classification task tech-nique has only recently been implemented for summarization (Liu, 2019). In their research they implement an extractive SDS system based on the BERT model. They fine-tune the net-work on sentence extraction for extractive summarization on different news-article datasets. To avoid redundancy they implement a tri-gram blocking technique, ignoring sentences with tri-grams overlapping the already extracted sentences. Their results show that their model outperforms multiple other state-of-the-art extractive SDS systems on the CNN/Daily-mail (Hermann et al., 2015) dataset.

2.2 Multi-document summarization

Multi-document summarization is when the input is made up of multiple mostly related documents (Hovy and Lin, 1998). The systems can be as simple as picking the most important document and use a single-document summarizer to summarize it (Zhang, Tan, and Wan, 2018). Or summarize all documents individually merging the summary and then summarize the merged sum (Zhang, Tan, and Wan, 2018)

2.2.1 Multi-document approaches

In some research (Liu, Yu, and Deng, 2015) (Baumel, Eyal, and Elhadad, 2018), MDS is treated as a two-step process, first extract relevant information, then creating a summary from a subset of the first extracted candidate sentences. In Liu, Yu, and Deng (2015) the authors construct a model for extractive MDS based on a two step architecture. Firstly creating a summary set of sentences based on the original content from the document set, which is a sparse representation of the input text. Then they try to reconstruct the original content from the sparse summary set. The system is supposed to scale-up the text into the original form. The steps are depicted in Figure 2.2. To create a summary from the candidate set, the candi-date sentences are exchanged following an optimization problem. Optimizing over different wanted traits as coverage, sparsity and diversity. Where the optimization problem is solved using the simulated annealing algorithm. Their results show that their model is fast and has competitive ROUGE results on the DUC 2006 and 2007 datasets.

Figure 2.2: Figure from Liu, Yu, and Deng (2015) representing the steps of their summariza-tion model.

In Wan et al. (2015) the authors base their research on the idea that MDS systems perform inconsistently over document-sets. They prove this by doing a quantitative analysis on the DUC 2004 dataset where they show that the best overall performing system, underachieves on a majority of document-sets. Their approach is to create a set of candidate summaries

(16)

2.2. Multi-document summarization

from different MDS systems using Integer Linear Programming (ILP) and then pick the best summary from the candidates. The intuition is to heighten performance by creating con-sistently good summaries which heightens the overall summarization performance by bet-tering the average. The system picks the best summary based on the rank from a Support Vector Machine (SVM). Their results show that they achieve a higher ROUGE-2 score for multi-document summarization on the DUC 2002 and 2004 datasets, using their compound framework than the individually best performing system.

In other research by Wan, Yang, and Xiao (2007), they construct a model for MDS based on the manifold-ranking algorithm. They try to maximise the information richness and nov-elty, creating a summary of the sentences with the highest rank regarding these traits. To extract sentences, inter- and intra-document links are created within the algorithm. Where they implement redundancy control by further re-ranking the sentences on repeat coverage. They compare their system with the competing systems from DUC 2003 and 2005 using the ROUGE-1/2/W metrics, showing competitive results.

In Naserasadi, Khosravi, and Sadeghi (2019) extractive MDS is done by approaching the task as an optimisation problem. Trying to create a summary with maximal coverage and sentence entailment. They formulate the summarization creation as a knapsack problem. Ini-tially extracting sentences based on the rank of their tf-idf vector representations. Sentences then receive scores based on the potential of entailing the content. The problem is finally solved by picking the set of sentences maximizing the score with regards to a length restric-tion. Similar to the subset sum problem, of picking a subset of a set of integers creating a zero sum. To avoid redundancy on a sentence level, they use fixed rules for compressing sen-tences. This is used to create less diluted and more information dense sensen-tences. They evalu-ate their system on the DUC 2007 dataset using human evaluations and ROUGE-metrics and show an improvement compared to the DUC 2007 competing systems.

Modeling extractive summarization as an optimisation problem is also done in Sanchez-Gomez, Vega-Rodríguez, and Pérez (2018), however optimizing with an artificial multi-objective bee colony algorithm. Their system focuses on the multi-objectives of maximising the context coverage while minimising the redundancy. The coverage and redundancy traits are measured using cosine similarity on tf-sdf sentence vector representations. Where the tf-sdf measurement is a sentence wise interpretation of the tf-idf measurement. They compare their system with other optimisation based summarization systems on a subset of the DUC dataset. Their results show that their system outperforms other systems on the ROUGE-2 metric.

In Marujo et al. (2015), the authors present two methods for extending a SDS to a MDS system. One of them, a single-layer hierarchical model and another following a waterfall architecture both depicted in Figure 2.5. The hierarchical model works by individually sum-marizing each document and then sumsum-marizing the compound of the results. The waterfall architecture also starts with individual summarization. But then uses the chronological order, creating a summary by aggregating the least recent document with the summary. Continu-ing until the most recent document is merged with the result. For the SDS task they use the KP-centrality summarization method, which clusters sentences based on a closeness mea-sure. In their evaluations they construct a set of systems using different MDS approaches combined with the KP-Centrality system. Where the KP-Centrality summarizer is used with different closeness-measurements which is compared in their results. They show that their system using the waterfall approach combined with cosine similarity outperforms other cur-rent state-of-the-art methods on ROUGE-1/2 for the DUC 2007 and TAC 2009 datasets.

Other SDS to MDS approaches like the one presented by Baumel, Eyal, and Elhadad (2018), initially create individual summaries for each document. Creating the full summary by extracting the top sentences from the most recent summary in descending order. In their research they create a query-based abstractive system for SDS to MDS. Where the SDS model is based on the seq2seq model (Bahdanau, Cho, and Bengio, 2014). Which in turn is based on the encoder decoder architecture, where it uses an attention mechanism to represent the current context. The encoder creates embeddings for the current context, the decoder

(17)

gen-2.2. Multi-document summarization

erates the next word based on the inputs context and the previously generated words. Be-cause training data for end-to-end MDS is uncommon, they use a relevance score for the sentences in the documents. In this way they shorten the documents, ignoring non-relevant information while still creating document representations. The relevance score is determined by the closeness of each sentence to the input query. They evaluate their model on the DUC 2005,2006 and 2007 datasets, and show that their model outperforms other extractive models for query-focused MDS on the ROUGE-1/2/SU4 metrics.

Single-layer hierarchical document flow. Waterfall document flow.

Figure 2.5: Figures depicting the merging of the single-layer and waterfall model from Marujo et al. (2015).

2.2.2 Problems faced in multi-document summarization

Automatically creating a summary is not trivial, but has fundamental problems, some of which are side-effects of having to use multiple documents. One example, repeat-information in the summary, can not happen in SDS unless the problem is also prevalent in the input document. The same goes for temporal order, which is when information is depicted in the wrong order, possibly disturbing the order of sequential events. This can be handled in a SDS system by ranking the extracted sentences in order of appearance in the document. Another problem that is prevalent in both SDS and MDS is sentence fragmentation, having non semantically ordered sentences making the summary incoherent. Where it could lead to unexplained pronouns, having a sentence referring to an unmentioned individual (Gupta and Lehal, 2010). Or possibly linking a pronoun to the wrong individual, creating a false interpretation of the source-document information.

The waterfall method implicitly tries to avoid problems with temporal order by work-ing hierarchically with the documents, sorted by creation. This removed temporal problems as long as the documents depict the most recent information, and no temporal information precedes any earlier document.

2.2.3 Fluency in extractive summaries

For extractive summarization there is no need to create fluent summaries (Smith, Daniels-son, and JönsDaniels-son, 2012). The purpose is to focus on information density and not readability, which is more important in abstractive summaries. Even so, fluency creates more easily read-able text and possibly a better understanding of the content. Readability and fluency have

(18)

2.3. Vector representations

been used as evaluation metrics (Liu et al., 2018), for extractive summarization systems. And research has been done (Gupta and Lehal, 2010), stating the case for fluent summaries. With studies like Otterbacher, Radev, and Luo (2002) and Smith, Danielsson, and Jönsson (2012) working on improving cohesion/fluency for extractive summarization systems.

2.2.4 Redundancy in MDS systems

Redundancy is implicitly prevalent in MDS systems, because multiple similar documents tend to overlap in information. A MDS is supposed to contain the shared overall information of the different documents. This leads to rewarding similar content from multiple sources, implicitly rewarding redundancy. Different research has dealt with this problem in different ways. Wan et al. (2015) built a system which created a summary with a maximum amount of significant words. Using the intuition of, the more unique words the less redundant infor-mation. Another common technique for redundancy control is measuring sentence similarity between the added summary sentences and each candidate sentence. Deeming a sentence with a high similarity redundant, ignoring to add the candidate sentence to the summary.

Tohalino and Amancio (2018) used a rank-based model, weighing connections between sentences. The connections are classified as inter or intra-connected depending on if the sentences belong to the same document. In the study they tried two methods for anti-redundancy filtering, called AR1 and AR2. The AR1 filter uses tf-idf and cosine similarity for identifying sentence similarity. The AR2 filter uses a formula combined with the weighted n-gram overlap. This is similar to a common technique that has been discussed in much re-search (Baumel, Eyal, and Elhadad, 2018) (Liu et al., 2018) (Gupta and Lehal, 2010), where sentence similarity is measured using only n-gram overlap.

Another commonly mentioned technique is Maximal Marginal Relevance (MMR) (Zhang, Tan, and Wan, 2018) (El-Ghannam and El-Shishtawy, 2014) (Gupta and Lehal, 2010). Even if popular, other techniques have shown better results on the MDS task (Marujo et al., 2015).

2.3 Vector representations

For Natural Language processing it is pertinent to use vector representations, representing text information by a quantifiable measure. This is to allow different algorithms and compu-tations to be applied to the text. A common naive technique is creating a corpus length vector with word occurrence information, as in the bag-of-words model. However, this creates large and often sparse vectors, not well representing real linguistic relations. To justifiably and effi-ciently represent text with vectors, much research has been done (Kenter, Borisov, and Rijke, 2016) (Pennington, Socher, and Manning, 2014) (Levy and Goldberg, 2014) (Mikolov et al., 2013) (Mikolov et al., 2018).

2.3.1 TF-IDF

A vectorization commonly used in information retrieval is the Term Frequency Inverse Doc-ument Frequency (tf-idf). The algorithm measures how often a term occurs (tf) in a specific document and multiplies this with the value accounting for how common the word is in the complete document collection (idf), depicted in Equation 2.1. The intuition of the algorithm is to represent word occurrence removing the weight of unrepresentative or too common words. Both measurements are common in information retrieval and occur in the BM25 ranking algorithm, which is also called Best Match or Okapi BM25. The algorithm is used for scoring documents on a query, and can be used for scoring sentences in documents with respect to a document set.

t f(t, d) =|tt P du|, id f(t, D) = |D|

(19)

2.4. Evaluating summarization systems

Equation 2.1shows the formula for the term-frequency t f , and inverse document frequency id f functions. The t is a term, d is a document and D is the complete document set.

BM25(S, C) = |S| ÿ i=1 id f(si)˚ t f(si, C)˚(k1+1) t f(si, C) +k1˚(1 ´ b+b ˚avglen|C| ) (2.2)

Equation 2.2shows the BM25 algorithm, where t f(si, C)is the term frequency for word siin

sentence C and id f(si)is the inverse document frequency of word siin the complete

docu-ment. The average sentence length in the entire document is avglen. The constants k1and b

are set according to Barrios et al. (2016), to k1=1.2 and b=0.75.

2.3.2 Word embeddings

When representing textual data in a vector space, as much information as possible should be included. For models like bag-of-words, a single sentence can be represented by an entire corpus-length vector. To produce less overhead and better handle a large corpus, word em-beddings have been introduced (Levy and Goldberg, 2014) (Pennington, Socher, and Man-ning, 2014). Creating more dense vector representations and accounting for the context of each word. In Mikolov et al. (2013) word vectors are created using the contiguous bag-of-words (c-bow) technique. Further research from Mikolov et al. (2018) introduce new pre-trained word embeddings based on the same c-bow technique. Recently, even more modern approaches have been proposed for creating word embeddings (Peters et al., 2018). Account-ing for a larger context and attention weights for the input. An approach similar to this is used in the BERT (Devlin et al., 2018), language model. Their embeddings are built using a very large context and attention weights. The embeddings account for the previous, as well as the following context of the target text. An illustration of the process is depicted in 2.6.

Figure 2.6: Figure representing the process of contextual word embeddings using attention weights.

2.4 Evaluating summarization systems

Evaluation of summarization systems follows the common technique used in the machine learning domain. System results are compared with a predefined gold-standard, using a dataset of task examples. This allows the same datasets to be used for both training and eval-uation. Using a dataset and a quantitative metric makes it easier to fairly compare systems from independent research. However, because summarization is ambiguous, many different summaries can fairly represent a given text. To tackle this, many summarization evaluation metrics incorporate the use of multiple gold-standard reference summaries.

One prevalent method used in the 2011 TAC summarization track1is the Pyramid method (Nenkova and Passonneau, 2004). The method uses 4 reference summaries and evaluates the score using Summarization Content Units (SCUs). SCUs are parts of text describing some

(20)

2.4. Evaluating summarization systems

information in the source documents. A summary is scored based on the overlapping SCUs with the reference summaries. Other techniques like the one presented in Louis and Nenkova (2013) uses automatically created summaries from other systems as reference summaries. Then creates a word distribution from the compound reference summaries. The word distri-bution is then compared with the word distridistri-bution of the candidate summary using Jensen-Shannon divergence. Even if the method does not account for length or fluency, it has shown to have a high correlation with human evaluations.

More traditional precision and recall based metrics are the Bilingual Evaluation Under-study (BLEU) (Papineni et al., 2002) and Recall-Oriented UnderUnder-study for Gisting Evaluation (ROUGE) (Lin, 2004) metrics. These metrics also allow multiple reference gold-standard sum-maries and measure similarity on a n-gram basis. The BLEU metric is precision-based and measures the average n-gram precision of a candidate summary with regards to the reference gold-standards. The ROUGE metric is recall-based, measuring n-gram recall with regards to the reference summaries. A difference with the ROUGE metric and actual n-gram recall is that the ROUGE metric incorporates different types of recall based measurements. Where the measurements differ on how the n-grams overlap. Both metrics have been critiqued recently (Schluter, 2017) (Reiter, 2018), because the metrics mostly cover an individual trait favoured in automatic summaries. However, when ROUGE was first introduced in the Document Un-derstanding Conference (DUC) it was combined with a length restriction. Using a recall based measurement combined with a length restriction, implicitly rewards precision. An example of the ROUGE scoring is depicted in Figure 2.7 where the ROUGE-2 metric is used.

Candidate:"the gunman police killed"

Reference:"police killed the gunman"

Candidate bi-grams:"the gunman", "gunman police", "police killed"

Reference bi-grams: "police killed" ,"killed the", "the gunman"

ROUGE-2 score:2/3

Figure 2.7: Example of ROUGE-2 scoring of a candidate with regards to a reference text. Within the ROUGE evaluation there are different types of metrics depending on how the n-grams are selected. ROUGE-S compares similarity based on skip-gram overlap. Skip-grams are bi-Skip-grams not necessarily directly connected but following one another within a given context. The skip-grams are thus order sensitive but not as static as regular n-grams. When getting the ROUGE-S similarity the precision is also included, making the measurements more complete. The ROUGE-S measurement calculation is depicted in Equa-tion 2.3. ROUGE-SU is another popular ROUGE metric, which is a variaEqua-tion of ROUGE-S, adding the leading uni-grams of the text to the skip-gram set. Because of the complete-ness of the ROUGE metric, it is often used for evaluation of summarization systems (Lin, 2004) (Sutskever, Vinyals, and Le, 2014) (Wan, Yang, and Xiao, 2007) (Marujo et al., 2017) (Naserasadi, Khosravi, and Sadeghi, 2019) (Sanchez-Gomez, Vega-Rodríguez, and Pérez, 2018) (Liu, 2019) (Zhang et al., 2019).

R(c, g) = |tC Y Gu| |C| , P(c, g) = |tC Y Gu| |G| , ROUGE ´ S(c, g) = (1+β2)˚R(c, g) (R(c, g) +β2P(c, g)) (2.3)

Equation 2.3shows the ROUGE-S calculation. Where c = candidate summary, g = gold summary, C = tskipgram P cu and G = tskipgram P gu. ROUGE ´ S(c, g)is the skip-gram recall for candidate c in respect to gold-standard reference g.

(21)

2.5. Neural Networks in Natural Language Processing

2.5 Neural Networks in Natural Language Processing

Neural models currently represent the state of the art for many natural language processing tasks. Google’s recent pre-trained neural language model Bidirectional Encoder Represen-tations from Transformers (BERT) (Devlin et al., 2018), has improved the state of the art for eleven natural language processing tasks. The BERT model is based on the Transformer ar-chitecture which in turn is based on the encoder-decoder arar-chitecture.

2.5.1 Encoder-Decoder/autoencoder

An autoencoder is a neural network architecture with the goal of recreating input data. The model consists of an encoder part and a decoder part. The encoder scales down the input data into an internal form existing in a hidden layer. The decoder then takes this internal representation and tries to upscale the data into a wanted form. Traditionally this architec-ture can be used for vectorization of data through feaarchitec-ture extraction, dimension reduction or downscaling. Because the model is a sequence to sequence model it suits many natural language processing tasks like machine translation.

This type of architecture, also called encoder-decoder, has been used in natural language processing (Sutskever, Vinyals, and Le, 2014) (Cho et al., 2014) using different types of neu-ral networks to act as the encoder and decoder respectively. Other architectures like the Transformer model from Vaswani et al. (2017), have been built incorporating this autoen-coder model. In their research they create a system for the task of machine translation. They propose an attention mechanism, weighting input vectors regarding the importance of the context words. An illustration of attention weights can be seen in Figure 2.6. Their research showed an improvement of the state-of-the-art for machine translation on BLEU metric eval-uation.

2.5.2 BERT

Bidirectional Encoding Representations from Transformers (BERT) is a language model built on the transformer architecture (Vaswani et al., 2017). The model processes input text includ-ing the context and uses an attention mechanism to focus on important words. This helps to better interpret semantic relationships. The BERT model is used for pre-training a general language understanding network, possible for transfer learning. The BERT implementation relies on a set of training tasks (Devlin et al., 2018), creating a model robust against previously unseen input. The training is set-up the same way as traditional word embedding training, but with additional sub-tasks. The first sub-task is to predict a hidden token given a text sequence. This is done by masking a random token in the input and ordering the model to predict the masked token given the context. For this task, the masked word is sometimes replaced with a random word, or entirely unchanged. The next task is entailment prediction, feeding sequences of sentences into the model and predicting if they are sequential.

The training steps are what create the intelligence of the model, by adding tasks and aug-menting errors to the input while training. It creates a model less prone to over-fitting and better handling errors or previously unseen word combinations.

2.5.3 Fine-Tuning & Transfer-Learning using BERT

In the BERT introduction paper (Devlin et al., 2018), they fine-tune their model on further tasks in the NLP domain. The fine-tuning process is done by adding a layer on top of the BERT transformer model. Using the pre-trained base language model for transfer learning, fine-tuning the weights on a specific task. In their research they show an improvement of the state-of-the-art for eleven natural language processing tasks.

In Zhang et al. (2019) they use the BERT pre-trained model in multiple steps for creating an abstractive summary. In the first step they create embeddings of the text, these embeddings

(22)

2.5. Neural Networks in Natural Language Processing

are used as inputs to a transformer for creating an initial summary. This summary is then corrected by masking each word and letting the pretrained BERT model predict the masked word. They show that their model produces state of the art results on the CNN/Daily mail dataset.

(23)

3 Method

This chapter presents the method, where the intrinsic evaluation and system implementa-tions are explained. The extrinsic evaluation and different datasets used in the study are also presented.

3.1 Summarization models

The first part of this study was to implement summarization systems. Four different extrac-tive MDS systems were implemented. All systems except the classification based, used the AR1 redundancy technique combined with the waterfall architecture. The different systems where all based on the extending a SDS system for the multi-document task. Using the wa-terfall implementation on a sorted document set. The method is depicted in Equation 3.1.

|td P Du|=n, multisum(D) =

n

ÿ

i=0

sum(sum+di) (3.1)

Equation 3.1 shows the Waterfall summarization method, where D is the document set, sum is the aggregate summary returned from sum()which is an arbitrary SDS function and multisum()is the waterfall method.

Each extracted sentence to be added to the summary is checked with regards to the AR1 redundancy-technique. If the cosine similarity of the sentence with regards to any sentence in the summary is higher than a threshold value it is interpreted as redundant and not added to the summary. The AR1 redundancy technique uses tf-idf vectors to represent sentences. The threshold is dynamic with regards to the full document set and calculated as shown in Equation 3.2 below.

ti, j P d P Du thresh= max(cos_sim(i, j))´min(cos_sim(i, j))

2 (3.2)

Equation 3.2showing the AR1 redundancy technique threshold calculation. Where i and j are sentences, d is any document and D is the document set.

(24)

3.1. Summarization models

The AR1 redundancy technique was incorporated to the multidocument summarization system in the following manner. For each sentence in each summarization step, only add a sentence if it is not redundant. For each MDS system a different SDS technique was used. While for the Classification model, the waterfall method was not used at all. Instead only the redundancy technique was used. This was because the classifier will always label a sentence with the same label. So if already summarized content is summarized again it will return the same result.

3.1.1 Classification based neural model

For the classification based model the implementation was done following the research from Vaswani et al. (2017), using the pretrained model BERT (Devlin et al., 2018) with a classifica-tion layer on top for sentence level classificaclassifica-tion. The model was trained on a multi-document news summarization dataset, labeling sentences as part of summary or not. When using a sentence classifier for summarization, the sum of extracted sentences should be longer than the wanted summary length. The idea is that filtering for redundancy will change the sum-mary length to a smaller size. So it is important that the reference sumsum-mary in the training data is larger than the summary length in the evaluation dataset.

The implementation of the BERT transformer model was made using the re-implemented PyTorch implementation1. The actual model to fine-tune was limited due to hardware con-ditions to the base model ’bert-base-uncased’, following the constraints in other research (Zhang et al., 2019). The model was extended using a linear classification layer, with the transformer outputs as inputs. To get a result from the classification layer a soft-max ac-tivation function was applied. This was done to normalize the result and give a probabil-ity score for each label. The configuration was done according to the documentation for other sequence level classification tasks from the General Language Understanding Evalua-tion (GLUE) benchmarks2. Using the model setup with a hidden layer output size of 768, a total number of hidden layers of 3072 with 12 Transformer layers and 12 attention heads. Fig-ure 3.1 depicts the general ’bert-base-uncased’ model as used in this study. The training was done using a single GPU (GTX 1060) using a vocabulary of 30.000 tokens. With a batch size of 32, learning rate of 2e-5, running in a total of 3 epochs. The data used for training was the full BBC extractive multi-document dataset (Greene and Cunningham, 2006), consisting of 8367 entries. The dataset was chosen because of the multi-document domain and because it covers news articles, well fitting the domain of the extinsic user-study. The dataset is further described in Section 3.2.

3.1.2 TextRank BM25

The TextRank implementation with BM25 is taken from the implementation from the Gen-sim3library ( ˇReh ˚uˇrek and Sojka, 2010). The library implementation is the same as the im-plementation from the research in Barrios et al. (2016), where it was originally introduced. The implementation follows the original PageRank implementation. For the other TextRank based systems a custom version of the algorithm was implemented, to be able to test different closeness measures, as well as different vector representations.

3.1.3 Custom TextRank with neural embeddings

The custom version of TextRank was implemented in two solutions using the same closeness of cosine similarity, but with different vector representations. The word2vec based model

1_{https://github.com/huggingface/pytorch-pretrained-BERT} 2_{https://gluebenchmark.com/}

(25)

3.2. Dataset preparations

is implemented using pre-trained context vectors ’wiki-news-300d-1M-subword’4trained on Wikipedia, from Mikolov et al. (2018). The other implementation uses word embeddings based on the BERT model. For this implementation the pre-trained ’bert-base-uncased’ model was used. The BERT model was trained on the concatenation of the BooksCorpus from Zhu et al. (2015) as well as Wikipedia content. The word embeddings where extracted vectors form the models internal representation of the input text.

To create sentence vectors of the word embeddings, each word-vector is averaged and appended to the sentence vector. This technique has been used for other neural word embed-dings and has shown to give good results (Kenter, Borisov, and Rijke, 2016). The approach then follows the TextRank algorithm by creating a similarity matrix of all sentence-vectors. Representing the similarity for each sentence to all other sentences in the document. In the next step the PageRank algorithm is run on the matrix until convergence. What the model returns is the document sentences ranked on relevance in ascending order. To produce a summary from this, the top n sentences are are then extracted.

Figure 3.1: BERT network architecture with inputs and outputs.

3.2 Dataset preparations

This section details how the different datasets used in training and evaluation have been used and created. As well as where the datasets have been retrieved and how they have been used in other research. The datasets and evaluation metrics where chosen because of the

(26)

3.2. Dataset preparations

popularity of the evaluation technique. This allows the systems in this study to be compared to systems from similar research.

3.2.1 BBC dataset for BERT classification fine-tuning

The dataset consists of 2225 documents from the BBC with documents spanning the topics: business, entertainment, politics, sport and tech. The dataset was originally introduced in Greene and Cunningham (2006) for the task of document clustering and has been annotated for summaries at a later stage. The data is publicly available on Kaggle5.

3.2.2 Datasets for evaluation

The datasets used in the evaluation are multi-document datasets providing multiple docu-ments and multiple reference summaries for each document set. They come from the 2002, 2006 and 2007 instances of the Document Understanding Conference (DUC)6. All datasets where used with the ROUGE evaluation for scoring and have multiple gold-standard sum-maries.

There are small differences in the datasets because of changes in the conference. For in-stance, both the 2006 and 2007 instances of the DUC have more consistent document-count, which was an improvement made over time. As well as the 2006 and 2007 datasets summary size, increased to 250. The size difference of the 2002 dataset with the later ones is because the training, test and trial data was merged creating a larger dataset for the 2002 instance. Table 3.1 depicts dataset information.

Name size document-count s-length

DUC-2002 59 5-15 200

DUC-2006 23 23-25 250

DUC-2007 22 25 250

Table 3.1: Showing the attributes for each dataset. Where size is the number of entries in the dataset and document ´ count is the maximum and minimum amount of documents per document-set. The s ´ length is the word length maximum of the reference summaries.

3.2.3 DUC 2006/2007 datasets

The version of the DUC-2006/2007 datasets used in this study excluded regular reference summaries, but were annotated by Copeck et al. (2007) with summarization content units (SCUs). The same as used in the Pyramid method (Nenkova and Passonneau, 2004), showing for each sentence what information it describes. A SCU can be interpreted as an answer to a question covering the document content. See Figure 3.2 for an example of a sentence and covering SCU. To evaluate ROUGE score on a summary regarding a dataset there has to be gold-standard references.

To create reference summaries from SCUs, each SCU sentence covering an individual statement is added to a reference summary. If the SCU is already covered by in the reference summary, a new reference summary is created. This method was used to create a dynamic set of reference summaries based on coverage. A reference summary has a limit of 250 words according with the limit of the evaluation summaries from the DUC conference.

5_{https://www.kaggle.com/pariza/bbc-news-summary/discussion} 6_{https://www-nlpir.nist.gov/projects/duc/index.html}

(27)

3.3. Extrinsic evaluation

Sentence: "The Supreme Court in 1987 barred states from requiring the teaching

of creationism in public schools where evolution is taught, calling such a Louisiana law a thinly veiled attempt to promote religion."

SCU: "The court decided against teaching creationism as science"

Figure 3.2: Example of a sentence accompanied by it’s SCU from the DUC 2006 dataset.

3.3 Extrinsic evaluation

For the extrinsic evaluation, the system with overall best performance from intrinsic mea-surements will be competing with a derivative system. This derivative system will use the same model but with a cohesion filter instead of the redundancy filter.

The cohesion filter uses the previously introduced BERT model for entailment prediction. Instead of checking if a sentence is redundant, the sentences are sorted according to the prob-ability of if the sentence follows the current summarized content. Because next sentence pre-diction is a task included in the BERT pre-training, the model did not require any fine-tuning. Intrinsic evaluations on automatic metrics do not account for fluency or cohesion. That is why extrinsic evaluations where done through user evaluations. Following research from Reiter (2018), stating that not much has been done comparing automatic metrics to actual usability of summarization systems.

According to Albert and Tullis (2013), the best way to compare different systems is to directly compare the systems with a preference evaluation. The user is forced to choose a system based on preference and not any predefined traits or metrics. This removes biases to specific traits and focuses on the pure benefit of the summary. This evaluation is summative, e.g. no improvements will be done afterwards. So the insights in how the summaries differ is unnecessary. The idea of the evaluation is to find out which type of summary is "best" according to the usability for the users. Because no neutral alternative is possible, this type of scoring creates as many comparing values as possible.

The live data used in the user based system comes from a flow of articles of different sources. Where the data consists of English articles scraped from news paper web-sites. The system fetches a document set from the article stream capturing a single event. The document set is then summarized and sent via a Slack client to a channel for the users to see. The mes-sage formatting included summaries from both systems accompanied by source references and a title from one of the documents. The title is picked from the most general document in the document cluster. The clustering is used to identify similar articles to be added to the doc-ument set, e.g. find a set of articles regarding the same event. For this process, each article is tagged using a Named Entity Recognition (NER) service, the articles are then clustered based on the Jaccard distance of the tags. The summary title is picked from the medoid article in the cluster. To ease the process of manual reading, the summaries are shortened to between 80 and 120 words. This is done by removing sentences from the end of a summary until the length measurement is met. The amount of documents for each summarization is capped to a minimum of 5 and maximum of 25. So if an event occurs there has to be at least 5 articles available, and no more than 25 articles will be used in in each summary. The minimum was set arbitrary, to use the multi-document trait and lower the impact of poor quality articles on the summary. The maximum was set specifically to 25 as the same document size used in the DUC-2006/2007 datasets.

(28)

4 Results

Systems from other studies presented in these results are annotated with an asterisk(*), in-dicating that the results were not replicated in this study. Among the compared models the DUC-top model does not represent a single system, but the leading results for each corre-sponding ROUGE metric from the conference. Results for each system is only present if the results were available using the same evaluation metric and dataset instance.

4.1 Compared systems

The results annotated with an asterisk come from other research where they use the same datasets and metrics. The systems are all from recent research and show competitive results for MDS on the DUC datasets and ROUGE metrics. These systems are further presented in this section.

In Zhang, Tan, and Wan (2018) they build a system for abstractive MDS which they base on previous research regarding single-document summarization (Tan, Wan, and Xiao, 2017). This system they base their implementation on is an abstractive system called SinABS. The SinABS system uses the implementation from Tan, Wan, and Xiao (2017) which is a seq2seq model. The model uses an attention mechanism and an encoder-decoder architecture. To extend this SDS system for the multi-document task, they add a decoder layer. The decoder layer produces an intermediary form of multiple documents, working as the input to the SinABS system. In their research they evaluate their model on the DUC-2002 dataset using the ROUGE metrics. In these results the model is denoted as MultiSinABS.

In other research (Li et al., 2017), they develop a MDS system also based on the encoder decoder architecture. However, in this system, using variational auto-encoders for latent semantic analysis. Combined with salience estimation to produce an extractive summary. Their model was evaluated using on the DUC-2006/2007 datasets using ROUGE. It is called VAE-A in their study and is denoted as such in these results.

Another system built in Baumel, Eyal, and Elhadad (2018) is based on a seq2seq abstrac-tive SDS and built for query-focused MDS. In their research they use a pre-trained abstracabstrac-tive summarization model. Where the initial content is extracted from the document set using a query-relevance filter. Then the most relevant sub-content is chosen, based on redundancy and relevance. This pseudo-document is fed into the neural abstractive SDS model, which produces a summary. In the paper the SDS model is denoted as RSA-QFS. This model is

(29)

4.2. Internal intrinsic evaluation

combined with an Iterative technique for MDS, as well as an initial query-relevance met-ric, computing the similarity of an input-query to the document content. Their MDS model is called the Iterative RSA Word Count, where the word count overlap is used to calculate query relevance. In these results their model is denoted as RSAWC.

Wang et al. (2016) focus on the task of sentence compression within the area of MDS sys-tems. Their implementation relies on a set of steps for creating a query-focused MDS. Firstly the sentences are ranked using support vector regression (SVR) with the LambdaMart rank-ing algorithm. The next step is sentence compression, where different implementations are tested. The leading implementation uses a parse-tree representation with a modified beam-search for compression. The last step is sentence reordering and co-reference resolution. This is done by matching sentences on tf-idf similarity, using the source document time-stamps and sentence ordering. The system shows an improvement of the state-of-the art on the DUC 2006/2007 datasets. In these results the system is denoted as CQMDS indicating compression-based and query-focused.

Another system using sentence compression is proposed in Naserasadi, Khosravi, and Sadeghi (2019). In their research they formulate the extractive summarization task as a com-binatorial optimization problem. Where sentences have a value and weight, the problem is to pick a subset of sentences maximising sum and weight. Where the sentences are weighed on length, and scored on relevance and entailment. The system showed state-of-the art im-provement on the DUC 2007 dataset and is denoted as KSOpt in this paper.

4.2 Internal intrinsic evaluation

The models evaluated where three different TextRank-based systems as well as a system based on a neural classifier. One of the systems was a TextRank based system using the Gensim implementation from Barrios et al. (2016) with BM25 vector representation for sen-tences, denoted as TRBM25. Other TextRank based systems where built on a custom imple-mentation, one of them using BERT word embeddings denoted as TRBERT and another with word2vec embeddings denoted as TRW2V. The last system was the classification based neu-ral model built on pre-trained BERT, fine-tuned on a the extractive BBC corpus is denoted as NNBBC.

The results for the DUC-2002 dataset are shown in Table 4.1. In the table the highest score for the respective ROUGE metric is in bold, and the systems measured in this study are separated from results retrieved in other studies. The empty values respond to unavailable results for the system and metric. The DUC-2006 dataset evaluations are shown in Table 4.2 and the 2007 results are shown in Table 4.3 following the same formatting as the DUC-2002 result table. Calculations of the ROUGE metric were done using a ROUGE wrapper library1_{where the base ROUGE program version was ROUGE 1.5.5 and the settings where}

equivalent to running the standard ROUGE evaluation2as done in Wang et al. (2016). The results show that the TRBM25 model had the highest score for all ROUGE metrics on the DUC-2002 result. The second best model was the NNBBC, where it scored marginally worse than the leading system and better then the MultiSinABS model for all ROGUE metrics on the DUC-2002 dataset. On all datasets both the TRBERT and TRW2V perform noticeably worse than all other systems, on all ROUGE metrics. The embedding-based systems also per-form inconsistently well on the different datasets, where the TRBERT model perper-forms better than the TRW2V on the DUC-2002 dataset, but under-performs on the DUC-2006 and 2007 datasets. For the DUC-2006 and 2007 datasets the TRBM25 model outperforms all other mod-els on all ROUGE metrics except ROUGE-1 where systems from other research perform better. The TRBM25 system has the best overall score for all measurements, while the NNBBC

sys-1_{https://github.com/tagucci/pythonrouge/blob/master/pythonrouge/pythonrouge.py} 2_{ROUGE-1.5.5.pl -n 4 -w 1.2 -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -a -d}

(30)

4.3. Extrinsic evaluation study and analysis

tem has a competitive score on the higher n-gram metrics and both the TRBERT and TRW2V perform poorly in comparison to all other systems.

Model ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-4 ROUGE-SU4

TRBM25 0.35301 0.12549 0.08538 0.07395 0.15805

NNBBC 0.34943 0.09994 0.05889 0.04977 0.14034

TRBERT 0.27607 0.07633 0.04798 0.04137 0.10791

TRW2V 0.2804 0.05623 0.02254 0.01645 0.09434

MultSinABS* 0.34 0.696 - - 0.114

Table 4.1: Evaluations on the DUC-2002 dataset.

4.3 Extrinsic evaluation study and analysis

The extrinsic results were made through a real-time system providing news personnel with extractive summaries regarding a current news event. Two summaries where provided for each event, accompanied by a title from the most relevant source document and links to the source articles. The summaries where sent to the concerned personal through a messaging channel. Then the users where instructed to for each message mark the favoured summary.

To evaluate, the users marked the favourable summary message using a marker in the chat client interface. Some of the users received late access to the channel. They where asked to vote on messages sent previous of their involvement. The study included 5 users in the news domain and was conducted over 14 days, with varying amount of incoming events and articles for each day. In total 58 summaries were created for each system.

TRBM25 0.41819 0.14727 0.0943 0.08037 0.19013 NNBBC 0.35179 0.09718 0.05564 0.04732 0.14088 TRBERT 0.29211 0.05696 0.02372 0.0193 0.10085 TRW2V 0.33415 0.05889 0.01865 0.01146 0.11217 RSAWC* 0.4289 0.0873 - - 0.1775 CBQFMDS* - 0.1102 - - 0.1625 DUC-top* 0.41017 0.09537 0.02935 0.01404 0.15495 VAE-A* 0.396 0.089 - - 0.143

Table 4.2: Evaluations on the DUC-2006 dataset.

TRBM25 0.45275 0.1827 0.12275 0.10382 0.22439 NNBBC 0.37014 0.09594 0.04832 0.03625 0.14311 TRBERT 0.34064 0.08073 0.03584 0.02724 0.12626 TRW2V 0.35769 0.09136 0.04099 0.03124 0.13615 KSOpt* 0.46 0.13 - - 0.18 DUC-top* 0.44499 0.12285 0.04757 0.02585 0.17470 RSAWC* 0.4392 0.1013 - - 0.1854 CQMDS* - 0.1349 - - 0.1846 VAE-A* 0.421 0.110 - - 0.164

(31)

4.3. Extrinsic evaluation study and analysis

Sentence: "Barack Obama was born in Hawaii. He is the president. Obama was elected in 2008."

Figure 4.1: Example of coreferences, where the coreferences are underlined. In this example there are 4 coreferences, from 3 following sentences. This would count as 3 sentence-wise coreferences.

Model Length Coreferences Votes

Informative 90.1 3.5 14.3%

Cohesive 89.9 4.12 85.7%

Table 4.4: Showing the average length and amount of sentence coreferences of the summaries from each system.

To highlight the difference of the generated summaries in the user evaluation, fluency, length and summary overlap have been measured. The overlap was measured using over-lapping sentences for the summaries and calculated with the Jaccard distance. On average the summaries overlapped with 43%. The fluency for the summaries was calculated using coreferences. Which are expressions in a text referring to the same entity. Coreferences have been used for measuring fluency in summarizations in other research (Smith, Danielsson, and Jönsson, 2012), (Naserasadi, Khosravi, and Sadeghi, 2019). The Figure 4.1 shows an example of a coreference annotated text.

The coreference measurements where made with the Stanford NLP toolkit (Manning et al., 2014), using the neural coreference annotator developed in Clark and Manning (2016) and Cheng and Lapata (2016). To fairly measure the coreferences for the extractive systems, internal coreferences within sentences where removed and only external sentence-wise coref-erences where accounted for. The results of the study are shown in Table 4.4. The votes percentage is the percentage of all preference votes for the given summarization system. Be-cause the amount of votes differed between users, the votes where merged in the results. The length and coreferences in the table respond to the average summary length in words, and the average amount of sentence-wise coreferences for each summary system.

Extractive Multi-document Summarization of News Articles

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Computer Science

2019 | LIU-IDA/LITH-EX-A--19/038--SE

Extractive Multi-document

Sum-marization of News Articles

Harald Grant

Upphovsrätt

Copyright

Acknowledgements

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Research questions

1.3

Contributions

1.4

Outline & Organization

2

Theory and Related Work

2.1

Summarization

2.1.1

Abstractive and Extractive summarization

2.1.2

TextRank

2.1.3

Neural models

2.2

Multi-document summarization

2.2.1

Multi-document approaches

2.2.2

Problems faced in multi-document summarization

2.2.3

Fluency in extractive summaries

2.2.4

Redundancy in MDS systems

2.3

Vector representations

2.3.1

TF-IDF

2.3.2

Word embeddings

2.4

Evaluating summarization systems

2.5

Neural Networks in Natural Language Processing

2.5.1

Encoder-Decoder/autoencoder

2.5.2

BERT

2.5.3

Fine-Tuning & Transfer-Learning using BERT

3

Method

3.1

Summarization models

3.1.1

Classification based neural model

3.1.2

TextRank BM25

3.1.3

Custom TextRank with neural embeddings

3.2

Dataset preparations

3.2.1

BBC dataset for BERT classification fine-tuning

3.2.2

Datasets for evaluation

3.2.3

DUC 2006/2007 datasets

3.3

Extrinsic evaluation

4