Textual Summarization of Scientiﬁc Publications and Usage Patterns

(1)

Textual Summarization of Scientific

Publications and Usage Patterns

Aybüke Öztürk

October 21, 2012

Master’s Thesis

Under the supervision of:

Dr. Jerry Eriksson, UmeåUniversity, Sweden

Examined by:

Prof. Frank Drewes, UmeåUniversity, Sweden

UmeåUniversity

Department of Computing Science

SE-901 87 UMEÅ

(2)

(3)

Abstract

In this study, we propose textual summarization for scientific publica-tions and mobile phone usage patterns. Textual summarization is a pro-cess that takes a source document or set of related documents, identifying the most salient information and conveying it in less space than the orig-inal text. The increasing availability of information has necessitated deep research for textual summarization within Information Retrieval and the Natural Language Processing (NLP) area because textual summaries are easier to read, and provide to access to large repositories of content data in an efficient way. For example, snippets in web search are helpful for users as textual summaries. While there exists summarization tools for textual summarization, either they are not adapted to scientific collection of documents or they summarize short form of text such as news. In the first part of this study, we adapt the MEAD 3.11 summarization tool [19] to propose a method for building summaries of a set of related scientific articles by exploiting the structure of scientific publications in order to focus on some parts that are known to be the most informative in such documents. In the second part, we generate a natural language statement that describes a more readable form of a given symbolic pattern extracted from Nokia Challenge data. The reason is that the availability of mobile phone usage details enables new opportunities to provide a better under-standing of the interest of user populations in mobile phone applications. For evaluating the first part of study, we make use of Amazon Mechanical Turk (Mturk) to validate summarization output.

(4)

Acknowledgements

This research project would not have been possible without the support of many people. I would like to express my greatest gratitude to the people who have helped and supported me throughout my project.

I am grateful that Dr. Sihem Amer Yahia and Prof. Marie Christine Rousset who gave me chance to work such an interesting project in their research team. They were abundantly helpful and offered invaluable assistance, support and guidance. I thank my internal supervisor Dr. Jerry Eriksson who helped me a lot. Special thanks of mine to Prof. Dr. Henning Christiansen and Prof. Frank Drewes who gave me valu-able advices for my project report.

I am grateful to my colleagues, Ms. Ruth Garcia, Mr. Shameem Ahamed Puthiya Parambath, and Mr. Behrooz Omidvar Tehrani for their continuous support for the project, from initial help and through ongoing encouragement to this day.

I wish to thank my parents and friends for their undivided support and interest who inspired me and encouraged me to go my own way, without whom I would be unable to complete my project. And especially to God, who made all things possible.

(5)

List of Figures

1.1 The SUNFLOWER Project Architecture . . . 4

1.2 The Nokia Project Architecture . . . 5

2.1 Fuzzy Logic Summarization Architecture . . . 11

2.2 Process of Summarization . . . 13

2.3 Example of Irrelevant Data . . . 14

2.4 Example of Centroid Feature . . . 16

2.5 List of Centroid Words . . . 18

2.6 Example of Without Keyword Summary . . . 20

2.7 Example of With Keyword Summary . . . 21

2.8 Example of MEAD Score . . . 22

2.9 Example of Sentence Extraction . . . 23

2.10 Example Text Before Rephrasing . . . 24

2.11 Example Text After Rephrasing . . . 24

3.1 Qualification Test Example . . . 28

3.2 Example of Qualification Test Image . . . 29

3.3 Independent evaluation question . . . 29

3.4 Comparative evaluation question . . . 30

3.5 Independent Evaluation Result . . . 31

3.6 Comparative Evaluation Result . . . 32

3.7 Independent Evaluation Result . . . 33

3.8 Comparative Evaluation Result . . . 34

4.1 Taxonomy for Application Attributes . . . 35

4.2 Taxonomy for Demographic Information Attributes . . . 36

4.3 Step 1 for Sentence Generation . . . 38

(8)

.1 MTurk Qualification Test . . . 49

.2 Independent Evaluation example 1 . . . 50

.3 Independent Evaluation example 2 . . . 50

.4 Comparative Evaluation example 1 . . . 51

.5 Comparative Evaluation example 2 . . . 51

(9)

Chapter

1 Introduction

1.1 General Problem Statement

Summaries should produce the most important points of the original text with as few words as possible. As enhancement of online information, systems that can automat-ically summarize one or more documents become more desirable. Summarization provides a greater flexibility and convenience such as headline news for informing, TV-guides for decision making, abstract of papers for time saving, and in small visible area personal digital assistant (PDA) screen can be given as example of summaries. On the other hand, evaluation of the quality of summaries is a very difficult task because summarization has to deal with relevance which is not a clear notion. Peo-ple identify applicable information that they think will be of interest to the readers. Summarization conveys a short form of input document and also reader’s state of mind. In other words, who the reader is, what his knowledge before reading the summary consists of, and why he wants to know about the input texts are significant point of summarization. Psycho-linguistic and computational-linguistic communities agree that modelling the reader’s state of mind is complicated task, if not entirely impossible [14].

Many approaches address the problem by building systems depending of the type of the required summary. Summarization is useful for different purposes and in varied settings:

• Abstractive summaries, the goal is to convey the most important information in the input and may reuse phrases or clauses from set of related documents, on the other hand the summaries are overall conveyed in the words of the summary author and requires lot of semantic interpretation and sentence synthesis. While some abstractive systems are designed to updates across different news reports such as SUMMONS [31]. This systems aim to present similarities, differences, contradictions, and generalizations among sources of information. Replication is a difficult task for this systems because they heavily rely on the adaptation of internal tools.

(10)

they appear in the document or in set of documents. Text extraction means to identify the most relevant passages in one or more documents, often using standard statistically based Information Retrieval techniques augmented with natural language processing and heuristics. Extractive summarization systems should produce higher quality summaries and usually consist of two part. The first part deals with important content selection, and the second part deals with the presentation of the selected contents. More information about extractive summarization is mentioned in the next chapter.

• Indicative summaries enable a quick scanning among the search results which is two or three lines summaries for informing the contents of source documents. Informative summaries are written to provide brief description of the original document to convey an idea of what the whole content of document is all about. • Keyword summaries, the goal of which is to compose a short text with a set of significative words or phrases mentioned in the given documents. Sentences ex-tracted from the text should be the most important and representative sentences. Extraction of the most important and representative phrases is called keyphrase extractive summarization which constrains the output to phrases that appear in the document [30].

• Query focused summaries, the goal of which is to summarize the input docu-ment for given specific query. Snippets for search engines is a particularly useful query application [2]. Query focused summarization is very similar to question answering. This systems provide a summary for documents based on a query or a question. The generated summary is shaped by the interest of the user. Up-date summaries are sensitive to time that express the recent upUp-dates regarding documents. It helps audiences or followers to access new information.

• Single document summaries, provide a more compact text that capture the essence of the original content of document. Single document summarization is a difficult task by itself, but multi document summarization has more dif-ficulties. Multi-document summaries generate a compressed summary from a set of related documents. It simplifies the source search to reduce the time by pointing to the most relevant source documents [44]. More information about multi document summarization is given in the next chapter.

• Summarizing Patterns aim is to provide a textual form translating from natural language which conveyed formal meaning by a given symbolic rule or pattern extracted from data by automatic pattern mining techniques.

1.2 Context of the Work

My work has been conducted in two different projects: SUNFLOWER and Nokia Challenge. Short overviews of these projects are given below.

(11)

1.2.1 The SUNFLOWER

The SUNFLOWER [11] is a system that employs collaborative editing to summarize large corpora of literature articles pertaining to a certain topic. As far as we searched, there is no available system like the SUNFLOWER. We believe that it is interesting and challenging to combine automatic summarization techniques with human intel-ligence. To accomplish the project, first related articles are bundled based on content and article metadata such as authors and citations. The second step each bundle are summarized by extracting key sentences from their constituent articles. The last step assigns summaries to subject experts according to their skills and helps them to collaboratively edit and improve the automatically generated summaries. The SUN-FLOWER is developed in collaboration with Bloomsbury Publishing [22], a publisher of scientific material to build an in-house collaborative editing platform. Project ar-chitecture is shown in Figure 1.1 1 which is drawn during the project. A screenshot of the Summarization module from the web interface of our implementation is given in Appendix.

As given in the Figure 1.1, the project is organised in different steps:

• Bundling:

The SUNFLOWER starts by pre-processing articles using Latent Dirichlet Allo-cation (LDA) [21] to associate to each article a vector of topic weights. In order to identify different sub-sections in the desired related work output, the SUN-FLOWER uses a document similarity measure that combines content similarity with extended co-authorship similarity and citation similarity to bundle arti-cles. Then, an agglomerative hierarchical clustering algorithm as in [20] finds bundles. Finally, the SUNFLOWER associates to each bundle a collection of keywords describing it using the LDA vectors of its constituent articles. More details about mentioned notions and methods can be found in [45].

• Summarization:

Our contribution is that given a set of bundles and their keywords, the SUN-FLOWER uses an adaptation of an open-source summarization toolkit MEAD which implements extractive summarization techniques. The well-known struc-ture of scientific publications allows us to experiment with summarizing scien-tific articles but also parts of articles such as their abstract, introduction and related work sections, where contributions tend to be formulated. In addition, since each bundle comes with a set of keywords, they can be used to bias the summarization towards those sentences that represent best the keywords. The output of this step is a set of bundles, their summaries and their keywords. We describe our summarization process with more details in the Section 2.2.

(12)

• Collaborative Editing:

An environment is created where skilled workers are associated with bundles for editing. Matching between workers and bundles is done using their respec-tive skills and sub-categories. An assignment heuristic is then used to optimize some objective function. Examples of objective functions include minimizing idle time of workers and balancing the number of edits across bundles. The formalization of the collaborative model along with the objective functions give rise to a family of efficient assignment heuristic. Please refer to [43] for more information about the methods for task assignment and collaborative editing.

Figure 1.1: The SUNFLOWER Project Architecture

1.2.2 The Nokia Challenge

The availability of mobile phone usage details and user demographics enables new opportunities to provide a better understanding of the interest of user populations in mobile phone applications. We have participated in a pattern mining project over a data set provided by the Nokia about phone usages by users for which demographic information is available. Frequent patterns give lots of encapsulated information. At the same time, there are some challenges which limits the usability of a frequent pat-tern. One challenge is the number of patterns can be millions and patterns with many

(13)

items are hard to read. Another challenge is the analyst may be unable to understand the meaning of patterns with many items. The fundamental goal of the project is to provide the analyst with an interactive exploration framework for frequent patterns and translate these patterns more readable form.

The steps defined in this project are:

• Pattern mining is used to discover hidden dependencies between applications and their usage. Pattern attributes correspond to values of user demographics, such as Young and Female as well as applications like Desktop and Calendar. • Given the large space of possible patterns, an interactive framework is proposed

based on usage-based primitives that helps to explore the space of discovered patterns by abstracting and refining them on demand. Abstraction is proposed to abstract attributes in a pattern that leads to more readable patterns in which the analyst can find the semantics more easily. Refinement is proposed to find attributes that are the most characteristic of sets of users, according to a saliency measure, and present these attributes through visualization. Please refer to [32] for more information about the methods for pattern mining, abstraction, refinement.

• Within this report, our contribution is to generate natural language statements that describe pattern in a more readable form. Figure 1.2 2 shows the architec-ture of project which is drawn during the project.

Figure 1.2: The Nokia Project Architecture

(14)

1.3 Outline of the Thesis

This report is organized as follows:

• Chapter 2 presents the various existing techniques of multi document summa-rization and focuses on the summasumma-rization tool MEAD that we have used in order to implement a tool for summarizing a set of scientific documents.

• Chapter 3 describes experiment protocol and settings, and discusses how analysing the experimental results has guided us for trying to improve them and how novel experiments have validate the improvements.

• Chapter 4 describes the method of sentence generation for symbolic patterns. • Finally, Chapter 5 presents future plans and the conclusion of this report.

(15)

Chapter

2 Multi-document Summarization

Automatic summarization is defined as the creation of a shortened version of a doc-ument or a set of docdoc-uments process by computer program [46]. There are many techniques to summarize a document and these techniques can be adapted for a set of documents (multi document) summarization [4][7]. We overview some of them in Section 2.1. Most of the existing works have deal with summarization of textual non technical documents such as newspaper articles. In our work, we have studied how to use and adapt an existing tool MEAD to build a prototype for summarizing a bundle of scientific articles. We explain our approach in Section 2.2.

2.1 Related Work

Summarization techniques has been studied and discussed as a research subject since the publication of Luhn’s paper [5]. Firstly, Luhn stemmed words to their root forms and deleted stop words. After that, he compiled a list of content words sorted by decreasing frequency. The index of list provides a significance measure of the word. A significance factor was derived that reflects the number of occurrences of signifi-cant words within a sentence. All sentences are ranked in order of their significance factor. The top ranking sentences are finally selected as summary [6].

Baxendale [33] worked on position feature which has been used in many complex machine learning based systems. He analysed 200 paragraphs to find that in 85% of the paragraphs the topic sentence came from the first sentence and in 7% of the time it was the last sentence. So that, positional feature accurate way to select a topic sentence would be to choose first or last sentence of documents. Edmundson [34] describes a system that produces document extracts. The two features of word fre-quency and positional importance were incorporated from the previous two works. Additionally, one more feature was used which checks whether the sentence is a heading or title. Weights were attached to each of these features manually to score each sentence.

(16)

Unsupervised methods for sentence extraction are the essential subject in extractive summarization because they do not require any external sources, models or on lin-guistic processing and interpretations. Last fifty years, machine learning techniques have been successfully applied to summarization. First method Naive-Bayes [17] is used for query-focused multi document summarization systems to categorizes each sentence as worthy of extraction or not. Naive-Bayes classifier described by Kupiec et al. [36] based on system of Edmundson [34] and two new feature is introduced, sen-tence length and the presence of uppercase words. The assumption, the employed features are independent of each other given the class. Each sentence was given a score according to (1), and only the n top sentences were extracted.

The classification probabilities are learnt statistically from the training data. Let S the set of sentences that generated as a summary, where s is a sentence from the doc-ument collection, and F1, F2,..Fk the features are used in classification. Below formula

is the probability that sentence s will be chosen to form the summary given that it possesses features. P(s∈S|F1, F2, ..Fk) = k

∏

i=1 P(Fi|s∈S)P(s∈S) k

∏

i=1 P(Fi)

The results from Naive-Bayes method experiment for sentence selection show that a combination of location of the sentence, word frequency, the presence of uppercase words and sentence length gave the best results for single-document summarization. Aone et al. [35] also combined a naive-Bayes classifier with TF*IDF (term frequency, inverse document frequency) feature. TF value is number of times a word appears in all documents divided by total number of words in all documents. IDF value is calculated as the logarithm of the number of documents divided by number of docu-ments where the word appears [13].

t f id f(tj) = D(tj)

|D| ∗log(C(Ctj))

According to formula, C is the number of documents in a collection, C(tj) is the

number of documents containing term tj, |D|is the total of all words in the document

and D(tj) denotes how many times tj occurred in document D.

Another method is based on Hidden Markov models (HMM) [16][37] which is used for single document summarization. The essential reason for using a sequential model is to account for local dependencies between sentences. The probability that

(17)

sentence s is in the summary is independent of whether sentence s-1 is in the sum-mary is not assumed for HMM. As we mentioned, naive Bayesian methods assumes the independence of features, but in HMM assumption, a joint distribution of the features set is used.

Lastly well known techniques is Graph-based method [18] which is used for finding similarities and dissimilarities in pairs of documents. The importance of a sentence is determined by computable features using cosine similarity matrix where each en-try in the matrix is the similarity between the corresponding sentence pair. After removing stop word and stemming, sentences in the documents are represented as nodes in an undirected graph. There is a node for each sentences. Two sentences are connected with an edge if the two sentences share some common word. That means TF*IDF cosine similarity is above some threshold. In this way, word frequency plays a direct role in determining the structure of the graph.

LexRank is a system [3] which use Graph-based methods for summarization. Au-thors calculated modified idf cosine similarity to use in graph based method. Algo-rithm 1 summarizes how to compute LexRank scores for a given set of sentences [3]. According to [38], Lexrank is not practical for multidocument summarization of sci-entic papers. In addition, LexRank is a sophisticated and computationally expensive method and it extracts almost the same sentences with the baseline MEAD Original method. These are the major reasons why we did not adapt LexRank to our system to propose a method for building summaries of a set of related scientific articles.

In NLP approach, text summarization is implemented based on fuzzy logic [29]. In depth each feature of a text such as sentence length, location in the document, sim-ilarity to keyword are mentioned in the next section as the input of fuzzy systems. Fuzzy logic system is designed according to selection of fuzzy rules and membership function. The performance of the fuzzy logic system is effected the selection of fuzzy rules and membership functions.

The fuzzy logic system consists of fuzzifier, inference engine, defuzzifier, and the fuzzy knowledge base. In the fuzzifier, inputs are translated into linguistic values using a membership function to be used to the input linguistic variables. After fuzzi-fication, the inference engine refers to the rule base containing fuzzy IFTHEN rules to derive the linguistic values. In the last step, the output linguistic variables from the inference are converted to the final values by the defuzzifier using membership function for representing the final sentence score.

In order to implement text summarization based on fuzzy logic, first step is that the features are used as input to the fuzzifier. Then, the input membership function for each feature is divided into different fuzzy set such as important values high (H) and very high (VH). In inference engine, the most important part in this procedure is the definition of fuzzy IF-THEN rules. The important sentences are extracted from

(18)

Algorithm 1Computing LexRank Score for Sentence Input: An array L of t sentences, cosine threshold m Output: A array S of LexRank scores

ArrayCosineMatrix[t][t] Array Degree[t] ArrayS[t] for i←1, t do for j←1, t do CosineMatrix[i][j] = id f −cosine(L[i], L[j]) if CosineMatrix[i][j] >m then CosineMatrix[i][j] = 1 Degree[i] + + else CosineMatrix[i][j] = 0 end if end for end for for i←1, t do for j←1, t do

CosineMatrix[i][j] = CosineMatrix[i][j]/Degree[i]

end for end for

ReturnS

these rules according to selected features. Sample of IF-THEN rules are shown as the following:

IF (SentenceLength is H) and (TermFreq is VH) and (SentencePosition is H) THEN (Sentence is important) Likewise, the last step in fuzzy logic system is the defuzzifi-cation.

The output membership function which is divided into three membership functions is used. Those are Unimportant, Average, and Important to convert the fuzzy results from the inference engine into a output for the final score of each sentences. After that, a value from zero to one is obtained for each sentence based on the selected sentence features in the output. The obtained output value determines the degree of the importance of the sentence in the final summary. The architecture of Fuzzy Logic Summarization is drawn based on [29] to show fuzzy logic system.

As we mentioned, multi document summarization differs from single in that the issues document selection, compression and redundancy are critical in the formation of useful summaries [10]. Based on these information, we can discuss about how sentences are extracted for multi document summarization.

(19)

Figure 2.1: Fuzzy Logic Summarization Architecture

The main difference between single and multi document summarization are the following:

• Finding a group of documents written about the same topic is much harder than work on a single document.

• The size of the summary with respect to the size of the document set is much more smaller for multi document set than for single document summaries. Both for single and multi document summarization, the co-reference problem is major issue [15]. The essential problem is to find the sentences that are actually im-portant enough to be included in a general purpose summary. Many requirements are needed for multi document summarization. For instance, the summary should re-flect essential points of documents but also should minimize redundancy. Summaries should be relevant and readable to the user and should outline related information. Finally, it is important that summaries should presents the most relevant and diverse information first so that the reader gets the maximal information content even if they stop reading the summary.

On account of today’s technology, information is increasingly being produced in digital formats. As a consequence of that the need of automatic text summarization raises in recent years. Specifically, the study of multi-document summarization be-comes popular to make and share knowledge in an appropriate way such as building a related work section for an article. A number of multi-document summarization systems have been developed to help users in getting an overview of a set of arti-cles. The most well-known example is MEAD summarization tool. There are some other free systems available as well [23][24][25]. For example SweSum [26] mainly being a Swedish language text summarizer and EstSum [27] for Estonian newspaper texts summarizer. These systems are typically evaluated with short documents such as newspaper. The main reason behind this is the lack of a publicly available large

(20)

collection of scientific articles with ideal summaries for document collection [28]. MEAD can be used for single document summarization and for multi document summarization (clusters of related documents). It takes an input documents in tex-tual form only. All data in MEAD is stored as XML. It computes for each sentence a score combining different scores depending on features that have to be selected by users of MEAD. It provides as an output summary the sentence having the highest scores. MEAD combines many summarization methods such as SimWithFirst feature, computes cosine overlap with the first sentence in the document (or with the title, if it exists). QueryOverlap cosine overlap with a query sentence or phrase. MEAD also includes two baseline summarizers, lead based and random based. Lead based sum-maries are produced by selecting the first sentence of each document, then the second sentence of each until the desired summary size is reached. A random summary con-sists of enough randomly selected sentences from the cluster to produce a summary of the desired size. We use Graph-based summaries in our system which is more appropriate for article summaries. MEAD has been primarily used for summarizing documents in English.

The settings in MEAD that can be set by the user are the following:

• minimum sentence length (number of words) that will be included in a sum-mary.

• how many sentences the output summaries will be made of (defined as a per-centage)

• the processing of provided keywords to choose the sentences to put in the sum-maries.

• the weights to take into account in the combination of the scores based on the above features.

2.2 Our Proposed Method For Constructing

Summaries

This section discusses our current implementation of a multi-document summariza-tion system which is designed to produce summaries for scientific articles. To exam-ine the current multi-document summarization methods on scientific topic summa-rization, articles have been extracted from the arXiv which is an open digital library [9]. In total we obtained 754,774 articles classified into 7 large groups: Physics, Mathe-matics, Computer Science, Statistics, Qantitative Biolgy, Qantitative Finance and Non-linear Science. For our experiments, we have used only the computer science related articles, which are 19,937 and we performed experiments using MEAD.

(21)

Figure 2.2: Process of Summarization

The process of summarization is done in four parts shown in the Figure 2.2.

• The first part of the process is a preprocessing step of converting and cleaning documents to provide to MEAD. Each pdf article is converted into text format to which some cleaning rules are applied to remove irrelevant pieces of text for MEAD.

• The second part of the process is MEAD specific. It assigns scores correspond-ing to selected MEAD features by calculatcorrespond-ing score of sentences.

• The next part of the process extracts sentences by scores and information of origin of sentences. This process is carried out in MEAD.

• Rephrasing summaries are postprocessing step of our system to make sum-maries more readable for people.

Each of these steps are described in the following sections.

2.2.1 Document Conversion And Cleaning

Before starting cleaning, we convert pdf into text format using ps2ascii. The main reason is that MEAD does not support pdf format.

As a result, we encounter some text as given in the Figure 2.3. At this point, we provide an overview of the problems to be addressed by document cleaning and their

(22)

solutions. A document cleaning approach should satisfy several requirements in our experiments.

The requirements defined in this experiments are:

• First of all, it should detect and remove piece of text which is not meaningful grammatically. For instance, text can not consist of repeated letters more than two as given some example in the Figure 2.3.

• Second, it should detect and remove all mathematical symbols if they appear together as formulas in text, tables or in figures.

Figure 2.3: Example of Irrelevant Data

We have used different samples from different texts to build document cleaning rules, as illustrated in the Figure 2.3. The building of the rules involve following steps:

• Conversion of articles into individual sentences. • Replace special characters with space

• Remove sentences containing any of below:

– references, figures, section titles, tables, acknowledgement

(23)

As a result, the rules are applied to each sentence and the output of this step is a text-only document. In addition to that, we do not remove some part of text but we make use of special MEAD feature to get rid of irrelevant part of articles. If a sentence consists of less than 9 words, MEAD feature does not extract that sentence.

2.2.2 Computation Of Scores For Sentences

For each sentence, several scores are computed by MEAD depending on some fea-tures chosen by the user. These scores are then combined into the final score of the each sentence. As we mentioned earlier, we have chosen to use four of MEAD’s features that are judged important in [8], the position of the sentence in the article, number of words in the sentence and how many times a word appear in an article indicate sentence importance. We explain now how the scores corresponding to each feature are computed by MEAD.

Position Feature

The first feature called position feature is the relative position of a sentence in a document such that the first sentence gets the highest score. It is applied separately to each document. Algorithm 2 shows the calculation of Position score. For instance, In the figure 2.8, first article consists of 5 sentences. The position score of sentence 5 is 0.447214 as calculated in equation 2.1.

r 1

5 =0.447214 (2.1)

Algorithm 2Position Feature

for each document do for each sentence do

if sentence then

Position←sqrt(1/position o f sentence)

else

Position←0

end if end for end for

(24)

Centroid Feature

The second feature called Centroid is a measure of the centrality of a sentence to the overall topic of a set of documents. A centroid is a group of words that statistically represent a set of documents. As such, centroid could be used both to classify relevant documents and to identify salient sentences in a set of documents. Centroid value can be calculated for words or for sentences in a set of articles. The centroid value for a word is computed as the TF*IDF values of that word.

Figure 2.4: Example of Centroid Feature

Centroid words are selected as those for which the centroid values are above some threshold. In our system, we have set the threshold to 3 (the default value proposed by MEAD). In fact, if there are not enough words for which the TF*IDF values are over the threshold, MEAD takes the first 8 * (number of document) words as centroid words. (8 is default number by MEAD).

After calculation of centroid value for each word, the centroid value for each sen-tence is computed as the sum of the centroid values of the centroid words in the sentence. MEAD finds which sentence has the highest centroid score among all sentences. That sentence is returned as the Centroid sentence for the whole set of documents. For each sentence, its centroid score is a normalized score obtained by dividing its centroid value by centroid value of the centroid sentence. Figure 2.4 dis-plays centroid scores for sentence coming from different texts (identified by a number in the column 1). The sentences are membered by their order in the text they come from (identified by a number in the column 2). The different feature scores are com-puted for each sentences in a set of documents (identified in the column 3,4 and 5)

Centroid algorithm is given in the Algorithm 3. We can see an example to clarify centroid feature calculation resulting is the Figure 2.4.

• The first step is that MEAD computes TF*IDF values for each word in the set of documents : "1.txt", "2.txt", and "3.txt".

(25)

• In the second step, MEAD counts document number and assigns 8 * 3 = 24 for required number of centroid words. And then, it constructs the centroid words of the set of documents by taking the words that are above the threshold until the desired size of Centroid words. Centroid words are given in Figure 2.5. In our example, there are returned words which is less than 3 score to complete required number of words such as " Sample" 2.90242081166701 and "statistically" 2.73214560374501 are less than threshold value.

• In the third step, MEAD computes the centroid value for sentences. We illus-trate second sentence of document "3.txt": " This paper compares two different ways of estimating statistical language models." In this sentence, models, es-timating, statistical and ways obtain centroid words to compute the centroid value for sentence.

(models) 8.42047422093794 (estimating) 5.34274350296071 (statistical) 3.28711826697865 (ways) 2.4411260596956 8.42047422093794 + 5.34274350296071 + 3.28711826697865 + 2.4411260596956 = 19.491462050573

• In the next step, MEAD finds the Centroid Sentence of the whole set of docu-ments. In our example, the second sentence of document "1.txt" is the Centroid Sentence (the highest score of centroid value for sentence among all sentences). The Centroid Sentence score is: 37.1032329841665

• the last step is that the final score of sentences is normalised. For our example sentence, its normalized centroid score is:

= 19.491462050573 / 37.1032329841665 = 0.525330557013477

• As shown in the Figure 2.4, centroid sentence of the whole set of document is "1.txt" second sentence. Its normalized centroid score is 1.0000. Our example sentence is "3.txt" second sentence. Its normalized centroid score is 0.525331.

(26)

Figure 2.5: List of Centroid Words

Keywords feature

Next feature is keyword based feature which boosts up the scores of the sentences containing keywords of interest. Summarized text should preserve the key ideas de-scribed in the article bundles. This is achieved using the keywords associated with the bundle. Keyword represents the most important words within the bundle articles. Keywords for a bundle is extracted from the topic distribution associated with the constituent articles and corresponding word distribution associated with the topics. We first explain how the keywords are obtained from the topic and word distribu-tions obtained as a result of the LDA.

LDA is a generative probabilistic model for documents. The basic idea behind LDA is that documents are composed of random mixtures of latent topics, where each topic is characterized by a distribution over words. LDA is based on the following assumptions: word distribution is a multinomial distribution, topic distribution is a multinomial distribution, topic weight distribution is a Dirichlet distribution, word distribution per topic is a Dirichlet distribution.

(27)

Mathematical model behind LDA is given here. Let P(d) be the probability of choosing a document d , P(t|d) is the conditional probability of choosing a topic t given the probability of choosing the document d and P(w|t) is the probability of choosing the word w given the probability of selecting the topic t. The join probabil-ity distribution of the observed variables(d, w)is P(d, w) =P(d)P(w|d). By Bayes rule

P(d,w) =∑ P(t)P(w|t)P(d|t)

In our work, for each bundle, the topic distribution vectors are summed up for the member articles and corresponding fields of the resulting vector is multiplied to the word distribution. Let us illustrate it an a very simple example.

Consider a bundle

bundle = <article 1, article 2>

topic distribution of article 1 = <topic 1: 0.6, topic 2: 0.3, topic 3:0.1> topic distribution of article 2 = <topic 1: 0.3, topic 2: 0.5, topic 3: 0.2>

The resulting topic distribution vector for the bundle = <topic 1:0.9,topic 2: 0.8, topic 3:0.3>

The word distribution of topic 1 = < word11: 0.7, word12: 0.3> The word distribution of topic 2 = <word21: 0.5, word22:0.5> The word distribution of topic 3 = <word31: 0.2, word32:0.8>

topic dist * word dist = < word11:0.63, word12:0.27 , word21: 0.4, word22: 0.4, word31: 0.06, word32: 0.24>

if we selected top 4 words as keywords, we obtain: <word11:0.63, word 21:0.4, word12:0.27, word32: 0.24>

As an example, the keywords obtained for a bundle of article about statistics and machine learning are : "training", "classification", "prediction", "learning". For calcu-lating keywords feature score: for each sentence, if one of its word matches with a keyword, it assigns 1 to that sentence, and 0 otherwise. MEAD works as follows.

Figure 2.6 and 2.7 illustrate the impact of the keyword feature on the resulting sum-maries. In Figure 2.6 shows the summary provided by MEAD for a given bundle without using the keyword feature. The Figure 2.7 shows the summary when key-word are used. Note that in the Figure 2.6, the last sentence of summary contains the keyword "classification" is located at a better place in the Figure 2.7 summary (in fourth position).

(28)

Figure 2.6: Example of Without Keyword Summary

Length Feature

The last feature called Length which is number of words in a sentence. Length fea-ture is cut off feafea-ture. Threshold length is 9.

FinalScore=

Position+Centroid+Keyword(QueryPhrase) if 9<sentence length 0 otherwise

For instance, in the Figure 2.8, the length of the first sentence of document "cs0008028.txt" is 6 and the final score for that sentence is assigned 0. The fifth sentence length is 26 in the document "cs0008024.txt" and the final score is calculated for that sentence is

(29)

Figure 2.7: Example of With Keyword Summary

Position score: 0.447214 Centroid score: 0.172739 Keyword score: 1.0000

0.447214 + 0.172739 + 1.0000 = 1.619953

The result is a table in which the different scores are grouped by articles. Figure 2.8 is an example of such a table. Note that the keyword feature has a strong impact in the final score because all sentences in the documents have score for Position and Centroid even if keyword feature is 0. On the other hand, Keywords feature provides a rapid rise of the sentence, if a sentence keyword score is 1.

Cosine similarity is a technique in Novelty Track in TREC 2002 [39] to compute sen-tence similarity as novelty re-ranker. Author noticed that human judges often pick clusters of sentences, whereas MEAD normally does not care about the spatial

(30)

rela-Figure 2.8: Example of MEAD Score

tionships between sentences within a document. She added new characteristic which boosting a sentences score slightly if the previous sentence had a relatively high score. This calculation continue until it has seen every sentence in the set. Before The default re-ranker in MEAD used the cosine similarity between already selected sentences in the summary and the new sentence which is under consideration. Similarity between two sentences is measured with the given below formula. We use novelty re-ranker after feature calculation step.

CoSim(s1, s2) =cos(θ) = _|_s1s1_||∗s2_s2_|

2.2.3 Sentence Extraction

The percentage of extracted sentences is a parameter to be set in the use of MEAD. We produce summaries that are 20 percentage the number of sentences of a set of documents in a bundle. For Example, if a bundle consists of 100 sentences, 20 sen-tences will be extracted as a summary.

While extracting sentences, MEAD preserves • the order of documents in the bundle. • the order of sentences in each document.

Figure 2.9 displays sentences coming from different texts (identified by a number in the column 1). The sentences are numbered by their order in the text they come from (identified by a number in the column 2) and the final scores are for each sentences in a set of texts (identified in the column 3). Even if the highest score of sentence comes from article 1, because of the order of articles, first extracted sentence comes from article 3.

(31)

Figure 2.9: Example of Sentence Extraction

2.2.4 Summary Rephrasing

Sentences often contain adverbial clauses, which lose their references when extracted out of context. It is specially the case for both in the single and in the multi-documents. We have designed a post processing step in order to locate this problem. Furthermore, another aim of this step is to give the authors names and the publica-tion title from every document in a bundle. To do this, MEAD keeps track of where sentences come from. We use this information to rephrase sentences to enhance the readability of the summary.

We start by conveying the sentence using already extracted sentences. We keep the authors names and the title before in the first sentence of the each part of summary. In addition to that, we did some replacement. If there is any word such as "we" which refers authors, and the paper has one author, we do not know if the author is male or female so we replace with author. If the paper has two or more authors, we replace with authors. For the following step, the adverbial clause is replaced with a proper word which conveys the sentence importance. This replacement does not make the sentence ungrammatical. The rephrasing rule is that we check the upper case of adverbial clauses. If adverbial clause starts with upper case and have comma, we remove from sentence or replace with proper word which shows the important of sentence.

As you see in Figure 2.10 and 2.11, we add title and authors of article information before giving extracted sentences from each article. The Rephrasing method is also used to classify the sentences connection that appear in a summary. Our experiments show that the proposed approach outperforms for both summary quality and fluency. Replaced words list is illustrated in Appendix.

(32)

Figure 2.10: Example Text Before Rephrasing

(33)

Algorithm 3Centroid Feature for Sentence

Input: An array S of n sentences, cosine threshold t Output: A array C of Centroid scores of sentence

Count=0

Maximum Score=0

Compute t f ∗id f score f or each word

for i←1, n do

for each word w o f S[i]do

t f id f(w) = t f(w) ∗id f(w)

Construct the centroid words o f the set o f documents by taking the words that are above the threshold

for each word w o f t f id f(w) do

if t f id f(w) >tor Count>8∗ (document size)then

t f id f(Centroid)(w) = t f id f(w) Count+ + else t f id f(Centroid)(w) = 0 end if end for

Compute the score f or each sentence

for i←1, n do C[i] =0

for each word w o f S[i]do

C[i] =C[i] +t f id f(Centroid)(w)

Compute the Centroid sentence o f documents

for i←1, n do

if C[i] >Maximum Score then Maximum Score=C[i] else continue end if end for for i←1, n do

Final Score C[i] ← (C[i]/Maximum Score)

(34)

(35)

Chapter

3 Experiments And Improvements Of The

Method

Our goal is to analyse the summarization of scientific publications using MEAD sum-marization methods and try to improve sumsum-marization result. To do this, we evaluate automatically generated summaries using Mturk. In order to evaluate the impact of using the keyword feature, we have computed two categories of experiments: an independent evaluation, in which we check quality of summaries; a comparative evaluation is to compare summaries obtained using keyword feature from those not using keywords.

3.1 Experimental Protocol Based On Amazon

Mechanical Turk

Even if technology changes everyday, human beings can do some tasks much more better than computers. For example, identifying objects in a photo or video, research-ing data details. Mturk is a crowdsourced marketplace for tasks that requires human intelligence. Basically, a person who is requester, needing work done can set up a HIT (human intelligence task) which is a small task. A person does this simple task in exchange for a tiny payment as a worker. Each worker would see thousands of independent pieces of task each day in Mturk web page [12]. This web page shows how much a task is paid and how long it will take to do each task. Once a worker is done, requester has ability to review the results and accept or reject them. They only pay for accepted work. If special skills are required to complete a task, requester can need that workers pass a qualification test before they are allowed to work on given HITs. There are different HITs such as comparison of given documents, translate one language to another language, identify duplicate entries and verify item details, find specific fields or data in large documents. We have chosen to use Mturk for our ex-periments because the evaluation of quality of summaries is much simpler to do by humans than by machines.

(36)

3.2 Experimental Setting

3.2.1 Qualification Test

The biggest challenge for using MTurk is how to decide whether a particular worker’s answer is correct. In other words, how to check worker’s background knowledge that is enough to work on a given task. We have designed a qualification test to measure knowledge of workers before work on our experiments. As you see in Figure 3.1, our qualification text is knowledge identification by using image.

The reason behind selection of image kind of question is, workers cannot find an answer only searching the question text in web. Background knowledge is needed in order to find an answer of given question. Moreover, there are some restrictions that each user will be given one attempt to solve the qualification test and test should complete within five minutes.

Our experiment qualification test is given in Figure 3.1 and 3.2 for graph theory topic: We ask users to identify a well-known Hamiltonian path in undirected graph from picture. We prepare different qualification test in order to each topic which is selected for experiment. Other qualification test is illustrated in Appendix.

(37)

Figure 3.2: Example of Qualification Test Image

3.2.2 Amazon Mechanical Turk Setting

We use Mturk survey questionnaire for workers to answer the given questions. For each topic we use 15 workers. We assign 0.3 $ per task. We have used only the com-puter science related articles. Among these article we select three different topics: Graph theory, Statistics and Machine Learning, and Information Retrieval.

To check quality of summaries, we use independent evaluation. We give summaries and complete articles as a link to the workers and ask " Does the given summary make sense with respect to the articles ". Question is shown in the Figure 3.3. We expect yes, if summary reflects well the content or no, if summary does not reflect content of articles. Moreover, we expect some feedback to accept given answers.

Figure 3.3: Independent evaluation question

In comparative evaluation, we give two summaries and full articles as a link to the workers. One summary is obtained from our tool when MEAD is used without the keywords feature obtained. The other summary is using the keywords feature. We ask to workers " Which summary reflects better the content of the given articles". We

(38)

put summary 1 and 2 button instead of yes and no button for those questions. Work-ers select one summary depending on their preference and give some feedback. The interface with the verifies concerning comparative evaluation is illustrated in Figure 3.4.

Figure 3.4: Comparative evaluation question

3.3 Experimental Results

We obtain our experiment result from average of the three different topics results. They are summarized in the Figure 3.5 and 3.6.

For independent evaluation result is: Statistics –> % 53.3

Machine Learning –> % 60 Information Retrieval –> % 80 Overall ratio: % 64.4

For comparative evaluation result is: Statistics –> % 53.3

Machine Learning –> % 60 Information Retrieval –> % 53.3 Overall ratio: % 64.4

(39)

The independent evaluation shows that summaries are not good. We also get sim-ilar results for comparative evaluation part. With keywords and without keywords the quality of summaries is not significantly different.

Workers give some feedback while solving questions. We use this information to improve quality of summaries. Some comments from workers are:

• The summary has a lot more information as compared to the respective articles. The summary has many differences with the articles.

• The summary is not accurate and is not good.

• It was a good decent job. Summary is not good enough.

• as both of them are somehow same so it is not possible to say which one is better so answer of this is no.

Figure 3.5: Independent Evaluation Result

3.4 Proposed Improvements

Taking advantage of the comments, we believe that we can get better results. Espe-cially, we see that articles are so long and they do not reflect content of articles. For independent evaluation, we summarize only title and abstract to improve indepen-dent evaluation test result. At the end of the experiment, we get better result for independent evaluation. The question is, " We gave a set of articles and a piece of

(40)

Figure 3.6: Comparative Evaluation Result

text. We ask to evaluate if the given text summarizes well the scientific content of the given set of articles and does the summary reflect well the scientific content of articles ". We expect yes or no answer with comments.

For comparative evaluation improvement, we use 15 keywords instead of 10 key-words and also again only title and abstract parts are given as a text. We also get better result for comparative evaluation. The results are founded in Figure 3.7 and 3.8. Other examples are given for both independent and comparative evaluation in the Appendix.

For independent evaluation result is: Statistics –> % 80

Machine Learning –> % 80 Information Retrieval –> % 86.6 Overall ratio: % 77.7

For comparative evaluation result is: Statistics –> % 66.6

Machine Learning –> % 86.6 Information Retrieval –> % 80 Overall ratio : % 82

Improvement results show that using only title and abstract give much better result than the first result. As we mentioned, compression ratio, the size of the summary with respect to the size of the document set is important point for multi document

(41)

summarization. According to [40], 20% or 30% of the source provides a reasonable input set for the summary of 10 to 20 sentence news. Concepts in the sentences are not taken completely out of context. Also, the extracted sentences are still themati-cally connected. On the other hand, in scientific articles, the compression rates have to be much higher. For instance, abbreviating a 10 page scientific article to a half page summary requires a compression to 5% of the original. In this point, the prob-lematic fact that sentence selection effects a qualitative difference because it is context insensitive. If only one sentence per one page is selected, all information about the extracted sentences is lost. Furthermore, abstract is highly beneficial in several infor-mation acquisition tasks. As mentioned in [10][41], abstracts have several advantages such as abstracts promote current awareness, save reading time, facilitate selection, and improve indexing efficiency. In addition to that, increasing given set of keywords also improves the result of comparative evaluation. As we mentioned in related work section, Luhn found that significance factor was derived that reflects the number of occurrences of significant words within a sentence.

(42)

(43)

Chapter

4 Pattern Summarization

The Nokia dataset consists of data from smart-phones of 38 participants in the course of more than one year. For each user, all records of different sensors like application usage, GPS, etc. are available. This dataset also includes answers to a questionnaire with 17 questions answered by some users in the experiment. Demographic attributes like gender, age group, profession, etc. come from this questionnaire. Application usage records consist of an application id and a time-stamp of when it was used. After removing some system applications, 170 applications are ended up.

Figure 4.1: Taxonomy for Application Attributes

There are different kinds of attributes in the dataset, half of which are dependent and the other half independent. A dependent attribute does not convey any meaning alone and should be attached to an independent attribute. Time attribute is

(44)

depen-Figure 4.2: Taxonomy for Demographic Information Attributes

dent usage attributes. For instance, Morning alone does not have any meaning, on the other hand, when we say Web-Morning it means that the user has used the appli-cation Web in the morning. The two independent attributes in Nokia are appliappli-cation usage and demographic information. For each attribute, we created a taxonomy. Figures 4.1, and 4.2 show the taxonomy for application usage, demographic informa-tion. For instance, in the Figure 4.1, common applications like Calendar are children of desktop applications. In the Figure 4.2, working full-time is a child of working and working itself is a child of social activity.

4.1 Our Propose Method For Generating Sentences

Inputs are patterns that have been automatically discovered by a pattern mining al-gorithm. A pattern is a set of attribute that are encountered frequently together in the dataset. These attribute can be categorized in different classes as we mentioned above.

• The first class groups attribute that are related to user information, such as "Studying full time" or "Is Male".

• The second class groups attribute that are related to the application used such as "Messengers", "Web" or "Contacts".

• The third class groups attribute that are related to period of time the application used such as "Morning" , "Afternoon" or "Night".

(45)

Before starting sentence generation, we can look at it following pattern examples and generated sentences for those patterns. First pattern example consists of user in-formation, the application used, and period of time the application used. In contrast to this example, second pattern example does not have user information.

Example Pattern 1:Studying full time, Is Male, Carousel_Morning, Contacts_Night, Carousel_Afternoon, Web_Night, Bluetooth_Weekend, Messengers_Weekend, Web_Noon, Contacts_Morning, Messengers_Noon, Messengers_Night,

Messengers_Weekday, Contacts_Afternoon, Contacts_Noon, Contacts_Weekend, Contacts_Weekday, Bluetooth_Weekday, Messengers_Afternoon

Example Sentence 1: Males who study full time use Contacts and Bluetooth at any time, Carousel in the morning and afternoon, Web at noon and night and never use Messengers in the morning.

Example Pattern 2: ActiveSearch_Weekday, ActiveSearch_Weekend,

Bluetooth_Weekend, Bluetooth_Weekday, Contacts_Morning, Web_Weekday, Web_Noon, Contacts_Night, Web_Morning, Web_Weekend, Web_Afternoon

Example Sentence 2: People use ActiveSearch and Bluetooth at any time, Contacts in the morning and at night, and never use Web at night.

Our sentence generation process can be summarised as follows:

• The first step of the process deals with splitting user info and application with time info into two parts.

• The second step of the process combines time info for each applications and replace with meaningful time information which reduces length of sentence and makes significant of sentence.

• The third step of the process combines applications that are in same time and makes time order for each applications.

• The fourth step of sentence generation process adds prepositions. • The last step of sentence generation process adds punctuations.

We explain sentence generation from given example pattern 1 step by step as fol-lows:

input : pattern

Example pattern : Studying full time, Is Male, Carousel_Morning, Contacts_Night, Carousel_Afternoon, Web_Night, Bluetooth_Weekend, Messengers_Weekend, Web_Noon, Contacts_Morning, Messengers_Noon, Messengers_Night,

(46)

Contacts_Weekday, Bluetooth_Weekday, Messengers_Afternoon

output: meaningful sentence

Example sentence: Males who study full time use Contacts and Bluetooth at any time, Carousel in the morning and afternoon, Web at noon and night and never use Messengers in the morning.

Figure 4.3: Step 1 for Sentence Generation

Step 1: Split user information and application with time information in separate list. In our example,

User info list ="Studying full time", "Is Male"

Time and Application list =Carousel_Morning, Contacts_Night,

Carousel_Afternoon, Web_Night, Bluetooth_Weekend, Messengers_Weekend, Web_Noon, Contacts_Morning, Messengers_Noon, Messengers_Night,

Messengers_Weekday, Contacts_Afternoon, Contacts_Noon, Contacts_Weekend, Contacts_Weekday, Bluetooth_Weekday, Messengers_Afternoon

The Figure 4.3 shows first step of sentence generation. According to rule, we obtain "Studying full time" and "Is Male" as user information, therefore, sentence starts with " Males who is studying full time "

(47)

Step 2: Combine time info for each applications and replace with meaningful time information. We create a dictionary in order to each applications. According to ex-ample,

Application list =Carousel, Messengers, Bluetooth, Web, Contacts

For Carousel , time list =Morning, Afternoon

For Messengers, time list = Afternoon, Noon, Night, Weekday, Weekend

For Bluetooth, time list =Weekday, Weekend

For Web, time list = Noon, Night

For Contacts, time list = Morning, Afternoon, Noon, Night, Weekday, Weekend (6 different application times)

In dictionary = (Carousel: [Morning, Afternoon], Messengers: [Afternoon, Noon, Night, Weekday, Weekend], Bluetooth: [Weekday, Weekend], Web: [Noon, Night], Contacts: [Morning, Afternoon, Noon, Night, Weekday, Weekend])

Application Contacts has all different time information, therefore, instead of writing all time information, we replace this time information with "at anytime". Application Messenger has 5 different time information. That means from 6 different time info, there is only one time information missing. That is " Morning ", so, instead of writing all time information, we replace these time informations with not Morning (!Morn-ing). For Bluetooth application, there are two time information which are Weekday and Weekend. That means user does not use Bluetooth application any special time because weekand and weekday reflects all time in a week. Therefore, we replace this time information with "at anytime". For Web application, there is not any special case to change time information.

As you see in the figure 4.4, we put all time information with new form in mean-ingful time list.

meaningful time list = (Carousel: [Morning, Afternoon], Messengers: [!Morning], Bluetooth: [at anytime], Web: [Noon, Night], Contacts: [at anytime])

Step 3: Combine applications that are in same time and makes time order for each applications.

As you can see in the figure 4.5, we combine applications that are same time and list it. In our example,

new meaningful list = (at anytime: [Contacts, Bluetooth] , (Morning, Afternoon): Carousel, (Noon, Night): Web, !Morning: Messengers)

(48)

Step 4: Add Preposition

As you can see in the figure 4.6, we add prepositions.

new meaningful list =(at anytime: [Web, Bluetooth], (in the Morning, in the After-noon): Carousel, (at Noon, at Night): Web, !Morning: Messengers )

There are two different time information for Carousel and Web application and they are both have same preposition in front of them, therefore, we add "and" be-tween them during the sentence generation. For at anytime information, there are two different applications. We also add "and" between them.

sentence = Males who is studying full time use Contacts and Bluetooth at anytime Carousel in the morning and in the afternoon Web at noon and at night and never use Messengers in the morning

Step 5: Add Punctuation

As you can see in the figure 4.7, we add punctuations after time informations before adding new application in the sentence.

sentence =Males who is studying full time use Contacts and Bluetooth at anytime, Carousel in the morning and in the afternoon, Web in at noon and at night, and never use Messengers in the morning daily.

Pattern summarization part assists to get idea regarding abstractive summarization method which may reuse phrases or clauses from set of related document in a mean-ingful way. We try to confirm that generated sentences meet grammar expectation. Unfortunately, we could not find how to evaluate to claim that our generation is very proper. This is left as a future work.

(49)

(50)

Figure 4.5: Step 3 for Sentence Generation

(51)

(52)

(53)

Chapter

5 Conclusion and Future Work

In this chapter, we conclude and present future work.

5.1 Conclusion

The majority of summarization systems continue to rely on sentence extraction since 1960s. Multi-document summarization was introduced as a problem in the 1990s. Nowadays, multi-document summarization is landmark in the progression of sum-marization research and takes the place of single document sumsum-marization. There is still a long trail to walk in this field.

Over time, both abstractive and extractive approaches have been attempted. Ab-stractive summarization requires heavily rely on the adaptation of internal tools and machinery for language generation. This summaries are difficult to replicate and ex-tend to domains. On the other hand, simple extraction of sentences have produced satisfactory results in multi document summarization. The recent popularity of ef-fective multi document summarization systems confirms this claim.

In this report, we have presented our multi-document summarization system which is designed to produce summaries for bundles of scientific articles. The well known summarization tool MEAD is integrated in our system. This report emphasizes ex-tractive approaches to summarization using statistical methods. Since a lot of inter-esting work is being done research in this field, we have chosen to include a brief summary on some methods that we found relevant to future research, even if they focus only on small details related to a general summarization process.

Our experiments based on Mechanical Turk give promising results. Results shows that abstract and title reflect general content of text and emphasize that a short doc-ument gives better result than a long text. Keyword based summary has a strong impact in order to obtain good quality of summaries.

(54)

In the second part of report, we have explained sentence generation for patterns extracted from data by automatic pattern mining techniques. We exploit categories of attributes to guide our generation process.

5.2 Future Work

The future aims of this study are the following:

• We have to work on how we can improve the quality of summaries for full articles.

• We have to study the effect of other summarization techniques that we could integrate in our system to improve summaries such as sentence planning. We may determine sentence which reflects authors own work or aim of document or related work. The basis of the global context of the paper determines the rhetorical status of a sentence [40][42]. We may use this approach to combine with selected features and may give some weight with sentence status.

(55)

(56)

(57)

Nouns or Adverbs or Prepositions Replace word

First, Firstly, Foremost, First of all, First off Second, Secondly

Third, Thirdly

We Authors

Also, Furthermore, Besides, Likewise

As a result of, Thanks to, For the reason that,

Case history one important point

Case in point one important point

For instance one important point

Kind of thing one important point

After all, For all that

All the same, Anyhow But, Despite, howbeit

In spite of, Nonetheless, Notwithstanding On the other hand

Per contra

Though, Without regard to Both

Table .1: Summary Rephrased Word List

(58)

Figure .2: Independent Evaluation example 1

(59)

Figure .4: Comparative Evaluation example 1

(60)

(61)

Bibliography

[1] Noemie Elhadad. User-sensitive text summarization. In AAAI, pages 987–988, 2004.

[2] Mehran Sahami and Timothy D. Heilman. A web-based kernel function for measuring the similarity of short text snippets. In WWW, pages 377–386, 2006. [3] Günes Erkan and Dragomir R. Radev. Lexrank: Graph-based lexical centrality

as salience in text summarization. CoRR, abs/1109.2128, 2011.

[4] David M. Zajic, Bonnie J. Dorr, and Jimmy J. Lin. Single-document and multi-document summarization techniques for email threads using sentence compres-sion. Inf. Process. Manage., 44(4):1600–1610, 2008.

[5] H.P. Luhn. The automatic creation of literature abstracts. IBM Journal, 2:159–165, 1958.

[6] Dipanjan Das and André F. T. Martins. A survey on automatic text summariza-tion, 2007.

[7] Jade Goldstein, Vibhu O. Mittal, Jaime G. Carbonell, and James P. Callan. Creat-ing and evaluatCreat-ing multi-document sentence extract summaries. In CIKM, pages 165–172, 2000.

[8] Breck Baldwin and Thomas S. Morton. Dynamic coreference-based summariza-tion. In In Proceedings of the Third Conference on Empirical Methods in Natural Language Processing (EMNLP-3), 1998.

[9] http://www.arxiv.org. Arxiv, 2012.

[10] Ani Nenkova, Sameer Maskey, and Yang Liu. Automatic summarization. In ACL (Tutorial Abstracts), page 3, 2011.

[11] Sihem Amer-Yahia, Ruth Garcia, Aybuke Ozturk, and Shameem Ahamed Puthiya Parambath. Crowd sourcing literature review in sunflower. Technical report, 2012.

Textual Summarization of Scientiﬁc Publications and Usage Patterns

Textual Summarization of Scientific

Publications and Usage Patterns

Aybüke Öztürk

October 21, 2012

Master’s Thesis

Under the supervision of:

Dr. Jerry Eriksson, UmeåUniversity, Sweden

Examined by:

Prof. Frank Drewes, UmeåUniversity, Sweden

UmeåUniversity

Department of Computing Science

SE-901 87 UMEÅ

Contents

List of Figures

Chapter

1

Introduction

1.1

General Problem Statement

1.2

Context of the Work

1.2.1

The SUNFLOWER

1.2.2

The Nokia Challenge

1.3

Outline of the Thesis

Chapter

2

Multi-document Summarization

2.1

Related Work

∏

∏

2.2

Our Proposed Method For Constructing

Summaries

2.2.1

Document Conversion And Cleaning

2.2.2

Computation Of Scores For Sentences

2.2.3

Sentence Extraction

2.2.4

Summary Rephrasing

Chapter

3

Experiments And Improvements Of The

Method

3.1

Experimental Protocol Based On Amazon

Mechanical Turk

3.2

Experimental Setting

3.2.1

Qualification Test

3.2.2

Amazon Mechanical Turk Setting

3.3

Experimental Results

3.4

Proposed Improvements

Chapter

4

Pattern Summarization

4.1

Our Propose Method For Generating Sentences

Chapter

5

Conclusion and Future Work

5.1

Conclusion

5.2

Future Work

Bibliography