A difference analysis method for detecting differences between similar documents

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2017,

A difference analysis method for detecting differences between similar documents

ANDREAS SERRA

(2)

A difference analysis method for detecting differences between similar documents

ANDREAS SERRA

Master’s Thesis at CSC Supervisor: Olov Engwall

Examiner: Viggo Kann

(3)

(4)

Abstract

Similarity analysis of documents is a well studied field.

With a focus instead on the opposite concept, how can we try to define and distinguish the differences within documents? This project tries to determine if differences within documents can be detected as well as quantified based on their semantic qualities. We propose a method for quanti- fying differences by applying tf-idf based models with analysis methods for lemmatization and synonym extraction, together with utility ranking algorithms. The method is implemented and tested. The results show that the method has potential but that further studies are required in order to fully evaluate to what extent it could be of practical use.

Such a method could though reap significant benefits within several different fields in which automatic difference detection could replace error prone manual labor in document management, as well as other beneficial purposes such as to provide automatically generated difference summaries.

(5)

Referat

En differens-analysmetod för att upptäcka skillnader mellan liknande dokument

Likhetsanalys mellan dokument är ett välutforskat områ- de. Med fokus istället på motsatsen, hur kan vi försöka definiera och särskilja skillnaderna mellan dokument? Det- ta projekt försöker undersöka om skillnader mellan dokument kan detekteras samt kvantifieras baserat på deras se- mantiska kvalitéer. Vi föreslår en metod för kvantifiering av skillnader genom att applicera tf-idf baserade model- ler tillsammans med analysmetoder för lemmatisering och synonymextrahering, i kombination med utilitetsranknings- algoritmer. Metoden implementeras och testas. Resultaten visar att metoden har potential men att det krävs ytterliga- re studier för att fullt ut avgöra till vilken grad den skulle kunna vara praktiskt användbar. En sådan metod skulle dock kunna erbjuda stora fördelar för ett flertal olika di- scipliner, där automatisk skillnadsdetektering skulle kunna ersätta felbenägen manuellt arbete gällande dokumen- tationshantering, samt också fylla andra förmånliga syften som t.ex. att kunna erbjuda automatgenererade skillnads- sammanfattningar.

(6)

Introduction

Similarity analysis between documents is a popular research topic within the information retrieval field. Finding out whether two documents are related or not is a crucial component of e.g. recommender systems and much research has therefore gone into developing said technology.

A less well researched topic within the field is that of finding what among documents are not similar, what the differences are. The reason for the lack of research could be that there have been less common use-cases where that information has been of interest. The topic can though be of high value for example within text summarization of documents, in which the differences between documents can some- times be even more important than the similarities. Finding differences also has the potential to gain a larger popularity based on the progress within the artificial intelligence field, since fully autonomous robots will have to be able to draw conclusions based on the difference between their perceived reality and that which someone or something else describes it to be.

Possible applications for difference analysis besides the aforementioned robotics and artificial intelligence field could be for example within news aggregation systems in which the key differences between the various sources are highlighted, exposing possibly crucial different points of views between authors. Text summarization systems could also be improved by collecting salient information bits based on difference analysis and this could in such a case provide benefits to a wide variety of document topics, from jurisdictional documents to political, scientific and others.

Within businesses, specification and requirements documents could automatically be scanned for semantic differences between revisions for a more streamlined error checking process, saving time and man hours otherwise cumbered by workers that undeniably provides a heighten risk of human errors.

1.1 Problem Statement

The objective of this degree project is to implement and evaluate an algorithm for finding differences between similar documents and ranking said differences based

(9)

CHAPTER 1. INTRODUCTION on their value to the concerning topic. The detected differences within a document will be the sentences that are significantly different from those found within another document of related contents in terms of contained semantic value.

The primary purpose of this degree project is to seek out the answers to the following questions.

Can an algorithm for finding differences between similar documents provide re- sults comparable to human results and can the results also be quantified in order to determine the most important differences?

1.2 Delimitations

The thesis will not cover specifics about how text summarization works or how similarity detecting algorithms work when that is not relevant to the thesis topic.

News articles are gathered from a limited amount of sources in English exclusively.

A news article is defined to be a document of news contents containing a title and one or more paragraphs of plain text, all other data such as e.g. images or author information are to be ignored. Sub headers if detected are treated as plain text.

1.3 Thesis Outline

Chapter 2 presents relevant background theory and related work. Chapter 3 covers the methods used in the thesis. The experiments are formulated in Chapter 4. The results are presented in Chapter 5 and finally Chapter 6 concludes the thesis with discussions, conclusions and ideas for future work.

1.4 Contributions

This report establishes a scoring system for measuring semantic differences between documents. It evaluates methods that detect as well as scores differences at a sentence level and ultimately provides a pioneering step in researching an information retrieval topic that few have explored before.

(10)

Chapter 2

Background

This chapter presents relevant theory regarding similarity and difference analysis, general methods used within the field and definitions to clarify the subject matter.

2.1 Document Similarity

Document similarity methods are used for either getting a quantified score of the closeness between documents or classifying documents into clusters of similar contents.

2.1.1 Cosine-Similarity

Cosine-similarity is a measure of similarity that measures the cosine angle between two vectors. For vectors A and B, with Ai and Bi being components of A and B respectively it is calculated as

similarity= cos(θ) = A · B kAkkBk =

Pn i=1AiBi

qPn

i=1A²_i^q^Pⁿ_i=1B_i² (2.1) Within information retrieval cosine-similarity can be used to measure the similarity between vector space models of e.g. documents [1].

2.1.2 Centroid-Based Clustering

Centroid-based Clustering is a method for classifying data into clusters represented by central vectors. The basic idea is that data is put into the category, cluster, whose vector resembles the data the most e.g. by having the smallest distance between them, the cluster vector is then updated to reflect on the differences provided by the new data and so the process continues.

(11)

CHAPTER 2. BACKGROUND With D being the TF-IDF vector of the document to classify, C a potential cluster, Dk and Ck the term frequencies of term k in their respective vectors, the similarity is calculated as

sim(D, C) = P

k(Dk∗ C_k∗ idf(k)) qP

k(Dk)²^q^P_k(Ck)² (2.2) If sim(D,C) is within a certain threshold then document D is added to C in accordance to previously described procedure.

Radev et al. [2] have used this approach in combination with TF-IDF, a weighting scheme, to classify documents into clusters in order to determine document topics as well as to find salient sentences within them. With this approach they claim that the produced text summarizations were of similar quality to ones produced by humans.

2.1.3 Naive Bayes

Naive Bayes is a set of supervised machine learning classifiers that uses a naive approach, hence its name, in order to classify documents. The basic idea is that every variable/feature of interest occurs independently of each other. As such the goal of the classifier, to find the most probable class, can be formulated as finding the most likely class c ∈ C given a set of independent features x ∈ X by means of conditional probability [3].

arg max

c (p(c|x1, ..., xn)) (2.3)

Kupiec et al. have previously investigated how one can use naive Bayes classifiers to select the most salient sentences from within a text for text summarization purposes. Evaluation of their solution suggests that the algorithm has an accuracy of up to 84% when compared to summaries manually created by humans [4].

Kraaij et al. have also previously used naive Bayes classifiers to select salient sentences from texts, with overall promising results. The system was though not deemed to work optimally for shorter documents because of the full sentences not summarizing the contents in an adequate manner [5].

2.1.4 Support Vector Machine

A support vector machine is a binary classifier that with the help of so called support vectors, the data points closest to the decision boundary, in multi-dimensional space maximizes the margin around the separating hyperplane [3].

(12)

2.1. DOCUMENT SIMILARITY

Figure 2.1. Example of support vector machine solution. The points on the dashed lines are the support vectors which are used to calculate the maximum margin for the decision boundary, represented by the solid line in the middle [3]. (Original figure from https://en.wikipedia.org/wiki/Support_vector_machine)

The support vector machine problem can be solved by taking the equation for hyperplanes and optimize it based on the normal vector.

W · X − b= 0 (2.4)

The hyperplane equation, where W is the normal vector to the hyperplane, X is a set of points and b is a constant.

The data points can be written as (Xi, yi) where Xi is the position vector of the point and yi is whether the point belongs to the class or not, 1 if true, -1 if false.

If the training data is linearly separable then their parallel hyperplanes can be described by the equations

W · X − b= 1 and W · X − b = −1.

Since all data points must be classified correctly, the following two constraints are enforced

W · Xi− b ≥1, if yi = 1 and W · Xi− b ≥ −1, if yi= −1.

These can be rewritten as yi(W · Xi− b) ≥ 1, ∀i ∈ [1..n].

The goal of the SVM thus becomes the optimization problem of finding the smallest ||W || that fulfills this constraint.

In cases of linearly inseparable problems one can use a so-called kernel trick in order to solve the problem, which means that each dot product of the original algorithm is replaced by a nonlinear kernel function instead [6].

(13)

CHAPTER 2. BACKGROUND Hirao et al. has previously implemented a SVM-based method for salient sentence extraction that according to their research were able to outperform other similar methods for this specific task [7].

2.2 Weighting Schemes

Seldom is all data within a set of equal value, to reflect on that a weighting scheme is a system for distributing importance or weight among data such that it is possible to approximate the value of each data component within the set.

2.2.1 TF-IDF

In determining the importance of a word within a certain document, one heuristic is to assume that the frequency of a word correlates with its importance to the document topic. This is called term frequency or TF for short.

tf(t, d) = ft,d (2.5)

One problem with term frequency is though that it leads to universally frequent words such as “it”, “of” and “is” gaining an unwarranted degree of weight since little to no semantic meaning is contained within these words on their own. IDF, Inverse Document Frequency is a common approach to combat this problem and can be calculated by counting within a set of documents how many of the documents that contain the word, thus producing a measurement of how common or rare the word is.

idf(t, D) = log |D|

|{d ∈ D: t ∈ d}| (2.6)

TF-IDF is the combination of these two measurements.

tf idf(t, d, D) = tf(t, d) · idf(t, D) (2.7) TF-IDF vectors are vectors of words containing TF-IDF measurements for each word [8].

2.2.2 Centroid

A centroid weighting scheme is a scheme based on TF-IDF that ignores certain words, stop words, which are words that are filtered out from the processing of the document. Stop words usually refer to the most common words of a language but in a centroid whether a word is part of the vector space model or not is instead based on its raw frequency multiplied by its idf, the TF-IDF score [2]. The word is included in the model if the score is above a certain threshold value. Basically C = [t|score(t) > δ], where C is the centroid, t ∈ T , the TF-IDF vector, and δ is

(14)

2.3. SEMANTIC ANALYSIS

2.2.3 Latent Semantic Analysis

Latent semantic analysis, LSA, based weighting schemes can be used in a similar way as to other vector space models by e.g. cosine similarity measurements but through their innate characteristics they also hold some benefits over other simpler models.

An LSA algorithm usually starts out with creating a TF-IDF or a similar base model, through the use of singular value decomposition, SVD, it finds a low-rank approximation of the model, in which words of related meaning gets classified as being similar [9, 10]. Formally this becomes the task of finding the values that sat- isfies the equation X = UΣV^T, where X is the base model, U and V are orthogonal matrices and Σ is a diagonal matrix.

The advantages of LSA besides being much more adept at handling larger documents, is that it is also able to merge words of similar meaning without the use of external resources such as synonym dictionaries.

Zhang et al. have done a comparative study in which they evalutated three different methods for text classification. The tested methods were a simple TF-IDF model, an LSA model and finally a multi-word system based on word class analysis.

Their results suggests that the LSA approach outperformed the others concerning semantic understanding as well as classification proficiency [11].

2.3 Semantic Analysis

The purpose of semantic analysis is to treat sentences not only as a container of words, but rather as a system from which one can gather information concerning the concepts, ideas and meanings involved within it.

2.3.1 Stemming and Lemmatization

Stemming is the process of reducing inflected words to their word stem, base or root form. This is useful for when the word form is unnecessary or of less value than the base form of the word.

Lemmatization has a similar purpose to stemming but also takes the word classes of words into consideration and instead of providing the stem of a word tries to provide a so called lemma, which is the dictionary form of a set of words. This enforces that the resulting word is valid and thus can e.g. be looked up in a dictionary, which in some cases is not possible for stemming.

(15)

CHAPTER 2. BACKGROUND

Word Stem Lemma

walking walk walk

made made make

gone gone go

went went go

mice mice mouse

ponies poni pony

fairies fairi fairy

WordNet [12] is a lexical dictionary that among other purposes can be used for the lemmatization of English words, it is highly popular and can be considered a standard resource within the field.

2.3.2 Part-of-Speech Tagging

To gather up information on what word classes words belong to, one can use Part- of-speech tagging. A POS tagger is a program that can automatically identify word classes of words, which can then be used to enhance the performance of a system by providing context-dependent information.

Word POS

The determiner

dog noun

likes verb

pink adjective

flowers noun

The Stanford Tagger [13] is an example of an English POS tagger that has a documented high accuracy for when determining word classes.

2.3.3 Synonym Extraction

In order to improve method results one can instead of only looking at the actual words of the documents also collect synonyms of these using synonym dictionaries and involve them in the process. Through this, methods can become more semantically aware and previous research has shown that it leads to better results [14, 15].WordNet is able to provide synonym lookups for English words.

(16)

2.4. EQUIVALENCE DEFINITION

2.4 Equivalence Definition

To detect the differences between documents, one must first define what is required in order for two sentences to be considered equal to each other as well as if there is any common information among the sentences.

2.4.1 Cross-Sentence Informational Subsumption

When comparing sentences between documents, one cannot assume that the information within the sentences will match flawlessly, a sentence in one document could very well correspond to several sentences in another document in terms of contained information, this phenomena is called cross-sentence informational subsumption.

A sentence b is said to subsume sentence a if the information content of a, i(a), is contained within b, i(b) and this can be written out as i(a) ⊂ i(b) [2]. This in turn does not mean that a necessarily subsumes b since there could exist a sentence c such that i(c) ⊂ i(b) & i(c) 6⊂ i(a).

An example of this case is the following:

Sentence a: The number of employees will be about 9,000 after the job cuts.

Sentence b: The job cuts will reduce the number of its employees to about 9,000 by the end of 2016.

Sentence c: The job cuts will be done by the end of 2016.

Here sentence b subsumes both a and c since it contains the same basic infor- mation content of both sentences whilst neither a nor c contains the information of any of the other sentences.

2.4.2 Equivalence Classes of Sentences

One way to determine whether two sentences contain equivalent information is through equivalence classes. Radev et al. suggests that two or more sentences are said to belong to the same equivalence class, that they are semantically equal, if they can be substituted for each other without crucial loss of information [2]. Using that definition, two sentences being different means that they do not belong to the same equivalence class, hence that they contain unique and valuable information that cannot be gained from the other source.

Two sentences sharing the same equivalence class can thus be defined as those fulfilling both i(a) ⊂ i(b) and i(b) ⊂ i(a). In the example case of the previous section, sentence b contained the semantical value of both of the other sentences and vice-versa thus (i(a) ∪ i(c)) = i(b), b and the disjunction of a and c share the same equivalence class.

2.5 Evaluation Metrics

Analyzing whether sentences are similar or different is an example of binary classification. Binary classification is the task of categorizing data into one of two

(17)

CHAPTER 2. BACKGROUND categories, positive or negative, where positive can be that a sentence is unique and negative that it is similar to at least one other sentence.

There are four possible outcomes from binary classification, true positives, Tp, and true negatives, Tn, being the results that are correctly classified into their respective categories and false positives, Fp, and false negatives, Fn, being the results that are falsely classified into the wrong categories.

2.5.1 Precision

Precision measures the rate of actually positive data within the data classified as positive.

T_p

Tp+ Fp (2.8)

Practically this means the rate of actually different sentences between documents detected, in contrast to all sentences detected as different.

2.5.2 Recall

Recall measures the rate of positive data that is classified correctly.

T_p

T_p+ Fn (2.9)

Practically this means the rate of different sentences between documents that are detected.

2.5.3 F-Score

F-score combines precision and recall and can be seen as a weighted average of the two measures. The F-score used within this thesis is the so called balanced F-score, also known as F 1.

F1 = 2 · precision · recall

precision+ recall (2.10)

2.5.4 Relative Utility

To rank the importance of observed differences one can use relative utility, measuring how well a system is able to retrieve the most important sentences within a document. Relative utility is designed to be used for ranking all sentences of a document in order to select the most relevant ones and use those to summarize the contents [16]. The same idea works for the scenario outlined within this thesis as well, where the ranking instead is based on difference importance and one through selecting these differences creates a difference summary of sorts.

(18)

2.6. CONCLUSIONS ON CHOICE OF METHODS

The equation for calculating system performance using relative utility is as following:

S= P_n

j=1δ_s,j·^P^N_i=1u_i,j

U⁰ (2.11)

where n is the amount of sentences within a document, δs,j is 1 if the sentence is selected by the system to be in the summary and 0 if not, N is the number of judges available, ui,j utility score judge i gave to sentence j and finally U⁰ being calculated as:

U⁰=

n

X

j=1

ε_j·

N

X

i=1

u_i,j (2.12)

where εj is 1 if the average judge result for sentence j determined that the sentence should be included in the summary and 0 if not.

In the case of having only one judge the equation can be reduced to:

S = Pn

j=1δ_s,j· u_1,j Pn

j=1δ_1,j· u_1,j (2.13)

Normalizing the results is highly recommended according to Radev et. al but since doing so requires multiple judges, it is not deemed applicable within this context.

2.6 Conclusions on Choice of Methods

This chapter has presented some different methods for determining the similarity of documents, with the simplest being the cosine-similarity measurement and arguably the most difficult being the support vector machines.

With all of these methods considered, the best choice of method for this thesis has been determined to be the cosine-similarity, this because it gives us a simple yet highly expandable solution. Since no prior research concerning this topic have been found, it leaves a need for this thesis to present comparable results based on solely its own research and thus methods such as cosine-similarity, which supports all types of generic vector-based solutions, makes for a natural choice.

The same reasoning can also be applied for choosing weighting schemes, with no consideration given to having no weighting scheme, since it would with high certainty leave the system unusable. With the presented options being either TF- IDF, a centroid or LSA, with a centroid basically being a more specialized version of TF-IDF, the final choice for weighting scheme was determined to be TF-IDF.

When combined with external means for semantic analysis, TF-IDF counteracts some of the advantages of LSA and should therefore be able to provide results of similar or maybe even superior quality to that of LSA, it is noteworthy though to still point out that TF-IDF is inferior from a scaling perspective.

(19)

Chapter 3

Methodology

This chapter presents the architecture of the system, the data set and how it was generated, how the difference analysis method based on three different models works both without and combined with semantic analysis methods and finally how the utility ranking algorithm for ranking the importance of sentences works.

3.1 Architecture

Figure 3.1. Architecture of the implemented system.

(20)

3.2. DATA SET EXTRACTION

3.2 Data Set Extraction

The first component of the architecture is the news article extractor, which is re- sponsible for gathering and formatting news articles such that they can be used as the algorithm’s data set.

The reason for not using a previously developed corpus of news articles is pri- marily the scarcity and difficulty of finding one because of copyright reasons, it is however also uncommon for news article corpora to contain pair reports of same events using multiple different sources and as such it is determined more feasible to simply generate such a corpus from scratch.

The news articles have been collected from the news sources ABC, CBS, CNN, FOX, The Huffington Post, NBC, New York Times, Reuters, The Guardian and Washington Post and have mainly focused on national, worldwide as well as political events.

The total test sample size is 20 news article pairs, with each article containing an average of between 10-12 sentences each.

The criteria for two articles to be considered a pair have been defined as 1. Both being event reports, documenting some kind of event or happening and

not e.g. interviews or debate articles.

2. Both covering the same event.

The extractor itself basically takes in an URL address to an article by one of the above mentioned sources, downloads and formats the article and finally stores the results in a text file.

The text files regardless of source are formatted by the extractor into the newline separated format:

[URL of article]

[EMPTY LINE]

[Header of article]

[Contents of article]

The extractor has been designed to try to remove tags, images, author information and other non-essential components from the articles but in some rare in- stances, some minor manual removals have been necessarily employed in order for the formatting to not influence the performance of the algorithm.

3.3 Preprocessing

In order for the news articles to be divided into valid sentences, they first go through the preprocessing stage, in which paragraphs are split based on sentence dividers such as periods, question marks and exclamation marks. The method has been developed to differentiate between periods used within abbreviations and those sig- nifying sentence stops, through the use of language specific information.

(21)

CHAPTER 3. METHODOLOGY The preprocessing also removes characters such as quotation marks and paren- theses such that the final result to as a large degree as possible ends up containing exclusively words and numbers.

3.4 Semantic Analysis

The words of the articles can be refined further through semantic analysis, potentially improving the accuracy of the difference analysis algorithm.

3.4.1 Part-of-Speech Tagging

The POS tagger in and of itself provides no benefits to the algorithm, since it only tags words with information concerning their word classes. What it does though provide is information that contributes to correct classification concerning the other semantic analysis methods. Without this information, the other methods must take heuristic approaches when determining whether words are nouns, adjectives or other word classes. This potentially leads to reduced accuracy and thus less reliable results.

Using the Stanford tagger also makes preprocessing redundant since it goes through a similar process naturally when generating the tag data and as such when Part-of-Speech is enabled in the system, no preprocessing is done before this step.

3.4.2 Lemmatization

Lemmatization can be done through WordNet [12]. WordNet on its own have no functionality concerning word class analysis and as such the words must be either tagged by the POS tagger in a previous step or be heuristically classified.

WordNet is though able to verify whether a lemma belongs to a certain word class or not, meaning that the problem of determining word classes is essentially limited to those cases in which a word belongs to multiple word classes.

The heuristic utilized within this project is to first check if the word is a verb, if not then check if it is a noun, otherwise if it is an adjective and finally an adverb.

All other word classes such as conjunction and interjection are simply ignored.

The motivation behind the heuristic prioritization order is that first and foremost non-verbs seldom gets classified as verbs, according to anecdotal tests done of words belonging to other classes. Concerning the other classes, it has been presumed that it is more important for nouns to be correctly classified than it is for the other classes.

3.4.3 Synonym Extraction

The synonym extraction is also done using WordNet as the back end and with the same heuristic approach to determining word classes as the lemmatization process.

(22)

3.5. MODELS

One problem with synonym extraction is that a word, even when only consid- ering one specific word class, potentially can belong to multiple synsets, groups of words having the same semantic purpose. This thesis will mostly ignore this issue and instead treat these multiple synsets as belonging to one larger synset instead.

This method is implemented through the algorithm described below.

The purpose of synonym extraction within this thesis is to replace normal words with synset representative words, which would make the news articles basically become lists of concepts instead of lists of words.

The algorithm for retrieving as well as setting synset representatives is as following:

def findSynsetRepresentative(word):

if synsetExists(word):

return getSynsetRepresentative(word) else:

synsetCount = {}

rogueSynonyms = []

foreach synonym in getSynonyms(word):

if synsetExists(synonym):

r = getSynsetRepresentative(synonym) synsetCount[r] += 1

else:

rogueSynonyms.add(synonym) r = argmax(synsetCount)

if r == None: # None of the synonyms belonged to a synset r = word

setSynsetRepresentative(word,r) foreach rogue in rogueSynonyms:

setSynsetRepresentative(rogue,r) return r

Basically the idea is that the most popular synset to the word’s synonyms determines the synset of the word, as well as to the yet to be represented synonyms of the word.

3.5 Models

A model in this context is a container of TF-IDF vectors, that determines what each sentence of an article is compared against. The comparisons are performed by taking the cosine measurement between each evaluated sentence of the first article and all of the vectors of the second article. The various models come with their own weaknesses and strengths and it is therefore of interest to investigate how these affect the system performance.

(23)

CHAPTER 3. METHODOLOGY

3.5.1 Document

The document based model uses a whole article as the reference model, meaning that the whole document is treated as the TF-IDF vector which the evaluated sentence is compared against.

One possible problem with this model is that it might match all words within the evaluated sentence without there being any actual semantic equivalence or informational subsumption from the referenced article.

3.5.2 Sentence

The sentence based model uses each sentence of the article individually as a TF-IDF vector reference model.

This approach should solve the problem with words not necessarily matching within the intended context, the similarity between the reference sentence and the evaluated sentence should correlate well with their semantic distance.

The disadvantage of using individual sentences as a model is that they by design cannot deal with cross-sentence informational subsumption, where the combination of multiple reference sentences accumulates to the same information to which is contained within the evaluated sentence.

3.5.3 Tuple

The tuple based model attempts to overcome both the weakness of the document based model as well as the sentence based model. When two or more sentences are combined together they become a tuple. The tuple based model in essence is a model in which fixed size permutations of all sentence tuples have been created, each being their own TF-IDF vector.

Given a list of sentences S containing five sentences s1−s5, S = {s1, s2, s3, s4, s5}

and size variable n set to 2, meaning that the tuples should be of two sentences, the tuple model would be T = {(s1, s2), (s1, s3), (s1, s4), (s1, s5), (s2, s3), (s2, s4), (s2, s5), (s3, s4), (s3, s5), (s4, s5)}. The evaluated sentence is then compared against each of the tuples individually.

The biggest weakness of this model is in the fact that it for c sentences takes c ∗(c − 1)/2 tuples to create the model, meaning a quadratic complexity.

3.6 Difference Analysis

The main component of the system is the difference analysis method. It both finds the differences between the articles as well as through utility ranking, calculates scores for how important the differences are.

(24)

3.6. DIFFERENCE ANALYSIS

3.6.1 Difference Detection

This is the method that analyses which sentences from the articles that are considered different and hence should be part of the results.

def findDifferences(article1, article2, model, threshold):

foundDifferences = []

referenceModel = createModel(article2,model) foreach sentence in article1:

if referenceModel.getDifferenceWith(sentence) >= threshold:

foundDifferences.add((sentence,0))

referenceModel = createModel(article1,model) foreach sentence in article2:

if referenceModel.getDifferenceWith(sentence) >= threshold:

foundDifferences.add((sentence,1)) return foundDifferences

The method takes as input in the two articles, which model to use for the difference analysis and finally a threshold value that the analyzed difference must be higher or equal to for the sentence to be considered different enough to be included in the results.

First it creates a reference model of the second article, going through each sentence from the first article and storing those that are different enough to the model to a list. The process is then reversed, with the first article being the reference model. The combined results from the two runs is returned as the output of the method.

The algorithm in and of itself can only detect syntactic differences between the articles, hence it is up to the semantic analysis stages, part of speech tagging, lemmatization and synonym extraction to process the articles in such a way that the results of the syntactic difference detection more or less becomes the same as one would expect from a semantic detection algorithm.

3.6.2 Utility Ranking

The difference analysis algorithm is supposed to not only detect differences but also heuristically determine the most significant ones within the articles by means of generating scores of their importance.

The simplistic nature of the methods used limits the efficiency potential, since deeper semantic analysis is unavailable. Following are two algorithms for calculating a crude estimate of this score.

def utilityRanking(sentence, article1, article2):

score = 0.0 wordCount = 0

foreach word in sentence:

(25)

CHAPTER 3. METHODOLOGY

s = (tf(word,article1) - tf(word,article2))*IDF(word) if s>0:

score += s wordCount += 1

score = score / wordCount maxVal = 0

foreach word in article1:

s = (tf(word,article1) - tf(word,article2))*IDF(word) maxVal = max(maxVal,s)

return ceil(5 * score / maxVal)

The score is calculated as the average of the positive difference TF-IDF of the words in the sentence, given the frequency within the first and the second article.

The theoretical maximal value is then calculated with the resulting utility ranking score being a normalized integer value between 0 and 5 rounded up.

The first article is the one in which the sentence originated from and the second one being the one used for comparison, the reference model.

The motivation behind this design is the assumption that new information correlates well with the introduction of new words/concepts, meaning them appearing more frequently in the former rather than the later article. Another assumption of the design is that information intense words in general are fairly rare and as such share higher IDF scores.

def utilityRankingBasic(sentence, article2):

score = 0

foreach word in sentence:

score += 1/(1+tf(word,article2)) return ceil(5 * score / len(sentence))

A much simpler approach is also presented here, which only factors in the frequency of the words in the second article, the reference model. The score is calculated as d5 ·^Pw∈S(_{1+tf (w,A)}¹ )/kSke, where w is a word, S is the sentence and A is the reference article.

If the sentence contains new information, the assumption is that it also contains words or concepts uncommon to the reference article at large.

3.7 Evaluation Strategy

The analysis algorithm is evaluated by first running the algorithm on test data article pairs, generating algorithm output consisting of the sentences containing differences as well as a score for each of them showing the estimated semantic value gain of the contained information.

Annotated versions of the article pairs are created by the thesis author in ac-

(26)

3.8. SCORING SYSTEM

[Header of article1]

[Score of Sentence #1]\t[Sentence #1 from article1]

...

[Header of article2]

...

where the score is a natural number between 0 and 5, where 0 basically means that the sentence provides no new information and 5 being that it provides crucial information concerning the topic, when compared to the information provided by the other article of the pair.

The output of the algorithm is then compared against these manually annotated articles, providing both comparisons concerning which sentences are detected as different as well as how important the detected differences are. Specifically this means precision, recall, F-score and relative utility.

The F-score as previously explained uses four separate variables, true positive T_P, true negative TN, false positive FP and false negative FN, for calculating the score. A sentence is for these variables defined to be different if it has a score higher than 0, meaning there is some difference in the semantic value of the sentence in regards to the reference article. The complete definition of these variables for a sentence is defined as following.

Table 3.1. Variable Definitions

Name Annotated Score > 0 Algorithm Score > 0

T_P True True

TN False False

FP False True

FN True False

3.8 Scoring System

The scoring system within this thesis is used for both the results of the ranking algorithms as well as for manually grading the significance of sentences within tuples of articles. The criteria for each score value, though, does only apply for the manual scores given by humans. The implemented algorithms, due to their simplicity, are neither able to interpret nor able to understand semantic concepts and are therefore incapable of providing such a degree of reasoning skills.

(27)

CHAPTER 3. METHODOLOGY Each score value is applied to a single sentence when comparing the contained information of the sentence in contrast to all information contained within another document.

Score 0

The sentence provides no new information and as such can be considered subsumed by the other document.

Score 1

The sentence provides new information, though, none of importance to the topic of the article. The information is basically irrelevant.

Score 2

The sentence provides new information of indirect value, e.g. mentioning similar or previous events or possibly containing new information of low importance to the topic whilst still being relevant.

Score 3

The sentence provides new information of direct value, it is not essential for understanding the event but can give some insight into intentions, motives, or details of the general situation.

Score 4

The sentence provides new information of high value. It is relevant to the main topic of the event and probably should have been covered by the other news source as well.

Score 5

The sentence provides new information of crucial importance. Without this information, the news source should likely be deemed unfit at covering the event.

3.8.1 Scoring Example

Following is two documents that demonstrate a possible usage of the scoring system, presenting first the reference article and after that the annotated article, with the scoring of the sentence followed by the actual sentence.

(28)

3.8. SCORING SYSTEM

Reference Article

The prime minister was on a visit to Sweden today.

The purpose of the visit was to negotiate a new trade deal between the countries.

The discussions are still in progress and it is therefore unclear to what the outcomes will be.

Annotated Article

5 The Prime Minister of the United Kingdom was on a visit to Sweden today.

4 The purpose of the visit was to negotiate a new trade deal between the countries, a high priority for both countries because of the upcoming Brexit from the European Union.

3 Theresa May have previously repeatedly stated that Sweden is an important trading partner to the country.

2 This is but the second visit the Prime Minister have done, having spent the last week negotiating a similar deal with the government in Denmark.

1 As a promising sign of the deal, the weather have been very kind during these last few days.

0 The results are currently not in yet, with the discussions still being in progress.

Explanation

The first sentence contains a crucial piece of information, what country the Prime Minister is coming from, without this information it is impossible to determine the importance of the visit.

The second sentence provides a reason for why the trade deal was needed, which the other article did not mention. Whilst not necessary in understanding the basic purpose of the event, it still provide a high value in fully comprehending the event.

The third sentence provides a motive for why the event was a high priority for both countries, though not containing any information regarding the actual event.

The fourth sentence mentions a previous event of similar character, which may or may not be of relevance to this event as well.

The fifth sentence is completely irrelevant to the actual event but still contains information that is not provided by the reference article.

The last sentence basically contains the same information as the last sentence of the reference article, meaning it can be considered subsumed.

(29)

Chapter 4

Experiments

Each experiment is a test containing a configuration of variables for the semantic analysis options as well as the choice of article model.

The variables tested on is:

• POS: Enable part-of-speech tagging

• LEMMA: Enable lemmatization

• SYN: Enable synonym extraction

• DOCMOD: Use the document model

• SENMOD: Use the sentence model

• TUPMOD(S): Use the tuple model with tuples of size S

If a variable is included in an experiment’s list of variables, it can be considered true, otherwise it is false.

(30)

Table 4.1. Experiments

Experiment Variables Description

1 DOCMOD Evaluate using only the document model.

2 SENMOD Evaluate using only the sentence model.

3 TUPMOD(2) Evaluate using only the tuple model.

4 POS TUPMOD(2) Evaluate using part-of-speech tagging and the tuple model.

5 LEMMA TUP-

MOD(2)

Evaluate using lemmatization and the tuple model.

6 SYN TUPMOD(2) Evaluate using synonym extraction and the tuple model.

7 POS LEMMA TUP- MOD(2)

Evaluate using part-of-speech tagging, lemmatization and the tuple model.

8 POS SYN TUP-

MOD(2)

Evaluate using part-of-speech tagging, synonym extraction and the tuple model.

9 LEMMA SYN TUP- MOD(2)

Evaluate using lemmatization, synonym extraction and the tuple model.

10 POS LEMMA SYN DOCMOD

Evaluate using part-of-speech tagging, lemmatization, synonym extraction and the document model.

11 POS LEMMA SYN SENMOD

Evaluate using part-of-speech tagging, lemmatization, synonym extraction and the sentence model.

12 POS LEMMA SYN TUPMOD(2)

Evaluate using part-of-speech tagging, lemmatization, synonym extraction and the tuple model of size 2.

(31)

Chapter 5

Results

The results are presented here with the scores of each experiment configuration being based on the threshold value, which is the lowest allowed difference that a sentence must have against the reference model in order to be considered different enough and thus able to provide new information about the topic. The difference is measured by cosine similarity with the threshold value functioning as a percentage of required minimum amount of difference. Threshold 0 means that all sentences are going to be considered different and threshold 100 that only 100% matching sentences will be considered different.

Results of experiments having similar configurations have been grouped together to more clearly show their respective differences.

(32)

0 10 20 30 40 50 60 70 80 90 100 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Precision

Document Model Precision (E1) Sentence Model Precision (E2)

Tuple Model Precision (E3)

Figure 5.1. Precision of basic models (E1-E3)

The basic models performance is shown in Figure 5.1 measured in precision scores. The sentence and tuple model clearly outperform the document model regardless of threshold value. Another observation that can be made is how the tuple model have consistently higher scores over the sentence model for all threshold values between 25 to about 70. The results after that point are mostly sporadic.

(33)

CHAPTER 5. RESULTS

0 10 20 30 40 50 60 70 80 90 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Recall

Document Model Recall (E1) Sentence Model Recall (E2)

Tuple Model Recall (E3)

Figure 5.2. Recall of basic models (E1-E3)

The basic models performance is shown in Figure 5.2 measured in recall scores.

The recall scores are almost a reversal of the precision scores of Figure 5.1, with the document model outperforming the other models for all threshold values. The sentence model also consistently provide higher score than the tuple model.

(34)

0 10 20 30 40 50 60 70 80 90 100 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

F-score

Document model (E1) Sentence model (E2)

Tuple model (E3)

Figure 5.3. F-score of basic models (E1-E3)

The results of the precision and recall scores combined as the f-score measurement shows the importance of measuring both of these aspects together, as Figure 5.3 clearly shows that for thresholds less than 63, the document model which had an excellent recall score still provide comparatively worse results than both of the other models. The sentence and tuple model both show virtually the same results for all threshold values lower than 57, with the sentence model generally performing better at all later points.

(35)

CHAPTER 5. RESULTS

0 10 20 30 40 50 60 70 80 90 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Precision

None (E3) POS (E4) Lemmatization (E5)

Synonym (E6)

Figure 5.4. Precision of tuple model and one analysis method (E3-E6)

The precision is shown in Figure 5.4 for when using a single semantic analysis method in addition to the tuple model. Before threshold 50, all methods seem to perform better than not using any methods. For thresholds between 20 and 50, synonym extraction shows potential, having a better score than the other methods.

Lemmatization also seems effective for almost all thresholds, where it performs better than experiment 3. Part-of-speech tagging shares a similar movement to that of experiment 3.

(36)

0 10 20 30 40 50 60 70 80 90 100 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Recall

Synonym (E6)

Figure 5.5. Recall of tuple model and one analysis method (E3-E6)

The recall is shown in Figure 5.5 for when using a single semantic analysis method in addition to the tuple model. Synonym extraction clearly worsen the recall score, which similarly happens to lemmatization as well, excepting a few lower threshold values. As with the precision in Figure 5.4, part-of-speech tagging on its own does not affect the results in any significant way.

(37)

CHAPTER 5. RESULTS

0 10 20 30 40 50 60 70 80 90 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

F-score

Synonym (E6)

Figure 5.6. F-score of tuple model and one analysis method (E3-E6)

The f-score is shown in Figure 5.6 for when using a single semantic analysis method in addition to the tuple model. Lemmatization for threshold values between 25 to 45 seems to have an edge over the other methods including using none of the methods. Part-of-speech tagging produces basically the same results as not using a method. Synonym extraction however have severely decreased results as the threshold value moves towards the middle from both sides.

(38)

0 10 20 30 40 50 60 70 80 90 100 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Precision

None (E3)

POS + Lemmatization (E7) POS + Synonym (E8) Lemmatization + Synonym (E9)

Figure 5.7. Precision of tuple model and two analysis methods (E3,E7-E9)

The precision score in Figure 5.7 shows how combining the methods, all provide positive results for thresholds 20 to 50 compared to experiment 3. Experiment 8, using part-of-speech tagging and synonym extraction, has the best results for threshold values below 50, with worse results for higher values. Experiment 9, lemmatization and synonym extraction, has the highest scores for the threshold ranges 70 through 90. Experiment 8, part-of-speech tagging and synonym extraction have average results for all threshold values.

(39)

CHAPTER 5. RESULTS

0 10 20 30 40 50 60 70 80 90 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Recall

None (E3)

Figure 5.8. Recall of tuple model and two analysis methods (E3,E7-E9)

The recall scores in Figure 5.8 show more mixed results than Figure 5.7, where for lower thresholds either experiment 3 or 7 provides the best results. At higher threshold values, experiment 8 has an edge but at other ranges it lags behind somewhat. Experiment 9 shows the worst results for thresholds between 20 and 65, having an average score for the highest threshold values.

(40)

0 10 20 30 40 50 60 70 80 90 100 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

F-score

None (E3)

Figure 5.9. F-score of tuple model and two analysis methods (E3,E7-E9)

The f-score results of Figure 5.9 looks quite similar to the recall scores of Figure 5.8. Experiment 9 clearly produces the worst f-score for most threshold values, experiment 7 has a small range around threshold 40 in which it beats the scores of the other methods but otherwise is overshadowed by either experiment 8 or 3.

Experiment 8 shows a somewhat average result except for thresholds 70 and up in which it have a somewhat higher score.

(41)

CHAPTER 5. RESULTS

0 10 20 30 40 50 60 70 80 90 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Precision

Tuple model without analysis methods (E3) Document model (E10)

Sentence model (E11) Tuple model (E12)

Figure 5.10. Precision of models using all analysis methods (E3,E10-E12)

In terms of precision, Figure 5.10 shows that the document model still clearly perform worse than the other models, regardless of threshold values. From threshold 25 to 50 the tuple model has the highest precision with the tuple model without analysis methods leading for some threshold values after that. The sentence model provides an average of the other models for most threshold values, without standing out much.

(42)

0 10 20 30 40 50 60 70 80 90 100 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

Recall

Figure 5.11. Recall of models using all analysis methods (E3,E10-E12)

Except for thresholds beyond 60, Figure 5.11 shows that the tuple model has the worst recall score of all methods, with the document model clearly having the best. As with the precision in Figure 5.10, the sentence model more or less perform as an average of the models.

(43)

CHAPTER 5. RESULTS

0 10 20 30 40 50 60 70 80 90 100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

F-score

Figure 5.12. F-score of models using all analysis methods (E3,E10-E12)

The f-score results in Figure 5.12 shows how being average in regards to both precision and recall can be better than excelling at either of them. For threshold values below 50, the sentence model has a small but still existing edge over the other models. The tuple model is though not far behind. For threshold values above 60, the document model perform the best of the models, but is mediocre for all lower threshold points.

(44)

0 10 20 30 40 50 60 70 80 90 100 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold value

F-score

Size 2 (E12) Size 3 (E13) Size 4 (E14)

Figure 5.13. F-score of tuple models using different tuple sizes (E12-E14)

Figure 5.13 presents the f-score when using different tuple sizes. It shows the limited benefits of using higher dimensions within the system. Execution time as well as memory usage goes up exponentially but the results are more or less equal, or at the very least indistinguishable for most threshold values.

(45)

CHAPTER 5. RESULTS Name Avg f-score Max f-score Precision Recall Best threshold

E1 - Document 0.67 0.73 0.59 0.95 59

E2 - Sentence 0.65 0.77 0.77 0.76 59

E3 - Tuple 0.64 0.77 0.71 0.83 47

E4 - Tuple+P 0.65 0.77 0.73 0.80 51

E5 - Tuple+L 0.63 0.78 0.65 0.99 29

E6 - Tuple+S 0.63 0.77 0.65 0.94 26

E7 - Tuple+PL 0.64 0.78 0.70 0.87 40

E8 - Tuple+PS 0.66 0.78 0.72 0.84 39

E9 - Tuple+LS 0.63 0.76 0.66 0.89 30

E10 - Document+PLS 0.67 0.72 0.60 0.88 59

E11 - Sentence+PLS 0.67 0.78 0.72 0.85 42

E12 - Tuple+PLS 0.66 0.78 0.72 0.85 39

E13 - Tuple(3)+PLS 0.66 0.78 0.73 0.84 39

E14 - Tuple(4)+PLS 0.66 0.78 0.73 0.84 39

Table 5.1. F-score results

The results of the experiments are presented in Table 5.1. The average f-score shows how well the system performs on average when taking the sum of all threshold values. Experiment 1, 10 and 11 have the highest average score, but the other experiments are not far behind.

The maximum f-score on the other hand shows the optimal score that the experiment was able to achieve, with the corresponding threshold being the threshold that provided that score. Several of the experiments managed to get a maximum f-score of 0.78 even though their respective precision and recall scores differed to a lesser or larger degree. Standing out is how experiment 1 and 10, both using the document model, have significantly lower maximum f-scores than the other experiments.

Another aspect that differs between the experiments is at what range they have their peak f-score performances. The document model experiments consistently have their highest optimal threshold ranges at around 59, regardless of whether one uses analysis methods or not.

Experiment 2 and 3 using only the sentence model and tuple model shows how these models on their own also have high optimal threshold levels. The difference with these in comparison to the document model is how when combined with analysis methods, they receive a significant reduction in their optimal threshold ranges.

Part-of-speech tagging as expected does not make a big difference on its own when it comes to optimal threshold values.

Using only one of the analysis methods lemmatization and synonym extraction seems to produce the lowest optimal threshold values.

(46)

Name Avg utility-basic Avg utility Max utility-basic Max utility

E1 - Document 0.27 0.19 0.29 0.25

E2 - Sentence 0.24 0.18 0.35 0.31

E3 - Tuple 0.24 0.19 0.35 0.31

E4 - Tuple+P 0.13 0.17 0.21 0.25

E5 - Tuple+L 0.24 0.21 0.35 0.41

E6 - Tuple+S 0.34 0.26 0.47 0.41

E7 - Tuple+PL 0.21 0.19 0.31 0.31

E8 - Tuple+PS 0.28 0.23 0.38 0.31

E9 - Tuple+LS 0.30 0.27 0.47 0.41

E10 - Document+PLS 0.33 0.20 0.41 0.29

E11 - Sentence+PLS 0.24 0.20 0.31 0.31

E12 - Tuple+PLS 0.25 0.20 0.38 0.31

E13 - Tuple(3)+PLS 0.25 0.20 0.38 0.31

E14 - Tuple(4)+PLS 0.25 0.20 0.38 0.31

Table 5.2. Utility ranking results

Table 5.2 shows the measured utility ranking scores gotten by the two different utility algorithms, when used on the results gathered by each running each experiment.

The utility ranking basic algorithm has an average score that almost consistently is higher than that of the more sophisticated utility ranking algorithm. This is also reflected in the maximum score which shows the same trend.

The basic ranking algorithm seems to function especially well with those experiments that used synonym extraction even though some of those experiments did not have convincing results concerning their f-score values. In a similar sense, experiment 4 which used the tuple model in combination with part-of-speech tagging, had a fairly good f-score value but still provided the worst utility scores for both algorithms.

(47)

Chapter 6

Discussion and Conclusions

Within this chapter, the results are discussed based on the research question given in the introduction, a final conclusion is provided for the work and to finish everything off, some suggestions are presented for how to expand and hopefully improve the performance of the systems and algorithms found within the report.

6.1 Discussion

One aspect that must be stated first and foremost is that the evaluation sample data has been quite limited due to time constraints. The time requirement to annotate a single news article pair has also been fairly long, leading to a total sample size of 20 news article pairs, concerning a total of 11 different subjects. The results thus cannot be seen as final but rather as indicative of the performances, a larger sample could therefore differ to a more or lesser degree if such a study were to be conducted in the future.

6.1.1 F-Score

The hypothesis was that a rise in f-score would follow with the increasing sophis- tication of the systems, meaning that the sentence model would outperform the document model and that the tuple model would outperform the sentence model.

Adding part-of-speech tagging would not make any significant difference on its own but in combination with either lemmatization or synonym extraction, would enhance their performances. The combining of all three semantic analysis methods was thought to lead to the highest efficiency and therefore provide the best results.

The optimal solution was therefore believed to be the one represented by experiment 12 or if not, one of the larger tuple sizes, experiment 13 or 14.

In reality the differences were not so substantial. Clearly the document model does underperform the other models, which experiment 1 and 11 show but the perceived advantage of the tuple model cannot be found within the results. The difference between experiment 2 and 3 in precision, recall and threshold is quite

A difference analysis method for detecting differences between similar documents

A difference analysis method for detecting differences between similar documents

ANDREAS SERRA

A difference analysis method for detecting differences between similar documents

Abstract

Referat

En differens-analysmetod för att upptäcka skillnader mellan liknande dokument

Contents

Chapter 1

Introduction

1.1 Problem Statement

1.2 Delimitations

1.3 Thesis Outline

1.4 Contributions

Chapter 2

Background

2.1 Document Similarity

2.2 Weighting Schemes

2.3 Semantic Analysis

2.4 Equivalence Definition

2.5 Evaluation Metrics

2.6 Conclusions on Choice of Methods

Chapter 3

Methodology

3.1 Architecture

3.2 Data Set Extraction

3.3 Preprocessing

3.4 Semantic Analysis

3.5 Models

3.6 Difference Analysis

3.7 Evaluation Strategy

3.8 Scoring System

Chapter 4

Experiments

Chapter 5

Results

Chapter 6

Discussion and Conclusions

6.1 Discussion