Bachelor Degree Project Using WordNet Synonyms and Hypernyms in Automatic Topic Detection

(1)

Author:

Nicko Wargärde

Supervisor:

Tobias Ohlsson

Semester:

VT 2020

Subject:

Computer Science

Bachelor Degree Project

(2)

Abstract

Detecting topics by extracting keywords from written text using TF-IDF has been studied and successfully used in many applications. Adding a semantic layer to TF-IDF-based topic detection using WordNet synonyms and hypernyms has been ex-plored in document clustering by assigning concepts that describe texts or by adding all synonyms and hypernyms that occurring words have to a list of keywords. A new method where TF-IDF scores are calculated and WordNet synset members’ TF-IDF scores are added together to all occurring synonyms and/or hypernyms is explored in this paper. Here, such an approach is evaluated by comparing extracted keywords using TF-IDF and the new proposed method, SynPlusTF-IDF, against manually as-signed keywords in a database of scientific abstracts. As topic detection is widely used in many contexts and applications, improving current methods is of great value as the methods can become more accurate at extracting correct and relevant key-words from written text. An experiment was conducted comparing the two methods and their accuracy measured using precision and recall and by calculating F1-scores. The F1-scores ranged from 0.11131 to 0.14264 for different variables and the results show that SynPlusTF-IDF is not better at topic detection compared to TF-IDF and both methods performed poorly at topic detection with the chosen dataset.

(3)

Preface

(4)

1 Introduction

Detecting a written text’s topic can be a trivial task for a skilled reader. As long as the reader understands the language and does not have any issues with comprehension, infer-ence, or the structure used in written language, it happens naturally when reading. It is a process that happens all the time whenever we read, and it can happen subconsciously without any direct effort. This type of process is known as a ballistic process, i.e. it can automatically take place and the reader might not be able to stop it [1–4].

However, automatically detecting a topic in written text for a machine is not the same as for a human reader. The machine lacks any context within or from outside the text, does not know what is being inferred, and does not even know any natural language. The machine needs an algorithm or some method in order to detect a text’s topic, and there are many such different methods in use today. The ubiquity of topic detection in text and its use creates the need for better methods, algorithms, and models for topic detection all the time. This report proposes a new automatic topic detection method based on a word frequency-based model, TF-IDF, and keywords extracted from texts and their synonyms and hypernyms, fetched from WordNet, to determine what topics a written text in English is discussing.

1.1 Background

Automatic topic detection in text is something that is widely used today. Being able to cluster similar news stories or other texts, deciding the contents of a tweet, or fetching relevant documents based on search criteria are only a few examples where topic detection is used. Several methods for detecting a text’s topic exist and are used to varying degrees, ranging from word frequency-based models to looking at co-occurrences of certain words to keyword clustering to machine-learning models [5–8]. A common key concept in most topic detection models is to extract keywords from text. These keywords are scored and ranked and together can describe what topics a text is dealing with.

Determining texts’ topics can be done using various approaches. One of these ap-proaches is topic modeling, which, in short, is a machine learning technique that uses statistics and probabilities in text to determine what topics documents in a corpus discuss. Topic models can allow corpora to be structured and organized in different ways based on the contained documents’ topics. Topic modeling is an unsupervised machine learning technique, meaning that it does not use any pre-trained models or require the analyzed texts to have a certain markup [9]. One commonly used topic model is Latent Semantic Indexing (LSI), a probabilistic model for structuring corpora based on topics [10,11]. An-other probabilistic approach in the topic model field is Latent Dirichlet Allocation (LDA). Using LDA, documents in corpora can be classified based on their topics and latent topics inferred using different kinds of algorithms [12].

(6)

trained, they will keep improving in accuracy whereas the other algorithms can hit a topic classification accuracy limit [9].

Topic Detection and Tracking (TDT) is a long-running research project with the goal of finding methods for not only detecting topics, but also tracking news stories’ topics across other news stories, among other tasks. In TDT, tracking is the process of actively finding new news stories that all discuss a certain topic. Similarly, another TDT task is cluster detection, clustering news stories that belong together based on shared topics they discuss. This is oftentimes known as document clustering when talking about any kind of documents that discuss the same topics, not just news stories. Briefly, some of the methods for the various TDT tasks that have been found to be useful include statistical unigram models, probabilistic statistical models, and data segmentation with decision trees, to name a few [13].

In the field of Distributional Semantics (DS), words’ semantic information is used to categorize the words along with the context in which they appear [14]. This is based on the distributional hypothesis which, in short, states that words appearing in the same context usually have similar semantic meaning and the idea that "a word is characterized by the company it keeps" [14–16]. There are several different ways to categorize words in DS, one of which is using vector spaces to model what words occur in which contexts and which other words co-occur in the same contexts. Depending on the application and method, a two-dimensional matrix may be used to represent this distribution of words in contexts, or a more advanced three-dimensional space can be used to represent words’ distributional vectors. Using these types of models can be very effective when not only categorizing words within a document, but for categorizing words and documents in large corpora, making it a useful tool in document clustering and topic classification, in general. The field of DS is quite large and there are several more approaches and methods available for tasks related to document clustering, topic classification, and topic detection, as well as various semantic distance measures that are and can be used in different kinds of related applications [14].

In pure topic detection, one of the most commonly used word frequency-based meth-ods is Term Frequency times Inverse Document Frequency (TF-IDF), the times is some-times left out when using TF-IDF’s full name. TF-IDF combines a measure of the number of times a word occurs in a document, called Term Frequency (TF), with the number of documents where the word occurs in a corpus divided by the corpus’ size, Inverse Docu-ment Frequency (IDF) [5, 17–19].

T Ftd =

ftd

totaltd

(1) Equation 1 above shows how TF can be calculated where Ftdis the frequency of a term

tin document d. That is, the frequency for term t is calculated in document d divided by the total number of terms totalt in document d.

IDFt= log(

totald

dt

) (2)

Equation 2 shows how IDF can be calculated by taking the logarithm of the total number of documents totalddivided by the number of documents in which term t occurs.

(7)

some variations include looking at the total number of documents where a certain word appears or by looking at the total number of occurrences in all documents where a certain word appears. TF-IDF is useful for extracting keywords in texts that can describe a text’s topic quite accurately. Each word in a text is given a score and then every word is sorted based on its score in descending order. The higher the score, the better the word describes the text.

Another method based on using statistics is using keyword clustering for topic de-tection which showed promising results when comparing extracted keywords against Wikipedia articles’ manually assigned keywords [7]. Going further beyond just the purely statistical approach in topic detection is adding more of the semantic aspect of language to the analysis. One way of doing this is to use co-occurrences of words in texts where, compared to TF-IDF, the extracted keywords described to texts’ topic more accurately. These two methods are further discussed in detail in Section 1.2 Related Work.

Finally, another method with a focus on the semantic aspect in language is Shehata’s [20] method. It is a method that uses synonyms and hypernyms from the lexical database WordNet for text clustering [21, 22]. Text clustering differs from pure topic detection in that its main goal is to decide whether two or more documents belong together by some measure. No comparisons are made to other documents’ topics when only pure topic detection is done. Shehata extracted keywords from texts, and then those keywords’ synonyms and hypernyms were fetched from WordNet. The synonyms and hypernyms were matched against the initial keyword and grouped as concepts. Documents that had similar or the same concepts were clustered together. That is, if the document had many occurrences of the word car, and another document had many occurrences of the word automobile, they would be deemed to be dealing with the same topic and be clustered together. Using WordNet for this task proved to improve the quality of text clustering significantly [20].

The extraction of keywords is a central concept in most topic detection models and other similar tasks. Whether the actual topics are extracted from the source text directly, or a machine-learning algorithm has been trained to detect topics based on some other, exter-nal knowledge set, the focus lies on keywords to describe a text and its topics. This project proposes a conceptually similar approach as Shehata [20] used, but for pure topic detec-tion, no text clustering, based on TF-IDF and using synonyms and hypernyms fetched from WordNet. By adding these synonyms and hypernyms to the calculations of TF-IDF scores, the semantic aspect in text becomes a more important part of topic detection and should improve the accuracy of topic detection, extracting keywords, that describe text compared to pure TF-IDF.

1.1.1 Terms and Concepts

Some of the terms and concepts that will be used in this paper are listed and briefly explained below. Certain terms mentioned will be further explained in detail in later sections, but this list provides a quick overview of the most commonly used terms and concepts.

• Synonyms: Words that have the same or very close to the same meaning as each other. In WordNet, synonyms are defined as two or more words that can be ex-changed while still keeping a sentence semantically the same and true [21, 22]. • Hypernyms: Words that can categorize other words, known as hyponyms, as being

(8)

• Hyponyms: Words that are semantically related grouped by a common hypernym, e.g. dog and cat are both hyponyms of the hypernym animal. Hyponyms and hypernyms have an is a association, a bungalow is a house or a cat is an animal. • Content words: Words that convey semantic meaning in sentences. They are

usu-ally nouns, verbs, adjectives, and sometimes adverbs. As the name suggests, they hold information about content in language.

• Function words: Words that convey grammatical function in sentences. They are usually everything except the above-listed word classes. For example, in the sen-tence A cat kicked the ball of yarn very far., content words are shown in bold and italics, and function words are only italicized. As the name suggests, these words hold information about function in language.

• Corpus: A collection of documents or texts that are usually related in some way but do not necessarily need to be. In this paper, the corpus consists of 2000 ab-stracts from the Institution of Engineering and Technology’s (IET) scientific papers database Inspec [23–25].

• Inspec: An engineering and physics literature database [25].

• Stemming: The process of taking a word and getting its stem which is a basic form without any inflections, conjugations, declensions, or any other prefixes or suffixes. For example, singing has the stem sing, and cars has the stem car.

• Tokenization: The process of taking a sentence and splitting it into tokens, i.e. words, numbers, punctuation, or other characters. There are different rules and methods for what a token is defined as. For example, the sentence It is sunny and 32◦in New York today! can be tokenized into:

– It – is – sunny – and – 32 – ◦ – in – New York* – today – !

* Depending on the tokenization method, New York could be tokenized into two separate New and York tokens.

(9)

• Inverse Document Frequency (IDF): The corpus’ size divided by the number of documents in which a certain term occurs. It can also be the corpus’ size divided by the total number of occurrences of a word in all documents. In both scenarios, the logarithm is taken of the calculated result.

• SynPlusTF-IDF: The name of the new method for topic detection presented in this project.

• WordNet: A lexical database in English that contains, among other items, sets, synsets, of words’ synonyms and hypernyms [21, 22].

• Synset: Short for synonym set, a set of a word’s all synonyms and hypernyms. For example, the synset in WordNet of animal is animal, animate being, beast, brute, creature, fauna.

• Natural Language Processing (NLP): A scientific field that uses mainly linguis-tics, artificial intelligence, and computer science to study natural language.

• Text/document clustering: The process of deciding whether two or more texts/ documents belong together. For example, if several texts deal with the same topic, they should be clustered together.

• Word sense: The sense or intended meaning of a word where several are possible. • Part Of Speech (POS): Part of speech, sometimes word class, is a categorization of

words based on their grammatical functions. For example, noun, verb, and adjective are all parts of speech. These are oftentimes used in NLP analyses by adding them using a POS tagger.

1.1.2 Topic Detection using Synonyms and Hypernyms

Topic detection is used in many different ways and many different applications. These applications can range from suggesting news stories similar to the one currently being read by a reader, to clustering other types of documents together, to making searches in unlabeled documents or to determining what a tweet is talking about. End-users browsing the web today encounter automatic topic detection, or at least see the results of it, all the time, both passively and by actively using it. The examples mentioned above are just a few instances where topic detection is used.

This project proposes a new method which uses synonyms and hypernyms of high-frequency words in a text for topic detection. Similar approaches do exist, as mentioned earlier, see [20], but mostly for document clustering, not only pure topic detection, and they use a different approach looking at slightly different factors [20, 26]. Synonyms and hypernyms from WordNet have also been successfully used in topic models, see e.g. [27]. In written text, authors use hypernyms, hyponyms, and synonyms to make the text more interesting, more varied, or to describe something in a different way. The use of synonyms, from a memory standpoint, can even help with remembering terms and ideas [28]. The relationship between hypernyms, hyponyms, and synonyms can be viewed as a tree structure with the hypernym being a parent node and the hyponyms are child nodes.

(10)

vehicle boat schooner yacht aircraft airplane/plane helicopter/chopper Hypernym Hyponym Hyponym

Figure 1.1: Tree diagram of hypernym, hyponym, and synonym relationships. synonyms and are a type of aircraft which is a type of vehicle. The focus of this project is to look at synonyms and hypernyms of highly frequent words in a text to determine the text’s topic. Synonym is defined here as two or more words that can be interchanged while still keeping the semantic content the same and true. Only what a word denotes, what it describes, and not what connotations it has, i.e. what feelings the words may raise in the reader, will be considered here.

Since synonyms and hypernyms are an integral part of language, the hypothesis that they can be useful in automatic topic detection based on TF-IDF and be more accurate compared to pure TF-IDF will be explored in this paper. While synonyms and hyper-nyms might not be used in certain genres of text, such as law documents or news articles, they have a place in other genres. By using synonyms, the writer can enrich and enhance their language in order to make it more interesting or more memorable. For example, synonymy plays a large part in political speeches and rhetoric, but they are not as widely used in scientific literature [29, 30]. However, a good quality dataset is needed for testing and verifying the results from experiments and therefore, a corpus of scientific abstracts with manually assigned keywords will be used in this project, described in Section 1.1.3 The Inspec Dataset. By looking at high-frequency words in text along with their syn-onyms and hypernyms fetched from the lexical database WordNet and then counting those synonyms’ and hypernyms’ TF-IDF scores together in the original text, a more accurate method for topic detection compared to the traditional frequency-based method, TF-IDF, will be proposed [21, 22].

The new proposed method, SynPlusTF-IDF, will be based on TF-IDF and compared against TF-IDF. By running the two methods on the same dataset, they can each be com-pared and their accuracy in topic detection measured and evaluated using precision and recall and by calculating F1-scores. More on this in Section 2 Method.

1.1.3 The Inspec Dataset

(11)

approach when compared to existing similar methods. The dataset has since been used in several other experiments, e.g. [31, 32], and can further be examined in [23, 24].

The genre of scientific literature is perhaps not the best suited for examining how TF-IDF and SynPlusTF-IDF perform in terms of topic detection. A more synonym-rich genre such as fiction could show that both methods perform better compared to how they perform with scientific abstracts. However, due to the scope of this thesis project and the availability of good datasets, the Inspec dataset is deemed to be good for studying differences between the two methods. If this project was larger, then a new dataset would be created that could perhaps enhance and highlight how the methods perform to a greater degree. Creating such a dataset is, however, out of this project’s scope.

The abstracts in the Inspec dataset have a mean of 136 tokens, rounded down, ac-cording to this project’s tokenization method described in section 3.2 Pre-processing. The median value is 131 tokens, ranging in length from 15 to 555 tokens. The abstracts discuss various topics in the genres of technology, physics, and chemistry, to name a few. Each abstract has on average 11, when rounded down, manually assigned keywords, where each keyword can consist of more than one token [23, 24]. The longest keyword is eight tokens long and 9.1% of all keywords consist of four of more tokens [23].

Documents Median Tokens/doc Mean Tokens/doc Keys/doc 2,000 131 136 10*

Table 1.1: Various statistics for the Inspec database [23, 24].

Table 1.1, based on [23, 24], summarizes some of the statistics of the Inspec database. Keywords will be extracted using both TF-IDF and SynPlusTF-IDF for all 2000 abstracts in the dataset, and the extracted keywords will be compared and matched against the manually assigned keywords. The results from this task for TF-IDF and SynPlusTF-IDF will be measured against each other to measure and compare accuracy at topic detection. According to [24], the Keys/doc column states that there are 10 keywords per docu-ment. However, calculating keys/doc gives a result of 11 keys/doc rounded down. There are 22147 keywords in total and 2000 documents.

22147

2000 = 11.0735 (3) In this project, an average number 11 keys/doc is will be considered the actual value. 1.2 Related Work

(12)

The issue of topic detection has been used in many fields such as the Information Retrieval (IR) field, which, briefly, deals with retrieving information from documents, text, audio, or video. In the IR field, concepts such as models based on boolean searches, hierarchic clustering, and single-link algorithms for document clustering and retrieval have been explored [38–40]. These all relate to topic detection in that there is a need to classify the document in some way, usually by detecting a document’s topic using keywords to describe the document.

In general, there has been much research done in related applications to topic detec-tion such as text summarizadetec-tion, document clustering, and general informadetec-tion retrieval. A Google Scholar search for any of those four terms yields thousands of relevant results and shows a long-going history of research in various fields on the topic. Briefly, some of the more important and influential, papers in these fields include, for example [10] which shows the effectiveness of Latent Semantic Indexing (LSI) and finding further seman-tic information in information retrieval, in addition to the information being explicitly searched for. In short, LSI works by placing documents in a vector space with related documents close to each other. This makes it possible to distinguish or cluster documents based on the documents’ topics. Furthermore, various directions within text mining, in general, have been explored, such as Latent Dirichlet Allocation (LDA), a generative sta-tistical model that was found to be effective in finding topics in texts [12].

Mimno et al. [41] argue that latent variable models, such as LDA, do not produce good enough results due to them including too many topics that are of poor quality. That is, latent variable models can yield lists of topics that a document discusses where around 10% of the topics are useless at describing the document. By including the topics that do not describe a document in the results from e.g. LDA, users’ confidence and trust in the models will be lowered. While the latent variable models can include the poor quality topics in their results, a human domain expert will instantly recognize the poor quality topics. To alleviate this problem, Mimno et al. [41] proposed a new automated evaluation metric that could filter out the bad topics. The evaluation metric worked by using words’ co-occurrences in a new way in texts in order to identify bad topics and improve the results of LDA. The new metric showed promising results which were an improvement compared to the results that LDA produced. Furthermore, a new statistical topic model was suggested that used the new evaluation metric and showed great improvements in the quality of results compared to LDA when examining documents from the National Institutes of Health [41].

Another topic model improvement incorporating words’ WordNet senses in the anal-ysis was proposed by Guo and Diab [42]. Their topic model works by looking at how senses are distributed in text, compared to looking at how word-topics are distributed in text. By taking a word’s senses and building a definition for that word based on neigh-boring senses, a more robust and richer definition is achieved. The new topic model was more accurate at finding topics of words when compared to traditional LDA and showed great promise for current use and future improvements [42].

(13)

topic modeling and out of the three categories, show the best results for short texts. The second most promising are the global word co-occurrence models, and the worst were the self-aggregation models, which performed worse than LDA [43].

A keyword clustering algorithm by Wartena et al. [7] extracted keywords that matched fairly well with the ones that had been manually added by Wikipedia articles’ authors. In short, keywords were extracted and clustered together based on a statistical similarity or distance measure. This measure was calculated based on how keywords occurred together in a corpus. The clustered keywords would describe the articles’ topics quite well and this method did not require any prior knowledge or any pre-trained models [7].

Another method by Wartena et al. [6] uses the semantic relationships of words based on their occurrences in texts. If words are semantically similar, they will appear in similar contexts [44]. This is a co-occurrence model that works by looking at all terms and which co-occurrences are present in the text in order to extract keywords. Compared to TF-IDF, Wartena et al.’s [6] co-occurrence model was found to perform better at extracting keywords that describe text from two datasets that the authors used for their model eval-uations. These datasets consisted of computer science abstracts from the Association for Computing Machinery (ACM) and synopses from the British Broadcasting Corporation (BBC). The results were promising for both datasets, even though the ACM dataset was much larger than the BBC dataset, indicating that co-occurrence could perform better at topic detection compared to TF-IDF even with smaller corpora [6].

More specifically and most related to this project’s topic is Shehata’s [20] and Sedding and Kazakov’s [26] papers. These both deal with document clustering using TF-IDF and WordNet synonyms and hypernyms in some way. Sedding and Kavakov used TF-IDF to score keywords and then added the keywords’ synonyms and hypernyms from WordNet to the list of keywords. When fetching synsets from WordNet, a list of all the senses for the search word is returned, sorted in descending order. The authors included all senses of the keywords and found that including all senses was not beneficial in document clustering effectiveness. Section 3.1 WordNet provides an example of the list of synsets and their senses found in WordNet. Shehata [20] used only the first sense of each keyword and as Sedding and Kavakov [26] noted, it might be more effective to only use the first sense of each keyword. This was also the approach tested with good results where TF-IDF scores were calculated and synonyms from WordNet were extracted. Only the first sense of the words were used and the rest ignored, as they most likely introduce too much noise or unusable information [26, 45]. Finally, Hotho et al. [46] found that WordNet could be used to give more background information about texts by incorporating WordNet synsets and the information found within these synsets.

(14)

1.3 Problem Formulation

This project aims to investigate a proposal for a new method, SynPlusTF-IDF, for auto-matic topic detection in text using synsets from the lexical database WordNet [21, 22]. In short, SynPlusTF-IDF calculates TF-IDF scores first before fetching synonyms and hy-pernyms from WordNet for all extracted keywords. These keywords will then have their synonyms’ and hypernyms’ TF-IDF scores calculated and added to them. The list of ex-tracted keywords is sorted in descending order based on the keywords’ scores before the top n highest scoring keywords are selected. More on the n cut-off and the actual method in Section 2 Method. The expected result will be a ranked list with keywords, more accu-rately describing the original text compared to only using TF-IDF for keyword extraction when matching against the manually assigned keywords in the database of scientific ab-stracts.

Looking at pure frequencies of words is one measure that can be used to determine a text’s topic. This is usually done by also removing function words so that prepositions, articles, and determiners do not skew the results. That is, if these function words were included in such a method for topic detection, most texts’ topics would be shown to be e.g. the, a, an, in, and so on. A more sophisticated method is by adding weighting and multiplying it with the IDF, getting TF-IDF. This project’s new method would be yet another improvement on what is TF and TF-IDF by adding a semantic aspect to topic detection. If a text contains the word car twelve times, house eight times, and boat five times, TF would show that the text’s topics are, in order, car, house, and boat. TF-IDF could perhaps rank the topic in another order, house, car, boat, if the other documents in the corpus where IDF was calculated contained house, car, and boat a certain number of times. However, the text’s actual ranking of topics could be boat, car, and house if the word boat only occurs five times, but it had many synonyms, e.g. ship, watercraft, motorboat, etc. that all together appear a total of 16 times. SynPlusTF-IDF would take this into account and calculate scores for each keyword along with the frequency of each keywords’ synonyms and hypernyms. Using synonyms as a semantic measure should improve the accuracy in topic detection compared to TF-IDF.

TF TF-IDF SynPlusTF-IDF car - 12 house - 0.8437 boat - 0.9265 house - 8 car - 0.5422 car - 0.8422

boat - 5 boat - 0.4654 house - 0.7842

Table 1.2: Example topic rankings of TF, TF-IDF, and SynPlusTF-IDF.

Table 1.2 shows how the example mentioned above could look between the three methods TF, TF-IDF, and SynPlusTF-IDF. The scores listed after each word are only to demonstrate how a certain text’s topics could be scored and ranked. The list of extracted keywords for each method here is only three words longs, and in reality relevant keywords could fall outside of this top n list cut-off, making the need for more accurate methods even greater.

(15)

the two methods in speed will not be measured. The only measure that will be explored is accuracy in topic detection compared against the Inspec database’s already assigned keywords for the abstracts. Accuracy will be measured using Precision and Recall as well as calculating F1-score. This is described further in Section 2.2 Evaluation Methods. The

actual scores that each method outputs for extracted keywords will not be considered in any other way than to rank the extracted keywords in a top n list. In other words, the scores will only be used to sort keywords in the list in descending order. The top n list of keywords with a set cut-off describing each abstract from both methods will be used. The length, n, of the top n list is further discussed in Section 2.1 The n cut-off. If an extracted keyword is ranked n + 1, it will fall outside the top n keywords and be deemed to be useless at describing a text’s topic. Even though the first keyword in the top n list could be the one that best describes the abstracts’ topics, followed by the second one which describes the abstracts’ topics second best, and so on, all extracted keywords in the top n list will be valued the same as each other. As long as a keyword is in the top n list, it is deemed to accurately describe a text. This is done to make it possible to compare the two methods as they do not necessarily have an equal scale for scoring keywords, however, they both share the same concept that the higher the score, the more likely it is that the keyword better describes the text.

1.4 Motivation

Since topic detection is so widely used in many different contexts, its value, both in online and offline applications, is already substantial and growing rapidly. By exploring differ-ent methods for topic detection, the accuracy and effectiveness of topic detection can be improved, yielding overall better results. By more accurately being able to determine a text’s topic, all areas where topic detection is used will improve. More accurate document clustering, better searches in unlabeled documents, determining what various customer feedback is discussing, more intelligent automated support bots, sentiment analysis, and general trend analysis in online discourse, to name a few, are all of great value for many different applications. This includes both professionals and companies that can employ these methods and the end-users who can benefit from all of the above mentioned possible improvements to the various applications of topic detection.

Furthermore, as conceptually similar methods have successfully been used in docu-ment clustering [20] and a topic model method [27], exploring the approach in pure topic detection can also increase the validity and usefulness of incorporating semantic measures in topic detection or other topic detection related tasks.

1.5 Objectives

O1 Implement the traditional frequency-based topic detection method (TF-IDF).

O2 Implement the new topic detection method (SynPlusTF-IDF). O3 Find a suitable existing dataset for topic detection.

O4 Run both methods on all texts in the dataset.

O5 Calculate and measure the two methods’ accuracy at detecting topics. O6 Compare and analyze the results from both methods.

(16)

Table 1.3 lists all the objectives in chronological order of completion that need to be achieved in this project. In order to achieve O1, TF-IDF, and how it has previously been implemented, needs to be studied in detail. After gaining enough knowledge about the method and how to implement it, a version will be written in Java. Thereafter, once O1 is completed, O2 can then be done as SynPlusTF-IDF further builds upon TF-IDF.

Finding a suitable dataset in O3 will be achieved by looking at what datasets are avail-able and can be freely used. The dataset needs to consist of texts along with keywords that describe the texts so that TF-IDF and SynPlusTF-IDF can be compared against the dataset. Once a dataset has been found, O4 and O5 will be ready to be completed. Cal-culating the two methods’ scores at topic detection in O5 will be done by calCal-culating precision, recall, and calculating F1-scores. This is described in detail in Section 2.2

Evaluation Methods.

After objectives 1–5 have been completed, the results are ready to be analyzed. How the two methods have performed at topic detection will be based on all measures calcu-lated in O5, which will then complete O6.

The expected results are that SynPlusTF-IDF will be more accurate at topic detection compared to traditional TF-IDF. No measurements of the methods’ performance in terms of speed will be made, however, SynPlusTF-IDF is at least slower than pure TF-IDF since TF-IDF is part of the new method and more steps and calculations are made in SynPlusTF-IDF. As such, SynPlusTF-IDF is a more expensive method to run, both in speed and the actual cost of running the hardware for longer times, compared to TF-IDF. However, the benefit from having a more accurate method should outweigh the possible negatives, which should in practice be negligible.

1.6 Scope/Limitation

As mentioned earlier, SynPlusTF-IDF and TF-IDF will be compared at topic detection accuracy. The focus is on pure topic detection and not any functionality that would use topic detection as part of its model, e.g. document clustering or document searches, and as such, only the accuracy of each method in topic detection will be measured. The texts used will only be written English of the scientific abstract genre, which might affect the methods’ accuracy since they could be better at topic detection in a certain genre compared to other genres. The lack of good and usable datasets suitable for this project creates a challenge making it necessary to somehow create a good dataset just for this project. However, such a task would be too big of a task for this project and only one dataset, the Inspec dataset, will be used.

Furthermore, SynPlusTF-IDF will only work with certain languages. It will be much less effective and less accurate on agglutinative languages such as Tamil, Indonesian or Korean, where their morphology makes it harder for the method to find a text’s topic and other methods are required to handle such languages, see for example [47]. This is all provided that the prerequisites are met for other languages also, e.g. that WordNet, or an equivalent database, has been translated to another language. One such example database, BabelNet, does exist, but the issue of language morphology would still limit the usefulness of SynPlusTF-IDF [48].

(17)

All of the limitations mentioned above will not be considered here since they all fall out of the scope for this project.

1.7 Target Group

The target group for this research includes developers who work with topic detection at any capacity. News aggregators, search engines, trend analyzers, blog taggers, data management, etc. might all be interested in this type of research since improving topic detection accuracy will improve their services. Being able to more accurately determine a text’s topic will, in turn, yield better results for businesses as well as end-consumers.

In addition to those developers, the end-users are also part of the target group even though the end-users might not necessarily have any active interaction with this project’s new method or results. The end-users could still be affected by the results if the new method was implemented somewhere accessible by the public.

1.8 Outline

The rest of this paper contains the following sections:

• Method. This section describes SynPlusTF-IDF in detail along with TF-IDF. How all calculations and how accuracy was measured are described here.

• Implementation. How the new method was implemented is described in detail here along with an overview of how TF-IDF was implemented.

• Results. The results from the experiments are described and presented here. • Analysis. What the results mean and how they are interpreted are discussed here. • Discussion. A final discussion on the whole project where the overall results and

what implications they have are presented in this section.

• Conclusion. The paper is concluded with some final thoughts and summarizations along with what future work can be done in order to further validate the results and conclusions in this paper.

(18)

2 Method

To compare the two methods, a controlled experiment will be conducted where the results are calculated and measured using precision, recall and F1-scores. As this project deals

with a comparison of two systems, a controlled experiment was the most suitable method to use so that quantitative results could be compared. Calculating TF-IDF and SynPlusTF-IDF scores for the 2000 abstracts will be done using a Java program, described in detail in section 3 Implementation. Each abstract will have keywords extracted, one list of key-words for each method consisting of n keykey-words that describe the topics of the abstract. The cut-off for the number of keywords is explained in Section 2.1The n cut-off, below. These keywords will be compared to the manually assigned keywords for each abstract in the Inspec database. The manually assigned keywords are found in a separate file for each abstract. These keyword files are called [number].uncontr, where [number] matches the number of the abstract file and .uncontr refers to the fact that these are the uncontrolled assigned keywords. If the stem of a keyword in the top n list matches the stem of a man-ually assigned keyword, it is considered a true positive. The manman-ually assigned keywords will also be tokenized in one experiment and matched against the extracted keywords to see how many matches occur. This is done to see what possible effect tokenization, or lack of, has on the results and to give a, perhaps, more fair comparison between keywords. The results for the number of matches for TF-IDF against the Inspec database’s keywords and the number of matches for SynPlusTF-IDF against the Inspec database’s keywords will be compared using precision, recall and F1-score. This is further described in Section

2.2 Evaluation Methodsbelow. 2.1 The n Cut-off

The n cut-off is a number that limits the top n list of extracted keywords. In this project, both TF-IDF and SynPlusTF-IDF assign every unique token a score. In a sorted top n list, this means that frequently occurring keywords that are not useful at describing a text appear towards the end of the list. These keywords include mostly function words and punctuation. A limit, the cut-off, on the top n list must then be used in order to filter away the unusable keywords.

Two n cut-offs will be tested in this experiment. Since the average number of extracted keywords for each abstract is 11, the first cut-off tested is 11. There is no difference if an abstract has one or 25 manually assigned keywords, or if the extracted keywords have their scores above a certain threshold. The number of 11 for the n cut-off will be static. The second number for n is using a dynamic number based on the number of keywords extracted with a minimum of 11 and a maximum of however many manually assigned keywords there are for that abstracts. This means that for an abstract with 13 manually assigned keywords, n will be 13, but for an abstract with seven manually assigned key-words, n will be 11. In practice, a fixed n is often used as using a score threshold to determine the number of keywords to describe a text can yield unreliable results. The two nvalues used in this project will be called 11n and max n respectively.

Minimum Maximum Average tokens/key

1 38 2

(19)

number of tokens per keyword which was: 44924

22147 ≈ 2.03 (4) Generalizing these values for abstracts with a mean of 136 tokens where the professional indexer found [23] on average 11 keywords per abstract, around 8% of the document’s length could be a set value for the n cut-off as Equation 5 below shows.

11

136 ≈ 0.08 (5) How applicable this will be for texts that are much longer remains to be investigated and is outside the scope of this project. However, since the professional indexer found on average 8% of keywords in the abstracts, 8% for the n cut-off is in line with their findings. 2.2 Evaluation Methods

In order to evaluate the topic detection accuracy of SynPlusIDF in comparison to TF-IDF, an experiment will be conducted where 2000 abstracts from the Inspec database are run through both methods. Then, an F1-score will be calculated for each abstract and the

results compared between both methods. By using the F1-score as a measure to evaluate

accuracy, a fair comparison can be made as the two methods will have equal conditions for scoring.

In short, F1-score is calculated by taking the harmonic mean of precision (P) and recall

(R) [49].

F = 2P R

P + R (6)

Equation 6 from [49] shows how F1-scores are calculated, where P is defined as:

P = T P

T P + F P (7) and R is defined as:

R = T P

T P + F N (8) The formulas above for P, Equation 7, and R, Equation 8, are from [50] where TP stands for true positives, FP stands for false positives, and FN stands for false negatives. F1

-score is a measure that builds upon the Precision and Recall measures which Kent et al. [51] defined in 1955. By using Precision and Recall for F1-score calculations, one

single measure can be used to determine how an algorithm performs at its task and more realistic results can be achieved.

(20)

Positive Prediction Negative Prediction Positive Class TP FN

Negative Class FP TN

Table 2.2: A matrix with the possible outcomes and their classes [52].

In Table 2.2 above, adapted from [52], the possible outcomes are categorized accord-ing to their predictions.

2.3 Reliability

The method for collecting data in this project consists of running both methods on all 2000 abstracts. The results from both methods will be measured using precision, recall and F1-score for each abstract and then the scores will be analyzed and compared. Similar

methods have been used in other papers dealing with similar topics, e.g. [6, 8, 20, 34, 53], to name a few, were either precision and recall or F1-score measures were used. If this

project were to be replicated on the same dataset, the Inspec database from [23], the results should be the same. The only variable that can affect this outcome is the pre-processing implementations. The actual pre-processing is described in Section 3.2Pre-processing, below. It is possible that another stemmer, tokenizer, or POS tagger could affect the results in several ways. If another stemmer than the Porter stemmer used here was used, it could yield different results as well as a different POS tagger. Perhaps the biggest impact would come from the choice of tokenization method. There are many different ways to tokenize text and even different definitions of what a token is. Since English has compound words that can be written as one word or as several words, e.g. sunshine and door knob, or even using a hyphen, e.g. long-term, the choice of tokenization method and tokenizer could affect the results the most. If a more advanced tokenizer was used which could correctly handle compound words, it would most likely enhance the results of the two methods investigated here. However, since all text for both TF-IDF and SynPlusTF-IDF have the same text pre-processing done, the comparison between them is fair.

2.4 Internal Validity

As this project aims to explore and compare the accuracy at topic detection in the In-spec database abstracts between TF-IDF and SynPlusTF-IDF, its validity will cover that scenario and the genre of scientific papers in general. There is no difference in how syn-onyms and hypernyms are used in the abstracts compared to their papers and as such, the results here can most likely be generalized to the genre of scientific papers. As mentioned earlier, accuracy here is defined to be how well a method can detect topics in the abstracts compared to the manually assigned keywords in the database. Accuracy is then a measure of how many correct topics the methods have found, as well as how many false positives they have found, topics that are not found in the keywords files in the database. The mea-surements of precision, recall and F1-score for this gives the ability to compare the two

methods’ results against each other.

(21)

There are no real specific language differences employed between various sub-genres of scientific papers. Comparing e.g. physics, chemistry, computer science, or linguistics papers in general, they, arguably, do not differ in their use of synonyms or hypernyms. Therefore, as long as the genre is scientific papers, the results found here should apply to all such papers.

The whole implementation, including all pre-processing, of TF-IDF and SynPlusTF-IDF is identical for both methods up to the point where WordNet synset members’ scores are added go extracted keywords. If no synset members occur in a text, then both TF-IDF and SynPlusTF-IDF would produce the same list of extracted keywords with the same scores. This ensures that the implementation is as fair as possible for the two methods. The only thing to keep in mind is that in this project, a unique token is defined as one where the string is unique as well as its POS tag. This is not necessarily done every time TF-IDF is implemented, but it is a required step for SynPlusTF-IDF which needs the POS tags in place for WordNet synset searches. Therefore, both TF-IDF and SynPlusTF-IDF have the same requirements and definitions for what a unique token is to keep the comparison between them with as few possible influencing variables as possible. Finally, the TF-IDF implementation will be based on other implementations written in Java and articles describing the method [5, 19, 54, 55]. This project’s implementation will also be manually tested using a small dataset, testing two documents in a dataset of 10 documents in total, to ensure that the implementation is correct.

2.5 External Validity

Even though the topics of the abstracts in the database are varying, they are still scientific abstracts somehow relating to technology, engineering, or physics, etc. This means that they are all of the same genre, and it might be a factor that could affect the measurements of the method. It is possible that SynPlusTF-IDF would perform differently when looking at other genres, e.g. fiction, political speeches, or news stories. Another factor is the text size, which can also affect the method’s accuracy. It could perform worse or differently on texts which are over 200 words in length. It could also perform worse on shorter texts. However, the actual contents and word usage in the texts is what should affect the results the most.

It is possible that the results from this experiment could also be true for other texts written in English, but not for all languages. Furthermore, in scientific language, syn-onyms and hypernyms are not used that much. Since clarity and conveying information is the main focus in the genre, synonyms could detract from those aspects. In, e.g., fic-tion, synonyms can be used to enhance the text and make it more lively. In political speeches, where the focus and objective can be to convince another party of something, synonyms and hypernyms can be used to invoke certain feelings in the listener. In the general rhetoric of debate, political or otherwise, synonym usage can act as a tool to help make a point and emphasize certain aspects of discussion [30].

(22)

certain words a higher score than they should have. The actual impact of this is also further discussed in Section 6 Discussion.

2.6 Ethical Considerations

(23)

3 Implementation

The methods, TF-IDF and SynPlusTF-IDF, in this project are implemented and written in Java. The Java program works by entering which folders are to be used as the corpus and root folder for all texts. After entering these paths, the program can be run and the results in the form of extracted keywords will be written to the console while TF-IDF and SynPlusTF-IDF scores are calculated. The results are also written to text files, one result text file for each analyzed abstract. Finally, a match of extracted keywords is done against the manually extracted keywords and the result from this matching is written to a single .txt file as comma-separated values (CSV).

The program is written in such a way that as long as the dataset used is made up of text files with the base texts, abstracts in this case, and separate files with keywords and topics to match against, minimal changes are needed to make it work with another dataset for the text pre-processing stage. However, some larger changes might be needed elsewhere in the code depending on the structure and layout of the database used.

Java was chosen due to the extensive library of NLP tools available in the Apache OpenNLP library, which is written in Java [56, 57]. OpenNLP was chosen so that the actual task of implementing TF-IDF and SynPlusTF-IDF and comparing these to methods could be the focus of this project. More on this in Section 3.2 Pre-processing. For the actual design and implementation, getting explicit and clear results as output was the main focus and goal. Optimizations of the implementation of TF-IDF, such as using hash codes replacing the actual words, were not prioritized in favor of code simplicity.

The objectives O1 and O2 are completed with this implementation of TF-IDF and SynPlusTF-IDF.

Figure 3.1: Example output from the Java implementation.

(24)

tagger has set the wrong tag for the word there here. Token uniqueness is defined as not only having the stem of the strings be equal to each other, but also having the same POS tag. Having duplicate keywords in this case where each occurrence of there should count as being equal can be an issue due to the POS tagger not being precise enough in all cases. However, since the abstracts go through the same pre-processing for both TF-IDF and SynPlusTF-TF-IDF, there is no discrepancy between the POS tagging, or any other pre-processing, for the two methods.

This implementation was manually tested on a small dataset consisting of 10 docu-ments, as mentioned in Section 2.4 Internal Validity, and was shown to be consistent when manually calculating TF-IDF and SynPlusTF-IDF scores. Scores for both methods were calculated for two documents of out the 10 using the Java implementation and then man-ually and the results were the same. The manual calculations produced the same results as the Java implementation.

3.1 WordNet

WordNet is a lexical database in English that allows for searches of words’ synonyms and hypernyms, among other things [21, 22]. Synonyms and hypernyms will be used in SynPlusTF-IDF as a way to add a semantic layer to topic detection, improving the ranking of keywords that describe a text’s topics. Searching for synonyms of the word car in WordNet returns the synset {car, auto, automobile, machine, motorcar} and the hypernyms {motor vehicle, automotive vehicle}.

Going further up the hypernym tree, and the terms become more abstract. It is pos-sible that a word being searched for has no synonyms and/or very abstract and general hypernyms, such as the hypernyms object or entity, and would thus not be useful in topic detection. At a certain level, two words might share the same hypernym, such as entity, but otherwise, be semantically weakly related to each other. This would give these words an inaccurate score since the text might not be referring to both words when the hyper-nym entity was used in their places. Therefore, only the immediate synset members will be used in this model. Meaning that there is no further traversing up the hypernym tree to other synsets outside of the initial one.

(25)

Figure 3.2 shows an example of how a search for the word car looks in the WordNet command-line interface (CLI) when showing all synsets. It shows that it found five senses of the word car and each senses’ synsets are shown followed by their hypernyms, shown after the => sign. The five senses are ordered by their frequencies in text. That is, sense 1is estimated to be the most commonly occurring sense, or meaning, of the word car in written language, and sense 5 is the least common sense of the word in written language. 3.2 Pre-processing

Before the texts are run through the methods they will be processed in several different ways. These are all standard NLP processes that make it possible to run the methods on the Inspec database abstracts. These are done with the help of Apache’s OpenNLP library [56, 57]. OpenNLP is written in Java and contains various tools for text and language processing. In this project, only pre-trained models are used when necessary, and no own training is ever done for any of the pre-processing steps [56, 57]. The pre-trained models and information about them can be viewed in [61, 62].

There are three steps in total in the pre-processing stage required for the TF-IDF algorithm here. First, a file is loaded. A file here refers to one .abstr file contain-ing one scientific abstract from Inspec. Second, the file is divided into sentences uscontain-ing OpenNLP’s [56, 57] sentence detector. Third, each sentence is tokenized, divided into to-kens, using OpenNLP’s tokenizer [56, 57]. The tokens can now have their TF-IDF scores calculated.

The new method requires one extra step which is Part-Of-Speech tagging, or POS tagging. It simply adds tags to each word depending on what part of speech the word belongs to, e.g. if it is a verb, preposition, noun, etc. This is also done using OpenNLP with their POS tagger. The POS tags are required by WordNet when looking up the synset of a word to distinguish between e.g. man the noun or man the verb. The POS tags are using the format created for the Penn Treebank Project [59, 60]. As mentioned earlier, even though TF-IDF does not require the POS tag, it is still added to keep all pre-processing the same for both methods and to not have any possible variables that can affect the results when comparing the two methods.

In addition to the text pre-processing, there are two other processes needed for the methods to work. The first process is to load the corpus. This means loading all scientific abstracts from Inspec into memory so that IDF can be calculated. The second process is stemming, which is used so that conjugations, pluralization, or other suffixes do not make the same words count as different words. That is, book and books should not be counted as separate words, but treated as the same when counting the TF of a word. The stemming is done using the Porter Stemming Algorithm implemented in OpenNLP [56, 57, 63]. 3.3 The Methods

(26)

3.3.1 TF-IDF

The TF-IDF version that will be used here is described below with each step shown in the list below. When calculate TF-IDF is mentioned in the list of steps of SynPlusTF-IDF, it refers to these steps. In short, TF is calculated as the raw frequency of a word in a document divided by the document’s length. IDF is calculated by taking the logarithm of the corpus’s size divided by the number of documents in which a term occurs in the corpus.

1. Let a be an abstract in the Inspec database. 2. Let t be a term in an abstract a.

3. Let c be the corpus of all abstracts, i.e. the Inspec database. The steps for calculating TF-IDF:

1. Calculate TF by for each t in a, count the occurrences and divide the result by a’s size.

2. Calculate IDF for each t from step 1 by counting the number of a in c that contains t.

3. Divide c’s size with the IDF score and take the natural logarithm of the result. 4. For each TF, multiply the TF with the IDF score.

3.3.2 SynPlusTF-IDF

SynPlusTF-IDF builds upon TF-IDF and begins by first calculating TF-IDF scores for each word in a document. The variables used when describing the new method are:

1. Let a be an abstract in the Inspec database. 2. Let t be a term in an abstract a.

3. Let c be the corpus of all abstracts, i.e. the Inspec database. 4. Let w be the lexical database WordNet.

5. Let synset1be the first sense synset for a term fetched from WordNet

(27)

SynPlusTF-IDF’s steps are as follows: 1. For each t in a from c, calculate TF-IDF.

2. Extract synset1 for each t with TF-IDF scores from w.

3. For each m in synset1 that occurs in a, calculate TF-IDF scores.

4. Add the m’s TF-IDF scores to the initial t’s TF-IDF score, calculating SynPlusTF-IDF scores.

5. Sort the list of ts in descending order based on SynPlusTF-IDF scores. 6. Select the n top terms from the list of ts.

(28)

Figure 3.3: Method flow illustration.

Figure 3.3 shows an illustration of how the flow of SynPlusTF-IDF can look. Calcu-lations for TF-IDF are not specified in the figure, same as the list of SynPlusTF-IDF steps above.

3.4 Classes

This project’s implementation consists of six classes in total, including Main.java. The five other classes are TextProcessor.java, SynsetGetter.java, Term.java, SynPlusTFIDF-Calculator.java, ResultsCalculator.java.

(29)

done, and both methods have been run, the final results are returned, written to a results file, and printed to the console sorted in descending order based on each Terms’ scores.

(30)

4 Results

The results from TF-IDF and SynPlusTF-IDF are presented in the sections below where the two methods have been ran using the Inspec database, thus completing objective O3. Precision, recall as well as F1-scores have been calculated for both methods. The

calcu-lations for precision, recall and F1-score for the two n cut-offs used, 11n and max n are

each presented in their respective sections. 4.1 11n cut-off

The results for TF-IDF and SynPlusTF-IDF are presented in Table 4.1 below matched against the original, non-tokenized keywords.

TF-IDF SynPlusTF-IDF Man. Assigned Keywords Matches 1413 1419 22147 Precision 0.06423 0.06450 Recall 0.41679 0.58548 F1 0.11131 0.11620 Keywords Extracted 22000 22000 Missed Keywords 147 147

Table 4.1: TF-IDF and SynPlusTF-IDF against non-tokenized keywords using 11n. Table 4.1 shows the results in number of matches against the manually assigned key-words. The total number of manually assigned keywords that were matched against for each result is shown in the Man. Assigned Keywords column in Table 4.1 and each follow-ing results table. The n cut-off at 11 limits the two methods’ results to 22000 keywords that could match against the 22147 manually assigned keywords. The row Missed Key-wordsis the total number of possible matches that the 11n cut-off missed. The highest possible number of matches for an n cut-off at 11 is ≈ 0.993. The precision results for TF-IDF and SynPlusTF-IDF are close to each other with both methods having a preci-sion of ≈ 0.064 when comparing their extracted keywords against the manually assigned keywords. SynPlusTF-IDF has slightly higher precision than TF-IDF.

SynPlusTF-IDF has slightly better recall compared to TF-IDF with a recall of 0.58548 against 0.41679 for TF-IDF. The F1-scores, shown in the row labeled F1 in the table, for

TF-IDF and SynPlusTF-IDF are similar to each other, 0.1131 and 0.11620 respectively. Actual

P N

Predicted

P 1413 20587 N 1982 134570

(31)

Actual P N

Predicted

P TP FP N FN TN

Table 4.3: Confusion matrix legend.

Table 4.3 contains the legend for all confusion matrices where TP means true positive, FP means false positive, FN is false negative, and TN is true negative.

Actual P N

Predicted

P 1419 20581 N 1997 134570

Table 4.4: Confusion matrix of SynPlusTF-IDF results, non-tokenized keywords using 11n.

The results for SynPlusTF-IDF matched against the non-tokenized keywords are pre-sented in a confusion matrix shown in Table 4.4.

The results for TF-IDF and SynPlusTF-IDF with extracted keywords matched against the original, tokenized keywords, are presented in Table 4.5. Here, the manually assigned keywords have been tokenized using the same tokenizer as mentioned in Section 3.2 Pre-processing.

TF-IDF SynPlusTF-IDF Man. Assigned Keywords Matches 3102 2946 44989 Precision 0.06895 0.06548 Recall 1 0.99932 F1 0.12901 0.12291 Keywords Extracted 22000 22000 Missed Keywords 22989 22989

(32)

the number of maximum possible matches. Only ≈ 0.489 is the highest possible preci-sion here when looking at the total number of tokenized manually assigned keywords that could have been matched.

Recall is almost perfect for both methods, however, since the manually assigned key-words have been tokenized here, the amount of missed keykey-words is over half all possible matches. This means that recall in these calculations gives a somewhat inaccurate image of the methods’ actual recall. The F1-scores, which are a little bit higher compared to the

non-tokenized keywords, 0.12901 and 0.12291, give a better image of the results for the tokenized manually assigned keywords when using the 11n cut-off.

Actual P N

Predicted

P 3102 18898 N 0 134570

Table 4.6: Confusion matrix of TF-IDF results, tokenized keywords using 11n. Table 4.6 shows a confusion matrix with the results for TF-IDF when matching against the tokenized keywords.

Actual P N

Predicted

P 2946 19054 N 2 134750

Table 4.7: Confusion matrix of SynPlusTF-IDF results, tokenized keywords using 11n. The results for SynPlusTF-IDF matched against the tokenized keywords are presented in a confusion matrix shown in Table 4.7.

4.2 max n cut-off

The results for TF-IDF and SynPlusTF-IDF when using max n are presented in Table 4.8 matched against the original, non-tokenized keywords.

TF-IDF SynPlusTF-IDF Man. Assigned Keywords Matches 1683 1701 22147

Precision 0.07599 0.07680 Recall 1 1

F1 0.14125 0.14264

Keywords Extracted 22147 22147

(33)

In Table 4.8, the results of the matches when using max n are shown. Here, a max-imum possible precision score is 1 since there is effectively no n cut-off. TF-IDF com-pared to SynPlusTF-IDF are again close in their precision score. They have a preci-sion of 0.07599 and 0.07680 when calculated against the manually assigned keywords. SynPlusTF-IDF has slightly higher precision than TF-IDF.

As the max n cut-off allows all possible manually assigned keywords to be able to match, recall is perfect in this scenario with a score of 1. However, since this cut-off removes the possibility of having any false positives, extracted keywords outside of the top n list that occur in the manually assigned keywords file, the calculation will always be

T P

T P = 1 (9)

Table 4.8 also contains the highest F1-scores for the two methods, each method scoring

around 0.14. Actual P N Predicted P 1683 20464 N 0 0

Table 4.9: Confusion matrix of TF-IDF results, non-tokenized keywords using max n. Table 4.9 shows a confusion matrix with the results for TF-IDF when matching against the non-tokenized keywords.

Actual P N

Predicted

P 1701 20446 N 0 0

Table 4.10: Confusion matrix of SynPlusTF-IDF results, non-tokenized keywords using max n.

The results for SynPlusTF-IDF matched against the non-tokenized keywords are pre-sented in a confusion matrix shown in Table 4.10. The results for TF-IDF and SynPlusTF-IDF with extracted keywords matched against the tokenized keywords are presented in Table 4.11.

The results for matches against the tokenized keywords are presented in Table 4.11. They are similar to those in Table 4.5, but SynPlusTF-IDF having a slightly lower preci-sion compared to TF-IDF. However, both have a precipreci-sion close to each other where the two methods score ≈ 0.07, where a maximum possible precision is 1.

As mentioned earlier, recall with this cut-off will always give a perfect recall score of 1. The F1-scores for the two methods are close to each other like in the other results tables

(34)

TF-IDF SynPlusTF-IDF Man. Assigned Keywords Matches 3102 2948 44989 Precision 0.06895 0.06553 Recall 1 1 F1 0.12901 0.12300 Keywords Extracted 44989 44989

Table 4.11: TF-IDF and SynPlusTF-IDF matched against tokenized keywords using max n. Actual P N Predicted P 3102 41887 N 0 TN

Table 4.12: Confusion matrix of TF-IDF results, tokenized keywords using max n. Table 4.12 shows a confusion matrix with the results for TF-IDF when matching against the tokenized keywords.

The results for SynPlusTF-IDF matched against the tokenized keywords are presented in a confusion matrix shown in Table 4.13.

4.3 Summary

Table 4.14 contains a summary of precision, recall, and F1-score results for each n cut-off

and whether the manually assigned keywords were tokenized or not.

(35)

Actual P N

Predicted

P 2948 42041 N 0 TN

Table 4.13: Confusion matrix of SynPlusTF-IDF results, tokenized keywords using max n.

TF-IDF SynPlusTF-IDF TF-IDF SynPlusTF-IDF 11n 11n max n max n Precision 0.06423 0.06450 0.07599 0.07680 Recall 0.41679 0.58548 1 1 Precision (tokenized) 0.06895 0.06548 0.06895 0.06553 Recall (tokenized) 1 0.99932 1 1 F1 0.11131 0.11620 0.14125 0.14264 F1 (tokenized) 0.12901 0.12291 0.12901 0.12300

Table 4.14: Summary of all results for TF-IDF and SynPlusTF-IDF.

Figure 4.1 contains the same results as summarized in Table 4.14, presented in a bar graph. The x-axis has both methods with the two cut-offs grouped along with whether the manually assigned keywords were tokenized or not.

(36)

5 Analysis

The previous section showed the results for both TF-IDF and SynPlusTF-IDF when ran on the Inspec dataset. The summarized metrics, precision, recall, and F1-score, shown

in Table 4.14, indicate that both methods performed poorly at the task of topic detection. The F1-scores ranging between 0.11131 to 0.14264 show that neither method could

suc-cessfully detect the topics that the professional indexer had manually assigned for each abstract. This could be explained by several different reasons.

First, tokenization and how it determines what a token is could influence the results in a significant way. As the tokenizer used here does not detect compound words consisting of several words as one token, it cannot match against the compound words found in the manually assigned keywords. Even when tokenizing the manually assigned keywords, there was no real improvement in the results. This can be explained that while the actual raw amount of matched keywords increased, the number of manually assigned keywords was now more than double. Furthermore, some of the manually assigned keywords in the Inspec database would be hard for a tokenizer to recognize as one token unless the tok-enizer was trained specifically for the task of tokenizing scientific literature. For example, while time series could be a trivial tokenization for a tokenizer that is more advanced than simply using whitespace to determine whether something is a token or not, undiscounted single-controller stochastic games could be much harder to classify as a single token where more advanced tokenization rules are needed. A noun phrase extractor should be used instead for such division of written text since it would be hard to count undiscounted single-controller stochastic gamesas one token by most rules.

Second, the genre of scientific literature might not be the best suited for detecting top-ics using synonyms and hypernyms. A more fiction, poetry, or prose oriented genre might give results that more clearly show the possible use that using synonyms and hypernyms might have in topic detection. However, this does not account for the fact that TF-IDF also performed poorly in this experiment, which leads to the third possible reason for the low scores.

Third, text length and short abstracts. As shown earlier in Table 1.1, the mean token length for the abstracts is 136 tokens. It is possible that longer texts could improve the results for both methods, showing that for texts with a mean of 136 tokens in length, better-suited approaches are necessary for a good and true indication of what topics a certain text is discussing.

Fourth, since word uniqueness is defined as the actual stem of the string matching another string stem as well as both words having the same POS tag, duplicated words can appear in the top n list where they perhaps should not. If the POS tagger has made a mistake in what POS a word is, it will appear once for each time a new POS tag has been set among the extracted keywords. If the word match as a verb and the word match as a noun are used in the same abstract, they should appear as two separate keywords as they actually are different words. However, knowing when the wrong POS tag has been set and when it has not in these cases is impossible to verify without actually reading each abstract and looking at the duplicate keyword entries and which POS they, in fact, are. Out of all extracted keywords, 153792 in total, there were only 718 duplicated keywords indicating that the impact on the results of this is quite small. Furthermore, not all of those 718 duplicated keywords are incorrect suggesting that the impact this could have is even smaller.

(37)

should not be given too much weight, and the focus should lie on the F1-scores which

give a more balanced result. TF-IDF and SynPlusTF-IDF both performed poorly at topic detection in the Inspec database. Furthermore, SynPlusTF-IDF did have a slightly higher F1-score than TF-IDF when matching against the non-tokenized manually assigned

key-words, however, the opposite was true when tokenizing the keywords. The differences between the F1-scores between TF-IDF and SynPlusTF-IDF are very small, only starting

to differ at the third decimal, indicating an overall extremely similar accuracy at topic detection.

Bachelor Degree Project Using WordNet Synonyms and Hypernyms in Automatic Topic Detection

Author:

Nicko Wargärde

Supervisor:

Tobias Ohlsson

Semester:

VT 2020

Subject:

Computer Science

Bachelor Degree Project

Abstract

Preface

Contents

1

Introduction

2

Method

3

Implementation

4

Results

5

Analysis