Exploring Emerging Entities and Named Entity Disambiguation in News Articles

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Information Technology

2020 | LIU-IDA/LITH-EX-A--20/021--SE

Exploring Emerging Entities

and Named Entity

Disam-biguation in News Articles

Utforskande av Framväxande Entiteter och Disambiguering av

Entiteter i Nyhetsartiklar

Robin Ellgren

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Publicly editable knowledge bases such as Wikipedia and Wikidata have over the years grown tremendously in size. Despite the quick growth, they can never be fully complete due to the continuous stream of events happening in the world. In the task of Entity Linking, it is attempted to link mentions of objects in a document to its respective corresponding en-tries in a knowledge base. However, due to the incompleteness of knowledge bases, new or emerging entities cannot be linked. Attempts to solve this issue have created the field referred to as Emerging Entities. Recent state-of-the-art work has addressed the issue with promising results in English. In this thesis, the previous work is examined by evaluating its method in the context of a much smaller language; Swedish. The results reveal an expected drop in overall performance although remaining relative competitiveness. This indicates that the method is a feasible approach to the problem of Emerging Entities even for much less used languages. Due to limitations in the scope of the related work, this thesis also suggests a method for evaluating the accuracy of how the Emerging Entities are modeled in a knowl-edge base. The study also provides a comprehensive look into the landscape of Emerging Entities and suggests further improvements.

(4)

Acknowledgments

Firstly, I would like to thank my supervisor, Marco Kuhlmann, for all the help, guidance, and support I received. Furthermore, Marco also introduced me to the topic of NLP and have contributed inspiration throughout my studies. Additionally, I would like to thank my examiner Arne Jönsson as well as the whole group of thesis students at NLPLAB for insightful and inspiring seminars. Lastly, I would like to thank the whole of iMatrics AB for commissioning this thesis. This includes providing the dataset used, an inspirational environment, and continuous support throughout the project.

(5)

List of Figures

2.1 Example of NER-tagging . . . 6

2.2 Example of a simple decision tree . . . 9

4.1 Wikidata extracted properties per entity type . . . 15

4.2 Example of parsed Wikipedia entry . . . 15

4.3 General pipeline . . . 16

4.4 Example NED document . . . 18

4.5 Example of weighted graph used for entity disambiguation . . . 19

4.6 Generating gold data pipeline for a single document . . . 22

(8)

List of Tables

4.1 Grid search variables . . . 22

4.2 Data used . . . 24

5.1 Built knowledge base . . . 25

5.2 Statistics on the dataset used and other comparable datasets . . . 26

5.3 EE Label Results . . . 26

5.4 Grid search results . . . 27

(9)

List of Abbreviations

EE Emerging Entity.

EED Emerging Entity Discovery. KB Knowledge Base.

NED Named Entity Disambiguation. NER Named Entity Recognition. NLP Natural Language Processing.

(10)

1 Introduction

Over the past years, we have seen how the internet has grown tremendously. The increase in popularity and size has arguably led to the internet being a major source of information where people nowadays depend on it for finding information in a large set of domains. This is best exemplified by the rise in popularity on publicly editable Knowledge Base (KB)s such as Wikipedia and Wikidata. These knowledge bases are structured around concepts and the idea is that one concept (e.g. Artificial Intelligence and Donald Trump) is defined by an article or table containing contextual information and hyperlinks to other concepts that somehow relate to it.

By generalizing the idea of these hyperlinks representing relations between concepts, it is possible to build queryable systems on the relations between concepts. Examples of such sys-tems include, but not limited to DBPedia1(Auer et al. 2007) and Google Knowledge Graph2. Systems like the mentioned ones have applications such as question answering, text classifica-tion, and many more in the context of automatic text processing. However, the information in knowledge bases that often make up the definitions of concepts in those systems is limited. Human-annotated knowledge bases cannot realistically ever contain neither all concepts that currently exists in the world nor all the world’s information about a concept that do exist in the knowledge base. This holds even though the knowledge bases are growing at a high speed with English Wikipedia as the largest one now exceeding 6M articles. As an illus-trative example of the limitedness of knowledge bases, the chairman of the municipality of Linköping in Sweden has no Wikipedia article in the Swedish Wikipedia (or any other lan-guage)3. The lack of either concepts or concept contexts poses a problem with for example newspapers writing about such concepts and then not being able to utilize knowledge base information about said concepts.

Revisiting the idea of knowledge bases where each concept is defined as the contextual words and the set of relations to other concepts; one (e.g. newspapers) could then want to map mentions in unstructured text to their corresponding definition in a knowledge base. By restricting said mapping to entities (e.g. people, organizations and places), the task commonly referred to as Named Entity Disambiguation (NED) or Entity Linking is reinvented. The big

1_{DBPedia is a queryable database built on the facts section of Wikipedia: https://dbpedia.org/}

2_{Google Knowledge Graph is a queryable database built on statements of Wikipedia, Wikidata and the CIA}

World Factbook

(11)

1.1. Motivation

challenge of NED is to navigate in a set of ambiguous entities (in terms of the entities names). For example, Michelle Williams represents 4 people in Wikipedia, shown below:

Michelle Williams, American actress

Michelle Williams, American singer, songwriter, and actress Michelle Williams, Canadian swimmer

Michelle Ann Williams, American public health scholar

One approach to solving this is by utilizing the fact that relations between concepts exist. In the early stages of NED, using the existence of relations has shown to be a reasonably effective method for solving this type of problem (Han and Zhao 2009; Han et al. 2011). However, there are still issues with this approach not performing well enough to reach acceptable results for the industry as well as leaving out a major factor. As we live in a highly dynamic world with a never-ending stream of events (in this thesis represented as news articles), the events will sometimes feature entities previously never seen and are thus non-existent in any knowledge base (i.e. not-in-KB). Those entities are of great importance for this thesis and further on, like Hoffart et al. (2014), they will be referred to as Emerging Entity (EE)s.

In an attempt to make the task of NED more accurate and complete for the media industry, this thesis sets out to extend existing KBs (e.g. Wikipedia and Wikidata) with claims about in-KB entities and to include Emerging Entities in multiple languages. The main hypothesis is that the set of relevant Emerging Entities, and their contextual claims to already in-KB entities are present in a collected set of news articles and/or a continuous stream of articles. These claims should be useful for defining the Emerging Entities and improve the performance of NED. Furthermore, since annotating big sets of articles with gold data is a very expensive process, a solution must be derived from unsupervised machine learning.

1.1 Motivation

This thesis is commissioned by iMatrics AB. iMatrics AB is a Natural Language Processing-focused company based in Linköping, Sweden that offers a variety of text analytical services to the media industry in various languages. As such, to be able to offer a multilingual NED service with support for EE is of great importance as it has implications in topic modeling, au-tomatic tagging of articles and news recommendation systems. iMatrics will provide streams of news articles written in Swedish.

1.2 General application

The motivation for this thesis is also applicable to general interests. The most exciting appli-cation consisting of automatic creation of new entities in publicly editable knowledge bases such as Wikipedia and Wikidata. As suggested by Graus et al. (2018), monitoring news arti-cles for emerging entities with that application is a viable method.

1.3 Aim

By utilizing a large set of news articles, public knowledge bases and unsupervised machine learning this thesis aims to:

• Investigate disambiguation for in-KB entities and how the inclusion of claims from other sources affects said disambiguation

(12)

1.4. Research questions

• Investigate the concept of Emerging Entities in news articles. More specifically, how to define an Emerging Entity and how to handle the increased uncertainty in NED models introduced by Emerging Entities

• Investigate how a NED system with EE performs with the goal of running it at large scale with industry demands

• Build a multilingual NED system with EE

1.4 Research questions

The following research questions need to be answered to reach the aim:

1. How do state-of-the-art NED with EE systems perform in other language domains than English (i.e. Swedish)?

• Current state-of-the-art NED with EE systems has only been evaluated on English (Hoffart et al. 2014; Wu et al. 2016; Zhang et al. 2019).

• The evaluation is performed by comparing if an entity is labeled as EE using the current knowledge base and with an actual entity when using the future knowl-edge base. Corresponds to the measurement EE-Label (see Section 4.8).

2. How well do models of the Emerging Entities link to entities in a future Knowledge Base?

• Hoffart et al. (2014), Wu et al. (2016) and Zhang et al. (2019) all create a model for the Emerging Entities labeled "EE". This is used to evaluate the system accordingly to research question one. However, the models are never linked to entities in a "future" version of the KB. To the best of our knowledge, this thesis is the first to attempt this problem.

• The evaluation is performed using the measurement EE-Linking (see Section 4.8).

1.5 Delimitations

Given the high aim and at the same time the scope of the project, only the Swedish language will be considered. Furthermore, concepts will be restricted to entities of the types persons, organizations and, places.

(13)

2 Theory

This chapter is outlined with an overview of the terminology used in this thesis at the start. It continues by giving the reader insight into the building blocks used in the method with both the historical perspective as well as the theoretical understanding.

2.1 Terminology definition

The following terms are of central importance to the thesis. Therefore, a definition of how they should be interpreted by the reader is provided in this section.

• Knowledge Base (KB)

A Knowledge Base in this thesis refers to a database alike structure holding entities, in-cluding a unique identifier per entity. A knowledge base is denoted K. Examples of a KB include but not limited to Wikipedia, Wikidata, YAGO and, DBPedia.

• Entity

An entity in this thesis is referred to as a real-world object. The object should be able to be uniquely identified in a knowledge base. Let e denote an entity, then e P K. As per Section 1.5, entities are also restricted to the types of person (PER), organization (ORG) and location (LOC).

• Surface form

Each entity e P K can be represented as textual phrases in news articles or in Wikipedia articles. Like Färber et al. (2016), they are referred to as the entities surface form. Each entity must hold at least one surface form. There is no restriction that the surface forms must be unique to an entity. The surface form is denoted s.

• Mention

When a surface form appears in a textual phrase, an entity has been mentioned. Each textual phrase in a news article may contain none, one or, multiple mentions. The entity corresponding to the mention may but does not need to be ambiguous with respect to which e P K is actually referred.

(14)

2.2. Vector representations

• Out-of-KB

As previously stated, any entity must be present in a KB. If an entity is not present in a KB, it can be referred to as Out-of-KB.

• Shadow Entity

A shadow entity is an abstract representation of an entity that is out-of-KB. The represen-tation can be inserted into a knowledge base and will then obtain a uniquely identifiable shadow entity id. The representation in this thesis consists of a collection of context sen-tences in which the entity has appeared as well as a number of surface forms.

• Emerging Entity EE

Defined in this thesis as all entities for which the logical expressions 2.3 or 2.4 hold.

2.2 Vector representations

For many applications within the field of Natural Language Processing (NLP), representing words (terms) or whole documents as vectors instead of raw text is preferable. This is because a vector representation allows for various mathematical operations to be executed, while raw texts do not. The idea has been thoroughly researched and there exists a wide range of techniques that can be applied to convert a collection of texts to vector representation (Salton et al. 1975; Leite and Rino 2008; Devlin et al. 2019). Note that the terms vector and embedding can be used interchangeably.

While these representations are not at focus for this thesis, a general understanding of the mapping into vector space is important for further reading.

2.2.1 Term Frequency–Inverse Document Frequency (tf-idf)

A naive way of creating a vector would be to let each possible word in the vocabulary have a unique index in the vector. Then a document vector is the count of occurrences for each word. However, with such a simple model there exist several issues. For example, some words that are very frequent in the natural language, e.g. "a", "and" and "the", carries very little semantic meaning. This would overshadow the other more interesting words. Another issue is that the significance of the words is not related to the entire document collection.

Another more advanced but popular model is the term frequency-inverse document frequency (tf-idf). The model assigns the highest scores to terms that occur frequently in a small number of documents. Thus indicating that those documents are semantically different from the other documents. The lowest scores for a term is assigned when the term occurs frequently in many documents.

Let the term frequency, denoted t ft,d, be the number of times term t appears in the

docu-ment d. Let the docudocu-ment frequency, denoted d ftbe the number of documents that contains

the term t. Note that terms can be both single terms (unigrams) and n-grams. Let D be the set of all documents in the collection. The inverse document frequency is then calculated by id ft = log(|D|_{d f}_t). It is common to add one to the denominator in order to avoid division by

zero, referenced as add-one smoothing. By combining the two measurements the tf-idf value is calculated by t f id ft,d=t ft,d˚id ft(Manning et al. 2008).

2.2.2 Context independent word embeddings

The vectorization method presented in Section 2.2.1 is purely statistical. More recent work attempts to model words by including its semantic meaning. One of the most prominent examples of this new approach is Word2Vec by Mikolov, Chen, et al. (2013). For example, the vectors of Gothenburg and Stockholm should be close neighbors in the vector space. In broad general terms, each word in the vocabulary is initialized with random values of a fixed size

(15)

2.3. Named Entity Recognition (NER)

(in this thesis the very common 300 is used). Then a large corpus of documents is traversed and for each word the surrounding words (i.e. its context) vectors are used together as a continuous bag of words model as input to a neural network to predict the word. If the model fails, the nodes in the network are punished and likewise if correct, the nodes behavior is enforced.

An important feature with the described method is that the only required input data is large amounts of raw text, no "gold-labeled" data is required. The embeddings are referred to as context independent because the embedding is mapped to words as a one-to-one relation. This means that words that are ambiguous (e.g. duck could refer to either a duck (bird) or the verb to duck) has the same embedding regardless of the context. Another shortcoming of context independent word embeddings is that words that are not present in the raw text then lack an embedding.

Examples of previous work that has applied the described method and successfully built context independent word embeddings include GloVe (Pennington et al. 2014), Polyglot (Al-Rfou et al. 2015), and FastText (Athiwaratkun et al. 2018).

2.2.3 Context dependent word embeddings

Rather than mapping a word to a pre-trained word embedding with a one-to-one relation (as in Section 2.2.2), context dependent word embeddings can be used. In general terms, the embedding is calculated on the fly as a function of the whole input text (e.g. the context sentence). Thus, the shortcoming with a one-to-one relation between words and embeddings are mitigated. Using context dependent word embeddings has shown to be very successful (sometimes state-of-the-art) in multiple NLP-tasks, e.g. Question-answering, Coreference resolu-tion and NER (Peters et al. 2018).

Context dependent word embeddings such as ELMo (Peters et al. 2018) and BERT (Devlin et al. 2019) can easily be used with the Transformers library (Wolf et al. 2019).

2.3 Named Entity Recognition (NER)

An important pre-processing step in many NLP applications is the task of Named Entity Recognition (NER). The task consists of mapping token sequences in unstructured text (e.g. news articles) that are entities with their corresponding entity label. The labeled tokens can then be used as mentions as input for further processing with NED. The NER task has lately seen significant improvements in results (Straková et al. 2019; Jiang et al. 2019; Ghaddar and Langlais 2018) but has been a long-standing popular field of research. For example, in 2003 there was a CoNLL shared task focusing on NER (Sang and De Meulder 2003).

Figure 2.1 shows an example of NER-tagging, with pink, green and, purple representing the entity types ORG, LOC and PER, respectively.

Manchester United is an English as-sociation football club which plays in the Premier League . Their manager is

norwegian Ole Gunnar Solskjær

(a) Example text before NER-tagging

Manchester United is an English as-sociation football club which plays in the Premier League . Their manager is

norwegian Ole Gunnar Solskjær

(b) Example text after NER-tagging

Figure 2.1: Example of NER-tagging

Many successful approaches to the task have in the past relied on supervised or semi-supervised machine learning techniques (Jiang et al. 2019; Carreras et al. 2002; Florian et al. 2003). However, such techniques require human-annotated gold data which can be

(16)

2.4. Named Entity Disambiguation (NED)

hard to get a hand on. Alternatively, features from publicly editable knowledge bases (e.g. Wikipedia), can be used to harvest information and build reliable NER-taggers. For exam-ple, Polyglot-NER (Al-Rfou et al. 2015) uses neural word embeddings in combination with Wikipedia internal links and oversampling to create a reliable NER-tagger for 40+ languages. Completely unsupervised methods that train only on raw text has recently shown great promise within the task of NER. The most prominent example being BERT (Devlin et al. 2019) which also has the attribute of being multilingual, but can be fine-tuned onto a specific language, e.g. Swedish1.

2.4 Named Entity Disambiguation (NED)

After gathering a number of mentions (for example with NER), the next step is typically to map them onto a knowledge base. Each mention can have multiple options to which it could be mapped, referred to as candidates. Since there are multiple options, i.e. an ambi-guity, this task is referred to as NED. By searching the knowledge base for possible surface forms, candidates for the mapping can be obtained. Simple, but effective, heuristics can use popularity-based metrics such as the longest Wikipedia article, number of linkings in Wikipedia or, the most inhabited place (if the entity is of type LOC). More advanced heuristics include the context (surrounding words) of which the mention appears and compare its similarity to the context of the candidate entities. There exist successful attempts with this approach that take advantage of machine-learning techniques (Milne and Witten 2008). However, as pointed out by Hoffart et al. (2011), the context-similarity-approach works best for longer texts but fails to achieve human-alike results for shorter texts, e.g. news-articles.

In order to improve the results, the entities could be disambiguated with respect to each other as a collective instead of individually, referred to as coherence (Kulkarni et al. 2009). This approach then attempts to maximize the global probability of the collective entities, rather than the local probabilities for each entity by considering the similarity between mentions, en-tities as well as the coherence between enen-tities. The drawback of such an approach is that the chosen candidates are more likely to be similar. This could be a problem with short texts that contain few entity mappings since the entities that are present are attempted to get fitted into a single topic, related to the other entities (Hoffart et al. 2011). For example the text in Figure 2.1, the mention "English" might get mapped onto the entity "England_national_football_team" instead of the correct entity "England" because of the other entities similarity with the topic soccer.

More recent approaches of NED makes use of neural networks to capture the similarity between mention contexts and candidate entities (Sun et al. 2015; Francis-Landau et al. 2016). Transforming the problem into a finding dense sub-graphs within a network of nodes (entities and mentions) in a probabilistic manner is also attracting attention (Han et al. 2011; Zeng et al. 2018).

Many NED-methods yield a confidence score (Kulkarni et al. 2009; Ratinov et al. 2011; Han et al. 2011). Since there is a possibility that a mention does not have a corresponding entity in the knowledge base, entities may be mapped onto the special entity NIL when the confidence is not higher than some threshold (Ratinov et al. 2011; Han et al. 2011; Wu et al. 2018).

2.5 Emerging Entities: Problem definition

When addressing the problem of Emerging Entities, four distinct challenges arise (Färber et al. 2016). This section aims at clearly defining these challenges in order for them to be properly addressed. The given parameters are for clarity named similarly as Färber et al. (2016).

(17)

2.5. Emerging Entities: Problem definition

• Let t0 denote the point in time referred to as current. Let t1 denote a point in time

referred to as the “future”. It is then given that t0ăt1. KBs evolve over time, thus there

are different versions of the same KB. Let Kt0 denote a certain KB at time point current.

Furthermore, let Kt₁ denote a certain KB at time point “future”.

• Let the elapsed time from t0to t1be denoted by∆t, i.e. ∆t = t1´t0. Since this thesis

investigates Emerging Entities, deleted entities or surface forms during∆t is kept in the KB K_∆t. Then, the emerged entities added during∆t is all e P K_∆t=Kt1´(Kt1XKt0).

• A set of surface forms for each entity e P Kt0 denoted S

e t0.

• A set of surface forms for each entity e P K∆tdenoted S_∆te .

• A function, f , which maps a mention m in a news article to its corresponding entity in (Kt0YK∆t). f : m Ñ e P(Kt0YK∆t).

2.5.1 Challenges

Similar to Zhang et al. (2019), this thesis makes use of Emerging Entities challenge definitions from Färber et al. (2016). Four different challenges are listed which by definition frames the scope of Emerging Entities. The challenges listed below is a conceptually equivalent definition as Färber et al. (2016), but includes changes to reflect the given parameters listed in Section 2.5.

Challenge 3 and 4 together make up the task of Emerging Entities while Challenge 1 and 2 is connected to the strongly related task of NED (see Section 2.4).

Challenge 1: Known surface form, known entity

There exists one and only one entity in the KB that is Kt0such that the mention m is in Set0and

the function f(m)will return that entity. This challenge is the equivalent of NED.

D!e P Kt0 :(m P S

e

t0^f(m) =e) (2.1)

Challenge 2: Unknown surface form, known entity

There exists no entities e in the KB that is Kt0 such that the mention m is in Set0. However,

there exists one and only one entity e1 _{in the same KB, K}

t0, such that the mention m is in S

e1

∆t

and the function f(m)returns that entity.

This challenge is the task of mining evidences and surface forms for already known enti-ties in order to strengthen Challenge 1.

Ee P Kt0 :(m P S e t0)^ D!e 1 PKt0 :(m P S e1 ∆t^f(m) =e1) (2.2) Challenge 3: Known surface form, unknown entity

There exists multiple entities in the KB that is Kt0 such that the mention m is in Set0. However,

the entity the mention actually refers to is not present in Kt0. The actual referred entity, e

1 _is

present in K∆tsuch that the mention m is in Se_∆t1 and function f(m)returns that entity e1_.

This challenge refers to a variant of an Emerging Entity but hold the increased difficulty that other entities with overlapping surface forms exists.

De P Kt0 :(m P S

e t0)^ D!e

1_P_K

∆t:(m P Se∆t1 ^f(m) =e1₎ _(2.3)

Challenge 4: Unknown surface form, unknown entity

There exists no entities e in the KB that is Kt0 such that the mention m is in Set0. There exists

one and only one entity e1 _{in the KB K}

∆tsuch that the mention m is in Se_∆t1 and the function f(m)returns that entity e1_.

(18)

2.6. Gradient Boosting Classification Tree

This challenge refers to a variant of an Emerging Entity, specifically when there is no record of neither the surface form nor the entity itself.

Ee P Kt0 :(m P S

e t0)^ D!e

1

PK_∆t:(m P Se_∆t1 ^f(m) =e1) (2.4)

2.6 Gradient Boosting Classification Tree

When dealing with binary classification problems, the system consists of input variables, x = tx0, x1, ..., xnu and an output variable yi P t0, 1u. By using a set of training samples,

txi, yiuMi=1, the goal is to create a function, F(x), that as closely as possible can map input x

to output y (i.e. make a prediction). To create such a function, another function referred to as the loss function L(F(x), y)is introduced with the goal of minimizing its output. When the loss function is minimized, the mapping function is optimal over the joint distribution of the training samples (Friedman 2001).

One way of creating the mapping-function is to utilize a set of weak classifiers and combine their votes to receive the final classification, a process known as boosting. A weak classifier is a classifier which is just slightly more accurate than (weighted from training samples) random. Assuming there is M classifiers, each classifier G(x)i then has a weight, αi, to their vote. If

the output variable is y P t´1, 1u, then the complete algorithm for classification boosting is denoted as (Friedman et al. 2001):

G(x) =sign(

M

ÿ

m=1

αmGm(x)) (2.5)

The construction of the weights, αiand the weak classifiers Gi(x)vary by algorithm and

there are many options in the general case. If there are N training samples, each sample has an initial weight, wi = 1/N. By iterating from m = 1 to m = M, each classifier Gm is

fitted to the training samples so that it minimizes the loss function. The samples that were misclassified at step m ´ 1 are assigned a higher weight. By doing this, the next classifier is forced to correct the mistakes of the previous classifier and therefore produces a more robust total classifier with higher accuracy. The described technique is referred to as AdaBoost and is described in detail in Friedman et al. (2001, p. 339).

Addressing the weak classifiers, one of the most prominent and well-known sort are known as decision trees. It is even in some cases argued that it is the ideal weak classifier when combined with boosting (Friedman et al. 2001, pp. 350–352). A decision tree is a hierar-chical component that decides the output by a sequence of questions on the input data. The tree is built using (weighted) probabilities of the training data.

X1 X2 X3 Y 1 0 0 1 1 1 1 0 1 0 1 1 X1=1 yes no Y=1 Y=0

Figure 2.2: Example of a simple decision tree

Figure 2.2 shows a special case one-level decision tree commonly known as a decision stump. The particular tree in the figure achieves an accuracy of 66% on the training data and would thus qualify as a weak classifier.

(19)

3 Related Work

This thesis sets out to build a multilingual Emerging Entities discovery and disambiguation system that also complements in-KB entities with contextual claims. The papers in this chap-ter is grouped by their topic as how they relate to that objective.

3.1 Summary

Li et al. (2013) presented a generative model to add contextual claims to in-KB entities but discards Emerging Entities. Hoffart et al. (2014) extends the idea of Li et al. (2013) by in-troducing Emerging Entities while still adding contextual claims to in-KB entities. Wu et al. (2016) builds on the work by Hoffart et al. (2014) but manages to reach better results and thus became state-of-the-art in the English domain. Zhang et al. (2019) also builds on the work by Hoffart et al. (2014), claiming state-of-the-art but does not produce better results than Wu et al. (2016). While Moussallem et al. (2017) does not consider Emerging Entities, it does con-sider multiple languages for NED (i.e. in-KB entities). Ideally, multilingual is a property we would like our system to have. Graus et al. (2018) have studied the patterns at which entities emerge and found that it takes on average 245 days.

The rest of this chapter is a more in-depth look to the above presented papers and their relevance to this thesis.

3.2 Mining evidences

Li et al. (2013) conducted their work with a similar motivation as this thesis, namely that KBs are not complete and never will be due to never ending continuous stream of events that happen. This has an negative effect on NED which in its turn is important in a variety of NLP applications. With this in mind Li et al. (2013) set out to mine KB-complementary evidences for entities. They show that especially for rare entities, mining evidences is a viable method that dramatically increase the evaluation scores of NED. One important point that differs from this thesis is that Li et al. (2013) does not consider adding emerging entities but rather discard them as out-of-KB. Their work is however very relevant since the methodology of mining evidences is a elementary building block for both Hoffart et al. (2014) and Wu et al. (2016).

(20)

3.3. Named Entity Disambiguation

3.3 Named Entity Disambiguation

While the work done by Hoffart et al. (2014) and Wu et al. (2016) did contain elements of classic NED, i.e. linking entity mentions to in-KB entities, their work mainly focused on the detection of Emerging Entities on the English domain. However, ideally an Emerging Entities discovery system should be language independent. On the note of language independence, Moussallem et al. (2017) presented a method that is state-of-the art in NED on 6 different languages that is not English, and state-of-the-art alike results on English. The method used by Moussallem et al. (2017) is built on using the KB as the definition and then performing algorithmic candidate generation and disambiguation as opposed to using various similarity measures in a vector-space model. KB candidate generation is done by considering all possible permutations in an entity word along with its acronyms. KB disambiguation is done by a Breadth-first search (BFS) over a knowledge graph structure derived from the KB. The motiva-tion for the method is that vectors derived from language corpuses will not be deterministic in the sense that the results between languages will be too biased.

The method and results presented by Moussallem et al. (2017) is relevant to this thesis, but its results can not directly be compared with this thesis since Emerging Entities is discarded. The code used by Moussallem et al. (2017) is publicly available1.

3.4 Detecting Emerging Entities

Where prior work assigned candidate entities with a confidence score below a certain thresh-old to a NIL-pointer (i.e. as out-of-KB), thus loosing the information, Hoffart et al. (2014) suggested an alternative approach. Their motivation was that since no knowledge base will ever be complete in the sense of lacking entities, a need for automatic detection of new entities has arisen.

Their approach builds on a keywords-context for all possible entities induced from a web search on the entity. The idea is that such a keyword-context should be an accurate tation of all entities with the same name (including the Emerging Entity). Such a represen-tation can then be reduced (model-difference) with the keywords from already known entities (existing in KB), thus leaving only the keyword-representation of the Emerging Entity. Fur-thermore, whenever a NED (AIDA-method by Hoffart et al. (2011) was used but treated as a black box) is done with high-confidence, an immediate keyword context in the document is used to enrich the identified entity in order to strengthen future NED problems. Similarly, whenever there is a low-confidence, the immediate keyword context is used to strengthen the argument of an Emerging Entity. To even further enhance the Emerging Entities with keywords, news articles in a close window of time (i.e. the same day) with references to the entity is assumed to be referencing the same entity. Thus, keywords extracted from nearby articles in time can be assigned to the Emerging Entity.

The work was evaluated on two different tasks, NED (using mean accuracy and the dataset CoNLL-YAGO 2011 (Hoffart et al. 2011)) and Emerging Entity Discovery (EED). Since there was no available dataset at the time, the authors manually annotated 300 news arti-cles. Although the authors reached 97.97% Emerging Entities precision, their method signifi-cantly decreased the accuracy of the NED problem (by adding more uncertainty to the model) solved by AIDA. The dataset used is further on denoted as the AIDA-EE dataset. The method and results presented in Hoffart et al. (2014) can be used to compare this thesis results with. Picking up where Hoffart et al. (2014) left off, Wu et al. (2016) presented another approach to the NED with EE in news articles problem. Their approach builds on representing the enti-ties in five different feature spaces of the classical vector-space model. Namely, Contextual Space, Neural Embedding Space, Topical Space, Query Space and Lexical Space. Especially the Contextual

(21)

3.4. Detecting Emerging Entities

Space strongly relates to the work done by Hoffart et al. (2014). The other spaces are merely different ways to represent the similarity between a text-mention and its real corresponding entity by using different features. For example, the Query Space utilizes the context words in which entities appear in web searches.

For evaluation, the authors used a dump from Wikipedia as the “future” and an older version of Wikipedia as the present. They then assumed that all the emerged entities in a set of news articles dated between the present and the “future” would be included in the “future” dump of Wikipedia.

The approach taken by Wu et al. (2016) slightly outperforms the approach by Hoffart et al. (2014) with respect to precision, recall and F1-score. To the best of our knowledge, Wu et al. (2016) is state-of-the-art in discovering Emerging Entities in news articles. Furthermore, the method and results presented in by Wu et al. (2016) can be used to compare this thesis results with. Zhang et al. (2019) addressed similar issues as Hoffart et al. (2014) but used a slightly modified approach. Zhang et al. (2019) identified that the main drawback with the method proposed by Hoffart et al. (2014) lied in the noice introduced to the NED solver by introducing one EE candidate per mention, yielding many EE options per mention. To address this, a two step process was introduced. Firstly, a probabilistic NED method was used which yielded a confidence score per candidate entity. Secondly, if the confidence score was low enough, harvest additional context from the Web and Wikipedia. The harvested context was then intersected with the context from existing entities in the KB to leave only the (possibly) context of the emerged entity. The intersection of context is closely related to the model-difference approach used by Hoffart et al. (2014). The main difference between their approaches is that the set of Emerging Entity candidates was fed to the NED solver in Hoffart et al. (2014), while Zhang et al. (2019) divides this into two steps. The result is that the NED solver is not as affected by the noise for entities in-KB.

Zhang et al. (2019) evaluated their approach on the same dataset as Hoffart et al. (2014) with respect to Emerging Entities Discovery and is thus fully comparable (i.e. AIDA-EE). Their results indicate out performance with respect to micro accuracy, macro accuracy and EE-F1 but not EE-Precision and EE-Recall. Zhang et al. (2019) claims state-of-the-art on the AIDA-EE dataset but does not reference Wu et al. (2016) which presents higher results on AIDA-EE-F1 with the same dataset. Note that Wu et al. (2016) does not disclose their results on micro accuracy nor macro accuracy. Rather than discovering Emerging Entities and link them, Graus et al. (2018) studied the patterns at which entities emerge from their first mention until inclusion in a KB (i.e. Wikipedia). The pattern was defined as the number of documents containing mentions to the emerging entity per day until inclusion. By clustering these patterns, Graus et al. (2018) found two distinct types of Emerging Entities. One which gradually attracts more attention over time, and the other which spikes in interest early. Like Wu et al. (2016), Graus et al. (2018) uses two versions of Wikipedia to identify which entities were emerging during the time period of their dataset. The results indicate that on average, an entity emerges on 245 days. Out of a total of 74, 482 identified emerging entities, 51, 095 were mentioned in a news articles stream. The implications of their results is that it is indeed a viable method to monitor news articles to automatically detect entities that should be added to a KB, as referenced in Section 1.2.

The method of identifying emerged entities with two different versions of Wikipedia is of relevance to this thesis. Furthermore, the estimated average time for entities to emerge is of importance when collecting the dataset to be used.

(22)

4 Method

This chapter explains the steps taken to realize the system that is the aim of the thesis. Firstly, the building of the knowledge bases is explained. Then an abstract pipeline of the whole system is shown and motivated in order to give an overview perspective. The chapter then proceeds with motivating and explaining each component in the pipeline. Lastly, the method for evaluating the system is shown as well as the used data.

4.1 Build a Knowledge Base

As stated in Section 2.5 two different versions of a KB are required to investigate Emerging Entities, KBt0 and KBt1. This section defines how those knowledge bases are built.

4.1.1 Issues with using ready-to-use knowledge bases

There exists ready to use knowledge bases such as DBPedia and YAGO3 which come pre-compiled. The issue with precompiled KB:s is that they are static with respect to a certain Wikipedia or Wikidata dump. This has the effect that they quickly get outdated. For ex-ample, the latest version of YAGO3 was compiled with the Wikidata dump of 2017-05-22. Although the latest version of DBPedia was released quite recently (2019-08-30), the previous release was made available in October 2016 which created a three-year gap. YAGO3 does of-fer the possibility to compile a new version yourself using raw data dumps but the process is extremely resource-demanding, requiring approximately 220GB of main memory1. Having physical access to machines with such large amounts of main memory is very rare. Alterna-tively, a cloud-computing resource could be used but has the downside of being expensive2.

4.1.2 Abstract building pipeline

Due to the issues with using ready-to-use knowledge bases addressed in Section 4.1.1, an alternative approach is used. By using a local database resource and loading the dumps into there, the problem with the usage of main memory is mitigated by moving the problem to storage memory instead. Having access to large amounts of storage memory is something

1_{See the YAGO3 official GitHub page at https://github.com/yago-naga/yago3}

(23)

4.1. Build a Knowledge Base

ordinary personal computers often have. To build the KB, both Wikidata and Wikipedia are used. This is due to a significant amount of entities that are not present in Wikipedia are present in Wikidata. However, for the entities that do exist in Wikipedia, there is also an entity context that can be harvested. Furthermore, Wikidata provides the entities with properties not in Wikipedia, e.g. aliases. The combination of the two thus yields the most enriched entity representations.

To build a knowledge base for a given language, the required data is a Wikipedia dump3 in that language and a Wikidata dump4. For parsing Wikidata and deciding on the Wikidata entity type, a modified version of the NECKAr tool by Geiß et al. (2017) is used. In terms of quality on the extracted entities, NECKAr is comparable with the widely used knowledge base YAGO3 (Geiß et al. 2017).

The list below describes the steps taken on a high level for building a KB while Section 4.1.2.1 - 4.1.2.4 contains more detailed steps.

1. Parsing and loading each page of Wikidata into a local instance of MongoDB. Inter-esting claims are extracted (see Section 4.1.2.2) along with a predicted entity type (see Section 4.1.2.1). To do this, a modified version of NECKAr by Geiß et al. (2017) is used. 2. Parse Wikipedia of any language and map against the database, adding the article text, Wikipedia surface forms and, links to other Wikipedia pages (see Section 4.1.2.3). If no Wikidata entry can be mapped onto, a new entry is created.

3. Dump the entries into a local ElasticSearch instance which enables good indexing and that can serve as the final KB (see Section 4.1.2.4).

4.1.2.1 Wikidata entity type

Wikidata has the property that it provides structured data in terms of statements. This is beneficial since it can be used to get a prediction of which type an entity belongs to. The structure is centered around items (e.g. Linköping University) and properties (e.g. "instance of") denoted with a Q or P-value, respectively. A pair xP, Qy is then referred to as a statement. Most significant for this thesis is the statements containing either "instance of" (P31) or "sub-class of" (P279). If the "instance of" or the "sub"sub-class of" property links to any of the root pages Organization (ORG) (Q43229), Geographic location (LOC) (Q2221906) or Human (PER) (Q5) then getting the predicted type is trivial. However, due to the topology of Wikidata being more detailed and thus complex, it is typically not a trivial task. In order to predict a type, it is required to follow the chain of statements eventually leading to one of the root items. As an illustrative example, consider the following chain from Manchester United eventually lead-ing to organization:

Manchester United F.C.ÝÝÝÝÝÝÑinstance of association football clubÝÝÝÝÝÝÑsubclass of football clubÝÝÝÝÝÝÑsubclass of sports clubÝÝÝÝÝÝÑsubclass of sports organizationÝÝÝÝÝÝÑsubclass of organization

To mitigate the chain problem, SPARQL based Wikidata Query Service5is utilized to retrieve the sub-classes or instances of belonging to each of the root items. These sub-classes are stored in a local MongoDB instance which can be queried when iterating over the set of pages in Wikidata to decide on the entity type. It should be noted that some sub-classes are sub-classes of multiple root classes. For example, hospital and library are both sub-classes of the organization and geographic location items, causing those entities to have both ORG and LOC as the entity type. Depending on the context both types could be right and thus this issue is left as is.

3_{Replace [lang] with your desired language, for example "sv": https://dumps.wikimedia.org/[lang]}

wiki/

4_{https://dumps.wikimedia.org/wikidatawiki/entities/}

(24)

4.1. Build a Knowledge Base

4.1.2.2 Parsing Wikidata

When iterating over Wikidata pages, interesting claims are extracted depending on the pre-dicted entity type (referenced as neClass). Such claims include but not limited to aliases, de-scription and birth date (if entity type is a person). The full list of extracted claims is shown in Table 4.1. For the properties sitelink (refers to Wikipedia sitelinks) and alias, a trailing [0-N] denotes that there can be multiple claims of that property, where N is an arbitrary integer. The general properties are extracted for all entities, regardless of the predicted entity type. The parsed Wikidata pages are inserted into a local MongoDB instance for later merging with Wikipedia. General properties id norm_name neClass description sitelink[0-N] alias[0-N] date_birth

(a) General properties

Person properties date_birth date_death gender occupation (b) Person properties Organization properties official_language inception hq_location official_website founder ceo country instanceof (c) Organization properties Location properties in_country in_continent location_type coordinate population (d) Location properties

Figure 4.1: Wikidata extracted properties per entity type

4.1.2.3 Parsing Wikipedia

Wikipedia articles are written with the MediaWiki Markup Language6 _{which means that}

some preprocessing is required to extract the title of the page and the written text inside each article. The tool WikiExtractor7 is used to perform this. WikiExtractor also yields all surface forms inside an article, i.e. texts that link to other Wikipedia pages. The parsed articles are outputed to file in JSON-format. The outputed JSON-formatted files are then iterated to create a dictionary of the surface forms so they can be included on the object it refers to. The surface-forms is simply translated to wikilinks, effectively yielding which other Wikipedia pages this article links to. Figure 4.2 shows an example of the extracted data for the location Copenhagen on Swedish Wikipedia.

1 {

2 "title": "Köpenhamn"

3 "surface_forms" : {"Köpenhamn":8411, "Københavs":1,

4 "Köpenhamnsområdet":1, "Köpenhamns":1,

5 "København": 4, "Copenhagen":1, [...]},

6 "text" : "Köpenhamn är Danmarks huvudstad [...]",

7 "wikilinks" : {"Danmark": 1, "Amager": 4, [...]},

8 }

Figure 4.2: Example of parsed Wikipedia entry

4.1.2.4 Merging Wikidata and Wikipedia

Using the extracted sitelink in Wikidata, as described in Section 4.1.2.2, which can act as a unique identifier to the corresponding document in Wikipedia, a merge is performed. If a

6_{https://www.mediawiki.org/wiki/Help:Formatting}

(25)

4.2. Abstract pipeline

merge can not be performed, i.e. the Wikidata entry has no Wikipedia page on the current language or the other way around, the document is left as is. The set of documents is then inserted into an ElasticSearch8instance. The insertion into ElasticSearch enables fast indexing on multiple attributes and thus enables candidate identification for NED based on either entity name, Wikidata alias or Wikipedia surface forms. The ElasticSearch instance is considered the knowledge base(s) of this project and is capable of linking to either Wikipedia, Wikidata or, both (if merged). From this step, local MongoDB instances are no longer needed.

4.2 Abstract pipeline

The goal of the thesis is to extend a NED system with the ability to not link only to existing entities but also future entities. The general approach is to decide if the current mention refers to a future entity and if so; tag with an emerging entity model which can be enriched with context from the input articles. If an in-KB entity is linked, enriching that is also possible but omitted due to scope. After the linking procedure is done, it is then attempted to link the emerging entity models to entities in future KB, yielding an indication of how good a representation the entity model was to the future entity.

It should be noted that improving NED-scores on entities existing in the present KB is not the goal of the thesis, although it might be a positive side effect by complementing in-KB entities with mined data as suggested by Li et al. (2013). On the other hand, introducing emerging entities might also worsen the results of NED by introducing noise as suggested by Hoffart et al. (2014).

Input document NER {m1,m2,...,mn} {m1: [e1,e2], m2:[e3,e4,e5],...,mn:[ex]

NED {m₁: {e₁:c₁, e₂:c₂}, m₂:{e₃:c₃,e₄:c₄, e₅:c₅},...,m_n:{e_x:c_x} EntityDB (KB Current)

Candidate generator Al-Rfou et al. (2015)

Han et al. (2011) _EED

Wu et al. (2016) Insert extracted contexts

and "EE models"

Tagged output document

Figure 4.3: General pipeline

Figure 4.3 shows an overview of the pipeline used for a single document. The figure should be seen in the context of being applied to all articles between time current and time future (disclosed in Section 4.9). A document along with its target language is first passed into a NER module (details disclosed in Section 4.3) which labels possible entities with a PER, LOC or ORG label. The possible entities is referred to as mentions (as referenced in the figure by m1, m2, ..., mn). The candidate generator module searches in the entity database (KB

current) for direct or partial matches in either of Wikipedia surface forms, Wikipedia title, Wikidata aliases or the Wikidata title. The results (denoted ex) are mapped onto the triggering mention

along with the context of the mention (i.e. the sentence in which it appeared). The map

(26)

4.3. Named Entity Recognition

containing the mentions and entity candidates is passed along to the NED module (details disclosed in Section 4.5) which outputs a confidence score (denoted cx) for each mention

actually referring to each entity candidate. The disambiguated mentions are then passed to the Emerging Entities Discovery (EED) module (details disclosed in Section 4.6) which creates a model of the entities if applicable or chooses one of the disambiguated entities for each mention. The output is a tagged document where each entity is labeled with either link to Wikipedia, Wikidata or, with a emerging entity id (denoted EExand is created by inserting

to the entity database).

4.3 Named Entity Recognition

Hoffart et al. (2014), Wu et al. (2016) and Zhang et al. (2019) all utilized Stanford NER (Finkel et al. 2005) in their pipeline in for generating mentions. Stanford NER models are only avail-able for the English, German, Spanish and Chinese languages but since Swedish is the lan-guage of interest in this thesis, Stanford NER is not applicable. However, using pre-trained models for the NER task in the pipeline is arguably the most viable option since it is not the focus of the thesis and therefore Polyglot-NER (Al-Rfou et al. 2015)9is used for the task. It should be noted that any NER-system with models available in the target languages and that outputs PER, ORG and, LOC tags could have been used. Another viable option is BERT (Devlin et al. 2019) with a finetuned model for NER onto the target language10.

Polyglot-NER uses context independent word embeddings (see Section 2.2.2) from Al-Rfou et al. (2013) which was trained on Wikipedia raw text from the various languages to capture the semantic meaning of words. Concatenated word embeddings of a context sur-rounding an entity link in Wikipedia are then used to train a one-vs-all neural network clas-sifier per language. The entity type used in training as gold data is automatically extracted by matching Wikipedia categories and Freebase (Bollacker et al. 2008) attributes in a similar manner as NECKAr (Geiß et al. 2017) (see Section 4.1.2.1) but for Wikipedia rather than Wiki-data. Because of Wikipedia writing guidelines, entity mentions should only be linked on the first occurrence in an article, training examples are not as frequent as they could have been. To mitigate this, Polyglot-NER uses coreference resolution and oversampling to generate more training examples.

As previously mentioned, the results of Polyglot-NER is not directly comparable to Stan-ford NER because of the respective target languages. However, it can be said that Polyglot-NER outperforms other competing Polyglot-NER-systems such as Nothman et al. (2013) and performs similarly as Stanford NER German (Faruqui et al. 2010).

4.4 Candidate generation

To get in-KB alternatives to disambiguate between a search against the KB needs to be per-formed. Simply using all in-KB entities will overload any NED algorithm. A straight forward approach would be to use string exact match against a set of fields in the KB, like Wu et al. (2016). However, this approach might miss cases that would be obvious to a human. For ex-ample, the mention including an owning s; Henrik Larssons would not be linked to the correct entity Henrik_Larsson because of the trailing s. For that reason, each passed mention from the NER algorithm will be permutated into a set of related terms. The set of permutations is then passed to the KB which queries on selected fields for exact string matches. The returned set of entities can then be passed to the NED algorithm for disambiguation.

The set of permutations used consists of: • The original mention

9_{Available in 40 languages, https://github.com/aboSamoor/polyglot}

(27)

4.5. Named Entity Disambiguation

• The lower-cased version of the original mention

• The original mention cleared from special characters such as ,,’,- and :. • The original mention cleared from trailing owning letters such as s.

• Splitting the term on space, getting the length of that array and then generating all possible combinations of words between length and length ´ 1, excluding unigrams. • Acronyms if the space split array length are equal or larger to 3.

This results in for example the mention Donald John Trump also being queried for the more commonly associated string Donald Trump.

The set of fields in the KB which are queried consists of; • Wikipedia title

• Wikidata aliases • Wikidata title

• Wikipedia surface forms

4.5 Named Entity Disambiguation

Both Hoffart et al. (2014) and Wu et al. (2016) used AIDA (Hoffart et al. 2011) as the NED component in their pipeline. As discussed in Section 4.1.1 there exist issues with precompiled knowledge bases and thus a custom knowledge base is built as per Section 4.1. AIDA uses DBPedia as the knowledge base but provides the opportunity to use other knowledge bases such as Wikidata. Supporting custom knowledge bases is not provided but can be done with large efforts. Because of the large efforts required to get the system working, AIDA will not be used. It should be noted that any language-independent NED system which outputs a confidence score for each possible entity candidate could have been used. Like Zhang et al. (2019), the used NED module is a re-implementation of the system by Han et al. (2011).

The method proposed by Han et al. (2011) is a probabilistic algorithm. In its essence, the method weights the probability of a mention referring to a specific entity by factoring in the other entity alternatives as well as the other mentions in the context and their respective entity candidates.

DOCUMENT:During his career at Manchester, Larsson was continuously praised by Sir Alex Ferguson.

MENTIONS:{m1= Manchester, m2= Larsson, m3= Sir Alex Ferguson }

(28)

4.6. Emerging Entity Detection Larsson Manchester Sir Alex Ferguson Manchester_United_F.C. Manchester University_of_Manchester Alex_Ferguson Henrik_Larsson Henrik_Larsson_(sprinter) 0.21 0.07 0.29 0.13 0.25 0.1 0.40 0.08 0.07 0.12 0.17 Mention Entity

Figure 4.5: Example of weighted graph used for entity disambiguation

The problem is modeled as a weighted graph. Between mentions and entities there is a compatibility weight calculated by using tf-idf values (see Section 2.2.1) of the mention context and the entity description text. Between entities there is a semantic relatedness weight calcu-lated by leveraging the number of shared Wikipedia links between the entities. Figure 4.4 shows an example document while Figure 4.5 shows how the document could look in the weighted graph. By using a random walk technique, the graph can be traversed to find the most prominent combination of entities with regards taken to both local weights of mention to entity and the semantic relatedness between entities.

At the time, the system by Han et al. (2011) was around state-of-the-art for NED with an F1-score of approximate of 0.8, varying by evaluation dataset (Liu et al. 2017). After 2014, NED has reached high F1-scores of approximately 0.9 (Al-Moslmi et al. 2020). Han et al. (2011) is thus not really holding up in terms of state-of-the-art today, but is used because of this thesis comparability with Zhang et al. (2019) as well as its relative competitiveness. However, as previously stated, improving NED-results is not at focus for this thesis so it is not central to use the state-of-the-art method.

4.5.1 Re-implementing the NED module

As mentioned in Section 4.5, the NED module is a re-implementation of the system presented in Han et al. (2011). The reason for this is that neither Han et al. (2011) nor Zhang et al. (2019) (who also re-implemented Han et al. (2011)s system) have openly released their code. Re-implementing the module presented in Han et al. (2011) is fairly straight forward using the method chapter. However, not all details are disclosed and thus some assumptions are required. In order to benefit this thesis comparability and reproducibility said assumptions are presented in this section.

Firstly, looking at local mention to entity compatibility. The measurement is calculated using the possible entities Wikipedia text and the mentions context using the tf-idf schema (see Section 2.2.1). It is not disclosed whether the tf-idf schema contains any n-grams. It is therefore assumed that only unigrams are used. The same logic is also applied for the prior importance. Regarding the initial evidence vector, s, the initial value for non-mention nodes is not disclosed. This value is assumed to be 1 because if set to 0 then all information about non-mention nodes is lost at later stages.

4.6 Emerging Entity Detection

For detecting Emerging Entities, Wu et al. (2016) is currently state-of-the-art. This method is the used one in this thesis with some modifications due to the language transition and iMatrics preferences. As referred in section 3.4 their work consists of creating multiple feature spaces and use said feature spaces to determine if an entity should be considered emerging.

(29)

4.6. Emerging Entity Detection

Since the implementation of Wu et al. (2016) is not publicly released, a re-implementation is required. The following subsections explain the interpretation of the feature spaces and how the re-implementation was conducted to fit the aim of this thesis. For clarity, the same labels on the feature spaces are used. Lastly in this section, it is explained how a Gradient Boosted Decision Tree Classifier is used to make the decision regarding the Emerging Entities.

4.6.1 Contextual Space

The Contextual Space represents the contextual similarity between the mention and the by NED chosen in-KB alternative echosen. The other entities mentioned in the context of the

en-tity is at center in this space. Let there be n such entities, then the set for those entities be-comes Es = tes1, es2, ..., esnu. For each entity esi it is measured how well it supports echosen

by weighing it both in terms of being a supportive entity and in terms of being a salient entity. The weighting is done by using the tf-idf scheme (Section 2.2.1). Supportive entity refers to being mentioned in the Wikipedia article of echosen, where tf is the number of times esi are

mentioned and df is the number of other entities linking to esiin the whole KB. Salient entity

refers to esi being mentioned in the same context as echosen in the articles of other entities.

tf then refers to the amount of co-occurrences with echosen while df remains the number of

in-KB entities linking to esi. Transforming the df values to idf is not explicitly disclosed but

is assumed to set |D| = |KB| since the documents referred to are the entities. The weights wsupportiveiand wsalienti for each esiare then calculated by p(echosen|m)˚supportive(esi, echosen)

and p(echosen|m)˚sailent(esi, echosen)respectively. p(echosen|m)represents the probability of m

referring to to echosen calculated by the amount of times m refers to echosenin the KB divided

by the total amount of times echosenis referred in the KB.

The returned value is then interpreted as the cosine similarity between vector v = twsupportive1, ..., wsupportivenu+twsalient1, ..., wsalientnuand a n dimensional vector with ones.

4.6.2 Neural Embedding Space

For the Neural Embedding Space, Wu et al. (2016) utilized the word2vec model (Mikolov, Sutskever, et al. 2013) referred in Section 2.2.2. Due to out-of-the-box support for more lan-guages, FastText (Athiwaratkun et al. 2018) is used instead.

The space is interpreted as splitting the text of the Wikipedia article belonging to an en-tity e on spaces, thus creating a enen-tity context ec = [w1, w2, ..., wn]. The same procedure is

repeated for the sentence in which an mention m occurs resulting in a mention context mc=

[w1, w2, ..., wm]. Three vectors, vec(mc) =řmi=1vec(wi), vec(ec) =řnj=1vec(wj)and vec(m)is

then calculated. The mean of cossim(vec(m), vec(ec)) and cossim(vec(mc) +vec(m), vec(ec))

is then returned as the similarity score between mention m and entity e in the Neural Embed-ding Space.

Given the algorithm described in Wu et al. (2016), it is required to make one assumption that is not disclosed. The assumption is regarding whether it actually is the mean between the two cosine similarities or if there is another measurement which should be returned.

4.6.3 Topical Space

The topical space can be described as the similarity between the topic of the article in which an entity is mentioned and the topic of the entities context (i.e. Wikipedia article). Wu et al. (2016) utilized the Open Directory Project (ODP)11which features 62,767 topics. By choosing only topics which contains at least 1,000 documents the amount of topics is reduced to 219. By using a topic classifier, the input article and candidate entity Wikipedia text is classified with a 219-dimensional vector. Due to noise reduction, the top-5 scoring topics of the input

(30)

4.6. Emerging Entity Detection

article is extracted. The returned value of the topical space is then the cosine similarity of the two top-5 vectors.

The topic classifier used by Wu et al. (2016) is not disclosed. Instead, the topic classifier of iMatrics is used, in which output dimensions are configurable. The chosen configuration features a 1501 dimensional vector. Due to company privacy, the implementation details of the used topic classifier cannot be disclosed.

4.6.4 Lexical Space

Wu et al. (2016) utilizes the normalized Levenshtein difference to measure the lexical space (i.e. string similarity). This measurement is straight forward to implement and is disclosed in equation 4.1

nld(m, e) = levenshtein(m, e)

max(len(m), len(e)) (4.1) The length of the entity name can be interpreted in multiple ways since the knowledge base contains both a Wikipedia title, possibly multiple Wikidata aliases as well as multiple Wikipedia surface forms. It is assumed that the Wikipedia title should be used and for an Emerging Entity it is interpreted as the most frequent surface form.

4.6.5 Query Space

The Query Space used by Wu et al. (2016) is an interpretation of user queries with the as-sumption that queries for entities should see a significant increase in activity when an entity is emerging. This query data needs to be gathered from commercial search engines. From a rational perspective, it would make a lot of sense if this query space was the most signifi-cant factor for determining Emerging Entities. Contradictory to that belief, Wu et al. (2016) studied the importance of their query spaces and show that the query space is in fact the least significant feature space. However, it should still be noted that the query does have a positive impact both in regards to accuracy as well as recall.

Keeping in mind that the query space is dependent on a third-party API, which would enforce a continuous payment being required, and at the same time the relative insignificance to the results the query space was omitted.

4.6.6 Gradient Boosted Decision Tree Classifier

Putting the feature spaces together, a classifier that can decide whether a disambiguated en-tity is correct or if the mention should instead refer to an emerging enen-tity is built. Wu et al. (2016) utilized the Gradient Boosted Decision Tree Classifier (Section 2.6). During assembly of the knowledge bases (Section 4.1) a list of which entities were emerging (i.e. present in future KB but not in current KB) arose, labeled EElist. In order to create training data for

the classifier, each disambiguated entity is checked against EElistalong with calculating the

scores from the four feature spaces. This process generates a dataset with tuples on the form (x1, x2, x3, x4, Y)where xiis derived from a certain feature space (xi P[´1, 1]) and Y P t0, 1u.

The hyperparameters used to train the classifier is based on the hyperparameters used by Wu et al. (2016). However, since the query space was omitted and a few assumptions have been made on the other feature spaces, a grid search using a few variations on the hyperparameters is also performed in order to find the optimal configuration for the classifier.

Exploring Emerging Entities and Named Entity Disambiguation in News Articles

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Information Technology

2020 | LIU-IDA/LITH-EX-A--20/021--SE

Exploring Emerging Entities

and Named Entity

Disam-biguation in News Articles

Utforskande av Framväxande Entiteter och Disambiguering av

Entiteter i Nyhetsartiklar

Robin Ellgren

Upphovsrätt

Copyright

Abstract

Acknowledgments

Contents

List of Figures

List of Tables

List of Abbreviations

1

Introduction

1.1

Motivation

1.2

General application

1.3

Aim

1.4

Research questions

1.5

Delimitations

2

Theory

2.1

Terminology definition

2.2

Vector representations

2.2.1

Term Frequency–Inverse Document Frequency (tf-idf)

2.2.2

Context independent word embeddings

2.2.3

Context dependent word embeddings

2.3

Named Entity Recognition (NER)

2.4

Named Entity Disambiguation (NED)

2.5

Emerging Entities: Problem definition

2.5.1

Challenges

2.6

Gradient Boosting Classification Tree

3

Related Work

3.1

Summary

3.2

Mining evidences

3.3

Named Entity Disambiguation

3.4

Detecting Emerging Entities

4

Method

4.1

Build a Knowledge Base

4.1.1

Issues with using ready-to-use knowledge bases

4.1.2

Abstract building pipeline

4.2

Abstract pipeline

4.3

Named Entity Recognition

4.4

Candidate generation

4.5

Named Entity Disambiguation

4.5.1

Re-implementing the NED module