• No results found

Domain Adaptation for Hypernym Discovery via Automatic Collection of Domain-Specific Training Data

N/A
N/A
Protected

Academic year: 2021

Share "Domain Adaptation for Hypernym Discovery via Automatic Collection of Domain-Specific Training Data"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Computer Science and Engineering

202019 | LIU-IDA/LITH-EX-A--2019/039--SE

Domain Adaptation for

Hyper-nym Discovery via Automatic

Col-lection of Domain-Specific

Train-ing Data

Domänanpassning för identifiering av hypernymer via

automa-tisk insamling av domänspecifikt träningsdata

Johannes Palm Myllylä

Supervisor : Jody Foo Examiner : Marco Kuhlmann

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Identifying semantic relations in natural language text is an important component of many knowledge extraction systems. This thesis studies the task of hypernym discov-ery, i.e discovering terms that are related by the hypernymy (is-a) relation. Specifically, this thesis explores how state-of-the-art methods for hypernym discovery perform when applied in specific language domains. In recent times, state-of-the-art methods for hyper-nym discovery are mostly made up by supervised machine learning models that leverage distributional word representations such as word embeddings. These models require la-beled training data in the form of term pairs that are known to be related by hypernymy. Such labeled training data is often not available when working with a specific language domain. This thesis presents experiments with an automatic training data collection al-gorithm. The algorithm leverages a pre-defined domain-specific vocabulary, and the lex-ical resource WordNet, to extract training pairs automatlex-ically. This thesis contributes by presenting experimental results when attempting to leverage such automatically collected domain-specific training data for the purpose of domain adaptation. Experiments are con-ducted in two different domains: One domain where there is a large amount of text data, and another domain where there is a much smaller amount of text data. Results show that the automatically collected training data has a positive impact on performance in both do-mains. The performance boost is most significant in the domain with a large amount of text data, with mean average precision increasing by up to 8 points.

(4)

Acknowledgments

I would like to thank Vida Johansson, Lars Edholm, Evelina Rennes, Christian Smith, Mag-nus Merkel and Mikael Lundahl at Fodina Language Technology AB for very useful help and feedback during my thesis work. I would also like to thank them for making me feel included in the company, and making the time spent on my thesis work very enjoyable. I also want to thank my supervisor Jody Foo, and my examiner Marco Kuhlmann at Linköping University for their continuous and constructive input on my work.

Linköping, June 2019 Johannes Palm Myllylä

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Motivation . . . 1 1.2 Background . . . 2 1.3 Aim . . . 2 1.4 Research questions . . . 3 1.5 Delimitations . . . 3 2 Theory 4 2.1 Hypernymy and hyponymy . . . 4

2.2 Word embeddings . . . 4

2.3 WordNet . . . 5

2.4 Hypernymy related work . . . 5

2.5 SemEval-2018 task 9: hypernym discovery . . . 12

2.6 Domain adaptation . . . 14

3 Method 17 3.1 Selecting the model . . . 17

3.2 Task description . . . 18

3.3 Reference system . . . 18

3.4 The training process . . . 20

3.5 Setting up the experiments . . . 20

3.6 Evaluation . . . 24

4 Results 26 4.1 Small domain results . . . 26

4.2 Large domain results . . . 28

4.3 Experiments with frozen embeddings . . . 29

5 Discussion 32 5.1 Results . . . 32

5.2 Method . . . 35

6 Conclusion 38 6.1 Future work . . . 39

(6)
(7)

List of Figures

2.1 Illustrative example of WordNet synsets linked by hyponymy (inverse of

hyper-nymy). . . 5

2.2 Example of a syntactic dependency tree. . . 7

2.3 3D visualization of embeddings. . . 9

(8)

List of Tables

2.1 Example of a co-occurrence matrix . . . 5

4.1 Results from manual evaluation of 105 example queries for the small domain. . . . 27

4.2 Example queries and predicted hypernyms from the general-purpose model in the small domain. . . 27

4.3 Example queries and predicted hypernyms from the domain-specific model in the small domain. . . 27

4.4 Example queries and predicted hypernyms from the mixed model in the small domain. . . 28

4.5 Results from initial experiments in the large domain. . . 28

4.6 Results when varying amount of domain training data. . . 29

4.7 Gold standard hypernyms for the example queries. . . 29

4.8 Example queries and predicted hypernyms from the top performing mixed model for the large domain. . . 30

4.9 Example queries and predicted hypernyms from the general model for the large domain. . . 30

4.10 Results from manual evaluation of 105 example queries for the small domain with frozen word embeddings. . . 30

4.11 Example queries and predicted hypernyms from the general model with frozen word embeddings. . . 31

4.12 Example queries and predicted hypernyms from the mixed model with frozen word embeddings. . . 31

(9)

1

Introduction

Automatic knowledge extraction from natural language data is an important problem to solve in the age of virtually unlimited amounts of available data. The field of study that aims to solve this problem is called Computational Linguistics or Natural Language Processing (NLP). Semantic relations such as synonymy, meronymy (part-of) and hypernymy (is-a) make up an important component of many knowledge extraction systems. This thesis focuses on the hypernymy relation. More specifically, this thesis studies hypernym discovery, i.e. extracting hypernyms of a given term from a large vocabulary of candidate terms. An example of two terms that are related by hypernymy is “dog” and “animal”. In this case, “animal” is a hyper-nym of “dog”. The inverse relation is called hypohyper-nymy, i.e “dog” is a hypohyper-nym of “animal”. Hypernym relations are an important part of numerous downstream applications, such as question answering systems and search engines [38, 50, 22]. Hypernym relations also make up the backbone of ontologies and taxonomies.

Supervised distributional machine learning models currently make up the state-of-the-art when it comes to hypernym discovery [10, 6, 17, 2, 51]. Such models need annotated training data, which can be a rare commodity when working within specific domains. The training data for these supervised systems consists of pairs of terms that are known to be related by hypernymy. The goal of this thesis is to develop, apply and evaluate a method for domain adaptation via automatic collection of training data for the target domain.

1.1

Motivation

In many industrial applications of language tools the main point of interest is a smaller and more specific language domain, rather than diverse general-purpose text. For example, one can imagine applications that are specific to medicinal journal entries or different types of technical specifications. Domain-specific hypernym relations can, for example, be used to build domain-specific, or perhaps even company-specific, ontologies and taxonomies.

Different domains of application tend to yield very different results with regards to per-formance [2, 10, 6]. This presents the problem of adapting to different domains. There are multiple ways of approaching the problem of domain adaptation, some of which are explored in the Theory chapter. This thesis approaches domain adaptation via a method for automatic training data collection. Data sparsity is a potential issue when dealing with smaller domains. Even in cases where there is plenty of text data available, compiling gold standard training

(10)

1.2. Background

data that can be used for state-of-the-art supervised models can be an expensive and time consuming process. For this reason it would be interesting to explore to what degree this process can be automated. Another interesting question is how pre-existing general-purpose training data can be leveraged.

1.2

Background

Fodina Language Technology AB is a company based in Linköping, Sweden. Fodina provide a variety of language support tools for their client companies. They currently have a tool called Termograph, which works on clustering synonymous terms, with the purpose of helping companies maintain a consistent terminology. The tool only handles synonymy at this point and Fodina are interested in broadening this by introducing hypernymy. This would add one more layer to the tool, and it could provide support for clients to build company specific ontologies. A model that can extract hypernym-hyponym pairs from a collection of terms would therefore be of interest to Fodina.

SemEval-2018 task 9 is a workshop on semantic evaluation, where task 9 specifically per-tains to a task dubbed hypernym discovery [10]. In short, hypernym discovery entails the following: Given a query term q, extract as many true hypernyms of q as possible from a pre-defined vocabulary. In earlier work on hypernymy, the task is often reduced to binary classification [5, 9, 48]. The task moves closer to real world applications by reformulating it as hypernym discovery. It also helps with alleviating some innate evaluation advantages for supervised systems [2, 10]. Framing the problem as hypernym discovery also lies close to applications that are of interest to Fodina. This makes results from the SemEval-2018 Task 9 submissions interesting for this thesis.

The best performing system on SemEval-2018 Task 9 was a system proposed by Bernier-Colborne and Barriere [6]. Their system outperformed all baselines and other submissions by a significant margin [10]. Given the success of this system, this thesis will use it as a basis for experiments.

1.3

Aim

The aim of this thesis is centered around examining how the performance of a state-of-the-art model for hypernym discovery is affected when introducing automatically collected domain-specific training data, and applying it to a small and narrow domain. The domain-domain-specific training data can either be used on its own, or in conjunction with available general-purpose training data. There are three main parts to exploring this:

• Examining to what extent the large amounts of available general-purpose text- and training data can be leveraged, and what kind of requirements this puts on the target domain.

• Examining to what extent the process of collecting domain-specific training data can be fully automated, and how such training data affects performance.

• Examining if combining automatically collected domain-specific training data with general-purpose training data can boost performance with regards to the target domain.

1.3.1

Target domains

The size and nature of the target domain can vary greatly depending on the context. This thesis presents experimental results from two different domains:

• The small domain: A manual with „305 000 tokens of text, and a vocabulary of 11 213 domain-specific terms extracted from the text. The data stems from a client of Fodina.

(11)

1.4. Research questions

• The large domain: Medical data from SemEval-2018 task 9, subtask 2A [10]. The corpus contains „130 million tokens. The domain specific vocabulary contains 93 888 terms. The large domain can be considered close to an ideal case, i.e. a lot of text data available but still a reasonably narrow scope. The small domain is a more realistic representation of what the data from a real client of Fodina might look like.

1.4

Research questions

Given domain-specific data in the form of a text corpus and a set of extracted terms, and given the system proposed by Bernier-Colborne and Barriere trained on general-purpose data as baseline [6], answer the following:

1. How is model performance affected when introducing automatically collected domain-specific training data, when evaluating the model on hypernym discovery within the small domain?

2. How is model performance affected when introducing automatically collected domain-specific training data, when evaluating the model on hypernym discovery within the large domain?

3. In the large domain, how does the amount of introduced automatically collected train-ing data affect model performance?

1.5

Delimitations

The task of extracting possible hypernym candidate terms, i.e. creating a vocabulary of terms from raw text, is not the focus of this thesis. A simple system based on extracting noun phrases will be used when there is no pre-existing term vocabulary. No other languages outside of English are considered.

(12)

2

Theory

This chapter provides theoretical explanations and clarifications that are necessary to under-stand the rest of the thesis. First there is some general background theory, and explanations of the different resources that this thesis work utilizes. After this, the specifics of hypernym discovery and the two main ways of approaching it, i.e. path-based methods and distributional methods, is explored. Then there is an explanation of the SemEval-2018 task 9 workshop, from which the model that was used for experiments was selected. Domain adaptation, which is closely related to the contribution of this thesis, is explored at the end of the chapter.

2.1

Hypernymy and hyponymy

Hypernym and hyponym are two linguistic terms. A hyponym is a word or phrase which has its semantic meaning encompassed by another word or phrase, i.e. its hypernym. Put more simply, a hyponym is in a is-a or type-of relationship with its hypernym. For example, a chimpanzee is an ape, which makes ape a hypernym of chimpanzee. The relation is transitive, e.g. if animal is a hypernym of ape then animal is also a hypernym of chimpanzee.

2.2

Word embeddings

In order to understand distributional systems for hypernym discovery, we first need to un-derstand distributional word representations. A word embedding is an example of such a rep-resentation. In practice, a word embedding is a mapping of a word to a vector of real numbers based on the contexts in which the word occurs. There are a number of different methods for achieving this. Perhaps the most intuitive and explanatory methods involve a co-occurrence matrix. A co-occurrence matrix is a matrix over how many times words occur together with other words in a certain context. It is most easily understood via an example. Consider the following sentences: “I like pancakes. I like hamburgers. I hate pizza.”. A co-occurrence ma-trix over words that appear in the same sentence would look as the one shown in Table 2.1. Each row in the matrix forms a co-occurrence vector for a certain word. The number of dimen-sions for the vectors will quickly grow as we introduce more text data. There are a number of methods for mapping words to vectors that involve reducing the dimensionality of the co-occurrence matrix to more manageable levels [31, 29, 28].

(13)

2.3. WordNet

Table 2.1: Example of a co-occurrence matrix

I like pancakes hamburgers hate pizza

I 0 2 1 1 1 1 like 2 0 1 1 0 0 pancakes 1 1 0 0 0 0 hamburgers 1 1 0 0 0 0 hate 1 0 0 0 0 1 pizza 1 0 0 0 1 0 {entity}

{motor vehicle, automotive vehicle}

{car, automobile}

{truck, motortruck} {...}

{...}

Figure 2.1: Illustrative example of WordNet synsets linked by hyponymy (inverse of hyper-nymy).

One of the most popular techniques for mapping words to vectors was developed by Tomas Mikolov at Google in 2013 [32]. The technique is called Word2Vec and involves using neural networks. The usage of neural nets led to much lower computational costs compared to previous approaches, while maintaining state-of-the-art performance of the word vectors [32]. This thesis work utilizes the Word2Vec technique when learning word embeddings.

The main advantage in having words represented as vectors is that it is possible to per-form mathematical operations on the vectors. The typical example which is used to show that these vector do in fact encode semantic information is the equation: v(king)´v(man) +

v(woman)«v(queen).

2.3

WordNet

WordNet is a long running English lexical database project with the goal of providing lexical resources for natural language research [34, 13]. It has become one of the most important resources for research in the NLP field. English verbs, nouns, adjectives and adverbs are grouped into synsets, i.e. unordered sets of synonyms. For example, one synset may contain car and automobile. WordNet also encodes a number of relations that link different synsets together. The most common relation, and the most important relation for this thesis, is hy-pernymy. The hypernymy relation in WordNet links more general synsets with more specific ones, see Figure 2.1. This feature, combined with WordNet’s size of 117 000 synsets, makes it a useful resource for this thesis work.

2.4

Hypernymy related work

Hypernym relations have been a topic of study in the NLP field for a long time. Early at-tempts at automatic identification of hypernymy involved looking at patterns in the text data.

(14)

2.4. Hypernymy related work

In recent times, state-of-the-art systems are increasingly based on working with distributional word representations, such as word embeddings, to infer hypernymy. Methods are usually grouped into two overarching categories: Path-based methods, and distributional methods. This section of the thesis explores methods from both of these categories. The experiments in this thesis utilized a state-of-the-art distributional method, see section 2.4.2 for theory regard-ing this method category. Section 2.4.1 explores path-based methods, and is included for the purpose of providing background.

2.4.1

Path-based methods

Using textual patterns to find hypernym relations was first introduced by Hearst in 1992 [20]. Hearst identified a number of textual patterns that can be used to identify hypernymy from raw text. One example of such a pattern is “such as X, Y or other Z”, which given the sentence “...such as BMW, Mercedes or other car manufacturers.” would identify car manufacturer as a hypernym of BMW and Mercedes. These patterns have come to be known as Hearst patterns. The Hearst patterns fulfill a few criteria that indicate that they are useful for finding hypernym relations [20]:

1. They occur frequently and across many genres. 2. They almost always indicate the correct relation.

3. Little or no pre-encoded knowledge is required to recognize them.

The Hearst patterns were discovered using a combination of manual observation of written text, and a procedure that leverages term pairs where it is known that the relation holds [20]. There are many known weaknesses in using lexio-syntactic patterns for extracting seman-tic relations, chief among them being poor recall [10, 36, 6]. Recall suffers primarily due to the innate requirement that hyponym and hypernym occur together somewhere in the text corpus. Many extensions to the basic lexio-syntactic pattern-based approach have been pro-posed to alleviate these weaknesses.

In their work on definition and hypernym extraction (essentially synonymous with hy-pernym discovery), Navigli and Velardi propose Word Class Latices (WCL) to model textual definitions and extract hypernym relations [36]. A lattice is a directed graph without cycles. For NLP-related tasks, word lattices have been used to represent sequences of symbols in a compact manner. WCLs is a generalization of word lattices [36]. Navigli and Velardi construct the WCLs based on a clustering of generalized sentences. The approach works on definitional sentences, and is reliant on a formal notion of textual definition where it is assumed that a given sentence contains [44]:

• A DEFINIENDUM field (the word being defined).

• A DEFINITOR field (the verb phrase used to introduce the definition). • A DEFINIENS field (includes the genus phrase).

• A REST field (clauses that provide further specificity).

When generalizing a sentence, they first replace the DEFINIENDUM field of the sentence with the token <TARGET>. After this, each word is replaced with a class according to the fol-lowing criteria: If the frequency of a given word in the training set exceeds a certain threshold, the class is the word itself. If not, the class is its part-of-speech (POS) tag. Note that this means that the <TARGET> token will be kept as is. An example of how a simplified sentence may look is: “In NN, a <TARGET> is a JJ NN”, where NN and JJ are POS tags for nouns and adjec-tives respectively [36]. These generalized sentences are then pre-processed and generalized further into star patterns [36]. The sentences are clustered based on their star pattern, and

(15)

2.4. Hypernymy related work

Figure 2.2: Example of a syntactic dependency tree.

a WCL is created for each sentence cluster using a greedy alignment algorithm [36]. When evaluating this approach on the task of hypernym extraction using the ukWaC data set [14], it outperformed the basic approach with Hearst patterns by some margin [36].

Boella and Di Caro propose a method for finding hypernym relations where they lever-age syntactic dependency paths in place of lexio-syntactic patterns [8]. Figure 2.2 shows an example of a dependency tree for a POS-tagged sentence. Their approach works on the same formal notion of a definitional sentence as stated in the previous paragraph. Boella and Di Caro innovate by treating the problem of hypernym extraction as two separate classification tasks [8]: Given the hypernym(x, y)relation within a sentence, find a possible x and y. If more than one x and/or more than one y is found, pair them up based on the distance be-tween them in the syntactic dependency tree. The main advantages to this approach, as stated by Boella and Di Caro, is that there is no need to create abstract patterns (which can be unsta-ble) between target terms, and that the approach can easily be generalized to non-definitional sentences by treating any sentence where there are no good candidates for x and/or y as non-definitional [8]. Two separate binary classification models are trained using a Support Vector Machine classifier. One classifier has the purpose of finding x, and the other has the purpose of finding y. The input for the model is POS tagged sentences with corresponding syntactic dependency trees. The model considers all words with a POS corresponding to a noun as candidates for x and y. Each candidate is classified in a binary fashion as positive or nega-tive, signifying whether or not the candidate is a good choice for x or y (depending on which classifier is used). The training data consists of two sets with marked xs or ys respectively [8]. This approach outperformed the state-of-the-art at the time, including the WCL approach by Navigli and Velardi, on the task of hypernym relation extraction [36, 8].

Pavlick and Pasca make the observation that most existing approaches for hypernym identification treat the hypernym class in a relation as something that must be observed word-for-word in the text [37]. They note that this can be very problematic when dealing with fine-grained classes, such as “Flemish still life painter”, as there are many different ways to express the class, most of which will not be present in the text. They propose a composi-tional approach based on interpreting modifiers in the class name relative to the head [37]. This is based on a notion from formal semantics where modifiers constitute properties that differentiate between subclass and superclass [21]. As an example, the modifier “1950s” dif-ferentiate between the class “1950s musicians” and its superclass “musicians” [37, 21]. Their model learns property profiles for class names. An examples of such a property profile for the class “Led Zeppelin song” is {“Led Zeppelin write *”, “Led Zeppelin play *”, “Led Zeppelin have *”}, where “*” is a wildcard [37]. These property profiles can then be of aid when finding instances of high granularity class names, even if the class names are not written verbatim in the text.

(16)

2.4. Hypernymy related work

Shwartz et al. present an approach for detecting hypernymy based on applying an LSTM network across syntactic dependency paths [42]. They use the LSTM network to encode the sequence of edges in the dependency tree, rather than using the full dependency path as a single feature. The expected result from this is that the LSTM encoder will focus on more informative parts of the path and ignore other parts. The edges in a dependency path are given an internal representation by concatenating component vectors:

~ve= [~vl,~vpos,~vdep,~vdir] (2.1) Where~vl is the word embedding,~vposis the POS,~vdepis the dependency label and~vdiris the dependency direction. A full dependency path is represented as a sequence of its edge vectors

~ve. Note that this method utilizes word embeddings for word representations, meaning that this model incorporates both path-based and distributional information. More about this in section 2.4.3. The problem is treated as a classification problem, where the task is to classify a given pair of terms(x, y)as related or not. For classification, each pair(x, y)is represented as the weighted average of its path vectors, i.e. all paths that that connected x and y in the corpus. This is then fed to a softmax activation function for classification. The network is trained by minimizing cross entropy using gradient descent [42].

Bernier-Colborne and Barriere propose an extension to the basic path-based approach, to counteract the weakness that these methods normally require the hyponym-hypernym pair to explicitly co-occur at some point in the corpus [6]. This is presented as part of a hybrid system that combines a path-based component and a distributional component. This thesis uses the distributional component for experiments [6]. See section 3.3 for explanations regarding the full hybrid system.

2.4.2

Distributional methods

As previously stated, a weakness shared between all path-based approaches is that they are reliant upon co-occurrence within the text corpus. Even the extended approach by Bernier-Colborne and Barriere that is designed to alleviate this weakness still relies on co-occurrence with a certain hyponym, or one of its co-hyponyms [6]. By relying on a distributional rep-resentation of words, it is possible to perform lexical and semantic inference between words and phrases that never co-occur in the corpus. Methods that leverage distributional word rep-resentations have recently been known to outperform its path-based contemporaries when working with hypernymy.

Kotlerman et al. propose an asymmetric measure of distributional similarity for the pur-poses of lexical inference [27]. In particular, their results when working with lexical entailment is of interest with regards to hypernym relations. They use the intuition that lexical entail-ment constitutes the asymmetric ability of one term to substitute another [27]. As an example, animal can be a substitute for dog since all dog contexts are also animal contexts. The reverse is not true, hence the asymmetry. In this case, entailment between two word embeddings would correspond to one of the embeddings including contexts or features of the other, but not vice-versa. The aim of the similarity measure is therefore to quantify inclusion of distribu-tional features [27]. Empirical results demonstrate that their asymmetric similarity measure is beneficial on two separate datasets [27]. These results suggest that distributional word representations do encode information that can be leveraged to perform lexical inference. In closely related work, Baroni et al. present experimental results which suggest that distribu-tional representations can be used for lexical entailment above word level [4]. These methods are limited in performance because the central hypothesis regarding substitution of terms does not always hold. They also struggle with distinguishing co-hyponymy and meronymy from hypernymy.

(17)

2.4. Hypernymy related work

Figure 2.3: 3D visualization of embeddings.

Projection learning

The distributional model used for this thesis is based on projecting the word embedding of a given term close to embeddings of its hypernyms. Figure 2.3 show a 3D-representation of 200-dimensional word embeddings, the Tensorflow embedding projector was used for visualization 1. In the figure, a hyponym-hypernym pair is highlighted. Projection-based approaches aim to find a function that can project the embedding of a hyponym (dog) close to the embeddings of its hypernyms (e.g. animal). Fu et al. were the first to propose such an approach, commonly referred to as projection learning, when working with hypernym rela-tions [17]. Their study was conducted on the Chinese language. Fu et al. first conducted a preliminary experiment based on the observation by Mikolov et al. that word embeddings preserve features that can capture certain syntactic and semantic information [33]. Taking in-spiration from the commonly used example: v(king)´v(queen)«v(man)´v(woman), they demonstrate that this offset-based property holds for some hypernym-hyponym word pairs, but not for all. By computing the offset between all pairs in their training data and visualiz-ing the result, they observe that the offsets cluster together in multiple smaller clusters. This indicates that, while a single offset is not enough to characterize hypernymy, it is possible to decompose the relation into multiple more fine-grained relations [17]. Figure 2.4 shows an example of what these cluster might look like. Note that Figure 2.4 is not based on actual data, it is just an illustrative example. Based on these observations, Fu et al. propose using projection learning.

The intuition behind projection learning is that all word vectors can be projected to the word vectors of their hypernyms using a uniform transition matrix. In other words, given a hyponym-hypernym pair(x, y)there exists a matrixΦ so that y = Φx. Finding the exact

Φ that can project all pairs correctly is a difficult problem. However, an approximation of Φ can be learned. Fu et al. learn this approximationΦ˚by minimizing the mean squared error

(18)

2.4. Hypernymy related work

Figure 2.4: Example of what clusters of offsets might look like.

across their training data (extracted from a Chinese semantic thesaurus) [17]: Φ˚=arg min Φ 1 N ÿ (x,y) }Φx ´ y}2 (2.2)

N denotes the number of pairs in the training data. Stochastic gradient descent is used for optimization.

Fu et al. note that a single uniform linear projection may still not be enough to fit all word pairs, given the number of clusters observed in the initial offset experiment [17]. Motivated by this, they cluster the word pairs in the training data based on their respective pairwise offset and then learn separate matrices for each cluster:

Φ˚ k =arg min Φk 1 Nk ÿ (x,y)PCk }Φkx ´ y}2 (2.3)

Nkdenotes the amount of pairs in the kthcluster Ck.

When clustering and training is complete, Fu et al. identify whether or not two words x and y constitute a hypernym-hyponym pair (xÑH y) via the following procedure: First, find the cluster Ckwhose center is closest to the offset x ´ y and obtain the corresponding matrix Φk. Then, for xÑH y to hold, one of the following must be true [17]:

1. The Euclidean distance betweenΦkx and y is less than a certain threshold δ.

2. There exists another word z that satisfies x ÑH z and z ÑH y. Here, transitivity of the hypernym relation is used.

This projection learning approach by Fu et al. significantly outperformed state-of-the-art methods at the time [17].

Yamane et al. note that most studies regarding hypernymy reduce the problem to binary classification, i.e. determining whether or not a given word pair is related by hypernymy [51]. Yamane et al. instead propose expanding the scope of the problem to encompass hypernym generation (essentially synonymous with hypernym discovery), i.e. generate the best hypernym candidates for a given word. This is a more practical problem that lies closer to an actual application. Yamane et al. propose a projection learning approach for hypernym generation [51]. Though there are many similarities with the previous projection learning approach by Fu et al., there are some important differences [51, 17]:

(19)

2.4. Hypernymy related work

• Yamane et al. jointly learn clusters and projections, in contrast with treating clustering and projections completely separately. Fu et al. base their clusters on the offset between terms. According to Yamane et al. this may not be appropriate for learning hypernymy. Also, since Fu et al. train projection matrices independently, similarity scores across different clusters may not be comparable. By joint learning, clusters and parameters are well tuned for hypernym generation, and similarity measures are consistent [51]. • Yamane et al. determine the number of clusters automatically based on a similarity

threshold.

• Yamane et al. use the same similarity measure when training both word embeddings and when training the model.

• Yamane et al. employ negative sampling, i.e. include negative non-hypernym instances in their training data. This helps in keeping projections far away from incorrect hyper-nyms, and it also helps in dealing with non-hypernym instances when doing prediction [51].

The model generates hypernyms for a given word by calculating a pairwise score for the word with all other words in the vocabulary, and then selecting the words which gave the highest scores. The score is based on an inner product similarity measure:

simc(xi, yi) =σ(Φcxi¨yi+bc) (2.4) Where xiand yiare two word embeddings,Φcis the projection matrix for the cthcluster, bcis the bias for the cthcluster, and σ denotes a logistic sigmoid function. To calculate the score for a query word q and a word w from the vocabulary, first identify which cluster c that gives the highest similarity between q and w as defined by Equation 2.4. The score for the pair(q, w)is then calculated according to:

scoregen(x, w) =max

c simc(x, w) (2.5)

When jointly learning clusters and projections, the following objective function is maxi-mized: J= k ÿ c=1 ÿ (x,y)PPc (log simc(x, y) + m ÿ i=1 log(1 ´ simc(x, y 1 i))) (2.6)

where k is the current number of clusters, y1i is a negative sample, and m is the number of negative samples. The training algorithm can be summarized as follows [51]:

• Input consists of training pairs(x1, y1), ...,(xn, yn)and a similarity threshold λ.

• Start with a single cluster, initialize its projection matrixΦ1as a randomized matrix and its bias b1as 0.

• Repeat the following until convergence or reaching max number of epochs: For each

(xi, yi), compute scoregen(xi, yi). If the score is less than the threshold λ, initialize a new cluster. Update cluster parameters according to Equation 2.6.

When evaluating the model, Yamane et al. found that it significantly outperformed previ-ous models that could be used for hypernym generation [51, 17]. By defining a score thresh-old, they could also perform a comparison with classification models. The projection learning model by Yamane et al. performed on par with state-of-the-art classification models [51].

In their submission to SemEval-2018 Task 9, Bernier-Colborne and Barriere propose a hy-brid approach for hypernym discovery [6]. This approach combines a supervised distribu-tional component and an unsupervised pattern-based component. The distribudistribu-tional model

(20)

2.5. SemEval-2018 task 9: hypernym discovery

is based on projection learning and is similar to the clustering model by Yamane et al., but there are several differences [51, 6]. To name a few: The model performs soft clustering rather than hard clustering, a fixed number of clusters is used, and the training algorithms is differ-ent in several ways. This thesis uses the distributional compondiffer-ent of the system by Bernier-Colborne and Barriere for experiments. Section 3.3 in the Method chapter gives a detailed explanation of the system.

Lexical memorization

Despite the good empirical performance of distributional methods, Levy et al. argue that these methods do not actually learn relations between two words [30]. They present exper-imental results which suggest that they instead learn a property of one of the words, inde-pendent of the other. According to Levy et al., the methods tend to learn whether or not a word is a “prototypical hypernym” [30]. In their results, they observe the phenomenon of Lex-ical Memorization. This is where a model learns that a specific word in a specific slot strongly indicates the relation. For example, the model may learn that in a (hyponym,hypernym) word pair the word person in the hypernym slot strongly indicates that the relation holds. This would lead to many pairs containing the word person being incorrectly classified as positive instances. Levy et al. perform their experiments using a number of different distributional word representations and five different datasets. They tried representing the word pairs(x, y)

by concatenating them, using the differential, and using only x and y respectively [30]. Two classifiers were trained and tested for each representation: a logistic regression classifier with L1or L2regularization, and an SVM classifier with a linear kernel or a quadratic kernel [30]. According to Levy et al., more sophisticated distributional methods may be able to extract relational information, but it might also be the case that the word representations simply do not contain the necessary information [30].

Lexical memorization could be a significant factor when using training data that has been automatically collected from a lexical resource. Since no training examples are added manu-ally, the diversity of the training data is limited by the lexical resource. If the diversity turns out to be very poor, i.e. few unique hypernym examples, significant lexical memorization can be expected.

2.4.3

Distributional and path-based hybrids

Due to the fact that distributional methods have been outperforming its path-based contem-poraries lately, it could be reasonable to question whether or not the latter have become ob-solete. However, Shwartz and Dagan present a study that suggests that path-based informa-tion can be used in conjuncinforma-tion with distribuinforma-tional informainforma-tion to great effect [41]. There are methods that combine path-based and distributional methods and achieve state-of-the-art performance [42][6].

2.5

SemEval-2018 task 9: hypernym discovery

SemEval is a series of workshops on semantic evaluation. Each workshop has a number of NLP-related tasks associated with it. Each task usually contains a task description, datasets for training and evaluation, and an evaluation script. Anyone who wishes to participate may submit a system. SemEval-2018 is the 12th installment in the series. The experiments for this thesis work were conducted using a system that was submitted to the SemEval-2018 task 9 workshop [10]. The following sections will describe the workshop and the provided resources.

(21)

2.5. SemEval-2018 task 9: hypernym discovery

2.5.1

Task description

Previous benchmarks pertaining to hypernym relations have generally characterized the problem as a binary classification task, i.e. a system is given a pair of two terms and has to decide whether or not the two terms constitute a hypernym-hyponym pair [5, 9, 48]. In con-trast to a simple binary classification task, hypernym discovery entails the following: given a search term and a vocabulary, discover the best hypernym candidates for the search term [10, 2].

More specifically with regards to SemEval-2018 task 9, Camacho-Collados et al. describe the hypernym discovery task as follows: Given a target term, a large textual corpus, and a vocabulary, retrieve as many suitable hypernyms of the target term as possible [10]. For the purposes of SemEval-2018, the overarching task of hypernym discovery consisted of a number of subtasks [10]. Only the three subtasks dealing with English are of interest to this thesis. These subtasks are:

• General-purpose English (subtask 1A): Discover hypernyms in a large corpus of gen-eral English text. Has a gold standard of 3 000 labeled terms.

• Medical English (subtask 2A): Discover hypernyms in a smaller corpus of English medical text. Has a gold standard of 1 000 labeled terms.

• English Music (subtask 2B): Discover hypernyms in a smaller corpus of English text related to the domain of music. Has a gold standard of 1 000 labeled items.

2.5.2

Data description

The data for each subtask has three main components, a corpus of English text, a correspond-ing vocabulary, and a set of gold standard marked terms split into a traincorrespond-ing set and an evaluation set.

The gold standard marked terms have the same format in all three subtask datasets. The gold standard data consists of two files, one file containing example input terms and one file containing a set of manually verified hypernyms of that input term. For instance, the first input term from the general purpose training data is blackfly, and its corresponding gold standard hypernyms are homopterous and insect. There are 3 000 example input terms like this for the general purpose English subtask, and 1 000 for each of the domain specific subtasks. The terms have been split 50/50 for training and evaluation.

The UMBC WebBase corpus, containing 3 billion words, is used for general-purpose En-glish [19]. It is composed of paragraphs from many different domains on the web. For the medical domain the team behind SemEval-2018 task 9 provide a compilation of texts taken from the MEDLINE repository2. This repository contains academic documents from the med-ical field. The extracted corpus contains 130 million words. For the music domain, the team compiled a corpus by concatenating several different previously existing music-related cor-pora. The end result contains 100 million words.

The vocabularies for each subtask were constructed using their respective corpora. Words that occur at least N times (five for general-purpose and three for domain specific) were included. Bigrams and trigrams that were encountered during the process of creating the gold standard data were also included, if they occurred at least N times in the corpus [10].

The process for creating the gold standard data consists of two steps. First collecting in-put terms, and then extracting hypernyms for these terms. The collection of inin-put terms was carried out according to the following procedure. Candidate terms were automatically extracted from the corpus. BabelDomains, originally presented by Camacho-Collados and Navigli [11], was leveraged to get candidate terms from a diverse and representative number of domains. These candidate terms were then subject to a number of constraints. They were

(22)

2.6. Domain adaptation

required to occur at least five and three times in the general-purpose and domain-specific datasets respectively. Only terms up to and including trigrams were included. And lastly, any terms for which no hypernyms could be extracted during the hypernym extraction step were removed from consideration. Once a list of candidates had been produced according to this automatic procedure, this list was subject to extensive manual refinement. This in-cluded changing plurals to singulars, capitalizing named entities and removing terms that were deemed too vague or general.

Once a list of input terms had been collected and refined, the extraction of hypernyms for these terms could begin. Firstly, candidate hypernyms were extracted automatically from the taxonomies WordNet, Wikidata, MultiWiBi and Yago [34, 47, 16, 1]. In addition to this, SnomedCT and MusicBrainz were used for the medical and music domain respectively [43, 45]. The extraction algorithm can be described as follows: For each term in the list, first retrieve all BabelNet [35] synsets which has the given term as lexicalization. Second, for each synset, visit the parent nodes in all previously mentioned taxonomies, do this up to five levels. At each node, extract all BabelNet lexicalizations of the synset that also appear in the vocabulary file. This list of candidate hypernyms was then validated and expanded by human annotators using a combination of crowdsourcing and expert verification [10].

2.5.3

Evaluation

Since the task is no longer formulated as a classification problem, there was a decision to move away from the classic precision, recall and F1metrics when evaluating participating systems [10]. Instead the task was evaluated as a soft ranking problem, where systems were evaluated on their top 15 discovered hypernyms for each input term. Performance could then be evaluated with the Information Retrieval metrics Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).

MAP is calculated as follows:

MAP= 1

|Q| ÿ qPQ

AP(q) (2.7)

Where Q denotes the set of queries from a certain experimental run, and AP(q)denotes aver-age precision, i.e. the averaver-age correctness of each discovered hypernym for a particular query. The purpose of this metric is to get a good estimate of a systems capability to retrieve a size-able number of hypernyms [10]. MAP was the main evaluation metric for submitted systems, and it is also the metric that will be used when evaluating experiments for this thesis.

2.6

Domain adaptation

Constructing labeled training data for supervised machine learning models is a time consum-ing task. For most supervised distributional models to work optimally, the trainconsum-ing and test data should be drawn from the same distribution. However, there are many cases when this is not possible. There may be labeled training data available in one domain, but not in the domain that we are interested in. Consider the general-purpose training data provided via SemEval-2018 task 9 [10]. If we were to train a supervised distributional model using this data and apply it in a new domain, it is probably unreasonable to assume that this training data is a perfect representation of the target domain. This presents us with the task of adapt-ing models to different domains, usually referred to as domain adaptation. The contribution of this thesis is to apply and evaluate an approach for domain adaptation based on automati-cally collecting domain-specific training data. This section will give background on different means of domain adaptation within the NLP field.

There have been many different approaches proposed for domain adaptation in the NLP field. Qi and Li present a 2012 survey on different domain adaptation algorithms [39]. In

(23)

2.6. Domain adaptation

their survey they identify three main classes of algorithms: Feature Space Transformation (FST), Prior Based Adaptation (PBA) and Instance Selection/Weighting (ISW).

FST algorithms operate on the feature level and attempt to adapt to new domains by changing the feature space or selecting a subset of features. The goal is to transfer the feature space X from the source domain to a new feature space X1 which is predictive of the target domain [39, 23, 7, 24]. The main strengths of FST algorithms in general is that they are easily extendable to multiple domains, and they can handle large gaps between source and target domain. One weakness is that they are moderately dependent on the underlying model.

PBA algorithms work at the model level by changing the priors estimated on the source domain, to new priors that are appropriate for the target domain [39, 12, 15, 40]. In short, a prior p(y) is a priori probability of a given output y. The following equation over the distribution of output y given input x illustrates the concept:

P(y|x) = řp(x|y)¨p(y) y1

p(x|y1)¨p(y1) (2.8)

We can see in the equation above that having priors that are poorly adapted to the target domain will result in a distribution that does not represent the domain well. PBA algorithms are, like FST algorithms, easily extendable to multiple domains, and they can also handle large gaps between domains. Since they work at the model level these algorithms have a drawback in that they are usually strongly dependent on the underlying model.

ISW algorithms work at the level of data instances by selecting and/or weighting in-stances in the labeled training data [39, 25, 3, 49]. The intuition behind these algorithms is that we can adapt the training data by removing (or lowering weights of) instances in the training data that have “misleading” properties with regards to the target domain, and keeping (or increasing weights of) instances that represent the target domain well. As these algorithms work at the data level, they have the advantage of being largely independent from underly-ing models. The drawbacks are that they usually only work well if the gap between source domain and target domain is reasonably small, and they are not easy to extend to multiple domains.

More specifically regarding domain adaptation within hypernym discovery, Espinosa Anke et al. propose a clustering and projection-based approach where terms are clustered based on domain [2]. This approach is based on the intuition that hypernymy is character-ized by similar distributional properties within a domain, but that this may differ between different domains. The purpose of domain adaptation in the work by Espinosa Anke et al. is to adapt to many different domains at once, thus ending up with a general-purpose model that should work well on many diverse domains data. The purpose for this thesis work is different in that we are only interested in performance within a single domain. How well the model performs in other domains is not of interest to this thesis. The experiments in this the-sis examine whether or not we can leverage automatically collected training data to improve performance within a specific target domain. This corresponds to an industrial context where the model has to perform well on the data of a certain client, and performance on other data is not of interest.

2.6.1

Automatic training data acquisition

The idea of acquiring training data automatically for new domains that have not been through the process of manual annotation is attractive, as it removes the large time invest-ment that goes in to manual annotation. There are results on different NLP tasks which suggest that using lexical resources to automatically acquire training data can be effective [46, 18, 17]. In their pioneering work on hypernymy via projection learning, Fu et al. utilize a Chinese semantic thesaurus to acquire their training data [17]. The team behind SemEval-2018 task 9 leverage lexical resources such as WordNet as a basis when constructing training

(24)

2.6. Domain adaptation

data and gold standard evaluation data [10]. However, the SemEval data is also subject to extensive manual refinement.

The intuition behind using automatically collected domain-specific training data for the purpose of domain adaptation is similar to that of ISW algorithms. The work is done at data level, and the goal is to make the data a better fit for the target domain. There are also similar benefits and drawbacks. Since the adaptation is performed at the level of training data instances, it is essentially independent of the underlying model. Any model that utilizes the same type of training data, i.e. hyponym-hypernym pairs, can utilize the method.

(25)

3

Method

The method for answering the research questions has four main components:

1. Selecting a state-of-the-art supervised distributional model to be used for hypernym discovery.

2. Defining and implementing an automatic procedure for collecting domain specific training data.

3. Outlining which domains should be considered, and what data setups are to be used for the different experimental runs.

4. Evaluating the results. A procedure for manual evaluation has to be specified for situa-tions where there is no gold standard data available.

This chapter explains all these components. First, everything surrounding the selected model and the task on which it will be evaluated is addressed. After this, the automatic collection of training data and the setup for the experiments are explained. The training data collection and the experiments involving said training data make up the main contribution of this thesis.

3.1

Selecting the model

The process of selecting a model for this thesis work was guided by results from the SemEval-2018 Task 9 submissions [10]. The top performing system used proven techniques, with some new additions, and outperformed the baselines and competing submissions by a large mar-gin [10, 6]. With this in mind, the supervised distributional component of the hybrid system proposed by Bernier-Colborne and Barriere was selected as the model to be used in the ex-periments for this thesis work [6]. Section 3.3 explains the technical details of the system.

Bernier-Colborne and Barriere performed a cross-evaluation on their system by training on the general-purpose training data and evaluating on the domain specific evaluation data. They achieved promising results [6, 10]. The results from this cross-evaluation further mo-tivates the selection of this model as it indicates that the model has potential to work well in a specific domain while being trained on the general-purpose data. This kind of cross-evaluation is what forms the baseline for comparisons for this thesis work.

(26)

3.2. Task description

3.2

Task description

Domain-specific hypernym discovery is the task on which the system will be evaluated. Here follows a task definition:

Given a query term q from a vocabulary v of domain-specific terms, extract the 15 best hypernym candidates from within the vocabulary v. The number of hypernyms to extract was set to 15 based on the SemEval-2018 task 9 description [10].

3.3

Reference system

The model used for hypernym discovery in this thesis is based on the system described in the paper CRIM at SemEval-2018 Task 9: A Hybrid Approach to Hypernym Discovery by Bernier-Colborne and Barriere [6]. Their system utilizes a combination of an unsupervised pattern-based component, and a supervised distributional component. For the purposes of this thesis work, only the distributional component of the system was used. The unsupervised pattern-based component is not affected by the training data, hence it is not a point of interest when answering the research questions. A description of the full hybrid system is still included for the sake of completeness. The following sections will explain the system, originally presented by Bernier-Colborne and Barriere [6].

3.3.1

Pattern-based component

The pattern-based approach used by Bernier-Colborne and Barriere is an extension of the basic pattern-based approach for hypernym discovery. A weakness of the basic pattern-based approach is that it requires hypernym and hyponym to occur together in the same sentence. To counteract this limitation, Bernier-Colborne and Barriere extend the basic approach in a few ways. Their extended pattern-based approach can be summarized as follows [6]:

1. Given a query term q as input, create an empty set Q which will hold an extended set of queries.

2. Search for co-hyponym patterns (enumeration patterns, e.g. “X,Y and Z”) and use these to find co-hyponyms of q. Add the co-hyponyms to Q and store their frequency as the number of times a given co-hyponym was found using the patterns.

3. For each co-hyponym ˆq P Q, calculate its score by multiplying its frequency with the cosine similarity between the embeddings of q and ˆq. Rank the co-hyponyms based on this score and keep the n (=5 selected empirically) with the highest score. Discard the rest.

4. Add the original query q to Q.

5. Create an empty set Hqwhich will hold the discovered hypernyms.

6. For each ˆq P Q, search for the hypernym patterns in the text to discover hypernyms of ˆq and add them to Hq.

7. Add the head of each term in Hqand the head of the original query q to this set. 8. For each candidate h P Hq, calculate its score by multiplying its normalized frequency

by the cosine similarity between the embeddings of h and q, and rank the candidates according to this score.

With these extensions to the pattern-based approach, it is no longer a requirement that hyponym and hypernym co-occur in the same sentence, thus improving the recall.

(27)

3.3. Reference system

3.3.2

Distributional component

The distributional part of the hybrid system is based on projecting the embedding of a query q in such a way that the projection is close to the embedding of its hypernym h. The model learns k projection matrices (k=24 for their official runs).

This section describes how the distributional model works [6]. The first step is to learn word embeddings for the terms in the vocabulary. The embeddings are learned on the text corpus corresponding to the relevant subtask, using word2vec with the skip-gram algorithm and negative sampling [32]. The text corpus is preprocessed in the following ways before learning the embeddings: Multi-word terms from the vocabulary up to and including tri-grams are converted to single tokens, and all characters are lowercased. Having learned word embeddings, the supervised model discovers hypernyms according to the following procedure. Given a query term q and a hypernym candidate term h, look up their embed-dings eq, eh P <dx1. The matrix P P <kxd, containing k projections of eq is then calculated according to:

Pi= (Φi¨eq)T (3.1)

WhereΦi P<dxdfor i P t1, ...., ku is one of k projection matrices. Similarities are then calcu-lated between the projections and the candidate ehusing the inner product:

s=P ¨ eh (3.2)

This gives us the column vector s P<kx1, which is then fed to an affine transformation and a sigmoid activation function. This gives us an estimate of the likelihood y that h is a hypernym of q:

y=σ(W ¨ s+b) (3.3)

This likelihood is calculated for all candidate terms in the vocabulary to discover the hyper-nyms of a given query.

Negative sampling is used when training the model: For each positive hyponym-hypernym pair in the training data, m negative examples are generated by replacing the hypernym with a random word from the vocabulary. m was set to 10 for most of the offi-cial runs by Bernier-Colborne and Barriere [6]. After adding negative samples, the model is trained to output likelihoods y (Equation 3.3) that are close to the target t, where t = 1 for positive examples and t = 0 for negative examples. To do this, the binary cross-entropy is minimized. For one training instance(q, h, t), where q is a query term, h is the candidate term and t is the target, the binary cross entropy is calculated according to:

H(q, h, t) =t ¨ log(y) + (1 ´ t)¨log(1 ´ y) (3.4) The full cost function J is accumulated by summing H for all examples in the training set D:

J= ÿ

(q,h,t)PD

H(q, h, t) (3.5)

J is then minimized using gradient descent with the Adam optimizer [26].

3.3.3

The hybrid approach

The full hybrid system proposed by Bernier-Colborne and Barriere combines the supervised and unsupervised components [6]. The system takes the top 100 candidates from the output of each respective component and normalizes their score. The scores are then added together, and the candidates are re-ranked in accordance with this new combined score. Candidates that are found by both systems will have an advantage with this approach. When performing the standardized evaluation of the SemEval-2018 hypernym discovery task, this approach greatly outperformed all baselines and competing systems [10, 6].

(28)

3.4. The training process

3.4

The training process

Section 3.3 above explains the theoretical and technical details of how the system performs hypernym discovery. This section will focus on the different components of the workflow pipeline and how we arrive at a final trained model from start to finish. These steps follow the proposed approach in the paper by Bernier-Colborne and Barriere [6].

3.4.1

Preprocessing and word embeddings

The text corpus and vocabulary are subject to preprocessing before training word embed-dings. All terms from the vocabulary are converted to lowercase, and multi-word terms from the vocabulary (bigrams and trigrams) are converted to single tokens. The exact nature of the corpus and vocabulary is dependent on which variant of the training data that is to be tested, see section 3.5.4.

With preprocessing completed, word embeddings are trained for all terms in the vocab-ulary. Word2vec with the skip-gram algorithm and negative sampling is used for training [32].

3.4.2

Training setup details

The setup for the training algorithm mimics that of Bernier-Colborne and Barriere [6]. Here are some details worth noting for the purpose of replicability:

• A fixed number of 24 projection matrices were used. These were initialized as identity matrices with random noise.

• Word embeddings were normalized before training.

• Negative sampling was applied to the training data. For each positive hyponym-hypernym pair in the training data, 10 negative pairs were added by replacing the gold hypernym with a randomly drawn term from the vocabulary.

• The model is trained on mini-batches of 32 positive examples and 32 x 10 negative examples.

• Dropout is applied to the embeddings and the projections to prevent overfitting. Gra-dient clipping and early stopping is used for regularization.

• Word embeddings are fine-tuned during training.

3.5

Setting up the experiments

We will now leave the underlying model and focus on the main contribution of this thesis. This thesis contributes by exploring how introducing automatically collected domain-specific training data affects model performance within a specific domain. This involves: Selecting which domains to consider, specifying and implementing an automatic process for collecting training data, and specifying what different data setups are to be used for the experiments.

3.5.1

Domains

Two domains were considered for this thesis: One domain with a small amount of text data and a small vocabulary, and one domain with much more text data and a bigger vocabulary. For the sake of brevity, the domains will be referred to as the small domain and the large

domain going forward.

The large domain was made up of data provided by the SemEval-2018 task 9 team and consisted of a corpus composed of medical texts from the MEDLINE repository [10]. The

(29)

3.5. Setting up the experiments

repository contains citations from published biomedical literature. Only the corpus and the vocabulary were used. Since the point of the thesis is to use automatically collected training data, the domain-specific training data provided by the SemEval team was not used. The corpus contained a total of 130 million tokens. The small domain was made up by a „1 200 page manual from a client company of Fodina. This made for a corpus containing a total of „305 000 tokens.

The scope of the experiments was limited to two domains due to time restrictions. Manual evaluation of each experimental run is required for the small domain, since there is no real gold standard data available there. It is time consuming to get a reasonable sample size with manual evaluation. The larger medical domain has gold standard data which can be used for evaluation.

With the limit set to two domains, it seems reasonable to have a rather large contrast between them. Here is one domain in the Fodina client data that can be seen as a realistic, perhaps even a little pessimistic, use case. And in contrast to this we have the medical domain which can be seen as something approaching a best case scenario.

3.5.2

Vocabulary creation

The data made available by the SemEval-2018 task 9 team contains term vocabularies for the general-purpose- and medical (i.e. large domain) data [10]. The general-purpose vocabulary contained 218 753 terms, and the vocabulary for the medical data contained 93 888 terms. These vocabularies were used when applicable. The vocabulary had to be constructed when working with the Fodina client (i.e. small domain) data. The first attempt at this was to use the system for term extraction that Fodina uses. Many terms are filtered out in this system on the basis that they are overly general for the purposes of maintaining a company terminology. Filtering out more general terms severely limits the number of hyponym-hypernym training pairs that can be extracted from the term vocabulary. For this reason, the vocabulary had to be extended to include more terms. This was done by using the NLP library spaCy1 to extract noun phrases from the corpus. Since term extraction is not a central part of this thesis, the noun phrases were refined in a very simple and straightforward way by removing some stop words and other words that were deemed uninteresting. For example, removing “the” from “the battery”. This way of extending the vocabulary significantly increased the amount of training data that could be collected for the domain. The resulting vocabulary contained 11 213 terms.

The vocabulary creation process employed by the SemEval-2018 task 9 team involved ex-tensive manual refinement. Terms were manually normalized, and the vocabulary was man-ually pruned by removing vague and overly general terms, and terms with miss-attributed domains [10]. No such manual refinement was applied when creating the vocabulary for the small domain. As a result, the small domain vocabulary will be more noisy than both the general-purpose vocabulary and the large domain vocabulary. This difference is not an issue when it comes to the general-purpose vocabulary, since one part to the research questions is to explore how this kind of refined general-purpose data can be leveraged in smaller domains. However, the differing quality of the vocabularies has to be taken into consideration when comparing results between the large domain and the small domain. This also puts further emphasis on considering the large domain as something approaching a best case scenario.

3.5.3

Training data collection

At the outset, the provided domain-specific data is assumed to consist of two components: Raw text data, and a set of terms that have been extracted from the text data. It is assumed that term extraction has been done beforehand since the task of extracting terms is not the

(30)

3.5. Setting up the experiments

main focus for this thesis. The exception to this being the small domain where the terms had to be extracted as described in section 3.5.2. These terms make up the domain specific vocabulary. The domain text data is compiled into a tokenized corpus. The corpus is prepro-cessed together with the domain-specific vocabulary according to the same procedure as the general-purpose corpus and vocabulary.

The domain-specific training data is collected by leveraging the lexical database WordNet [34]. The end result is a set of input terms Q and a set of corresponding “gold” hypernyms H. Hidenotes a list of extracted gold hypernyms for input term Qi. The collection algorithm was defined as follows:

1. Initialize Q with all terms from the domain-specific vocabulary, and initialize H as an empty set.

2. Remove all terms from Q that do not have a corresponding synset in WordNet.

3. For each term Qi, traverse n hypernym steps in the lexical resource and add all encoun-tered terms to Hi. Some overly general terms, such as artifact are ignored. n was set to 5 for the experiments.

4. Remove out-of-vocabulary terms from H.

The SemEval-2018 task 9 team utilized resources such as WordNet in a very similar fashion when compiling their training data [10]. The biggest difference relative to the method used for this thesis is that the SemEval training data went through extensive manual refinement, while the domain-specific training data collected for this thesis was not manually refined at all. Crowdsourcing and expert verification was leveraged for the SemEval manual refine-ment. The crowdsourced refinement involved selecting all correct hypernyms among the candidates extracted from the lexical resources, thus filtering out incorrect hypernyms. The expert verification was done in two passes. First, a set of experts were tasked to remove incorrect hypernyms, and normalize terms if necessary. Secondly, a different set of experts were instructed to focus on adding obvious hypernyms that were missing [10]. This type of manual refinement is time consuming and expensive, and therefore not always feasible in the context of industrial application. In contrast, the method utilized for this thesis is fully automated and no manual refinement is required.

The automatically collected data will be of a lower quality in a number of ways. Primarily, the diversity of the training pairs is limited by the lexical resources that are used, since no hypernyms are added manually. With such deficiencies in mind, it is interesting to explore what level of performance can be expected from a state-of-the-art supervised distributional model when introducing this type of training data.

3.5.4

Data variants

The central part for the contribution from this thesis is to study the effects of different training data variants. For clarification, when training data is referenced going forward it refers to the “gold” hyponym-hypernym pairs that are used to train the projection matrices.

Three different variants were evaluated for both of the domains. This section provides motivation as to why the different variants are of interest, and describes their composition and how they were created.

The three variants are general-purpose training data, pure domain-specific training data and mixed training data (descriptions in the following subsections). Having a model trained on general-purpose data will serve as a baseline to see whether or not the domain-specific train-ing data actually does anythtrain-ing to boost model performance. The models trained on pure domain specific data will give an idea as to whether or not the quality of this automatically

References

Related documents

When looking at protein structure it is common to study it on four separate levels. The first level is the primary structure, which is the actual sequence of linked amino acids.

SE The National Board of Health and Welfare and the Swedish Association of Local Authorities and Regions (SKL) both provide high level metadata regarding national quality

Key words: Patient Involvement, participation, co-creation, patient empow- erment, user involvement, patient centered care, motivation, user experiences, co-creation, service

We show that image contrast at mixing frequencies can be larger than at drive frequencies, and we use Fisher’s linear discriminant analysis 15 on the collective multifrequency

Sampling points are selected using a modified artificial potential field (APF) approach, which balances multiple criteria to direct sensor measurements towards

The implementation of the variable node processing unit is relatively straight- forward, as shown in Fig. As the extrinsic information is stored in signed magnitude format, the data

Vid brand i undermarksanläggningar kan normalt inte brand bekämpas från utsidan utan metoden rökdykning måste användas för att kunna nå fram till branden.. Metoden rökdykning