Relation Classification using Semantically-Enhanced Syntactic Dependency Paths : Combining Semantic and Syntactic Dependencies for Relation Classification using Long Short-Term Memory Networks

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer Science

Master thesis, 30 ECTS | Computer Science

2018 | LIU-IDA/LITH-EX-A--18/007--SE

Relation Classification using

Semantically-Enhanced Syntactic

Dependency Paths

–

Combining Semantic and Syntactic Dependencies for

Relation Classification using Long Short-Term Memory

Networks

Riley Capshaw

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

Many approaches to solving tasks in the field of Natural Language Processing (NLP) use syntactic dependency trees (SDTs) as a feature to represent the latent nonlinear struc-ture within sentences. Recently, work in parsing sentences to graph-based strucstruc-tures which encode semantic relationships between words—called semantic dependency graphs (SDGs)—has gained interest. This thesis seeks to explore the use of SDGs in place of and alongside SDTs within a relation classification system based on long short-term memory (LSTM) neural networks. Two methods for handling the information in these graphs are presented and compared between two SDG formalisms. Three new relation extraction sys-tem architectures have been created based on these methods and are compared to a recent state-of-the-art LSTM-based system, showing comparable results when semantic depen-dencies are used to enhance syntactic dependepen-dencies, but with significantly fewer training parameters.

(4)

Acknowledgments

First and foremost I would like to thank my supervisor Marco Kuhlmann, whose excellent teaching renewed my interest in NLP. You provided invaluable guidance both in designing this thesis and understanding the theory. I would not have done so well without our weekly meetings and insightful discussions. I sincerely hope to continue working with you and the rest of NLPLAB.

To all my friends here at the university and around the world, thank you for the occasional get-togethers, the cultural food exchanges, the fika discussions, and the gaming sessions. You all helped me to keep my life balanced as I adjusted to my new home so far from where I grew up.

To my husband Daniel, thank you for supporting me through the years and for giving me this opportunity to live and study in Sweden. Without you, I would have never learned about Linköping nor its university, and I might have never returned to academia. To my parents-in-law Eric and Natasha, and to my siblings-in-law Daryl and Samantha (and their own growing families), thank you so very much for your support.

Finally, I want to thank my parents, Gary and Paige. You two let me fly across an ocean with all my belongings and live thousands of miles away from you so I could pursue my dreams and be with the person I love. I can never express how grateful I am for that, nor how much it has meant to me and how much I love you both. I just wish Mom could have been here when I finished my thesis.

(5)

List of Figures

1.1 Example of overlap in relation classification labels . . . 2

2.1 Training sentence #10717 . . . 5

2.2 Example syntactic dependency tree . . . 7

2.3 Example enhanced syntactic dependency trees with compound . . . 7

2.4 Semantic dependency graph derived using DM . . . 8

2.5 Semantic dependency graph for the sentence from example 2.1 . . . 8

3.1 Feedforward Neural Network architecture . . . 12

3.2 Recurrent Neural Network architecture . . . 13

3.3 Long Short-Term Memory unit . . . 14

3.4 Popular activation functions used in neural networks . . . 15

3.5 Recursive Neural Tensor Network . . . 17

3.6 Syntactic dependency tree for partial training example #474 . . . 17

3.7 The word channel of the SDP-LSTM . . . 18

3.8 Full SDP-LSTM system. . . 19

4.1 Semantic units . . . 21

4.2 A semantic unit dependency graph for the partial training example #474 . . . 22

4.3 Comparison of shortest dependency paths . . . 23

4.4 Shortest DDP for partial training example #474 . . . 23

4.5 Disconnected semantic unit dependency graph mended with syntactic units . . . . 25

6.1 Entity-Informed Unit-Path System . . . 30

6.2 Entity-Separated Unit-Path System . . . 31

6.3 Directed Dependency Path System . . . 32

7.1 Convergence times . . . 37

8.1 Semantic dependency parser instability . . . 39

8.2 Semantic dependency graph with conjunctive edges. . . 40

8.3 SDG for training example #391 as parsed by MeurboParser . . . 40

(8)

List of Tables

1.1 Table of examples from the SERC task . . . 2

5.1 SDP-LSTM dimensions . . . 27

6.1 EI-UP system dimensions . . . 30

6.2 ES-UP system dimensions . . . 31

6.3 DDP system dimensions . . . 32

7.1 Semantic path statistics . . . 35

7.2 Table of F1scores. . . 36

(9)

1 Introduction

This chapter seeks to give a clear understanding of the goals and aims of this thesis. It begins by introducing necessary key concepts within the field of Natural Language Processing, such as information extraction and the task of relation classification. A brief introduction to some systems used within these areas is included, as well as the relation of this thesis’s goal to those methods.

1.1 Motivation

One major area within Natural Language Processing (NLP) deals with machine understand-ing of text documents. This area often uses information extraction techniques to retrieve structured information from those documents. However, due to the inherent complexities and ambiguities in natural languages, as well as the necessity of background knowledge to understand anything more than the most trivial of sentences, this process is difficult and largely inaccurate, even for tasks which humans might consider trivial.

In these tasks, the extracted information might be the desired result, such as in the question-answering system by Girju [9]. There, the author focuses on answering questions about causation, which could aid in identifying important cause-and-effect relationships in many domains. In other tasks, the information might be compiled in a database [17] or ontol-ogy [40] for further structuring or analysis, then used for another task. For those, it is critical that the initial stage of information extraction is accurate, otherwise the entire knowledge base becomes unreliable.

Relation Classification

This thesis focuses on relation classification (RC), an information extraction task which gen-erally deals with detecting the type of semantic relation between pairs of labelled entities (words or multi-word phrases). A semantic relation between two words is a directed re-lation dependent upon human interpretation and background knowledge, rather than just grammatical structure. Explicitly, a semantic relation between two words requires context about the real-world grounded meaning of the words, both when taken individually and as a pair. The relations focused on in this thesis are between nominal entities (single- or multi-word nouns) within a single sentence.

(10)

1. INTRODUCTION

Relationship Direction Example sentence

Cause-Effect (e1,e2) #7911: The man(e1)radiated jolliness(e2).

Entity-Origin (e1,e2) #1279: The rest(e1)was funded from her small family savings(e2).

Message-Topic (e1,e2) #1434: The final programme(e1)detailed the history(e2)of Russborough House.

Product-Producer (e2,e1) #1470: His wife(e1)has just completed the first paper(e2)of her graduate degree.

Entity-Destination (e1,e2) #1501: She poured flour(e1)into a pretend dragon(e2)with a tube.

Member-Collection (e2,e1) #4386: We gathered kindling in a grove(e1)of tall pines(e2)near the cabin.

Instrument-Agency (e2,e1) #4577: World powers(e1)threaten Iran with sanctions(e2).

Component-Whole (e1,e2) #4602: The ricotta mixture(e1)was the best part of this dish(e2).

Content-Container (e2,e1) #7759: The kitchen(e1)contains basic items such as salt(e2), pepper and olive oil.

Other - #961: He sent the revellers(e1)into party mode(e2).

Table 1.1: Examples for each of the possible labels from SemEval 2010 Task 8, including di-rectionality and their number in the data set.

RC tasks often focus only on a specific set of semantic relations. Girju [9] uses Word-Net [30] to determine which verbs have a causal implication between their subject and object, such as “develop” meaning “causes to grow.” Huang et al. [17] extract proteprotein in-teractions from published articles in order to compile a comprehensive database that can be used for further research. Serra and Girardi [40], on the other hand, use a RC subsystem to perform expert-assisted ontology learning by examining a small subset of relevant relations.

The specific semantic relations that this thesis focuses on come from SemEval 2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals([12]; SERC). This is a shared RC task which is used as an evaluation benchmark in many studies and includes nine directional relations plus an undirected OTHER relation, for a total of nineteen possible labels for any given pair of entities. The SemEval-2010 Task 8 dataset serves as an excellent benchmark not only for the breadth of relations included, but for the use of strongly related categories where the structure of each sentence is not enough to infer the label. The authors manually annotated 10,717 sentences with nine directional relations plus an undirected OTHER relation, as shown in table 1.1.

When I came, the apples_(e1) were already put in the basket_(e2). Content-Container(e1,e2)

Then, the apples_(e1) were quickly put in the basket_(e2). Entity-Destination(e1,e2)

Figure 1.1: An example of label overlap provided by the authors of SERC.

Figure 1.1 has an example provided by the authors of SERC which shows two extremely similar sentences, where the situation they describe is either static (top) or dynamic (bottom). These illustrate that the relation between apples and basket changes depending on whether the sentence as a whole implies movement of the apples. In other words, the different rela-tions sometimes have very subtle differences, meaning that more than just the two entities are necessary for determining which relation applies. Consider the example in table 1.1 for the OTHER class. If one were to classify based on keywords like “sent” and “into”, then that example might be incorrectly classified as Entity-Destination(e1,e2). This time, the classification depends on knowledge of the real-world meaning of the entities.

Relation Classification with Neural Networks

Notably, while machine learning techniques were employed by most of the SERC partici-pants, none of the submissions used neural network architectures. It was not until later that the performance of neural networks could be competitive on this task. One of the first was a

(11)

1.2. Aim

recursive neural network designed by Socher et al. [41] named the MV-RNN, which achieves near-state-of-the-art performance using no external data other than a syntactic dependency parser (discussed in chapter 2), while the previous state-of-the-art system was a support vec-tor machine with twelve external feature sets. After adding in only three external feature sets, the MV-RNN beats the previous state-of-the-art system in terms of F1score. Despite the complexity of the approach, it clearly demonstrates the potential of neural networks as a tool for these types of tasks which reduces the need for hand-engineered features. The MV-RNN will be discussed in further detail in chapter 3.

Recently, Xu et al. [45] presented a long short-term memory-based recurrent neural net-work architecture named the SDP-LSTM. Their solution also uses syntactic dependency trees, but instead of relying on a predictable binary structure as the MV-RNN does, it uses the short-est (undirected) dependency path in that tree between the two marked tokens. This linearizes the tree, converting the most important parts into a single series of words. Using this shortest dependency path (SDP), the SDP-LSTM is able to implicitly use the structure of the tree as a feature. This system scores even higher than the MV-RNN in terms of F1score, and will serve as the baseline system for replication and comparison.

Semantic Dependencies

In fact, all of the previously mentioned RC systems use a syntactic dependency tree in some regard, yet also discuss the drawbacks inherent in using syntactic information to derive se-mantic relations. Girju [9] even mentions that features based on lexical information (like WordNet [30]) and syntactic information do not provide everything necessary to detect causal relations. One could argue based on this that a semantic task like RC would benefit from the use of semantic dependencies, which provide information about how different words interact within a sentence to influence its overall interpretation. Unfortunately, the literature seems to lack such systems which use semantic dependency graphs. Even the task of parsing into these structures is relatively new, with two SemEval shared tasks from 2014 [35] and 2015 [34] attempting to motivate parsing sentences into graphs instead of trees to enable more complex analyses. As such, part of this thesis is the exploration of combining a relation classification system with semantic dependencies.

1.2 Aim

The main focus of this thesis is to determine whether the use of semantic dependency graphs can improve the performance of RC systems. In order to perform this comparison, a baseline system is constructed based on related literature, with a particular focus on those using long short-term memory (LSTM)-based RNNs. After exploring the relationship between RNNs and tree- and graph-structured data, a chosen reference system is extended to handle the in-formation explicitly and implicitly encoded into semantic dependency graphs. This extension is accomplished by first developing two new methods for representing important substruc-tures in those graphs in a linear manner compatible with the baseline architecture. Given these new representations, further modifications to the baseline system are developed and evaluated. Evaluation of all systems is performed over the data set from SERC.

1.3 Research Questions

In order to accomplish the goal of this thesis, the following questions are answered.

1. How can the information in semantic dependency graphs be effectively used for rela-tion classificarela-tion?

2. Given a LSTM-based system for relation classification which uses syntactic dependency trees, how can it be extended to instead use semantic dependency graphs?

(12)

1. INTRODUCTION

3. How is the performance of a relation classification system affected when using semantic dependency graphs in place of syntactic dependency trees?

Question 1 is addressed in chapter 4, which presents two methods for using semantic dependencies based on the theory presented in chapter 2 regarding semantic dependency graphs. Chapter 6 addresses question 2 by combining the answer to question 1 with the SDP-LSTM, as well as extensions to the ideas and systems from chapter 3. Finally, chapter 7 compares using the answers to questions 1 and 2 to the SDP-LSTM baseline in order to answer question 3.

1.4 Delimitations

The scope of this thesis is limited in the neural network architectures to be evaluated, the evaluation of the models, and the domain itself. Multiple classes of neural network architec-tures are presented and discussed, yet only LSTM-based RNNs are evaluated. The theoreti-cal studies explore and compare two different classification tasks, but only one will be used in the final evaluation. Further, the evaluation only seeks to discuss whether semantic de-pendency graphs can offer an advantage over syntactic dede-pendency trees based on this one classification task. A broader study of possible modifications to the presented systems, such as hyperparameter tuning or additional feature inclusion, is also omitted.

(13)

2 Dependency Parsing

This chapter will cover syntactic and semantic dependency parsing and the resulting data structures which encapsulate the parsed information. Further, it will discuss a few key fea-tures and drawbacks of automatically-generated dependency strucfea-tures. These topics are prerequisites to understanding the theory in chapter 4.

2.1 Structuring Unstructured Sentences

Sentences are a form of semi-structured data, where the information is generally read by humans in a linear fashion (left to right in English), but the actual information conveyed in a sentence may not be so linear. Back-references to previous entities or inversion of the usual ordering of subject, verb, and object may add a bit of nonlinear complexity to a sentence. That is, any two words in a sentence may be related in some way, not just those which are adjacent to each other.

A few days before the service, Tom Burris had thrown into Karen’s casket_(e1) his wedding ring_(e2). Entity-Destination

Figure 2.1: Sentence #10717 from the testing set.

Example 2.1 taken from the testing set shows a sentence that employs literary inversion. Here, the two entities casket and ring are marked with the relationship Entity-Destination(e2, e1). This relationship is, however, rare in the e2-to-e1 direction. Only two examples show up across this project’s training and testing data sets (one in each), which makes it all the more difficult to handle in general.

2.2 Dependency Parsing

A simple method for extracting relevant features to detect a relationship between entities might simply be to take the subsentence bounded by them. For example 2.1, that yields “cas-ket his wedding ring”, which has no information on what is happening (the verb is excluded).

(14)

2. DEPENDENCYPARSING

In order to know that “thrown” is important, human readers make inferences based on the re-lationships between words, which in turn might be dependent upon rere-lationships with other words, forming a hierarchy of dependencies within a sentence. Extracting these dependency

relationsfrom a sentence is referred to as dependency parsing, producing a dependency

graph(see definition 1 adapted from Oepen et al. [35]). For the purposes of this thesis, only

synactic dependency parsing and semantic dependency parsing will be discussed.

Definition 1(Dependency Graphs). A dependency graph for a sentence S =w1, . . . , wnis a

structure G = (V, E,`E), where V = t1, . . . , nu is the set of nodes (in one-to-one correspon-dence to w1, . . . , wn), E V V is the set of edges, and`Eare mappings that assign labels (dependencies in this context) to the edges [35].

Dependency parsing focuses on linguistically-motivated dependency relations which de-scribe the role of each word in relation to exactly one other (a head and a dependent). These dependency relations are defined by a dependency grammar, which describes the set of pos-sible directed binary relations which can be derived independent of word order. This role is usually derived as a direct or indirect dependency of the sentence’s main verb, which is marked as the root node. This root node is also referred to as the (syntactic or semantic) head of the dependency structure.

2.3 Syntactic Dependency Parsing

The first class of dependency relations to be discussed will center around syntactic relations between words, motivated primarily by grammatical analyses of sentences by linguists. Syn-tactic dependency parsing therefore seeks to parse a sentence into a structure that describes the grammatical role of every word. Unless otherwise noted, all examples in this section were parsed using the Stanford Parser1and may contain linguistic inaccuracies.

The Stanford Parser currently uses the dependency relations found in the Universal De-pendencies project (UD; [5]), and so will be used in the following examples2. Since these dependencies are syntactic in nature, the dependency trees are referred to as syntactic de-pendency trees (SDT). The UD relations allow for any word to function as the dependent in at most one relation, but the head in many relations. For example, a verb is often the head of both a subject (nsubj) and a direct object (dobj) relation, while the dependent of either relation cannot be the dependent of another relation with another verb, even in the case of a compound verb. Taken together, these relations form a dependency structure known as a

dependency tree, described in definition 2.

Definition 2(Dependency Trees). A dependency tree is a dependency graph G = (V, E,`E)

with the restriction that the root node must have exactly one path to every other node in V. In general, the root is an extra token w0added to V, with a corresponding node 0 added to E. Figure 2.2 shows the partial SDT for example 2.1. Note that the edges from the main verb point into the three main phrases of the sentence (and in this example, to the head of each phrase): the subject “Tom Burris,” the prepositional phrase “into Karen’s casket”, and the direct object (noun phrase) “his wedding ring.” These phrasal subtrees are all color coded in the graph. Now, a relationship between the two marked entities can be found using the tree. That is, casket_(e1)is related to the root node via the nmod relation, and the root node is related to ring_(e2)via the dobj (direct object) relation. However, this still lacks an explicit direction for the relationship. Though unlikely, it would have been valid for Tom Burris to have thrown his ring from the casket.

1_{Available online, but with frequent outages: http://nlp.stanford.edu:8080/parser/index.jsp.}

Last successfully accessed April 27, 2018.

2_{The reference system by Xu et al. [45] actually uses an older version of the Stanford dependency parser, which}

(15)

2.4. Semantic Dependency Parsing

Tom Burris had thrown into Karen ’s casket_(e1) his wedding ring_(e2).

case aux case compound nsubj compound nmod:poss root nmod:poss nmod dobj

Figure 2.2: Syntactic dependency tree for the sentence from example 2.1, excluding the tem-poral phrase for space.

Further Characteristics

Restricting dependency structures to trees means that some types of relationships cannot be explicitly encoded. One of the clearest examples of this is illustrated with compounds, such as in figure 2.3. On the left is the SDT for a simple sentence where a subject-predicate-object relationship is explicitly encoded (I-washed-car). However, on the right is the SDT of essentially the same sentence, but with a compound verb. Because the subject and object of both verbs can only be the dependent of one word, this second example loses the explicit relationship between “washed” and its object, and “waxed” and its subject. Any inference made with this tree would need to understand the function of the conj to also understand that the nsubj and dobj relations for one half of a conjunction are equally applicable to the second half.

I washed my car I washed and waxed my car

root nsubj nmod:poss dobj nsubj root cc conj:and nmod:poss dobj nsubj

Figure 2.3: Syntactic dependency tree for a simple sentence (left) and a sentence with a com-pound verb (right). All components marked in red appear only when parsed using enhanced Universal Dependencies.

Extensions to the Universal Dependency relations attempt to mitigate this in part by re-laxing assumptions to produce directed acyclic graphs instead of trees, where there may exist multiple paths from the root node to some vertices in V. The effects of this extension are shown in red for the graph on the right, restoring one of the two missing arcs and adding additional information to the conj relation.

2.4 Semantic Dependency Parsing

Like syntactic dependency parsing, the semantic counterpart focuses on extracting directed binary relations between the words of a sentence. The need for parsing to more complex structures than trees in order to capture semantic information has been well-motivated by two shared tasks at the International Workshop on Semantic Evaluation (SemEval), one in 2014 [35] and one in 2015 [34]. These tasks argue that tree structures cannot accurately rep-resent the meaning behind an arbitrary sentence beyond simple grammatical relations. The development of the enhanced version of Universal Dependencies mentioned in the previous section also lends weight to this argument.

Unlike syntactic dependencies, semantic dependencies do not have a single favored repre-sentation. Instead, the SemEval tasks considered three formalisms: DELPH-IN MRS-derived

(16)

2. DEPENDENCYPARSING

bi-lexical dependencies (DM) [8], Enju predicate-argument structures (PAS) [31], and Prague semantic dependencies (PSD) [11]. This thesis will not use PAS and PSD, instead focusing on DM as well as a fourth formalism known as combinatory categorical grammar dependencies (CCD) [16]. CCD as a resource was recently provided by Oepen et al. [33] and further mo-tivated by Kuhlmann and Oepen [19]. All analyses in this section will be based on the DM representation, with all dependency graphs generated using DELPH-IN’s online LOGON parser3. These dependency graphs will be referred to as semantic dependency graphs (SDGs) since they use semantic dependencies.

I washed and waxed my car

top ARG2 _and_c ARG1 ARG2 ARG1 poss

Figure 2.4: Semantic dependency graph derived using DM. For illustrative purposes, the relations between both verbs and their subject and object have been color coded.

Figure 2.4 shows the SDG for the same example as in figure 2.3 (right). Note that all of the desired relationships between the verbs and their arguments are present. In this formalism, the graph is composed of many predicate-argument relations, supplemented with binary re-lations between semantically-important modifiers and the words they modify. When the predicate is a verb, one can often assume that the arguments are its subject (ARG1), direct object (ARG2), indirect object (ARG3), and prepositional object (ARG4). However, ARG3 and especially ARG4 are relatively uncommon labels in the English training data.

Tom Burris had thrown into Karen ’s casket_(e1) his wedding ring_(e2)

compound top ARG2 ARG3 ARG1 ARG2 ARG3 ARG1 ARG2 poss poss compound

Figure 2.5: Semantic dependency graph for the sentence from example 2.1, excluding the tem-poral phrase for space. The dotted blue arcs show the smallest explicit relationship between the marked entities.

Returning to example 2.1, figure 2.5 shows the partial SDG for that example sentence. Again we have a clear relationship between the two entity words. Here, “into” plays the role of the predicate in a predicate-argument relation with both entities, where “ring” is the first argument (ARG1), and “casket” is the second (ARG2). In this relation, the directionality is explicitly encoded, showing clearly that the “ring” is “into” the ‘casket” and not the other way around.

MeurboParser

The semantic dependency parser used in this thesis is the MeurboParser by Peng, Thomson, and Smith [37]. In short, this parser uses deep multitask learning to predict the correct de-pendency graph for each input sentence, a technique which is outside the scope of this thesis. The most important factor to note is that the system can be retrained easily on new data, allowing it to parse both the DM and CCD data used in chapter 6 for comparability.

(17)

3 Neural Networks for Structured

Data

This chapter covers the basic concepts required to understand recurrent neural network archi-tectures, including those used in the experiments of this thesis, beginning with representation methods for word-based input. This chapter is meant to act both as a brief survey of this topic and as background material for chapters 5 and 6. For brevity, all architectures and examples will be given with supervised classification tasks in mind, where the expected label for every example is known. Schmidhuber [39] provides a far better overview of recent trends in neural networks (with a focus on deep architectures). The main focus of this chapter is, however, the way these architectures have been adapted in the literature to handle structured data, such as the dependency trees and graphs discussed in chapter 2.

3.1 Vector Models for Meaning Representation

One of the major challenges within NLP is the representation of discrete data like words, which only have meaning for humans. Neural networks generally use use numerical and statistical methods to extract information from the data, and as such do not work well over the raw data. This section will cover the basics regarding representing words as vectors, a technique used extensively in the neural networks presented in the following sections.

The simplest standard approach is to use one-hot encoding, which only provides a unique vector identifier of each word. In one-hot encoding, every word is given an index, then represented as a vector with every element set to zero except for that index, which is set to one. However, systems using one-hot encoding often suffer from the curse of dimensionality. A vocabulary of tens of thousands of words may mean hundreds of thousands or millions of weights to learn. Further, weights dealing with rare words (which may occur only once the data) might not be trained sufficiently.

The chosen representation ideally should include information based on the meaning of each word while also minimizing the dimensionality of the vectors—a dense vector represen-tation. One way to make vector representations more informative is to encode some metric into them. For example, a co-occurrence matrix would include at least some information on how often words occur together, and its rows could function as the vectors. However, this will likely skew values based on words which occur very frequently. The word distribution within a document or corpus will follow Zipf’s law, which states loosely that the vast major-ity of observations (words) will be from a very small subset of all possible observations (the

(18)

3. NEURALNETWORKS FORSTRUCTUREDDATA

vocabulary), so this count will be greatly biased. More than half of the words in the Brown Corpus—a well-known corpus of American English documents with more than one million words and a vocabulary of roughly 50,000—are drawn from a subset of only 135 words [7]. The implication here is that these common words are extremely uninformative in a semantic sense, as they co-occur with almost every other word.

Positive Pointwise Mutual Information

To counteract this skewing, positive pointwise mutual information (PPMI) may be used to populate such a matrix [23]. Equations 3.1 and 3.2 show how to calculate pointwise mutual information (PMI) and its positive counterpart:

PMI(w, c) =log P(w, c) P(w) P(c) =log#(w, c) |D|

#(w) #(c) (3.1)

PPMI(w, c) =max(PMI(w, c), 0) (3.2) Here, w is the word in question, c is a context word (a word which appears near w), D is the data,|D| is measured in number of words, and #()is a function which counts the number of occurrences of a word or word-context pair within D. These values could also be measured in terms of tokens rather than words, for example in a system which takes into account common multi-word phrases. The resulting matrices are denoted MPMIand MPPMI. Note that values in MPMImay be undefined and are also converted to 0 in MPPMI.

The resulting matrix is an encoding of the associative strength (the mutual information) be-tween every pair of words in D. This formulation discounts frequent events, yielding larger values for word-context pairs that occur infrequently. This makes intuitive sense, as a hu-man would have to use the few nearby context clues to understand a new word which only appeared once in a document. Therefore, some amount of semantic information about each word seems to be preserved in this encoding using that notion of context clues.

Latent Semantic Analysis

However, MPPMI is still too large to be practical. Larger corpora, especially those which in-volve informal language such as emoticons, may have a vocabulary size in the millions1, yielding a matrix of trillions of decimal values. As such, one might next apply dimensionality reduction to the data, reducing the row size of the matrix from millions to hundreds. Latent

semantic analysis(LSA) could be applied using MPPMI to automatically reduce the data to

only its most semantically informative parts [20]. In very general terms, LSA is the process of applying truncated singular value decomposition (SVD) to a matrix representation of the data to find a lower-rank approximation which preserves as much semantic information as possible [21]. In this case, SVD would be applied to MPPMI[23] or to another representation of the data (such as a co-occurrence matrix). This yields a matrix with one row for every word in the vocabulary, but only as many columns as needed. In fact, the width of the matrix is often in the low hundreds.

These rows, also called embeddings, can be used to compare different words within vec-tor space. That is, a similarity measure such as cosine similarity can be used to determine how close the meanings of two words are. Further, studies have shown that linear relation-ships between pairs of word vectors are also often captured by these embeddings. The classic examples use analogies, such as the difference between the vectors for king and man being approximately the same as the difference between the vectors for queen and woman [29].

1_{See downloads section on https://nlp.stanford.edu/projects/glove/, which describes the corpus}

(19)

3.2. Neural Networks

One would also expect that the differences between man and woman, brother and sister, and king and queen would all be roughly equal in the vector space. To better analyze this, mul-tiple data sets have been released which contain semantic and syntactic analogies to test the quality and breadth of information captured in embeddings2[27, 28, 29].

Embeddings can be extended to handle any abstract discrete concept [25]. For exam-ple, the SDP-LSTM discussed later in this chapter applies this technique to concepts such as part-of-speech tags. Chapter 6 takes this further, using a RNN to devise intermediate repre-sentations (a sort of embedding-like vector) for variable-length semantic relation tuples.

Neural Embedding Models

The biggest drawback of the previous method is the need to calculate the matrices for PPMI and SVD given corpora with billions of tokens containing millions of unique words. In fact, embeddings trained on corpora of that size were not even feasible until Mikolov et al. in-troduced the continuous bag-of-words (CBOW) model and the skip-gram (SG) model [27]. These methods of generating word embeddings fall into a category which uses neural net-works, called neural embedding models. Recent studies have shown that these are often approximations of more computationally intensive methods, such as the skip-gram model approximating the truncated SVD of the PMI matrix [23].

The CBOW and SG models were subsequently improved by Mikolov et al. in terms of training time and accuracy of the vectors for uncommon words [28]. This improvement was released in word2vec3, which is commonly used to train word embeddings for many NLP tasks. Xu et al. [45] used word2vec [28] to train word embeddings based on the English Wikipedia corpus4, and these embeddings will be used in all of the experiments in this thesis.

3.2 Neural Networks

Neural networks are machine learning algorithms which historically were designed to mimic biological neural networks. They are composed of large computational graphs, where a vec-tor of input values is repeatedly modified through one or more hidden layers to produce an output which approximates some desired value. These hidden layers may involve mul-tiplication by learned weights; applications of nonlinear activation functions; concatenation, pooling, or other layer-based combination operations; convolution or other localized cross-correlation operations; and even recurrence or recursion. Weights in a neural network are usually updated during learning through backpropagation, or the use of gradient descent to determine the error of the network’s output given a desired output and an error or loss function. As such, neural networks are generally composed of only differentiable operations, allowing the error gradient to be easily calculated and applied to every internal parameter.

Notation

The following notation is used in the following sections. All vectors are represented as bold lowercase letters (x, h) and all scalar elements within those vectors as lowercase letters with a subscript indicating their index (x0, h1). All matrices and lists of vectors (time series) will be represented as bold uppercase letters (W, X) with individual vector rows or elements as ordinary vectors with a subscript indicating their index (x0). The final output of each neural

2_{One of which is available in the Gensim library as a test file: https://raw.githubusercontent.com/}

RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt. Last accessed May 1, 2018.

3_{https://code.google.com/archive/p/word2vec/}_{. Last accessed May 1, 2018.}

(20)

network will be denoted as ˆy, regardless of the shape of the output (scalar, vector, time series). Also note that in all equations and figures, all bias values are left out5.

Vector concatenation will be represented in one of two ways, either with the` operator or by stacking in a matrix: c = a` b = [a_b]. Finally, element-wise multiplication will be represented with theb operator:[5 2]b[2 5] = [10 10].

3.3 Feedforward Neural Networks

Feedforward neural networks (FFNNs) are neural networks which form a computation

graph with no cycles. The simplest variety of FFNNs only use chains of dense or fully con-nected layers, or layers where every output from the previous layer has a connection to every input for the next layer. For example, a simple neural network with one hidden layer would have two weight matrices to learn:

ˆy=g(V f(Wx)), (3.3)

where V and W are the hidden weight matrices and f and g are activation functions. Fig-ure 3.1 illustrates this, with all vectors (top) and weight matrices (inside) labelled, including their calculations (bottom).

x0 x1 x2 h0 h1 h2 y0 y1 y2

h

x

ˆy

f

(

Wx

)

g

(

Vh

)

W

V

Figure 3.1: A simple FFNN with three nodes in each layer and two applied activation func-tions, fitting equation 3.3.

3.4 Recurrent Neural Networks

Recurrent neural networks(RNNs) relax FFNNs to include cycles, often to handle data with

an explicit ordering (such as temporal data or words in a sentence). Networks in this cate-gory need to take into consideration temporal dependencies within the input when updating weights, and so use backpropagation through time for learning weights.

Figure 3.2 illustrates a RNN using the same underlying FFNN from figure 3.1. Note that the vertical axis has changed from representing elements within each layer to representing the layers over time. Also note that the input has changed from a single vector to a list of vectors (or a matrix which is traversed as a list of vectors). Here, the output ˆy is determined by a composition of all outputs over time. Common ways to compose the output are to discard all but the last value, pool all the outputs (mean, max, sum, product, or some other element-wise operation), or return the outputs as a new series. That is, RNNs may perform many-to-one and many-to-many operations, such that the input and output vary in size. Further, other

5_{This bias is used to prevent the learned function from necessarily passing through the origin. For example,}

equation 3.3 would be rewritten as ˆy=g(V f(Wx+bx) +bh). In general, all non-output layers in all networks

(21)

3.4. Recurrent Neural Networks x0 x1 x2 X h0 h1 h2 y0 y1 y2 ˆy W W W V V V U U

Figure 3.2: A simple recurrent neural network with each layer simplified to a single node. In this example, the input X is assumed to be a series of three vectors and the network itself is “unrolled” over time (vertical axis). Note how the output of the hidden layer is propagated through time and how the output for all layers is accrued into a final result. For a given input

X, the weight matrices U, V and W remain constant.

network constructions may be one-to-many, taking in a single value and recurring until some internal trigger stops the output.

The activation of the above RNN at a given time step t is

yt=g(V f(r(Wxt, ht 1))), (3.4) where r is the recurrence function. This recurrence function defines how the hidden layer’s activation from the previous time step affects the current time step’s activation. Often, this involves a linear transformation by another weight matrix U before adding it directly to the linear transformation of the input. In this case, the hidden layer’s activation is defined as

ht= f(r(Wxt, ht 1)) = f(Wxt+Uht 1). (3.5) This means that U purely focuses on determining what information is valuable to maintain in the hidden state vector.

One advantage of RNNs over FFNNs is their ability to compactly handle variably-sized input and output. FFNNs could in theory handle the same problems, but require that the input size remain fixed. For sequences with arbitrarily long lags between important patterns, this becomes infeasible partly due to the necessary size of the input and number of weights needed. For example, consider the sentiment classification task from chapter 3.5. A naive ap-proach with FFNNs could be to concatenate embedding vectors for every word in a sentence as the input, padding with zeros to account for the longest sentence in the data set. How-ever, this longest sentence might be one hundred words long, meaning that all input vectors would be one hundred times larger than the embedding length. Recall from the discussion in section 3.1 the problems associated with inputs of this size. A FFNN constructed in this way would also need to learn weights which handle every possible positioning of key patterns. This quickly becomes infeasible with larger inputs, both from a learning perspective and a data gathering perspective.

Long Short-Term Memory Networks

The ability of RNNs to handle arbitrarily long sequences presents a major complication in the training of networks. The need to backpropagate error through time when learning leads to the possibility that the error gradient will either exponentially increase or decrease, known as the exploding or vanishing gradient problem [14]. The longer the input sequence, the higher the probability becomes that this will happen. Exploding gradients ruin the learning process, making it such that the weights update in ways that prevent the network from ever converg-ing on a suitable optimum. Vanishconverg-ing gradients prevent useful information from propagatconverg-ing

(22)

3. NEURALNETWORKS FORSTRUCTUREDDATA ct b g i b f ct 1 b o xt ht 1 ht

Figure 3.3: A single LSTM unit, borrowing notation from [45]. Here,b is the element-wise multiplication operator and represents the various activation functions in use. h is the activation of this hidden unit, c is the unit’s memory cell, and i, f, o and g are various gates described in the text.

through the network, making updates to the weights negligible. In order to address this issue, Hochreiter and Schmidhuber [15] proposed the long short-term memory (LSTM) unit. The original design has been further improved in recent years, leading to multiple variants in the literature. Of particular interest is the variant used by Xu et al. [45], which was inspired by the tree-based architecture proposed by Zhu et al. [46] and which matches the implementation for LSTM units in Keras [3].

Figure 3.3 shows the general architecture for a single LSTM unit at time step t. Again the recurrence depends upon the input xt and the hidden layer’s previous output ht 1. How-ever, in this formulation an additional internal value c is kept between time steps, called the LSTM’s memory cell. This is multiplied by the output of the forget gate (ft) to determine element by element which values in memory need to be forgotten (set near 0). In addition to the forget gate, the LSTM unit includes an input gate (i) which prepares the input for use by the memory cell, a candidate gate (g) which applies an element-by-element weighting to i, and an output gate (o) which extracts information from the raw input to be combined with the activated output of the memory cell. Formally, this unit is defined by the following six equations: it=σ(Wixt+Uiht 1) (3.6) ft=σ(Wfxt+Ufht 1) (3.7) ot=σ(Woxt+Uoht 1) (3.8) gt=tanh(Wgxt+Ught 1) (3.9) ct=itb gt+ftb ct 1 (3.10) ht=otb tanh(ct) (3.11)

The multiple uses of element-wise multiplication (b) are key to the LSTM’s ability to avoid exploding and vanishing gradients. By “closing” (setting to near-zero) individual el-ements in the gates, the unit prevents the propagation of unnecessary information through the network by leaving those elements unchanged between time steps, in effect dynamically decreasing the depth of the network. Graves [10] presents a much clearer description of this phenomenon, along with explanatory figures. This particular topic lies outside the scope of this thesis.

Common Activation Functions

As mentioned above, most NNs employ activation functions. Activation functions generally take the form of a nonlinear transformation applied to the result of a linear operation,

(23)

affect-3.5. Structured Data

ing the result vector element by element. These seek to improve the network’s ability to learn as well as its convergence speed. The two most important activation functions for this thesis were used in equations (3.6) to (3.11): sigmoid (σ) and hyperbolic tangent (tanh). Also of interest is the rectified linear unit (ReLU), which has recently been the most popular choice in many applications[22]. All three activation functions are shown below in figure 3.4. Note that the domain of each function varies. Some situations may arise where a system needs to prop-agate only positive values (sigmoid in some LSTM gates), or where it needs to capture both highly positive and highly negative values. Further, some activation functions are chosen for the properties of their gradients. ReLU improves learning in networks with many layers due to its gradients always being either 0 or 1, which is useful when a network might suffer from vanishing gradients when using other activation functions [26].

sigmoid(σ) = 1

1+ex

2 0 2

0 1

hyperbolic tangent (tanh)= e

x_ex ex₊_e x 2 0 2 1 0 1

rectified linear unit (ReLU)=max(0, x)

2 0 2

0 2

Figure 3.4: Popular activation functions used in neural networks.

For many classification tasks, the final layer of a neural network is fed through the softmax function, similar to multinomial linear regression. Instead of operating independently over each element of the vector, the softmax function generalizes the sigmoid function to ensure that the sum of all elements in the result is 1:

softmax(x)i = e xi

°_K k=1exk

, (3.12)

where K is the dimensionality of the input vector. The output vector from this function ap-proaches a one-hot vector as the difference between the input vector’s largest value and the rest of its values increases. Often, the desired output is encoded as a one-hot vector, so it becomes clear that this method encourages the network to learn weights which emphasize a single element to be larger than the others, rather than enforcing a specific output value.

3.5 Structured Data

So far, the input data to these neural networks has been either a single vector or a series of vectors. That is, the structure of the input has been exclusively linear. Recall the syntactic and semantic dependencies discussed in chapter 2, which are usually formatted as tree or graph structures over the words of a sentence. As a linguistic tool, these structures encode more information about the interaction of words within a sentence than the simple ordering of those words encodes, and many machine learning systems use them for computational linguistics tasks. However, the neural network architectures discussed so far do not handle nonlinearly-structured data of any sort. This section discusses architectures which do handle these structures, with a focus on the SERC task as well as sentiment classification.

(24)

Sentiment Classification

Sentiment classificationis a global classification task where one wants to label an entire

sen-tence with a sentiment value (whether it is positive or negative). One sentiment analysis data set of particular interest is the Stanford Sentiment Treebank [42] (SST). For this task, the authors provide entire sentiment trees where not only are individual words labelled with sentiment values, but all nodes in those trees are labelled as well. Further, rather than sim-ple positive or negative labels, the SST includes five possible fine-grained sentiment labels: negative, somewhat negative, neutral, somewhat positive, and positive. SST is used as an illustrative and motivational tool for many of the following architectures, but is not explored outside of this chapter.

Tree-Structured Data

Many approaches to tasks in NLP use SDTs to add extra features beyond locality and co-occurrences as well as to restructure the data. The literature contains numerous examples of this, and in general one of two approaches are taken to explicitly handle tree structures:

1. Use convolution or recursion when traversing the tree.

2. Use the path between two nodes in the tree to restrict the input to a linear subpart. Mou et al. [32] extend the idea of convolution to focus on weighting different kinds of depen-dencies, rather than positions in a sliding window. Their system, the tree-based convolutional neural network (TBCNN), applies the following equation to every single-depth subtree in the SDT: yp= f Wpp+ n ¸ i=1 Wr[c_i]ci ! , (3.13)

where p is the embedding for the parent word of the subtree, ciis the embedding for ithchild word in the subtree, r[ci]is the grammatical relationship between each child word ci and its parent p, and Wp and Wr[c_i] are role-specific weight matrices. Their system was evaluated over the SST, which, as mentioned in section 3.5, is the global task of sentiment classification. Socher et al. [42] use a recursive neural tensor network (RvNTN) in their solution to the SST task. In short, they compose the sentiment for an entire tree recursively applying the same network to classify the sentiment of every subtree. That is, the network architecture is built up by recursively applying the equation

pi= f  cil cir T V[1:d] cil cir +W cil cir ! (3.14)

over a tree, where piis the ithparent node in the tree and ciland cirare its left and right child nodes, respectively. The tensor portion of the network is V[1:d], which represents d-many 2d 2d matrices, where d is the embedding dimension. Further details, particularly those regarding tensor backpropagation through structure used to train this system, are left to the original paper.

Figure 3.5 shows both a simple tree and the functions used to recursively compose the overall representation of that tree. Notice however that this formulation only allows for two children to every parent node. In the data set for SST, the trees are restricted to binary trees, which allows the authors to make important assumptions about the structure of the data6. All intermediate representations and the final representation for each tree are evaluated using a softmax classifier to predict the individual sentiment values.

Recursive neural networks (RvNNs) can be viewed as a way to recursively apply a FFNN over all nodes of a tree to generate a distributed representation of that tree. That FFNN can

(25)

3.5. Structured Data p2 a p1 b c p1= f b_c T V[1:d]b c +Wb c ! p2= f  a_p 1 T V[1:d] a p1 +W a p1 !

Figure 3.5: Example of how the recursive neural tensor network is constructed based on the tree on the left given equation 3.14.

instead be replaced with a RNN which keeps an internal state alongside the recursively-built representation. This idea forms the basis for most techniques that apply RNNs directly to tree structures. For example, Zhu et al. [46] apply a custom LSTM unit recursively to the nodes of binary trees in their own approach to SST.

However, the focus task for this thesis is the SERC task, a local classification task. Most of the systems discussed thus far could theoretically be modified to handle the SERC task. The TBCNN produces a representation (and prediction) for every node in the tree. If the relation classification could be accomplished using a single parent node in the tree common to both entities, then the prediction for that node would suffice. The RvNTN could be adapted in the same way due. Prior to the RvNTN, Socher et al. [41] developed a matrix-vector recur-sive neural network (MV-RvNN) which behaved similarly, but was applied to the SERC task by following the nodes along the dependency path between entities in the SDT to find the representation for their first common parent node.

The drill platform(e1) was a part of the Romulan mining ship(e2) Narada.

det comp nsubj cop det root nmod case det comp comp comp

Figure 3.6: Syntactic dependency tree for partial training example #474. Its dependency path is highlighted in dotted blue, and the relation between the entities is Component-Whole(e1,e2).

This idea of using the syntactic dependency path (SynDP) is what many RNN-based ap-proaches to the SERC task use. Rather than using the SynDP to formulate a recursive repre-sentation of the subtree bounded by that path, most use the path itself as a linearization of the tree bounded by the two entities. With this formulation, the earlier, simpler systems can be used. This is exactly what Xu et al. [45] did in their approach to the SERC, which serves as the reference system for the studies in this thesis.

Syntactic Dependency Path-based LSTM

Xu et al. [45] propose the shortest dependency path LSTM (SDP-LSTM), an LSTM-based RNN which uses the SynDP between two entities to structure the input to a multi-channel LSTM. Figure 3.6 shows the SDT for an example sentence. The SynDP for this example passes through platform, part, Narada, and ship, in that order. The SDP-LSTM system would break this down into two paths, the first from platform to part, and the second from ship to Narada to part. This ensures that the enities remain at the beginning of each path and that the com-mon root of the subtree remains at the end of each path, taking into consideration the

(26)

direc-3. NEURALNETWORKS FORSTRUCTUREDDATA

tionality of the individual syntactic relation chains. The authors argue that this directionality matters and this scheme improves performance over simply using a single path.

The architecture of the SDP-LSTM is broken into distinct channels, each of which has two halves, one for each sub-path through the SynDP. The word channel of the SDP-LSTM is shown in figure 3.7. There, the vertical axis loosely represents time from the bottom up, but one can perform the calculations of the two halves in parallel or in series. The words, which are encoded as embeddings, are fed individually into the two LSTM units (r nodes), and the output of the hidden layer from each step is fed into a pooling layer (p nodes). Every channel has exactly two LSTM units and two pooling layers, each handling only their half of the SynDP (left or right). Finally, the two outputs from the pooling layers are concatenated and fed into a fully-connected hidden layer which is used as the output of the channel. Every channel’s output is then concatenated, fed into another hidden layer, then fed into a softmax layer to perform the final classification decision.

platform part part Narada ship rL0 rL1 rR2 rR1 rR0 pL pR

`

oW

Figure 3.7: The word channel of the SDP-LSTM, unrolled over the input SynDP from fig-ure 3.6. Every input word is encoded as a vector (embedding) and the two halves of the path (L and R) have no interaction before concatenation. The input is read from the outside in by the LSTM units (r), with every intermediate result being pooled. The subscript number for each LSTM unit simply designates that unit in time, there are only two LSTM nodes. The pool layer (p) performs an element-wise max function to keep track of the highest activations over time, then passes that resulting vector to a dense output layer (o) after both halves are concatenated.

The four channels used by this system are meant to capture various degrees of syntactic and semantic information. For syntactic information, the part-of-speech tag for every word is used, as well as the grammatical relation between every adjacent word in the SynDP. For semantic information, pre-trained word embeddings by word2vec and WordNet [30] hyper-nyms are used. For the purpose of this discussion, a hypernym is the broadest semantic category that a word falls under (such as noun.person or verb.perception). All in-put features are encoded as embeddings, but only word embeddings are pre-trained. The remaining embeddings are trained alongside the complete system. See figure 3.8 for a com-plete overview of the system’s architecture.

One final detail to note is that the system achieves its best performance by using dropout, a method proposed by Hinton et al. [13] to alleviate overfitting. The general idea is that by setting some percentage of the input to zero, the patterns learned over that input become less interdependent and thus less sensitive to small changes. It is important to note that the dropout mask changes with every training epoch, but omits the same elements for every input vector to which it is applied within an epoch. SDP-LSTM has a dropout mask applied to the input embeddings (at a rate of 0.5) as well as the penultimate hidden layer (at a rate of 0.3).

(27)

3.5. Structured Data Words Parts of Speech Grammatical Relations WordNet Hypernyms XW ` oW XP ` oP XH ` oH XG ` oG

`

h

ˆy

Figure 3.8: The full SDP-LSTM system unrolled assuming the same input as in figure 3.7. All four channels are specified along the same SynDP halves, but the grammatical relation (edge labels of the SynDP) channel has one less input per half. The output activation function () is the softmax function from equation (3.12). The output of each channel (o) maintains the dimensionality of its input (embedding encoded) and the output of the network is a 19-dimensional vector. The chosen label of the relation is the argmax of that output vector.

Graph-Structured Data

Graphs present significantly more problems than trees for machine learning algorithms, due in part to the lack of restrictions on their structure and the lack of a clear order of traversal. This is even true when limiting the input domain to directed acyclic graphs, such as SDGs. The literature study for this thesis uncovered no current systems which employ SDGs for either the SERC or SST tasks.

Graphs can be represented in matrix form using an adjacency matrix, a square ma-trix where nonzero elements represent (directed) connections in a graph. Traditional

convolutional neural networks(CNNs) often take matrices as input (such as pixel values

in an image), so there is support for using adjacency matrices with CNNs [18]. Baldi and Pollastri [2] demonstrate how bidirectional LSTMs can be extended in a recursive manner to handle two-dimensional data like adjacency matrices.

Other approaches, such as those by Scarselli et al. [38] and Li et al. [24], devise recur-sive and recurrent architectures, respectively, which can be applied directly over graphs. One could also extend the dependency-specific weight matrices from the TBCNN to devise a RNN architecture which works well on SDGs, where much of its information is held in edge labels. Unfortunately, these are all approaches which are far better suited to global classification tasks, and there is no clear way to restrict SDGs to only those parts relevant to a local classi-fication task. Even the use of dependency paths in semantic graphs is problematic, as SDGs may have zero or multiple shortest paths between entities. To address these problems and use SDGs in a neural network for relation classification, new methods are adapted from tree-based approaches and presented in the following chapter.

(28)

4 Semantic Dependencies for

Relation Classification

This chapter presents the use of semantic dependencies as features to a relation classifica-tion system, a topic which has so far lacked coverage in NLP literature. First, the concept of semantic units is introduced, then used to form a semantic unit dependency path between en-tities. Second, the idea of dependency paths is extended to any directed graph with labelled edges, referred to as a directed dependency path. Finally, the previous two path construc-tions are extended to handle disconnected graphs through the introduction of a method for combining syntactic dependency paths with semantic dependency graphs. These three topics are the theoretical contributions of this thesis.

4.1 Predicate-Argument Dependency Paths

The two semantic dependency formalisms used in this thesis focus strongly on predicate-argument relationships between words. For this chapter, consider a predicate to be the head of any relationship, and an argument to be an indexed dependent to some predicate. Recall from section 2.4 the use of numbered ARG arcs in the DM formalism. CCD takes this a bit further by only having numbered arcs (1 through 6).

The first approach for extracting information from SDGs takes advantage of these ex-plicit predicate-argument substructures by constructing a semantic unit dependency graph (SUDG). This structure assumes that a sentence is composed of several discrete, related

semantic units, each composed of a predicate and its arguments, which capture the

indi-vidual ideas within a sentence.

Semantic Units

A semantic unit is represented as an ordered tuple where the first element is a predicate from the SDG and the remaining elements are that predicate’s arguments. There are at least three types of semantic units, depending on the expressivity of the formalism used to generate the graph. The two most relevant types are semantic modifiers and predicate-argument units:

• Semantic modifiers are semantic units which consist of only two words: the predicate and the word it modifies. The most common examples are BV and COMP (or compound) arcs in the DM formalism. The CCD formalism does not have this kind of unit.

(29)

4.1. Predicate-Argument Dependency Paths

• Predicate-argument units are semantic units which consist of a predicate and at least one numbered argument. The DM formalism labels these arguments as ARGN , where N is a number between 1 and 4. The CCD formalism labels these arguments simply as N between 1 and 5.

The third type of unit is isolated, which are generally words which play a purely syntactic role and therefore have no incoming or outgoing edges in the SDG. To illustrate the various types of units, take the sentence ‘The lobster dinner I paid for was tasteless.’ Its SDG and all semantic units are shown in figure 4.1. The SDG has a total of six semantic units, two of

The lobster dinner I paid for was tasteless.

BV

COMP ARG1

ARG4

top ARG1

Predicate-argument units:

Semantic modifier units:

tasteless, dinner paid, I, -, -, dinner

The, dinner lobster, dinner

Figure 4.1: DM-based semantic dependency graph (top) and its corresponding semantic units (bottom) grouped by type. The isolated units (for) and (was) have been omitted.

which are isolated units and omitted, two of which are predicate-argument units, and two of which are semantic modifier units. ‘Tasteless’ is arguably more of a semantic modifier (it is an adjective), yet is connected to ‘dinner’ via an ARG1 arc, so must be classified as a predicate-argument unit. The SDG also lacks ARG2 and ARG3 arcs, thus yielding a predicate-predicate-argument unit with two empty slots (marked as -). Since the order of the arguments matters, the lack of an argument should also matter, and thus predicate-argument units may include empty slots if ARG arcs are missing.

It is important to note that nodes in SDGs may have multiple of the same outgoing arc, particularly when dealing with compounds. In these cases, all possible interpretations are generated in a combinatorial fashion. That is, if one predicate has three ARG1 arcs, two ARG2 arcs, and one ARG3 arc, then six semantic units will be produced, one for every possible substitution of variables.

Intuitively, predicate-argument semantic units capture the most important aspects of a sentence, such as ‘I paid [for] dinner’ and ‘dinner [was] tasteless,’ where words in brackets are inferred. In this case, ‘for’ can be inferred due to ‘dinner’ being in the fourth argument slot and ‘was’ can be inferred due to ‘tasteless’ being an adjective and ‘paid’ being past-tense. Further, the semantic modifiers can be used to recursively compose the details: ‘I paid [for] (The (lobster dinner))’ and ‘(The (lobster dinner)) [was] tasteless.’ This decomposition is similar to viewing the semantic units as grounded expressions extracted from the SDG, which is related to the variable-free elementary dependencies presented by Oepen and Lønning [36]. Recall from chapter 2 that the DM formalism is based on these elementary dependencies, which were in turn based on minimal recursion semantics [4], which may be relevant to the recursivity present in the semantic modifiers.

Semantic Unit Dependency Graph

Once the semantic units have been extracted, they are used to form a graph. The process is extremely straightforward: draw undirected edged between units which have at least one

Relation Classification using Semantically-Enhanced Syntactic Dependency Paths : Combining Semantic and Syntactic Dependencies for Relation Classification using Long Short-Term Memory Networks

Linköping University | Department of Computer Science

Master thesis, 30 ECTS | Computer Science

2018 | LIU-IDA/LITH-EX-A--18/007--SE

Relation Classification using

Semantically-Enhanced Syntactic

Dependency Paths

Combining Semantic and Syntactic Dependencies for

Relation Classification using Long Short-Term Memory

Networks

Riley Capshaw

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

Relation Classification

Relation Classification with Neural Networks

Semantic Dependencies

1.2

Aim

1.3

Research Questions

1.4

Delimitations

2

Dependency Parsing

2.1

Structuring Unstructured Sentences

2.2

Dependency Parsing

2.3

Syntactic Dependency Parsing

Further Characteristics

2.4

Semantic Dependency Parsing

MeurboParser

3

Neural Networks for Structured

Data

3.1

Vector Models for Meaning Representation

Positive Pointwise Mutual Information

Latent Semantic Analysis

Neural Embedding Models

3.2

Neural Networks

Notation

3.3

Feedforward Neural Networks

h

x

ˆy

f

(

Wx

)

g

(

Vh

)

W

V

3.4

Recurrent Neural Networks

Long Short-Term Memory Networks

Common Activation Functions

3.5

Structured Data

Sentiment Classification

Tree-Structured Data

Syntactic Dependency Path-based LSTM

`

`

