Identifying Base Noun Phrases by Means of Recurrent Neural Networks

(1)

Identifying Base Noun

Phrases by Means of

Recurrent Neural

Networks

Using Morphological and Dependency Features

Tonghe Wang

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30ECTScredits

(2)

Abstract

Noun phrases convey key information in communication and are of interest in NLP tasks. A base NP is defined as the headword and left-hand side modifiers of a noun phrase. In this thesis, we identify base NPs in Universal Dependencies treebanks in English and French using an RNN architecture.

The data of this thesis consist of three multi-layered treebanks in which each sentence is annotated in both constituency and dependency formalisms. To build our training data, we find base NPs in the constituency layers and project them onto the dependency layer by labeling corresponding tokens. For input features, we devised 18 configurations of features available in UD annotation. We train RNN models with LSTM and GRU cells with different numbers of epochs on these configurations of features.

(3)

Acknowledgement

This work is carried out under the supervision of Beáta Megyesi. I am grateful for her guidance and insights. I would like to thank Ali Basirat, who provided indispensable expertise and encouragement.

Thank all my teachers for generously sharing their knowledge and driving me forward. My warm gratitude to all my classmates at the language technology program. Our lakeside barbecues, pub talks and delirious nights in Chomsky are beautiful memories.

(6)

1. Introduction

Noun phrases convey substantial information in language. In human conversa-tions, interlocutors can roughly figure out the main points merely based on the information expressed by noun phrases. In various computational applications, natural language processing (NLP) techniques are used to obtain information from natural language. At this step, noun phrases play a similar role in providing key information.

An NLP-powered calendar app is able to recognize key information of a meeting, such as time, location, and names of participants from text and automatically create an event with this information. These details are expressed with noun phrases. When users search for information in a search engine, they can find relevant web pages by inputting a search query, usually a noun phrase will suffice. As is seen in information retrieval (IR) tasks, the search engine typically matches the query term against keywords that it has indexed, usually noun phrases. In dialogue systems, a similar approach is used to answer factoid questions. The system finds noun phrases in the user’s utterance and matches them against the indexed keywords.

NLP systems carry out syntactic analysis, or parsing, to recognize a sentence and try to assign a syntactic structure to it. The result of a successful parsing attempt is a parse tree. This step is useful for downstream operations.

Syntactic parsing of sentences faces the challenge of structural ambiguity. Two kinds of structural ambiguity that are commonly seen in constituency parsing involve the attachment of sub-constituents and the coordination of phrases. In cases where full parsing is too cumbersome or unnecessary, ‘shallow parsing’ or ‘partial parsing’ is carried out instead. The latter alternative is favorable also because it is more efficient than full parsing. For example, information systems only need to parse and process segments in a piece of text where relevant information is likely to appear.

Chunking is an alternative form of partial parsing that identifies and classifies base phrases from sentences. Base phrases are flat, non-overlapping segments in a sentence that do not recursively contain any constituent of the same type. In most previous studies, as in this thesis, base phrases are defined to include the headword of the phrase and words to the left of the head within the constituent. By focusing on internally flat phrases, chunking avoids the need to analyze the internal structure. And by eliminating words within the constituent after the headword, chunking avoids the structural ambiguity in attaching right-hand side constituents.

(7)

is the need for manually devising grammar rules. As for data-driven approaches, the limitation is the amount of available annotated data.

1.1. Purpose and outline

This thesis is an effort to continue and expand the line of research pertaining to chunking, particularly NP-chunking or the chunking of base noun phrases. In this thesis, we try to identify base NPs using a recurrent neural network (RNN) drawing information from morphological and dependency features.

Specifically, we try to extract base NPs from Universal Dependencies (UD) treebanks in English and French. The aim of this thesis is to find out which configurations of features available in the UD annotation are the most informative for discovering base NPs. By experimenting on monolingual and bilingual test sets, we also try to validate the universality of UD in a small way. Since features are annotated consistently across languages in UD, we anticipate to find out how transferable the predictive power of these configurations are across languages.

We approach this problem as a sequence-labelling task at which RNNs have been shown to be effective. We use treebanks for which both constituency and dependency representations are available to prepare our training data. We experiment with configurations of syntactic and morphological features annotated in UD treebanks to find out which features have the most potential to derive base NPs across languages. We use delexicalized features to avoid overfitting the models to one language.

In Chapter 2, we explain the concept of the noun phrase and base NPs in Section 2.1. We introduce notable schools of syntactic theory, constituency and dependency formalisms and the data format used for these formalisms in Section 2.2. We summarize previous work on identifying base phrases, or chunking, in Section 2.3. We illustrate what an RNN is and how different RNN units work in Section 2.4.

In Chapter 3, we overview the methodology used in our experiments. We list the source of our training data in Section 3.2. We explain how we preprocess data in Section 3.3. We introduce the features and configurations of features that we try to explore in Section 3.4. We then illustrate the structure of our RNN in Section 3.5.

In Chapter 4, we show the settings and steps of our experiments. In Chapter 5, we show the main results, explain how they are calculated and discuss what the they mean.

(8)

2. Background

2.1. Noun phrases and base NPs

In natural languages, nouns are used to refer to concrete objects and abstract concepts. ‘Goulash’ and ‘happiness’ are two examples. A noun phrase is a cluster of words in which the headword is a noun. Noun phrases express myriad objects and concepts, either with individual nouns or in addition with determinatives and modifiers. The meaning of a multi-word noun phrase may go beyond the respective meanings of its lexical components. For example, ‘post rock’ is a music genre that cannot be accurately described by either ‘post’ or ‘rock’. Noun phrases can play various syntactic roles in a clause, which are indicated by its position, preposition or postposition, morphological variations or particles.

Apart from the headword, a noun phrase may contain determinatives and mod-ifiers. These can either be adjectives, pronouns and determiners or prepositional or postpositional phrases and relative clauses. With different configurations of these components, noun phrases can contain different levels of complexity.

These components can appear on the left or right side of the headword. As shown in Figure 2.3, in the English noun phrase ‘a red car on the busy street’, the prepositional modifier ‘on the busy street’ appears on the right-hand side of the headword. The equivalent modifier in Uyghur ‘awat kochida’ appears on the left-hand side of the headword ‘mashina’.

The focus of this thesis is on ‘base NPs’. Ramshaw and Marcus (1995) define a ‘base NP’ as ‘the initial portions of non-recursive noun phrases up to the head, including determiners but not including postmodifying prepositional phrases or clauses’. Cardie and Pierce (1998) further described base NPs as ‘noun phrases that do not contain other noun phrase descendants’. Megyesi (2002) defines a base NP as consisting of the phrasal head and its modifiers on the left hand side.

Following the definitions above, ‘a car’ and ‘the busy street’ in the noun phrase shown in Figure 2.4 are both base NPs. But the whole noun phrase is too complex to be a base NP, either because there is a post-modifying prepositional phrase, according to the definition given in Ramshaw and Marcus (1995) or because it

(9)

Figure 2.2.: Hypernyms of ‘happiness’. All nouns are ultimately hyponyms of ‘entity’ on WordNet (Miller, 1994).

contains two NP descendants, according to the definition given in Cardie and Pierce (1998).

Figure 2.3.: Determinatives and modifiers can appear on the left or right of the headword.

A base NP is also called an ‘NP-chunk’, following an earlier con-cept of a ‘chunk’ introduced by Ab-ney et al. (1991). By observing the prosodic and grammatical patterns, Abney et al. (1991) notes that ‘a typical chunk consists of a single content word surrounded by a con-stellation of function words,

match-ing a fixed template’. For example, Abney et al. (1991) divides the sentence ‘when I read a sentence, I read it a chunk at a time’ into ‘chunks’ in the following manner, according to prosodic patterns: [when I read] [a sentence] [I read it] [a chunk] [at a time]. The group of words that makes up a chunk are syntactically related and non-overlapping. Thus, the chunks ‘a sentence’ and ‘at a time’ from the example above are respectively a noun phrase chunk (NP-chunk) and a prepositional phrase chunk (PP-chunk).

2.2. Syntactic representations

Figure 2.4.: A noun phrase that includes two base NPs.

To represent the syntactic structure of sentences, many theories and formalisms have been devised by researchers. Two of the most widely used representa-tions for syntactic structures are depen-dency (Mel’čuk, 1988) and constituency formalisms (Chomsky, 1956; Chomsky, 1979). The constituency formalism di-vides a sentence into increasingly small constituents. In contrast, the depen-dency formalism focuses on words and the dependency relations between two words.

(10)

constituency parsing typically refers to context-free grammars (CFG) (Chomsky, 1956).

For a dependency formalism, the underlying theory is a dependency grammar, for which the operation logic focuses on relationships between syntactic units, in which binary relations are established between wordforms (Mel’čuk, 1988).

There are annotated treebanks in both formalisms, notably constituency treebanks annotated following the style of Penn Treebank (Marcus et al., 1993) and dependency treebanks following Universal Dependencies (Nivre et al., 2016). In constituency treebanks, the type and boundary of a constituent are annotated. In dependency treebanks, the annotation expresses part-of-speech tags and morphological features of words, dependency relations between words, and syntactic roles between the head and its dependent. However, the beginning, end and type of phrases are not explicitly annotated.

Other syntactic formalisms exist and some have been applied in NLP tasks.

Figure 2.5.: A syntactic tree in the constituency formalism.

Head-driven Phrase Structure Grammar (HPSG) (Pollard et al., 1994) is a lexicalist phrase structure grammar that has evolved from generalized phrase structure grammar, an earlier effort to extend CFG. HPSG organizes linguistic information via types and subtypes that have at minimum two attributes phonetic form (PHON) and syntactic and semantic information (SYNSEM). The basic type that HPSG deals with is the Saussurean concept of sign, which then have words and phrases as two subtypes (Pollard et al., 1994). NLP resources, including grammars and parsers, based on HPSG have been developed for several European and Asian languages by researchers (Sag et al., 2003).

(11)

associated with grammatical relations. A variety of NLP resources have been developed based on LFG (Sag et al., 2003, p. 538).

Tree-Adjoining Grammar (TAG) tries to turn syntactic trees into better repre-sentations of grammatical dependencies. The basic units in TAG are elementary trees. The representation of a sentence is done by rewriting nodes of trees as other trees through operations such as ‘substitution’ and ‘adjunction’ (Sag et al., 2003, p. 540). The XTAG Project is an effort to develop NLP resources in accordance with TAG (Joshi, 2001).

Below we illustrate constituency and dependency formalisms using examples of each.

2.2.1. Constituency

The constituency representation of syntactic structures assumes that sentences are organized hierarchically into phrases and that grammatical relations need to be defined in terms of phrase structure configurations (Sag et al., 2003).

Syntactic parsing in the constituency formalism can be done either in a top-down or in a bottom-up manner. The top-down approach, e.g. the Earley algorithm (Earley, 1970), begins at the top level of a parse tree and recognizes increasingly small constituents in a sentence. The bottom-up approach, e.g. the CKY algorithm (Kasami, 1965; Younger, 1967), starts with tokens and build increasingly long constituents until one or more plausible complete trees are reached.

Figure 2.6.: A syntactic tree represented in Penn-style bracketing.

(12)

Figure 2.7.: Visualization of a dependency tree.

This parse tree is represented in Penn-style bracketing format in Figure 2.6. When analyzing this example sentence following the annotation convention used in Penn Treebank, we first start from the ROOT tag. The whole sentence is classified as a simple declarative clause (S), which consists of a clause introduced by a subordinating conjunction (SBAR), a noun phrase (NP) and a verb phrase (VP).

As we go down the tree to smaller constituents, we’d notice the SBAR governs these children nodes: a preposition or subordinating conjunction (IN) and a simple declarative clause (S). On the same level the parse tree, the NP consists of a possessive pronoun (PROP$) and a noun (NN), the VP consists of verb in third person singular (VBZ) and a simple declarative clause (S). This goes on and on down the branches in this tree until we reach the leaves, i.e. tokens, that do not have further children.

2.2.2. Dependency

In contrast to the hierarchical approach found in constituency syntactic analysis, dependency grammar focuses on establishing binary relationships between word-forms, which are known as dependencies. The earliest effort to systematically build a syntactic theory of dependency grammar analysis dates back to Tesnière (1959).

Universal Dependencies (UD) is the de facto standard for dependency annota-tion. The UD project provides a universal inventory of PoS tags, dependency relations and other features. This allows consistent annotation of similar linguis-tic phenomena across languages. As of version 2.5, the UD project covers 157 treebanks for 90 languages (Zeman et al., 2019).

Figure 2.7 shows how the dependency relationships between tokens in a sentence are represented according to UD. This dependency tree is represent as shown in Figure 2.8 in CoNLL-U format, explained in Section 2.2.2.

UD builds on earlier initiatives prior to its inauguration. These include efforts for universal part-of-speech tagsets such as Interset (Zeman, 2008), Google Universal Part-of-Speech (Petrov et al., 2012) and dependency treebanks such as HamleDT (Zeman et al., 2012), Universal Dependency Treebanks (McDonald et al., 2013), and Universal Stanford Dependencies (de Marneffe et al., 2014).

(13)

Figure 2.8.: CoNLL-U representation of the dependency tree.

more grammatical relations are defined for SD than for UD. Such relations in SD are organized in a hierarchy, whose root is the least specific dependent. The relation between a head and its dependent is preferably annotated precisely. But if this is not possible, a less specific relation up the hierarchy is used. Second, the design principles of SD suggest that relations be between content words. Functional words, such as prepositions and conjunctions, are ‘collapsed’ into the name of the relation.

CoNLL-U format

Dependency trees in UD treebanks are represented in pain text files encoded in

UTF-8 in CoNLL-U format1, an extension to the CoNLL-X format (Buchholz

and Marsi, 2006).

In these text files, blank lines mark sentence boundaries. Lines starting with a hash are comments, in which meta data about the sentence and the full text of the sentence are noted.

Within the boundaries of each sentence, each line is separated by tabs into 10 columns that respectively indicate word index (ID), word form or punctuation (FORM), lemma or stem of word form (LEMMA), universal PoS tag (UPOS), language-specific PoS tag (XPOS), list of morphological features (FEATS), head of the current word (HEAD), dependency relation to the head (DEPREL), and other information (MISC). For multi-word tokens, the ID column may be a range covering member words of the token. Contractions of French prepositions and articles are indicated in this manner.

A dependency tree in the CoNLL-U format is illustrated in Figure 2.8. Above the sentence, the ID of the sentence in the treebank (sent_id) and the full text of the sentence (text) are noted in the comment. In the sentence, ID column for du shows 5-6, indicating it is a contraction of de (5) and le (6). On the lines for n’ and divulgué, SpaceAfter is set to No in the MISC column, indicating there is not a space between this and the next token.

1

CoNLL-U format is more thoroughly introduced on the official website of the Universal

(14)

2.3. Chunking

Chunking is the effort to identify and classify chunks, or base phrases, from a sentence. Since chunking is computationally less complex than a full parsing, there has been much interest in this topic among researchers. Data-driven chunking, as opposed to rule-based chunking, learns chunking decisions from labeled data, for example annotated treebanks. It thereby avoids the limitation of rule-based chunking approaches, namely the need for manually devising grammar rules.

2.3.1. Data-driven chunking approaches

One of the earliest efforts of data-driven chunking was Ramshaw and Marcus (1995), who used transformation-based learning.

Transformation-based error-driven learning

Transformation-based error-driven learning is first proposed by Brill (1993) for automatic grammar induction and parsing of text. This technique has been used for part-of-speech tagging (Brill, 1995) and other tasks.

Take parsing for an example to illustrate how transformation-based learning works. The parser first starts with a naive state of knowledge about sentence structures. The parser parses text using the naive state of knowledge. The parsing results are scored by comparing with annotated training data. Then the transformations, i.e. rewrite rules and triggering environments, that lead to the highest scoring result are learned. The learned parser consists of the naive state of knowledge along with all the learned transformations.

Ramshaw and Marcus (1995) uses transformation-based learning for base NP chunking. Ramshaw and Marcus (1995) also first framed the problem as a sequence labelling task, adopting the the chunk tag set I, O, B to indicate the span of base NP tags. In which I indicates a word is inside a base NP, whereas O indicates it is outside. And B is used to mark the first word in a base NP.

To learn a chunking model, the chunker starts by making chunking predictions on the training corpus using a baseline heuristic. At each location where the prediction is wrong, candidate rules for correction are derived from selected features of neighboring words. Then those candidate rules are scored by testing against the rest of the training corpus. The best scoring rule is learned. This process is repeated.

Vilain and Day (2000) used rule sequence processors to carry out chunking. The processor first instantiates partial parses around individual words then extends them and assigns a type following a sequence of rules. This sequence of rules is learnable in a similar fashion as Ramshaw and Marcus (1995), using error-driven transformation learning.

Memory-based learning

(15)

In memory-based learning, prediction is made on a test case by computing its distance of all cases in memory.

Argamon et al. (1998) used a memory-based sequence learning algorithm to identify NP and VP chunks by recognizing local sequential patterns. The sentences in input data for this algorithm are represented with sequences of PoS tags with spans of target phrases wrapped in brackets. Tiles, or fragments of sentences that include a single bracket, are scored based on their occurrences in the training data. The algorithm thus tries to find phrases by covering possible sequences with the most probable tiles.

Veenstra and Buchholz (1998) used IGTree, a variant of the memory-based learning method that builds decision trees based on information gain, to chunk NPs. Then Veenstra (1999) went on to chunk NPs, VPs and full PPs as a tagging task. As an initial step in their work of grammatical relations assignment, Buch-holz et al. (1999) chunked several types of phrases using three memory-based learning algorithms, also treating chunking as a tagging task.

Other pre-RNN approaches

Corpus-based PoS sequence matching Cardie and Pierce (1998) presented a

corpus-based approach to find base NPs by matching part-of-speech tag sequences and leaving out lexical information. Rules are the PoS tag sequences that constitute a base NP found in the training corpus. Rules are then scored against a pruning corpus by subtracting the number of false positives from the number of true positives. Unproductive rules are then pruned. At testing time, a sequence of PoS is identified as a base NP if it is identical to a rule.

Cascaded Markov models Brants (1999) used cascaded Markov models that are

capable of producing structured partial parse trees to perform chunking, rather than framing it as a sequence labeling task. The models are organized in a layer-by-layer fashion. On each layer, the Markov model determines the best set of phrases, which are in turn fed into the next layer as input. Most likely phrases are determined using a modified Viterbi algorithm (Viterbi, 1967) based on probabilities in stochastic context-free grammar derived from the training corpus.

Maximum entropy-based approaches Koeling (2000) used a maximum entropy

model to chunk text. Osborne (2000) used an off-the-shelf maximum entropy-based PoS tagger to carry out chunking. Osborne (2000) fed different concate-nations of features as words to the tagger. The tagger then predicts the chunk labels.

(16)

Precision Recall F1

Argamon et al. (1998) 91.6%

Ramshaw and Marcus (1995) 91.8% 92.3% 92.0%

Cardie and Pierce (1998) 93.7% 94% 93.8%

Veenstra (1999) 94.0% 93.7% 93.8%

Buchholz et al. (1999) 92.5% 92.2% 92.3%

Osborne (2000) 91.92% 92.45% 92.18%

Brants (1999) 91.4% 84.8% 88.0%

Vilain and Day (2000) 87.85% 87.77% 87.81%

van Halteren (2000) 93.55% 94.13% 93.84%

Koeling (2000) 93.18% 92.84% 93.01%

Kudo and Matsumoto (2001) 93.89% 93.92% 93.90%

Zhang et al. (2001) 94.29% 94.01% 94.15%

Lacroix (2018) 89.99%

Attardi and Dell’Orletta (2008) 95.07% 94.62% 94.84% Table 2.1.: The precision, recall or F1 scores of previous effort of NP-chunking.

WPDV van Halteren (2000) used weighted probability distribution voting

(WPDV) models to carry out chunking, using such features as adjacent words and tags in a window and NP-chunk labels for previous words.

In addition, other approaches to chunking has been tried in the research com-munity, notably submissions to the CoNLL-2000 Shared Task on chunking (Sang and Buchholz, 2000). Kudo and Matsumoto (2001) used support vector machines (SVM) and weighted voting by multiple SVM systems to chunk English phrases.

Zhang et al. (2001) used regularized Winnow (Littlestone, 1988) for text chunking. RNN-based approach

Lacroix (2018) deduced NP-chunks from English UD treebanks using a deep recurrent neural network with a Bi-LSTM architecture. They also demonstrated the benefit of performing chunking along with PoS-tagging through multitask learning and evaluated the effect of using NP-chunks as features on dependency parsing performance.

Table 2.1 summarizes the precision and recall of previous effort of NP-chunking. It needs to be pointed out that this table is for information purposes only as it is difficult to directly compare these previous studies. They use different definitions of NPs and base NPs and they are evaluated on different corpora. Some count tokens for precision and recall, others count phrases.

2.4. Recurrent neural networks

A recurrent neural network (Elman, 1990) is a neural network architecture that contains recurrent units on the hidden layer.

(17)

information. As a result, an RNN is better equipped to process sequences than a feedforward network.

Figure 2.9.: Internal structure of an RNN unit.

The internal structure of a recur-rent unit is illustrated in Figure 2.9,

where x_tindicates the input vector

at time t, ht−1the hidden state from

time t-1 and ht the hidden state at

time t.

Inside the unit, xt and ht−1 are

respectively combined with a ma-trix of weights in dot product before added together. The sum is then

transformed into the htthrough an

activation function f, as shown in

(2.1). ht is passed on to the next

time step. ht can also be used to

calculate y_t, the output at time t.

ht= f (U ht−1+ W xt) (2.1)

However, attributes of a simple RNN make it difficult to capture relevant information available in long distance dependencies. This problem is two fold.

On a conceptual level, the information captured by the state of previous time step is relatively local. And such information is used to predict the time step at hand and carried further to make predictions for future time steps.

Take ‘The car they are buying is red.’ as an example. The simple RNN will learn the plural nature of ‘they are’. Carrying that knowledge future, it will come into conflict with the singular ‘is’. And at that time step, the relationship between ‘the car ’ and ‘is’ would have been overridden.

On the calculation level, ‘exploding’ and ‘vanishing gradients’ problems can arise when learning a simple RNN during backpropagation. As the states at the hidden layer are repeatedly multiplied, the gradients tend to either blow up or eventually become zero.

LSTM and GRU are two modified versions of RNN that can better address these problems.

2.4.1. LSTM

Long short-term memory (Hochreiter and Schmidhuber, 1997) units introduce mechanisms to selectively carry forward or forget context information. In an LSTM unit, a context layer is introduced to carry forward information. Three sigmoid functions are introduced to serve as ‘forget’, ‘input’ and ‘output’ gates that control to what extend information is taken in, forgotten and given out.

At each time step, the LSTM unit takes in the context vector from the previous

time step c_t−1, hidden state from the previous time step h_t−1 and the input for

the current time step x_t.

(18)

Figure 2.10.: Internal structure of an LSTM unit

The forget gate deletes obsolete information from the context, as shown in

(2.2) and (2.3). First, a mask value ftis obtained by passing the weighted sum of

the hidden state at time t-1 and the input for time t through a sigmoid function. Then the mask is multiplied by the context vector to remove the information that is no longer needed.

ft= σ(Ufhh−1+ Wfxt) (2.2)

kt= ct−1 ft (2.3)

The information to be learned from the previous hidden state and the current input is obtained using (2.4). Specifically, the weighted sum of hidden state at time t − 1 and the input vector at time t is passed through a tanh function.

gt= tanh(Ught−1+ Wgxt) (2.4)

The input gate determines to what extent the information contained in g_t is

incorporated to the current context. Similar to forget gate, first a mask it is

calculated by passing a weighted sum through a sigmoid function (2.5). Then gt

is multiplied by the mask (2.6).

it= σ(Uiht−1+ Wixi) (2.5)

jt= gt it (2.6)

The result of the calculations inside the input gate j_tis added to k_tto produce

ct, the context vector for time t.

Then the output gate determines what information is needed for the current hidden state, instead of being carried forward for future time steps. First a mask

otis calculated by passing a weighted sum through a sigmoid function. Then the

(19)

Figure 2.11.: Internal structure of a GRU.

to produce the hidden state for time t. The two calculations are shown in (2.7) and (2.8).

ot= σ(Uohh−1+ Woxt) (2.7)

ht= ot tanh(ct) (2.8)

In the end, ctand htcarried forward to the next time step. In a neural network,

ht can either be fed to the subsequent layers or used as output.

2.4.2. GRU

Gated recurrent units (Cho et al., 2014) are a less complex form of recurrent units.

Since four sets of weights are kept in a single LSTM unit, the burden of training is considerable. GRUs employ a reset gate and an update gate, fewer than LSTM units. As a result, GRU reduces the number of learnable parameters as compared to LSTM. The inner working of a GRU is shown in Figure 2.11.

At the reset gate (2.9) and the update gate (2.10), the weighted sum of the previous hidden state and the input vector is passed through a sigmoid function,

resulting in masks rt and zt between 0 and 1, as shown in (2.9) and (2.10).

rt= σ(Urht−1+ Wrxt) (2.9)

zt= σ(Uzht−1+ Wzxt) (2.10)

The intermediate value for the hidden state at time t, ˜ht, is calculated with

(2.11). In this calculation, the result of the reset gate, rt, is used in a pointwise

multiplication to determine which aspects of the previous hidden state are kept in the intermediate hidden state.

˜

ht= tanh(U (rt ht−1) + W xt) (2.11)

In (2.12), the result of the update gate, z_t, is used to determine which

(20)

intermediate hidden state. The result of this calculation is the hidden state for

time t, h_t.

ht= (1 − zt)ht−1+ zt˜ht (2.12)

2.4.3. Representation of categorical data

The input data that will be used in this thesis are linguistic features. Possible values for each feature fall into discrete categories. English numerals can be cardinal, ordinal or multiplicative. This distinction is represented in UD with the NumType feature, which has seven possible values, among them Card for cardinal, Ord for ordinal and Mult for multiplicative.

(21)

3. Methodology

3.1. Overview

This thesis treats the task of identifying base NPs as a sequence labeling task using a deep recurrent neural network (RNN).

We take treebanks with both constituency and dependency annotation layers as our data source. In order to produce labeled data set, we need to mark the boundaries of base NPs onto dependency trees. The first step to do this is gathering sentences that are annotated in both formalisms.

We mark up base NPs onto the CoNLL-U representations in the constituency layer. First, we find NP constituents in the constituency layer. At this step, we use a Python script to determine NP constituents that meet the criteria as base NPs. Then, we find each matching sentence in the dependency layer. We extend the CoNLL-U format by adding a column, in which we tag each token with ‘B’ if it is the first word of a base NP, ‘I’ if it is a non-initial word in a base NP or ‘O’ if it is outside of any base NP, following the tags used in Sang and Buchholz (2000).

All English sentences are divided into eleven sections. One of which is randomly chosen to be the test set. The remaining 10 sections are reserved for 10-fold cross-validation at the training-validation phase of experiments. Sentences in the French data set are divided the same way.

We construct an RNN with two bidirectional recurrent layers, in which two kinds of recurrent units, LSTM and GRU, are tested. We train the network respectively using the English data set, the French data set, and a concatenation of both, while adopting different configurations of syntactic and morphological features available in UD annotations as training input. We use the ‘BIO’ tags in the added column as training labels.

For each configuration of input features, we iteratively train the network with different numbers of epochs and with different recurrent cells. The best performing model according to validation accuracy on a randomly chosen validation set is picked for the test phase.

At the test phase, the best performing model obtained with each configuration of features is tested on the test set. The same test set is used against each configuration. Precision and recall of tokens tagged as members of base NPs are calculated.

3.2. Training data

(22)

multiple formats, including constituency annotation in the format of Penn-style bracketing and dependency annotation in the format of CoNLL-U. Such parallel annotations are useful for the markup of base NPs in training data.

3.2.1. English

In this thesis, we use two treebanks in English. One is the Georgetown University Multilayer Corpus (Zeldes, 2017). The other is the English Web Treebank (Bies et al., 2012; Silveira et al., 2014). Both treebanks contain eclectic genres that are different in style from newswire text.

Georgetown University Multilayer Corpus (GUM) is the result of pedagogical

efforts in linguistics. The corpus contains 126 text files totaling 109,141 tokens, covering such genres as interviews, news articles, travel guides, how-to guides, academic writing, biographies, fiction, and online forum discussions.

The annotation of GUM is done by graduate and undergraduate students at Georgetown University in the classroom setting, in which graduate students make up the majority of annotators.

The texts are tokenized automatically and later manually corrected by students. Part-of-Speech tagging is done completely by hand from scratch. Syntactic annotation is first performed automatically, then corrected collaboratively by students before evaluated by instructors.

The resulting constituency and dependency trees make up two of the eight annotation layers available in GUM, namely Penn style brackets and the CoNLL-U format. Syntactic trees found in this treebank in Penn style brackets and the CoNLL-U format are used in this thesis. These formats are introduced in Sections 2.2.1 and 2.2.2.

EWT English Web Treebank contains 254,830 word tokens (16,624 sentences)

of English text from the web, including blogs, newsgroup discussions, emails, reviews, and answers from question-answer sites. In EWT, pronouns, vocatives, discourse particles, imperatives, sentences fragments, and errors appear more frequently than in newswire text (Silveira et al., 2014).

The constituency annotation of EWT corpus was released in 20121. The text

is tokenized and tagged manually. Constituency annotations of the text are in the Penn-style bracketing scheme.

The dependency trees in Universal Dependencies format are automatically converted from an earlier version in Stanford Dependencies format and manually corrected. In creating the SD version of this treebank, Silveira et al. (2014) first automatically converted the constituency trees and then manually corrected them.

3.2.2. French

The French Treebank (FTB) (Abeillé et al., 2003) roughly contains 664,500 tokens (21,550 sentences) from articles published between 1990 and 1993 on the

(23)

newspaper Le Monde, covering a variety of domains. The difference in genre between the English treebanks and FTB are reflected in different average sentence lengths in English and French treebanks.

The constituency trees in FTB are automatically annotated before being corrected manually. The constituency trees are then automatically converted into dependency trees by Candito et al. (2010). This part of FTB is available upon applying for a license.

The Universal Dependencies branch of FTB is converted from the earlier dependency trees and contains 18,535 sentences and 556,064 tokens. The UD part of FTB that is available to the public does not include the underlying text. UD trees in FTB complete with words are made available by the creators of the treebank after obtaining the license.

3.3. Data preparation

3.3.1. Discovering base NPs

We concatenated GUM and EWT treebanks to form our English data set. And the FTB treebank made up our French data set.

We then compared the UD layer and Penn-style bracketing layer to look for sentences with identical tokens. Thus syntactic tree in both CoNLL-U and Penn-style bracketing formats are available for these sentences. We found 11854 such sentences in English and 18535 for French.

We first automatically recognized base NPs according to the annotations in Penn-style bracketing. This step is performed with functions get_eng_bnp()

and get_fra_bnp() respectively defined in eng_bnp.py and fra_bnp.py2 for

the English and French data sets. We then found the same base NPs in the UD trees and marked the spans of the discovered base NPs as shown in Figure 3.5. In the process of recognizing base NPs, we looked for the innermost tags that indicate noun phrases. In the English data set, we looked for NP (noun phrases) and WHNP (noun phrases starting with a wh-question word). In the French data set, we looked for the NP tag.

Tokens found within the span of the innermost NP or WHNP tags are consid-ered members of a base NP, on the condition that the tokens in question do not introduce a nested structure, be it a subclause or posterior modifying phrase.

This is ensured by admitting noun phrases that are ‘flat’ (determined with function _isFlat()), namely all subtags are terminal tags. If an English noun phrase is not flat, we go through the subtags and cut it off when a subtag is ‘SBAR’, ‘PP’, ‘ADVP’, ‘CC’ or ‘WHPP’ (see line 36 in eng_bnp.py) or when

certain punctuation marks are encountered.

If a French noun phrase is not flat, we go through the subtags and cut it off when a subtag is not ‘N’ (a noun) or ‘D’ (a determiner) and introduces a nested substructure (see line 32 in fra_bnp.py and function _isNested()) or when certain punctuation marks are encountered.

2

These and other Python scripts are uploaded to a GitHub repository for this thesis. The

(24)

Figure 3.1.: English sentence in which automatically discovered base NPs are enclosed in red rectangles.

For example, in the sample sentence: ‘Basil comes in many different varieties, each of which have unique flavor and smell.’

As illustrated in Figure 3.1, innermost NP and WHNP tags (NP (NN Basil)), (NP (JJ many) (JJ different) (NNS varieties)), (NP (DT each)), (WHNP (WDT which)) as well as (NP (DT a) (JJ unique) (NN flavor) (CC and) (NN smell)) are discovered by the script as base NPs. In contrast, the NP tag en-closed in the grey rectangular is not minimal. Because there are other NP tags nested under it, hence it is not the innermost NP tag. And nested inside, there is a SBAR tag that introduces a subordinate clause.

Following previous studies, the headword of a noun phrase and words left of the head within the constituent are included in the base NP. This means right hand arguments are typically removed. While reducing the inner complexity of phrase structures, this does remove certain information that could be relevant, especially for languages that tend to place attributives on the right-hand side of the head.

Figure 3.2.: Right-hand side adjectival phrase not included in base NP.

In the following examples, the in-nermost NP contains nested modi-fiers to the right of the head word. And the nested structures anno-tated as a nested modifier are re-moved. Therefore in Figure 3.3, ‘une mauvaise orientation’ and ‘les pro-duction’ are recognized as base NPs. Whereas in Figure 3.2, the two to-kens in ‘l’année’ are recognized as part of a base NP, but not the ad-jective ‘précédente’, since it is anno-tated as a right-hand side adjectival phrase.

(25)

Figure 3.3.: French sentence in which automatically discovered base NPs are enclosed in red rectangles.

Figure 3.4.: How multiword tokens are represented in the CoNLL-U format.

contracted articles are subsumed in the wordform and a null element *T* is marked in its place. ‘Les production’ is represented as ‘*T* production’ in the Penn-style bracketing format. In UD, however, a contraction is first presented as a standalone wordform, with the span of words that it subsumes marked in the ID column. It is followed by the syntactic words that it consists of.

In such cases, we first find the concrete word in UD, then trace back preceding words and recovered the word substituted with *T*. In this case, ‘les’, the wordform found on line 6, as shown in Figure 3.4.

3.3.2. Representation of base NPs

Figure 3.5.: An example of the extended CoNLL-U markup.

To represent NP-chunks, we extend the CoNLL-U format by adding a column called ‘BNP’ to represent the boundaries of discovered base NPs. Specifically, we mark ‘B’, ‘I’ and ‘O’ for each token, as shown in Figure 3.5.

(26)

inside a base NP. Other tokens are marked with ‘O’ because they are outside of any base NP.

3.3.3. Marking up base NPs in UD files

For the purpose of pre-processing data in the CoNLL-U format, we defined a

class named conllu.Tree3_.

The conllu.Tree class represents a UD tree. It can be initialized with a string that contains a UD tree in standard CoNLL-U format or the extended CoNLL-U format with boundaries of base NPs marked. In the former case, base NPs discovered earlier can be loaded with the load_bnp() method. When the UD tree is in place and base NPs marked, we can select and output feature configurations to TSV files using output_nnfeats() method. The output TSV files are later used in training and testing.

Internally, each line in the UD tree with a non-range index is loaded to initialize a token using the subclass Token. Each column in the CoNLL-U format is held in a corresponding property.

Using the discovered base NPs, we load the matching UD tree and initialize a conllu.Tree class with it. By feeding a list of all base NPs to the class using the load_bnp() method, tokens that are found in the base NPs are marked in the UD tree.

(a)‘number’ found in English words

(b)‘prontype’ in English words

Figure 3.6.

At this step, not all base NPs discovered earlier are successfully located in UD trees. We used the number of tagged ‘B’s and ‘I’s in a UD tree to tell if all base NPs are correctly marked, where the num-ber of tagged ‘B’s should equal the number of base NPs discovered and the sum of ‘B’s and ‘I’s should equal the total number of discovered to-kens in base NPs. In cases where these numbers do not match, an er-ror message is printed. There are more than 3000 sentences in which the markup of base NPs is severely incomplete. Most common cause for this discrepancy is different repre-sentations of numbers, clitics and punctuation marks that are found in the middle of a base NP.

The outcome of marked base NPs is shown in Figure 3.5.

3

(27)

Figure 3.7.: Percentage of PoS found in base NPs.

3.4. Features

We feed the input feature set (X) and gold standard labels (Y) into the neural network, as defined in Section 3.5, so that the neural network can arrive at the best possible weights that can be used to make predictions.

From the information available in the UD annotation, we can adopt various configurations of the following items as the input feature set. ‘B’, ‘I’ and ‘O’ labels on the BNP column serve as our gold standard labels.

PoS tags We adopt all 17 UPOS tags in the UD inventory. We do not consider

language specific PoS tags.

Deprels We adopt all 37 universal syntactic relations defined in UD. We choose

to ignore subtypes for several reasons. First, accounting for all possible deprels and subtypes will likely bring about an explosion of the number of combinations, which will lead to a data sparsity problem.

Second, by adopting subtypes, we will likely bring about language-specific characteristics that are not transferable to other languages, thus making our system less universal. Third, it is difficult to capture the relationship between a deprel and its subtypes and it is likely that they’d be treated as discrete categories.

Morphological features By looking into the marked base NPs, we discover that

the most frequent parts-of-speech found in base NPs include NOUN, DET, PROPN, NUM, ADJ, PUNCT and PRON.

The morphological features that are associated with these parts-of-speech are: typo, definite, gender, case, person, poss, abbr, degree, prontype, number, numtype, polarity, and gender.

(28)

Figure 3.6a shows number feature is found most commonly in nouns, pronouns, proper nouns and determiners. This is in line with the agreement in number between nouns and demonstrative pronoun.

(a)‘gender’ in French words

(b)‘number’ in French words

(c)‘numtype’ in French words

Figure 3.8.

Figure 3.6b shows prontype is found in determiners and pronouns. This can be attributed to the fact that certain pronouns can be tagged either as a determiner or a pronoun depending on the role they play.

Figures 3.8a and 3.8b shows the range of parts-of-speech involved in agreement in gender and number.

Figure 3.8c shows the numtype feature found in numerals, pronouns and adjectives. This is in line with the fact that numerals are tagged as different parts-of-speech as they can play different roles.

Morphological features such as abbr and typo are excluded because they rarely appear and are relevant only to individual tokens and do not have syntactic relevance to other to-kens. Degree and polarity are also excluded. Because although they add further information to modi-fiers, they do not likely to interact with other words.

All possible values for these mor-phological features as defined by the

UD project4 are shown in Table 3.1.

To account for the possibility where new values are defined in UD

for these features in the future, an additional value OOV, standing for out-of-vocabulary, is added to each feature. When feeding data into the neural network, should any value not listed in Table 3.1 be found, it is replaced by ‘OOV’. In the case where a morphological feature is not used, an underscore is used.

Since the categories for PoS and deprels are predefined, it is safe to assume that no out-of-vocabulary value will appear unexpectedly. Therefore no ‘OOV’ category is added for PoS tags and deprels.

3.4.1. Configurations of features

We arrange the above input features into 18 configurations and these fall under four lines of logic, as shown in Table 3.2.

4

A full list of morphological features can be accessed at:https://universaldependencies.org/

(29)

Feat Values

Definite Ind, 2, Def, Com, Cons

Gender Neut, Unsp, Com, Fem, Masc

Number Coll, Count, Sing, Dual, Pauc, Ptan, Unsp, Plur, Adnum,

Assoc

PronType Art, Coll, Ind, Add, Int, Dem, Neg, Ref, Contrastive, Exc,

Qnt, Tot, Emp, Refl, Rel, Prs, Ord, Rcp

Person 4, 0, 2, Auto, 3, 1

Poss Yes

NumType Coll, Sets, MultDist, Dist, Frac, Range, OrdMult, OrdinalSets,

Mult, Card, Ord

Case Ade, Add, Gen, Sub, Obl, Comp, Ela, Abl, Ine, Lat, Car, Apr,

Tem, All, Com, Loc, Par, Per, Dat, Erg, Mal, Tra, NomAcc, Equ, Cau, Nom, Voc, Con, Abs, Temp, Ess, Prl, Advb, Ter, Ben, Ill, Egr, Dis, Ins, Del, Cns, Acc, Sup, Abe

Table 3.1.: Morphological features deemed relevant to base NPs.

The baseline configuration is pos, in which only the PoS tags are taken as an input feature. In the delexicalized setting that we have adopted, this is the minimum amount of information that can be used to predict the boundaries of base NPs. The design of a RNN allows memory from preceding and following tokens to be passed on in the RNN unit. By deriving relationships between adjacent tokens using such memory, we suppose the neural network can make predictions using this minimal amount of information.

Deprels with adjacent tokens Extending this baseline configuration, we add two

boolean features in pos_isdep to indicate if a token has any dependency relation with the previous and following tokens. We add two categorical features in pos_deprel to show the deprel between the current and previous tokens and the current and following tokens. In the case where there is not a deprel, an underscore is used.

PoS tags of syntactic heads and dependents We make configurations that

ac-count for PoS tags of syntactic heads and dependents of the current token. The configurations pos_parent, pos_grand, and pos_parent_child respectively in-dicate the PoS tags of the current token and its syntactic head, or ‘parent’; the PoS tags of the current token, its syntactic head and the head of its head, or ‘grandparent’, hence ‘grand’; and the PoS tags of the current token, its syntactic

head and the leftmost and rightmost dependents.

Deprels with syntactic heads and dependents On the basis of the previous three

configurations, we add deprel information between the tokens to form pos_dep_parent, pos_dep_grand and pos_dep_parent_child.

Morphological features On top of the nine configurations above, we add

(30)

(31)

3.5. Network structure

Figure 3.9.: Neural network structure for base NP detection.

The structure of the neural network we use is illustrated in Figure 3.9. The network is implemented using Keras (Chollet et al., 2015).

We transformed sentences in the training set into the input format according to configurations of features, in which each input feature is represented by a column representing categories.

In order to feed input data in different configurations to the network, we use as many input units as input features. These input units are coupled with equal

number of categorical embedding units5.

The input data are first transformed into one-hot representation at the in-put layer. Then the one-hot representations of categories are transformed into categorical embeddings by categorical embedding units.

For each embedding unit, we keep the output dimension at roughly half the number of input categories. For example, there are 18 possible input values for the feature UPOS, we set the output dimension to 8.

At the concat layer, the embeddings for all input features are concatenated and transformed before entering the recurrent layers.

We stack two bidirectional RNN layers and experiment with LSTM and GRU units. We set the number of recurrent units at 80, equal to the maximum length of sequences. Before and after the stacked bidirectional RNN layers, we use a dense layer with a dropout rate of 0.2. The activation functions used on the

(32)

Figure 3.10.: Distribution of sentence lengths

dense layers are respectively tanh and relu. This choice is made after prior experiments with a subset of the training data.

On the output layer, we use softmax to make predictions out of four possible categories: B, I, O and _, as defined in Chapter 3.1.

The parameter maxlen determines the length of sequences, which in turn determines the number of recurrent units at recurrent layers. The longest sentence in training data contains 153 tokens, but 99.40% of the sentences in the training set are shorter than or equal to 80 words. Informed by the distribution of sentence lengths in the training set, we set maxlen to 80, expecting to strike a balance between reasonable representation of sentence lengths and the total number of variables to train.

(33)

4. Experiments

4.1. Training and validation

4.1.1. Division of data

Considering the limited amount of language data available, we choose to use a form of 10-fold cross-validation for the training process. For this purpose, we carry out division of data accordingly. Specifically, we randomly divide the English data set into eleven sections containing equal number of sentences. We do the same on the French data set. We then randomly pick one section in each language as the test set for that language.

The 10 remaining sections are used in our training-validation process. During this process, we carry out a form of 10-fold validation to arrive at the best model for each of the configuration and each data set.

4.1.2. Best models

In order to derive the best model using each configuration of features, we try out two kinds of RNN units and different number of training epochs in iteration. At each iteration, we randomly choose one section from the training-validation data as the validation set.

When modeling sentences as sequences, sentences containing fewer tokens than maxlen are padded with zeros, while those containing more tokens are truncated to maxlen. The padding elements consist of zeros as input features and labels, therefore are more likely to be correctly labelled. To avoid artificially boosting

accuracy, we ignore padding elements by defining an ignore_accuracy function1

and applying it when compiling the model.

We run the training and validation process respectively on three language settings: English monolingual, French monolingual and bilingual. For each config-uration of input features and each language setting, we derive six models: Using GRU or LSTM units and trained for 30, 40 or 50 epochs. From these models, we choose the one that delivers the highest validation accuracy as the best model. The accuracies and validation accuracies in the training and validation process are tabulated in Figures A.1, A.2 and A.3. All numbers in the figures are decimals. Decimal points are omitted for convenience.

4.2. Experiments on the test set

We then use the best models derived from each configuration and language setting to predict the boundaries of base NPs in the test set. In total, 54 models are tested. The results are tabulated in Tables A.4, A.5, and A.6.

(34)

5. Results and discussion

The experiment results shown in Tables A.4, A.5, and A.6 are given as precision (P), recall (R) and F1 scores. In this thesis, these values are calculated by (5.1)

and (5.2). If a token is labelled B in the test set and the model predicts it as I, this prediction is considered incorrect.

precision = Num of tokens correctly tagged B or I

Num of detected tokens as B or I (5.1)

recall = Num of tokens correctly tagged B or I

Num of all tokens that are B or I (5.2)

Subsequently, we calculate F1 using (5.3).

F1 =

2 × P × R

P + R (5.3)

5.1. Main results

The best performance on the English (P 93.02%, R 92.37%, F 92.70%) and bilingual (P 94.27%, R 94.30%, F 94.29%) test sets are delivered by models for the configuration pos_dep_parent_child_morph trained on bilingual data. The best performance on the French test set (P 94.76%, R 94.99%, F 94.87%) is delivered by the model also for the configuration pos_dep_parent_child_morph trained on French data. These scores and highest precision, recall and F1 scores achieved by other models trained and test on different data sets are summarized in Table 5.1. The configuration of features used in the best performing models are noted below.

Models... Tested on

Trained on

English French Both

English P 92.86% R 92.43% F 92.65% P 72.42% R 86.34% F 78.77% P 76.23% R 87.14% F 81.32% pos_dep_grand pos_dep pos_dep

French P 84.43% R 72.51% F 78.02% P 94.76% R 94.99% F 94.87% P 92.21% R 89.40% F 90.78% pos_deprel_morph pos_dep_parent _child_morph pos_deprel_morph Both P 93.02% R 92.37% F 92.70% P 94.64% R 94.87% T 94.76% P 94.27% R 94.30% T 94.29% pos_dep_parent_child_morph

(35)

The configuration pos_dep_parent_child_morph reflects the dependency re-lationships between the current token and its syntactic head and two syntactic children. It also reflects the PoS tag and morphological features of these tokens.

Recall in Table 2.1 that F1 scores reported by previous NP-chunking efforts range between 87.81% and 94.84%. It needs to be noted that these numbers are not directly comparable to those in previous studies due to differences in data, nuances in definitions and in metrics.

5.2. Discussion

From the F1 scores for bilingual models, as shown in Table A.6, we observe that: Adding morphological information brings an overall improvement of 0.95 percentage point to F1 the average score.

This average improvement is more pronounced in French (0.99 percentage point) than in English (0.81 percentage point). This indicates that morphological features are more informative in determining phrases in French than in English.

It is worth noting that among the bilingual models, the best performing ones are those trained with morphological features, showing the significance in adopting morphological features.

It is not surprising that models trained on monolingual data perform worse on the other language and that monolingual models perform worse on bilingual test set. That the best monolingual French model performs slightly better (90.78%) on bilingual test set (shown in column ‘f1_bi’ in Table A.5) than the best monolingual English model (81.33%, shown in column ‘f1_bi’ in Table A.4) might be due to the fact that French is better represented in training data. There are fewer French sentences than English but there are more French tokens as French sentences are on average longer.

When tested on the other language and on bilingual data, monolingual models tend to do worse with additional input features than the PoS-only baseline configuration.

The English model for configuration pos_dep_parent_child_morph scored 73.16% on the French and 77.01% on the bilingual test set. This configuration captures the PoS tag and morphological features of the current token (t), its syntactic head (h) and its leftmost and rightmost syntactic children (lc, rc); the deprels betweent t and h, between t and lc and between t and rc. In comparison, the English model with the baseline configuration respectively scored 77.70% and 80.22%.

Similar phenomena are also observed on French models, the most noticeable of which is pos_morph. This configuration reflects the PoS tag and morphological features of the current token. It scored a pitiful 48.68% on the English test set and 82.53% on the bilingual test set, compared to 74.23% and 85.16% by the baseline configuration.

(36)

On the other hand, testing on the bilingual test set, the models trained on

bilin-gual data perform better, with average F1 score at 92.51% across configurations1_,

than English model and French model do on the bilingual test set, respectively

achieving average F1 scores 79.06%2 and 88.46%3 across configurations.

On a more conceptual level, the conventional practice to admit tokens on the left-hand side does not do justice to noun phrases where right-hand side adjectives are found. This is more common in French than in English. One thing in common of English and French is they put clausal modifiers on the right-hand side, removed in this and previous studies. Dealing with more complex structure on the left, however, could pose a challenge in languages where clausal modifiers tend to appear on the left-hand side of the headword. Experiments on more dissimilar languages may shed light in these respects.

1_{Average of column f1_bi in Table A.6.}

2

Average of column f1_bi in Table A.4.

(37)

6. Conclusion and future work

In this thesis, we tried to identify base NPs in UD treebanks in English and French using bidirectional RNNs. We carried out experiments to find out which features available in UD annotation were the most predictive for identifying base NPs.

Base NPs were defined to include the headword of an NP and its left-hand side modifiers. We treated this task as a sequence-labeling task, following a long tradition in previous NP-chunking studies.

We obtained training and test data by 1) finding base NPs following patterns in the constituency layer and 2) projecting found base NPs to an added column in UD layer. In the resulting training and test data, each token in a sentence was labeled either with B if it was at the beginning of a base NP, or I if it was inside a base NP, or O if it was outside any base NP. In case where padding was added, the filler tokens were labeled with _. Our models were evaluated according to the prediction, recall and F1 scores of the predictions they made. Specifically, predicting B as B and predicting I as I were considered correct.

We devised 18 configurations of linguistic features, each containing a different set of features that may included the PoS tags and morphological features of the current token, its syntactic head, head of head, and its dependents.

For each configuration, we obtained a group of six models trained on English data, a group of six on French data, and a group of six on English and French data. For each configuration and each group, the model with the highest accuracy during validation was chosen for the configuration and training data set. We then experimented with these 54 models on English, French and bilingual data sets.

We demonstrated that taking advantage of non-lexical features available in Universal Dependencies annotations, our best performing RNN models can deliver superior results on identifying base NPs in English (P 93.02%, R 92.37%, F 92.70%, shown in Table A.6), in French (P 94.76%, R 94.99%, F 94.87%, shown in Table A.5) and across the two languages (P 94.27%, R 94.30%, F 94.29%, shown in Table A.6). Thanks to the universal nature of UD annotations, non-lexical features allowed these models to generalize across languages.

We shall admit that the scope of this study is narrow, limited to English and French. Successful generalization can be, at least partly, attributed to the structural similarities in noun phrases in these languages.

(38)

approaches that represent tokens and their dependency relations from a graph-oriented vantage point may resolve the shortcomings of removing right-hand side dependencies and insertion of left-hand side clausal modifiers.

(39)

(40)

(41)

(42)

(43)

(44)

(45)

(46)

B. Brief documentation of code

B.1. conllu.py

This file defines the conllu.Tree class. Internally, it maintains a dictionary of tokens, represented by the internal class named Token.

The Token class maintains properties for every column in the CoNLL-U format. In addition, it also maintains a property called bnp to reflect the ‘BIO’ markup. The conllu.Tree class is initialized with a CoNLL-U tree represented in a string. With the load_bnp method, base NPs represented with lists of tokens are loaded. Then Token objects that are at the beginning, inside or outside of base NPs are respectively tagged.

With output_ext_tree() method, the dependency tree contained in this conllu.Tree object is output in the standard CoNLL-U format or in the extended format with ‘BIO’ labels.

The output_nnfeats() method outputs selected features to TSV files which is then used in the RNN.

B.2. eng_bnp.py and fra_bnp.py

The functions get_eng_bnp() and get_fra_bnp() are respectively defined in these scripts. They load constituency trees in Penn-style bracketing and find base NPs in them. The output is a list containing lists of tokens, which can be passed to the output_ext_tree() method introduced in the above section.

B.3. mappings.py

Mappings between keys and values are defined in this script. These mappings are used in pre- and post-processing of data for the RNN.

B.4. helpers.py

This script helps the RNN run by loading data from files, padding and truncating sequences and transforming categorical data into arrays. The ignore_accuracy() function is defined in this script, which is used by Keras at compiling time to get real accuracies by excluding correctly predicted paddings.

B.5. network_training.py

(47)

For each configuration of input features, various models are defined and trained in the for-loop defined at the bottom.

B.6. predict.py

This script is used at testing time. It loads the test set and iteratively makes predictions using trained models. Predictions are saved in json files for scoring.

B.7. scoring.py

(48)

Bibliography

Abeillé, Anne, Lionel Clément, and François Toussenel (2003). “Building a Treebank for French”. In: Treebanks: Building and Using Parsed Corpora. Ed. by Anne Abeillé. Dordrecht: Springer Netherlands, pp. 165–187.

Abney, Steven P., Gennaro Chierchia, Pauline Jacobson, and Francis J. Pelletier (1991). “Parsing By Chunks”. In: Principle-Based Parsing. Ed. by Robert C.

Berwick, Steven P. Abney, and Carol Tenny. Vol. 44. Dordrecht: Springer Netherlands, pp. 257–278.

Argamon, Shlomo, Ido Dagan, and Yuval Krymolowski (1998). “A Memory-Based Approach to Learning Shallow Natural Language Patterns”. In: Proceedings of 36th Annual Meeting of the Association for Computational Linguistics (ACL). Montreal, Canada, pp. 67–73.

Attardi, Giuseppe and Felice Dell’Orletta (2008). “Chunking and Dependency Parsing”. In: Proceedings of LREC 2008 Workshop on Partial Parsing. Mar-rakech, p. 6.

Bies, Ann, Justin Mott, Colin Warner, and Seth Kulick (2012). “English Web Treebank”. Linguistic Data Consortium, Philadelphia, PA.

Brants, Thorsten (1999). “Cascaded Markov Models”. In: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguis-tics. The Ninth Conference. Bergen, Norway: Association for Computational Linguistics, p. 118.

Bresnan, J., A. Asudeh, Ida Toivonen, and S. Wechsler (2015). Lexical Functional Syntax: Second Edition. June 2015. 1-499.

Brill, Eric (1993). “Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach”. In: 31st Annual Meeting of the Associa-tion for ComputaAssocia-tional Linguistics. Columbus, Ohio, USA: AssociaAssocia-tion for Computational Linguistics, June 1993, pp. 259–265.

Brill, Eric (1995). “Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging”. Computational Linguistics 21.4, pp. 543–565.

Buchholz, Sabine and Erwin Marsi (2006). “CoNLL-X Shared Task on Mul-tilingual Dependency Parsing”. In: Proceedings of the Tenth Conference on Computational Natural Language Learning - CoNLL-X ’06. The Tenth Confer-ence. New York City, New York: Association for Computational Linguistics, p. 149.

Buchholz, Sabine, Jorn Veenstra, and Walter Daelemans (1999). “Cascaded Grammatical Relation Assignment”. In: Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, p. 8.

(49)

Evalua-tion (LREC’10). Valletta, Malta: European Language Resources AssociaEvalua-tion (ELRA), May 2010, p. 9.

Cardie, Claire and David Pierce (1998). “Error-Driven Pruning of Treebank Gram-mars for Base Noun Phrase Identification”. In: Proceedings of COLING/ACL. Montreal, Canada, pp. 218–224.

Cho, Kyunghyun, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio (2014). “Learning Phrase Representations Using RNN Encoder–Decoder for Statistical Machine Transla-tion”. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724–1734.

Chollet, François et al. (2015). “Keras”. url:https://keras.io.

Chomsky, N. (1956). “Three Models for the Description of Language”. IRE Transactions on Information Theory 2.3, pp. 113–124.

Chomsky, Noam (1979). “The Logical Structure of Linguistic Theory”. Synthese 40.2, pp. 317–352.

Dalrymple, M. (2001). Lexical Functional Grammar. Syntax and Semantics. Academic Press.

De Marneffe, Marie-Catherine, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning (2014). “Universal Stanford Dependencies: A Cross-Linguistic Typology”. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA), May 2014, pp. 4585–4592.

Earley, Jay (1970). “An Efficient Context-Free Parsing Algorithm”. Communica-tions of the ACM 13.2 (Feb. 1970), pp. 94–102.

Elman, Jeffrey L. (1990). “Finding Structure in Time”. Cognitive Science 14.2, pp. 179–211.

Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-Term Memory”. Neural Computation 9.8, pp. 1735–1780.

Joshi, Aravind K. (2001). “The XTAG Project at Penn”. In: Proceedings of the Seventh International Workshop on Parsing Technologies. Beijing, China, Oct. 2001, pp. 16–27.

Kaplan, Ronald and Joan Bresnan (1982). Lexical-Functional Grammar: A Formal System for Grammatical Representation. Jan. 1982. 173-281.

Kasami, T. (1965). An Efficient Recognition and Syntax Analysis Algorithm for Context-Free Languages. AFCRL-65-758. Bedford, MA: Air Force Cambridge

Research Laboratory.

Koeling, Rob (2000). “Chunking with Maximum Entropy Models”. In: Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop.

Kudo, Taku and Yuji Matsumoto (2001). “Chunking with Support Vector Ma-chines”. In: Second Meeting of the North American Chapter of the Association for Computational Linguistics.

Identifying Base Noun Phrases by Means of Recurrent Neural Networks