• No results found

Answer Triggering Mechanisms in Neural Reading Comprehension- based Question Answering Systems

N/A
N/A
Protected

Academic year: 2021

Share "Answer Triggering Mechanisms in Neural Reading Comprehension- based Question Answering Systems"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

Answer Triggering

Mechanisms in Neural

Reading

Comprehension-based Question

Answering Systems

Max Trembczyk

Uppsala University

Department of Linguistics and Philology Master’s Programme in Language Technology Master’s Thesis in Language Technology August 14, 2019

(2)

Abstract

We implement a state-of-the-art question answering system based on Con-volutional Neural Networks and Attention Mechanisms and include four different variants of answer triggering that have been discussed in recent lit-erature. The mechanisms are included in different places in the architecture and work with different information and mechanisms.

We train, develop and test our models on the popular SQuAD data set for Question Answering based on Reading Comprehension that has in its latest version been equipped with additional non-answerable questions that have to be retrieved by the systems. We test the models against baselines and against each other and provide an extensive evaluation both in a general question answering task and in the explicit performance of the answer triggering mechanisms.

We show that the answer triggering mechanisms all clearly improve the model over the baseline without answer triggering by as much as 19.6% to 31.3% depending on the model and the metric. The best performance in general question answering shows a model that we call Candidate:No, that treats the possibility that no answer can be found in the document as just another answer candidate instead of having an additional decision step at some place in the model’s architecture as in the other three mechanisms.

(3)

Contents

Preface 5 1 Introduction 6 1.1 Purpose . . . 6 1.2 Outline . . . 7 2 Technical Backgrounds 8 2.1 Recurrent Neural Networks (RNNs) . . . 8

2.1.1 Long Short-Term Memory (LSTM) networks . . . 9

2.1.2 Gated Recurrent Unit (GRU) . . . 9

2.2 Convolutional Neural Networks (CNNs) . . . 10

2.3 Encoder-Decoder Networks . . . 11

2.4 Attention . . . 11

2.4.1 Transformer Models and Multi-Head Attention . . . 12

2.4.2 Pointer Networks . . . 14

3 Question Answering 15 3.1 Earlier Approaches . . . 15

3.2 Reading Comprehension . . . 15

3.3 Question Answering with Neural Reading Comprehension . . . . 17

3.3.1 QANet . . . 18 3.4 Answer Verification . . . 19 4 Data 21 4.1 Overview . . . 21 4.2 SQuAD 2.0 . . . 21 5 Implementation 23 5.1 Question Answering System . . . 23

5.2 Inclusion of the AV mechanisms . . . 23

5.2.1 VerifyFirst . . . 24

5.2.2 Candidate:No . . . 25

5.2.3 VerifyLater . . . 25

5.2.4 VerifyLater (Confidence Based) . . . 26

5.3 Hyperparameters . . . 26

6 Evaluation Procedure 27 6.1 Exact Match (EM) . . . 27

6.2 F1 . . . 27

6.3 Answer Verification: Metrics . . . 28

6.4 Answer Verification: Hypotheses . . . 28

(4)

7 Results 30

7.1 Question Answering . . . 30

7.1.1 Performance on Answerable Questions only . . . 30

7.2 Answer Triggering . . . 31

7.3 Hypotheses . . . 31

7.3.1 Average Overlap . . . 32

7.3.2 Average Length of the Questions . . . 33

7.3.3 Average Length of the Answer . . . 34

7.3.4 Question words . . . 35

8 Discussion of the Results 37

(5)

Preface

I would like to thank everyone who accompanied and supported me during the work at this thesis and during my master studies. I thank the teachers and tutors at the institute, especially my supervisor Mats Dahllöf for his suggestions and helpful discussions. I also thank my opponent Xiao Cheng for her constructive criticism and suggestions.

(6)

1 Introduction

Systems that answer arbitrary questions posed in natural language are getting more and more common in everyday applications with voice assistants becoming increasingly popular and search engines becoming better.

Question Answering (QA) systems, in contrast to other information retrieval (IR) systems that have a structured language for the queries to a knowledge source, work with questions posed in natural language. This is more convenient for the human user but the much harder task for machines as the question first needs to be transformed into a machine readable format and then be passed to get an answer from the system. A property of natural language is that its semantics can be ambiguous and context-dependent, making it hard to get a high reliability of the interpretation when questions get more complex.

Reading comprehension (RC) is a Natural Language Understanding (NLU) task where information that is stored in plain text documents can be accessed. It requires the ability to understand the ideas and relationships in the text. Reading Comprehension is the way that many recent neural QA systems extract the answer from a document in natural language, in contrast to systems that rely on easier to access knowledge bases like ontologies and knowledge graphs. RC is in recent systems commonly done using an encoder-decoder neural network architecture that creates embeddings for the question and for spans in the document, and ranks spans after their probability of being the correct answer candidate.

While many systems assume that there always is a correct answer in the data resources, this may not be the case in real-world applications. The task of answer triggering (AT) or answer verification (AV) addresses this problem by determining if the set contains any correct answers, and lets the system only output an answer if there exists one.

There are several ways in which AV can be done. It is possible to determine if an answer exists as a first step looking at the question and the document where the answer is assumed to be, by checking the answers that have been created by the system for confidence and plausibility as a final step, or by adding a no answer choice that stands in a line with the other answer candidates.

1.1 Purpose

In this thesis we will explore different AV architectures. We will implement a state-of-the-art RC-based QA system and include four different varieties of AV mechanisms in order to compare their performance in a controlled setup.

The main objectives of our work are:

(7)

• Quantifying the contribution of the four AV mechanisms to the overall performance of a state-of-the-art QA system.

• Assessing the utility of the four AV mechanisms for different needs, by for instance dividing them into more conservative or greedier approaches, or approaches particularly suited for a certain kind of question.

We will also test a number of hypotheses about the performance of our models in order to see where the major difficulties of our models are.

• A large overlap between the words in the question and in the corresponding context document where the answer is expected increments the risk that an unanswerable question is mistakenly given an answer.

• Longer questions are more complex and both harder to answer and more likely to be recognized as unanswerable.

• Questions with longer answers are more complex and both harder to answer and harder to be recognized as unanswerable.

• There are question words that are harder to answer and harder to be recognized as unanswerable. Why and How could be hard question words while questions with Who, How many, Where or When are relatively easy to answer.

The conclusions from these research questions will be used for discussions and suggestions for further improvements of today’s neural QA systems, particularly for the implementation of AV mechanisms.

1.2 Outline

The structure of this thesis is as follows:

• We will start in chapter 2 with an introduction of the terminology and technical backgrounds of today’s QA systems.

• Chapter 3 will summarize the history and related research in the fields of Question Answering and Reading Comprehension.

• Common RC and QA data sets and particularly the SQuAD 2.0 data set that we will use for our experiments are described in chapter 4.

• In chapter 5 we will describe the implementation with details of our models, training and hyperparameter choice of our models.

• In chapter 6 we introduce the evaluation procedure, including the metrics used for our evaluations and hypotheses for a qualitative evaluation. • The results of our evaluations will be given in chapter 7.

• In chapter 8, these results will be discussed with respect to our expectations, to the models’ theoretical properties and to other work.

(8)

2 Technical Backgrounds

All modern competitive QA systems rely on Machine Learning, particularly on Neural Networks (NN). In this chapter, we will give an overview over the NN approaches and features commonly used for QA. It will introduce the technical terminology used in the rest of the thesis.

A highly important neural network type for QA, as for many Natural Language Processing (NLP) problems, are Recurrent Neural Networks (RNNs), particularly Long Short Term Memory (LSTM) networks that have well-known qualities in handling sequences. They are described in section 2.1. Another concept that has spread over almost all NLP areas is Attention, a concept that helps the network to learn what parts in a sequence are especially relevant for a problem. It is described in section 2.4.

For our experiments, we implement a model that uses Convolutional Neural Networks (CNNs), a concept that is especially popular in image processing but is gaining popularity in NLP, too, as it also has good capacities to handle sequences but is faster than RNNs. CNNs are described in section 2.2.

We will shortly describe more architectures that will come up in the Related Work section: the purely Attention-based Transformer models (section 2.4.1) and the Pointer Networks that can be deployed when the output of the network is a part of the input (section 2.4.2) to give an idea on how they work and what their applications and advantages are.

2.1 Recurrent Neural Networks (RNNs)

RNNs, as said before, are very common in NLP applications nowadays because in contrast to a simple feed-forward neural network, they can process sequences of data, such as text and speech. They do not only rely on the current input but also on a memory what they have seen in the past. This information is saved in the hidden state ht −1 from the previous time step t − 1. So, the RNN is, on a

high level, a function that maps the current input xt and the previous hidden

state ht −1 to a new hidden state ht:

ht = RNN(xt, ht −1) (2.1)

(9)

Figure 2.1: RNN. Graphic by Christopher Olah.

But standard RNNs can only deal with short-distance relations and always assign higher weights to more recent inputs rather than learning to prioritize information. For this reason, the Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks have been developed (Goodfellow et al., 2016).

2.1.1 Long Short-Term Memory (LSTM) networks

LSTMs (Hochreiter and Schmidhuber, 1997) are intended to solve the long-term dependency problem by using so called gate functions for information management within the network. These functions are trained to keep information over long distances if they are considered relevant for the output. Vice versa, needless information will be filtered out. The output of each step is regulated by them as well, so that only the most meaningful information is passed through the gate.

Figure 2.1.1 shows an LSTM cell with its gate components. One RNN layer is replaced by four.

Figure 2.2: LSTM. Graphic by Christopher Olah.

The cell state is in top of the diagram, and contains the information that flows from the previous to the next LSTM cell, either changed or unchanged, depending on the decisions controlled by the gates.

There are different types of gates that interactively control the information flow. The forget gate (the part with σ on the left) controls what information from the cell state is preserved conditioned on ht −1and xt. Through the input

gate (the σ and tanh in the middle), the cell state is updated. The part of the gate with tanh activation creates the new values, while the part with σ activation decides which values are actually updated. Finally the combined output (σ and tanh) is written to the cell state though addition. The output gate on the right controls the information that is led through to the next cell. (Olah, 2015)

2.1.2 Gated Recurrent Unit (GRU)

(10)

forget gate into what they call an update gate, and skip the idea of having a cell state additional to the hidden state. Despite their simpler architecture, they reach in general a performance comparable to LSTMs (Greff et al., 2017), a fact that motivates their increasing popularity. Their architecture is shown in figure 2.3.

Figure 2.3: GRU. Graphic by Christopher Olah.

2.2 Convolutional Neural Networks (CNNs)

Convolutional neural networks (CNNs, LeCun et al. (1989)) were developed for image recognition, but are getting increasingly popular for NLP tasks as their qualities in recognizing and combining local patterns are also important to many tasks on text data, especially on character level features.

CNNs use convolutions instead of general matrix multiplications. In a convo-lution operation, small windows, also called filters or kernels, are moved across the input to search for particular local features (Goodfellow et al., 2016). The typical configuration of a layer consists of one or more convolutional operations followed by a pooling function. After the last convolutional layer, the output is flattened to a one-dimensional array and fed into a fully connected layer. A popular pooling function is for example max-pooling that only reports the maximum output within a neighborhood, so that only the most meaningful features are preserved. Less meaningful features will be discarded, only focusing on the most relevant parts and also saving computational resources by reducing the number of dimensions.

(11)

Figure 2.4 shows a one-dimensional CNN with two CNN layers. The inputs xi are fed through two convolutional layers, A and B, with a max pooling layer

in-between, before reaching the fully connected layer F. (Olah, 2014)

2.3 Encoder-Decoder Networks

Encoder-decoder networks have spread from neural machine translation (Kalch-brenner and Blunsom, 2013; Cho, Van Merriënboer, Bahdanau, et al., 2014) to many other NLP problems like text summarization (Nallapati et al., 2016), syntactic parsing (Vinyals, Kaiser, et al., 2015) and question answering. They translate one sequence into another traditionally by encoding the input that can be of variable length into a vector representation of fixed length and decoding it into the target sequence that can also be of variable length. Figure 2.5 shows a machine translation example on how the encoder-decoder architecture works.

Figure 2.5: Encoder-Decoder Example by Cho, Van Merriënboer, Bahdanau, et al. (2014)

The English example input sentence is encoded into a fixed length vector representation z1, z2, ..., zd, and then decoded into its translation in the French

language.

While the first encoder-decoder architectures were based on normal RNNs (Kalchbrenner and Blunsom, 2013), there have soon appeared other variants, like CNN-based architectures (Cho, Van Merriënboer, Bahdanau, et al., 2014) and architectures with attention mechanisms added (Bahdanau et al., 2014) that we will describe in the following section.

2.4 Attention

While the fixed-length vector representation of traditional RNN-based encoder-decoder architectures can work well on short sequences, RNNs are not good at remembering relevant information in longer sequences. The attention mechanism (Bahdanau et al., 2014) addresses this handicap of RNNs by giving the model access to all states of the input sequence when deciding on the output sequence. It learns to assign a score to each input state, normalizes them into a probability distribution and uses the distribution to create the so-called context vector.

Let h1, ..., hn be the states and s1, ..., sn the scores that the model assigns to

the state with the same index. The scores s : 1, ...,sn are then normalized, for

example using a softmax function:

(12)

p1, ..., pn are also called the attention weights. The context vector is created as

follows:

Context Vector = p1∗h1+ p2∗h2+ ... + pn∗hn (2.3)

This vector is then concatenated with the output of the last time step and fed into the decoder. With the learned attention weights, the model is now able to attend to relevant parts anywhere in the input sequence. The contribution of an input state to the current context vector corresponds to its assigned weight in the probability distribution. This is also called soft attention, in contrast to hard attention where a hard decision on the most relevant input state is made (Xu et al., 2015).

2.4.1 Transformer Models and Multi-Head Attention

Vaswani et al. (2017) proposed the Transformer Model that performs machine translation without an RNN, relying entirely on an attention mechanism. This is considered desirable not only because of its capacities to work with long-term dependencies but also because the training time of the Transformer Model is much shorter than of RNNs as they allow for heavy parallelization, what an RNN cannot do due to its architecture.

(13)

Figure 2.6: Transformer Model by Vaswani et al. (2017)

A novelty in this model is the Multi-Head Attention mechanism in several of the blocks. Instead of one, multiple attention distributions are used for a single input, referred to as multiple heads.

Figure 2.7: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several attention layers running in parallel by Vaswani et al. (2017)

(14)

parameters where the attention weights are created by only using the dot-product. The special feature of the multi-head attention mechanism is that it does not apply its attention mechanism over the whole model dimension d. Instead, the model dimension d is split into the number of used heads and attention is applied over each of these heads separately. Afterwards these heads are combined again. The main advantage over previous attention mechanisms is that the resulting context vector can be assembled from the given values in much higher resolution. Linear layers are used to linearly project inputs and output.

2.4.2 Pointer Networks

(15)

3 Question Answering

In this chapter we will give an overview over how the fields of Question An-swering (section 3.1) and Reading comprehension (section 3.2) have developed over the time. Section 3.3 gives an overview over approaches that use Reading Comprehension for Question Answering, and section 3.4 introduces approaches to Answer Verification in Question Answering Systems.

3.1 Earlier Approaches

Until in the recent years end-to-end encoder-decoder architectures have become more common, an explicit feature extraction was necessary to extract most importantly the question type and keywords, to formulate a machine-readable query. Then an information retrieval system is used over either documents in natural language, a database, or structured world knowledge that can for example be domain-specific knowlegde in ontologies. While working with structured knowledge bases (the so-called knowledge based approach) is easier for machines but much more limited in the data resources available at least for open-domain QA, when working with documents in natural language (the IR-based approach), these retrieval systems also rely on a pipeline of multiple subtasks, like Part of speech tagging, syntactic parsing, entity and named entity recognition, semantic parsing etc. The relevant documents are extracted using the query’s keywords, and the relevant passages are extracted and ranked. (Jurafsky and Martin, 2000) A well-known natural language question answering system and also an early milestone in their development is IBM Watson that became famous when it won the TV quiz show Jeopardy! in 2011. It extracts features of a question, then looks across the corpus and generates hypotheses and compares the language of the question and of the possible answers using more than 100 reasoning algorithms that analyze it in many different ways. It considers evidence from many language and contextual features that either supports or refutes an answer candidate. Confidence scores are created for each candidate answer, also by many parallel algorithms, and the answer with the highest ranking is selected. Watson is based on a large corpus that consists of unstructured knowledge of many types like text books, guidelines, how-to manuals, FAQs, benefit plans, and news. (High, 2012)

3.2 Reading Comprehension

(16)

But data sets that are large enough to train machine learning systems on are today often created automatically with the task to fill out an automatically created gap in a text with a missing word or entity. While this has the advantage of the easy access to large amounts of training data, this task is considered less versatile than standard question answering.

Hermann et al. (2015) introduced a CNN and a Daily Mail data set with one million news documents that are supplemented with bullet points that summarize the information in an article. They replace one entity token in these bullet points at a time with a placeholder, taking this as the query, and use the information in an article’s main text as the information source.

Hermann et al. (2015) use different LSTM models with and without incor-porated attention mechanisms and show that attention is crucial for a good performance and that systems with attention easily outperform both non-neural baselines and solely LSTM-based models. The architecture that has received most attention by other researchers is the Attentive Reader. Its attention architecture is simple: It computes a document representation vector with a weighted sum of the word representations according to attentions based on the query, and then combines this representation and the query representation to a document representation.

Several other works have outperformed the Attentive Reader by refining its attention architecture.

The Attention Sum (AS) Reader (Kadlec et al., 2016) uses attention to directly pick the answer from the document rather than building a document represen-tation first. Word embeddings for each word in the document are processed to an encoding layer that builds a contextual embedding with a Bi-GRU. The query receives a fixed-length representation of the same dimensionality as each contextual word embedding. These embeddings are used to compute a weight for every word in the document, using the dot product of the word and the query embedding, and the dot product is normalized using a softmax function. Then a score for the word as an answer candidate is calculated as the sum over each position.

Cui et al. (2016) introduced the two-way attention-over-attention (AoA) reader, refining the AS Reader by Kadlec et al. (2016). It calculates an every pair-wise matching score between the words in the query and the words the document. In the resulting matrix, they apply softmax function to each column to get probability distributions for it, so that each column is an individual document-level attention with respect to one query word, and also to each row in order to get an importance distribution on the query based on each document word so that they have both a query-to-document attention and document-to-query attention. The document-to-query-attentions are then averaged and the dot product between the query-to-document attention and document-to-query attention is calculated and used for the final scoring.

(17)

for each token, a token-specific representation of the query is formed with soft attention, and then the query representation is multiplied element-wise with the document token representation, so that as in the AoR Reader, query-to-document and a document-to-query attention are learned. The prediction is then made similar to the AS Reader by Kadlec et al. (2016).

3.3 Question Answering with Neural Reading Comprehension

The common approach for the general architecture is creating embeddings for the question and for each token or each span in the passage and compute a similarity score between them to decide on the right answer span. For these scores, the widest use at the time of writing this have recurrent neural networks for the sequential inputs combined with attention mechanisms for the long-term interactions.

Wang and Jiang (2016) were the first to adopt Ptr-Net (Vinyals, Fortunato, et al., 2015) to the QA problem, combining it with an LSTM preprocessing layer, a match-LSTM (Wang and Jiang, 2015) layer that uses attention over each of the tokens of the passage sequentially, and at each step obtains a weighted vector representation of the question. The question representation is then combined with a vector representation of the current token and fed into the so-called match-LSTM that sequentially aggregates the weighted representations for the prediction. The final layer is the Pointer layer that creates the answer.

Seo et al. (2016) introduced the bi-directional attention flow (BiDAF) net-works to model question-passage pairs at multiple levels of granularity. In three embedding layers, a character-level embedding (obtained with a CNN) and a word embedding (GloVe by Pennington et al. (2014)) are used to create a context embedding with a BiLSTM. The embedding layers are applied to both the query and the context. The fourth layer is the attention flow layer that links the query and context vectors and produces a set of query-aware feature vectors for each word in the context. As introduced by Cui et al. (2016) for a RC system, it is applied in two directions: query-to-context and context-to-query. At each time step, it flows through the following modeling layer, along with the embeddings from previous layers. The fifth layer is the modeling layer that scans the context with a BiLSTM. The sixth layer is the output layer that provides an answer to the query.

Xiong et al. (2016) introduced dynamic co-attention networks (DCN) which attend the question and passage simultaneously and iteratively refine answer predictions. The co-attentive encoder captures similarities between the question and the document. The dynamic Ptr-Net decoder alternates between start and end candidates of answer spans.

(18)

Recently, models that drop the recurrent components have reached a certain popularity expecially due to substantial speed improvements. A. W. Yu et al. (2018)’s QANet that is based on convolutions and attention is described in the

following subsection 3.3.1 because it is the base of the models that we have implemented for our experiments.

3.3.1 QANet

The QANet model by A. W. Yu et al. (2018) is the model that we use as the base of our own experiments. Therefore we will introduce it here in more detail than other recent models. It is illustrated in figure 3.1.

Figure 3.1: QANet network architecture overview by A. W. Yu et al. (2018)

What is particularly special about this model is the way the encoder blocks work in the model. The encoder blocks are coloured in aqua blue in the above figure. On the left, the encoder blocks can be seen within the model structure while the right part depicts the interior of one encoder block. In previous works, the main component of these blocks used to be one or several RNNs such as LSTM or GRU. This however comes with drawbacks like long training time of the model and problems with capturing long-term dependencies as described in section 2.4.

(19)

(Vaswani et al., 2017) is used to encode the position of each token within the sequence, which is followed by a stack of convolutional layers and a self-attention layer which uses multi-head-attention as proposed in Vaswani et al. (2017). Finally a feed forward layer is applied at the end of each encoder block. Each of the sub-layers in the block is preceded by a normalization layer. Additionally, residual connection are placed around each sub-layer.

These encoder blocks are used to encode the embedded inputs of question and context (Qemb and Demb). The Context-Query Attention layer aims to produce

a query aware context representation by using attention over both question and context. This layer is a very close implementation to the attention flow layer from (BiDAF) model by Seo et al. (2016).

Qenc = StackedEmbedEncoder(Qemb)

Denc = StackedEmbedEncoder(Demb)

CQout = ContextQueryAttention(Qenc, Denc)

(3.1) This is followed by a stack of 3 encoder blocks which share the same weights which are used to encode the query-aware context representation.

M1 = StackedEncoderBlocks(CQout)

M2 = StackedEncoderBlocks(M1)

M3 = StackedEncoderBlocks(M2)

(3.2) The output of the first block M1 is used to compute both start and end index

of the answer, while the output of the second encoder block M2 is solely used

to compute the start index and the output of the last block M3 to compute

the end index. The forwarding of the each encoder output ensures that relevant information is passed through.

To compute the final probability distribution over all tokens, the two respective encoder outputs for computing the start and end index are each concatenated and fed into a pointer network to obtain a single scalar score (5.1) value for each token index in the context.

Sst ar t = Pointer1([M1;M2]), Send = Pointer2([M1;M3]) (3.3)

Finally a softmax activation is used to convert these scores into a probability distribution.

p1, p2= sof tmax(S

st ar t), sof tmax(Send) (3.4)

3.4 Answer Verification

(20)

QA systems including answer triggering can be based on a pipeline approach that either ranks possible answers first and then decides if the answers are valid answers (Y. Yang et al., 2015; Jurczyk et al., 2016) or that firstly decides if there is a correct answer to the question and secondly creates a ranking for the answer candidates. But as the pipeline approach suffers from error propagation, there are also efforts to create systems for joint optimization of both tasks (Zhao et al., 2017).

(21)

4 Data

In section 4.1 we will introduce common data sets for Question Answering, motivating our choice that is the SQuAD data set described in detail in 4.2.

4.1 Overview

While many of the large data sets are automatically generated, where a missing word or entity should be filled in, as created by Hermann et al. (2015), we will focus on question answering with human-formulated questions as this task is closer to real-world applications, where a user can ask a question in natural language for example in a search engine. But it is also the harder task as not only the document/context that is expected to support the answer but also the question must be understood by the system.

We will focus on selection-based question answering data sets where the answer context in given paragraph must be found. Recent QA data sets designed for end-to-end systems typically answer questions based on a specified text paragraph or document as the retrieval of documents that are likely to provide the answer is relatively easy and well-studied compared to extracting the exact answer boundaries (Clark and Gardner, 2017).

A very popular resource to create data sets is Wikipedia as it is openly available and contains up-to-date knowledge about a very wide range of topics that humans are interested in (Chen et al., 2017). There are several smaller crowdsourced corpora that are based on Wikipedia articles, like SelQA with 8,000 questions (Jurczyk et al., 2016) and InfoboxQA with 15,000 questions that are based on Infoboxes in Wikipedia articles (Morales et al., 2016). WikiQA (Y. Yang et al., 2015) also includes the answer triggering task, containing 3,047 questions from Bing query logs as the question source and a Wikipedia page text containing the candidate answer sentences, the correct sentences labeled via crowdsourcing. About half (1,473) of the questions in the set have a correct answer sentence in the corresponding article.

4.2 SQuAD 2.0

(22)

crowdworkers who were encouraged to use their own words so that question words are often synonyms of the words in the context passage.

A system should predict if a question has an answer supported by the paragraph or not. If it does have one, it should predict a correct answer span.

The reported human performance (Rajpurkar et al., 2016) has a F1 score of 89.5 on the SQuAD data set while the best system so far reaches an F1 score of 77.0 (as of November 7th, 2018) on the leaderboard (https://rajpurkar.github.io/SQuAD-explorer/). A logistic regression baseline developed by Rajpurkar et al. (2016) reached only 51.0%.

We chose to use SQuAD in our experiments as it is of appropriate size, good quality, the questions are formulated in external author’s own words and it requires answer verification as a certain percentage of questions in the data set is unanswerable.

We provide an example context sequence with two example questions and their respective answers in the following quote:

Context:

For the 2012-2013 school year annual tuition was $38,000, with a total cost of attendance of $57,000. Beginning 2007, families with incomes below $60,000 pay nothing for their children to attend, including room and board. Families with incomes between $60,000 to $80,000 pay only a few thousand dollars per year, and families earning between $120,000 and $180,000 pay no more than 10% of their annual incomes. In 2009, Harvard offered grants totaling $414 million across all eleven divisions; $340 million came from institutional funds, $35 million from federal support, and $39 million from other outside support. Grants total 88% of Harvard’s aid for undergraduate students, with aid also provided by loans (8%) and work-study (4%).

Question:

After 2007 how much do student from families earning less than $60,000 pay for school?

Answers:

(1) nothing for their children to attend, including room and board [answer start index: 156]

(2) nothing [answer start index: 156] is impossible: false

Question:

How many student dorms were there in use at Harvard in 2012? Answers:

[]

is impossible: true

(23)

5 Implementation

We will describe the implementation of the QA system that is the basis of our experiments in section 5.1 and how we included the four AV mechanisms to the system in section 5.2. Our choice of hyperparameters in the models is motivated and described in section 5.3.

All models were implemented in Python using the PyTorch library. (Paszke et al., 2017).

All models used for this thesis are publicly available athttps://github.com/ maxtrem/qanet.

5.1 Question Answering System

The system that we implemented is based on QANet (A. W. Yu et al., 2018). We followed the instructions in their paper closely to implement our base model. Please see section 3.3.1 for further details of the model’s architecture.

In the following, let D = {d1, d2, ..., dn} and Q = {q1, q2, ..., qm} represent the

words in the input document and question, respectively. Sst ar t and Send represent

the un-normalized scores for each token being start and end of the answer span, whose contents si and ei are also referred to as answer candidates in the following.

Sst ar t = {s1, s2, ..., sn}

Send = {e1, e2, ..., en}

(5.1) To get the predicted indices ast ar t and aend for start and end of the answer

span the arдmax function is used:

ast ar t = arдmax(Sst ar t)

aend = arдmax(Send)

(5.2)

5.2 Inclusion of the AV mechanisms

There are four versions of Answer Triggering mechanisms that we included, resulting in four different models for the evaluations. We will introduce them here: The VerifyFirst model in subsection 5.2.1, the Candidate:No model in subsection 5.2.2 and the VerifyLater model in subsection 5.2.3. Subsection 5.2.4 introduces a modified version of VerifyLater.

Each of the above mentioned mechanisms outputs two confidence scores nst ar t

(24)

If the answer candidate with the highest score is below that threshold the question is considered to be impossible to answer, given the current context. If one or more candidates are above that threshold the answer is considered to be answerable and the highest ranking candidate is taken. This means however that the decision if the an question is answerable, given the current context, not only depends on the answer triggering mechanism itself, but also on the distribution within Sst ar t and Send.

To consider an answer as impossible to answer it suffices that either nst ar t or

nend is larger than the highest ranking respective answer candidate:

max(Sst ar t)< nst ar t∨max(Send) < nend (5.3)

Figure 5.2 shows where in QANet’s architecture the different mechanisms are placed and what information they use.

Figure 5.1: QANet network architecture overview with answer triggering mechanisms included.

5.2.1 VerifyFirst

To obtain the nst ar t and nend the output of the context-query attention layer is

(25)

this point both nst ar t and nend will be set to the same value. The mechanism

can be summarized by the following formula where CQout (3.1) represents the

output of the context-query attention layer.

nst ar t, nend = C2(relu(C1(CQout))) (5.4)

The context-query attention layer aims to produce a query-aware context representation which contains linked information of document and question with losing as little information as possible, as in Seo et al. (2016).

5.2.2 Candidate:No

The second approach, that has been used by Zhao et al. (2017), is Candidate:No that introduces a special token NO and appends it to the end of the context document. So the input document is changed to:

D = {d1, d2, ..., dn, NO} (5.5)

What is special about this method is that this token is passed through the complete computation of the model from the beginning to the end. Through context-query attention it is compared and linked to all other possible answer candidates. It is also passed through the model encoders at the end which possibly helps to encode additional information. This method is not only simple, it also does not require any hard changes to the model and utilizes all of the models components and natively fits into the models structure. Using this method nst ar t

and nend will be the last element in Sst ar t and Send respectively.

Sst ar t = {s1, s2, ..., sn, nst ar t}

Send = {e1, e2, ..., en, nend}

(5.6)

5.2.3 VerifyLater

The VerifyLater model is inspired by Y. Yang et al. (2015) and Jurczyk et al. (2016) and uses a mechanism similar to the VerifyFirst approach, but, in contrast to the latter, does not make this decision after the context-query attention layer but uses the final encoder outputs (3.2) for its prediction and computes nst ar t and nend from [M0;M1] and [M0;M1] respectively. The encoder

outputs are concatenated and mapped to a single scalar using a two-layered convolutional network. nst ar t = C2st ar t(Cst ar t1 ([M1;M2])) nend = C2end(C 1 end([M1;M3])) (5.7) Ci

(26)

5.2.4 VerifyLater (Confidence Based)

This method belongs also to the category VerifyLater, but is simpler and less computationally expensive than the upper one. Instead of the final encoder states M1, M2 and M3 (3.2) the un-normalized scores of the answer candidates Sst ar t

and Send (5.1) are used to conclude if there is a possible answer in the given

document.

In this approach, the decision if an question is possible to answer is solely based on the distribution in Sst ar t and Send.

nst ar t = Cst ar t(Sst ar t)

nend = Cend(Send)

(5.8) Cst ar t and Cend are convolutional projection layers as in 5.7.

5.3 Hyperparameters

(27)

6 Evaluation Procedure

We evaluate the answer verification mechanisms both in their contribution to the overall question answering performance and explicitly.

The most common evaluation metrics in general question answering are Exact Match (EM) described in section 6.1 the F1 metric described in section 6.2. We will ignore punctuation tokens and articles, as this has been described to be common for the QA evaluation metrics (Jurafsky and Martin, 2000).

For the explicit AV evaluation, we additionally measure the precision, recall and F1 score for the non-answerable questions, as described in section 6.3. For a more qualitative evaluation of the AV mechanisms, we formulate hypotheses about which kinds of question-context-answer triples may be especially hard to predict for the AT mechanisms, and describe our methods of testing them in section 6.4.

6.1 Exact Match (EM)

The Exact Match (EM) metric is the percentage of predictions that match one of the gold answers exactly. An answer only counts as an exact match if all words in the predicted answer are exactly the same as in the original gold answer. It is calculated as the fraction of the number of correct predictions and the number of all predictions.

EM = Number of correct predictionsNumber of all predictions (6.1) For unanswerable questions, the predictions are counted as correct if correctly recognized as unanswerable, and incorrect otherwise.

The EM is the most stringent metric. It makes sure that all relevant parts of the answer are covered. Answers that contain only a part of the original answer, or only one more word than the original answer, are counted as completely wrong.

6.2 F1

The F1 metric measures the average overlap between the gold answers and the predictions of the system. For every prediction, Precision and Recall of the words are computed and used for the calculation of the F1 score:

(28)

F1 = Precision + RecallPrecision ∗ Recall (6.4) The F1 metric is, in contrast to EM, more tolerant to answers where only a part of the words of the correct answer are found.

For unanswerable questions, we assign a right classification a score of 1 and a wrong classification a score of 0. The average of the per-prediction scores is the average F1 score that we will report.

6.3 Answer Verification: Metrics

We will evaluate the performance of the Answer Verification mechanisms explicitly by calculating the precision, recall and F1 score for the unanswerable questions for each AV mechanism.

Precision = Number of answers correctly tagged as unanswerable

Number of all answers tagged as unanswerable (6.5) Recall = Number of answers correctly tagged as unanswerableNumber of unanswerable questions in the Gold data (6.6) F1 = Precision + RecallPrecision ∗ Recall (6.7) These metrics will show the performance of the AV mechanisms and help giving an idea of the advantages and disadvantages of each mechanism, e.g. in being conservative (higher precision, lower recall) or greedy (lower precision, higher recall).

6.4 Answer Verification: Hypotheses

We will test a number of hypotheses in order to see where the major difficulties of our models are, to be aware of them in possible applications and to suggest measures that would possibly deal with the problems in future work. They include:

• A large overlap between the words in the question and in the context document increments the risk that an unanswerable question is mistakenly given an answer.

• Longer questions are more complex and both harder to answer and more likely to be recognized as unanswerable.

• Questions with longer answers are more complex and both harder to answer and harder to be recognized as unanswerable.

(29)

We will test the hypotheses by dividing the models’ outputs on the development set into groups and looking at the performance or length statistics of all models on each of the subgroups. The four groups that we will look at are the true negatives, the false negatives, the true positives and the true negatives for unanswerable questions. We will look at different length statistics or on the precision and recall, depending on the hypothesis. The respective metric will always be stated during evaluation in section 7.3.

Please note that the reasoning in whether a hypothesis applies or not is rather informal and we will not perform statistical significance tests this section.

6.5 Baselines

(30)

7 Results

We report the scores for the general question answering metrics F1 and Exact Match for the four models with answer triggering as well as for the baseline without answer triggering in section 7.1. In the following section 7.2, we report the scores for detecting the unanswerable questions in the metrics precision, recall and F1. In the last section of this chapter, section 7.3, we finally test the hypotheses that we introduced in section 6.4.

7.1 Question Answering

We report the F1 and Exact Match results for the general Question Answering task on the SQuAD 2.0 development set in table 7.1.

Model F1 EM Baseline 33.6 25.1 Candidate:No 62.4 56.4 VerifyFirst 61.6 54.8 VerifyLater 53.2 49.0 Confidence 62.4 56.4

Table 7.1: Question Answering Results: F1 and EM on SQuAD 2.0

The beneficial effect of the answer triggering mechanism is obvious: All models with answer triggering outperform the baseline by far, with a minimum of 19.6% and up to 28.8% in the F1 score over the baseline without answer triggering. The Exact Match score gives an even clearer impression. Here the answer triggering models lie between 23.9% and 31.3% over the baseline.

The best performance is reached by the Candidate:No and Confidence model in both metrics. The second best performance in the F1 metric is by the VerifyFirst model. While the performance of the best three models is in similar magnitudes, we report by far the lowest scores for the VerifyLater model.

7.1.1 Performance on Answerable Questions only

To evaluate the models’ performance on the answerable questions only, we tested them on the SQuAD 1.1 development set that does not contain the non-answerabe questions. We report the F1 and EM results in table 7.2.

(31)

Model F1 EM Baseline 66.9 50.0 Candidate:No 52.8 40.6 VerifyFirst 54.9 41.3 VerifyLater 55.4 42.3 Confidence 52.5 40.5

Table 7.2: Question Answering Results: F1 and EM on SQuAD 1.1

on the 2.0 data set with a large portion of non-answerable questions hurts the performance on a data set without non-answerable questions like the 1.1 data set. However, it is worth noting that even though all AV-Models perform worse on 1.1 they still perform s better there than the baseline model on 2.0.

7.2 Answer Triggering

We report the Precision, Recall and F1 scores on the detection of non-answerable questions in table 7.3.

Model Precision Recall F1 Candidate:No 72.4 71.8 72.1 VerifyFirst 73.9 68.1 70.9 VerifyLater 60.9 69.9 65.1 Confidence 71.9 72.1 72.0

Table 7.3: Answer Triggering Results: Precision, Recall and F1

It is notable that all models that performed well on general question answering as reported in section 7.1 (Candidate:No, VerifyFirst and Confidence) also perform well on the AT task. However, the balance of precision and recall differs between these models.

The confidence method is the least conservative mechanism with a high recall of 72.1% and a precision of 71.9%. VerifyFirst is most conservative, getting a recall of 68.1% and the highest precision of all with 73.9%. Candidate:No lies between them and is the most balanced model regarding these metrics, with a precision of 72.4% and a recall of 71.8%. The VerifyLater model has the lowest scores in the answer triggering task as well, with a very low precision of 60.9% which makes the F1 score also low with 65.1% although the recall is with 69.9% more reasonable.

7.3 Hypotheses

(32)

• True positives (TP) refers to the question-answer pairs that the model tagged as unanswerable and that are also unanswerable in the gold standard. • True negatives (TN) refers to the question-answer pairs that the model

did not tag as unanswerable, and that are also answerable in the gold standard.

• False positives (FP) refers to the question-answer pairs that the model tagged as unanswerable, but that are in fact answerable in the gold stan-dard.

• False negatives (FN) refers to the question-answer pairs that the model did not tag as unanswerable, but that are unanswerable in the gold stan-dard.

7.3.1 Average Overlap

The first hypothesis that we test is: A large overlap between the words in the question and in the context document increments the risk that an unanswerable question is mistakenly given an answer.

We measure the average overlap of the tokens of questions and context docu-ment. Duplicates are also included.

Model All TP TN FP FN Candidate:No 21.6 19.5 23.8 21.8 21.0 VerifyFirst 21.6 19.5 23.7 22.1 20.8 VerifyLater 21.6 19.5 23.8 21.6 20.7 Confidence 21.6 19.8 23.5 22.6 20.2 Gold - 19.9 23.3 -

-Table 7.4: Average overlap of question and context.

When looking at table 7.4, we can see that questions which are possible to answer have a significantly higher overlap of 23.3 on average, while questions that have no answer only have an overlap of 19.9 tokens on average.

While true positives and true negatives of the of the different models are quite similar to the values of the gold data, the values for false positives and false negatives are the complete opposite. While questions mistakenly assumed to have an answer by the models have fewer overlap, questions that are mistakenly predicted to have no answer have a higher overlap. However, in contrast to the true positives and the true negatives which have very similar results across the models, the values vary a lot for false positives and false negatives what makes it difficult to draw a general conclusion.

Regarding our hypothesis, the average overlap in the false positives group is comparable to the average overlap in all question-answer pairs, slightly higher (0.2 words) in one model and notable higher (1 word) in one model. So we cannot

(33)

7.3.2 Average Length of the Questions

The second hypothesis is: Longer questions are more complex and both harder to answer and more likely to be recognized as unanswerable.

Therefore we will look at the average question length (in tokens) for our four groups. Model All TP TN FP FN Candidate:No 11.3 10.9 11.6 11.6 11.2 VerifyFirst 11.3 10.9 11.6 11.6 11.2 VerifyLater 11.3 10.9 11.6 11.6 11.1 Confidence 11.3 11.0 11.5 11.8 11.1 Gold - 11.0 11.6 -

-Table 7.5: Average length of question.

In table 7.5 we see that the true positives group has the shortest questions of all groups on average in all models. The true negatives and the false positives have the longest questions, denoting that the answerable questions are the longest questions in general in the gold standard.

Regarding our hypothesis, we note that the group of false positive questions are significantly longer on average than the group false negative questions but also longer than the overall average question length. So our hypothesis seems to apply here with regard that longer questions are indeed more likely to be recognized as unanswerable.

Question length Answer length Average value: 11.3 Average value: 3.2 % shorter % longer % shorter % longer

58.7 41.3 72.5 27.5

Model F1 shorter F1 longer F1 shorter F1 longer

Canidate:No 63.8 60.7 56.8 43.6

VerifyFirst 63.0 59.8 58.4 47.2

VerifyLater 61.7 59.6 59.6 45.7

Confidence 64.1 60.3 56.0 44.8

Table 7.6: This table compares the performance for the different models on different question and answer lengths. The upper part shows the average values and the distribution how many examples are shorter or longer than the average value. The lower part of the table shows the F1 score for the different models. As above shorter or longer with respect to the average value.

Note that in order to get meaningful data, questions without answer are excluded in the answer length section.

(34)

Looking at the bottom part of the table, we can see that shorter questions lead to a higher F1 score for all the models. For example for the Candidate:No model, questions that are shorter than the average value have an average F1 score of 63.8 while the questions that are longer have only a F1 score of 60.7. So by looking at the F1 scores of the questions part of table 7.6, we can assume that our hypothesis is true with regard that longer questions are indeed harder to answer.

7.3.3 Average Length of the Answer

We will also have a look at the average length of the answers that appear as true negatives and false positives. The corresponding hypothesis is: Questions with longer answers are more complex and both harder to answer and harder to be recognized as unanswerable. Model All TP TN FP FN Candidate:No - - 3.0 3.8 -VerifyFirst - - 3.1 3.8 -VerifyLater - - 3.1 3.8 -Confidence - - 3.0 3.8 -Gold - - 3.2 -

-Table 7.7: Average length of answer.

We note that the answers that the model correctly sees as answerable are shorter on average, with 3.0 or 3.1 tokens compared to 3.8 tokens where the model incorrectly predicted an answer. The average length of answers in the gold standard is 3.2, slightly higher than the true positives group of our four models. So the hypothesis that questions with longer answers are harder to recognize as unanswerable seems to apply, although not with particularly strong evidence.

(35)

that have shorter answers. It is worth mentioning that the average value of 3.2 tokens per answer is not very long and that the majority of questions have even shorter answers than that.

7.3.4 Question words

We will now look at the Precision, Recall and F1 scores for seven different question words, with the following hypothesis: There are question words that are harder to answer and harder to be recognized as unanswerable. Why and How could be hard question words while Who, How many or Where are relatively easy.

Model −→ Candidate:No VerifyFirst VerifyLater Confidence

↓Question Type F1 F1 F1 F1 What 73.5 72.4 71.3 73.2 Where 69.7 68.4 65.3 69.7 When 75.4 70.3 68.2 75.4 Who 70.3 70.3 68.6 70.6 Why 70.5 66.7 67.7 69.6 How 67.3 66.7 64.2 67.7 How many 65.8 66.3 63.9 65.0

Table 7.8: F1 scores for answer triggering task for different question words: What, Where, When, Who, Why, How and How many.

We note that while it seems to be right that there are easier and harder question words, it is not necessarily the words that we expected that are hard for the answer triggering task. The question words with the lowest F1 score are How and How many, and while we expected the former to be hard, we expected the latter to be easy. The third-lowest F1 score has Where, which we also expected to be an easy question word.

The highest F1 score has the question word What, followed by When and Who.

Model −→ Candidate:No VerifyFirst VerifyLater Confidence

↓Question Type P R P R P R P R What 72.7 74.4 73.5 71.3 73.3 69.4 71.9 74.5 Where 73.0 66.7 77.9 60.9 72.7 59.3 73.0 66.7 When 76.8 74.0 75.9 65.5 75.0 62.6 76.1 74.7 Who 75.3 65.9 77.4 64.5 77.3 61.6 75.8 66.1 Why 58.4 88.9 60.4 74.4 65.6 70.0 61.5 80.0 How 73.3 62.2 75.4 59.8 73.3 57.1 70.7 65.0 How many 77.4 57.1 83.1 55.1 80.4 53.1 73.7 58.2

Table 7.9: Precision and Recall scores for answer triggering task for different question words: What, Where, When, Who, Why, How and How many.

(36)

models (two with a slightly higher precision and two with a slightly higher recall) the dominance of precision for Who, How and How many is rather extreme for all models. For Why, the recall-dominance is even higher with up to 30.5% for the Candidate:No model.

Model −→ Candidate:No VerifyFirst VerifyLater Confidence ↓Question Type EM F1 EM F1 EM F1 EM F1 What 56.8 62.8 55.2 62.0 54.7 61.3 56.5 62.6 Where 49.3 56.8 48.5 56.6 47.0 55.5 49.9 58.1 When 65.1 69.8 60.5 65.7 60.4 65.3 64.4 69.2 Who 58.8 63.2 57.6 63.1 57.0 61.8 59.4 63.9 Why 48.1 58.1 44.4 53.5 41.3 54.1 47.6 56.2 How 53.4 60.2 53.8 60.7 52.1 58.8 54.9 60.7 How Many 60.4 64.9 60.9 65.3 59.9 64.1 60.4 64.1

Table 7.10: EM and F1 scores for question answering task for question words types.

After the results on the answer triggering mechanism regarding the hypothesis, we will now focus on the overall performance difference (F1 and EM Score for QA-Task) of the different questions words: There are question words that are harder to answer and harder to be recognized as unanswerable. Why and How could be hard question words while Who, How many or Where are relatively easy.

Looking at the table 7.10, it is very obvious that questions asking for When and How many are the easiest to answer. On the other hand it seems that questions asking for Why, How and Where are particularly difficult to answer. Looking back at our hypothesis the results for question words Why and How turn out to be as expected. A surprising outcome however is that questions with Where seem to be very hard questions, which opposes our hypothesis. It is also noticeable that questions asking for Who are much easier to answer, as Where and Who can be considered rather similar from the structure only one asking for location the other for persons.

(37)

8 Discussion of the Results

The inclusion of answer verification mechanisms is generally extremely beneficial and leads to huge performance gains over a baseline model without answer triggering. This is not surprising as the SQuAD 2.0 data set has one third unanswerable questions, which are guaranteed to be wrong in case no answer triggering is done. It is doubtlessly worthwhile to implement them in a question answering model as in real-world applications, it will also be highly probable that users ask questions that cannot be answered based on the available documents, and as it is certainly desirable not to give an answer that the model is not at all confident about, as it may leave the user with false beliefs.

The better performance of the baseline on the SQuAD 1.1 data set without non-answerable question is not surprising as the AV-Models were trained with one third of unanswerable questions and as a certain overfitting to the nature of the training data is expectable in a relatively homogenous set like SQuAD.

Regarding the choice of the answer triggering mechanism, it is not as clear as the general recommendation to use such a mechanism. Candidate:No is a good candidate as it leads to the best overall performance on the SQuAD 2.0 data set, but the Confidence and the VerifyFirst method may still be an appropriate choice if a precision-oriented or respectively a recall-oriented method for answer triggering are desired as the differences in the overall performance are not extremely high. The baseline system or one of the worse-performing systems on the SQuAD 2.0 development set may however be preferable when being confident that the number of non-answerable questions will be very low as they tend to perform better on the answerable questions only.

The systems’ performance on answer verification is higher than the overall performance, but this has been expectable as for a yes/no decision, as even a random guess on an unbalanced data set with 1/3 unanswerable questions would already lead to a F1 performance of 40.0% what makes the about 70% F1 in the prediction of an unanswerable question look rather limited, while the number of possible answer spans in a document is quadratic in the document size. It would therefore be worth to do more intensive research how to design answer triggering mechanisms more sophisticatedly, possibly with other or more features and more advanced semantic representations.

(38)

systems that are based on reading comprehension. We also note that many questions in the SQuAD data set are relatively easy to answer as they do not require reasoning over multiple sentences or even texts, external knowledge, or logical reasoning. The way that the data set was created - crowd workers seeing a text and creating questions where the answer must be a text span in this passage - leads to questions that can be answered in a very straightforward way. Therefore our data set can also be seen as limited.

We do not reach the results of the currently best systems at the SQuAD 2.0 leaderboard. This will partly be due to the fact that we did not perform extensive parameter tuning and that we chose the parameters in a way that should keep the training speed reasonable. But as the intended contribution of this thesis was not to further improve the overall performance on the SQuAD 2.0 data set but to evaluate different answer triggering mechanisms, we did not invest further efforts into improving the EM and F1 results but focused on building a robust system for our experiments. However we believe that our findings will also be applicable to other, better-tuned systems.

In our hypothesis testing experiments we noted that it is hard to make general assumptions about which questions are especially hard to be recognized as unanswerable. Hypotheses that seemed very intuitive to us, like that questions with a large overlap with the respective context document are more likely to be mistakenly answered, did not or did not clearly apply. The question words are also hard to predict concerning their hardness of being tagged as unanswerable. So our qualitative analysis remains rather inconclusive as we are not able to give suggestions on which question types to pay special attention to when developing future answer verification models.

(39)

9 Conclusion

In this thesis, we devoted ourselves to the topic of reading-comprehension based question answering systems build on deep neural networks. We studied different answer triggering mechanisms that deal with the question if there is any answer to the question in a given document.

We started with describing the foundations of the work in this thesis. We described recent developments in the task of reading comprehension systems in chapter 3 and the technical backgrounds of the systems, in particular the neural network architectures that are underlying them, in chapter 2.

The data set that we used for our experiments, SQuAD 2.0, is a very popular large data set that contains about one third unanswerable questions. Based on this set, we developed an experimental setup to test the different answer triggering mechanisms against each other. We tested their contributions to the performance in a general question answering task as well as in an explicit evaluation of the answer triggering mechanisms with their respective precision and recall. We also tested several hypotheses about which kinds of question-answer-context triplets to pay special attention to.

We successfully equipped a recent question answering system based on an encoder-decoder architecture with convolutional neural networks and attention mechanisms with four different types of answer triggering mechanisms and clearly improved the model over its baseline performance without answer triggering. This leads us to the conclusion that the inclusion of answer verification is highly beneficial for such a model.

While the four answer triggering mechanisms that we experimented with all give a performance boost, one of them, the VerifyLater model that is one of the two models that predict in the end if the highest ranking answer should be outputted or not, has the lowest gains in both the general question answering metrics and in the explicit question answering evaluation. The other three mechanisms perform significantly better, the best of them being Candidate:No where an no-answer token is added to the context tokens and scored along with them, and Confidence, a simpler and more computationally efficient version of VerifyLater.

While the Candidate:No model is balanced between precision and recall, the Confidence-model with slightly worse results is by tendency recall-oriented. If a precision-oriented model is desired, VerifyFirst may be a good choice. VerifyFirst is a model where the no-answer-score is predicted at an earlier step in the model than in the three other systems.

(40)

While these overall results are insightful and encouraging, the hypothesis testing remained rather inconclusive and did not give us useful information about what to attend to in future answer triggering systems. However, we showed that for the general question answering performance, it is worth striving for more attention on question words that have shown to be especially hard, like Why, How and Where, and on questions that have larger answer lengths.

In future work we suggest that it would be very valuable to explore handling the AT mechanisms within an extra loss, trained independently of the system. Handling the AT task more explicitly instead of taking the model’s incorporated information could help to improve the performance on this task, but to the cost of making the architecture more complex and also of increasing the training time.

Other valuable experiments could include the usage of transfer learning as well as training and/or testing the AT mechanisms on other, possibly more challenging and more diverse question answering data sets, or combinations of those. The systems could also be included in an conversational AI system that handles sequential questions and other dialogue elements in order to test their performance in a practical setting.

(41)

Bibliography

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio (2014). “Neural ma-chine translation by jointly learning to align and translate”. arXiv preprint arXiv:1409.0473.

Chen, Danqi, Adam Fisch, Jason Weston, and Antoine Bordes (2017). “Reading wikipedia to answer open-domain questions”. arXiv preprint arXiv:1704.00051. Cho, Kyunghyun, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio

(2014). “On the properties of neural machine translation: Encoder-decoder approaches”. arXiv preprint arXiv:1409.1259.

Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio (2014). “Learning phrase representations using RNN encoder-decoder for statistical machine translation”. arXiv preprint arXiv:1406.1078.

Clark, Christopher and Matt Gardner (2017). “Simple and effective multi-paragraph reading comprehension”. arXiv preprint arXiv:1710.10723.

Cui, Yiming, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu (2016). “Attention-over-attention neural networks for reading comprehension”. arXiv preprint arXiv:1607.04423.

Dhingra, Bhuwan, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov (2016). “Gated-attention readers for text comprehension”. arXiv preprint arXiv:1606.01549.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep Learning.

http://www.deeplearningbook.org. MIT Press.

Greff, Klaus, Rupesh K Srivastava, Jan Koutnık, Bas R Steunebrink, and Jürgen Schmidhuber (2017). “LSTM: A search space odyssey”. IEEE transactions on neural networks and learning systems 28.10, pp. 2222–2232.

Hermann, Karl Moritz, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom (2015). “Teaching machines to read and comprehend”. In: Advances in Neural Information Processing Systems, pp. 1693–1701.

High, Rob (2012). “The era of cognitive systems: An inside look at IBM Watson and how it works”. IBM Corporation, Redbooks.

Hirschman, Lynette, Marc Light, Eric Breck, and John D Burger (1999). “Deep read: A reading comprehension system”. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, pp. 325–332.

Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term memory”. Neural computation 9.8, pp. 1735–1780.

(42)

Hu, Minghao, Yuxing Peng, Zhen Huang, Nan Yang, Ming Zhou, et al. (2018). “Read+ verify: Machine reading comprehension with unanswerable questions”. arXiv preprint arXiv:1808.05759.

Jurafsky, Daniel and James H. Martin (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition.

Jurczyk, Tomasz, Michael Zhai, and Jinho D Choi (2016). “Selqa: A new bench-mark for selection-based question answering”. In: Tools with Artificial Intelli-gence (ICTAI), 2016 IEEE 28th International Conference on. IEEE, pp. 820– 827.

Kadlec, Rudolf, Martin Schmid, Ondrej Bajgar, and Jan Kleindienst (2016). “Text understanding with the attention sum reader network”. arXiv preprint arXiv:1603.01547.

Kalchbrenner, Nal and Phil Blunsom (2013). “Recurrent continuous translation models”. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700–1709.

LeCun, Yann, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel (1989). “Backpropagation applied to handwritten zip code recognition”. Neural computation 1.4, pp. 541– 551.

Morales, Alvaro, Varot Premtoon, Cordelia Avery, Sue Felshin, and Boris Katz (2016). “Learning to answer questions from Wikipedia infoboxes”. In: Pro-ceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1930–1935.

Nallapati, Ramesh, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. (2016). “Abstractive text summarization using sequence-to-sequence rnns and beyond”. arXiv preprint arXiv:1602.06023.

Olah, Christopher (2014). “Conv nets: A modular perspective”. Dosegljivo. url:

http://colah.github.io/posts/2014-07-Conv-Nets-Modular.

Olah, Christopher (2015). “Understanding lstm networks, 2015”. url: http: //colah.github.io/posts/2015-08-Understanding-LSTMs.

Paszke, Adam, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer (2017). “Automatic differentiation in PyTorch”.

Pennington, Jeffrey, Richard Socher, and Christopher Manning (2014). “Glove: Global vectors for word representation”. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever (2018). “Improving language understanding by generative pre-training”. url: https: //s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang (2016). “SQuAD: 100, 000+ Questions for Machine Comprehension of Text”. CoRR abs/1606.05250. arXiv:1606.05250. url: http://arxiv.org/abs/1606.05250. Seo, Minjoon, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi

(43)

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin (2017). “Attention is all you need”. In: Advances in neural information processing systems, pp. 5998– 6008.

Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly (2015). “Pointer networks”. In: Advances in Neural Information Processing Systems, pp. 2692–2700. Vinyals, Oriol, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and

Geoffrey Hinton (2015). “Grammar as a foreign language”. In: Advances in neural information processing systems, pp. 2773–2781.

Wang, Shuohang and Jing Jiang (2015). “Learning natural language inference with LSTM”. arXiv preprint arXiv:1512.08849.

Xiong, Caiming, Victor Zhong, and Richard Socher (2016). “Dynamic coattention networks for question answering”. arXiv preprint arXiv:1611.01604.

Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio (2015). “Show, attend and tell: Neural image caption generation with visual attention”. In: International conference on machine learning, pp. 2048–2057.

Yang, Yi, Wen-tau Yih, and Christopher Meek (2015). “Wikiqa: A challenge dataset for open-domain question answering”. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2013–

2018.

Yu, Adams Wei, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mo-hammad Norouzi, and Quoc V Le (2018). “QANet: Combining Local Convolu-tion with Global Self-AttenConvolu-tion for Reading Comprehension”. arXiv preprint arXiv:1804.09541.

Yu, Yang, Wei Zhang, Kazi Hasan, Mo Yu, Bing Xiang, and Bowen Zhou (2016). “End-to-end answer chunk extraction and ranking for reading comprehension”. arXiv preprint arXiv:1610.09996.

References

Related documents

First we have to note that for all performance comparisons, the net- works perform very similarly. We are splitting hairs with the perfor- mance measure, but the most important

In our application, a closed domain QA model built on BERT (see Section 4.2.1) is used for allowing the user to make questions and receive answers on specific documents of text..

- To compare patient safety, procedure feasibility, recovery and patients’ experiences using patient-controlled sedation with propofol, nurse anaesthetist- controlled sedation

På den måde bliver det i langt højere grad den indre motivation der driver eleven, og måske ikke så meget re- sultatorienteringen, hvilket jeg tillægger en større mangfoldighed

Gällande damlaget uppger de att några tjejer fått sluta då de inte längre platsar i truppen på grund av deras satsning till elitettan, vilket visar på en tydlig sportsmässig

Community Question Answering (cQA) portals are an example of Social Media where the infor- mation need of a user is expressed in the form of a question posed in natural language;

This thesis set out to answer the question “To what extent can a Transformer question answering model trained on Stack Overflow data answer questions on introductory Java

The final result shows that when classifying three categories (positive, negative and neutral) the models had problems to reach an accuracy at 85%, were only one model reached 80%