Zero-shot, One Kill: BERT for Neural Information Retrieval

(1)

Zero-shot, One Kill: BERT for Neural Information Retrieval

Using Wikipedia-based Weak Supervision for Passage-(Re)ranking and Question Answering

Stergios Efes

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits June 9, 2021

Supervisor:

(2)

Abstract

[Background]: The advent of bidirectional encoder representation from trans- formers (BERT) language models (Devlin et al., 2018) and MS Marco, a large scale human-annotated dataset for machine reading comprehension (Bajaj et al., 2016) that made publicly available, led the field of information retrieval (IR) to experience a revolution (Lin et al., 2020). The retrieval model based on BERT of Nogueira and Cho (2019), by the time they published their paper, became the top entry in the MS Marco passage-reranking leaderboard, surpassing the previous state of the art by 27% in MRR@10. However, training such neural IR models for different domains than MS Marco is still hard because neural approaches often require a vast amount of training data to perform effectively, which is not always available. To address the problem of the shortage of labelled data a new line of research emerged, training neural models with weak supervision. In weak supervision, given an unlabelled dataset labels are generated automatically using an existing model and then a machine learning model is trained upon the artificial “weak“ data. In case of weak supervision for IR, the training dataset comes in the form of a tuple (query, passage). Dehghani et al. (2017) in their work used the AOL query logs (Pass et al., 2006), which is a set of millions of real web queries, and BM25 to retrieve the relevant passages for each of the user queries.

A drawback with this approach is that it is hard to obtain query logs for every single different domain. [Objective]: This thesis proposes an intuitive approach for addressing the shortage of data in domains with limited or no data at all through transfer learning in the context of IR. We leverage Wikipedia’s structure for creating a Wikipedia-based generic IR training dataset for zero-shot neural models. [Method]: We create the “pseudo-queries“ by concatenating the titles of Wikipedia’s articles along with each of their title sections and we consider the associated section’s passage as the relevant passage of the pseudo-queries.

All of our experiments are evaluated on a standard collection: MS Marco, which is a large scale web collection. For our zero-shot experiments, our proposed model, called “Wiki“, is a BERT model trained on the artificial Wikipedia-based dataset and the baseline is a default BERT model without any additional training.

In our second line of experiments, we explore the benefits gained by pre-fine-

tuning on the Wikipedia-based IR dataset and further fine-tuning on in-domain

data. Our proposed model, "Wiki+Ma", is a BERT model pre-fine-tuned in the

Wikipedia-based dataset and further fine-tuned in MS Marco, while the baseline

is a BERT model fine-tuned only in MS Marco. [Results]: Results regarding our

first experiments show that our BERT model trained on the Wikipedia-based

IR dataset, called "Wiki", achieves a performance of 0.197 in MRR@10, which is

about +10 points more in comparison to a BERT model with default weights; in

addition, results in the development set indicate that the “Wiki“ model performs

better than BERT model trained on in-domain data when the data is between

10k-50k instances. Results regarding our second line of experiments show that

pre-fine-tuning on the Wikipedia-based IR dataset benefits later fine-tuning steps

on in-domain data in terms of stability. [Conclusion]: Our findings suggest that

transfer learning for IR tasks by leveraging the generic knowledge incorporated

in Wikipedia is possible, though more experimentation is needed to understand

its limitations in comparison with the traditional approaches such as the BM25.

(3)

1 Introduction 5

1.1 Purpose and Research Questions . . . . 5

1.2 Outline . . . . 6

2 Background 7 2.1 Information Retrieval Tasks . . . . 7

2.1.1 Ad hoc Retrieval . . . . 7

2.1.2 Question Answering . . . . 7

2.1.3 Other Information Retrieval Tasks . . . . 8

2.2 Evaluation in Information Retrieval . . . . 8

2.2.1 Mean Average Precision . . . . 9

2.2.2 Mean Reciprocal Rank . . . . 9

2.3 Traditional Information Retrieval . . . . 9

2.3.1 Vector Space Model . . . . 9

2.3.2 BM25 . . . . 10

2.3.3 Other Traditional Information Retrieval Models . . . . 10

2.4 Learning-To-Rank Information Retrieval . . . . 10

2.5 Neural Information Retrieval . . . . 11

2.5.1 A Unified Model formulation of the Neural Ranking Models . 11 2.5.2 Model Architectures . . . . 12

2.6 Bidirectional Encoder Representations from Transformers . . . . 13

2.6.1 Traditional Language Modelling . . . . 13

2.6.2 Neural Language Modelling: Attention Mechanisms and Trans- formers . . . . 13

2.6.3 BERT Language Modelling . . . . 14

2.7 Weak Supervision for Ranking . . . . 15

2.7.1 Wikipedia-Based Weak Supervision Signals . . . . 15

2.7.2 Previous Work . . . . 16

3 Methodology 17 3.1 Task . . . . 17

3.2 BERT for Passage-Reranking . . . . 17

3.3 Datasets . . . . 18

3.3.1 MS Marco Training Dataset . . . . 18

3.3.2 Wikipedia-based Training Dataset . . . . 18

3.3.3 Development and Test set . . . . 18

3.4 Creating the Wikipedia-based dataset . . . . 19

3.5 Data pre-preprocessing . . . . 20

3.6 Training on TPUs . . . . 21

3.7 Experimental Systems . . . . 21

3.8 Experimental Settings . . . . 22

3.9 Evaluation Methods . . . . 22

4 Results 23

4.1 Accuracy . . . . 23

(4)

4.2 Convergence . . . . 23 4.3 Stability . . . . 23 4.4 Analysis . . . . 24

5 Conclusion 27

(5)

1 Introduction

Informational retrieval technologies (IR) play an important role in people’s daily life.

Millions of users are searching the web every day and billions of queries are processed daily by major search engines, such as Google and Bing.

The effectiveness of such search engines is based upon two factors: the use of neural networks and a large amount of click-log data that they are being trained upon (Zamani, 2019). Such text retrieval models need to acquire an understanding of raw text documents and to learn a ranking function 𝑓 (𝑞,𝑑) which given a query q and a document d outputs a probability for a document being relevant (Guo et al., 2020).

Neural IR started to flourish only after the publication of large datasets for passage- reranking (Nogueira and Cho, 2019) such as MS Marco from Microsoft Bing (Bajaj et al., 2016). It has been noted that the absence of such big datasets made it impossible for neural IR to compete with classical IR (Lin, 2019). Despite the recent advances though in Neural IR (Nogueira and Cho, 2019; Hofstätter et al., 2020; S. Han et al., 2020) it has to be emphasized that they took place in a scenario where there is an abundance of training signals, i.e. MS Marco. On the other hand, without such large datasets (which are either click-logs or manually labelled with human judgments) the effectiveness of neural networks is highly questioned (Lin, 2019; W. Yang et al., 2019).

For such a setting, when there are not any labeled data available for training, recent research has focused on training neural ranking models using weak supervision, in which labels are acquired automatically using other means. For example, Dehghani et al. (2017) used an unsupervised IR model, BM25, while K. Zhang et al. (2020) used anchor text. Addressing the issue of the lack of labelled data, Frej et al. (2019) take it one step further by utilizing Wikipedia for building large-scale IR test collections automatically.

This thesis follows the aforementioned line of research: it aims to investigate on what level Wikipedia-based weak supervision signals can be used to train a generic efficient neural ranking retrieval model. (Radford et al., 2019).

1.1 Purpose and Research Questions

The purpose of this work is to investigate the effectiveness of transfer learning in deep neural networks in the context of IR for addressing the data bottleneck that one faces when they need to train a neural retrieval model for a domain with limited data or no data at all. BERT models (Devlin et al., 2018) have been shown to be an effective approach to transfer learning by being pre-trained on large amounts of raw text data and then fine-tuned for a specific task (Devlin et al., 2018). In our case, we explore the possibility of transfer learning for IR by fine-tuning a BERT model with Wikipedia-based weak supervision for the task of passage-reranking. We assume that since the Wikipedia corpus is characterised by generic knowledge, training a BERT model in a Wikipedia-based IR dataset would potentially help it to incorporate generic information retrieval knowledge that would prove beneficial for later retrieval tasks.

We try to quantify how beneficial such an approach can be by answering the

following research questions:

(6)

• Does a BERT model, fine-tuned with Wikipedia-based weak supervision, perform better in terms of accuracy when tested on out-of-domain data (zero-shot setting) in comparison to a default BERT model that has not been fine-tuned at all?

• If we further fine-tune this BERT model (that has already been trained with Wikipedia-based weak supervision) on in-domain data, will there be any im- provement in terms of accuracy, convergence or stability in comparison to a model that has only been trained with in-domain data?

1.2 Outline

Beginning with Chapter 2, it describes the different tasks in IR by paying particular attention to the question answering task and one of its components, that is going to be central in this thesis, passage-reranking. In addition, it describes the different approaches in IR, such as the traditional IR, learning-to-rank and neural IR, and how evaluation is performed in IR. After the necessary IR concepts are laid out, a pre- sentation of language modelling follows (BERT) and of weak supervision. Chapter 3 presents the methodology used to perform passage-reranking with BERT and presents the datasets used. That is Wikipedia, which is pre-processed and used to create an artificial query-passage dataset for training the BERT model, and MS Marco which is both used for training and testing. Chapter 4 describes the experimental settings along with the baseline(s). Chapter 5 analyzes and discusses the results from our experiments.

Chapter 6 presents the essential contributions of this work and concludes the thesis.

(7)

2 Background

The following sections aim to present a coherent yet short overview of the concepts and theory needed to understand neural information retrieval. The literature does not always agree on terminology and different names for the same concepts are used interchangeably. For this reason, it felt necessary to clarify the terms used in this thesis.

Traditional IR or classical IR refers to the basic retrieval systems used in IR (such as, for instance, the vector space model, and BM25 retrieval algorithm). Such methods focus on word occurrences for measuring relevance. For this thesis, the term traditional IR was chosen as an attempt to highlight the absence of machine learning in it.

Learning to rank (LETOR) or machine-learned IR refers to a new paradigm in IR that seeks to employ machine learning approaches to solve IR problems. By LETOR the literature seems to refer to 1) the framework, meaning the formalisation of the IR problem of ranking from the perspective of machine learning, 2) the use of traditional machine learning approaches to solve it, such as support vector machine for instance, or simple neural networks such as the perceptron. These kinds of methods focus on training ML models on human labelled datasets using hand-crafted features as will be explained more thoroughly later. For this work the term learning to rank (LETOR) is adopted because it is broadly used in the literature as opposed to machine-learned IR Neural IR or neural retrieval or Neu-IR, or neural LETOR refers to the use of deep neural network (DNN) architectures to address IR problems from the perspective of the LETOR framework. The DNNs are trained on human labelled datasets but the feature learning is done automatically by the network in contrast to the hand-crafted rules mentioned before. The term neural IR was chosen to be used in this thesis since it is broadly used in the literature.

2.1 Information Retrieval Tasks

In this section, we describe the task of question answering in IR and how it relates to this thesis, we also make a brief description of the ad hoc retrieval task since it is the most prominent task of the IR field, and we briefly mention the rest of the IR tasks.

2.1.1 Ad hoc Retrieval

The most notable of the retrieval tasks is the ad hoc retrieval (Guo et al., 2020), in which a user has an information need for which they issue a query to a retrieval system, which in turn measures the relevance between the query and the documents in the collection and then retrieves the top N scoring documents. A major difficulty in ad hoc retrieval is that the incoming queries usually have an unclear intent and range from a few words to a few sentences (Mitra and Craswell, 2017).

2.1.2 Question Answering

The task of question answering (QA) is to automatically answer a user’s questions

issued in natural language using some information resources (Guo et al., 2020). The

information resources could either be structured data (such as a knowledge base)

(8)

or unstructured data (for example web pages, or documents) which is what we are concerned with in this thesis. Furthermore, there are several task formats for QA, such as passage-reranking (Nogueira and Cho, 2019), passage-retrieval, answer span locating (Rajpurkar et al., 2016) and answer synthesizing from multiple sources (Mitra et al., 2016).

As far as passage-reranking is concerned, it has to be noted that the literature seems to not agree in the terminology. Some papers consider it not an independent task per se but rather just a post-retrieval step to the passage-retrieval task (Aktolga et al., 2011). Others, such as Nogueira and Cho (2019), treat it as an independent task. In addition, both the terms passage-ranking and passage-reranking seem to be used interchangeably in the literature. In any case, passage-reranking has become a crucial component in any QA system (Cui et al., 2005.

Most QA systems employ a pipeline structure that consists of several modules to get answers; Nogueira and Cho, 2019):

• passage-retrieval: in this phase, 𝑛 relevant passages are retrieved using an inex- pensive method, such as BM25 or TF-IDF.

• passage-reranking: the 𝑛 retrieved passages are reranked using a more computa- tionally expensive method.

• answer span locating: the top 5-10 passages will be used as candidates by an an- swer extraction algorithm for marking up the answer location in the passage(s).

2.1.3 Other Information Retrieval Tasks

There are many other IR tasks as well such as product search (Brenner et al., 2018), sponsored search (Fain and Pedersen, 2006), community question answering (L. Yang et al., 2013), and automatic conversation (Ji et al., 2014), but they are outside the scope of this work.

2.2 Evaluation in Information Retrieval

Information retrieval systems are evaluated using test collections (Schütze et al., 2008) which consist of:

• A document collection which is the set of documents that the IR system indexes.

• A set of queries that express the information needs.

• A set of ground truth labels, which are relevance judgments, a binary assessment of a document being relevant or irrelevant with respect to a query.

IR evaluation revolves around the concept of relevant and not relevant document.

Given a query a document is classified as being relevant or not. Gold standard or ground truth judgments of relevance is referred to the decision of this binary classification.

The typical metrics used in IR are adjusted versions of the precision, recall and the F measure ; they are “adjusted” since they are set-based measures and are computed using unordered sets, while in the IR context the aim is to evaluate ranked results.

Below are presented the most common metrics used in IR, the mean average precision

(MAP) and the one used in this thesis mean reciprocal rank (MRR).

(9)

2.2.1 Mean Average Precision

One of the most commonly used metrics in the IR community is the MAP. Given a query, average precision is defined to be the average of the precision values obtained for the top k retrieved documents, and later this value is averaged over all queries. Let {𝑑

₁

, ..., 𝑑

_{𝑚 𝑗}

} be the list of relevant documents for query 𝑞

^𝑗

∈ 𝑄 and 𝑅

^{𝑗 ,𝑘}

to be the set of retrieved results from the top results until you get to document 𝑑

^𝑘

:

𝑀 𝐴𝑃 (𝑄) = 1

|𝑄 |

Õ

𝑗=1

1 𝑚

_𝑗

𝑚_𝑗

Õ

𝑘=1

𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑅

𝑗 ,𝑘

) (2.1)

2.2.2 Mean Reciprocal Rank

The reciprocal rank (RR) metric calculates the reciprocal of the rank at which the first relevant document was retrieved. RR is 1 if a relevant document was retrieved at rank 1, 0.5 if a relevant document was retrieved at rank 2 and so on. Averaged across all queries, this metric is called the mean reciprocal rank (MRR) (Craswell, 2009):

𝑀 𝑅𝑅 = 1

|𝑄 |

Õ

𝑖=1

1 rank

^𝑖

(2.2)

MRR is associated with a user model where the user only wishes to see one relevant document (Craswell, 2009). The metric is very sensitive when going from rank 1 to rank 2 (0.5) in contrast to moving from rank 100 to 1,000 (0.009).

2.3 Traditional Information Retrieval

The main tendency in traditional IR is that a sparse term-document matrix is built out of term frequencies. Such retrieval systems are also called as bag of word models since they ignore the ordering of the words in the documents.

2.3.1 Vector Space Model

The intuition behind the vector space model (VSM) is that documents and document queries (in this setting a query is seen as a document) can be represented as vectors in a multi-dimensional space and be compared there using the cosine similarity. More specifically, ® 𝑑 denotes the derived vector from a document d and ®𝑞 denotes a vector derived from a query q, where each component in the vector corresponds to a dictionary term that is calculated using the tf-idf scheme we describe below (Schütze et al., 2008).

Tf-idf weighting

Using the tf-idf weighting scheme a weight is assigned to a term 𝑡 in a document 𝑑 using the following formula:

tf-idf

_t,d

= tf

_t,d

× idf

_t

(2.3)

where:

tf

_t,d

= 𝑓

_{𝑡 ,𝑑}

Í

𝑗=1

𝑓

_{𝑗 ,𝑑}

in which 𝑓

^{𝑡 ,𝑑}

is the number of times a term 𝑡 appears in a document 𝑑

^𝑖

divided by the

(10)

total number of terms in the document.

and

idf = 𝑙𝑜𝑔 |𝐷 | 𝑑 𝑓

_𝑡

in idf (inverse document frequency) |𝐷| denotes the number of documents in the collection and 𝑑𝑓

^𝑡

the number of times a term 𝑡 appears in the collection.

Cosine similarity

Having derived the document vector ® 𝑑 and the document query vector ®𝑞, cosine similarity is computed between these two vectors representations to quantify the similarity between them:

𝑐𝑜𝑠𝑖𝑛𝑒 (𝑞, 𝑑) = 𝑞 ® × ® 𝑑

| ® 𝑞 | × | ® 𝑑 | (2.4)

2.3.2 BM25

The BM25 scoring formula stems from the probabilistic relevance framework (PRF) and it is considered the state-of-the-art of traditional IR by some researchers (Robertson and Zaragoza, 2009). Even though through the years several BM25 versions have been developed (Kamphuis et al., 2020), the basic formula is (Aklouche et al., 2019):

𝑠𝑐𝑜𝑟 𝑒 (𝑞, 𝑑) = Õ

𝑡∈𝑞

𝑖𝑑 𝑓 (𝑡 ) ×

𝑡 𝑓

_{𝑡 ,𝑑}

× (𝑘

₁

+ 1)

𝑡 𝑓

_{𝑡 ,𝑑}

+ 𝑘

₁

× ( 1 − 𝑏 + 𝑏 ×

𝑎 𝑣𝑔𝑑𝑙^𝑑𝑙

) (2.5) where 𝑘

₁

and 𝑏 are constants tuned on a labelled dataset, 𝑑𝑙 is the document length, and 𝑎𝑣𝑔𝑑𝑙 is the average document length in all the documents in the collection.

2.3.3 Other Traditional Information Retrieval Models

Since the aim of this thesis is not give a comprehensible guide to traditional IR, we briefly mention only two other approaches that we think are prominent in the field of IR.

• In the language modelling approach a document is assumed to have its own language model and the task is to to calculate the probability of a document 𝑑 emitting a query 𝑞 (Banerjee and H. Han, 2009).

• In Bayesian networks, documents, terms, and queries are represented as nodes and there are arcs that link them together. Using the prior document probabilities and the conditional ones from the interior nodes a posterior probability can be computed (Turtle and Croft, 1989).

2.4 Learning-To-Rank Information Retrieval

Learning-to-rank (LETOR) refers to the application of traditional machine learning in the field of information retrieval for (re)ranking a list of documents given a query.

Many ML models have been employed over the years in the LETOR task, for example support vector machine (Yue et al., 2007), boosted decision trees (Burges et al., 2005).

LETOR makes use of training data annotated with human relevance labels to train

for a ranking task. The main thing that distinguishes LETOR models from the neural

approaches is that LETOR models employ hand-crafted features for representing the

(11)

query-document pairs (Mitra, Craswell, et al., 2018) – something we will address in a more detailed manner in the next chapter where the neural approaches are described.

Typically such hand-crafted features fall under one of the following three categories:

query-independent or static features (e.g. document length or web-link length), query- dependent or dynamic features (i.e., BM25), query-level features (e.g. query length).

The different LETOR approaches can be categorised based on their training objec- tives (T.-Y. Liu, 2011):

• Point-wise method: It is the earliest method used. In the point-wise approach, the loss function looks at one document at a time and scores the document independently of the other documents. A regression model is typically trained on labeled data to predict a numerical relevance label for a document given a query (Mitra, Craswell, et al., 2018).

• Pair-wise method: In the pair-wise approach, the loss function looks at two documents at a time and tries to derive the optimal ordering for them. The ranking problem is reduced to a binary classification problem to predict the most relevant document (Mitra, Craswell, et al., 2018).

• List-wise method: In list-wise approaches, the entire set of documents is taken as an input and the model predicts the ground truth labels (T.-Y. Liu, 2011).

2.5 Neural Information Retrieval

Neural IR refers to the use of deep neural architectures to address the IR tasks. A key difference between deep architectures and the LETOR approaches is that, in contrast to the manual feature engineering demanded by the LETOR methods, deep neural networks learn the features needed for training in an unsupervised way – at the cost of training more complex models though (Mitra, Craswell, et al., 2018).

2.5.1 A Unified Model formulation of the Neural Ranking Models

As we mentioned earlier in the the thesis, neural IR is studied under the LETOR framework (Guo et al., 2020). For this reason the literature usually gives a unified formulation of neural ranking models from a generalized view of LETOR problems, which we are also going to describe below.

Following Guo et al. (2020), suppose there is set of queries 𝑄, which could be any type of text queries, natural language questions, or input utterances and a set of documents 𝐷 , which could be any type of text documents, answer passages, or snippets from web-pages or real documents. Let also 𝑌 be a set of labels that represent a relevance degree 𝑌 = {1, 2, ...𝑙}. Then it exists an order between these grades 𝑙 > 𝑙 − 1 > ... > 1.

Now let 𝑞

^𝑖

be the query in the 𝑖𝑡ℎ position of 𝑄, and 𝐷

^𝑖

being the set of documents associated with 𝑞

^𝑖

, such that 𝐷

^𝑖

= {𝑑

𝑖,1

, 𝑑

_𝑖,₂

, ...𝑑

_𝑖,𝑛

𝑖

} and 𝑌

^𝑖

the set of labels associated with 𝑞

^𝑖

, such that 𝑌 = {𝑦

^𝑖,₁

, 𝑦

_𝑖,₂

...𝑦

_𝑖,𝑛

𝑖

} with 𝑦

^{𝑖, 𝑗}

being the relevance degree of 𝑑

^{𝑖, 𝑗}

of 𝑞

^𝑖

.

Let 𝑓 (𝑞

^𝑖

, 𝑑

_{𝑖, 𝑗}

) be a ranking function that assigns a relevance score to a pair of a query

and a document. Lastly, let 𝐿(𝑓 ;𝑞

^𝑖

, 𝑑

_{𝑖, 𝑗}

, 𝑦

_{𝑖, 𝑗}

) be the loss function that calculates the loss

between the prediction of the function 𝑓 and the label. Hence the objective of function

𝑓 is to find the optimal ranking function 𝑓

^∗

by minimizing the loss function over a

labelled dataset.

(12)

𝑓

^∗

: 𝑎𝑟𝑔𝑚𝑖𝑛 Õ

𝑖=1

Õ

𝑗=1

𝐿 ( 𝑓 ;𝑞

^𝑖

, 𝑑

_{𝑖, 𝑗}

, 𝑦

_{𝑖, 𝑗}

) (2.6) Without loss of generalisation, we can further abstract the ranking function 𝑓 to the following unified formulation

𝑓 (𝑞, 𝑑) = 𝑔(𝜓 (𝑞), 𝜙 (𝑑), 𝜂 (𝑞, 𝑑)) (2.7) where q and d are inputs, 𝜓, 𝜙 are representation functions which extract features from q and d, 𝜂 is the interaction function which extracts features from the pair (q, d), and 𝑔 is the evaluation function which computes the relevance score based on the feature representations.

Using this generalised scheme, we can describe now the differences between the LETOR approaches and the Neural IR. In LETOR approaches the inputs of the functions 𝑓 are usually raw texts, while in the neural IR, these inputs could be either raw texts or word embeddings (the mapping function is not included in the unified formula since it is considered as a basic input layer).

Turning to the the 𝜓, 𝜙 and 𝜂 functions, in the LETOR approaches they are usually set to be fixed functions, while the function 𝑔 is a machine learning model (for instance a gradient boosting tree) which could be learned from the training data. On the other hand, neural ranking model encode all the four functions 𝜓, 𝜙, 𝜂 and 𝑔 in the network so that can be learned in an unsupervised manner from the data (Guo et al., 2020).

2.5.2 Model Architectures

Depending on the nature of the interaction function 𝜂 or the different assumptions over the features (extracted by the representation function 𝜙, 𝜓) described in section 2.5.1, deep learning models can be divided in to the following two architectures (Guo et al., 2020): representation-focused and interaction-focused.

Figure 2.1: Model architectures: a) Representation-focused, b) Interaction-focused (Guo et al., 2016).

Representation-focused architecture

The underlying assumption in representation-focused models is that relevance depends

on compositional meaning of the input texts. Thus, models of this category usually

employ complex representation functions 𝜙 and 𝜓 (e.g. deep neural networks) to

derive high-level representations over the text inputs 𝑞 and 𝑑, but not the interaction

function 𝜂, and define a simple evaluation function 𝑔, for instance cosine similarity,

for calculating the relevance score (Guo et al., 2020).

(13)

Interaction-focused architecture

On the other side of the spectrum, the underlying assumption in the interaction-focused models is that the relevance depends upon the interaction between the input texts.

Thus, models of this category employ a function 𝜂 along with simple representation functions 𝜙 and 𝜓, while they define a complex evaluation function 𝑔 (i.e. deep neural networks). Depending on the kind of the interaction function used, the interaction- focused architecture can be further categorised (Guo et al., 2020) into:

• Non-parametric interaction functions that measure the closeness between inputs without learnable parameters, in which some of them are defined over each pair of input vectors.

• Parametric interaction functions that learn the similarity function from the data.

2.6 Bidirectional Encoder Representations from Transformers

Having described the main IR concepts needed for the thesis, we proceed to the area of language modelling (LM) which constitutes the basis of Bidirectional Encoder Representations (BERT) models used later in our experiments.

2.6.1 Traditional Language Modelling

Language modelling (LM) has been the basis for various NLP tasks. In machine transla- tion tasks, e.g, a LM is used to improve fluency over the output translations by choosing the most probable fluent output (Jing and Xu, 2019). The first language models were rule-based until the advent of the statistical language models (1980s) which assign a probability distribution over sequences of words (Jing and Xu, 2019):

𝑃 (𝑠) = 𝑃 (𝑤

₁

, 𝑤

₂

...𝑤

_𝑛

) = 𝑃 (𝑤

₁

) × 𝑃 (𝑤

₂

|𝑤

₁

)...𝑃 (𝑤

𝑛

|𝑤

₁

, 𝑤

₂

...𝑤

_𝑛−₁

) (2.8) A major drawback of this n-gram LM approach was the curse of dimensionality. In particular, for modelling a LM of this kind with a vocabulary of size 10,000, there are potentially 10000

^𝑛⁻¹

free parameters.

2.6.2 Neural Language Modelling: Attention Mechanisms and Transformers

To address the aforementioned “curse” of data sparsity, neural networks were intro- duced for language modeling in continuous space (Bengio et al., 2000). Such LMs created a dense representation of the language in contrast to the sparse representation of the statistical approach mentioned before. Since then many techniques have been implemented in neural language modelling, but are beyond the scope of this thesis to be presented. We will only mention the recurrent neural networks (RNNs) (Mikolov et al., 2010), and the one that mostly concerns this thesis which is the attention mechanisms.

Attention mechanisms are a set of coefficients that are used by the neural network to acquire the target area needed to focus on. LMs equipped with attention mechanisms use the long history more efficiently (Bahdanau et al., 2014); (Mei et al., 2016). Formally the attention vector Δ

^𝑡

is calculated by the representation of {𝑟

1

, 𝑟

₂

, ..., 𝑟

_𝑡₋₁

} :

Δ

^𝑡

=

𝑡−1

Õ

𝑖=0

𝛼

_{𝑡 𝑖}

𝑟

_𝑖

(2.9)

Due to the success of attention mechanisms, Vaswani et al. (2017) proposed an archi-

tecture based solely on attention mechanisms: the Transformer, which consists of an

(14)

encoder and a decoder. The Transformer encoder is the basis of the BERT language models.

2.6.3 BERT Language Modelling

BERT is a language model which is based on the aforementioned Transformer architec- ture. As the authors mention in their paper, it is designed to “pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers ” (Devlin et al., 2018).

Input and output representation

BERT is capable of coping either with a single sentence or a pair of sentences and thus it is able to handle a variety of downstream tasks. An example is given in Figure 2.2.

The first element of every sequence is the [CLS] token. In classification tasks the final hidden state of that token is used as the aggregate sequence representation. A pair of sentences is concatenated into a single sequence using the [SEP] token to separate them. In addition, a learned embedding is added for every token indicating whether it belongs to Sentence A or B. Thus, for a single token, its input representation is obtained by summing the corresponding token, segment, and position embeddings.

Figure 2.2: BERT input representation (Devlin et al., 2018).

BERT framework

The BERT framework consists of two phases: pre-training and fine-tuning. During the first phase, the BERT model is pre-trained on large amounts of raw text data in an unsupervised manner to acquire deep language representations, and then, during the second phase it is fine-tuned in a supervised task. Most of the times the wording

“training BERT” involves only the 2nd task of fine-tuning, since BERT models pre- trained on general language data are freely available.

More specifically the pre-training step of BERT involves training it in the following

two unsupervised tasks: masked LM and next sentence prediction (NSP). In Masked

LM a percentage of the input tokens are masked and then the model tries to predict

those masked tokens. In that way the BERT model acquires knowledge on how words

are related to each other. On the other hand, the aim of NSP is to acquire knowledge

regarding the relationship of two sentences, since many downstream tasks such as

question answering require such knowledge. Thus, the BERT model is trained on a

binary classification task of predicting whether the next sentence follows the previous

one.

(15)

Figure 2.3: Pre-training and fine-tuning (Devlin et al., 2018).

The fine-tuning step of BERT is pretty easy since the Transformer architecture with the self-attention mechanism permits BERT to model other downstream tasks by just swapping out the appropriate inputs and outputs into the Sentences A and B respectively and fine-tune all of the parameters end-to-end. For instance, in the question answering task, sentence A is the question and sentence B the relevant answer.

Thus a BERT classifier is trained on such inputs.

2.7 Weak Supervision for Ranking

Weak supervision is a sub-field of machine learning in which the basic assumption is that we can acquire in a cheap way imperfect, noisy, labels in an unsupervised fashion and use them as a weak supervision signal for training a classifier (Hernández- González et al., 2016). Weak supervision has been applied in many NLP tasks such as relation extraction (Bing et al., 2015; X. Han and Sun, 2016), knowledge base completion (Hoffmann et al., 2011), sentiment analysis (Severyn and Moschitti, 2015).

In the context of neural IR weak supervision is used to address the problem of the absence of large labeled datasets needed to efficiently train neural networks. Such large datasets are expensive to obtain, and thus unsupervised learning is considered as a long standing goal for several applications (Dehghani et al., 2017). More specifically, weak supervision in neural IR means that we take advantage of an existing unsupervised IR model, such as BM25, which we use as “pseudo-labeler”. Given a target collection of documents and a set of training queries the pseudo-labeler is used to rank the documents for every query in the training set. The objective is to train a classifier using these scores as weak supervision signals obtained by the “pseudo-labeler”. As an example, for a query “dogs”, our pseudo-labeler retrieves the following three documents:

“dogs are good” , “dogs eat bones”, “dogs love humans”. Then the created dataset will be a set of tuples (q, relevant passage): (dogs, dogs are good), (dogs, dogs eat bones), (dogs, dogs love humans) . These tuples will be used to train a binary classifier for distinguishing between relevant and irrelevant documents.

2.7.1 Wikipedia-Based Weak Supervision Signals

Frej et al. (2019) used Wikipedia to create test collections for IR by utilizing Wikipedia’s

internal linkage to create query topics. This thesis is inspired by the aforementioned

work on using Wikipedia as a source to create automatically creating an artificial

training dataset for training a neural classifier for IR, in which the weak labels are

constructed not by using an unsupervised algorithm, as Dehghani et al. (2017) did with

(16)

BM25, but rather exploiting Wikipedia’s internal structure for creating the artificial dataset. More details regarding the methodoloy will be presented in section 3.

2.7.2 Previous Work

Weak supervision in IR is an active area of research, and several weakly-supervised alternatives have been explored so far.

In Dehghani et al. (2017), the authors utilized BM25 to retrieve documents to con- struct their weak training dataset. K. Zhang et al. (2020) used the anchor texts and their linked web pages to construct their weak supervision signals. Ma et al. (2020) introduced a zero-shot retrieval approach using synthetic query generation by training a generative model on a different community QA data. Frej et al. (2019) exploited Wikipedia’s internal linkage to create query topics. Nogueira and Cho (2019) trained a BERT model for passage re-ranking and achieved state-of-the-art results in the MS Marco dataset.

This thesis is inspired mostly by the work of Nogueira and Cho (2019) and Frej et al. (2019), though it differentiates from them substantially. Nogueira and Cho (2019) train a BERT model with supervision on MS Marco (a human-labelled dataset that we will describe in detail in the next chapter) for the passage-reranking task and perform evaluation on MS Marco, while in our work we will first fine-tune the BERT model with Wikipedia-based weak supervision and then perform then evaluation in a zero-shot fashion on the MS Marco dataset; in addition, after training on Wikipedia we will further fine-tune our model on the MS Marco dataset and evaluate on MS Marco again. On the other hand, our main differences with Frej et al. (2019) lies in the methodology for creating the weak supervision signals using Wikipedia and that they again use Wikipedia for evaluating their models. Specifically, Frej et al. (2019) implement a complicated method for creating the weak supervision signals by utilizing the internal linkage of Wikipedia to build an IR collection, consisting by a training and a test set, and then perform various experiments on it, while in our work we utilise the internal structure of Wikipedia’s articles to build the weak supervision signals.

The next chapter will explain our procedure in detail.

(17)

3 Methodology

In this section, we explain in detail our methodology for creating the Wikipedia-based IR training dataset and training the BERT retrieval models. As the basis for all of our systems, a 𝐵𝐸𝑅𝑇

^{𝑆 𝑀 𝐴𝐿𝐿}

model is used (downloaded from the official github repo of Google AI

¹

), and all the pre-processing and training is done using the Tensorflow library

²

.

The architectural overview of our pipeline is as follows: (1) Wikipedia-based dataset creation, (2) MS Marco preparation, (3) Data pre-processing, (4) TPU training, (5) Experimental systems.

• Wikipedia-based dataset creation: We download the latest Wikipedia dump which we pre-process following the methodology described in section 3.4 to create the Wikipedia training dataset.

• MS Marco preparation: After we download the training and development dataset of MS Marco, we divide the development set to two sets sets: 100 queries for the development set and 6880 queries for the test set as we explain in section 3.3.

• Data pre-processing: Before training our different BERT models, all the training data is converted first to the necessary format that BERT needs, and then to the TFRecord format for boosting the speed of the TPUs.

• TPU training: Training a BERT model in a large amount of data is a very tedious and computationally expensive procedure. For this reason, we train all models in Google Cloud using its TPUs.

• Experimental systems: Having both Wikipedia and MS Marco in TFRecord format we start training three different BERT models: (1) a model trained on Wikipedia data, (2) a model trained on both Wikipedia and MS Marco data, (3) and a model trained only on MS Marco data as explained in section 3.6.

3.1 Task

As mentioned earlier, the passage-reranking task is the second phase of a question answering pipeline. A question answering pipeline consists of 3 phases (Nogueira and Cho, 2019): (1) passage-retrieval, in which a large number of relevant passages are pooled using a cheap computational method (e.g. BM25 or TFIDF), (2) passage- reranking, in which these documents are re-ranked by more sophisticated methods (such as neural networks), and (3) answer extraction, in which the top-n documents will be fed to a question extraction algorithm for marking up the answer.

3.2 BERT for Passage-Reranking

Given a list of retrieved passages, the aim in the passage-reranking phase is to calculate a relevance score 𝑠

^𝑖

for a candidate passage 𝑑

^𝑖

to a query 𝑞. Using the theory presented

1

https://github.com/google-research/bert

2

https://www.tensorflow.org/

(18)

in chapter 2.5, 𝐵𝐸𝑅𝑇

is used as a binary classifier in a point-wise fashion, that is, the [𝐶𝐿𝑆] vector is used as an input to a single layer neural network to obtain the probability of the passage being relevant and the loss function looks at one document at a time independently of the other documents.

More specifically, the query is fed to the classifier as sentence A and the passage text as sentence B. In addition, the maximum query length is truncated to have 64 tokens. The passage is also truncated such that the concatenation of the passage and the query amounts to at most 512 tokens. The publicly available pre-trained BERT model is used as basis and the a re-ranker is fine-tuned using the cross-entropy loss:

𝐿 = − Õ

𝑗∈ 𝑗𝑝𝑜𝑠

𝑙 𝑜𝑔 (𝑠

𝑗

) − Õ

𝑗∈ 𝑗𝑛𝑒𝑔

𝑙 𝑜𝑔 ( 1 − 𝑠

^𝑗

) (3.1)

in which 𝐽

^𝑝𝑜𝑠

represents the set of indexes of the relevant passages and 𝐽

^𝑛𝑒𝑔

represents the set of indexes of non-relevant passages in the top-1,000 documents retrieved with BM25 (Nogueira and Cho, 2019).

3.3 Datasets

In this section, we describe the training datasets used in our experiments: (1) the MS Marco dataset, where we train our in-domain BERT model, and (2) the Wikipedia-based dataset where we train our BERT model for the zero-shot experiments. In addition, we describe both the development and the test set.

3.3.1 MS Marco Training Dataset

The MS Marco dataset (Bajaj et al., 2016) is composed of ∼400 million query-passage pairs, in which the passages are marked for being relevant or irrelevant.

Nonetheless, in our experiments, we use a smaller release of MS Marco which is

∼ 10% of the Marco dataset (the original MS Marco Dataset is more than 270 gb) to make our experiments easier (the dataset is called ‘triples.train.small.tsv‘ and was downloaded from the official github repo of Microsoft

³

). In addition, from this smaller MS Marco version we use only the first 20.000.000 query-passage pairs in order to be equal in size with our Wikipedia-based dataset that we will describe next.

3.3.2 Wikipedia-based Training Dataset

The Wikipedia-based training dataset consists of 20.000.000 query-passage pairs, where each query is the concatenation of a Wikipedia article’s title and one of its section’s titles, while the associated ‘relevant‘ passage (in quotation since this is our assumption) is the section’s passage. We describe thoroughly our methodology for building such dataset in section 3.4.

3.3.3 Development and Test set

All experiments performed in this thesis are evaluated using the development and the test set from the MS Marco dataset. In the official MS Marco dataset, the development set contains 6980 queries that are associated with the top 1,000 passages retrieved using the BM25 from the MS Marco dataset. Each query has on average one relevant passage, while some of them have none since the corpus was initially constructed by retrieving the top-10 passages from the Bing search engine and then annotating them.

3

https://microsoft.github.io/MSMARCO-Passage-Ranking/

(19)

For this reason some of the relevant passages might not be retrieved by the BM25. The official MS Marco contains also a test set that consists of ∼6, 800 queries and their top 1,000 retrieved passages; though their relevance judgements are not publicly available.

For the above reason, we create a test set out of the official development set by dividing it into two sets of 100 and 6880 queries that are used in the experiments as a development and test set respectively. Using a development set of 100 queries instead of the original one of 6980 queries does not pose any problem as far as size is concerned since scientific research has demonstrated that just 50 queries is the sufficient minimum (Buckley and Voorhees, 2017).

3.4 Creating the Wikipedia-based dataset

The procedure for creating the artificial Wikipedia-based training corpus for IR used in our experiments is as follows:

Let W be a set of Wikipedia articles 𝑊 = {𝑤

₁

, 𝑤

₂

, ..., 𝑤

_𝑛

} , with each article 𝑤

^𝑖

containing a main title 𝑡

^𝑖

and 𝑗 section titles, i.e. 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑇𝑖𝑡𝑙𝑒

^{𝑖, 𝑗}

, and each of the sec- tionTitles is associated with a section passage 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑃𝑎𝑠𝑠𝑎𝑔𝑒

^{𝑖, 𝑗}

. Let 𝑄 be the set of the artificial Wikipedia-based user queries 𝑄 = {𝑞

_1,1

, 𝑞

_1,2

, ..., 𝑞

_{𝑖, 𝑗}

} such that 𝑞

^{𝑖, 𝑗}

is the concatenation of a Wikipedia title 𝑡

^𝑖

with one of its 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑇𝑖𝑡𝑙𝑒

^{𝑖, 𝑗}

. Let now 𝑃 be the set of the associated relevant passages 𝑃 = {𝑝

_1,1

, 𝑝

_1,2

, ..., 𝑝

_{𝑖, 𝑗}

} with 𝑝

^{𝑖, 𝑗}

being the cor- responding Wikipedia 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑃𝑎𝑠𝑠𝑎𝑔𝑒

^{𝑖, 𝑗}

to the 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑇𝑖𝑡𝑙𝑒

^{𝑖, 𝑗}

, and we call the 𝑝

^{𝑖, 𝑗}

the

“relevant passage” to the query 𝑞

^{𝑖, 𝑗}

. Let now Ψ be a sequence of associated irrelevant passages Ψ = (𝜓

_1,1

, ..., 𝜓

_𝑖0, 𝑗⁰

) with 𝜓

^𝑖⁰^{, 𝑗}⁰

being a Wikipedia 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑃𝑎𝑠𝑠𝑎𝑔𝑒

^𝑖⁰^{, 𝑗}⁰

, such that 𝑠𝑒𝑐𝑡 𝑖𝑜𝑛𝑃 𝑎𝑠𝑠𝑎𝑔𝑒

_𝑖0, 𝑗⁰

≠ 𝑝

𝑖, 𝑗

, and we call the 𝜓

^𝑖⁰^{, 𝑗}⁰

the “irrelevant passage” to the query 𝑞

^{𝑖, 𝑗}

. Now, let 𝐴 be our artificial IR training dataset, that is a set of triplets (𝑞

^{𝑖, 𝑗}

, 𝑝

_{𝑖, 𝑗}

, 𝜓

_𝑖0, 𝑗0

) with 𝑞

^{𝑖, 𝑗}

being the artificial user query, 𝑝

^{𝑖, 𝑗}

being its associated relevant passage, and 𝜓

_𝑖0, 𝑗⁰

being the associated irrelevant passage:

𝐴 = {(𝑞

_1,1

, 𝑝

_1,1

, 𝜓

_1,1

), ..., (𝑞

𝑖, 𝑗

, 𝑝

_{𝑖, 𝑗}

, 𝜓

_𝑖0, 𝑗⁰

)} (3.2) We build the training dataset A in two steps: (1) we parse the Wikipedia dump and create a temporary file 𝛼 with the (𝑞

^{𝑖, 𝑗}

, 𝑝

_{𝑖, 𝑗}

) pairs, (2) create another temporary file 𝛽 with the irrelevant passages 𝜓

^𝑖⁰^{, 𝑗}⁰

and concatenate the two files 𝛼 and 𝛽, resulting to the training dataset A.

Creating the (𝑞

𝑖, 𝑗

, 𝑝

_{𝑖, 𝑗}

) pairs

In the first step (1), we start by downloading the latest Wikipedia dump from the official Wikipedia repository

⁴

. Since the Wikipedia dump comes in the XML format is needed to be parsed in order for us to obtain the clean text out of it. For this reason, we use an open- source Wikipedia parser

⁵

to clean the Wikipedia dump and obtain the clean text. We run the parser with the following configuration --sections --filter_disambig_pages

to preserve the sections of every article and filter the disambiguation pages of Wikipedia since they do not have any sections and therefore we cannot utilise them to create any queries. The procedure of extracting clean text out from Wikipedia takes around 6-8 hours depending on your hardware.

Having obtained the clean text we use simple regex rules for parsing it and extracting the artificial queries (which are the article title and the section title concatenated as we mentioned before) associated with their relevant passages (the title sections). For

4

https://dumps.wikimedia.org/enwiki/

5

https://github.com/attardi/wikiextractor

(20)

Figure 3.1: Extracting the (𝑞

𝑖, 𝑗

, 𝑝

_{𝑖, 𝑗}

) from a Wikipedia article. The extracted (𝑞

𝑖, 𝑗

, 𝑝

_{𝑖, 𝑗}

) from the above article would be (“MissingNo. History”, “Developed [...] games”), (“Miss- ingNo. Characteristics”, “A player [...] ”) etc.

example, to identify the section we match the pattern ‘Section::::‘ that exists in the clean text and using that we extract the section passage.

After we finish the whole procedure for the entire Wikipedia we end up with

∼ 12.000.000 query-passage pairs (𝑞

^𝑖

, 𝑝

_{𝑖, 𝑗}

) saved in the file we named 𝛼.

Creating the irrelevant passages 𝜓

_𝑖0, 𝑗⁰

Since we train a binary classifier (relevant-irrelevant) our training corpus 𝐴 needs to have for each artificial query an associated irrelevant passage. We do this with the following intuitive way: we loop over the file 𝛼 we just created from step (1) starting from the 800,000th index (we randomly picked this number) and then we start saving the associated passages until we reach 12,000,000 instances (the same number of instances of file 𝛼). We save the instances in a new file called 𝛽. In that way we make sure that the passage in ith position of 𝛽 are irrelevant to 𝑝

^{𝑖, 𝑗}

of 𝛼. As a final step to our procedure we concatenate the files 𝛼 and 𝛽 and we end up with our artificial training dataset with 12,000,000 line of triplets (𝑞

^{𝑖, 𝑗}

, 𝑝

_{𝑖, 𝑗}

, 𝜓

_𝑖0, 𝑗⁰

) , resulting to ∼20 gigabytes of data.

3.5 Data pre-preprocessing

Before training our models we need first to convert them to the appropriate BERT format, and then convert them to the TFRecord format to be consumed by the TPUs.

The BERT format is necessary by the BERT model while the TFRecord format is suggested for optimising the Tensorflow models trained

⁶

.

6

https://cloud.google.com/architecture/best-practices-for-ml-performance-cost

(21)

Preprocessing happens in one pass, meaning that we wrote a python file, that we called "preprocessor.py", which accepted a list of strings as an input and it output a TFRecord file. Google AI provides the necessary code for these conversion in their official github

⁷

, that one can modify and customize according to their specific needs (e.g. there is not any straight-forward implementation of converting a string to BERT format and then to TFRecord format, but rather one must first get accustomed with the logic of Google’s AI github and their code provided until they become able to re-use the Google AI’s code and adjust it to their needs.)

3.6 Training on TP Us

Despite their impressive performance BERT models are quite slow in training and inference time (W. Liu et al., 2020). For this reason, to address the slow inference and training time of BERT in our experiments we make use of Google’s TPUs

⁸

. As an example, fine-tuning just one of our BERT model on a TPU takes around 20 hours for 20 gigabyte of data while -based on the metrics of Google

⁹

- training on a GPU would have taken 300 hours (∼13 days!). As far as inference time is concerned, the development set consisting of 100,000 instances (100 queries × 1,000 passages per query) takes around 5 minutes to be annotated from the BERT relevance classifier we fine-tuned on TPU, while the test set that consists of 6,880,000 instances takes around 50 minutes. Using GPUs the equivalent times would be ∼1 and ∼10 hours respectively.

Although Google provides free TPUs usage through their Google colab

¹⁰

it was unfeasible to use it since the limitations it poses on the continuous usage

¹¹

. For this reason we used TPUs through the Google cloud. In total we spent 500$ to perform all of our experiments (Google Cloud provides 300$ free trial).

3.7 Experimental Systems

To investigate our first research question regarding the possibility of transfer learning with BERT in the context of IR by using Wikipedia as a pre-fine-tuning step, we use the following two systems:

• Default: This model is used “as is” with its “out-of-the-box” weights and without any additional training.

• Wiki: a model that we train with weak supervision in 20, 000, 000 query-passage pairs using our Wikipedia-based query-passage dataset.

For our second research question regarding to what extent pre-fine-tuning for the passage-reranking task in Wikipedia data could improve a model in terms of (1) accu- racy, (2) convergence or stability, we train the following two models:

• Wi+Ma: A 𝐵𝐸𝑅𝑇

model is first pre-fine-tuned with weak supervision in 20, 000, 000 query-passage from our Wikipedia-based dataset and then it is further fine-tuned in additional 20, 000, 000 data from the MS Marco dataset.

7

https://github.com/tensorflow/models/tree/master/official/nlp/data

8

https://cloud.google.com/tpu/

9

https://cloud.google.com/blog/products/gcp/quantifying-the-performance-of-the-tpu-our-first- machine-learning-chip

10

https://colab.research.google.com/

11

https://research.google.com/colaboratory/faq.html#resource-limits

(22)

• Marco: A 𝐵𝐸𝑅𝑇

model, that is is fine-tuned only in 20, 000, 000 data from the MS Marco dataset.

3.8 Experimental Settings

Following the settings of Nogueira and Cho (2019) a 𝐵𝐸𝑅𝑇

model is fine-tuned using a TPU v2.8 in Google Cloud. with a batch size of 128 (128 sequences * 512 tokens

= 65,536 tokens/batch) if we fine-tune the model on MS Marco, otherwise a batch of 32 is used when we fine-tune in Wikipedia.

ADAM (Da, 2014) is used with initial learning rate set to 3×10

⁻⁶

, 𝛽

₁

= 0.9, 𝛽

2

= 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. Dropout probability of 0.1 on all layers. For the fine-tuning in Wikipeda we change the learning rate to just 10

⁻⁶

.

3.9 Evaluation Methods

In the passage-ranking task we want to measure in which position the trained retrieval models rank the relevant passage that is the answer to the user’s query. As mentioned in section 3.3.3 every query is associated with 1000 passages containing at most one relevant passage which is the answer to the user’s question. In addition, the associated relevance judgment to the relevant passage is binary (relevant, non-relevant). Since we are interested in our experiments to find the first relevant passage in the list we picked MRR as our evaluation method which puts a high focus on the first relevant element of the list. Also, using MRR makes our results comparable with the results of the other papers in the MS Marco leader-board as the official MS Marco web page for the passage-reranking task states that the evaluation method should be the MRR metric

¹²

.

12

https://microsoft.github.io/MSMARCO-Passage-Ranking/

(23)

4 Results

In this section, we report the results of our experiments. Table 4.1 shows the results on the test and development set, while Figure 4.1 shows the performance of the models in the development set trained on different amounts of data. Figure 4.2 quantifies the instability of the BERT model by reporting the results of 10 runs for each of the models.

One thing to notice is that in Figure 4.1 and 4.2 we do not report any results for our Default model since as it was mentioned we do not train it on any data.

4.1 Accuracy

Regarding the results of our first research question, we observe in Table 4.1, that our Wi+Ma model achieves a score of 0.19 on the test set, that is almost +10 points higher than the performance of the Default model which achieves a score of 0.104, and 0 on the development set. Turning to the results for the first part of our second research question, which is if more accuracy can be gained by pre-fine-tuning in Wikipedia data, we can see that our Wi+Ma and Marco model have the same performance, both of them achieving a score of ∼0.36 in MRR@10. Therefore, we can argue that no accuracy is gained by pre-fine-tuning on Wikipedia.

System Dev (100 queries) Test (3880 queries)

Default 0.0 0.104

Wiki 0.197 0.196

Wi+Ma 0.37 0.3610

Marco 0.35 0.3598

Table 4.1: Accuracy: Results on development and test set.

4.2 Convergence

In the Figure 4.1 we observe the results about convergence which is whether our model’s loss function moves towards the minimum with a decreasing trend. This is regarding the second part of our research questions, if any convergence would be gained by pre-fine-tuning in Wikipedia, in Figure 4.1 we observe that our Wi+Ma model does not converge faster as it steadily increases its performance with more data in the same fashion as the Marco and the Wiki model do.

4.3 Stability

in Figure 4.2 we observe the results regarding stability that concerns the last part of our research questions if any stability would be gained by pre-fine-tuning in Wikipedia.

Stability is defined as whether the model leads to smaller standard deviation of the fine-

tuning accuracy (Devlin et al., 2018; Dodge et al., 2020). The plot shows an impressive

gain in the stability for our Wi+Ma model, since all the lines of the 10 runs are

overlapping, thus creating a solid blue line. Turning to the other two models, Marco

(24)

Figure 4.1: Convergence: Results on the development set by training different sizes of the data (between 1k - 20 million). For the results between 1k and 1 million we report the best score in 10 runs.

and Wiki, we see that their performances are unstable, hence their lines are scattered in the plot, but they start to gradually converge as they see more data.

4.4 Analysis

Zero-shot

Our hypothesis was confirmed that Wikipedia with its generic knowledge can be utilised to create a generic IR training dataset that by imitating the user’s queries can be leveraged by BERT and its transfer-learning capabilities to train a generic retrieval model. Our model succeeded +10 of MRR@10 score in comparison with the Default model when tested on out-of-domain data, the MS Marco, a dataset that is characterised by real-world queries that are quite diverse. In addition, we can observe from Figure 4.1 that our Wiki model has an even better score (+6) in comparison to the Marco model (that uses in-domain data), when that is trained on 10k-50k of training data.

Accuracy and Convergence

Turning to our second hypothesis regarding the potential benefits of pre-fine-tuning a

BERT model for the passage-reranking task in our Wikipedia dataset the results showed

that, as far as accuracy and convergence is concerned, both models are equal. That

was something we did not necessarily expect and leaves room for improvements. One

possible explanation for our model not performing better is that the queries constructed

from the Wikipedia articles in concatenation with the sections resembled better the

task of ad-hoc retrieval which is characterised by short queries with potentially unclear

(25)

Figure 4.2: BERT instability. Development accuracy in 10 runs.

intent (Mitra and Craswell, 2017) while the MS Marco dataset was created to address the need for evaluating systems against natural language questions (Bajaj et al., 2016).

Also, there is a lot of hyperparameter optimization that we did not address in this thesis, such as the length of the Wikipedia passage that was set to 512 tokens, or the batch size that was set to 32. Last but not least, observing Figure 4.1 we see a strange behaviour in Wiki when trained on 100k data. Its accuracy decreases. This might be an indication that the way the query-passages are constructed introduces some noise in the data that makes the model decrease its performance. Another potential reason might be the nature of the MS Marco dataset per se. MS Marco is a generic web dataset with any sort of questions and answers, thus the incorporated knowledge in Wikipedia’s articles might be just a subset of the knowledge incorporated in MS Marco’s dataset. This would mean that when the BERT model is exposed in the MS Marco for training there are are many events that the model has never seen before and thus it cannot converge faster or have higher accuracy.

Instability

Turning to the instability of the BERT models, it has been mentioned in the literature that despite the significant success of the BERT models the process of fine-tuning them (especially on small datasets) remains unstable (Devlin et al., 2018; Dodge et al., 2020).

Potentially, the reason behind this lies in the nature of the fine-tuning phase. A new task-specific layer replaces the original one and fine-tunes the complete model.

In that way, new sources of randomness are introduced: the weight initialization of the new output layer and the data order in the stochastic fine-tuning optimization (T.

Zhang et al., 2020). Such factors have been claimed to influence the results significantly

(Dodge et al., 2020; Phang et al., 2018) and in particular on small datasets (e.g., < 10,000

samples). Hence for such small datasets, practitioners resort to conduct several random

trials of fine-tuning for model selection based on validation performance (Devlin et al.,

(26)

2018). This increases model deployment costs and time, while making the scientific comparison challenging (Dodge et al., 2020).

The results shown in Figure 4.2 for our Wi+Ma model indicate that a BERT model trained on large amounts of Wikipedia data increases its stability as we see that the behaviour of the other two models is quite unstable, especially when the size of the data is small. The reason behind the stability of the Wi+Ma model probably lies in the following: as we see from the other two models, Marco and Wiki, they tend to be quite unstable; though as they are fed more data (e.g. 1 million data) they start to stabilize and their lines to overlap. By extrapolating the lines we can hypothesize that with more data these models reach convergence as far as stability is concerned.

So using the Wiki model as basis for fine-tuning on in-domain data the model starts from a model where the weights have been already stabilised. Unfortunately, due to the expensiveness of the experiments we could not run for 10 times for more than 1 million data. Our results are aligned with and confirm the findings of Phang et al.

(2018) who showed that fine-tuning the model on a large intermediate task stabilizes

the later fine-tuning on small datasets. In addition, another speculation regarding

the stability gained by the Wi+Ma model is due to the fact that, as we mentioned in

the "Accuracy and Convergence", the knowledge inside Wikipedia is a subset of the

knowledge of the Marco’s dataset. Therefore when we train on MS Marco there are

enough events that the BERT models has already seen before in Wikipedia and this

leads the model to develop more dense weights and to have smaller standard deviation

in the accuracy.

(27)

5 Conclusion

In this thesis, our hypothesis was that transfer learning in the context of IR is possible by leveraging a generic knowledge database as Wikipedia to create an IR training dataset, so that training a deep neural model in such dataset would prove beneficial for later retrieval tasks since the model would incorporate generic retrieval knowledge.

Our aim was to address the shortage of data when training the data-hungry neural models for domains where there is limited or no data at all. For that, we proposed a simple and intuitive technique for transfer learning in the context of IR; we leveraged Wikipedia’s structure to create a generic weakly supervised IR training dataset.

Our hypothesis suggested two research questions, that tried to examine the impact that the artificial Wikipedia-based dataset has on deep neural retrieval models from two different angles: our first question was regarding the zero-shot scenario, i.e. whether a BERT model trained only in the Wikipedia-based IR dataset would have higher accuracy in comparison to a default BERT model, and the second one was regarding the pre-fine-tuning on Wikipedia scenario, i.e. whether pre-fine-tuning a BERT model on the Wikipedia-based IR dataset would have pose any improvements of the later fine-tuning stages with in-domain data in terms of accuracy, convergence, or stability.

First of all, our hypothesis was confirmed that transfer learning in the context of IR is possible by utilising Wikipedia to create cheaply an IR dataset capable of training a data-hungry deep neural model. Our zero-shot experiments on the dataset, MS Marco, showed that the BERT model trained only on Wikipedia surpassed -under a large margin- the default BERT model (+10 points difference in MRR@10). In addition, experiments on the development set indicated that our "Wiki" model is better or the same in performance when compared with a BERT model trained only on in-domain data, when the size of the data ranges between 10k-50k samples.

At the same time even though we showed that transfer learning in the context of IR using Wikipedia is possible we could show that our zero-shot approach is a more efficient in a real world scenario. To put it more simple, "should someone employ a BERT model trained on Wikipedia for their domain-specific application?".

Zero-shot, One Kill: BERT for Neural Information Retrieval