Zero-shot, One Kill: BERT for Neural Information Retrieval
Using Wikipedia-based Weak Supervision for Passage-(Re)ranking and Question Answering
Stergios Efes
Uppsala University
Department of Linguistics and Philology Master Programme in Language Technology
Master’s Thesis in Language Technology, 30 ects credits June 9, 2021
Supervisor:
Abstract
[Background]: The advent of bidirectional encoder representation from trans- formers (BERT) language models (Devlin et al., 2018) and MS Marco, a large scale human-annotated dataset for machine reading comprehension (Bajaj et al., 2016) that made publicly available, led the field of information retrieval (IR) to experience a revolution (Lin et al., 2020). The retrieval model based on BERT of Nogueira and Cho (2019), by the time they published their paper, became the top entry in the MS Marco passage-reranking leaderboard, surpassing the previous state of the art by 27% in MRR@10. However, training such neural IR models for different domains than MS Marco is still hard because neural approaches often require a vast amount of training data to perform effectively, which is not always available. To address the problem of the shortage of labelled data a new line of research emerged, training neural models with weak supervision. In weak supervision, given an unlabelled dataset labels are generated automatically using an existing model and then a machine learning model is trained upon the artificial “weak“ data. In case of weak supervision for IR, the training dataset comes in the form of a tuple (query, passage). Dehghani et al. (2017) in their work used the AOL query logs (Pass et al., 2006), which is a set of millions of real web queries, and BM25 to retrieve the relevant passages for each of the user queries.
A drawback with this approach is that it is hard to obtain query logs for every single different domain. [Objective]: This thesis proposes an intuitive approach for addressing the shortage of data in domains with limited or no data at all through transfer learning in the context of IR. We leverage Wikipedia’s structure for creating a Wikipedia-based generic IR training dataset for zero-shot neural models. [Method]: We create the “pseudo-queries“ by concatenating the titles of Wikipedia’s articles along with each of their title sections and we consider the associated section’s passage as the relevant passage of the pseudo-queries.
All of our experiments are evaluated on a standard collection: MS Marco, which is a large scale web collection. For our zero-shot experiments, our proposed model, called “Wiki“, is a BERT model trained on the artificial Wikipedia-based dataset and the baseline is a default BERT model without any additional training.
In our second line of experiments, we explore the benefits gained by pre-fine-
tuning on the Wikipedia-based IR dataset and further fine-tuning on in-domain
data. Our proposed model, "Wiki+Ma", is a BERT model pre-fine-tuned in the
Wikipedia-based dataset and further fine-tuned in MS Marco, while the baseline
is a BERT model fine-tuned only in MS Marco. [Results]: Results regarding our
first experiments show that our BERT model trained on the Wikipedia-based
IR dataset, called "Wiki", achieves a performance of 0.197 in MRR@10, which is
about +10 points more in comparison to a BERT model with default weights; in
addition, results in the development set indicate that the “Wiki“ model performs
better than BERT model trained on in-domain data when the data is between
10k-50k instances. Results regarding our second line of experiments show that
pre-fine-tuning on the Wikipedia-based IR dataset benefits later fine-tuning steps
on in-domain data in terms of stability. [Conclusion]: Our findings suggest that
transfer learning for IR tasks by leveraging the generic knowledge incorporated
in Wikipedia is possible, though more experimentation is needed to understand
its limitations in comparison with the traditional approaches such as the BM25.
Contents
1 Introduction 5
1.1 Purpose and Research Questions . . . . 5
1.2 Outline . . . . 6
2 Background 7 2.1 Information Retrieval Tasks . . . . 7
2.1.1 Ad hoc Retrieval . . . . 7
2.1.2 Question Answering . . . . 7
2.1.3 Other Information Retrieval Tasks . . . . 8
2.2 Evaluation in Information Retrieval . . . . 8
2.2.1 Mean Average Precision . . . . 9
2.2.2 Mean Reciprocal Rank . . . . 9
2.3 Traditional Information Retrieval . . . . 9
2.3.1 Vector Space Model . . . . 9
2.3.2 BM25 . . . . 10
2.3.3 Other Traditional Information Retrieval Models . . . . 10
2.4 Learning-To-Rank Information Retrieval . . . . 10
2.5 Neural Information Retrieval . . . . 11
2.5.1 A Unified Model formulation of the Neural Ranking Models . 11 2.5.2 Model Architectures . . . . 12
2.6 Bidirectional Encoder Representations from Transformers . . . . 13
2.6.1 Traditional Language Modelling . . . . 13
2.6.2 Neural Language Modelling: Attention Mechanisms and Trans- formers . . . . 13
2.6.3 BERT Language Modelling . . . . 14
2.7 Weak Supervision for Ranking . . . . 15
2.7.1 Wikipedia-Based Weak Supervision Signals . . . . 15
2.7.2 Previous Work . . . . 16
3 Methodology 17 3.1 Task . . . . 17
3.2 BERT for Passage-Reranking . . . . 17
3.3 Datasets . . . . 18
3.3.1 MS Marco Training Dataset . . . . 18
3.3.2 Wikipedia-based Training Dataset . . . . 18
3.3.3 Development and Test set . . . . 18
3.4 Creating the Wikipedia-based dataset . . . . 19
3.5 Data pre-preprocessing . . . . 20
3.6 Training on TPUs . . . . 21
3.7 Experimental Systems . . . . 21
3.8 Experimental Settings . . . . 22
3.9 Evaluation Methods . . . . 22
4 Results 23
4.1 Accuracy . . . . 23
4.2 Convergence . . . . 23 4.3 Stability . . . . 23 4.4 Analysis . . . . 24
5 Conclusion 27
1 Introduction
Informational retrieval technologies (IR) play an important role in people’s daily life.
Millions of users are searching the web every day and billions of queries are processed daily by major search engines, such as Google and Bing.
The effectiveness of such search engines is based upon two factors: the use of neural networks and a large amount of click-log data that they are being trained upon (Zamani, 2019). Such text retrieval models need to acquire an understanding of raw text documents and to learn a ranking function 𝑓 (𝑞,𝑑) which given a query q and a document d outputs a probability for a document being relevant (Guo et al., 2020).
Neural IR started to flourish only after the publication of large datasets for passage- reranking (Nogueira and Cho, 2019) such as MS Marco from Microsoft Bing (Bajaj et al., 2016). It has been noted that the absence of such big datasets made it impossible for neural IR to compete with classical IR (Lin, 2019). Despite the recent advances though in Neural IR (Nogueira and Cho, 2019; Hofstätter et al., 2020; S. Han et al., 2020) it has to be emphasized that they took place in a scenario where there is an abundance of training signals, i.e. MS Marco. On the other hand, without such large datasets (which are either click-logs or manually labelled with human judgments) the effectiveness of neural networks is highly questioned (Lin, 2019; W. Yang et al., 2019).
For such a setting, when there are not any labeled data available for training, recent research has focused on training neural ranking models using weak supervision, in which labels are acquired automatically using other means. For example, Dehghani et al. (2017) used an unsupervised IR model, BM25, while K. Zhang et al. (2020) used anchor text. Addressing the issue of the lack of labelled data, Frej et al. (2019) take it one step further by utilizing Wikipedia for building large-scale IR test collections automatically.
This thesis follows the aforementioned line of research: it aims to investigate on what level Wikipedia-based weak supervision signals can be used to train a generic efficient neural ranking retrieval model. (Radford et al., 2019).
1.1 Purpose and Research Questions
The purpose of this work is to investigate the effectiveness of transfer learning in deep neural networks in the context of IR for addressing the data bottleneck that one faces when they need to train a neural retrieval model for a domain with limited data or no data at all. BERT models (Devlin et al., 2018) have been shown to be an effective approach to transfer learning by being pre-trained on large amounts of raw text data and then fine-tuned for a specific task (Devlin et al., 2018). In our case, we explore the possibility of transfer learning for IR by fine-tuning a BERT model with Wikipedia-based weak supervision for the task of passage-reranking. We assume that since the Wikipedia corpus is characterised by generic knowledge, training a BERT model in a Wikipedia-based IR dataset would potentially help it to incorporate generic information retrieval knowledge that would prove beneficial for later retrieval tasks.
We try to quantify how beneficial such an approach can be by answering the
following research questions:
• Does a BERT model, fine-tuned with Wikipedia-based weak supervision, perform better in terms of accuracy when tested on out-of-domain data (zero-shot setting) in comparison to a default BERT model that has not been fine-tuned at all?
• If we further fine-tune this BERT model (that has already been trained with Wikipedia-based weak supervision) on in-domain data, will there be any im- provement in terms of accuracy, convergence or stability in comparison to a model that has only been trained with in-domain data?
1.2 Outline
Beginning with Chapter 2, it describes the different tasks in IR by paying particular attention to the question answering task and one of its components, that is going to be central in this thesis, passage-reranking. In addition, it describes the different approaches in IR, such as the traditional IR, learning-to-rank and neural IR, and how evaluation is performed in IR. After the necessary IR concepts are laid out, a pre- sentation of language modelling follows (BERT) and of weak supervision. Chapter 3 presents the methodology used to perform passage-reranking with BERT and presents the datasets used. That is Wikipedia, which is pre-processed and used to create an artificial query-passage dataset for training the BERT model, and MS Marco which is both used for training and testing. Chapter 4 describes the experimental settings along with the baseline(s). Chapter 5 analyzes and discusses the results from our experiments.
Chapter 6 presents the essential contributions of this work and concludes the thesis.
2 Background
The following sections aim to present a coherent yet short overview of the concepts and theory needed to understand neural information retrieval. The literature does not always agree on terminology and different names for the same concepts are used interchangeably. For this reason, it felt necessary to clarify the terms used in this thesis.
Traditional IR or classical IR refers to the basic retrieval systems used in IR (such as, for instance, the vector space model, and BM25 retrieval algorithm). Such methods focus on word occurrences for measuring relevance. For this thesis, the term traditional IR was chosen as an attempt to highlight the absence of machine learning in it.
Learning to rank (LETOR) or machine-learned IR refers to a new paradigm in IR that seeks to employ machine learning approaches to solve IR problems. By LETOR the literature seems to refer to 1) the framework, meaning the formalisation of the IR problem of ranking from the perspective of machine learning, 2) the use of traditional machine learning approaches to solve it, such as support vector machine for instance, or simple neural networks such as the perceptron. These kinds of methods focus on training ML models on human labelled datasets using hand-crafted features as will be explained more thoroughly later. For this work the term learning to rank (LETOR) is adopted because it is broadly used in the literature as opposed to machine-learned IR Neural IR or neural retrieval or Neu-IR, or neural LETOR refers to the use of deep neural network (DNN) architectures to address IR problems from the perspective of the LETOR framework. The DNNs are trained on human labelled datasets but the feature learning is done automatically by the network in contrast to the hand-crafted rules mentioned before. The term neural IR was chosen to be used in this thesis since it is broadly used in the literature.
2.1 Information Retrieval Tasks
In this section, we describe the task of question answering in IR and how it relates to this thesis, we also make a brief description of the ad hoc retrieval task since it is the most prominent task of the IR field, and we briefly mention the rest of the IR tasks.
2.1.1 Ad hoc Retrieval
The most notable of the retrieval tasks is the ad hoc retrieval (Guo et al., 2020), in which a user has an information need for which they issue a query to a retrieval system, which in turn measures the relevance between the query and the documents in the collection and then retrieves the top N scoring documents. A major difficulty in ad hoc retrieval is that the incoming queries usually have an unclear intent and range from a few words to a few sentences (Mitra and Craswell, 2017).
2.1.2 Question Answering
The task of question answering (QA) is to automatically answer a user’s questions
issued in natural language using some information resources (Guo et al., 2020). The
information resources could either be structured data (such as a knowledge base)
or unstructured data (for example web pages, or documents) which is what we are concerned with in this thesis. Furthermore, there are several task formats for QA, such as passage-reranking (Nogueira and Cho, 2019), passage-retrieval, answer span locating (Rajpurkar et al., 2016) and answer synthesizing from multiple sources (Mitra et al., 2016).
As far as passage-reranking is concerned, it has to be noted that the literature seems to not agree in the terminology. Some papers consider it not an independent task per se but rather just a post-retrieval step to the passage-retrieval task (Aktolga et al., 2011). Others, such as Nogueira and Cho (2019), treat it as an independent task. In addition, both the terms passage-ranking and passage-reranking seem to be used interchangeably in the literature. In any case, passage-reranking has become a crucial component in any QA system (Cui et al., 2005.
Most QA systems employ a pipeline structure that consists of several modules to get answers; Nogueira and Cho, 2019):
• passage-retrieval: in this phase, 𝑛 relevant passages are retrieved using an inex- pensive method, such as BM25 or TF-IDF.
• passage-reranking: the 𝑛 retrieved passages are reranked using a more computa- tionally expensive method.
• answer span locating: the top 5-10 passages will be used as candidates by an an- swer extraction algorithm for marking up the answer location in the passage(s).
2.1.3 Other Information Retrieval Tasks
There are many other IR tasks as well such as product search (Brenner et al., 2018), sponsored search (Fain and Pedersen, 2006), community question answering (L. Yang et al., 2013), and automatic conversation (Ji et al., 2014), but they are outside the scope of this work.
2.2 Evaluation in Information Retrieval
Information retrieval systems are evaluated using test collections (Schütze et al., 2008) which consist of:
• A document collection which is the set of documents that the IR system indexes.
• A set of queries that express the information needs.
• A set of ground truth labels, which are relevance judgments, a binary assessment of a document being relevant or irrelevant with respect to a query.
IR evaluation revolves around the concept of relevant and not relevant document.
Given a query a document is classified as being relevant or not. Gold standard or ground truth judgments of relevance is referred to the decision of this binary classification.
The typical metrics used in IR are adjusted versions of the precision, recall and the F measure ; they are “adjusted” since they are set-based measures and are computed using unordered sets, while in the IR context the aim is to evaluate ranked results.
Below are presented the most common metrics used in IR, the mean average precision
(MAP) and the one used in this thesis mean reciprocal rank (MRR).
2.2.1 Mean Average Precision
One of the most commonly used metrics in the IR community is the MAP. Given a query, average precision is defined to be the average of the precision values obtained for the top k retrieved documents, and later this value is averaged over all queries. Let {𝑑
1, ..., 𝑑
𝑚 𝑗} be the list of relevant documents for query 𝑞
𝑗∈ 𝑄 and 𝑅
𝑗 ,𝑘to be the set of retrieved results from the top results until you get to document 𝑑
𝑘:
𝑀 𝐴𝑃 (𝑄) = 1
|𝑄 |
|𝑄 |
Õ
𝑗=1
1 𝑚
𝑗𝑚𝑗
Õ
𝑘=1
𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑅
𝑗 ,𝑘) (2.1)
2.2.2 Mean Reciprocal Rank
The reciprocal rank (RR) metric calculates the reciprocal of the rank at which the first relevant document was retrieved. RR is 1 if a relevant document was retrieved at rank 1, 0.5 if a relevant document was retrieved at rank 2 and so on. Averaged across all queries, this metric is called the mean reciprocal rank (MRR) (Craswell, 2009):
𝑀 𝑅𝑅 = 1
|𝑄 |
|𝑄 |
Õ
𝑖=1
1
rank
𝑖(2.2)
MRR is associated with a user model where the user only wishes to see one relevant document (Craswell, 2009). The metric is very sensitive when going from rank 1 to rank 2 (0.5) in contrast to moving from rank 100 to 1,000 (0.009).
2.3 Traditional Information Retrieval
The main tendency in traditional IR is that a sparse term-document matrix is built out of term frequencies. Such retrieval systems are also called as bag of word models since they ignore the ordering of the words in the documents.
2.3.1 Vector Space Model
The intuition behind the vector space model (VSM) is that documents and document queries (in this setting a query is seen as a document) can be represented as vectors in a multi-dimensional space and be compared there using the cosine similarity. More specifically, ® 𝑑 denotes the derived vector from a document d and ®𝑞 denotes a vector derived from a query q, where each component in the vector corresponds to a dictionary term that is calculated using the tf-idf scheme we describe below (Schütze et al., 2008).
Tf-idf weighting
Using the tf-idf weighting scheme a weight is assigned to a term 𝑡 in a document 𝑑 using the following formula:
tf-idf
t,d= tf
t,d× idf
t(2.3)
where:
tf
t,d= 𝑓
𝑡 ,𝑑Í
𝑗=1
𝑓
𝑗 ,𝑑in which 𝑓
𝑡 ,𝑑is the number of times a term 𝑡 appears in a document 𝑑
𝑖divided by the
total number of terms in the document.
and
idf = 𝑙𝑜𝑔 |𝐷 | 𝑑 𝑓
𝑡in idf (inverse document frequency) |𝐷| denotes the number of documents in the collection and 𝑑𝑓
𝑡the number of times a term 𝑡 appears in the collection.
Cosine similarity
Having derived the document vector ® 𝑑 and the document query vector ®𝑞, cosine similarity is computed between these two vectors representations to quantify the similarity between them:
𝑐𝑜𝑠𝑖𝑛𝑒 (𝑞, 𝑑) = 𝑞 ® × ® 𝑑
| ® 𝑞 | × | ® 𝑑 | (2.4)
2.3.2 BM25
The BM25 scoring formula stems from the probabilistic relevance framework (PRF) and it is considered the state-of-the-art of traditional IR by some researchers (Robertson and Zaragoza, 2009). Even though through the years several BM25 versions have been developed (Kamphuis et al., 2020), the basic formula is (Aklouche et al., 2019):
𝑠𝑐𝑜𝑟 𝑒 (𝑞, 𝑑) = Õ
𝑡∈𝑞
𝑖𝑑 𝑓 (𝑡 ) ×
𝑡 𝑓
𝑡 ,𝑑× (𝑘
1+ 1)
𝑡 𝑓
𝑡 ,𝑑+ 𝑘
1× ( 1 − 𝑏 + 𝑏 ×
𝑎 𝑣𝑔𝑑𝑙𝑑𝑙) (2.5) where 𝑘
1and 𝑏 are constants tuned on a labelled dataset, 𝑑𝑙 is the document length, and 𝑎𝑣𝑔𝑑𝑙 is the average document length in all the documents in the collection.
2.3.3 Other Traditional Information Retrieval Models
Since the aim of this thesis is not give a comprehensible guide to traditional IR, we briefly mention only two other approaches that we think are prominent in the field of IR.
• In the language modelling approach a document is assumed to have its own language model and the task is to to calculate the probability of a document 𝑑 emitting a query 𝑞 (Banerjee and H. Han, 2009).
• In Bayesian networks, documents, terms, and queries are represented as nodes and there are arcs that link them together. Using the prior document probabilities and the conditional ones from the interior nodes a posterior probability can be computed (Turtle and Croft, 1989).
2.4 Learning-To-Rank Information Retrieval
Learning-to-rank (LETOR) refers to the application of traditional machine learning in the field of information retrieval for (re)ranking a list of documents given a query.
Many ML models have been employed over the years in the LETOR task, for example support vector machine (Yue et al., 2007), boosted decision trees (Burges et al., 2005).
LETOR makes use of training data annotated with human relevance labels to train
for a ranking task. The main thing that distinguishes LETOR models from the neural
approaches is that LETOR models employ hand-crafted features for representing the
query-document pairs (Mitra, Craswell, et al., 2018) – something we will address in a more detailed manner in the next chapter where the neural approaches are described.
Typically such hand-crafted features fall under one of the following three categories:
query-independent or static features (e.g. document length or web-link length), query- dependent or dynamic features (i.e., BM25), query-level features (e.g. query length).
The different LETOR approaches can be categorised based on their training objec- tives (T.-Y. Liu, 2011):
• Point-wise method: It is the earliest method used. In the point-wise approach, the loss function looks at one document at a time and scores the document independently of the other documents. A regression model is typically trained on labeled data to predict a numerical relevance label for a document given a query (Mitra, Craswell, et al., 2018).
• Pair-wise method: In the pair-wise approach, the loss function looks at two documents at a time and tries to derive the optimal ordering for them. The ranking problem is reduced to a binary classification problem to predict the most relevant document (Mitra, Craswell, et al., 2018).
• List-wise method: In list-wise approaches, the entire set of documents is taken as an input and the model predicts the ground truth labels (T.-Y. Liu, 2011).
2.5 Neural Information Retrieval
Neural IR refers to the use of deep neural architectures to address the IR tasks. A key difference between deep architectures and the LETOR approaches is that, in contrast to the manual feature engineering demanded by the LETOR methods, deep neural networks learn the features needed for training in an unsupervised way – at the cost of training more complex models though (Mitra, Craswell, et al., 2018).
2.5.1 A Unified Model formulation of the Neural Ranking Models
As we mentioned earlier in the the thesis, neural IR is studied under the LETOR framework (Guo et al., 2020). For this reason the literature usually gives a unified formulation of neural ranking models from a generalized view of LETOR problems, which we are also going to describe below.
Following Guo et al. (2020), suppose there is set of queries 𝑄, which could be any type of text queries, natural language questions, or input utterances and a set of documents 𝐷 , which could be any type of text documents, answer passages, or snippets from web-pages or real documents. Let also 𝑌 be a set of labels that represent a relevance degree 𝑌 = {1, 2, ...𝑙}. Then it exists an order between these grades 𝑙 > 𝑙 − 1 > ... > 1.
Now let 𝑞
𝑖be the query in the 𝑖𝑡ℎ position of 𝑄, and 𝐷
𝑖being the set of documents associated with 𝑞
𝑖, such that 𝐷
𝑖= {𝑑
𝑖,1, 𝑑
𝑖,2, ...𝑑
𝑖,𝑛𝑖
} and 𝑌
𝑖the set of labels associated with 𝑞
𝑖, such that 𝑌 = {𝑦
𝑖,1, 𝑦
𝑖,2...𝑦
𝑖,𝑛𝑖
} with 𝑦
𝑖, 𝑗being the relevance degree of 𝑑
𝑖, 𝑗of 𝑞
𝑖.
Let 𝑓 (𝑞
𝑖, 𝑑
𝑖, 𝑗) be a ranking function that assigns a relevance score to a pair of a query
and a document. Lastly, let 𝐿(𝑓 ;𝑞
𝑖, 𝑑
𝑖, 𝑗, 𝑦
𝑖, 𝑗) be the loss function that calculates the loss
between the prediction of the function 𝑓 and the label. Hence the objective of function
𝑓 is to find the optimal ranking function 𝑓
∗by minimizing the loss function over a
labelled dataset.
𝑓
∗: 𝑎𝑟𝑔𝑚𝑖𝑛 Õ
𝑖=1
Õ
𝑗=1
𝐿 ( 𝑓 ;𝑞
𝑖, 𝑑
𝑖, 𝑗, 𝑦
𝑖, 𝑗) (2.6) Without loss of generalisation, we can further abstract the ranking function 𝑓 to the following unified formulation
𝑓 (𝑞, 𝑑) = 𝑔(𝜓 (𝑞), 𝜙 (𝑑), 𝜂 (𝑞, 𝑑)) (2.7) where q and d are inputs, 𝜓, 𝜙 are representation functions which extract features from q and d, 𝜂 is the interaction function which extracts features from the pair (q, d), and 𝑔 is the evaluation function which computes the relevance score based on the feature representations.
Using this generalised scheme, we can describe now the differences between the LETOR approaches and the Neural IR. In LETOR approaches the inputs of the functions 𝑓 are usually raw texts, while in the neural IR, these inputs could be either raw texts or word embeddings (the mapping function is not included in the unified formula since it is considered as a basic input layer).
Turning to the the 𝜓, 𝜙 and 𝜂 functions, in the LETOR approaches they are usually set to be fixed functions, while the function 𝑔 is a machine learning model (for instance a gradient boosting tree) which could be learned from the training data. On the other hand, neural ranking model encode all the four functions 𝜓, 𝜙, 𝜂 and 𝑔 in the network so that can be learned in an unsupervised manner from the data (Guo et al., 2020).
2.5.2 Model Architectures
Depending on the nature of the interaction function 𝜂 or the different assumptions over the features (extracted by the representation function 𝜙, 𝜓) described in section 2.5.1, deep learning models can be divided in to the following two architectures (Guo et al., 2020): representation-focused and interaction-focused.
Figure 2.1: Model architectures: a) Representation-focused, b) Interaction-focused (Guo et al., 2016).
Representation-focused architecture
The underlying assumption in representation-focused models is that relevance depends
on compositional meaning of the input texts. Thus, models of this category usually
employ complex representation functions 𝜙 and 𝜓 (e.g. deep neural networks) to
derive high-level representations over the text inputs 𝑞 and 𝑑, but not the interaction
function 𝜂, and define a simple evaluation function 𝑔, for instance cosine similarity,
for calculating the relevance score (Guo et al., 2020).
Interaction-focused architecture
On the other side of the spectrum, the underlying assumption in the interaction-focused models is that the relevance depends upon the interaction between the input texts.
Thus, models of this category employ a function 𝜂 along with simple representation functions 𝜙 and 𝜓, while they define a complex evaluation function 𝑔 (i.e. deep neural networks). Depending on the kind of the interaction function used, the interaction- focused architecture can be further categorised (Guo et al., 2020) into:
• Non-parametric interaction functions that measure the closeness between inputs without learnable parameters, in which some of them are defined over each pair of input vectors.
• Parametric interaction functions that learn the similarity function from the data.
2.6 Bidirectional Encoder Representations from Transformers
Having described the main IR concepts needed for the thesis, we proceed to the area of language modelling (LM) which constitutes the basis of Bidirectional Encoder Representations (BERT) models used later in our experiments.
2.6.1 Traditional Language Modelling
Language modelling (LM) has been the basis for various NLP tasks. In machine transla- tion tasks, e.g, a LM is used to improve fluency over the output translations by choosing the most probable fluent output (Jing and Xu, 2019). The first language models were rule-based until the advent of the statistical language models (1980s) which assign a probability distribution over sequences of words (Jing and Xu, 2019):
𝑃 (𝑠) = 𝑃 (𝑤
1, 𝑤
2...𝑤
𝑛) = 𝑃 (𝑤
1) × 𝑃 (𝑤
2|𝑤
1)...𝑃 (𝑤
𝑛|𝑤
1, 𝑤
2...𝑤
𝑛−1) (2.8) A major drawback of this n-gram LM approach was the curse of dimensionality. In particular, for modelling a LM of this kind with a vocabulary of size 10,000, there are potentially 10000
𝑛−1free parameters.
2.6.2 Neural Language Modelling: Attention Mechanisms and Transformers
To address the aforementioned “curse” of data sparsity, neural networks were intro- duced for language modeling in continuous space (Bengio et al., 2000). Such LMs created a dense representation of the language in contrast to the sparse representation of the statistical approach mentioned before. Since then many techniques have been implemented in neural language modelling, but are beyond the scope of this thesis to be presented. We will only mention the recurrent neural networks (RNNs) (Mikolov et al., 2010), and the one that mostly concerns this thesis which is the attention mechanisms.
Attention mechanisms are a set of coefficients that are used by the neural network to acquire the target area needed to focus on. LMs equipped with attention mechanisms use the long history more efficiently (Bahdanau et al., 2014); (Mei et al., 2016). Formally the attention vector Δ
𝑡is calculated by the representation of {𝑟
1, 𝑟
2, ..., 𝑟
𝑡−1} :
Δ
𝑡=
𝑡−1
Õ
𝑖=0
𝛼
𝑡 𝑖𝑟
𝑖(2.9)
Due to the success of attention mechanisms, Vaswani et al. (2017) proposed an archi-
tecture based solely on attention mechanisms: the Transformer, which consists of an
encoder and a decoder. The Transformer encoder is the basis of the BERT language models.
2.6.3 BERT Language Modelling
BERT is a language model which is based on the aforementioned Transformer architec- ture. As the authors mention in their paper, it is designed to “pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers ” (Devlin et al., 2018).
Input and output representation
BERT is capable of coping either with a single sentence or a pair of sentences and thus it is able to handle a variety of downstream tasks. An example is given in Figure 2.2.
The first element of every sequence is the [CLS] token. In classification tasks the final hidden state of that token is used as the aggregate sequence representation. A pair of sentences is concatenated into a single sequence using the [SEP] token to separate them. In addition, a learned embedding is added for every token indicating whether it belongs to Sentence A or B. Thus, for a single token, its input representation is obtained by summing the corresponding token, segment, and position embeddings.
Figure 2.2: BERT input representation (Devlin et al., 2018).
BERT framework
The BERT framework consists of two phases: pre-training and fine-tuning. During the first phase, the BERT model is pre-trained on large amounts of raw text data in an unsupervised manner to acquire deep language representations, and then, during the second phase it is fine-tuned in a supervised task. Most of the times the wording
“training BERT” involves only the 2nd task of fine-tuning, since BERT models pre- trained on general language data are freely available.
More specifically the pre-training step of BERT involves training it in the following
two unsupervised tasks: masked LM and next sentence prediction (NSP). In Masked
LM a percentage of the input tokens are masked and then the model tries to predict
those masked tokens. In that way the BERT model acquires knowledge on how words
are related to each other. On the other hand, the aim of NSP is to acquire knowledge
regarding the relationship of two sentences, since many downstream tasks such as
question answering require such knowledge. Thus, the BERT model is trained on a
binary classification task of predicting whether the next sentence follows the previous
one.
Figure 2.3: Pre-training and fine-tuning (Devlin et al., 2018).
The fine-tuning step of BERT is pretty easy since the Transformer architecture with the self-attention mechanism permits BERT to model other downstream tasks by just swapping out the appropriate inputs and outputs into the Sentences A and B respectively and fine-tune all of the parameters end-to-end. For instance, in the question answering task, sentence A is the question and sentence B the relevant answer.
Thus a BERT classifier is trained on such inputs.
2.7 Weak Supervision for Ranking
Weak supervision is a sub-field of machine learning in which the basic assumption is that we can acquire in a cheap way imperfect, noisy, labels in an unsupervised fashion and use them as a weak supervision signal for training a classifier (Hernández- González et al., 2016). Weak supervision has been applied in many NLP tasks such as relation extraction (Bing et al., 2015; X. Han and Sun, 2016), knowledge base completion (Hoffmann et al., 2011), sentiment analysis (Severyn and Moschitti, 2015).
In the context of neural IR weak supervision is used to address the problem of the absence of large labeled datasets needed to efficiently train neural networks. Such large datasets are expensive to obtain, and thus unsupervised learning is considered as a long standing goal for several applications (Dehghani et al., 2017). More specifically, weak supervision in neural IR means that we take advantage of an existing unsupervised IR model, such as BM25, which we use as “pseudo-labeler”. Given a target collection of documents and a set of training queries the pseudo-labeler is used to rank the documents for every query in the training set. The objective is to train a classifier using these scores as weak supervision signals obtained by the “pseudo-labeler”. As an example, for a query “dogs”, our pseudo-labeler retrieves the following three documents:
“dogs are good” , “dogs eat bones”, “dogs love humans”. Then the created dataset will be a set of tuples (q, relevant passage): (dogs, dogs are good), (dogs, dogs eat bones), (dogs, dogs love humans) . These tuples will be used to train a binary classifier for distinguishing between relevant and irrelevant documents.
2.7.1 Wikipedia-Based Weak Supervision Signals
Frej et al. (2019) used Wikipedia to create test collections for IR by utilizing Wikipedia’s
internal linkage to create query topics. This thesis is inspired by the aforementioned
work on using Wikipedia as a source to create automatically creating an artificial
training dataset for training a neural classifier for IR, in which the weak labels are
constructed not by using an unsupervised algorithm, as Dehghani et al. (2017) did with
BM25, but rather exploiting Wikipedia’s internal structure for creating the artificial dataset. More details regarding the methodoloy will be presented in section 3.
2.7.2 Previous Work
Weak supervision in IR is an active area of research, and several weakly-supervised alternatives have been explored so far.
In Dehghani et al. (2017), the authors utilized BM25 to retrieve documents to con- struct their weak training dataset. K. Zhang et al. (2020) used the anchor texts and their linked web pages to construct their weak supervision signals. Ma et al. (2020) introduced a zero-shot retrieval approach using synthetic query generation by training a generative model on a different community QA data. Frej et al. (2019) exploited Wikipedia’s internal linkage to create query topics. Nogueira and Cho (2019) trained a BERT model for passage re-ranking and achieved state-of-the-art results in the MS Marco dataset.
This thesis is inspired mostly by the work of Nogueira and Cho (2019) and Frej et al. (2019), though it differentiates from them substantially. Nogueira and Cho (2019) train a BERT model with supervision on MS Marco (a human-labelled dataset that we will describe in detail in the next chapter) for the passage-reranking task and perform evaluation on MS Marco, while in our work we will first fine-tune the BERT model with Wikipedia-based weak supervision and then perform then evaluation in a zero-shot fashion on the MS Marco dataset; in addition, after training on Wikipedia we will further fine-tune our model on the MS Marco dataset and evaluate on MS Marco again. On the other hand, our main differences with Frej et al. (2019) lies in the methodology for creating the weak supervision signals using Wikipedia and that they again use Wikipedia for evaluating their models. Specifically, Frej et al. (2019) implement a complicated method for creating the weak supervision signals by utilizing the internal linkage of Wikipedia to build an IR collection, consisting by a training and a test set, and then perform various experiments on it, while in our work we utilise the internal structure of Wikipedia’s articles to build the weak supervision signals.
The next chapter will explain our procedure in detail.
3 Methodology
In this section, we explain in detail our methodology for creating the Wikipedia-based IR training dataset and training the BERT retrieval models. As the basis for all of our systems, a 𝐵𝐸𝑅𝑇
𝑆 𝑀 𝐴𝐿𝐿model is used (downloaded from the official github repo of Google AI
1), and all the pre-processing and training is done using the Tensorflow library
2.
The architectural overview of our pipeline is as follows: (1) Wikipedia-based dataset creation, (2) MS Marco preparation, (3) Data pre-processing, (4) TPU training, (5) Experimental systems.
• Wikipedia-based dataset creation: We download the latest Wikipedia dump which we pre-process following the methodology described in section 3.4 to create the Wikipedia training dataset.
• MS Marco preparation: After we download the training and development dataset of MS Marco, we divide the development set to two sets sets: 100 queries for the development set and 6880 queries for the test set as we explain in section 3.3.
• Data pre-processing: Before training our different BERT models, all the training data is converted first to the necessary format that BERT needs, and then to the TFRecord format for boosting the speed of the TPUs.
• TPU training: Training a BERT model in a large amount of data is a very tedious and computationally expensive procedure. For this reason, we train all models in Google Cloud using its TPUs.
• Experimental systems: Having both Wikipedia and MS Marco in TFRecord format we start training three different BERT models: (1) a model trained on Wikipedia data, (2) a model trained on both Wikipedia and MS Marco data, (3) and a model trained only on MS Marco data as explained in section 3.6.
3.1 Task
As mentioned earlier, the passage-reranking task is the second phase of a question answering pipeline. A question answering pipeline consists of 3 phases (Nogueira and Cho, 2019): (1) passage-retrieval, in which a large number of relevant passages are pooled using a cheap computational method (e.g. BM25 or TFIDF), (2) passage- reranking, in which these documents are re-ranked by more sophisticated methods (such as neural networks), and (3) answer extraction, in which the top-n documents will be fed to a question extraction algorithm for marking up the answer.
3.2 BERT for Passage-Reranking
Given a list of retrieved passages, the aim in the passage-reranking phase is to calculate a relevance score 𝑠
𝑖for a candidate passage 𝑑
𝑖to a query 𝑞. Using the theory presented
1
https://github.com/google-research/bert
2
https://www.tensorflow.org/
in chapter 2.5, 𝐵𝐸𝑅𝑇
𝑆 𝑀 𝐴𝐿𝐿is used as a binary classifier in a point-wise fashion, that is, the [𝐶𝐿𝑆] vector is used as an input to a single layer neural network to obtain the probability of the passage being relevant and the loss function looks at one document at a time independently of the other documents.
More specifically, the query is fed to the classifier as sentence A and the passage text as sentence B. In addition, the maximum query length is truncated to have 64 tokens. The passage is also truncated such that the concatenation of the passage and the query amounts to at most 512 tokens. The publicly available pre-trained BERT model is used as basis and the a re-ranker is fine-tuned using the cross-entropy loss:
𝐿 = − Õ
𝑗∈ 𝑗𝑝𝑜𝑠
𝑙 𝑜𝑔 (𝑠
𝑗) − Õ
𝑗∈ 𝑗𝑛𝑒𝑔
𝑙 𝑜𝑔 ( 1 − 𝑠
𝑗) (3.1)
in which 𝐽
𝑝𝑜𝑠represents the set of indexes of the relevant passages and 𝐽
𝑛𝑒𝑔represents the set of indexes of non-relevant passages in the top-1,000 documents retrieved with BM25 (Nogueira and Cho, 2019).
3.3 Datasets
In this section, we describe the training datasets used in our experiments: (1) the MS Marco dataset, where we train our in-domain BERT model, and (2) the Wikipedia-based dataset where we train our BERT model for the zero-shot experiments. In addition, we describe both the development and the test set.
3.3.1 MS Marco Training Dataset
The MS Marco dataset (Bajaj et al., 2016) is composed of ∼400 million query-passage pairs, in which the passages are marked for being relevant or irrelevant.
Nonetheless, in our experiments, we use a smaller release of MS Marco which is
∼ 10% of the Marco dataset (the original MS Marco Dataset is more than 270 gb) to make our experiments easier (the dataset is called ‘triples.train.small.tsv‘ and was downloaded from the official github repo of Microsoft
3). In addition, from this smaller MS Marco version we use only the first 20.000.000 query-passage pairs in order to be equal in size with our Wikipedia-based dataset that we will describe next.
3.3.2 Wikipedia-based Training Dataset
The Wikipedia-based training dataset consists of 20.000.000 query-passage pairs, where each query is the concatenation of a Wikipedia article’s title and one of its section’s titles, while the associated ‘relevant‘ passage (in quotation since this is our assumption) is the section’s passage. We describe thoroughly our methodology for building such dataset in section 3.4.
3.3.3 Development and Test set
All experiments performed in this thesis are evaluated using the development and the test set from the MS Marco dataset. In the official MS Marco dataset, the development set contains 6980 queries that are associated with the top 1,000 passages retrieved using the BM25 from the MS Marco dataset. Each query has on average one relevant passage, while some of them have none since the corpus was initially constructed by retrieving the top-10 passages from the Bing search engine and then annotating them.
3
https://microsoft.github.io/MSMARCO-Passage-Ranking/
For this reason some of the relevant passages might not be retrieved by the BM25. The official MS Marco contains also a test set that consists of ∼6, 800 queries and their top 1,000 retrieved passages; though their relevance judgements are not publicly available.
For the above reason, we create a test set out of the official development set by dividing it into two sets of 100 and 6880 queries that are used in the experiments as a development and test set respectively. Using a development set of 100 queries instead of the original one of 6980 queries does not pose any problem as far as size is concerned since scientific research has demonstrated that just 50 queries is the sufficient minimum (Buckley and Voorhees, 2017).
3.4 Creating the Wikipedia-based dataset
The procedure for creating the artificial Wikipedia-based training corpus for IR used in our experiments is as follows:
Let W be a set of Wikipedia articles 𝑊 = {𝑤
1, 𝑤
2, ..., 𝑤
𝑛} , with each article 𝑤
𝑖containing a main title 𝑡
𝑖and 𝑗 section titles, i.e. 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑇𝑖𝑡𝑙𝑒
𝑖, 𝑗, and each of the sec- tionTitles is associated with a section passage 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑃𝑎𝑠𝑠𝑎𝑔𝑒
𝑖, 𝑗. Let 𝑄 be the set of the artificial Wikipedia-based user queries 𝑄 = {𝑞
1,1, 𝑞
1,2, ..., 𝑞
𝑖, 𝑗} such that 𝑞
𝑖, 𝑗is the concatenation of a Wikipedia title 𝑡
𝑖with one of its 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑇𝑖𝑡𝑙𝑒
𝑖, 𝑗. Let now 𝑃 be the set of the associated relevant passages 𝑃 = {𝑝
1,1, 𝑝
1,2, ..., 𝑝
𝑖, 𝑗} with 𝑝
𝑖, 𝑗being the cor- responding Wikipedia 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑃𝑎𝑠𝑠𝑎𝑔𝑒
𝑖, 𝑗to the 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑇𝑖𝑡𝑙𝑒
𝑖, 𝑗, and we call the 𝑝
𝑖, 𝑗the
“relevant passage” to the query 𝑞
𝑖, 𝑗. Let now Ψ be a sequence of associated irrelevant passages Ψ = (𝜓
1,1, ..., 𝜓
𝑖0, 𝑗0) with 𝜓
𝑖0, 𝑗0being a Wikipedia 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝑃𝑎𝑠𝑠𝑎𝑔𝑒
𝑖0, 𝑗0, such that 𝑠𝑒𝑐𝑡 𝑖𝑜𝑛𝑃 𝑎𝑠𝑠𝑎𝑔𝑒
𝑖0, 𝑗0≠ 𝑝
𝑖, 𝑗, and we call the 𝜓
𝑖0, 𝑗0the “irrelevant passage” to the query 𝑞
𝑖, 𝑗. Now, let 𝐴 be our artificial IR training dataset, that is a set of triplets (𝑞
𝑖, 𝑗, 𝑝
𝑖, 𝑗, 𝜓
𝑖0, 𝑗0) with 𝑞
𝑖, 𝑗being the artificial user query, 𝑝
𝑖, 𝑗being its associated relevant passage, and 𝜓
𝑖0, 𝑗0being the associated irrelevant passage:
𝐴 = {(𝑞
1,1, 𝑝
1,1, 𝜓
1,1), ..., (𝑞
𝑖, 𝑗, 𝑝
𝑖, 𝑗, 𝜓
𝑖0, 𝑗0)} (3.2) We build the training dataset A in two steps: (1) we parse the Wikipedia dump and create a temporary file 𝛼 with the (𝑞
𝑖, 𝑗, 𝑝
𝑖, 𝑗) pairs, (2) create another temporary file 𝛽 with the irrelevant passages 𝜓
𝑖0, 𝑗0and concatenate the two files 𝛼 and 𝛽, resulting to the training dataset A.
Creating the (𝑞
𝑖, 𝑗, 𝑝
𝑖, 𝑗) pairs
In the first step (1), we start by downloading the latest Wikipedia dump from the official Wikipedia repository
4. Since the Wikipedia dump comes in the XML format is needed to be parsed in order for us to obtain the clean text out of it. For this reason, we use an open- source Wikipedia parser
5to clean the Wikipedia dump and obtain the clean text. We run the parser with the following configuration --sections --filter_disambig_pages
to preserve the sections of every article and filter the disambiguation pages of Wikipedia since they do not have any sections and therefore we cannot utilise them to create any queries. The procedure of extracting clean text out from Wikipedia takes around 6-8 hours depending on your hardware.
Having obtained the clean text we use simple regex rules for parsing it and extracting the artificial queries (which are the article title and the section title concatenated as we mentioned before) associated with their relevant passages (the title sections). For
4
https://dumps.wikimedia.org/enwiki/
5
https://github.com/attardi/wikiextractor
Figure 3.1: Extracting the (𝑞
𝑖, 𝑗, 𝑝
𝑖, 𝑗) from a Wikipedia article. The extracted (𝑞
𝑖, 𝑗, 𝑝
𝑖, 𝑗) from the above article would be (“MissingNo. History”, “Developed [...] games”), (“Miss- ingNo. Characteristics”, “A player [...] ”) etc.
example, to identify the section we match the pattern ‘Section::::‘ that exists in the clean text and using that we extract the section passage.
After we finish the whole procedure for the entire Wikipedia we end up with
∼ 12.000.000 query-passage pairs (𝑞
𝑖, 𝑝
𝑖, 𝑗) saved in the file we named 𝛼.
Creating the irrelevant passages 𝜓
𝑖0, 𝑗0Since we train a binary classifier (relevant-irrelevant) our training corpus 𝐴 needs to have for each artificial query an associated irrelevant passage. We do this with the following intuitive way: we loop over the file 𝛼 we just created from step (1) starting from the 800,000th index (we randomly picked this number) and then we start saving the associated passages until we reach 12,000,000 instances (the same number of instances of file 𝛼). We save the instances in a new file called 𝛽. In that way we make sure that the passage in ith position of 𝛽 are irrelevant to 𝑝
𝑖, 𝑗of 𝛼. As a final step to our procedure we concatenate the files 𝛼 and 𝛽 and we end up with our artificial training dataset with 12,000,000 line of triplets (𝑞
𝑖, 𝑗, 𝑝
𝑖, 𝑗, 𝜓
𝑖0, 𝑗0) , resulting to ∼20 gigabytes of data.
3.5 Data pre-preprocessing
Before training our models we need first to convert them to the appropriate BERT format, and then convert them to the TFRecord format to be consumed by the TPUs.
The BERT format is necessary by the BERT model while the TFRecord format is suggested for optimising the Tensorflow models trained
6.
6
https://cloud.google.com/architecture/best-practices-for-ml-performance-cost
Preprocessing happens in one pass, meaning that we wrote a python file, that we called "preprocessor.py", which accepted a list of strings as an input and it output a TFRecord file. Google AI provides the necessary code for these conversion in their official github
7, that one can modify and customize according to their specific needs (e.g. there is not any straight-forward implementation of converting a string to BERT format and then to TFRecord format, but rather one must first get accustomed with the logic of Google’s AI github and their code provided until they become able to re-use the Google AI’s code and adjust it to their needs.)
3.6 Training on TP Us
Despite their impressive performance BERT models are quite slow in training and inference time (W. Liu et al., 2020). For this reason, to address the slow inference and training time of BERT in our experiments we make use of Google’s TPUs
8. As an example, fine-tuning just one of our BERT model on a TPU takes around 20 hours for 20 gigabyte of data while -based on the metrics of Google
9- training on a GPU would have taken 300 hours (∼13 days!). As far as inference time is concerned, the development set consisting of 100,000 instances (100 queries × 1,000 passages per query) takes around 5 minutes to be annotated from the BERT relevance classifier we fine-tuned on TPU, while the test set that consists of 6,880,000 instances takes around 50 minutes. Using GPUs the equivalent times would be ∼1 and ∼10 hours respectively.
Although Google provides free TPUs usage through their Google colab
10it was unfeasible to use it since the limitations it poses on the continuous usage
11. For this reason we used TPUs through the Google cloud. In total we spent 500$ to perform all of our experiments (Google Cloud provides 300$ free trial).
3.7 Experimental Systems
To investigate our first research question regarding the possibility of transfer learning with BERT in the context of IR by using Wikipedia as a pre-fine-tuning step, we use the following two systems:
• Default: This model is used “as is” with its “out-of-the-box” weights and without any additional training.
• Wiki: a model that we train with weak supervision in 20, 000, 000 query-passage pairs using our Wikipedia-based query-passage dataset.
For our second research question regarding to what extent pre-fine-tuning for the passage-reranking task in Wikipedia data could improve a model in terms of (1) accu- racy, (2) convergence or stability, we train the following two models:
• Wi+Ma: A 𝐵𝐸𝑅𝑇
𝑆 𝑀 𝐴𝐿𝐿model is first pre-fine-tuned with weak supervision in 20, 000, 000 query-passage from our Wikipedia-based dataset and then it is further fine-tuned in additional 20, 000, 000 data from the MS Marco dataset.
7
https://github.com/tensorflow/models/tree/master/official/nlp/data
8
https://cloud.google.com/tpu/
9
https://cloud.google.com/blog/products/gcp/quantifying-the-performance-of-the-tpu-our-first- machine-learning-chip
10
https://colab.research.google.com/
11
https://research.google.com/colaboratory/faq.html#resource-limits
• Marco: A 𝐵𝐸𝑅𝑇
𝑆 𝑀 𝐴𝐿𝐿model, that is is fine-tuned only in 20, 000, 000 data from the MS Marco dataset.
3.8 Experimental Settings
Following the settings of Nogueira and Cho (2019) a 𝐵𝐸𝑅𝑇
𝑆 𝑀 𝐴𝐿𝐿model is fine-tuned using a TPU v2.8 in Google Cloud. with a batch size of 128 (128 sequences * 512 tokens
= 65,536 tokens/batch) if we fine-tune the model on MS Marco, otherwise a batch of 32 is used when we fine-tune in Wikipedia.
ADAM (Da, 2014) is used with initial learning rate set to 3×10
−6, 𝛽
1= 0.9, 𝛽
2= 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. Dropout probability of 0.1 on all layers. For the fine-tuning in Wikipeda we change the learning rate to just 10
−6.
3.9 Evaluation Methods
In the passage-ranking task we want to measure in which position the trained retrieval models rank the relevant passage that is the answer to the user’s query. As mentioned in section 3.3.3 every query is associated with 1000 passages containing at most one relevant passage which is the answer to the user’s question. In addition, the associated relevance judgment to the relevant passage is binary (relevant, non-relevant). Since we are interested in our experiments to find the first relevant passage in the list we picked MRR as our evaluation method which puts a high focus on the first relevant element of the list. Also, using MRR makes our results comparable with the results of the other papers in the MS Marco leader-board as the official MS Marco web page for the passage-reranking task states that the evaluation method should be the MRR metric
12.
12