Evaluation of BERT-like models for small scale ad-hoc information retrieval

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Informationsteknologi

2021 | LIU-IDA/LITH-EX-A--21/051--SE

Evaluation of BERT-like models

for small scale ad-hoc

informa-tion retrieval

Utvärdering av BERT-liknande modeller för småskalig ad-hoc

in-formationshämtning

Daniel Roos

Supervisor : George Osipov Examiner : Cyrille Berger

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Measuring semantic similarity between two sentences is an ongoing research field with big leaps being taken every year. This thesis looks at using modern methods of semantic similarity measurement for an ad-hoc information retrieval (IR) system. The main chal-lenge tackled was answering the question "What happens when you don’t have situation-specific data?". Using encoder-based transformer architectures pioneered by Devlin et al., which excel at fine-tuning to situationally specific domains, this thesis shows just how well the presented methodology can work and makes recommendations for future attempts at similar domain-specific tasks. It also shows an example of how a web application can be created to make use of these fast-learning architectures.

(4)

Acknowledgments

I want to thank my supervisors from both SICK Linköping and Linköping university for their continuous help and good suggestions. Also, thank you to my opponents for the much-appreciated feedback. Finally, I want to direct a special thank you to my friends and family for their much appreciated support and encouraging words during the course of the thesis work.

(5)

List of Figures

4.1 Showing Spearman correlation scores as a function of additional training data added for STS dataset using batch size 16. . . 19 4.2 Showing Spearman correlation scores as a function of additional training data

added for STS dataset using batch size 4. . . 19 4.3 Showing how the training time on STS dataset increases with additional training

data. . . 20 4.4 Showing test time for STS dataset. . . 21 4.5 Showing Spearman correlation scores comparing training on only STS, and

(7)

List of Tables

2.1 The cosine similarity measured between the bolded words after encoding via BERT. Bolded numbers are the ones referred to in the text. . . 10 2.2 An example series to be evaluated using Spearman correlation. . . 13 3.1 An example of data used to demonstrate the methodology presented in Section

3.3. Multiple queries for the same item are separated by semicolons. . . 17 4.1 Similarity measurements of each item against the query "Paint a fence" before

fine-tuning. Higher numberñ more similar. . . 23 4.2 Similarity measurements of each item against the query "Paint a fence" after

(8)

1 Introduction

SICK is a company that, among other things, delivers toolkits for vision-related diagnostics of production lines. For example, a potential customer managing a production line producing sticks might want to discard units with a width of more than 10 % above a certain width threshold. This is where the products that SICK produces can assist via image processing algorithms. A camera is mounted, an image processing algorithm is constructed and installed on the hardware, and the customer is now able to filter out the unwanted sticks.

Observing and solving many different types of problems of the above character has re-sulted in SICK producing a generalized toolkit that can be used without specific knowledge of what is happening "under the hood" in the image processing algorithms. While simplify-ing the process immensely, this toolkit can also be very unintuitive for the uninitiated, and to have this solution delivered to a customer without assisting in configuration can at times be difficult. Without an explicit knowledge background in image processing, it can be hard to know which tool goes where and with what configuration.

1.1 Aim

The topic of this thesis is to abstract the choice of which tools to use away from the user and instead allow them to focus on formulating the problems that they want to solve. Aiming to reduce the amount of time required of the user to read through documentation and man-uals, this thesis tries to find a solution aimed at letting the user easily find what they want. The user interaction is meant to be as simple as possible while still resulting in the relevant information.

This is done via an input field where the user1 is requested to describe their situation as detailed as possible (the user’s query). The algorithm then provides suggestions for ap-proaches to the problem based on the tools available in SICK’s vision toolkits. For example, the query "Filter out sticks that have a width of over 12 cm." might result in a suggestion of a specific tool that enables length measurements.

1_{"The user" might be everything from a knowledgeable machine-learning expert unfamiliar with the toolset, to}

(9)

1.2. Motivation

A tool in SICK’s toolkit can seldom be used in solidarity. Rather, it is most often combined with other tools in order to form more complex systems. These tool combinations are ill doc-umented in terms of practical use cases, and trying to predict all the tasks they can be applied to is complex. To tackle these issues, this thesis applies a more general overlook of the prob-lem: a system is designed to allow the people tasked with constructing tool-combinations to iteratively add more content to these combinations as more use cases are found. This could for example be in the form of documented use cases or common queries.

A question that then quite naturally follows is "How do we know when we have enough data?". It is a good question to ask since data gathering for a new task can be expensive and time-consuming. The thesis investigates, through the use of external datasets, how dif-ferent methods for similarity measurements behave when subjected to iteratively more data. The end goal is to see what kind of benefit SICK could expect to observe if they, down the line, fine-tune a model to the semantic similarity task. To do this, three different transformer models from the BERT family (BERT, DistilBERT, and RoBERTa) are tested using an iterative approach where they are fine-tuned and evaluated at pre-determined checkpoints. Two dif-ferent usages of the models, inspired by work by Reimers and Gurevych [35], are considered; Bi-encoders and Cross-encoders.

Previously inserted content is compared at run-time with a user’s query through different methods. The problem of matching a query to some database is often called information retrieval (IR) [4, p. 6] or, more specifically, ad-hoc information retrieval. The focus of this thesis is to apply different semantic similarity measurement tools to this task and to evaluate how well they benefit from fine-tuning compared to each other.

1.2 Motivation

There are two main motivations for writing this thesis. Firstly, and perhaps mainly, it will hopefully act as a useful tool for the uninitiated to get an easier introduction to SICK’s toolkit. Secondly, the comparative study will perhaps act as additional motivation for SICK to extend the available documentation, as they might see a benefit of this given the results.

1.3 Research questions

This section will outline the research questions that will be answered during this thesis. 1. How well do BERT, DistilBERT and RoBERTa improve with fine-tuning? Could there

be any significant benefit to fine-tuning given a realistically achievable data size? This question aims to emulate the situation at SICK where a non-fine tuned model is used. 2. Does fine-tune on one dataset benefit testing on another, as long as the task is similar?

This question aims to emulate using an external dataset to "prime" the model for the given task.

3. How could a system be designed to help users find a solution that can help their use case? Can their experience with the system be recorded or otherwise documented to assist in future use?

(10)

1.4. Delimitations

1.4 Delimitations

Here are some of the delimitations of this thesis. Most of them are addressed again in Section 6.1.

Only English language. Since the support for language processing is much greater for En-glish than for other languages (e.g. Swedish), this thesis solely focuses on queries in EnEn-glish in order to maintain high quality.

Semantic similarity task. Asking the question "When do we have enough data?" is relevant to more tasks than semantic similarity. For the purpose of this thesis, this is the task that is evaluated.

Dataset used. Ideally, more datasets should be evaluated to see how general the conclusions made in this thesis are. Unfortunately, finding matching datasets proved difficult.

Available hardware. All tests were done on a GTX 1660 SUPER which is enough for much, but not everything. Therefore, some models could not be included in the tests due to this delimitation.

(11)

2 Background

This chapter will begin by describing ad-hoc information retrieval (introduced in the intro-duction) on a more formal level. After that, Section 2.1 will begin by introducing some tra-ditional approaches to solutions to the problem. Section 2.1.3 will give an overview of ap-proaches involving deep neural networks, which eventually end up forming the base line algorithm for the tests run later in the thesis. The more modern approaches based on trans-former neural networks (which are the approaches that this thesis evaluates) are presented in Section 2.1.4.

Based on the example presented in the introduction, the main problem that this thesis tries to solve can be formulated more precisely. Given a user query q, and a database D = td1, d2, . . . , diu where dx = tdx1, dx2, . . . , dxju, x P t1, 2, . . . , iu; how should dxbe ranked

in terms of relevance to q? A single entity dxyP dxis a string that in some manner describes

dx, yP t1, 2, . . . , ju. Given above representation, the problem can be divided into two parts:

(1) calculate a representation of both q and all dxy P D (e.g. using word embeddings for

semantic representations); (2) rank each dxbased on its relevance to q using its dxyand return

the best scoring results to the user. This chapter presents different common solutions to this problem.

The general task of making a computer understand written text is known as natural lan-guage processing (NLP) and is a widely studied field. Its applications are many-fold; popular examples include:

• Machine translation - Taking a sentence as input and computing its translation in an-other language.

• Question answering - Taking a question as input and computing its answer from a set of facts.

• Paraphrase identification - Taking two sentences as input and computing if they are paraphrases of each other (i.e. if they contain the same semantic content).

• Information retrieval - Taking a query as input and computing a ranked list of docu-ments likely to contain information about said query.

(12)

2.1. Language comparison technologies

The difference between information retrieval and paraphrase identification can sometimes be quite hard to identify. The perhaps easiest way to understand it is to think of information retrieval as the task, which uses paraphrase identification to achieve its goal. It is this com-bination that is studied in this thesis, and it is referred to simply as information retrieval henceforth.

2.1 Language comparison technologies

Section 1.1 introduced information retrieval (IR) briefly through a problem formulation spe-cific to this thesis, and the introduction of this chapter gave a theoretical overview of the problem. This section describes IR in a more general way but also dives a bit deeper into some of the specifics related to this thesis.

In essence, the main goal of an information retrieval system is to retrieve a document or actual information that fully or partially matches the user’s query [4, p. 2]. This is done through two main functions: (1) analyzing both source items and user query, and (2) match these and retrieve relevant items [4, p. 6], both of which can be implemented through many different methods. The field became popular in the research community as early as 1961 [19, p. 2], and a survey by Faloutsos and Oard in 1998 presented that IR could be divided into two distinct groups: the first being the traditional (full-text scanning, signature files, inversion and vector modeling using the cluster hypothesis1) and the second being more recent efforts trying to combine IR and NLP through the use of semantic information (specifically using latent semantic indexing (LSI) or neural networks) [11].

The rest of this section describes some common implementations; starting with older methods implementing direct word comparisons, then shortly describing latent semantic analysis, before introducing distributed semantic models and some of the ways the tech-nology has been used in terms of IR. After that, modern technologies using recurrent neural networks (RNNs) and transformer-based models such as BERT are introduced on conceptual levels.

2.1.1 Direct word comparisons

When thinking about information retrieval, the most obvious solution would be to directly compare the words in q to the words in dxyand score based on how many words match.

One popular and strong choice for direct word comparison is BM25 [44], explained in detail by Manning et al. [26]. One of the equations for finding the Retrieval Status Value (RSV) of a document d for query q presented there is

RSVd= ¸ tPq log N dft (k1+1)tftd k1((1 b) +b(Ld/Lave)) +tftd (2.1) where

• N is the number of documents,

• dftis the number of documents where t appears at least once,

• tftdis the frequency of term t in document d,

• Ldis the length of document d,

• Laveis the average document length,

1_{The cluster hypothesis says that clusters of documents can be formed where documents close to each other are}

(13)

• k1is a positive tuning parameter deciding how much t’s frequency of appearance in d

should influence RSVd,

• b is another tuning parameter which decides how much Ldshould influence RSVd.

Manning et al. offer a more a more detailed explanation [26, p. 232-234]. There are more advanced equations presented by Manning et al., but for the purpose of this thesis Equation 2.1 will suffice. The equation can be interpreted as scoring each term t present in q based on its "uniqueness" in combination with a document d. The more unique a relationship the both have, the higher the value.

2.1.2 Latent semantic analysis

Latent semantic analysis (LSA) [5] is a method for gathering the semantic meaning of a doc-ument through truncation of a matrix factorization through a method called singular value decomposition (SVD). In short, the SVD-factorization can be used to dimensionally reduce an input matrix (often a term-per-document count) by truncating it and keeping only the largest singular values. This results in a dimensionally reduced version of the original matrix which describes how important and unique certain tokens are to a given document. More detailed introductions have been made by Landauer et al. [21] and Dumais and Nielsen [10].

In reality, SVD is often applied to a (sparse) term frequency (TF) matrix weighted by some global metric (e.g. inverse document frequency (IDF)). Including any type of global metric to the SVD truncation has been shown to increase performance significantly [9].

2.1.3 Distributional semantic models

Semantic similarity between words can also be utilized through word vectors; the encoding of words to a high-dimensional vector space based on their semantic context. The two (LSA and DSMs) can often be separated into approaches learnt on term-document matrices (e.g. LSA) and term co-occurance data (e.g. DSMs) [7].

The vector representations have their roots in what is known as the distributional hypoth-esis: often formulated as "words which are similar in meaning occur in similar contexts" [15, 38], and have shown great success [37]. The research area is often called distributional semantic models (DSMs) and its implementations can sometimes be referred to as word embeddings if ad-vanced techniques (e.g. neural networks) are used for generating the vector representations instead of traditional vector computation [51]. Sahlgren [38] uses the following sentences as an example of paradigmatically related words; words that appear in similar contexts:

to have a splendid time in Rome to have a wonderful time in Rome

Here, one can observe that the words wonderful and splendid appear in the same context. One can then use this information to make statements such as "the words splendid and wonderful have the same semantic meaning", using the distributional hypothesis. (Of course, in real-world applications much more data has to be processed before such a state-ment can be made in confidence.) Through word vectors, this concept of similarity is trans-formed into computable data points. In a word vector representation, each word in a vocabu-lary is represented by a vector where the vector’s ’direction’ determines its semantic meaning. Semantic similarity2between two words can then be measured through their cosine similar-ity [29, 46, 32].

This thesis does not dive further into the generative details of word-vector representations and instead uses pre-trained vectors; meaning that the transformations are previously trained and ready to use. One very popular example of an unsupervised algorithm used for training

(14)

such vectors (or encodings) is GloVe, which outperforms other models on for example word similarity tasks [32]. The creators of GloVe supply many different pre-trained word vectors on their website3; this thesis uses the set included in Python’s SpaCy-module, which uses 300-dimensional vectors for 685 000 terms of English4.

Neural networks of different fashions are popularly used to generate word embeddings [50]. Some of the most well-known examples include the CBOW and skip-gram models, introduced and further improved by Mikolov et al. [29, 28]; commonly referred to as word2vec. An extension to word2vec is doc2vec presented by Le and Mikolov [22] as a paragraph-level vector representation, said to represent a paragraph better then the sum of its word vectors.

There are a few different ways to apply word embeddings to information retrieval tasks; some more fruitful and popular than others. The rest of this section will cover a few of them and end with a verdict regarding relevance to this thesis. Beware that some less successful methods will be cited here, for the purpose of being complete.

One approach to applying word embeddings to information retrieval is through the com-parison of documents and queries through their respective (summed) vector representations. That is: sum the word vectors for documents and query and rank via similarity metric. Imple-mentations of such concepts are done by Galke et al. [12] where documents are represented as their cluster centroids after being weighted by TF-IDF metrics5. The model is shown to be competitive to the TF-IDF baseline and outperforms both doc2vec [22] and WMD [20] on the IR task.

Roy et al. [36] discuss the downsides of such implementations and conclude that the rep-resentation of large documents as a sum of its word vectors "is not likely to be useful" because of the broad context presented in large documents. They instead propose a solution that rep-resents documents as K clusters containing individual word vectors separated by topics6and applies standard cluster distancing methods7to calculate similarity to a query cluster. They then combine this similarity with a standard text-based similarity measure PLM(details can

be found in section 2.4 of [36]). Their results show marginal or no improvements when sepa-rating a document into 100 clusters/topics (K=100) compared to just a single cluster (K=1, representing a document as the average vector of its word vectors; or cluster centroids). The authors believe this is due to the need for a significantly higher number of clusters to accu-rately represent the number of topics present in each document. This makes sense, and it would have been interesting to see the authors spend more time developing this idea in or-der to validate the hypothesis. Overall, the proposed method shows promise compared to the LM similarity measure.

Another popular use of word embeddings is through query expansion [7, 53]. The idea of query expansion is to not modify the target documents in any way, and instead manip-ulate the query to feature terms of semantic similarity to the original query terms using the word embeddings, normally through cosine similarity. Thus, the queries have been expanded to cover a wider group of terms, without altering the semantic meaning of the query. The expanded query representations can then be compared to the corpus by using traditional IR methods such as the ones discussed in Section 2.1.1.

Xiong et al. [48] explain a common problem with using word embeddings for semantic similarity that often comes up in the context of IR: the varieties of ways that two words can be semantically similar to another greatly impacts the performance. They mention an example using cities: the words ’pittsburgh’ and ’boston’ could be argued as semantically similar since they are both cities. However, if a user searches for ’pittsburgh hotel’, the IR system should be aware of the semantic difference between ’pittsburgh’ and ’boston’ since the user will likely

3_{https://nlp.stanford.edu/projects/glove/} 4_{https://spacy.io/usage/linguistic-features}

5_{Here, the model being referred to is the one named IWCS by the authors since this is their most successful}

implementation.

6_{Clustering is done via K-means.}

(15)

reject a document mentioning ’boston hotel’. They call this a problem with soft-match signals and their implementation is an attempt to address this. They propose a kernel-based neural ranking model called K-NRM that effectively counts these similarities at different levels and uses these signals for ranking in a supervised training environment.

2.1.4 Beyond DSMs

A heavily criticized aspect of word embeddings as discussed thus far is the problem of con-text; or rather, the lack thereof. Generalized, (or global as Diaz et al. [7] refer to them) word embeddings risk misrepresenting words simply because of too wide context. Diaz et al. in-stead propose training local embeddings for more contextualized representations. The last sentence of their paper sums up the opinion quite well:

Instead of using a “Sriracha sauce of deep learning,” as embedding techniques like word2vec have been called, we contend that the situation sometimes requires, say, that we make a béchamel or a mole verde or a sambal — or otherwise learn to cook.

This, and similar opinions like it, gave visions of a new goal: what if the interpretation of a word could change, depending on its context? Or rather, in terms more relevant to the field: what if the vector representation of a word embedding changed based on its neighboring words in a given sentence? The rest of this section explains the path toward modern-day methods, and how they can be applied to IR.

Before introducing how the above problem has been successfully addressed, I shift focus slightly. The problem of context, as described above, has been observed from many differ-ent perspectives in the NLP community. For example, the NLP-task of machine translation suffers greatly from an inability to form grammatically correct sentences without context; the same sentence in two languages often don’t map their translations word-for-word. The use of recurrent neural networks (RNNs) became the state-of-the-art method for solving these contextual problems since they introduced the possibility of sequentially analyzing text, thus -in some manner - -introduc-ing context. Enhanced with signals from long short-term memory (LSTM) or gated recurrent unit (GRU) cells, the RNNs could "remember" context even bet-ter. By splitting the RNNs into an "encoder" encoding the original language and a "decoder" decoding the data into the target language, the idea of attention was born8. Attention, in this context, refers to a decoding RNNs ability to observe earlier "hidden states" of the encoding process, thus making it easier to allow for respect of the non-linearity of word relationships. The idea was presented by Bahdanau et al. [1] and refined by Luong et al. [25]. A powerful example is presented in the first of the two, where the method is shown to correctly translate the English sentence

The agreement on the European Economic Area was signed in August 1992. into the French sentence

L’accord sur la zone économique européenne a été signé en août 1992.

where the order of the words European Economic Area has been correctly swapped in the French translation, clearly exemplifying the attention mechanism. The method was further refined and generalized into language model (LM) usage through models such as ELMo [33] and ULMFiT [18].

One huge detail with RNN-based contextualization, which eventually lead to the develop-ment of its replacedevelop-ment, is the fact that it’s a sequential architecture; the result of one iteration is in part a function of the output of the previous (commonly on a letter or word level). This has three major effects:

8_{To be precise, the invention of the encoder-decoder architecture is more of a prerequisite of attention than a}

(16)

1. poor parallelization opportunities since one iteration cannot be performed before the previous one finishes,

2. gradient vanishing/explosion since the final output has a huge sequential calculation history to apply gradient descent to, and

3. even though the attention mechanism through LSTM and other gating mechanisms attempts to fix it, the amount of attention one iteration can apply to a word in the context of another is inherently dependant on their distance in a given sentence.

The solution to the above-mentioned problems was, as it turns out, to entirely drop the RNN-based approach and instead give full focus to the principle of attention - or, more specif-ically, self-attention9. The paper “Attention is all you need” written by Vaswani et al. [43] in-troduced a new approach to NLP with the transformer based architecture. The transformer, originally created for machine translation (and tested on such datasets by Vaswani et al.) consists of two generalized building blocks: the encoder stack and the decoder stack; in the original paper, each stack consists of 6 identical layers. Similar to the RNN-based approaches, the encoder’s task is to encode the original language, whereas the decoder’s task is to decode the encoding into the target language.

2.1.4.1 The encoder

The encoder and decoder stacks are quite similar and so, for the purpose of this thesis, is explained as a single concept from the perspective of an encoder. Before the input is fed to the encoder stack, it’s embedded using a learned embedding technique and positionally encoded in order to retain information about the word order. After embedding, the self-attention technique is applied directly. In this step, three different learned matrices10generate representations of each word embeddings. These representations are called query (Q), keys (K), and values (V) and compute attention as follows:

Attention(Q, K, V) =softmax QK T a dk ! V

explained in further detail by Vaswani et al. [43] (Equation 1). dksimply represents the

di-mensionality of Q and K and is applied to minimize the size of the dot-product. The concept of multi-head attention was applied in order to allow the model to attend to different parts of a given sentence, resulting in 8 different attention heads that are combined through concate-nation and a single layer neural network. Finally, a feed-forward layer is applied before the encoding process restarts in the next layer. Residual connections are applied through both the attention sub-layer and the final feed-forward sub-layer, possibly in order to minimize gradient vanishing/explosion. The result is an attention mechanism that doesn’t depend on word distance and can be efficiently parallelized.

BERT One of the most widely used applications of the transformer is known as the Bidirec-tional Encoder Representations from Transformers (BERT) [6]. The idea of BERT is to focus solely on the encoding part of the transformer, and it implements a variety of new concepts to make this feasible. Perhaps the largest contribution11of BERT is the method used for train-ing; it introduced the masked language model (MLM) pre-training objective, in which the

9_{It’s important to note that the concept of self-attention was not created by Vaswani et al., merely used.} 10_{For easier understanding, they can essentially be viewed as single-layer neural networks without biases and}

activation functions

11_{There are many more changes made from the original transformer into BERT, such as a deeper stack (12}

en-coders instead of 6), longer embedding lengths (768 instead of 512), and using more attention heads (12 instead of 8). (This is for BERTBASE; BERTLARGEis even bigger.)

(17)

tokens were randomly masked (replaced) with a [MASK] token and the model was tasked with predicting which tokens were masked. It was also trained on next sentence prediction; a binary classification problem where the model was asked to predict if one sentence directly followed another. In short, BERT supplied the NLP community with a solid pre-training base for further fine-tuning; some examples include question answering, information retrieval, and sentiment analysis. Its encodings can also be used as a word embedder without neces-sarily defining a downstream task to train on. As an example of the power of BERT’s word embeddings out-of-the-box12, consider the following five sentences:

1. Stick to the plan.

2. I wish we could just stick to what we said. 3. I will measure the length of this stick. 4. That’s my stick.

5. That’s my bat.

Since BERT’s word embeddings are different depending on the context, the emboldened words will each have different encodings; each a 768-dimensional vector representation. Ta-ble 2.113shows the cosine similarities (higher numberñ more similar) between the different words, where BERT has correctly identified the two different semantic contexts of the word stick. The similarities tied to sentence 5 show that there is similarity between a bat and a stick in the given context, but the similarity between sentence 3 and sentence 4 is higher, meaning that the actual type of object was weighed higher. Sentences 1 and 2 show mutually exclusive highest similarities for each other; meaning that no other interpretations of the words were deemed as similar as those two to each other. The above observations are very important as they, in combination, demonstrate the power of contextualized embeddings.

Table 2.1: The cosine similarity measured between the bolded words after encoding via BERT. Bolded numbers are the ones referred to in the text.

S. # 1 2 3 4 5 1 1 2 0.57 1 3 0.27 0.31 1 4 0.29 0.30 0.68 1 5 0.15 0.11 0.46 0.60 1

2.1.4.2 BERT for sentence similarity

BERT generates token-level embeddings for each token in its vocabulary which are all subject to their context through the attention mechanism. However, since the task approached in this thesis is comparing sentence-level data (not token-level), the output of BERT has to be further processed in order to be able to compare sentences.

Reimers and Gurevych [35] expanded on the idea of representation through embeddings using Sentence-BERT (SBERT) for sentence-level embeddings, which reduced time for pair-wise finding tasks dramatically while maintaining accuracy. They published their tools on GitHub14 and include examples for use-cases such as text summarization, semantic search,

12_{The numbers shown in Table 2.1 were generated using the bert-base-uncased model from Huggingface,}

meaning no fine-tuning was done.

13_{Note: Table 2.1 should not be interpreted as proof of the capabilities of BERT, but rather as an example of the}

concept being presented.

(18)

clustering, and much more. Perhaps their main contribution is the objective functions used for training a BERT-like model to compute meaningful sentence representations. They use three different objective functions for fine-tuning BERT: classification, regression, and triplet; inspired by Schroff et al. [40].

Triplet and regression are the interesting functions for this thesis since they tackle the semantic similarity task. Starting with triplet, the idea is to choose a triplet of sentences (sa, sp, sn), where you can make the educated guess that saand spare more similar to each

other than saand sn. Similarity is measured through point-wise difference, and a margin e is

introduced to enlargen the loss slightly. The final loss function is formulated as

max(||sa sp|| ||sa sn||+e, 0) (2.2)

with the only addition being that the max function is used to remove negative loss. The usefulness of this loss function appears when combined with semi-supervised data. In the paper, Reimers and Gurevych use the function on data gathered by Dor et al. [8] where a triplet of sentences are generated from the same Wikipedia article; two from the same section (saand sp) and the last from another section of the article (sn).

Regression loss is a simplified version of the above, where two sentences are embedded and compared with cosine similarity. This loss is useful for supervised training, where the target labels (sentence similarity level) are known. This loss is used for STSb [3] and AFS [30] datasets, which both supply supervised labels. They mention a problem with SBERT for the AFS dataset, where the argumentative structure of the sentences is a better fit for BERT than SBERT since BERT can make better use of its attention mechanism.

The main philosophy of the research team is speed over accuracy, which is understandable given the enormous time difference and relatively low performance loss. The two approaches are separated into cross-encoders (making sentence comparisons inside the encoder; allowing for full use of the attention mechanism) and bi-encoders (fine-tuning using the above meth-ods and embedding sentences individually). In a follow-up paper, they introduce AugSBERT [42] where the accuracy of a cross-encoder is used to generate "silver datasets" that can be used to fine-tune the SBERT-model. This allows SBERT to better mimic the behavior of a cross-encoder while maintaining its speed since the number of encodings required is still lin-ear (instead of quadratic). This is similar to how Hinton et al. [17] introduced knowledge distillation, but varying the task instead of the model.

BERTSCORE [54] was designed to evaluate machine translators on their completeness during translation in order to replace lexical approaches (e.g. BLEU). Assuming the first sentence is x = (x1, x2, . . . , xk)and the second sentence is ˆx = (ˆx1, ˆx2, . . . , ˆxm), the authors’

approach to similarity measurement is similar to Reimers and Gurevych in that its first step is to compute vector embeddings for both sentences. After vectorization a matrix of cosine similarities between each token-pair in x and ˆx is constructed giving a matrix Mxˆx P Rkm.

This matrix is then used to compute recall, precision and (as an effect) F1-score using greedy matching (i.e. choosing the largest possible value for each row/column; depending on if recall or precision is calculated). In short, they introduce the following measurements:

RBERT= 1 |x| ¸ x_iPx max ˆx_iP ˆx Mxˆx, PBERT= 1 | ˆx| ¸ ˆxiP ˆx max xiPx Mxˆx (2.3)

The equations look slightly different in the paper since they compute cosine similarity us-ing pre-normalized vectors, resultus-ing in only the inner product beus-ing calculated. The method is shown to have great robustness when compared15 against other unsupervised methods, but supervised alternatives outperform BERTSCOREwhen trained on relevant data. The

au-thors show that the best BERTSCORE method is using F1-score combined with an inverse document frequency (IDF) measurement, as shown in the paper.

(19)

2.2. Evaluation techniques

2.1.4.3 BERT variations

Taking a step back from sentence comparison, there has been tremendous work put into vary-ing BERT in different ways. Below, a few of them are presented and compared to BERT.

DistilBERT To improve on training times, DistilBERT [39] was created. It was created with the motivation of being usable on smaller devices such as smartphones. They show that, with a 40 % smaller model, they can retain 97 % performance while being 60 % faster. The authors obtain this primarily by halving the number of encoding layers, matrix math optimizations, and using knowledge distillation [17]; distilling from the original BERT model.

RoBERTa “Roberta: A robustly optimized bert pretraining approach” [24] is a BERT variant whose authors questioned the pre-training objectives of BERT; specifically the next sentence prediction (NSP) objective. They found that removing this objective, and instead sampling contiguously until the maximum input size is hit (normally 512 word-piece tokens) improved results on several benchmarks. They also vastly expanded on the volume of data used to pre-train the model, stating that BERT is significantly underpre-trained. Their model achieved (at the time) state-of-the-art results on several benchmarks, outperforming BERT with several percentage points, as can be found in the paper.

DeBERTa Yet another BERT variant, published only a few months ago as of the writing of this thesis, improved on RoBERTa and set the record for state-of-the-art performance. Its name is “Deberta: Decoding-enhanced bert with disentangled attention” [16] and, as of the writing of this thesis, it shares the top score on both GLUE and Super-GLUE benchmarks; benchmarks measuring natural language understanding through, among other tasks, seman-tic similarity.16 It achieves this by a couple of methods. First, it separates the previously en-tangled word and position-encoding vector into two separate vectors; both with their own query and key vectors. Second, it uses this separation to calculate three17 different attention scores: content-content attention; i.e. traditional attention, content-position attention; where for example, a word might request information about its surrounding, and position-content attention; where a position in a sentence might request information regarding surrounding words. These three are then added together to form the final attention matrix for the current layer.

2.2 Evaluation techniques

When talking about the evaluation of machine learning models, the most common metrics used are accuracy and F1-scores. However, since the models evaluated in this thesis are regressional18and used for ranking purposes, it makes much more sense to use measurements that take this into account. To clarify: given a test scenario where each case is formed by a pair of sentences and a similarity score, having the model be evaluated on its distance from this score (using, for example, MSE) would be unwise since the actual score is of no use to the end-user. It does not matter if a sentence pair is tagged with .9 or .1, as long as its relative position among the other pairs is the same.19 _{This is especially important when using}

BERT-16_{For GLUE: DeBERTa holds second place after ERNIE with 0.1 points. For Super-GLUE: DeBERTa holds second}

place after a T5 [34] variant with 0.1 points. Details of how scores are calculated can be found in Section 3.4 of the GLUE paper [45].

17_{Position-position attention is left out since the position vectors used are relative; meaning the encoding would}

be the same for all positions which would result in just a static addition to the final attention matrix.

18_{Regression-type models output numbers on a continuous spectrum instead of classifying input examples.} 19_{For a more hands-on example, consider a version of Table 2.2 where all predicted similarities are divided by 10.}

(20)

2.2. Evaluation techniques

like models since these have been noted as having difficulty reflecting the large difference of similarity between different sentence pairs.20

Spearman correlation One such measurement is the Spearman correlation coefficient, named after Charles Spearman. It is especially useful in this case since it can be applied to any monotonic relationship, meaning that the two series being compared do not have to be linearly related to score a high Spearman score. This is because the Spearman correlation is calculated using the rank of each entry for the two series, instead of their absolute assigned values (which is how Pearson correlation works). The coefficient is in the interval [1,1] where a high positive coefficient suggests a strong positive relationship between the two se-ries (as one rises, so does the other); a high negative coefficient instead suggests a strong negative relationship (as one rises, the other falls). A coefficient close to 0 suggests there is no correlation between the two series (as one rises, the other rises or falls unexpectedly through-out the series).

Table 2.2 presents a mock use case of Spearman correlation. "Predicted similarity" is the model’s predicted relationship between the two sentences present in each pair A, B, C, D, and "Gold similarity" is the gold standard values (i.e. how similar the sentence pairs actually are). Using these values, we can compute the Spearman correlation.

Table 2.2: An example series to be evaluated using Spearman correlation.

Pair Predicted similarity Predicted rank Gold similarity Gold rank

A 0.7 2 0.65 3

B 0.9 1 0.9 1

C 0.55 3 0.7 2

D 0.5 4 0.6 4

First, the two series (the predictions and the gold standard) are internally ranked by their values. This ranking can be seen in columns "Predicted rank" and "Gold rank" in table 2.2 where higher similarityñ smaller rank. After this, the correlation (rs) is calculated the same

way a Pearson correlation coefficient is calculated, only the rank values are used for calculat-ing the coefficient instead of the similarities:

rs=1

6°d2 i

n(n2₁₎

where di is the difference in rank for each entry and n is the number of entries. For the

example in table 2.2, the Spearman correlation coefficient is:

rs =16

(1+0+1+0) 4(16 1) =1

12

60 =1 .2=.8

where 1+0+1+0 represents element-wise difference between the two rank-series. The resulting coefficient, .8, can be interpreted as the two series having a strong positive relation-ship. If, instead, the computed coefficient would have been a large negative number, the two series would have been negatively related. As an exercise for better understanding of how Spearman correlation works, try computing the coefficient for the same data but inversing the order of the gold labels. Also, try computing the coefficient for the two rank-series 1 2 3 4 and 4 3 2 1.

20_{The following link leads to a discussion where the creator of BERT comments on the issue: https://github}

.com/google- research/bert/issues/164#issuecomment- 441324222. As examples, this issue can be observed in practice in two separate instances; here: https://blog.floydhub.com/when-the-best-nlp-mo del-is-not-the-best-choice/and here: https://github.com/hanxiao/bert-as-service#q-the -cosine-similarity-of-two-sentence-vectors-is-unreasonably-high-eg-always--08-what s-wrong.

(21)

2.3. Related work

2.3 Related work

There has been much work done in the field of practical usage of information retrieval; some more relevant to this thesis than others (see earlier mentions under for example section 2.1.3). This section focuses more on applications of these.

Yang et al. [50] propose a model for suggesting similar bug reports using a combination of three different scoring methods. In their case, a bug report consists of a title and description (the "document"), as well as metadata containing extra information about the report. The total score is an aggregation of (1) the cosine similarity between two documents’ TF-IDF vec-tors, (2) the cosine similarity between the two documents’ semantic content (defined as the average vector of a document’s word embeddings generated using the skip-gram model), and (3) similarity based on the metadata (either 0, 0.5 or 1). The model is interesting since it tackles a similar problem as this thesis: enhancing software engineering tasks through NLP techniques. There is some critique to be directed to their choice of score aggregation; they themselves mentioning it is "unclear which of the scores are more important" and their solu-tion being "we simply add them up so that they have equal weight".

Another way to approach the problem would have been to use code generation. Say a user expresses a natural language query such as the example in Section 1.1; a model could be constructed to automatically generate an answer based on previously seen examples. Such approaches exist but are reserved for domains where there is extensive data to train on. For example, Yin and Neubig [52] use datasets such as DJANGO and WIKISQL, containing 18 805 and 80 654 annotated examples each. Using more recent work such as GPT-3 [2], there’s also interesting work done on utilizing the few-shot learning capabilities of these pre-trained models for down-stream tasks [23], which would be interesting to apply in the context of code generation. Looking at the problem in inverse order, there has been interesting recent progress made on using transformers for pseudo-code generation. A recent paper by Yang et al. [49] might be a good entry point for further research. Gemmell et al. [13] offers a concrete method for code generation using the transformer architecture and thus introduce what they call the Relevance Transformer.

A big part of the contributions of this thesis is fine-tuning encoder-focused transformers. Mosbach et al. [31] offer practical advice for how the BERT-family of transformers behave under different fine-tuning conditions and crush some commonly-believed theories about fine-tuning stabilities. Based on their research, they advocate the use of (1) small learning rates (they mention the number 2e 5) using ADAM and (2) increasing the number of itera-tions and training to almost no training loss. For further deep-diving into the inner-workings of BERT when subject to fine-tuning, see Hao et al. [14]. For more details on fine-tuning BERT specifically for text-classification (i.e., not the same task as this thesis covers) Sun et al. [41] offers a great starting point.

(22)

3 Method

This chapter will be divided into three different sections; one for each research question. All fine-tuning and other measurements were done on a GTX 1660 SUPER. Huggingface’s Transformers [47] was used extensively throughout the process for their simplicity.

3.1 Incremental training

A method for incremental training was developed to answer the first research question. Firstly, the checkpoints at which the model was to be evaluated had to be identified. For this thesis, these values were arbitrarily chosen as (0%, 1%, 5%, 10%, 20%, 30%, 40%, 60%, 80%, 100%) in order to give more insight into the lower end of the spectrum. After that, the training data was split into buckets sized by how big of a leap the current step was from the last. So, for example, the first round of training contained only 1% - 0% = 1% of the train-ing data; the second 5% - 1% = 4%; the third 10% - 5% = 5%; the fourth 20% - 10% = 10%, etc. Next, the models were trained incrementally; meaning, after training on one bucket, the model was evaluated using the test data before starting the training process again with the next bucket. The training data splits were chosen at random, and all tests were done twice with two different randomizations to motivate the solidity of the results. The results are the average of these runs. In short, the models were trained as follows:

1. Choose at which checkpoints the model is to be evaluated.

2. Generate training data buckets based on the checkpoints sized by their incrementation from the previous checkpoint.

3. For each bucket:

a) Re-use the model trained previously. b) Train the model on the current bucket.

c) Evaluate the model on the test data.

(23)

3.2. Fine tuning transfer learning

The results were evaluated using Spearman correlation, explained in section 2.2. For this task, the STS benchmark dataset [3] was used. The dataset contains 5749 training samples, 1500 dev samples, and 1389 test samples. Each sample contains two sentences and a similarity score ranging from 0 to 5 for how similar the two sentences are. In order to be usable as a loss signal, these scores were normalized by dividing by 5, since both cross-encoders and bi-encoders output scores from 0 to 1. The model was trained for 20 epochs per bucket of data, using the AdamW optimizer with a learning rate of 2 105(as suggested by Mosbach et al.) and e= 106. No warmup was used. For bi-encoders, the regression loss mentioned in 2.1.4.1 is used. For cross-encoders, the problem is treated as a binary classification problem and trained using binary cross-entropy loss. Further following the advice of Mosbach et al., the methodology described above was done for two different batch sizes, 16 and 41_.

In order to get a measurement of usability, the different models were also timed through their training and testing. Since testing was done on the same dataset throughout the train-ing, the mean testing time will be reported, while training times will be reported for each increment size of training data.

Solutions to compare The bi-encoder and cross-encoder architectures from the Sentence-BERT paper [35] were both used for fine-tuning on the dataset. Three different Sentence- BERT-models were tested: BERT [6], DistilBERT [39] and RoBERTa [24], resulting in a total of six transformer-based solutions being compared. These were chosen based on the constant rel-evance of BERT-like transformer methods; both BERT and RoBERTa were considered state-of-the-art when they were published. DistilBERT will be an exciting addition because of its creators’ claim of size and speed without performance loss. It would have been interesting to also consider DeBERTa, but due to hardware limitations this had to be dropped; the model’s separation of content and position encodings proved to be too much for the GPU used during fine-tuning.

It would have been interesting to also evaluate BERTSCORE, but since its score calculation cannot be used to calculate back-propagation gradients the method cannot be finetuned2.

The baseline that the models compare against uses the GloVe vectors [32] mentioned in Section 2.1.3 combined with the BertScore [54] methodology for calculating F1-scores.

3.2 Fine tuning transfer learning

Since there is no in-domain data to train on supplied by the company, it would be interesting to investigate if unsupervised (but roughly similar) data can be used to improve the model for the downstream task. For this purpose, a second dataset has to be introduced - hence the name of this section. In other words, an attempt was made to find an answer to the second research question: can one dataset be used to benefit the result of another? For the results to be useful (i.e. can offer an adequate answer to the second research question and be of use to the company), the dataset used to fine-tune the network should be both easily accessible and relevant to the task.

An interesting approach to this question is using the semi-supervised triplet loss (equa-tion 2.2) introduced earlier. The principle of incremental training introduced in the previous section will still be applied, but the model was first fine-tuned on accessible data. The dataset that was chosen for fine-tuning consists of Wikipedia articles, described by Dor et al. [8]. Since the dataset is much larger than those previously used (the training data consists of 1 779 418 samples), only the fastest and smallest model was used for training: DistilBERT. Even though training cross encoders are shown to train faster, DistilBERT was trained as a

bi-1_{Lower batch size implies a higher number of iterations, which is what Mosbach et al. suggest.}

2_{The authors do leave instructions on how to finetune the method, but their proposed method of fine-tunement}

(24)

3.3. Applying the transformer to a web application

encoder here, for the faster inference times3. After fine-tuning on the wiki-dataset, the model was fine-tuned on the STS-dataset; the same as the previous section. It was then compared to its equivalent from the previous tests; the hypothesis being that language comparison knowledge should be generalizable given a somewhat similar task structure.

3.3 Applying the transformer to a web application

Combining the knowledge acquired from the previous two questions, the third research ques-tion tackles methodology for actually implementing such a system for use by an end-user. For this purpose, a popular Python library called Django was used to construct a web application. The application allows administrators to create Tool-items with a name and description, as well as Combinations of Tools with their own names and descriptions. Both Tools and Com-binations can have one or more Queries attached to them. The Query-text and all names and descriptions are also saved as vector representations of themselves in the database, mean-ing a Bi-encoder is used for efficiency purposes (again, discussed by Reimers and Gurevych [35]). When a user inputs a search query, the Tools and Combinations previously saved in the database are ranked (in two separate lists) based on their similarity to the query. Thus, an item I’s score is calculated as

Iscoreq =sim(emb(q), emb(Idescription)) +sim(emb(q), emb(Iname)) (3.1)

where emb()is the vector representation function and sim()is the cosine similarity func-tion. Once a user clicks on a result and marks it as "correct" for the given query, the query is added to that item’s list of Queries. When an administrator judges that enough data has been entered into the database, they can start the fine-tuning process through a script.

Since there are no similarity measurements to be used as labels the item which each query belongs to instead acts as the label. Based on this thought, the dataset used for finetuning is constructed based on two different components: (1) all combinations of queries and their respective items’ names and descriptions (as two different entries per query) with labels 1 (= "These are very similar"), and (2) for each query, a random (different) item is selected from the database to act as a negative sample; the same process as (1) is repeated but with labels 0 instead. The rest of the training process is identical to 3.1.

Using this methodology, a small test case was created using three items and five queries to demonstrate the idea. The items, presented in Table 3.1, were all Tools but the exact same principle would have applied if Combinations were used instead. After the data was inserted into the database the query "Paint a fence" was tried before and after fine-tuning, with the hypothesis being that the model would separate the different items more after finetuning than before.

Table 3.1: An example of data used to demonstrate the methodology presented in Section 3.3. Multiple queries for the same item are separated by semicolons.

Name Description Queries

Shovel Used to dig holes. Remove dirt; Dig a hole

Hammer This can be used to hammer in nails. Works best with wooden materials. Useful for constructing buildings.

Build house; Hammer nails

Paint brush Used to paint with. With this you can paint either houses, paintings or any-thing you like.

Draw a picture

3_{This is because, in practice, each sentence will only have to be encoded once. This is discussed extensively by}

(25)

4 Results

This chapter will present the results of the tests that were introduced in the previous chapter.

4.1 Incremental training

The following figures are color-coded after which transformer was used and line styled after which method they used for inference; dotted represents bi-encoders and solid represents cross-encoders.

Figures 4.1 and 4.2 show how adding more training data affects Spearman correlation for the STS dataset for different models. Note that 1 % of training data translates to only 57 data points. As can be expected, all three cross-encoders start off extremely low (the RoBERTa and BERT versions even start with a negative Spearman correlation, meaning the two series compared (predictions and gold) had negative correlation) but overcomes the bi-encoders after around the 20 %-mark, or about 1140 data points. Note that there is a missing value for the 100 %-mark for the RoBERTa cross-encoder in Figure 4.1 due to an out-of-memory error. An interesting observation from this graph is the percentage point increase between using 1 % and 100 % of the available data. For the bi-encoders, the average increase was roughly 9.5 percentage points. For cross-encoders, this increase was larger at roughly 19.1 percentage points. The baseline using GloVe-vectors resulted in a Spearman correlation of about 0.49, shown together with the models in Figure 4.1.

The method of splitting groups of data into "buckets" and training iteratively was vali-dated against a more brutal approach where a new model instance was trained for each data incrementation. It showed similar results while taking much longer to train since previous training checkpoints could not be utilized.

(26)

4.1. Incremental training

Figure 4.1: Showing Spearman correlation scores as a function of additional training data added for STS dataset using batch size 16.

Figure 4.2: Showing Spearman correlation scores as a function of additional training data added for STS dataset using batch size 4.

Another metric that was recorded during training was the training times which, as men-tioned in the Method chapter, might indicate the models’ usability in a real-world use case.

(27)

Figure 4.3 shows how RoBERTa and BERT behave somewhat similar for both inference meth-ods, while DistilBERT is faster. Note that the x-axis of this graph has changed from being the cumulative sum of training data used, to the increments used. When two or more points share increment, their times are averaged1. The graphs can be interpreted as slopes indicat-ing time taken per data point added (more specifically, seconds per percentage point of data), and the intercept indicating a static overhead independent of the number of data points; such as data and model load times. The top legend shows regression lines for RoBERTa+BERT and DistilBERT, where one can draw conclusions such as "DistilBERT has almost half the overhead and twice the speed of the other models". All bi-encoders show slower training times than their respective cross-encoders.

Figure 4.3: Showing how the training time on STS dataset increases with additional training data.

1_{For example, 10 % to 20 % and 20 % to 30 % are both increments of 10 %, so their timings would be parts of the}

(28)

Finally, inference (testing) was timed in order to see if there was any marginal difference to be seen for inference, presented in Figure 4.4. Even though the main advantage of using a bi-encoder is not even represented in this graph (since inference can be done separately from run-time comparison) they still outperform their cross-encoder counterparts. As can be expected, DistilBERT is significantly faster than the others.

(29)

4.2. Fine tuning transfer learning

4.2 Fine tuning transfer learning

Figure 4.5 shows the result of first training on the wiki dataset and then incrementally training and testing on STS, to see if there’s a benefit to using existing - similar - data.

Figure 4.5: Showing Spearman correlation scores comparing training on only STS, and training on wiki first followed by training on STS.

(30)

4.3. Applying the transformer to a web application

4.3 Applying the transformer to a web application

Tables 4.1 and 4.2 show similarity measurements between items in the database and the query "Paint a fence", following the methodology presented in Section 3.3. A higher score symbol-izes a closer distance in vector space between vector embeddings of the two pieces of texts encoded with the transformer, meaning they are judged more similar to each other.

Table 4.1: Similarity measurements of each item against the query "Paint a fence" before fine-tuning. Higher numberñ more similar.

Tool name Name score Description score Summed score

Shovel .18 .19 .37

Hammer .18 .14 .32

Paint brush .20 .17 .37

Table 4.2: Similarity measurements of each item against the query "Paint a fence" after fine-tuning. Higher numberñ more similar.

Tool name Name score Description score Summed score

Shovel .11 .11 .22

Hammer .1 .09 .19

(31)

5 Discussion

In this chapter, the work of this thesis will be discussed from many different angles. Firstly, the results acquired are analyzed and criticized, and the work’s usability to the company will be addressed. After that, the method will be deconstructed and evaluated, based on the experience it generated. The work in a wider context will conclude the chapter.

5.1 Results

The results shown in Figure 4.1 were beyond any expectations, and quite hard to motivate, since much in the area of transformers is mere guesswork. It is clear that a drastic benefit can be seen when exposing the model to only a handful of data points, which is certainly a useful lesson to learn. As it turns out, a transformer can even perform worse than baseline tactics when no fine-tuning is done. Evidently, however, one should be careful as to how and with what data the transformer is allowed to fine-tune. As Figure 4.5 shows, there is a clear loss to be had when knowledge does not transfer between tasks; even though the two have inherent similarity on an intuitive level.

Figure 4.2 shows that, in contrast to work done by Mosbach et al. [31], having a higher number of iterations is not always beneficial. However, to give the benefit of the doubt, they do not attempt to give any concrete numbers to work from, which could entail that the advice is meant to stave off attempts at fine-tuning using even larger batch sizes than 16. Unfortunately, this is not addressed in this thesis. The clear outliers in this graph are both of the RoBERTa models, whose performance initially rise but then fall drastically (especially noticable for the Bi-encoder). This behavior should raise eyebrows as both models fall to correlation coefficients close to .05, which in the given context could be likened to chaotic guessing. It is hard to formulate any hypotheses as to what causes this since RoBERTa is very similar to BERT in terms of architecture. It is important to note that the behavior occurs in both permutations of the data; meaning that it can not be traced to a fluke. Mosbach et al. also recommend fine-tuning to almost zero training loss, arguing that this does not affect the generalizability of the model. The paper’s discussion tied to this argument could serve as a good starting point for further analysis.

Figures 4.3 and 4.4 show that claims made by Sanh et al. seem to hold. The DistilBERT ver-sions train and test in almost half the time, while the performance loss on testing is negligible. This is the reason why only the DistilBERT model was used to test research question 2. As has

(32)

5.2. Method

been briefly mentioned in previous chapters, the benefits to speed that bi-encoders give can not be observed in these figures since all sentence pairs are inferenced exactly once, which is not the case for IR-related tasks, as is discussed by Reimers and Gurevych. For example, in the web application constructed for the purpose of this thesis (Section 3.3), inference ("usage of the transformer") can be done on data entry when using a bi-encoder. If a cross-encoder is used, inference has to be done for each pair to be compared when a query’s results are to be evaluated.

Something interesting to note, however, is that there seems to be a very clear difference between testing and training times for all models’ bi-encoder and cross-encoder versions; cross-encoders perform better when training and bi-encoders better on testing. Static over-head from for example loading the models into memory can be inferenced from the training measurements (Figure 4.3) where the models can be seen having no significant time-related preference between bi-encoders and cross-encoders, so there’s no reason to believe that is the reason for the difference. This finding will have to be left for future work.

The second research question that this thesis asks is addressed in Figure 4.5. This figure shows how a model fine-tuned on only STS compares to one fine tuned on first Wikipedia data and then STS. Since both tasks are related to estimating sentence similarity, the hypoth-esis was that the two tasks would be mutually beneficial to each other. This way, one would be able to fine-tune on generally accessible data with the goal of increasing performance on unseen data in the future. Unfortunately, this hypothesis is broken by Figure 4.5; it is better to not fine-tune on the Wikipedia data before applying the model to the task at hand. The reason for this is hard to conclude without more testing. Some hypotheses: (1) the losses (and, as a result, the tasks themselves) might be too different from each other; meaning improvement in one directly leads to worse performance in the other, (2) the e used in the triplet loss func-tion might be too large; leading to the model overcompensating, (3) the dataset used might be too difficult; since all three sentences are extracted from the same article, the difference ||sa sp|| ||sa sn|| might be too small to produce an effective result. These hypotheses

could be verified with additional testing. For example, (1) could be tested by applying an incremental strategy similar to the one proposed in this thesis on the wiki dataset and ap-ply the full STS fine-tuning process in between increments. If the Spearman correlation falls steadily, the hypothesis could be motivated as correct. (2) could be tested by modifying e and observing the result. (3) could be tested by creating a more diverse dataset.

Section 3.3 presented a web application whose concept was brainstormed together with the company supervisor. It will be SICK’s portal to using the work done in this thesis through manually adding the content they wish and fine-tuning the model to perform better on the given data. The basic demonstration of the functionality of the web application shows the phenomenon mentioned in Section 2.2 in practice; after fine-tuning the scores of all Tools are reduced, but the model can still be seen as more confident in the choice after fine-tuning than before since the score for the "correct" item is proportionally larger than the other items compared to before fine-tuning. This shows that the hypothesis from 3.3 holds for the given test, but larger tests (more closely resembling what could be considered a real-world example) would have to be conducted before a solid conclusion can be drawn.

5.2 Method

Whether the applied methodology produces valid answers to the research questions asked or not is always hard to judge. This section will apply some critical thinking to the work done, and at the same time make some points on how the points being made could be better backed with future work.

The fine-tuning process is described as detailed as can be, with motivated choices for hy-perparameters. The applied repetition of fine-tuning on two different convolutions of train-ing data shows that the results from Section 3.1 are reliable and produce somewhat

Evaluation of BERT-like models for small scale ad-hoc information retrieval

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Informationsteknologi

2021 | LIU-IDA/LITH-EX-A--21/051--SE

Evaluation of BERT-like models

for small scale ad-hoc

informa-tion retrieval

Utvärdering av BERT-liknande modeller för småskalig ad-hoc

in-formationshämtning

Daniel Roos

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Aim

1.2

Motivation

1.3

Research questions

1.4

Delimitations

2

Background

2.1

Language comparison technologies

2.1.1

Direct word comparisons

2.1.2

Latent semantic analysis

2.1.3

Distributional semantic models

2.1.4

Beyond DSMs

2.2

Evaluation techniques

2.3

Related work

3

Method

3.1

Incremental training

3.2

Fine tuning transfer learning

3.3

Applying the transformer to a web application

4

Results

4.1

Incremental training

4.2

Fine tuning transfer learning

4.3

Applying the transformer to a web application

5

Discussion

5.1

Results

5.2

Method