Context matters : Classifying Swedish texts using BERT's deep bidirectional word embeddings

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 18 ECTS | Cognitive Science

2020 | LIU-IDA/KOGVET-G–20/006–SE

Context ma ers

–

Classifying Swedish texts using BERT’s deep bidirec onal word

embeddings

Daniel Holmer

Supervisor : Arne Jönsson Examiner : Ma as Arvola

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

When classifying texts using a linear classifier, the texts are commonly represented as feature vectors. Previous methods to represent features as vectors have been unable to capture the context of individual words in the texts, in theory leading to a poor repre-sentation of natural language. Bidirectional Encoder Reprerepre-sentations from Transformers (BERT), uses a multi-headed self-attention mechanism to create deep bidirectional fea-ture representations, able to model the whole context of all words in a sequence. A BERT model uses a transfer learning approach, where it is pre-trained on a large amount of data and can be further fine-tuned for several down-stream tasks. This thesis uses one mul-tilingual, and two dedicated Swedish BERT models, for the task of classifying Swedish texts as of either easy-to-read or standard complexity in their respective domains. The performance on the text classification task using the different models is then compared both with feature representation methods used in earlier studies, as well as with the other BERT models. The results show that all models performed better on the classification task than the previous methods of feature representation. Furthermore, the dedicated Swedish models show better performance than the multilingual model, with the Swedish model pre-trained on more diverse data outperforming the other.

Keywords: NLP, text classification, BERT, feature representation, pre-trained lan-guage models, transformer networks, fine-tuning

(4)

Acknowledgments

I would like to thank everyone who has made the completion of this thesis possible. A special thanks to my supervisor Arne Jönsson for his guidance and feedback, and for always being a source of inspiration. I would also like to thank my examiner Mattias Arvola and everyone in the seminar group, who have provided valuable input during the writing process.

(5)

1 Introduction 1 1.1 Previous work . . . 2 1.2 Aim . . . 2 1.3 Research questions . . . 3 2 Background 4 2.1 Text classification . . . 4 2.2 Feature representation . . . 4 2.2.1 Bag-of-words . . . 4 2.2.2 Word embeddings . . . 5 2.2.3 Context-dependent embeddings . . . 5 2.3 Transformer networks . . . 6

2.3.1 The encoder-decoder framework . . . 6

2.3.1.1 Encoder . . . 7 2.3.1.2 Decoder . . . 7 2.3.2 Attention . . . 8 2.3.3 The Transformer . . . 9 2.3.3.1 Self-attention . . . 9 2.3.3.2 Multi-head attention . . . 11 2.3.3.3 Positional encoding . . . 12

2.3.3.4 Residual connections and feed-forward network . . . 13

2.4 Language models for pre-trained language representations . . . 13

2.4.1 ELMo . . . 13

2.4.2 OpenAI GPT . . . 14

2.5 BERT . . . 14

2.5.1 Architecture details . . . 15

2.5.2 Input representation and Wordpiece Model . . . 15

2.5.3 Pre-training . . . 17

(6)

2.5.3.2 Next sentence prediction . . . 18 2.5.3.3 Pre-training data . . . 18 2.5.4 Fine-tuning . . . 18 3 Method 20 3.1 Datasets . . . 20 3.2 BERT Models . . . 21

3.2.1 The BERT Base Multilingual pre-trained model . . . 21

3.2.2 The KB Swedish pre-trained BERT model . . . 21

3.2.3 The AF AI Swedish pre-trained BERT model . . . 21

3.3 Implementation . . . 22

3.3.1 Preparation of the datasets . . . 22

3.3.2 Fine-tuning . . . 23 3.4 Evaluation . . . 23 4 Results 25 4.1 Baselines . . . 25 4.2 eCare subset . . . 25 4.3 DigInclude subsets . . . 26 5 Discussion 28 5.1 Results . . . 28

5.1.1 BERT model differences . . . 28

5.1.2 Dataset differences . . . 29

5.2 Method . . . 30

5.2.1 Limitations of sequence lengths . . . 30

5.2.2 Optimizing fine-tuning hyperparameters . . . 31

5.2.3 On the pre-training of language models . . . 31

6 Conclusion 33

Bibliography 34

(7)

List of Figures

2.1 Overview of the Transformer-encoder. Adapted from Figure 1 in Vaswani et al. (2017). . . 10 2.2 Example of BERT input representations. Adapted from Figure 2 in Devlin et al.

(2019). . . 16 2.3 Illustration of the trivial word predictions in a deep bidirectional model. Adapted

from Rajasekharan (2019). . . 17 3.1 Example of the area under a ROC curve (AUC) . . . 24

(8)

List of Tables

3.1 Summary of datasets . . . 21

3.2 BERT models overview . . . 22

4.1 ZeroR baselines, breakdown . . . 25

4.2 SMO + BOW, breakdown, Santini et al. (2019a) . . . 26

4.3 BERT, eCare breakdown . . . 26

4.4 BERT, DigInclude breakdown . . . 27

(9)

Chapter 1 Introduction

The rapid expansion of the internet has established it as one of the main ways humans exchange information and ideas. Even though millions of hours of video and audio clips get uploaded and streamed online each day, written text is still the dominant medium of this information exchange. Vital societal functions, such as bank offices and health care centers, are increasingly restrictive in telephone and office hours, while often referring to different sources of written information in its place. ”For more information, visit our website” is not an unusual phrase in the digital age. To be able to read and understand the contents of a text is, therefore, often a necessity to be able to fully participate in society.

However, studies indicate that many people do not have proficient reading skills for many of the important texts found online. The latest PISA reading assessment (OECD, 2019) was taken by 15-year old students, with texts aimed to assess the students reading proficiency in a digital environment. The study showed that roughly 18% of the Swedish students could not identify the main idea in a text of moderate length. To find and understand lengthy informational texts online is, therefore, not a trivial task for a large part of the population.

One way to help these people is to make sure the texts published online are less complex to read. What differs a complex text from a less complex text can be defined in many different ways, with the assistance of a wide array of measures and analyzing techniques. Three of the most common readability measures for Swedish texts are LIX, OVIX, and Nominal Ratio. These measures can reveal important characteristics, like sentence length and the ratio of different word classes of a text, and therefore also say something about its complexity. The essential is that there is something inherently different in the structure of a complex text, compared to a less complex text. This distinction between the two types of text allows them to be classified through many different Natural Language Processing (NLP) methods. It can be incredibly helpful to assess if the textual content at hand is too complex, and therefore need to be modified to allow less proficient readers to take part in its content and ideas.

When using a text classifier, one of the simplest ways to represent text is to use bag-of-words (BOW), where each word (feature) in the text is stored together with their relative frequency, ignoring word position. A more advanced way to represent features is by using word embeddings, where each feature is mapped to a vector of numbers (often called an embedding). One popular way to create these word embeddings is to use word2vec (Mikolov et al., 2013). Word2vec can capture the semantic meaning of words, providing more competent feature representations than the previously mentioned BOW approach. The embeddings from word2vec work relatively well, but are still missing an essential aspect of natural language; context. This will significantly limit the quality of the feature representation since, in human natural language, the context of a word is of great importance to its meaning. To be able

(10)

1.1. Previous work

to account for context is computationally expensive, but during the evolution of more and more intricate deep-learning network architectures, new methods capable of creating context-dependent word embeddings have emerged.

One of these newer, and potentially more powerful, methods is BERT (Bidirectional En-coder Representations from Transformers) (Devlin et al., 2019). BERT builds upon the multi-layered bidirectional Transformer encoder described in Vaswani et al. (2017), which allows the model to learn bidirectional representations of unlabeled texts. That is, the model learns con-textual relations between all the words in a sequence, providing a rich feature representation for every word.

To learn these representations, a BERT model is pre-trained on a large number of text documents. The pre-trained model can subsequently be fine-tuned to perform many specific tasks, just by changing the final output layer. This form of transfer learning has shown to be an efficient way to apply general language representations to a specific domain, without having to learn the entire model from scratch.

1.1 Previous work

It is tempting to believe that every new, theoretically more advanced feature representation method is always going to outperform the theoretically simpler ones. In reality, this is not always the case. For example, Santini et al. (2019a) found that the previously mentioned BOW produced better results (in the form of higher weighted average f-scores) on the task of classifying Swedish texts as of low or standard complexity, than the on paper superior word2vec method. Santini et al. (2019a) also notes that this unstable behavior of word2vec has been observed in other domains (Wendlandt et al., 2018). The fact that bigger is not always better makes it important to study BERT in different domains, and determine if similar problems exist.

Internationally, there has been no lack of studies on BERT since the release of the original BERT paper (Devlin et al., 2019). Due to the impact BERT has had on the field of language technology, a new subfield in the area has emerged, often referred to as BERTology, as seen in, for example, Rogers et al. (2020). The large interest in this BERTology has also led to numerous improvements and extensions of the original BERT model, both with regard to speed and efficiency; like DistilBERT (Sanh et al., 2019) and ALBERT (Lan et al., 2019), as well as models specialized for other languages than English; like CamemBERT (Martin et al., 2019) and FlauBERT (Le et al., 2019).

At the same time, research on BERT in a domain with Swedish texts has been somewhat scarce. Even though the multilingual model released with the original paper is capable of being used with more than 100 languages, including Swedish, very little work has been done in this research area. In early 2020 did, however, BERT models specialized for Swedish language1,2 get released for public use, which hopefully will help to fill this gap by making Swedish BERT models more accessible. This thesis is rooted in these advancements, and will further investi-gate the effectiveness of both the mentioned multilingual and specialized models in a domain with texts written in Swedish.

1.2 Aim

This thesis aims to compare the performance of BERT with the best methods used by Santini et al. (2019a), on a task of classifying Swedish texts as either of low or standard complexity. This will provide guidance on the most efficient methods when constructing classifiers for Swedish texts in similar domains.

1_{https://github.com/Kungbib/swedish-bert-models} 2_{https://github.com/af-ai-center/SweBERT}

(11)

1.3. Research questions

1.3 Research questions

• How will the weighted average f-score of a binary text classifier, based on BERT’s fea-ture representations, compare to the weighted average f-score of a classifier using BOW as feature representations, on the task of classifying Swedish texts as of either low or standard text complexity?

(12)

Chapter 2 Background

This chapter serves to provide a theoretical background to the BERT architecture, and provide context to how it was used to investigate the research questions of this thesis. The chapter begins with the description and definition of the text classification task. This is followed by an overview of previous methods of representing features, leading up to predecessors of the type of context-dependent embeddings used in BERT. The two following sections will give an overview of the two most important theories BERT builds upon; transformer networks and how language models is used in pre-trained language representations. The chapter will conclude with a close inspection of the BERT framework, firmly grounded in the theories addressed in the earlier sections.

2.1 Text classification

A text classification task involves assigning a predefined category to a text (Heimann Müh-lenbock, 2013). To assign the category, some features of the text at hand is extracted, and depending on the extracted features, an assignment of a category that aligns with the extracted features can be done. Therefore, the performance of a text classifier is directly correlated to the quality of its extracted features. In the past, the selection of features was mostly made by hand-written rules, a task that requires high domain knowledge and can be very resource-intensive. The hand-written rules have, therefore, mostly been replaced with supervised ma-chine learning techniques. Instead of manually writing the rules, a supervised mama-chine learning algorithm takes a set of input texts and a set of corresponding classification categories. The algorithm then learns which features that correspond to a certain label, and subsequently use that information to map unseen texts to new categories (Jurafsky & Martin, 2019).

2.2 Feature representation

There are many possible ways to represent texts as features. This section will describe some of the most common ones, and how they differ from the features used in the BERT model.

2.2.1 Bag-of-words

One of the most basic approaches is to represent the text as a bag-of-words (BOW), where each word in the text gets transformed into a vector and mapped with another vector consisting of the corresponding word counts. This seemingly simple procedure will, in many cases, be quite effective, but it comes with some issues. One of the main ones is that a BOW representation

(13)

2.2. Feature representation

disregards the position of individual words in a text. Natural language is heavily context dependant, and removing a word from its context can completely change its semantic meaning. Another problem is the sparse encoding of the vectors that makes them sub-optimal for machine learning algorithms (Jurafsky & Martin, 2019).

2.2.2 Word embeddings

An alternative approach to BOW is to use a vector semantics method, where each word is represented in a multidimensional semantic space. This method assumes that words which are frequently co-occurring also share some semantic meaning (Jurafsky & Martin, 2019). These co-occurring words are modeled as vectors (also referred to as embeddings) near each other in the multidimensional space. By calculating the dot product of two embeddings, a semantic similarity measure can be obtained, indicating how close they are to each other in the semantic space, and therefore how similar the words represented by the embeddings are. The embeddings are based on a term-term matrix (Jurafsky & Martin, 2019). This means that each word has its own row in the matrix, with columns corresponding to all the words in the vocabulary. Each cell indicates how often the word in the row co-occurs (in a small window) with the word in the column. These rows are what constitutes the embeddings. However, due to how natural language is structured, each word in the vocabulary will only co-occur with a small subset of the total number of words. As a consequence, most of the cells in the matrix are going to be empty (and padded with zeroes), which means that a large portion of each embedding will be of no particular interest. This is often called sparse vector representations (Jurafsky & Martin, 2019).

The sparse vectors are today often replaced by dense vectors. The dense vectors offer some advantages over the sparse vectors, where the most prominent one is that they, due to their fewer dimensions, is a better fit for feature representation in machine learning systems. A system that uses shorter, more dense vectors, have to learn substantially fewer weights than a system using sparse vectors. This has a significant impact on the training times of the system. Also, due to the reduced number of weights, a system with dense vectors is less susceptible to overfitting problems (Jurafsky & Martin, 2019). As of today, dense vectors is most often implemented in form of the very influential word2vec (Mikolov et al., 2013), or GloVe (Pennington et al., 2014).

Word2vec uses the weights of a binary classifier when creating embeddings, instead of the actual frequencies of word-occurrence as seen earlier (Jurafsky & Martin, 2019). The classifier iterates over words, chosen randomly, in the dataset, and predicts if they are probable to appear close to the target word to be embedded. Since the whole dataset is known, and therefore also if the word is close to the target or not, the classifier can use the dataset itself as the gold standard for the classification task. The results of the classification are of less importance - the information that constitutes the embeddings are the collection of weights for each word.

Even though methods like word2vec has been used as feature representation to great success on many different tasks, the technique has its limitations, mainly with capturing the context of each word embedding. The problem is described in Neelakantan et al. (2014), and is often illustrated with the meaning of the word bank. The semantic meaning of the word bank in the phrase the sand on the river bank, is entirely different than in the phrase the bank was robbed. Word2vec will however only have one embedding of the word bank, losing the nuances of its semantic ambiguity.

2.2.3 Context-dependent embeddings

Where word2vec and GloVe use the co-occurrences between a context word and a target word, context-dependent embeddings uses whole sequences to capture the context of a target word to create embeddings. This method of selecting sequences instead of individual words to infer

(14)

2.3. Transformer networks

the context provides information about the sequential relationships of the words in a sequence, instead of just observing if they occur in the proximity of a target word. One early example of these richer embeddings is context2vec (Melamud et al., 2016). Context2vec builds upon the word2vec architecture, and introduces bi-directional Long Short Term Memory (LSTM) networks (Graves, 2013; Hochreiter & Schmidhuber, 1997) to replace the context modeling of averaged word embeddings in a fixed window, as seen in word2vec (Melamud et al., 2016). To capture the context of a target word, the two LSTM reads the current sentence, left-to-right, and right-to-left respectively, around the target word and creates two context vectors. To illustrate, the context vectors for the word ravens in the sequence ”two ravens in the old

oak tree” , would be a representation of the sequences ”two” and ”in the old oak tree”. The

two context vectors is then concatenated and passed into a multi-layer perceptron, and the following output is the joint context around the target word (Melamud et al., 2016).

Section 2.4 describes how the concept of context-dependent embeddings can be further developed with the help of language models and transformer networks. Prior to that, Section 2.3 will provide a description of the functionality of a transformer network.

2.3 Transformer networks

To perform sequence to sequence tasks, the dominant method has long been some form of recurrent model, in the likes of recurrent neural networks (RNNs). These methods have been relatively well-performing but suffer from some important drawbacks, where the most glaring is the lack of parallelization of the computational steps (Vaswani et al., 2017). This flaw is due to the sequential nature of a recurring network, where the hidden state is not only dependent on the current input, but also all of the hidden states in the previous time steps. To be able to calculate a new hidden state, all of the previous hidden states has to be calculated in a se-quential manner, making parallelization virtually impossible. While this shortcoming exists for every type of input to the network, it is the most prevalent during long sequence lengths, when memory limitations often reduce the batching possibilities. Vaswani et al. (2017) proposes the use of what they call the Transformer, as a way around the limitations. The Transformer builds upon the concept of attention, which provides mechanisms to abandon recurrency alto-gether, in combination with a network structure of encoders and decoders. This combination allows for better task performance and training times, compared to recurrent neural networks. The following sections will provide a background to these concepts, concluding with a closer inspection of the Transformer itself.

2.3.1 The encoder-decoder framework

Deep neural networks (DNNs) have proven to be successful in many NLP tasks, for instance, word embedding extraction with word2vec (Mikolov et al., 2013) and language modeling (Ben-gio et al., 2003). One major limitation of a DNN is, however, that they require both the input and target vectors to be of fixed dimensionality (Sutskever et al., 2014).

The limitations lead to problems on tasks where the length of the vectors is unknown, which often is the case due to the variability of natural language. One solution to this problem is to use encoders and decoders (Cho et al., 2014). The encoder-decoder architecture introduces recurrent neural network (RNN) structures to in the encoding layer encode the input sentence to a vector of fixed length, pass it through the DNN, and subsequently decode1 _{it in the} decoder layer. This allows the DNN always to be passed a vector of fixed length, regardless of the actual length of the input sequence.

(15)

2.3.1.1 Encoder

The first building block of the encoder-decoder framework is the encoder layer. Multiple recurrent network structures, such as RNNs, gated recurrent units (GRU), or convolutional networks, may be implemented for this purpose (Jurafsky & Martin, 2019). However, Sutskever et al. (2014) showed that multilayered LSTMs are especially suited as encoders. This is a result of their ability to handle long sequences, unlike for example RNNs that have problems with exploding or vanishing gradients (Bengio et al., 1994).

During run time, the encoder gets its input of a text represented as a sequence of vectors,

x= (x1, . . . , xtx). From this sequence, a context vector, c is generated. This vector serves as

contextualized representation (Jurafsky & Martin, 2019) of the whole input sequence. The context vector is described by Bahdanau et al. (2015) as

c= q({h1, . . . , htx}), (2.1)

where the hidden state h, at time t is

ht= f(xt, ht₋₁), (2.2)

and q and f is a recurrent2 function. ht is dependant both on the current input xt and the

former hidden state ht₋₁, while the context vector c contains all of the previous hidden states.

After being processed by the encoder the context vector will be passed to the decoder, where it will be used to decode an output sequence.

2.3.1.2 Decoder

Like the encoder, the decoder can be implemented with many network types, but LSTMs is the most commonly used. The purpose of the decoder is to take the contextualized representation of an input sequence, and generate some output sequence from it. This generation is done sequentially for each word in the output sequence, until an EOS (end-of-sequence) token is produced. The decoding of a sequence is based both on the sequence context vector, c, passed from the encoder, and all of the earlier decoder states for the current sequence (Jurafsky & Martin, 2019).

By adapting the decoder definition by Bahdanau et al. (2015), the hidden state h of the decoder at time t can be formally expressed as

ht= f(ht−1, yt−1, c), (2.3)

where yt−1 is the previous output token and c the context vector. The token y to output is

inferred from a conditional distribution;

P(yt∣yt−1, yt−2, . . . , y1, c) = g(ht, yt−1, c), (2.4) where g is a softmax function providing probable output tokens given the current time step. In other words, at every time step, the decoder provides a conditional distribution with possible tokens, computed as seen in Equation 2.3 and Equation 2.4 from all the previous hidden states and the context vector. A softmax function g will then deliver all the possible tokens as a probability distribution. One intuitive way to chose which token to output from this probability distribution is to use an argmax function and pick the token with the highest probability. This might, however, lead to problems (Jurafsky & Martin, 2019). The tokens chosen are always the ones that the model deem most probable at the current position, but this does not always give the best results when considering the context of the whole sequence. Natural language has complex word relations and there can be long dependencies between the word positions, which get lost when only considering the current position of the word.

2_{As stated before, the most common practice is to use LSTMs, but other recurrent network structures may}

(16)

An alternative to selecting the most probable word, is to select the whole output sequence with the highest combined probability. However, large vocabulary sizes make it not feasible to search through all of the possible sequence combinations, due to the exponential nature of the search problem. Instead, to only explore the space of possible outputs using a state-space search algorithm, such as beam search (Jurafsky & Martin, 2019), and then select the output sequence, is a far better approach and is often used in similar problems (Boulanger-Lewandowski et al., 2013). Beam search combines breadth-first searching and a filtering heuristic, where the w most promising paths (also called the beam) at time step (depth of the search tree) t is kept for further evaluation, while the rest of the paths are discarded. This pruning of paths is done at each t, which results in a search space that is significantly narrower than an exhaustive breadth-first search (Boulanger-Lewandowski et al., 2013), with considerably reduced search times. When the entire context vector for the input sequence has been passed through the decoder, the remaining path through the sequence with the highest cumulative log-likelihood is chosen as the output.

Unfortunately, also the beam search approach has its flaws. The context vector c that guides the filtering at each time step t, is the final hidden state of the encoder. Naturally, the final hidden state is always more focused on the later parts of the input sequence (Jurafsky & Martin, 2019). As a consequence, the process of selecting the output sequence is somewhat unfair, where the last tokens often have more influence over the selection process than they reasonably should.

2.3.2 Attention

One often used way to counteract the inherent problems of the context vector in the decoder is to implement an attention mechanism (Jurafsky & Martin, 2019). As seen in Section 2.3.1, a encoder-decoder network without an attention mechanism gathers all information about the input sequence in a single context vector. In a encoder-decoder network with an attention mechanism, each of the token positions in the input sequence has their own context vectors. The attention mechanism allows for the decoder to search for the token positions in the input sequence that has the most relevancy for the word that is to be generated, and access their context vectors. The decoder then bases the choice of the word to be generated on the relevant words context vectors, together with all of the previously generated words (Bahdanau et al., 2015). This allows the decoder to make a choice based only on words that are the most relevant to the context at each time step, instead of having to attend to all of the words in the input sequence equally, even if they are not helpful for the current word. Some similarities with how a human would be approaching a similar translation task can be seen here. Instead of first reading whole paragraphs and produce the complete translation at once, most people will probably start by reading a chunk of text, translate that chunk, and move on to the next chunk. This procedure allows the translator to put the most attention to the parts of the text that are relevant at the time, and disregard the others. Similarly, the decoder only focuses on the applicable encoder states.

The formal definition of conditional probabilities of the decoder is expressed by Bahdanau et al. (2015) as:

P(yi∣y1, . . . , yi−1, x) = g(yi−1, si, ci), (2.5)

where si is the hidden state of an RNN at time i, and is derived from:

si= f(si−1, yi−1, ci). (2.6)

A difference from the decoder in Equation 2.3 is therefore, as Bahdanau et al. (2015) points out, the fact that each target word yi is associated with a distinct context vector ci.

The context vector ci is essentially a dynamic vector derived from the hidden states h of

the encoder at each time step i of the decoder (Jurafsky & Martin, 2019). This means that at each time step i of the decoder, the decoder state si−1will be compared to all of the encoder

(17)

states j. Each of these comparisons will be scored, by calculating the dot product between the vectors as described in Jurafsky and Martin (2019):

score(si₋₁, hj) = si₋₁⋅ Ws⋅ hj (2.7)

where Ws is the set of weights that allows the network to learn what aspects of similarity

between the states that are of importance for the task at hand. The score represents the similarity between the different states, and a higher similarity is assumed to mean a higher grade of relevancy between them.

When the scores have been calculated from all of the encoder states for a given decoder time step, a softmax function maps all the relative similarity scores to a vector of weights. The result is, for the current decoder state i, a probability distribution of the most relevant encoder states j. By averaging these probabilities, a context vector ci associated with the

decoder time step i is created (Jurafsky & Martin, 2019).

This approach enables the context vector ci to continuously reflect the most important

tokens in the input sequence for every given token in the output sequence.

2.3.3 The Transformer

The Transformer builds upon and expands the encoder-decoder architecture, where an input sequence gets encoded to a sequence of continuous representations, and subsequently decoded to an output in a sequential manner (Vaswani et al., 2017). The Transformer does, however, offer some notable improvements over previous encoder-decoders. Firstly, it relies entirely on attention to model all of its word dependencies, without any recurrent or convolutional network structures. This self-attention mechanism solves some of the limitations with the mentioned network structures, increasing both the performance of the transformer architecture. Secondly, it introduces the concept of multi-head attention. Multi-head attention allows the model to attend to different positions of the input sequence simultaneously, and therefore provide a richer word context. As a consequence, the Transformer can capture polysemy in a better way than previous encoder-decoder based architectures. Finally, since there are no recurrent or convolutional structures that indicate individual word ordering in input sequences, the Transformer uses a positional encoding to distinguish the positions in a sequence each word belongs. This is all tied together by a standard feed-forward network, used for the learning of the output embeddings, alongside residual connections to ease the flow of essential vectors through the blocks.

Due to these improvements, the Transformer achieved state of the art results in neural machine translation tasks (Vaswani et al., 2017), and can be considered a major milestone in the field. Many later architectures, such as BERT, are based directly on the Transformer.

Both the encoder and decoder parts are, in the Transformer, implemented in stacks of

N = 6 identical blocks each. However, in BERT, only the encoder stack is implemented.

Therefore the Transformer concepts explored in this report will be in an encoder context. In the following sections, the most important concepts of the Transformer-encoder (see Figure 2.1 for an overview) will be investigated, as well as how they work in conjunction to produce the impressive results.

2.3.3.1 Self-attention

One of the main drawbacks of a recurrent network structure is that the current state of the network relies on all of the previous hidden states. In other words, it becomes impossible to parallelize the computations of such networks, since, to calculate the hidden state ht, all of

the previous hidden states ht−1 must already be known. Even though the attention

mecha-nism used by Bahdanau et al. (2015) and explained in Jurafsky and Martin (2019) addresses the problems of the context vector c in the encoder-decoder architecture, there still exists a recurrent component in both the encoder and decoder in these systems. More specifically, the

(18)

Figure 2.1: Overview of the Transformer-encoder. Adapted from Figure 1 in Vaswani et al. (2017).

improved context vector ci is calculated by stepping through hidden states of the encoder in

recurrent fashion, and is then decoded in the decoder using a similar recurrence. The type of attention described in Section 2.3.2 counteracts some of the weak points of the encoder-decoder, but still suffer from the issues lack of parallel computation ensues. These issues are avoided in the Transformer (Vaswani et al., 2017) by abandoning the recurrent components of attention altogether, resulting in what the authors call self-attention.

The authors describes self-attention with the concept of query (Q), key (K), and value (V) vectors, all derived from the current input vector X. The Q, K, and V is therefore projections of the vector X, used in different parts of the self-attention mechanism. For each X in the input sequence, an individual set of Q, K, and V gets generated. This is done by multiplying the word embeddings of the input sequence with their corresponding weight matrices, WQ_, WK, and WV_{. During this process, the X embedding gets scaled down from a dimension}

of 512, to Q, K, and V vectors of size 64 (Alammar, 2018). This is of importance to the computation of multi-headed attention described in Section 2.3.3.2.

The three weight matrices are created as a part of the pre-training process of the Trans-former. An important point is, as Alammar (2018) emphasizes, that also the vectors Q, K, and

V are in practice packed together in whole matrices. This allows several vectors to be

com-puted simultaneously, and provides a more efficient transformer. To think of them as vectors is a way to aid in the understanding of the computational steps, although the main principles remain the same and can be applied to the computation of matrices as well. When used to calculate attention, the vectors involved in the self-attention mechanism can be described as follows:

(19)

• Key (K): the embedding of another word that Q is being scored in relation to. • Value (V): the embedding of a word that corresponds to a specific Key. In the context

of this report, this will always be identical to K, but in theory, each K can have multiple

V .

To compute the attention from these vectors, Vaswani et al. (2017) uses what they call Scaled

Dot-Product Attention. This is conceptually similar to the dot product attention described in

Section 2.3.2, but with an added scaling function of _√1

dk. dk is the dimension of the query and

key vectors, and the scaling function is used as a way to counteract too large dot products (Vaswani et al., 2017).

Alammar (2018) describes the computational steps in calculating the scaled dot-product attention as following:

1. Create the three vectors Q, K, and V, by multiplying the input embedding X with each of the pre-trained weight matrices, WQ_{, W}K_{, and W}V_.

2. Calculate the dot product between the current query vector, Qt, and all of the key

vectors,(K1, . . . , Kn).

3. Multiply all of the calculated dot products with the scaling function _√1

dk, and run a

softmax function on the results. The softmax normalizes all the results so that they add up to 1. The results indicate the importance of each K to the current Q.

4. Multiply the softmax scores for each Kn, with the corresponding value vector Vn. This

will add weights to the value vectors and filter out the irrelevant Vn for the current Qt,

since they will be multiplied by low softmax scores.

5. Summarize all the weighted value vectors and add them to the output vector Zt.

This process is repeated for all the queries, and as a result, one output vector Z per query will be produced. These will indicate how much each word in the sequence should attend to all of the other words in the sequence, respectively.

A formal definition of this process of scaled dot-product attention calculation is described by Vaswani et al. (2017) as:

Attention(Q, K, V ) = softmax(QK T

√

dk

)V (2.8)

The whole process of calculating self-attention is computationally expensive. For the at-tention using RNNs described in Section 2.3.2, the number of operations for calculation is decided largely by the dimension of the network (Kaiser, 2017). That is a consequence of how, in every time-step, a word attends to all the dimensions of the RNN. Therefore, with regard to the sequence length, the complexity is O(n), and thus sequence length is in this aspect, not a relevant issue. For self-attention, on the other hand, every query vector, Q, in the in-put sequence has to attend to all the key vectors, K. The complexity is, therefore, directly dependant on the sequence length, O(n2_{). This quadratic scaling in regard to the sequence} length has led Vaswani et al. (2017) to set the arbitrary limit of a maximum of 512 tokens per sequence, striking a balance between performance and memory usage.

2.3.3.2 Multi-head attention

The second important novel concept in the Transformer is multi-head attention. Multi-head

attention uses the same mechanisms as described in the previous section, but perform multiple

such computations simultaneously, each calculation in their own attention head. This improves upon the attention function in two ways (Alammar, 2018). First, the model expands the ability to focus on multiple word positions with its attention mechanism. In single-head attention,

(20)

the dominating focus for each word is often on the word itself, and other words in the sequence get their importance toned down. In multi-headed attention, additional attention heads are introduced, allowing for the attention heads to focus on different word positions. Second, it enables multiple what Alammar (2018) describes as ”representation subspaces” to the attention layer. That is, with the single-headed attention, one area of the input sequence is going to be more focused than the others. With multi-headed attention, on the other hand, the possibility arises to focus on even more areas.

The Transformer implements eight different attention heads. During the pre-training, eight unique sets of weight matrices, WQ_{, W}K_{, and W}V _{is generated, with randomly initialized}

weights. From these sets of matrices, it is also possible to generate eight unique sets of Q, K, and V vectors, one set for each attention head. As a result, each attention head will tend to focus on different word positions of the input embedding.

Since the computation is performed eight times, precisely as described in Section 2.3.3.1, but with eight different sets of weight matrices, the result will be eight different Zn matrices3.

These are concatenated into a single Z matrix and subsequently multiplied with the weight matrix WO_{, which like the rest of the weight matrices, originates from the pre-training of}

the model. The resulting Z matrix can be passed on to the following feed-forward layer (see Section 2.3.3.4). The feed-forward layer expects a single matrix of dimension 512. This is exactly the result of the operation of multiplying eight matrices of dimension 64, which displays why the downscaling of the input embedding X mentioned in Section 2.3.3.1 was necessary (Alammar, 2018). Another consequence of the mentioned downscaling to reduce the dimensions of matrices, is that the computational cost of the matrix operations is comparable to those in single-headed attention with full dimensionality (Vaswani et al., 2017).

The formal definition of the multi head attention mechanism is found in Vaswani et al. (2017):

M ultiHead(Q, K, V ) = Concat(head1, . . . , headh)WO, (2.9)

where

headi= Attention(QWiQ, KWiK, V WiV). (2.10)

2.3.3.3 Positional encoding

One consequence of not employing any recurrence or convolutional networks is that the inherent positional information of the sequences fed into them gets lost. For example, when running an embedding of an input sentence through a RNN, it ”unrolls” itself into a new time step and generates a new hidden state for every word in the input sentence. It is then only a matter of investigating which time step each word corresponds to, in order to get information about their position. In the Transformer, Vaswani et al. (2017) implements positional encodings to track the relative position of the words in the input embedding.

The positional encoding is performed by adding a encoding vector to each of the input embeddings X. This encoding vector has the same dimenson as the input embedding (512), filled with values between -1 and 1, representing sinoid waves of different frequencies (Vaswani et al., 2017). The waves origins from the following functions:

P E_(pos,2i)= sin(pos/100002i/dmodel) _(2.11)

P E_(pos,2i+1)= cos(pos/100002i/dmodel), _(2.12)

where pos is the position and i is the dimension. In other words, a sine function will be used for the even dimensions, while a cosine function will be used for the odd dimensions. For any fixed offset k, P E_(pos+k)has the ability to be represented as a linear function of P E_(pos), which allows the model to attend to relative positions (Vaswani et al., 2017).

3_{Or Z}

n-vectors, using the earlier vector analogy. From here on, the vector analogy will be abandoned, and

(21)

2.4. Language models for pre-trained language representations

2.3.3.4 Residual connections and feed-forward network

After the vector from the attention head has been produced, it is concatenated to the vec-tor that was used as its input, creating a new connection between the input and output of the attention head. This link is called a residual connection (He et al., 2016) and was first used in the field of computer vision to support the training of deep neural networks. In the Transformer, its purpose is to ease the training and allow the original positional encoding (see Section 2.3.3.3) to be passed through the encoder stacks.

The result of the concatenation of vectors is subsequently passed through a layer normal-ization function (Ba et al., 2016), whose main benefit is reduced training times due to an efficient approach to estimate the normalization statistics in each network layer. The nor-malized vectors are then passed into a feed-forward network. As described in Section 2.3.3.1, the attention heads only produce a vector representing how much each word should attend to other words. It has not learned any feature representations yet. This learning is instead the role of the feed-forward network (Sileo, 2019). The vectors from layer normalization function after the attention head, consisting of the word itself, the calculated attention weights, and the positional encoding, are passed to this regular feed-forward network. The following outputs are the context-dependent representations of the input texts.

2.4 Language models for pre-trained language representations

Pre-trained language representations, for example, the popular word2vec (see Section 2.2.2), has shown to be very useful for a wide array of NLP-tasks. As seen in Section 2.2.3, context2vec manages to capture the context of words, improving on the word2vec model. Peters et al. (2018) did refine this approach even further; to capture the context of the words in a sequence, they used an architecture based on language models, named ELMo. ELMo showed much improved results compared to previous methods and is an important predecessor to BERT.

Another critical step towards BERT is the OpenAI GPT architecture (Radford et al., 2018). OpenAI GPT (Generative Pre-training Transformer) shares many similarities with ELMo (Embeddings from Language Models), and also incorporates the Transformer (see Sec-tion 2.3.3) instead of recurrent structures.

The main concepts behind ELMo and OpenAI GPT, and how their approaches to language representation differ, will be explored in the following sections.

2.4.1 ELMo

The system that popularized context-dependent embeddings (see Section 2.2.3) on a wider scale is ELMo (Embeddings from Language Models), developed by Peters et al. (2018). ELMo builds upon the TagLM architecture (Peters et al., 2017). TagLM implements a pre-trained recurrent language model (LM) in conjunction with a word embedding model (like word2vec, Section 2.2.2) to compute context embeddings of each word in an input sentence. The pur-pose of a language model is to predict the next word in a sequence, something that requires information about both the syntactic and semantic roles of words in context (Peters et al., 2017). TagLM takes this inherent information about the context of words in the language model to create a LM embedding for each word in the sequence, which is concatenated with the respective word embedding to create the context-dependent embeddings. This, in turn, is provided as input to the sequence tagging model in the TagLM architecture.

The overall structure of ELMo is mostly similar to TagLM, with some small improvements to the finer details of the architecture. However, there is a large disparity in the type of information that the language models output in respective architecture. In the case of TagLM, only the topmost layer of the language model is used for the LM embeddings, while in ELMo, all of the internal layers are used to represent the LM embedding. This leads to embeddings that the authors call deeply contextualized (Peters et al., 2018), which, according to the improved

(22)

2.5. BERT

task performance, captures more aspects of context than previous embeddings. Furthermore, the authors also noticed that different layers of the LM were not equally suited for all tasks. For example, the first layer provided better representations during a POS-tagging task, while the top layer was better for sentiment analysis. The ELMo architecture provides a γ (gamma) parameter that allows for weighting the impact of the different layers in the final context embedding. This allows the architecture to perform well at various NLP tasks, and at the time of publication, applying ELMo as word representations produced state-of-the-art results on six different of them.

ELMo implements what Devlin et al. (2019) refers to as a ”feature-based approach” to pre-trained language representation. This means that the embeddings from the pre-trained ELMo model are later used as features in another model, specialized for a specific task (Weng, 2019). The parameters of the ELMo model is fixed, and it is only the model for the specific task that is modified during training.

2.4.2 OpenAI GPT

GPT (Generative Pre-training Transformer), developed at OpenAI by Radford et al. (2018), refines the usage of language models for language representations, that showed to be effective in both TagLM and ELMo. However, instead of the recurrent methods to capture the con-text of words in these architectures, GPT builds on the attention mechanisms found in the

Transformer (Section 2.3.3).

Additionally, GPT implements what Devlin et al. (2019) refers to as a ”fine-tuning ap-proach” to pre-trained language representation. The fine-tuning approach builds upon the Universal Language Model Fine-tuning (ULMFiT) method, proposed by Howard and Ruder (2018). ULMFiT successfully applied transfer learning on a pre-trained language model, al-lowing a general language model to be fine-tuned with domain-specific data, for a specific task. In other words, the pre-trained language model parameters are updated and changed to fit a specific downstream task. According to Devlin et al. (2019), this approach is advantageous since fewer parameters have to be learned from scratch, compared to training a completely new model (as is the case in feature-based approaches, such as ELMo (Section 2.4.1)).

GPT combines the advantages of attention found in the Transformer, with the fine-tuning found in ULMFiT, creating an architecture that at the time of its publication performed with state-of-the-art results (Radford et al., 2018). However, one limitation of GPT is that the model is only able to capture the context in one direction, namely the left-to-right context (Weng, 2019).

2.5 BERT

The field of language technology advances at a rapid pace, and it is not uncommon that the state of the art method for conducting a specific task gets replaced every year. Such advancements has historically been seen on a task to task basis, where a new method for task A is not applicable for task B or task C.

This changed with the advancement of context-dependent embeddings, employed in systems like ELMo (Peters et al., 2018), which at the time of publication, blew the current state of art methods out of the water on a wide array of tasks. ELMo had context embeddings trained with LSTMs in a bi-directional manner, meaning they were trained on sequences left-to-right and right-to-left, and subsequently concatenated into embeddings representing each word. This allowed the ELMo language model to gain contextual information, but still missing what a transformer with attention can bring to the table (see Section 2.3). Furthermore, OpenAI GPT (Radford et al., 2018) used transformers to combat the issues inherent of recurrent structures. However, GPT only used a left-to-right transformer, able to capture the context between layers in one direction, still missing the whole context of the input sequence. GPT

(23)

2.5. BERT

also successfully applied transfer learning to a pre-trained language model, enabling it to be fine-tuned to specific downstream tasks.

The previously mentioned architectures lead to the most known implementation of said transformer networks, namely BERT (Bidirectional Encoder Representations from Transform-ers), which further improved upon the results of the earlier mentioned architectures. It was subsequently implemented as a part of the Google search engine, with an estimated improve-ment of the search queries by around 10% (Rogers et al., 2020).

BERT pre-trains a language model with a technique using bi-directional transformers, which means that the BERT model can capture the context of words in all directions, in all layers, resulting in what Devlin et al. (2019) call deep bidirectional representations. The BERT model can also be fine-tuned for several downstream tasks, in the same manner as GPT.

The following sections will describe BERT in greater detail, beginning with an overview of the architecture. Subsequently, how BERT represents its input sequences, and how they get broken down into WordPieces, will be described. The last two sections will explore the main parts of the BERT architecture; pre-training and fine-tuning.

2.5.1 Architecture details

The base implementation of a BERT model consists of a stack of 12 Transformer-encoder blocks (see Section 2.3.3 and Figure 2.1), consisting of 12 multi-headed self-attention layers (see Section 2.3.3.1 & 2.3.3.2). Also the Feed-forward network hidden layer (see Section 2.3.3.4) size is increased compared to the Transformer, from 512 to 768, to account for the increased number of attention heads.

BERT uses the same overall architecture for both pre-training and fine-tuning, with the only difference being the output layer (Devlin et al., 2019). For pre-training this output layer performs two tasks, masked LM training (see Section 2.5.3.1) and next sentence prediction (see Section 2.5.3.2). Only during fine-tuning, an output layer for the chosen task is applied.

2.5.2 Input representation and Wordpiece Model

To be able to pass text inputs to the encoder stack employed in BERT, three different embed-dings per token in the input sequence must be generated.

First, the input sequence is passed through BERT’s tokenizer. The tokenizer normalizes the sequence and converts all whitespace to spaces, and splits all punctuation by adding spaces on both sides. Finally, it splits the sequence into wordpiece embeddings. Originally, the Wordpiece model was implemented in Google’s speech recognition system to solve a Japanese/Korean segmentation problem (Wu et al., 2016). The purpose of the Wordpiece model is to break down words that are unknown and not in the vocabulary, to smaller wordpieces that do exist in the vocabulary. For example, the word playing might not exist in the vocabulary. Therefore it gets broken down into the wordpieces [play] and [##ing]4_{. The wordpiece model in the} standard, BERT-base model, consists of a 30,000 token vocabulary5_{, including wordpieces with} only one or two characters. These smaller wordpieces are useful when, for example, the out of vocabulary, non-sense word xzat somehow finds its way into the input sequence. It might then be split into the wordpieces [x], [##z], and [##at]. This functionality ensures that every possible token gets some kind of wordpiece representation, instead of putting every non-vocabulary token under a general unknown-token, like seen in some other model architectures (McCormick, 2019).

Furthermore, the tokenizer introduces several special tokens (Devlin et al., 2019):

4_{The ## indicates that the wordpiece is a suffix to another wordpiece.}

5_{All BERT models does, however, come with specialized wordpiece model of their own. The size and}

(24)

2.5. BERT

• [CLS]: A special classification token, used as an aggregated sequence representation dur-ing classification tasks. The embedddur-ing associated with this token can sdur-ingle-handedly be used when classifying an entire sequence.

• [SEP]: This token is used to mark the end of the input sequence. Furthermore, for BERT to handle different down-stream tasks, the input representation must be able to handle both single sentences and a pair of sentences as input. During a sentence pair task, the [SEP] token is also used as a delimiter between the two sentences. On the other hand, if only the single [SEP] token at the end of the sequence is present, the input will be handled as a single-sentence task. This functionality enables the processing of both types of tasks without any changes to the overall architecture.

• [MASK]: Used to mask tokens when training the masked language model. Will be further explained in section 2.5.3.1.

• [PAD]: All sequences passed into the BERT model must be of the same sequence length. The length is specified beforehand, and this token is used as padding for shorter sequences to reach the specified length.

The wordpiece tokens, together with these special tokens, constitutes the token embeddings used to represent the actual text input passed into the BERT model.

The second type of embedding is the segment embedding. These embeddings indicate which sentence in a sentence pair task the current token belongs to. Instead of having to search for where the special [SEP] token is located in relation to the current token and infer which sentence it belongs to, a separate segment embedding stores that information for every token. For single sentence tasks, all of the segment embeddings will be identical.

Finally, the third type of embeddings is the position embeddings. These are used to encode which relative position in a sequence each of the tokens has. The functionality is identical to the positional encodings in the Transformer, described in section 2.3.3.3.

For an overview over how the two sequences ”my dog is cute” and ”he likes playing” gets processed into the input representations utilized in BERT using the concepts described in this section, see Figure 2.2.

Figure 2.2: Example of BERT input representations. Adapted from Figure 2 in Devlin et al. (2019).

(25)

2.5. BERT

2.5.3 Pre-training

One of the greatest benefits with BERT is that it can be pre-trained on large amounts of texts, and then further specialized for the task at hand. The pre-training is performed simultaneously on two different tasks, which capture different features of the input texts. First, this section will explore these tasks, followed by a description of the amount of data needed to pre-train an effective BERT model.

2.5.3.1 Masked language model

Language model pre-training (Dai & Le, 2015) has proven to be useful in many NLP-systems, including the before mentioned ELMo and Open AI GPT architectures. When training a language model, a sequence of words is used as input, and the task is to predict the word following the input sequence. In a deep architecture, this process requires that the training is done in a sequential manner, right-to-left or left-to-right, through the sequence. Both the ELMo and Open AI GPT architectures operate in this deep monodirectional fashion.

Conversely, the BERT architecture is what Devlin et al. (2019) call deep bidirectional, meaning every token is connected to the whole context in all layers. This deep bidirectionality leads to issues which renders the language model prediction trivial (Rajasekharan, 2019). The reason for this is that the bidirectional connections indirectly allows the word to be predicted to see itself, which means the model does not need to make any prediction - it could just access the correct word, without learning anything in the process. This problem is illustrated in Figure 2.3, with how the word ”from” in the sequence ”Soft sounds from another planet” can be accessed by itself from encoder block two and onward.

Figure 2.3: Illustration of the trivial word predictions in a deep bidirectional model. Adapted from Rajasekharan (2019).

As a way around this problem, Devlin et al. (2019) uses a masked language model, where 15% of the tokens in the input sequence is masked, and the prediction task is subsequently performed on these masked tokens. Since the word to be predicted is hidden from the language model, it is forced to rely on the context of the hidden word to make the prediction, avoiding the problem illustrated in Figure 2.3. The same technique of masking words has been used

(26)

2.5. BERT

historically to assess human reading proficiency, under the name of the Cloze task (Taylor, 1953).

A consequence of masking words in this manner is however that a mismatch between pre-training and fine-tuning emerges (Devlin et al., 2019). That is, the model is trained to perform a prediction only when it sees a masked token, but during fine-tuning, no such tokens are present. The authors solve this by using the following masking procedure for the 15% of the total tokens initially chosen to be masked:

• Alternative 1 (80% of the time): Replace the word with the [MASK] token

• Alternative 2 (10% of the time): Replace the word with another randomly chosen token • Alternative 3 (10% of the time): Do nothing, and keep the original word. This allows

the model to learn to attend to the actual token itself, not only the context tokens When following this procedure, the model is forced to keep a distributional contextual repre-sentation of every input token, not just the masking token, since it sometimes will be asked to perform the prediction on tokens ”masked” with alternative two and three (Devlin et al., 2019).

2.5.3.2 Next sentence prediction

While the masked language model training provides an understanding of relationships between words in a sentence, the next sentence prediction task provides an understanding of relation-ships between sentences (Devlin et al., 2019). A binary classifier is tasked with predicting if two sentences are directly following each other, or if they are not. For each pre-training example, a sentence pair A and B is selected. In 50% of the pairs, sentence B is directly following sentence A in the corpus, while in the other 50%, sentence B is randomly selected from the rest of the corpus. This is a seemingly simple procedure but has proven to improve the results on several down-stream tasks (Devlin et al., 2019).

2.5.3.3 Pre-training data

As mentioned before, BERT models are pre-trained on large amounts of data. For example, the English BERT-base model released alongside the original paper by Devlin et al. (2019) was trained on approximately 16GB text. The pre-training was performed on four cloud TPUs in Pod configuration, resulting in 16 TPU chips in total. Despite this massive computational power, the whole pre-training process took four days to run (Devlin et al., 2019).

Following studies, like the one performed by Liu et al. (2019), have however indicated that the original BERT models, to a large extent, were undertrained. The authors, in the model they called A Robustly Optimized BERT Pretraining Approach (RoBERTa), extended the pre-training dataset to approximately 160GB of uncompressed text. That, together with some tweaks to the pre-training tasks, performed at state-of-the-art levels at several downstream tasks. This gives some indication to that different variations BERT is greatly benefited by using bigger datasets.

The pre-training of these huge models do come with high environmental costs. Strubell et al. (2019) reports that pre-training the English BERT-base model on GPU, emits CO2 on levels comparable to a trans-American flight. Pre-training is also a great financial cost, with the price to train the BERT model in the cloud ranging from $3751-$12,571, depending on the service (Strubell et al., 2019). A further discussion on the impact of pre-training will be provided in Section 5.2.3.

2.5.4 Fine-tuning

Out of the box, the pre-trained BERT model provides general language representations. These can be further refined, and be used in several downstream tasks, by fine-tuning the BERT

(27)

2.5. BERT

model. The fine-tuning is done by adding an additional output layer to the model and train it to perform the selected task. During this process, all of the model parameters will be jointly

fine-tuned (Devlin et al., 2019), and specialized to the specific data and downstream task at

hand.

One common downstream task BERT is used for is text classification. During classification tasks, the [CLS] token mentioned in section 2.5.2 is of great use. The [CLS] is always the first token in a sequence, regardless if it is a single or paired sentence task, and contain the final hidden representation h for the whole sequence. This single token can then be passed to the output layer to perform the probability prediction of label c (Sun et al., 2019):

p(c∣h) = softmax(W h), (2.13)

where W is the parameter matrix for the specific task. All of the parameters are then fine tuned by maximizing the log-probability of the correct label (Sun et al., 2019).

For the fine-tuning hyperparameters, Devlin et al. (2019) found that the best performing ranges of hyperparameter values were:

• Batch size: 16, 32

• Learning rate: 5e-5, 3e-5, 2e-5 • Number of epochs: 2, 3, 4

The authors also found that, for large datasets (with more than 100k training examples), the hyperparameters did not impact the results to a large degree. On smaller datasets, the authors recommends to try several combinations of these hyperparameters, to find to the best combination for the dataset at hand.

An alternative to the fine-tuning procedure is to use a feature-based approach to BERT (Devlin et al., 2019). That is, instead of continually updating the parameters of the BERT model during the fine-tuning process, simply extract fixed features from the model. These can then be used as features in a system separate from the BERT framework. Devlin et al. (2019) found this approach to perform at levels comparable to fine-tuning, but the fine-tuning approach is the more frequently used in other literature.

(28)

Chapter 3 Method

In this chapter, the methods used will be further explained. The first section explores the different datasets used in the classification task. The second section focuses on the imple-mentation of the BERT architecture with a pre-trained model and fine-tuning it to perform said classification. This is followed by a section describing the methods used to evaluate the performance of the implemented model.

3.1 Datasets

Three Swedish datasets were used in this thesis; the eCare and DigInclude subsets in Santini et al. (2019a), as well as an extended version of the DigInclude subset (for an overview of all three datasets, see Table 3.1).

The first dataset used is a subset of the eCare corpus (Santini et al., 2019b), which consists of 34231 _{web pages of texts regarding chronic diseases, labeled as either lay or specialized} by a native lay speaker. The specialized pages were judged to contain medical jargon. In contrast, the lay pages were judged to contain language understandable to a person with no medical training, and is seen as an easy to read version of the standard language (medical jar-gon) used in the domain. The distribution between the two categories are: 2560 specialized pages (75%), and 863 lay pages (25%). The total amount of words in the whole eCare dataset is 424,278.

The second dataset used is a subset of the DigInclude corpus (Rennes & Jönsson, 2016). The dataset consists of a total of 17,502 sentences crawled from Swedish authorities’ web sites. 3,827 (22%) of them were categorized as easy-to-read (simple) and 13,675 (78%) were categorized as of standard complexity. There are a total number of 233,094 words in the subset. Both of these datasets from Santini et al. (2019a) are unbalanced, with one class clearly outweighing the other.

The third dataset used is another subset of the DigInclude corpus (Rennes & Jönsson, 2016). However, instead of single sentences, each entry consists of several sentences. This is a balanced dataset of 6,164 entries, with 3,082 (50%) entries classified as easy-to-read (simple) and 3,082 (50%) entries classified as of standard complexity. The total number of words in this dataset are 530,089.

The first two datasets in their original forms can be found in their entirety on the companion web site2_{to Santini et al. (2019a).}

1_{For reasons explained in Section 3.3.1, the class balance and the number of entries has been altered from}

the subset used in Santini et al. (2019a).

Context matters : Classifying Swedish texts using BERT&apos;s deep bidirectional word embeddings

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 18 ECTS | Cognitive Science

2020 | LIU-IDA/KOGVET-G–20/006–SE

Context ma ers

Classifying Swedish texts using BERT’s deep bidirec onal word

embeddings

Daniel Holmer

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Previous work

1.2 Aim

1.3 Research questions

Chapter 2

Background

2.1 Text classification

2.2 Feature representation

2.2.1 Bag-of-words

2.2.2 Word embeddings

2.2.3 Context-dependent embeddings

2.3 Transformer networks

2.3.1 The encoder-decoder framework

2.3.2 Attention

2.3.3 The Transformer

2.4 Language models for pre-trained language representations

2.4.1 ELMo

2.4.2 OpenAI GPT

2.5 BERT

2.5.1 Architecture details

2.5.2 Input representation and Wordpiece Model

2.5.3 Pre-training

2.5.4 Fine-tuning

Chapter 3

Method

3.1 Datasets

Context matters : Classifying Swedish texts using BERT's deep bidirectional word embeddings