• No results found

Duplicate Detection and Text Classification on Simplified Technical English

N/A
N/A
Protected

Academic year: 2021

Share "Duplicate Detection and Text Classification on Simplified Technical English"

Copied!
72
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Science

2019 | LIU-IDA/LITH-EX-A--19/033--SE

Duplicate Detec on and

Text Classifica on on

Simplified Technical English

Duble detek on och textklassificering på Förenklad Teknisk

En-gelska

Max Lund

Supervisor : Arne Jönsson Examiner : Marco Kuhlmann

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten finns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.

(4)

Acknowledgments

First off I’d like to thank Etteplan and Lars Karlsson for giving me the opportunity to do my thesis project within the field of NLP. I got the chance to work with things that really interests me, which made everything more fun and less stressful!

I would also like to thank my external supervisor at Etteplan, Magnus Lundqvist, and my supervisor at Linköping University, Arne Jönsson, for their guidance and helpful inputs. And thank you to my examinor Marco Kuhlmann for being an excellent teacher, sparking an interest in me many others for NLP and Text Mining.

Thank you to Svea Jörgensen and my sister Johanna Lund for proof-reading and providing good suggestions for improvements to the language.

And finally a special thank you to my girlfriend Tiene Mendes for making my life so much easier through all the stress of finishing this thesis.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables ix List of Abbreviations x 1 Introduction 1 1.1 Motivation . . . 2 1.2 Research questions . . . 3 1.3 Delimitations . . . 4 2 Theory 5 2.1 Natural Language Processing (NLP) . . . 5

2.2 Text pre-processing . . . 5

2.2.1 Tokenization . . . 6

2.2.2 Homogenization of texts . . . 6

2.3 Vector space models . . . 6

2.3.1 Term-frequency Inverse document-frequency . . . 6

2.3.2 Class-based weight schemes . . . 7

2.3.3 Context independent neural word and document embeddings . . . 8

2.3.4 Context dependent neural word embeddings . . . 9

2.4 Sequence to sequence models . . . 10

2.5 The Transformer . . . 12

2.6 BERT . . . 16

2.6.1 Pre-training procedure . . . 16

2.6.1.1 Masked language model . . . 16

2.6.1.2 Next sentence prediction . . . 16

2.6.2 Fine-tuning procedure . . . 17

2.6.3 WordPiece tokenization . . . 17

2.7 Document embeddings from transformers and LSTM . . . 18

2.8 Cosine similarity classifier . . . 19

2.9 Support vector machines . . . 19

2.10 Density Based Spatial Clustering of Applications with Noise (DBSCAN) . . . . 20

2.10.1 Hierarchical DBSCAN (HDBSCAN) . . . 20

2.11 Related work on STE module classification and clustering . . . 21

2.12 Validation metrics . . . 21

(6)

2.12.2 Clustering . . . 22 3 Data 24 3.1 Dataset properties . . . 24 3.2 STE modules . . . 27 3.3 “Problematic” modules . . . 27 4 Method 28 4.1 Classification . . . 28

4.1.1 Classification of an unlabeled manual . . . 30

4.2 Clustering . . . 31

4.2.1 Exploring document embedding viability . . . 31

4.2.2 Comparing semantic vs. term-frequency embeddings . . . 32

4.2.3 Clustering with HDBSCAN . . . 33

4.2.4 Clustering based on custom distance threshold in DBSCAN . . . 33

4.2.4.1 Cleaning up duplicates . . . 34

5 Results 35 5.1 Classification task . . . 35

5.2 Clustering task . . . 43

5.2.1 Similarity tests using 51 known module duplicates . . . 43

5.2.2 Examples of high difference between tf-idf and semantic document em-beddings . . . 45

5.2.3 Clustering with HDBSCAN . . . 46

5.2.3.1 Inspecting clustered texts . . . 46

5.2.3.2 Clustering with an explicit distance threshold in DBSCAN . . 47

6 Discussion 48 6.1 Results discussion . . . 48

6.1.1 Classification of module labels . . . 48

6.1.2 Clustering for duplicate detection . . . 49

6.1.2.1 Similarity tests using 51 known module duplicates . . . 49

6.1.3 Examples of high difference between tf-idf and semantic document em-beddings . . . 50

6.1.3.1 Clustering with HDBSCAN . . . 50

6.2 Method discussion . . . 51

6.2.1 Classification of module labels . . . 51

6.2.2 Clustering for duplicate detection . . . 51

6.2.2.1 Similarity tests using 51 known module duplicates . . . 51

6.2.3 Examples of high difference between tf-idf and semantic document em-beddings . . . 52

6.2.4 Clustering with HDBSCAN . . . 52

6.2.5 Clustering with an explicit distance threshold in DBSCAN . . . 52

6.3 Source criticism . . . 53

6.4 The work in a wider context . . . 53

7 Conclusion 54 7.1 Future work . . . 54

Bibliography 56 8 Appendix 61 8.1 Module-pair with high distance difference between tf-idf and BERT represen-tations . . . 61

(7)

List of Figures

2.1 Simplified illustration of the two bidirectional LSTM layers in ELMo with an input sequence that starts with “This is ...”. The hidden word representations in layer 1 and layer 2 are concatenated from the respective layer’s forward and backward LSTM. The final ELMo embededings for each word are a weighted sum of the

hidden representations and the input representations. . . 10

2.2 Illustration of an encoder using the last hidden state as an encoded representation of the sentence (i.e. e= e2). . . 11

2.3 Illustration of a decoder. The <sos> (start-of-sentence) representation together with the encoder output is input to the LSTM at the first time step. The decoding of the sentence stops when the <eos> (end-of-sentence) word is predicted. . . . 11

2.4 One layer of the encoder in The Transformer. The left part shows the computations of a single attention-head, which are performed h= 8 times for a single encoder layer. 14 2.5 Visualized attention scores [transformer_viz] in BERT for all tokens. The left side shows scores from attention head 8 in encoder layer 7, which seems to focus on tokens separating clauses in the text. The right side shows attention head 1 from layer 2, where the attention for every word is focused on the next word. . . 15

2.6 Visualized attention scores [transformer_viz] in BERT for the token ##ge. The left side shows scores from attention head 1 in encoder layer 2. The right side shows attention head 5 from layer 6. . . 18

2.7 The points of two classes. . . 20

2.8 Example of a confusion matrix. . . 22

2.9 Example of the “elbow method” identifying k= 3 as the best candidate. . . 22

3.1 Counts of modules in different classes. . . 25

3.2 Length of modules by number of words. The left side plot shows the modules as points, separated by their class labels. The points above the horizontal red line represent the number of modules that will have to be truncated in BERT since they contain more than 512 words. The right side plot shows the length of each module (sorted) as a vertical line. The area above the horizontal red line represents the proportion of the total amount of text that will have to be removed in BERT. . 26

4.1 Model convergence on the test set in different CV folds. . . 30

4.2 Example of a command-line interface for cleaning up duplicate modules. All the sentences of both module texts are printed. When there is a difference in the sentences, both sentences are printed with the preceding symbols - and + to mark which module it belongs to. A dashed line is inserted under the word(s) to show where the difference was found. The user is prompted to discard either module +, -, both (b) or neither (n). . . 34

(8)

5.2 Confusion matrix for uncased BERT using WordPiece tokenization. The top-left box shows that 90.5% of all “Accessories” modules were correctly classified. The number of correct predictions (19) and the true total number of such modules (21) are also seen in the same box. To the right of the top-left box we can see that one “Accessories” module was mistakenly classified as an “Assembly” module. The top number in the box shows the percentage of “Accessories” modules misclassified as “Assembly” (4.8%), and the bottom number shows the number of misclassifications (1). . . 39 5.3 Confusion matrix for cosine classifier using tf-icf vectorization. See Figure 5.2 for

a description of the plot. . . 40 5.4 Confusion matrix for SVM classifier using tf-idf vectorization. See Figure 5.2 for a

description of the plot. . . 41 5.5 Plots showing the predicted and true labels for an unlabeled manual at different

character positions. The upper plot shows the predicted labels and the softmax probabilites for the predicted label above, to serve as a measure of confidence in the predictions at different character positions. For instance, when the probability score sometimes drops for a short segment in the middle of the long “Safety” section, it means that there is a module boundary there where the prediction is less certain of the label. . . 42 5.6 Same plot as in Figure 5.5, showing the softmax probability for all labels (not just

the predicted label). We can see that when the probability drops for a module in the middle of the long “Safety” section, the probability simultaneously increases for the incorrectly classified “Introduction” and “Maintenence” labels. . . 43 5.7 Failed similarity tests using different document vectors. The y-axis are different

document embeddings. The x-axis are module ids of module-pairs that are dupli-cates. A colored point means that an embedding failed the corresponding similarity test on the module-pair. The BERT encoder layers are counted from the first layer (1) to the last layer (12). CLS denotes that the [CLS] token was used as a document vector. AVG denotes that the sentence vectors of a module’s text were averaged to form a document vector. . . 44 7.1 Text difference visualization in Meld. . . 55

(9)

List of Tables

5.1 Stratified cross-validation results using BERT learning rate 5−5. Bold numbers indicate highest average accuracy across CV folds for each model. If a column value is marked with “-”, it means that the value is not applicable for the particular model or weight-scheme. . . 36 5.2 Results from HDBSCAN clustering using cosine distance. AVG denotes averaged

sentence vectors. Concat denotes the concatenation of the last four encoder layers in BERT. . . 46 5.3 Results from DBSCAN clustering using an explicit cosine distance threshold. . . 47

(10)

List of Abbreviations

Abbreviation Meaning Section Page

BERT Bidirection Encoder Representations from Transformers 2.6 15

CMS Content Management System 1 1

D-Max Document Max 2.3.2 8

D-TMax Documents Two Max 2.3.2 8

DBCV Density-Based Clustering Validation 2.12.2 23

DBSCAN Density-Based Spatial Clustering of Applications with Noise 2.10 19

DITA Darwin Information Typing Architecture 1 1

Doc2Vec - 2.3.3 8

ELMo Embeddings from Language Models 2.3.4 9

GloVe Global Vectors for Word Representation 2.3.4 9

HDBSCAN Hierarchical Density-Based Spatial Clustering of Applications with Noise 2.10.1 20

LSTM Long Short-Term Memory 2.3.4 9

NER Named-Entity Recognition -

-OOV Out-Of-Vocabulary -

-RNN Recurrent Neural Network 2.3.4 9

STE Simplified Technical English 3.2 27

SVM Support Vector Machine 2.9 19

USE Universal Sentence Encoder 2.7 18

W-Max Word Max 2.3.2 8

Word2Vec - 2.3.3 8

fastText - 2.3.4 9

kNN k-Nearest Neighbors 2.8 19

tf-icf Term-frequency Inverse class-frequency 2.3.2 7

tf-icf-cf Term-frequency Inverse class-frequency In-class frequency 2.3.2 7

tf-idf Term-frequency Inverse document-frequency 2.3.1 6

tf-idf-cf Term-frequency Inverse document-frequency In-class frequency 2.3.2 7

(11)

1

Introduction

Technical documentation is often created by the use of content management systems (CMS), which allows a technical author to compose a large, complex document by using smaller build-ing blocks. These buildbuild-ing blocks, called modules1, can often be re-used across different

docu-ments. By using standardized XML-based information models, such as The Darwin Informa-tion Typing Architecture (DITA)2or S1000D3, it is easy to query a CMS for modules that can

be re-used when writing technical documentation. This allows for more effective creation of documentation, easier collaboration by multiple authors, as well as reduced translation costs. However, modules that convey the same meaning are sometimes authored by different people, leading to multiple copies existing in the databases used within a CMS. This leads to missed opportunities of re-use, and since a module’s unique ID then differs from a duplicate, it could end up being re-translated which leads to unneccessary costs. Another obvious concern is the extra work performed by technical writers when duplicate modules are authored. Because of these concerns, an automated way of identifying such instances of unneccessary duplicates would be of interest.

Furthermore, manufacturers in the EU are legally obligated to keep any technical docu-mentation for 10 years [2]. Some legacy docudocu-mentations do not follow structured information models, but contains useful content that are viable for re-use and migration into newer sys-tems. An automated way of making or aiding such migrations would therefore also be of interest.

A company where these concerns are relevant is Etteplan, a global company operating within three main areas: engineering services, software and embedded solutions, and technical documentation. Many technical documentations are created and maintained through their proprietary CMS called HyperDoc, using standardized specifications of XML-content such as DITA or S1000D. The top-level XML-tags of this structure correspond to a label of the module’s content. When migrating or acquiring a database of legacy technical documentation, automatic classifications of these labels could then be useful. The classifications of these labels are manually added to modules at the time of their creation by expert technical authors. These classifications should therefore generally be of high quality and a good source for ground truth labels in a classification problem.

1Other names for modules include: content components, topics, or content modules [1] 2http://docs.oasis-open.org/dita/v1.2/spec/DITA1.2-spec.html

(12)

1.1. Motivation

An earlier master’s thesis [3] was conducted at Etteplan with the aim of identifying dupli-cates in a dataset of documentation modules for one of Etteplan’s customers. This was done by creating Doc2Vec embeddings (i.e. vectors) [4] of module texts and using the HDBSCAN algorithm [5] to identify clusters of modules that are duplicates. In order to validate the quality of clusters produced by the algorithm, the density-based clustering validation index (DBCV) [6] was used. As a continuation of this previous thesis, my work will investigate both the ability to automatically classify the module labels, and if the document embeddings used for this classification problem can be used to further improve the ability to cluster and detect duplicate modules. Furthermore, it will be an investigation of the benefits of using word representations that capture semantic meaning when aiming to identify duplicate modules. It should also be noted that there is no exact definition to use for determining if two modules are duplicates or not, so any qualitative analysis will have to be based in common sense reasoning. Another common feature of technical documentation is something called Simplified Tech-nical English (STE)4, which defines a set of rules and restrictions to the English language that

the modules of a technical documentation must adhere to. Among other things, STE defines that the meaning of a word should not be dependent on the context it appears in. From an academic standpoint, this property will be a focus of this thesis. Since some of the techniques and models used in this work are said to be effective in large part because of their ability to capture contextualized meaning [7][8][9][10], it is interesting to use a STE dataset for testing this claim

1.1 Motivation

Recent advancements with pre-trained language models has been called the ImageNet5moment

for Natural Language Processing (NLP)6, where models such as BERT (Bidirectional Encoder

Representations from Transformers) [9] and ULMFiT (Universal Language Model Fine-tuning for Text Classification) [11] have achieved new state of the art results on many NLP tasks7.

Earlier approaches to pre-training have been to use models such as Word2Vec [12] to learn word embeddings from large corpora in a language modeling task, which are later used as input to other systems. The new approach trains a language model on a large corpora, where the model itself can be fine-tuned for a downstream task with relatively minimal re-configuration. This approach is called transfer learning and is well-known within computer vision, where deep learning models are pre-trained on ImageNet and later fine-tuned on a downstream task.

The BERT model in particular has shown to be very versatile by performing well in different tasks such as classification, question answering, and named entity recognition [9]. One of the benefits of this model is the ability to capture contextual word embeddings, which are lacking in other pre-trained embeddings, such as Word2Vec. To illustrate contextual word embeddings, consider the following sentences:

• “The bank on Main Street was robbed yesterday.” • “There was a small boat on the bank of the river.”

The word “bank” obviously has different meanings in the two sentences. If we used only a single vector representation for this word, we can imagine that the semantic content in some contexts would be represented incorrectly.

The example above illustrates something that should never occur in technical documen-tation, since the STE used within modules explicitly requires unambigous usage of words. The benefits of using BERT on STE should therefore be diminished compared to using it on

4http://www.asd-ste100.org/ 5http://www.image-net.org/ 6http://ruder.io/nlp-imagenet/ 7https://nlpprogress.com/

(13)

1.2. Research questions

“regular” English. However, there are other parts of BERT, such as the WordPiece tokeniza-tion method [13], and the self-attentokeniza-tion mechanisms that attend to syntactic structure, that could be contributing factors for success even without the benefit of capturing contextualized meanings of words. Furthermore, even if STE defines only one allowed context for each unique word, a model which correctly represents this context should be beneficial compared to a model that uses only one representation across all possible contexts.

Earlier works on clustering and classification of modules in technical documentation have been conducted in studies using traditional document vectors as bag-of-words with term-frequency metrics. Using class-based weight-schemes, these studies showed successful results in the two tasks (classification and clustering) by simply measuring the cosine similarity between document vectors [14][15][16][1]. The class-based weighting schemes used in these studies were also further improved upon using non-STE datasets [17].

With these results in mind, this thesis proposes that it would be of interest to compare the performance of models like BERT to more traditional methods. This would be one way to investigate the impact of unambigous meanings of words on the performance of these otherwise hugely successful new models and techniques. In more general terms, this thesis aims to investigate how the unique properties of STE affect performance and requirements on models for common NLP tasks.

1.2 Research questions

The purpose of this thesis could be summarized as answering the question “how are common NLP tasks best performed on STE?”. Many new techniques that achieve state of the art results on NLP tasks also require specialized hardware to use, and are very computationally demanding. This makes them problematic to put in production outside an academic context. It is therefore of interest to investigate what benefits these techniques offer when working with STE, which should be considered a subset of English due to its unique properties. The explicit research questions for this thesis are the following:

1. For the task of classifying STE module labels, how does BERT compare to previously successful term-based vectorization methods and traditional classifiers, with regards to accuracy?

The aim is to investigate whether BERT still outperforms traditional methods for classifi-cation, such as cosine similarity and support vector machines, when the dataset explicitly forbids contextually dependent meanings of words. Earlier works show that class-based weighting schemes can greatly improve results on classification tasks for both STE and non-STE datasets.

2. When classifying module labels, is there a benefit to using WordPiece tokenization in BERT in order to use embeddings for sub-words when an out-of-vocabulary (OOV) word is encountered?

The aim is to investigate if performance is degraded from not using this special tokeniza-tion method. The main benefit of self-attentokeniza-tion mechanisms is said to be the capturing of contextualized meanings of words. If texts then have unambigous meanings of words, there is an opportunity to investigate what factors impact BERT performance.

3. When classifying module labels, is there a benefit to using document vectors that are more sensitive to named entities, with regards to accuracy?

Previous work on STE classification saw improved results using class-based weight schemes for terms instead of weighing by the inverse document-frequency. This should give proportionally less weight to very rare terms, such as unique model names of prod-ucts. Likewise, the BERT paper [9] recommends only using a cased vocabulary (i.e. not

(14)

1.3. Delimitations

lowercasing all words) for named entity recognition tasks, and using a lowercase vocabu-lary and tokenization for all other tasks. But since technical documentation might have many uppercased entites that are of special importance, there could be a benefit to using a cased vocabulary together with WordPiece tokenization (since rare entities will often be represented even if they are OOV).

4. For the task of clustering duplicate STE module data with HDBSCAN, how does docu-ment vectors extracted from a fine-tuned BERT model compare to the term-frequency based document vectors, with regards to DBCV and tests on manually annotated module duplicates?

The aim is to investigate whether document representations that are more successful in the classification task are also well suited for clustering to identify duplicate modules. The HDBSCAN algorithm with DBCV score as an evaluation metric is chosen since an earlier master’s thesis at Etteplan[3] identified them as good choices for this task. This also allows comparison with results from that earlier work.

5. For the task of clustering duplicate STE module data, is there any benefit to using document representations from pre-trained models, such as those obtained by a trans-former architecture (BERT) or bi-directional LSTMs, over more simple methods based on term-frequency?

The fine-tuned representation used in classification from BERT might be a good candidate, but there are other ways of using pooling and concatenation of hidden representations from a transformer or LSTM network. One of the main benefits of pre-trained models is that semantic information can be captured in a way that is not possible with term-frequency vectors. This could then enable identification of duplicate modules that have the same semantic content, but express that content differently. This research question should be evaluated by manual inspection of differences in duplicate identification between models (i.e. term-frequency vs. pre-trained).

1.3 Delimitations

There are many other ways of doing document classification and embedding of documents into vector spaces. However, earlier works in the specific domain of classification and clustering on STE module data is limited. Considering the recent success of transformer models [8], this thesis is focused on a comparison of these and the earlier works on STE. Furthermore, all experiments are performed on the same dataset consisting of module data from one of Etteplan’s customers.

(15)

2

Theory

This chapter introduces the general concepts and techniques within the field of NLP, including algorithms and methods used in previous works for classification and clustering on technical documentation. Furthermore, this chapter will cover language modeling, LSTM networks, sequence to sequence models, transformers, clustering algorithms, and common validation metrics.

The theory section assumes the reader has some knowledge of how neural networks are trained with gradient descent using backpropagation of errors from some defined loss function, such as the cross-entropy loss for classification tasks.

2.1 Natural Language Processing (NLP)

NLP is a field within computer science and artificial intelligence largely concerned with com-putationally processing natural language for some explicit task. Some examples of common tasks within NLP are text classification, machine translation, and text summarization. Early approaches to NLP, especially within the syntactic parsing of text, relied heavily on rule-based systems derived from linguistic theory, whereas the current state of the art relies much more on statistical approaches using machine learning [18].

For the tasks relevant to this thesis (classification and clustering) the long-standing ap-proach has been to create vector space models where documents are represented as vectors. The traditional way of doing this has been to create vectors with the same dimension as the number of unique words (known in NLP as word types) in the corpus. A document is then viewed as a so-called bag-of-words, where the ordering of words relative to eachother are disre-garded. In order to better facilitate this technique, the texts undergo common pre-processing steps.

2.2 Text pre-processing

In many applications of NLP, texts undergo pre-processing steps before being used in some specific task, such as classification. In this section, common techniques for pre-processing texts are described.

(16)

2.3. Vector space models

2.2.1 Tokenization

Tokenization is the splitting of texts into smaller linguistic units, such as words or sentences. Below is an example of a sentence that is tokenized using two different methods1:

Pre-tokenization: Tokenization isn’t always easy.

WhitespaceTokenizer: {“Tokenization”, “isn’t”, “always”, “easy.”} TreebankWordTokenizer: {“Tokenization”, “is”, “n’t”, “always”, “easy”, “.”}

In both tokenized sentences the tokens are separated by commas. For example, it can be observed that the word isn’t has been transformed into the two tokens is and n’t in the TreeBankWordTokenizer, while the WhitespaceTokenizer keeps the word as a single token. Since isn’t actually is the contraction of the words is and not, it is probably better to use two different tokens. However, this creates a word type of the token n’t which is now different from the word type not, when they in fact refer to the same word. We can also note that easy. and easy (observe the punctuation) become different tokens in the examples above. This illustrates one of the concerns relating to creating more homogenous tokens between texts in terms of sharing common word types.

2.2.2 Homogenization of texts

To create more homogenous tokens between texts, common steps to take are lowercasing of words, removing punctuations and non-alphanumeric characters, and replacing contractions with their full words (e.g. isn’t becomes is not). When dealing with bag-of-words and term-frequency based models, another common step is to remove stopwords. These are words such as i, you, at, it, that, has etc. that are common in practically any kind of text, regardless of the text’s subject or style. The idea is to remove these words since they will not have any discriminating quality in something like a classification or information retrieval task.

Two other steps that are sometimes used are stemming and lemmatization. Stemming is a heuristic process to chop off word endings or beginnings of words in the hope of producing a base or root (the “stem”) of the word. For example using the PorterStemmer [19] algo-rithm on the words {process, processed, processing} would produce the word process in all cases. Lemmatization refers to a similar practice, but uses a vocabulary and morphological analysis in order to produce the root dictionary form of a word (the “lemma”). Stemming or lemmatization have both advantages and disadvantages, and should not be regarded as an obviously beneficial pre-processing step. For instance, words that have different meanings can mistakenly be reduced to the same base.

2.3 Vector space models

The vector space model is the general idea of representing text (i.e. words or documents) as vectors, which are sometimes called word embeddings or document embeddings. Vector space models are a common feature within the fields of information retrieval and NLP, where they can be used for classifying or querying a corpus of documents.

2.3.1 Term-frequency Inverse document-frequency

For creating document vectors one of the most common techniques is called term frequency– inverse document frequency (tf-idf) [20]. If a corpus of documents D has vocabulary V , each document can be represented by a vector v∈ R∣V ∣, where each index of the vector identifies a

unique term. For each term t∈ V that is present in a document d ∈ D, the raw term-frequency ft,dof the term within the document is stored in the document’s vector representation vd. The

raw term-frequencies in vd are then weighed according to their inverse document-frequency,

(17)

2.3. Vector space models

so that the weight of common words are scaled down, and uncommon words are scaled up, which becomes the tf-idf weights for the document. It is common to use the logarithmically scaled and smoothed term-frequency and inverse document-frequency, so that the calculation becomes:

tf(t, d) = log(1 + ft,d) (2.1)

idf(t, d, D) = log ∣D∣

1+ ∣{d ∈ D ∶ t ∈ d}∣ (2.2)

tfidf(t, d, D) = tf(t, d) ⋅ idf(t, D) (2.3) The intuition behind tf-idf is the following: terms that are common to a certain document but uncommon in the corpus should be good att discriminating the document from others. Another important detail to note is that the terms can be both single words and n-grams. An n-gram is simply n consecutive words appearing in the text. For example, if we have the three consecutive tokens {“My”, “three”, “tokens”} and extract both the unigrams and bigrams, we would get the terms {“My”, “three”, “tokens”, “My three”, “three tokens”}.

2.3.2 Class-based weight schemes

When the task is to classify documents, rather than to retrieve documents matching some query, it could be a better idea to weigh terms in document vectors not by their inverse document-frequency but by their inverse class-frequency (icf). This could then lead to higher weights for terms that discriminate better on the class-level instead of on the document-level. To get tf-icf weights, the term-frequency is calculated as in equation 2.1, so that the tf-icf calculation for classes C in the corpus becomes:

icf(t, c, C) = log ∣C∣

1+ ∣{c ∈ C ∶ t ∈ c}∣ (2.4)

tficf(t, d, c, C) = tf(t, d) ⋅ icf(t, c, C) (2.5) In previous work on classification of module XML-tags [1][16], variations of class-based weight schemes were used to significantly improve the accuracy of a cosine similarity classifier. The first example use tf-idf-cf, where the cf term stands for in-class frequency, which is the frequency of a term within a class. Using the equation definitions in 2.1 and 2.2 for tf and idf, and ft,c the frequency of term t in class c the tf-idf-cf calculation becomes:

cf(t, c, C) = log ft,c

1+ ∣c ∈ C∣ (2.6)

tfidfcf(t, d, D, c, C) = tf(t, d) ⋅ idf(t, D) ⋅ cf(t, c, C) (2.7) Another variation is tf-icf-cf, which is calculated using equations 2.1, 2.4, and 2.6 as:

tficfcf(t, d, c, C) = tf(t, d) ⋅ icf(t, C) ⋅ cf(t, c, C) (2.8) Finally, another approach using probability estimation for terms appearing in a class as a weight-scheme saw improved results using support vector machines and the k-nearest neighbor algorithm in classification tasks on “regular” English datasets [17]. This method is called term-frequency term-relevance-ratio (tf-trr). By using equation 2.1, for vocabulary V and a term t∈ V in a document d ∈ D that has the class c ∈ C, where ¯c are all the other classes not c, and α is the base of the logarithmic operation, the calculation becomes:

tftrr(t, d, c, C) = tf(t, d) ⋅ log (P(t∣c)

(18)

2.3. Vector space models

If Tc is the set of documents in class c, then P(t∣c) and P (t∣¯c) are calculated as:

P(t∣c) = ∣Tc

k=1 ft,k ∣V ∣

l=1 ∣Tc¯∣

k=1 fl,k , P(t∣¯c) = ∣T¯c

k=1 ft,k ∣V ∣

l=1 ∣T¯c

k=1 fl,k (2.10)

The weight schemes tf-idf-cf, tf-icf-cf, and tf-trr, all require some way of creating document vectors at inference time when the class of a document is unknown. The authors of the tf-trr weight scheme [17] proposes creating∣C∣ document vectors as a matrix Wd, where each row is

calculated as if a document d belonged to one of the possible classes in C. The final document vector for d is then resolved in one of three ways:

• Word Max (W-Max): The maximum term weight of each vector in Wd is taken (i.e.

take the max over the column-axis of the matrix). Note that the terms don’t have to be words, they could also be larger n-grams.

• Document Max (D-Max): The term weights for each vector in Wd is summed, and

the vector with the maximum value is selected (i.e. take the max of the sum over the row-axis of the matrix)

• Documents Two Max (D-TMax): The two vectors in Wd that has the largest sum

of term weights are selected. Then, for each term, the highest weight between the two selected vectors are used to create the final document vector.

2.3.3 Context independent neural word and document embeddings

The previously covered vectorizations of documents are all derived from the traditional tf-idf method, first created in 1975 [20]. Newer approaches take a language modeling task as an objective to learn word or document representations. A language model can generally be described as a model of what words or sequences are more likely to be generated. An example could be to predict the word wi given the previous words (wi−1, wi−2, ..., wi−n).

With the development of Word2Vec by Mikolov et al. in 2013 [12], a new approach using neural networks was shown to enable capturing of semantic meaning of words. In general terms, this approach randomly initializes vectors of some specified size for each unique word in a large corpus, and then uses one of two architectures to learn word embeddings. The first approach is called Continous Bag-Of-Words (CBOW), where the vector for the target word is predicted from the input vectors of the surrounding words from a chosen context size of c words. The second approach is called Skip-gram, where instead the surrounding word vectors in c are predicted from an input word vector. The intuition for why this captures semantic meaning of words comes from the now famous quotation by John Ruperth Firth: “You shall know a word by the company it keeps”. The details of the network architecture are not the focus of this thesis, but it is interesting to note that it is a “simple” shallow feed-forward network, in contrast to more recent models that will be covered in later sections.

The authors of Word2Vec also developed the Doc2Vec model [4]. Doc2Vec is a continuation of Word2Vec that enables embedding of complete paragraphs or documents. In general terms, this model adds a paragraph vector to be predicted along with the word vectors. Intuitively, this vector captures what is shared among the word vectors for the document, which hopefully corresponds to its semantic meaning. The previous master’s thesis at Etteplan [3] used the Paragraph Vector with Distributed Bag of Words (DBOW) version of the model. PV-DBOW uses a context window in the words of the document, and concatenates the paragraph vector with the word vectors in order to predict the next word in the context window [4].

It is the case for both the term-frequency based weights schemes and the neural embeddings used in the previous master’s thesis that the document vectors don’t consider the ordering

(19)

2.3. Vector space models

of words in documents. Furthermore, the learned word embeddings in both Word2Vec and Doc2Vec are context independent, meaning that they use a single representation for each word (see the “bank” example in Section 1.1).

2.3.4 Context dependent neural word embeddings

The work on word embeddings has seen further development with models such as GloVe [21], Probabilistic FastText [22], and Paragram embeddings [23], where pre-trained models have been made publicly available. These models have developed new ways to capture statistical relationships between words [21], efficient representations of sub-words for better handling of rare words [22], and paraphrased meanings from a large paraphrase corpus [23]. However, they are all context free in the sense that every unique word only has a single learned word embedding.

Recent years have seen the emergence of models that can produce context-aware embed-dings, such as context2vec [24], CoVe [25], TagLM [26], and ELMo [7]. ELMo (Embeddings from Language Models) in particular proved very successful and displayed state of the art results on many tasks by using the embeddings it produces [7].

Even though ELMo is not the focus of this thesis, it is a good starting-point to introducing the theory behind the transformer architecture and mechanisms used in BERT and others. ELMo uses a deep recurrent neural network (RNN) with Long Short-Term Memory (LSTM) [27] cells which, in general terms, allows saving a hidden state from previous inputs to the LSTM cell that can act upon subsequent inputs. This allows keeping track of long-term dependencies. The output hidden state from each time step to an LSTM cell is fed back into the cell together with the input at the next time step. For each computational time step t through the LSTM network, the hidden state ht then depends on the hidden state ht−1 and

the input at step t.

The language model objective of ELMo is bidirectional, where, given N tokens (t1, t2, ..., tN), a forward language model predict the token tk given the history(t1, ..., tk−1):

P(t1, t2, ..., tN) = N

k=1

P(tk∣t1, t2, ..., tk−1) (2.11)

Similarly, a backward language model predicts the previous token given the future context: P(t1, t2, ..., tN) =

N

k=1

P(tk∣tk+1, tk+2, ..., tN) (2.12)

The model takes word embeddings as input to the first, bottom layer LSTM (these are actually character-based word representations from a convolutional neural network, see [7] for the details). The forward and backward language models predicts a representation in the sentence given the forward or backward contexts. The hidden representations of the LSTM cells in the bottom layer are also fed into a higher level layer of the same size, where the same bidirectional language modeling task is performed using the hidden representations from the previous layer as inputs. As a final step, the forward and backward representations from each layer are concatenated. A weighted sum of the different layers’ representations is performed to create the final contextualized word embeddings. This weighted sum can be learned when using ELMo together with some other model, to favor the representations from some layer over others. The unrolled model architecture is illustrated in Figure 2.1.

Experiments using ELMo showed that the model could be “plugged in” before another downstream task, where the weighted sum of the ELMo embeddings could be learned alongside the other model. The authors found that the lower layer LSTM capture properties more related to syntax, while the higher level layer relate more to context-dependent aspects of word meanings [7]. This means that using the model outputs as input for e.g. part-of-speech tagging could benefit from being able to favor the lower-level LSTM representations in the weighted sum of the layers.

(20)

2.4. Sequence to sequence models <s> LSTM LSTM This LSTM is LSTM </s> LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Layer 2 hidden representations Layer 1 hidden representations Input word embeddings Layer 2 Forward LSTM Layer 2 Backward LSTM Layer 1 Forward LSTM Layer 1 Backward LSTM Weighted sum (ELMo embeddings)

Figure 2.1: Simplified illustration of the two bidirectional LSTM layers in ELMo with an input sequence that starts with “This is ...”. The hidden word representations in layer 1 and layer 2 are concatenated from the respective layer’s forward and backward LSTM. The final ELMo embededings for each word are a weighted sum of the hidden representations and the input representations.

Now one might ask “why can’t we just add two bidirectional LSTM layers before our model, and input some GloVe or Word2Vec vectors?”, because this is indeed the basic idea here. However, this would mean having to learn the LSTM weights from the potentially small dataset that is used in the specific task (such as the case in this thesis). The real power of ELMo comes from having pre-trained on a massive dataset2, to enable learned “contextualization”

and syntactic structures of texts to be extracted and utilized in some other task like text classification.

Results using ELMo have been very successful, but one downside of the model is that the two representations from the forward and backward language model tasks only sees in either one of the directions. Although these two representations are combined through concatenation, we can imagine that it would be better if we could jointly condition on both the forward and backward contexts simultaneously while training. However, if we would try to look both behind and ahead at the same time, the target word would be able to indirectly “see itself”, which makes training impossible. This is one of the issues that BERT is able to overcome by the use of a masked language model and self-attention mechanisms.

2.4 Sequence to sequence models

The foundation of BERT comes from the developments in the sequence to sequence model architecture [29], which can be used any time you have a task that takes a sequence as input and returns a sequence as output. One example of such a task is neural machine translation, where the input would be a sequence of tokens in a source language, and the output a sequence of outputs in a target language. In this scenario, and encoder-decoder architecture [30] could be used, where an encoder would encode the words of the source language, and the decoder would

(21)

2.4. Sequence to sequence models

decode the encoding to a representation in the target language. Using LSTM, an example of an unrolled encoder-decoder structure can be seen in Figure 2.2 and 2.3.

LSTM

How

e

0

LSTM

are

e

1

LSTM

you

e

2

e

Encoder representation

Figure 2.2: Illustration of an encoder using the last hidden state as an encoded representation of the sentence (i.e. e= e2).

LSTM <sos> LSTM Hur LSTM mår e Encoder representation LSTM du

softmax softmax softmax

g softmax <eos>

g

g g

Figure 2.3: Illustration of a decoder. The <sos> (start-of-sentence) representation together with the encoder output is input to the LSTM at the first time step. The decoding of the sentence stops when the <eos> (end-of-sentence) word is predicted.

In Figure 2.2 the encoder representation that is fed to the decoder is a fixed-size vector, which is actually the last hidden state output by the LSTM (i.e. e= e2). In the decoder seen

in Figure 2.3, the function g∶ Rhidden→ Rvocab could be implemented as a feed-forward neural

network that takes a vector of the hidden size and outputs a vector of a vocabulary size. This vector is normalized by a softmax function where the highest probability output in the vector is the index of the selected translated word from the target language vocabulary, i.e. using a vocabulary V and a vector z the softmax probability at index i becomes:

(22)

2.5. The Transformer sof tmax(z)i= exp(zi) ∣V ∣j=1 exp(zj) (2.13)

The vector representation of the translated word, together with the hidden state from the LSTM at the first time step, is input at the next time step and the process is repeated.

We can imagine that using a fixed-size vector representation for the complete encoded sequence could mean that some information is lost, especially when dealing with long sequences [31]. The use of an attention mechanism handles this problem by letting the decoder “pay attention” to the hidden representations from the encoder when determining how to translate the next word [31]. Intuitively, this can be seen as asking “which words should I pay attention to in the original language when I’m trying to translate the next word?”. The idea is that allowing this look-behind into all the hidden encoder representation is better than just trying to decode the whole sequence from the last encoder output.

In the example from the encoder and decoder in Figure 2.2 and 2.3, the decoder would calculate attention scores for each hidden representation in the encoder. The function for doing this can be implemented in different ways, such as a collection of dot products or feed-forward single-layer neural networks. All the attention scores then gets normalized by the softmax function, and the decoder input is computed as a weighted sum of the attention scores with the hidden encoder representations. The input to the decoder LSTM then becomes the weighted sum of the encoder representations, and the output from the previous time step. In many cases the attention scores are close to 1 for a single word in the encoder, and 0 for all other, which intuitively would mean that the context used for translating that particular word is mostly the word itself [31].

2.5 The Transformer

The 2017 paper “Attention is all you need” by Vaswani et al. [8] showed that it is actually pos-sible to create an encoder-decoder architecture using solely attention mechanisms, completely dispensing with any recurrence using LSTM. This model is called The Transformer and its encoder part is the foundation for the design of BERT.

The attention mechanism can be described as function mapping a query, key, and value to an output [8], where the query, key, value, and output are all vectors. In our previous examples using LSTM, the query would be the hidden state of the decoder and the keys would be all the hidden states of the encoder. The attention scores between the query and all the keys would be calculated and normalized, and this would be used to create a linear combination of the encoder hidden states, which would become the values. Intuitively, the attention score can be viewed as calculating the similarity between the query and the keys, to find out what you should “pay attention to” when decoding a particular item in a sequence.

In The Transformer, a self-attention mechanism is used to calculate the scores from the queries and keys to generate a weighted sum of the values. This means that the attention mechanism works on the input itself in the encoder, before any decoding operations take place. Furthermore, the self-attention mechanism is performed multiple times in each layer of the encoder, and this is called multi-head attention.

As an example we can take the bottom layer of the encoder, where the embeddings for each token in a sequence is input as rows of vectors in a matrix M . For each attention head in a layer, a different linear transformation is performed on the vectors of this matrix to map the input to the “query”, “key”, and “value” spaces. The queries and keys will both be mapped to dimension dk, and the values to dimension dv. To get the queries, keys, and values that are

(23)

2.5. The Transformer Qi= W Q i M (2.14) Ki= WiKM (2.15) Vi= WiVM (2.16)

The similarity of the queries and the keys are then calculated as a dot product which is scaled by the dimension of the querys/keys dk. The resulting similarity scores are then input

to a softmax function. The result of that function is used to make a weighted sum of the values. For attention head i the function then becomes:

attentioni(Qi, Ki, Vi) = softmax( QiKiT

dk

)Vi (2.17)

The output from each of the h attention heads are concatenated and a final linear trans-formation by WO is performed:

M ultiHead(Q, K, V ) = Concat(attention1(Q1, K1, V1), ...., attentionh(Qh, Kh, Vh))WO

(2.18) The resulting matrix from the multi-head attention is then input to a fully connected single layer feed-forward network using ReLu activation. For a vector x in the rows of the matrix output from M ultiHead(Q, K, V ), the network outputs:

F F N(x) = max(0, xW1+ b1)W2+ b2 (2.19)

The vectors output from this network forms the matrix that is input to the next layer in the encoder, and the whole process is repeated. In The Transformer there are a total of N= 6 identical encoder layer, where each has h= 8 attention heads per layer. There are also some additional details of the model not covered here, such as an added positional encoding to the input and residual layer connections (see [8]). An illustration of a single encoder layer is shown in Figure 2.4.

To summarize, the parameters that are learned (per layer) are the transformation matrix in WO, the three transformation matrices WQ

i , W K i , W

V

i for each attention head i, and the

weights for the feed-forward networks between each layer. The weights for these networks are different between different layers, but the same for each input vector in the same layer [8].

The decoder part of the model is also made up of self-attention steps, but as this is not used in this work we can stick to focusing on the encoder. The intuition behind self-attention is that, much like in ELMo, a hidden representation at each layer learns to perform different tasks [8]. The authors note additional benefits of dispensing with recurrent layers: the total computational complexity per layer is lower, and the amount of computation that can be parallelized is higher. Perhaps most important is the improved ability to capture long-distance dependencies in the sequences, due to the fact that the hidden states doesn’t have to be “remembered” through the sequence in the same way as in a recurrent network. Additionally, it is possible to interpret where the attention score distributions are focused, which makes it easier to reason about what is actually going on in the different layers. Some examples of this can be seen in Figure 2.5 and 2.6, where the self-attention scores from BERT on STE module text are shown using a visualization tool3for transformer models [32]. The authors of The Transformer found that different attention heads clearly learn to perform different tasks, where some are more related to syntactic structure, and others to semantic meaning [8].

For training The Transformer a dataset of 4.5M English-German sentence pairs was used. The English sentence is encoded, and the model tries to decode the sequence to a representation in German. The model was trained for 12 hours on 8 NVIDIA P100 GPUs.

(24)

2.5. The Transformer

Q

i

K

i

V

i

Q

Linear

K

Linear

V

Linear

MultiHead Attention

Scaled Dot-product Attention at attention head i

Concat

Linear

Weighted sum

Dot-product

softmax

Scale

For each attention head the linear transformations

are performed Q Wi Q, K W

iK, V WiV

h

Rows of the final matrix are

sent through a fully-connected FFN, creating the input to the next encoder layer

i∈{1.. h } Final linear transformation by WO

Figure 2.4: One layer of the encoder in The Transformer. The left part shows the computations of a single attention-head, which are performed h= 8 times for a single encoder layer.

(25)

2.5. The Transformer

Figure 2.5: Visualized attention scores [32] in BERT for all tokens. The left side shows scores from attention head 8 in encoder layer 7, which seems to focus on tokens separating clauses in the text. The right side shows attention head 1 from layer 2, where the attention for every word is focused on the next word.

(26)

2.6. BERT

2.6 BERT

The architecture of BERT is almost identical to the encoder part of The Transformer, with the only difference being an increased number of layers and attention-heads. The BERT model comes in two sizes, the BERTbase and BERTlarge, where the base model was used in this

thesis. This model uses 12 encoder layers, with 12 attention heads per layer, and hidden representations (and input token embeddings) of size 768 [9]. In the original transformer architecture [8] the model is trained on a sequence transduction task, such as translating a sequence of words from one language to another. In BERT, the objective is instead a language modeling task, like previously seen in ELMo, Word2Vec, and others. But instead of predicting the next word given a sequence of previous words in either the forward or backward direction, BERT uses an approach called a masked language model, which is also known as a Cloze task [33]. This allows for conditioning on both the forward and backward context simultaneously, instead of doing the tasks separately and concatenating the results.

2.6.1 Pre-training procedure

The pre-training procedure consists of two tasks: the masked language model and next sentence prediction.

2.6.1.1 Masked language model

The language model objective used in BERT is simply to mask 15% of the tokens in the input with a [MASK] token, where the vectors corresponding to this token is fed to a softmax layer the size of the vocabulary [9]. This in itself allows for training the model on a bidirectional context. However, the authors note that there are two downsides to this approach. The first is that when the model is later fine-tuned, the [MASK] token is never present, since we input complete sequences of tokens without masking. To mitigate this, the masked token is not always replaced with [MASK], but is instead chosen with the following procedure:

• 80% of the time, the masked token is replaced with [MASK]

• 10% of the time, the masked token is replaced with a random token (i.e. some other word different than the actual word being masked)

• 10% of the time, the token is kept unchanged

The encoder does not know which words it will have to predict a representation for, and which have been replaced by random words. By this procedure the encoder is forced to keep a contextualized representation for every input token, regardless of if it’s marked as [MASK] or not. This allows us to obtain a contextualized representation of every word when fine-tuning, even when none of the words are marked as [MASK]. The authors note that since the random replacement only occurs for 1.5% of all tokens (10% of 15%), this does not seem to harm the model’s language understanding capability [9].

2.6.1.2 Next sentence prediction

The second pre-training task is to predict if a given sequence is the next sentence, given the first sequence. Downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI) require an understanding of relationships between two sentences, which are not captured by only using the masked language model. The pre-training of BERT therefore includes two sentences A and B, where sentence B is the actual next sentence 50% of the time, and a random sentence from the training corpus 50% of the time. An example from the BERT paper[9] is as follows:

(27)

2.6. BERT

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguins are flight ##less birds [SEP]

Label = NotNext

The [SEP] token is used to indicate the boundary between two sentences and the [CLS] token is included at the start of every sequence. When pre-training the [CLS] token is used for “next sentence prediction”, but it can be used for any text classification task when later fine-tuning the model.

The pre-training of BERT was done on a concatenation of BookCorpus (800M words) [34] and text passages from English Wikipedia (2,500M words). The model was trained for 4 days using 16 TPU chips (Tensor Processing Units).

2.6.2 Fine-tuning procedure

The fine-tuning of BERT is different depending on the type of task that is performed. For this thesis, a classification task is performed by using the [CLS] token that is included in the start of every sequence. The vector for the [CLS] token is used as an input to a single layer feed forward network followed by a softmax layer over the number of classes. When pre-training, this vector is learned on the “next sentence prediction” task by using attention over both the first and second input sentences. When the model is fine-tuned the representation for this vector is instead learned according to the new objective function, which is the new classification task. Since there are no second sentences in the classification task, only a single sequence is input, tokenized, and ended by the [SEP] token.

Another important thing to note is that there is a hard limit of 512 tokens in sequences that can be input to BERT. Any text longer than this limit will be truncated.

2.6.3 WordPiece tokenization

The tokenization process in BERT uses first a basic tokenization that splits words and special characters into tokens. This is followed by a special technique to handle rare words called WordPiece tokenization [13]. This technique has a learned vocabulary which not only contains complete words but also single characters and sub-words, such as word endings. These are learned based on frequency in the English Wikipedia, and the embedding of each token is randomly initialized.

When tokenizing a text and a complete word is encountered that is not part of the learned WordPiece vocabulary, a greedy longest-match-first algorithm is used to find partial matches (sub-words) in the vocabulary. The resulting tokens using WordPiece tokenization in BERT can be seen in an excerpt from an STE module text:

before you pull the product rear ##ward , di ##sen ##ga ##ge the drive and push the product forward

The words rearward and disengage did not exist in the vocabulary, leading to the word being split up into the largest sub-word tokens that were found in the vocabulary. The ## characters are input to indicate where the original word was split.

If WordPiece tokenization was disabled, the same module text excerpt would instead become:

(28)

2.7. Document embeddings from transformers and LSTM

before you pull the product [UNK] , [UNK] the drive and push the product forward

Here the words not found in the vocabulary are instead replaced with [UNK]. Notice also that the punctuations are kept as tokens instead of being removed as a pre-processing step.

The attention scores for the sub-word ##ge from a WordPiece tokenization of disengage can be seen in Figure 2.6. When looking at the figure, it is tempting to “anthropomorphize” the model and say that it is resolving the meaning of the complete word by paying attention to the other sub-words. Even if this is perhaps misinterpreting what the model is doing in the particular attention heads visualized, it is the idea behind splitting rare words into sub-words.

Figure 2.6: Visualized attention scores [32] in BERT for the token ##ge. The left side shows scores from attention head 1 in encoder layer 2. The right side shows attention head 5 from layer 6.

2.7 Document embeddings from transformers and LSTM

To get document-level embeddings from transformers or recurrent neural networks using LSTM, there must be some way to either pool token embeddings together, or let the model learn such a representation. In the case of The Transformer, each sequence is represented as a matrix of token embeddings. The encoder in BERT uses the [CLS] token for classification, either to classify on the next sentence prediction task when pre-training, or to predict module labels when fine-tuning the model to the task of this thesis. The [CLS] token then becomes a representation for the document fit to that specific task. But this does not necessarily mean that the same token can be used for any arbitrary task — such as clustering of duplicate modules of STE text.

The Universal Sentence Encoder [10] trains the encoder part of a transformer model for learning sentence embeddings by conditioning on downstream tasks. Likewise, the InferSent model [35] uses bi-directional LSTMs that takes as input either GloVe [21] or fastText [22] embeddings to learn sentence representations from the Stanford Natural Language Inference

(29)

2.8. Cosine similarity classifier

datasets. The advantages of transfer-learning can be used since both these pre-trained models are publicly available4,5.

2.8 Cosine similarity classifier

For two vectors v and z, both of size N , the cosine similarity between them is:

cosine_similarity(v, z) = Ni=1 viziNi=1 v2 iNi=1 z2 i (2.20)

The cosine similarity of two identical vectors is 1, two orthogonal vectors is 0, and two opposite vectors is -1. To implement a cosine similarity classifier, we simply compare the similarity of a target vector to every other vector in the training set. The class of the most similar vector can be used as the predicted class. Another strategy could be to use the most common class in the k most similar vectors in the training set. In other words, the cosine similarity classifier is the same as the well known k-nearest neighbor algorithm using cosine similarity for distance measurement.

The earlier work on classification and clustering of STE modules make extensive use of cosine similarity for classification and duplicate detection [1][16][15][14].

2.9 Support vector machines

A support vector machine (SVM) finds the optimal separating hyperplane between two classes. The linear SVM uses a linear distance metric between obersvations in the input space. A non-linear SVM instead uses a kernel function to map from the input space to a higher dimensional feature space, where it is sometimes easier to separate the classes. The increased dimensionality is however more computationally expensive, and not neccessarily more effective. For the already high-dimensional inputs of document vectors created using term-frequency weight schemes, a linear kernel is preferred [36].

The hyperplane in the linear SVM is selected so as to maximize the minimum margin between two classes, meaning that the margin between the two observations that are closest to the hyperplane from either class is maximized. However, this margin is “soft” in the sense that we can modify a regularization parameter C, which allows misclassifying certain points close to the margin in order to get a better separation between the classes.

In binary classification with N samples of training data (yi, xi), i = 1, 2..., N, yi is a label

equal to ±1 and xi is a vector of features. Using the L2 hinge loss function the linear SVM

solves the following optimization problem for the model defined in w [36]:

min w 1 2w T w+ C Ni=1 max(0, 1 − ywTx)2 (2.21)

Smaller values for the parameter C will result in a larger margin at the expense of misclassifying observations close to the margin. Conversely, a large value for C will result in a smaller margin. In the multi-class case, a “one-vs-rest” strategy is applied, where for n classes, n individual classifiers are fit to the data.

SVM has been one of the more popular methods used for text classification [37], and was successfully used in combination with the tf-trr weight scheme described in Section 2.3.2 [17].

4https://tfhub.dev/google/universal-sentence-encoder/2 5https://github.com/facebookresearch/InferSent

(30)

2.10. Density Based Spatial Clustering of Applications with Noise (DBSCAN)

2.10 Density Based Spatial Clustering of Applications with Noise

(DBSCAN)

Clustering algorithms typically require an input parameter k that determines the number of clusters to partition the data into. Well known algorithms such as k-means creates centroids which are the “centers of gravity” for the k clusters. Other algorithms take a hierarchical approach where a dendrogram (a tree-structure) splits the data into hierarchical levels. These algorithms don’t require a k parameter, but on the other hand require a conditional parameter that controls when to stop splitting the tree into smaller subtrees.

The DBSCAN algorithm [38] instead uses the notion of density where for each “core point” in a cluster the neighborhood defined by a distance threshold ϵ has to contain a M inP ts minimum number of points. Additionally, it defines that a “border point” can be included in the cluster if it is able to reach a core point within an ϵ distance.

Figure 2.7: The points of two classes. Consider the two classes in Figure 2.7.

Using a k-means algorithm on this data with euclidean distance determining members of a cluster would not work at all, since the clusters will be defined by circles from the centroids. With DBSCAN, we would start at a random point and check if we can reach M inP ts given the distance threshold ϵ. If we can’t, we visit another random point, but if we can, we have now found a core point of a cluster and we can “keep walking” using the same rules. Now we also allow border points to be included in the cluster according to the previously specified rule. Points that don’t get included in any cluster by these rules are left as unclassified noise. We can then iden-tify these qualities of DBSCAN:

• It doesn’t require us to specify the number of clusters. • It doesn’t have to cluster every point in the data. • It can handle clusters of arbitrary shapes.

These three qualities makes the algorithm a good candidate for the task of clustering duplicate modules.

2.10.1 Hierarchical DBSCAN (HDBSCAN)

The HDBSCAN algorithm [5] is an extension of DBSCAN where the requirement of setting an ϵ parameter is removed, requiring only M inP ts as a parameter. The algorithm performs several iterations of clustering with DBSCAN using different values of ϵ. This creates an hierarchical clustering algorithm from DBSCAN, where flat clusters[39] can be extracted based on their stability in different iterations.

The HDBSCAN algorithm was used in the previous thesis performed at Etteplan [3]. For more details on this algorithm, interested readers are referred to that thesis [3] and the HDB-SCAN paper [5].

References

Related documents

End-to-end approach (left side): In the training phase directly feed the classification input samples X (i.e. the mammograms) into a clas- sification network.. According to

Our thesis is aimed at developing a clustering and optimization based method for generating membership cards in a hypermarket by using a two-step sequential approach: first, we build

In the Vector Space Model (VSM) or Bag-of-Words model (BoW) the main idea is to represent a text document, or a collection of documents, as a set (bag) of words.. The assumption of

One important use-case of the uncertainty estimation is to detect unusual pat- terns in the time series. We use the estimated predictive uncertainty to de- tect abnormal behaviour

This indicates that participants with worse hearing are unable to suppress the ignored talker, resulting in higher similarity in the neural tracking of attended and ignored

After a hundred patterns are generated from the second class, gure 14, all of them classied as belonging to the second cluster, the user decides to generate and analyse some

At first, this strategy was also used when translating referral traffic but according to Google Analytic’s online support page, the correct term to use in Swedish is länktrafik,

Thus, the objective of this thesis is twofold; the first goal is to explore techniques for converting the free-text trader comments into meaningful numerical features, the second