Domain similarity metrics for predicting transfer learning performance

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datavetenskap

2018 | LIU-IDA/LITH-EX-A--18/046--SE

Domain similarity metrics for

predicting transfer learning

performance

–

Jesper Bäck

Supervisor : Jody Foo Examiner : Marco Kuhlmann

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

The lack of training data is a common problem in machine learning. One solution to this problem is to use transfer learning to remove or reduce the requirement of training data. Selecting datasets for transfer learning can be difficult however. As a possible solution, this study proposes the domain similarity metrics document vector distance (DVD) and term frequency-inverse document frequency (TF-IDF) distance. DVD and TF-IDF could aid in selecting datasets for good transfer learning when there is no data from the target domain. The simple metric, shared vocabulary, is used as a baseline to check whether DVD or TF-IDF can indicate a better choice for a fine-tuning dataset. SQuAD is a popular question answering dataset which has been proven useful for pre-training models for transfer learn-ing. The results were therefore measured by pre-training a model on the SQuAD dataset and fine-tuning on a selection of different datasets. The proposed metrics were used to measure the similarity between the datasets to see whether there was a correlation between transfer learning effect and similarity. The results found a clear relation between a small distance according to the DVD metric and good transfer learning. This could prove useful for a target domain without training data, a model could be trained on a big dataset and fine-tuned on a small dataset that is very similar to the target domain. It was also found that even small amount of training data from the target domain can be used to fine-tune a model pre-trained on another domain of data, achieving better performance compared to only training on data from the target domain.

(4)

Acknowledgments

I would like to thank Consid for giving me this opportunity. They have been incredibly flexible and cooperative throughout the entire project. I would also like to thank my mentor Jody Foo who has dedicated a lot of time towards helping me finish this thesis. I would also like to thank my friends Fredrik and Martin who have been very supportive and helpful throughout this entire project.

(5)

List of Figures

1.1 Example of overfitting. The red and blue dots are datapoints of two different classes. The black line is the desired separation between the classes but instead the model has overfitted and drawn the green line that takes the outliers of each class into consideration. Overfitting by Chabacano [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons. . 2 1.2 A graph describing the different cases of learning. Transfer

learn-ing and domain adaptation by Emilie Morvant [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons. . 2 1.3 A graph describing how the model will be trained on the different combinations

of datasets. . . 3 2.1 A simple neural network with 1 input layer of 3 input features, 1 hidden layer of 4

neuron and a output layer of 2 neurons. Colored neural network by Glosser.ca [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons/ black text added to original . . . 9 2.2 Common definition of the squared error function. E = Error, t = correct output

value, y = output value of the neural network . . . 10 2.3 Definition of the derivative of E w.r.t the weights of the previous layer. oj means

the output of the previous layer j. netjis the weighted sum of the j:th (previous)

layer in the neural network before it is passed through the activation function pj. . 10

2.4 ~X is the input features to the neuron. h is the internal state of the neuron, meaning its weights and bias. O is the output of the neuron. The unfold sig-nifies how the previous states of the neuron is accessible at each step, dur-ing backpropagation each state is corrected (BPTT - Backpropagation through time). Recurrent neural network unfold by François Deloche [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

. . . 11 2.5 Ft is the forget gate, responsible for filtering out data that does not need to be

remembered any more. It is the input gate responsible for saving relevant new

input data to the existing memory. Otis the output gate responsible for creating

the proper output from the cell and the same operation creates the hidden state that will be passed together with the memory vector~C to the next iteration of the sequence.~X is the new input features for this iteration.C and h represent the data~ from previous iterations of the same sequence,C being the memory and h being~ the hidden state. Long Short-Term Memory by François Deloche [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons . 12 2.6 A confusion matrix with binary data. Binary confusion matrix by Oritnk [CC

BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons . . . 13

(8)

3.1 Example of a question and answer from each of the datasets. The upper row is the answer and the bottom row is a sentence which is labelled as containing the an-swer to the question. The SQuAD column represents the modified SQuAD dataset for sentence extraction instead of the normal span selection. . . 19 3.2 Figure describing the structure of the unmodified bi-att-flow model, this model is

described in the Classifier section, numbered from bottom to top. Figure is used with permission. Seo, Minjoon, et al. "Bidirectional attention flow for machine comprehension." arXiv preprint arXiv:1611.01603 (2016). . . 21 3.3 The two different ways the model will be trained. 2-step experiments (a) to the left

and 3-step experiments (b) to the right. . . 23 3.4 Formula for calculating Average precision. K iterates from 1 to n, where n is the

number of retrieved documents. P(K) marks the precision and cut-off point k in the list. rel(k) returns 1 if the k:th document is relevant and 0 otherwise . . . 24 3.5 Formula for calculating MAP. Q is the total amount of queries. . . 24 3.6 Formula for calculating MRR.|Q| is the amount of queries. Rankimeans the

posi-tion of the highest relevant returned document in set. . . 24 4.1 These graphs show how the models performance on the evaluation data

devel-oped over the training steps. The training steps on the x-axis and the MAP on the y-axis. These are the same training steps that are shows in table 4.2, 4.3 and 4.4, the peak of each graph is the performance shown in those tables. . . 27 4.2 These graphs show how the models performance on the evaluation data

devel-oped over the training steps. The training steps on the x-axis and the MAP on the y-axis. These are the same training steps that are shows in table 4.2, 4.3 and 4.4, the peak of each graph is the performance shown in those tables. . . 29 4.3 These graphs show how the models performance on the evaluation data

devel-oped over the training steps. The training steps on the x-axis and the MAP on the y-axis. These are the same training steps that are shows in table 4.2, 4.3 and 4.4, the peak of each graph is the performance shown in those tables. . . 30

(9)

List of Tables

3.1 Example of a question, answer and context found in the Daily Mail dataset. . . 16 3.2 Example of a SemEval Track 3A question and answers. . . 18 3.3 The total number of documents, the ratio between correct and incorrect answers

and how many words per document there are, on average, for each dataset. One document is one candidate answer sentence. The difference in data available is large but since the amount of training steps will be fixed in the model this will not affect the end results. . . 19 3.4 MRR = (1/2 + 1 + 1/3)/3 = 11/18 = 0.61. . . 24 4.1 This table shows the measured distance between the different combinations of

datasets included in the study. The shortest distance has been marked in bold, the longest has been underlined for each metric. . . 25 4.2 This table shows all combinations of training done that ends with evaluating on

the WikiQA data. This table shows how well the model is able to to perform on the WikiQA data when trained on different combinations of datasets. This table also shows the distances between the pairs of datasets, pair one is the first two datasets. Pair 2 is the distance between the second pair of datasets (if one exists, otherwise marked as "-"). . . 27 4.3 This table shows all combinations of training done that ends with evaluating on

the NewsQA data. This table shows how well the model is able to to perform on the NewsQA data when trained on different combinations of datasets. This table also shows the distances between the pairs of datasets, pair one is the first two datasets. Pair 2 is the distance between the second pair of datasets (if one exists, otherwise marked as "-"). . . 28 4.4 This table shows all combinations of training done that ends with evaluating on

the WikiQA data. This table shows how well the model is able to to perform on the WikiQA data when trained on different combinations of datasets. This table also shows the distances between the pairs of datasets, pair one is the first two datasets. Pair 2 is the distance between the second pair of datasets (if one exists, otherwise marked as "-"). . . 29

(10)

1 Introduction

1.1 Motivation

Question Answering (QA) has received increasing amounts of attention in recent years, espe-cially since the 2011 appearance of Watson in the quiz show Jeopardy1. The Watson question answering (QA) system participated and defeated two human competitors. Such a QA sys-tem used many different technologies including ontologies and often structured databases such as Resource Description Framework (RDF) databases. These technologies require large collections of structured data which can take many months or years to collect. To circumvent this issue there has been an increase in machine learning-based QA models using Natural Language Processing (NLP). A machine learning-based model can in theory operate on any domain of unstructured data such as a common website or documents written in natural language to fulfil its task. Machine learning models are more scalable and easier to set up, but as a drawback performance is dependant on training data. For a model to perform well the training data has to be "good", meaning it is representative of the domain as a whole. There also has to be enough training data so the model can learn enough from the data. The data also has to be general enough that the model can generalize its knowledge to data it has not encountered before. Too much training however, will make the model overfit. When overfitting the model learns too much from the training data, to the extreme that it is not representative of the general domain any more, see Figure 1.1.

Another common problem with machine learning approaches in most fields is the avail-ability of training data. Unless training data can be retrieved directly from the target domain it is difficult to train a model for a specific task. In some cases very similar data can be found in an existing dataset or domain which might give sufficient performance. When either the data or task differs from each other between training and application its commonly known as transfer learning, see Figure 1.2. This study will concern itself with transductive transfer learning and domain adaptation. The domain will differ between the training and applica-tion of the model but the task will stay the same. Domain adaptaapplica-tion will be done by trying to adapt the knowledge from a source domain to a target domain.

In recent years there has there been an increase in available datasets for machine learn-ing. But training data does not necessarily translate well from one domain to another. This

(11)

1.1. Motivation

Figure 1.1: Example of overfitting. The red and blue dots are datapoints of two different classes. The black line is the desired separation between the classes but instead the model has overfitted and drawn the green line that takes the outliers of each class into consideration. Overfitting by Chabacano [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons.

Figure 1.2: A graph describing the different cases of learning. Transfer learning and domain adaptation by Emilie Morvant [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons.

usually means that one has to find training data specifically applicable to the desired task, or to create their own training data, which might be difficult, expensive or even impossible. It has however been shown in previous studies that training a model for one domain (source domain) and then transferring that knowledge to another domain (target domain) can in fact be very beneficial. In such a case the model would first be trained on data from the source domain and then be trained on the set of data from the target domain to make the model fit to the target domain[14][11]. The hope is that this technique allows the model to learn some general knowledge from the first dataset which it can use when it is being trained for the sec-ond domain. For example it is very common to use some sort of transfer learning for image recognition models in domains where there is a lack of labelled images to train on. However, this technique still requires that some training data is available from the target domain. It could potentially save a lot of time if there was a technique or method that did not require any labelled data from the target domain at all.

(12)

1.2. Aim

Figure 1.3: A graph describing how the model will be trained on the different combinations of datasets.

The use of transfer learning is not always beneficial and might sometimes result in worse results compared to only training on data from the target domain [25], known as negative transfer. Intuitively it may be assumed that two datasets that are very similar should be very compatible for transfer of knowledge. To the extent of the author’s knowledge there has so far been little research on how to measure this similarity between domains, domain similarity, at least with regards to how compatible they are for transfer learning. A set of metrics that compare the compatibility between two datasets could be a useful tool for selecting which source domain to use and to determine whether there are compatible datasets available. If there exists a dataset similar enough to the target domain a company or research team could save significant amounts of time by using the existing dataset instead of developing their own. Therefore this study will explore if some metrics of similarity can be used to predict the transfer learning effect between a selection of datasets.

1.2 Aim

The problem regarding lack of training data also applies to machine learning-based QA sys-tems. There has only recently been an increase in available datasets such as SQuAD2[24], WikiQA3[27], NewsQA4etc. The aim of this study is therefore to study if different metrics of similarity, used to compare two datasets or domains, can be used to predict good or bad transfer of knowledge. What aspects of two datasets make them similar and suitable for transfer learning? It will be studied whether knowledge learned from one domain can be transferred to another domain where some aspects of the content might differ. For example the language used in a Wikipedia article, a business email and a twitter post can differ signif-icantly in aspects such as length (where a tweet is character-limited), formality and wording. This will be studied by training the open-source bi-att-flow model5(Bi-directional attention flow) on the SQuAD, WikiQA, NewsQA and MSMarco datasets. Pre-training (first instance of training) will be done on the very large dataset SQuAD. WikiQA, NewsQA and MSMarco will be used for fine-tuning (second instance of training) and evaluation, shown in Figure 1.3. The evaluation is supposed to represent the domain that contains no training data at all, the target domain. The result from evaluation and the measurements collected from the similarity metrics will be compared to see if a correlation between positive transfer learning and textual similarity can be found.

2_{https://rajpurkar.github.io/SQuAD-explorer/}

3

https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/

4_{https://datasets.maluuba.com/NewsQA} 5_{https://github.com/allenai/bi-att-flow}

(13)

1.3. Research questions

1.3 Research questions

Transfer learning and training on out-of-domain data has previously proven to achieve re-sults close to, and sometimes even outperforming, training on data from the target do-main[11]. Therefore the performance measured from evaluating the models in this study will be compared to see how much performance is gained or lost during out-of-domain training and transfer learning. It will also be tested if the performance gained or lost correlates with the measured textual similarity between each domain. If so textual similarity metrics could be useful when deciding on what domain or dataset to use when using transfer learning.

• To what extent can we predict the potential for transfer learning between two domains based on textual similarity measures?

• To what extent can out-of-domain data be used to gain performance for a target domain in the case that little or no data exists from the target domain?

1.4 Metrics

Intuitively, the wording and how many words two domains share should represent how sim-ilar they are. If many of the same words occur in two domains the same type of language might be used. The overlap of unique words between datasets should represent a basic met-ric of similarity and should also be easy to measure, a good baseline for domain similarity. Alternatively, document vectors based on embeddings from the co-occurrence of words in a domain has been used to great effect in many areas of NLP [30]. Each domain can be given a position in the embedding vector space by averaging the positions of every word in that domain. The distance between these domain positions should be measurable and should represent a similarity or dissimilarity between respective domains. Another very common technique for measuring similarity between texts, especially common in search engines [1], is term frequency-inverse document frequency (TF-IDF). TF-IDF can be used to create vectors where weight is put on words that are the most unique for that domain. Two such vectors would be similar if they contain many words mostly unique to those vectors. The distance between the TF-IDF vectors for two datasets should indicate a similarity or dissimilarity be-tween them.

As a baseline metric, the shared vocabulary will be used. The candidate metrics to com-pare to the shared vocabulary will be the distance between document vectors and TF-IDF vectors for each dataset respectively. The document vector distance and TF-IDF distance should prove viable metrics for selecting good datasets for fine-tuning, in real world machine learning applications. If proven viable, an organization could collect a set of viable datasets and select the most potent fine-tuning dataset from the collection by using a distance metric. The shortest distance between the fine-tuning set and target domain should indicate the best potential for transfer learning.

Shared vocabulary

Shared vocabulary means the overlap of unique words, in each of the datasets. For every dataset a vocabulary of every token, word, that occurs will be created. These vocabularies will then be compared to each other to see how many of the tokens occur in both datasets.

TF-IDF

TF-IDF, is a metric that emphasises words that occur often in one document but only occurs rarely in any other document. Such words are strong identifiers of that document, if two domains or datasets have many similar words that only occur in those domains or datasets

(14)

1.5. Background

they should be similar. A TF-IDF vector for each dataset will be computed, the distance between each dataset vector should indicate their similarity.

Document vector distance

As an alternative metric, document embedding vectors will be used. For this the gensim doc2vec implementation will be used6 which will be trained on data collected from the datasets to be used in the study. An average vector over all documents (sentences) in the dataset will be computed and the distance between the average vector for each of the datasets will be used as a metric. The closer the vectors are the more similar and better suitable they should be for transfer learning.

1.5 Background

In [11] significant improvements were found when using transfer learning. The source do-main of the study was the SQuAD [16] dataset and the target dodo-mains were the WikiQA[27] dataset and the SemEval 2016 QA task [3]. Pretraining the bi-att-flow model on the SQuAD dataset for span selection, then fine-tuning the model to the WikiQA dataset gave state-of-the-art results on the test data for the WikiQA dataset. This shows great potential in trans-ferring knowledge from one domain to another [11]. In another study transfer learning is used to improve the performance of a neural network used for sequence tagging tasks (POS-tagging). The network is trained on the Penn Treebank dataset and improvements over the state-of-the-art was achieved on several tasks [28].

1.6 Delimitations

Python and Tensorflow are used for development in this study, the study is based on pre-existing code from earlier studies. The original study that introduced the bi-att-flow model was published in November of 2016. The machine learning framework Tensorflow7 was used for this project and the version of this framework that was current at that time was 0.12. Therefore even though it is outdated at the time of this study, the same version will be used, to make measurements more accurate between the studies and to save time.

6_{https://radimrehurek.com/gensim/models/doc2vec.html} 7_{https://www.tensorflow.org/}

(15)

2 Theory

2.1 Natural Language Processing

Natural Language Processing is the research field of the interaction between humans and ma-chines. More specifically it is a field where computers try to learn and understand the way humans communicate, in all of its complexities. Examples of this can be seen in for example voice assistants such as Apples Siri or Google Assistant which is probably one of the most human-like interfaces people interact with today. This field is vast and cover many many as-pects of human-machine interaction such as voice communication, machine language trans-lation, part-of-speech tagging, named entity recognition tagging, syntactic parsing and much more.

2.2 Question Answering

Question answering has been around since the 1960’s but in a different structure compared to the machine learning-based models from today. One of the first QA-systems to be developed was the BASEBALL[6] system, developed to be able to answer questions about the current baseball league. This system utilized a database/knowledge base to look up the answer to a question. This type of system can commonly be defined as a closed-domain question answer-ing system meananswer-ing the domain about which questions can be answered is both known and limited. This allows for the use of domain-specific databases, knowledge bases and ontolo-gies. The problem to solve often becomes how to convert an input question into a valid query that will return the correct data from the knowledge base[23].

The alternative to closed-domain is open-domain question answering, where the system should be able to answer a question about almost anything. This means it can not rely on the same techniques as when used in a closed domain system. Instead much work has to be put into the information retrieval. Search engines are often used, combined with some question-document ranking algorithm such as tf-idf and either machine learning or hand crafted rules to extract answer sentences [2].

In recent years, with the increasing availability of training data, more machine learning-based models for question answering are emerging. Simpler classifiers were often used in prior models for tasks such as question classification, where a question is classified according to what type of answer is expected to the question. Classes often include date, number, name

(16)

2.2. Question Answering

etc. [29]. But there has also been models that try to use machine learning and reading compre-hension as the answer extracting module of a QA-system. In the DrQA-model[2] a document retriever that fetched Wikipedia articles coupled with a reading comprehension module that tried to extract the answer to a question from the document retrieved was used as a complete QA-system.

Embeddings

One common and easy way to represent words in a way that is easy for a computer to un-derstand is the one-hot-encoding. This technique builds a vocabulary of all unique words in the dataset and creates a vector, filled with zeros, with the same size as the amount of unique words. This will be a very large vector often thousands of indexes long. Each index repre-sents one of the unique words. Every word can be represented by one vector by flipping that words corresponding index to the value one instead of zero. This technique works but has flaws. Firstly these vectors are high dimensional and there has to be a matrix of one vector for every word in the input, causing both significant computational cost and requiring a lot of memory. The second flaw is that all contextual information about these words is lost, there is no information about whether two words are similar, often occur together or mean the same thing. This information about words can be very valuable as it makes differences in word-ing less significant. A solution to both of these problems can be embeddword-ings, these are denser vectors derived from a co-occurrence matrix [5].

Word embeddings

Word embeddings aim to keep the contextual information between words, in a numerical form. This is done by training a model on existing text to let it learn what words often occur in the same context, the context of a specific word, the target word, are the words in close proximity to that target word. These words are called context words and can be collected in different patterns but commonly the n number of words directly preceding and following the target word are used as context words. Word embeddings can be created in a number of ways, two common ways are either via singular value decomposition (SVD) of the co-occurrence matrix or via a neural language model. To use SVD the co-co-occurrence matrix first has to be created. This is a matrix with as many rows and as many columns as there are unique words. The rows represent each unique word in the data and the columns are used to count how many times other words occurs in the context of the target word, corresponding to that row. The problem with the co-occurrence matrix is that it grows very big and very sparse. Since every context word is represented as a column for every target word and most words do not occur in the context of other words the matrix will be mostly zeros, making it a very inefficient solution. To solve this problem SVD is used. The SVD algorithm splits one big matrix into several smaller matrices which are easier to compute. A modification of the algorithm, truncated SVD, can be used to optimize further. Truncated SVD only keeps the k most relevant context words for each target word to shrink the matrices even further.

The second way to generate word embeddings mentioned is via the use of a neural lan-guage model. A neural lanlan-guage model is often a neural network which is trained to classify whether two words are in the same word, context pair or not. This dataset of words and contexts is retrieved in a similar way to how the co-occurrence matrix is built. Positive data samples are generated from the actual words and contexts in the data and negative samples are forged by swapping legitimate context words for a non-legitimate ones in the pairs. The classification task itself is not very relevant, the objective of this task is the learned weights of the network. These weights will be similar for words that often occur together and can be used as word-embedding vectors. One of these implementations, the skip-gram model, has been proven to produce the same output as a PMI (pointwise mutual information) matrix [10]. The PMI matrix is a slight modification on the normal co-occurrence matrix where more

(17)

2.3. Machine Learning

emphasis is put on contexts where the target word occurs more often. Once extracted these word embeddings can be read as numerical vectors. Every word in the model maps to the same vector space and similar words will be mapped close to each other. These embeddings has been shown to improve the performance of many tasks in NLP such as sentiment analysis and syntactic parsing [20].

Character embeddings

Character embeddings are similar to word embeddings in many ways, mapping characters to a vector space representing in what context each character occurs. However, character embeddings are not used for the same purpose as word embeddings. Word embeddings allows for subtle meanings of words to be contained within the embedding. For example, for sentiment analysis different words might carry differing amounts of good or bad sentiment. When using word embeddings these sentiments can be learned and conveyed through the combinations of different words. This can not be done as well with character embeddings, since sentiment is not attached to individual characters. Character embeddings however, handle misspellings much better and unknown words/tokens are not as problematic since there is a much more finite set of characters that will occur in a dataset compared to the amount of possible words in a language. This also makes character embeddings require much fewer dimensions to represent data making computations significantly faster, requiring less parameters [8].

2.3 Machine Learning

Training a network

All neural networks require training in some form. Often the training data is collected di-rectly from the target domain but in the case of transfer learning it can also be collected from another, often similar, domain. All of this data is then split into what is commonly known as training, development and test data. The training data is used to train the model. The develop-ment data is used during or after training to approximate how well the model is performing on the target task. The development data can be used for tuning different hyperparameters (settings or variables the developer has to set before training the model) between training sessions of the model. Tuning of hyperparameters is done to find the optimal configuration based on the performance on the development data. The test set should only be used when development and training is done, to estimate how well the model will perform in the real world. The models performance on the test data should be similar to that on the real world data since neither has been seen before by the model.

A technique commonly used to guarantee that the performance of a model does not de-pend too much on one specific configuration of training, development and test data is cross-validation. When using cross-validation the data is, in multiple iterations, shuffled and split into train, development and test-data, then the normal training and testing is done. This will give as many final results as there was iterations in the cross-validation. The performance of each model in each iteration can be averaged to get a performance that does not only repre-sent one specific set of data, making the results more reliable. However, not all data can be shuffled and randomised since the order of the data might matter.

Neural Networks

A neural network can intuitively be thought of as a human brain (even though they are sci-entifically vastly different). When a human perceives a certain pattern some neurons trigger more strongly than others, allowing the brain to interpret what the human senses are per-ceiving. This allows humans to distinguish a cat from a car. Similarly when a neural network

(18)

Figure 2.1: A simple neural network with 1 input layer of 3 input features, 1 hidden layer of 4 neuron and a output layer of 2 neurons. Colored neural network by Glosser.ca [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons/ black text added to original

is given the same image the desire is that a specific set of neurons trigger in a pattern that would indicate it has been given an image of a cat. Providing the neural network with an image actually means converting it into individual numerical values, for an image this could be the grayscale brightness value of each pixel (0-256). If so there would be one input neuron for each pixel in the image. One value can be imagined as one of the input neurons (circles) in 2.1.

In Figure 2.1 the arrows pointing from the input neurons to each neuron in the hidden layer indicates that those values will be multiplied by the weight vector of that hidden neuron. Any layer between the input layer (input data) and the output layer (output values of the neural network) are called hidden layers. These weight vectors are the same size as the amount of input features (in most cases). The activation of a neuron is calculated by multiplying the fea-ture vector and the current neurons weight vector and adding the bias value to the resulting dot product. Both the bias and weight vector is shifted and modified during training. The value received after adding the bias is then sent as input to an activation function. There are several different activation functions that can be used, sigmoid, relu, softmax among others. An activation function is often a function that translates its input to an output that is more suitable as input to the next layer. This operation is done for each neuron in the layer.

Once all operations are done the exact same thing will be done between the hidden layer and the output layer. Note that there can be one or more hidden layers but the operations will follow the exact same steps. Once at the output layer all of the activations are collected into one output vector, this vector can be interpreted in different ways depending on the desired result. Two output neurons might be suitable for true/false classification, the strongest (high-est) neuron indicates true or false, [0.78, 0.22] = True and [0.12, 0.88] = False. This depends on how the task is formulated.

If the neural network was already trained and in actual use the process would be done at this point. But if the neural network is training the most vital part of the process will be done now. During training the neural network will go through its normal process of multiplying weights with inputs, also called forward propagation, and at the end compare its output to the known answer provided together with the input data. To measure how "wrong" the neural

(19)

Figure 2.2: Common definition of the squared error function. E = Error, t = correct output value, y = output value of the neural network

Figure 2.3: Definition of the derivative of E w.r.t the weights of the previous layer. oj means

the output of the previous layer j. netjis the weighted sum of the j:th (previous) layer in the

neural network before it is passed through the activation function pj.

network was at one particular training sample a loss function is used. A loss function can be though of as a distance measure function for how far from the correct answer the output was. This is used to determine how severely the weights and biases in the neural network needs to be altered in order to give a more correct output the next time. One common loss function is the squared error function. The common definition of the squared error function can be seen in Figure 2.2.

To know how to alter the weights of each neuron the gradient descent algorithm is often used. This algorithm tries to alter the weights based on the partial derivative of the squared error, in order to reach a local minimum. This is known as backpropagation. The goal of backpropagation is to determine how sensitive the loss function is to changes in the weights of the previous layer and to correct them in a way that would minimize the loss. Minimizing the loss is done by calculating the derivative of the squared error with regards to the weights of the previous layer. The derivative will yield what change in weights and biases that would have resulted in the smallest error. The definition can be seen in Figure 2.3.

This means that in order for gradient descent to calculate how the weights of the previous layer affect the squared error, it must first be known how those weights affect the weighted sum of that layer j. Then it must be calculated how the weighted sum affects change in the output of that layer, then it can be calculated how change in the output of that layer affects change in the squared error. This should be enough to calculate how change in weights of the previous layer affects the loss function, but since the derivative of the weighted sum of the previous layer depends on the output of the layer before that, each of the derivatives mentioned has to be calculated for each layer backwards until the input layer is reached. This is known as the chain rule.

Recurrent Neural Networks

Recurrent neural networks (RNNs) are a modification or improvement to regular neural net-works, if applied to the correct type of problem. Generally RNNs are designed to be used for sequential data, such as music, text or speech. RNNs build further on the regular neural network in that it can unroll past classifications from previous inputs, meaning it can recall what data it encountered previously in a sequence. This is very useful for the types of data just mentioned. Language, for example, is highly dependant on what words occur in what order.

The RNN recieves input as features, similarly to a normal neural network with the only difference being that it also takes past outputs of the sequence (if there are any) as additional input. For example, the last word in a sentence will have the outputs from all previous words in that sentence as input, providing a lot of additional information. A sigificant problem

(20)

Figure 2.4:~X is the input features to the neuron. h is the internal state of the neuron, meaning its weights and bias. O is the output of the neuron. The unfold signifies how the previous states of the neuron is accessible at each step, during backpropagation each state is corrected (BPTT - Backpropagation through time). Recurrent neural network unfold by François De-loche [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

with the RNN arises during learning, during backpropagation or backpropagation through time (BTT) as it is called for RNNs. The problem of the vanishing gradient [7]. The vanish-ing gradient is a problem where the partial derivative shrinks, "vanishvanish-ing", in value for each additional layer in the network. The problem occurs when multiple small numbers are mul-tiplied, such as the output of some activations functions. A vanishing gradient results in very small insignificant changes to the weights of the network when multiplied with the partial derivative, hindering the learning of the network.

The vanishing gradient problem occurs in normal feed-forward neural networks as well, but are easier to avoid by not having too many layers. This problem is difficult to avoid for RNNs where each time step basically adds another layer, this means long sequences of data creates very deep network architectures. One can imagine the blue ht boxes in Figure 2.4

growing once for each word in a sentence. There are some different solutions to this problem but the most prominent one being a modification to the RNN architecture, named LSTM or long short-term memory network [19].

Long Short-term Memory - LSTM

The LSTM is an advancement on the RNN. It has been modified to be more selective with what data it decides to keep for later use. This is done via the use of a couple separate weights, or gates, all working together. They are called gates due to their function of "gating" what data they let through. Each hidden node in the LSTM contains this set of gates and they are designed to allow each cell (node) to remember its previous states. This enables behaviour such as remembering important information from many iterations (states) ago while forget-ting very recent information that might be worthless. Each of these gates has its own set of weights, which work the same as a cell in a normal neural network. These gates are called the forget gate, the input gate and the output gate. The basic goal of each gate is to filter out irrelevant data and to increase emphasis on very relevant data. This is done by outputting a vector, which can be multiplied or added, element-wise, with the input vector. The gates can be identified as F, I and O in Figure 2.5 and following each gate there is either a circled plus or an x, signifying element-wise addition or multiplication respectively. This means the passing vector will be filtered by that gates output according to the flow of the chart.

Forget gateThe forget gate takes as input, the hidden state of the previous iteration or time step and the current input data. Based on the input this gate will learn to decide what data is relevant to keep in memory. The vector C in Figure 2.5 can be though of as the memory. The forget gate outputs a vector containing values between 0 and 1.

(21)

Figure 2.5: Ft is the forget gate, responsible for filtering out data that does not need to be

remembered any more. Itis the input gate responsible for saving relevant new input data to

the existing memory. Otis the output gate responsible for creating the proper output from

the cell and the same operation creates the hidden state that will be passed together with the memory vector~C to the next iteration of the sequence. ~X is the new input features for this iteration.C and h represent the data from previous iterations of the same sequence,~ C being~ the memory and h being the hidden state. Long Short-Term Memory by François Deloche [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

These values when multiplied with the memory vector will emphasize some values in the memory and most importantly reduce those that are not required any more.

Input gateThe input gate works much like the forget gate, but instead of deciding what in memory that is not required any more, the input gate is designed to emphasize what in the current input is the most important. The output from the input gate will element-wise be added to the memory vector~C.

Output gateThe output gate is similar to the neuron of a normal neural network. It takes the input features, multiplies them by its internal weights and runs the result through an activation function, a sigmoid function in most cases. The output gate is concerned with what the sought, correct, output is for this specific input at this point in time. The output of the output gate is a vector that will be element-wise multiplied by the memory vector to create the output of the whole cell and the state (h) which will be available for the next iteration of the sequence.

Transfer Learning

Transfer learning means to use a model trained for one domain in a different domain, see Figure 1.2 for the different types of transfer learning. Transfer learning has shown to be suc-cessful for several tasks such as object and speech recognition [28]. During normal training there is usually a large set of data that has been split into a training, development and a test set, making sure that even though the data comes from the same domain the sets are differ-ent from each other. In the usual case the source and target domain are the same, but when utilizing transfer learning they are different. The objective when utilizing transfer learning is often to reduce or remove the amount of training data required from the target domain. It is very common that it is not possible to find or create more data. Data from an outside domain will not be as similar to the target domain as actual data from the target domain, it is fair to assume that performance when training on data from an outside source domain would therefore be worse. But if there is not enough data from the target domain there are potential

(22)

2.4. Preprocessing

Figure 2.6: A confusion matrix with binary data. Binary confusion matrix by Oritnk [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/4.0)], from Wikimedia Commons

performance gains to be had by having additional training data from another domain. Usu-ally pre-training is done first, which is followed by fine-tuning. Pre-training means training the model on the set of data from the source domain first, with the intent of then fine-tuning the weights on data from the target domain. The intended outcome is that the model will already have learned most of the abstract concepts of its task from the pre-training and will adapt quickly and efficiently to the target domain.

2.4 Preprocessing

This section will describe some methods of processing data. The aim of these methods of preprocessing is either to improve accuracy of the models or to improve other factors such as training time.

Tokenization

In text mining and lexical analysis in general, tokenization means to split a sentence on certain separators. A separator might be a simple space, colon, comma or any character or sequence of characters depending on the context. The document will then be split on these separators so that the strings between each separator becomes a token. This step is required in order to do further processing on the text. An example of tokenization could be the document ’This is a string, which is about to be tokenized’ which would then become [’This’, ’is’, ’a’, ’string’, ’which’, ’is’, ’about’, ’to’, ’be’, ’tokenized’].

2.5 Evaluation

Accuracy

One common metric for measuring the performance of machine learning models is accuracy. Accuracy is the amount of correct classifications over all of the classifications made, as can be seen in equation 2.1. TP and TN stands for true positive and true negative i.e. correct positive and correct negative classifications made. FN and FP stands for false negative and false positive i.e. incorrect classifications made.

Accuracy= TP+TN

TP+FP+TN+FN (2.1)

For example false positive means the data was predicted as true when the correct answer was negative. Accuracy is a simple but effective metric for performance. There are other metrics such as recall that can measure how often a model is correct for a specific class. Such a metric can be useful for example if the data is significantly unbalanced and the model is

(23)

2.6. Related Research

biased towards classifying data as the class that has majority. For example if a dataset con-sists of 90% false data points and 10% true, the model will achieve an accuracy of 0.9 by always classifying data as false, but that does not necessarily make the model useful in an application.

Recall

Recall= TP

TP+FN (2.2)

Recall is defined as can be seen in equation 2.2. Recall puts a lot of emphasis on not getting any false negatives while also being correct. Recall does not take false positives into account so falsely classifying something as positive is not considered as problematic. An example of when this is desired can be when classifying for a rare disease, falsely classifying someone as not sick, when they are sick, could have significant consequences [4].

Precision

Precision= TP

TP+FP (2.3)

Precision instead is defined as can be seen in equation 2.3 which puts more importance on not getting false positives instead. This can be desired when false positives are much worse than false negatives. An example of this could be when classifying email as spam, falsely classifying a legitimate mail as spam could lead to a user not getting some important infor-mation. Accidentally letting a spam mail through does not have as significant consequences and is therefore less important [4].

F1-score

F1=2

Precision Recall

Precision+Recall (2.4)

If the effect of both recall and precision are desired and the data is unevenly distributed over the classes then F1-score is the best option. If the distribution over classes is more even then accuracy and F1-score are very similar [4]. F1-score is described as the harmonic mean of precision and recall. The definition can be seen in equation 2.4.

2.6 Related Research

Multiple previous studies show positive effects from using transfer learning. Pre-training on SQuAD and transfering that knowledge to other target domains has in multiple occasions led to state-of-the-art performance [9] [11]. [22] used six different similarity metrics and found a linear relation between the similarity of the source and target domain and the transfer ef-fect between the domains. The transfer efef-fect was measured for the NLP task part-of-speech tagging. Rényi, Variational (L1), Euclidean, Cosine, Kullback-Leibler and Bhattacharyya coef-ficient similarity metrics were tested. [18] proposes a task independent bayesian optimization model for selecting similar data suitable for transfer learning. This model achieves compet-itive results with the state-of-the-art domain adaptation technique SDAMS [26], not outper-forming but offering a more general model that can be used for more tasks. This Bayesian op-timization model also outperforms existing state-of-the-art domain similarity metrics, Jensen-Shannon divergence, for three tasks, sentiment analysis, POS-tagging, and parsing. All of the included metrics in [18] were Jensen-Shannon divergence, Rényi divergence, Bhattacharyya distance, Cosine similarity, Euclidian distance and Variational dist. These metrics where cal-culated by using three different representations of the data, term distributions, topic distri-butions and word embeddings. QA was not one of the tasks tested for.

(24)

3 Method

3.1 Development and replicability

All code used and developed in this study was written in Python 3.5.2. An older version of Tensorflow (0.12) was used since it would be too time consuming to update the code to a more current version. The model used in this study is taken directly from another study on transfer learning for question-answering systems[11]. All development was done in Windows 10, Nvidia CUDA 8.0 was used to be compatible with the older Tensorflow 0.12, newer version of CUDA might also work but that was not tested in this study. cuDNN version 7.1.4, for CUDA 8.0 was also used. Development was done on a laptop with a NVIDIA Quadro M2000M graphics card and an Intel Core i7-6700HQ 2.60Hz processor.

3.2 Candidate datasets

A pre-study was done to decide what datasets to include in this study. Each of the datasets considered for inclusion will be described in the following section. There will be a general description of the datasets, its content and structure structure.

MCTest

MCTest is a multiple choice reading comprehension dataset. It originally consisted of 2000 questions collected from 500 childrens’ stories [17]. There are 4 multiple-choice questions per story. An example of a MCTest question can be seen below.

1. Where did James go after he went to the grocery store? a) his deck

b) his freezer

c) a fast food restaurant d) his room

(25)

3.2. Candidate datasets

Question: Producer X will not press charges against Jeremy Clarkson, his lawyer says.

Answer: Oisin Tymon

Context:

The BBC producer allegedly struck by Jeremy Clarkson will not press charges against the “Top Gear” host, his lawyer said Friday. Clarkson, who hosted one of the most-watched television shows in the world, was dropped by the BBC Wednesday after an internal investigation by the British broad-caster found he had subjected producer Oisin Tymon “to an unprovoked physical and verbal attack.” Table 3.1: Example of a question, answer and context found in the Daily Mail dataset.

CNN / Daily Mail

This dataset consists of cloze-style questions. Cloze-style questions are segments of text where entities such as the name of a character, the name of a currency or any named object have been removed from the text. The assignment is to fill the blank spaces with the correct words. A larger segment of text, the context, is also given. By reading the context the model should be able to figure out what entity answers the question. An example can be seen in figure 3.1. This dataset is large, since it can be created synthetically by removing entities from sentences. The data is collected from CNN and Daily Mail news articles into two separate datasets. The CNN dataset contains 90,000 training documents, with 4 missing entities per document on av-erage. The Daily mail dataset contains 196,000 training documents, also with approximately 4 missing entities per document [12].

TREC LiveQA

The Text REtrieval Conference (TREC) is a set of text retrieval assignments following different categories, referred to as tracks. Many of these tracks occur several years in succession where researchers publish and or present what their research has been able to accomplish regarding a specific track. For example, at the time of this study, there is a Real-Time summarization track, where the goal of each participating team is to develop a system that can automatically, in real time, construct summaries from social media streams. Another of these tracks is the LiveQA track, where the goal is to find answers to user created questions in real time. A dataset containing 634 questions (coupled with the questions from the 2015 and 2016 LiveQA tracks) is given to the participants to train their respective models. This dataset while relevant for QA is more suitable to a complete QA-system and not to span or sentence level QA. The reason is that the systems are expected to be able to retrieve the information by themselves by utilizing for example the web. For this study it is more relevant with datasets containing a query and a body of text or a set of sentences from which to determine which sentence contains an answer to the question.

SQuAD

SQuAD is originally a dataset used for reading comprehension and span-selection. The data is collected from Wikipedia articles and created by crowd sourcing parts of the creation. Version 1.1 of SQuAD includes 100,000 questions. Questions are coupled with a large snippet of text and, for each question, an interval of indexes which marks one or several locations of where the answer to the question can be found in the text. Version 2.0 contains these 100,000 ques-tions along with 50,000 quesques-tions to which the answer can not be found in the text, models that perform well on this data has potential to translate well into a machine learning ques-tion answering system in the future. The SQuAD dataset has become popular partly due to the very large amount of training data available compared to smaller datasets such as wik-iQA and SemEval[11]. SQuAD has been included in previous studies and has shown great potential for transfer learning.

(26)

3.2. Candidate datasets

WikiQA

WikiQA was created by automatically extracting questions from Bing query logs [27]. Ques-tions were extracted by looking for keywords such as "what" or "how" in sentences as well as sentences ending with a "?". These questions were combined with the sentences in the summary section of the Wikipedia page corresponding to the Bing query that was made. These sentences are the answer candidates to the query. The labelling of the data was crowd sourced through a web interface. In the end a total of 3,000 questions, 30,000 sentences and 1,400 answers were collected and labelled. This means that not all questions have an answer in the candidate sentences which can be an important attribute for a model to learn. WikiQA is similar to this study both in structure and in task and has been studied in previous transfer learning studies [11].

SemEval

Semeval is short for International Workshop on Semantic Evaluation1which includes several dif-ferent tasks. When SemEval is mentioned in this study the SemEval 2017 Task 3A - Community Question Answering [3] is being referenced. Task 3 consists of 4 different sub-tasks A,B,C and D. Task 3A is a Question-Comment similarity task where the objective is to classify forum comments as relevant or not with regards to a question posted. An example can be seen in figure 3.2. The SemEval Task 3A is created from StackExchange forum questions and the first 10 comments found in each of those forum threads. In total there are 6400 questions and 40000 answers. The model has to rank them as good, has potential and bad which means there are 3 classes instead of 2, as in WikiQA, making the classification problem slightly more complicated. The language in this dataset is more colloquial than what can be expected in for example SQuAD or NewsQA which makes it less likely that the transfer effect will be good. The task is similar to the sentence extraction in WikiQA and this dataset was included in other transfer learning studies [11].

NewsQA

NewsQA, much like SQuAD is a reading comprehension dataset, which contains a context (segment of text) and a set of questions which can or can not be answered by deriving infor-mation from the context [21]. The answer(s) to the questions are marked as integer ranges (start:stop), representing the character or word indexes in the text which exactly matches the answer to the question. NewsQA is based on CNN news articles but the questions are crowd sourced to and written by humans users. The dataset consists of 120,000 question and answer pairs. The dataset was created in three steps:

1. First one subset of people get to create a question to an article by only seeing the articles headline and highlight.

2. A second group then gets the questions, the whole article and has to mark which pas-sages within the article answers the question.

3. A third validation group get the question and answers to make sure they are valid. As mentioned this task shared the same task with SQuAD but the language can be ex-pected to differ, since NewsQA consists of news articles and SQuAD consists of Wikipedia snippets. Due to the similarity in task NewsQA makes an interesting addition to the study.

(27)

3.3. Similarity metrics

Q:

Damn...The beach besides the Intercontinental has been closed. The area has been set aside for a housing project. Why can’t Doha; a seaside capital city of an aspiring tourism hub not have at least ONE beach? It was so popular amongst the residents. The same had happened to the one near the old Doha Club. Does anyone know any good beaches near the city limits?

1 Thats totally weak; I would imagine they passed a law to make

having fun illegal. Bad

2 i saw yesterday...:( i was sooo pissed Bad

3 =( Bad

4 because you are in qatar; no urban planning... PotentiallyUseful 5 We used to play beach volley there. Sad to hear that. Bad

6

-sealine -al wakra - al gharriya -inland sea -al khor Just don’t rely that there would be a public beach around doha...

everywhere there’s development :)

Good

7

...I CANT FORGET THAT PLACE! me and ma auntie used to go there for swimming and hang around..sad to know this story ..huhuhuhuh

Bad

8 Is that closed too??? Bad

9

the only really good beach for swimming in doha was replaced by the Al Sharq spa. The nearest beaches

now are outside the reach of anyone without a car. Sad really!

PotentiallyUseful

10 I went to Al Khor today and ALL the beaches were for families only and they were ALL empty!! Whats the point of that????? Bad

Table 3.2: Example of a SemEval Track 3A question and answers.

MS Marco

MS Marco is another reading comprehension and question answering dataset[13] consisting of about 1,000,000 bing queries from real users and around 180,000 natural language answers as of version 2.1. All questions are collected from anonymized user Bing-queries and the passages are collected from real web documents collected via Bing. The answers derived from the documents are collected by humans. This is a dataset that is highly relevant to the study, the structure and task of the dataset fits the other datasets and the aim of this study.

Summary

After evaluating the candidate datasets it was decided to include the datasets SQuAD, Wik-iQA, NewsQA and MSMarco. These datasets were either constructed to be used for or are convertible to sentence level QA. These datasets also represent different domains such as news articles and Wikipedia articles which should help bring some insight into how the lan-guage of different domains affects the similarity between datasets as well as their ability to transfer knowledge between each other. Examples of questions and answers can be seen in figure 3.1. Table 3.3 shows the ratio between correct and incorrect answers for the included datasets.

3.3 Similarity metrics

The similarity metrics used to compare datasets are listed below, they are intended to derive high-level similarities between the textual content of each dataset, other more complicated measurements and measurements requiring the data to be labelled has been avoided. The intention of this is to make these measurements as easy and accessible as possible to use, no labelled data or modifications should be required.

(28)

3.3. Similarity metrics

Figure 3.1: Example of a question and answer from each of the datasets. The upper row is the answer and the bottom row is a sentence which is labelled as containing the answer to the question. The SQuAD column represents the modified SQuAD dataset for sentence extraction instead of the normal span selection.

squad-class wikiqa newsqa msmarco

Documents 499331 130000 3970434 9078734

Correct answers: 97900 8407 748108 592034

Incorrect answers 401431 121593 3222326 8486700

answer ratio 0,2438 0,0691 0,2321 0,0697

Average document length 32 65 23 65

Table 3.3: The total number of documents, the ratio between correct and incorrect answers and how many words per document there are, on average, for each dataset. One document is one candidate answer sentence. The difference in data available is large but since the amount of training steps will be fixed in the model this will not affect the end results.

Shared vocabulary

To measure the shared vocabulary a universal vocabulary of the unique occurrence of every token from every dataset is built. No stemming is done and no stop words are removed. In this universal vocabulary, every word represents one dimension of the shared vocabulary vector. The shared vocabulary vector for each individual dataset is created by, for each word in that dataset, marking the corresponding dimension in the vector as 1 and otherwise 0. The angular distance between two of these individual dataset vectors should signify their similarity. This should give an indication of whether similar language is used between the datasets.

TD-IDF

The process for measuring similarity over TF-IDF is very similar to shared vocabulary. A universal vocabulary of every word that occurs in any of the datasets is created. A vector is created for each dataset, filled with zeros, with the same amount of dimensions as the size of the vocabulary. The occurrence of every word in each dataset is counted, for each occurrence the corresponding dimension in the dataset vector is incremented by one. Simi-lar to the shared vocabuSimi-lary but counting occurrence instead of setting only ones and zeros. When this is process is done, the inverse document frequency has to be accounted for as well. The inverse document frequency is calculated by taking the logarithm of the amount of to-tal datasets divided by the amount of datasets in which the current word occurs. The term frequency and inverse document frequency is multiplied for every word to get the TF-IDF value for each word in every dataset. The term frequency increases the weight of words com-mon in a dataset and inverse document frequency removes weight if that word also occurs in other datasets. Datasets will be considered mostly by the words that are common in that dataset and rare or non-existent in other datasets. The definition for TF-IDF can be seen in

(29)

3.4. Classifier

equation 3.1, where tft,dis the term frequency for word t in document d, N is the total number

of documents and dftis the total number of documents where word t occurs.

tf-idft,d=tft,d log N

dft (3.1)

Averaged document vector

The third domain similarity metric to test in this study is the averaged document vector distance between two datasets. This will be measured by training a gensim Doc2Vec model2. Then collecting the vector for each document in a dataset, averaging the vectors by summing up all the vectors and diving by the total amount of documents and measuring its distance to the average vector of all other datasets. The parameters used was a window size of 15 and minimum word count of 10. The rest of the parameters were left as the default value.

The Doc2Vec model is a modification on the word2vec word embeddings model. Each document is represented as an array of individual strings (words), a document in this case will be one candidate answer sentence. A sentence that does or does not answer the question. The Doc2Vec model is trained on every document in every dataset included in the study. Then the vector for each document is retrieved from the model, all of the vectors are summed together and divided by the total number of documents in the dataset. Once the average vector has been calculated the cosine distance between each dataset is measured and saved. Embeddings are capable of capturing subtle meanings between common words, this should be a more accurate domain similarity metric. If so this metric should be able to identify a better suitable source domain than the shared vocabulary metric would have.

Cosine similarity and cosine distance

cos(xxx, yyy) = xxx yyy

||xxx|| ||yyy|| (3.2)

1 cos(xxx, yyy) (3.3)

Each of the metrics listed above creates a vector representation of each dataset. To measure the distance between these vectors the cosine distance will be used. Cosine similarity measures the cosine of the angle between two vectors, equation 3.2, meaning more emphasis is put on the orientation of the vectors rather than the magnitude. This will remove significance from the difference in word count between the datasets, which is very important since there is a significant difference in size between the datasets. Instead significance is put on which words occur at all. Measuring the cosine similarity will return a value between -1 and 1 for positive or negative vector values and a value between 0 and 1 for only positive vector values. Since negative vector values can only occur in the document embedding vectors, the document vector similarity ranges between -1 to 1 while the shared vocabulary and TF-IDF similarities range between 0 and 1. The cosine distance is then calculated as 1 - the cosine distance, equation 3.3, yielding a distance between 0 and 2 for document vector distance and 0 to 1 for shared vocabulary and TF-IDF distance, where a lower value means the vectors are more similar.

3.4 Classifier

This study uses a modified version of the BI-DIRECTION-ATTENTION FLOW (bi-att-flow) model3. The modified version4has changed the output layer of the original model in order

2_{https://radimrehurek.com/gensim/models/doc2vec.html} 3_{https://github.com/allenai/bi-att-flow}

Domain similarity metrics for predicting transfer learning performance

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datavetenskap

2018 | LIU-IDA/LITH-EX-A--18/046--SE

Domain similarity metrics for

predicting transfer learning

performance

Jesper Bäck

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Metrics

Shared vocabulary

TF-IDF

Document vector distance

1.5

Background

1.6

Delimitations

2

Theory

2.1

Natural Language Processing

2.2

Question Answering

Embeddings

Word embeddings

Character embeddings

2.3

Machine Learning

Training a network

Neural Networks

Recurrent Neural Networks

Long Short-term Memory - LSTM

Transfer Learning

2.4

Preprocessing

Tokenization

2.5

Evaluation

Accuracy

Recall

Precision

F1-score

2.6

Related Research

3

Method

3.1

Development and replicability

3.2

Candidate datasets

MCTest

CNN / Daily Mail

TREC LiveQA

SQuAD

WikiQA

SemEval

NewsQA

MS Marco

Summary

3.3

Similarity metrics

Shared vocabulary

TD-IDF

Averaged document vector

Cosine similarity and cosine distance

3.4