Sentiment analysis and transfer learning using recurrent neural networks : an investigation of the power of transfer learning

(1)

Linköpings universitet

Master’s thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/080--SE

Sen ment analysis and transfer

learning using recurrent neural

networks

–

an inves ga on of the power of transfer learning

Sen mentanalys och överföringslärande med neuronnät

Harald Pettersson

Supervisor : Marco Kuhlmann Examiner : Arne Jönsson

(2)

Upphovsrätt

De a dokument hålls llgängligt på Internet - eller dess fram da ersä are - under 25 år från publicer-ingsdatum under förutsä ning a inga extraordinära omständigheter uppstår.

Tillgång ll dokumentet innebär llstånd för var och en a läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och a använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrä en vid en senare dpunkt kan inte upphäva de a llstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För a garantera äktheten, säker-heten och llgängligsäker-heten ﬁnns lösningar av teknisk och administra v art.

Upphovsmannens ideella rä innefa ar rä a bli nämnd som upphovsman i den omfa ning som god sed kräver vid användning av dokumentet på ovan beskrivna sä samt skydd mot a dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens li erära eller konstnärliga anseende eller egenart.

För y erligare informa on om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years star ng from the date of publica on barring excep onal circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educa onal purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are condi onal upon the consent of the copyright owner. The publisher has taken technical and administra ve measures to assure authen city, security and accessibility.

According to intellectual property law the author has the right to be men oned when his/her work is accessed as described above and to be protected against infringement.

For addi onal informa on about the Linköping University Electronic Press and its procedures for publica on and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

In the ﬁeld of data mining, transfer learning is the method of transferring knowledge from one domain into another. Using reviews from prisjakt.se, a Swedish price comparison site, and hotels.com this work investigates how the similarities between domains affect the results of transfer learning when using recurrent neural networks. We test several different domains with different characteristics, e.g. size and lexical similarity. In this work only relatively simi-lar domains were used, the same target function was sought and all reviews were in Swedish. Regardless, the results are conclusive; transfer learning is often beneﬁcial, but is highly de-pendent on the features of the domains and how they compare with each other’s.

(4)

Acknowledgments

I would like to thank Findwise for the support and the opportunity to do my work at their ofﬁce. My family deserves a great deal of gratitude for the patience and help with read throughs. A very special thanks to my supervisor Marco Kuhlmann for the guidance and patience when the work was being dragged out. Last but not least, a big thank you to my examiner, Arne Jönsson.

Stockholm, 2019 Harald Pettersson

(5)

3.3 Evaluation . . . 18 3.4 Data preparation . . . 18 3.5 Network architecture . . . 21 3.6 Domain similarity . . . 22 3.7 Transfer learning . . . 23 4 Results 24 4.1 Architecture performances . . . 24 4.2 Domain similarity . . . 24 4.3 Transfer learning . . . 25 5 Discussion 30 5.1 Results . . . 30 5.2 Method . . . 32

5.3 The work in a wider context . . . 35

(6)

(7)

List of Figures

2.1 A multi-layer perceptron. . . 6

2.2 Overﬁtting. . . 8

2.3 Simple recurrent neural network node . . . 9

2.4 A long-short term memory node. . . 10

2.5 Recurrent neural network . . . 10

2.6 Description of CBOW and skip-gram for Word2vec. . . 13

2.7 Description of Doc2vec using DM, distributed memory. . . 13

2.8 An example of a ROC curve. . . 15

3.1 The distribution and hierarchy of review domains. . . 18

3.2 Average length of reviews . . . 19

3.3 An example review. . . 20

3.4 Distribution of rating value of reviews. . . 20

4.1 The increase of AucRoc value when pre-training. . . 27

(8)

List of Tables

2.1 Confusion matrix. . . 14

3.1 The ratio of positive and negative reviews per domain. . . 21

3.2 The default values of Keras LSTM for the parameters not changed in this thesis. . . 22

3.3 The default values for the Gensim Doc2Vec parameters not changed in this thesis. . . 23

4.1 Performances of different architectural setups. . . 25

4.2 Explanation of architecture headers. . . 25

4.3 The pairwise similarity of domains. . . 26

(9)

1 Introduction

The world today is vastly different from how it looked just a few decades ago. According to the International Telecommunications Union, the number of Internet users increased from 400 million to 3.2 billion from the year 2000 to 2015 [5]. This digitalization has led to a great leap in globalization, since more and more people can communicate within seconds online. It has also made it easier to gather large quantities of information. Using this information is crucial to be in the competitive top of any market.

The field of neural networks has received more attention in recent years after a long slumber with sparse activity. Letting machines figure out the patterns and structures of data is appealing, compared to rule-based solutions, due to a couple of different reasons. First it is time consuming to figure out the rules, and may even be practically impossible without the massive calcula-tions of computers. Having personnel working on rule extraction instead of training machines is ineffective, inefficient and expensive. Secondly, the notion of artificial intelligence builds on computers being able to learn with as little intervention from humans as possible. The recent upsurge for neural networks lies much in the increased computational powers of computers. Es-pecially the focus on computational GPUs from many companies has led to much faster training times.

A common limitation is that many areas and domains lack the needed data for the networks to be trained into their full potential. For natural language processing (NLP) tasks, one big cause of this data scarcity is the very uneven distribution of language on the data. According to the website w3techs.com [24], English is used by 51.5 % of the 10 million top websites as of June 2017. This can be compared to the runner up on that list, Russian with 6.6 %, or to 0.5 % for Swedish. This is not the language ratio of all data, but the content language. Still, it’s an indication of how dominant the English language is. So, if the NLP application is in any other language than English, one is most likely going to have to deal with an insufﬁcient amount of data. Even in English many domains don’t have enough labeled data. To solve this problem one can use transfer learning, the process of using knowledge in another domain to boost the results in the target domain.

The use cases of transfer learning are many. Not only to actually get the wanted performance in a specific domain, but also to reduce training time overall. A goal could be to train a baseline network to work for several problems, just fine-tune it to the specific topic. This would reduce time and money spent, as well as mitigate the need for large quantities of training data. This is somewhat of a utopian fantasy, as in reality the performance of transfer learning is very

(10)

depen-1.1. Motivation

dent on the similarity between the source and target domains [22], i.e. the data the network is pretrained on and the data of the current problem. But how general can a network be? Is there a golden line of similarity at which pretraining is still very effective? This thesis will delve into these questions and try to answer them.

To test transfer learning in this paper, the concrete task of sentiment analysis is performed. Sentiment analysis is the task of trying to decide a text’s polarity; whether it is positive, negative or neutral. If a company receives a lot of feedback from customers, the sentiment of the texts can be used as an indication of how important the feedback is. If it is a complaint, maybe something ought to be done urgently. Sentiment analysis could also indicate how well a service is performing over time. If more neutral messages are sent, maybe they are questions and the information of the usage of service need to be improved.

1.1 Motivation

Using the large amounts of data available to improve services is not straightforward. Several problems may arise: often the data is not in the correct format; the data is for privacy reasons not available in all contexts; the computational power is not enough to fully take advantage of the data. To resolve these problems, knowledge from another domain could be transferred into the sought to be learned domain by using transfer learning.

1.1.1 Findwise

Findwise is a consulting company dedicated to improving search, information management and data analysis. This means that they are helping other companies sort their data in suitable formats, making the information easily attainable through search and analyse it in order to improve current processes, proﬁt on new possible areas and generally improve their business intelligence. For Findwise, with many different customers and doing lots of text analysis, it would be interesting to see whether it is possible to train a general-purpose neural network which can cheaply, in terms of time, computer power and data size, be adapted to different domains.

1.2 Aim

The goal of this thesis is to evaluate recurrent neural networks in regard to their transferability between domains. The issue this could resolve is when a speciﬁc solution to a problem, e.g. a classiﬁcation problem, relies on a data set that is too small or has other issues. The aim is to be able to use a separate data set from another domain to improve the performance.

1.3 Research questions

The thesis will revolve around two main questions. These are stated below. Answering these will be a subgoal, and will contribute to the overall aim of the thesis.

1. How is a neural network’s transferability affected by the similarity of the domains. 2. How is a neural network’s transferability affected by the relative and absolute sizes of the

domains.

1.4 Delimitations

The main focus of the thesis is transfer learning. This means that ﬁnding the absolute best neural network for the task is not included. A sufﬁciently good network was chosen, and the research was conducted using that one, under the assumption that most networks with similar architecture would behave about the same when transfer learned.

(11)

The work revolves around transfer learning of neural networks in text analysis, but for the purpose of simplicity a more speciﬁc subarea has been chosen; sentiment analysis.

The data scope is limited to Swedish, which is a huge limitation since most available data is in English. But this makes solving the transfer learning problem even more desirable, since the data is, by deﬁnition, less abundant in the smaller domains. Solving the problem in bigger domains ought to be easy if it is already solved in smaller.

(12)

2 Theory

2.1 Machine learning

Machine learning (ML) is a subﬁeld of artiﬁcial intelligence where computers are taught to iden-tify patterns in raw data. In contrast to using rules to draw conclusions from the data, ML tech-niques use the underlying structure to learn how to represent the data for a given task.

Supervised learning is a subset of ML where the output of the system for the given training data is known. This is used by the machine to reduce its error, and hopefully be able to generalize to new, unseen data. As an example, consider pictures of cats and dogs. A computer is given the task of classifying each picture as either containing a cat or a dog and is being feed the raw pixel data of the images. Using supervised techniques, this would mean that after the machine has seen a picture (or a batch of pictures), adjust the parameters so that it is becoming better at estimating whether it is a cat or a dog for that particular picture or pictures. The idea is that given a sufﬁciently large enough of training samples, and sufﬁciently large number of iterations over the samples, the computer can distinguish between cats and dogs.

Unsupervised learning is another method for training machines. It is used for scenarios where there are no known labels for the training samples. In the cats and dogs example, this would mean we don’t know the actual animal in each image, we only have the data. Some algorithms can still distinguish between the animals and cluster the cat pictures separately from the dog pictures.

2.1.1 Transfer learning

The concept of transfer learning (TF) is about transferring knowledge between domains, and is sometimes called domain adaption. This is widely used in areas such as image processing where the early layers of convolutional neural networks (CNN) often end up extracting low-level patterns such as lines and circles regardless of the image input and problem formulation. This gives the possibility to pretrain on data from a different domain than the target domain and increase performance.

(13)

2.1.2 Deﬁnitions

To be consistent with existing work, the same notations as in A survey of transfer learning [33] will be used. A domainD is deﬁned by a feature vector space X and a marginal probability distribution P(X) where X = {x1, x2. . . , xn} ⊆ X . X is the space of all possible input feature

vectors for a speciﬁc domain. X is a particular training sample, with xi being a feature vector

pertaining to a speciﬁc training instance. A taskT is deﬁned by a label space Y and trained predictive function f ∶ X ↦ Y. f is given vector pairs of (xi, yi) to train on, i.e. predict yi∈ Y

from xi ∈ X . Y is the space of all possible output label vectors. A target domain DT has a

corresponding target data set of labeled instances DT = {(xT 1, yT 1), (xT 2, yT 2)..., (xT n, yT n)}

where xT i∈ XT and yT i∈ YT. Similarly, a source domainDS has a data set DS of pairs of data

points and labels.

2.2 Artiﬁcial Neural networks

’Neural network’ as a term stems from the 1940’s where attempts were made by McCulloch and Pitts to try to find mathematical expressions describing the process in biological neurons [20]. Even though the first attempts revolve around trying to represent biological events, many of the restrictions following this are irrelevant when trying to recognise pattern [3]. Therefore, the artificial neural networks have drifted away from the biological neurons. The idea, however, is that a neuron is active when the input value reaches a threshold. An active neuron outputs an active signal, i.e. 1. An inactive neuron has a 0 output. This gives a step function as:

h(x) =⎧⎪⎪⎨⎪⎪

⎩

1, if w∗ x − β ≥ 0

0, otherwise (2.1)

where w is a weight factor, x is the input value and β is the threshold, or bias, of the neuron. These neurons can then be combined into networks where each neuron’s output can be used as input to other neurons, forming different layers.

Artiﬁcial neural networks in their simplest forms consist of an input layer, hidden layers and an output layer. The number of hidden layers vary.

2.2.1 Feed-forward network

In a feed-forward network, the output from each unit is always passed forward, no directed loops exists. This is the original neural network. The simplest form of neural network, the single-layer perceptron, and its somewhat more advanced form, the multi-layer perceptron (MLP) are both feed-forward networks. An example of an MLP can be seen in ﬁgure 2.1. The MLP has an input layer and an output layer. Between them are a number of hidden layers. When talking about deep networks, the number of hidden layers is what is referred to as depth. Each of the network’s layers works as a feature extractor at different abstraction levels.[18]

Each node introduces a non-linearity into the system, making it possible to solve tasks that linear algorithms cannot. The features extracted from the layers is a new way of representing the data. Often the output layer is smaller, i.e. has fewer nodes. This is so that the representation is as compact as possible and conclusions can be drawn quickly, by for example a linear classiﬁer.

2.2.2 Activation function

The function which activates the neuron is called the activation function. Each neuron in a network consist of a summation over the weighted input from the previous layer which is passed through an activation function before being outputted to the next layer. The activation function introduces a non-linearity, and is often chosen to be the rectiﬁed linear function, logistic sigmoid, hyperbolic tangent or any other function which approximates the behaviour of the step function in equation 2.1. See example functions in equations 2.4 This is applied elementwise to the

(14)

2.2. Artiﬁcial Neural networks

Figure 2.1: A multi-layer perceptron.

resulting vector from the summation, see equation 2.2. hnis the output vector for the nth layer,

with the same size as the number of neurons in the layer. f is the activation function. Wn is

the weight matrix of the nth layer, with dimension t by s, where t and s are the dimensions of

hn and hn₋₁, respectively. βlis the bias of the neuron. The bias can be incorporated into W by

extending the input vector to each layer by 1 and setting the extra value to 1 and concatenate

W with βl. This corresponds to adding a bias neuron to each hidden layer with no input which

always outputs 1, and results in equation 2.3.

hl= f(Wlhl₋₁+ βl) (2.2)

hl= f(Wlhl−1) (2.3)

Without the nonlinearity the output would always be just a linear transformation of the input and hence not be as universal as it is sought to be. Also, more than one layer would not make sense since a linear combination of linear functions is just a linear function and all the layers could be reduced to a single one.

Rectiﬁed linear function: f(x) = max(0, x) Logistic sigmoid: f(x) = 1 1+ e−x Hyperbolic tangent: f(x) =e2x− 1 e2x_{+ 1} Hard sigmoid: f(x) =⎧⎪⎪⎪⎪⎪_⎨⎪⎪ ⎪⎪⎪⎩ 0, if x≤ −2.5 1, if x≥ 2.5 0.2x+ 0.5, otherwise (2.4)

(The hard sigmoid is a simpliﬁed sigmoid which is easier to compute. It is a linear function between the points (-2.5, 0) and (2.5, 1), and is 0 when x< −2.5 and 1 when x > −2.5.)

2.2.3 Loss function

The loss function, also called cost function, is a function that is to be optimized through train-ing. A frequently used loss function for linear regression is the mean squared error (MSE), see equation 2.5 and its derivative in equation 2.6. For binary classiﬁcation a typical loss function is

(15)

the logarithmic loss, also called binary cross entropy (BCE), see equation 2.7 and its derivative in equation 2.8. CMSE(W ) = 1 2N N ∑ j₌₁ (yj− ˆyj)2 (2.5) ∂ CMSE(W ) ∂W = yj− ˆyj (2.6) CBCE(W ) = −1 N N ∑ j=0

(yj∗ log(ˆyj) − (1 − yj) ∗ log(1 − ˆyj)) (2.7)

∂ CBCE(W ) ∂W = − N ∑ j=0 (yj∗ ∂ ˆyj ∂W ˆ yj + (1 − yj) ∗ ∂ ˆyj ∂W 1− ˆyj ) (2.8)

For these equations, j is the index of a single training sample, yjis the ground truth value for

this sample ˆyj is the output, i.e. the predicted value, of the network for training sample j given

weights W with incorporated bias.

2.2.4 Optimization

The core of any successful training of a neural network is the optimizer. This is the algorithm that decides how to update the parameters, i.e. the weights and biases, of the model to minimize the loss function. Most known are the gradient-based algorithms which calculates the gradient of the loss function with respect to the weights and the training set and takes a step in the neg-ative direction. Doing this several times results in a lower and lower loss and ﬁnally reaches a minimum, local or global. Optimally the gradient is calculated using all training data, taking an inﬁnitesimal step in the negative direction of the gradient and repeating until convergence. Using all data in that manner is called batch deterministic gradient methods. This is of course infeasible and in practice a hyperparameter learning rate, η, is instead used as the step size. To decrease convergence time even more, the assumption that the data contains some redundancy can be used. Instead of calculating the gradient over the whole training for each weight update, it is calculated over a randomly sampled subset of the training data, a mini-batch, at a time. Methods using mini-batches are called minibatch stochastic gradient descent or just stochastic gradient descent (SGD). [11]

One of the many problems with optimizing neural networks is that the solution space is non-convex. This means that even though a gradient descent method with appropriate parameters is often able to ﬁnd a minimum, it might be a local minimum. For any neural network with more than one non-input node there exist several identical local minima. These arise from the fact that any neural network is equivalent to a neural network where two nodes in a non-input layer have swapped weighs, meaning there is a minimum at the location in the solution space which corresponds to the swapped weights.

Backpropagation

The means for calculating the gradient in the optimization in a neural network is called back-propagation, BP. It is the process of calculating the gradient of the loss function with respect to the weights in the neurons[27]. This requires that both the loss function and the activation function is differentiable, and hence further restricts the choice of activation function.

2.2.5 Regularization

Most supervised learning algorithms suffers the problem of overfitting. When a model is learning on training data, it first reduces both the training error and the generalization error, but only to some extent. After some time, the model will start to overfit on the training set and the

(16)

generalization error starts to increase; the model has become overly complex to fit the training samples, see figure 2.2. There are several ways to reduce overfitting by regularization, some of these are explained in this section.

Figure 2.2: Illustration of overﬁtting. The blue line shows a perfect ﬁt for these samples, while the black, simpler line is the one that will generalize best1_.

Dropout

Dropout is a regularization technique where some nodes in the network are discarded during a training session (usually during a mini-batch). To choose which nodes to drop and which to pertain, each node is given a probability of p to pertain. This can be seen as training 2n_different

networks which extensively share weights, thus forcing the nodes to be less co-adapted to each other. During testing however, averaging over all these 2n_{networks is not feasible, instead one}

approximates this average by using a full network, but weighting the outputs by p, i.e. each nodes expected value over all these networks. This has shown great promise in many neural networks [29].

Early stopping

A simple technique to reduce overﬁtting is to use early stopping. After each epoch during training the model is evaluated using a part of the data set, disjoint from the training and test sets, called the validation set. If the error on the validation set starts to increase, this is a sign that the generalization error is increasing and that the model is starting to overﬁt to the training set. Different methods of applying early stopping exists[12], but a common way is to allow some patience, i.e. letting some epochs show worsening validation error before stopping. This is done to account for the function of the validation error not being perfectly convex.

2.2.6 Recurrent neural network

In contrast to a feed-forward network, a recurrent neural network (RNN) contains loops. These networks are good at processing sequential data, such as text. Theoretically, they are good at remembering past inputs and associate inputs that are separated in the sequence. But in practice these have had two related problems: vanishing and exploding gradients, as explained by Hochreiter et. al.[14]. These problems source from unstable gradients due to that each layer’s gradient is a product of all later layers’. If these do not balance out, the gradients tend to become very small or very large. Well known RNN unit architectures include long-short term memory (LSTM) and gated recurrent unit (GRU). LSTMs deal with the instable gradients by expanding the previously very simple RNN cell, with only a single non-linearity, to a cell with several gates

(17)

and a cell state that is passed on over each item in the sequence. The gates decided how much of the cell state to be forgotten and replaced with new information, thus giving the possibility to model dependencies between well separated input. The newer GRU cell combines the forget and update gates in the LSTM and therefore leads to a bit simpler and easier trained network.

For a simple RNN the output and state are calculated as equation 2.9, which is also shown in ﬁgure 2.3.

ot= ht= σ(W ⋅ [ht−1, xt] + β) (2.9)

Figure 2.3: A simple recurrent neural network node. For LSTMs the gates, output and cell state is calculated as:

ft= σ(Wf⋅ [ht−1, xt]), it= σ(Wi⋅ [ht₋₁, xt]), ̃ Ct= tanh(Wc⋅ [ht₋₁, xt]), ut= it⊙ ̃Ct, Ct= ft⊙ Ct−1+ ut, ot= σ(Wo⋅ [ht−1, xt]), ht= ot⊙ tanh(Ct),

where[ht₋₁, xt] is a concatenation of two column vectors into a single column vector, ⊙ is

element wise product, the Wxare weight matrices including the biases, and t are the index of the

input sequence. The forget gate ftis a combination of the previous output ht−1and the current

input xtpassed through a sigmoid function. It chooses how much of the previous cell state, Ct−1,

that is to be forgotten. itis the update gate that chooses how much of the new information, ̃Ct

that are to be kept and added to the new cell state, Ct. The output gate, ot, decides what parts of

the new cell state to send to the output, ht. These parts are also limited to[−1, 1] by a hyperbolic

tangent function. During training, it’s the weight matrices and bias vectors that are learnt. The forget bias βf is recommended by Jozefowicz et. al. [15] to be initialized to 1 instead of the

usual 0.5, otherwise extensive forgetting will be the case of the ﬁrst instances in the training. Recurrent neurons may be combined in a feed-forward manner, where the units send the activations to the next layer while still keeping and updating an internal state. Graves et. al.[13] shows that such stacked LSTMs can increase the performance of the network. They used it on speech recognition and noticed an overall error rate drop from 23.9 % to 18.4 % when using one or ﬁve LSTM layers, respectively.

(18)

Figure 2.4: A long-short term memory node.

Figure 2.5: A stacked recurrent neural network. Each square is a node, and can for example look like in figure 2.3 or figure 2.4. The figure also shows the unfolding of an RNN in k time steps.

Backpropagation through time

Backpropagation through time (BPTT) is BP adapted to RNNs. To perform BP on RNNs the network is unfolded by a speciﬁed amount of time steps, k. For each time step a recurrent unit is duplicated, giving k instances of the same unit. This unfolding is shown in ﬁgure 2.5, the output of the last time step, yk, is used as the whole sequence’s output. Then normal backpropagation

is performed. For recurrent units the weight update is a summation or average of the calculated gradient for each instance.

Recurrent dropout

Regular dropout has not been as successful in RNNs as for other types of neural networks. Reg-ular dropout only drops connections between the layers. Recurrent dropout, however, refers to dropping connections on the states between time steps. In ﬁgure 2.5 regular dropout would be applied to the vertical lines, while recurrent dropout would be applied to the horizontal lines. Gal and Ghahramani shows that using dropout masks on the recurrent connections leads to performance gains [30].

2.2.7 Transfer learning for neural networks

There are basically two ways of performing transfer learning on neural networks, either one uses parameter initialization or multi-task learning. Parameter initialization means that a network is trained on a domain,DS, and then ﬁne-tuned on the target domain. During the ﬁne-tuning the

(19)

In contrast, during multi-task learning, training is done by alternating training on source data and training on target data. If the target domain data is scarce, this method can extend the training data set, although at the cost of decreasing the quality for the speciﬁc task. One of the downsides of this method is that the network will never be pretrained and hence will always need to be extensively trained for every new task.

One can also look at the different ways to partition and use the available data. Daumé [8] describes three basic ways: Source Only, only using the source data to train the model; Target

Only, only using target data; All, using the union of source and target data. Watanabe et. al. [32]

extends this list by: FineTune, as explained above and Dual, which is using shared parameters for all layers except the output layer, where source and target have their own sets of weights. This method is frequently used in multi-task learning.

2.3 Representing textual data

When working with data, the data rarely comes in the form necessary to process it. To be able to pass the reviews to the network, their sentences need to be encoded into some vector embedding. This can be done in several different ways, where some techniques try to capture the meaning of the words and sentences. If a more sophisticated embedding method is used, this could be considered as transfer learning (see 2.1.1).

A deep neural network can be seen as multiple feature extractors combined, each layer out-putting a new representation of the data. The idea is that each layer is a layer of abstraction, with the end result being a more specific representation of the input. Consider, for example, the difference in abstraction between describing a picture by its pixel values or by what objects it contains - both are correct but are at complete opposite edges of the abstraction spectrum. For text this mean that the first layer or layers extract more basic information, such as semantics and word similarity, while subsequent layers combine the words into sentences and paragraphs with meaning, and the final layers are more related to the specific task of the whole network, i.e. classification.

2.3.1 One-hot vector

A one-hot vector embedding is a simple technique which assigns each word a k-dimensional vector with all elements zero except one which is set to one, where k is the number of unique words in the corpus plus a special vector to represent all words not in the corpus. The assignment of which vector to each word is unimportant and hence this solution disregards any context of words, and therefore much information is lost since it does not capture the semantic meaning of the words; the L2 distance between words are equal regardless if they are well connected or not[11]. If this embedding is used, the network’s earlier layers probably needs to learn some of the semantic meanings of words.

2.3.2 Word2vec

Word2vec is a probability-based word representation approach proposed by Mikolov et. al. [21] in 2013. They used two different architectures, continuous bag-of-words, CBOW, and continu-ous skip-gram. A result they got using these methods were that simple algebraic operations on the resulting word vectors worked such that, for example, the vector calculation of vector(”king”) - vector(”man”) + vector(”woman”) resulted in a vector that was closest to the vector of ”queen”. This means that syntactic and semantic information is stored in the vectors. Compared to one-hot vectors Word2vec vectors could therefore be better input for e.g. a neural network. The vectors are also more compressed, with fewer values, meaning smaller input layer which reduces the complexity of the network.

Both CBOW and skip-gram are neural networks predicting words from the context of a text base. It consists of a single hidden layer with linear activation, and an output layer using

(20)

soft-2.3. Representing textual data

max, a non-linear activation. The output of the network is a multinomial distribution of word predictions. The actual word vectors are extracted from the hidden layer. The weight matrix,

Wh, of the hidden layer corresponds to a lookup table for word vectors, which is the same as the

hidden layer output when given a speciﬁc word as input.

Continuous Bag-of-Words

In CBOW, a neural network is trained to predict a word from the surrounding context. Given a context window size, the model is trained on data of context windows where the middle word is the prediction, see the left side of ﬁgure 2.6. According to Mikolov et. al. [21], this architecture performs very well with syntactic problems, such as solving the problem ”great is to greater as big is to bigger” where one of the words great, greater, big or bigger are missing. As input, training samples of(C, w) pairs are created, where C is a list of context words to the word w. Each word is represented by a one-hot vector. In practice, the pairs are reduced into several(˜c, w) pairs, where ˜c is the average of the vectors in C. ˜c is fed to the neural network and w is used as the ground truth.

Skip-gram

Skip-gram is similar to CBOW but instead tries to predict a context from a word, that is, the opposite of CBOW, see the right side of ﬁgure 2.6. Skip-gram was the most accurate overall, but especially for semantic problems, such as ”Athens is to Greece as Oslo is to Norway” where one of the capitals or countries are to be predicted.[21] As with CBOW,(C, w) pairs are used as training samples, but split into(ci, w) pairs, where ciis each vector in C and used as the ground

truth, and w as the input.

2.3.3 Frequency based

There also exists several frequency-based embeddings, which calculates the word frequency and uses this to determine the word vector. These techniques include count vector, where each word is described by a vector with an element per document in the corpus which is the number of occurrences of that word in that document, and TF-IDF, a similar approach to count vector but which also tries to down weigh unimportant words that occurs in many documents.

2.3.4 Doc2vec - paragraph vectors

Mikolov et. al. [17] later presented an extension to the word representation where they added an extra type of vector along with the word vectors. These vectors each represented the paragraph or document where the training samples were extracted from. Like Word2vec, Doc2vec has two distinct implementations, distributed memory (DM) and distributed bag of words (DBOW) which corresponds to Word2vec’s CBOW and skip-gram architectures respectively. The DM implementation adds a weight matrix alongside Wh, called D, see ﬁgure 2.7. D projects a

paragraph input vector (one-hot encoded) into a vector, the same way that Whdoes for words.

This vector is averaged with the context word vectors to produce the output of the hidden layer passed to the output layer, and is predicting a single word, similar to the CBOW model. DM trains word and paragraph vectors simultaneously. The DBOW model predicts a context, similar to the skip-gram Word2vec model. However, instead of predicting context words from a single word, the model predicts it from the paragraph vector and the context is a sampled window from that paragraph. Only paragraph vectors are trained using the DBOW model.

2.3.5 Vector comparison - cosine similarity

A common method for comparing words and paragraphs after vectorization, is cosine similarity. The cosine similarity of two vectors is the cosine of the angle between them, and hence ranging

(21)

Figure 2.6: Description of CBOW and skip-gram for Word2vec.[21]

Figure 2.7: Description of Doc2vec using DM, distributed memory. Compared to Word2vec, an extra matrix for paragraphs or documents is added along the word matrix.[17]

from -1 to 1. It will be -1 if the vectors are facing in completely different directions, 0 if they are perpendicular and 1 if they are parallel. A value closer to 1 mean the vectors are more similar. The method takes no consideration of the magnitude of the vectors.

2.4 Evaluation

There are different ways of evaluating performance of a speciﬁc network. All evaluation is done on a test set, which has not been used during training. The metrics all compare how well the models perform against a gold standard. The gold standard is to predict all input xi∈ X to the

correct yi∈ Y, i.e. the predictive function f always outputs yi given input xi.

2.4.1 Accuracy, precision, recall and F

1

-score

Accuracy is the simplest metric, the correct predictions divided by all predictions. If a model randomly guesses the outcome, the accuracy will be around 50 % for a binary classiﬁcation task, regardless of the distribution of the data. Precision is calculated for each prediction class. It is a measure of how many of the predicted instances for this class are actually in this class according to the gold standard, those positives are called true positives, tp, while the others are called false

positives, fp, see equation 2.10. Missed instances of the class is called false negatives, fn, while

instances neither predicted or in the class are called true negatives, tn. Tp, fn, fp and tn are often summarized in a confusion matrix, see 2.1. Recall is how many of the actual instances of the class that were predicted, see equation 2.11[26]. It’s a measure of completeness[23]. Fβ-score

is a weighted average of precision and recall, see equation 2.12. F1-score is the most common,

called the harmonic weighted average. All metrics lie between 0 and 1 and a higher value is better.

(22)

2.4. Evaluation

Table 2.1: Confusion matrix.

Predicted positives Predicted negatives Actual positives True positives False negatives Actual negatives False positives True negatives

precision= tp tp+ fp (2.10) recall= tp tp+ fn (2.11) fβ= (1 + β2) precision∗ recall β2_{∗ precision + recall} (2.12)

If the data is very skewed it is important to look at these metrics for each class; for example a binary classiﬁcation problem where most instances are of the same class. Making a predictor that just outputs the most usual class every time, the recall for this class is 1 and the precision will be very close to 1. For the other class on the other hand, the recall will be 0 (precision will be undeﬁned).

One way of counteracting the impact of the choice of class for the metrics is to consider both classes separately and calculate their averages. This can be done with or without weighing by support, which means adding a multiplying factor to each class’ contribution to the average correlating to the number of examples for that class, giving each training instance the same affect to the average.

2.4.2 Receiver operator characteristics

In probabilistic binary classification, such as neural networks, the models often output a contin-uous value between 0 and 1. This coupled with a threshold value T gives the classification into two classes. If f(x) is the output of the classifier given input x, x can be given either class Cn

or Cpby

x∈⎧⎪⎪⎨⎪⎪

⎩

Cn if f(x) ≤ T

Cp otherwise.

Depending on the choice of T, the results can be very varying. The choice can also be dependent on what is most crucial in the context, i.e. how tp, fp, tn and fn are weighted against each other. One could for example in a medical setting ﬁnd that false negatives, missing a diagnosis for example, could be much worse than accidentally predicting that a patient has an illness he or she actually has not, i.e. false positives.

The receiver operator characteristics, roc, is often depicted using a curve where the true

positive rate, tpr is plotted against the false positive rate, fpr. Tpr is the same as recall, shown in

equation 2.11. Fpr is shown in equation 2.13. For an ideal classifier tpr is 1 and fpr is 0, which means no misclassifications. However, in practice, there is a trade-off between their values. Altering the parameters of a model can give different results. In the case of a binary classifier with a continuous predictive output, lowering T leads to more samples being classified as a positive outcome, and if some of these are actual positives it will also increasing the tpr, but if some are actual negatives it will also increase fpr.

f pr= f p

tn+ fp (2.13)

Figure 2.8 shows an example of a ROC curve. The dashed diagonal line represents a classiﬁer guessing, regardless if it is random guessing or always guessing for the class with the largest

(23)

number of samples, i.e. majority guessing. This would give tpr= fpr = 0.5 for any value of T , for random guessing and tpr= fpr = 1or0 for any value of T for majority guessing depending on which class is bigger, and is used as a baseline. The curve is constructed by calculating the tpr and fpr at different T values, generating multiple dots combining into a line. A line closer to the left and upper edges of the graph means a better classiﬁer. This can be quantiﬁed by calculating the area under the curve of the ROC, AucRoc. An AucRoc of 1 is a perfect score while 0.5 is equal to guessing.[31]

Figure 2.8: An example of a ROC curve.

2.5 Related articles

In A survey of transfer learning [33] Weiss et. al. review several TF techniques. Efﬁcient Transfer

Learning Schemes for Personalized Language Modeling using Recurrent Neural Network by Yoon

et. al. [34] deals with a similar problem in a similar way. They have a small domain (a personal device) in which they want a personalized language model to be able to i.e. predict message replies in a manner characterized by the user. For this they use an RNN architecture. The RNN needs a large quantity of data, but the amount of data generated by a single person is very small. Also, the access to the data are often restricted to the device and hence the computation must be able to be done on the personal device with very limited computational power. They suggest a new transfer learning method for efﬁcient training using an LSTM architecture.

Mou et. al. discuss in How Transferable are Neural Networks in NLP Applications? how transfer learning performs under different settings [22]. They conclude that the performance of transfer learning in NLP is heavily dependent on the semantic similarity between the source and target domains.

Luong et. al. [19] use multi-task learning of recurrent neural networks on various NLP tasks, such as machine translation. Benton et. al. [1] similarily use multi-task learning, but with feed-forward networks, to predict mental health conditions. Both teams improve the performances of the networks on the task at hand by using knowledge from other domains, while also ac-knowledging that the results are dependent on chosing the appropriate source domain for the target.

Bingel and Søgaard [2] predicts the gains of multi-task learning between different NLP tasks. They use dataset features, along with characteristics of single-task learning, to see which features are the greatest predictors. Their ﬁndings show that especially the number of labels and the label entropy of the tasks are good predictors, and size was not.

Glorot et. al. tested transfer learning on the sentiment analysis task [10], where they used Amazon reviews split up into categories. They found positive transfer for their SVM model, but didn’t further investigate which category pairs worked best, and why. They did ﬁnd that a model

(24)

2.5. Related articles

pre-trained on multiple domains generalized better, compared to when it was pre-trained on a single domain.

(25)

3 Method

The setup and methodology of the experiments will be presented.

3.1 Data sets

Different varieties of data were collected. They were also collected from different sources.

3.1.1 Reviews

Data was collected from prisjakt1 _{and hotels.com}2_{. The reviews from these sites paired with}

their respective rating made up the training, testing and validation samples. The reviews were split up into their categories from their site in a hierarchical way, since most items had several subcategories. E.g. a SSD disk on prisjakt would have the subcategories ”Datorer & Tillbehör” (Computers & Accessories), ”Datorkomponenter” (Computer Components), ”Hårddiskar” (Hard Drives), ”Solid State-diskar (SSD)” (Solid State Disks). These categories lay as a foundation when deﬁning the domains. All subcategories were recorded so that the granularity of the domains could be chosen during training. The review domain distribution can be seen in ﬁgure 3.1.

The reviews were not split into sentences or any other grouping, each training sample con-sisted of a review and the corresponding rating. Each review was limited to 256 words, well above the average for reviews, as can be seen in ﬁgure 3.2, to include as much data as possible but still keep training times down.

3.1.2 Word2Vec

To better represent the words as vectors, a Word2vec embedding was trained, using Gensim’s implementation[25]. This was done on the Swedish Wikipedia dump3_{, using the skip-gram}

model and 300 hidden nodes. The embedding is not trained on the actual RNN training data, since a more general structure is sought which does not depend on the target domain. The model was also restricted to the 50,000 most common words.

1_{https://www.prisjakt.nu, collected on 2017-05-20} 2_{https://www.hotels.com, collected on 2017-06-13}

(26)

3.2. Frameworks and hardware

• Prisjakt 566,458

– Produkter (Products) 286,203

*

Ljud, Bild & Musik [LBM] (Sound, Image & Music) 98,774

*

Foto & Video [FV] (Photo & Video) 7,134

*

Mobil & GPS [MG] (Mobile & GPS) 21,556

*

Spel & Film [SF] (Games & Movies) 72,601

*

Datorer & Tillbehör [DT] (Computers & Accessories) 65,553

*

Hem & Trädgård [HT] (Home & Garden) 12,635

*

Sport & Friluftsliv [Sports] (Sports & Outdoor Activities) 1,343

*

Skönhet & Hälsa [SH] (Beauty & Health) 4,907

*

Skor, Kläder & Accessoarer [SKA] (Shoes, clothes & accessories) 1,700

– Butiker (Stores) 275,435

– Miscellaneous 4,820 • Hotels.com 504,302

Figure 3.1: The distribution and hierarchy of review domains, showing subcategories and number of reviews. Abbreviations for some domains are shown in square brackets.

3.2 Frameworks and hardware

For development, the framework Keras [6] was used. This is a high-level Python framework that abstracts parts of an underlying system. It runs on either Theano4_{or Tensorﬂow}5_{. In this project}

Tensorﬂow was chosen, since Tensorﬂow supports Nvidia GPUs through CUDA, and this project was mainly done on computers using Nvidia GPUs. The most used GPU during the project was a Nvidia GeForce GTX 1080 Ti.

Pandas6_{and NumPy}7_{were used for inspecting, processing and preparing the data.}

NLTK8_{was used to tokenize the sentences into lists of words and characters.}

Gensim[25] was used for Word2vec and Doc2vec implementations.

3.3 Evaluation

Evaluation was done on a test set, separate from the training set. When a model was ﬁnished training, the test set samples were passed through the network. A threshold value, T , of 0.5 was set to distinguish positive from negative results when calculating accuracy. The area under the receiver operator characteristics curve metric was considered the most ﬁtting metric and chosen as the decisive metric in the evaluation of the models.

3.4 Data preparation

A preprocessing step based on inspection was used to ensure consistent and high quality data.

3.4.1 Data inspection

The data collected was stored in a JSON format. Given in ﬁgure 3.3 is an example of a review. 4_{http://deeplearning.net/software/theano/}

5_{https://www.tensorﬂow.org/} 6_{https://pandas.pydata.org/} 7_{http://www.numpy.org/} 8_{https://www.nltk.org/}

(27)

Figure 3.2: Average length of reviews per rating. Five-point system doubled into ten-point system.

The review has a hierarchical category structure, with the source website ﬁrst and the name last and subcategories in between. This way the reviews can be grouped at different levels.

After collection a manual inspection of the data was done to assess its quality and general characteristics. This included:

• looking for reviews to remove,

• looking through some randomly selected examples and deciding where to set the break-point for positive and negative ratings, and

• inspect some attributes of all the data, average review length, rating distribution and do-main distribution.

Inspection of the data sets showed both that there are a lot more positive than negative reviews for all domains, as can be seen in figure 3.4. It also showed that the middle reviews of ratings five to eight for the ten-point rating system and three and four for the five-point system, had very mixed sentiments between reviews for the same ratings. Here come a few examples where the reviewers all rated a seven out of ten.

”Bra stabil produkt som jag inte haft några problem alls med. Bra bild, riktigt bra bild då tidigare dvd-spelare ersattes mot denna enbart pga bild kvaliten.”

”Den klart svagaste länken i systemet. Vet ej vad jag ska ha, men den kommer att bli utbytt så småningom.”

(28)

3.4. Data preparation

{

” country ” : ” se ” , ” r a t i n g ” : ”7” ,

” date ” : ”12 å r sedan ” ,

” t e x t ” : ”Den ä r bra men den b ö r j a r b l i l i t e gammal . ” , ” c a t e g o r i e s ” : [

” p r i s j a k t ” ,

” Ljud , B i l d & Musik ” ,

”DVD− s p e l a r e & Blu −ray − s p e l a r e ” , ”DVD− s p e l a r e ” ,

” Samsung DVD 811 R e g i o n s f r i ” ] ,

” u r l ” : ” h t t p s : / /www. p r i s j a k t . nu / produkt . php ? o=7&s =0” }

Figure 3.3: An example review.

”Knappt godkänd. Idag köper man självklart ingen sådan här stand aloneburk längre utan en nätt dator att använda som HTPC.”

Rough translations:

”Good stable product that I had no problems at all with. Good picture, really good picture since the previous DVD player was replaced with this only because of the picture quality.”

”The clearly weakest link in the system. Don’t know what to have, but it will be exchanged eventually.”

”Barely approved. Today, of course, you do not buy such a stand-alone computer anymore, rather a neat computer to use as HTPC.”

1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 0.5

Figure 3.4: Distribution of rating value of reviews. Ten-point systems projected down to a ﬁve-point system to be able to merge the different sources.

The ratio of positive and negative reviews per domain after classifying the reviews positive or negative, as well as removing the interlayer of medium rated reviews can be seen in table 3.1.

(29)

Table 3.1: The ratio of positive and negative reviews per domain. Abbreviations explained in ﬁgure 3.1.

Domain Positive Negative

All 515356 (83.68 %) 100494 (16.32 %) Prisjakt 354774 (85.95 %) 57979 (14.05 %) LBM 56651 (89.45 %) 6682 (10.55 %) Butiker 199696 (86.46 %) 31270 (13.54 %) DT 37973 (83.78 %) 7353 (16.22 %) HT 6045 (69.11 %) 2702 (30.89 %) Produkter 155078 (85.31 %) 26709 (14.69 %) Prisjakt excluding HT 348729 (86.32 %) 55277 (13.68 %) MG 11614 (77.83 %) 3309 (22.17 %) Prisjakt excluding LBM 354774 (85.95 %) 57979 (14.05 %) Hotels 160582 (79.07 %) 42515 (20.93 %)

3.4.2 Data preprocessing

The data inspection lead to some decisions, namely • empty reviews were removed,

• all reviews were lowercased to get a more uniﬁed corpus, • a hard upper limit to review length was set to 256, and

• the ranges of ratings for positive and negative reviews were set to nine and above, and four and below, respectively. All reviews with ratings ﬁve to eight were discarded.

Also, the reviews from hotels.com were rated from one to ﬁve instead of one to ten. To unify them with the rest of the reviews the ratings were doubled so that they also ranged from one to ten.

As the reviews and accompanying ratings were to be used as ground truth for the training, the quality of the data was important. Looking at the reviews, many of the ones with ratings in the middle segment are quite neutral reviews with both positive and negative aspects. It was hard to classify the whole review as either positive or negative, even subjectively. This was the main reason for having a big rating gap between the reviews considered positive and the reviews considered negative.

The texts could be very long and to be able to process it in an easy way, texts belonging to the same batch during training needed to be the same sizes. This was achieved by truncating long texts by removing words from the end and padding short texts by adding a special padding character at the end.

When running the training, a dictionary was used to translate the texts into a numeric rep-resentation. This was done by either replacing each word with a one- hot vector or by using a pre-trained Word2vec model’s representation of the words. What was important was to use the same dictionary during training of both the source domain and the target domain, since they need the same representations of the words. If a word occurred that had no entry in the dictionary a special unknown vector was used.

3.5 Network architecture

A number of different networks were tested to get one that were sufﬁciently good. The differences between the networks were:

(30)

3.6. Domain similarity

Table 3.2: The default values of Keras LSTM for the parameters not changed in this thesis.

Keras LSTM parameter Default value

use_bias True kernel_initializer glorot_uniform recurrent_initializer orthogonal bias_initializer zeros kernel_regularizer None recurrent_regularizer None bias_regularizer None activity_regularizer None kernel_constraint None recurrent_constraint None bias_constraint None stateful False

• the initial value of the forget bias, either 0.5 or 1,

• which embedding that was used, simple one-hot encoding or Word2vec, • dropout,

• recurrent dropout, and

• if the samples were weighted or not.

LSTM was used as cell architecture for all tests. Activations and recurrent activations were kept at default Keras functions, tanh and a hard sigmoid. The rest of the parameters were kept at default values, which can be seen in table 3.2. RMSprop9_{was used as optimizer as per suggestion for}

recurrent neural networks by Keras10_.

3.5.1 Network training

The networks were trained using Python and several libraries, most notably Tensorﬂow and Keras. Several runs using different setups were done, with each setup having training, validation and test sets and also being cross-validated. The data set used was all the data collected from prisjakt.se. Early stopping was applied with a patience of 5, returning the model weights at the epoch yielding the lowest validation loss. The model was then evaluated against the test set. Area under receiver operating characteristics curve, AucRoc, was used as decisive score to compare the networks. First a few tests were conducted where each of the different parameters where changed from a baseline. Parameter changes that showed best results were then combined in an attempt to improve the networks even more.

3.6 Domain similarity

Using Gensim’s [25] Doc2vec implementation11_{, paragraph vectors for each domain were}

cal-culated. The vectors were taken as to reﬂect the word usage and general language of each domain. These were then used to decide the similarity between the domains using cosine simi-larity. It was assumed that each domain could be treated as a single document. The model used was trained on each speciﬁed domain and used most of the default parameters, of which most noteworthy is a window size of 5 and using the DM implementation, see chapter 2.3.4. The 9_{Introduced by Geoffrey Hinton in http://www.cs.toronto.edu/tijmen/csc321/slides/lecture_slides_}

lec6.pdf

10_{Found under RMSprop chapter in https://keras.io/optimizers/} 11_{https://radimrehurek.com/gensim/models/doc2vec.html}

(31)

Table 3.3: The default values for the Gensim Doc2Vec parameters not changed in this thesis.

Gensim Doc2Vec parameter Default value

dm 1 window 5 alpha 0.025 min_alpha 0.0001 min_count 5 max_vocab_size None sample 1e-3 hs 0 negative 5 ns_exponent 0.75 dm_mean None dm_concat 0 dm_tag_count 1 dbow_words 0 trim_rule None

default values can be seen in table 3.3. The vector size was increased to 300 and the number of epochs to 10.

Cosine similarity results in a value in the range[−1, 1], with values closer to 1 in this case meaning more similar domains. The results were then categorized into three classes: 1–3 where 1 represents very similar domains (cosine similarity above 0.95), 2 less similar (cosine similar-ity 0.80 to 0.95), and 3 not at all similar (cosine similarsimilar-ity less than 0.80). This was done for convenience of being able to compare different training pairs where the similarity was kept in the same class, seen as having the same relative similarity, and some other parameter could be changed and tested instead.

3.7 Transfer learning

To test the capabilities of the neural network to transfer between domains a set of tests were done with different combinations of source and target domains, as well as the training size of each domain. For each domain pair the the model was ﬁrst trained on the source domain data. The transfer is performed by using the weights from the network trained on the source data to initialize the weights for further training with the target data, i.e. parameter initialization as described in section 2.2.7. The data set sizes tested were in steps of 20 %, i.e. 0, 20, 40, 60, 80 and 100 % of the target training data, and 0 and 100 % for source training data. I.e. for each domain pair a total of 12 test runs were performed, each one cross-validated. Each time a training set was reduced, the ratio of positive and negative reviews was kept approximately constant. To be noted is that only the training data was reduced, test data sizes were kept constant. This was done so that the comparison would be as fair as possible, since a larger test size is a better representation of the general case.

To compare the transfer learning results the same metric as when comparing the network architectures was used; AucRoc.

The data sets used were a combination of large, medium and small data sets. All domains with over 100, 000 reviews were considered large, all below 25, 000 reviews small, and the rest medium, see 3.1. They were also selected depending on their relative domain similarity, so that transfer learning was tested on both very similar domains, and less similar domains.

(32)

4 Results

The experiments described in chapter 3 produced multiple quantiﬁable results.

4.1 Architecture performances

Table 4.1 shows the performance of training and evaluating different models and architectures on the Prisjakt data set. Accuracy is shown in the table, but AucRoc was used as the decisive metric for deciding model performance. Model 1 is a baseline model. It is a non-bidirectional LSTM with one layer. Models 2-8 are each a model where one parameter has changed. The difference in results between the models are very small, but some parameters seem to improve the results. Model 3 shows a relatively large improvement over the baseline by using a bidirec-tional RNN, so does increased drop out for model 6. A Word2vec embedding show promise for model 5. Other parameters seem to worsen the performance, e.g. weighting the infrequent classes and increasing the layers. Models 9-13 are all models where the three best parameter changes from the previous models have been combined to try to better the performance. 9-11 are experiments with changed dropout while using a Word2vec embedding, while 12 and 13 shows how bidirectionality, Word2vec embedding and increased dropouts work together. To bear in mind is that some models take far longer time to train, and if a faster model is just a little less performant, it could be worth using that one for the speed increase. Both increasing layers to two and choosing a bidirectional model increases training time by a factor of about two. In the end model 11 gave the best result and consisted of only one layer and was not bidirectional, thus, this model was chosen for further studies.

4.2 Domain similarity

Using the paragraph vectors calculated using Doc2vec, similarity between domains could in turn be calculated using pairwise cosine similarity. The result is shown in table 4.3. It shows that reviews taken from hotels.com is far less similar to the other categories which was taken from prisjakt.se. Most prisjakt reviews were relatively similar, but still ranged from low 0.8 to high 0.9.

(33)

Table 4.1: Performances of different architectural setups. Headers are explained by table 4.2

Model L Bidi FB E D RD W Accuracy AucRoc

1 1 Simple 0.2 0.2 0.9375 0.9538 2 2 Simple 0.2 0.2 0.9366 0.9521 3 1 YES Simple 0.2 0.2 0.9425 0.9609 4 1 YES Simple 0.2 0.2 0.9418 0.9573 5 1 Word2Vec 0.2 0.2 0.9415 0.9599 6 1 Simple 0.5 0.2 0.9391 0.9609 7 1 Simple 0.2 0.5 0.9401 0.9547 8 1 Simple 0.2 0.2 YES 0.9338 0.9521 9 1 Word2Vec 0 0.2 0.9386 0.9524 10 1 Word2Vec 0.2 0 0.9354 0.9492 11 1 Word2Vec 0.5 0.2 0.9428 0.9621 12 1 YES Word2Vec 0.5 0.2 0.9445 0.9612 13 1 YES Word2Vec 0.5 0.5 0.9425 0.9605

Table 4.2: Explanation of architecture headers. Model The model number

L Number of layers

Bidi If the RNN is bidirectional or not

FB If the forget bias is initialized to 1 instead of 0.5

E Which embedding to use, either a simple one-hot encoding or a trained Word2Vec embedding

D Dropout value

RD Recurrent dropout value

W If the classes are weighted inversely by their frequencies,_{i.e. samples from non-frequent classes are weighted more} AucRoc The result in Area under ROC curve

The domain similarities resulted in three similarity categories, which can be seen in table 4.4. Since the hotel domain was the one which differed the most, the class with most differing domain pairs all included it.

4.3 Transfer learning

Given the results of section 4.2 and the sizes of the domains, eleven combinations were chosen, see table 4.4. The combinations were chosen so that a large mix of factors could be tested, e.g. if similar domains gave a different result than less similar.

The results of the transfer learning between domains can be seen in ﬁgure 4.2. Each graph shows the AucRoc value when testing on one domain after pre-training on another. The x-axis shows the percentage of target domain data that was used, from 0 % to 100 % in 20 % increments. The separate lines show how much source data was used, either 0 or 100 %. The graphs are truncated to show AucRoc values from 0.8 to 1, which leads to some values being excluded, most of which are the 0 % target and source data, i.e. untrained and therefore also uninteresting. The size of each training set is shown in the parenthesis, with the source domain size before the colon and target domain size after. For ﬁgures 4.2e and 4.2g, the target domain, HT and LBM respectively, has been excluded from the source domain.

Some characteristics are shared between most graphs. Firstly, the AucRoc when using only source data is already a huge improvement over random guessing, with all results except MG on HT 4.2k being over 0.8. This shows that using source data is at least better than not using any data, unsurprisingly. The overall trend is also that the more target data used, the better the result.

(34)

4.3. Transfer learning

Table 4.3: The pairwise similarity of domains calculated using cosine similarity on paragraph vectors. Value closer to 1 means more similar. Abbreviations explained in ﬁgure 3.1. Prisjakt

excluding HT and Prisjakt excluding LBM, are domains where a speciﬁc subdomain has been

excluded when calculating the paragraph vectors.

Domain Prisjakt LBM Butiker DT HT Produkter

Prisjakt excluding HT MG Prisjakt excluding LBM Hotels Prisjakt 1.00 0.999 0.933 0.977 0.912 0.997 0.983 0.990 0.972 0.606 LBM 0.999 1.00 0.92 0.969 0.897 0.992 0.990 0.992 0.977 0.565 Butiker 0.933 0.920 1.00 0.964 0.982 0.954 0.866 0.925 0.887 0.785 DT 0.977 0.969 0.964 1.00 0.968 0.985 0.926 0.959 0.952 0.715 HT 0.912 0.897 0.982 0.968 1.00 0.933 0.829 0.906 0.893 0.778 Produkter 0.997 0.992 0.954 0.985 0.933 1.00 0.970 0.984 0.960 0.658 Prisjakt excluding HT 0.983 0.990 0.866 0.926 0.829 0.970 1.00 0.981 0.962 0.478 MG 0.990 0.992 0.925 0.959 0.906 0.984 0.981 1.00 0.983 0.538 Prisjakt excluding LBM 0.972 0.977 0.887 0.952 0.893 0.960 0.962 0.983 1.00 0.471 Hotels 0.606 0.565 0.785 0.715 0.778 0.658 0.478 0.538 0.471 1.00

Table 4.4: The combination of domains to be tested, their respective sizes and the domain sim-ilarity categorized into three categories.

Source Target Sizes Similarity

Prisjakt Hotels Large ->Large Class 3 0.606 Hotels Prisjakt Large ->Large Class 3 0.606 Produkter Butiker Large ->Large Class 1 0.954 Butiker Produkter Large ->Large Class 1 0.954 Prisjakt LBM Large ->Medium Class 1 0.977 Prisjakt HT Large ->Small Class 2 0.829 Hotels HT Large ->Small Class 3 0.778 LBM DT Medium ->Medium Class 1 0.969 DT LBM Medium ->Medium Class 1 0.969

HT MG Small ->Small Class 2 0.906

(35)

The beneﬁt is, however, diminishing, i.e. the more target data available, the less is the gain from the usage of source data.

All graphs where both domains are large, figures 4.2a - 4.2d, show some very similar results. All already have a good base performance using only source data, as well as that when using all target data the profit of using source is almost negligible. We can see that the gain of using just 20 % target data is a lot larger if the domains are less similar, as in figures 4.2a and 4.2b, compared to the more similar domains in figures 4.2c and 4.2d. This trend is also noticeable when using large source domains on smaller target domains, see the more similar domains in figure 4.2e compared to the less similar in figure 4.2f. In this case the trend also holds for the entirety of the graph and not just the initial 20 % of target data.

In ﬁgure 4.1 the difference between using pre-training and not, when using both full datasets of the target domain and full dataset of the source domain when pretraining, is shown. This can be seen as the overall success of the transfer learning, but shows mainly that small domains beneﬁts a lot when the model is pre-trained by a large source domain.

(36)

4.3. Transfer learning

(a) Results of hotels.com reviews pre-trained with prisjakt reviews. Class 3 simi-larity

(b) Results of prisjakt reviews pretrained with hotels.com reviews. Class 3 similar-ity.

(c) Results of prisjakt store reviews pre-trained with prisjakt product reviews. Class 1 similarity.

(d) Results of prisjakt product reviews pre-trained with prisjakt store reviews. Class 1 similarity.

(e) Results of home and garden pretrained with prisjakt excluding home and garden reviews. Class 2 similarity.

(f) Results of home and garden pretrained with hotels.com reviews. Class 3 similar-ity.

(37)

(g) Results of sound, image and music pre-trained with prisjakt excluding sound, im-age and music reviews. Class 1 similarity.

(h) Results of computers and accessories reviews pretrained with sound, image and music reviews. Class 1 similarity.

(i) Results of sound, image and music re-views pretrained with computers and ac-cessories reviews. Class 1 similarity.

(j) Results of mobile and gps reviews pre-trained with home and garden reviews. Class 2 similarity.

(k) Results of home and garden reviews pretrained with mobile and gps reviews. Class 2 similarity.

Sentiment analysis and transfer learning using recurrent neural networks : an investigation of the power of transfer learning

Master’s thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/080--SE

Sen ment analysis and transfer

learning using recurrent neural

networks

an inves ga on of the power of transfer learning

Sen mentanalys och överföringslärande med neuronnät

Harald Pettersson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1 Motivation

1.1.1 Findwise

1.2 Aim

1.3 Research questions

1.4 Delimitations

2

Theory

2.1 Machine learning

2.1.1 Transfer learning

2.1.2 Deﬁnitions

2.2 Artiﬁcial Neural networks

2.2.1 Feed-forward network

2.2.2 Activation function

2.2.3 Loss function

2.2.4 Optimization

2.2.5 Regularization

2.2.6 Recurrent neural network

2.2.7 Transfer learning for neural networks

2.3 Representing textual data

2.3.1 One-hot vector

2.3.2 Word2vec

2.3.3 Frequency based

2.3.4 Doc2vec - paragraph vectors

2.3.5 Vector comparison - cosine similarity

2.4 Evaluation

2.4.1 Accuracy, precision, recall and F

-score

2.4.2 Receiver operator characteristics

2.5 Related articles

3

Method

3.1 Data sets

3.1.1 Reviews

3.1.2 Word2Vec

*

*

*

*

*

*

*

*

*

3.2 Frameworks and hardware

3.3 Evaluation

3.4 Data preparation

3.4.1 Data inspection

3.4.2 Data preprocessing

3.5 Network architecture

3.5.1 Network training

3.6 Domain similarity

3.7 Transfer learning

4

Results

4.1 Architecture performances

4.2 Domain similarity

4.3 Transfer learning