Transfer learning in Swedish - Twitter sentiment classification

(1)

Transfer learning in Swedish - Twitter sentiment classification

LUCAS GRÖNLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Twitter sentiment classification

LUCAS GRÖNLUND

Master in Machine Learning Date: May 21, 2019

Supervisor: Giampiero Salvi Examiner: Viggo Kann

School of Electrical Engineering and Computer Science Host company: Cybercom Group

Swedish title: Transfer learning på svenska - attitydanalys av Twitter

(4)

(5)

Abstract

Language models can be applied to a diverse set of tasks with great results, but training a language model can unfortunately be a costly task, both in time and money. By transferring knowledge from one domain to another, the costly training only has to be performed once, thus opening the door for more applications. Most current research is carried out with English as the language of choice, thus limiting the amount of available already trained language models in other languages.

This thesis explores how the amount of data available for training a language model effects the performance on a Twitter sentiment classification task, and was carried out using Swedish as the language of choice. The Swedish Wikipedia was used as a source for pre-training the language models which were then transferred over to a domain consisting of Swedish tweets. Several models were trained using different amounts of data from these two domains in order to compare the performance of these models.

The results of the model evaluation shows that transferring knowledge from the Swedish Wikipedia to tweets yield little to no improvements, while unsupervised fine-tuning on tweets give raise to large improvements in performance.

(6)

Sammanfattning

Språkmodeller kan appliceras på en mängd olika uppgifter med bra resultat, men att träna en språkmodell kan dessvärre vara kostsamt både tids- och peng- amässigt. Genom att överföra information från en domän till en annan behöver denna kostsamma träningsprocess bara genomföras en gång, och ger således lättare tillgång till dessa modeller. Dagens forskning genomförs främst med engelska som språk vilket således begränsar mängden av färdigtränade modeller på andra språk.

Denna rapport utforskar hur mängden data tillgänglig för träning av språk- modeller påverkar resultatet i ett problem gällande attitydanalys av tweets, och utfördes med svenska som språk. Svenska Wikipedia användes för att först trä- na språkmodellerna som sedan överfördes till en domän bestående av tweets på svenska. Ett flertal språkmodeller tränades med olika mängd data från dessa två domäner för att sedan kunna jämföra deras prestanda.

Resultaten visar att överföring av kunskap från Wikipedia till tweets knappt gav upphov till någon förbättring, medan oövervakad träning på tweets förbätt- rade resultaten markant.

(7)

1 Introduction 1

1.1 Background . . . 1

1.2 Thesis objective . . . 2

2 Theory 3 2.1 Machine Learning . . . 3

2.1.1 Supervised and Unsupervised Learning . . . 3

2.2 Neural Networks . . . 4

2.2.1 Neurons . . . 4

2.2.2 Activation functions . . . 5

2.2.3 Backpropagation . . . 5

2.2.4 Regularization . . . 6

2.2.5 Recurrent Neural Networks . . . 6

2.2.6 Long Short-Term Memory . . . 7

2.2.7 AWD-LSTM . . . 9

2.2.8 Quasi-Recurrent Neural Networks . . . 11

2.3 Natural Language Processing . . . 12

2.3.1 Word Embeddings . . . 13

2.3.2 Transfer Learning . . . 13

3 Method 15 3.1 Datasets . . . 15

3.1.1 Wikipedia . . . 15

3.1.2 Twitter . . . 17

3.2 Preprocessing . . . 18

3.3 ULMFiT . . . 19

3.3.1 General Domain Language Model Pre-Training . . . . 20

3.3.2 Target task Language Model Fine-Tuning . . . 20

3.3.3 Target Task Classifier Fine-Tuning . . . 23

v

(8)

3.4 Evaluation metrics . . . 24

3.4.1 Accuracy . . . 24

3.4.2 Precision . . . 24

3.4.3 Recall . . . 24

3.4.4 F1-score . . . 24

4 Experimental Settings 26 4.1 Model and Implementation . . . 26

4.1.1 Pre-training on general domain corpus . . . 26

4.1.2 Fine-tuning on target domain corpus . . . 27

4.1.3 Fine-tuning Classifier on target domain . . . 28

4.2 Evaluation . . . 28

5 Results 29 5.1 Pre-training on Wikipedia . . . 29

5.2 Fine-tuning on Twitter . . . 30

5.3 Training classifier on labelled tweets . . . 31

6 Discussion 35 6.1 Pre-training . . . 35

6.2 Fine-tuning . . . 36

6.3 Sentiment Classification . . . 36

6.4 Benefits of pre-training . . . 37

6.5 Ethical Considerations . . . 38

6.6 Conclusions . . . 38

6.7 Future Work . . . 39

Bibliography 41

(9)

Introduction

The use of transfer learning within the domain of computer vision has greatly improved results for about a decade [9], but only recently has this approach been successfully applied to natural language processing. The idea that some basic knowledge of language can be transferred from one domain to another feels intuitive, we would after all be able to read an article from a field we have very limited knowledge of although we might not fully understand the contents of it. Not having to learn a whole language from scratch each time we want to create a language model with a specific task in mind could potentially reduce the amount of data needed and the time taken to train the model, two very important advantages when considering machine learning.

1.1 Background

Natural language processing (NLP) is a way to automatically analyse and process human language in order to make computers able to perform a large vari- ety of tasks such as document classification, sentiment analysis and machine translation. A common situation is that there are large amounts of data available in one domain, but we want a language model to perform well in another domain with very limited data. This might even be the case in situations where a large amount of unlabelled data is available, but the process of annotating this data is too costly or time consuming.

Deep learning models have shown to perform well on many NLP tasks [51], but they are trained from scratch, require a large amount of data, long training times and are prone to overfitting [28]. While pre-trained word representations in the form of simplistic embeddings are commonly employed in most NLP models, only a small part is transferred while the rest of the model is trained

1

(10)

from randomly initialized parameters [33][26].

Until very recently attempts at using fully pre-trained language models have either been unsuccessful at large [28], or required large amounts of data within a similar domain in order to get competitive results [7]. New models and methods utilizing transfer learning have emerged and obtained promising results on multiple NLP tasks, thus opening the doors wide for more research and development [8][34][16].

1.2 Thesis objective

Most of the current research is carried out with English as the language of choice and thus leaving other languages without pre-trained models. For languages such as English, large quantities of data are available that can be used for pre-training a language model, but other languages might be more limited in the amount of data available. Training methods such as ULMFiT [16]

clearly show the benefits of having pre-trained models to start from, but it is currently not well established how much data is needed for pre-training.

This degree project ultimately aims to investigate how the amount of data available for pre-training correlates with the results on a downstream task in a target domain. The objective is to train Swedish language models using current state-of-the-art algorithms with different amounts of data available for pre-training and in-domain fine-tuning, and evaluate the models on a sentiment analysis classification task using Swedish tweets. The exact methods will be described in more detail within the following sections.

The research question that this thesis will try to answer can be summarized as:

How does the amount of data available for pre-training on the Swedish Wikipedia and performing in-domain fine-tuning on Swedish tweets affect the results on a sentiment analysis task?

(11)

Theory

This chapter will provide basic theory that build the foundation for this thesis.

2.1 Machine Learning

In light of the large increase in available data, more effective methods than having humans manually analysing and creating complex models need to be utilized [1]. Machine learning is often used in these kind of environments where recognizing intrinsic patterns within the data is required, such as image classification, recommendation engines and spam detection.

2.1.1 Supervised and Unsupervised Learning

Machine learning methods need data to learn from in order to find patterns.

This data can be labelled with known answers, such as whether an email is spam or not, and if these are used in the training process it is called Super- vised learning. The goal in the case of supervised learning is to find a func- tion f that maps data from the input to the output domain based on labelled examples, this function can then be used in order to make predictions based on the distribution of previous training data. Unsupervised learning takes place when there is no labelled data, instead patterns that exist within the data are identified. Another method called Semi-Supervised learning combines the two previously mentioned approaches and is often deployed in cases where there is a large amount of unlabelled data but only a small amount of labelled data available, as this is a common scenario since labelling data is often a costly task [2].

3

(12)

2.2 Neural Networks

Artificial neural networks are machine learning models initially inspired by the neural networks found in biological brains. They are composed of several parts which are briefly explained in the following sections.

2.2.1 Neurons

A network contains several neurons which are all connected in various ways depending on the network architecture. Typically a neuron is sent several input signals that are either amplified or dampened depending on the weights related to each connection. The neurons respond differently depending on the input signal, where the response is determined by the neurons activation function.

An example of a single neuron can be seen in Figure 2.1.

A common way to connect neurons is by grouping several of them together into layers. These layers can then be connected into more complex configura- tions by letting the output of one layer be the input of another so called hidden layer. The connections between neurons and layers can be set up in many different ways depending on the task at hand, and will allow for more complex representations of the data.

∑ i=1 n

x i w i

w1

w2

wn

x₁ x₂

xn

σ

Input Weights

Activation Function

Outputy

Figure 2.1: A single neuron.

(13)

2.2.2 Activation functions

There are many different activation functions, but all of them have in common that they are non-linear. Since neural networks are composed by chaining functions, if they were to be linear it can be shown that they could instead be represented as a single linear function

y = σ2(W2σ1(W1x)) σ_iis linear ⇒ y = Wσ2W₂W_σ₁W₁x = W x

In this thesis the Rectified Linear Unit (ReLU) [30] and Softmax [2] which are widely used activation functions will be employed.

ReLU is a simple function used between layers in a neural network

ReLU (x) = max(0, x) (2.1)

and is thus very computationally efficient. Apart from introducing sparsity into the model, in the case of deep learning it also tackles part of the problem of vanishing gradients by mapping all negative values to zero and letting all other values pass through unmodified [12].

Softmax is commonly used by the last layer of a neural network if its pur- pose is categorical classification since the output can be used to represent a categorical distribution [2]:

sof tmax (x)_i = e^xⁱ PK

j e^x^j for i = 1, . . . , K (2.2) where K in the case of categorical classification is the number of classes.

2.2.3 Backpropagation

In order to train a network the difference between the network output and the desired output is to be minimized. This is done by backpropagation. Training data is propagated through the network and the error is computed using a loss function such as Mean Square Error (MSE) or categorical cross entropy [13].

In order to minimize this loss, the gradients with respect to each trainable parameter of the network are computed with the repeated application of the chain rule, and then commonly some variant of Stochastic Gradient Descent (SGD) such as Adam [19], is used to update the parameters [32][5]. This process is re- peated until the network reaches a sufficient degree of performance. Training recurrent neural networks which uses sequential data is done by backpropaga- tion through time (BPTT). In this case the recurrent model is represented as a deep multi-layered model, and normal backpropagation is performed on this unrolled deep model [32].

(14)

2.2.4 Regularization

Overfitting is a concept in machine learning that happen when a network learns the data used for training so well that it starts to generalize worse on other unseen data [2]. When training neural networks this is something to keep in mind, and there are several techniques one could use to minimize this effect.

Dropout is a technique where units along with their connections are ran- domly "dropped" from the neural network during training and is commonly employed in computer vision [45]. Dropout prevent neurons to co-adapt, which is when the network is learning feature detectors that are only useful in the context of several other feature detectors. Dropout instead promote neurons learning how to detect a feature independently of the other neurons and thus helps combat overfitting [15].

L₂- and L1-norm (also called Ridge Regression and Lasso) are ways to limit the size of the networks weights by adding an extra term ||W ||2or ||W ||1

to the cost function [2]. This prevents the network from having weights that become much larger than the others and thus minimizing the networks ability to overfit.

Early stopping [38] is another technique that can be used to prevent overfitting. While splitting available data for training and testing, a third subset of the data called validation set is created. After each epoch of training, the current model is evaluated on the validation set and the loss is calculated. As the validation loss stops decreasing, training is terminated early since the network is starting to overfit on the training data.

2.2.5 Recurrent Neural Networks

Recurrent Neural Networks (RNN) are types of networks that allow the passing of information between two iterations. This can be of great benefit in cases where data is not i.i.d but instead has some relation in either time or space, for example a sentence is a sequence of words dependent on each other [21].

Given activation functions σ^x and σ^y, an input x^t and the previous network’s hidden state values ht−1, the following two equations for each time step in a single layer recurrent neural network are used:

h_t= σ_x(W x_t+ U h_t−1+ b_x)

y_t= σ_y(V h_t+ b_y) (2.3)

where W , U , V are weight matrices while bxand byare biases. An illustration of these networks can be seen in Figure (2.2).

(15)

xt

W ht−1 U

∑ bx

σx V ∑ σy

by

ht

yt

RNN

Figure 2.2: A vanilla RNN cell.

2.2.6 Long Short-Term Memory

A vanilla RNN has trouble remembering long sequences due to the problem of vanishing and exploding gradients which occur when the gradients either goes towards a number close to zero or becomes too big. An extension to the model called Long Short-Term Memory (LSTM) was invented with the goal of addressing this issue [21][42]. By introducing an intermediate type of storage via a memory cell the network retains information over longer periods of time.

The information stored in the memory cell is updated via two gates, the Input gate which controls the flow of new information and the Forget gate which regulates the amount of information that is to be removed from the cell memory. There is a third gate called Output gate which regulates the amount of cell memory used to compute the output of the unit. The following equations for each time step are used (with biases omitted):

i_t = σ_i(W_ix_t+ U_ih_t−1) f_t = σ_f(W_fx_t+ U_fh_t−1) o_t = σ_o(W_ox_t+ U_oh_t−1)

˜

ct = tanh (Wcxt+ Ucht−1) c_t = f_t c_t−1+ i_t ˜c_t h_t = o_t tanh (c_t)

(2.4)

where i^t, f^t and o^tare the input, forget and output gate values. ˜ct is the new temporary memory which is combined with the previous memory state ct−1, i_t and ft in order to create the new memory cell state ct. Lastly the updated

(16)

memory cell state is combined with the output gate values to create a new hidden state h^twhich is passed forward to the next time step. An illustration of this can be seen in Figure (2.3).

ht−1

tanh σi

, Wi Ui

⋅

σf

, Wf Uf

c

⋅ ,

Wc Uc

σo

, Wo Uo

tanh ⋅

h

t

x

t

LSTM

Figure 2.3: A regular LSTM cell.

As mentioned before, this structure make LSTMs better at storing information for longer sequences in comparison to RNNs and thus solves the problem of vanishing gradients. The problem of exploding gradients is handled by gradient clipping which means rescaling them once they go over a certain threshold [32].

In the context of language models, LSTMs make use of the previous words in a sentence to provide context for future predictions. Given a sequence of tokens (x¹, . . . , xn) a forward language model makes a prediction for token x^k based on the previous tokens (x^k−1, . . . , x₁), the probability of the sequence then becomes as follows:

p (x₁, . . . , x_k) =

k

Y

i=2

p (x_i|x_i−1, . . . , x₁)

However, in order to put a word into context, more information than the previously encountered words may be useful, it is then reasonable to believe that a bidirectional model which looks at both previous and future tokens would perform better than a unidirectional model. It is therefore no surprise that the current state of the art makes use of bidirectional models [8][16][34]. An example of how this architecture might roughly look like can be seen in Figure

(17)

(2.4).

LSTM LSTM LSTM LSTM

x0 x1 x2

h0

LSTM

xt

LSTM LSTM

LSTM h0´

y0 y1 y2 yt

Figure 2.4: A 2-layer bidirectional LSTM.

2.2.7 AWD-LSTM

Average-SGD Weight-Dropped LSTM (AWD-LSTM) is a recent extension of the vanilla LSTM that has come to show great results within natural language processing [22]. Since many other newer models incorporate AWD-LSTMs this section provides a more in depth look at what new strategies it brings.

Weight-Dropped LSTM

As mentioned before, dropout is a commonly used regularization technique.

In a RNN however, applying dropout to the hidden states of the network will harm its ability to retain long-term memory. There have been many proposed solutions to this problem; e.g. drop the same network units each time step [11] or apply dropout to the cell update [40]. Most of these solutions require some sort of modification to the LSTM implementation (either acting on the hidden states or updates to the memory state) and are thus not compatible with black box libraries such as NVIDIA cuDNN which can be several times faster than naive LSTM implementations [22]. In order to combat this problem the weight-dropped LSTM uses DropConnect [47] on the hidden-to-hidden weights (Ui, U_f, U_oand Ucin equation 2.4) instead of the activations.

As the dropout is applied once to the weight matrices before the forward and backward pass it does not modify the LSTM implementation and black box libraries can be used.

(18)

Average-SGD

Stochastic gradient descent (SGD) is the most common training method for neural networks. The algorithm computes the gradient of the loss function C(θ) with respect to the model’s trainable parameters θ. The parameters then get updated according to:

θ_t+1= θ_t− η ˆ∆C(θ) (2.5)

where η is the learning rate and∆ denotes the stochastic gradient which com-ˆ putes the gradient for a minibatch of data points. This method iteratively re- duces the loss function until a sufficient degree of performance is obtained.

Averaged SGD (ASGD) performs the same step as eq. (2.5) but does not return the most recently calculated gradients straight away, it instead returns an average of the gradients after T iterations:

1 (K − T + 1)

K

X

i=T

θi (2.6)

where K is the total number of iterations performed and T is a user defined trigger of how many steps back to include in the averaging. An extension to this algorithm is to make a trigger decision based on performance on the validation set instead of a user defined trigger, so that after multiple iterations on the validation set with no improvement the averaging is triggered [22].

Regularization Techniques

Several regularization techniques are utilized in the AWD-LSTM. They are briefly described below.

• In order to prevent inefficient data usage that can stem from performing Backpropagation through time [42] with fixed minibatch sizes, a random sequence length is selected for the forward and backward passes. The base sequence length is first selected to have length seq with probability p and length

seq

2 with probability (1 − p). The final sequence length is then selected by sampling from the distribution N (seq, σ) where σ is the standard deviation. In order to not get too long or negative length sequences, maximum and minimum values are set. During training the learning rate is then linearly scaled proportional to the resulting minibatch size so as to not favour short sentences [22][14].

(19)

• The same dropout mask is used at each time step on the inputs and outputs of the LSTM, called Variational dropout [11].

• Dropout is performed on the embedding matrix at a word level which in effect is identical to dropping all occurrences of specific words at random within each pass [11]. The remaining word embeddings are then scaled by

1

1 − p_d where pdis the dropout probability.

• The weights are shared between the embedding and decoding layer, thus reducing the total amount of parameters in the network with minimal to no loss in performance, this is called weight tying [35].

• A form of L2 regularization is applied on the hidden state output of the final layer of the network by adding an extra term α||m ht||₂ to the loss function, where α is a scaling constant, m is the dropout mask, ht is the output at time t and is element wise multiplication. This gives a larger penalty for activations with larger values and thus encourages the network’s activations to stay small [24].

• Another form of an L2 regularization term that penalizes rapid changes between timesteps is applied by adding another extra term β||h^t− ht+1||2

to the loss function, where hⁱ is the hidden state output of the final layer [24].

2.2.8 Quasi-Recurrent Neural Networks

One of the problems that come with using LSTMs and RNNs is that com- putations at each timestep depend on the previous timestep’s output. This severely limits parallelism and leads to very high computation time for longer sequences of data. Quasi-Recurrent Neural Networks [3] (QRNNs) are intro- duced to fix part of this issue by alternating convolutional and pooling layers.

Given a sequence of T n-dimensional inputs x¹, x2, . . . , xT the convolu- tions are performed in the time dimension using m filters, producing a sequence of T m-dimensional vectors z1, z₂, . . . , zT. In tasks such as next token prediction the filters are made such as to not allow for any future tokens to influence previous ones, a concept called masked convolution [17]. Following the structure of an LSTM the forget gate and output gate are computed, and the following equations are used:

(20)

Z = tanh (W_zX) F = σ (W_fX) O = σ (W_oX)

Although these equations look similar to the equation of an LSTM as can be seen in equation (2.4), the important thing to note is the absence of sequential dependence, these can therefore be parallelized across both batch and spatial dimensions.

The second part of a QRNN layer is the pooling. One of the ways to do this that includes an output gate (called fo-pooling [3]) is to make the forget-gate control the ratio between memory and input

ct= ft ct−1+ (1 − ft) zt

h_t= o_t c_t

The recurrent parts of these functions have to be calculated for each time-step, but along the feature dimension parallelization can be applied. Using these convolution and pooling methods allow for a significant increase in compu- tational time over regular RNNs and LSTMs, up to 4× faster per epoch and fewer epochs required for convergence [23][3].

2.3 Natural Language Processing

Nautral language processing (NLP) is the study of human language in the context of computer science. The main goal is to enable computers to analyse large amounts of language data in different ways. Common tasks in this area are machine translation (translate text between languages), question answering problems, named entity recognition and sentiment analysis [49][8][4]. This section describes a couple of common concepts within NLP.

Tokenization

In order to process language it is first segmented into smaller pieces such as words, phrases, symbols or other meaningful sub-elements called tokens. The acquired list of tokens is then used by further analysis applications.

"Tokenization example text" → ["Tokenization", "example", "text"]

(21)

2.3.1 Word Embeddings

In order for computers to make sense of natural language in text format a mapping into vectors consisting of real numbers is commonly used, in other words a vector representation of particular tokens. There are many different ways to represent words in the form of vectors, but a strived after property is for words that are semantically related in a language sense to also be close to each other in the word vector space. There is much research in making these embeddings more interpretable for humans [20][41]. This particular feature works well in conjunction with machine learning where a typical problem is to learn how to classify or cluster items viewed as a set of feature vectors.

2.3.2 Transfer Learning

Pre-trained word embeddings have been used for a while within NLP prob- lems, models such as Skip-gram [27], word2vec [26] and GloVe [33]. These word embeddings are pre-trained on large amounts of unlabelled data and are then used to initialize the first layer of a neural network while the rest of the layers are trained on task-specific data. Although these word embeddings have generally given raise to a boost in performance [18][36], they only come with previous knowledge in the first layer of the network, all other layers need to be trained from scratch. A single layer can only encode so much information, so in order to get a deeper pre-trained representation containing e.g. charac- teristics of words (syntax and semantics) and its different uses in linguistic contexts, deeper pre-trained network structures are needed.

Recent advances within NLP have at their core multilayer pre-trained models such as ULMFiT [16], BERT [8] and ELMo [34]. Similar to how important transfer learning was for computer vision about a decade ago [39], these new language models are outperforming old ones on most of tasks within NLP. This can be seen when looking at current NLP benchmarks, for example GLUE¹ . In general, a language model is typically much more shallow than its computer vision counterpart and thus requires different fine-tuning methods.

One of the big benefits with pre-training language models is that training data is very easy to come by. The previously mentioned ULMFiT and BERT models are both pre-trained on, among others, English Wikipedia which is freely available for anyone to use. The pre-trained models are then carried over to a domain in which data might be scarce, and it is then of great benefit

1General Language Understanding Evaluation (GLUE) benchmark is a collection of tools for evaluating and analysing performance of models across a diverse set of NLU tasks[48]. A leaderboard can be found at: https://gluebenchmark.com/leaderboard

(22)

to already have semantic structure encoded within the initial model. In the same way as a model can transfer context from one domain to another, it can also transfer context from one language to another. This can be beneficial in situations where even unlabeled data is hard to come by, as a multilingual language model can be trained on related languages [46].

(23)

Method

The goal of this project is to investigate the impacts of pre-training on a large domain corpus by transferring models pre-trained on Wikipedia to a domain consisting of tweets and perform a sentiment classification task. This task consists of classifying tweets as having a ’Positive’, ’Negative’ or ’Neutral’ sentiment, and is thus a 3-class classification problem. Since the target language of this thesis is Swedish, there are not many resources for model comparisons available. But as mentioned in the introduction, this project’s goal is not to set new SOTA but instead to get a deeper understanding of how beneficial it is to use a pre-trained language model in cases where data might be scarce.

This chapter contains information about the datasets and the method which will be used during the experiments.

3.1 Datasets

The two datasets used in this project were Wikipedia and tweets from Twitter and will be described in more detail in the two following sections.

3.1.1 Wikipedia

The main idea of transfer learning is to train a model on a large general domain first, and Wikipedia provides just that with ease of access to the latest dumps¹ in all languages available on Wikipedia. The Wikipedia database dump was extracted using the tool Wikiextractor².

1https://dumps.wikimedia.org/svwiki/latest/

2https://github.com/attardi/wikiextractor

15

(24)

Although its lower population, Swedish Wikipedia is ranked third in amount of articles³. However as of February 25th 2019, out of the 3.75 million articles, three million are created by Lsjbot⁴⁵. This bot is primarily focused on articles about living organism and geographical entities. It is also clear when looking at the total amount of edits and active users that the Swedish Wikipedia is a place dominated by bots. This poses a problem since the dataset used for pre-training is to be general, but if most articles are about specific species of animals or rivers the model would potentially be too specialized in that area.

Since the articles created by Lsjbot tend to be short, a minimum of one thousand characters was set for an article to be included in the final dataset. This reduced the amount of available articles to about 358 thousand.

The thesis objective is to compare models trained on varying amounts of data, so in order to accomplish this nine overlapping training datasets ranging from 10 to 90 million words were created using the fastai library⁶ which uti- lizes the tokenizer spaCy⁷. A validation dataset consisting of 10 million tokens was created in order to evaluate the models. If the character limit of one thousand characters was not used when creating these datasets, the words "Cat- alogue" "of" "life" were among the top twenty most commonly used words.

Catalogue of life is a commonly used source by Lsjbot.

As can be seen in Table 3.1 the mean length of an article in the dataset was about 300 words, with each word consisting of an average of 5.6 characters.

This is rather short when compared to the mean article length of WikiText-103 which is around 3600 words [25].

Mean word len Mean article len # of tokens # of articles

Train 5.64 307.34 100,000,132 325,377

Valid 5.63 301.01 9,693,092 32,202

Table 3.1: Statistics of Swedish Wikipedia dataset.

3https://meta.wikimedia.org/wiki/List_of_Wikipedias - February 25th 2019

4https://stats.wikimedia.org/EN/BotActivityMatrixCreates.

htm

5https://en.wikipedia.org/wiki/Lsjbot

6https://docs.fast.ai/

7https://spacy.io/

(25)

3.1.2 Twitter

The Twitter dataset consisted of two parts, unlabelled and labelled tweets.

These were used for the language model fine-tuning and training of the classifier respectively.

Unlabelled tweets

labelled tweets (Stockholm)

Figure 3.1: Left: The bounding box in which unlabelled tweets were collected.

Right: Labelled tweets.

Unlabelled tweets were collected during the period December 2018 - Jan- uary 2019 using the Twitter-Streaming-API⁸together with the Tweepy library⁹. A location bounding box was selected to encompass Sweden and only tweets that were labelled by Twitter to be in Swedish were collected. In this way, a total of 57, 000 unlabelled tweets were gathered. The bounding box as well as tweet distribution can be seen in figure (3.1), which shows a clear majority of tweets coming from large cities.

The labelled tweets were annotated by a single annotator in a separate study as having a ’Positive’, ’Neutral’ or ’Negative’ sentiment with a self-agreement F1-score (Eq. 3.4) of 0.762 [29]. These tweets were gathered from Stock- holm during the period September - October 2014, and those with geolocation turned on can be seen in figure (3.1). The study labelled 58, 547 tweets, but only 38, 709 remained available for use due to tweets and users being deleted or made private.

8https://developer.twitter.com/en/docs

9http://www.tweepy.org/

(26)

As can be seen in table 3.2 the dataset was fairly balanced, no class being severely under-represented. A clear difference in the amount of emoticons, user mentions, urls and hashtags can be seen and could possibly be explained by local differences between Stockholm and other regions within Sweden. In the study that gathered the tweets a support vector machine was trained on the data in order to perform classification of the three sentiment classes. An F1-score of 0.657 was achieved [29].

Unlabelled Labelled Total tweets 57,000 38,709

Emoticons 7,228 1,400

User Mentions 53,962 21,821

Urls 14,821 14,345

Hashtags 16,138 13,788

Positive - 10,378

Neutral - 16,017

Negative - 12,314

Table 3.2: Statistics of preprocessed tweets.

3.2 Preprocessing

spaCy was used in order to tokenize both datasets and some additional preprocessing was applied with the help of fastai library and custom regex:

• Remove all html-code.

• Replace unwanted characters with sensible ones (e.g. nbsp; with a space).

• Replaces repeated characters (e.g. "\n\n\n" with "\n").

• Added special tokens for uppercase words, beginning of a sentence, all caps words, repeated words.

Since the Twitter dataset has some unique tokens, some additional preprocessing was applied:

• Replace user mentions, urls and hashtags with unique special tokens (e.g. @username with x_usermention_x).

(27)

• Replace the most common emoticons with special tokens _emo_pos and _emo_neg.

Table 3.2 shows some statistics about the preprocessed tweets. Figure 3.2 is an example-text from Wikipedia before and after preprocessing:

Mesa La Higuera är ett platåberg i Mexiko.

Det ligger i delstaten Baja California Sur.

xxbos xxmaj mesa xxmaj la xxmaj higuera är ett platåberg i xxmaj mexiko . xxmaj det ligger i delstaten xxmaj baja xxmaj california xxmaj sur .

Figure 3.2: Wikipedia text before and after pre-processing. The token xxbos refers to the beginning of a section and xxmaj refers to the following word beginning with a capital letter.

3.3 ULMFiT

Universal Language Model Fine-tuning (ULMFiT) is a transfer learning method that has been shown to be effective in multiple NLP tasks [16]. ULM- FiT uses a 3-layer bidirectional LSTM architecture (see Figure 3.3) in conjunction with some novel techniques. The method consists of three stages which will be discussed in more detail below.

Figure 3.3: ULMFiTs architecture and three stages. a) General domain pre- training. b) Target task LM fine-tuning. c) Target task classifier fine-tuning.

Figure taken from [16].

(28)

3.3.1 General Domain Language Model Pre-Training

As discussed in section 2.3.2, pre-training on a large corpus can capture general properties of a language that can then be transferred over to the target do- main for further fine-tuning. This is done by performing next token prediction on a general domain corpus, which often is a time-consuming task since the datasets are large, and requires access to good GPUs or TPUs. By making the model learn how to predict the next token given a previous sequence of tokens, it learns how to put words in context and thus creates a basic understanding of the domain language. This property is then transferred over to another target domain in order to help the model adapt quicker using its already attained knowledge.

3.3.2 Target task Language Model Fine-Tuning

Since the dataset used for pre-training most likely does not belong to the same distribution as the target task dataset, an intermediate step before training the classifier is performed which helps the language model adapt to the local language properties. For tokens that appear in both domains, the embeddings found during the pre-training step is simply carried over. Tokens that are new to the target task dataset and has thus not been seen during pre-training are initialized as the mean embedding of all the pre-trained embeddings. The same training procedure using next token prediction is then commenced until the model has adapted to its new domain. ULMFiT uses two specific methods during this step:

Discriminative Fine-Tuning

The idea here is that since different layers in a model capture different types of information they should be tuned differently and thus require different learning rates. The learned weights for deeper layers are more specific to the dataset they are trained on, while the initial layers capture more general properties of the data [50]. ULMFiT uses discriminative fine-tuning where the parameters θ are split into layers {θ¹, . . . , θ^L} and the learning rate η used in the ASGD update is unique for each layer {η¹, . . . , η^L}. The learning rate for the last layer is user specified and the rest are chosen as η^l−1 = η^l

2.6 in the case of ULMFiT [16].

(29)

Slanted Triangular Learning-Rates

Since the target task dataset has a different distribution than the one used for pre-training, intuitively the model should converge to a region of the parameters space that is suitable for the target task, and then start fine-tuning the parameters. As an extension to regular triangular learning rates [43], ULM- FiT uses a learning rate with an initial short increase and then a longer period of decay, see Figure 3.4.

0 200 400 600 800 1000 1200 1400 iteration

0.002 0.004 0.006 0.008 0.010

learning rate

Figure 3.4: The slanted triangular learning-rate used in ULMFiT.

Cyclical learning rate and momentum

Allowing the learning rate and momentum to rise and fall has been shown to be beneficial overall for the network’s performance while also eliminating the need to search for these hyperparameters, as long as reasonable minimum and maximum boundary values have been selected. Instead of using a linear increase and decay, it has been shown that a cyclical schedule can give raise to even better results [44]. Figure 3.5 shows how the learning rate and momentum varies between specified boundary values during training.

Learning rate range test

In order to find a suitable learning rate for the model a learning rate range test [44] can be performed. By validating a model for some iterations while linearly increasing the learning rate from a minimum to a maximum value,

(30)

0 200 400 600 800 1000 1200 Iterations

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030

Learning Rate

0 200 400 600 800 1000 1200

Iterations 0.86

0.88 0.90 0.92 0.94 0.96 0.98

Momentum

Figure 3.5: Left: Learning rate variation during training. Right: Momentum variation during training.

a region which is suitable to initialize the learning rate within can be found.

This was performed for every model before each training started in order to find reasonable learning rates. Figure 3.6 shows a graph of the learning rate range test for a language model before training the classifier where the maximum learning rate would be chosen as 1e−02.

1e-08 1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00 Learning Rate

1.0 1.1 1.2 1.3 1.4

Loss

Figure 3.6: Learning range test for a model.

(31)

3.3.3 Target Task Classifier Fine-Tuning

Training the classifier is the final stage in ULMFiT. In this stage, the only parameters that are learned from scratch are two fully connected layers that use ReLU and softmax activation respectively to output a probability distribution over the target classes. The benefit of having both pre-trained on a general domain and fine-tuned on the target task domain should help the model both converge faster and boost performance. Furthermore, since a basic understanding of the target language is already encoded, the model should be able to generalize better using fewer samples [16]. The bidirectional language models are fine-tuned independently in each direction and the class predictions are averaged. Furthermore three notable methods are used:

Concat pooling

If only the last hidden layer would be taken into account when making a prediction, information could be lost since the relevant part could be anywhere and at multiple places in a sentence. Thus the last hidden state is concatenated with max- and mean-pooled representations of as many previous hidden states as can be fit in the GPU memory.

Gradual unfreezing

With the same reasoning as to why a slanted learning rate is used, gradual un- freezing is applied in order to keep pre-trained features intact. It is a method similar to sequential fine-tuning [9] in which one layer is trained at a time while keeping the other layers frozen. Gradual unfreezing instead gradually unfreezes one layer at a time and fine-tunes all currently unfrozen layers, starting from the last layer.

BPTT for Text Classification

In order for the fine-tuning to fit in the GPU memory the training sequence is split up into smaller batches that are determined in the same way as for the AWD-LSTM in section 2.2.7. The model is initialized with the final hidden state values of the previous batch, this way it does not "forget" previous information.

(32)

3.4 Evaluation metrics

In order to measure how well a machine learning model performs there are several metrics that can be looked at.

3.4.1 Accuracy

The amount of correctly classified labels divided by the total number of labels is called accuray

Accuracy = # correctly predicted labels

total number of labels (3.1) This is a metric that can be somewhat misleading when there is imbalance in the data. Say for example that 90% of the available data belongs to one class while the rest is split on 9 other classes, if one guess that every class belongs to the majority class an accuracy of 90% would be achieved even though all other classes would be labelled wrongly.

3.4.2 Precision

Precision is a metric that measures the proportion of correctly predicted classes divided by the amount of predictions of each respective class. It is in a sense a measure of the confidence of the predictions.

P recision = # True Positive

# True Positive + False positive (3.2)

3.4.3 Recall

Recall can be referred to as the true positive rate or the sensitivity because it is a measure of the proportion between the correctly predicted classes and the total amount of each respective class.

Recall = # True Positive

# True positive + False Nagative (3.3)

3.4.4 F1-score

The F1-score is a metric that incorporates both recall and precision in order to give a useful evaluation of imbalanced datasets. The F1-score can be calculated for each class by viewing it as a binary classification problem and

(33)

calculating the harmonic mean between precision and recall F 1-score = 2 · precision · recall

precision + recall (3.4)

In the case of multiple classes an average of all individual F1-scores can be calculated as the models average F1-score, also called macro F1-score.

(34)

Experimental Settings

The method used in this project will be based on word level recurrent neural networks following the ULMFiT method described in section 3.3.

4.1 Model and Implementation

Nine three-layer recurrent neural networks were initially employed using the different Wikipedia datasets following the ULMFiT method, except for it being uni-directional and using QRNNs instead of AWD-LSTMs. The choice of QRNNs is due to the long training times associated with LSTMs accompanied by the limited time scope of this project, which is the same reason as to why a uni-directional model was employed.

The vocabulary of the language models was set to 30, 000 tokens, the embedding sizes was set to 400, the hidden nodes in each layer was set to 1150 and a filter of width 2 was used. During training of the classifier the decoder layer was dropped, and two linear layers with 50 and 3 hidden nodes using ReLU and softmax activation functions respectively were added. During all steps Adam was used as the optimizer together with categorical cross entropy as the loss function. The choice of these parameters was made to be in line with what the authors of ULMFiT used during their experiments [16].

All models were trained on a virtual machine through google cloud using an NVIDIA Tesla K80 GPU.

4.1.1 Pre-training on general domain corpus

The pre-training step is when the models are trained on a large general domain corpus by performing next token prediction, in this case the Swedish

26

(35)

Wikipedia was used in order to have the models learn general domain char- acteristics and underlying structure of the target language. It was done by having the models use a sequence of tokens in order to predict the next token on a training set and was continued for a maximum of ten epochs or until performance on the validation set was not improving. The ten epoch limit was set due to the long training times, up to three hours per epoch.

Going back to the research question which is how the amount of data available for the pre-training step correlates with the model performance, nine models were trained on different sized partitions of the Wikipedia dataset ranging from 10 to 90 million tokens. As mentioned, due to the long training times it was not possible to train more than one model for each different sized dataset in this project. Hyperparameters were set to be as equal as possible without overfitting, the exception being dropout which was changed for the models trained on the smaller partitions to avoid their stronger tendency to overfit.

The base values used for the dropout on the input layer, embedding layer, between QRNN’s, internally on QRNN’s and output layer were (0.2, 0.02, 0.1, 0.15, 0.25) respectively, as recommended by Howard et al. [16] . A dropout multiplier was applied to all dropouts in order to scale them for different models. After a series of tests on smaller subsets the multiplier was set to 0 for models using 60 million or more tokens, and to 0.1 for models using 50 or less million tokens. All models used a base sequence length of 70, a batch size of 128, the learning rate was varied from 2e-4 to 2e-3 and the momentum was varied from 0.85 to 0.95.

4.1.2 Fine-tuning on target domain corpus

In order to make the language models adapt to the target domain and learn its more local properties they were fine-tuned for a number of epochs. This was done in the same way as the pre-training step, by using a sequence of tokens and then predict the next token on a training set while monitoring the performance on a validation set and employing early stopping if required.

Each model from the pre-training step as well as a model not pre-trained at all were fine-tuned on two twitter datasets, one containing all available tweets and one containing half of the available tweets. In total 20 models were trained in this step.

After a series of hyper parameter searches the models trained on the full twitter dataset utilized a dropout multiplier of 0.4 while the models trained on half of the tweets used a multiplier of 0.5. A batch size of 32 was used during training and the learning rate was varied between 6e-4 and 3e-3 while

(36)

the momentum varied between 0.85 and 0.95.

4.1.3 Fine-tuning Classifier on target domain

During this step the decoder was dropped and replaced with two fully connected linear layers with 50 and 3 hidden nodes using ReLU and softmax respectively. Now, instead of performing next token prediction, the classifiers were trained to predict which of the three target classes "Positive", "Negative"

and "Neutral" a tweet belonged to. From the dataset containing all labelled tweets four subsets were created, using 1%, 10%, 50% and 100% of the available tweets. This was done in order to evaluate how the models behave when exposed to different amounts of labelled data.

All 20 models from the in domain fine-tuning step as well as the models from the pre-training step and a model randomly initialized from scratch were trained on these four datasets. In total 120 different models were trained.

Some hyperparameters were set differently depending on the available data and determined by a gridsearch on a validation set. The hyperparameters used can be seen in table (4.1), and the momentum was varied between 0.85 and 0.99 for all models.

dp mult. min / max learning rate 1% of labelled 0.6 0.00003 / 0.0003 10% of labelled 0.6 0.005 / 0.0005 50% of labelled 0.5 0.001 / 0.01 100% of labelled 0.3 0.001 / 0.01

Table 4.1: Hyperparameters used for different datasets while finetuning the classifier.

4.2 Evaluation

In order to evaluate the different models’ performance ten train, validation and test sets were created using the available labelled tweets. The models were then trained on these different splits and the test recall, precision and F1-scores were recorded.

(37)

Results

This chapter shows the results obtained from various experiments mentioned in Chapter 4.

5.1 Pre-training on Wikipedia

The lowest validation losses for each of the nine models while performing next token prediction on different sized datasets ranging from 10 to 90 million Wikipedia tokens can be seen in Figure (5.1). As mentioned in section 4, the models were only trained once due to very long training times.

10 20 30 40 50 60 70 80 90

# of tokens in millions 2.35

2.40 2.45 2.50 2.55 2.60

validation loss

Figure 5.1: Lowest validation losses for nine models trained using different sized Wikipedia datasets ranging from 10 to 90 million tokens.

29

(38)

A qualitative example showing what the models have learned can be seen in Figure 5.2. This was generated by seeding a model with the word Wikipedia and letting it output the most probable next sequence of words.

Wikipedia effektiv herrgård proffs kasta håkansson bergshamra rytt gravid isolerade frigörs forsar

Wikipedia är en mindre Wikipedia. Det är genom perioden som Youtube började som Wikipedia

Figure 5.2: Comparison between an untrained and trained language model.

Top: before training. Bottom: after training.

5.2 Fine-tuning on Twitter

Each of the nine models that were pre-trained on different sized Wikipedia datasets as well as a randomly initialized model were moved to the Twitter domain and trained using next token prediction. The final validation losses utilizing all available tweets can be seen in Figure 5.3.

0 10 20 30 40 50 60 70 80 90

# of tokens used during pre-training, in millions 4.04

4.06 4.08 4.10 4.12 4.14 4.16 4.18

validation loss

Figure 5.3: Final validation losses for ten models on the full twitter dataset, pre-trained on different sized Wikipedia datasets ranging from zero to 90 million tokens.

(39)

A qualitative example showing what the models have learned can be seen in Figure 5.4. This was generated by seeding a model with the words Twitter är and letting it output the most probable next words.

Twitter är också en av de tjugo som de flesta i

Sverige fortfarande kräver.

Twitter är ett skämt och jag vill skilja mig ifrån min vän x_usermention_x.

Figure 5.4: Comparison between models before and after fine-tuning on the target domain. Top: before tuning. Bottom: after tuning.

5.3 Training classifier on labelled tweets

Figure 5.5 shows how the F1-score varies while performing sentiment classification when different amounts of data are used during pre-training and training of the classifier, averaged over ten runs. The different F1-scores and standard deviations for the model utilizing 100% of twitter data for training the classifier can be seen in Table 5.2.

0 10 20 30 40 50 60 70 80 90

# of tokens used during pre-training, in millions 0.45

0.50 0.55 0.60 0.65

F1-score

100% of Twitter data 50% of Twitter data 10% of Twitter data 1% of Twitter data

Figure 5.5: Mean F1-scores and one standard deviation for models fine-tuned on all of the available Twitter data but using different amounts of training examples for the classifier. Averaged over ten runs.

(40)

Figure 5.6 and Table 5.1 shows how the pre-training on Wikipedia and fine-tuning on Twitter before training the classifier effects the performance of the models while varying the amount of data available during training of the classifier.

1 10 50 100

Percentage of data used during training of classifier 0.20

0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65

F1-score

Fully Pre-trained and finetuned Fully Pre-trained not finetuned Only finetuned Randomly initialized

Figure 5.6: F1-scores for models either fully pre-trained and finetuned, only pre-trained, only fine-tuned or trained from scratch, all using different amounts of data ranging from 1% too 100%.

Table 5.3 shows the resulting F1-scores for a model utilizing all available Wikipedia and Twitter data for pre-training and training of the sentiment classifier, but different subsets of the Twitter data for fine-tuning.

A couple of example tweets and their corresponding predictions and labels can be seen in Table 5.4.

(41)

Classifier data Pre-train tokens Mean F1-score F1 std.

100% 90M 0.660 0.0170

100% 0M 0.649 0.0143

50% 90M 0.642 0.0140

50% 0M 0.641 0.0131

10% 90M 0.567 0.0190

10% 0M 0.592 0.0139

1% 90M 0.512 0.0423

1% 0M 0.510 0.0365

Table 5.1: Performances using different amounts of data during pre-training on Wikipedia and sentiment classifier training. All available data used during the fine-tuning on Twitter step. The F1-score and standard deviation was calculated over ten runs.

Pre-train data F1-score F1 std.

0M 0.649 0.0143

10M 0.657 0.0166

20M 0.654 0.0131

30M 0.653 0.0164

40M 0.654 0.0155

50M 0.653 0.0134

60M 0.656 0.0180

70M 0.657 0.0158

80M 0.667 0.00934

90M 0.660 0.0170

Table 5.2: Performances using all available data for training the sentiment classifier and fine-tuning on Twitter while varying the amount of data used during pre-training on Wikipedia. The F1-score and standard deviation was calculated over ten runs.

(42)

Fine-tuning data F1-score F1 std.

0% 0.611 0.0189

50% 0.656 0.0146

100% 0.660 0.0170

Table 5.3: Performances using all available data for training the sentiment classifier and pre-training on Wikipedia while varying the amount of data used during fine-tuning on Twitter. The F1-score and standard deviation was calculated over ten runs.

Tweet Prediction Label

Nu vet jag vad det innebär med tentaplugg.

Jag får inte ens spela tv-spel när tjejen sitter med böckerna.

Skolan är inte för alla.

Negative Negative Nu har vi ändrat priset igen:

E85: 9,44 B95: 13,89 Diesel: 13,73 x_hashtag_x

Neutral Neutral

Men; kan ju åtminstone bada här, vilket är guld värt.

Och jäklar vad skönt det är att inte jobba.

Och attans vad underbart att ha läsro!

Positive Positive Åter igen till mitt kära gym.

Har haft ett break på ett år pga skada.

Känner mig som ett barn på julafton.

Nu kör.

x_url_x

Negative Positive

Det negativa är jag måste snåla o räkna ören.

Det underbara är jag har lång sovmorgon en onsdag.

Så länge det går runt är jag glad vinnare.

Positive Negative Nån skrev "störst av allt är kärleken".

Alltså har du ens sett blåvalar?

Dom är skitstora, du kanske borde läsa på lite, idiot.

Negative Neutral Första september. Ny månad. Ny årstid.

Ny vecka. Ny chans till stordåd.

Godmorgon därute, gör skillnad för våra barn!

Positive Neutral

Table 5.4: Example of tweets and their corresponding prediction and label.

(43)

Discussion

This chapter contains discussion and analysis of the results obtained in Chap- ter 5, as well as conclusions attempting to answer the research question stated in section 1.2 and thoughts about possible future work.

6.1 Pre-training

The results in Figure 5.1 show that more available data yield better generalization for the model, something that is to be expected. Looking at the figure there seems to be some further gain to be had if even more data were to be used, but the time involved with training these models does not scale linearly with the decrease in validation loss. The model using 90 million tokens took slightly over 30 hours to train, for a model using double the amount of tokens one could expect it to take twice as long but probably not achieve much lower validation loss.

Figure 5.2 shows the model’s ability to form sentences before and after pre-training on Wikipedia. As can be seen, before training the model was outputting unrelated words, unable to form coherent sentences. After training, the model had adapted to the domain and learned how to use capital letters, punctuation and basic grammar. This result is in line with what was expected and shows that the model has capability to learn the structure of a language.

Furthermore, the figure shows the model’s ability to recognize the context it is in, relating Wikipedia to another popular website, YouTube. This behaviour could be seen in many other instances when the model was asked to form sentences regarding history, movies, sports e.t.c.

35

(44)

6.2 Fine-tuning

Figure 5.3 shows the effect pre-training on Wikipedia had for the model’s ability to perform next token prediction on the target domain. Looking at the figure, the difference between a model not pre-trained at all and a fully pre-trained model is notable. Although the gain in performance is quite small, effectively a decrease in validation loss of 0.158, the results are mostly consistent in that more data used during pre-training yield better generalization.

An example that shows the model’s ability to adapt to its environment is shown in Figure 5.4. A model that has not yet seen Twitter data but has been pre-trained on Wikipedia shows its ability to form coherent sentences using the more formal and non-emotional language used in Wikipedia articles. Compar- atively, after the model has been fine-tuned on tweets it is clear how it is now forming sentences containing more emotionally weighted words together with its use of Twitter specific tokens such as user mentions and hashtags.

6.3 Sentiment Classification

A summary of the classification performances when using all available Twitter data during the fine-tuning and classification training can be seen in Table 5.2. The best performing model was trained using 80 millions tokens from Wikipedia and achieved an F1-score of 0.667 with a standard deviation of 0.00934.

As mentioned before in section 3.1.2 the study that gathered the tweets was able to achieve an F1-score of 0.657 by using a two-plane SVM [29].

However, as also touched upon in the same section, many of the then labelled tweets were not available any more due to users making their profiles private or deleting tweets. Thus the Twitter dataset used in this thesis is approximately two thirds the size of the original dataset. As can be seen in Figures 5.5 and 5.6 the amount of data available during training of the classifier heavily effects the performance of the models, one could expect the resulting F1-score to be slightly higher if the original complete dataset were to be used. Nonetheless, the current models generally perform about equal or slightly better than the two-plane SVM model used in the cited study.

Another thing to note is that, as mentioned in section 3.1.2, the self- agreement F1-score of the annotator was 76.2%. This means that many of the tweets could have multiple interpretations of what their sentiment actu- ally is, and as such makes this task harder. Looking at the tweets in Table 5.4 and their corresponding labels and predictions, the sentiment of the tweets in