Evaluation of Approaches for Representation and Sentiment of Customer Reviews

(1)

STOCKHOLM, SWEDEN 2021

Evaluation of Approaches for

Representation and Sentiment of

Customer Reviews

Stavros Giorgis

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Supervisor: Jussi Karlgren Examiner: Viggo Kann

School of Electrical Engineering and Computer Science Host company: Gavagai AB

Swedish title: Utvärdering av tillvägagångssätt för representation och uppfattning om kundrecensioner

(3)

Classification of sentiment on customer reviews is a real-world application for many companies that offer text analytics and opinion extraction on customer reviews on different domains such as consumer electronics, hotels, restaurants, and car rental agencies. Natural Language Processing’s latest progress has seen the development of many new state-of-the-art approaches for representing the meaning of sentences, phrases, and words in the text using vector space models, so-called embeddings.

In this thesis, we evaluated the most current and most popular text representation techniques against traditional methods as a baseline. The evaluation dataset consists of customer reviews from different domains with different lengths used by a text analysis company. Through a train dataset exploration, we evaluated which datasets were the most suitable for this specific task. Furthermore, we explored different techniques that could be used to alter a language model’s decisions without retraining it. Finally, all the methods were evaluated against their time performance and the resource requirements to present an overall experimental assessment that could potentially help the company decide which is the most appropriate technique to replace its system in a production environment.

Keywords: machine learning, nlp, text analytics, sentiment analysis, transformers, tfidf, bow, fasttext, word2vec, bert, xlnet, roberta

(4)

Klassificeringen av attityd och känsloläge i kundrecensioner är en tillämpning med praktiskt värde för flera företag i marknadsanalysbranschen. Aktuell forskning i språkteknologi har etablerat vektorrum som standardrepresentation för ord, fraser och yttranden, så kallade embeddings. Denna uppsats utvärderar den senaste tidens mest framgångsrika textrepresentationsmodeller jämfört med mer traditionella vektorrum. Utvärdering görs genom att jämföra automatiska analyser med mänskliga bedömningar för kundrecensioner av varierande längd från olika domäner tillhandahållna av ett textanalysföretag. Inom ramen för studien har olika testmängder jämförts och olika sätt att modifera en språkmodells klassficering utan om träning. Alla modeller har också jämförts med avseende på resurs- och tidsåtgång för träning för att hjälpa uppdragsgivaren fatta beslut om vilken teknik som utgör den mest ändamålsenliga utvecklingsvägen för dess driftsatta system.

Nyckelord: maskininlärning, nlp, textanalys, sentimentanalys, transformatorer, tfidf, bow, fasttext, word2vec, bert, xlnet, roberta

(5)

I would like to express my deep gratitude and appreciation to Professor Jussi Karlgren, the supervisor of this thesis, for all the insights, weekly discussions, and great feedback on this work. I would like to thank Gavagai for allowing me to write my thesis at their company. Moreover, I would like to acknowledge the help provided by Nils Balzar Ekenbäck, who also wrote his thesis with Gavagai, for the valuable weekly discussions on the background theory. Finally, I would like to thank my family members for the continuous support and love they have provided me throughout my time in Sweden and KTH.

(6)

1 Introduction

1

1.1 Thesis structure . . . 2

1.2 Research Questions . . . 3

1.3 Benefits, Ethics and Sustainability . . . 4

2 Theory

6 2.1 Bag of Words (BoW) . . . 6

2.2 Term Frequency-Inverse document frequency (tf-idf). . . 7

2.3 Word2Vec. . . 7

2.3.1 Continuous Bag of Words (CBoW) . . . 8

2.3.2 Skip-Gram model . . . 8

2.4 FastText . . . 8

2.5 Transformers . . . 8

2.5.1 Bidirectional Encoder Representations from Transformers (BERT) 9 2.5.2 DistilBERT . . . 12

2.5.3 RoBERTa . . . 12

2.5.4 DistilRoBERTa . . . 13

2.5.5 XLNet . . . 14

2.6 Sentence representations . . . 15

3 Method

16 3.1 Datasets . . . 16

3.1.1 Train Datasets . . . 16

3.1.2 Evaluation Datasets . . . 21

3.1.3 Pre-processing . . . 24

3.2 Language Models . . . 26

3.3 Network architectures . . . 26

(7)

3.3.1 Multi-Layer Perceptron (MLP) . . . 27

3.3.2 Bidirectional LSTM for Sentence Classification . . . 27

3.3.3 Convolutional Neural Network for Sentence Classification (Kim- CNN) . . . 28

3.4 Evaluated models . . . 29

3.5 Implementation . . . 31

3.6 Metrics . . . 31

3.7 Experimental pipeline . . . 32

3.7.1 KimCNN fine-tune investigation . . . 33

3.7.2 Proposed architectures evaluation . . . 33

3.7.3 Best proposed architectures evaluation . . . 33

3.7.4 Distilled architectures and resource evaluation . . . 33

3.7.5 Lexicon based enhanced architectures evaluation . . . 34

4 Results

35 4.1 KimCNN fine-tune investigation. . . 35

4.2 Proposed architectures evaluation . . . 36

4.2.1 oscar-ryan . . . 37

4.2.2 blitzer . . . 40

4.3 Best proposed architectures evaluation . . . 43

4.3.1 oscar-ryan . . . 43

4.3.2 blitzer . . . 44

4.4 Distilled architectures and resource evaluation . . . 44

4.4.1 oscar-ryan . . . 44

4.4.2 blitzer . . . 45

4.5 Lexicon enhanced architectures evaluation . . . 46

4.5.1 oscar-ryan . . . 46

4.5.2 blitzer . . . 46

5 Discussion

48 5.1 KimCNN fine-tune investigation. . . 48

5.2 Proposed architectures evaluation . . . 48

5.3 Best proposed architectures evaluation . . . 49

5.4 Distilled architectures and resource evaluation . . . 50

5.5 Lexicon enhanced architectures evaluation . . . 50

(8)

5.6 Research Questions . . . 51

6 Conclusion

54

6.1 Study Limitations . . . 55 6.2 Future Work . . . 55

References

56

(9)

Introduction

Natural language processing (NLP) lies at the intersection between linguistics, computer science, and artificial intelligence. It is focused on the analysis and interpretation of words, sentences, or documents to extract insights and classify the documents into different categories accurately. One of Natural Language Processing’s significant applications is Sentiment Analysis, where a model predicts a given sentence’s sentiment. The primary struggle of NLP methods has always been the feature engineering task. There has been a shift from traditional statistical techniques to infer features from text to machine learning approaches in the last decade.

Especially in recent years, with the rise of Natural Language Processing, many state- of-the-art learning context-aware text representations have been introduced, namely, Embeddings achieving outstanding results in many NLP benchmarks. However, there is a gap between NLP benchmarks and real-world datasets and tasks. In most cases, it is difficult, if not impossible, to implement and use such methods in production systems solving real-world problems both because of lack of training data and time performance issues. It is essential to accurately evaluate such methods’ overall performance for the specific task that each individual is interested in. Despite these difficulties, in recent years, more than ever, companies have widely started to use such techniques to solve Natural Language Processing tasks like Sentiment Analysis.

Gavagai is one example of such companies. Gavagai is a Swedish language technology company founded by Dr. Jussi Karlgren and Dr. Magnus Sahlgren in 2010. The Gavagai team has extensive experience in various Natural Language Processing tasks.

Mainly, they work on topic modeling and sentiment analysis. The main product of the

(10)

company is Explorer ¹. Explorer is a text analytics tool that is used by companies to review and analyze their customer reviews. Explorer supports 46 different languages and generates actionable insights from hundreds of thousands of customer reviews in minutes presented to interactive Dashboards. The Explorer performs topic modeling on a sentence level and sentiment analysis both on a topic and a sentence level in more detail. That being said, Gavagai is interested in evaluating how their approach compares with the more recent state of the art approaches for customer reviews representation and sentiment analysis. This thesis examines how the current state of the art learned text representations and the sentiment analysis models compare against traditional Natural Language Processing methods and the Gavagai’s Explorer.

1.1 Thesis structure

In section 1.2, we formulate the Research Questions of this project, and in section 1.3, we address the Ethics and Sustainability of the work. In Chapter 2, we present the background theory. Chapter 3 presents the datasets, the metrics, the methods, and the experimental pipeline. In Chapter 4, we present the results of all the experiments that are shown in the experimental pipeline. In Chapter 5, we discuss the results we obtained, and we attempt to answer the research questions. Finally, in Chapter 6, we discuss the results and future work.

1https://www.gavagai.io/products/explorer/

(11)

1.2 Research Questions

• How should we choose suitable Datasets for training transformer-based methods such as BERT, RoBERTa, XLnet for learning context-aware customer reviews representations?

• How do transformer-based methods for learning context-aware representations such as BERT, RoBERTa, XLnet, compare with methods for learning context- free representations such as FastText, and traditional baseline methods such as Bag of Words and tf-idf?

• How should we finetune architectures that combine a neural network on top of pre-trained context-aware embeddings?

• How do different neural networks for the sentiment classification such as Bidirectional LSTM, CNN for sentence classification, Multilayer Perceptron utilize and exploit the customer review representations and how do they compare with each other in an extrinsic evaluation setting?

• How do distilled variations of transformer-based learned context-aware customer review representations compare with the original ones?

• How could we utilize in-house lexicon-based knowledge to enhance and alter transformer-based models’ decisions?

(12)

1.3 Benefits, Ethics and Sustainability

Ethics and safety should always be prioritized in designing and implementing AI algorithms such as the ones described in this thesis. It is of paramount importance to fully evaluate the ethics behind our choices to avoid any negative consequences. The potential ethical harms that we consider to be the most important and attempt to break down are the following.

• Data privacy All the datasets used in this work have been publicly available. In particular, all training datasets evaluated have been widely used as benchmarks in many research papers and do not contain any personal information.

Furthermore, evaluation datasets used by Gavagai have also served as publicly available benchmarks and do not as well contain personal information of the users that did the reviews. Finally, in this work, we do not expose any examples from the training or evaluation datasets, but instead, we refer the reader to the official websites to find the datasets.

• Bias and discrimination In this thesis work, we have attempted to use customer reviews from a wide range of domains both for training and evaluation datasets. Furthermore, all the datasets used for training our approaches are either collected as a whole or as a random sample of the initial population, not to replicate any designer preconceptions and biases that might exist.

• Non-transparent, and unexplainable outcomes In this thesis work, we have used Deep Neural Networks both for learning context-aware customer reviews representations and for predicting the sentiment. As a result, our approach falls under the drawback shared among all such methods: the models’

outcomes might be unexplainable or non-transparent. In this work, we presented our effort to solve this problem even though we can not fully understand and solve it. However, we argue that the lack of complete explainability in our case does not indicate to affect the ethics or promote any trace of discrimination, bias, or inequity.

(13)

As far as sustainability is concerned, a recent well-known study [18] showed that one deep neural network model’s training could be enough to produce up to 626,155 pounds of carbon dioxide emissions. This is the amount of carbon dioxide emissions that five cars produce in five years to put things in perspective. However, an average size deep learning model has much less carbon dioxide emissions, although it is still considerable. Our effort to eliminate these emissions has been based on the fact that we used transfer learning in almost all our experiments, which means that the large transformer-based models have already been pre-trained, which minimizes the training time and, therefore, the carbon dioxide emissions. Furthermore, we used the small versions of all the pretrained models for the same purpose. Finally, we experimented with the distilled versions of these models, which could help the training and inference time and the carbon dioxide emissions.

(14)

Theory

This chapter aims to introduce the background required to understand the approach we followed in this study. First, we introduce the concept of word vectors, and we present Bag of Words and Term Frequency-Inverse document frequency, two methods used to convert text into numeric vectors. Then we introduce Word2vec and FastText, two frameworks that use a neural network model to learn static numeric representations of words from a large corpus of text. Then, we introduce the Transformer and three different language representation neural network models based on the transformer architecture. Finally, we present common approaches to evaluate sentence representations and for the sentiment analysis task.

2.1 Bag of Words (BoW)

The Bag of Words (BoW) is a simple model used to produce numerical vectors from the text. A text, defined as a sentence or a document, is represented as a vector with a size equal to the vocabulary of the whole corpus size used. Therefore, each sentence or document is a bag of its words, which means that a numerical vector represents it with 1 in the corresponding vector position and 0 to the rest. This approach disregards the order of the words and, as a result, the contextual information that the sentence or the document contains.

(15)

2.2 Term Frequency-Inverse document frequency (tf- idf)

The term frequency-inverse document frequency is a simple model used to produce numerical vectors from the text in a similar fashion to the Bag of Words approach.

Term Frequency-Inverse document frequency reflects how important a word is to a document or into a collection of sentences. It is used as a term-weighting scheme, which increases proportionally to the times each word appears in a document and is normalized by the number of documents or sentences in the corpus.

2.3 Word2Vec

Word2Vec was first introduced in [13]. They proposed two novel architectures constructed to compute continuous vector representations of words. The two architectures are Skip-Gram and Continuous Bag of Words (CBoW). Both architectures rely on a FeedForward Neural Network, which was first introduced in [3] and contains a linear projection layer, a hidden layer, and output layers, which was used to learn vector representations and a statistical language model.

Figure 2.3.1: The Skip-Gram and CBOW architectures. Taken from [13]

(16)

2.3.1 Continuous Bag of Words (CBoW)

The Continuous Bag of Words (CBoW) architecture is similar to the feedforward neural network. Still, the projection layer is shared for all words, unlike the previously mentioned [3] that only uses the words that are part of the projection matrix.

This architecture predicts the current word with a window of surrounding words as context.

2.3.2 Skip-Gram model

The Skip-Gram architecture is similar to the Continuous Bag of Words (CBoW) architecture. This architecture predicts the context within a specific range in a window before and after the word. The authors of [13], have shown that increasing this range improves the quality of the word representations since a larger window is considered when predicting a particular word.

2.4 FastText

FastText is a model that was first introduced in [5], which improves Mikolov [13].

They introduced a model that takes into account the word morphology by introducing subword units. The sum of the n-grams of the subwords is then summed to represent words. That being said, FastText builds richer vector representations for each word.

The main and most interesting benefit of this approach is that out of vocabulary words can still be evaluated in the evaluation process, which means that FastText might be generalized in unseen settings. Finally, FastText is trained in the same fashion as Word2vec and can be found with both the Skip-Gram and the CBoW objective.

2.5 Transformers

The Transformer is a Neural Network that was first introduced in [20], which depends entirely on the attention mechanism mentioned in [20] and allows for model train parallelization and achieves state-of-the-art results in many downstream NLP tasks. Attention is all you need [20] was the first to introduce a new era in the Natural Language Processing task called Language Modelling. It enhanced the standard encoder-decoder architectures that heavily relied on Recurrent Neural

(17)

Networks(RNNs) by replacing these networks with attention units. The new transformers architecture’s encoder component consists of a multi-head self-attention unit followed by a fully connected feed-forward network for several identical layers.

Furthermore, each layer deploys a residual connection around each subunit, followed by a layer normalization unit. The decoder component connects to the encoder by introducing a multi-head attention unit over the encoder’s outputs. The decoder component consists of a multi-head self-attention unit followed by a fully connected feed-forward network for the same number of layers as the encoder. On the other hand, the decoder component consists of a multi-head self-attention unit that can only operate on previous tokens, followed by a multi-head attention unit that performs its calculation on the output of the encoder component. Moreover, each layer deploys a residual connection and a layer normalization unit for every unit in the same manner as the encoder component. Figure 2.5.1 shows the transformer architecture as it was presented in [20].

2.5.1 Bidirectional Encoder Representations from Transformers (BERT)

Bidirectional Encoder Representations from Transformers (BERT) is a language representation model introduced in [6]. As stated by its name, BERT only implements the original Transformer architecture’s encoder since it is trained as a language representation model. The number of self-attention units, the number of layers, and the hidden size vary, resulting in different BERT releases corresponding to different network sizes. Furthermore, BERT is bidirectional in the sense that the whole input is processed at once, and the self-attention mechanism can operate on the whole input.

That is why position and segment embeddings are added to the token embeddings as shown in figure 2.5.2 before feeding the input to the model to keeps track of the order of the tokens. During training time, BERT dynamically constructs a representation for each token in the input by mapping the token to a fixed vocabulary created during training and then encoding each token’s position in the input using the position segment embeddings. Moreover, they introduce special tokens.

• A classification token (CLS) is used to represent the whole sentence in classification tasks.

• A separator token (SEP) is used for the next sentence prediction pre-training

(18)

Figure 2.5.1: The Transformer - model architecture. Taken from [20]

task.

• A mask token (MASK) is used for the masked language modeling pre-training task.

The so-called BERT framework is split into two stages. [6] shows that BERT can be effectively pre-trained on a large corpus and then fine-tuned at different tasks using

(19)

Figure 2.5.2: The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings. Taken from [6]

transfer learning. The pre-training stage consists of two tasks.

• Next Sentence Prediction(NSP) task: This task aims to represent relationships between sentences. The model predicts in a binary classification task whether sentence B is the next sentence of sentence A.

• Masked Language Model (MLM) task: This task aims to represent the relationships between tokens in one sentence in a bidirectional fashion. The training generator replaces 15% of the original tokens with the special token (MASK) I described above. The model predicts the original token with cross- entropy loss.

The fine-tuning stage fine-tunes all the parameters of the model end-to-end. For every downstream task, the relevant inputs and outputs are created. For downstream tasks, where sentence embeddings are required, the special token (CLS) is fed into a new layer for classification, which we leverage in this study. Figure 2.5.3 shows the pre-training and fine-tuning stages as they were presented in [20].

(20)

Figure 2.5.3: Pre-training and fine-tuning. Taken from [6]

2.5.2 DistilBERT

DistilBERT is a distilled version of Bidirectional Encoder Representations from Transformers (BERT) introduced in [16]. DistilBERT is a method to pre-train a smaller, faster, and lighter language representation model by leveraging the knowledge distillation introduced in [7]. Knowledge distillation is a method used to train a model called the student, to imitate another model’s behavior, which is called the teacher, and reproduce the performance with much fewer training parameters. In DistilBERT, this student model has the same architecture as BERT while reducing the number of layers. Furthermore, all the operations in these layers are highly optimized. As a result, DistilBERT manages to retain 60% of the original parameters and achieves an overall 97% of the initial performance.

2.5.3 RoBERTa

Robustly

Optimized BERT Pretraining Approach (RoBERTa) is a language representation model introduced in [11]. RoBERTa is an optimized BERT architecture with different design choices for the training strategy and the datasets used for pre-training, which improves the performance on the downstream tasks. Furthermore, as stated in [11], this strategy confirms that the amount of data used for pre-training is crucial for the performance of the downstream NLP tasks. More specifically, the optimizations that are introduced in [11] are:

(21)

• The Next Sentence Prediction (NSP) task is completely ignored because, as mentioned in [11] when sentences come from a single document, the model performs better.

• The Masked Language Model (MLM) task used in the pre-training stage which represents the relationships between tokens in one sentence in a bidirectional fashion, is replaced by a training generator which dynamically creates a masking pattern. The training data are replicated ten times, and they are masked dynamically in different ways during the training.

• The model is trained with bigger batches of data with more training data available.

2.5.4 DistilRoBERTa

DistilRoBERTa is a distilled version of the Robustly Optimized BERT Pretraining Approach (RoBERTa) language model introduced in [11]. Like DistilBERT, DistilRoBERTa is a smaller, lighter, and faster language model that leverages the knowledge distillation technique introduced in [7]. Knowledge distillation is a method used to train a model called the student, to imitate another model’s behavior, which is called the teacher, and reproduce the performance with much fewer training parameters.

(22)

2.5.5 XLNet

XLNet is a language model that is an extension of the Transformer-XL and was introduced in [22]. XLNet modifies the training objective with an autoregressive method to model the bidirectional surrounding context of a token. XLNet is designed to deal with some major drawbacks of BERT, which are:

• Independence Assumption: BERT reconstructs all the MASK tokens described above simultaneously, which means that all potential dependencies among them can not be modeled properly.

• Input noise: BERT includes the MASK tokens in the pre-training stage, which leads to a discrepancy between the pre-training stage and the fine-tuning stage since the downstream tasks do not include these special tokens.

XLNet specifically implements the following optimizations:

• The Permutation Language Modelling objective means that the model samples random permutations of all the tokens in the sequence and predicts the tokens in a random order using the context. In other words, the tokens are not predicted sequentially, but instead, they are predicted randomly using all permutations of the previous tokens as context.

• The ’Two-Stream Self-Attention’ which means that the input representations are split into two streams differentiating the positional and the token embeddings to use the positional embeddings, the so-called query stream to predict each token that is a valuable optimization, as the positional information is essential when trying to predict the masked tokens that are in influential positions of the sequence, such as the start of the sentence.

(23)

2.6 Sentence representations

All the methods mentioned in this section refer to approaches used to learn text representation of words. However, it is often crucial to construct text representations of sentences in real-world problems. There are multiple ways to create sentence embeddings, such as Embedding centroids, which means that the embeddings of all words are averaged with equal weights to construct one embedding. This embedding represents the whole sentence further enhancing this approach. Smooth Inverse Frequency that was introduced in [2] which introduces a weighting scheme when generating the sentence embeddings. The primary assumption is that more critical words appear less frequently in a similar fashion to TF-IDF. Deep Averaging Network (DAN) introduced in [9] computes the average of the embeddings and then passes it through multiple Feed Forward layers followed by a linear classification layer.

However, these approaches do not consider the position of the words and the surrounding context when computing the sentence embeddings. Long Short-Term Memory Networks (LSTM) have been widely used with word embeddings being fed to LSTM cells, and the mean of the outputs of those cells is calculated. Convolutional Neural Networks for Sentence Classification (KimCNN), first introduced in [10] is a convolutional layer followed by a pooling layer. Furthermore, there are methods that are used for transformer-based neural networks to fine-tune sentence embeddings.

For instance, BERT for Document Classification (DocBERT) was first introduced in [1] which fine-tunes BERT for sentence classification by introducing a fully-connected feed-forward layer to the final hidden state of the CLS token, and all the model’s weights are fine-tuned. On the other hand, some approaches build an end-to- end model that utilize pre-trained transformer-based embeddings to enhance the embeddings by processing them by other networks. For instance, the RCNN-RoBERTa introduced in [14] feeds an RCNN network with RoBERTa embeddings to enhance further and capture more contextual information.

(24)

Method

This chapter aims to give the reader a description of the datasets and network architectures we used in this study. We begin by introducing the datasets used for both training and evaluation. We then describe the network architectures used to evaluate the review representations.

3.1 Datasets

3.1.1 Train Datasets

Part of this study was to identify annotated datasets with excellent quality for the evaluation of final classifiers. Our choices’ motivation was to cover a wide range of review domains to evaluate what kind of reviews we need to generalize and achieve good performance on our evaluation datasets successfully. Furthermore, our motivation was also to have shorter and longer reviews as well as bigger and smaller datasets to evaluate whether the size of the reviews or the size of the dataset is essential for the generalization of our proposed models because our evaluation datasets cover both longer and shorter reviews from a wide range of domains. More specifically, oscar-ryan with a mean value of 19 words and two sentences per review, and blitzer with a mean value of 106 words and seven sentences per review.

In more detail, the Yelp dataset contains 598,000 total reviews with a mean value of 133 words and 11 sentences per review. The IMDB dataset contains 50,000 reviews with a mean value of 233 words and 14 sentences per review. The 515 Hotel Reviews dataset contains 682,578 reviews with a mean value of 24 words and one sentence per

(25)

review. That is to say, we have covered different review domains, and we have a big dataset with long reviews, a big dataset with short reviews, and a short dataset with long reviews.

Yelp Reviews Dataset

The Yelp reviews dataset was collected from the Yelp Dataset Challenge in 2015, which consists of reviews from Yelp, and it was first introduced as a classification benchmark for text in [24]. The original dataset contains 1,569,264 samples that have raw review texts. For this study, we used a subset of this dataset, which was constructed to predict the reviews’ polarity label. Positive reviews consist of 1 and 2 stars. Negative reviews, on the other hand, comprised of 3 and 4 stars. The dataset consists of 280,000 training reviews and 19,000 validation reviews for each label.

Table 3.1.1: Number of positive and negative reviews for yelp reviews dataset.

Label Train Validation Total

Positive 280,000 19,000 299,000

Negative 280,000 19,000 299,000

Total 560,000 38,000 598,000

The number of words per review ranges from 1 to 1052 words per review with a mean value of 133 and a standard deviation of 122 words per review. The distribution of the words per review for both positive and negative labeled reviews is presented below.

Figure 3.1.1: Distribution of word counts for positive(left) and negative(right) for the Yelp Reviews Dataset

The number of sentences per review ranges from 1 to 509 sentences per review with a mean value of 11 and a standard deviation of 9 sentences per review. The distribution

(26)

of the sentences per review for both positive and negative labeled reviews is presented below.

Figure 3.1.2: Distribution of sentence counts for positive(left) and negative(right) for the Yelp Reviews Dataset

IMDB Dataset of 50K Movie Reviews

The IMDB Dataset of 50k Movie Reviews was collected from [12]. The dataset contains 25,000 equally distributed among positive and negative sentiments for both training and validation.

Table 3.1.2: Number of positive and negative reviews for movie reviews dataset.

Positive 12,500 12,500 25,000

Negative 12,500 12,500 25,000

Total 25,000 25,000 50,000

The number of words per review ranges from 10 to 1470 words per review, with a mean value of 233 and a standard deviation of 173 words per review. The distribution of the words per review for both positive and negative labeled reviews is presented below.

The number of sentences per review ranges from 1 to 150 sentences per review with a mean value of 14 and a standard deviation of 10 sentences per review. The distribution of the sentences per review for both positive and negative labeled reviews is presented below.

(27)

Figure 3.1.3: Distribution of word counts for positive(left) and negative(right) for the Movie Reviews Dataset

Figure 3.1.4: Distribution of sentence counts for positive(left) and negative(right) for the Movie Reviews Dataset

515K Hotel Reviews

The 515 Hotel Reviews dataset was collected from Kaggle ¹. The dataset contains 682,578 customer reviews and scoring for 1493 hotels across Europe. The dataset comprises, among others, the Negative Review and the Positive Review columns.

These reviews are merged to gather all positive and negative reviews.

Table 3.1.3: Number of positive and negative reviews for booking reviews dataset.

Positive 263,064 112,487 375,551

Negative 214,741 92,286 307,027

Total 477,805 204,773 682,578

All the reviews of the dataset contain one sentence. The number of words per review ranges from 5 to 406 words per review with a mean value of 24 and a standard deviation

1https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe/data

(28)

of 28 words per review. The distribution of the words per review for both positive and negative labeled reviews is presented below.

Figure 3.1.5: Distribution of word counts for positive(left) and negative(right) for the Booking Reviews Dataset

(29)

3.1.2 Evaluation Datasets

oscar-ryan

The oscar-ryan dataset which is also called fine-grained contains 294 product reviews from various online sources that are manually annotated by Oscar Täckström and Ryan McDonald on a sentence-level [19]. The dataset is balanced for different domains (books, electronics, music, videogames, DVDs) and the sentiment.

The number of words per review ranges from 1 to 155 words per review with a mean value of 19 and a standard deviation of 15 words per review. The distribution of the words per review for both positive and negative labeled reviews is presented below.

The number of sentences per review ranges from 1 to 27 sentences per review with a mean value of 2 and a standard deviation of 1 sentence per review. The distribution of the sentences per review for both positive and negative labeled reviews is presented below.

(30)

The dataset contains reviews from several domains such as Books, DVDs, Electronics, Music, and Videogames. The distribution of the negative and positive reviews per domain is presented below.

Table 3.1.4: Number of positive and negative reviews per domain for oscar-ryan dataset.

Label Positive Negative Total

Books 160 195 355

Dvds 164 264 428

Electronics 161 240 301

Music 183 179 362

Videogames 255 442 697

Total 923 1320 2243

blitzer

The Multi-Domain Sentiment dataset [4] (blitzer) contains product reviews from Amazon. The reviews expand to various domains such as books, DVDs, electronics, musical instruments, and kitchen appliances that can be hundreds of thousands (books, DVDs) or only a few hundred (music instruments). The original dataset contains star reviews (1 to 5 stars) converted to binary labels, Positive when rating

> 3 and Negative when rating < 3.

The number of words per review ranges from 1 to 5129 words per review, with a mean value of 106 and a standard deviation of 126 words per review. The distribution of the words per review for both positive and negative labeled reviews is presented below.

The number of sentences per review ranges from 1 to 233 sentences per review with a

(31)

mean value of 7 and a standard deviation of 8 sentences per review. The distribution of the sentences per review for both positive and negative labeled reviews is presented below.

The dataset contains reviews from a wide variety of domains. The distribution of the negative and positive reviews per domain is presented below.

(32)

Table 3.1.5: Number of positive and negative reviews per domain for blitzer dataset.

Label Positive Negative Total

Apparel 1000 1000 2000

Automotive 152 584 736

Baby 900 1000 1900

Books 1000 1000 2000

Camera 999 1000 1999

Cell phones 384 639 10239

Video games 458 1000 1458

Dvds 1000 1000 2000

Electronics 1000 1000 2000

Gourmet food 1000 208 1208

Grocery 1000 352 1352

Personal care 1000 1000 2000

Jewelry and watches 1000 292 1292

Kitchen 1000 1000 2000

Magazines 1000 970 2000

Music 1000 1000 2000

Musical instruments 284 48 332

Office products 367 64 431

Outdoor living 1000 327 1327

Software 1000 915 1915

Sports and outdoor 1000 1000 2000

Tools and hardware 98 14 112

Toys 1000 1000 2000

Video 1000 1000 2000

Total 20565 17413 37978

3.1.3 Pre-processing

Text pre-processing is key for language modeling tasks in Natural Language Processing (NLP). Several pre-processing techniques, such as Lemmatization, Lowercasing, Stemming, Stopword removal, normalization, and noise removal, have been applied to this thesis work.

(33)

BOW models

Text pre-processing is crucial for optimizing Bag of Words models and n-gram models, and Term Frequency-Inverse document frequency models. In this thesis work, as a first step, we have tokenized the sentences using the Tokenizer from the transformers package to have a fair comparison among all models. The next step was to lowercase, remove the noise, and the stop words. Noise removal means that all the tokens that might constitute noise, such as tags in tweets, are removed. Finally, stemming was applied, which is reducing the inflection in words.

FastText

Text pre-processing is a crucial step for optimizing Skip-Gram and Continuous Bag of Words (CBoW). Like the BOW models, we tokenized the sentences using the Tokenizer we used with the transformers to compare all models. The next step was to lowercase, remove the noise, and the stop words. Stemming was not applied to these models since FastText is supposed to generate rich representations that consider the word morphology.

Transformer based models

Traditional text pre-processing seems not to affect Transformer based models. It produces slightly worse results because these models can benefit from stop words and noisy words. The sentences were tokenized using the same Tokenizer, and the built-in pre-processing, that is, special tokens, attention masks, and positional embeddings, were applied.

(34)

3.2 Language Models

This section aims to present the different word representation methods that have been used to represent customer reviews. Traditional methods such as BoW and TF- IDF and ngrams with TF-IDF have been used as baselines. Furthermore, static word embeddings such as FastText have been used. Finally, three different contextualized transformer-based approaches have been used. Their methods are bert-base, roberta- base, and xlnet-base. The motivation behind this was to evaluate both static strategies and other pre-training techniques of the transformer-based approaches and how they compare with the traditional baseline methods and the company’s solution.

3.3 Network architectures

This section aims to describe and present the different network architectures used for the sentiment analysis task.

The Multi-Layer Perceptron (MLP) was used as a linear model where all the contextualized word embeddings were fine-tuned with the classification token. The static embeddings are averaged to compute the sentence embedding. Finally, the baseline methods such as BoW and Tf-idf are just fed to the linear layer. The Bidirectional LSTM for Sentence Classification and the Convolutional Neural Network for Sentence Classification (Kim-CNN) are used with the contextualized and the static word embeddings and they are fine-tuned in two fashions. All the embeddings are frozen, which is named static, and all the embeddings trainable, which is called non-static. The motivation behind these design choices was to evaluate both the embeddings fine-tuning capabilities and the quality of the pre-trained embeddings.

Furthermore, the different network architectures’ reason was to assess whether a more complex model as the classifier could benefit more from using all the sentence embeddings instead of just using the classification token.

(35)

3.3.1 Multi-Layer Perceptron (MLP)

Multi-Layer Perceptron (MLP) [21] is a type of Neural Networks that consists of multiple layers of perceptrons. The vanilla version of MLPs consists of the input layer, the hidden layer, and the output layer. In this study, we utilize such networks to evaluate the fine-tuning process of the contextual embeddings. In more detail, the input layer corresponds to the output of one of the embeddings model. We have the hidden layer, and the output layer corresponds to the 2 class classification problem.

3.3.2 Bidirectional LSTM for Sentence Classification

Recurrent Neural Networks (RNNs) are Neural Networks that process a sequential input in time steps and have a hidden state called an internal memory and holds information about each time step’s previous outputs. That is to say, that RNNS are suitable for tasks such as Speech Recognition or Time Series predictions. The intuition behind our choice to utilize such a network in our study lies in the fact that in this way, we can transfer contextual information from previous to following steps. Furthermore, throughout the research that was conducted in [23], RNNs are found to be more effective in sentiment analysis than other Network architectures. The architecture of an RNN unit is presented in figure below.

Figure 3.3.1: RNN model architecture. Taken from²

The vanilla RNN architecture has certain drawbacks. It is not suitable for long-term dependencies, which we have in this study since we consider long customer reviews

(36)

in products. Long Short Term Memory units (LSTM) proposed in [8] deal with the vanishing gradients problem that comes as an effect of long-term dependencies in traditional RNN architectures. In this study, we implemented a bidirectional LSTM architecture on top of the word embeddings we evaluated. Bidirectional RNNs make a connection between hidden layers of opposite directions to the same output. In this way, a hidden layer for a specific time step can leverage information from both directions, resulting in a better contextual representation of the sentence. Finally, we evaluated the mean of all outputs as the resulting output of the LSTM layer as performed in [24].

3.3.3 Convolutional Neural Network for Sentence Classification (Kim-CNN)

Convolutional neural networks (CNN) were initially introduced for computer vision, but they have been beneficial in recent years and have achieved notable results in NLP tasks. They utilize convolutional layers applied to local features. Convolution is an operation between the input and the so-called kernel, which is of a certain dimensionality and is multiplied with subsets of the same dimensionality of the input in an element-wise manner. Finally, the sum of these multiplications constitutes the output of the Convolution. In that sense, this operation is considered a feature extraction operation, which is our motivation to utilize this study’s architecture. That is to say that, Convolutional Neural Networks may be able to extract information about the semantic and syntactic features of the reviews as shown in [15]. Finally, as shown in [17], CNNS can leverage the richness of pre-trained word embeddings.

In this study, we implemented a wide-spread CNN architecture for sentence classification introduced by Yoon Kim in [10]. The model is presented in figure 3.3.2. The architecture is rather simple as it only utilizes one convolutional layer and one pooling layer. In [10], the Convolutional Neural Network was trained on top of pre-trained word2vec [13] embeddings. In this study, we trained and evaluated this architecture on top of all the different static and contextual word embeddings we have presented.

(37)

Figure 3.3.2: Kim CNN model architecture

3.4 Evaluated models

This section aims to explain the evaluated models and their names used to report them in the experiments and results.

• MLP-TF-IDF The sentence embeddings produced by Term Frequency-Inverse document frequency which are fed to a Linear classifier.

• MLP-TF-IDF-N-GRAMS The sentence embeddings produced by Term Frequency-Inverse document frequency with n-grams and a Linear classifier on top.

• MLP-FT-avg The sentence

embeddings produced by averaging word embeddings (centroids) produced by the pre-trained FastText model with a Linear classifier on top.

• MLP-BERT The fine-tuned word embeddings of the classification token produced by bert-base-uncased with a Linear classifier on top.

• MLP-RoBERTa The fine-tuned word embedding of the classification token produced by roberta-base with a Linear classifier on top.

• MLP-XLnet The fine-tuned word embedding of the classification token produced by xlnet-base-cased with a Linear classifier on top.

• LSTM-FT-static The pre-trained word embeddings produced by the pre- trained FastText model with LSTM classifier on top.

(38)

• LSTM-BERT-static The pre-trained word embeddings produced by bert-base- uncased with LSTM classifier on top.

• LSTM-RoBERTa-static The pre-trained word embeddings produced by roberta-base with LSTM classifier on top.

• LSTM-XLnet-static The pre-trained word embeddings produced by xlnet- base-cased with LSTM classifier on top.

• CNN-FT-static The pre-trained word embeddings produced by the pre-trained FastText model with CNN classifier on top.

• CNN-BERT-static The pre-trained word embeddings produced by bert-base- uncased with CNN classifier on top.

• CNN-RoBERTa-

static The pre-trained word embeddings produced by roberta-base with CNN classifier on top.

• CNN-XLnet-static The pre-trained word embeddings produced by xlnet-base- cased with CNN classifier on top.

• CNN-BERT-non-static The fine-tuned word embeddings produced by bert- base-uncased with CNN classifier on top.

• CNN-RoBERTa-non-static The fine-tuned word embeddings produced by roberta-base with CNN classifier on top.

• CNN-XLnet-non-static The fine-tuned word embeddings produced by xlnet- base-cased with CNN classifier on top.

• MLP-DistilRoBERTa The fine-tuned word embedding of the classification token produced by distilroberta-base with Linear classifier on top.

• CNN-DistilRoBERTa-non-static The fine-tuned word embeddings of all tokens produced by distilroberta-base with CNN classifier on top.

(39)

3.5 Implementation

The proposed architectures, the baseline language models, and the experimentation pipelines were developed in Python 3.6³ and Pytorch⁴. The transformer-based language models were developed using the Huggingface library⁵

The TfidfVectorizer used is from the scikit-learn⁶, and the model contains vectors of 500 dimensionality. The TfidfVectorizer for n-grams used is from the scikit-learn, and the model contains vectors of 500 dimensionality, and the n-gram length is 5.

The pre-trained FastText model that was used is the model trained with the CBoW objective.⁷. The pre-trained bert-base-uncased, roberta-base, xlnet-base-uncased that were used are from the Huggingface library.

3.6 Metrics

The most popular metric in text classification is accuracy. Accuracy is the fraction of the correct predictions of the model. However, the distribution of the datasets should be taken into account when choosing the evaluation metric. In this thesis work, all train and evaluation datasets are slightly imbalanced. Furthermore, in real-world cases, the labels’ distribution on the reviews depends on the customer’s dataset. Finally, the nature of the problem we have worked on in this thesis demands that our models have a high recall. In hotel reviews, for instance, we are interested in detecting all the negative reviews that might affect our customers for the insights to be valuable. The metrics we have used in this work are Precision, Recall, and F1 score. Recall is the fraction of the number of true positives to the number of true positives and false negatives.

In other words, recall shows how many of the positive cases the model identified. For customer reviews, specifically, this shows the fraction of the positive reviews the model identified. Precision is the fraction of the number of true positives to true positives and false positives. In other words, precision shows how many of the identified positive reviews were positive. Even though we need our model to have a high recall, we also need to have precise results. That is why the metric we used to evaluate and compare

3https://www.python.org/downloads/release/python-360/

4https://pytorch.org/

5https://huggingface.co/

6https://scikit-learn.org/stable/

7https://fasttext.cc/docs/en/english-vectors.html

(40)

our models was the F1 score, the harmonic mean of both precision and recall. For reference, both recall and precision have been reported in the Results section, but the comparison was based on the F1 scores.

3.7 Experimental pipeline

This section thoroughly explains the experimental pipeline that we followed to answer all the research questions.

• The initial step of this thesis work was to investigate the most suitable datasets for training the proposed methods.

• The next step of this thesis work was to investigate different representation methods for customer reviews.

• The final step of this thesis work was to investigate the different network architectures for the sentiment analysis task.

Each dataset is used to train each of the models and then evaluate the evaluation datasets. That means that we end up with every different combination of a model and a training dataset as checkpoints, which we evaluate against the evaluation datasets.

Each of these train datasets has a train and a validation dataset. The validation dataset is used to determine some of the design choices and hyperparameters. When all options were decided, the datasets were used to train the models and evaluate the unseen evaluation data. We have built this pipeline to access these models’

generalization capabilities on unseen data evaluated on completely different data from different domains. Furthermore, as described in the Datasets section, each of the train datasets has different review sizes in terms of tokens and sentences and different amount of data to access if the number, the distribution, and the size of the dataset plays a crucial role when it comes to generalization on unseen data. Next, the best two models were trained on all the train datasets combined to assess whether the models can benefit from more data from different domains with different distributions and lengths. The model that continuously performed better independent of the training dataset in terms of f1 score was chosen as the best one from this pipeline. Finally, a distilled version of the best models was trained with all the training datasets combined to assess whether a smaller distilled model could be easier deployed in a production setting and could achieve similar performance on the evaluation datasets.

(41)

3.7.1 KimCNN fine-tune investigation

This section evaluated how we should fine-tune contextualized transformer-based embeddings, followed by the Convolutional Neural Network for Sentence Classification (Kim-CNN). In [10], the authors introduce static and non-static representations, and they evaluate whether they should fine-tune the embeddings or not. Static means that the embeddings were frozen, and only the model on top was fine-tuned. On the other hand, non-static implies that both the embeddings and the model on top were fine- tuned. We performed the same evaluation for our implementation to assess if more recent transformers still need to be fine-tuned when used with such networks. We used the Yelp Dataset to train and evaluate the two different settings. Finally, there are more fine-tuning techniques, such as only froze specific layers and fine-tuned others. Still, due to a lack of resources, we only investigated those described in this section.

3.7.2 Proposed architectures evaluation

This section trained all the proposed models described in section 3.4 with three different training datasets. The training datasets of Yelp Dataset, Movie Dataset, and Booking Dataset were used to train the models. Finally, we evaluated all the models on both the oscar-ryan and blitzer evaluation datasets. We performed this evaluation to assess whether a model that constantly peformed better across different datasets or the different training datasets with different characterists make a difference.

3.7.3 Best proposed architectures evaluation

This section evaluated the best performing models in a broader training setting using all the train datasets combined and assessed on both the oscar-ryan and blitzer evaluation datasets. We performed this evaluation to assess whether more and diverse data would make any difference in the performance.

3.7.4 Distilled architectures and resource evaluation

This section evaluated how distilled versions of the best performing models from previous stages compare to each other. Furthermore, we assessed how the best- evaluated models and their distillations compare to each other in terms of model parameters and memory consumption, and inference performance to potentially

(42)

access the resources that each model demands to be deployed in a production environment.

3.7.5 Lexicon based enhanced architectures evaluation

In this final section, as an extra evaluation, we thought it would be interesting to evaluate how the best performing models compare in a training setting with all the sentences being enhanced with lexicon-based knowledge. In more detail, the company has built some lexicons with expertise acquired over the years that define the polarity of all the tokens in English. In this way, we believe that the model could benefit from this knowledge, and it is what we evaluated here. Furthermore, in this way, one might be able to deal with the ’black box’ and, by providing different lexicons with different polarities for specific tokens, alter the model’s outcome. To achieve that, we take advantage of the fact that the transformers are usually pre-trained, including some special tags. In other words, we introduce four new tags that are <pos>, </pos>,

<neg>, </neg>, defining the start and the end of the token that has either positive or negative polarity according to our lexicon. We fine-tune the model with the new tags, and we claim that it will benefit from this additive knowledge. Unfortunately, we did not have different lexicons to evaluate whether we could alter the model’s outcomes.

(43)

Results

This chapter aims to present the results of all the experiments described in the previous chapter and interpret the experimental results.

4.1 KimCNN fine-tune investigation

In Section 3.7.1, we investigated how we should fine-tune the contextualized transformer-based embeddings, followed by the Convolutional Neural Network for Sentence Classification (Kim-CNN). We used the Yelp Dataset to train and evaluate the two different settings. Static means that the embeddings were frozen, and only the model on top was fine-tuned. On the other hand, non-static implies that both the embeddings and the model on top were fine-tuned. There are more fine-tuning techniques, such as only freeze specific layers and fine-tune others, but we only investigated those described in this section due to lack of resources.

Model Precision Recall F1

CNN-BERT-static 0.9588 0.9660 0.9624

CNN-BERT-non-static 0.9658 0.9704 0.9681 CNN-RoBERTa-static 0.9504 0.9615 0.9559 CNN-RoBERTa-non-static 0.9732 0.9730 0.9731

CNN-XLnet-static 0.9441 0.9773 0.9604

CNN-XLnet-non-static 0.9697 0.9732 0.9714

Table 4.1.1: Comparison of static and non-static CNN architectures trained on the Yelp train dataset; bold figures indicate superior performance on Yelp validation dataset

(44)

We observe that the non-static version of the models always performs better for all different contextualized word embeddings. Furthermore, we observe that the RoBERTa embeddings model has the best performance on the Yelp validation dataset.

However, both settings perform exceptionally well, and it would be interesting to evaluate them against different datasets to access their generalization capabilities.

Both settings were used in the next experiments, even though the non-static versions continuously performed better.

4.2 Proposed architectures evaluation

In Section 3.7.2, we evaluated all proposed models with three different training datasets. In the first subsection, we present all the results from all the models that have been trained with the Yelp dataset, the Movie dataset, and the Booking dataset and have been evaluated on the oscar-ryan dataset. In the next subsection, we present all the results from all the models that have been trained with the same datasets and have been evaluated on the blitzer dataset.

(45)

4.2.1 oscar-ryan

In this subsection, we present the results of the evaluation of all train datasets on the oscar-ryan evaluation dataset.

Gavagai 0.72 0.58 0.64

Tf-Idf 0.59 0.84 0.69

Tf-Idf-N-Gram 0.61 0.82 0.70

MLP-BERT 0.79 0.82 0.80

MLP-XLnet 0.81 0.82 0.82

MLP-RoBERTa 0.90 0.77 0.83

LSTM-FT-static 0.76 0.78 0.77

CNN-FT-static 0.73 0.77 0.75

LSTM-BERT-static 0.77 0.82 0.80

CNN-BERT-static 0.77 0.84 0.80

CNN-BERT-non-static 0.80 0.81 0.80

LSTM-RoBERTa-static 0.79 0.85 0.82

CNN-RoBERTa-static 0.80 0.85 0.83

CNN-RoBERTa-non-static 0.84 0.81 0.83

LSTM-XLnet-static 0.68 0.76 0.72

CNN-XLnet-static 0.72 0.88 0.79

CNN-XLnet-non-static 0.80 0.85 0.82

Table 4.2.1: Comparison of Gavagai with all architectures trained on the Yelp reviews dataset; bold figures indicate superior performance on oscar-ryan dataset

We observe from the models trained with the Yelp reviews dataset that all proposed methods outperform the baselines, with RoBERTa achieving the best performance.

Furthermore, we observe that the static and non-static settings of the Convolutional Neural Network for Sentence Classification are very similar for all contextualized word embeddings, which is an indication that it was the right choice to include both of them in the evaluation.

(46)

Gavagai 0.72 0.58 0.64

Tf-Idf 0.58 0.80 0.67

Tf-Idf-N-Gram 0.59 0.81 0.68

FastText 0.54 0.62 0.58

MLP-BERT 0.64 0.93 0.76

MLP-XLnet 0.74 0.89 0.81

MLP-RoBERTa 0.81 0.85 0.83

CNN-FT-static 0.63 0.84 0.72

LSTM-RoBERTa-static 0.72 0.93 0.81

CNN-RoBERTa-static 0.67 0.94 0.78

Table 4.2.2: Comparison of Gavagai with all architectures trained on the Movies reviews dataset; bold figures indicate superior performance on oscar-ryan dataset

We observe from the models trained with the Movie reviews dataset that all proposed methods outperform the baselines except for XLnet and RoBERTa achieving the best performance. Furthermore, in this case, we observe that the non-static setting of the Convolutional Neural Network for Sentence Classification has much more outstanding performance, probably because this dataset is the smallest, which might indicate that the static models do not have enough data to generalize appropriately. The fine-tuned classification of RoBERTa achieves the best results.

(47)

Gavagai 0.72 0.58 0.64

Tf-Idf 0.88 0.14 0.24

Tf-Idf-N-Gram 0.85 0.36 0.51

FastText 0.52 0.50 0.51

MLP-BERT 0.82 0.61 0.70

MLP-XLnet 0.83 0.63 0.72

MLP-RoBERTa 0.84 0.67 0.75

CNN-FT-static 0.82 0.44 0.57

CNN-BERT-static 0.87 0.58 0.69

CNN-RoBERTa-non-static 0.91 0.56 0.69

CNN-XLnet-non-static 0.72 0.90 0.80

Table 4.2.3: Comparison of Gavagai with all architectures trained on the Hotel reviews dataset; bold figures indicate superior performance on oscar-ryan dataset

We observe from the models trained with the Booking reviews dataset that Gavagai outperforms all the baseline methods and outperforms all implementations of FastText and some of the architectures of the contextualized embeddings. All the architectures’

overall performance is much lower with this training dataset, which means that the models could not generalize due to the concise reviews they were trained on.

Furthermore, in this case, we observe that the non-static setting of the Convolutional Neural Network for Sentence Classification has much more outstanding performance, probably because this dataset has the shortest reviews. The fine-tuned Convolutional Neural Network for Sentence Classification with XLNet embeddings achieves the best results.

(48)

4.2.2 blitzer

In this subsection, we present the results of the evaluation of all train datasets on blitzer evaluation dataset.

Gavagai 0.77 0.84 0.81

Tf-Idf 0.85 0.79 0.82

Tf-Idf-N-Gram 0.85 0.81 0.83

FastText 0.87 0.80 0.83

MLP-BERT 0.87 0.92 0.89

MLP-XLnet 0.90 0.95 0.92

MLP-RoBERTa 0.96 0.91 0.94

CNN-FT-static 0.90 0.86 0.88

CNN-XLnet-static 0.88 0.96 0.92

CNN-XLnet-non-static 0.94 0.94 0.94

Table 4.2.4: Comparison of Gavagai with all architectures trained on the Yelp reviews dataset; bold figures indicate superior performance on blitzer dataset

We observe from models trained with the Yelp reviews dataset that all proposed methods outperform the baselines, and RoBERTa achieves the best performance.

Furthermore, in this case, we observe that the non-static setting of the Convolutional Neural Network for Sentence Classification performs similarly. The fine-tuned classification of RoBERTa achieves the best results along with the fine-tuned Convolutional Neural Network for Sentence Classification with RoBERTa.

(49)

Gavagai 0.77 0.84 0.81

Tf-Idf 0.98 0.06 0.11

Tf-Idf-N-Gram 0.87 0.24 0.38

FastText 0.86 0.25 0.39

MLP-BERT 0.92 0.59 0.72

MLP-XLnet 0.95 0.61 0.74

MLP-RoBERTa 0.97 0.62 0.75

CNN-FT-static 0.95 0.34 0.50

CNN-RoBERTa-non-static 0.96 0.61 0.75

Table 4.2.5: Comparison of Gavagai with all architectures trained on the Hotel reviews dataset; bold figures indicate superior performance on blitzer dataset

We observe from the models trained with the Hotel reviews dataset that all proposed methods performed poorly and could not generalize due to the concise reviews they were trained on. In this case, this impact is more remarkable since we evaluated on a corpus with much longer reviews. Furthermore, we observed again that in this case, the non-static setting of the Convolutional Neural Network for Sentence Classification has much more outstanding performance, probably because this dataset has the shortest reviews. The Gavagai achieves the best results.

(50)

Gavagai 0.77 0.84 0.81

Tf-Idf 0.79 0.78 0.78

Tf-Idf-N-Gram 0.80 0.81 0.80

FastText 0.79 0.80 0.79

MLP-BERT 0.88 0.91 0.89

MLP-XLnet 0.91 0.95 0.93

MLP-RoBERTa 0.93 0.93 0.93

CNN-FT-static 0.84 0.82 0.83

Table 4.2.6: Comparison of Gavagai with all architectures trained on the Movie reviews dataset; bold figures indicate superior performance on blitzer dataset

We observe from the models trained with the Movie reviews dataset that all proposed methods outperform the baselines, and RoBERTa achieves the best performance.

Furthermore, in this case, we observe that the non-static setting of the Convolutional Neural Network for Sentence Classification performs similarly. The fine-tuned classification of RoBERTa achieves the best results along with the fine-tuned Convolutional Neural Network for Sentence Classification with RoBERTa and the fine- tuned classification of XLNet.

(51)

4.3 Best proposed architectures evaluation

In Section 3.7.3, we evaluated the best performing models in a broader training setting using all the train datasets combined and assessed on both the evaluation datasets.

Below we present the comparison of Gavagai with the best performing models trained on all train datasets combined.

4.3.1 oscar-ryan

Gavagai 0.72 0.58 0.64

MLP-RoBERTa 0.79 0.90 0.85

CNN-RoBERTa-non-static 0.87 0.80 0.85

Table 4.3.1: Comparison of Gavagai with all architectures trained on all train reviews datasets; bold figures indicate superior performance on oscar-ryan dataset

We observe from the models trained with the full train dataset that all proposed methods outperform the baselines, and RoBERTa achieves the best performance.

Furthermore, in this case, we observe that the non-static setting of the Convolutional Neural Network for Sentence Classification performs better. The fine-tuned classification of RoBERTa achieves the best results along with the fine-tuned Convolutional Neural Network for Sentence Classification with RoBERTa. Finally, we observed that the models benefited from the larger and combined datasets achieving the highest f1 score for the oscar-ryan dataset.

(52)

4.3.2 blitzer

Gavagai 0.77 0.84 0.81

MLP-RoBERTa 0.96 0.94 0.95

CNN-RoBERTa-non-static 0.95 0.96 0.95

Table 4.3.2: Comparison of Gavagai with all architectures trained on all train reviews datasets; bold figures indicate superior performance on blitzer dataset

Furthermore, in this case, we observe that the non-static setting of the Convolutional Neural Network for Sentence Classification performs better. The fine-tuned classification of RoBERTa achieves the best results along with the fine-tuned Convolutional Neural Network for Sentence Classification with RoBERTa. Finally, we observed that the models benefited from the larger and combined datasets achieving the highest f1 score for the blitzer dataset.

4.4 Distilled architectures and resource evaluation

In Section 3.7.4, we evaluated how distilled versions of the best performing models from previous stages compare with each other. We assessed the distilled models in a broader training setting again, using all the train datasets combined and evaluated on the oscar-ryan and blitzer datasets. Below we present the comparison of Gavagai with the best performing models and their distilled versions trained on all train datasets combined.

4.4.1 oscar-ryan

Furthermore, in this case, we observe that the non-static setting of the Convolutional Neural Network for Sentence Classification performs better. The fine-tuned classification of RoBERTa achieves the best results along with the fine-tuned Convolutional Neural Network for Sentence Classification with RoBERTa. The