Fine-grained sentiment analysis of product reviews in Swedish

(1)

Fine-grained sentiment

analysis of product

reviews in Swedish

Emil Westin

Uppsala University

Department of Linguistics and Philology Språkteknologiprogrammet

(Language Technology Programme)

Bachelor’s Thesis in Language Technology November 3, 2020

(2)

Abstract

In this study we gather customer reviews from “Prisjakt”, a Swedish price comparison site, with the goal to study the relationship between review and rating, known as sentiment analysis. The purpose of the study is to evaluate three different supervised machine learning models on a fine-grained depen-dent variable representing the review rating. For classification, a binary and multinomial model is used with the one-versus-one strategy implemented in the Support Vector Machine, with a linear kernel, evaluated with F1, accuracy, precision and recall scores. We use Support Vector Regression by approximating the fine-grained variable as continuous, evaluated using MSE. Furthermore, three models are evaluated on a balanced and unbalanced dataset in order to investigate the effects of class imbalance.

(3)

1 Introduction 4 1.1 Motivation . . . 4 1.2 Research Questions . . . 5 1.3 Outline . . . 5 2 Background 6 2.1 Text representation . . . 6 2.2 Machine Learning . . . 8 2.3 SVM . . . 8 2.4 SVR . . . 9 2.5 Weighting . . . 9 2.6 Basic Concepts . . . 10 3 Data 11 3.1 Web scraper . . . 11 3.2 Measurement Scale . . . 11 3.3 Token statistics . . . 12 3.4 Regex Conversions . . . 12 3.5 Text preprocessing . . . 14 4 Method 16 4.1 Training and Test Sets . . . 16

4.2 Document Term Matrices . . . 17

4.3 Model training . . . 17 4.4 Feature Engineering . . . 18 4.5 Evaluation Metrics . . . 18 5 Results 21 5.1 Experimental results . . . 21 5.2 Fine-tuned models . . . 25 6 Discussion 27 A Regex 29 B SVR features by balance 30

C Product Review Ratings by Category 31

(4)

1 Introduction

Sentiment analysis is the task of determining the opinion or sentiment polarity of a document. One way to determine the opinion in its simplest form is through binary classification. For example, predicting a sentiment as one of two discrete classes, positive or negative, given a randomly selected text document from some specified domain. Another alternative is to consider a “fine-grained” scale, meaning that there are various classes to predict, e.g. a rating from 1-5 or 1-10, where the difference between each class is less clear-cut compared to the difference between positive or negative sentiments. As such, a central problem in the fine-grained analysis is that the classification requires a high level of precision to discriminate between the classes. Furthermore, the fine-grained distribution of for example review ratings tend to be skewed, causing class imbalance.

In this thesis, we will investigate the shortcomings and advantages of the fine-grained sentiment analysis approach. For this purpose, we gather data from the price comparison site prisjakt, using a custom web scraper, where users are able to write reviews in Swedish about products they have purchased and assign ratings on a 10-point scale.

The goal is to evaluate the performance of two supervised machine learning models. We evaluate the classification approach SVM, and introduce the regression alternative SVR, on three different tasks: binary classification, multinomial classifica-tion and regression. For simplicity, we will refer to these as three different models: binary SVM, multinomial SVM and SVR. We construct these models with the goal of classifying or predicting an unlabeled review document in Swedish by a rating or sentiment.

1.1 Motivation

When the labeled data is fine-grained enough, typically seven or more categories, it is possible to approximate the scale as continuous at the cost of very little bias (Hox 2010:141). By having the dependent variable on a continuous scale, we can use regression, i.e. predicting a real-valued number instead of a discrete class. The regression approach has shown to be effective when fine-grained data is available, as presented in Kapukaranov and Nakov (2015), it outperformed binary classification and ordinal logistic regression.

(5)

in the binary classification task.

1.2 Research Questions

As mentioned, a central problem in the fine-grained sentiment analysis is the skewness of the rating distribution. Additionally, an important issue to address is the sample size, where in general a higher sample size is to prefer in order for the model to better discriminate between the classes. To study these issues, we will split the dataset into balanced and unbalanced sets. By balanced, we refer to an equal number labels in each class. The balanced data set is reduced in size since the majority classes lose a lot of observations in order to be balanced with the minority classes. As such, it becomes of interest not only to study the balanced and unbalanced datasets, but also the difference in sample sizes. Therefore, the main research questions of this study are:

• How does the binary and multinomial classification perform on a balanced dataset with smaller sample size compared to a unbalanced, larger sample size, dataset?

• For comparison, how does the regression approach for the fine-grained data on the same balanced and unbalanced sets?

• The review documents are represented as bag-of-words, which leads to high dimensional training data. In order to train the models, dimensionality reduction is needed to filter out irrelevant features. How does feature selection and feature engineering affect the results when for example removing stopwords, stemming, tf-idf weighting and applying regex conversions on the balanced and unbalanced datasets?

1.3 Outline

(6)

2 Background

In this section, we will give background to the concepts used in this thesis, such as text encoding, machine learning, the models, weighting and some basic concepts.

2.1 Text representation

An important research area in Natural Language Processing deals with different ways of representing or encoding text data. A widely used way to represent text is called Vector Space Model (VSM) (Shawe-Taylor and Cristianini 2004). In VSM, the original word order in the documents is not taken into account and words are assumed to be independent. A text document can be represented as a N-dimensional row vector of term frequencies using the bag-of-words (BoW) function (ibid):

φ : d −→ φ(d) = (t f (t1, d),t f (t2, d), ...,t f (tN, d)) ∈ F = RN (1) where t f(tj, d) refers to the frequency of the j:th term t appearing in document d and F represents the feature space where each feature is a term, each term corresponds to a dimension. For a collection of ` text documents, the document-term matrix D is used: D=    φ(d1) .. . φ(d`)   =      t f(t1, d1) t f (t2, d1) · · · t f (tN, d1) t f(t1, d2) t f (t2, d2) · · · t f (tN, d2) .. . ... . .. ... t f(t1, d`) t f (t2, d`) · · · t f (tN, d`)      (`×N) (2)

For example, consider the corpus of two (`= 2) short text documents where d1= “A happy cat”, d2= “A sad cat”. After tokenization and lower-casing, the dictionary of unique terms is sorted in alphanumerical order resulting in t1= a, t2= cat, t3= happy, t₄= sad with N= 4 as vocabulary size. The document term matrix is:

D= φ(d1) φ(d2) =t f (t1, d1) t f (t2, d1) t f (t3, d1) t f (t4, d1) t f(t1, d2) t f (t2, d2) t f (t3, d2) t f (t4, d2) (3) =t f (a, d1) t f (cat, d1) t f (happy, d1) t f (sad, d1)

t f(a, d2) t f (cat, d2) t f (happy, d2) t f (sad, d2) (4) =1 1 1 0 1 1 0 1 (2×4) (5)

(7)

space, however for certain datasets it may be desirable to keep stopwords and certain punctuation symbols. For example, removing the stop word “a” results in the removal of the first column in Eq. (5), we can visualize D_(2×3)in Fig. 2.1.

cat 0.0 0.2 0.4 0.6 0.8 1.0 happy0.4 0.2 _0.0 0.6 0.8 1.0 sad 0.0 0.2 0.4 0.6 0.8 1.0

Two document vectors in 3-dimensional feature space

d1 d2 0.0 0.2 0.4 0.6 0.8 1.0 d1 0.0 0.2 0.4 0.6 0.8 1.0 d2 cat happy sad

Three points in 2-dimensional document space

Figure 2.1: Graphing the document-term matrix in the feature space (left) and docu-ment space (right).

We can observe in Fig. 2.1 that both documents contain one occurrence of the word “cat”, with the difference that d1contains the word “happy” while d2contains “sad”.

It is important to see this visualization, because most machine learning methods and statistical multivariate techniques are based on the concept of distance. For example, a measure of similarity between d1and d2is the cosine of the angle between d1and d2, which is equivalent to the sample correlation coefficient (Johnson and Wichern 2007). In other words, if the two documents have similar orientation, the sample correlation will be close to 1; if oriented in opposite directions, the sample correlation will be -1 and if perpendicular, the sample correlation will be zero.

The main drawback of the VSM is the assumption of independent terms, result-ing in a loss of grammatical information since the word order is lost. Furthermore, synonyms will be represented differently in the VSM model despite having similar meaning, while words with many meanings is represented in only one way. The advantages of VSM is a relatively simple model, popular for information retrieval tasks (Manning et al. 2008). Furthermore, the model can be generalized and extended, removing the independence assumption of terms by introducing term-to-term cor-relations (GVSM), allowing for incorporating additional information like semantic relatedness (Tsatsaronis and Panagiotopoulou 2009).

(8)

2.2 Machine Learning

The word “learning” in machine learning (and deep learning) stands for meaningfully learning representations of the training data such that the predictions are close to the expected output (Chollet and Allaire 2018). A representation is a different way to view or encode the data, for example rotating the coordinate system, mapping the data points to a feature space where linear separation is possible in classification tasks, or weighing points according to their variance (statistical distance). A machine learning model automatically finds a suitable representation by searching through a predefined set of operations, called a hypothesis space (ibid). The model is guided by a feedback signal or loss function, which keeps track of for example the percentage of correctly classified points.

2.3 SVM

The Support Vector Machine (SVM) is a supervised ML binary classification tech-nique. The goal of SVM is to find a hyperplane, which can be visualized as a straight line in two dimensions, such that “the set of vectors in the training set is separated without error and the distance between the closest vector to the hyperplane is maximal” (Vapnik 2000b). The special property of SVM is the use of kernels, which maps the original training data to a high dimensional feature space. This is a way for the model to find a representation of the data that facilitates the linear separation by the hyperplane with minimal error. For example, consider the linear kernel (Gram) matrix:

K= DDT=    φ(d1) .. . φ(d`)    φ(d1) · · · φ (d`) (6) =      κ(d1, d1) κ(d1, d2) . . . κ(d1, d`) κ(d2, d1) κ(d2, d2) . . . κ(d2, d`) .. . ... . .. ... κ(d`, d1) κ(d`, d2) . . . κ(d`, d`)      (`×`) (7)

where κ(d1, d2) = φ (d1)φ (d2)T = ∑Nj=1t f(tj, d1)t f (tj, d2) is a kernel function (Shawe-Taylor and Cristianini 2004). We can note the following:

• Each element in the kernel matrix is a dot product between two bag-of-words documents, which is a similarity measure. We can interpret elements with high values as an indication that two given documents have a lot of terms in common. • No matter the number of terms (columns) in the document term matrix, the

kernel matrix will always reduce to dimension ` × ` when N > `. This means that the kernel can act as a dimensionality reduction technique. On the other hand, this also implies some information loss compared to the original document-term matrix.

(9)

allows for a smaller margin if more observations are classified correctly (overfitting), and conversely a smaller value of C allows for a larger margin (underfitting).

In multi-class classification, we will mainly use the one-versus-one classification strategy, which trains k(k − 1)/2 classifiers. A test observation is classified by using all k(k − 1)/2 classifiers and counting how many times the test observation is assigned to each class, where the test observation is assigned to the class with most frequent number of assignments (James et al. 2013). The other method is called one-versus-rest having the advantage of faster training time, however it is not preferred for unbalanced data since a minority class will get compared to the all other classes with more observations, resulting in a heavy imbalance.

The advantage of the SVM is flexibility. The kernel function makes it possible to distinguish classes even when there appears to be an overlap. The drawback of these models is that they are not easy to interpret and is not suitable for inference, for example understanding the relationship between the review rating y and the word features. However, the model is suitable for our purpose to get higher prediction accuracy.

2.4 SVR

Support Vector Regression (SVR) is a method used for predicting a continuous dependent variable, y, given a set of features or independent variables (Smola and Schölkopf 2004). It is an extension of the SVM algorithm. In SVR, the goal is to seek coefficients that minimize a loss function that ignores errors within a certain distance from the fitted regression, defined by ε and is called the ε-insensitive loss function (Smola and Schölkopf 2004; Vapnik 2000a).

2.5 Weighting

The document-term matrix of term frequencies assumes all terms to be equally im-portant. This assumption creates a bias towards words that occur in many documents, despite having little discriminating power (Manning et al. 2008). Therefore, the term frequencies can be weighted by the so-called inverse document frequency, idf:

tf-idf(t, d) = t f (t, d) × id f (t) = t f (t, d) × ln ` d f(t) (8) where d f(t) is the number of documents containing the term t and ` is the number of documents in the training set. Before weighting the term frequencies, the document term matrix is typically standardized to L1-norm such that the row sums are equal to one, so that documents with more words are unbiased to shorter documents. The weighting is done by constructing a diagonal matrix R_(N×N)containing the idf weights for each term in the dictionary:

(10)

The test set is weighted with the same diagonal matrix since the test data is unknown: Dtf-idf,test= DtestRtrain.

2.6 Basic Concepts

A feature is a property that we hypothesize to be most relevant to the decision or prediction we want to make. For example, the words “good” and “splendid” might be features that could be found in a review with high rating. Cristianini and Shawe-Taylor (2000) makes a distinction pointing out that the features are quantities within a feature space while the original quantities sometimes are called attributes. Features are also called independent variables or predictors.

If a feature is relevant to the classification, removing it would degrade the per-formance (Basant Agarwal 2016). However, since we start with complete reviews, summing up to thousands of features, many of them are redundant. Feature selection is the task of removing these irrelevant and noisy features in order to improve the performance (classification accuracy) of the model.

The dimensionality of the data is decided by the size of the feature vector. As the number of features increase, the performance can degrade, which is know as the “curse of dimensionality” (Cristianini and Shawe-Taylor 2000; Basant Agarwal 2016). Feature selection can therefore also be seen as choosing the most suitable representation of the data by dimensionality reduction.

(11)

3 Data

In this chapter we will go trough the data we have gathered in this study and provide descriptive statistics of interest.

We have defined a target population consisting of reviews and corresponding review ratings by users on prisjakt of the top 50 ranked consumer electronics products in different categories. The focus is on categories with more than hundred reviews in their top 50 listings in order to achieve reasonable sample sizes. From this population, we have drawn 12 samples, one from each category. See appendix C for a complete overview of the distribution of these categories. The total sample size is n= 4664 observations after removing non-response. The non-response refers to reviews “under investigation” by prisjakt, whose review and rating was hidden during sampling.

Prisjakt1 is a price comparison site founded in 2002 in Sweden, known as

PriceSpy in the United Kingdom and New Zealand, Prisjagt in Denmark,

leDenicheurin France, among others. Apart from comparing prices among a wide variety of stores, Prisjakt allows users to write reviews and rate products they have purchased. The advantage of this setup is the fact that the reviews are published on a single site, allowing for an easy overview, independently of wherever the purchases were made.

3.1 Web scraper

The data was collected from prisjakt in February and March 2018 by programming a web scraper in Python with the BeautifulSoup package for parsing html and urllib for establishing a connection to prisjakt. We extracted html attributes and created a data frame of columns for these attributes and rows for each user review. The attributes of importance were the review rating, anonymous user id and the review text. We also collected some contextual information such as the date published and the previous number of reviews for this anonymous user. We cannot share the original data collected for privacy reasons.

3.2 Measurement Scale

It’s important that we distinguish the scale used for rating. The products are given a rating or score by the user who submitted the review. It follows an ordinal scale, meaning that the variables can be ordered or ranked, from 1 to 10, where 1 = worthless, 2 = bad, 3 = pretty bad, 4 = barely acceptable, 5 = acceptable, 6 = pretty good, 7 = good, 8 = very good, 9 = excellent and 10 = perfect (translated from Swedish).

(12)

The score can be interpreted as measuring the level of the individual consumer’s satisfaction with the product. In a recent version update of prisjakt, the scale was reduced to a 5-point scale with 0.5 steps. However the same information is contained at each step.

3.3 Token statistics

In table 3.1, we can study the number of words per review in a few summary statistics, where the words are tokenized by spaces, including stopwords and numbers, i.e. the original reviews without any pre-processing.

The average number of words per review is approximately 104 in the entire dataset (n= 4664), however by aggregating by category, it is notable that categories such as Processors, Consoles and Home Cinema have both a lower average number of words per review, including low median values. The median is the middle value in the ordered list of number of words per review. The range is defined as the difference between the longest review and the shortest review, in terms of number of words in absolute values.

Note that all categories, except Routers, has higher estimated standard deviations than their respective estimated means. This indicates that the distribution of number of words per review has a high spread (variance), and that the number of words per review are further away from their mean, on average. For this reason, we have ordered table 3.1 by the median since it is more robust to outliers.

A small percentage of reviewers (0.2%) are quite enthusiastic, writing reviews over a thousand words as indicated by the range statistic in for example the Headphones, Smartphones and Monitors categories. The longest review is 1785 words long in the Smartphones category. On the other hand, approximately every fifth review is on average less than 20 words long (18.1%), about two thirds (66.1%) of reviews contain less than 100 words, and 86.3% (4024) of reviews contain less than 200 words.

In Fig. 3.1 we can observe the top ten most frequent n-gram terms in the un-balanced and un-balanced datasets, after removing stopwords, punctuation symbols, converting +/- to plus/minus and removing urls. Recall that the balanced dataset contains less observations and is balanced in terms of labels for categories 1-5 and 6-10 respectively, while the unbalanced dataset has a larger sample size but with class imbalances. The underscores (_) replace whitespace which is how the n-gram features are represented in the document term matrix.

There is little difference between the two sets apart from a difference in frequen-cies. The top ten unigrams are the same for the two sets with only a difference in order. The main difference in the bigrams is the bigram går_NOT_att (does not go) present in the balanced set in Fig. 3.1b while the unbalanced contains helt_klart in in Fig. 3.1a. The trigrams are slightly more differently distributed, with the unbal-anced containing å_andra_sidan, bra_ljud_bra in Fig. 3.1a while in Fig. 3.1b we see minus_går_NOT_att, plus_snygg_design.

3.4 Regex Conversions

(13)

Category µˆ σˆ Median Range n Smartwatches 144 154 104 887 160 TV:s 130 136 84 812 232 Activity bands 117 119 79 751 205 Pulse watches 121 124 78 659 154 Headphones 119 139 73 1476 1108 Monitors 106 116 73 1209 335 Routers 95 82 71 454 208 Smartphones 125 177 69 1783 831 Tablets 113 133 65 831 143 Home cinema 88 91 56 660 257 Processors 52 58 35 508 319 Consoles 63 98 25 954 712 All 104 133 62 1783 4664

Table 3.1: Summary statistics on number of words per review aggregated on the category level, sorted by the median in descending order. The estimated average number of words is represented by ˆµ , ˆσ is the estimated standard deviation and n is the number of instances in each category.

manner, as follows: + Great price + Good design - Bad quality - Rather slow

Additionally, the list of pros and cons is often accompanied by a short summary where the reviewer gives an overall impression of the product. Intuitively, we hypothesise that including the + and - symbols can be beneficial for predicting the review rating. For example, more pros than cons could indicate a higher rating, in general. For this reason, we convert + and - at the beginning of lines into text form, so that these symbols won’t get deleted in the automatic punctuation removal.

Among this, keeping emojis might be beneficial for classification. Since emojis often consist of punctuation symbols like :) or UTF-8 unicodes, like <U+2764> (the heart symbol), it can be of interest to keep these symbols by converting them to for example “emoji_pos” or “emoji_neg”. Surprisingly, only two unicodes in the training set were found relevant, as such the we have decided to keep list of unicodes very short for our purpose. For a full list of emojis used, see appendix A.

Finally, certain stopwords used for negation, like “not” and “never”, are often included in stopword lists by default. In order to prevent these words from getting deleted when applying the stopword list, we can apply a regex that removes these words and prepends the prefix “NOT_” to the word that follows. The advantage of this approach is that the stopwords are removed while at the same time keeping the important negation words which is especially important when using the unigram features.

(14)

bara även ljud väldigt helt bättre lite minus plus bra 0 1000 2000 3000 4000

Unigram

lite_mer helt_enkelt hela_tiden helt_klart bra_batteritid helt_ok väldigt_bra plus_bra bra_ljud riktigt_bra 0 100 200 300 400

Bigram

drar_ner_betyget plus_snygga_plus always_on_display bra_ljud_bra å_andra_sidan bra_ljud_plus plus_bra_batteritid samsung_galaxy_s plus_bra_ljud riktigt_bra_ljud 0 ₁₀ ₂₀ ₃₀

Trigram

(a) Unbalanced dataset

bara väldigt bättre ljud även helt lite minus plus bra 0 300 600 900

Unigram

lite_mer bra_batteritid helt_enkelt går_NOT_att helt_ok hela_tiden väldigt_bra plus_bra bra_ljud riktigt_bra 0 ₂₀ ₄₀ ₆₀ ₈₀

Bigram

bra_batteritid_plus minus_går_NOT_att plus_bra_batteritid plus_snygg_design bra_ljud_plus plus_snygga_plus samsung_galaxy_s drar_ner_betyget helt_ok_ljud helt_okej_ljud plus_bra_ljud riktigt_bra_ljud 0.0 2.5 5.0 7.5 _10.0

Trigram

(b) Balanced dataset

Figure 3.1: The top 10 most frequent n-gram terms in each document term matrix dictionary

Apart from these regexes, other preprocessing steps are done such as: lowercasing, removing numbers, and at the final step tokenization is performed, where the tokenizer has the option to do stemming.

3.5 Text preprocessing

In the dataset retrieved by the web scraper, we select the columns corresponding to the review rating and the product review text. We also create a binary dependent variable to be used for the SVM algorithm. The original rating categories are used as the fine-grained dependent variable.

(15)

(16)

4 Method

In this section we discuss the methods used in this study, including text preprocessing, creation of document term matrices, model training, feature selection and evaluation metrics.

4.1 Training and Test Sets

0 500 1000 1500 1 2 3 4 5 6 7 8 9 10 N = 4664 , n_train = 3264 , n_test = 1400

Unbalanced distribution, fine−grained

0 100 200

1 2 3 4 5 6 7 8 9 10

N = 1520 , n_train = 1064 , n_test = 456

Balanced distribution, fine−grained

0 1000 2000 3000 4000 0 1 N = 4664 , n_train = 3264 , n_test = 1400

Unbalanced distribution, binary

0 200 400 600 0 1 N = 1520 , n_train = 1064 , n_test = 456

Balanced distribution, binary

Split

test train

Rating

Count

Figure 4.1: Rating distribution

First, we have divided the dataset so that categories 1-5 (negative class) and categories 6-10 (positive class) are distributed equally. By doing this, the initial misclassification rate is balanced to 50%. However, we lose observations by doing this split, reducing the sample size from 4664 to 1520. In the unbalanced set, we keep all 4664 observations. In Fig. 4.1 we observe the fine-grained distributions on the top and the binary on the bottom.

(17)

to study how the difference in sample size affects the model performance. If we had a larger sample size in the original dataset, a possibility would be to uniformly balance all 10 categories to have an equal number of observations. But even in that case, a lot of observations would be lost. Furthermore, we want the sample distribution to be representative of the distribution of the target population. We will therefore accept the skewed distribution as is.

We have divided the sample such that 70% of the reviews are in the training set and the remaining 30% in the test set. To ensure that the same datasets are generated, we set a specific seed to the random number generator.

The rating distribution is heavily skewed. We observe a median rating of 9 and a mean of 7.88. In figure 4.1 we get a visual confirmation of this. We argue that this sample is representative of our target population, since product reviews tend to be skewed. In Appendix A, we can observe that most subcategories are distributed similarly in a unimodal manner (one peak), except for activity bands with a clear bimodal distribution and the with the lowest mean rating (5.8).

4.2 Document Term Matrices

After text preprocessing, we convert the preprocessed text corpora into bag-of-word corpora in the form of a document term matrix (DTM) as shown in Eq. (2). We create one DTM per test and training set using the text2vec package. The dictionary of features is created by selecting what type of n-grams to use, and if to include a list of words for removing stopwords, which we provide with the tm package. Furthermore, the dictionary can be “pruned” by selecting the minimum proportion of documents that should contain a term. The dictionary is converted to a “vectorizer” which is used to create the document term matrices.

After the DTM:s have been created, we specify the option to weight them by the tf-idfalgorithm as specified in Eq. (8). Instead of creating a diagonal matrix of idf weights, it is computationally faster to use the fit_transform() in text2vec for weighting the training set corresponding to DtrainRtrainand transform() for weighting the test set, corresponding to DtestRtrain.

4.3 Model training

The main part of training the model is done in Python using the scikit-learn library. Training the SVM model, we use the “SVM C-SVC” model used for classifi-cation.

In order to speed up training time, a precomputed kernel matrix is calculated (for SVM) by using Eq. (6). To select the multinomial classification strategy, we supply the decision function shape with either “ovo” (one-versus-one) or “ovr” (one-versus-rest). In order to deal with class imbalance, we can also apply class weights in order to use different cost values C for minority or majority classes:

w_{balanced, j}= n · k· nj

∑

i=1 y_{i, j} !−1 (11) where wbalanced, jis the weight for class j= 1, ..., k, k is the number of unique classes, nis the number of observations in the training set and ∑nj

(18)

sum) of true labels in the j:th class in the training set. Eq. (11) assigns larger weight for minority classes and smaller weights to classes with more observations. The idea is that larger cost C will overfit by decreasing the margin of the hyperplane for the minority classes. A lower value of C relaxes the cost of misclassification, resulting in a larger margin, with the goal of underfitting for the majority classes.

For SVR, we use the liblinear implementation “LinearSVR” which uses a linear kernel by default. For this implementation, no precomputed kernel matrix is needed since the liblinear algorithm is very fast by default. We supply the document-term matrix Dtrainfor training and Dtestfor testing with the decision function. The hyperpa-rameters ε (epsilon) and the cost C are used, other pahyperpa-rameters are set to default.

4.4 Feature Engineering

In order to understand the effects of different feature engineering configurations, we will run a pipeline of models with the following different configurations:

• Balanced / Unbalanced dataset

• Finegrained / Binary dependent variable

• Four different N-gram features: 1-, 2-, 3- 1+2-grams

• Pruning (min. proportion of docs containing a term): None (NA), 0.01, 0.001 • tf-idf / tf

• Stopword removal (with/without) • Stemming (with/without)

The total number of unique combinations of the above mentioned features results in 25· 4 · 3 = 384 different models to train. In the SVM model, we use the default parameters with C= 1, classification strategy one-versus-one and balanced class weights mentioned in Eq. (11).

For the SVR model, we train 24· 4 · 3 = 192 different models, since only the fine-grained dependent variable is valid for regression. Similarly, default hyperparameters are used, with C= 1, ε = 0.

All results and feature engineering settings are stored in a local SQL database in order to access the results independently of R or Python.

After good feature engineering combinations have been selected as candidate models, we fine-tune the hyperparameters using a grid search with 10-fold cross validation. Hsu et al. (2016) proposes providing sequences of exponential values of C and then restricting the values to smaller intervals once reasonable values have been found.

4.5 Evaluation Metrics

(19)

Predicted

Positive Negative

Actual (True) Positive TP FN

Negative FP TN

Table 4.1: Confusion matrix.

Accuracy= TP+TN TP+TN+FN+FP (12) Precision= TP TP+FP (13) Recall= TP TP+FN (14) F1= 2 · precision · recall precision+ recall (15)

Accuracy simply represents the proportion of correctly classified classes divided the total number of observations.

Precision represents the proportion of true positive divided the column sum of the predicted class. For example, out of the all documents predicted as positive, how many of them are actually positive (relevant)?

Recall represents the proportion of true positive divided by the row sum of the actual class. That is, out of all documents predicted as positive or negative, how many of the predictions are true? In other words, how many relevant predictions do we get? Lastly, the F1 score is simply a weighted average of the precision and recall measures. It represents the overall accuracy. This measure works well for unbalanced data since we take into account precision and recall. It ranges between 0 and 1 where the latter is the best possible value.

In classification, scores for precision, recall and F1 are returned for each class. The average of these scores is called the “macro”. For ex. F1macro= (1/k) ∑k_j=1F1j where k is the number of classes. Since some of our datasets are heavily imbalanced, a weighted average is used based on the number of true instances for each class:

F1weighted macro= 1 n k

∑

j=1 njF1j (16)

where nj is the number of true instances in the j:th class and n is the total number of instances. As a result, the weighted F1 score may not lie between precision and recall. We will refer to F1weighted macroas F1 macro in the results section for simplicity, keeping in mind that we use the weighted macro.

For the regression approach, we want to measure how well the model fits the data. A common measure for this purpose is the Mean Squared Error (MSE):

(20)

(21)

5 Results

In this section we present the results of this study, beginning with the experimental results from feature engineering, following up with the fine-tuned models.

5.1 Experimental results

Balanced Binary Balanced Fine−grained Unbalanced Binary Unbalanced Fine−grained 0 100 200 0 100 200 0.10 0.15 0.20 0.25 0.30 0.05 0.10 0.15 0.20 0.75 0.80 0.85 0.5 0.6 0.7 0.8 Number of features (×1000) F1 macro N−grams 1−gram 1+2−gram 2−gram 3−gram SVM, default hyperparameters, C=1.

Classification

Figure 5.1: Scatterplot of different n-gram features on the effect of model F1 weighted macro by number of features in the document term matrix (in thousands).

(22)

but still perform reasonably well, with a maximum weighted F1 macro score of 0.811. The fine-grained unbalanced results in Fig. 5.1 only reaches a weighted F1 of 0.297 with accuracy 0.346 for a bigram combination, even when taking into account the balanced class-weights described in Eq. (11). The fine-grained balanced results show even worse performance, reaching a F1 of 0.285 with accuracy 0.296 for a unigram combination. Unbalanced Fine−grained Balanced Fine−grained 0 100 200 0 25 50 75 100 4 6 8 10 12 Number of features (×1000) MSE N−grams 1−gram 1+2−gram 2−gram 3−gram SVR, default hyperparameters

Regression

Figure 5.2: Scatterplot of different n-gram features on the effect of model MSE by number of features in the document term matrix (in thousands).

In Fig. 5.2, we can observe the fine-grained results in terms of MSE. The SVR performs better (lower MSE) on the unbalanced data, reaching the lowest MSE of 4.37 with unigrams, tf-idf weighting, removing stopwords and numbers, without stemming and a pruning of 0.001. The balanced version has an MSE of 6.53 as lowest for 1+2-gram, tf-idf weighting, with stemming, removing stopwords, and pruning set to 0.01.

In Fig. 5.3, boxplots for all feature combinations are used. The middle lines in the boxplots marks the median value, while the boxes themselves cover the interquartile (IQR) range of the 25th (Q1) and 75th (Q3) percentile. The lines (whiskers) of the boxes show the minimum and maximum which shows the distance of 1.5 times the IQR from Q1 and Q3. Any other observations are outliers marked as dots.

(23)

range. This could be due to that stopwords removal has a negative effect when the number of features are small, especially for reviews of shorter lengths. As discussed in the data section, almost every fifth review on average is less than 20 words long, so for these observations it may be of importance to keep as many features as possible. Pruning the documents seem to have no better effect than no pruning at all. Finally, tf-idf weighting increases the F1 score more than not.

For the features of the multiclass SVM in Fig. 5.3b, the boxes are much wider compared to Fig. 5.3a, indicating a higher variance in the results. Similarly as the binary classification, 1- and 1+2-gram combinations perform better than the bigrams and trigrams in terms of F1. Pruning seems to have a more positive effect, where the setting 0.001 performs best in this case. Most notably, the tf-idf weighting has a negative effect on the F1 score. This is due to the heavy class imbalance, since the classes with many observations contain higher frequencies of certain word with high discriminative power, the idf functions assigns these words a lower weight, despite being relevant for classification. For example, in the categories 9 and 10, the words great, perfect are good features for learning the model to discriminate between the sentiments. However, the idf function will weight the term frequencies with a lower value since the document frequencies in the majority classes are much larger compared to the minority classes.

(24)

0.25 0.50 0.75 Stemming FALSE TRUE 0.25 0.50 0.75 N−grams 1−gram 1+2−gram 2−gram 3−gram 0.25 0.50 0.75 Stopword removal FALSE TRUE 0.25 0.50 0.75 Tf−Idf weighting FALSE TRUE 0.25 0.50 0.75 Pruning 0.001 0.01 NA F1 macro

(a) Binary data SVM.

0.0 0.1 0.2 0.3 Stemming FALSE TRUE 0.0 0.1 0.2 0.3 _N−grams 1−gram 1+2−gram 2−gram 3−gram 0.0 0.1 0.2 0.3 Stopword removal FALSE TRUE 0.0 0.1 0.2 0.3 Tf−Idf weighting FALSE TRUE 0.0 0.1 0.2 0.3 Pruning 0.001 0.01 NA F1 macro (b) Finegrained data SVM. 10 20 Stemming FALSE TRUE 10 20 N−grams 1−gram 1+2−gram 2−gram 3−gram 10 20 Stopword removal FALSE TRUE 10 20 Tf−Idf weighting FALSE TRUE 10 20 Pruning 0.001 0.01 NA MSE

(c) Finegrained data SVR (lower MSE indicates better performance).

(25)

Model F1 Accuracy Precision Recall Tf-if N-gram Dim C SVM Binary (A) Balanced 0.8002 0.8004 0.8033 0.8033 X 1+2 8695 0.800 (B) Unbalanced 0.8637 0.8636 0.8638 0.8638 × 1+2 167058 0.005 Multiclass (C) Balanced 0.2448 0.2807 0.2489 0.2489 × 1 10858 0.050 (D) Unbalanced 0.3207 0.3679 0.3144 0.3144 × 1+2 167058 0.050

Table 5.1: Classification results for the fine-tuned models. HereXstands for true (tf-idf) and × stands for false (tf).

5.2 Fine-tuned models

From the experimental results, we selected the models with the highest F1, accuracy and MSE scores and performed a grid search with 10-fold cross validation to find optimal values of the hyperparameters. In table 5.1, the classification results are presented for the two classification methods: binary (k= 2) or fine-grained (k = 10), each evaluated on balanced and unbalanced datasets.

Starting with model (A) in table 5.1, the performance metrics are around 0.8 where the model is configured with pruning set to 0.001, with stemming, removing stopwords and numbers, and applying the regexes for emojis, plus/minus symbols and prepending NOT_ for words preceded by negation words.

The model (B) performs better on the unbalanced data when more observations are used, reaching metrics around 0.86. Both models (A) and (B) use a combination of 1+2-grams, however the dimensionality (dim) is much higher in model (B).

For the multi-class classification, both models (C) and (D) performs worse than the binary classification as a result of the heavy class imbalance. The use of class weights had a marginally positive effect of a few percentage points. Model (D) performs slightly better than model (C), with a weighted F1 macro of 0.3207 and accuracy 0.3670, compared to F1 0.2448 and accuracy 0.2807 in model (C). The optimal hyperparameter C= 0.05 was used for both models, indicating that a larger margin hyperplane was used to classify more observations correctly, as opposed to for example model (A) with C= 0.8.

In table 5.2, the MSE from the two fine-grained regression models are shown. Model (F) has a MSE of 4.1151 using unigram features where the optimal hyper-parameters were found to be C= 1, ε = 1.5. This model did not apply stemming, however stopwords and numbers were removed, and applying the same regexes as in model (A), with a pruning setting of 0.001. Model (E) on the balanced dataset has a higher MSE of 6.8428 indicating worse predictive performance, however the number of features in this model is only 978 despite using a 1+2-gram combination, which is due to the dimensionality reduction from setting the pruning parameter to 0.01. For both models, tf-idf weighting was found to increase the performance which is in accordance with the experimental results.

(26)

Model MSE Tf-if N-gram Dim C ε SVR

(E) Balanced 6.8428 X 1+2 978 1 1.0

(F) Unbalanced 4.1151 X 1 5170 1 1.5

Table 5.2: Regression results for fine-tuned models.

settings and hyperparameters as previously described in 5.1. In this case, a positive difference in F1 scores for models A-D indicates better performance compared to the baseline. For models E-F, a negative MSE difference indicates better performance compared to the baseline.

The SVR models E-F shows a decrease in MSE by almost all combinations of regex conversions, except for (n) which is prepending “not” being neutral. The most decrease is found in bl+e+p+n (i.e. using all regex conversion features), for both model balanced model E (-0.1170) and unbalanced model F (-0.0254).

The SVM models on the other hand do not show the same level of improvement by using the regex conversion. For models B-D, a majority of regex features decrease the F1 score compared to the baseline, expect for model C which sees a increase in F1 for bl+p (prepending not) and bl+p+n (prepending not and converting +/-). Model A shows a increase for slightly more regex conversion features such as bl+p and bl+p+n being the most significant, followed by bl+e+p and bl+e+p+n. As such, it is the only among the SVM models than benefit slightly by incorporating the emoji conversions.

SVM SVR

∆ F1 ∆ MSE

Regex features A B C D E F

baseline (none of e,p,n) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

bl+e -0.0136 -0.0012 -0.0036 -0.0011 -0.0273 -0.0050 bl+e+n -0.0136 -0.0012 -0.0036 -0.0011 -0.0276 -0.0051 bl+e+p 0.0135 -0.0023 -0.0029 -0.0043 -0.1162 -0.0254 bl+e+p+n 0.0135 -0.0023 -0.0029 -0.0043 -0.1170 -0.0254 bl+n 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 bl+p 0.0179 -0.0026 0.0024 -0.0054 -0.0954 -0.0203 bl+p+n 0.0179 -0.0026 0.0024 -0.0054 -0.0955 -0.0202

(27)

6 Discussion

In this thesis we have studied the effects of applying three different models on the task of sentiment analysis of product reviews in Swedish, in the domain of consumer electronics. The results of this study showed that the SVR model performs rather well on fine-grained data, even when the distribution of the review data is heavily skewed. The fine-grained SVM model used for classification suffered from the class imbalance as shown by the low accuracy and F1 scores, even when using the one-versus-one classification strategy. In classification, we found that dividing the dataset by a threshold of categories 1-5 as negative and 6-10 as positive, the classification reached high accuracies and F1 scores of around 86% for the unbalanced fine-tuned model.

The models trained on larger sample size datasets performed on average better than the smaller sample size datasets. This is most likely explained by the fact that the unbalanced, larger sample size data contained more observations in the positive categories 5-10, causing a bias towards these categories. However, the weighted F1 macro scores indicated that on average, a good balance between precision and recall was achieved in each class.

We also showed that by exploring various different feature engineering combina-tions we were able to understand the effects of the features in terms of the weighted F1 score and MSE. For example, a combination of 1+2-gram features was found to increase performance in general. We also explored the effects of our implementation of three different regex conversions, showing that the regression model improved the most from these conversions, compared to the classification models. However, the differences were marginal and no general conclusions can be extrapolated.

Furthermore, it is interesting to discuss the effect of tf-idf in terms of how the reviews have a different average word length per document depending on the category as we saw in the data section. For example, removing stopwords may degrade per-formance for shorter reviews where every feature can be more important for certain reasons. For example, some reviewers may have a unique writing style, for example they may value a shorter, concise writing style where stopwords and even punctuation symbols are important stylometric features. For this reason, it may be interesting to investigate the application of string kernels, where the features are character sequences instead of word sequences to investigate this hypothesis further.

(28)

or the reverse?

(29)

A

Regex

Convert + at beginning of lines: "(\\r\\n)(\\+)"

"\\1 plus "

Convert - at beginning of lines: Find: "(\\r\\n)(\\-)",

Replace: "\\1 minus " Remove anchors and urls: Find: "<.*?>", Replace: "" Find: "http\\S+", Replace: "" Prepend NOT_ :

Find: "\\b(inte|ej|aldrig)\\b\\s+([a-öA-Ö]+)" Replace: paste0("NOT_", "\\2")

Replace emojis by sentiment: # Code in R: emojis_positive <- list(":-D", ":)", ":')", ":-)" ,":)" , ":-]", ":]", ":-3", ":3" ,":->", ":>", "8-)", "8)", ":-}",":}", ":o)", ":c)", ":^)", "=]" , "=)" , ":-))", ":')", ";)", ";-)", ";D", ";-D", ":P", "<U\\+2764>", # heart "<U\\+2714>" # check mark ) emojis_negative <- list(":\\(", ":\\'-\\(" , ":-\\(", ":\\'\\(", ";\\(" , ':c', ':\\|', ':-\\|' ) # "|" is the OR operator

find.emojis_positive <- paste(unlist(emojis_positive), collapse = "|") find.emojis_negative <- paste(unlist(emojis_negative), collapse = "|") text <- gsub(find.emojis_positive, "emoji_pos", text )

(30)

B

SVR features by balance

10 15 20 25 Stemming FALSE TRUE 10 15 20 25 N−grams 1−gram 1+2−gram 2−gram 3−gram 10 15 20 25 Stopword removal FALSE TRUE 10 15 20 25 Tf−Idf weighting FALSE TRUE 10 15 20 25 Pruning 0.001 0.01 NA MSE

Figure B.1: Boxplots of MSE for 192 different feature engineering configurations. Balanced finegrained data.

4 6 8 10 12 Stemming FALSE TRUE 4 6 8 10 12 N−grams 1−gram 1+2−gram 2−gram 3−gram 4 6 8 10 12 Stopword removal FALSE TRUE 4 6 8 10 12 Tf−Idf weighting FALSE TRUE 4 6 8 10 12 Pruning 0.001 0.01 NA MSE

(31)

C

Product Review Ratings by

Bibliography

Namita Mittal (auth.) Basant Agarwal. Prominent Feature Extraction for Sentiment Analysis. Socio-Affective Computing. Springer International Publishing, 1 edition, 2016. ISBN 978-3-319-25341-1,978-3-319-25343-5.

Ali Basirat, Christian Hardmeier, and Joakim Nivre. Principal word vectors. arXiv preprint arXiv:2007.04629, 2020.

Alexander J. Smola Bernhard Schölkopf. Learning with kernels: support vector machines, regularization, optimization, and beyond. The MIT Press, 1st edition, 2002. ISBN 0262194759,9780262194754.

F. Chollet and J.J. Allaire. Deep Learning with R. Manning Publications, 2018. ISBN 9781617295546. URL https://books.google.se/books?id= xnIRtAEACAAJ.

Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines: And Other Kernel-based Learning Methods. Cambridge University Press, New York, NY, USA, 2000. ISBN 978-0-521-78019-3.

Steve R Gunn et al. Support vector machines for classification and regression. ISIS technical report, 14(1):5–16, 1998.

Joop Hox. Multilevel Analysis: Techniques and Applications, Second Edition. Quantitative methodology series. Routledge Academic, 2 edition, 2010. ISBN 184872845X,9781848728455,1848728468,9781848728462,9780203852279. Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support

vector classification. 2016.

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning, volume 112. Springer, 2013.

R.A. Johnson and D.W. Wichern. Applied Multivariate Statistical Analysis. Ap-plied Multivariate Statistical Analysis. Pearson Prentice Hall, 2007. ISBN 9780131877153.

Borislav Kapukaranov and Preslav Nakov. Fine-grained sentiment analysis for movie reviews in bulgarian. In Proceedings of the International Conference Recent Advances in Natural Language Processing, pages 266–274, 2015.

(33)

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed

representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013b.

Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels:

support vector machines, regularization, optimization, and beyond. MIT press, 2002.

John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, USA, 2004. ISBN 0521813972.

Alex J. Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and Computing, 14(3):199–222, August 2004. ISSN 0960-3174. URL https://doi.org/10.1023/B:STCO.0000035301.49549.88. Robert A Stine. Sentiment analysis. Annual review of statistics and its application, 6:

287–308, 2019.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. George Tsatsaronis and Vicky Panagiotopoulou. A generalized vector space model for

text retrieval based on semantic relatedness. In Proceedings of the Student Research Workshop at EACL 2009, pages 70–78, 2009.

Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science. Springer, 2nd edition, 2000a. ISBN 1441931600,9781441931603.

Vladimir Naumovich Vapnik. The Nature of Statistical Learning Theory, Second Edition. Statistics for Engineering and Information Science. Springer, 2000b. ISBN 978-0-387-98780-4.

Fine-grained sentiment analysis of product reviews in Swedish