Student Thesis

(1)

Student Thesis

Master’s level (second cycle)

Preprocessing method comparison and model tuning for

natural language data

Author: _{Peter Tempfli}

Supervisor: _{William Wei Song and Serena Barakat} Examiner:_{Moudud Alam}

Subject/main field of study:_{Microdata Analysis} Course code: _MI4002

Higher education credits: _{15 ECTS-credits} Date of examination: _02/06/2020

At Dalarna University it is possible to publish the student thesis in full text in DiVA. The publishing is open access, which means the work will be freely accessible to read and download on the internet. This will significantly increase the dissemination and visibility of the student thesis.

Open access is becoming the standard route for spreading scientific and academic information on the internet. Dalarna University recommends that both researchers as well as students publish their work open access.

I give my/we give our consent for full text publishing (freely accessible on the internet, open access):

(2)

Abstract

Twitter and other microblogging services are a valuable source for almost real-time marketing, public opinion and brand-related consumer information mining. As such, collection and analysis of user-generated natural language content is in the focus of research regarding automated sentiment analysis. The most successful approach in the field is supervised machine learning, where the three key problems are data cleaning and transformation, feature generation and model choice and training parameter selection. Papers in recent years thoroughly examined the field and there is a agreement that relatively simple techniques as bag-of-words transformation of text and a naive bayes models can generate acceptable results (between 75% and 85% percent F1-scores for an average dataset) and fine tuning can be really difficult and yields relatively small results. However, a few percent in performance even on a middle-size dataset can mean thousands of better classified documents, which can mean thousands of missed sales or angry customers in any business domain. Thus this work presents and demonstrates a framework for better tailored, fine-tuned models for analysing twitter data. The experiments show that Naive Bayes classifiers with domain specific stopword selection work the best (up to 88% F1-score), however the performance dramatically decreases if the data is unbalanced or the classes are not binary. Filtering stopwords is crucial to increase prediction performance; and the experiment shows that a stopword set should be domain-specific. The conclusion is that there is no one best way for model training and stopword selection in sentiment analysis. Thus the work suggests that there is space for using a comparison framework to fine-tune prediction models to a given problem: such a comparison framework should compare different training settings on the same dataset, so the best trained models can be found for a given real-life problem.

Keywords

(3)

Table of contents

1. Introduction 3

1.1 The aim of this work 4

2. Previous work in the field 6

3. Machine learning approach 8

3.1. Supervised Machine learning 8

3.2. Bag of words methods 8

3.3. TF-IDF weighting 9

4. The dataset 10

4.1. Word frequency in the dataset 13

4.2. POS distribution in classes 17

4.3. Comparison of the predefined classes 17

5. Building the right training dataset 19

5.1. Comparing different dataset sizes with model performance metrics 20

6. Pre-processing 23 6.1. String Normalization 23 6.2. Tokenization 23 6.3. Stopwords 24 6.4. Stemming / Lemmatisation 24 6.5. N-gram converting 25

Infrequent word filtering 25

6.6. Synonyms 26

6.7. Part of Speech tagging 26

7. The experiment 27

7.1. Classifiers 27

7.2. Pre-processing datasets 27

7.3. Comparison matrices 27

8. Discussion 34

9. Conclusions and future work 35

(4)

1. Introduction

Sentiment analysis is a document classification problem, in the domain of natural language processing. In simple terms, sentiment analysis aims to detect the sentiment of a 'subject' of the communication towards an 'object'. As an example, sentiment of product reviews can be analysed, so an automated system can classify if a product review is positive, negative or neutral. In more advanced classification systems the sentiment itself is not a list of classes (as positive, negative or neutral) or a scale, but rather a multy-dimensional system (as angryness, joy, interest...) on which every dimension can have a value (Snyder and Barzilay, 2007). Also, sentiment analysis is not strictly a classification problem: advanced sentiment analysis problems are often about detecting subjectivity, polarity and subject/object relations in a natural language document. The last problem (subject/object relationship) is also in the domain of the entity detection. The most challenging problems in automated sentiment analysis are mostly connected with linguistic features of the text, which are above the vocabulary level. For example: negotiation, specific word orders which change meaning, modal verbs, sarcasms. This work focuses on the classification problem in the domain of sentiment analysis. Classification of natural language documents as a problem appears not only in the domain of sentiment analysis -- this is a more broad area. In simple terms it can be described as automatically adding labels to a document (one or more), analysing its content. For example, a classification engine can add 'economy', 'politics' or 'culture' tags to newspaper articles; or an email filtering engine can classify emails as 'spam' or 'important'. This problem is very similar to classifying a product review as 'positive' or 'negative'. It is important to mention that there are many ways to classify a document sentiment: on a simple binary (positive/negative) system, on a 3-class system (positive-neutral-negative), or a scale (one-to-ten) or on a many-dimensional system. As many classifier algorithms have many limitations, not all of them can be used for every classification system. Thus before selecting a system for classes, it is important to take into account that this can introduce limitations about choosing the best classifier.

The application of sentiment analysis techniques is very wide, and in future new areas might evolve. Some of current domains:

● Marketing and monitoring brand reputation (the dataset of the current work is a typical example of this). The typical process of collecting data is to set up automated keyword-monitoring processes on the critical channels, and then applying pre-trained sentiment analysis classifiers on the collected data. The process can help to point out the critical areas in order to increase brand reputation.

(5)

● Automated political surveys. Using this sentiment analysis tools high volume tweets can be analysed automatically and show the effects of individual public messages.

● Sentiment-based stock price predictions. There are some attempts to build stock (or other goods) trading systems based on media and social media message analysis. In this area speed is critical and frequent model re-training can be crucial.

● Customer support -- integrated sentiment analysis engine can help to prioritize messages from 'angry' customers, so help-desk agents can solve their cases first. Advanced customer-support and CRM software already have integrated automated text analysis tools.

This work focuses on specific kinds of natural-language documents: short 'tweets'. Twitter is a microblogging service started in 2006; currently the most popular of this type. Users share 140-character long messages, so from the sentiment-analysis perspective these observations are rather short and very subjective. This makes it a perfect platform to sentiment analysis; a large amount of research works with data gathered from Twitter. According to Wikipedia, 37% of content is conversational and 40% of the content is 'pointless babble' which also falls under the subjective communication category (Wikipedia, 'Twitter', 2016.05.25.).

1.1 The aim of this work

Twitter-gathered data analysis is a relatively well-known area in natural language processing and sentiment analysis, and it has many commercial implementations as well, as it was demonstrated in the previous section. For data gathered from Twitter, it seems that there is a consensus that machine learning algorithms using even relatively simple feature generating methods can create results which are usable not only for research problems, but also in production environments with real-life use-cases. Such applications create very valuable information for organizations, so correct and well-performing implementation is critical.

(6)

dataset (which is described in The Dataset section). Another dataset is used in order to make sure the findings are generic enough and not specific only to one dataset. At the work’s contribution, a comparison framework is introduced which demonstrates how to implement and fine-tune models. Said that, the research question of this thesis work can be formulated as:

(7)

2. Previous work in the field

Twitter data is well-researched, as Twitter is a very approachable data source for valuable information. First papers about sentiment analysis using Twitter data as a source started to appear about 2009. Twitter offers public APIs with various capabilities to collect enough data (however, as the company also sees the value of the data, the public API gets more and more restrictive). Nevertheless, there are many publicly available datasets out there for sentiment analysis. A common pattern is the following in papers about sentiment analysis, especially about Twitter-data.

Authors describe the problem domain and the aim of the research. The most common theme is product review analysis and the aim of the study is to find features which can improve the classification. Many authors focus on _{‘generating a list of product attributes’}

as ‘_{microblogging web-sites are rich sources of data for opinion mining and sentiment} analysis._{’ Authors looking for ‘methods for automatically distinguishing between positive} and negative reviews’ _{(Dave, Lawrence and Pennock, 2003), (Pak and Paroubek,}

2010).

Authors describe the dataset which they use. Some use datasets which are collected automatedly and try to use different techniques to create training datasets. It the dataset section many authors point out the problematic nature of collecting data from twitter; some key issues are _{‘Labeler quality’, ‘Number of labels provided by the labelers’,} ‘Labeler bias’, ‘Different labeler bies’ (Barbosa and Junlan, 2010) This problem is addressed in the ‘Comparison of the predefined classes’ section of this work. Some use unsupervised learning methods to create a training dataset or use twitter-specific features using ‘_{using distant supervision, in which [our] training data consists of tweets} with emoticons_{’ (Go, Alec and al., 2009). Others use already available datasets}

collected by others or some crowd-sourced solutions, like the IMDB (Pang, Lee and Vaithyanathan 2002). There are available 'standard' datasets which are used to control results, as the _{Stanford Twitter Sentiment Gold, Sentiment Strength Twitter Dataset} (Saif et al, 2003) and others.

(8)

○ Linguistic techniques and substitutions (Dave, Kushal and Pennock, 2003) ○ N-gram conversion. ‘_{Word n-grams features are the simplest feature for}

Twitter sentiment analysis_{.’ (Jianqiang et al, 2017). There are many}

opinions about the optimal n-gram size; this work will argue that this setting is probably based on the dataset and the domain.

○ POS (Part of Speech) conversion. ‘ _{[These] approaches have shown that}

the effectiveness of using POS tags. The intuition is certain POS tags are good indicators for sentiment tagging (Wiebe and Riloff, 2005), (Barbosa

and Junlan, 2010).

○ Text pattern recognitions (POS groups, syntactic trees..)

○ Semantics and syntax based techniques. These techniques are usually based on linguistic studies as ‘_{Contextual semantic approaches’,}

‘Conceptual semantic approaches’ _{and ‘Entity-Level Sentiment Analysis} Approaches’ _{(Saif, Fernandez and Alani, 2016) These approaches on the}

(9)

3. Machine learning approach

3.1. Supervised Machine learning

In most cases, supervised machine learning is the right approach to do sentiment analysis. The main question is how to define the features which describe a document: as a natural language document from a computational perspective is a list of characters (a string), most machine learning algorithms can not take it as an input directly (some neural networks models can do this). So before training a model, the document should be transformed into a processable format, which in most cases is a 'feature set' which can be represented as a list of numbers. Every number in the list represents a value for a given feature (a dimension). It is important to understand that before transforming the document into a list of features, the features themselves should be defined. When implementing the transformation part, this is a very exact step (however, most machine learning libraries have already implemented methods for this). In the ‘Bag of words’ section it is described how document features and documents are represented in a form which is consumable for classifier algorithms; and more practical details are discussed in the 'Pre-processing’ section.

3.2. Bag of words methods

A document is a collection _{tokens. A token is a word, a punctuation mark or another} type of string (number, other symbol, url, emoticon, etc.). So a document can be represented as a vector, where every element is a count of a given token in a document. It is very important to process the whole dataset _{before transforming} individual documents into vectors -- otherwise the order of features is different in every vector, so they can not be compared. As an example, these 2 documents above can be represented as vectors the following way (Table 1).

Table 1: Bag of words demonstration

Document text Peter Tempfli and

(10)

The method called 'bag of words' because the vector doesn't tell the position of a given word in the document -- it tells only the count; so when transforming a document to a 'bag of words' vector, we throw every word into a metaphorical 'bag' in which we lose the position. Thus a collection of documents is a matrix, where rows represent the documents and columns the words (tokens). As this is a numeric format (a DataFrame in some frameworks), most machine learning algorithms can use it as an input format. It is important to mention that the matrix representation of document collection results in extremely sparse matrices with a very large number of features which can be problematic for some algorithms. This can create memory problems in some implementations, but many frameworks have methods to fix the issue. Another way to fix it to limit the number of features to the most significant ones. As a simple method, tokens can be filtered by minimum occurrence or the overall token number can be limited; or some more sophisticated statistical methods can be used.

3.3. TF-IDF weighting

Given that the distribution of word frequency in English language (and also in any other natural language) is highly long-tailed, some words might appear in almost every document, while other words can be very rare. This creates an very high unbalance in numbers in the overall numbers in the dataset representation matrix, and in some cases it can even hit software limitations (dividing small numbers by very large numbers can result numbers which are impossible to represent in a given software architecture, so some calculation simple can not be made). Some classifiers can manage this problem well (for example, Naive Bayes) but not all of them. Because of this, very often a normalization method is applied to the bag-of-words matrix, the so-called 'term frequency–inverse document frequency' method. On every point in the bag of words matrix the following function is used:

TFIDF(t, d, D) = tf(t, d) * idf(t, D)

(11)

4. The dataset

The source of the data (appen.com, Open Source Datasets, 2016.05.25.) describes the dataset as following:

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

The same dataset is also available at Kaggle with some user comments and analytics.

(www.kaggle.com, 2016.05.25.)

The dataset contains 14640 rows and this work focuses on two columns: text and airline_sentiment. Text field contains the actual content of the twitter message, for

example:

Table 2: Dataset demonstration

row number text airline_senti

ment

1160 @united is unfriendly screw family, that hates...

negative

1161 @united gate agent at EWR " if you are disabl...

negative

1162 @united it won't help...been there done that.

negative

1163 @united forces us to check our baby bag on ove...

negative

1164 @united would love help getting there today. I...

(12)

In this work other properties (id, dates, timezone, sentiment confidence) are not used for 2 reasons. Firstly, there is not enough documentation about these properties (how they are created). Secondly, using a too complex dataset as input would overly widen the scope of this research.

Airline sentiment field contains the sentiment of the tweet, which is a 3-type value: positive, neutral or negative.

(Figure 1: Distribution of sentiments)

(13)

(Figure 2: Distribution of Airlines with sentiments)

Shows the overall airline distribution across the dataset

(14)

It is worth noting that the dataset is highly biased in two dimensions. First, the majority of tweets have a negative sentiment classification. This is a highly problematic issue for training, as the dataset is biased at target dimension. We will come back to this problem at the section about balancing the dataset. Secondly, the amount of data about specific airlines is not equal. This can be explained by business and marketing data (larger airlines tend to have more mentions). The more problematic issue though is that the sentiment distribution inside the airline-subsets is not equal: for example, the plot demonstrates well that customers tend to write more positive mentions about Delta than about United in proportion, even though the absolute number of positive mentions is similar. This is a problem because for this reason a classification algorithm can conclude that documents which contain the token 'Delta' tend to be positive. Even though this _{is true for the training dataset, the classification itself should be unbiased.}

4.1. Word frequency in the dataset

As bag-of-words representation of documents use the word presence in the documents, it is worth to mention the word frequencies in the description of the dataset. I will argue that the word frequency can even give some insight into the dataset.

(15)

Table 3: Word frequency with stopwords in the dataset n word count 0 @ 16583 1 . 13603 2 to 8644 3 i 6629 4 the 6054 5 ! 5312 6 ? 4678 7 a 4473 8 you 4375 9 , 4156

(16)

After making the filtering stopword transformation, the dataset shows a different face. It is rather obvious that these words are more typical to the airline industry.

Table 4: Word frequency with stopwords in the dataset

(17)

(18)

4.2. POS distribution in classes

Part of speech distribution in a text is relatively stable (given that the text is large enough) [usually it is Noun, Adjective, Personal Pronoun, Preposition, Determiner..] ; however, the exact distribution shape varies between different styles, genres and even authors. Given this, it is useful to examine the part of speech distribution between different pre-labeled sentiment classes.

(Figure 4:Part of Speech Distribution)

It is easy to see that there are significant differences in the part of speech distribution between the categories. For example, among positive tweets there is 11% NNP (proper nouns); but among the negative tweets only 6%. JJ (adjective) percentage also drops significantly between positive and negative (8.3% and 5.6%). On the other hand, negative tweets tend to have more prepositions (IN). In my option this proves that POS-tagging conveys some information about the classes.

4.3. Comparison of the predefined classes

(19)

dataset is classified by a sentiment analysis engine. Then the results are compared with the predefined classes.

For this, Amazon Comprehend is used (Amazon Comprehend, n.d). As the documentation says, ‘_{Amazon Comprehend uses a pre-trained model to examine and} analyze a document or set of documents to gather insights about it. This model is continuously trained on a large body of text_’.

(Figure 5: Predefined vs Engine classes)

(Figure 6: Engine classes vs Predefined)

Overall, 67% of classification matching. What is more interesting, that the distribution of non matching classes is not equal. The sentiment engine's classes are mostly matching the predefined neutral and positive classes; however, a large percentage of neutral classes are classified as neutral. This tendency can be seen on both plots (_Predefined

(20)

the dataset as Negative (so the non-matching classes) have a 0.75 score. The documents which are classified as Neutral and predefined as Neutral, have 0.84 score.

Table 5: Predefined and Engine classes proportion

Predefined class Amazon class Amazon score (Neutral)

Neutral Neutral 0.84 Negative or Neutral Neutral 0.79 Negative Neutral 0.75

From this the following conclusion can be drawn: the predefined classes are not necessarily faulty (or, the engine is not necessarily wrong); rather that the boundaries between classes are on different positions. In a classification problem like sentiment analysis this is absolutely acceptable, as there are no strict borders between sentiments. Pang and Lee conclude (Pang and Lee, 2002) that machines outperform humans with classification; so even though different models can give different results, it is still more feasible to use automated sentiment analysis than manual classification.

5. Building the right training dataset

When using supervised machine learning methods, an already existing dataset with target values is needed in order to _{train (or fit) the classifier model. Another dataset is} needed which is _{not shown to the classifier during the training -- this way the} performance of the trained classifier can be evaluated on unseen data before using it in production. Evaluating, in this sense, means comparing the existing target data with the predictions given by the trained classifier. Evaluating can be as simple as counting the right predictions (so we can get an overall percentage); or it can use more sophisticated metrics as _{sensitivity, selectivity, F1-score, etc.}

(21)

● On the other hand, it is also beneficial to test the model performance on a training dataset which has relatively small differences in targets. The rationale behind this is that way it is possible to test model performance for rare cases, not only for the common ones.

● Is large enough, so the randomness of outcomes is relatively small

● On the other hand, it is not too large, because there is a trade-off between the training set and the test set. The more data to test, the less is to train, and the model performance can suffer.

There are no clear definitions about what is a good size for a test set, because, as we see, there are many factors. The decision heavily depends on the size of available data, on the characteristics of this data and on the domain.

5.1. Comparing different dataset sizes with model performance metrics

In order to compare different test sizes, a following algorithm is used:

● Split the dataset into 2 parts : _{train set and test set with a proportion for test part} X

● Train a classifier model on the _{train set.}

● Predict the classes using a Naive Bayes classifier for the _{test set; the proportion} of correct outcomes would be S (score)

● Map S to X

● Repeat the following steps for every X between 0 and 1

(There is certain randomness in this function, because the split of test and train sets is random. In order to eliminate this, S is calculated 10 for every X and finally an average is taken)

(22)

(Figure 7: Accuracy on unbalanced dataset)

(23)

(Figure 9: Accuracy on balanced dataset with 3 classes)

These are the key findings:

● The function is plotted for 3 different datasets. _{Unbalanced with 3 classes is the} whole dataset; _{Balanced with 2 classes is is a set where there is an equal} proportion of positive and negative classes; _{Balanced with 3 classes has an}

equal proportion of three classes (positive, neutral, negative)

● It is obvious that with a larger training set the test results (the score of the classifier) is better.

● At the beginning of the plot there is a steep decline. Here the classifier simply 'learns' the whole dataset, so it can be called the 'overfitting area'.

● The best scores have the dataset with 2 classes. After the overfitting area, it can yield a stable 85% score; at a certain point (arount 20% train set) the scores start to decline in a steep line. The conclusion is that it is relatively easy to train a classifier with 2 classes, even with a small train size.

● Balanced dataset with 3 classes has an overall smaller score, but the shape of the function is similar. However, the declension is larger. So we can conclude that with a more complicated dataset (more classes) a larger train set is beneficial.

● The unbalanced dataset has different tendencies. The score declines proportionally with the size of the train set. The conclusion is that for a 'realistic, unbalanced' dataset more test data generally means a better fit.

(24)

and unbalanced dataset almost don't suffer anything on scores compared to smaller test sets.

6. Pre-processing

Pre-processing the text (documents) is an operation which converts the input document to a different output format. The main aim of such operations is usually to simplify and normalize the text so the ambiguities (from the point of view of the model) are removed. These kinds of operations are usually simple, rule based operations, which are

removing features. However, there is another type of pre-processing operations, which are _{adding features, so the documents are getting richer. These operations usually use} some linguistic algorithm, and very often an external model or dataset is included. As a pre-processing operation is a simple input-output function, the pre-processing steps can be chained together. It is beneficial from the implementation point of view, as it is easy to insert or remove steps. However, it is important that steps which are removing features (simplifying, normalizing) should be put before the steps which are adding features. The next part is going to discuss the pre-processing steps used in the experiment.

6.1. String Normalization

This step includes all the common regular expression based operations, such as ● lowercase conversion

● number removal / converting to words ● punctuation mark removal

● white space removal

This is an important step, as makes the documents more approachable for tokenization by removing a lot of 'noise'.

6.2. Tokenization

(25)

● There are entities, which should be one token, but using simple rules they might be split up. IP numbers, car model names, phone numbers... Entity recognition is always a domain-related problem.

● This also can be a language-specific problem. For example, German language uses a lot of compound nouns, such as

Rechtsschutzversicherungsgesellschaften_{. Here stemming can be a solution.}

The common practice is usually ‘ _{removing URLs, removing stop words, removing}

numbers, reverting words that contain repeated letters to their original form, replacing negative mentions, and expanding acronyms to original word’ _{(Jianqiang and Xiaolin,}

2017); but changes in the details can have dramatic effect on the prediction score. Advanced tokenization algorithms are usually configurable, so when implementing the tokenization step, the needs for the use-case can be taken into account.

6.3. Stopwords

A stopword is a very frequent term, which conveys very little meaning. Filtering out these words reduces noise and helps to make the documents more specific. Stopwords are language-specific and domain specific; the list of the latter is usually hand-tailored according to the problem.

When working with classification problems and using statistical methods, filtering out stopwords usually is a good strategy. However, there might be situations, when dropping common stopwords is problematic. For example, named entity recognition can struggle: the term _{President of the United States would be reduced to a form President}

United States_{, which is arguably not very useful as a named entity.}

6.4. Stemming / Lemmatisation

In this step, every word affix is removed, and just the very root of the word is kept. This is the root morpheme, so it is often not human readable. Lemmatisation is a similar process, when words are reduced to the _{lemma, which is the dictionary form of the} word.

(26)

english and non-english documents, for example Indonesian (Hidayatullah, 2015) or Arabic _{(Wahbeh and al., 2011).}

Secondly, lemmatising two different words can reduce them to the same base form, so ‘_{reduce the high dimensionality of the feature space in text classification’ (Wahbeh and}

al., 2011). It might be beneficial, but also might be problematic when different forms convey important features for the document classification.

6.5. N-gram converting

Making a list of tokens is only one solution to represent a document. Another option is to represent the documents as a list of n-grams, ‘ _{since they obtained good results in} previous works’ (Barbosa and Junlan, 2010) An n-gram is a sequence of _{n number of} tokens. For example, _{(brown, fox) is a 2-gram (bigram). As a demonstration, the} statement _{The quick brown fox jumps over the lazy dog can be represented in 2-grams} as follows:

(The, quick), (quick, brown), (brown, fox), (fox, jumps)...

In the same way, the statement can be represented in 3-grams as follows:

(The, quick, brown), (quick, brown, fox), (brown, fox, jumps)...

This representation of documents can be beneficial because larger chunks of information can be used as a document feature. For example, when using 2-grams, the

brown fox is a term, which can be found around the corpus. The size of n-grams depends on the problem, but in nature language problems rarely used larger than 3-grams. The cons about using n-grams is that it creates more features, which might lead to performance issues.

Infrequent word filtering

(27)

6.6. Synonyms

Converting synonyms to a common term might be beneficial, as by doing so, a list of infrequent terms can be converted to a more frequent one. This step always needs an external dictionary. Some argue that sentiment analysis scores ‘_{significantly different for}

[synonym] word pairs that mainly differ stylistically’ (Shen, Fratamico, Rahwan and

Rush, 2018)

6.7. Part of Speech tagging

Some authors argue that adding Part of Speech information about the words as additional features can help improve classifier performance: ‘_{Previous approaches have}

shown that the effectiveness of using POS tags [for this task]. The intuition is certain POS tags are good indicators for sentiment tagging. (Wiebe and Riloff, 2005), (Barbosa

(28)

7. The experiment

The aim of the _{experiment is to compare different classifier performance with different} text pre-processing options. The goal is to find the optimal combination of the classifier algorithms and the text pre-processing methods for 2 datasets: a balanced, binary airline sentiment dataset (positive-negative) and for the same dataset, but unbalanced, 3-class form (positive-negative-neutral). As it was shown in the section about the test set size, a 30% training set looks like a good choice: randomness is relatively small when testing, but there is still enough data for training.

7.1. Classifiers

The following classifiers are used in the experiment: ● Naive Bayes classifier

● Support Vector Machines

7.2. Pre-processing datasets

By default, all text is normalized (removed whitespaces, turned to lowercase). After this, the following pre-processing methods are implemented. These methods are implemented as functions which get and return the same text format, so these functions can be chained. For example, in an experiment _{Stemming and N-grams conversion can} be used together. The following pre-processing methods are implemented and the same experiment is runned on them input, so the results are comparable. Pre-processing methods are chainable, so they can be combined as well.

● Stop-word filtering (common stopwords and domain specific) ● Keeping only top 500 tokens

● 2-Gram conversion ● 3-Gram conversion

● Stop word removal and stemming ● POS-tagging

(29)

Table 6: F1-score, precision-score, recall-score on 2-class balanced Airline dataset

Pre-processing transformation SVC Naive Bayes

(30)

Table 7: F1-score, precision-score, recall-score on 3-class unbalanced Airline dataset

Pre-processing transformation SVC Naive Bayes

Word-filtering (0.77, 0.81, 0.76) (0.78, 0.82, 0.76) FreqFilter 500 (0.75, 0.86, 0.69) (0.74, 0.81, 0.69) Ngrams 2 (0.77, 0.81, 0.75) (0.76, 0.77, 0.75) Ngrams 3 (0.75, 0.79, 0.73) (0.73, 0.74, 0.73) Stemmer (0.78, 0.8, 0.76) (0.78, 0.83, 0.76) Word to POS (0.73, 0.84, 0.67) (0.69, 0.82, 0.62)

As shown on the tables 6 and 7, experiments with different setups give different results (Table 6 and 7). The following section focuses on the generic trends which can be seen in these results. As a control for the experiment, the same experiment was run on two different datasets (Go, Bhayani and Huang, 2009), and very similar patterns can be seen.

(31)

(Part Of Speech tags only or filtering by word frequency) Support Vector Machines do a better job. The tradeoff is a worse prediction score.

Running the same experiment on a larger, unbalanced (so the number of different classes are different), non-binary dataset yieds lower prediction scores in general. What is more interesting, Support Vector Machines almost always beat Naive Bayes classifiers on these datasets. We can also see that while on a balanced, binary dataset changing pre-processing can create high increase in prediction scores, doing the same on an unbalanced dataset won't give the same great results. While the difference between the best and worse prediction scores on the balanced dataset is 21%, the same value on the unbalanced dataset is only 5 percent.

(32)

Table 8: Most important features for arirlines dataset (words, stopwords filtered)

Balanced dataset (airlines) Unbalanced dataset (airlines)

(33)

Filtering stopwords as well as domain-specific stopwords is important, as it improves prediction scores. Analyzing the top features (Table 8) show that different datasets have very different top-features, and for the airlines-dataset it is arguably very much domain specific.

Table 9: Most important features for comparison dataset (words, stopwords filtered)

(34)

(35)

8. Discussion

The experiment shows that in the sentiment analysis domain different datasets work very differently. It is important to understand the domain of data collected, use cases and the shape of the data because it can have large effects on results.

● It is easy to see why well-balanced data is easy to classify. However, real-life data is often unbalanced, and on unbalanced data different classifiers might work better. If it is possible, re-balancing data (i.e. sampling an equal number of cases from every class) can be a feasible strategy if the amount of data is large enough. The part in the work about sample size shows that relatively small amounts of training data can be enough to represent the whole dataset well enough.

● As a general rule, for sentiment analysis problems the best approach, as many research shows, are simple Naive Bayes classifiers. In order to increase performance parameter tuning techniques can be used: running a TL-IDR mapping on the bag-of-words features and a grid-search on the classifier parameters can be run. It is important to mention that grid-search dramatically increases training time -- as this is effectively running p^N training instead of only one.

● For text pre-processing, filtering stopwords increases prediction scores. Domain-specific stop-word filtering also increases prediction scores, but this always needs manual work on the dataset -- so this can not be automated, and in some cases even could need some domain-specific knowledge.

● Looking into top features (datapoints which have the most predictive power) shows that most of these features are domain-specific. From this a conclusion can be made that it is very hard to create a ‘generic’ sentiment prediction model; rather for every specific problem its own classifier should be trained.

● Reducing the number of features decreases prediction scores, but with currently available computers it is probably not a real problem. However, there could exist use cases where it can be needed. For example, model complexity (for example, parameter-space grid-search or neural network models) can slow down dramatically training, so reducing parameters can be the only option.

(36)

9. Conclusions and future work

It can be said that sentiment analysis, in general, is a very domain-specific field, so there is not one best setup for every problem. Said that, a similar approach presented in this work (comparing different classifiers with different feature selection methods) can be very useful for real-life sentiment analysis problems in order to create better performing classifiers. Many commercial implementations of sentiment analysis tools don't have the fine-tuning features which are used and demonstrated in this work. As a future work, the author sees to following possible directions:

● Implement an experiment framework which can cooperate with the commercial tools, so the users and administrators can tailor the sentiment analysis software to their own needs.

● As the experiment shows, biased training dataset and using not the right stopword settings have dramatic effects on the classifier performance. Most commercial applications do not have tools for fixing these problems; so implementing a solution might create much value.

(37)

References

Agarwal, A., Xie, B., Vovsha, I., Rambow, O. and Passonneau, R.J., 2011. Sentiment analysis of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011) (pp. 30-38).

Appen.com, Open Source Datasets https://appen.com/resources/datasets/ , Last accessed: 2020.05.25

Amazon AWS, https://docs.aws.amazon.com/comprehend/latest/dg/how-it-works.html, 2020.05.25

Barbosa, L. and Feng, J., 2010. Robust sentiment detection on twitter from biased and noisy data. In Proceedings of the 23rd international conference on computational linguistics: posters (pp. 36-44). Association for Computational Linguistics.

Bifet, A. and Frank, E., 2010. Sentiment knowledge discovery in twitter streaming data. In International conference on discovery science (pp. 1-15). Springer, Berlin, Heidelberg.

Dave, K., Lawrence, S. and Pennock, D.M., 2003. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the 12th international conference on World Wide Web (pp. 519-528).

Davidov, D., Tsur, O. and Rappoport, A., 2010. Enhanced sentiment learning using twitter hashtags and smileys. In Proceedings of the 23rd international conference on computational linguistics: posters (pp. 241-249). Association for Computational Linguistics.

Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12), p.2009.

Hidayatullah, A.F., 2015. The Influence of Stemming on Indonesian Tweet Sentiment Analysis. Proceeding of the Electrical Engineering Computer Science and Informatics, 2(1), pp.127-132.

(38)

Kaggle.com, Twitter US Airline Sentiment

https://www.kaggle.com/crowdflower/twitter-airline-sentiment,2020.05.25

Mäntylä, M.V., Graziotin, D. and Kuutila, M., 2018. The evolution of sentiment analysis—A review of research topics, venues, and top cited papers. Computer Science Review, 27, pp.16-32.

Pak, A. and Paroubek, P., 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc (Vol. 10, No. 2010, pp. 1320-1326).

Pang, B., Lee, L. and Vaithyanathan, S., 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 79-86). Association for Computational Linguistics.

Parikh, R. and Movassate, M., 2009. Sentiment analysis of user-generated twitter updates using various classification techniques. CS224N Final Report, 118.

Saif, H., Fernandez, M., He, Y. and Alani, H., 2013. Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

Saif, H., He, Y., Fernandez, M. and Alani, H., 2016. Contextual semantics for sentiment analysis of Twitter. Information Processing & Management, 52(1), pp.5-19.

Shen, J.H., Fratamico, L., Rahwan, I. and Rush, A.M., 2018. Darling or babygirl? investigating stylistic bias in sentiment analysis. In 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FATML).

Snyder, B. and Barzilay, R., 2007. Multiple aspect ranking using the good grief algorithm. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference (pp. 300-307).

Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q., Al-Shawakfa, E. and Alsmadi, I., 2011. The effect of stemming on Arabic text classification: an empirical study. International Journal of Information Retrieval Research (IJIRR), 1(3), pp.54-70.

(39)