Sentiment Analysis of Microblog Posts from a Crisis Event using Machine Learning

(1)

Sentiment Analysis

of Microblog Posts from a Crisis Event

using Machine Learning

A N D E R S W E S T L I N G

(2)

Sentiment Analysis

of Microblog Posts from a Crisis Event

using Machine Learning

A N D E R S W E S T L I N G

DD221X, Master’s Thesis in Computer Science (30 ECTS credits) Degree Progr. in Computer Science and Engineering 300 credits Master Programme in Computer Science 120 credits Royal Institute of Technology year 2013

Supervisor at CSC was Joel Brynielsson Examiner was Jens Lagergren TRITA-CSC-E 2013:042 ISRN-KTH/CSC/E--13/042--SE ISSN-1653-5715

Royal Institute of Technology

School of Computer Science and Communication

KTH CSC

(3)

Abstract

With social media services becoming more and more popular, there now exists a constant stream of opinions publicly available on the Internet. These opinions can be analyzed to find the users’ sentiments towards things. One example of interest is to see how people are feeling during a crisis situation to get a better understanding about what kind of help that would be the most useful at the moment.

The goal of this degree project has been to see if it is possible to create an automatic classifier, based on machine learning techniques, that can accurately determine whether a microblog post written during a political event in Russia is for, against, or neutral towards the group of people being at the center of the event.

Because of the shortness of microblog texts and the informal lan-guage often used in them, the problem is expected to be more difficult compared to sentiment analysis of normal length texts.

A number of different machine learning algorithms were studied along with different ways to convert the microblog texts into a represen-tation that can be used by the classifier algorithms. The most promising of these algorithms and representations were implemented and tested to see if an accurate classifier could be obtained.

(4)

Sentimentanalys av mikroblogginlägg från en

krishändelse med maskininlärning

I och med att tjänster för sociala medier blir allt mer populära, existe-rar det nu en konstant ström av åsikter fritt tillgängliga på internet. Dessa åsikter kan analyseras för att finna användarnas känslor kring olika ämnen. Ett exempel av intresse är att se hur folk känner under en krissituation för att få en bättre uppfattning om vilken typ av hjälp som skulle vara till mest nytta för tillfället.

Målet med detta examensarbete har varit att se om det är möjligt att skapa en automatisk klassificerare, baserad på maskininlärningsme-toder, som med precision kan avgöra huruvida ett mikroblogginlägg skri-vet under en politisk händelse i Ryssland är för, emot, eller neutral till den grupp människor som händelsen kretsar kring.

Problemet väntas vara svårare än sentimentanalys av normallånga texter, detta eftersom mikroblogginlägg är mycket kortare och ofta har ett informellt språk.

Ett antal olika algoritmer för maskininlärning studerades tillsam-mans med olika metoder för att representera mikroblogginläggen på ett format som algoritmerna kan arbeta med. De mest lovande utav dessa algoritmer och representationer implementerades och testades för att se om en effektiv klassificerare kunde åstakommas.

(5)

Acknowledgements

This Master’s Thesis project was carried out at the Swedish Defence Research Agency (FOI) in 2012. I had a great time working on the project, and there are some people who I would like to thank for helping me:

First of all I would like to give special thanks to my supervisors, Dr. Tove Gustavi and Dr. Joel Brynielsson, for their positive attitude, feedback and support. They did a great job proofreading this report and suggesting improvements on the content and layout.

I would also like to thank Dr. Ulrik Franke and Dr. Carolina Vendil Pallin for classifying the microblog posts I used for my experiments. I’m glad I didn’t have to do it myself! Furthermore, many thanks to Dr. Fredrik Johansson for taking an interest in my project and commenting on my work and the first draft of this report.

(6)

(7)

Introduction

This chapter serves as an introduction to the degree project. Section 1.1 provides some background to the problem and why it is of interest. Section 1.2 explains what the purpose of this project is and Section 1.3 provides an introduction to machine learning.

1.1 Background

Thanks to the Internet, people from all around the world are now able to share their opinions with each other wherever they are, whenever they want. This means that the public opinion on many subjects is openly available to those who want it. Buyers review products they purchase online so that a potential buyer of a product can see what possibly hundreds of other buyers thought about the product before making the decision herself.

Opinions are also available in a less structured form thanks to social media. With services like Facebook, YouTube and Twitter, users can share their opinions with their friends and the public when they feel like it, and they can say almost whatever they want. By finding and analyzing such messages about a subject, the general opinion about something can be inferred. Companies do this to analyze company and brand popularity, but there are other possible uses.

(10)

process necessary.

Sentiment analysis is the process of determining the sentiment of a given text. This can be performed by computers and requires natural language processing to interpret the text in such a way that a sentiment can be extracted. This can be done in a number of ways, where one of them is to use machine learning techniques on a number of given texts along with corresponding sentiment to train a model that can find the sentiment of new, unseen texts.

1.2 Purpose

The purpose of this thesis is to evaluate a number of machine learning algorithms and techniques for their use in sentiment analysis made on Twitter posts posted during a crisis event. Different machine learning algorithms will be analyzed as well as different methods of representing the texts in a format that the algorithms can work with. Some of these algorithms and representations will then be tested on Twitter posts from an event to see if the correct sentiment of these posts can be accurately determined.

1.3 Machine learning

Machine learning is the automated process of extracting patterns in large amounts of data and use these patterns to make predictions on new data [15]. What these patterns are depends on the task at hand and the algorithm used. When working with machine learning, work is performed on sets of data. A data set is a collection of instances, where each instance can be seen as a collection of features. An instance could be described as a vector of these features, x = (x1, x2, x3, . . . , xk) where k is

the number of features. Features can be numerical, categorical where the feature takes one of a predefined set of values such as month of the year, or of a less restricted form such as free text (names for example). Some algorithms only work with certain types of features.

Machine learning is a large field with many different uses [14], though only a few will be mentioned in this report. Those are:

• Classification: Given a collection of classified instances (xi, yi), where xi is the

feature vector and yi is the corresponding class, assign a class to an instance

xj where the class yj is unknown. The class assignment should be made in

such a way that the instance is more similar to instances belonging to the assigned class than instances belonging to the other classes. This could for example be used to decide whether a potential loan-taker is likely to default or not by seeing if it is similar to others who have defaulted.

(11)

1.3. MACHINE LEARNING

at species, one could find that there are five major groups of species, where each group is one cluster. Clustering will not tell what the classes are, just how many there are.

In a perfect world one would just give the data to the algorithm and it would do its thing. Sadly, this is not the case. There are quite a few common issues that arise when learning from data. If these are not handled properly the results can become poor and to reach optimal performance these issues must often be taken into consideration.

1.3.1 Generalization

Often when working with machine learning a training set is used. This is the data that the algorithm uses to learn about the structure of the data and find an underlying model that is later used when working with unseen data. The ability to correctly handle unseen data based on the training set is called generalization. Overfitting on the other hand is when the training has resulted in the model being too well-adjusted to the specifics of the training data, making it unable to generalize. This can be avoided by not training the model too much and to evaluate the model with some testing data. A subset of the available data can be chosen to be used as a testing set. The rest of the data is used for training and then the testing set is used on the model to see how well it performs on unseen data. When choosing between multiple classifiers the one that performs the best on the testing set is most likely the classifier that has generalized the best.

1.3.2 Feature selection

Generally a lot of different information is collected to learn from, the more the better. This means that all the features that the data consist of are not always necessary. This can have several downsides. One issue is that it affects the speed since more data must be processed. When deciding whether a person should be granted a loan or not, the person’s hair color or name is hardly a deciding factor. Having these features present will make the process slower and could also degrade the results by chance, simply because of some unfortunate distribution of the fea-ture values (every loan-taker named Steve defaulted). Another issue is correlation. Highly correlated features (such as age and height of children) could reinforce each other, skewing the results. What should be one feature is instead essentially two, meaning that the features can have more influence over the result than they should when using some algorithms.

(12)

that the k most influential features found are not necessarily the best combination of k features, because of correlations, making a greedy approach non-optimal [33].

1.4 Twitter

Twitter is a microblogging service where users can send and read text-based posts called “tweets.” Tweets are short messages consisting of up to 140 characters. Users can follow other users to receive their tweets as they are posted. These tweets can then be replied to, forming a conversation, or “retweeted,” where a user simply posts the same post again quoting the original author to share with the user’s own followers. Twitter has become widely popular and has at this moment over 500 million user accounts [37].

Twitter allows searching for tweets. To make it easier for users to emphasize the topic of the tweets, they can prefix words with the symbol #. These words, called hashtags, become links leading to a list of other tweets containing the same hashtag. Twitter also keeps track of trending words and hashtags, making it easier for users to find out what is being discussed at the moment.

(13)

Chapter 2

Theory

In this chapter, machine learning and sentiment analysis will be described further. A number of machine learning algorithms will be presented first, where the clas-sification algorithms are the most relevant for this project. The following section describes how sentiment analysis can be done, with the focus being on how machine learning can be used to do it. This includes how text can be preprocessed to work with the classification algorithms presented earlier.

2.1 Machine learning algorithms

Machine learning can be used for many things, such as training models to classify instances and clustering existing data sets based on similarities between the in-stances. In the following sections we describe four algorithms for classification and one algorithm for clustering.

2.1.1 Classification

In this section four popular, but quite different, classification algorithms will be presented. Naïve Bayes and Support Vector Machine are two of the most popular algorithms for sentiment analysis, while the other two, C4.5 and k-nearest neighbors, are mainly included for the sake of comparison.

C4.5

(14)

Country

Sweden

Greece

China

Yes

Income

Yes

No

<1000

≥1000

Credit card

Yes

No

Yes

No

Figure 2.1. A decision tree with three test nodes (white) and five terminal nodes

(grey), deciding on one of two classes, Yes and No.

C4.5 is an algorithm for building decision trees [35][40]. The algorithm works with all types of data, both categorical and numerical. It is used in a wide variety of areas where the decision boundaries can be described by a tree-like decomposition or rules.

To build the tree one feature is tested at each node. This node will then have a number of children based on the possible values that the feature can take. If the feature is binary, there will be two children. If the feature represents for example Country, there will be one child for each country present in the training data. When dealing with numerical data with an infinite amount of possible values a subtree for each value is not feasible or effective, so a threshold value θ is used, making it a binary split with the nodes representing “< θ” and “≥ θ.”

When choosing what feature to test at each node, a variety of measures are available. One popular option is “information gain” that maximizes the difference between the entropy of the set of instances at the current node and the sum of the entropies of all of its children nodes when splitting on a specific feature [23]. In other words, the purpose is to find a split that creates subsets where the classes are less evenly distributed, lowering the entropy. An alternative is “gain ratio” which is similar to information gain but adds a penalty value when the split results in many children, which avoids a tendency of information gain to split on categorical features.

To decide what value to give the threshold when splitting a numeric feature, sort the values in the training data and try splitting at each possible value (or the average of two neighboring values). The split with the highest information gain is then chosen.

(15)

2.1. MACHINE LEARNING ALGORITHMS

node instead represents what value to give an unclassified instance. These criteria can be specified by the user, for example that all training instances at this node are of the same class or that the number of instances left at the point is below a specified threshold. The most frequent class is chosen as a class value, as can be seen in Figure 2.1.

When the tree is built, pruning is performed to reduce overfitting. The pruning performed by C4.5 is called pessimistic pruning [19] and works on the same data set that it is trained on. The pruning is performed in a bottom-up fashion, working from the leaves up to the root.

At each node, compare the number of misclassified instances (the error rate) of the whole subtree with two other situations. The first one is to replace the subtree with a terminating node, assigning the node with the most frequent class in the training data. The other alternative is to replace the node with its most frequent child, that is the subtree where most of the training instances go. If any of these two alternatives has a lower error rate the node is replaced with the one with the lowest error rate.

When dealing with missing feature values in data there are a few alternatives to handle it. During training, one easy way is to simply ignore those instances. Another option is to assign them the most frequent value of that subset of the data. A third alternative is to create another child node for missing values. When classifying, either assign the instance a value, or if the missing value node exists, follow that one.

One nice feature of C4.5 is that the tree can be represented as a set of rules, one for each path down the tree, resulting in the assignment value. This can be presented to a user to give a certain understanding of the decision boundaries.

The algorithm has a few weaknesses though. When dealing with XOR (an odd number of the features must be 1) or “m-of-n” functions (m of the n features are needed to decide the class) the resulting tree ends up being really large and cannot be effectively pruned. Building and storing it can therefore be troublesome. This behavior is quite rare in real life situations though and should not be an issue.

K nearest neighbors

K nearest neighbors (or kNN) is quite different from the other classification algo-rithms described here. Instead of training a model, it uses lazy learning, meaning that it does not try to generalize before a query is made. The idea is very sim-ple. When classifying an unseen instance, look in the training set at the k nearest neighbors, and choose the most frequent class in this set of instances [13][40].

If k = 1, the closest neighbor decides the class. While no learning is required, comparisons will have to be made with all of the training data each time a query is made. A consequence of this is that the classifying can be slow if the training data set is large and many queries are being made.

(16)

to a poor result. On the other hand, if k is too large, a very small cluster of a certain class could be overpowered by a larger cluster further away. Care has to be taken when choosing k. One thing that can be done is to add a weight to each neighbor based on their distance. This would give closer neighbors more influence than neighbors that are further away. This makes a larger k value more forgiving, but will not necessarily give improved results.

An issue here is that kNN only works for data where a distance between two instances can be measured. For example, how would one measure the distance between two types of weather? Is sunny closer to snowy than to rainy? In some cases numbers can be given to categorical features. When filling out evaluations, “very satisfied” can be considered closer to “satisfied” than “not satisfied” and can be given integer values to represent this, but this is not always possible.

Another issue is how to measure the distance. Euclidean and Manhattan dis-tance are the most common ones, but in some cases, like document classification, measures such as the cosine measure can be more appropriate. Scaling might be required to avoid features dominating the distance. If one feature is a person’s height (1.0–2.0 meters) and another is her income ($10000–$100000), the income will dominate the Euclidean vector distance and the neighbors will probably be the

k people with the most similar income. To adjust for this bias, scaling will have to

be done to give all features a more similar value range.

Missing feature values are handled in a simple way: ignore them when calculating the distance. It is crude, but performs reasonably well. An alternative is to assign it a value, for example the mean from the training set.

Support Vector Machine

A Support Vector Machine (or SVM) is a powerful binary classifier that builds a decision boundary in the multi-dimensional space to separate two classes [20][38][40]. Like kNN, SVM works only with numeric data. A problem with many classifiers making decision boundaries is that it is not very clever about it. The classifier is trained to create the decision boundary in such a way that as many training instances as possible are correctly classified, and that is the only criteria. SVM tries to improve the boundary even further by maximizing the distance between the boundary and the classes it separates. This is illustrated in Figure 2.2. Just because this is done does not necessarily mean that it results in a better classifier, but in practice it has shown that it does. Mathematically, this boundary is found by solving: min w,b 1 2kwk2 such that yi(wTxi+ b) ≥ 1, i = 1, . . . , n

where w is the normal of the separating hyperplane and b is an offset value so that

(17)

2.1. MACHINE LEARNING ALGORITHMS

𝑥

₁

𝑥

2

d d

Figure 2.2. An example of a decision boundary produced by a support vector machine. Black and white points are separated by the line with normal vector w, at a distance b from the origin, with the areas between the boundary and the dashed lines representing margins of width d.

yi∈ {−1, 1}. A new instance xk is classified by calculating:

yk = sgn(wTxk+ b).

The original version of SVM draws a linear boundary. This is sufficient in some cases, but not always. To deal with this, SVM can use the so-called kernel trick. The kernel trick makes use of the fact that data in higher-dimensional space is easier to linearly separate than data of lower dimensions. The idea is to transform the feature vectors into a higher dimension while calculating the vector product, replacing two transformations and a vector product with a single function called a kernel function. The created boundary is linear in the higher-dimensional space but would appear curved when projected to the original dimensionality.

When performing the kernel trick an appropriate kernel function has to be cho-sen. Depending on the function the decision boundary will take different shapes, making it advantageous to select a proper kernel. Popular kernels are the

polyno-mial dot product (xi· xj)p and the radial basis function exp(−

kxi−xjk2

(18)

Noisy data can also be problematic. One single outlier could potentially push the boundary quite a bit, making the classifier less effective. By introducing a slack variable the closest vectors will be ignored. The idea is simply to let a few data points appear inside the margin of the boundary, hopefully leading to a more robust classifier.

Since SVM is a binary classifier, multiclass classification is a bit of an issue. There are ways to handle it using multiple classifiers. One could use the one-vs-all principle and have one SVM for each class, deciding whether an instance belong to class A or not, another for B or not, and so on. Another variant is to have a classifier for each pair of classes, and have the classifiers vote on which class an instance belongs to.

Naïve Bayes

Naïve Bayes works by using a different principle, namely calculating the relative probability of an instance belonging to a class. The idea is to use Bayes theorem to calculate the probability P (class|instance) for each possible class. The class assigned is the class that is the most probable. Bayes theorem says that

P(class|instance) = P(class)P (instance|class)

P(instance)

where P (instance) can be omitted since its value is the same for all classes and the actual probabilities do not matter, i.e., we are solely interested in the most probable class.

What makes Naïve Bayes “simple” is that it assumes that the features of an instance x are independent of each other. By making this assumption, the prob-ability P (x = x1, . . . , xn|class) becomes Qni=1P(xi|class), making the calculations

much easier. This is of course a very restrictive assumption that makes few real-world cases applicable. The good news is that it works fairly well even if they are not completely independent, but feature selection can be especially important when

working with Naïve Bayes. To calculate P (xi|class) one must decide what

prob-ability distribution to use. If knowledge about the feature exists, one should pick an appropriate function. Otherwise, binning is a good alternative, i.e., splitting the value range into several intervals, or “bins,” where the count of a bin is incremented for each training instance having a value in that range.

(19)

2.2. SENTIMENT ANALYSIS

Naïve Bayes is a very useful algorithm because of its simplicity. It is very easy to implement and use. The math is simple and can be used both with categorical and continuous features. Training and classifying is fast. It is also easy to add training data at a later time by just updating the bins and probabilities.

2.1.2 Clustering

k-means is a clustering algorithm that shares ideas with kNN [27][40]. The idea is simple: place k “cluster centers” randomly around the feature space. Each instance is then assigned to the closest cluster center. When this is done, each cluster center is moved to the mean of all the instances belonging to that cluster. The whole process is then repeated. This will eventually converge so that all the clusters are at a fixed location. This is the final clustering, with all nodes being assigned to its closest cluster center.

The algorithm is simple enough, but there are two major issues that must be dealt with. The first one is k. Knowing beforehand how many clusters there are is not always possible. One can run k-means multiple times with an increasing value of k and then choose the best k. How the best value of k is chosen is the next issue. One way to judge a solution is to give it a cost, where the most obvious one is the distance from each node to its closest cluster center. This does not work, though, as this value will decrease as k increases. To compensate for this, a penalty value that grows with k must be included. The simplest penalty would be to divide the total cost by k. Another idea is to cluster clusters. Simply put, after convergence, take the cluster with the largest average distance and perform 2-means on that cluster, and repeat as necessary.

The other issue is that while the algorithm always converges, it can converge into a local minimum. The initial placement can lead to a situation where one cluster center ends up between two clusters, taking them all, while another center is too far away to get anything. This is illustrated in Figure 2.3. Two solutions to this problem are to either run the algorithm multiple times with random initial cluster placements, taking the result with lowest final cost, or help centers that are not getting nodes by moving them closer to the closest instances even if they belong to another center.

The result is that each instance is assigned to a cluster center. In general, a cluster assignment is not a classification as we do not know what class a cluster represents; it simply means that those instances that belong to the same cluster should be more similar than those belonging to other clusters.

2.2 Sentiment analysis

(20)

𝑥1 𝑥₂ 𝑥1 𝑥₂ c1 c2 c2 c1

Figure 2.3. Two clusterings after convergence. The left image shows cluster center

c1 taking all the nodes as one large cluster while the right image shows the nodes being separated into two smaller clusters.

Sentiment analysis is quite similar to the problem of topic classification, where the topic or topics of a text are determined. A common method of topic classification is to extract keywords that are more commonly used when writing about a certain topic, and then classify new texts based on the presence of such words. An example could be the sentence “While the acting was great, it was way too long, three hours is just too much.” The topic of this sentence would most likely be movies or theater, and an automatic classifier could base this on the presence of the words “acting” and “long,” since they are commonly used in that context.

This approach does not work quite as well when performing sentiment analysis. Words that are positive or negative can be extracted, but going just by these will not always tell whether a text is positive or negative. Consider the sentence “Why someone would pay for this movie is beyond me.” The writer is negative towards the movie, but phrases it without using any clearly polarized words. Sentences like this make sentiment analysis more complicated than topic classification and therefore more advanced methods are needed.

2.2.1 Approaches

(21)

Objects and object features can be extracted by information extraction methods such as Conditional Random Fields and Hidden Markov Models [26]. Assumptions are sometimes made to simplify this process, like assuming that there are at most one comparative relation in a sentence and that all objects and features are nouns. These assumptions are most often true, but not always.

When dealing with texts consisting of multiple sentences, analysis can be made on each sentence separately or on the whole text. In [31] each sentence is analyzed, the neutral sentences are thrown away and the classification is done on the remaining sentences. However, all sentences are not always semantically independent of each other, so some knowledge is lost.

When using the discriminant word approach, a word list of polarized words must be created. This can of course be done by hand, but there are many words that add sentiment to text, so creating a complete list by hand is not reasonable. A popular lexical database used for sentiment analysis when working with English texts is WordNet [8]. WordNet groups nouns, verbs, adjectives and adverbs into sets of synonyms called “Synsets.” For each word in the database, synonyms and antonyms can be extracted. From these words, new synonyms and antonyms can be found. A popular approach is to select a few sentiment words such as “good” and “bad” and find words that should be similarly valued by looking at the synonyms, the synonyms of the synonyms, and so on [21]. By making such a search, terminating once a certain distance from the original word has been reached, discriminant word lists for positive and negative words can be created. The distance from the original word should be taken into consideration so that words further away have a weaker score, since they are likely to be less synonymous.

Another lexical database based on WordNet is SentiWordNet [3]. In SentiWord-Net all words are given three sentiment values; positivity, negativity and objectivity. These values sum to 1, and can be used to give a sentiment value to the words of a text. A similar resource is the MPQA subjectivity lexicon, which is a lexicon of 8221 subjective lemmas that are tagged with a polarity (positive or negative) and a strength (weak or strong) [5].

When performing sentiment analysis on foreign texts, automatic translation combined with SentiWordNet and machine learning has given decent results [11].

2.2.2 Sentiment analysis using machine learning

Another approach to sentiment analysis is to use machine learning techniques. By analyzing a large amount of texts, a model can be trained to classify new texts based on similarities to the texts the model was trained on. Classification requires that the training texts have been given a sentiment already so that the algorithm can try to find what separates the sentiments and use this on the new texts.

(22)

Table 2.1. An example of a text representation using the bag-of-words model. The

numbers in each row make up the feature vector of the text.

Text good night the movie is not fugitive best

Good night! 1 1 0 0 0 0 0 0

The movie is not good 1 0 1 1 1 1 0 0

The Fugitive is the best 0 0 2 0 1 0 1 1

which is useful if sentiment analysis is made on texts written in a language that does not have good tools for natural language processing available.

On the other hand, it is not certain that the algorithms are good enough to learn to separate the classes. To decide the sentiment based on a simplified representation of the texts might not be enough considering the complexity of human languages. Another issue is the training data. Even if it is possible for a trained model to accurately classify texts, a large number of classified texts are likely needed to train the model. If a person can spend an hour classifying enough texts to get a good classifier then it is not a problem. If tens of hours must be spent obtaining a training set it becomes a lot more problematic, and a text analysis approach could be more useful in that case.

If there are no classified texts available, clustering can be done to divide the texts into groups such that the texts in the same group are more similar to each other than to texts in other groups. This does not tell if the groups represent different sentiments, though, but when combined with other methods clustering can be useful when no classified data is available [24].

Text representation

(23)

The bag-of-words model can be extended in many ways, for example by including word n-grams. Word n-grams are sequences of n words that appear in the text. For example, the 2-grams of “This is not good” are “This is,” “is not” and “not good.” The reasoning here is that some sequences of words have their own meaning and can therefore improve classification if they are included as features. Popular values for

n are 1 (unigram), 2 (bigram) and 3 (trigram). N-grams can also refer to sequences

of characters constructed in a similar way, but in this report n-gram will always refer to a sequence of words. Other ways to extend the feature vector can be to include parts of speech, word positions in the text (text at the end of the text could be of more or less importance) and degree modifiers (“very,” “much”) [25].

One issue when performing sentiment analysis is negation. By including a nega-tion, the sentiment of a whole sentence can be reversed. There are some ideas regarding how to handle this. One popular approach is to simply create new words of all words that follow a negating word up to the next period. For example, the text “I am not happy today. How are you?” would become “I am not_happy not_today. How are you?” This makes sure that negated words are treated differently. One paper experiments with negations and finds that the two words following a negation should be treated differently [17]. This was not done with machine learning tech-niques, though. Each word was given a sentiment value in the range [−1, 1] based on SentiWordNet. After considering negations the values were summed up and a final sentiment score was achieved. The authors proposed that the value of a word close to a negation should be multiplied by a factor of −1.27.

Pre-processing of data

Using the methods presented for building feature vectors leads to very large vectors. The number of words used can easily be over 10,000. While it is important to have good features that separate the classes well, having too many features can both slow down training and classification but can also worsen the results, making feature selection important. Calculating the “value” of the features is one approach, using a metric like information gain (that is also used in C4.5) to decide what features that should be worth keeping. When working with text there are also more domain specific options available.

(24)

more positive in the eyes of the classifier.

Stemming is a linguistic technique that can be used to reduce the number of words. The process reduces a word into a base form, or stem. The purpose of this is to treat all inflections of a word as the same word. For example, “stemming,” “stemmer” and “stems” could all be reduced into “stem.” Hopefully the inflection does not matter in terms of sentiment, and the stemming should therefore improve classification by combining words that are essentially the same.

Stemming can be done in many different ways, especially depending on language. Lookup tables can be used, but those tend to be large and cannot handle unseen words. A popular type of algorithm is suffix-stripping [28]. By keeping track of the common suffixes for inflected words, these can be removed from words to get the stem. Examples of such suffixes in the English languages are “-ed,” “-ing” and “-ly.” This approach cannot handle all words, though, such as irregular verbs. Another issue is when a word contains such a suffix but is not inflected, like “sting,” which could be stemmed into “st.” This can often be avoided by adding additional constraints, such as requiring the stem to contain at least one vowel.

A closely related technique is lemmatisation. Lemmatisation is the process of finding the lemma of an inflected word. This is similar to stemming, and the stem and the lemma of a word are often the same. The main differences are that the lemma is always a real word, and it is based on the context in which it is used. For example, the word “meeting” can refer to a noun (“a meeting took place”) or a verb (“he is meeting me in five minutes”). Stemming would treat “meeting” the same in both cases, while lemmatisation would consider the part of speech and lemmatise them differently. Lemmatisation is a more advanced technique compared to stemming, but can be more effective.

Adjectives and verbs are the most subjective parts of speech, so if these are extracted from the training data they can make up the feature vectors. Classical feature reduction techniques used for machine learning can also be used. Informa-tion gain, chi-square and mutual informaInforma-tion are all viable opInforma-tions [21].

Other more unique representations have been proposed, such as keeping track of the distances between positive and negative words in a text. These distances are then placed into bins and summed, with each feature representing a bin and the value being the sum [16].

(25)

Sentiment analysis of tweets

Since tweets are only up to 140 characters long, the writers will have to be very concise, making the language quite different compared to that of normal length texts. In terms of machine learning, this will most likely make it more difficult to learn from since the messages do not contain as much information. Since the texts are short, related texts are less likely to have similarities between them. With Twitter being a fairly informal forum, the texts are likely to be poorly written with misspelled words and improper grammar being common, making them more difficult to understand.

On the other hand, abbreviations as well as emoticons are popular ways to ex-press a sentiment in just a few characters, which can be very useful for sentiment analysis. The texts should also be straight to the point with few conflicting senti-ments. Some aspects can be favorable for sentiment analysis while others are not, making this quite different from sentiment analysis of normal length texts. Research on the area is therefore necessary to see how much it differs compared to normal length texts.

Twitter is still relatively new, so there has not been much research done on sentiment analysis of tweets, but some exist. In [9] the authors compared classifi-cation of short and long texts. Tweets and microreviews (reviews that are at most 140 characters long) were compared to longer blog posts and reviews. When using only positive and negative texts, they found that unigrams and Naïve Bayes per-formed well on the shorter texts, reaching 74.85% accuracy on tweets, while SVM with additional features such as bi- and trigrams with part of speech tagging was most effective when classifying longer texts. They also tried classification including neutral tweets, getting 61.3% accuracy using unigrams and Naïve Bayes.

(26)

(27)

Chapter 3

Methodology

In this chapter the method used will be described. The goal of this project is to see if it is possible to create a classifier that accurately finds the sentiment of tweets written during a crisis event. This will be done by training different classifiers with data from such an event and evaluate how accurate the classifiers are. Since there are many different algorithms, text representations and feature selection techniques available, different combinations of these will be tested to see if some of them work well together, resulting in better classifiers.

Section 3.1 will describe the data used, where it came from and how it was categorized. Section 3.2 describes the implementation that was used to create the classifiers and how the classifiers were evaluated. Section 3.3 describes the tests that were performed, i.e., the different combinations of parameter values that were used to create classifiers.

3.1 Data

The data used for this project were tweets posted during a crisis event. The API that Twitter provides [6] makes it easy to search for tweets or gather tweets by keywords in real-time. A problem with the API is that Twitter does not allow gathering of tweets older than about a week, or even less if a query returns many results. This poses a problem for this project as tweets from a relevant event like an earthquake cannot be retrieved a long time after it happened. To get the necessary data set, tweets have to be collected when an event is happening. This is what was done for this project. Several data sets were collected during the period of June to August of 2012 and based on their relevance and possible categorizations one of them was chosen to be used for training classifiers.

(28)

much activity on Twitter regarding the band and the arrest, not just in Russia but in the rest of the world as well.

On August 17, the three members were convicted of hooliganism motivated by religious hatred, being sentenced to two years in prison. The conviction resulted in much activity on Twitter that day, with people from all over the world reacting, being both positive and negative to the sentence. The data were collected when this happened. A number of relevant hashtags were followed with a Python program using the Twitter Streaming API to collect the tweets. Roughly 130,000 tweets were collected during a period of 24 hours, from a bit past midnight Moscow time to the same time the next day. It is a bit unclear exactly how the streaming API [6] works in terms of how much that has actually been retrieved. There are limitations on how much that can be gathered that should be mostly based on the posting frequency of relevant tweets. Every single relevant tweet might therefore not have been collected, but it should not matter as we will only look at a subset of them.

Out of this collection non-Russian tweets (roughly 5/6th of them) and retweets (more than half) were removed, and of the remaining tweets 390 of them were randomly selected for manual classification. By choosing the tweets at random, the training set should be somewhat representative of the full set of tweets.

The reason for using Russian tweets is that it is a Russian event, and the amount of tweets that are negative towards Pussy Riot is much greater in Russian than for example in English tweets, making it more probable to get a decently sized set of negative tweets. Retweets can be excluded since they essentially contain the same information as the original tweet, and this avoids repetition in the training data.

The tweets were then classified into three classes: positive to Pussy Riot (neg-ative to the sentence), neg(neg-ative to Pussy Riot and other. Other includes neutral tweets, tweets where a sentiment could not be found or decided upon, or simply unrelated content. Tweets where there is a sentiment, but it is not related to Pussy Riot or the verdict are classified as other.

The classifications were made by two researchers at the Swedish Defence Re-search Agency who are fluent in Russian and knowledgeable about Pussy Riot and Russian politics. The data set was split into two halves and they got one half of the tweets each. First they classified the half that they got. Then they classified the other person’s half independently of the first classification. If they disagreed about a tweet, the person who classified it first made the final decision on the tweet’s class. The manual classifiers agreed with each other on 306 classifications, or 78%. Regrettably, after classifying one half each, they had to slightly alter the definition of the categories to make some clarifications, so it is not a perfectly fair to compare the two classifications, the actual agreement is probably slightly higher. The result of the final classifications was:

(29)

3.2. IMPLEMENTATION

This gives us some references about the accuracy of a classifier. It should not be reasonable for an automatic classifier to perform better than the manual classifiers, so an accuracy of 78% is the goal. However, a lower limit for what is a reasonable performance should also be considered, since if the accuracy is too low, it would be better to just guess the class. The baseline value considered will be the accuracy achieved when guessing the class using the distribution of the available data. The probability of guessing the correct class is the probability of an instance belonging to a class, which is the fraction of training instances belonging to that class, multiplied by the probability of the guesser guessing the same class (which is the same fraction), summed over all class values.

With the data used here, this gives us a baseline of (159/390)2 _{+ (59/390)}2₊

(172/390)2 _≈_{0.38, or 38% accuracy. Other baseline values could be used as well.}

For example, classifying everything as other would be more accurate, 44%, but only guessing one class is not very useful, especially since the other class is the least interesting one.

The distribution itself should be acceptable even if negative is much smaller than the other categories. However, since there are quite few tweets to begin with, the negativeset might be too small, so it is expected to be more difficult to classify than the others. One thing worth mentioning here is that both of the manual classifiers found the classification to be more difficult than they first expected, indicating that this is a difficult problem even for human experts. It was not always clear who the author of the tweet was in favor of, especially when using sarcasm and links to external websites. Since the classifier cannot follow links, classification was made without considering the content of external websites.

3.2 Implementation

The implementation was made in Java using Weka [7]. Weka is an open source library of machine learning algorithms and tools for preprocessing data and evalu-ating models. The reason for using Weka is that it is a popular tool for machine learning with many different algorithms for data mining, feature selection and fil-tering of the data. Weka provides the tools for doing what needs to be done in this project and has extensive documentation.

The program written in this project begins by reading the tweets from a file. The program will then extract all the n-grams present in the texts and select those that are believed useful for finding the correct classes of a tweet. A classifier can then be trained with the classified texts and be used to classify new texts. The classifiers tested in this project were the classifiers described in Section 2.1.1:

(30)

was used, since initial tests showed that training became faster and results were slightly better compared to not using normalization.

• Naïve Bayes: The version used was the Multinomial Naïve Bayes (NBM) classifier that is available in Weka.

• C4.5 (also known as J48 in Weka): J48 was used with its default configuration. • K-nearest neighbors: kNN was configured to look at the 10 closest neighbors and have them vote on the class, with the vote weight being the inverse of the distance. Little time was put into choosing these parameters for reasons explained below.

The SVM and Naïve Bayes are two of the most popular algorithms for text classi-fication and were expected to perform well. Tree-based classifiers such as C4.5 are not as suitable for text classification as there are a large number of features present, resulting in large trees where many features will not be used. K-nearest neighbors is not suitable for practical reasons as it will iterate over the whole training set each time an instance is classified. If a classifier is used during a real event with hun-dreds of messages being posted every minute, the classifier must be reasonably fast at classifying, making k-nearest neighbors less suitable (hence not spending much time choosing the optimal parameters). Therefore, C4.5 and kNN were mostly included for the sake of comparison.

3.2.1 Text representation & pre-processing

A number of different text representations were tested. All letters were converted into lowercase. The bag-of-words principle with n-grams was used for all represen-tations. The n-grams were of size up to 3, with each vector feature representing one n-gram. Both term frequency and term presence were tested. Techniques for re-moving some of the words were also used. N-grams only occurring f or fewer times in the training data were ignored, for different values of f. Table 3.1 shows how many n-grams there are for different frequencies. On the other end, the most com-mon n-grams should contain little sentiment value, so the g most frequent n-grams became stop terms, i.e., removed.

Information gain (see description of C4.5 in Section 2.1.1) was also used for two purposes. Most importantly, it was used for feature selection when training a classifier. The information gain was calculated for each n-gram. The features were then sorted by gain, and the least informative were removed. The purpose of this was that words with little gain should be removed so that they would not interfere with the classifier by being noisy.

The most informative words will also be presented in the results chapter. The words are presented to see what kind of words that are the most discriminative and if these could help us understand more about the classification.

(31)

3.2. IMPLEMENTATION

Table 3.1. A table showing how the n-grams are distributed in terms of frequency.

The numbers represent the number of n-grams that occurs the corresponding number of times in the training data.

Frequency Unigrams Bigrams Trigrams

1 1986 3945 4113

2 180 102 34

3 46 22 11

4 29 9 1

5 or more 73 19 5

stem the words. Weka includes a wrapper class for using Snowball when creating the n-grams.

3.2.2 Evaluation

A classifier that has been trained should be evaluated to estimate how well it per-forms. This is done by classifying a number of instances where the real class is known and see how many instances the classifier can classify correctly. Evaluating a model on the data that it was trained on is a bad idea, however, and will not give a proper estimation of its performance. Since the model was created based on the training data, it is very likely that it will perform quite well when classifying the same data even if it tries to generalize. A model should instead be tested on new data that the classifier has never seen before. This tells us more about how well the model can generalize and handle unseen data, which is the whole point of the classifier.

When large amounts of training data is available, the training data can be split into two sets, a training set used to create the model and a testing set used for evaluating it. However, when small amounts of data are available, such as the 390 instances used in this project, the split would result in two small sets, possibly leading to a poor classifier because of small training data, a poor evaluation because of small testing data, or both.

Cross-validation is a model evaluation technique that is especially useful when small amounts of data are available. The idea is to partition the data into several subsets and then create multiple classifiers, each using different subsets for training and for testing. A classifier is trained on all but one of the subsets, which is used for testing the classifier. By creating multiple classifiers this way with different subsets being used for testing, each classifier will be built on larger amounts of the data, making it more representative of the classifier built on all of the data, while having multiple classifiers makes it possible to also test with more of the data. The result of each classifier evaluation is then combined into a final evaluation.

(32)

Training vectors Classifier Testing vectors Train Classified vectors 90% 10% Extract and select features Classify Tweets Training tweets Testing tweets Feature set Evaluation

Figure 3.1. A chart showing the process of converting the tweets into data sets to

train and test a classifier.

different classifiers is called k-fold cross-validation. All instances will be used for both training and testing (though never both at the same time). The larger k is the more representative the results will be, but training a classifier takes time, so more folds will take more time. Leave-one-out-cross-validation will create as many classifiers as there are instances, evaluating on a single instance each time. This will result in the best evaluation, but is very computationally expensive. A common value for k is 10, resulting in 10-fold cross-validation which is what was used in this project.

3.3 The tests

This section will describe how the tests were done. Since 10-fold cross-validation was used, the main process was repeated 10 times but with different training and testing sets each time. This process is illustrated in Figure 3.1. First, the data were split into two sets. 90% of the tweets constituted the training set, while the remaining 10% constituted the testing set. From the training data, all the n-grams were extracted. If stemming was used, it was performed when creating the n-grams. The infrequent n-grams (n-grams that occurred less than f times in the training data) were removed, and the g most frequent n-grams became stop terms, i.e., removed as well. More than g n-grams could become stop terms if multiple n-grams were the gth most frequent. Of the remaining features, the i most informative features found using information gain became the final set of features used.

(33)

3.4. VERIFICATION OF IMPLEMENTATION

classify the testing data set for evaluation. The class assigned to an instance was compared to the actual class to see if they agree. These statistics were saved and aggregated over all the folds into a final evaluation. The statistics of interest are the accuracy value and the confusion matrix.

The tests were made by performing 10-fold cross-validation with different combi-nations of classifiers and values for the parameters. The results were then examined to see what parameter values performed well and what did not. The parameters were:

• Classifier, one of the four classifiers mentioned earlier.

• Word presence, whether to use word frequency or just presence in the feature vectors.

• Maximum n-gram size, up to how large the n-grams could be. Values tried were 1, 2 and 3.

• Minimum term frequency, the minimum number of times an n-gram had to occur in the training tweets to be considered. The values tried were the integers in the range of 1 to 5.

• Stemming, whether the stemmer was used or not.

• Stop terms, the number of stop terms removed. The values tested ranged from 0 to 30, increasing in steps of 5.

• Information gain features, how many of the most informative n-grams calcu-lated by information gain that were used. This value began at 25 and increased in steps of 25 until all features remained.

The values were chosen so that all combinations of parameter values could be tested in a reasonable amount of time, which in this case ended up being a few hours.

3.4 Verification of implementation

(34)

(35)

Chapter 4

Results

In this chapter results from the tests will be presented. Since the number of variables and combinations were quite large, all of the test results will not be shown. Instead a selection of the results will be presented in graphs together with some comments on what could be found in the results. The graphs have been chosen to illustrate how different parameters can change the result and lead to the set of parameter values that resulted in the best classifier (in terms of accuracy). This classifier will be presented further with a confusion matrix.

4.1 Test results

4.1.1 Word presence and word frequency

To begin with, the most basic settings were used when training the classifiers. Un-igrams were used, nothing was removed, no stemming was performed. The first graph, which can be seen in Figure 4.1, shows the difference between using word presence and word frequency in the feature vectors. The difference is small, but Naïve Bayes and C4.5 perform better with word presence while k-nearest neighbors performed better with word frequency. Generally throughout the tests word pres-ence produced better results, even if SVM favored word frequency. We can also see here that Naïve Bayes is a couple percentage units more accurate than the other algorithms at 48.7% accuracy and that k-nearest neighbors performs the worst at 43.6%. All of the classifiers perform better than the baseline at 38% accuracy.

4.1.2 Minimum word frequency

(36)

47% 49% 51% 35% 37% 39% 41% 43% Acc u rac 41% 42% 43% 44% 45% 46% 47% 48% 49% 50% NBM SVM C4.5 kNN A cc u racy Algorithm Frequency Presence

Figure 4.1. A comparison of the difference in accuracy between using term frequency

and term presence. The features used were unigrams with no feature selections made.

35% 37% 39% 41% 43% 45% 47% 49% 51% 1 2 3 4 5 A cc u rac y Minimum frequency NBM SVM C4.5 kNN

Figure 4.2. A comparison of the difference in accuracy when changing how many

times a term must appear in the texts to become a feature. Unigrams were used with word presence.

4.1.3 N-gram size

(37)

4.1. TEST RESULTS 35% 37% 39% 41% 43% 45% 47% 49% 51%

1-gram 2-gram 3-gram

Acc

u

racy

Maximum n-gram size

NBM SVM C4.5 kNN

Figure 4.3. A comparison of the difference in accuracy with different maximum

sizes of n-grams. Word presence is used and each term must occur at least two times in the training data.

40% 42% 44% 46% 48% 50% 52% 54% 56% 58% 0 5 10 15 20 25 30 A cc u racy

Number of stop terms

NBM SVM C4.5 kNN

Figure 4.4. A comparison of the difference in accuracy between different numbers

of stop words. Unigrams were used with word presence and a minimum frequency of two.

4.1.4 Stop words

(38)

40% 42% 44% 46% 48% 50% 52% 54% 56% 58% 0 5 10 15 20 25 30 A cc u racy

Number of stop terms

NBM SVM C4.5 kNN

Figure 4.5. The same test as in Figure 4.4, but this time with stemming used.

4.1.5 Stemming

Figure 4.5 shows how the number of stop words affect the accuracy again, but this time the Snowball stemmer is included. Results are better in most cases, especially for Naïve Bayes and SVM. C4.5 shows a surprising improvement at 20 stop words, beating SVM. kNN did not improve at all compared to not using the stemmer.

4.1.6 Information gain

Finally, to see if the accuracy can be improved further, feature selection using information gain is used. Figure 4.6 shows how the number of selected terms affect the accuracy. Results are a bit jumpy, but the best result was improved further by more than one percentage unit.

4.2 Highest accuracies

Since the algorithms reacted differently to the parameter values, the most successful setup with each algorithm is presented in Table 4.1 along with the corresponding parameter values. All classifiers reached an accuracy of over 50% using the best settings, with the highest being achieved with the Naïve Bayes classifier at 55.38%. The two parameter values that are the same for all classifiers are n-gram size (uni-grams) and that stemming was used.

(39)

4.3. INFORMATIVE WORDS 40% 42% 44% 46% 48% 50% 52% 54% 56% 58% 25 50 75 100 125 150 175 200 225 250 275 300 A cc u rac y

Number of most informative features

NBM SVM C4.5 kNN

Figure 4.6. A comparison of the difference in accuracy with different numbers of

features used. The features used were those found having the largest information gain value. Unigrams were used with word presence and each term had to occur at least two times in the training data. The 20 most frequent n-grams were removed.

Table 4.1. The parameter setup for the most accurate version of each algorithm.

Count is Yes if frequency was used for feature values. Size is the maximum size of the n-grams. Frequency is how many times an n-gram had to occur in the training data. Stop is how many stop terms that were removed. Stem is if stemming was used and IG is how many of the most informative features found using information gain that were used.

Algorithm Accuracy Count Size Frequency Stop Stem IG

NB 55.38% No 1 2 20 Yes 150

SVM 53.33% Yes 1 2 15 Yes 125

C4.5 51.03% No 1 2 20 Yes All

kNN 50.51% Yes 1 4 15 Yes 25

of the other class is quite high, with 76% of the other instances being classified correctly. The positive classification has a recall of 48%, while the recall of negative classification is only 14%. Further discussion follows in Section 5.3. The accuracy of the classifier is 55.4%, making it closer to, but well over, the baseline value (38%) than the goal accuracy (78%).

4.3 Informative words

(40)

Table 4.2. The confusion matrix for the best version of the Naïve Bayes classifier.

The matrix shows how the classifier predictions were distributed and how well they coincide with the actual class value.

Predicted class

Actual class Positive Negative Other

Positive 76 1 82

Negative 20 8 31

Other 32 8 132

translation might not be the most appropriate considering the subject. The words are listed below in decreasing order along with a translation or description:

1. ïèçäåö – vulgar dislike of the state of things, 2. #õàìñóä – hashtag referring to the court, 3. êîíöîì – “end,” 4. çàùèùàòü – “defend,” 5. äîëæíû – “should,” 6. ïîêëîííûé – “worship,” 7. íàðîäó – “people,” 8. ïàìÿòíèêà – “monument,” 9. òàêîé – “such,” 10. âûíåñ – “made.”

(41)

Chapter 5

Discussion

In this chapter the results presented in Chapter 4 will be discussed, in terms of what they mean and why the results are what they are.

5.1 Data set and classification

There are most likely a number of issues with the data set and the classification problem that made it difficult to build an effective classifier. The major one is the number of tweets. Learning from 390 tweets should be difficult, especially when only 59 of the tweets were negative, which would explain the poor recall of negative tweets (see Section 4.2). In other studies on tweet classification, the number of tweets are at least a thousand [9][22], and that is most likely necessary. Human language is complex, and an opinion can be expressed in many different ways, making it necessary to use a lot of training data. One related question is how many tweets that can be classified manually in a reasonable amount of time. Manual classification is not worth the time it takes if experts must spend tens of hours classifying tweets for every situation.

The list of informative words presented in Section 4.3 hints at the difficulty of the problem as well. The most informative word makes perfect sense. The second word is a hashtag which is a bit surprising. It is probably not a polarized word by itself, it just happens that some groups of users tend to use it more. Manual counting showed that among the tweets containing the hashtag, 22 were positive, 17 belonged to other and no instance was negative, so those tweets have been slightly more in favor of Pussy Riot compared to the rest. The remaining words do not look very informative. A guess is that they likely are not informative, it is just that the other words are even less informative. This could be a consequence of most words only appearing once (see Table 3.1). This list would probably look quite different with more training data.

(42)

for one or possibly two emotional statements. A possible issue is also the informal language. Using slang and misspelling makes classification even more difficult, and with tweets being informal and short people are more likely to be creative with how they phrase themselves.

The classification problem itself is a likely issue. Compared to the more “classi-cal” sentiment analysis problem, this classification should be more difficult. Here, it is not enough to just decide that a tweet is positive, it is essential to also know what the tweet is positive towards. How much more difficult this actually makes learning is hard to say, but it is safe to say that it should only make it more difficult. This means that more advanced natural language processing might be helpful so that the subject of a statement can be decided upon.

5.2 The tests

Most importantly, the four algorithms need to be considered. When looking at the graphs in Chapter 4 the algorithms perform fairly similarly. They all perform significantly better than guessing (at least with appropriate settings) but none of them give great results. C4.5 and k-nearest neighbors perform surprisingly well, frequently performing at the same level as the SVM. Naïve Bayes was not the winner in all tests, but did generally perform very well and produced the highest accuracy. This is in line with [9] in which is was found that Naïve Bayes was the best classifier when working with short texts. While SVM was not a definite runner-up, it did produce better results than C4.5 and kNN when using the optimal parameters and generally performed better.

The other parameters are of course of interest as well. In terms of word presence versus word frequency, Figure 4.2 indicates that presence is better, but it depended on the algorithm and other parameters. This parameter most likely makes little difference since tweets are so short. In most cases the words only occur once anyway. The maximum size of the n-grams did not make any clear differences. Looking at Table 4.1, though, all algorithms performed at their best with only unigrams. Adding larger n-grams is probably both helpful and not. Some features improve classification while most of them are just noisy. In the end, using just unigrams with feature selection seems to be enough, which agrees with [32]. Table 3.1 shows that very few bi- and trigrams occur more than once. Many of them that do are likely part of common expressions, like “of the” in English. With more training data larger n-grams could potentially be helpful.

Stemming turned out to be quite helpful. With so few words being used, the process of stemming should be especially useful, turning inflected words into a single term, making sure they are not removed for being infrequent. The question is how much it helps when using larger data sets.

(43)

5.2. THE TESTS

support for them being helpful. Previous research suggests that 4 is a good minimum frequency value [32]. Here that number was found to be 2. Looking at Table 3.1 this is hardly surprising considering that only 102 words occur more than four times in the texts. The table also tells use something about how small the training data is. Of the 2314 words that occur in the texts, only 328 occur more than once. Classifying a text based on 328 different words (of which most of them will not occur) cannot be enough to make a proper sentiment analysis.

Removing the most frequent words also seemed to be helpful to some extent. Looking at the list of common words, most of them are completely irrelevant, with words (translated) such as “in,” “year,” “for,” and similar words that should occur frequently in this context. The only one that is of interest is the hashtag “#õàìñóä” referring to the court. As discussed above, this was one of the most informative features. A guess is that it is not informative enough to be worth keeping along with the rest of the stop words. In future tests one could try to avoid using hashtags as stop words or even use a predefined list of stop words from an external source.

Information gain worked similarly to the other feature removal techniques; no obvious general improvement can be seen by including it, but the best result did become better. Having very few features performed poorly which should be ex-pected. Aside from that it looked like including most features but not all was the most effective, which makes sense considering the nature of the problem. Many words that are neither common nor uncommon can still be useless for finding the sentiment, and information gain can help find these. One thing worth noting in Table 4.1 is that C4.5 did not have any use for the feature removal. This is most likely because C4.5 uses information gain internally to build the tree.

One thing that should be mentioned here is of course a consequence of the nature of the tests performed. With so many parameters being adjusted, one will eventually find a combination that improves the result a bit further. The “optimal” configurations found here are very unlikely to be the best combination for a different data set. These are the best configurations using these tweets, evaluating with these folds. The parameter values should rather be seen as an indication of what kind of settings that could be effective. A summary of lessons learned:

• Unigrams are likely enough, adding larger n-grams didn’t improve results noticeably.

• Stemming improves the classifier, at least with smaller data sets.

• Stop words, frequency checks and information gain can all be helpful with the right settings.

Sentiment Analysis of Microblog Posts from a Crisis Event using Machine Learning

Sentiment Analysis

of Microblog Posts from a Crisis Event

using Machine Learning

A N D E R S W E S T L I N G

Sentiment Analysis

of Microblog Posts from a Crisis Event

using Machine Learning

A N D E R S W E S T L I N G

Abstract

Sentimentanalys av mikroblogginlägg från en

krishändelse med maskininlärning

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Background

1.2

Purpose

1.3

Machine learning

1.4

Twitter

Chapter 2

Theory

2.1

Machine learning algorithms

Country

Sweden

Greece

China

Yes

Income

Yes

No

<1000

≥1000

Credit card

Yes

No

Yes

No

𝑥

𝑥

2.2

Sentiment analysis

Chapter 3

Methodology

3.1

Data

3.2

Implementation

3.3

The tests

3.4

Verification of implementation

Chapter 4

Results

4.1

Test results

4.2

Highest accuracies

4.3

Informative words

Chapter 5

Discussion

5.1

Data set and classification

5.2

The tests