Classification of Hate Tweets and Their Reasons using SVM


Academic year: 2021

Examensarbete 30 hp

18 Januari 2016

Classification of Hate Tweets

and Their Reasons using SVM


Classification of Hate Tweets and Their Reasons using


Natalya Tarasova

This study focused on finding the hate tweets posted by the customers of three mobile operators Verizon, AT&T and Sprint and identifying the reasons for their dissatisfaction. The timelines with a hate tweet were collected and studied for the presence of an explanation. A machine learning approach was employed using four categories: Hate, Reason, Explanatory and Other. The classification was conducted with one-versus-all approach using Support Vector Machines algorithm implemented in a LIBSVM tool.

The study resulted in two methodologies, the Naive method (NM) and the Partial Timeline Method (PTM). The Naive Method relied only on the feature space consisting of the most

representative words chosen with Akaike Information Criterion. PTM utilized the fact that the majority of the explanations were posted within a one-hour time window of the posting of a hate tweet.

We found that the accuracy of PTM is higher than for NM. In addition, PTM saves time and memory by analysing fewer tweets. At the same time this implies a trade-off between relevance and completeness.


“Natalya-san, don’t be afraid of Big Data.”



De senaste femton ˚aren har Internet blivit en arena som v˚ara dagliga aktiviteter i allt st¨orre utstr¨ackning utspelar sig p˚a: vi h¨amtar och sprider information i form av text, ljud och bild, handlar, bokar resor och upplevelser, l¨aser online kurser etc. Via Internet samarbetar vi med andra och skapar m¨otesplatser. Idag kan vi n˚a ut och vara n˚abara genom olika plattformar. Beroende p˚a vilka grupper man vill n˚a ut till finns det olika verktyg. Till exempel anv¨ands Facebook oftast f¨or att h˚alla kontakten med sl¨akt, v¨anner och bekanta, d¨aremot anv¨ands LinkedIn som en plattform f¨or professionella kontakter. Populariteten av sociala medier har lett till att det finns en rik k¨alla till s¨okbara data f¨or analys av vad m¨anniskor k¨anner, t¨anker och g¨or [9, 10]. D¨arf¨or finns det ett stort intresse hos forskare att unders¨oka allt fr˚an trender och opinionsm¨attningar till spridning av influensa med hj¨alp av dessa data [3, 4]. F¨oretag har ocks˚a insett v¨ardet av att anv¨anda information fr˚an sociala medier i syfte att f¨orst˚a vad deras kunder tycker om de tj¨anster och produkter som f¨oretagen tillhandah˚aller.

I detta arbete har vi fokuserat p˚a det sociala mediet Twitter. Grundtanken med Twitter ¨

ar att vem som helst n¨ar som helst ska kunna n˚a ut till andra genom att publicera ett meddelande som best˚ar av max 140 tecken. Ett s˚adant meddelande kallas tweet. Arbetet bedrevs vid Social Media Labs, KDDI R&D. P˚a Social Media Labs forskas det kring, bland annat, vad anv¨andare i olika delar av v¨arlden tycker om en viss produkt eller tj¨anst. Resultatet visualiseras p˚a en karta och ger en snabb ¨overblick ¨over geografiska trender.

vi har unders¨okt huruvida det ¨ar m¨ojligt att identifiera orsaken till varf¨or anv¨andare uttrycker hat i tweets riktat mot mobiloperat¨orerna Verizon, AT&T och Sprint. Efter att ha l¨ast hat-tweets fr˚an cirka 500 anv¨andare kunde vi konstatera att det gick att hitta f¨orklaringar till varf¨or Twitter-anv¨andare uttrycker hat mot sina mobiloperat¨orer. D¨arefter konkretiserades fr˚agest¨allningen f¨or denna studie: m˚alet blev att ta fram en metod som m¨ojligg¨or identifikation av hat-tweets samt av de orsaker som f¨oranledde dem. Studien utmynnade i tv˚a metoder: en ”naiv” metod (the Naive Method, NM) och en mer ”avancerad” metod (the Partial Timeline Method, PTM).



I wish to thank my subject reader, Sofia Cassel, for support, productive discussions and guidance. I also want to thank Social Media Lab at KDDI R&D for their warmth and hospitality.

Finally, I wish to thank the Sweden Japan Foundation for travel funding.



Sammanfattning ii

Acknowledgements iv

List of Figures vii

List of Tables viii

Chapter 1


The proliferation of micro-blogging platforms and networking services has led to new and previously unexplored opportunities to disseminate information. Many users share their opinions, thoughts and feelings and express their needs and desires using social media on a daily basis. At first sight, the disseminated information seems to be of a diverse and chaotic character. Nevertheless, if it is not viewed in isolation but rather as a part of a larger context, it can form a coherent structure that reveals hidden relationships. This fact together with the relative simplicity of information retrieval has led to the appearance of new domains of research within data mining, text mining, machine learning and natural language processing. For example, some studies have focused on predicting trends [3, 4] and revenues [5], recommendation of products and services [6–8], sentiment analysis towards various topics such as brands, celebrities etc [9,10].

Business and industries followed the example of the academic world and started to explore marketing opportunities in social media. For customers, social media has become an important arena for sharing experiences and finding information about companies, their products, and services. Online networking offers a useful source of information for both parties. Consumers can find up-to-date information and companies can mine the opinions of users on the products and services. The ideas for improvement and development of new products can be found and also new opportunities for user-user, company-user and user-company interaction are created. Due to the improved knowledge about the customer and the ease of reaching out to her or him, advertising is becoming more personalized. The behavioural patterns of users are collected and analyzed to create an individual-oriented relationship to the company. The customers, for their part, can interact with each other collecting more knowledge on the diversity of products and services on the market. Due to the size of the available information both customers and


Introduction 2 companies need reliable tools for getting a quick and simple overview of the available information. Therefore, the application of computational research based on the analysis of tweets is a part of producing such tools [11].

Due to the size and the varying quality of the information available on the Internet, the process of finding and extracting relevant information has become more difficult and time consuming. In order to overcome this problem, a wide range of clustering and summarization methodologies were developed. Due to the dynamic nature of Web content, there is a constant need to improve and adjust already existing methods. This study focused on classification of information retrieved from Twitter, a micro-blogging website. In June 2015, there were 315 million active users on Twitter around the globe [12]. With such a high number of users almost any action in the real world receives feedback on the networking platform. Politicians, celebrities and companies choose to connect to their fans, followers and customers using this media. Twitter has played an important role in socio-political events such as the Arab Spring and the Occupy Wall Street Movement. It attracts a diverse community of users and therefore makes it possible to study different social phenomena. For instance, tweets have been used to create a variety of applications: from the recommendation of the best navigational routes based on tweets [13] to mapping the scale of the distraction caused by a tsunami [14]. Furthermore, Twitter is simple to leverage due to the existence of APIs that enable a fast and simple connection to the website. Furthermore, this study focused particularly on commercial use of Twitter. The goal was to find the reasons for hate towards mobile operators. The analysis was conducted on tweets addressed to the three largest mobile operators in the US: AT&T, Verizon and Sprint. By using machine learning techniques, the identification of hate tweets and the related reasons was detected. The motivation and purpose of the study are described in the section 1.3.




Introduction 3 by adding a hashtag represented by 0#0 sign. In order to address another user, an ”at-sign” @ has to be added in front of the user’s name.

Tweets present very noisy data due to the length, informal language and the lack of context or the background knowledge. It makes the analysis of tweets a challenging task.



According to previous studies, in order to create an efficient marketing strategy, the online content of social media should be used to bridge the discrepancy in the physical-virtual relationship between consumers and companies [15]. For this reason, manage-ment systems for the analysis of web-content are becoming an inevitable part of strategy analytics for small, mid-size business, large corporations and organizations alike [11]. These systems come in different forms depending on needs of the company. One of the methodologies used to map the relationship between the customer and the company is a Customer Journey Map (CJM). A CJM is an oriented graph that visualizes how the relationship between the customer and the company develops. There is no of-ficial definition of CJM and it exists in different variations depending on the brand and purposes. Typical stages of a consumer-company relationship are the following: Dis-covery , Investigation , Order , Usage, and Termination . Each category describes the current state of the relationship between the user and the company or its products. If the information on the Internet posted by customers can be automatically classified according to CJM, it can help create personalized offers and better foresee the needs of the customer.

Prior to the project described in this report, another study on the classification of tweets using Support Vector Machine (SVM) was conducted. The goal of the study was to explore how well the model could be used for tweets. Therefore, the classification categories were taken from CJM. During the labeling of the tweets, it was noticed that the number of the hate tweets about the mobile carriers and their services was surprisingly high. For example, a tweet ”I hate Verizon” conveys a strong feeling but does not explain what the specific problem is. Hate tweets are the tweets that explicitly express hate towards the company (in this case only mobile carriers but the concept can be extended to other branches) or its products and/or services. The lack of an explanation makes it impossible to classify hate tweet according to CJM.


Theory 4 making. By analysing tweets, it is possible to minimize the guesswork of companies regarding the issues that create dissatisfaction.




Chapter 2


Prior to the classification of tweets with a machine learning algorithm, it is necessary to collect data, remove superfluous words, define the classes and how they can be rep-resented. This section explains these major steps in text mining along with a geometric interpretation of the principles behind SVM. Finally, it describes the format of files used for training and testing SVM with a LIBSVM tool1.


Data Retrieval

Tweets and the information related to them such as the name of the user, publication date, location (if enabled by the user), profile information etc can be downloaded from Twitter using one of the numerous Twitter Application Programming Interfaces (APIs). Twitter APIs have to be accessed using an authentication request. The request is conducted through the Open Authentication protocol (OAuth). The protocol defines the way an application should requests the access to one of the APIs has to submit the credentials issued by Twitter. For receiving those the application has to be registered on Twitter [16].

Tweeter provides the developer with several options when it comes to the choice of an API. The most used ones are Streaming API and REST API. There are three major differences between Streaming and REST APIs:

• data access

Streaming APIs work in online mode providing continuous access to newly posted tweets. Once a connection between the application and Twitter is established, the




Theory 6 tweets will start flowing into the system. REST APIs collect only data that has already been posted on Twitter pulling a certain number of tweets per request. Only tweets published within the last week can be collected using REST APIs. • rate limits

Streaming APIs are allowed to stream 5,000 user ids concurrently [17]. The APIs’ rate limit window duration is 15 minutes long [18]. Users represented by access tokens can make 180 requests/queries per time window. Using application-only authentication, an application can make 450 queries/requests per 15 minutes on its own behalf.

• search function

Search queries can be built from the key words, phrases, names of the users, dates etc. It is also possible to include additional parameters in the search query. For example, the search can be restricted to a certain language or geolocation. The format of the search query is the same for both REST and Streaming. However, the search function is implemented differently in REST and Streaming. REST provides relevance but not completeness. For this reason it might be more appropriate to use Streaming APIs in some cases.

There are three types of streams: public streams, user streams and site streams. Public streams contain public tweets published in chronological order; user streams are the tweets from the timeline of a single user; site streams access timelines of multiple users [17]. In this study, user stream was leveraged for the collection of data from the timelines of the users tweeting about hate.


Text Mining

Text mining refers to the process of finding relationships and patterns in collection of unstructured information. Text analytics can be broken down into three steps [19]:

1. Pre-processing: removal of stop-words, stemming, and tokenization. 2. Text representation: determination of most significant features of a text.

3. Knowledge extraction: use of machine learning tools to find relationships and hidden dimensions that are difficult to determine manually.


Theory 7 denser feature representation by reducing inflected words to their root. Tokenization removes punctuation and breaks phrases into single words, unigrams. When the nec-essary pre-processing is done, the content of a text is analyzed using statistical methods to determine its characteristic features. In fact, text data can be treated in different ways and on various levels. It can be treated as a collection of independent unigrams. This approach omits the linguistic specifics of the text and neglects the semantics. It is known as bag-of-words (BOW). BOW is the most common approach due to its simplicity. However, words can be used in different contexts and therefore important nuances might be lost which could lead to inaccurate results in the knowledge extraction step. In some cases, the most common pairs of words might be added to the feature space, these are called bigrams. Feature space is a collection of all the representa-tive characteristics of the studied classes. Each class is represented by a feature vector. There is another, more sophisticated, approach that treats data on the semantic level. Compared to BOW, it is more challenging from the computational point of view since it takes advantage of the existing relations between the words. It is still more common to use BOW for the creation of feature space.

The representation of classes does not have to be constrained to the most significant words, on the contrary, the research shows that adding extra features improves the ensuing analysis [9,20]. In section3, various feature extension techniques are described. In order to extract useful knowledge from the created representation, the features are stored numbers and analyzed with machine learning techniques. The conversion of words to numbers can be performed in various ways. A technique for feature selection that was used in this work is Akaike Information Criterion (AIC). AIC is based on the estimation of the information loss when modeling the underlying processes that create the observations. Given empirical data, for example a word in a tweet, AIC estimates how likely it is that the word appears in the target class compared to other classes using maximum log-likelihood. Furthermore, AIC incurs a penalty on the number of free parameters. The higher number of the parameters, the higher the penalty. This approach discourages overfitting. Overfitting is a common problem in machine learning and refers to the misclassification of test data. It occurs when the number of features is higher than the number of data-points. AIC is calculated as:

AIC = −2 ln L + 2K, (2.1)

where ln L is the maximized likelihood and K is the number of free parameters.


Theory 8 After defining the feature space, knowledge discovery methods can be applied to the numerical representation of the words. A common and robust approach is Support Vector Machines (SVM) described in section2.3.1.


Supervised Machine Learning

The main idea of machine learning is to teach a computer to associate new data with data that it has been exposed to earlier. The dynamic character of the data available on the Internet is hard to deal with automatically because pre-programmed algorithms might not describe tomorrow’s reality. For instance, without the machine learning algorithms the detection of spam e-mails and frauds becomes difficult. Therefore the goal is to recognize old patterns in new data. For instance, Gmail had been taught to recognize spam from non-spam emails rather than being explicitly programmed for it.

There different types of machine learning: supervised, semi-supervised, unsupervised, active and deep learning. This study focused only on supervised machine learning. The name comes from the fact that ”correct answers”, i.e. training data, must be given. The algorithm creates a prediction model based on the observed samples. Supervised learning targets to types of problems: regression and classification problem. Regression addresses the problem of prediction of continuous valued output; classifica-tion problems deal with discrete values. This study considered discrete valued output and therefore focused on classification problem. In classification the features of classes are represented as vectors consisting of the most significant attributes. For example, tweets are represented as the most significant words which are organized in feature vec-tors. Supervised learning relies on the labelled training data. Usually, the retrieval of needed data is relatively simple. On the other hand, manual labeling has shown itself to be more time consuming. For this reason, in this project the approach of active learning was evaluated. Active learning algorithms require smaller training data sets then super-vised algorithms. However, active learning is combined with some interactive learning, i.e. the algorithm asks the user to label some instances of unclassified data to enhance the ”understanding” of significant features.

2.3.1 Support Vector Machines (Geometrical interpretation)


Theory 9

Figure 2.1: Data from class A and class B separated by two different planes: one represented as a dashed line and another one represented as a solid line [1].

Figure 2.2: The maximal margin is represented by the line that goes through the points d and c and is orthogonal to the hyperplane [1].

In figure2.1, we see that the solid line is a better choice of demarcation than the dashed one because the margin from the nearest point of each data set to the line is larger. The classes A and B are represented by the matrices Amxn and Bmxnrespectively. Every row


Theory 10

Figure 2.3: Two distinct classes, represented by blue and red dots, cannot be sepa-rated by a maximal margin hyperplane [2].

Bw − Ie ≤ 0. The final separating plan lies between the supporting hyperplanes and therefore the objective of the minimization is 12kwk2. The problem of finding the two

closest points can be stated in the following way:

minimize 1 2kwk


subject to Aw − Ie ≥, Bw − Ie ≤ 0. (2.2)

The line between two closest points must be orthogonal to the supporting hyperplanes. In order to construct such a plane the algorithm finds two convex hulls and constructs a line between two nearest points in these sets. This approach is illustrated in figure

2.2 where the line that goes though the point d and c, denoted by w is the maximal margin, the solid line orthogonal to w is the maximal margin hyperplane. The point d and the two circle points lying on the same dashed line as the point c are called support vectors. They support maximal margin in a sense that if these points would move the maximal margin hyperplane would shift too. Notice that the change in the position of any other point does not effect the maximal margin hyperplane unless it crosses the boundary points. It is important that there can be drawn an orthogonal line between the two supporting planes to avoid the risk that the two points might not be the closest ones or that the supporting hyperplanes are not as far away as possible.


Theory 11 This also means that the maximal margin classifier cannot be used. A commonly used approach to deal with this problem is to allow a certain degree of misclassification in the interest of a better classification of most of the data. This approach is called support vector classifier. In order to allow a certain degree of the missclassification, a non-negative tuning parameter C is introduced. The parameter C determines how much freedom we have to violate the margin. If C is high then the system is tolerant of the misclassifications. This also means that if the margin is large several violations are allowed. When C is low the margin narrows implying a choice of a highly fit classifier. There are some classes that cannot be separated linearly because the relationship be-tween the outcome and the predictors is non-linear. In this case, the number of predictors is expanded by using a non-linear function. More specifically, this is done by applying so called kernels.

The final formulation of optimization problem becomes:

minimize 1 2kwk + C n X i=1 ξi ξ ∈ Rn subject to yi(hw, Φ(xi)i + b) ≥ 1 − ξi, i = 1, ..., n, (2.3) b ∈ R, (2.4) ξ ≥ 0, i = 1, ..., n, (2.5)

where ξiis a slack variable that allows individual observations to be on the wrong side of

the margin, x is mapped to a non-linear map Φ. If ξi = 0 then then the i th observation

was classified correctly; if ξi ⊃ 0 then it is on the wrong side of the margin; if ξi ⊃ 1

then it is on the wrong side of the hyperplane.

In the equation2.3, the function Φ(xi) does not need to be calculated explicitly. Instead,

another function that defines the inner product in the new space is inferred. It is K(xi, xj) = Φ(xi)TΦ(xj). K(xi, xj) is called a kernel function. The computation of

kernels occurs in the feature space and not in the space of input data Rd. Some of the most common kernels are the following:

• Linear: K(xi, xj) = (xi)T(xj).

• Polynomial: K(xi, xj) = (γ(xi)T(xj) + r)d, γ > 0.

• Radial basis function (RBF): K(xi, xj) = e(γkxi−xjk


, γ > 0.


Related Work 12 common classification approaches: one and all. A one-versus-one approach compares each class to all other classes by comparing them individually. A one-versus-all approach treats the target class as one class and all the other classes as the other one.


SVM Data Format


Chapter 3

Related Work

This study proposes a framework for identification of hate tweets related to mobile operators and their triggers using classification techniques. Therefore the review of relevant works focuses mainly on classification of tweets and a few studies related to hate tweets. To broaden the understanding of Twitter’s importance for companies, a brief review of the marketing studies related to Twitter is also covered in this section. Short text classification is a field of text classification in pre-defined categories, where the analysed data is sparse. The main goal of short text classification is to assign a certain category to a piece of information, for example a tweet. Accurate automatic classification of tweets is a challenging task due to their short length, variations in the vocabulary, unclear context and the sparseness of the text. To overcome this problems the classification is always preceded by pre-processing and sometimes by expansion of the vocabulary.

According to previous research, there are other decisions that might impact the outcome of the classification. One of them is the choice of class labels. A suitable choice of labels facilitates the division of data into well defined clusters, which benefits the representation of classes. The labeling of data depends on the application. The choice of the labels is often based on the observations of data or the adoption of pre-existing classification schemes (e.g. the advertising model AIDA, where the acronyms stand for Attention, Interest, Desire and Action). ”Mining Consumer Attitude and Behavior” by Hwon et al. [21] shows that an appropriate clustering, in this case AIDA, can support the methodology and reveal hidden relationships. However, in some cases class-labels cannot be determined beforehand due to their dynamic nature. News and trending topics are two examples of constantly evolving and changing subjects [22], [3]. Nevertheless, even this type of problem utilizes a framework of pre-determined generic categories, such as technology, art etc, that supports the classification task.


Related Work 14 As stated earlier, tweets are informal, short and sparse (i.e. a certain word might occur seldom in the tweets), therefore the removal of noise and creation of a denser feature space are important for future feature vector construction. The pre-processing procedure is described in the section Text mining. Pre-processing steps were highlighted in the studies by Lee et al., Wang et al. and Perez et al. [3], [9], [11]. In addition to pre-processing, some studies [23], [22], [4] employed filtering techniques in order to improve the relevance level of the collected tweets. The keywords for filtering are often derived from the observed data and/or defined from dictionaries or other external resources. For example, Yang et al. [23] in a study from 2013 suggested a method for ambiguity filtering of the company name. The classification was binary, either the tweet was related to a company tweet or not. Each category was represented by the keywords that were defined as the most frequently searched words by the Internet users. For example, the top three keywords for the company Apple were: apple, apple store, apple iphone 4. After the keywords were determined, two different filters were created in order to ensure high accuracy of recognition of the keywords. Filter 1 checked if the the keyword was a whole keyword or not; filter 2 determined the relevant single tokens. For instance, if the keyword is ”iphone 4” and the target tweet is ”I love iphone”. Filter 1 would reject the tweet, whereas Filter 2 would recognize it as being relevant.


Related Work 15


Internal Enhancement of Feature Space

Aside from tweets, Twitter provides the reader with additional information such as topic-indicative hashtags, context-enhancing links, user profile information, lists of followers etc. Previous studies have shown that a higher classification accuracy can be achieved by expanding the feature space using this internal information [22], [24], [9]. For exam-ple, feature-based enhancement was explored by Sriram et al. [20]. They improved the classification of tweets by adding a nominal feature, i.e. the authors name, and seven bi-nary features. These, for example, were the absence of shortenings, emotions, and slang words. Bevenuto et al. [24] proposed a classification method for distinguishing spam-mers, i.e. users who post spam, from non-spammers by integrating 23 different metrics discovered on Twitter. Interestingly, the evaluation of the features showed that even the least significant ones improved the classification compared to the baseline method. Therefore it is reasonable to assume that even low ranked features have discriminatory power. Furthermore, the importance of regarding users not as single elements but as a part of a larger network has been proven to be beneficial [3].


External Enhancement of Feature Space

One of the big challenges of tweet classification is their sparseness, i.e. many words appear only a few times in a corpus. For this reason, some tweets might be repre-sented by an empty vector. To alleviate this problem, attempts have been made to create a denser feature space. A common technique that addresses this problem is to incorporate external resources such as search engines, open source knowledge bases e.g. Wikipedia and online dictionaries. Perez et al. [11] proposed three ways of enrichment of original feature vectors: by incorporating general information about the company provided by Wikipedia, by enriching the tweets that really refer to companies, expand-ing only the ambiguous words with external information. The result showed that the third approach performed better than other methods on specific company names (Ar-mani, Warner, Cadillac etc); the first approach performed better then other methods on generic company names (Parl, Sprint, Southwest etc). In a paper from 2008 Phan et al. [25] explored the enrichment of the tweets with extracted topics and found that it improved the classification and outperformed the baseline.


Related Work 16 Method Accuracy

BOW approach for classification with C 5.0 [3] 65 % Network-based classification with C 5.0 [3] 70 % DFICF with Naive Bayes [9] 71 % Adding user attributes; training with SVM [24] 87 % Tf and clarity with Maximum Entropy classifier [26] 70 % BOW and eight additional features [20] 95 % Self-Term expansion with K-means classifier [11] 60-95 %

TEM-Wiki with K-means classifier [11] 60-97 % TEM-Full with K-means classifier [11] 54 - 73 %

Table 3.1: Accuracy of the proposed methods for some of the studies


Marketing Potential of Twitter

Information spreading over the Internet requires that companies to revise their mar-keting strategies [27]. According to an interview-study, the presence on the Web allows companies to understand consumers’ consumption habits, detect and anticipate negative reactions etc [28]. A better understandin of user’s preferences along with advances in Twitter mining and classification techniques has created opportunities for the develop-ment of user models. Based on the knowledge about the user’s interests and tweeting patterns, scientists have tried to understand how to better target users with product recommender systems [8]. A paper from 2011 [29] investigated how tweeting activities could support the modeling and personalization of user profiles. Based on the hash-tags and the topics of the posted tweets the study compared three profile models for applicability of the news recommendation system.


Antagonism in Tweets

The topic of hate and radicalization of Twitter has been addressed in a handful of studies. Burnap and Williams [30] addressed the identification of hateful and/or antagonistic statements against certain races and religions. In a paper from 2013, Kawase et al. investigated what consequences hateful tweets about jobs might have [31]. As far as we concerned, any study addressed hate tweets related to mobile operators.


Chapter 4


This chapter gives an overview of the methodology employed in this work. In partic-ular, we provide a definition of the classification categories and describe the process of collection and pre-processing of the tweets. Finally, the proposed methods - the Naive Method (NM) and the Partial Timeline Method (PTM) - are introduced in the sections

4.4and 4.5.


Defining hate tweets and reasons

In this study, the definition of a hate tweet was constrained to the presence of two compound parts: the verb ”hate” in combination with the object that the verb was addressed to. The object was represented by the name of the company or the pronoun ”it” if it pointed to the company. In addition, the words and phrases that described company’s services and/or products were also included as a subject of hate. For example, ”I hate Verizon” or ”I absolutely hate Sprint’s service”. Hate tweets without any stated explanation for the hate were labeled as Hate.

When reading the timelines of the users who posted a hate tweet, we noticed that the reasons, if stated, could appear before and/or after the hate tweet. The differences between these ways of stating the reason were studied more closely in order to determine if they ought to be treated as separate categories.

First, we looked at the cases when the reasons were stated before the hate tweet, serving as the premise. These types of reasons were called triggering reasons. Second, we looked at the cases when the reasons were stated after the hate tweet serving as an explanation. These reasons were called justificatory reasons. Third, we looked at the cases where the reasons were stated both before and after the hate and therefore were


Method 19 Category Description Example

Justificatory reason The reason is explained after the hate state-ment and it justifies the previously posted hate tweet.

First the user posted: I hate sprint so much. Then the explanation was posted:

So upset because our wifi is out and all I want to do was watch Super-natural.

Triggering reason The user first described an issue related to the mobile service and then made a hate statement.

The user explained the problem: Pissed not understand-ing why my phone isn’t ready to be picked up when it was suppose to be ready yesterday.

Consecutive post is a hate tweet: I hate AT&T.

Combined sequence A combination of justi-ficatory and triggering reasons.

Table 4.1: A description of different types of reasons for hate tweets.

called combined sequences. The observed types of reasons are summarized in the table

4.1. These three categories were compared to each other with respect to the word frequency, whether they were addressed to someone, i.e. the presence of a user-name and the time of posting. The results are presented in section Results and Discussion in table 5.4. However, the differences between these cases were insignificant and they were therefore treated collectively as a single category Reason . A tweet was labeled as Reason if the following criteria were fulfilled:

• it addressed an issue related to the mobile operator or its service • it was stated during the same day as the hate tweet

• it contained conjunctions because (also spelled cuz, cus, coz, bs, b/c, cause), nouns reason, why (optional alternative).


Method 20 Category Structure Example

Hate The word hate and the name of the company appear in the same tweet. It is clear that the author hates the company or its services and/or prod-ucts, but no reason is given .

Dear AT&T I hate u. I absolutely hate Sprint.

Explicit Hate and the reason are ex-plicitly stated in the same

tweet. Starting to hate at&t slow service #at&t. Starbucks free wifi is better than my AT&T wifi. I hate it so much. Reason A tweet that clarifies the

rea-son behind hate.

The trigger or reason was posted at 03:52:03:

Welp only got 40% bat-tery and when my phone died it’s over cause it wont charge for some reason smh.

Hate-tweet posted at 03:58:25: And I hate going to the Verizon store smh. Other Any other tweets on the user’s


My grandparents are so country.

Back to work in the morning after 9 days.

Table 4.2: Criteria for hate tweet labelling.

before or after the tweet classified as Explicit . The latter category was not further broken down based on the order in which the hate and reason appeared because this approach did not yield significant results for justificatory and triggering reasons. The final categorization of the tweets is presented in table4.2.


Method 21


Collection and Pre-processing

In order to collect training data, a search query for the identification of hate tweets and thereby relevant timelines was created. The query had the same format as a hate tweet, i.e. it utilized a combination of two words: ”Hate” and the name of the operator, e.g. Verizon. If these two words were present in the same tweet the entire timeline from that day was pulled and stored in an Excel table. The underlying assumption was that the reason for the hate tweet would also be posted during the same day.

The tweets were collected with the Streaming and REST API from Twitter; stored in Excel and manually searched for the reasons.

The labelled tweets were then pre-processed in order to remove noise and create a denser feature space. In this study, the pre-processing followed a classical scheme described in Text mining. The first step was to transform the tweets into separate tokens, remove punctuation and stop-words. The pre-processing was done without the utilization of the relationship between the words, i.e. BOW approach. The next step was to stem all the words and replace URLs by the word ”url”, usernames were replaced by the word ”username”. Different spellings of a word were replaced by one alternative1.

Pre-processing resulted in four word lists, one for each category. These lists, i.e. feature sets, were compared to each other using AIC in order to determine the most represen-tative words for each category. The threshold for the model selection was chosen to 0.1 to avoid overlap between the words from different classes. The application of AIC is explained in section A.2.


Training and testing

Training and testing were performed using LIBSVM. The calculation of AIC was per-formed in a self-developed program written in Java.

The training and testing procedure was as follows:

1. The data were converted to SVM format, see section2.4.

2. 20 % of the collected data were saved as test data (proportional to the number of tweets in each category) and the remaining 80 % formed training data.

3. Training and test data were scaled in the range [-1, +1] using the command svm-scale.



Method 22

Figure 4.1: The Naive Method.

4. An RBF kernel was chosen. The relationship between the feature words and the categories is non-linear and therefore it is reasonable to assume that an RBF kernel is the optimal choice. The expression for an RBF kernel was presented in section


5. The parameters C and γ were calculated with the command grid.py.

6. The SVM algorithm was trained on the training data set. For the classification task one-versus-all approach was used.

7. The predictive power of the model was estimated on the test set.


The Naive Method (NM)

The Naive Method is depicted in figure 4.1. In the first stage, a tweet was either categorized as Other or not. If a tweet was not classified as Other , it was automatically checked to see if it belonged to Hate. If it did not belong to Hate, the method checked if it belongs to Explicit . If the tweet could not be classified as Explicit then it was automatically labeled as Reason . The tweets were classified using binary a method, i.e. each tweet was tested to see if it belonged to: 1. Explicit or not, 2. Hate or not, 3. Reason or not. The same test set was used to predict the accuracy of each model. The classification with NM revealed three major misclassifications:

• tweets of an antagonistic character that are not related to the mobile operators, • tweets related to mobile issues or the carriers but not related to hate and its


• overlapping feature vectors for the categories Hate, Reason and Explicit leading to misclassification.


Results and Discussion 23

Figure 4.2: The Partial Timeline Method.


The Partial Timeline Method (PTM)

In a study from 2012 Sun et al. [26] mimicked human labelling to create a filter for classification. This idea was appropriated in our study and adopted to the proposed method described in this section. This method is not an extension of NM.

In this study we focused on the improvement of the classification using only the informa-tion in the collected tweets. In order to improve the classificainforma-tion and solve the earlier stated issues, several of the internal features mentioned in section3.1were investigated: the use of usernames, urls and time of tweeting.

The schematic representation of PTM can be seen in figure 4.2. The first step of the method identifies a hate tweet or a self-explanatory tweet. It then searches the timeline for proximate tweets that were posted within a one-hour time window and classifies them in Reason or Other . PTM relies on the observation that the majority of the explanations are posted 30 minutes before or 30 minutes after the hate tweet, see table


Chapter 5

Results and Discussion



This study showed that the task of classification of hate tweets and their reasons is a feasible task. The proposed two methods, see section 4 were evaluated in terms of accuracy and F-score. The accuracy of the classification for NM and PTM along with the name of the category are presented in the table 5.1. As can be seen from the table, PTM generally performs better than NM. For a sense of context, the accuracy of the proposed methods can be compared to some previous studies, see table 3.1. Direct comparison is difficult to perform due to the differences in the size and content of data sets.

The accuracy may not be a true representation of how reliable the methods are. This is because of the unbalanced number of tweets in different categories. The precision, recall and F-score can be seen in the table5.2. There is one value that stands out clearly; it is the precision for NM, category Reason , which is the highest possible value, i.e. 100%. Looking at the precision in the context of recall and F-score however, the performance of the method is not as good as the precision value alone would suggest. It is important to emphasize that the number of training and test data used for conducting the experiments was comparatively small. The main reason was the time consuming process of manually labeling the data. Instead of collecting more data this project focused on developing the classification scheme.

By looking at table 5.1, PTM is more accurate than the NM. In addition, it saves time and memory by analysing fewer tweets. However, PTM neglects the tweets posted outside the one-hour time window. This means that choosing between the proposed methods implies a trade-off between relevance and completeness.


Results and Discussion 25

Table 5.1: The results of the experiments.

Method, category Accuracy NM, Other 0.77 NM, Explicit 0.88 NM, Hate 0.66 NM, Reason 0.78 PTM, Hate+Explicit 0.98 PTM, Reason 0.87

Method, category Recall Precision F-score NM, Other 0.78 0.95 0.86 NM, Explicit 0.79 0.54 0.86 NM, Hate 0.93 0.48 0.63 NM, Reason 0.53 1.00 0.69 PTM, Hate+Explicit 0.96 0.66 0.79 PTM, Reason 0.84 0.76 0.80

Table 5.2: Recall, precision and F-score

Reason within 30 min within 5 min within 1 min Stated before 77.5% 62.5% 45%

Stated after 70.3% 51.6% 27.5%

Table 5.3: The reasons posted before and after a hate tweet were investigated from a temporal perspective, i.e. we calculated the percentage of the tweets belonging to Reason posted 30 minutes, 5 minutes and 1 minute before or after the posting of a hate


PTM is based on the observation that the majority of the users posted the reason within the first five minutes: 62.5% of the users posted the reason before the hate tweet and in 51.6% of the cases the reason was posted after the hate. Within a time window of half an hour 77.5% of reasons were stated before the hate tweet and 70.3% were stated after the hate tweet. These results are summarized in table 5.3. Notice that the tweets that were posted within one minute were included in the tweets that were posted within five minutes. Furthermore, all of these tweets were included in the tweets that were posted within 30 minutes.


Results and Discussion 26 Category Time interval Percentage of tweets (%)

Triggering 06:01 am - 12:00 pm 30 12:01 pm - 06:00 pm 25 06:01 pm - 12:00 am 35 12:01 am - 06:00 am 10 Justification 06:01 am - 12:00 pm 40.1 12:01 pm - 06:00 pm 25.3 06:01 pm - 12:00 am 29.1 12:01 am - 06:00 am 5.5

Table 5.4: Percentage of tweets per time interval

Bigram Percentage (%) ”my phone” 20 ”customer service” 5 ”unlimited data” 5 ”service sucks” 5 ”phone bill” 5 ”my data” 5

Table 5.5: Most frequent bigrams present in tweets classified as Reason . The bia-grams that are not presented in this table had the prevalence lower than 5%.



In order to see if there was any variation in the tendency to post triggering and justificatory tweets over the course of the day, we analysed the tweets by dividing the day in four six-hour-periods. Looking at the data in table 5.4, the tweets were distributed relatively evenly over the day. However, we could observe a clear difference in the tendency to post justificatory reasons vs triggering reasons. Justification reasons were posted in 68 cases out of 100. One possible explanation is that the users posting hate tweets were contacted by other users and/or the company in question regarding the reason for the hate tweet. For instance, we observed that the mobile operators, AT&T, Verizon and Sprint, interact with the customers who have expressed hate against them.

One the biggest challenges with this work was the time consuming manual labeling of the training and test data. However, even in the absence of statistically significant results, it is possible to explore aspects of the methodology such as choice of class labels, vector features and the sequence of steps by which the classification is carried out. It is important to emphasize that since the data set analysed in this study is limited, see AppendixA.1, one should be cautious when drawing conclusions.


Chapter 6

Future Work

For future work there is are number of technical improvements that can be made:

• expand the data set.

A larger data set would reduce the risk of overfitting and improve the statistical significance of the results.

• utilize the information provided by Twitter.

As discussed in section 3, the classification can be improved by expanding the feature space with profile information, list of followers, hashtags, geolocation etc. • create denser feature vectors.

Methods1 for creating a denser feature space might also be useful for reducing the risk of overfitting and improving the classification.

• conduct sentiment analysis.

Sentiment analysis as well as the syntactic and semantic specifics of the tweets could help to identify hate tweets and associated reasons.

• crawl the external sources.

One possible improvement of the study is to improve the feature space by adding synonyms from other websites. These might be Facebook, forums and mobile operators’ homepages.

The study could also be improved by solving the problem of the overlapping feature spaces for categories Hate, Explicit and Reason , see section4. One suggestion is to re-define the categories. For instance, create one class with a feature space consisting of the feature spaces belonging to these three classes. Then, based on the semantic


[19], Chapter 4.


Appendix A 29 analysis identify hate tweets, which probably have the highest rate of negative emotions per tweet length. Another suggestion is to regard Explicit and Reason as one class because both explain the underlying cause of hate.

It would be interesting to study if we could find the reason for the dissatisfaction of a customer even if he or she has not provided any reason on the timeline. One possible approach is to use User Similarity Model [34]. The model says that the users A and B are more closely related than the users B and C if the overlap in the topics posted on their timelines is greater for A and B then for B and C. For related users it might be possible to predict the reason for a hate tweet posted by the user B by looking at the reason posted by A.


Appendix A


Collected data set

The number of collected tweets are presented in table A.1. The number of tweets in the category Other is fairly high compared to the remaining categories. The contain of Other varies a lot and therefore in order to make this category more predictable we collected more tweets.



It is difficult to make a judgment on a word’s significance for a class just by looking at the number of its occurrences. Some words appear frequently through all the classes, i.e. articles and prepositions, without adding any value to the analysis. Therefore, AIC was used to quantify the importance of each word. Every word w was tested for each class A against two models: an independent and dependent model. In future, these will be denoted as IM and DM respectively. If w is representative for class A, its AIC value for the DM is less than the AIC value of the IM and vice versa. The relative distance between the unknown truth and the best model is shortest.

The first step in the modeling of the IM and DM was to calculate the occurrences of each word, w, appearing in the classified tweets. Many words did not appear more than once through all the tweets therefore the data were sparse. This fact will be used later

Table A.1: The number of collected tweets per each category.

Category Number of tweets Other 1000

Hate 132 Reason 132 Explicit 220


Appendix A 31

Table A.2: Occurrence of a word w

Token class A class ¬A w n11 n12

¬w n21 n22

Table A.3: Independent model

Token class A class ¬A w pq (1 − p)q ¬w p(1 − q) (1 − p)(1 − q)

in SVM analysis. The information about each word has been summarized in tableA.2, where the notation class ¬ A stands for not class A meaning all the remaining classes; ¬w means not w ; n11 stands for the number of tweets belonging to the target class A

and containing the word w; n12 is the number of times the word appeared outside the

target class; n21 is the number of tweets where the word did not appear in the target

class, n22 is the number of tweets where the word does not appear in the other classes.

The number of free parameters is two: n11and n12.

For the purpose of readability, the following notations will be introduced: N = (n11+

n12+ n21+ n22), h = n11+ n12and k = n11+ n21.

Based on the word and class occurrences, the probability of each word for a specific class was calculated. The probability of class A is p and the probability of the word to appear somewhere in training data is q. The probability of each class is known and equals 14, therefore the probability of class A is always 14 and the probability of not class A is 34. Nevertheless, in order to preserve the generality, the notation p will be used.

P (A) = p = k

N, P (w) = q = h

N (A.1) The assumption that p and q are independent leads to the derivation of the IM with two free parameters. The joint probabilities of the IM are presented in table A.3.

The events presented in table A.3are considered to be independent and therefore their joint probability P is

P = pkqh(1 − p)N −k(1 − q)N −h (A.2)

Log-likelihood, L, for the IM is:


Appendix A 32

Table A.4: Dependent model

Token class A class ¬A w p11 p12

¬w p21 p22

To find the maximized log-likelihood for the IM with respect to p and q the following conditions have to be satisfied:

∂L ∂p = k p − N − k 1 − p = 0 (A.4) ∂L ∂q = h q − N − h 1 − q = 0 (A.5)

The Eq.A.4and Eq.A.5lead to Eq.A.1. Insertion of Eq.A.1 into Eq.A.3 gives MLL: Lmax= h ln h + k ln k + (N − h) ln (N − h)

+ (N − k) ln (N − k) − 2N ln N. (A.6)

Finally, from Eq.2.1and Eq.A.6 the AIC for the IM is

AICIM = −2Lmax+ 2 ∗ 2 (A.7)

Similar derivation of AIC applies to the DM. The outlines of the model are presented in table A.4. The notations are the following: p11 is the probability of w appearing in

A, p12 is the probability of w appearing in other classes, p21 is the probability of not

observing w in A, and, lastly, p22is the probability of not observing w in other classes.

Notice that p22 can be expressed as p22 = 1 − p11− p12 − p21. This means that the

number of free parameters is 3. The joint probability for the DM is

P = pn11 11 p n12 12 p n21 21 p n22 22 . (A.8)

Similarly to Eq. A.3, the log-likelihood of the events in table A.4is


Appendix A 33 The log-likelihood for the case when w appears in the target class A is maximized when

∂L ∂p11 = n11 p11 −n22 p22 = 0 or n11 p11 = n22 p22 (A.10) The last expression in Eq. A.10 is equivalent to

n12 p12 = n21 p21 −n22 p22 (A.11) and is constant. Therefore it is possible to set Eq.A.11 equal to some constant c. Now the events can be expressed as

n11= cp11, n12= cp12, n21= cp21, n22= cp22. (A.12)

Summing up all the events gives

n11+ n12+ n21+ n22= c(p11+ p12+ p21+ p22) = c. (A.13)

This also means that c = N and p11= n11 N , p12= n12 N , p21= n21 N , p22= n22 N , (A.14) The insertion of Eq.A.14 in Eq.A.10 gives the MLE of the DM:

Lmax = n11ln n11+ n12ln n12

+ n21ln n21+ n22ln n22 (A.15)

Finally, from the Eq.2.1and Eq.A.15and the AIC of DM was derived:

AICDM = −2Lmax+ 2 ∗ 3 (A.16)



46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

This paper builds on the work in Topic Classification and attempts to provide a baseline on how to analyse the Swedish Central Bank Minutes and gather information using both

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

The EU exports of waste abroad have negative environmental and public health consequences in the countries of destination, while resources for the circular economy.. domestically