• No results found

Multilingual identification of offensive content in social media

N/A
N/A
Protected

Academic year: 2021

Share "Multilingual identification of offensive content in social media"

Copied!
56
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Engineering

2020 | LIU-IDA/LITH-EX-A--20/053--SE

Multilingual identification of

offensive content in social media

Marc Pàmies Massip

Supervisor : Emily Öhman and Jörg Tiedemann Examiner : Marco Kuhlmann

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

In today’s society there is a large number of social media users that are free to express their opinion on shared platforms. The socio-cultural differences between the people behind those accounts (in terms of ethnicity, gender, sexual orientation, religion, politics, . . . ) give rise to an important percentage of online discussions that make use of offensive language, which often affects in a negative way the psychological well-being of the victims. In order to address the problem, the endless stream of user-generated content engenders a need to find an accurate and scalable solution to detect offensive language using automated methods. This thesis explores different approaches to the offensiveness detection task fo-cusing on five different languages: Arabic, Danish, English, Greek and Turkish. The results obtained using Support Vector Machines (SVM), Convolutional Neural Networks (CNN) and the Bidirectional Encoder Representations from Transformers (BERT) are compared, achieving state-of-the-art results with some of the methods tested. The effect of the embed-dings used, the dataset size, the class imbalance percentage and the addition of sentiment features are studied and analysed, as well as the cross-lingual capabilities of pre-trained multilingual models.

(4)

Acknowledgements

First of all, I want to express my sincere gratitude to Jörg Tiedemann for giving me the oppor-tunity to do this master’s thesis as a visiting student at the University of Helsinki. Moreover, the weekly feedback and advice received in the seminars from the Language Technology Group was very helpful.

I am also especially grateful to my supervisor Emily Öhman for her unconditional support throughout the entire project. I really appreciate all the counsel and kindness received during my stay in Helsinki, as well as her predisposition to help in any aspect related to the thesis. I would also like to thank my examiner from Linköping University, associate professor Marco Kuhlmann, for all the guidance received regarding the formalities of the project as well as the constructive feedback from the thesis drafts.

Lastly, I would like to dedicate a few words to Timo Honkela, who sadly passed away during the development of this work. Timo is the reason why I came to Helsinki in the first place, since he happily agreed to collaborate with me despite his ill condition. The unfortunate circumstances only allowed me to meet him briefly, but it was enough to realize that Timo was an extremely charming and truly inspiring person. May his soul rest in peace.

(5)

Contents

1 Introduction 1

1.1 Problem definition and motivation . . . 1

1.2 Aim . . . 2

1.3 Research questions . . . 2

1.4 Delimitations . . . 3

1.5 Structure of the work . . . 3

2 Theory 4 2.1 Natural Language Processing . . . 4

2.1.1 Text classification . . . 4

2.1.2 Sentiment analysis . . . 5

2.2 Text representation . . . 5

2.2.1 Word count vectors . . . 5

2.2.2 TF-IDF vectors . . . 5

2.2.3 Neural embeddings . . . 6

2.3 Models for text classification . . . 7

2.3.1 Support Vector Machine . . . 7

2.3.2 Convolutional Neural Network . . . 8

2.3.3 BERT . . . 9

2.4 Evaluation Measures . . . 11

2.4.1 Accuracy, precision and recall . . . 11

2.4.2 F1 score . . . 12

2.4.3 Area Under the Curve . . . 12

3 Related work 13 4 Method 16 4.1 Hardware and software . . . 16

4.2 Data . . . 17

4.2.1 OLID . . . 17

4.2.2 Class imbalance . . . 18

4.2.3 Pre-processing steps . . . 20

4.3 Traditional machine learning approach . . . 22

4.4 Deep learning approach . . . 24

4.5 Transfer learning approach . . . 25

4.6 Evaluation . . . 27

5 Results 28 5.1 Traditional machine learning approach . . . 28

5.2 Deep learning approach . . . 30

5.3 Transfer learning approach . . . 32

(6)

6 Analysis and discussion 35

6.1 Results . . . 35

6.1.1 Traditional machine learning approach . . . 35

6.1.2 Deep learning approach . . . 36

6.1.3 Transfer learning approach . . . 37

6.1.4 Comparison to state-of-the-art . . . 38

6.2 Method . . . 39

6.2.1 Self-critical stance . . . 39

6.2.2 Replicability, reliability and validity . . . 40

6.2.3 Source criticism . . . 41

6.3 The work in a wider context . . . 42

7 Conclusion 43 7.1 Summary and critical reflection . . . 43

7.2 Future work . . . 44

(7)

1

Introduction

1.1

Problem definition and motivation

This thesis work addresses the problem of offensive language identification (e.g. detrimental, obscene and demeaning messages) in the microblogging sphere, with special focus on five different languages: Arabic, Danish, English, Greek and Turkish.

The number of social media users has reached 3.5 billion in 2020, and growth is not expected to diminish in the following years1. In such an interconnected world, where an average of 6,000 tweets are generated every second2, it seems inevitable that some users promote offen-sive language taking advantage of the anonymity provided by social media sites. According to a study from 2014, 67% of social media users had been exposed to online hate and 21% ac-knowledged having been its target [48]. Consequences of repeated exposure to this material include desensitization of verbal violence and an increment on outgroup prejudice [69]. The proliferation of hateful speech on the internet has not gone unnoticed by those offering social networking services (SNS) [60]. Nowadays, any company hosting user-generated con-tent has the arduous task of penalizing the use of offensive language without compromising the users’ right to freedom of speech [70]. The usual approach is to forbid any form of hate speech on their terms of service and censor inappropriate posts that have been reported by users, but still companies like Facebook or Twitter have been criticized for not doing enough3. The criticism received in recent years has forced SNS providers to take a more active role as moderators, but the vast amount of data generated by online communities forces them to automate the task. The employment of human moderators is no longer an option since it is costly and highly time consuming. Moreover, the final outcome is inevitably subject to the moderator’s notion of offensiveness, even if they have received proper guidelines and training beforehand [78]. In addition, the fact that hate speech spreads faster than regular speech in online channels [41] implies that solutions that respond in a timely fashion are required.

1https://www.statista.com/topics/1164/social-networks/

2https://www.internetlivestats.com/twitter-statistics/

(8)

1.2. Aim

All this generates a need to find an accurate and scalable solution that solves the problem of offensive language detection using automated methods. As a result, in recent years the Natural Language Processing community has become increasingly interested in this field.

1.2

Aim

The purpose of this thesis project is to investigate and evaluate different solutions to the problem of offensive language detection and categorization on the microblogging site Twit-ter. The task will be framed as a supervised learning problem, and several configurations of traditional machine learning (Support Vector Machine), deep learning (Convolutional Neu-ral Networks) and transfer learning (BERT) techniques will be tested. The goal is to evaluate the importance of certain features, explore different ways of vectorizing text and experiment with some of the most promising models from the literature.

The final output should be a stand-alone system able to identify offensive tweets with both high precision and recall. The final implementation serves as a proof-of-concept.

1.3

Research questions

The experiments developed throughout this project aim to answer the following questions:

1. Can sentiment analysis boost the performance of an offensive language classifier? It is safe to assume that there is a relation between emotion and offensive language in social media, since offensive posts tend to present a negative polarity. A way to study this relation would be to analyse the impact of features that carry some type of sentiment information, or alternatively to incorporate an additional step for polarity classification.

2. Are subword-level approaches better than word-level approaches?

Modern NLP approaches use pre-trained word embeddings to capture the semantics of text in a machine-friendly representation, but the choice of embeddings differs from problem to problem. The unorthodox writing style in online communities gives rise to many out-of-vocabulary words when the noisy text is tokenized into word units. This suggests that subword-level embeddings might perform better than word-level embeddings since they are capable of handling spelling variations of words. A deep learning model will be fed with different types of embeddings to get some insight about this topic.

3. How good are multilingual neural language models at cross-lingual model transfer? Nowadays there are several publicly available pre-trained models that claim to obtain good results in a long list of languages. The multilingual corpus used for training gives rise to a single shared vocabulary that makes these models well suited for zero-shot learning. By fine-tuning a multilingual model in a language other than the one used for testing, it should be possible to gain some understanding about its capacity to general-ize information across languages.

4. Is the task of offensive language detection equally challenging in all languages? The different structural properties of languages at a phonological, grammatical and lexical level may make offensive language easier to detect in some of them. Besides, the wide variety of tools at the disposal of high-resource languages can play against less popular ones. This work will compare the results obtained in five different languages: Arabic, Danish, English, Greek and Turkish.

(9)

1.4. Delimitations

1.4

Delimitations

A significant delimitation will be the fact that this work is exclusively focused on textual data. Social media posts are quite often accompanied by multimodal information (e.g. im-ages, GIFs, videos, URL links. . . ) which can be crucial to fully understand the underlying message, and therefore ignoring it might deteriorate the final performance of the classifier. Unfortunately, the processing of this type of data is beyond the scope of this project. Infor-mation about users (e.g. age, gender, demographics. . . ) will not be considered as it is often unreliable, even though some meta-information has been proven to be predictive in previous work [13].

Moreover, the subjective biases of human annotators might introduce some noisy labels in the training data, since a same post might be considered offensive by some and non-offensive by others. The lack of a standard and universal definition of the term ’offense’ adds ambiguity to the labelling task, and even with a clear definition sometimes some background informa-tion is required to correctly interpret the message (e.g. the usage of a word can be offensive or not depending on the interlocutors relationship [34]). A study from 2016 highlights the differences between amateur and expert annotators when labelling a hate speech dataset, and found that the different labelling criteria are reflected in the final classification results [78]. However, we have no choice but to regard the labels from publicly available datasets as absolute truth since it is beyond the scope of this thesis to annotate datasets. During the eval-uation process it might be interesting to pay special attention to those tweets that are often misclassified to better understand the limitations of the system, which might be influenced by social biases in the form of noisy labels.

1.5

Structure of the work

The remainder of this work is organized as follows:

Chapter 2 introduces the reader to the topic of offensive language detection, providing all the theoretical background required to fully understand the explanations from future chapters. Chapter 3 presents a literature review of related work in order to show what has been at-tempted so far by others and the state of maturity of the field at the time of writing.

Then, the method is described in Chapter 4. All the data, pre-processing steps, feature ex-traction techniques and classifiers that have been used along the process are explained. This detailed description of the work should allow the reader to replicate the experiments obtain-ing similar results.

The obtained results are reported in Chapter 5 and later analysed in Chapter 6. The latter also includes a critical discussion of the methodology used and a final section discussing the ethical and societal aspects related to the work.

(10)

2

Theory

This chapter contains the theory of use for the intended study, introducing the reader to the text classification task and covering the theory behind its main steps.

2.1

Natural Language Processing

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that copes with the interaction between computers and humans using natural language. Its ultimate goal is to make computers understand the human language, which is not an easy task due to the imprecise characteristics of our languages. It requires not only mastering the syntax of text but also its semantics. Since linguistic rules are hard to hand-code, older rule-based approaches have been replaced by machine learning (ML) algorithms that are able to extract such rules from large amounts of data and derive meaning from them.

Among the applications of NLP are machine translators, spell checkers, virtual assistants, chatbots and interactive voice response applications. This project will focus on the sub-field of text classification, which is defined below.

2.1.1

Text classification

Text classification is a supervised machine learning task that amounts to automatically as-signing text documents to pre-defined categories based on their content [2]. Classification can be done into two categories (e.g. spam detection) or more (e.g. categorization of customer queries by type) depending on the application. A text classifier requires a labelled dataset for training, so that the underlying algorithm can learn from the labelled examples before making accurate predictions. However, before feeding examples to the system, it is neces-sary to transform the raw text into a form that can be interpreted by the machine. After this preliminary step, called feature extraction, the model can start the learning process aiming to recognize patterns that will be later used to make the classification decisions.

The problem of offensive language detection is nothing more than a specific application of text classification, with some peculiarities that make it especially challenging [64].

(11)

2.2. Text representation

2.1.2

Sentiment analysis

Sentiment analysis, also known as opinion mining in academia, is an application of text clas-sification and one of the most active research areas in the field of computational linguistics and NLP [39]. It is basically the automated process of analysing the sentiment that lies un-derneath a series of words in order to classify a text as either positive, negative or neutral, although a more fine-grained classification is also possible [57]. It is important to notice the difference between sentiments, which are limited to a single dimension (polarity), and emo-tions that also capture the intensity offering a more detailed level of analysis [19].

This field of study has become very active in the last 20 years, mainly because, in general, the emotional content of text is an important part of language. In the case of social media, emojis are widely used to express feelings, which is why they are considered a meaningful feature for sentiment analysis tasks [31]. When detecting offensive language in Twitter, taking into consideration the sentiment of tweets is supposed to improve the classification results because of the relation between offensive language and sentiment. Common sense suggests that offensive posts should carry a more negative sentiment as they tend to contain strong emotions like anger or frustration.

2.2

Text representation

Since machines cannot understand words as humans do, it is necessary to convert our natural language to a machine-friendly representation. This section describes some of the most com-mon methods that can be used to convert plain text to machine-readable information. They all map symbolic representations to a vectorized form that can be fed to a neural network.

2.2.1

Word count vectors

Bag-of-words (BoW) is arguably the most basic method when it comes to converting textual data to numeric representations, but still leads to statistically significant results. It consists in generating a vector of word counts for each document (i.e. tweet) and storing them in a matrix where each column is a word and each row represents a document. The word counts stored in its cells are normalized before being fed to a neural network as features.

Among BoW’s limitations are the data sparsity problems in short texts, the high dimensional-ity of the encoded vectors and the fact that similar entities are not placed closer to each other in the embedding space. Moreover, the ordering of words is ignored. Nonetheless, BoW is still a good option to build an inexpensive baseline model. It is also suited for very specific cases, such as small datasets of highly domain-specific data where most of the words would not appear in the dictionary of a pre-trained word embedding model.

2.2.2

TF-IDF vectors

TF-IDF, which stands for Term Frequency – Inverse Document Frequency, is another way of converting text documents to matrix representations [63]. The TF-IDF model relies on a sparse vector representation which is quite powerful despite its simplicity.

Term frequency (TF) represents the number of occurrences of a word in a document, and is simply computed as the number of times a term t appears in a document d (nt,d) divided

by the total number of terms in the document (N). So, the term frequency of a term t in a document d (t ft,d) is calculated as:

t ft,d=

nt,d

(12)

2.2. Text representation

On the other hand, Inverse document frequency (IDF) represents the importance of each term and is computed as follows:

id ft=logN

nt

In this case, the numerator consists of the total number of documents (N) while the denom-inator has the number of documents that contain the term t (nt). The final TF-IDF value of

each term and document is the result of combining the aforementioned formulas as follows: t f id ft,d =t ft,d id ft

So, instead of simply measuring the frequency of words as the BoW model does, in TF-IDF each word is given a weight that somehow represents its importance to the document it be-longs to. This is done by not only counting the number of occurrences of the word in a single document but also in the entire corpus. In the formula above, the term t ft,d assigns high

weights to terms that appear repeatedly in a document because they are supposed to be good representatives of it. In a similar way, id ftassigns low weights to words that are present in

many documents and high weights to words that appear in fewer documents. This is based on the intuition that distinctive words that appear repeatedly in a limited number of docu-ments are the ones that better represent those docudocu-ments, and therefore they should be given more importance.

2.2.3

Neural embeddings

Neural embeddings are one of the most popular NLP techniques nowadays. They provide a way to capture the semantics of text in low-dimensional distributed representations, not only considering the words themselves but also their context and the relationships between them. This makes it possible to reflect a word’s meaning in its embedding.

By mapping these numerical representations of words into a vector space it is possible to visualize how words with similar meaning (thus used in similar contexts) occupy close spatial positions and vice versa. One of the main benefits of this is that a model can react naturally to a previously unseen word if it has seen semantically similar words during training. Despite being more memory-intensive, on many occasions it is worth using word embed-dings as they are more informative than a simple BoW or TF-IDF matrix. Moreover, the fact that they have been already pre-trained on a large corpus allows practitioners to directly use the publicly available dictionaries, saving the time and resources that it would take to train such a model from scratch.

Word2Vec [43] is a predictive embedding model that was published by Google researchers in 2013, but is still in wide use to this day. The algorithm outputs a vector space given a corpus of textual data, capturing the semantic relationships in n-dimensional vectors. It is available in two different architectures: Continuous Bag-of-words (CBOW) or Continuous Skip-gram. None of them takes into account the order of context words, which is one of Word2Vec’s greatest weaknesses in comparison to other models that were later published.

GloVe [53] takes a similar approach to Word2Vec to generate dense vector representations, but instead of extracting meaning using skip-gram or CBOW it is trained on global co-occurrence counts of words. This is based on the idea that some words are more likely to occur alongside certain words, and thus it makes sense to consider the entire corpus when generating the embedding of a word. It allows GloVe to take global context into account, unlike Word2Vec which is exclusively focused on local context.

(13)

2.3. Models for text classification

Both word embeddings (Word2Vec and GloVe) achieve similar empirical results, and they suffer from the same problems: the inability to deal with unknown words and the fact that multi-sense words are always encoded in the same way regardless of their context. The first of those problems can be solved by splitting words into a bag of character n-grams to be encoded, as subword-level embeddings like fastText do.

fastText [28] is a model developed by the Facebook AI Research group that goes one step further by taking into account the morphology of words, which enables the embeddings to encode sub-word information. Unlike word-level embeddings, in this case a word’s vector is constructed based on its character n-grams, allowing fastText to handle previously unseen words reasonably well. It is important to note that, even if it considers the internal structure of words by splitting them, at the end it still generates one single vector per word. It was used to train 300-dimensional word vectors in 157 languages, which are publicly available at the official website1.

The common problem with all the aforementioned embeddings is that they are context-insensitive, which is a clear deficiency since words can have different connotations based on their surrounding words. In 2018, AllenAI introduced contextual word embeddings to the world with ELMo [54]. The context-sensitive representations of these embeddings learned from language models overcame the main limitations of the static embeddings presented be-fore. With ELMo, every vector assigned to a token is sentence-dependent, meaning that the local context is reflected in the instance embeddings by taking into consideration the entire in-put sentence for every word on it. As a result, even polysemous words (which represent over 40% of the English dictionary [18] [36]) are represented by different vectors according to their context. This is translated in a performance boost for downstream tasks, which is why con-textual embeddings became so popular in recent years. As a matter of fact, popular models like Google’s BERT [15], OpenAI’s GPT-2 [59] or FacebookAI’s XLM [35] use contextualized word embeddings to obtain more accurate representations.

It is important to notice that all the embeddings discussed in this section have different char-acteristics that make them more or less suited for a specific problem. Choosing the best em-bedding requires a try-and-fail approach as there is no one model that works always best.

2.3

Models for text classification

This subsection presents the theory behind the different models that will be used in the practi-cal part of this project, which includes Support Vector Machines (SVM), Convolutional Neural Networks (CNN) and the Bidirectional Encoder Representations from Transformers (BERT).

2.3.1

Support Vector Machine

Support Vector Machine (SVM) is a supervised machine learning algorithm that was first introduced in 1992 [7] and is still widely used today. It is mostly used for classification tasks, although it can also be applied in regression problems. In general, it offers higher accuracy and robustness than other classifiers like Naïve Bayes, logistic regression or decision trees. The main idea behind SVM is to segregate a given dataset by finding a hyperplane that sep-arates all classes in the best possible way. To achieve that, the decision boundary must be as far as possible from any point in the labelled training set. Then, new data points are at-tributed to one class or another depending on which side of the hyperplane they belong to. This is illustrated in Figure 2.1, where the hyperplane is actually a line since the feature space represented in the image has only two dimensions.

(14)

2.3. Models for text classification

Figure 2.1: Example of decision line in a two-dimensional space. Source: [4]

In order to find the optimal hyperplane, its perpendicular distance to the support vectors (those data points of each class that are closer to the frontier) must be maximized. This dis-tance is known as the margin, and is maximized by minimizing a hinge loss function in an iterative manner. For this, only the support vectors are taken into consideration, meaning that the removal of other data points does not affect the outcome of the algorithm.

The maximum marginal hyperplane separates classes in a high-dimensional space of as many dimensions as input features. In the case of nonlinear input spaces, where classes cannot be separated by a linear decision boundary, SVM uses the so called kernel trick to convert the input space to a higher dimensional space where it is possible to accurately segregate the points using a simple line. Different types of kernels can be used to transform the feature space. For textual data linear kernels are supposed to work best, but polynomial kernels and radial basis function (RBF) kernels are also available.

Some of the main advantages of SVM classifiers is their effectiveness in high-dimensional spaces and their good performance when there is a clear margin of separation between classes. Furthermore, they are memory-efficient as only a subset of the training points (the support vectors) is used in the decision phase. However, they require a long training time for large datasets and perform poorly when the target classes overlap.

A popular paper published in 1998 by T.Joachims studies the application of SVM for text categorization [27]. The theoretical and empirical evidence provided by the author proves that SVMs are a robust method to learn text classifiers from labelled examples.

2.3.2

Convolutional Neural Network

A Convolutional Neural Network (CNN or ConvNet) is a type of deep neural network that was originally built for image analysis but later found to be also quite effective for NLP prob-lems. As it can be deduced from its name, the main difference with regular multilayer per-ceptron networks (MLP) is that some of the CNN hidden layers (known as convolutional layers) perform convolution operations instead of matrix multiplication. This is precisely what makes them so good at pattern recognition in images, since the convolutional filters convolve across the input pixels being able to detect edges, shapes, textures and even specific objects in deeper layers of the network. In the case of text, the artificial neural network deals with word embeddings instead of pixel matrices, which increases the size of the feature space from three channels (case of RGB images) to as many as the length of the word embeddings.

(15)

2.3. Models for text classification

Figure 2.2 illustrates a particular example of how a CNN processes textual data. It gets as input a matrix of embeddings and performs element-wise products with the elements from different filters that slide over the sentence matrix. The filters act as n-gram feature extrac-tors for embeddings, obtaining 2-grams, 3-grams and 4-grams in the toy example from below. Then the feature maps are pooled (i.e. 1-max pooling) and concatenated to form a feature vec-tor that is fed to a fully connected dense layer that performs classification. Typically sigmoid is used for binary classification and softmax for multi-class classification. It is also common to use non-linear activation functions like RELU or tanh before applying pooling techniques.

Figure 2.2: Toy example with a 7-token sentence and 5-dimensional embeddings. Source: [91] One of the reasons why CNNs perform good at text classification is that their convolutional and pooling layers allow them to detect salient features regardless of their position in the input text [24]. Moreover, the non-linearity of the network and its ability to model local ordering is supposed to lead to superior results. In the context of hate speech, Gambäck and Sikdar trained four CNN models to classify tweets as sexist, racist, both or neither [22]. Their best model (78.3% F-score) employed CNNs with Word2Vec embeddings, but they also experimented with models trained on character 4-grams, randomly generated word vectors and a combination of word vectors and character n-grams. Other CNN-related works in the field include [5] and [50].

2.3.3

BERT

The Bidirectional Encoder Representations from Transformers (BERT) is a deep pre-trained language model that was open-sourced by Google in late 2018 [15]. The bidirectional model soon became very popular due to its state-of-the-art results in several NLP downstream tasks, such as text classification, language inference, entity recognition, paraphrase detection,

(16)

se-2.3. Models for text classification

mantic similarity or question answering. These outstanding results were possible because, unlike prior models, BERT takes full advantage of the bidirectional information of text se-quences. This is illustrated by Figure 2.3, which shows how GPT connections only go from left to right while ELMo generates features by concatenation of left-to-right and right-to-left LSTMs. In addition, word segmentation is performed using the WordPiece algorithm [65], which initialises a vocabulary with single characters and adds the most frequent combina-tions of symbols in an iterative manner.

Figure 2.3: BERT, GPT and ELMo pre-training model architectures. Source: [15] The other reason for BERT’s success has to do with its novel training method. BERT was pre-trained using a vast amount of text data (Wikipedia and the Book corpus dataset [93]) on two language-based tasks:

• Masked Language Modelling: Fifteen percent of the words in the training corpora are randomly masked and the network is trained to predict them.

• Next Sentence Prediction: The network is trained to predict if two given sentences are coherent together. Consecutive sentences from the training corpora are used as positive examples and randomly selected sentences as negative examples.

After training, BERT can be either used as a high quality feature extractor keeping the learned weights fixed, or alternatively be fine-tuned with a relatively small amount of task-specific data. This is possible because the pre-trained weights already contain a lot of information, which reduces a lot the amount of data and training time required to obtain good results. Depending on the training corpora, there are two different models that are worth mentioning:

• English BERT: Pre-trained exclusively on monolingual English data, giving place to an English-derived vocabulary. Google released another language-specific model for Chinese, and similar models in other languages have been produced by third parties. • Multilingual BERT: Pre-trained on monolingual corpora in 104 languages, giving place

to a single WordPiece vocabulary that allows the model to share embeddings across languages.

In terms of architecture, BERT is composed of a stack of transformer blocks that act as en-coders. It is based on the transformer network [75], where the so called transformers use attention mechanisms to assign different weights to parts of the input based on their signifi-cance. Attention makes the network focus on specific data points, which is especially useful to learn about the context of words. This context is encoded in the WordPiece embeddings which are passed from one layer to the next capturing meaning at each stage. The number of self-attention layers is 12 for the Base model and 24 for the Large model, and the length of the embeddings is 768 and 1024 for the Base and Large versions, respectively.

(17)

2.4. Evaluation Measures

2.4

Evaluation Measures

The aim of this section is to introduce some of the metrics that can be used to evaluate the ef-fectiveness of supervised classifiers. In general, all these metrics provide a performance score that simplifies model comparison and selection during training. Basically, the model that ob-tains a higher score on previously unseen data is considered to be the best at generalizing and thus should be selected for being the most reliable.

Since each metric evaluates different characteristics of the classifier, it is important to select an appropriate one for a proper comparison of machine learning algorithms. Otherwise a suboptimal solution might be selected. In general, the values used to compute these measures are obtained from the confusion matrix (see Table 2.1 below), which displays the correctness of the model in a very intuitive way.

Actual Positive Class Actual Negative Class Predicted Positive Class True Positive (TP) False Positive (FP)

Predicted Negative Class False Negative (FN) True Negative (TN) Table 2.1: Confusion Matrix

2.4.1

Accuracy, precision and recall

A very commonly used measure is accuracy, which provides the ratio of correct predictions over the total number of examined cases:

Accuracy= TP+TN

TP+FP+TN+FN

However, its bias towards the majority class makes it only suitable for classification problems where the target variable classes are roughly balanced. This is not the case of offensive lan-guage detection, since despite the prevalence of offensive comments on social media these are far from being the majority.

Another option is to use precision, which measures the fraction of predicted positive samples that are actually positive. It is a good choice for problems where it is important to keep the number of false positives down (e.g. spam filtering), and is computed as follows:

Precision= TP

TP+FP

On the other hand, recall measures the fraction of positive samples that are correctly clas-sified. This makes it appropriate for problems where it is important to capture as many positives as possible (e.g. cancer detection). The formula for recall is displayed below:

Recall = TP

TP+TN

The problem is that in many applications, such as the one under study, the model is expected to be both precise (high precision) and robust (high recall). In these cases it is necessary to take both measures into account, or a combination of them for a more intuitive interpretation.

(18)

2.4. Evaluation Measures

2.4.2

F1 score

The F1 score is a value between 0 and 1 computed as the harmonic mean of precision and recall:

F Measure= 2 precision  recall

precision+recall

Nonetheless, in some situations it might be interesting to assign different weights to preci-sion and recall based on domain knowledge. The reason is that in many problems precipreci-sion and recall are not equally important, and thus they should be weighted differently. The im-plications of each prediction error must be reflected in the cost of false negatives and false positives.

It is also important to notice that the formulas presented above are only applicable to binary classification problems. When there are more than two categories, it is possible to define per-class values for precision, recall and F1-score. Then, they can be combined in different ways to obtain the overall precision, recall and F1 scores:

• Macro-averaged score: Arithmetic mean of the per-class scores, giving equal weights to each class. In other words, it does not take class imbalance into account.

• Weighted-average score: Weighted average of the per-class scores, giving more impor-tance to the over-represented classes. In this case the class imbalance is considered. • Micro-averaged score: For precision and recall, it is computed using the regular

formu-las but considering the total counts of true positives, false positives and false negatives. The micro-averaged F1-score is simply the harmonic mean of the micro-averaged preci-sion and the micro-averaged recall. The resulting values are biased by class frequency.

2.4.3

Area Under the Curve

In the literature there is also an assortment of publications that evaluate their systems with the area under the receiver operating characteristic (ROC) curve, better known as AUROC or AUC in the general case. The ROC curve displays the performance of a model for each classification threshold. This is achieved by plotting sensitivity against specificity:

Sensitivity= TP

TP+FN Speci f icity=

TN TN+FP

The AUC algorithm measures the area beneath the ROC curve providing a score between 0 and 1 that represents the probability of the model ranking a randomly selected positive example higher than a randomly selected negative example. This means that the higher the better, similarly to what happens with the other measures described in this section. The AUC is an effective measure for binary classification problems where both classes are equally important. However, for problems with high class imbalance, it is better to compute the area under the precision-recall curve (PRC) to give special focus to the minority class. Both measures (ROC-AUC and PCR-AUC) are not as good for classifiers comparison as they are for evaluation of a single classifier.

(19)

3

Related work

One of the first works related to the subject under study was published in 2009, where a supervised classification technique for harassment detection in chat rooms and discussion forums was proposed [85]. Their early experiments showed that the performance of a harass-ment classifier can be improved by adding sentiharass-ment and contextual features to the model, something that was utilised in some of the works that followed.

A few years later, Warner and Hirschberg published a paper where they provided their own definition of hate speech, with a special focus on anti-Semitism [77]. They realized that, de-pending on the target group, hate speech is characterized by different high-frequency stereo-typical words that can be used either in a positive or negative sense. This makes the hate-speech problem very similar to the one of Word Sense Disambiguation (WSD) [84], which is why the authors used WSD techniques to generate features such as the polarity of words. Af-ter collecting and labelling their own corpus, they trained a Support Vector Machine (SVM) classifier that achieved an accuracy of 94% and F1 score of 63.75%. The most indicative fea-tures of their model were single words.

The aforementioned work motivated the writing of another paper in 2013, which highlighted the importance of context to identify anti-black racism [34]. According to the publication, 86% of racist tweets were categorized as such simply because they contained some kind of offensive words. This is why Kwok and Wang decided to use a Bag-of-words (BoW) model, which was proven to be insufficient because unigram features alone are not able to capture the relationship between words. This leads to a high number of misclassified tweets that contain terms that are likely to appear in racist posts, even if these words are not racist at all in many contexts (i.e. the words ’black’ or ’white’). Their Naïve Bayes classifier achieved an accuracy of 76% on a balanced dataset, a percentage that the authors believe could be improved by including bi-grams, WSD and sentiment analysis to the algorithm.

At this early stage of the offensive language detection field, the use of predefined black-lists was a common approach [23]. The main problem of this old-fashioned approach is that tweets with offensive words are easily misclassified as hateful, which leads to a high false pos-itive rate caused by the prevalence of curse words on social media [76]. Moreover, list-based methods struggle to detect offensive posts that have no blacklisted terms. This proves that

(20)

this classification task requires some deeper understanding because even blacklisted terms might not be offensive in the right context, as it was already noted by Warner and Hirschberg [77]. For instance, the word ’nigga’ can be used in a friendly way among African Americans and at the same time be considered offensive in other situations. Furthermore, lexical-based methods struggle to accurately detect obfuscated profanity (i.e. ’a$$hole’) as it is unattain-able to add all possible variants of slurs to the dictionary, which is why it is a good idea to incorporate edit distance metrics to the system [68].

In order to overcome the limitations of list-based methods, Chen et al. [10] used character n-grams in combination with other lexical and dependency parse features along with automat-ically derived blacklists. Their main contribution was providing the so called Lexical Syntac-tic Feature-based (LSF) language model implemented as a client-side application that filters out inappropriate material to protect adolescent social media users. The proposed method, trained on Youtube comments from over 2M users, achieved a 96.25% F1 score in sentence offensive detection outperforming all the n-gram models that were used as baselines. Unlike most contemporary works, their tool takes into account the author of the content (i.e. writing style, posting patterns. . . ) to identify not only offensive content but also potential offensive users. In the task of user-level detection, the LSF framework achieved a 77.85% F1 score. Other researchers also realized the limitations of BoW-based representations, such as the fact that they do not take into account the syntax and semantics or the high dimensionality and large sparsity problem caused by obfuscations. In order to address these issues, Djuric et al. propose a paragraph2vec [37] approach that learns distributed low-dimensional representa-tions of user comments using neural language models [6]. Feeding this to a logistic regression classifier and evaluating it on the largest hate speech dataset available at the time, the results showed that the proposed method was not only more efficient but also better than BoW mod-els in terms of AUC scores [16].

One year later, researchers from Yahoo Labs implemented a supervised classification method that obtained better results when evaluated on the exact same dataset [46]. Nobata et al. claim to have used a more sophisticated technique to learn the low-dimensional representation of comments as well as some additional features. They experimented with a wide range of NLP features (n-grams, linguistic features, syntactic features and distributional semantics features) and after evaluating the impact of each individual feature they obtained promising results by combining all of them. However, token and character n-grams alone produced similar results, even outperforming the ones from Djuric et al. [16]. Apart from that, the authors made available a dataset formed by thousands of comments from Yahoo! users that were labelled as hate speech, derogatory language, profanity or none of them.

Other researchers also made the effort of labelling entire datasets and made them public afterwards so that the community could have shared data to objectively compare their results. A good example is the work of Waseem and Hovy [80], who annotated an unbalanced corpus of over 16,000 tweets as either racist, sexist or normal. In addition to that, they provide a dictionary with the most indicative words of the dataset and a bullet list for hate speech identification that can be used by others to gather more data. They also studied the impact of combining character n-grams with extra-linguistic features and found out that gender is the only demographic information that significantly improves performance. Other researchers also took advantage of gender information to improve classification [13], being aware that this type of user-related information is often unreliable or even unavailable on social media. Most of the features used throughout the years for offensive language detection are included in the survey carried out by Schmidt and Wiegand in 2017, where several state-of-the-art models were analysed with special focus on feature extraction [64].

(21)

In 2017 more modern approaches were introduced, where Badjatiya et al. were the first ones to use deep neural network architectures for hate speech detection [5]. Their proposed solu-tion outperformed existing methods by 18 F1 points when evaluated on the dataset provided by [80]. The authors trained several classifiers using task-specific embeddings learned using CNNs, LSTMs and fastText [28], and obtained the best results when combining these em-beddings with Gradient Boosted Decision Trees. Interestingly enough, their best system was randomly initializing the embeddings instead of using GloVe pre-trained word embeddings [53].

Another line of research put special effort on differentiating between hate speech and other instances of offensive language, which are very often mixed in the literature. It is important to discern them since the former type is considered a much more serious infraction that can even have legal implications, and thus should not be confused with ordinary offensive posts. In 2017, Davidson et al. retrieved 24,802 tweets containing words compiled by Hatebase.org1 and labelled them as hate speech, offensive language or neither [14]. They soon realized that the Hatebase lexicon is not accurate enough since only 5% of the tweets were labelled as hate speech by their annotators, which is why the authors provide a reduced version that is supposed to have higher precision. Their multi-class classifier obtained an F1 score of 90%, proving that fine-grained labels are better for hate speech detection. However, the confusion matrix showed that near 40% of hate speech was misclassified and that the model was biased towards the ’neither’ class. Their conclusions go further by stating that racist and homopho-bic tweets are more likely to be correctly classified as hate speech while sexist tweets are often classified as offensive.

With regard to classifiers, some of the algorithms that can be found on the literature are Ran-dom Forest [8], Logistic Regression [14] and Support Vector Machine [40], as well as deep learning approaches like Convolutional Neural Networks [22] or Convolutional-GRU [92]. However, in 2018 the introduction of deep pre-trained language models like ELMo [54], ULMFiT [25], Open-GPT [59] and BERT [15] triggered a shift of the approaches taken in the field. These novel models obtained state-of-the-art results in several NLP downstream tasks, text classification being one of them. In particular, BERT [15] stood above the rest for being deeply bidirectional and using the novel self-attention layers from the transformer model [75], which allows it to better interpret a word’s context. Moreover, it uses WordPiece em-beddings [65] instead of the common character or word based approaches, and it is trained by a self-supervised objective. The bidirectional model can be conveniently fine-tuned with a small amount of task-specific data and offer an excellent performance. The results published in the shared task OffensEval 2019 [88] proved that BERT is well suited for the offensive lan-guage detection task, since it yields successful results to the teams that used it. In fact, six of the top-10 ranked teams in the offensive identification task used Google’s model for their submissions.

Apart from Offenseval, which has already been held twice, other shared tasks like GermEval [81] and TRAC-1 [32] are worth mentioning. Also, workshops dealing with offensive lan-guage, such as TRAC [33], TA-COS [38] or ALW1 [79] have become more prevalent in recent years.

As for languages, as it usually happens, most of the work that can be found on the literature is focused on English. However, some researchers have also investigated less popular lan-guages such as Greek [51], Arabic [44], Slovene [21] and Chinese [72]. One of the reasons for this shortage of non-English works might be that most of the publicly available datasets are currently in English [80] [46] [14], but this might change in the following years thanks to the emergence of multilingual shared tasks like OffensEval 2020 [89].

(22)

4

Method

4.1

Hardware and software

All the implementations were performed on a 64-bit Windows 10 machine with 16GB of RAM and 2 CPU cores. However, the experiments were run either in GPUs from Google Colabora-tory1or Puhti2, a supercomputer from the Finnish IT Center for Science (CSC). The utilization of these high-level performance hardware reduced a lot the training time.

As for the software, all code was written in Python 3.6. This programming language offers a wide range of open-source libraries for scientific computing (e.g. numpy), data manipulation (e.g. pandas), data visualization (e.g. matplotlib, seaborn) and natural language processing (e.g. nltk, gensim), among others.

Apart from the aforementioned libraries, the following were used in different situations:

• The implementations of Naïve Bayes, Support Vector Machine and Random Forest were done with the popular ML library scikit-learn [52].

• The Convolutional Neural Network was implemented in Keras, a high-level library that runs on top of TensorFlow [1].

• To fine-tune different BERT models, the transformers package from the huggingface PyTorch library was used [82].

• BERT models further pre-trained on monolingual data in Arabic3, Danish4, Greek5and Turkish6were obtained from publicly available repositories.

1https://colab.research.google.com 2https://docs.csc.fi/computing/system/ 3https://github.com/alisafaya/Arabic-BERT 4https://github.com/botxo/nordic_bert 5https://github.com/nlpaueb/greek-bert 6https://github.com/stefan-it/turkish-bert

(23)

4.2. Data

4.2

Data

4.2.1

OLID

The main data used for this project is the so called OLID dataset, which stands for Offensive Language Identification Dataset [87]. It was originally provided by the organizers of the OffensEval shared task [88] [89], which consists of the following sub-tasks:

A) Offensive Language Identification: whether a tweet is offensive or not.

B) Categorization of Offense Types: whether an offensive tweet is targeted or untargeted. C) Offense Target Identification: whether a targeted offensive tweet is directed towards an

individual, a group or otherwise.

The different sub-tasks all shared the same dataset which was annotated according to a three-level hierarchical model, so that each sub-task could use as dataset a subset of the previous sub-task’s dataset. First, all tweets were labelled as either offensive (OFF) or not offensive (NOT). Then, for sub-task B, all the offensive tweets were labelled as targeted (TIN) or untar-geted insults (UNT). And finally, for the last sub-task, the third level of the hierarchy labelled targeted insults based on who was the recipient of the offense: an individual (IND), a group (GRP) or a different kind of entity (OTH). To illustrate this, Table 4.1 shows the label distri-bution of the English dataset from the 2019 edition.

A B C Training Test Total

OFF TIN IND 2,407 100 2,507

OFF TIN GRP 1,074 78 1,152

OFF TIN OTH 395 35 430

OFF UNT - 524 27 551

NOT - - 8,840 620 9,460

All 13,240 860 14,100

Table 4.1: Distribution of label combinations in OLID.

All tweets were retrieved from the Twitter Search API and labelled through a crowdsourcing campaign that followed the steps described in the dataset description paper [87]. As it is explained in the paper, each tweet was manually labelled by at least two human annotators, adding a third in those cases where a majority vote was necessary to solve a disagreement. The criteria used by annotators was to label tweets according to the following definition for the OFF label:

posts containing any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct. This includes insults, threats, and posts containing profane language or swear words.

The corpus from OffensEval 2019 contains 14,100 English tweets (32.90% belonging to the OFF class), of which 13,240 originally belonged to the training set and the remaining 860 to the test set. In 2020, unlike in the previous year, the labels of English tweets were generated by unsupervised learning methods instead of human annotators [61]. Thanks to this it was possible to collect over 9 million tweets, each of them associated to two values: the confidence that a specific instance belongs to a specific class and its standard deviation. However, this dataset will not be used for the main experiments since the unsupervised learning models

(24)

4.2. Data

used to generate it used tweets from the OLID dataset as part of the training data, which means that using both sets may lead to overfitted results.

In addition to the English tweets, OffensEval 2020 provided a multilingual dataset for sub-task A composed of Arabic [45], Danish [67], Greek [56] and Turkish [11] tweets. Table 4.2 shows the sizes and offensiveness percentage of the different datasets. Since the number of instances can vary a lot (i.e. the Danish dataset is roughly 10 times smaller than the Turkish dataset), some experiments will only use subsets of certain datasets so that it is possible to objectively compare results. Otherwise the outcome may be conditioned by the amount of training data, making comparisons difficult when drawing conclusions.

Language NOT OFF Total %OFF

Danish 2,864 425 3,289 12.92% Arabic 8,009 1,991 10,000 19.91% Greek 7,559 2,728 10,287 26.52% Turkish 28,437 6,847 35,284 19.41% Table 4.2: Multilingual datasets OffensEval 2020.

Even though the annotation guidelines from the task organizers were clear, it is assumed that there might be some noise in the gold standard. Table 4.3 shows a few examples extracted from the OLID dataset that could be questioned by some. Still, no corrections have been performed on the annotations as it is out of the scope of this project, and there is no point in conducting such a subjective task.

Tweet A B C

@USER Great news! Old moonbeam Just went into a coma! NOT NULL NULL

@USER Yep Antifa are literally Hitler. NOT NULL NULL

@USER Ouch! OFF UNT NULL

@USER She is drinking so much koolaid she’s bloated. OFF TIN IND @USER @USER gun control! That is all these kids are asking for! OFF TIN OTH

Table 4.3: Controversial examples from the OLID dataset.

The literature overview from Chapter 2 showed that there are other publicly available datasets. However, being in the same language they do not allow performance comparisons for multilingual models, which is one of the interests of this work. It would be interesting to enrich the training data by combining several of those datasets, but one limitation might be their classification criteria. For instance, some studies focus on abusive language [16] instead of offensive language, others include specific types of hate speech like racism or sexism [80] and others differentiate between hate speech, derogatory language and profanity [46].

4.2.2

Class imbalance

The datasets used for offensive language detection tasks often suffer from class imbalance, meaning that their classes are not equally represented. This imbalance is intended to realis-tically represent the real-world content available on social networks, but at the same time it adds a level of difficulty to the classification task because machine learning algorithms are much likely to classify new observations to the majority class.

As an example, the OLID dataset is slightly imbalanced at the first level, more imbalanced in level two and highly imbalanced at the third. The reason is that most offenses are targeted and, when targeted, they are almost always directed at a group or individual. This means

(25)

4.2. Data

that classifiers will be reluctant to assign new examples to the OTH class, thus affecting very negatively the final F1 score.

For a classifier to have good performance, it is necessary to address the class imbalance prob-lem or else less instances from the minority class will be correctly classified. There are several techniques that can be used:

• Obtain more instances from the poorly represented classes. This is not so simple since the reason why a class is poorly represented is precisely that it is not so common in the real world.

• Delete instances from the majority class (under-sampling). This approach only makes sense if we have a large dataset at our disposal, otherwise the remaining data could not be enough for proper training. On the other hand, if there is too much data this approach can solve memory problems and reduce the total runtime. However, there is always the problem that useful information might be discarded (e.g. for rule-based classifiers).

• Add copies of instances from the minority class (over-sampling). Unlike under-sampling, this is usually done when there is a shortage of data and it is not affordable to lose training instances. The drawback in this case is that the algorithm is then more likely to overfit.

• Generate synthetic samples using systematic algorithms like Synthetic Minority Over-sampling Technique (SMOTE) [9]. As its name indicates, SMOTE is an over-Over-sampling technique that generates new synthetic data for the minority class. This is done consid-ering the K nearest neighbors of the minority instances and constructing feature space vectors between them. The main advantage with respect to over-sampling by repetition is that SMOTE is less prone to overfitting. On the other hand, it is not practical for high dimensional data and can introduce some noise caused by the overlapping of classes when performing kNN.

• Resample with ratios other than 1:1, since a certain degree of imbalance is acceptable. • Ensemble several unbalanced datasets to give birth to a balanced one.

• Assign high weights to samples from under-represented classes and vice versa. • Modification of the classification thresholds. This was used by the winners of sub-task

C in OffensEval 2019’s edition, as described in their system description paper [74]. • Use penalized models such as penalized-SVM or penalized-LDA, which penalize more

classification mistakes on the minority class during training.

To keep things simple, this work will only use the over-sampling technique, as there is not enough data to perform under-sampling. However, it will not be done by default but only in those cases where it is seen that its use would bring some benefit. The participation in this year’s OffensEval sub-task C, where the second position was achieved by over-sampling the dataset, confirms that this is a reliable technique for the problem at hand [49]. Other approaches like resampling at different ratios or modifying the classification thresholds were tested as well, leading to similar results.

It is important to notice that cross-validation should not be applied after over-sampling, be-cause it would produce overly optimistic results that do not generalize to new data.

(26)

4.2. Data

4.2.3

Pre-processing steps

In any machine learning project, the first step after collecting the dataset is to explore it in order to detect what needs to be corrected (e.g. inconsistencies, missing information, out-of-range values. . . ). Then, a series of transformations are applied to the data to produce a more reliable training set. This initial step is crucial since bad quality data will surely lead to bad quality results, and it is especially important when dealing with microblog content which tends to be unstructured and noisy. This project experimented with the following noise removal and normalization steps:

Desensitization

In the original datasets, every web address is replaced by the general token ’URL’. Also all user mentions, which always start with the at symbol (@) in Twitter, were replaced by the general token ’@USER’ to respect users’ anonymity. However, in some cases it had to be modified to simply ’user’ so that it can be recognized as a single token. For instance, the tokenizer of the BERT-Base Multilingual Cased model (which does not lowercase the input text) would split the word ’@USER’ into three separate tokens: ’@’, ’US’ and ’##ER’.

Hashtag segmentation

Hashtags are known to be an integral part of Twitter. They are used as labels to group tweets about a same topic, so that users can easily find content about a subject of interest. The problem from an NLP point of view is that they are usually a sentence with no spaces between words, which makes them hard to tokenize correctly. This is why, as an initial pre-processing step, hashtags will be divided into recognizable words.

The fact that they always start with the hash symbol (#) makes hashtags easy to detect with regular expressions. Then, in order to know where to split, we assume that every new word in a hashtag starts with a capital letter since it is how they are commonly used (e.g. #thi-sIsAnExample). Regular expressions are used again to insert a blank space every time that a non-capital letter is followed by a capital letter. By doing this and removing the hash symbol, most hashtags are correctly converted to actual sentences. However, since there is no stan-dardized way of spelling hashtags, it is conceivable that some of them might be wrongly split by this method (i.e. #nocapitallettersatall or #ONLYCAPITALS). There are open-sourced seg-mentation modules7that would fix this problem but they are not available in all languages on which this work focuses.

Tokenization

Tokenization is the task of splitting a sequence of words into individual units, or splitting tweets into words for this particular case. These words are referred as ’tokens’, and are de-fined by StanfordNLP8as instances of a sequence of characters that are grouped together as a useful semantic unit for processing. Since our dataset is made up exclusively of tweets, it seemed appropriate to use the TweetTokenizer module from the NLTK library9. Unlike

reg-ular tokenizers, this particreg-ular one is able to detect special tokens such as ’:)’ or ’->’ as well as separating consecutive emojis into separate tokens.

In the case of BERT, tokenization was done with its own built-in WordPiece tokenizer that splits words into smaller sub-word units. This is why when using BERT it is preferable to skip the previous pre-processing step (hashtag segmentation), as BERT’s tokenizer breaks out-of-vocabulary words into the largest possible sub-words contained in the vocabulary. In

7https://github.com/grantjenks/python-wordsegment

8https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html

(27)

4.2. Data

a worst case scenario the word will be split into individual characters, which is quite unlikely however, considering the size of BERT’s training corpus.

Lowercasing

All characters are lowercased to ensure that variations of the same word are not associated to different embeddings. Also, it is necessary for methods that require feature extraction so that tokens can match blacklisted terms regardless of how they have been written.

Reducing lengthened words

Repetition of characters is a common way for social media users to express emotions such as excitement. In order to reduce the amount of out-of-vocabulary words it is important to correct these intentional spelling mistakes, which can be done again with regular expressions. The approach is to find all groups of three or more repeated character sequences (e.g. ’!!!!!!!’) and reduce their length to just two characters. The outcome may still not be grammatically correct (e.g. ’amaaaaaazing’ would be translated to ’amaazing’), but this could be easily fix by a spelling correction algorithm. The problem is that such tool is easy to find for English but not for all the languages of interest.

Stopword removal

Stopwords are the most commonly used words in a language (i.e. ’the’ or ’is’ for English). It is a common practice to filter them out since the text’s semantics should remain intact after removing such words. It is important to note that in tasks such as offense target identification, it might not be a good idea to remove stopwords since they can carry valuable information. This project used a list of stopwords from the NLTK library, which has stoplists available for several languages.

Stemming or Lemmatization

These are both ways of identifying a canonical representative for a set of related word forms, so any of them can be used for that purpose. Stemming consists in stripping a word of its prefixes and suffixes, aiming to reduce inflectional forms of a word to a common base form, so that related words are mapped to the same stem. Lemmatization does the same but doing a morphological analysis of words, which makes it more accurate but slower to run. There is no stemming or lemmatization tool that supports all languages, so a different one had to be found for each language. For example, in Danish stemming was done with the Snowball-Stemmer [58] from NLTK (which also supports English, among others) and lemmatization with the Lemmy lemmatizer from Spacy.

Emojis removal

This is not mandatory but makes sense in some cases. For example, when using BERT there is no point in feeding it emojis since they are considered out-of-vocabulary words, and there-fore they are associated to an ’UNK’ token. In any case, as it is obvious that emojis play an important role in a tweet’s semantics, their hidden emotion should be captured in some way, hopefully as a meaningful feature.

It is important to notice that the order in which these steps are applied does matter. For example, if lowercasing is done before hashtag segmentation, the latter would have no effect. It is believed that spelling correction algorithms would be useful in this phase, but they are not so easy to find in languages other than English. There are even resources for contraction expansion (e.g. ’don’t’Ñ ’do not’) or irregular words correction (e.g. ’bro’ Ñ ’brother’). All these tools can help normalize noisy tweets, although a certain amount of noise might be helpful for abuse detection if captured in the feature extraction phase, which will be explained in the following section.

(28)

4.3. Traditional machine learning approach

4.3

Traditional machine learning approach

Despite not being the most popular approach nowadays, several machine learning algo-rithms have been used for similar tasks in the literature. This part of the thesis will focus on Support Vector Machine (SVM), but Naïve Bayes and Random Forest will also be tested as baselines.

For the multinomial Naïve Bayes, an additive smoothing parameter of 0.01 is used and the class prior probabilities are learned from the data. The SVM classifier has a linear kernel, a regularization parameter of 2.25 and squared l2 penalty. Word unigrams, bigrams and trigrams were used as features. In the case of Random Forest only unigrams are considered as this achieved slightly better cross-validated results when doing grid search. The number of trees in the forest is set to 200 and no maximum depth is specified. Gini index is used to measure the quality of each split.

Different combinations of pre-processing steps and vectorization were tested. At the end, tweets are vectorized with TF-IDF and all pre-processing steps explained in Section 4.2.3 are applied to the data:

• User mentions and web addresses are replaced by general tokens. • Hashtags are converted to sentences.

• All characters are lowercased.

• Text is tokenized with NLTK’s TweetTokenizer. • Repeated characters are removed.

• NLTK stopword lists are used to remove the most common words.

• Different stemmers were used for each language, as stemming produced very similar results to lemmatization.

The main limitation of these methods is that they inevitably require some domain expertise to be applied in the so-called feature engineering process. Feature engineering implies using domain knowledge to extract those attributes from the data that can help a machine learning algorithm make better predictions. The feature selection and extraction steps come right after cleaning and pre-processing the data, and before feeding it to the model.

Surface-level features like character or token n-grams are highly predictive on their own, but tend to be used in combination with other features for improved performance. In general, it can be said that bigrams and trigrams are better than unigrams because they take into account the context information of nearby words. The three of them (1-grams, 2-grams and 3-grams) will be used when converting tweets into a numerical format with the TF-IDF model. Then, in order to answer research question 1, sentiment analysis information should be em-bedded in another kind of features. For instance, a sentiment lexicon could be used to count the number of positive, negative and neutral words in a post and use such information as features. It can also help to somehow numerically represent the sentiment hidden in emo-jis and use that knowledge as a feature, since they are capable of completely changing the essence of a tweet. This will be attempted using the so called Emoji Sentiment Ranking [47], a language-independent lexicon that provides sentiment scores for the most popular emojis on the internet. A few examples can be seen in Figure 4.1.

References

Related documents

As mentioned in Section 2.3.1, an often used feature within image analysis is color distribu- tions. Since the perception of colors is an integral part of the human visual system,

The authors of this thesis have used a positivistic approach since the goal of the study is to see how companies actually integrate social media and how this affects their

4.1 Gender differences in the use of offensive words in conversations on taboo topics In this section I have divided the words that were used in conversations on taboo topics into

The main effect plots for volume % filling on the net power consumption, the specific surface area, the pebbles consumption and the toe and shoulder

X-ray Computed tomography (CT scan) is a promising method to evaluate surface integrity in drilled holes in CFRP composite materials. In C fibre/epoxy- composite laminate

Procentuell fördelning av svar på frågan: använder andra elever de mest kränkande eller något mindre kränkande orden i närvaro av vuxen på skolan?.

The RDAC model used is segmented, has ideal switches, ideal resistances, an ideal voltage source, and 10 Ω resistance to model the output impedance of the supply network..

Sohn and Lee (2019) explore a multilingual hate speech detection system based on BERT, in which three different BERT models are trained in parallel on machine translated data, and