Transfer Learning for Multilingual Offensive Language Detection with BERT

(1)

Transfer Learning for Multilingual Offensive Language Detection with BERT

Camilla Casula

Uppsala University

Department of Linguistics and Philology Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits June 9, 2020

Supervisors:

Christian Hardmeier, Uppsala University

Sara Tonelli, Fondazione Bruno Kessler

(2)

Abstract

The popularity of social media platforms has led to an increase in user-generated content being posted on the Internet. Users, masked behind what they perceive as anonymity, can express offensive and hateful thoughts on these platforms, creating a need to detect and filter abusive content. Since the amount of data available on the Internet is impossible to analyze manually, automatic tools are the most effective choice for detecting offensive and abusive messages.

Academic research on the detection of offensive language on social media has been on the rise in recent years, with more and more shared tasks being organized on the topic. State-of-the-art deep-learning models such as BERT have achieved promising results on offensive language detection in English.

However, multilingual offensive language detection systems, which focus on several languages at once, have remained underexplored until recently.

In this thesis, we investigate whether transfer learning can be useful for improving the performance of a classifier for detecting offensive speech in Danish, Greek, Arabic, Turkish, German, and Italian. More specifically, we first experiment with using machine-translated data as input to a classifier. This allows us to evaluate whether machine translated data can help classification.

We then experiment with fine-tuning multiple pre-trained BERT models at once. This parallel fine-tuning process, named multi-channel BERT (Sohn and Lee, 2019), allows us to exploit cross-lingual information with the goal of un- derstanding its impact on the detection of offensive language. Both the use of machine translated data and the exploitation of cross-lingual information could help the task of detecting offensive language in cases in which there is little or no annotated data available, for example for low-resource languages.

We find that using machine translated data, either exclusively or mixed with gold data, to train a classifier on the task can often improve its performance.

Furthermore, we find that fine-tuning multiple BERT models in parallel can

positively impact classification, although it can lead to robustness issues for some

languages.

(3)

Acknowledgements 4

1 Introduction 5

1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 8 2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Offensive Language Detection: Previous Work . . . . . . . . . . . . . 9

2.2.1 Data and Annotation . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.3 Classification Methods . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Transformer Models and BERT . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 Multi-Head Self-Attention . . . . . . . . . . . . . . . . . . . . 14

2.3.2 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 State of the Art and Challenges . . . . . . . . . . . . . . . . . . . . . 17

3 Data 18 3.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Danish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Greek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Arabic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.5 Turkish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7 Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Methodology and Experimental Setup 21 4.1 Translation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Data Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Preliminary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Baseline Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Multi-Channel BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Results and Discussion 27 5.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Danish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.2 Greek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2.3 Arabic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2.4 Turkish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2.5 German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2.6 Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusion 33

(4)

Acknowledgements

I would like to thank my supervisor Christian Hardmeier for his support, his valuable advice and his availability at every stage of this project.

I would also like to thank my co-supervisor Sara Tonelli for her constant support, suggestions, and advice. I am grateful for having had the chance to work with Stefano Menini, Alessio Palmero Aprosio, and Elisa Leonardelli at the Digital Humanities research unit at Fondazione Bruno Kessler. They all patiently guided me and were always ready to help, even in the midst of a pandemic.

I am grateful for my classmates in the Language Technology Master Program, who shared this experience (and many days in the windowless Chomsky lab) with me.

Thank you to my classmate Shorouq Zahra for helping me with the interpretation of Arabic results for this project.

Finally, I am thankful for the unfaltering support of my family and friends. Their

encouraging words are the most powerful fuel I could ask for.

(5)

1 Introduction

The rise in popularity of social media platforms in recent years has caused a surge in user-generated web content, with large amounts of messages being posted on the web each second. Masked behind what they perceive as anonymity, users often express offensive, hateful, and abusive thoughts on the Internet. The detection and the removal of offensive content therefore play a crucial role in keeping the Internet open and safe (Vidgen et al., 2019). Such messages can raise ethical issues, in addition to being against regulations or platform guidelines. With content being generated at a much faster rate than it could possibly be consumed, however, manual analysis of social media data is impractical, if not impossible. The use of automatic methods to analyze the data and detect offensive content has consequently become increasingly necessary.

Work on automatic detection of offensive language has recently been on the rise, with more and more shared tasks being organized on the topic (Basile et al., 2019;

Bosco et al., 2018; Wiegand et al., 2018; Zampieri et al., 2019b, 2020). Most efforts have been centered on the detection of abusive language in English, but offensive language detection systems have been developed for other languages as well, such as Italian (Bosco et al., 2018) or German (Wiegand et al., 2018). While in general systems are centered on the detection of offensive speech in a specific language, the detection of offensive messages from a multilingual perspective, performing offensive language detection in multiple languages with similar architectures, remained largely unexplored until the SemEval 2019 HatEval shared task, which included English and Spanish data (Basile et al., 2019). In the last two years, more work has been carried out on the topic (Corazza et al., 2020; Ousidhoum et al., 2019; Sohn and Lee, 2019), including the SemEval 2020 OffensEval shared task, which included offensive language detection data in five languages: English, Danish, Greek, Arabic and Turkish (Zampieri et al., 2020). Since the annotation of offensive language corpora often requires extensive resources, both in terms of time and annotators, the development of multilingual models for the task can limit the need for large amounts of task-specific and language- specific resources.

In the SemEval 2019 OffensEval shared task (Zampieri et al., 2019b), focused on the detection of offensive language in English, most of the best performing systems used BERT, a state-of-the-art deep learning model based on the Transformer architecture (Devlin et al., 2019; Vaswani et al., 2017; Zampieri et al., 2019b). Sohn and Lee (2019) explore a multilingual hate speech detection system based on BERT, in which three different BERT models are trained in parallel on machine translated data, and all of them are simultaneously used for classification. The authors name this type of model multi-channel BERT, based on the idea that each pre-trained BERT model constitutes one parallel channel in the model. In this work, we propose a system inspired by that of Sohn and Lee (2019), with the goal of exploring the potential of transfer learning for multilingual offensive language detection.

Our system aims at exploring two ways of applying transfer learning to this task.

First, we explore whether a system for detecting offensive language online can be fully

or partially trained on machine translated data. As the annotation of offensive language

datasets is extremely time consuming, this could offer a valuable alternative if shown

effective. The second aspect we explore is the use of a multi-channel BERT model

(6)

inspired by the one in Sohn and Lee (2019), in which the model exploits cross-lingual information for classification.

1.1 Purpose

This thesis project has the main aim of investigating the impact of transfer learning in the classification of offensive language in a multilingual setting. While systems are generally developed to deal with offensive messages in a specific language, an effective multilingual system could reduce the need for language-specific and task-specific annotated data. We further investigate the possibility of training multilingual systems with no gold data for the language of interest, in order to assess whether classification is possible in cases in which we lack annotated data. The use of transfer learning in this setting could be of particular interest for the annotation of offensive messages in low-resource languages.

Our focus is the assessment of the impact of cross-lingual information on offensive language classification. In order to investigate this, we experiment with two methods:

1. The use of machine translated data to train offensive language detection models, 2. The use of a multi-channel BERT setup, in which we train both a multilin- gual BERT model and an English BERT model and use both for classification, regardless of the language we experiment on.

1.2 Outline

Different studies on the detection of offensive language generally do not share a consistent definition for the task. For instance, what one study sees as offensive speech, another study might find non-offensive. In addition to this, some studies aim at detecting specific instances of offensive language, such as cyberbullying or hate speech. In order to clarify some aspects related to the definition of the task, we discuss some terminology issues in Section 2.1. We then summarize the most relevant previous work done on the detection of offensive language online, focusing on the different annotation guidelines, features, and classification methods typically used for the task.

Among the classification methods, we single out the method we follow, based on BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019). We then proceed to give an overview of Transformer models and the BERT architecture.

Given that there is great variety among offensive language detection datasets and the data we use comes from different sources, we introduce each of the datasets we use for our experiments in Chapter 3, providing information on their source, size, and annotation process.

After introducing the data we use, we illustrate our methodology and experimental setup in Chapter 4. Since the first aspect explored in our approach is the use of machine translated data for classification, we start Chapter 4 by illustrating our setup for translating gold datasets, as well as the different data configurations used for training our systems on varying percentages of machine translated data. After illustrating our baseline model, we describe our main classifier.

Finally, in Chapter 5, we provide the evaluation metric and results we obtain in our experiments, and discuss them more in detail for each language we experiment on:

Danish, Greek, Arabic, Turkish, German, and Italian. We finally discuss our results

and findings.

(7)

1.3 Collaboration

This thesis project was carried out during an internship period at the Digital Hu- manities (DH) research unit at Fondazione Bruno Kessler in Trento, Italy. During the internship, we participated with the DH research unit in the OffensEval task at SemEval 2020, focused on multilingual detection of offensive language in five lan- guages. Our submission to the shared task, co-authored with Stefano Menini, Alessio Palmero Aprosio, and Sara Tonelli, had the goal of assessing whether the performance of multilingual offensive language detection systems could be improved by fine-tuning multiple pre-trained BERT models simultaneously.

In this work, we expand on the SemEval 2020 submission, and perform additional experiments on a similar setup, with some differences. First of all, in addition to experimenting with fine-tuning multiple BERT models simultaneously, we try to assess the impact of machine translated data on classification, by training our models on three data setups (gold data only, gold and translated data, and machine translated data only), while in the shared task submission we only experimented with two setups (gold only and both gold and machine translated data). Second, the classifier in this thesis includes an additional dense layer compared to the classifier in our SemEval 2020 submission. Third, the results reported in this thesis are on different test sets (for Danish, Greek, Arabic and Turkish, these test sets were randomly extracted from gold training data). Finally, in this thesis we experiment on two additional languages than in the shared task, German and Italian.

The DH unit collaborated with this thesis by performing some of the automatic

translations on the Google Translate API and adapting a pre-processing tool to be

used on Italian and German (details are given in the dedicated sections). They also

provided guidance and suggestions regarding the system architecture. The remaining

parts of the work were carried out by the author.

(8)

2 Background

In recent years, automatic identification of abusive messages has attracted increasing attention within the natural language processing community, partly because social media platforms have been increasingly pressured to deal with these issues (Waseem et al., 2017). Offensive language is a complex phenomenon which includes many different types of abusive language, such as cyberbullying or hate speech. Research on the topic of offensive language detection is highly heterogeneous, because many works focus on one specific sub-task, e.g. hate speech detection (Basile et al., 2019; Corazza et al., 2020; Waseem and Hovy, 2016), while other works concentrate on the task from a broader perspective (Zampieri et al., 2019b, 2020). Furthermore, since abusive content is a varied and complex phenomenon, there is no shared definition among researchers of what constitutes offensive language, leading to contrasting definitions of the task and, subsequently, differences in annotation guidelines for the creation of datasets.

In order to clarify some issues around the definition of the task and its specificity, in the first section of this chapter we address some terminology aspects. We then give an overview of the previous work that has been carried out on the topic of offensive language detection and move on to provide an overview of the BERT architecture.

2.1 Terminology

The umbrella terms offensive language and abusive language are used to refer to a wide variety of phenomena, such as hate speech, cyberbullying, aggression, flaming, and trolling. In the literature on the topic, there is no commonly understood definition of what constitutes offensive or abusive language, and often academic works use these words without providing the reader with clear definitions.

The confusion around the definition of the task has led to a variety of works being carried out with different annotation guidelines for similar phenomena, as well as overlaps in studies meant to be centered on different instances of offensive language.

These factors have affected the reusability of datasets across sub-tasks (Kumar et al., 2018). For example, a large part of the work on offensive language detection focuses on hate speech, defined in Davidson et al. (2017) as follows:

‘Language that is used to express hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group’.

However, according to Davidson et al. (2017), many studies tend to conflate offensive language and hate speech, resulting in works that officially aim at detecting hate speech, but in reality perform a task closer to offensive language detection, and vice versa.

Therefore, some of the research on specific sub-tasks such as hate speech detection can be valuable to study offensive language in a broader sense.

Phenomena such as hate speech and cyberbullying share common elements, which make it reasonable to group them under the broader label of offensive language. A definition for the term offensive language could then be drawn from the common points between all phenomena that constitute it. In the words of Kumar et al. (2018):

‘All of these behaviours are considered undesirable, aggressive and detri-

mental for those on the receiving end’.

(9)

Waseem et al. (2017) discuss the lack of consensus regarding the definition of offensive language and the fact that differences between the definitions of the tasks lead to contradictory annotation guidelines. They propose a typology to distinguish between the sub-tasks. According to the authors, instances of offensive language can be categorized according to two main distinctions. The first distinction is based on whether the offense is targeted towards an individual or entity, or generalized to an entire group. The second distinction that should be made is between explicitly abusive messages (for example, containing insults) and implicit offenses, typically employing devices such as sarcasm or irony.

In this work, we use the term offensive language as a hypernym for different forms of language, including cyberbullying, hate speech, abusive and aggressive language, regardless of whether the offense is explicit, implicit, directed, or undirected. Our goal is that of classifying offensive messages and inoffensive ones in a coarse, binary fashion. In order to do this, we chose datasets annotated for offensive language in general, including profane language with no explicit offensive intent. However, we also experiment with a dataset entirely annotated for hate speech only (see chapter 3).

2.2 Offensive Language Detection: Previous Work

One of the first academic works on the automatic detection of offensive language is Spertus (1997). Spertus builds a decision tree classifier for identifying abusive messages, referred to as flames in her work, defined as ‘messages that not only contain insulting words but use them in an insulting manner’. This system is based on linguistic features in the form of syntactical, lexical and graphological rules.

For years, most commercial methods used blacklists and regular expressions before more sophisticated methods started to emerge. Nowadays, the detection of offensive language generally takes the form of a supervised learning task, with support vector machines and neural networks typically being used for classification.

Due to the heterogeneity of offensive language, there is great variety among offensive language detection systems. Systems vary with regards to three main aspects:

• Data and annotation process, given the many differences in the definition of the task and, consequently, the annotation guidelines followed;

• Features used for classification, which can range from n-gram based approaches to vector representations of words, sentences or paragraphs;

• Classification methods, which can range from traditional machine learning approaches to deep learning methods.

In this section, we summarize academic work done in the field of offensive language detection by focusing on each of these aspects.

2.2.1 Data and Annotation

In order for offensive language detection systems to be robust, high-quality labeled

data is essential. Typically, data is gathered from one specific social media site, with the

most common one being Twitter (Bosco et al., 2018; Waseem and Hovy, 2016; Zampieri

et al., 2019b, 2020). Other sites commonly used as sources for offensive messages

include Facebook (Bosco et al., 2018), Instagram (Hosseinmardi et al., 2016), and Reddit

(Sigurbergsson and Derczynski, 2020) among others.

(10)

Annotation Process and Guidelines

Because of the complex nature of offensive language, annotation guidelines differ across corpora. While some datasets are built for detecting specific subcategories of abusive language, such as cyberbullying (Hosseinmardi et al., 2016), hate speech (Warner and Hirschberg, 2012; Waseem and Hovy, 2016), and trolling (Golbeck et al., 2017), other corpora are created with the goal of dealing with abusive language in broader terms (Zampieri et al., 2019b, 2020). Furthermore, the interpretation of the different phenomena is subjective, and therefore inconsistent across papers. As a consequence, when working with abusive or hateful messages we typically observe low inter-annotator agreement (Vidgen et al., 2019; Waseem et al., 2017). Inter-annotator agreement tends to be especially low when dealing with what Waseem et al. (2017) define as implicit forms of abuse. While explicit abusive language is described as

‘unambiguous in its potential to be abusive’, implicit abusive language is stated to be ‘that which does not immediately imply or denote abuse’: the offensive content, in these cases, is typically concealed through ambiguous terms, sarcasm, and other means, making these instances more difficult to detect (Waseem et al., 2017).

Another difference between datasets is the level of annotation. Some datasets, such as OLID (Zampieri et al., 2019a), offer multiple levels of annotation in addition to the binary coarse-grained (offensive/non-offensive) labels. OLID, for instance, is annotated also according to whether the offense is targeted (containing an insult or threat directed towards an individual, a group, or an entity) or untargeted (posts which contain profanity but which are not directed towards a person, group or entity).

OLID is also annotated according to the specific target if a certain insult is targeted (individual, group, or other). For instance, the sentence ‘what is wrong with these idiots?’ is annotated as offensive and targeted, and the target is identified as a group, while the sentence ‘worst experience of my fucking life’ is annotated as offensive and untargeted (Zampieri et al., 2019a).

The annotation process is typically performed by researchers themselves, experts (e.g. feminist and anti-racism activists (Waseem, 2016)) or, more often, via crowdsourc- ing. Although crowdsourcing can speed up an otherwise very lengthy process, the quality of annotations is typically lower in datasets featuring this type of annotations, since the annotators lack domain-specific knowledge with relation to abuse and hate speech (Schmidt and Wiegand, 2017; Waseem, 2016; Waseem et al., 2017).

Potential issues

Given the relatively low amount of offensive messages online compared to non- offensive messages, gathering data for offensive language datasets, as well as at- tempting to make corpora balanced, can be problematic (Schmidt and Wiegand, 2017).

Generally, social media websites are queried for terms known to be associated with offensive language, which can range from swear words to ethnic slurs. Waseem and Hovy (2016), for example, try to circumvent the need for extremely high amounts of annotation in order to obtain a balanced set by querying social media sites for terms which are more likely to be related to offensive content, such as feminazi or arab terror.

While this approach would make it easier to find hateful content, defining a priori certain topics to query for finding offensive language can result in biased datasets, where only some kinds of abusive language are represented (Schmidt and Wiegand, 2017).

Additional methods used to find offensive language on social media include using

hate speech lexica (Davidson et al., 2017) and identifying users who generate a large

quantity of abusive content. Wiegand et al. (2018), who build a twitter dataset for

(11)

hate speech detection in German, users who frequently post offensive content are first identified heuristically, then posts are sampled from their timeline and annotated.

According to the authors, this method of data gathering is less biased than querying for specific terms, since it allows for more variety across instances of offensive language.

The size and class balance of datasets vary considerably in function of the data collection methods used, although generally the number of offensive messages ranges between 15% and 30% of the total size of the corpus. While this percentage is higher than the amount of offensive tweets found when sampling posts at random, the number of offensive posts is typically kept lower than 50% to reflect the imbalance between the classes found in random sampling.

Another potential issue is dataset degradation. Many datasets are shared as lists of post IDs, so the data has to be retrieved through APIs. This is, for example, the case with many Twitter datasets. Since content can be removed, either by moderators because it violates platform guidelines, or by users themselves for various reasons, this can reduce the size of datasets over time (Vidgen et al., 2019).

Finally, the annotation process can have a negative mental impact on the annotators, since they are exposed to large amounts of offensive and abusive content (Chen et al., 2012; Vidgen et al., 2019; Waseem, 2016). In many cases, annotation via crowdsourcing is preferred for this reason, since it allows the distribution of the psychological tax the task entails onto a crowd rather than having small group of individuals annotate large quantities of data (Nobata et al., 2016; Warner and Hirschberg, 2012). A solution proposed by Vidgen et al. (2019) to handle the impact of abusive language on annotators is providing them with social and mental health support to ensure that the research on this topic remains sustainable.

2.2.2 Features

The phenomenon of offensive language online involves social, cultural, linguistic, and individual factors. Offensive content is therefore not homogeneous, as there is great variety between single instances of such language. The term features refers to measurable pieces of information on a post which can be fed into a classifier to help classification. For example, when dealing with offensive speech, the presence of swear words or insults, the use of the imperative mode, or the use of capitalization can be used as features for an offensive language detection model, since all these factors can be correlated with the presence of offensive language.

However, given the complexity of the phenomenon, it would be extremely difficult to tell an offensive message apart from a non-offensive one on the basis of one single influencing factor. For instance, if only the presence of insulting words is considered, the sentence ‘I don’t think you’re stupid’ can be marked as offensive because it contains the word stupid, while the statement could be considered non-offensive in its entirety.

Since the use of only one feature can be ineffective, multiple features are often used simultaneously. The set of features used in offensive language detection systems can vary greatly from simple surface information, such as character n-grams, to more complex representations, such as word or paragraph embeddings (Schmidt and Wiegand, 2017).

Simple Surface Features

The most basic and straightforward approach is that of looking at surface-level infor-

mation, both in the form of token n-grams and character n-grams. These features are

observed to be highly predictive (Schmidt and Wiegand, 2017) and are employed in

a variety of works on the topic (Chen et al., 2012; Nobata et al., 2016; Waseem and

(12)

Hovy, 2016). Token n-grams can, however, lead to data sparsity problems, given the large variations in spelling when dealing with user-generated content. For instance, elongated words such as hellooooo would lead to large amounts of rare or unknown tokens when using word n-grams. In addition to this, users are often aware of the word blacklist systems typically used on social media platforms for detecting abusive language, and they consciously try to evade such systems using uncommon spelling or characters, such as in the sentence ‘ki11 yrslef a$$hole’ (Mehdad and Tetreault, 2016).

Due to this data sparsity issue, character n-grams are typically preferred for the task.

Mehdad and Tetreault (2016) compare character-based and word-based approaches on the task, and find that character n-grams can predict abusive language more accurately.

Additional surface-level features typically used for detecting offensive language in combination with other features include the frequency of URLs and punctuation, capitalization, number of non-alphanumeric characters, and text and token length (Schmidt and Wiegand, 2017).

Lexical Features

The use of lexical resources is typically linked to the belief that offensive messages are likely to contain specific negative words, such as insults (e.g. idiot) or slurs (e.g.

faggot). The presence of these words can be then used, in combination with other features, to predict whether a message is potentially offensive or not.

There is a large quantity of word lists and lexicons available online. Some of these contain generic hate-related or offensive terms, such as the resource used by Nobata et al. (2016)

¹

, although lexica specific for certain subcategories of offensive language (e.g. ethnic slurs) exist. Other approaches involve lexica specifically created for the task at hand. For example, Spertus (1997) compiled a list of good adjectives and good verbs (e.g. great or super) under the assumption that posts containing these words tend to contain praise rather than offenses.

In general, lexical resources typically need to be used in conjunction with other fea- tures in order to allow effective prediction of offensive language, given the importance of contextual information for the interpretation of certain words (Hosseinmardi et al., 2016; Schmidt and Wiegand, 2017).

Linguistic Features

There are different types of linguistic features which can, in principle, be predictive when dealing with offensive language. Spertus (1997) formulates a set of linguistic rules tailored specifically to the detection of hate speech. These include the detection of the imperative mode, which is stated to be typical of offensive and condescending language, and the co-occurrence of the second person pronoun you followed by a noun apposition (such as in the phrase ‘you bozos’), which tend to be insulting. In order to prevent false positives, Spertus also includes some additional information, such as polite and praise rules, which find instances of language containing words such as ‘please’ or ‘kudos’, which decrease the probability that a certain text contains offensive language (Spertus, 1997).

In Xu et al. (2012), the combined use of n-grams and part-of-speech (POS) informa- tion is explored in the context of cyberbullying. However, POS-tags are not reported to improve the performance of the classifier significantly.

Another type of linguistic feature found in related works is syntactical information.

Chen et al. (2012) use dependency relation information in addition to n-gram features in order to capture dependency relations between offensive words and user identifiers

1

https://hatebase.org/

(13)

(pronouns, names, etc), even when not contiguous. For example, such information could help a classifier in correctly labeling as offensive the sentence ‘you are, by any means, an idiot’ (Chen et al., 2012).

Generalization

Many features, in order to allow for effective prediction, need to appear in both training and test data to avoid data sparsity issues. Since the amount of annotated data available for this task is typically limited, word generalization methods can help circumvent the problem (Schmidt and Wiegand, 2017). For example, Xiang et al. (2012) use statistical topic distribution as a feature.

The most widely used generalization approach consists of vector word representa- tions, also known as word embeddings. Word embeddings are typically induced from a large reference unlabelled corpus, and are widely used for the classification of offensive language (Bosco et al., 2018; Zampieri et al., 2019b). In addition to word embeddings, the use of sentence and paragraph embeddings has also been found effective for the task (Schmidt and Wiegand, 2017).

Additional Features

Additional features often used in combination with other types of information include meta-information, such as the gender of the author of a certain message (Waseem and Hovy, 2016) or the quantity of replies to a certain post, and multi-modal information, such as images, audio and video content. For example, Hosseinmardi et al. (2016) experiment with a multi-modal classifier in order to detect instances of cyberbullying on Instagram, a social network based on the sharing of images.

2.2.3 Classification Methods

The most widely used classification methods in the literature follow supervised learning approaches. Both deep learning approaches and more traditional approaches are used, sometimes in combination.

Traditional machine learning approaches include logistic regression, naive Bayes, decision trees, random forests, and linear support vector machine (SVM) classifiers (Davidson et al., 2017; Xu et al., 2012). Out of these approaches, linear support vector machines are the most widely used and yield the best results (Schmidt and Wiegand, 2017; Xu et al., 2012).

More recent approaches are based on deep learning methods. In the 2018 TRAC shared task on aggression identification in social media (Kumar et al., 2018), the best performing system used long short-term memory (LSTM) and convolutional neural networks (CNN) (Aroyehun and Gelbukh, 2018). Furthermore, the system proposed by Aroyehun and Gelbukh (2018) exploits data augmentation through automatic transla- tion and pseudo labeling. The data augmentation process is inspired by techniques typically used in computer vision, in which datasets are enlarged using label-preserving transformations. In their case, Aroyehun and Gelbukh (2018) translate each example into an intermediate language and then back to English, the language they work on.

The pseudo-labeling process, on the other hand, consists in automatically annotating additional data using a model trained for hate speech detection.

In the OffensEval task at SemEval 2019 (Zampieri et al., 2019b), focused on the

identification and categorization of offensive language in English, most teams opted

for deep learning approaches, with most of the best performing teams in terms of

macro F1 scores using deep learning models such as ELMo (Peters et al., 2018) and

BERT (Devlin et al., 2019), which have achieved state-of-the-art results in a number

(14)

of NLP tasks. In sub-task A, centered on the binary classification of offensive/non- offensive language, seven out of the 10 top-ranking teams used BERT (Zampieri et al., 2019b).

BERT is designed to be pre-trained on a large unlabeled corpus through two un- supervised tasks: masked language modeling, in which a random selection of input tokens is masked and then predicted, and next sentence prediction. The pre-trained language representation can then be fine-tuned on a specific downstream task using labelled data. While each task has a separate fine-tuned model, they are all initialized with the same pre-trained model parameters (Devlin et al., 2019). BERT has obtained state-of-the-art performance in a variety of NLP tasks, including offensive language detection. Devlin et al. (2019) have made several pre-trained BERT models available for English, as well as a pre-trained Chinese model, and a multilingual model, pre-trained on the 104 largest Wikipedias

²

.

Since most research on offensive language detection has been revolving around English, BERT is typically used to perform abusive language detection on this language.

However, some researchers have experimented with BERT in multilingual settings.

For instance, Sohn and Lee (2019) build a model which exploits three pre-trained BERT models for performing hate speech detection on three languages: Italian, German, and Spanish. In this architecture, named multi-channel BERT, the three pre-trained BERT models are fine-tuned in parallel and all are used for classification. In their setup, a corpus (in Italian, German or Spanish) is automatically translated into English and Chinese, and the tri-parallel data is then fed to a model which fine-tunes three different pre-trained BERT models: multilingual BERT, English BERT, and Chinese BERT. The hidden layers from each of these models are then added through weighted sum before classification. The model therefore exploits cross-lingual information from all three models in order to classify hate speech. In our setup, we will use a system inspired by that of Sohn and Lee (2019), in which we fine-tune multiple BERT systems in a parallel (multi-channel) setting.

2.3 Transformer Models and BERT

BERT stands for Bidirectional Encoder Representations from Transformers (Devlin et al., 2019). It is a deep bidirectional language representation model based on the Transformer architecture, introduced by Vaswani et al. (2017). There has been a surge in recent years in Transformer-based architectures, such as BERT itself and ELMo (Peters et al., 2018), which now define the state of the art in a variety of NLP tasks.

The Transformer architecture, which was initially developed for neural machine translation (Vaswani et al., 2017), relies on attention mechanisms rather than recurrent or convolutional layers to compute representations of its input and output, thus reducing training times and cost compared to previous state-of-the-art models. At its core, the Transformer architecture has an encode-decoder structure, in which both the encoder and the decoder are composed of stacks of 6 identical layers containing both feed-forward layers and Multi-Head Self-Attention mechanisms.

2.3.1 Multi-Head Self-Attention

The use of a Multi-Head Self-Attention mechanism is the main novel component of Transformer models. Self-attention is defined in Vaswani et al. (2017) as ‘an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence’. In other words, this attention mechanism encodes

2

https://github.com/google- research/bert

(15)

the relationship between each word and every other word in a sequence, focusing on the most relevant relationships, effectively allowing a sequence to ‘pay attention’

to itself. For example, in the sentence ‘I bought chocolate cookies’, the Multi-Head Self-Attention layer’s output for the word ‘chocolate’ might pay more attention to the word ‘cookies’ than the word ‘I’ depending on the task. This form of attention has the benefit of reducing computational complexity per layer and the ability to effectively handle long-range dependencies.

Scaled Dot-Product Attention

Multi-head attention uses scaled dot-product attention which, in turn, is based on dot- product attention, one of the most commonly used attention functions (Vaswani et al., 2017). Attention functions can generally be described as a mapping between a query, a set of key-value pairs, and an output, in which all these components are vectors. The output is computed as a weighted sum of the values, and the weights for each value are calculated based on the similarity between the query and the key.

In Géron (2019), dot product attention is explained with the following example: we can suppose that the encoder has, at a given time step, analyzed the sentence ‘They played chess’, and it has understood that ‘They’ is the subject and ‘played’ is the verb, creating the equivalent of a dictionary representation ({‘subject’:‘They’, ‘verb’:‘played’, ...}). We can now suppose that the decoder has already translated the subject, and it should translate the verb next, so it needs to look up in the dictionary the value corresponding to the key ‘verb’ to access the verb in the original sentence. However, the keys in the encoder dictionary are not stored as discrete tokens (i.e. ‘verb’ or

‘subject’), but rather in vectorized form, learned during training. This means that what the decoder will look up (the query) will not exactly match any key in the encoder dictionary. This problem is solved by computing a similarity measure between the query and the key. This measure is then converted into a weight for each value through a softmax function, so that the weights for all values add up to 1.

If we assume 𝑑

_keys

to be the dimension of keys in the input, for large values of 𝑑

_keys

the dot products can, in the words of Vaswani et al. (2017), ‘grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients’. In order to avoid this effect, Vaswani et al. (2017) add a scaling factor to the dot products,

√

1 𝑑_keys

. The resulting equation, used for calculating the matrix of outputs of scaled dot product attention, is Equation 2.1. Q, K, and V are matrices containing one row per query, key and value respectively.

Attention( Q, K, V) = softmax QK

^>

p 𝑑

_keys

!

V (2.1)

Multi-Head Self-Attention

In Multi-Head Attention, the queries, keys and values are linearly projected h times with different linear projections. The Scaled Dot-Product Attention function is then performed in parallel on each of these projected versions of queries, keys and values.

The output values are then concatenated and projected again. The process is illustrated in Figure 2.1.

Masked Multi-Head Self-Attention

In Vaswani et al. (2017), the decoder uses both non-masked Multi-Head Attention and

Masked Multi-Head Attention. While non-masked Multi-Head Attention mechanisms

(16)

Figure 2.1: Multi-Head Self-Attention. Adapted from Figure 2 in Vaswani et al. (2017) .

pay attention to the relationship between each word and every other word in the sequence, in Masked Multi-Head Attention mechanisms each word is considered with relation to the words preceding it only. This ensures that predictions made at a given position in the sequence can only depend on previous positions.

2.3.2 BERT

Unsupervised pre-training has been demonstrated effective for a variety on NLP tasks (Howard and Ruder, 2018; Radford et al., 2018). In particular, Radford et al. (2018) use a Transformer-like architecture to perform unsupervised pre-training. Their architecture is composed of a stack of 12 Transformer modules which only use Masked Multi-Head Attention layers, and it is trained using self-supervised learning on a large dataset.

This model is then fine-tuned on various language tasks, with small changes for each task.

For BERT, Devlin et al. (2019) use a similar architecture to that of Radford et al.

(2018). The BERT architecture is composed of a stack of 12 Transformer modules with 12 self-attention heads, but it only uses non-masked Multi-Head Attention layers. This allows BERT to be naturally bidirectional, since attention is calculated by taking in consideration the relationship between each word and all other words in the sequence, not only the words to its left.

Devlin et al. (2019) propose two self-supervised pre-training tasks for their model:

• Masked language modeling (MLM), in which each word in any sentence has a probability of 15% of being masked. When a certain token is selected to be masked, it has an 80% probability of being substituted with the _[MASK] token, a 10% probability of being replaced by a random word (since it will not see _[MASK]

tokens in real examples), and a 10% probability of being left unchanged (this is done to bias the model toward the right answer). The model then tries to predict masked words using cross-entropy loss.

• Next sentence prediction (NSP), in which pairs of sentences are fed into the model

(with half the pairs being consecutive sentences, and the other half being random

(17)

pairs), and the model has to predict whether the sentences are consecutive or not.

This pre-training task is especially useful for tasks such as question answering and natural language inference.

After being pre-trained, BERT models can then be fine-tuned on specific downstream tasks using task-specific labelled data.

2.4 State of the Art and Challenges

The current state of the art for hate speech or offensive language detection varies across languages, with some systems achieving state-of-the-art results with traditional machine learning approaches such as support vector machines (Wiegand et al., 2018), and others achieving it through deep learning approaches, such as BiLSTMs, RNNs, CNNs, and ELMo or BERT-based classifiers (Bosco et al., 2018; Zampieri et al., 2019b).

Word and sentence embeddings tend to be the most widely used features (Basile et al., 2019; Bosco et al., 2018; Zampieri et al., 2019b).

The main challenges in the field of offensive challenge identification at the moment

are related to the quality of datasets. Harmonization across sub-tasks is needed, along

with solutions to annotation and classification biases. Classification biases, discussed

by Vidgen et al. (2019), derive from the fact that computational methods can encode

and reinforce social biases. This can, in turn, cause classifiers to be more effective at

detecting some kinds of hate speech or offensive language (e.g. sexism) over other

kinds of offensive speech. Further challenges are embodied by multimedia content

(such as images or videos) which is under-researched within the field (Vidgen et al.,

2019), as well as cross-domain applications (i.e. training a system on one platform, and

testing it on another).

(18)

3 Data

The datasets used for this thesis are all from shared tasks on offensive language detection or hate speech detection. More specifically, we used datasets from GermEval 2018 (Wiegand et al., 2018), OffensEval 2019 (Zampieri et al., 2019b), and OffensEval 2020 (Zampieri et al., 2020), which were focused on offensive language detection, and from the Evalita shared task on hate speech detection on Italian (Bosco et al., 2018).

The Italian and German datasets we use are the same datasets used by Sohn and Lee (2019) for their experiments on their multi-channel BERT setup. We run experiments on these corpora to compare the performances of the two systems.

A summary of the size of each dataset we used and the relative class balance and sources can be consulted in Table 3.1.

Language Messages % Off. Source(s) Platform(s) Danish 2,879 13% OffensEval 2020 Reddit, Facebook

Greek 8,743 29% OffensEval 2020 Twitter

Arabic 7,999 20% OffensEval 2020 Twitter

Turkish 31,277 20% OffensEval 2020 Twitter

German 8,541 34% GermEval 2018 Twitter

Italian 4,000 32% Evalita 2018 Twitter

English 14,100 33% OffensEval 2019 Twitter

Table 3.1: Number of messages, class balance (percentage of offensive messages), source and social media platform used for retrieving data for each language we experiment with.

3.1 English

Although we did not run any experiments on English, we used an English dataset annotated for offensive language as the source data for our automatic translations.

The dataset chosen for this is the Offensive Language Identification Dataset (OLID), described in Zampieri et al. (2019a). OLID was created for the OffensEval 2019 shared task (Zampieri et al., 2019b), centered on the detection of offensive language in English.

The examples in the dataset were retrieved from Twitter by searching for keywords known to be associated with offensive messages, ranging from more neutral ones such as you are to more task-related terms such as liberals or gun control. The corpus consists of 14,100 tweets annotated via crowdsourcing on three hierarchical levels, based on:

1. Whether a post contains offense or profanity or not ( _OFF/NOT ), including non- offensive profanity,

2. If _OFF , whether the insult is targeted or untargeted ( _TIN/UNT ),

3. If _TIN , what category the the target belongs to: individual, group, or other ( IND/GRP/OTH ).

Since this works revolves around binary classification of offensive/non offensive

instances, we considered just the first level of annotation ( _OFF/NOT ).

(19)

3.2 Danish

The Danish dataset we used, described in Sigurbergsson and Derczynski (2020) and provided by the OffensEval 2020 organizers, is constructed from user-generated com- ments from Reddit and Facebook. It contains 3,600 posts total, annotated by the authors using the same three-level scheme as OLID (Zampieri et al., 2019a). In our experiments, we only used the training set

¹

, consisting of 2,879 examples. Again, we only use the first level of annotation for our experiments, i.e. the binary offensive/non offensive distinction.

3.3 Greek

The Greek offensive language detection corpus we use is Offensive Greek Tweet Dataset (OGTD) 2.0 (Pitenis et al., 2020). This dataset was also provided by the OffensEval 2020 organizers for the shared task. It contains 10,287 tweets which are manually annotated for offensive language according to the same guidelines as OLID. This dataset is, however, only annotated on the first level ( _OFF/NOT ). As with the other OffensEval 2020 datasets, we only perform our experiments using the training set (8,743 tweets).

3.4 Arabic

The Arabic dataset we used is described in Mubarak et al. (2020). The corpus is composed of 10,000 tweets manually annotated for offensive language. While the dataset is annotated on two levels, including information about whether the offensive tweet is vulgar (i.e. it contains profanity) or it is an instance of hate speech, the second level is not used for our task, so we only use the coarse-grained labels. As with the other OffensEval 2020 datasets, we only perform our experiments on the training set (7,999 tweets).

3.5 Turkish

The Turkish corpus we used is described in Çöltekin (2020). It is composed of 34,792 tweets annotated for offensive language according to the guidelines of OffensEval 2020 (Zampieri et al., 2020). The training set available to us was composed of 31,277 tweets.

However, for our experiments we only used a random sample of 20% of this dataset, consisting of 6,255 tweets, to keep the size in line with the other datasets we had.

3.6 German

The German corpus we chose is the one that was used in the GermEval 2018 shared task on offensive language identification, described in Wiegand et al. (2018). The dataset contains 8,541 tweets manually annotated by the authors, with 5,009 tweets constituting the training set and 3,532 the test set.

This dataset was gathered by sampling tweets out of 100 profiles of users who regulary posted offensive messages. The corpus includes both coarse-grained binary labels for offensive language and fine-grained annotation for specific instances of

1

At the time the experiments were performed, the labels for the test set were not available for all

datasets provided by the OffensEval 2020 organizers.

(20)

offensive language, i.e. abuse, insult, and profanity. For our experiments, we only use the coarse-grained labels.

3.7 Italian

For Italian, we use the Twitter portion of the Evalita 2018 HaSpeeDe corpus (Bosco et al., 2018). The task was focused on hate speech detection on Twitter and Facebook with a dataset for each. The corpus we used is a subsection of the corpus described in Sanguinetti et al. (2018), and it is, unlike the rest of the datasets we used, specific to a certain type of offensive language, namely hate speech against immigrants. The annotation of this corpus was carried out both by experts and through crowdsourcing.

The subset we used contains 4,000 tweets, with 3,000 tweets comprising the training

set and 1,000 the test set.

(21)

4 Methodology and Experimental Setup

The detection of offensive language from a multilingual perspective and the use of machine translated data can have the benefit of reducing the need for language-specific data, which is especially time consuming and difficult to obtain for this task. Our aim is therefore that of evaluating whether cross-lingual information can have a positive impact on the classification of offensive language detection in a multilingual setting.

We perform offensive language detection including instances of more specific sub- tasks, such as hate speech detection. The setup is the same one that was used for the FBK-DH submission to SemEval 2020 shared task 12, OffensEval, focused on the detection of offensive language in social media in Danish, Greek, Arabic and Turkish (Zampieri et al., 2020). In this work, we illustrate experiments performed in addition to the submission to explore the potential of this system, as well as experiments performed on two additional languages, Italian and German.

Our approach exploits transfer learning, both regarding the data and the classifier we use. While BERT models could be considered transfer learning per se, given that they are helpful in applications which differ from the tasks they were pre-trained on (masked language modeling and next sentence prediction, see Chapter 2), our approach investigates the application of transfer learning to offensive language detection in a two-step setup, where information is transferred not between tasks, but between languages and models.

The two steps that characterize our approach are:

1. Automatic translation of offensive language detection datasets, 2. Fine-tuning of BERT in a multi-channel setting.

First, we increase the amount of data available for fine-tuning BERT through au- tomatic translation, i.e. we translate an existing offensive language detection dataset from English into another language and vice-versa, to assess whether automatically translated data can be helpful for the classification of offensive language and whether it is possible to perform offensive language detection using, partially or exclusively, automatically translated data. This latter point is important to understand whether the detection of offensive language could be feasible in cases where there is no annotated data available, for example to perform the task on low-resource languages.

Second, we experiment with fine-tuning multiple pre-trained BERT model in parallel.

Sohn and Lee (2019) explore the possibility of fine-tuning multiple BERT models in parallel and refer to this type of model as multi-channel BERT. This model requires parallel data to be fine-tuned (i.e. the same tweets in two different languages), and is thus complementary to the first step in our setup. While the system of Sohn and Lee (2019) uses a three-channel BERT model (English, Chinese, and multilingual BERT), our setup is based on a two-channel BERT setup, using both the English and the multilingual pre-trained BERT models. After these two models are fine-tuned together on parallel data, the hidden representations for each sequence from each fine-tuned BERT model are then summed together and used for classification.

Our experiments on both our baseline and main system are performed on three

different data configurations including a varying amount of automatically translated

data, to help us assess the impact automatically translated data can have on the

classification performance.

(22)

4.1 Translation Process

The first step in our setup is the translation of gold datasets. This step has a dual purpose:

first, it creates artificial data which can be fed into our system to evaluate whether automatically translated data can be helpful for classification; second, it provides us with parallel data which can be fed into our multi-channel BERT architecture.

We start with a gold dataset for each of the 6 non-English languages and a gold dataset for English. We translate the 6 non-English datasets into English, and our English gold dataset (OLID) into the 6 remaining languages. At the end of the process, we have two datasets for each non-English language, one gold and one automatically translated, and 7 English datasets, of which one gold and 6 automatically translated.

All the translations are performed using the Google Translate API, the same tool used by Sohn and Lee (2019) in their work on multilingual BERT

¹

.

Figure 4.1 offers an overview of the translation process.

Figure 4.1: Our translation process for creating parallel data. Machine translated data is colored in grey.

4.2 Data Configurations

After the translations are performed, we obtain two different parallel datasets for each non-English language (LANG):

• Gold data in LANG, machine translated data in EN,

• Machine translated data in LANG, gold data in EN.

Since our goal is that of assessing whether automatically translated data can be useful for the classification of offensive language, we run our experiments on different data configurations, containing a varying degree of automatically translated data in the language of interest. In the first configuration, which serves as benchmark, we only have gold data for LANG. In the second configuration, we train our models only on

1

The translation process for Danish, Greek, Arabic and Turkish was performed by the DH research

group at FBK.

(23)

automatically translated data for LANG. In the third data setup, we train our models on both the previous datasets merged together, so we have both gold and automatically translated data for LANG.

For each language, we therefore have three parallel data configurations, with a different ratio of gold to automatically translated data in the language of interest. The data configurations are illustrated in Figure 4.2.

Figure 4.2: The three parallel data configurations we obtain after the translation process.

4.3 Preliminary Operations

Before training our models, we perform some preliminary operations on the data.

First, we make the datasets as homogeneous as possible. OLID and all the OffensEval datasets (the gold data for Danish, Greek, Turkish and Arabic) are annotated with the same guidelines and labels, as well as pre-processed in the same way: user mentions and URLs are removed and substituted by ‘ _@USER ’ and ‘ _URL ’. The datasets for Italian and German have a different origin. Using the Ekphrasis tool

²

, we normalize the data in Italian and German removing user mentions (which are replaced by ‘ _@user ’) and URLs, replaced by ‘ _URL ’. For Italian, hashtags are split using the same tool

³

.

We then normalize the labels across datasets. Since the model requires the labels 0 and 1, we map the different labels used in the datasets to binary values. For OLID and the OffensEval datasets, ‘ _NOT ’ (non-offensive) is mapped to 0, and ‘ _OFF ’ (offensive) to 1. For the GermEval 2018 dataset, ‘ _OTHER ’ is mapped to 0 and ‘ _OFFENSE ’ to 1. The HaSpeeDe corpus for Italian, labeled for hate speech, already uses 0 and 1 as labels, where 1 indicates the presence of hate speech.

After normalizing the tweets and the labels, we set aside 20% of the gold data for Danish, Greek, Turkish, and Arabic to be used as test set. For Italian and German, we use the test sets provided by the task organizers of Evalita 2018 and Germeval 2018.

2

https://github.com/cbaziotis/ekphrasis

3

The Ekphrasis tool was adapted to Italian and German by the DH research team at FBK.

(24)

Then, we prepare the data for the BERT model. We tokenize all English data with the English base BERT tokenizer

⁴

, non-English data using the multilingual BERT tokenizer

⁵

, which split strings into sub-word tokens. We then add the two special tokens _[CLS] and _[SEP] . The first one, _[CLS] , is a special classification token. Its final hidden state is used in BERT models as the aggregate sequence representation for classification tasks (Devlin et al., 2019). In other words, the final hidden state of this token can capture the hidden representation for the whole sequence. The second token,

[SEP] , is a separation token. Since BERT can accept sentence pairs in a single sequence (to allow tasks such as question answering), this token is used for separating input sentences. As we only use one tweet at a time as the input, in our case this token is added at the end of the sequence.

After adding these two special tokens, the entire sequence is converted into BERT IDs, according to the vocabulary present in the pre-trained multilingual BERT tokenizer.

Finally, the sequences are padded so that they are all the same length (in our case, sequence length is 128).

An example of the conversion process from normalized text into BERT inputs is illustrated in Table 4.1.

Normalization @USER @USER I loathe @USER

Tokenization [’@’, ’US’, ’##ER’, ’@’, ’US’, ’##ER’, ’I’, ’lo’, ’##ath’, ’##e’, ’@’, ’US’, ’##ER’]

Addition of BERT tokens [’[CLS]’, ’@’, ’US’, ’##ER’, ’@’, ’US’, ’##ER’, ’I’, ’lo’, ’##ath’, ’##e’, ’@’, ’US’, ’##ER’, ’[SEP]’]

Conversion into IDs [101, 137, 1646, 9637, 137, 1646, 9637, 146, 25338, 9779, 1162, 137, 1646, 9637, 102]

Padding [101, 137, 1646, 9637, 137, 1646, 9637, 146, 25338, 9779, 1162, 137, 1646, 9637, 102, 0, 0, ..., 0]

Table 4.1: Example of BERT input preparation.

4.4 Baseline Model

We fine-tune the cased multilingual BERT model as our baseline. This model was pre-trained on the 104 languages with the largest Wikipedias and it has 110M trainable parameters

⁶

. For each language, we use the same exact model and hyperparameters.

First, the model weights are initialized using the weights of the pre-trained mul- tilingual BERT model. After feeding the data into the model, we extract the hidden representation of the first token of each tweet, _[CLS] , which captures the representa- tion of the entire sequence. This representation is then fed into a dropout layer, after which we add two dense layers, before finally passing the output of the second dense layer through a softmax layer to enable classification. We use Adam as the optimizer and L2 regularization, with a regularization parameter of 0.01. Our dropout rate is 0.1 and batch size is 32 for all experiments. The loss function we use is sparse categorical cross-entropy. Each system is trained for 2 epochs and, during training, a random set of 20% of training data is set aside as development data. A summary of the steps is shown in Figure 4.3.

Given that our baseline model is single-channel, we only need data in one language, so we do not use parallel data for the experiments. The experiments on our three data configurations therefore only involve the LANG half of our three parallel datasets, which correspond to the first column in Figure 4.2.

4

https://github.com/google- research/bert

5

https://github.com/google- research/bert/blob/master/multilingual.md

6

https://github.com/google- research/bert/blob/master/multilingual.md

(25)

Figure 4.3: Our baseline model setup: fine-tuning of multilingual BERT on three different data configurations.

4.5 Multi-Channel BERT

Our main system consists of a multi-channel BERT setup, inspired by the three-channel model used in Sohn and Lee (2019). In our two-parallel version of their model, pre- trained multilingual BERT and English base BERT are fine-tuned in parallel. This type of setup requires a double input, consisting of parallel data, which we create using machine translation (see Section 4.1).

In our setup, the parallel LANG-EN data is fed into the system. While the multilingual pre-trained BERT is fine-tuned on the LANG half of the data, as in the baseline, the English pre-trained BERT is fine-tuned on the English half of the data. We feed the data into each BERT model and, as we do in our baseline, we extract the hidden representation of the _[CLS] token, and then feed it into a dropout layer. After the dropout layer, we add a dense layer. After each BERT model is fine-tuned on the corresponding data, we sum the hidden states for each sequence together. In other words, the system holds a hidden representation for each tweet in both LANG and EN. These two representations are then added together, before finally passing through a dense layer and a softmax layer to enable classification. The hyperparameters we use are the same as the baseline, and are equal across all six languages. We train each system for 2 epochs. The model architecture is illustrated in Figure 4.4.

We run our experiments on three different data configurations, corresponding to

those used for our baselines. However, since this model also features English BERT,

we use the full parallel datasets. During training, a random set of 20% of training data

is set aside as development data.

(26)

Figure 4.4: Our multi-channel BERT model architecture.

(27)

5 Results and Discussion

5.1 Evaluation Metrics

When dealing with offensive language detection on social media, the number of offensive messages is generally lower than that of non-offensive messages (Davidson et al., 2017). This, as we have seen (see Chapter 3), is generally reflected in offensive language identification datasets, which generally contain more non-offensive messages than offensive ones. This imbalance causes metrics such as accuracy to be unreliable when assessing classifier performance. In fact, for some of our datasets, a system could achieve 80% accuracy just by assigning each instance the most frequent label (non-offensive).

To properly capture the performance of our systems, we rely on macro-averaged F1 scores. This measure takes the average of the F1 scores obtained for each class, taking into consideration precision and recall for both the non-offensive and offensive class.

Macro F1 is calculated as follows:

𝐹

₁

= 2 × 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃 𝑟 𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 𝑀 𝑎𝑐𝑟 𝑜 𝐹

₁

=

𝐹

1(non-off )

+ 𝐹

1(off )

2 5.2 Results

We test our setups on 20% of the gold data or, when available, on the separate test data for a particular language. The systems are therefore always tested on gold data in the language of interest, even if they were trained on automatically translated data. For the multi-channel BERT model, the data still needs to be parallel at test time, so the system is fed gold data in LANG and machine translated data in English for testing.

We run each experiment 3 times to compensate for random initialization. The average macro F1 scores we obtain with 3 runs on each model and configuration are reported in Table 5.1.

Multi-channel BERT performs better than our baseline for four languages: Greek, Arabic, Turkish, and German. Additionally, the partial or exclusive use of machine translated data improves the performance of the classifiers in all cases. Notably, for both Danish and Arabic the best-performing system is trained on machine translated data only (on the Arabic side only in the case of Arabic). The use of machine translated data could therefore be useful for the classification of offensive language in low-resource languages, provided that machine translation was a feasible choice. However, it seems that cross-lingual information can only help improve models to a certain extent. Our Greek and Italian models already performed well on gold data only, and they did not see much of an improvement when trained on machine translated data. This could mean that, while machine translated data can help classification in certain cases, it can hurt some systems by adding noise if they already perform well on gold data only.

Although the general tendency is for multi-channel BERT to perform best, there are

some differences across languages in terms of performance. Additionally, some systems

seem to be sensitive to randomness in the training procedure, with sudden jumps in

Transfer Learning for Multilingual Offensive Language Detection with BERT