• No results found

Faster and More Resource-Efficient Intent Classification

N/A
N/A
Protected

Academic year: 2021

Share "Faster and More Resource-Efficient Intent Classification"

Copied!
104
0
0

Loading.... (view fulltext now)

Full text

(1)

LICENTIATE T H E S I S

ISSN 1402-1757 ISBN 978-91-7790-689-6 (print)

ISBN 978-91-7790-690-2 (pdf) Luleå University of Technology 2020

Pedr

o Alonso

Faster and Mor

e Resour

ce Efficient Intent Classification

Department of Computer Science, Electrical and Space Engineering Division of Embedded Intelligent Systems Laboratory

Faster and More Resource Efficient

Intent Classification

Pedro Alonso

Machine Learning

132012-LTU-Alonso.indd Alla sidor

(2)

Faster and More Resource-Efficient

Intent Classification

Pedro Alonso

Department of Computer Science, Electrical and Space Engineering

Lule˚

a University of Technology

Lule˚

a, Sweden

Supervisors

(3)
(4)

To my family...(friends are also family)

(5)
(6)

Abstract

Intent classification is known to be a complex problem in Natural Language Pro-cessing (NLP) research. This problem represents one of the stepping stones to obtain machines that can understand our language. Several different models recently appeared to tackle the problem. The solution has become reachable with deep learning models. However, they have not achieved the goal yet. Nevertheless, the energy and computa-tional resources of these modern models (especially deep learning ones) are very high. The utilization of energy and computational resources should be kept at a minimum to efficiently deploy them on resource-constrained devices. Furthermore, these resource savings will help to minimize the environmental impact of NLP. Therefore, this thesis presents results for making more resource-efficient machine learning models for intent classification.

This thesis considers two main questions. First, which deep learning model is optimal for intent classification? Second, starting from an optimal deep learning model, the concern shifts to resource-efficiency. Then this thesis is also concerned with how we can minimize the resources used in intent classification?

Concerning the first question, the work here shows that intent classification in writ-ten language is still a complex problem for modern models. However, deep learning has shown successful results in every area of machine learning it has been applied. The sec-ond question shows that we can achieve results similar to the deep learning models by combining of conventional solutions. These results are achieved by combining classical machine learning models, pre-processing techniques, and a hyperdimensional computing approach. The work here shows the optimal model used in short texts and the most resource-efficient model. In this case, it was a deep learning model and the hyperdimen-sional computing inspired model, respectively.

First, a baseline is conveyed using tweets filled with hate-speech and one of the best deep learning models available now in deep learning. Next, by presenting the steps taken to arrive at the final model with hyperdimensional computing, which will minimize the required resources. This model can help make intent classification faster and more resource-efficient by trading a few performance points to achieve such savings. Therefore, a hyperdimensional computing model is proposed.

This model is inspired by hyperdimensional computing, and it is called “hyperembed,” which shows the capabilities of this paradigm. When considering resource-efficiency, the models proposed were tested on intent classification on short texts, tweets (for hate-speech where intents are to offend or not to), and questions posed to Chatbots.

(7)

In summary, the work presented here suggests two conclusions. First, the deep learn-ing models have an advantage in performance when there is sufficient data. They, how-ever, tend to fail when the amount of available data is insufficient. In contrast to the deep learning models, the proposed models based on classical machine learning models work well even on small datasets. Second, deep learning models require substantial resources to train and run. The models proposed here aim to trade off the computational resources and the classification performance of the model.

(8)

Contents

Part I

1

Chapter 1 – Thesis Introduction 3

1.1 Aim and Research Questions . . . 4

1.2 Thesis Outline . . . 5

1.3 Summary of papers . . . 5

Chapter 2 – Literature Review 9 2.1 Overview & Criteria . . . 9

2.2 Language Representations . . . 10

2.3 Modeling Language in Deep Learning . . . 12

2.4 Use of n-grams for Intent Classification . . . 13

2.5 Hyperdimensional computing . . . 15

Chapter 3 – Applied ML to text classification 17 3.1 Datasets . . . 17 3.2 Data Pre-processing . . . 18 3.3 Performance Metrics . . . 19 3.4 Models . . . 20 3.5 Applied Models . . . 23 Chapter 4 – Results 25 4.1 Paper A: TheNorth at SemEval-2020 Task 12: Hate Speech Detection using RoBERTa. . . 25

4.2 Paper B: Subword Semantic Hashing for Intent Classification on Small Datasets . . . 26

4.3 Paper C: HyperEmbed: Tradeoffs Between Resources and Performance . 26 4.4 Unpublished Results . . . 27

Chapter 5 – Conclusions and Future Work 33 5.1 Findings in relation to Research Questions . . . 33

5.2 Conclusions . . . 34

5.3 Future Work . . . 35

References 37

(9)

Part II

43

Paper A 45 1 Introduction . . . 47 2 Experimental data . . . 48 3 System overview . . . 48 4 Experimental setup . . . 50 5 Results . . . 51 6 Quantitative analysis . . . 51 7 Conclusion . . . 52 Paper B 55 1 Introduction . . . 57 2 Related Work . . . 58 3 Datasets . . . 59 4 Methodology . . . 60

5 Results and Analysis . . . 64

6 Discussion and Future Work . . . 65

Paper C 71 1 Introduction . . . 73 2 Related Work . . . 74 3 Evaluation outline . . . 75 4 Methods . . . 77 5 Empirical evaluation . . . 82

6 Discussion and conclusions . . . 88

(10)

Acknowledgments

This work presented has been realized at the Lule˚a University of Technology at the Department of Computer Science, Electrical and Space Engineering in EISLAB in the Machine Learning Group.

First, I would like to acknowledge the persons who supported me throughout this endeavor, starting with my supervisors, Professor Marcus Liwicki, Dr. Foteini Liwicki and Dr. Denis Kleyko. They gave me the opportunity to create my own path and their guide me each step of the way. To Macus and Foteini, thanks for your ideas and support. I would be lost without them. I would like to express my gratitude to Denis. Without his constant supervision, this work would be one-quarter of the way done.

I would also like to thank my friends in EISLAB. Without their presence and support, this work would not even be started. Their friendship has been what keeps me always cheerful. In this regard, a special thanks go to America. You know this path would have ended a long time ago without your help, to Sergio, who made all things in Sweden feel easier and less daunting.

I would like to thank my colleagues. However, I do not really have any because I consider all my friends. Thus, my gratitude to them is already expressed above.

Finally, I would like to thank my parents and brothers. Without their constant support I would not have started. To my wife Lu, thank you love, you make every moment more enjoyable and fun.

Lule˚a, November 2020. Pedro Alonso

(11)
(12)

Part I

(13)
(14)

Chapter 1

Thesis Introduction

“In life, unlike chess, the game continues after checkmate.” Isaac Asimov Computers have become ubiquitous in our lifetimes. They come from the smallest nanobots [1] used to combat cancer, mobile devices that make us think more globally, server farms that hold our information, governmental tasks, and social media.

This omnipresence factor is what we need to harness to make a “dream” possible to offload all the menial, trivial, or tedious tasks from a human perspective, to computers to deal with them. At the same time, we would shift our focus to more rewarding activities for our brains like music, reading, or exploring the world around us.

For this “dream” to happen, the first thing is that we need to make computers un-derstand our language, our words, and our sense of meaning as straightforward as it may be to two persons having a normal conversation in their native language [2].

With this inspiration in mind, this work sought to better understand the represen-tation of our language for computers such as the intricacies that machines will need to understand, but since the Natural Language Understanding (NLU) is such a vast field, we have focused on a specific sub-task.

In particular, the main focus of this thesis is in classifying intent in written text using the newest form of machine learning (ML), in conjunction with some of the old ones. The newest form of ML, called Deep Learning (DL) [3], has shown encouraging results in every area [4] it has been applied. This contains various problems such as classification, regression, reinforcement learning, recommendation systems, to name a few.

NLU has been an unsolvable problem thus far. We can see this in regular conversations where individuals have to keep track of several things like slang, the emotional state of mind of the individual, clarity of sounds made, and the level of sarcasm of the person. But online, the offline physical clues that we can infer to extract the meaning diminish and, computers do not possess the same advantages as us; due to this, the NLU has stagnated in a sense (although progress is there) until the DL came into the picture, an example of this can be seen in advancements of word embeddings [5].

(15)

4 Thesis Introduction Therefore, there has been a constant strive to make computers understand our lan-guage better. Models upon models have been developed. Thus far, le creme de la creme was known as Bidirectional Encoder Representations from Transformers (BERT) [6], which has spawned its variations. While there are already better models such as GPT-3 [7], BERT [6] in its time was the best one in the NLU. Not only models for processing have spawned but also for the way text is inputted, some of which are: One-Hot encoding for text, n-grams, and word embeddings.

In this work, the word embeddings are considered first due to their novel impact on representing words as vectors and their advantages. These words vectors are, as their name indicates, written words mapped to numerical vectors.

Another approach to text representation is known as n-grams, which is addressed in this work. n-grams can be obtained from words or characters in the text. The “n” stands for the number of items (characters or words) used.

Finally, since nothing comes without a cost, in this case, the costs are the resources utilized in the models created for the type of intent to classify (as in hate, covered here). These costs were deemed too troublesome, requiring a re-assessment of the problem. The need is to find the best way to curtail the excessive use of resources, then another algorithm is proposed for a more resource-efficient classification, and as a byproduct also to combat the damage this research may be doing unintentionally to the environment [8]. This thesis features papers that connect its topics into actions for all of these men-tioned ideas. We will go from the DL - latest ML model - for intent classification (using hate-speech) to the nitty-gritty details that deal with the proposed resource-efficient models for classification.

1.1

Aim and Research Questions

The issue that this thesis aims to tackle is the classification of intent in short texts. For this, the research questions are laid out at the bottom. As a context on the use of hate-speech, this thesis adheres to the idea that hate-speech is a type of intent, where the final goal is to harm the intended target.

This form of intent can be seen with the increased online presence of human communi-cation, which has also given rise to online harassing or “trolling”. These are statements that are intended to cause emotional harm to different people with the premise that just because they do not conform to the worldview of the attacker, they deserve the harassment.

Unfortunately, the increase of human presence online carries in itself more hateful comments, some of which may slip the people who serve as moderators, and they require a system that can automate this task. Therefore, in this work, the following questions are researched:

RQ1: What is the optimal way (speed and accuracy) for inferring intent on small text datasets?

RQ2: Is hyperdimensional computing suitable for saving resources for classifying small text datasets?

(16)

1.2. Thesis Outline 5 RQ3: Do the findings on small text datasets also apply to hate-speech detection? RQ4: How efficient is DL compared to hyperdimensional computing for the detection of hate-speech in tweets?

1.2

Thesis Outline

This thesis comprises the current advances in research in NLU, specifically the area of intent classification. It is divided into two parts. Part I (kappa) introduces the key concepts and methods as well as recapitulates the main results. Part 1 states the reasons and motivations for this work and summarizes the papers used in compiling this work.

Chapter 2 lays out the foundations needed to answer the question, what is text classification? This chapter also introduces the concepts needed to follow along with the current work. Chapter 3 presents the different models used in this work and the relevance that the methods carry. Then, Chapter 4 presents and extends the discussion of the results in the papers, which served as the basis for this compilation. Finally, chapter 5 presents the conclusions of this work and mentions what is envisioned for future research.

Part II corresponds to the included papers in full. In particular, this part of the thesis presents three papers published or submitted to conferences/journals that support the current writing. In particular, there are two conference papers and one journal. All papers have been reformatted to match the layout of this thesis.

1.3

Summary of papers

The papers, which made this work possible, are briefly summarized below. They are two conference papers and one manuscript submitted to a journal. Here they are described, their authors and their contributions are presented. The papers are presented in full in Part II; papers not used in this thesis are also mentioned at the end. These papers did not contribute to the RQ laid out previously. Therefore they were not utilized in this work.

1.3.1

Paper A

Title: TheNorth at SemEval-2020 Task 12: Hate Speech Detection using RoBERTa [9]. Authors: Pedro Alonso, Rajkumar Saini, Gy¨orgy Kov´acs.

Published in: International Workshop on Semantic Evaluation (SemEval) 2020. Summary: Hate-speech detection on social media platforms is crucial as it helps to avoid severe harm to marginalized people and groups. The application of NLP and DL has garnered encouraging results in the task of hate-speech detection. The expression of hate, however, is varied and ever-evolving. Thus better detection systems need to adapt to this variance. Because of this, researchers keep on collecting data and regularly come up with hate-speech detection competitions. In this paper, we discuss our entry to one such

(17)

6 Thesis Introduction competition, namely the English version of sub-task A for the OffensEval competition. Our contribution can be perceived through our results, which was first an F1-score of 0.9087, and with further refinements described here climb up to 0.9166. It serves to give more support to our hypothesis that one of BERT variants, namely RoBERTa, can successfully differentiate between offensive and non-offensive tweets, given the proper pre-processing steps.

Author contribution: P.A. was the one in charge of implementing various models, for the data, was given to him by R.S., and once we make a decision on the best model, P.A. was also in charge of obtaining the results for the competition. G.K. was responsible for putting our findings together in a coherent text.

1.3.2

Paper B

Title: Subword Semantic Hashing for Intent Classification on Small Datasets [10]. Authors: Kumar Shridhar, Ayushman Dash, Amit Sahu, Gustav Grund Pihlgren, Pedro Alonso, Vinaychandran Pondenkandath, Gy¨orgy Kov´acs, Foteini Simistira, Marcus Liwicki.

Published in: International Joint Conference on Neural Network (IJCNN) 2019. Summary: In this paper, we introduce the use of Semantic Hashing as embedding for the task of Intent Classification and achieve state-of-the-art performance on three frequently used benchmarks. Intent Classification on a small dataset is a challenging task for data-hungry state-of-the-art DL-based systems. Semantic Hashing is an attempt to overcome such a challenge and learn robust text classification. Current word embedding based methods are dependent on vocabularies. One of the major drawbacks of such methods is out-of-vocabulary terms, especially when having small training datasets and using a wider vocabulary. This drawback is a problem for the case of Intent Classification for chatbots, where typically small datasets are extracted from internet communication. Two problems arise with the use of internet communication. First, such datasets miss a lot of terms in the vocabulary to use word embeddings efficiently. Second, users frequently make spelling errors. Typically, the models for intent classification are not trained with spelling errors and it is difficult to think about ways in which users will make mistakes. Models depending on a word vocabulary will always face such issues. An ideal classifier should handle spelling errors inherently. With Semantic Hashing, we overcome these challenges and achieve state-of-the-art results on three datasets: Chatbot, Ask Ubuntu, and Web Applications. Our benchmarks are available online.

Author contribution: P.A. along with A.D. and V.P. were working on data aug-mentation for the algorithm, which ultimately had no impact in the final score. K.S., A.S, and G.G.P were in charge of the main algorithm, G.K. was in charge of writing, while M.L and F.S. were coordinating the group efforts and contributed to writing the paper.

(18)

1.3. Summary of papers 7

1.3.3

Paper C

Title: HyperEmbed: Tradeoffs Between Resources and Performance in NLP Tasks with Hyperdimensional Computing enabled Embedding of n-gram Statistics [11].

Authors: Pedro Alonso, Kumar Shridhar, Denis Kleyko, Evgeny Osipov, Marcus Li-wicki.

Submitted to: Transactions of the Association for Computational Linguistics. Summary: Recent advances in DL have led to a significant performance increase on several NLP tasks, however, the models become more and more computationally demand-ing. Therefore, this paper tackles the domain of computationally efficient algorithms for NLP tasks. In particular, it investigates distributed representations of n-gram statis-tics of texts. The representations are formed using hyperdimensional computing enabled embedding. These representations then serve as features, which are used as input to standard classifiers. We investigate the applicability of the embedding on one large and three small standard datasets for classification tasks using nine classifiers. The embed-ding achieved on par F1 scores while decreasing the time and memory requirements by

several times compared to the conventional n-gram statistics, for example, for one of the classifiers on a small dataset, the memory reduction was 6.18 times; while train and test speed-ups were 4.62 and 3.84 times, respectively. For many classifiers on the large dataset, the memory reduction was about 100 times, and train and test speed-ups were over 100 times. More importantly, the usage of distributed representations formed via hyperdimensional computing allows dissecting the strict dependency between the dimen-sionality of the representation and the parameters of n-gram statistics, thus, opening a room for tradeoffs.

Author contribution: P.A. was mainly in charge of obtaining the results and de-scribing the datasets and methods, K.S. did Byte Pair Encoding experiments and wrote the related parts, D.K. was in charge of the compilation of results and writing the first draft, E.O. and M.L. were project coordinators and contributed to writing the paper.

1.3.4

List of Publications Not Included in the Thesis

1. Pedro Alonso, Rajkumar Saini, Gy¨orgy Kov´acs “TheNorth at HASOC 2019: Hate Speech Detection in Social Media Data”, in Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation, December 2019. [12]

2. Pedro Alonso, Gy¨orgy Kov´acs, Rajkumar Saini., “Hate Speech Detection using Transformer Ensembles on the HASOC dataset”, in 22th International Conference, SPECOM, Russia, October 7-9 2020, Proceedings. Springer, September 2020. [13] 3. Purvanshi Mehta, Kumar Shridhar, Pedro Alonso, Marcus Lliwicki, Gy¨orgy Kov´acs, Vanda Balogh, “Author Profiling Using Semantic and Syntactic Features: Notebook for PAN at CLEF 2019”, in Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter, December 2019. [14]

(19)

8 Thesis Introduction 4. Gy¨orgy Kov´acs, Rajkumar Saini, Pedro Alonso, “Challenges of Hate Speech

(20)

Chapter 2

Literature Review

“Sometimes it is the people no one imagines anything of who do the things that no one can imagine.” Alan Turing This chapter aims to present the literature that served as the building blocks for the three papers on which this thesis is built. The ideas discussed in these papers are the ones that allowed for the research in the current thesis to be addressed, such as word embeddings, n-grams, DL, and hyperdimensional computing.

2.1

Overview & Criteria

The first thing to note here is that, as mentioned previously, this work is equating intent classification with hate-speech detection; this is due to the view that conversations have an intent to accomplish some goal. In the case of hate-speech, the goal is to cause emotional distress to the reader. With this in mind, this work starts by reviewing hate-speech.

With the advent of the Internet, online communities have become more important, due to the revenue they can generate or the people they connect. The fact that the Internet is ubiquitous and, more importantly, anonymous has given rise to faster communication of ideas among the world inhabitants (also called netizens). However, not everything has been good; faster connectivity among communities has also brought about the wide dispersion of hate among those same communities.

Looking at the earliest examples of MySpace and the latest like Facebook or TikTok, they have all dealt with the same problem: How to keep the interaction between its users as decent as possible? This problem is especially difficult for Twitter [16], given the extensive adoption it has had over the years, this has not come without a cost. If one takes any world leader, he/she probably has a Twitter account, and as such he/she is open to receive messages, good or bad, from any individual, this has led Twitter managers to try to manually spot hateful or offensive tweets that do not adhere to their

(21)

10 Running header rules. Certainly, here lies the problem, having to manually mark tweets as offensive is a tedious task, which one could see why some not-so-suitable tweets will make it through that front line after a while.

The European Union (EU) has deemed the reduction of hate-speech important enough that initiatives are funded to combat it [17]. Not only funds but also laws have been put in place to tackle the problem. Using the European Commission to pressure Facebook, YouTube, Twitter, and Microsoft into signing an EU hate-speech code [18], where they will try to review online hate-speech accusations in twenty-four hours or less. This attempt to tackle hate-speech has had mixed results, with some companies having better results (e.g., Facebook with 43%) than others (Twitter 39%) [19]. Here is where the research into detecting hate-speech shows its importance.

A computer cannot understand words written by people; it only understands numbers (in reality, binary digits in modern computers); thus, researchers need a way to represent words with numbers understandable by a computer or algorithm. There are many ways to represent words, in this work only two in use are considered: n-grams and word embeddings.

Given the research questions posed in the previous chapter, it was essential to select papers that provide simple yet clear answers to the questions. For this reason, the papers stated in the previous chapter (papers A, B, and C) were selected, as they serve to tackle the desired answers head-on. The papers selected deal with the following topics: word embeddings, n-grams, DL, short text classification, and hyperdimensional computing. These topics are covered in detail in the following sections.

2.2

Language Representations

This section presents the two methods used in this work to represent words, and use them in ML models.

2.2.1

n-grams

n-grams appear in the fields of computational linguistics and probability theory, where they are described as an n contiguous sequence of items (hence its name) sampled from source data. The items for a The n-gram can be phonemes, syllables, letters, words, or base pairs (of which can be characters or words, to name a few) depending on the application used.

They can range from 1-gram (also called unigram) representing one single element being sampled, 2-grams (bigram/digram) with 2 elements, 3-gram (trigram) and so on [20].

These n-grams serve as the most obvious feature set to start classifying text [21]. When using them for this task, most authors use a combination of unigrams and larger n-grams together. This have proven to be successful up to a certain point [21–23], especially when dealing in low-training data settings. Such data setting is investigated

(22)

2.2. Language Representations 11 in Paper B, which serves to expand the results obtained by [24], where they got state-of-the-art results using a DL model and making use of the hashing method described in [25]. The main difference with paper B lies in the amount of data we had at our disposal. While they could apply a DL model, our data setting did not allow for such luxuries. The application was made in a low data setting with traditional ML models, to try to overcome the difficulties the DL faces with scarce data.

2.2.2

Word embeddings

The technique of representing words as vectors has roots in the 1960s [26] with the development of the vector space model for information retrieval. Reducing the number of dimensions using singular value decomposition then led to the introduction of latent semantic analysis in the late 1980s [26].

Word embeddings are, in short, vectors of numbers [27] that represent the meaning of words. They use distribution semantic models (DSM) [27] and are one of the most commonly used techniques in NLP. For example, they are currently used in DL mod-els with good performance [28]. Their appearance was motivated by the fact that the traditional forms of representing words such as one-hot encoding or bag-of-words do not allow for the inclusion of syntactic (structure) and semantic (meaning) relations across words into the representation. Therefore, the representation of traditional forms might not have the needed information that the algorithm could use to correctly classify the text. They came to be as a way of representing words in a not so naive way.

These word embeddings allow for mathematical functions to be applied to them and get interesting results such as king - man + woman ≈ queen. More concretely, one can subtract the encoded meaning of the word man (maleness) and add the vector for woman, which is representing the femaleness, to the vector for the word king and end up with a vector, which is most close to another vector, in this case, the one representing queen [29] (as shown in fig. 2.1).

(23)

12 Running header These properties are attributed to the geometric metaphor of space, which takes the word-space model as a spatial representation of meaning. The main idea is that the similarity of words to us (humans) can be represented in an n-dimensional space, with n ranging from 1 to +∞, in practice (commonly) ranging from 100 to 300 dimensions.

The numbers distributed across the dimensions of the vector represent the word mean-ing. In simple terms, each dimension on the vector captures that vector closeness of its association to the true meaning to other words in the corpus, but it is important to point out that the individual dimensions do not have interpretable meaning in themselves, so the representation is distributed. In that sense, the semantics are embedded in the vector across its dimensions. These sorts of representations allow one to claim that vectors that are, in the high dimensional plane, close together have a direct correlation on how similar are those words [30].

Furthermore, to construct these word embeddings several algorithms are commonly used. Two of the most used ones are, Word2Vec [29] and GloVe [31], respectively. While an explanation of these two models is outside the scope of this thesis, a short description is due. Word2Vec seeks to factorize a point-wise information matrix of word context, using a shallow neural network (three-layer network), an input-hidden-output network, GloVe does this in an explicit way looking to minimize a least squares cost function.

J = V X i,j+0 f (Xij)(wTiw˜j+ bi+ ˜bj− log(Xij))2 J = V X i,j=1 f (Xij)(wiTw˜j+ bi+ ˜bj− logXij)2

Where J is the standard notation for cost functions, V is the size of the vocabulary, Xij is the “strenght” of the connection in the co-occurrence matrix, wi and wj are word

embeddings, and bi and bj are the biases associated with these words.

For more details on the most used approaches such as Word2vec and GloVe readers are kindly directed to the original papers, [29] and [31], respectively. These two algorithms for word representations are the most suitable to be used in a DL model; as such, most of these models use some form of an embedding layer, which is considered as a trainable layer if it is not using pre-trained weights. This pre-training is an advantage, as shall be noted later, since using pre-trained vectors allowed us to succeed at the OffenseEval 2020 competition.

2.3

Modeling Language in Deep Learning

ML algorithms have had a difficult time understanding the human language; while we can store text and present text, having an algorithm that understands the text being read continues to be challenging. Algorithms developed before the development of the Bidirectional Encoder Representation from Transformers (BERT) [6] architecture were, in essence, single application algorithms, the problem they were thought for was the

(24)

2.4. Use of n-grams for Intent Classification 13 only one they could solve. Then BERT [6] came along, and it can be considered the Swiss-knife utensil for NLP, with the ability to be fine-tuned for any particular problem, retaining enough information before re-training.

BERT [6] is an open-sourced neural network-based pre-training technique used to solve NLP problems (see Figure 2.3). It uses the transformer architecture, where its attention mechanism learns the contextual relations between words. Since the objective of BERT [6] is to generate a language model, only the encoder is necessary. BERT [6] is trained from Wikipedia corpus, it gets the context of the text corpus and tries to predict a [masked] word used as a query. Being bidirectional means, it looks forward and backward to get the context which it is working with to find the missing meaning of the word.

BERT [6] works by taking as input a text of any size and maps it into a fixed-length vector (commonly 300 dimensions). This mapping puts similar contextual language into similar areas. Similarly, Word2Vec uses masking to hide the word which will be predicted but, since BERT is bidirectional, it looks both ways to infer the context. It does this again and again until the embeddings obtained are powerful enough to predict the masked words; then the result can be further fine-tuned to any NLP task desired.

The development of BERT has its roots in the architecture known as Transform-ers [32].Figure 2.2 presents the architecture, in this figure one can see the encoder (which reads the input text) and decoder (which produces the output) mechanisms. One also sees the attention mechanism used.

Since this is out of the scope of the current thesis, here only a summary of the necessary details to follow along with this work is presented. Using the example given at [33], given the sentence “I arrived at the bank”, we are looking to determine to which type of “bank” the word is talking about. In a translation setting, an attention mechanism in the transformer computes the next representation for the given word, comparing it against every other word in the sentence. The result is an attention score used to determine how much of the other words should contribute to the next word representation. In the previous example, the word “river” should obtain a high score, which will be weighted as an average of all words in the representation, fed into an neural network that generates a new representation for the word “bank”, understanding that the sentence is referring to a riverbank. For a more detailed explanation on transformers, the reader is invited to look into [32] and [33]. Therefore, BERT [6] considers the full context of a word looking at everything that comes before and after that word.

2.4

Use of n-grams for Intent Classification

This work is concerned with text classification, beginning with tweets. In tweet classifi-cation, there are two classes, hate and non-hate. This classification can also be seen as intent classification, as mentioned before: to spew hate or not spew hate. This change allows us to focus on the next problem, the one where we have a small data pool (as stated previously, DL and word embeddings, pre-trained or not, are harder to use in this situation, for example, where the device used to classify is constrained in space or

(25)

14 Running header

Figure 2.2: Transformer architecture [32].

processing power). Also, since several kinds of n-grams exist, the one tackled here is the character n-gram using inspiration from [25].

(26)

2.5. Hyperdimensional computing 15

Figure 2.3: BERT architecture [6].

seems to be more of the former repeated than of the latter, in most situations as stated in [21,34]. The performance of text classification is improved when considering them with additional features. For example, hashing, which seems to confirm the findings of [35] where the study focused on the hypothesis that characters have a higher predictive power in text classification. This following [36], where the authors found that one of the most important properties that character n-grams have is that they avoid using tokenizers, lemmatizers, and similar language dependant tools.

The authors of [35] also provide results for the neural networks area. Using recurrent neural networks in their comparison, which at the end is outperformed by a na¨ıve bayes and support vector machine combination. These findings allows one to extrapolate the potential of characters n-grams for intent classification.

2.5

Hyperdimensional computing

As discussed previously, DL can outperform almost any technique it goes against (in-cluding humans in some areas [37, 38]), but there are caveats, DL will mostly get better results provided that the model has a lot of clean data (for example, clear non-distorted images [38] to train) and, that one has considerable computational power. These caveats

(27)

16 Running header present a problem in low data or/and low resource settings.

As mentioned previously, DL is not without drawbacks; aside from excelling in clas-sification performance, one of its disadvantages is the energy utilized when training the models (especially NLP models) is very high. As stated in [8], large DL models consume the equivalent of a lifetime of 5 vehicles [8] in CO2 emissions, or one of the three top

tech companies uses the same amount of energy as Germany [8], being BERT [6] one of the top users. One of the solutions the authors propose is that “We recommend a concerted effort by industry and academia to promote research of more computationally efficient algorithms, as well as hardware that requires less energy.” [8], which is where hyperdimensional computing could help to mitigate this issue.

Next in focus is the notion of hyperdimensional computing (HDC) [39–44] introduced as a way to diminish the resources. Hyperdimensional computing can be used in both the n-gram and the word vector approaches for classification. Here, hyperdimensional computing is used to save resources when doing intent classification while preserving performance. Hyperdimensional computing [40, 45, 46] is a reference for what in practice refers to the size of the vectors being used. Traditionally computer memory is represented by 32 or 64 bits of address, where each position has a meaning; for example, the first positions can be the operation, then the value on which they are applied. This is binary computation (values in these vectors [0,1] ), or conventional computation. The result relies on correctly encoded positions and memorizing what each one represents.

Operating with high dimensional vectors allow for the use of distance in this high dimensional space as a measure of similarity, which is translated as a measure of mean-ing [45]. For well-defined vector operations that add or bind these vectors together, it preserves most of the information encoded in them as discussed in [47, 48]. In the next chapter, a diluted description is expressed. Alternatively, the reader is recommended to look into [40] for a more detailed explanation.

The energy saved with hyperdimensional computing is well documented in [48–55] where these papers covered a variety of uses, from standard text classification to DNA sequencing as well as resource-efficient modification of existing algorithms, all utilizing high-dimensional vectors.

(28)

Chapter 3

Methodology

“I do not fear computers. I fear the lack of them.” Isaac Asimov This chapter describes all the datasets, models, metrics, and pre-processing used in this thesis. It starts with describing the datasets used in the papers presented, then how the data was pre-processed to be fed into an ML algorithm. It also describes the performance metrics used in the papers to help support our claims. Finally, it describes the models and the way they were applied in the papers that comprise the thesis.

3.1

Datasets

Here we described the datasets used for this work. The first dataset was used in paper A as a pre-training step, and then used as the working dataset in new not yet published experiments. The next were used in paper B and C.

3.1.1

Datasets Used in Papers A and Unpublished Results

The Offensive Language Identification Dataset (OLID) [56], used in paper A and for the unpublished results, is labeled in the following way: if a post does not contain offense or profane language, that post is labeled as not-offensive if it does, in violation of rules stated in OLID, such as, “Posts containing any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct”. These posts include insults, threats, and posts containing profane language or swear words [56], labeled as offensive. The OLID data contains 13, 241 items or tweets (text snippets), 4, 400 (about 35% percent) are offensive, while the rest are cataloged as not-offensive.

OLID [56] dataset in paper A was used in two ways, first, as a pre-training step, where we trained our classifier to detect offensive and non-offensive tweets on it, and second, as a validator for the results, helping as a way to adjust the threshold for the best performance.

(29)

18 Applied ML to text classification

3.1.2

Datasets used in Papers B and C

Three datasets were employed to gather the results of papers B and C. These datasets are the Chatbot, the AskUbuntu, and the WebApplications.

The Chatbot dataset, comprising questions regarding the public transport of Munich posed to a Telegram chatbot, has 206 items concerning two intents, Departure Time with 43 and 57 train and test respectively, and Find Connection with 57 and 71.

AskUbuntu dataset containing 190 items and five intents, with these distributions for train and test, Make Update 10 & 37, Setup Printer 10 & 13, Shutdown Computer 13 & 14, Software Recommendation 17 & 40 and None 3 & 5.

Web Applications comprises 100 items divided in eight intents, Change Password, Delete Account, Download Video, Export Data, Filter Spam, Find Alternative, Sync Ac-counts, and None distrubuted in the following manner for train and test (items and intent): (2, 6), (7, 10), (1, 0), (2, 3), (6, 14), (7, 16), (3, 6), (2, 4), respectively.

The 20NewsGroups dataset was originally collected by Ken Lang. Comprising 20 cate-gories, such as, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware to name a few. Each category contains 18, 846 samples. Split into train and test set, 11, 314 and 7, 532 respectively.

3.2

Data Pre-processing

In this section, we describe the pre-processing steps taken for the used datasets. First, we describe the steps taken to pre-process the OLID [56] dataset. Next, we present the steps for all the other datasets.

3.2.1

OLID pre-processing

Text pre-processing was done on the tweets provided by the OffensEval 2020 competition organizers [56]. They are divided in train and test set with 13241 and 861 respectively. The test set tweets are the ones that were used to compile the results. With regular expression search, the @ − words (for example, @StephenKing) were replaced with @U SER, and URLs (if any) were replaced with “URL.” We noticed that the emoji count (number of emojis in a tweet) and the tweet scores were not correlated significantly. The facial emotion emojis also had almost no correlation. Therefore, emojis were removed from the tweets. Hashtags (#) and emoticons were also removed from the tweets.

In the end, the processed tweets were generated with only one whitespace between the words/tokens in each tweet (see Table 3.1 for examples of text pre-processing). All the tweets were pre-processed, 13, 241 and the test set 861. This removal was done because it was deemed unnecessary to give the model words which may appear a single or couple of times in all the dataset. All test set tweets were used in the final classification.

(30)

3.3. Performance Metrics 19

Table 3.1: Original tweets and their pre-processed counterpart

Original Pre-proccessed Somebody was abit excited by

the first of his birthday celebrations

Somebody was abit excited by the first of his birthday celebrations @USER Lowkey my jam but I’m on his

head for this XXL shirt he wearing

@USER Lowkey my jam but I’m on his head for this XXL shirt he wearing @USER His ass need to stay up @USER His ass need to stay up @USER his bouffant tail is amazing @USER his bouffant tail is amazing

3.2.2

Pre-processing done for Papers B and C

Before running the data through our algorithm in papers B and C, some were applied changes to it; we lowercased the samples, also replaced pronouns by ’-PRON-,’ and removed all special characters except stop words or characters.

Dataset distribution between classes was analyzed, and less sampled classes have been oversampled by adding more augmented sentences to these classes. Each class had an equal number of training samples for all three datasets in the final training set.

The extra samples have been augmented with a dictionary-based synonym replace-ment of nouns and verbs chosen randomly. This augreplace-mentation helped in getting new variations in the training dataset. However, it did not take the spelling errors into ac-count. Dictionary replacement has been done using WordNet [57]. In this augmentation step, the smallest class in the dataset was augmented to be equal to the largest class.

3.3

Performance Metrics

This section describes the metrics used to assess the performance of the researched mod-els.

3.3.1

F1 Score

This work used the F1-Score as our principal measure, since this measure allowed to minimize false negatives which were deemed as too costly; this measure presents a score of the performance of the models. It is the weighted average of Precision and Recall which are defined as:

precision = tp tp + f p recall = tp

tp + f n

tp, fn, fp represent the true positives samples, the false negatives and the false positives, respectively. Therefore F1-Score takes all these quantities into account for its calculation.

(31)

20 Applied ML to text classification It may not be as intuitive as accuracy, but it is more useful [58] especially in an uneven class setting. F1 is calculated as:

F1-score = 2 ∗ precision ∗ recall precision + recall

This metric is the one used in all the results discussed in this thesis.

3.3.2

Time and Memory

Time and memory were also considered, they were measured in the following way, the time was measure as both, the time that takes to train a model and the time it took in testing the trained model, and the memory was measure as the sum of the size of input features vectors, for the train and test sets splits, as well as the size of the trained model. To have a fair measure avoiding detailed specifications (of system used, processors or operating system) all reported metrics are relative values, where the references are the values obtained with conventional n-gram statistics.

3.4

Models

Here the focus is on describing the different models used in this work; we describe the setup of the RoBERTa [59] model used in the hate-speech detection in paper A, the semantic subword hashing model used in papers B and C, and the hyperdimensional computing model used in paper C and the unpublished results.

3.4.1

Roberta

RoBERTa [59] is a variant of the original BERT [6] model introduced by Facebook. It extends BERT [6] by a series of enhancements such as:.

• Better training methodology;

• More data is used to train, 160GB instead of 16GB for BERT [6]; • Removed the next sentence prediction of BERT;

• Introduces dynamic masking; • Increases the batch size.

With these enhancements, RoBERTa has outperformed BERT [59], XLNET [59], to name a couple. For these reasons RoBERTa [59] was chosen to conduct the experiments in the OffensEval 2020 competition [60]. Another reason was that at the time it was perceived that the hardware was not capable of running it, then the first plan was to use DistilBERT [61]; but in the end, hardware was not an issue, and consequently, the option

(32)

3.4. Models 21 of DistilBERT [61] was discarded since the server was capable of running RoBERTa [59] without issues. For a more detailed description of RoBERTa, see [59].

For training the model, the data used was OffensEval2020 competition data [60], raw data was fed to the processing stage described in Section 3.2.1. After the data pre-processing step, we obtained clean data, and then we took the first 1 million samples from the training data and partitioned this one million tweets into a development set of 3000 tweets (subtracted from the millions training set, the last 3000 tweets), and the rest was used for training. The resulting training set was fed as input to the RoBERTa [59] model. The hyper-parameters were mostly left as default (as the time was tight), though the following hyperparameters were modified: “early stopping patience” this was set to three and “learning rate” was set to 1e − 5 and “evaluate during training steps”, the number of training steps before evaluating, was set to 2000 (as previous experiments have given some intuitions).

With these hyper-parameters, we trained two versions of DistilBert (one for 3 epochs and another one for 20 epochs), as well as RoBERTa, for three epochs. Here, we used only the mean (and not the standard deviation, since the work was time-constrained) scored as a regression target.

3.4.2

SemHash

Semantic hashing (SemHash) is based on the Deep Semantic Similarity Model [25], where the authors propose a hashing mechanism for tokens for the input to the model to depend on a hash value rather than on tokens. The proposed method works by extracting sub-word tokens (parts of sub-words) from sentences as features. These features are then passed through a vectorization process before finally sent to be processed by a classifier for training or prediction. This way, the method acts as both a featurizer (extract features) and a vectorizer (makes these features suitable for classification).

To visualize the method is as follows: Given an input text T , for example, ”I have a flying disk”, split it into a list of words ti. The output of the split should look like,

[”I”, ”have”, ”a”, ”flying”, ”disk”]. Pass each word into a pre-hashing function H(ti) to

generate sub-tokens tji, where j is the index of the sub-tokens. For example, H(have) = [#ha, hav, ave, ve#]. H(ti) first adds a # at the beginning and the end of each word and

then extracts trigrams from it. These trigrams are the sub-tokens tji. This procedure is described in Algorithm 1.

H(ti) is then applied to the whole dataset to generate sub-tokens. These sub-tokens

are then what represents a Vector Space Model (VSM). This VSM can be used to extract features for a given input text. More concretely, this VSM is a hashing function for an input text sequence.

This algorithm is used after the data pre-processing has been carried out. Data augmentation also played a vital role in the performance of SemHash, where it was attested that the number of samples in each class should be the same to get the best performance.

(33)

22 Applied ML to text classification Algorithm 1 Subword Semantic Hashing

Texts ←− collection of texts Create set sub-tokens Create list examples for text T in Texts do

Create list example

tokens ←− split T into words. for token t in tokens do

t ←− “#” + t + “#” for j in length(t)−2 do

Add t[j:j + 2] to set sub-tokens Append t[j:j + 2] to list example end for

end for

Append example to list examples end for

return (sub-tokens, examples)

algorithm [62] and then fed to standard ML algorithms such as Logistic Regression, Ridge Classifier, Multi-Layer Perceptron, and Random Forest, to name a few.

3.4.3

Embedding n-gram statistics using hyperdimensional

com-puting

The use of hyperdimensional computing for classification tasks can work in several ways. One is encoding words as dimensional vectors, another is encoding n-grams as high-dimensional vectors, and another is encoding characters as high-high-dimensional vectors [63]. Depending on the type of high-dimensional vectors used, the operations used to encode the data [64], will be different.

For the manipulation of these high-dimensional vectors, three operations are defined [40] these are: bundling (as in add vectors together), binding (to make a variable equal to the vector), and permutation (which serves to bind high-dimensional vector in a sequence with their position). All the operations are shown by an example next.

As [64] states, the encoding of sequences can be thought of as ”hyper-vectorize” a set where the order of the elements in the sequence is important and needs to remain. Encoding a set is similar to encoding a sequence, but there is a trick involved. This trick involves the permutation operation. In turn, the binding operation is designed to concatenate n-vectors into one and is implemented via position-wise multiplication [65]. The bundling operation allows to store information in high-dimensional vectors [66]. This operation is commonly implemented via position-wise addition.

A single HD vector for the tri-gram cba represented as Hi(denoted as Hi) is formed

(34)

3.5. Applied Models 23 in each position j it will be mapped to its high-dimensional vector as follows: ρ1(H

c)

ρ2(H

b) ρ3Ha.

Permutation, what allows the bulk of the work to be possible, it works in the following way for sequences: To encode the whole sequence, a, b, c, d, e and that it differs from b, c, d, a, e, the first sequence is taken as sequence in position 0 (seq0), then apply

the permutation i times the element position in the sequence. The sequence ends up like s = a + ρ(b) + ρ2(c) + ρ3(d) + ρ4(e).

In this way, a unique high-dimensional vector is created for that specific sequence, and hence allows hyperdimensional computing to be used for text classification, as also stated in [63].

The previous three operations above, allows the embedding of n-gram statistics to be encoded into a distributed representation (high-dimensional vector) [67]. With the description stated, the model was done following all the previous steps. The bundling, binding and permuting of vectors were enough to embed the high-dimensional vectors into a high dimensional space such that two similar n-grams statistics remain close together. The range of n-grams used in these experiments for the small datasets was [2 − 4], and the experiments were run 50 times to obtain a fair score.

3.5

Applied Models

Here we describe how each model was applied to a specific dataset and with which pre-processing step.

3.5.1

Paper A

For this paper, the following was carried out, first, the dataset used is the one described in 3.1.1, pre-processed with 3.2.1 and fed into the model described in 3.4.1.

3.5.2

Paper B

For this paper, the following was carried out, first, the datasets used are described in 3.1.2, the pre-processing steps were 3.2.2 and fed into the algorithm described in algo-rithm 1, and the final output was fed into several standard ML classifiers.

3.5.3

Paper C

For this paper, the following was carried out, first, the datasets used was 3.1.2, pre-processed in the same way as 3.2.2 and fed into the model described in 3.4.3.

(35)

24 Applied ML to text classification

3.5.4

Unpublished Results Using Result of Paper C in OLID

data

For these experiments, the dataset used was 3.1.1 pre-processed in the same manner described in 3.2.1 and ran with the steps described in 3.4.3. These experiments were done to help solidify the hypothesis that in a constrained setting, sacrificing a few F1-score percentage points, allows for a more resource-efficient solution.

(36)

Chapter 4

Results

“My little computer said such a funny thing this morning.” Alan Turing The current chapter will give a brief overview of the results from the algorithms detailed in the previous chapter. The full results are presented in the papers in Part II.

4.1

Paper A: TheNorth at SemEval-2020 Task 12:

Hate Speech Detection using RoBERTa.

Paper A served to show the capability of RoBERTa [59] when dealing with hate-speech in tweets. SemEval-2020 Task 12 was a competition in which we took part and obtain an F1-score of 0.9087, which put us in the top 50%.

The solution submitted to the competition was a fine-tuned RoBERTa [59] model for hate-speech using OLID dataset [56] for fine-tuning. The choice of RoBERTa came as another model, that was tested was discarded (namely DistilBERT [61]) when it was noticed that RoBERTa has a better performance like shown in Table 4.1.

Table 4.1: Model decisions analysis

Model Training Threshold Results (macro-F1score) Epochs OLID train OLID test Distillbert 3 0.46 0.7917 0.6002 Distillbert 20 0.41 0.7720 0.5958 Roberta 3 0.42 0.8043 0.6085

Our results felt encouraging being less than 0.02% away from the top contenders. After a little more fine-tuning of the model, the score climbed to 0.9116, as shown in

(37)

26 Results Table 4.2.

Table 4.2: Model latest results

Model Threshold Results (macro-F1 score)

OLID train OLID test OffensEval test RoBERTa 0.45 0.7960 0.8012 0.9116 RoBERTa 0.44 0.8124 0.8026 0.9085

These results showed that a 0.92 was within reach with adjustments in the hyper-parameters and by increasing the data available to the model in the training set.

4.2

Paper B: Subword Semantic Hashing for Intent

Classification on Small Datasets

Paper B presents the connection between papers A and C and helps to understand how we were able to get smaller models without sacrificing performance while requiring far fewer data.

This paper presents one novel algorithm (namely SemHash) that deals with intent classification in a setting where data is scarce, and a DL model is unfeasible to be applied. Paper B uses three of the standard datasets for intent classification: ChatBot, We-bApplication, and AskUbuntu. These datasets are obtained from questions posed to get information on different intents that users have when using the sites/application.

The performance was compared against various natural language understanding ser-vices (Botfuel, Dialogflow, Luis, Watson, Rasa, Recast and Snips) and a recent (at the time) classifier (such as TildeCNN [68]). Our algorithm came at the top of all others, as shown in Table 4.3.

4.3

Paper C: HyperEmbed: Tradeoffs Between

Re-sources and Performance

The results from paper C allowed us to confidently approach the issue of resource man-agement in short text classification. The way this was done was by applying the proposed HyperEmbed algorithm to the datasets comprising several short corpora, where this in-cluded AskUbuntu 4.4, ChatBot 4.5 and WebApplication 4.6 datasets and the other using a much larger test set in this case, 20NewsGroup 4.7.

These results showed that the cost of using hyperdimensional computing for classi-fication ranges from thirty-two to zero percentage points, depending on which classifier and dimensions of vectors are used. Resources saved were always high.

(38)

4.4. Unpublished Results 27

Table 4.3: Micro F1 Score comparison of different NLU services with SemHash

Platform Chatbot AskUbuntu WebApp Overall Avg. Botfuel 0.98 0.90 0.80 0.91 0.89 Luis 0.98 0.90 0.81 0.91 0.90 Dialogflow 0.93 0.85 0.80 0.87 0.86 Watson 0.97 0.92 0.83 0.91 0.91 Rasa 0.98 0.86 0.74 0.88 0.86 Snips 0.96 0.83 0.78 0.89 0.86 Recast 0.99 0.86 0.75 0.89 0.87 TildeCNN 0.99 0.92 0.81 0.92 0.91 Our Avg. 0.99 0.94 0.83 0.93 0.92 Our Best 0.996 0.94 0.85 0.94 0.93

Table 4.4: Performance of all classifiers for the AskUbuntu dataset.

F1score Resources: SH vs. HD Resources: SH vs. BPE

Classifier SH BPE HD Tr. Ts. Mem. Tr. Ts. Mem. MLP 0.92 0.91 0.91 4.62 3.84 6.18 1.67 1.61 1.72 Passive Aggr. 0.92 0.93 0.90 4.86 3.07 6.31 2.19 2.14 1.76 SGD Classifier 0.89 0.89 0.88 4.66 3.50 6.31 1.94 2.16 1.76 Ridge Classifier 0.90 0.91 0.90 3.91 4.74 6.31 1.63 1.62 1.76 KNN Classifier 0.79 0.72 0.82 2.11 4.53 8.48 1.56 1.79 1.76 Nearest Centroid 0.90 0.89 0.90 1.66 3.41 6.32 1.35 1.87 1.76 Linear SVC 0.90 0.92 0.90 1.18 2.39 6.29 0.91 1.91 1.76 Random Forest 0.88 0.90 0.86 0.91 1.09 6.11 1.15 0.96 1.75 Bernoulli NB 0.91 0.92 0.85 2.30 3.72 6.34 1.96 2.42 1.76

4.4

Unpublished Results

The results in tables 4.8, 4.11, 4.10, 4.9 and 4.12; help support the claim made in this thesis, that by sacrificing some F1-score percentage points, we can gain considerable speed in training and testing the models, translating into fewer resources being utilized, and a more resource-efficient process.

(39)

28 Results

Table 4.5: Performance of all classifiers for the Chatbot dataset.

F1score Resources: SH vs. HD Resources: SH vs. BPE

Classifier SH BPE HD Tr. Ts. Mem. Tr. Ts. Mem. MLP 0.96 0.94 0.96 3.42 2.62 4.58 1.86 1.52 1.86 Passive Aggr. 0.95 0.91 0.94 4.40 2.38 4.72 2.29 2.22 1.92 SGD Classifier 0.93 0.93 0.92 3.16 2.06 4.72 1.88 1.84 1.92 Ridge Classifier 0.94 0.94 0.92 2.88 2.22 4.72 1.67 1.38 1.92 KNN Classifier 0.75 0.71 0.83 1.66 3.59 6.51 1.43 1.79 1.92 Nearest Centroid 0.89 0.94 0.84 1.41 2.13 4.73 1.17 1.61 1.92 Linear SVC 0.94 0.93 0.94 0.52 1.57 4.72 1.28 1.66 1.92 Random Forest 0.95 0.95 0.91 0.95 1.10 4.61 1.16 0.98 1.91 Bernoulli NB 0.93 0.93 0.82 1.92 2.60 4.73 1.53 1.72 1.92

Table 4.6: Performance of all classifiers for the WebApplication dataset.

F1score Resources: SH vs. HD Resources: SH vs. BPE

Classifier SH BPE HD Tr. Ts. Mem. Tr. Ts. Mem. MLP 0.77 0.77 0.79 3.10 2.00 4.43 1.74 1.44 1.73 Passive Aggr. 0.82 0.80 0.80 3.73 1.45 4.33 1.86 1.32 1.75 SGD Classifier 0.75 0.74 0.73 3.01 1.87 4.33 1.62 1.32 1.75 Ridge Classifier 0.79 0.80 0.80 1.66 2.40 4.34 0.71 1.09 1.75 KNN Classifier 0.72 0.75 0.76 1.16 2.76 5.96 1.14 1.51 1.76 Nearest Centroid 0.74 0.73 0.77 1.42 1.79 4.34 1.13 1.21 1.75 Linear SVC 0.82 0.80 0.80 1.04 1.48 4.29 0.47 1.18 1.75 Random Forest 0.87 0.85 0.72 0.95 1.26 4.11 1.05 1.12 1.73 Bernoulli NB 0.74 0.75 0.64 1.51 2.08 4.38 1.19 1.49 1.75

(40)

4.4. Unpublished Results 29

Table 4.7: Performance of all classifiers for the 20NewsGroups dataset.

F1score Resources: SH vs. HD

Classifier SH HD Train speed-up Test speed-up Memory reduction MLP 0.72 0.64 53.23 79.50 93.19 Passive Aggr. 0.74 0.69 103.64 202.95 93.42 SGD Classifier 0.70 0.66 105.43 186.31 93.42 Ridge Classifier 0.16 0.71 45.46 338.01 93.42 KNN Classifier 0.31 0.31 184.47 65.87 127.54 Nearest Centroid 0.08 0.15 212.75 254.74 93.42 Linear SVC 0.75 0.69 5.11 176.62 93.42 Random Forest 0.58 0.26 4.27 21.43 93.41 Bernoulli NB 0.60 0.15 57.72 56.54 93.42

Table 4.8: Performance of MLP Classifier on OLID Dataset.

F1score Resources: SH vs. HD

Dimensions SH: 32603 dim HD Train speed-up Test speed-up Memory reduction 32 0.74 0.63 5.71 5.66 723.76 64 0.75 0.65 2.62 5.94 423.16 128 0.74 0.67 2.27 2.66 231.15 256 0.73 0.69 1.66 2.77 121.18 512 0.75 0.71 1.19 2.95 62.10 1024 0.76 0.72 1.60 3.74 31.44 2048 0.74 0.71 1.50 3.92 15.82 4096 0.74 0.72 0.95 3.29 7.93 8192 0.73 0.73 0.68 1.88 3.97 16384 0.73 0.70 0.49 1.86 1.99

(41)

30 Results

Table 4.9: Performance of RandomForest Classifier on OLID Dataset.

F1 score Resources: SH vs. HD

Dimensions SH: 32603 dim HD Train speed-up Test speed-up Memory reduction 32 0.77 0.64 27.52 60.62 723.75 64 0.75 0.64 49.80 135.17 423.16 128 0.76 0.65 115.55 151.52 231.15 256 0.76 0.65 63.01 68.79 121.18 512 0.74 0.65 55.24 36.27 62.10 1024 0.75 0.66 23.79 18.87 31.44 2048 0.75 0.66 12.26 14.66 15.82 4096 0.77 0.66 3.68 9.74 7.93 8192 0.77 0.67 0.81 4.26 3.97 16384 0.77 0.67 1.20 3.13 1.99

Table 4.10: Performance of SGD Classifier on OLID Dataset.

F1 score Resources: SH vs. HD

Dimensions SH: 32603 dim HD Train speed-up Test speed-up Memory reduction 32 0.34 0.61 140.81 24.94 723.76 64 0.74 0.65 144.00 13.30 423.16 128 0.74 0.68 78.65 37.96 231.15 256 0.73 0.69 49.73 63.22 121.18 512 0.74 0.71 26.88 13.13 62.10 1024 0.75 0.72 15.70 33.97 31.44 2048 0.72 0.72 8.51 9.29 15.82 4096 0.65 0.71 3.32 2.46 7.93 8192 0.73 0.72 1.85 5.55 3.97 16384 0.74 0.72 1.98 2.02 1.99

(42)

4.4. Unpublished Results 31

Table 4.11: Performance of Ridge Classifier on OLID Dataset.

F1score Resources: SH vs. HD

Dimensions SH: 32603 dim HD Train speed-up Test speed-up Memory reduction 32 0.72 0.65 24.00 24.94 723.76 64 0.72 0.68 20.56 13.30 423.16 128 0.72 0.69 45.50 37.96 231.15 256 0.72 0.71 63.19 63.22 121.18 512 0.72 0.74 17.46 13.13 62.10 1024 0.72 0.75 11.93 33.97 31.44 2048 0.72 0.75 7.48 9.29 15.82 4096 0.72 0.77 4.26 2.46 7.93 8192 0.72 0.77 3.03 5.55 3.97 16384 0.72 0.76 2.10 2.02 1.99

Table 4.12: Performance of LinearSVC on OLID Dataset.

F1score Resources: SH vs. HD

Dimensions SH: 32603 dim HD Train speed-up Test speed-up Memory reduction 32 0.68 0.65 4969.18 811.60 723.76 64 0.68 0.68 246.88 80.82 423.16 128 0.68 0.69 185.23 23.31 231.15 256 0.68 0.73 150.23 5.95 121.18 512 0.68 0.73 72.58 16.33 62.10 1024 0.68 0.74 23.55 28.01 31.44 2048 0.68 0.73 1.30 18.72 15.82 4096 0.68 0.70 0.09 9.21 7.93 8192 0.68 0.68 0.03 16.72 3.97 16384 0.68 0.68 0.03 9.30 1.99

(43)
(44)

Chapter 5

Conclusions and Future work

“Any technological advance can be dangerous. Fire was dangerous from the start, and so (even more so) was speech - and both are still dangerous to this day - but human beings would not be human without them.” Isaac Asimov This work explores the area of intent classification (anchored to hate-speech detection) for short text. To show this classification area, paper A presented the results of intent classification (hate-speech detection) utilizing DL. The next study presented in paper B addressed intent classification in short texts. Showing good results in situations where data scarcity represents a problem, and DL cannot be applied. Next, it was seen that this scarcity did not hinder the performance of algorithms in paper B, but the challenge was how to make a more resource-efficient solution. A possible solution was presented in paper C.

5.1

Findings in relation to Research Questions

RQ1: What is the optimal way (speed and accuracy) for inferring intent on small text datasets?

For this question Paper B provided answers which were, intent classification needs a combination of pre-processing steps and post-processing. Then a careful selection of classifier is required.

RQ2: Is hyperdimensional computing suitable for saving resources for classifying small text datasets?

The question posed here asks the reader to imagine the situation where data and computing power are scarce, but the need for classifying intent/text persists. Paper C offers one solution to this problem that involves delving into hyperdimensional comput-ing for an answer. The results demonstrated that hyperdimensional computcomput-ing gives a

(45)

34 C.and.FW solution and can be seen as a resource-efficient paradigm.

RQ3: Do the findings on small text datasets also apply to hate-speech detection? This is answered with the unpublished results. They showed that indeed hate-speech can be successfully tackled by the algorithms proposed for small datasets. While incur-ring in no or a small penalty in accuracy.

RQ4: How efficient is DL compared to hyperdimensional computing for the detection of hate-speech in tweets?

Two results gave an answer to this question. Paper A and the unpublished results. Paper A showed that with no constraints DL is better alternative to infer intent, but DL is not without concerns. To addressed these concerns the unpublished results showed that hyperdimensional computing could provide a good alternative at a small cost. Therefore, it was shown that DL is still the best resource to be used when limitations permit, and hyperdimensional computing when they do not.

5.2

Conclusions

The results put forward in this thesis can be seen as a foundation of research that could be done in either DL or hyperdimensional computing relating to short text classification. Paper A presented what could be known in the future as the primary ML go to algorithms (DL). The DL models used here were shown to achieve good performance when there is enough data to train them and no resource limitations. In the competition for which paper A was produced, the results showed the capabilities of the RoBERTa [59] model when applied to hate-speech detection. These results could represent a research direction to improve either the RoBERTa [59] model or train another variation in the BERT [6] pool of models.

Next, the attention got shifted into small training data, which paper B presented promising results (being state-of-the-art at the time). This paper showed that high per-formance could be achieved even with the limitation above imposed, that is, insufficient training data for the model. It was using n-grams and traditional ML algorithms. Pa-per B results showed that n-grams, along with the SemHash algorithm, could classify correctly intent in a small-data setting. This result represents another area that could be researched further, delving into the area of data preparation. While the performance showed was fairly high at the end, there is still room for improvement.

In the case of the results presented in paper C, the premise of this paper was that given more constraints (data and computational resources), classification of short text, did not need to suffer as much as previously thought. The results put forward, helps to exhibit that the performance could be similar using a technique borrowed from hyperdimensional computing and applied to the current situation.

Finally, for the unpublished results, they were aimed to display the closing of the current cycle and present all the work done for this thesis as a complete circle. These last results demonstrated that for the task of intent classification, the self-imposed

(46)

con-5.3. Future Work 35 straints, placed at the start, could be circumvented without sacrificing too much of the performance that the DL model provided, using the framework of hyperdimensional com-puting.

5.3

Future Work

The problem of intent classification is still not fully solved and this thesis attempts to present the different mechanisms that could help alleviate some of the issues that arise in text classification with DL. It has been shown that in small data and low power regime, it is still possible to classify intents from text without sacrificing too much accuracy. However, the work presented here does not take into account all the new developments into word embeddings. This thesis, while did pay some attention to word embeddings, the way they were constructed was not the most resource-efficient way [8]. The next stage of research is envisioned as making more efficient constructs of word embeddings. As we have seen, they allow for better performance, but one could argue that the cost for that gain is too high. The plan is to shift the research to word embeddings construction using a hyperdimensional computing approach, the one used in this thesis, and one of its variants known as Random Indexing. They are expected to be a more resource-efficient approach.

(47)

References

Related documents

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

DIN representerar Tyskland i ISO och CEN, och har en permanent plats i ISO:s råd. Det ger dem en bra position för att påverka strategiska frågor inom den internationella

Det är detta som Tyskland så effektivt lyckats med genom högnivåmöten där samarbeten inom forskning och innovation leder till förbättrade möjligheter för tyska företag i

Sedan dess har ett gradvis ökande intresse för området i båda länder lett till flera avtal om utbyte inom både utbildning och forskning mellan Nederländerna och Sydkorea..