‘Sorry, I didn’t understand that’

(1)

SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

' Sorry, I didn't understand tha t'

A comparison of methods for intent classification for social robotics applications

MIKAELA ÅSTRAND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

A comparison of methods for intent

classification for social robotics applications

MIKAELA ÅSTRAND

Master in Machine Learning Date: February 28, 2020 Supervisor: Gabriel Skantze Examiner: Olov Engwall

School of Electrical Engineering and Computer Science Host company: Furhat Robotics

Swedish title: "Förlåt, men jag förstod inte riktigt": En jämförelse av metoder för avsiktsklassificering tillämpad på social robotik

(4)

(5)

Abstract

An important feature in a social robot is the ability to understand natural language. One of the core components in a typical system for natural language understanding (NLU) is so called intent classification; the action of classifying user utterances based on the underlying intents of the user. Previous research on intent classification has mainly been performed on dialogues very different from what can be expected in social robotics. There, dialogues are of a more social nature with utterances often being very short or highly context dependent.

It has also been performed under the assumption that all test utterances do indeed belong to one of the predefined intent classes. This is often not the case in an actual application where the user cannot be expected to know the limitations of the system.

In this thesis, a number of intent classification methods are evaluated based on how they perform on two tasks: classifying utterances belonging to one of the predefined classes and identifying utterances that are out of scope. For this, three different datasets are used: two existing intent classification datasets and one that was collected as part of this project and that is more typical for dialogues in social robotics.

The methods being evaluated are support vector machine (SVM), logistic regression, the intent classifier in the NLU platform Snips, and the neural language model BERT. For SVM and logistic regression, two different feature representation techniques are used: bag-of-words (BoW) with and without tf-idf weighting, and pre-trained GloVe embeddings.

Based on the results of these evaluations, three main conclusions are drawn:

that simple methods are usually to be preferred over more complicated ones, that out-of-scope detection needs further investigation, and that more datasets typical for different kinds of applications are needed. BERT generally performs the best on both tasks, but SVM and logistic regression are not far behind with pre-trained word embeddings performing no better than BoW and Snips no better than simple classifiers. Previous research on out-of-scope detection is very limited and the results obtained here give no clear indication of what is the overall best approach or what performance is to be expected in different settings.

Finally, the intent classification and out-of-scope detection performances differ a lot between different datasets, making representative datasets a necessity for drawing conclusions about expected performance in specific applications.

(6)

Sammanfattning

En viktig egenskap hos en social robot är förmågan att förstå naturligt språk. I ett typiskt system för automatisk språkförståelse är en av huvudkomponenterna så kallad avsiktsklassificering som går ut på att klassificera användaryttranden utifrån avsikten bakom det. Den forskning som tidigare gjorts inom ämnet har sällan behandlat dialoger av social natur där dialoger kan vara långa men med många korta yttranden där kontexten är helt avgörande för att avsikten ska förstås. Olika metoder har också framförallt utvärderats under antagandet att alla testyttranden tillhör någon av de givna klasserna, vilket inte är realistiskt för en faktisk tillämpning där användarna inte kan förväntas veta systemets begränsningar.

I det här arbetet utvärderas ett antal metoder för avsiktsklassificering utifrån två aspekter: hur bra de presterar på klassificering av yttranden som tillhör någon av de givna klasserna, samt hur bra de är på att känna igen yttranden som inte gör det. Detta görs på tre olika dataset; två existerande dataset för avsiktsklassificering och ett som samlats in som en del av det här arbetet och som är mer typiskt för just social robotik.

De specifika metoder som utvärderas är stödvektormaskin (SVM), logistisk regression, avsiktsklassificeraren i dialogplattformen Snips, samt den neurala språkmodellen BERT. För att representera yttranden som vektorer för SVM och logistisk regression används dels en räknebaserad representationsteknik (så kallad ”bag-of-words”, BoW) med och utan tf-idf-viktning, dels förtränade GloVe-ordinbäddningar.

Utifrån de erhållna resultaten dras tre huvudsakliga slutsatser: att enklare metoder oftast är att föredra framför mer avancerade, att det behövs mer forskning på avsiktsklassificering där vi förväntar oss att vissa yttranden inte hör till någon av klasserna, samt att det behövs fler dataset som är representativa för olika typer av tillämpningar. BERT presterar överlag bäst utifrån båda aspek- terna, men övertaget i förhållande till SVM och logistisk regression är litet.

Dessutom presterar oftast BoW lika bra som förtränade ordinbäddningar och enkla klassificerare bättre än Snips. Den forskning som tidigare gjorts utifrån den andra aspekten är väldigt begränsad och resultaten som presenteras här ger inga entydiga svar på vilka metoder som är att föredra eller vilken prestanda som kan förväntas. Det är också stora skillnader i hur väl de olika metoderna presterar på olika dataset, vilket gör det tydligt att det behövs fler, mer varierade dataset för att kunna dra slutsatser kring hur väl de kan förväntas prestera inom en specifik tillämpning.

(7)

1 Introduction 1

1.1 Research questions . . . 3

1.2 Delimitations . . . 3

2 Background 4 2.1 Social robotics and NLU . . . 4

2.2 Intent classification . . . 6

2.2.1 Existing datasets . . . 6

2.2.2 Intent classification alone . . . 7

2.2.3 Joint and end-to-end modelling . . . 9

2.2.4 NLU platforms . . . 9

2.3 Out-of-scope detection . . . 10

2.3.1 Out-of-domain detection in NLU . . . 11

2.3.2 Open and one-class classification . . . 12

2.4 Text classification methods . . . 12

2.4.1 SVM . . . 13

2.4.2 Logistic regression . . . 14

2.4.3 Bag-of-words . . . 15

2.4.4 Word embeddings . . . 17

2.4.5 Language models . . . 19

3 Method 22 3.1 Datasets . . . 22

3.1.1 CoffeeShop . . . 22

3.1.2 SnipsData . . . 25

3.1.3 OosEval . . . 26

3.1.4 OpenSubtitles . . . 26

3.2 Compared classifiers . . . 27

3.2.1 Motivations . . . 27

v

(8)

3.2.2 Implementation details . . . 28

3.3 Evaluation . . . 30

4 Results 32 4.1 In-scope classification . . . 32

4.1.1 All classifiers and datasets . . . 32

4.1.2 CoffeeShop performance . . . 33

4.1.3 Stemming and lemmatisation . . . 37

4.1.4 Varying training set sizes . . . 37

4.1.5 Random versus designed examples . . . 37

4.2.1 Using thresholds . . . 39

4.2.2 OOS as a class . . . 40

4.2.3 OOS class augmented with OpenSubtitles . . . 41

5 Discussion 44 5.1 In-scope classification . . . 44

5.3 The CoffeeShop dataset . . . 46

5.4 In social robotics applications . . . 47

5.5 Future work . . . 48

5.6 Societal impact and ethical aspects . . . 49

6 Conclusions 51

Bibliography 53

(9)

Introduction

Social robots are robots that are intended to interact socially with humans, having applications in fields such as costumer service, entertainment, and elderly care. One example of a social robot is Furhat, produced by the company Furhat Robotics at which this thesis was carried out.

Figure 1.1: Furhat in a café. Credit: Furhat Robotics.

Verbal communication is a key part of human social interaction and therefore also a main feature in many social robots. This includes both understanding and generation of spoken natural language. In most dialogue systems, including in Furhat, one of the core parts of the natural language understanding is so called intent classification. This is about classifying user utterances into predefined categories corresponding to the underlying intent of the user. In the situation depicted in the picture above, we could for instance imagine the following conversation taking place:

1

(10)

The two utterances to the left could then for instance be classified into three intents: "express that I don’t know" for the first utterance, and "order lemon pie" and "ask for price" for the second.

There is a wide range of methods for doing intent classification but they have not been thoroughly evaluated on data typical for social robotics applications.

For instance, if taken out of its context it is far from obvious that "OK, I’ll take that" means that the user wants to order a lemon pie. This is one typical feature in dialogues in social robotics that has not often been considered in previous intent classification research. Such research has also mainly been performed on simple questions or dialogues with one specific user goal, usually written rather than spoken, and almost exclusively in English. In a social conversation, there is often not a predefined goal with the interaction apart from the interaction itself.

Utterances could be very short, even just single words, or many sentences long.

Although intent classification is usually performed on the transcriptions of what is being said, using written utterances is not the same thing as we do not express ourselves the same way in speech and writing. As seen above, utterances are often also highly dependent on the context, referring to what has previously been said in the conversation, but perhaps also to earlier interactions or just common knowledge about the world. In order to choose the methods most suitable for intent classification in social robotics, it is necessary to compare them on data more typical for the field.

As for many classification tasks, intent classification methods have generally been evaluated assuming that all test utterances belong to one of the given classes. In most dialogue systems, there is only a very limited set of predefined, possible intents. This set is usually not known by the user who we have to expect might say things that are not covered by any of the intents. For instance, following the dialogue above, the user might be thinking about ordering a coffee along with the pie but is not sure about what to take. If the robot has

(11)

not been programmed to be able to answer questions about different kinds of coffee, we might end up with the following exchange:

For a conversation to run smoothly and not take unexpected turns, it is important to recognise out-of-scope utterances and not classify them as random intents. This is particularly important in social robotics applications, where the interaction must feel natural in order to be successful. It is therefore of interest to compare different intent classification methods not only based on their in- class classification performance, but also on how they perform at identifying when utterances are out of scope.

1.1 Research questions

In this thesis, a number of methods for intent classification are compared in order to answer the following two questions:

1. How do established intent classification methods perform on utterances typical for social robotics applications?

2. How do these methods perform in identifying out-of-scope utterances?

1.2 Delimitations

In order to limit the scope of this project, the classifiers are evaluated solely based on their intent classification and out-of-scope detection performance on single-intent utterances in English. The comparisons do not consider multi- intent utterances, other languages, entities and entity recognition, training and prediction time, hardware requirements, or the feasibility of using the different classifiers in production. All classifiers are implemented using existing code and Python libraries with very limited hyper-parameter tuning. No evaluation is performed with real users of an actual robot application.

(12)

Background

The first section of this chapter introduces natural language understanding (NLU) and its role in social robotics. Then, relevant background and related work in intent classification and out-of-scope detection are described: Section 2.2 introduces intent classification and presents related work within the field, and Section 2.3 focuses on out-of-scope detection. Finally, Section 2.4 gives some theoretical background to the classification methods used in this project.

2.1 Social robotics and NLU

Social robots are robots intended for social interaction with people. They can be either physically or virtually embodied and be made to look as similar to humans as possible (androids), have some resemblance with humans (humanoid) or animals (zoomorphic), or be neither but still have some anthropomorphic features such as a face and facial expressions. Some possible applications are within education, health, elderly care, costumer service, and entertainment [1].

See Figure 1.1 in the previous chapter for an example of a social robot, Furhat.

A standard pipeline for a spoken dialogue system, for instance in a social robot, can bee seen in Figure 2.1. It consists of five main components: automatic speech recognition (ASR), natural language understanding, dialogue management, natural language generation, and speech synthesis. In such a setup, NLU is a separate module that transforms the words identified in the ASR module to a semantic representation on which the dialogue manager can then decide how to act [2]. There is also the notion of spoken language understanding (SLU) that sometimes refer to what is here called NLU and sometimes to ASR and NLU together.

One of the main differences between social robotics and most other types of

4

(13)

Figure 2.1: A standard dialogue system pipeline.

NLU applications is that communication with a social robot can be multimodal, taking for instance gestures and gaze as input. Another difference is that social robots should be able to engage in multiparty dialogues [1]. These things are out of the scope of this project, but also when considering only the words being spoken, there are several features that are typical for NLU in social robotics in contrast to in many other dialogue systems.

One difference between speech- and text-based NLU is that speech generally does not follow the same grammatical structure as text, with long sequences of words without clear sentence division, often including irregularities such as self-corrections and hesitations. Also, even when an utterance consists of grammatically correct sentences, the ASR usually outputs a stream of text that does not include punctuation.

A social robot should be able to handle utterances with features typical for human-human dialogue, such as being context-related, for grounding, or of very varying length. The meaning of an individual utterance often depends on the context in which it is said, referring to what has been said earlier in the conversation or to common knowledge. A typical utterance would often be no longer than one or a few words, but it could also last for many sentences. Other important features in a social robot are to support multiple languages and to be interesting to interact with for an extended period of time, requiring, among other things, a high level of understanding [3].

Depending on the application, NLU can be split into sub-tasks in different ways, where classification is generally performed on both the utterance level and the word level. Several common divisions include the notion of intent

(14)

classification that is the focus of this project.

2.2 Intent classification

One division of NLU is into intent classification (also ”intent detection” or

”intent determination”) and entity recognition (also ”named entity recognition”,

”entity extraction” or ”entity identification”). Intent classification means map- ping user utterances to predefined meanings called intents, such as ”Hi there”

→ greeting or ”Will it be sunny tomorrow?” → weatherQuery. In entity recognition, words or sequences of words representing entities belonging to predefined categories, such as places, person names or time expressions, are identified and labelled.

Another division in task-oriented dialogue systems is into domain detection, intent classification and slot filling [4]. This kind of classification of the example sentence ”Show me an easy recipe for chocolate muffins” can be seen in Figure 2.2.

Figure 2.2: Domain (D), intent (I) and slots in IOB-format (S) for an example sentence (W).

In this thesis, focus is solely on intent classification. The rest of this section is aimed at giving an overview on the work that has previously been done within that.

2.2.1 Existing datasets

Previous work on intent classification has used a number of different datasets that all have in common that they are either task-oriented or about question answering.

One of the first applications of intent classification was for automatic call routing. Data from systems such as AT&T’s How may I help you? (HMIHY) with 14 call-types (some utterances being of multiple types) [5] and a system implemented by IBM with 35 call-types [6] have been used mainly by researchers from these companies.

(15)

A dataset often used for benchmarking is the Air Travel Information Service corpus (ATIS). It was collected in the beginning of the 1990s using a Wizard- of-Oz method, i.e. where users are communicating with what they think is a computer but that is actually a human operator, and contains approximately 6’000 utterances representing 17 different intents [7].

Two text-based NLU corpora were introduced in [8]: the Chatbot and StackExchange corpora. The Chatbot corpus was collected from a chatbot in the Telegram app answering questions about public transport, in total containing 206 questions. The StackExchange corpus contains 251 questions and answers from two StackExchange platforms: Ask Ubuntu and Web Applications. These corpora include two, five and eight intents respectively (including the intent

”None”).

In [9], they collected a dataset they called Chinese Question Understanding Dataset (CQUD) based on questions from the Chinese question and answer community Baidu Knows.

The company behind the NLU platform Snips¹made an in-house dataset with over 16’000 utterances covering seven intents publicly available in 2017 [10]. This dataset has since been used by for instance [11] and [12] and is described in more detail in Section 3.1.2.

In March 2019, a dataset of 25’716 utterances representing a total of 64 intents in 21 domains was released [13]. It was collected through crowdsourcing via Amazon Mechanical Turk where the Turkers were asked to give examples of how they would interact with a home assistant robot in scenarios such as setting an alarm or checking tomorrow’s weather.

Finally, in research on Facebook [14], Microsoft [15], [16] and [17], and Sony [18], in-house datasets collected through crowdsourcing or from real users of their systems (e.g. Microsoft Cortana) were used, but these datasets are not publicly available.

2.2.2 Intent classification alone

Most articles mentioned below focus on either utterance representation and feature extraction or on specific classifiers.

Early work on intent classification represented the utterances as vectors of n-gram counts (with 1 ≤ n ≤ 3) and used support vector machines (SVM) [19], adaptive boosting (AdaBoost) [20], naïve Bayes or maximum entropy (MaxEnt) [21] classifiers.

1https://snips.ai/

(16)

In [22], a parsing-based sentence simplification method extracted keywords, and classification performance was shown to improve when a model trained on the simplified sentences was used together with a model trained on the original sentences.

More recently, focus has been on different types of neural networks (NN), both for feature extraction and for classification.

In 2011, a deep belief network (DBN) in combination with a feed-forward NN was shown to perform similar to SVM and outperform the AdaBoost and MaxEnt classifiers on the IBM call-routing data [23], trained with labelled word count vectors as input. Building upon this work, in 2014, DBNs were trained on unlabelled data to generate features for an SVM classifier, further improving classification results [24]. Other features that have been considered take in context in the form of having the previous predicted label as a feature [25].

A yet more recent feature representation technique is to use word embeddings, something that has proven successful for many text classification tasks [26]. However, it has only been sparsely applied to intent classification. Exam- ples include using enriched GloVe word embeddings as input to a bidirectional long short-term memory network (BiLSTM) [27] and FastText word embeddings as input to a convolutional neural network (CNN) [28]. Both word2vec and GloVe word embeddings were used together with three different CNN architectures in [18] for intent classification and other text classification tasks, both embedding types yielding similar results. Word embeddings are further described in Section 2.4.4, in particular GloVe embeddings that were the ones used in this project.

In [29], multiple types of NNs were compared on two different binary intent classification tasks; the intent flight vs. all other intents for the ATIS dataset, and web search vs. all other intents for a much larger in-house Microsoft Cortana dataset. They tried both word embeddings and word hashing and obtained the best overall results with LSTM and networks with gated recurrent units (GRU).

A kind of hashing was also the focus of [30], where a subword hashing scheme for feature extraction was introduced. To evaluate its performance, the hashing was used along with a number of different classifiers (SVMs, logistic regression, kNN, etc.) for intent classification of the AskUbuntu and StackExchange datasets. The results were then compared with those of seven different NLU platforms evaluated on the same datasets in [8] and [31].

(17)

2.2.3 Joint and end-to-end modelling

In several papers, intent classification is performed jointly with slot filling and sometimes also domain detection. Although being less flexible and not allowing for domain- and intent-specific features, this has the advantage of only needing one model for all domains and being able to benefit from common features of different domains and intents [17]. Joint modelling has mainly been done using different kinds of neural networks: CNN [32], recursive neural networks (RecNN) [16], bidirectional recurrent neural network (RNN) with LSTM [17] [27], RNN with gated recurrent units (GRU) [9], attention-based RNN [33], and slot-gated attention-based RNN [11]. In [12], the authors used the pre-trained language representation model BERT. Their model was shown to outperform the models in [17], [33] and [11] on both slot-filling and intent classification on the Snips and ATIS datasets.

There has also been some work on joint optimisation of the ASR and NLU components [6] and on methods for doing end-to-end SLU from audio to semantic representations, for example in [14] and [34].

2.2.4 NLU platforms

There are a number of platforms offering NLU as a service that all include intent classification, mainly (in alphabetical order): Botfuel², DialogFlow³ (Google, previously API.ai), Lex⁴ (Amazon), LUIS⁵(Microsoft), Rasa⁶ (open source), SAP Conversational AI⁷(previously Recast.ai), Snips⁸ (open source), Watson⁹(IBM) and Wit.ai¹⁰(Facebook). However, how the intent classification is performed by these NLU platforms is in most cases not public. In Rasa, a sentence is represented by the average of its word vectors and is then classified by an SVM where a grid search has been performed to find the best parameters [35]. In Snips, there are two subsequent intent classifiers. First, a deterministic parser based on regular expressions checks if the utterance matches any of the example patterns. If it does not, a probabilistic classifier based on logistic regression is used instead. The classifier also has an inbuilt None intent trained

2https://www.botfuel.io/

3https://dialogflow.com/

4https://aws.amazon.com/lex/

5https://luis.ai/home

6https://rasa.com/

7https://cai.tools.sap/

8https://snips.ai/

9https://www.ibm.com/watson

10https://wit.ai/

(18)

on samples from noise text with the same word distribution as the language as a whole [10].

2.3 Out-of-scope detection

In all the articles about intent classification hitherto cited, the test sets only include in-scope (INS) utterances, i.e. utterances belonging to one of the predefined intent classes. However, this is not realistic for real-life applications where out-of-scope (OOS) utterances are likely to occur.

There are three overall approaches to OOS detection:

1. Use an intent classifier that returns a confidence or probability score and apply a threshold under which an utterance is classified as OOS.

2. Define a separate OOS class and train the intent classifier with both INS and OOS data.

3. Use two separate classifiers: one binary INS vs. OOS and one to distinguish between the intent classes for utterances classified as INS.

The first approach has the advantage of not requiring any OOS data in the training set, but optimal thresholds might vary strongly with the model used, the number of classes, and the amount of training data. Depending on how the second and third approaches are implemented, it can be more or less important that the OOS training data is well moderated without any overlap with the intent classes. Methods where the OOS training data can include some INS utterances might be preferable as they demand less manual work in removing INS utterances from the OOS dataset and allow for reusing OOS data between applications.

This categorisation of three approaches to OOS detection is the one used in an article that served as one of the inspirations for this project: An evaluation dataset for intent classification and out-of-scope detection [36]. Their main contribution was the release of a dataset for intent classification that includes relevant out-of-scope data. On this dataset, they evaluated both the INS classification and the OOS detection performance of a couple of different classifiers (SVM, multilayer perception (MLP), FastText, CNN and BERT) and NLU platforms (DialogFlow and Rasa) using all three approaches.

The OOS detection performance of three NLU platforms (LUIS, Watson and DialogFlow) was also evaluated in [37]. Apart from that, there does not seem to have been much work done on OOS detection in intent classification.

(19)

However, there are related problems in NLU and text classification that have been researched more thoroughly, one of them being out-of-domain detection in NLU. Inspiration could also be taken from previous work in open and one-class classification.

2.3.1 Out-of-domain detection in NLU

In [38], the authors investigate unsupervised training for intent classification in an SLU system. They distinguish between two types of OOS utterances: out- of-domain (OOD) utterances and in-domain (IND) utterances that cannot be handled by the system (there called ”in-domain, unknown” (IDU)). Utterances of the second type share many features with INS utterances, and it is shown that they are indeed harder to detect than OOD utterances. To perform OOD and IDU detection, they use AdaBoost and a likelihood ratio score calculated using a background model trained on OOS data.

Techniques that have been used for OOD detection in other, related tasks include SVMs with classification confidence scores [39], and different kinds of neural networks: neural sentence embeddings and an autoencoder [40], generative adversarial networks [41], and BiLSTM [42].

In [39], OOD detection performance is evaluated on topic classification in a speech-to-speech translation task. An SVM model is trained for each topic class, and confidence scores are calculated based on the distance between the utterance vector and each SVM hyper-plane. Applying a threshold for OOD detection, IND classification and OOD detection are performed simultaneously, needing no OOD data for training.

Another OOD detection method not requiring any OOD training data is suggested in [40] where the authors perform binary IND vs. OOD classification of sentences in Korean covering 13 domains whereof eight are considered IND and five OOD. Word representations are pre-trained on unlabelled Wikipedia text using a skip-gram network, and fine-tuned with an LSTM network using domain classification of the IND sentences as an auxiliary task. This gives neural sentence embeddings that are supposed to emphasize sentence aspects distinguishing different domains. An autoencoder is then used to perform the IND/OOD classification, the best model receiving an equal error rate (EER, the rate at which the false acceptance and false rejection rates are equal) of 7.02% compared to an EER of 13.69% with the method of [39]. Partly the same authors also investigated using a generative adversarial network (GAN) trained on only IND sentences for the same task on a similar, Korean dataset in [41].

(20)

Finally, a domain classifier and OOD detector are trained jointly in [42], optimising for domain classification accuracy with the constraint that the false acceptance rate (FAR) cannot exceed a given threshold. Utterances are represented using a BiLSTM, on top of which two separate feed-forward NN are used: one for domain classification and one for OOD detection. The joint loss function involves dynamic class weighting to ensure that the FAR constraint is fulfilled, and the method is evaluated on two datasets collected from real users of Amazon Alexa.

2.3.2 Open and one-class classification

Two ways of seeing OOS detection are as an open or one-class text classification problem.

In open text classification, documents are classified into their respective classes while rejecting those not belonging to any of them. Two examples of methods for open text classification are: using center-based similarity space learning and SVMs [43], and using a CNN with an extra 1-vs-rest layer where class-specific thresholds are applied to the network sigmoid outputs to reject OOS documents [44].

One-class classification is a variant of binary classification where only the positive class is well represented in the training data and the negative class is poorly represented or not present at all. In [45], it is split into three categories:

with only positive examples, with positive examples and badly distributed negative examples, and with positive and unlabelled data, where the last case is the most researched for text classification. One-class text classification has mostly been performed using different versions of SVMs, but also with for example naïve Bayes, expectation maximisation, decision trees, and genetic algorithms. In the case of intent classification, the OOS utterance space is very large and diverse and therefore hard to represent properly in a limited number of examples. Binary INS vs. OOS classification could therefore be seen as a one-class classification problem.

2.4 Text classification methods

Performing intent classification using ASR transcriptions can be viewed as a multi-class short text classification task. There are a lot of different feature representation techniques and classifiers that have been successfully used for text classification and that could be of interest for intent classification. This project focuses on comparing a number of such methods, and in this section,

(21)

some theory behind the chosen methods is described. The choice of methods to include is motivated in Section 3.2.1.

2.4.1 SVM

The support vector machine (SVM) is a binary maximum margin classifier [46]. It finds the hyperplane separating the two classes that maximises the

”margin” between the two classes, defined as two times the perpendicular distance between the hyperplane and the closest datapoints. Figure 2.3 shows a simple two-dimensional example where the separating hyperplane is the solid line and the margin the distance between the two dashed lines. The three highlighted points are the so-called support vectors that are the datapoints closest to the separating line.

Figure 2.3: The maximum margin hyperplane (the solid line) and support vectors (the three highlighted points) for a 2-dimensional classification task.

We label the two classes +1 and −1 and want to find the weight vector w and intercept b that maximise the margin. w is perpendicular to the plane, on which w · x − b = 0. As pointed out in Figure 2.3, w · x − b = ±1 for the support vectors. New datapoints are classified according to on which side of the hyperplane they are positioned.

The problem can be reformulated as a quadratic programming problem where we need to solve

arg max

α

X

j

α_j −1 2

X

j,k

α_jα_ky_jy_k(x_j · x_k) (2.1)

(22)

under the constraints α^j ≥ 0 andP

jαjyj = 0. This problem has a single global solution, and when we have found the corresponding α, new datapoints are classified by:

h(x) = sign X

j

α_jy_j(x · x_j) − b

!

(2.2)

α_j will be non-zero only for the support vectors, so in order to classify a new datapoint x, we only need to compute the dot product between x and the support vectors. w can be recovered from α as w =P

jα_jx_j.

SVMs can often be used even when the dataset is not originally linearly separable. If applying the kernel trick, where the dot product is replaced by a kernel function, the data is transformed to a higher-dimensional space where it might be possible to find a linear separator. There is also the soft margin version of SVM where we allow datapoints to be misclassified, but add a penalisation proportional to their distance to the hyperplane [46].

The SVM is originally a binary classifier, but several approaches for extend- ing it to multi-class classification exist. Two common approaches are one-vs-one and one-vs-rest. Denoting with K the number of classes, K(K − 1)/2 SVMs are trained in one-vs-one, one for each pair of classes. A datapoint is then classified as belonging to the class getting the most votes from these SVMs. In one-vs-rest, we train one SVM for each class with the negative class containing all the datapoints belonging to the other K − 1 classes. A datapoint is then classified as belonging to the class whose SVM gives the highest output value.

With a large number of classes, the one-vs-rest approach might be preferable as it requires training K rather than K(K − 1)/2 SVMs. However, this approach has the problems of using unbalanced training datasets for each SVM, and that the output values might not be on the same scale and therefore not completely comparable [47].

2.4.2 Logistic regression

Logistic regression (LR) is like SVM originally a binary classifier. Given two classes y = 0 and y = 1, we assume a linear relationship between the observations and the logit function for y = 1. For an observation x with feature vector f , this can be expressed:

logit(p(y = 1|x)) = log

p(y = 1|x) 1 − p(y = 1|x)

= w · f (2.3)

(23)

This gives the following probability of x belonging to class 1:

p(y = 1|x) = 1

1 + e^−w·f (2.4)

and, as the probabilities should sum to 1, for class 0:

p(y = 0|x) = 1 + e^−w·f

1 + e^−w·f (2.5)

The first equation is called the logistic function and lends its name to the classifier.

A datapoint x is classified as belonging to the class getting the highest conditional probability. This can be simplified to only consider the sign of the dot product so that x belongs to class 1 if w · f > 0 and to class 0 if w · f < 0, w · f thereby defining a hyperplane separating the two classes. The weight vector w is chosen to maximise the conditional likelihood for the training data:

ˆ

w = arg max

w

Y

i

P (y⁽ⁱ⁾|x⁽ⁱ⁾) (2.6)

LR can be generalised to multi-class classification in several ways. As for the SVM, it can be done by combining multiple one-vs-one or one-vs-rest classifiers. Another way is using multinomial logistic regression, also called maximum entropy modelling (MaxEnt) [2].

2.4.3 Bag-of-words

All information in this section comes from [2].

The simplest way of representing text as vectors is to use a so-called bag- of-words (BoW) model. Documents are represented as vectors having the dimension of the size of the corpus vocabulary, each element representing the number of times that word occurs in the document. For a big dataset, this leads to very sparse, high-dimensional vectors as most words in the vocabulary will not be present in a given document. Techniques to decrease the vector dimensions include using only a fixed subset of the vocabulary, for example by ignoring stopwords, and to perform lemmatisation or stemming of the texts.

In a standard BoW, all words count equally. To highlight the fact that some words contain more information useful for the classification than others, tf-idf scores can be used instead of raw word counts.

(24)

Lemmatisation and stemming

Both lemmatisation and stemming are used to translate morphologically different forms of a word to the same, normal form. Words are built up so-called morphemes that are smaller meaning-bearing units. They can be split into two categories, stems that contain the main meaning of a word and affixes that add some additional meaning. In lemmatisation, words are translated to their dictionary forms (lemmas), for example, ”was”→”be” and ”cats”→”cat”. This is a rather complex process, usually including morphological parsing, parts- of-speech tagging and dictionary lookups. Stemming is a simpler approach including only morphological analysis where all the affixes are cut off and only the stem is kept. For some words like ”cats”, this gives the same results as with lemmatisation. Other examples would however give very different results, like

”was”→”wa” and ”organisation”→”organ” using the Porter stemmer.

Tf-idf

The term frequency (tf) is used to give more importance to words appearing many times in a given document d. It is the frequency with which the word t appears in d and could be defined as the raw count:

tf^t,d= count(t, d) (2.7)

or the logarithm thereof, to avoid taking the logarithm of 0 either adding 1:

tf^t,d = log(count(t, d) + 1) (2.8) or treating 0 as a special case:

f (x) =

(1 + log(count(t, d)) if count(t, d) > 0,

0 otherwise. (2.9)

The second term ”idf” is the inverse document frequency and is used to emphasise words that appear in few documents and that are hence discriminative for d. Denoting the total number of documents N and the number of documents in which t appears dft, the idf is usually defined as

idft= log N df^t

(2.10) Finally, the tf-idf value is given by multiplying the term frequency and inverse document frequency:

tf-idft,d = tft,d× idft (2.11)

(25)

2.4.4 Word embeddings

An alternative to BoW in representing text as vectors is by using pre-trained word embeddings. These are vector representations of words that are trained on large corpora in order to capture semantic and contextual information about the words. Many state-of-the-art results in text classification over the last few years were obtained with different kinds of neural networks initialised with word embeddings. Examples of word embeddings include word2vec¹¹, GloVe¹² and FastText¹³that have all previously been used for intent classification (see Section 2.2.2). In this project, GloVe embeddings were used, and they will therefore be further described here.

GloVe

”GloVe” stands for global vectors and they were released in 2014 by the authors of [48], from which all the information below is taken.

The name reflects the fact that the embeddings are trained on global word- word co-occurrences. The training corpus is passed through once to populate a co-occurrence matrix X were each number represents how frequently two words co-occur in the corpus, element Xij being the number of times word j appears in the context of word i. If denoting with Xi =P

kX_ik the number of times any word is found in the context of word i, the probability of j appearing in the context of i is given by

P ij = P (j|i) = X_ij

X_i (2.12)

To distinguish relevant words and cancel out those that are non-discriminative, probability ratios instead of pure probabilities are used. An example to make this more intuitive is given in Table 2.1 where probabilities and probability ratios for the two focus words ice and steam and four context words are listed.

We see that context words that do not correspond to properties distinguishing the focus words (water and fashion) give probability ratios close to 1 while context words that are strongly connected to properties distinguishing them give high or low ratios.

The authors therefore formulate the function on which to base the cost function as:

F (w_i, w_j, ˜w_k) = P_ik

P_jk (2.13)

11https://code.google.com/archive/p/word2vec/

12https://nlp.stanford.edu/projects/glove/

13https://fasttext.cc/

(26)

Table 2.1: Example of co-occurrence probabilities and probability ratios for the words ice and steam and four different context words. Table values from [48].

Probability (ratio) k = solid k = gas k = water k = f ashion P (k|ice) 1.9 × 10⁻⁴ 6.6 × 10⁻⁵ 3.0 × 10⁻³ 1.7 × 10⁻⁵ P (k|steam) 2.2 × 10⁻⁵ 7.8 × 10⁻⁴ 2.2 × 10⁻³ 1.8 × 10⁻⁵

P (k|ice)/P (k|steam) 8.9 8.5 × 10⁻² 1.36 0.96

Since the probability ratio is a scalar and to make the function represent the difference between two target words in the vector space, F is then chosen to take as input the dot product between the difference between the word vectors w_i and wj and the third word vector ˜w_k:

F (w_i, w_j, ˜w_k) = F ((w_i− w_j)^Tw˜_k) = P_ik

P_jk (2.14)

F is also required to fulfill

F ((w_i − w_j)^Tw˜_k) = F (w^T_i )

F (w^T_j) (2.15)

as well as to be invariant under w ↔ ˜w and X ↔ X^T (because when considering two words, the choice of focus versus context word is arbitrary), which is shown to be fulfilled by the exponential function. Combining Equations 2.14 and 2.15 and the definition of P^ij, and then taking the logarithm and grouping all bias terms, we arrive at:

w_i^Tw˜_k+ b_i+ ˜b_k = log(X_ik) (2.16)

Finally, the cost function is formulated as a least square model with this term and a weighting factor f (X^ij):

J =

V

X

i,j=1

f (Xij)(w_i^Tw˜j + bi+ ˜bj − log(Xij))² (2.17)

where f is defined to give more weight to frequent co-occurrences:

f (x) =

((x/x_max)^3/4 if x < xmax,

1 otherwise. (2.18)

(27)

Along with the paper, pre-trained GloVe embeddings trained on four different datasets were released, including Wikipedia, newswire and Twitter data.

The different training datasets range from 6 to 840 billion tokens with vocabulary sizes from 400’000 to 2.2 million words, and the dimensions of the generated word vectors are between 25 and 300.

2.4.5 Language models

The language models referred to here are so called neural language models, not the statistical language models that model word sequence probabilities.

Transfer learning is about using knowledge learned from one task to perform better on another, related task. Pre-trained word embeddings could be seen as a kind of transfer learning where the feature representation is based on knowledge gained from the tasks used to train the embeddings. However, when using word embeddings, only a good representation of the individual words is given but no context or relationships within the text. The rest of the network needs to be trained from scratch, requiring large sets of labelled training data. When labelled data is scarce, it would be useful with models where more knowledge is transferred to the new task.

The main current transfer learning approach in natural language processing (NLP) uses so called language models that are pre-trained without supervision on a large text corpus. These are complete networks that capture both low- and high-level structure, from words to semantic meaning. They are usually imported as they are and then fine-tuned to the task at hand but can also be used for producing word embeddings that take context into account. A number of such language models were released during 2018 and 2019 and have already been successfully used for a range of different NLP tasks, including text classification.

Contextual word embeddings

Contextual word embeddings are functions of the entire input sequence and not only the individual words, thereby giving different embeddings for a specific word depending on the context in which it appears. CoVe (context vectors) were trained on a supervised machine translation task [49]. These embeddings were then outperformed by ELMo (embeddings from language models) that were trained in an unsupervised fashion and computed from the internal states of a pre-trained deep bi-directional language model based on LSTM layers [50].

(28)

ULMFiT and GPT

The language model ULMFiT (universal language model fine-tuning) introduced in [51] is based on a 3-layer LSTM architecture. Evaluating on six different text classification datasets, the authors showed good performance with only a small number of labelled examples for each new task. On two different datasets, fine-tuning with only 100 labelled training datapoints gave the same performance as training from scratch with 1’000-2’000 labelled training datapoints. If also including a large number of unlabelled examples (50’000 and 100’000 respectively), i.e. semi-supervised fine-tuning, the performance was

as if doing supervised learning with 5’000-10’000 datapoints.

OpenAI’s generative pre-training models (GPT and GPT-2) are based on the transformer network architecture and showed state-of-the-art results on a number of NLU tasks upon release, both with fine-tuning and in a zero-shot learning setting [52, 53].

BERT

The language model used in this project is called BERT which stands for bidirectional encoder representations from transformers. Upon release in 2018, it achieved state-of-the-art results on eleven different NLP tasks [54] and has since been successfully used for many more, including intent classification and slot filling trained jointly in [12]. It is a multi-layer bidirectional transformer encoder that can be fine-tuned to perform both sentence- and token-level tasks. The model architecture is based on the original transformer architecture proposed in [55].

BERT takes both single sentences and sentence pairs as input, where a

”sentence” is a text of arbitrary length, and creates one single token sequence that becomes the input of the model. Each sequence starts with a classification token called [CLS]. Then follows a pair of sentences, called A and B, with a separator token called [SEP] in between, ending by another [SEP] token.

The two consecutive phrases "I really like robots. Do you?" is thus transformed to the token sequence:

[[CLS], i, really, like, robots, [SEP], do, you, [SEP]] (2.19)

Each token i in the sequence is represented with an embedding Ei that is the sum of three embeddings: the WordPiece embedding for the specific

(29)

token, an embedding indicating whether it belongs to sentence A or B, and an embedding corresponding to its position in the full sequence.

The models are pre-trained on data from the BooksCorpus and English Wikipedia performing two different tasks: masked language model (MLM) and next sentence prediction (NSP). In MLM, 15 % of the input tokens are chosen at random to be ”masked”. Each token being masked is in 80 % of the cases replaced with the [MASK] token, in 10 % of the cases with a random token and in 10 % of the cases with the original token. Using cross-entropy loss, the model is then trained to predict the masked tokens. The NSP task is included to train a model that can understand sentence relationships, which is useful for tasks such as question answering. With two sentences A and B, sentence B is in 50 % of the cases the sentence actually following A in the corpus and in 50

% of the cases a random sentence from the same corpus. The model is then trained to predict whether B follows A or not, using the embedding for [CLS]

for prediction.

Fine-tuning is then performed on the specific task where all model parameters are initialised with the pre-trained parameters and then fine-tuned with labelled, task-specific data. When there are natural input pairs, these are used as A and B. When there is only one input text, such as in text classification, it is used as A and an empty text string as B.

Two main models were released along with the paper, BERTBASE and BERTLARGE, with 110 million and 340 million parameters respectively. A multilingual BERT model supporting 104 languages trained on Wikipedia data has later been released by the same authors¹⁴. There is ongoing work in developing mono-lingual BERT models for more languages. For instance, BERT models for Swedish were recently released by The National Library of Sweden¹⁵and The Swedish Public Employment Service¹⁶.

14https://github.com/google-research/bert/blob/master/multilingual.md

15https://github.com/Kungbib/swedish-bert-models

16https://github.com/af-ai-center/SweBERT

(30)

Method

The first section of this chapter describes the datasets that were used in this project. The following two sections describe the classifiers chosen for comparison and how they were implemented. In the last section, the evaluation metrics used to compare them are introduced.

3.1 Datasets

Four different datasets were used to train and evaluate the classifiers, here called: CoffeeShop, SnipsData, OosEval and OpenSubtitles.

3.1.1 CoffeeShop

An in-house dataset was collected through crowdsourcing among the employees at Furhat Robotics. They were given five dialogues taking place in a coffee shop setting where each dialogue included four gaps with a short description of what was to be said there, in total covering 20 different intents. They were then instructed to give five examples of how to say each intent. Finally, they were asked to give five examples of out-of-scope utterances that did not match any of the intents. An example dialogue can be seen in Table 3.1.

One person was asked to provide utterances that are typical for examples that would be provided when developing a skill for a social robot instead of for how an end user would say. This is the respondent labelled ’r19’ in the dataset who provided five to nine examples per intents, whereof five were chosen at random for the experiment described in Section 4.1.5. These utterances were generally shorter and more focused on covering many important keywords.

22

(31)

Table 3.1: One of the dialogues from the collection of the CoffeeShop dataset, collecting data for four intents: notReady, askPreference, contentInfo and dontCare.

Who Utterance/instruction Furhat Hello! What can I get you?

You (Say that you’re not ready to order yet) Furhat Of course, take your time.

You (Ask what Furhat would take)

Furhat I would go for today’s special, it’s a really good hazelnut vanilla latte.

You (Ask what it contains)

Furhat One espresso shot, your choice of milk, hazelnut syrup, and some vanilla extract.

You Ok, then I’ll take a vegan today’s special.

Furhat Do you want it with soy or oat milk?

You (Say that you don’t care)

Furhat Ok, then I’ll make it with oat milk. Here you go!

Table 3.2 shows ten example utterances, five of these ”designed” examples and five from another respondent. These examples were all given for the instruction (Say that you’re not ready to order yet) in the dialogue in Table 3.1, representing the intent ’notReady’.

Table 3.2: Five examples each from two respondents for the intent notReady.

r19 r11

I don’t know yet Sorry, I am not ready yet!

I’m still looking Give me a moment, please.

I need a minute Could you wait for a minute? I am not ready.

give me some time Just a second! Still looking at the menu.

I’m not ready yet Sorry, I am not ready to order yet.

Cleaning consisted in making all characters lowercase, removing punctuation and replacing common contractions such as ”don’t’→”do not” and

”I’m”→”I am”.

There were 23 respondents and after manually going through all the given utterances to check that they did indeed match the intended intents, 2037 utterances remained, whereof 1391 unique utterances. This corresponded to 33-100 unique utterances per intent. There were also 100 unique out-of-scope

(32)

utterances.

The full dataset was split into a training, validation and test set based on the respondents so that every respondent was only present in one of the datasets. Four respondents were chosen to be in the test set and three to be in the validation set so as to get an approximate 70/10/20 train/val/test ratio. It was then ensured that all utterances were unique, in the case they were present in several sets only keeping them in one set picked at random.

The final train/val/test ratio is approximately 67/11/22, with the validation percentage ranging from 6 % for isImpressed to 19 % for dontKnow and the test percentage ranging from 15 % for dontCare to 29 % for yes. The number of utterances for each intent and subset can be seen in Table 3.3. A validation set was only needed for BERT, so for all the other classifiers it was considered part of the training set.

Table 3.3: The number of utterances for each intent and subset in the CoffeeShop dataset.

Intent Train Val Test Total

askPreference 54 8 17 79

askTime 40 7 17 64

contentInfo 42 10 15 67

dontCare 61 6 12 79

dontKnow 29 10 14 53

droppedCoffee 60 10 18 88

getOptions 49 7 19 75

getPrice 41 5 17 63

goodbye 36 7 12 55

hello 24 3 6 33

isImpressed 52 5 18 75

no 31 9 13 53

notReady 62 8 20 90

openingHours 44 9 12 65

orderCoffee 69 11 20 100

requestRepeat 52 8 16 76

tellPreference 59 7 18 84

waitingTime 62 9 18 89

whyAsking 43 6 16 65

yes 22 5 11 38

Total 932 150 309 1391

(33)

The 100 OOS utterances were split at random into 35 for training, 15 for validation, and 50 for testing. As for the in-scope data, the validation set was considered part of the training set for all classifiers but BERT.

3.1.2 SnipsData

This dataset is usually called Snips only but will here be called SnipsData to separate it from the Snips classifier.

As described in Section 2.2.1, it is an intent classification dataset covering seven intents that was released in 2017 by the company behind the NLU platform Snips [10] and has since been used by for instance [11] and [12]. The version of the dataset used in this project comes from the authors of [12] and can be found at the corresponding GitHub repository¹. The slot-filling part of the dataset was ignored, only including the intent labels and the utterances. All utterances are lowercased with punctuation removed so no further pre-processing was needed. The dataset contains a total of 14’484 utterances: 13’084 for training and 700 for validation and test respectively. See Table 3.4 for the exact numbers for each intent and subset.

Table 3.4: The number of utterances for each intent and subset in the Snips dataset.

Intent Train Val Test Total

AddToPlaylist 1818 100 124 2042

BookRestaurant 1881 100 92 2073

GetWeather 1896 100 104 2100

PlayMusic 1914 100 86 2100

RateBook 1876 100 80 2056

SearchCreativeWork 1847 100 107 2054 SearchScreeningEvent 1852 100 107 2059

Total 13084 700 700 14484

The dataset was collected through crowdsourcing using text query generation tasks with fixed intents and entities. This was followed by two dis- ambiguation steps to ensure the quality of the annotations, for example to identify incorrect intent labels and spelling errors. The first step was to add an additional crowdsourcing task that consisted of checking utterances provided by other crowdworkers. The second step was to repeatedly run a 3-fold cross

1https://github.com/90217/joint-intent-classification-and-slot-filling-based-on- BERT/tree/master/data/snips

(34)

validation with their NLU engine. In both steps, majority voting was applied to find potential errors.

3.1.3 OosEval

OosEval is a dataset released in 2019 and described in [36]. The full dataset covers 10 domains and 150 intents, with 150 queries for each intent: 100 for training, 20 for validation and 30 for testing. They also compiled variants of the datasets with fewer training queries per intent (”Small”) or unbalanced distribution of training queries (”Imbalanced”) that are not considered here.

What is special about the OosEval dataset is the focus on out-of-scope detection. In the original dataset, they also include 1200 OOS utterances and in the version call ”OOS+” 1350: 250 for training, 100 for validation and 1000 for testing. This is the variant of the dataset used in this project.

The INS utterances were collected using crowdsourcing where the respondents were told to pretend they were communicating with an AI assistant and perform tasks of three different types: asking questions within a given topic domain (”scoping”), rephrasing given questions or statements, and formulating questions or statements given a scenario. The OOS utterances are a mix of two types of utterances from the INS collection: utterances that did not match any of the intents, and scoping and scenario utterances with topics from other domains than those included in the INS collection.

Like SnipsData, OosEval comes with all characters in lowercase and punctuation removed, requiring no further pre-processing.

3.1.4 OpenSubtitles

In order to obtain a large number of dialogue utterances to use as examples of OOS utterances, a dataset was compiled based on the Swedish-English part of the parallel corpora described in [56]. It was created from the OpenSubtitles website; an online repository of subtitles for movies and TV in more than 60 different languages².

Only the English phrases were usedand only every 1000th utterance was included in the dataset in order to get a varied set of utterances. That way, only a few phrases would be included from each movie. Cleaning consisted of converting to lower case and removing all punctuation.

2www.opensubtitles.org

(35)

3.2 Compared classifiers

Four classifiers were included in the comparison: support vector machines (SVM) and logistic regression (LR) with different feature representation techniques, the intent classification part of the NLU platform Snips, and the language model BERT.

3.2.1 Motivations

SVM and LR have both been extensively used for intent classification as well as for other text classification tasks, often yielding similar results. SVM is one of the main methods that has been used for OOD detection in NLU and LR has the advantage of naturally giving probability scores for the predictions.

The feature representation techniques used along with SVM and LR were BoW (with and without tf-idf) and pre-trained GloVe word embeddings. BoW was included as it is an established text feature representation technique that is very simple to implement and understand, requires a lot less resources than pre-trained word embeddings, and usually provides a strong baseline. For the pre-trained word embeddings, there are many alternatives for English. GloVe was chosen in this project as it was used in two of the intent classification articles mentioned in Section 2.2.2 and in the OOS dataset article described in Section 2.3. It also has the advantages of existing in several versions with word embeddings of different dimensions and not requiring as much memory as some of the alternatives. One such alternative is FastText that would, however, be particularly interesting in a multilingual setting as it is available in many different languages.

A platform offering intent classification as part of NLU as a service was included in order to see how a complete intent classification system combining multiple classifiers and hand-crafted tweaks performs relative to simple classifiers. As described in Section 2.2.4, there are a number of such platforms.

Snips was chosen in this project for two main reasons:

1. Snips NLU is open source with all source code available online which allows for looking into how the classification is performed.

2. Snips NLU is specifically developed for running locally on the device with low hardware requirements, which is of interest in social robotics applications.

As described in Section 2.2.4, the Snips intent classifier consists of two main parts where the probabilistic part is based on logistic regression. This is one

(36)

more reason to include LR in the comparison as the creators of Snips probably compared different methods and found LR to be the best for their purposes.

By including both LR and Snips, we can also compare the platform classifier performance to the performance of the actual simple classifier it uses. Thereby, we will see if there seems to be value in adding a deterministic classifier and other extra features on top of it.

BERT was included in the comparison as it has previously achieved state-of- the-art results on many different tasks in natural language processing, including intent classification. It does, however, require a lot of memory and computa- tional resources. It is therefore interesting to compare it to simpler classifiers like SVM and LR to see if it improves performance enough to justify this.

All the evaluated classifiers can return probabilities along with the predicted intents. These probabilities can be used for OOS detection using thresholds, where all utterances for which the top probability is under a given threshold are classified as OOS. For Snips, the probabilities are rather confidence scores as they do not generally sum to 1.

3.2.2 Implementation details

All methods were implemented in Python and evaluated as described in Section 3.3. The Python module sklearn (version 0.20.4) was used extensively in the experiments.

BoW and GloVe embeddings

For the SVM and LR classifiers, four different vector representations were used.

BoW was used both with and without tf-idf scores, using the scikit-learn feature extraction modules TfVectorizer and CountVectorizer with default settings. With TfVectorizer, the tf-idf scores are computed using counts for the term frequency and, with the same notation as in Section 2.4.3, for the inverse document frequency:

idft = log 1 + N 1 + df^t

+ 1 (3.1)

Then the vectors are normalised using the Euclidean norm.

With GloVe embeddings, utterances were represented as the average of the embeddings for the individual words. The embeddings used were the 50- and

(37)

300-dimension embeddings trained on 6 billion tokens from a 2014 English Wikipedia dump and the fifth edition of the GigaWord newswire dataset³.

No stemming or lemmatisation was performed apart from in the experiment that was specifically about evaluating the effect of that, described in Section 4.1.3. There, stemming was performed with the Snowball stemmer and lemmatisation with the Wordnet lemmatiser, both implemented using the stem package in the Python module nltk.

SVM

SVM was implemented using the SVC classifier in the SVM library from scikit- learn with linear kernel, probability=True and all other parameters as given by default. This implementation was chosen as it allows for probability estimates which is needed for the experiments with thresholds described in Section 4.2.1. In SVC, multi-class classification is performed according to the one-vs-one scheme and probability scores are calculated using five-fold cross validation.

Classification was performed using BoW and GloVe embeddings as described above.

Logistic regression

Logistic regression was implemented using the LogisticRegression module from scikit-learn with the lbfgs solver and one-vs-rest multi-class classification.

All other parameters were as given by default. Vector representation of the utterances was done in the same way as for the SVM, using BoW and Glove embeddings.

Snips

The Snips classifier was implemented using the Python package snips_nlu. In the OOS detection experiments, all utterances classified as the inbuilt None intent were interpreted as OOS.

BERT

The BERT implementation was taken from the code⁴accompanying the article

”BERT for joint intent classification and slot filling” [12], only keeping the

3”glove.6B.zip”, available at https://nlp.stanford.edu/projects/glove/

4Available at: https://github.com/90217/joint-intent-classification-and-slot-filling-based- on-BERT

(38)

intent classification parts of the model.

The BERT model used was BERTBASE (see Section 2.4.5) and the implementation was written in TensorFlow. Fine-tuning was done with the objective of minimising cross-entropy loss, using maximum input length 50, learning rate 5 ∗ 10⁻⁵, and dropout probability 0.1. Training was in all cases run for maximum 30 epochs, saving the model after every epoch improving the validation accuracy.

The only parameter not taken from the original code is the batch size where an analysis of the performance for four different batch sizes was performed.

Table 3.5 shows validation accuracies for the different datasets and four different batch sizes. For SnipsData, batch size 128 was chosen as all batch sizes yielded the same validation accuracy and it is what was used in [12]. For CoffeeShop and OosEval, the top performance was obtained with batch size 16 that was therefore used for these datasets in all experiments.

Table 3.5: BERT validation accuracies for the different datasets using different batch sizes. Accuracies for the chosen batch sizes in bold.

Batch sizeDataset

SnipsData CoffeeShop OosEval

16 0.986 0.880 0.960

32 0.986 0.853 0.958

64 0.986 0.860 0.957

128 0.986 0.840 0.953

For the experiment evaluating how the classification performance depends on the number of training examples (see Section 4.1.4), early stopping with patience 3 was used. This means that training was stopped if the validation accuracy had not improved in three consecutive epochs.

3.3 Evaluation

For all datasets, intent classification was evaluated using accuracy, and out-of- scope detection using out-of-scope recall as those metrics were used in related work that this project builds upon. When an utterance is classified as OOS, it usually triggers a request for repetition or clarification. If an OOS utterance is misclassified as one of the INS intents, it might trigger any kind of action. It is therefore more important that OOS utterances are not classified as random intents than that INS utterances are not classified as OOS, thus the focus on recall for OOS.