Quality Assessment of Conversational Agents: Assessing the Robustness of Conversational Agents to Errors and Lexical Variability

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018,

Quality Assessment of Conversational Agents

Assessing the Robustness of Conversational Agents to Errors and Lexical Variability

JONATHAN GUICHARD

(2)

(3)

Quality Assessment of Conversational Agents

Assessing the Robustness of

Conversational Agents to Errors and Lexical Variability

JONATHAN GUICHARD

School of Computer Science and Communication Master in Computer Science

Date: March 2018

Author’s Email: jgui@kth.se Supervisor: Gabriel Skantze Examiner: Olov Engwall

Principal: Anthony Ventresque, University College Dublin

Translated title: Kvalitetsutvärdering av konversationsagenter - att bedöma robustheten hos konversationsagenter mot fel och lexikal

(4)

Abstract

Assessing a conversational agent’s understanding capabilities is crit- ical, as poor user interactions could seal the agent’s fate at the very beginning of its lifecycle with users abandoning the system. In this thesis we explore the use of paraphrases as a testing tool for conversational agents. Paraphrases, which are different ways of expressing the same intent, are generated based on known working input by performing lexical substitutions and by introducing multiple spelling di- vergences. As the expected outcome for this newly generated data is known, we can use it to assess the agent’s robustness to language variation and detect potential understanding weaknesses. As demonstrated by a case study, we obtain encouraging results as it appears that this approach can help anticipate potential understanding shortcomings, and that these shortcomings can be addressed by the generated paraphrases.

(5)

Sammanfattning

Att bedöma en konversationsagents språkförståelse är kritiskt, eftersom dåliga användarinteraktioner kan avgöra om agenten blir en framgång eller ett misslyckande redan i början av livscykeln. I denna rapport un- dersöker vi användningen av parafraser som ett testverktyg för dessa konversationsagenter. Parafraser, vilka är olika sätt att uttrycka sam- ma avsikt, skapas baserat på känd indata genom att utföra lexiska substitutioner och genom att introducera flera stavningsavvikelser. Ef- tersom det förväntade resultatet för denna indata är känd kan vi an- vända resultaten för att bedöma agentens robusthet mot språkvaria- tion och upptäcka potentiella förståelssvagheter. Som framgår av en fallstudie får vi uppmuntrande resultat, eftersom detta tillvägagångs- sätt verkar kunna bidra till att förutse eventuella brister i förståelsen, och dessa brister kan hanteras av de genererade parafraserna.

(6)

Chapter 1 Introduction

A Conversational Agent is a piece of software intended to dialogue with a human user, whether it is to provide specific services, serve as virtual assistants, or take part in social conversations. In recent years, these agents, also known as chatbots, have become more and more popular.

Many companies see them as a cost effective and viable solution to handle great volumes of simple customer requests while users enjoy a convenient, fast and always accessible service. [14] Big “tech” firms have recently developed their own personal assistants, for example Apple with Siri or Microsoft with Cortana, and both Google and Ama- zon released physical devices based on their assistant.

Despite the recent progress made in the field, conversational agents still seem to have a hard time winning over peoples’ hearts: according to some analysts, their current retention rate remains low, at an average of just 4% on a 7-day time frame. [6] As with any service or product, the final quality of these agents is assessed by the users.

Characteristics such as efficiency, reliability and effectiveness will indeed impact their willingness to use the service again and again. But when it comes to conversational agents, properly understanding what is being said appears as a central element that can have a serious impact on the final user experience. An agent not able to understand the natural language of its users, forcing them to constantly rephrase their requests, seems unlikely to have a high retention rate.

It therefore appears that being able to assess the understanding of such conversational agents during their development could provide some useful insight, allowing developers and researchers to focus on the identified weak points of their agents, which could ultimately lead to an improvement of the user-experience.

(10)

CHAPTER 1. INTRODUCTION

1.1 Problem Definition

In this thesis, we investigate how to automatically test the robustness of text-based conversational agents to language variation, limiting our study to the intent recognition aspect of those agents.

In the context of linguistics, language variation refers to all the different ways of expressing the same semantic content. These variations can take the form of a different word choice and a different grammar, and are often related to the regional and socio-economic background of the speaker.

The conversational agents under test are strictly text-based, meaning that users must type their queries in natural language to interact with them (e.g. “Add a meeting with John Doe tomorrow at 2pm”).

Moreover, we only consider agents that can be represented as an au- tomaton, with the transitions being queries formulated by the user and each state triggering an action and a written response from the agent.

This structure makes it easy to verify that an agent correctly under- stands a query by checking the resulting state. This therefore excludes agents with a much more complex structure, for example agents that try to have social conversations with their users, such as cleverbot¹.

Briefly put, the problem can be summarized as automatically performing adversarial example testing for conversational agents. An adversarial example is defined by Kurakin, Goodfellow, and Bengio [17]

as “a sample of input data which has been modified [...] in a way that is intended to cause a machine learning classifier to misclassify it”.

In our case, we wish to generate adversarial examples that retain the same meaning while expressing it in a range of different ways: we call this operation paraphrasing. We however do not try to trick the conversational agent per se, we rather try to mimic and emulate the way users naturally express themselves. This entails phenomenons such as the use of slang words, colloquial expressions, as well as grammar, spelling, and homophones mistakes.

1.2 Objective

The goal of this thesis is to investigate how the quality of conversational agents can be assessed. More precisely, we focus on how a con-

1http://www.cleverbot.com/

(11)

versational agent can be automatically tested in terms of robustness to errors and lexical variability.

We therefore explore how paraphrasing can be automated so that generated sentences carry the same semantic meaning while having a different syntax, using different words and expressions.

The research question we are examining in this thesis is: “Can paraphrasing techniques help improve the understanding capabilities of Conversational Agents?”

1.3 Boundaries

We would first like to point out that this project continues and extends previous work that has been carried out at University College Dublin.

The proposed testing framework introduced in this thesis is therefore not an original contribution by the author, but is rather used as a start- ing point for our reflection.

Moreover, we would also like to point out that the goal of this thesis is not to propose ways of improving the underlying algorithms of conversational agents. We indeed focus on the testing aspect, and consider these agents as black-boxes: we are neither interested in nor con- cerned about the multiple techniques and approaches to building a conversational agent.

Finally, we solely focus on English as the language being used by the conversational agents and the users.

1.4 Target Audience

We believe that this report will be of great interest to conversational agents developers, as we aim to provide them with a novel testing tool. We moreover think that this report will attract the interest of computer scientists involved in Natural Language Understanding, as our goal is to assess the quality of such tools in the context of conversational agents. Finally, this report should also be interesting for anyone interested in automated lexical paraphrasing.

(12)

1.5 Acknowledgments

This work has been supported by the funding of Microsoft Skype to Lero - the Irish Software Research Centre (www.lero.ie). The author would also like to thank Ross Smith and Dan Bean from the Skype Division at Microsoft for their support during this project.

1.6 Glossary

Lemma The canonical form of a word, i.e. the ground form that can be found in a dictionary. For example, run is the lemma of run, runs, ran and running. Not to be confused with Stem.

Lexical Category The category of a word, such as noun or verb.

Morphology The study and description of word formation, such as inflection, derivation, and compounding.

N-gram A contiguous sequence of N items. In this document, unless otherwise specified, the items are words.

Precision In binary classification, the amount of obtained relevant results (true positives) over the amount of obtained results (true positives and false positives).

Recall In binary classification, the amount of obtained relevant results (true positives) over the amount of relevant results (true positives and false negatives).

Stem The truncated portion of a word that is common to all its inflected variants. For example, wait is the stem of wait, waits, waiting and waited. Not to be confused with Lemma.

Surface Form The inflected form of a word, the form of a word as it is encountered in texts. For example, waiting is a surface form of wait.

(13)

Chapter 2 Theoretical Background

2.1 What is a paraphrase?

The concept of paraphrasing is generally defined as semantic equiva- lence: a paraphrase is an alternative way of expressing the same content as the original form. Paraphrases can occur at different levels [20]:

Lexical paraphrases: Individual terms that have a similar meaning are usually referred to as lexical paraphrases, for example hnice, goodi or hdrink, gulpi. Lexical paraphrases are moreover not limited to synonyms and can also include other forms of semantic relations such as hyper- and hyponymy, for example hlabrador, dogi.

Phrasal Paraphrases: Fragments of text sharing the same semantic content are referred to as phrasal paraphrases. These fragments can be either fixed, such as hkick the bucket, pass awayi, or contain linked variables, such as h the X1 of X2, X₂’s X1 i.

Sentential Paraphrases: Two sentences that have the same meaning are referred to as sentential paraphrases, for example hHe was found guilty by the jury, The jury convicted himi.

2.2 An Overview of Machine Translation

One way of approaching our paraphrasing problem is to consider it as a simplified translation problem, with the source and target languages being the same. Techniques and tools used in Machine Translation could therefore prove very useful.

(14)

CHAPTER 2. THEORETICAL BACKGROUND

2.2.1 Rule-Based Machine Translation

Rule-Based Machine Translation (RBMT), also known as Classical Ma- chine Translation, is not any particular kind of method but rather a paradigm that groups four main approaches. These approaches all have in common that they mostly rely on bilingual dictionaries and crafted rules, and they all have an analysis stage, a transfer stage and a generation stage. The Vauquois Triangle, shown in figure 2.1, repre- sents each of the aforementioned stages with its sides, and helps us to visualize the level at which each approach operates. The closer to the top of the triangle, the more complex the approach is.

Figure 2.1 – The Vauquois Triangle, illustrating the different levels at which rule-based machine translation can occur. SL stands for Source Language, TL for Target Language.

Unlike other RBMT approaches, direct translation does not try to build any abstract representation of the input, but rather directly ap- plies the specified rules and dictionaries on the source sentence. The most simplistic way to do this consists of naively translating sentences word-by-word, but some more complex steps can be added before and after this operation. This includes, but is not limited to, morphological analysis and synthesis, word re-ordering, rules for prepositions, and manually handling idioms, compounds, and other special cases. [38]

The syntactic and semantic transfer approaches are very similar in the sense that they first analyze the input sentence in order to build an abstract representation of it. The transfer from the source to the target language is then performed with rules operating on this representation. Words are translated in the same fashion as for direct translation,

(15)

through look-up in a dictionary. Once those two steps have been com- pleted, a proper sentence can be generated from the transformed representation. [39] The main difference between syntactic and semantic transfer is the level of abstraction at which they operate. While syntactic transfer only tries to model the syntax, semantic transfer also tries to analyze and represent the semantic structure of the source sentence.

Finally, the last main approach in RBMT is interlingua-based. The idea with this approach is to analyze the input and build a language- neutral representation from which a sentence in the target language can be generated. In other words, rather than directly translating from one natural language to another, the interlingua is used as an inter- mediary. This interlingua does not need to be a natural language and can for example be a formal language, making it easier to work with for a computer. [39] This approach does not have a transfer stage, the translation is instead happening during the analysis and generation stages. While adding support for new languages should be easier with this approach, the interlingua needs to be able to capture the semantic meaning and the phenomena of all supported languages, which are both hard problems.

While RBMT methods have been successfully used for commercial applications, such as SYSTRAN [36], the main drawback is the knowledge acquisition bottleneck. These methods require indeed an extensive linguistic expertise of each supported language in order to properly craft the rules, which are numerous and language-pair de- pendent. However, despite this low-coverage issue, high quality and consistency can be achieved for restricted domains where the language is controlled (e.g. weather reports). [38] Moreover, using a rule-based approach makes it possible to tailor the translations to specific needs by simply altering the dictionaries and rules in use.

2.2.2 Data-Driven Approaches

Contrary to rule-based paradigms, data-driven approaches aim to in- fer translation mechanisms by relying on large amounts of training data, namely bilingual human-translated texts. Data-driven paradigms can be further broken down into the following approaches:

• Statistical Machine Translation (SMT) treats translating as a probabilistic task, the goal being to find the most likely output sequence given an input. This is done by relying on statistical mod-

(16)

els representing the provided training data.

• Example Based Machine Translation (EBMT) performs translation by analogy, by using previously translated data and adapting it to the sentence to translate.

• Neural Machine Translation (NMT) is an approach based on deep learning techniques.

While EBMT have no known commercial application, SMT was the approach used by Google Translate since 2006, before being replaced by a new NMT engine in 2016.

2.3 Assessing Machine-Produced Sentences

Assessing the quality of a paraphrase and of a translation can be seen as two very similar tasks. In both cases we want to verify that the semantic content remains the same and that the generated sentence appears as fluent.

2.3.1 Human Evaluation

The most straightforward way to evaluate the quality of translations is through the use of human judges. We briefly present some of the methods used by the EuroMatrix project. [40]

One way a translation can be evaluated is by rating its fluency and its adequacy. Fluency refers to the degree to which the translation is properly formed and understandable by the judge. A five point scale can for example be used for that purpose, grading a translation as In- comprehensible, Disfluent English, Non Native English, Good English or Flawless English. Adequacy refers to the quantity of information from the original sentence that is still present in the translation. It can be evaluated using the five point scale None, Little, Much, Most and All.

While such an evaluation method seems at first to be reasonable, the concepts of fluency and adequacy can be hard to distinguish and grade accordingly. Moreover, there are no clear guidelines on how to quan- tify meaning or how many grammatical errors separate the different levels of fluency.

Some other methods to manually evaluate the quality of a translation can for example be the time needed to read a translated para- graph, or the post-editing time required to form grammatically correct

(17)

and content adequate translations. Similarly to fluency and adequacy, concepts such as clarity or informativeness can be graded on a discrete scale. [40]

2.3.2 Lexical Similarity

Metrics based on lexical similarity score a translation hypothesis by the number of N-grams it has in common with a set of human-produced references. The idea behind this family of metrics is that a good translation engine will tend to use the same words and groups of words as a human expert. This is for example the principle on which BLEU [27] is based, metric which has become the standard in the machine translation field.

While those metrics can easily be automated and are very helpful to quickly compare translation techniques, it is important to keep in mind that they should be used and interpreted as a relative scale, and not an absolute one.

2.4 Natural Language Analysis Tools

2.4.1 Tokenization

In Natural Language Processing, tokenization is the process of demar- cating and splitting sections of a sequence of characters. Tokenization is used to identify and retrieve words from an input sentence, but it does not always split on white spaces as a natural speaker would intu- itively do. For example, “doesn’t” is usually tokenized as “does” and

“n’t”.

2.4.2 Parts-of-Speech Tagging

Parts-of-Speech (PoS) Tagging is the process of classifying and mark- ing words appearing in a sentence with their grammatical category, for example Noun Plural or Verb 3^rd Person Singular. This usually requires the sentence to first be tokenized, which can be done separately or implicitly by the algorithm itself as a preprocessing step.

(18)

2.4.3 Dependency Parsing

Dependency Parsing seeks to describe the structure of a sentence in terms of words and an associated set of directed grammatical relations that exist between these words. Figure 2.2 showcases an example of dependency parsing.

Figure 2.2 – Example of a dependency analysis as a tree. Figure based on [22].

2.5 Language Models

In linguistics, a language model is a probabilistic tool that helps to characterize sequences of words. More specifically, given a sequence of n words w1, ..., wn, the model computes the probability P (w1, ..., wn) of encountering this sequence in a given language. In other words, such models are useful to assess fluency. For example, the sentence “I like this new book” has a higher fluency, and thus probability, than “I this book new like”.

For a given sequence of words w1, ..., w_n which we now write as wⁿ₁, we can decompose P (w₁ⁿ) by using the chain rule, as shown in equations 2.1 and 2.2. [16]

P (wⁿ₁) = P (w₁)P (w₂|w₁)P (w₃|w²₁)...P (w_n|w₁ⁿ⁻¹) (2.1)

=

n

Y

k=1

P (w_k|w^k−1₁ ) (2.2)

However, rather than computing the full conditional probabilities as in equation 2.2, which is a computationally hard task, these probabilities can be approximated with a language model. For an N-gram language model, the approximation is as shown in equation 2.3. [16]

P (w_n|w₁ⁿ⁻¹) ≈ P (w_n|w_{n−N +1}ⁿ⁻¹ ) (2.3)

(19)

For example, the sequence “I like pie” will be assessed as follows (the symbol <s> marks the beginning of the sequence):

• With the unigram model: P (I, like, pie) = P (I)P (like)P (pie).

• With the bigram model:

P (I, like, pie) = P (I|<s>)P (like|I)P (pie|like)

• With the trigram model:

P (I, like, pie) = P (I|<s>, <s>)P (like|<s>, I)P (pie|I, like)

The probabilities are obtained by performing word counts on a training corpus. The formula used to compute these probabilities is shown in equation 2.4. [16]

P (w_n|w_{n−N +1}ⁿ⁻¹ ) = Count(w_{n−N +1}, ..., w_n−1, w_n)

Count(wn−N +1, ..., wn−1, ) (2.4) The main problem with N-gram language models is that they are trained on a finite corpus. Therefore, some perfectly valid sequences might get a null probability simply because they have never been encountered before. In order to correct this problem, a step known as smoothing can be introduced. One simple example of smoothing is Add-One Smoothing, which consists in adding one to all the N-gram counts before computing the probabilities.

2.6 Word Vectors

Word vectors are a representation of text where words with a similar meaning have a similar representation, namely high-dimension real- value vectors. This representation is learned, typically through the processing of a corpus of text and the use of machine learning tools.

Word vectors are based on the distributional hypothesis, stating that words that occur in the same context tend to have a similar meaning, and that a word can be judged based on the company it keeps. [33] These vector representations therefore characterize a word based on the context in which it frequently occurs. It is important to note that word vectors can not be interpreted on their own, as the dimensions of the vectors do not correspond to any concrete concept.

Word vectors are very useful since they are able to capture semantic relationships to some extent, and enable vector arithmetic on words, as illustrated by equation 2.5. [23]

(20)

King − M an + W oman ≈ Queen (2.5) The vector representation of equation 2.5 could perhaps be loosely similar to what is shown in equation 2.6, with the vectors being in the same order.





 0.99 0.97 0.03 0.30 ...







−





 0.02 0.99 0.03 0.26 ...





 +





 0.01 0.07 0.96 0.60 ...







≈





 0.96 0.05 0.97 0.57 ...







(2.6)

Based on equation 2.6, the first dimension of the vectors could be interpreted as representing the concept of “royalty”, the second as

“masculinity”, and the third as “femininity”. It is however important to note that this is only an example for the sake of understanding and does not correspond to the reality of the algorithm in any way.

The dimensions of word vectors are indeed impossible to interpret and should simply be considered as an abstract representation allowing us to perform mathematical operations on words.

Finally, word vectors can also be used to compute the semantic similarity of words simply by computing the similarity of their vector representation.

2.7 Word-Sense Disambiguation

Word-Sense Disambiguation (WSD) is an open problem of Natural Language Processing, whose goal is to identify the correct sense of a polysemous word when used in a sentence. One famous example of a WSD task is to determine the sense of the word pen in the following passage [2]:

“Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy.”

In this, the word pen seems more likely to refer to an enclosure in which children play, rather than a writing utensil filled with ink as suggested by the relative size of both objects.

There are four conventional paradigms to WSD: dictionary based, supervised, semi-supervised, and unsupervised approaches. The last

(21)

three methods mainly involve machine learning algorithms which use fully annotated, sparsely annotated, or unannotated corpora of texts as their training data. The dictionary-based methods on the other hand make the hypothesis that words used together in text are related, and that this relation can be observed in the definition of the words and their senses. This hypothesis is the basis of the Lesk algorithm and its variants. For example, the Simplified Lesk algorithm determines the most probable sense of a word by computing the overlapping words between the definitions of each sense and the neighbors of the am- biguous word. The biggest overlap corresponds to the most probable sense.

2.8 Spelling Corrector

For a word w, the goal of a spell corrector is to find the most likely correction ˆcout of all the candidate corrections, as expressed by equation 2.7. Equations 2.8 and 2.9 are derived using Bayes’ Theorem and considering the fact that P (w) is the same for all corrections.

ˆ

c = arg max

c∈candidates

P (c|w) (2.7)

=⇒ ˆc = arg max

c∈candidates

P (c)P (w|c)/P (w) (2.8)

=⇒ ˆc = arg max

c∈candidates

P (c)P (w|c) (2.9)

The three parts of equation 2.9 are: [26]

1. A Candidate Model, expressing which candidate corrections to consider.

2. A Language Model P (c), expressing which corrected words c are the most likely to appear in an English text.

3. An Error Model P (w|c), expressing the probability that w would be typed when the author meant c.

A simple approach to the candidate model is to consider simple edits to a word. These edits include deletion of a character, insertion of a character, replacement of a character with another, and transposition of two adjacent characters. Candidate corrections of a word w are then

(22)

obtained by applying all possible simple edits to w recursively up to a certain depth called editing distance. This set of candidate corrections is then filtered by a dictionary in order to only keep existing and correctly spelled words as candidates.

The language model gives the probability of encountering a word in the English language. This can for example be obtained by computing the frequency of the said word in a text corpus, and using this frequency as a probability.

Finally, a basic approach to the error model is to consider that all corrections of editing distance 0 are infinitely better than corrections of editing distance 1, and so on and so forth. In practice, this simple model implies that the search for candidate corrections is done in a breadth-first fashion.

Moreover, while more complex models can be used in order to improve the performances, these basic strategies are still able to produce satisfying results.

(23)

Chapter 3 Related Work

3.1 Automatic Paraphrasing

The process of automatically paraphrasing a given input sentence can be broken down into two main steps, namely the acquisition of paraphrasing data and rules, and the generation of new sentences using this knowledge.

According to Barzilay and McKeown [4], there are three big types of approaches to the problem: manual collection of paraphrases, uti- lization of lexical resources, or corpora-based acquisition.

3.1.1 Manual Collection

The manual collection of paraphrases is usually performed through the use of crowd-sourcing platforms, where users are paid a small amount in exchange of performing quick and simple tasks. In this sec- tion, we briefly describe two of the most notable pieces of work that rely on this approach.

In “Collecting paraphrase corpora from volunteer contributors” [8], the corpus of paraphrases is built through a game called “1001 paraphrases”. Users are given a sentence that they must reformulate, and the goal is for them to correctly guess a hidden reference paraphrase.

The guesses entered by the users are thus paraphrases of the reference. This method was able to produce 14,850 distinct contributions for 400 distinct initial sentences. Moreover, the contributions made by the users seem to be of good quality according to the author.

An alternative approach introduced by Chen and Dolan [7] offers to build a much more extensive corpus of approximately 85,000 para-

(24)

CHAPTER 3. RELATED WORK

phrases. These paraphrases were collected by asking users of Ama- zon’s Mechanical Turk to describe the clear and unambiguous action performed in a very short video clip. Those collected paraphrases are then used to train a statistical machine translation engine, by considering that the paraphrasing task is an English to English translation problem. A second method for collecting paraphrases was also tested on Amazon’s Mechanical Turk. Instead of describing short clips, users were asked to directly paraphrase a given sentence. This task was deemed as more difficult and less enjoyable, and produced results that were much more similar.

Finally, Xu [42] proposes a solution to paraphrase acquisition based on data from Twitter, which significantly decreases the cost of this acquisition process by eliminating the need for complex resources such as videos and by reducing the amount of human work required. Tweets are first collected based on trending topics, and are then stripped into sentences. Users of Amazon’s Mechanical Turk are then given one original sentence and ten candidate paraphrases, and are asked to se- lect the candidates that have the same meaning as the original sentence. Such candidates are thus considered as possible paraphrases.

3.1.2 Utilizing Lexical Resources

Paraphrases can also be generated by relying on existing lexical resources, such as thesauruses or dictionaries. This approach is however limited to generating lexical paraphrases.

In “Generation that exploits corpus-based statistical knowledge”

by Langkilde and Knight [18], the authors build a natural language generator able to produce multiple sentences in plain English of an abstract representation. Those generated sentences are obtained by using all the possible fitting synonyms known to the system, and are thus paraphrases. This contribution is however not able to generate paraphrases from a sentence in plain English, but rather from an abstract representation of this sentence.

A different task relying on the same lexical resources has also been investigated by Hassan et al. [13]. In this work, the authors indeed aim to identify the most probable synonyms of a word appearing in a sentence. Multiple resources are put to use in order to extract synonym sets for a given word, such as WordNet [24] or Microsoft’s Encarta encyclopedia. The candidate replacements of a target word are then ranked by combining several sub-rankings:

(25)

• Lexical Baseline: candidates suggested by multiple thesauruses are ranked higher.

• Machine Translation: the input sentence is translated back-and- forth and the candidates appearing in the resulting sentence are ranked higher.

• Most Common Sense: the first candidate appearing in the synonyms set is ranked higher.

• Language Model: the candidates that appear the most frequently in the same N-grams as the target word are ranked higher.

• Latent Semantic Analysis: the candidates that are the most se- mantically related to the original sentence are ranked higher.

• Information Retrieval: the rank of a candidate is determined by the number of pages returned by a search engine when query- ing the original sentence with the target word replaced by the candidate.

• Word Sense Disambiguation: the target word is disambiguated and its synonyms are proposed as candidates.

According to the authors, this approach outperformed their competi- tors at the time in the SemEval 2007 Lexical Substitution task, consis- tently ranking first or second.

3.1.3 Utilizing Text Corpora

The text corpora approach can be further broken down in four main categories depending on the nature of the primary data that is being used: monolingual data (see [28]), monolingual parallel data (e.g. different translations of the same book), monolingual comparable data (e.g. news articles mentioning the same events), and multilingual parallel data (e.g. sentences that are translations of each other).

“Expansion of multi-word terms for indexing and retrieval using morphology and syntax” by Jacquemin, Klavans, and Tzoukermann [15] is the first attempt to automatically identify paraphrasing rules from corpora, the goal being to improve coverage for Information Re- trieval systems. The authors based their approach on a set of manually crafted grammatical and morphological rules. These rules are applied to a set of precompiled words of interests, which are dynami- cally expanded and matched with the text, and thus identifying paraphrases. While it is able to expand term variation by at least 30% for

(26)

French, this approach requires extensive tuning to apply it successfully to other languages. It is moreover limited to the rephrasing of specialized, domain-specific terms. [28]

Shinyama, Sekine, and Sudo [34] propose a much simpler and less labor-intensive approach to paraphrase extraction, which is based on different translated versions of the same news article. This extraction process is performed by identifying common named entities (such as proper nouns), assuming that sentences sharing many of those will be paraphrases of each other. The paraphrasing rules, expressed in the form of a synchronous context-free grammar, are then extracted with a dependency tree. This method achieves interesting results but offers low coverage.

Barzilay and Lee [3] expand on the idea previously developed by Shinyama, Sekine, and Sudo [34], that is acquiring paraphrases from text corpora. In this work, the paraphrases are however acquired based on a monolingual corpus of news articles relating similar but not nec- essarily identical events. Moreover, the detection of potential paraphrases is performed with a different method rather than relying on named entities. Sentences are first clustered to form groups of potential paraphrases. Those clustered sentences are then processed through a multiple sequence alignment algorithm, which builds a graph representation of the cluster, helping identify common and divergent word paths. Paraphrasing patterns are then extracted from this graph in the form of synchronous context-free grammar rules. This approach achieves striking results and better coverage than previously, but has been demonstrated to be of limited generality. [31]

Quirk, Brockett, and Dolan [31] try to improve the acquisition of paraphrases from parallel source data as done previously with the use of statistical machine translation tools. The goal is thus to find the best paraphrasing T^∗ of a sentence S such as shown in equations 3.1 and 3.2.

T^∗ = arg max

T

(P (T |S)) (3.1)

= arg max

T

(P (S|T )P (T )) (3.2)

Similarly to previous work, this approach is also based on news stories reporting on the same events, but the number of sources is more important than in previous contributions. The collected articles are

(27)

first clustered by story, then the sentences within a cluster are aligned based on the editing distance. For each pair of sentences, words are aligned using Giza++. Blocks of aligned words that are contiguous in their respective sentences are considered to be paraphrases and added to a database with a corresponding replacement probability. The con- struction of the database concludes the training step. For a given input sentence, the paraphrase generation process begins by building a lattice of all possible replacements (see an example in figure 3.1) by browsing the database. The vertices represent positions in the original sentence and the edges are possible replacements. The optimal path through this lattice is scored as the product of all the replacement probabilities (P (S|T )) with the probabilities from the trigram language model (P (T )). This optimal path is computed with the Viterbi algorithm, and corresponds to the suggested paraphrasing for the given input. This approach seems to perform better than previous attempts, with a human acceptance rate of up to 91.5% for 59 sentences, compared to 78.0% for Barzilay and Lee [3] on the same data. Data sparseness is however a key problem in this approach.

Figure 3.1 – A truncated example of a generation lattice as used by Quirk, Brockett, and Dolan [31]. Vertices are positions in the original sentence, and edges are possible replacements. Figure based on [31], simplified.

The approach proposed by Zhao et al. [43] builds upon and tries to improve the work done by Quirk, Brockett, and Dolan [31] by combining multiple resources to perform the paraphrasing. These resources are an automatically constructed thesaurus, monolingual parallel data from novels, monolingual comparable data from new articles, a bilingual phrase table, word definitions from the Encarta dictionary, and a corpus of similar user queries. In order to combine the different approaches, equation 3.2 is rewritten as equation 3.3, where N corresponds to the number of resources, hT M is a replacement scoring function and hLM is a fluency scoring function based on a 5-gram language

(28)

model. λT M_i and λLM are weights.

T^∗ = arg max

T

(

N

X

i=1

λ_{T M}_{_i}h_{T M}_{_i}(T, S) + λ_LMh_LM(T, S)) (3.3) In addition to the six resources, a “self-paraphrase” resource is introduced, which is made up of pairs of identical words. This allows some words to remain the same during the paraphrasing generation.

For a a given input sentence, up to 20 paraphrases are generated by each of the seven resources and stored in memory alongside their associated replacement score. A statistical machine translation decoder is then used to find the best paraphrase T^∗ based on equation 3.3.

It appears that all resources contribute to the paraphrasing through suggestions otherwise not found by the other methods. However, the similar user query approach is deemed as the least effective. Further- more, while news article sentences have a paraphrase acceptance rate of up to 70%, informal sentences from internet forums cause much more problems with a top acceptance rate of 40%.

Xu [42] uses the approach proposed by Quirk, Brockett, and Dolan [31], but the novelty is that the goal of this work is to generate paraphrases with a specific output style. In other words, the objective is to “translate” an input sentence into a specific style of English, for example Shakespearean or “Internet English”. This thesis first addresses paraphrasing with Shakespearean English as the reference style, with the parallel monolingual data being translations of Shakespeare plays into modern English. The author also explores a dictionary-based approach where substitutions occur on a word-by-word basis. According to human evaluation, the statistical approach is rated higher on semantic adequacy, style and overall. Furthermore, the author also explores paraphrasing with “Internet English” as the reference style. This is done by using a similar statistical approach with Twitter as the data source. In order to do so, a corpus of comparable data is built based on tweets discussing similar events. This approach seems able to learn slang terms, acronyms and misspellings which are otherwise hard to learn with statistical approaches.

Pasca and Dienes [28] describe a paraphrasing acquisition method that no longer requires parallel data, but is rather based on the high quantity of monolingual documents available online. Instead of align- ing entire documents or sentences originating from comparable documents, the authors rather align small sentence fragments with each

(29)

other. Sentence fragments are aligned if there are similar words at the end of each fragment. Such aligned sentence fragments are thus potential paraphrases. While this approach does not require comparable data and can therefore make use of online resources on a bigger scale, the major drawback is the lack of sense distinction when creating the proposed paraphrases, which are also often limited to simple synonym replacements.

Pavlick et al. [29] explore a new approach to paraphrase acquisition, and propose to perform this step through a “bilingual pivoting method”. This paper is the continuation of “PPDB: The Paraphrase Database” [10], and offers multiple improvements compared to the initial release. The core idea of this method is that if two English phrases e1 and e2 translate to the same foreign phrase f , then e1 and e₂ can be assumed to have the same meaning, and are thus possible paraphrases. This process is illustrated in figure 3.2. This paraphrasing database contains more than 100 million entries for English.

Each entry is expressed in the form of synchronous context-free grammar rules, which offers modular paraphrases with “variables”, such as: h the X1 of X2, X2’s X1 i. Those entries are furthermore annotated with replacement scores, entailment relationships, word-category disambiguation and style categories, to name only a few. This database has been released to the public under the Creative Commons License.

Figure 3.2 – An example of paraphrase acquisition through bilingual pivoting as performed by Pavlick et al. [29]. In this example, thrown into jail is acquired as a paraphrase of imprisoned. Figure based on [10].

Based on the work done by Pavlick et al. [29], Napoles, Callison- Burch, and Post [25] present a ready-to-use solution for generating paraphrases of English sentences. This solution relies on a database

(30)

of extracted paraphrases, which are used as a grammar for the syntax- based statistical machine translation engine Joshua [30]. The only data needed to generate paraphrases with this tool is the input phrase itself. The authors moreover claim that their solution is the first one to provide researchers with a ready-to-use paraphrasing engine.

As in many other fields of Computer Science, deep learning is making its way to paraphrasing, as proposed by Gupta et al. [12]. This approach is a combination of generative models (based on Variational Auto-Encoders) with sequence-to-sequence models (based on Long Short-Term Memories), and is able to generate a paraphrase for a given input sentence. The training data is made up of original sentences and their reference paraphrase. According to the authors, their technique outperforms the state-of-the-art by a significant margin and sets a new baseline for future research. Moreover, “unlike most existing models, [this] model is simple, modular and can generate multiple paraphrases, for a given sentence”.

Due to the lack of standardized metric and testing sets, comparing the performances of paraphrase acquisition and paraphrase generation systems is not a straightforward task. It seems however that deep learning methods are the new state-of-the-art, while resource-based systems can still be of use to target lexical substitutions.

3.1.4 Evaluating the Quality of Paraphrases

According to Chen and Dolan [7], “[the] lack of standard datasets and automatic evaluation metrics has impeded progress in the [automatic paraphrasing] field”. In the following we mention some testing frame- works that have been proposed in an attempt to address this issue.

Liu, Dahlmeier, and Ng [19] are the first to propose a standardized paraphrasing testing framework. PEM scores a candidate paraphrasing based only on the original sentence, and outputs a numeric score estimating the quality of the paraphrase. This framework furthermore does not rely on lexical similarity, but rather tries to assess semantic closeness by using a bag of pivot-language N-grams. While this metric appears to correlate well with human judgment, it however requires a significant amount of in-domain bilingual data to train the evalu- ator. According to Chen and Dolan [7], “training a successful PEM becomes almost as challenging as the original paraphrasing problem, since paraphrases need to be learned from bilingual data”.

(31)

Chen and Dolan [7] thus propose an alternative framework which requires no training and is more straightforward to use. For this purpose, a new metric called PINC is introduced. This metric aims to measure lexical dissimilarity: the fewer words a candidate has in common with the source sentence, the higher the score. The formula for PINC is detailed in equation 3.4, where s and c are the source and candidate sentences and N is the maximum N-gram considered.

P IN C(s, c) = 1 N

N

X

n=1

1 − |n-gram_s| ∩ |n-gram_c|

|n-gram_c| (3.4)

The quality of the generated paraphrases is thus measured by using PINC and BLEU [27] as a 2-dimensional metric. The idea behind this approach is that a good paraphrase has little lexical similarity with the source sentence (PINC), but high lexical similarity with reference paraphrases indicating that the semantic content is preserved (BLEU).

While the framework proposed by Chen and Dolan [7] seems to have gained traction among researchers, it has not been adopted on a scale similar to BLEU in the machine translation field. In fact, many research papers focusing on paraphrasing still only report their results using machine translation metrics combined with a qualitative human assessment.

There seems however to be a consensus on what constitutes a good paraphrase. Similarly to machine translation fluency and adequacy are key aspects, but a good paraphrase must also have a high lexical dissimilarity compared to the original sentence. [19]

3.2 Sentiment Shifting

Guerini, Strapparava, and Stock [11] present a tool for modifying an input text towards a more positive or negative version. The valence shifting is mainly achieved by substituting a word for a more suiting equivalent (for example a connoted synonym or a superlative), and by inserting and deleting words that play the role of downtoners or intensifier (such as “very”). These substitutions are performed until the target valence is reached. The authors claim to achieve satisfactory results but have not conducted a user study to support this claim.

Whitehead and Cavedon [41] build upon the work performed by Guerini, Strapparava, and Stock [11] by using an approximation of the original method, while also expanding it by filtering out unacceptable

(32)

candidate sentences. However, after performing a human evaluation of the generated results, the authors are unable to see any correlation between their algorithm and human judgment. They attribute this to the noisy data present in SentiWordNet, as many entries are incorrectly scored. Moreover, SentiWordNet is deemed unfit for this approach as the authors claim it is unrealistic to expect every word to always carry the same sentiment. They thus suggest to perform sentiment generation in a contextual way, rather than using out-of-context resources.

(33)

Chapter 4 Our Contribution

4.1 Presentation of the Testing Framework

As presented previously, the context in which this work takes place is the testing of conversational agents, based on the principle of adversarial example testing: a base-case is altered while retaining the same expected outcome in order to assess the robustness of the subject under test. [17] This project expands and builds on work previously carried out at University College Dublin. More precisely, the test framework we are introducing next is therefore not an original contribution by the author.

We refer to the sentences produced by a user as utterances. They have an intent, which is the nature of the query made by the user, and entities, which are the parameters of this query. For example, for the utterance “Add a 1 hour meeting for tomorrow at 2pm in room A1”

the intent would be adding an event to the calendar, and the entities would be the time, date, duration and location of the event to be added.

A schematic representation of the framework is given in figure 4.1, which works as follows. Original utterances are first processed by the agent in order to retrieve the original intent. These original utterances are then used to generate adversarial utterances, which are in turn processed by the agent under test in order to retrieve the adversarial intents.

Based on those adversarial intents, we can compute the robustness of an agent to a given intent. We define the robustness R to an intent i as per equation 4.1, where Ci designates the adversarial utterances correctly classified as being of intent i, and Ti designates all the adversarial utterances of intent i, with the adversarial utterances in both sets being

(34)

CHAPTER 4. OUR CONTRIBUTION

generated from correctly classified original utterances only.

R_i = |C_i|

|T_i| (4.1)

Figure 4.1 – Schematic representation of the test framework in use. Original utterances are processed to generate adversarial examples which help assess the robustness of the agent under test.

4.2 Overview of our Approach

Since we do paraphrasing in the context of conversational agent testing, this brings two additional constraints compared to the “regular”

paraphrasing problem.

Firstly, we focus on accuracy rather than coverage. We indeed wish to minimize the number of false negatives test cases, i.e. cases for which the conversational agent fails to correctly handle an utterance that a human agent would not understand either. Furthermore, we anticipate that every negative outcome requires manual investigation in order to determine its real nature ( true or false negative). Coverage is therefore not the key property in our case.

(35)

Secondly, the paraphrasing should be able to model different types of speech, for example by making use of informal vocabulary. The goal is indeed to generate data that is as close as possible to utterances produced by human users, but also as diverse as possible, in order to anticipate real cases of misunderstanding.

These additional constraints lead us to favor a rule-based approach, as opposed to data-driven paradigms. We indeed expect this type of approach to give us more flexibility and control with regards to the style of the output being produced, while not having to deal with data- sparseness issues that we would otherwise likely encounter. More- over, it also seems important to have a fine-grain understanding of how paraphrases were generated in order to trace back misunder- standings and make this acquired knowledge leverageable.

As illustrated in figure 4.2, we propose to generate adversarial utterances by processing input data through the following “pipeline”, for which the different stages can be skipped in order to solely focus on one specific aspect:

1. Structural Modifications, which primarily aims to change the grammatical structure of the utterance.

2. Lexical Substitutions, which aims to change the vocabulary used in the utterance by either using:

(a) Generic and neutral synonyms (Generic Lexical Substitu- tions), as opposed to

(b) Synonyms specific to a register of speech or a geographical region (Targeted Lexical Substitutions).

3. Spelling Shifting, which aims to adapt the spelling of the utter- ance to a national spelling convention.

4. Noise Generation, which aims to model the noise that is encoun- tered in real data (such as spelling errors or misplaced spaces).

In the following we only propose solutions to steps 2 to 4, leav- ing out the structural modifications aspect due to time and complex- ity constraints¹. Moreover, we believe that lexical paraphrases and

1We however considered performing these modifications with the engine developed by Napoles, Callison-Burch, and Post [25], but were unable to due to issues present in the package released by the authors. The issues have unfortunately not been resolved at the time we write this document, despite contacting the authors.

(36)

Figure 4.2 – The proposed adversarial examples generation pipeline. Dashed blocks are not discussed in this work.

other alterations to the surface forms of words may be more insight- ful when testing conversational agents. Indeed, agents are not capa- ble of linking different words or written forms to the same concepts unless encountered in the training data or specifically handled, for example with a spelling corrector. While interesting, word ordering or the use of different non-key words might not have as much impact on the understanding capabilities. We therefore choose to focus on lexical substitutions as our main paraphrasing technique.

In this project, we assume that all the input utterances are grammatically and syntactically correct with no spelling mistakes. We furthermore transform all the input utterances to lower-case as a preprocessing step in order to avoid and not handle case sensitivity issues.

The implementation was performed using Python3, as many Nat- ural Language Processing (NLP) tasks have been performed with this language. This project was furthermore carried out using the following libraries: Natural Language Toolkit [5] as a general NLP tool and for access to WordNet [24], Pattern [35] for word inflection, Gensim [32] as a word vectors interface, Pywsd [37] as Word-Sense Disam- biguation tool, the Stanford Core NLP tools [21] for dependency parsing, and Deap [9] for the genetic algorithms. Moreover, we also relied on the services provided by the Microsoft Azure portal, which provided us the tokenization, tagging, language model, web search, and translation tools. Finally, the conversational agent under test was cre- ated with the Microsoft LUIS platform.

4.3 Lexical Substitutions

In the following we propose an approach to paraphrasing through lexical substitutions, that is we replace words occurring in a given input utterance with their synonyms. The goal of this paraphrasing is to in- troduce variations in the vocabulary, either by remaining generic or by targeting a specific type of English. We chose this approach because of

(37)

its relative simplicity and little overhead compared to more complex solutions such as adapting a Machine Translation Engine.

4.3.1 Generic Lexical Substitutions

We follow the lead of Hassan et al. [13] and build our approach on these two successive steps:

1. Retrieval of possible synonyms for each word appearing in the input utterance, and generation of the candidate paraphrases by substituting the said word with its inflected synonyms.

2. Scoring of the candidate paraphrases yielded by the previous step in order to weed out bad candidates and provide a ranking of the remainders.

Generation of Candidate Paraphrases

The processing of each input utterance starts with tokenization and Parts-of-Speech tagging, using the Microsoft Cognitive Services API². Dependency parsing using the Stanford Core NLP tools [21] is also performed in order to identify and retrieve dependencies of interest.

For each token obtained in the previous step, we skip over punctu- ation signs and stopwords. The remaining tokens are then lemmatized using the Pattern [35] library. In the case of phrasal verbs, the corresponding particle is retrieved based on the information obtained from the dependency parsing, and included in the lemma form³.

These lemma forms are then used to retrieve the possible synonyms from lexical databases, namely WordNet [24]. These synonyms are in turn inflected and inserted into the input utterance, yielding candidate paraphrases. The supported inflections include pluralization, singularization and conjugation. Indefinite article epenthesis⁴ is also supported. These operations are handled by the Pattern [35] library.

Finally, we only consider paraphrases that are one editing distance away from the input utterance. We indeed do not consider the simul- taneous substitution of multiple lemma forms in the input utterance.

2 https://azure.microsoft.com/en-us/services/

cognitive-services/

3 For example, the verb in the sentence “He made it up” lemmatizes to

“make_up” rather than “make”.

4I.e. the indefinite article a becoming an before a vowel.

(38)

This is done in order to limit the number of generated candidate paraphrases and to maintain a better quality.

Scoring of Candidate Paraphrases

Many words have different meanings depending on the context in which they occur, and as such performing substitutions with synonyms will lead to nonsensical results. Filtering out poor paraphrases is therefore necessary, and we perform this task by adapting and combining some of the strategies presented by Hassan et al. [13]. The following strategies have been implemented:

Language Model (LM) The fluency of candidate paraphrases is scored with a 5-gram language model. The probability of each candidate paraphrase is computed by the Language Model API provided by Microsoft’s Cognitive Services platform.

Translation Pivoting (TP) The input utterance is translated into a foreign language and then back into English, producing a pivoted sentence. Candidates whose inflected synonym appears any- where in one of the pivoted sentences are given a score of 1, 0 otherwise. In our implementation, we rely on the Microsoft Translator Text API using French as pivoting language and re- questing multiple translations to French and back to English in order to increase the diversity of the pivoted sentences.

Word Vectors (WV) This strategy yields a higher score for candidate paraphrases whose inflected synonym has a strong lexical similarity with the input utterance. In order to assess the lexical similarity we rely on Word2Vec [23] using the Google News pre- trained data⁵ and Gensim [32]. The lexical score ls of a synonym s is computed as the average of the cosine similarities between the inflected form of s and each word of the input utterance which is not a stopword (set we call U ), as shown in equation 4.2. The vectors are of course the corresponding word vectors.

ls(s, U ) = 1

|U |

w∈U

X

w

~ s · ~w

||~s|| || ~w|| (4.2)

5https://code.google.com/archive/p/word2vec/

(39)

Web Search (WS) Each candidate paraphrase is queried in a search engine and scored based on how many hits were obtained. The search is performed using the Bing Web Search API.

Word-Sense Disambiguation (WSD) For each word being replaced in the input utterance, we retrieve its most probable sense using the Simplified Lesk algorithm provided by Pywsd [37]. Candi- dates whose synonym corresponds to the most probable sense of the replaced word are given a score of 1, 0 otherwise.

Lexical Frequency (LF) This strategy scores candidates by how frequently its synonym has been encountered as a synonym for the different senses of the original word and in different lexical resources. The yielded score corresponds to how many times the synonym has been encountered.

Our previously described scoring methods are aimed at different aspects of what makes a good paraphrase. We indeed measure the semantic relatedness of a synonym with the input utterance by using the Word-Sense Disambiguation and Word Vectors strategies, but also take into account the well-formedness of candidates with the Web Search and Language Model methods. Finally, Translation Pivoting is a high-precision but low-coverage strategy that allows us to identify the most plausible candidates, while Lexical Frequency encourages the use of synonyms with a broader sense. Since our scoring strategies are aimed at orthogonal aspects of the problem, it is expected that combining them together should provide overall better results.

The aforementioned strategies are combined together by first rescal- ing each of the metrics in the range [0; 1], rescaled metrics which are then combined linearly as shown in equations 4.3 and 4.4. In these equations, smand s⁰_mrepresent the original and rescaled scores yielded by strategy m, λm is the weight of strategy m, c is the candidate paraphrase to score, C is the set of all paraphrases to score, and s(c) repre- sents the global score of candidate paraphrase c.

s⁰_m(c) =

s_m(c) − min

x∈C(s_m(x)) maxx∈C(s_m(x)) − min

x∈C(s_m(x)) (4.3) s(c) =

Pm

i=1λm∗ s⁰_m(c) Pm

i=1λ_m (4.4)

(40)

Candidates that obtain a global score lower than a pre-determined threshold t are considered bad and are pruned from the final set of paraphrases.

The best weights λm and the best pruning threshold t are determined by a genetic algorithm which tries to maximize the amount of acceptable paraphrases being returned by our lexical substitutions.

More specifically, the fitness function that the algorithm tries to maximize is the F-score averaged over all the training input utterances. As shown in equation 4.5, the F-score is based on precision and recall, with positive β being the importance of recall. For a given input utterance, we define the precision as the number of acceptable paraphrases returned by our system divided by the total number of paraphrases returned, and recall as the number of acceptable paraphrases returned by our system divided by the total number of acceptable paraphrases that were generated. Our problem can also be viewed from an information retrieval perspective, with generated paraphrases being search results that are either relevant if the paraphrase is acceptable, or irrel- evant otherwise. These “search results” are then sorted by our scoring strategies. The F-score is indeed a standard metric to measure the performances of search engines, measure which we adapt to this context.

F_β = (1 + β²) · precision · recall

(β²· precision) + recall (4.5) The parameters of the genetic search are stored in an array as in- tegers, ranging from 0 to 100. More precisely, the array we operate our genetic search on is of the form [λ1; λ₂; ...; λ₆; t ∗ 100]. The genetic operators in use are two-point crossover for the mating, uniform random integer replacement for the mutation, and tournament selection for the survivor selection. We use a very simple genetic algorithm as presented in chapter 7 of Bäck, Fogel, and Michalewicz [1], algorithm which performs a simple mating - mutation - fitness evaluation - survivor selection cycle for each new generation of individuals. These operations were performed using the Deap library [9].

In an attempt to limit overfitting, we decide to perform the genetic search multiple independent times. This is done in order to combine the fittest individuals of each run as shown in equation 4.6 where b rep- resents the fittest individual of run b and B the total number of independent runs. Each genetic search is conducted on a random subset of the training data, subset which is obtained by sampling with replace-

Quality Assessment of Conversational Agents: Assessing the Robustness of Conversational Agents to Errors and Lexical Variability

Quality Assessment of Conversational Agents

Assessing the Robustness of Conversational Agents to Errors and Lexical Variability

JONATHAN GUICHARD

Quality Assessment of Conversational Agents

Assessing the Robustness of

Conversational Agents to Errors and Lexical Variability

JONATHAN GUICHARD

Abstract

Sammanfattning

Contents

Chapter 1 Introduction

1.1 Problem Definition

1.2 Objective

1.3 Boundaries

1.4 Target Audience

1.5 Acknowledgments

1.6 Glossary

Chapter 2

Theoretical Background

2.1 What is a paraphrase?

2.2 An Overview of Machine Translation

2.2.1 Rule-Based Machine Translation

2.2.2 Data-Driven Approaches

2.3 Assessing Machine-Produced Sentences

2.3.1 Human Evaluation

2.3.2 Lexical Similarity

2.4 Natural Language Analysis Tools

2.4.1 Tokenization

2.4.2 Parts-of-Speech Tagging

2.4.3 Dependency Parsing

2.5 Language Models

2.6 Word Vectors

2.7 Word-Sense Disambiguation

2.8 Spelling Corrector

Chapter 3

Related Work

3.1 Automatic Paraphrasing

3.1.1 Manual Collection

3.1.2 Utilizing Lexical Resources

3.1.3 Utilizing Text Corpora

3.1.4 Evaluating the Quality of Paraphrases

3.2 Sentiment Shifting

Chapter 4

Our Contribution

4.1 Presentation of the Testing Framework

4.2 Overview of our Approach

4.3 Lexical Substitutions

4.3.1 Generic Lexical Substitutions