DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS
STOCKHOLM SWEDEN 2016,
A semi-supervised approach to dialogue act classification
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION
classification using K-Means+HMM
Master’s Thesis at NADA Supervisor: Gabriel Skantze
Examiner: Danica Kragic
Dialogue act (DA) classification is an important step in the process of de- veloping dialog systems. DA classification is a problem usually solved by supervised machine learning (ML) approaches that all require hand labeled data. Since hand labeling data is a resource-intensive task, many have pro- posed to focus on unsupervised or semi-supervised ML approaches to solve the problem of DA classification.
This master’s thesis explores a novel method for semi-supervised ap- proach to DA classification: K-Means+HMM. The method combines K- Means and Hidden Markov Model (HMM) modeling in addition to ab- stracting away the words in the utterances to their part-of-speech (POS) tags and the utterances to their cluster labels produced by K-Means prior to HMM training.
The focus are the following hypotheses: H1) incorporating context of the utterances leads to better results (HMM is a method specifically used for sequential data and thus incorporates context, while K-Means does not); H2) increasing the number of clusters in K-Means+HMM leads to better results; H3) increasing the number of examples of cluster labels and hand labeled DAs pairs in K-Means+HMM leads to better results (the ex- amples of pairs are used to create the emission probabilities used to define the HMM).
One of the conclusions is that K-Means performs better than K- Means+HMM (the result for K-Means measured with one-to-one accuracy is 35.0%, while the result for K-Means+HMM is 31.6%) given 14 clusters and one example pair. However, when the number of examples is increased to 15 the result is 40.5% for K-Means+HMM; the biggest improvement is when the number of examples is increased to 20 resulting in 44% one-to- one accuracy. That is, K-Means+HMM outperforms K-Means provided that a certain number of examples is given.
Another conclusion is that the number of examples has a much larger impact on the results - compared to the number of clusters - thus perhaps concluding that the statement “there is no data like labeled data” holds.
En delvis övervakad metod för klassificering av dialoghandlingar: K-Means+HMM
Klassificering av dialoghandlingar är ett viktigt steg i processen för utveck- ling av dialogsystem. Klassificering av dialoghandlingar är ett problem som vanligtvis löses med hjälp av övervakade maskininlärningsmetoder som alla behöver uppmärkt data. Eftersom uppmärkning av data är en resurskrä- vande uppgift har många föreslagit att fokusera på oövervakade eller delvis övervakade maskininlärningsmetoder för att lösa problemet av klassifice- ring av dialoghandlingar.
Denna masteruppsats utforskar en ny delvis övervakad maskininlä- ningsmetod för klassificering av dialoghandlingar: K-Means+HMM. Föru- tom att metoden kombinerar K-Means och Hidden Markiv Model (HMM) modellering, abstraheras orden i yttranden till deras ordklasstaggar och yttranden till deras klusteretiketter som produceras av K-Means före HMM träningen.
Projektets fokus är följande tre hypoteser: H1) en intergration av yttrandenas kontext leder till ett bättre resultat (HMM är en metod som används specifikt för sekventiell data och den integrerar således kontex- ten, medan K-Means gör inte det); H2) ökning av antalet kluster i K- Means+HMM leder till bättre resultat; H3) ökning av antalet exempel av par av klusteretiketter och dialoghandligar uppmärkta för hand i K- Means+HMM leder till bättre resultat (parexemplen används för att skapa emissionssannolikheter som definierar HMM).
En av slutsatserna är att K-Means presterar bättre än K-Means+HMM (resultatet för K-means mätt med en-till-en noggrannhet är 35,0%, medan resultatet för K-Means+HMM är 31,6%) givet 14 kluster och ett exempel- par. Däremot, när antalet av exempelpar ökar till 15 ökar resultatet för K-Means+HMM till 40,5%. Den största ökningen är när antalet exempel- par är 20, vilket ger ett resulat på 44% en-till-en noggrannhet. Med andra ord, presterar K-Means+HMM bätre än K-Means då att ett visst antal exempelpar är tillgängligt.
En annan slutsats är att antalet av exempelpar har en mycket större effekt på resultaten jämfört med antalet kluster, vilket då möjligtvis leder till slutsatsen att “det finns ingen bättre data än uppmärkt data”.
1 Introduction 1
1.1 Project Introduction . . . 1
1.2 The master’s thesis project aim . . . 2
1.3 Assumptions . . . 2
1.4 Hypotheses . . . 3
1.5 Outline of the report . . . 3
2 Background 5 2.1 Dialogue Acts . . . 5
2.2 Corpora . . . 6
2.3 Related Work . . . 7
2.3.1 Semi-supervised approaches . . . 8
2.3.2 Unsupervised approaches . . . 8
2.3.3 Related evaluation methods . . . 12
3 Corpus 15 3.1 The Wizard-User Corpus . . . 15
4 Method 21 4.1 Preprocessing of the corpus . . . 21
4.1.1 Removing non content utterances . . . 21
4.1.2 Remapping of the DAs . . . 21
4.1.3 Speaker separation . . . 22
4.1.4 POS-tagging . . . 22
4.1.5 Further preprocessing steps . . . 22
4.2 K-Means+HMM . . . 24
An overview of K-Means+HMM . . . 24
4.2.1 Details on K-Means+HMM . . . 24
M.1 Extraction of feature vectors . . . 24
M.2 K-Means clustering . . . 25
M.3 Training the HMM . . . 26
M.4 Applying Viterbi . . . 28
4.3 Evaluation methods . . . 28
4.4 Implementation . . . 33
5 Results 35 5.1 K-Means vs. K-Means+HMM . . . 35
5.2 Varying the number of clusters . . . 40
5.3 Varying the number of samples . . . 42
6 Discussion, conclusions and future work 43 6.1 Method and application . . . 43
6.2 Hypotheses H1-H3 discussed . . . 44
6.2.1 H.1: K-Means+HMM yielding better results than K-Means 44 6.2.2 H.2: The larger the number of clusters in K-Means+HMM the better the results . . . 45
6.2.3 H.3: The larger the number of example pairs in K-Means+HMM the better the results . . . 46
6.3 Discussion regarding the evaluation methods . . . 47
6.4 Corpus . . . 48
6.5 Future work . . . 48
6.6 Society, sustainability and ethics . . . 49
Appendices 49 A 51 A.1 Dialogue example . . . 51
A.2 Remapping of Dialogue Acts . . . 52
A.3 POS-tagged example . . . 53
This chapter aims at introducing the reader to the master’s thesis project, its aim, the underlying assumptions and hypotheses and giving an overview of the report itself.
1.1 Project Introduction
Dialogue act (DA) tagging intends to assign one tag or more to each and every utterance in a dialogue. Each tag reflects the semantic intention of the utter- ance, thus providing information of the underlying meaning of each utterance.
For instance, an utterance “Where is the subway station?” can be tagged as DA Question.
That is, DAs go beyond the meaning of the words in the utterances, but try to mirror the intent or the goal of each utterance; a more detailed definition of DAs and examples are given in chapter 2 Background (p. 5).
DA tagging is one of the first important steps in understanding speech or text in the process of building dialogue systems , such as for instance Apple’s Siri or Amazon’s Alexa.
Currently the most common methods used for DA tagging rely on supervised machine learning (ML) approaches, for instance Hidden Markov Model (HMM) modeling . In other words, the traditional approach to DA tagging is super- vised ML that follows the pattern “(hand) label-train-test” . Supervised ML approaches in turn rely on lagre amount of hand labeled data for training .
The problem with the current supervised approaches to DA tagging is the need for large hand labeled corpora for training, which is an expensive and time consuming task [1–4].
Consequently, as a step forward many have proposed to focus on unsupervised or semi-supervised ML to solve the problem of the time consuming task of hand
CHAPTER 1. INTRODUCTION
labeling data as the amount of corpora constantly increases due the interactions on the internet via for instance social media [1,3].
Attempts have been made to categorize utterances into DAs by unsupervised clustering methods. Ohtake (2008), for instance, used K-Means as unsupervised clustering method as it doesn’t require any hand-labeled data and is fairly simple.
The simplicity of K-Means is of great advantage since unsupervised ML is a novel and still unexplored method for DA classification .
In addition to the unsupervised ML methods, semi-supervised or minimally supervised ML methods have been explored by Venkataraman et.al. (2002, 2003) [1, 3]. The semi-supervised approaches explored by Venkataraman et.al. have focused on HMM modeling and self-training; that is, starting out with a small amount of labeled data and using it to DA tag the rest of the data by iterating over it. The reason for the semi-supervised approach, especially when the goal is to label utterances with a set of previously predefined DA classes, is the belief that
“[t]here is no data like labeled data” [3, p.5].
1.2 The master’s thesis project aim
The aim of this master’s thesis project is to explore a novel semi-supervised DA tagging method with two abstraction levels: abstracting the words in the dialogues to their Part-of-Speech (POS) tags and abstracting each utterance in the dialogue into a cluster class; then taking into consideration the context of the utterances by the means of HMM modeling (read more about the method used in chapter 4 Method (p. 21)). The context refers to the fact that HMM takes into consid- eration the previous utterances in a sequence of utterances. The novel method is named K-Means+HMM.
The underlying assumptions for this master’s thesis project are the following:
A.1 Since DAs convey the underlying meaning of an utterance the assumption is that words can be abstracted away to POS-tags. The abstraction to POS- tags is an approximation that helps to simplify the problem of learning DAs with semi-supervised ML.
A.2 If the speakers in a dialogue are separated the clusters of the utterances in the dialogues will be more discriminated, as some of the DAs are only associated with one of the two speakers.
A.3 Non-content utterances or words (non-content utterances are silence or non- sense, and non-content words are, for instance, “eeh”) do not impact POS- tagging, and thus in turn clustering, as they carry no underlying meaning.
In short, non-content utterances and words are not meaningful enough to be taken into consideration in this particular study.
The hypotheses posed within the scope of this master’s thesis project are the following:
H.1 K-Means+HMM should yield better results in terms of the evaluation meth- ods introduced in Section 4.3 Evaluation methods (p. 28) compared to K-Means.
H.2 The larger the number of clusters in K-Means+HMM the better the results.
H.3 The larger the number of examples of clusters with their associated DA from the hand labeled answers the better the results obtained from K- Means+HMM.
A comment regarding H.1, it is important to note that the mere unsupervised clustering of utterances per se can be seen as clustering of the utterances into DA-categories. That is, for example all utterances in cluster one can be seen as utterances of DA Question, whereas all utterances in cluster two are utterances of DA Answer_Yes, etc. . . It is this fact that is the basis for the comparison between K-Means and K-Means+HMM.
1.5 Outline of the report
Chapter 2 Background (p. 5) provides the reader with the necessary background on DAs and the current unsupervised and semi-supervised ML methods for DA- classification. Chapter 3 Corpus (p. 15) presents the corpus used in this study.
Chapter 4 Method (p. 21) presents an overview and the details of the scientific method used in the experiments of the study. Chapter 5 Results (p. 35) presents the obtained results of the experiments. Chapter 6 Discussion, conclusions and future work (p. 43) provides the reader with a discussion on the obtained results and concludes the study by giving some pointers for future work.
This chapter defines and exemplifies a dialogue act (DA) and provides the reader with a background of semi-supervised and unsupervised DA-classification.
2.1 Dialogue Acts
Dialogue act (DA) modeling is capturing “the communicative goal or action under- lying each utterance” [6, p.1]. DA annotation of utterances in dialogues provides a level of structure in the dialogues, which is useful when designing dialogue sys- tems . That is, a dialogue system can either try to interpret human-to-human dialogues or it may participate in a dialog with a human .
In both cases, the system must keep track of how each utterance changes the commonly agreed upon knowledge [. . . ] including the con- versational agent’s obligations and plans.
[7, p. 1]
In other words, each utterance in a dialogue can be tagged by a DA-tag in order to show the underlying structure of the dialogue.
For instance, Core and Allen (1997) define DAMSL (Dialogue Act Markup in Several Layers), which is a set of “primitive communication actions” which can help when analyzing the structure of a dialogue [7, p. 1]. DAMSL is a general hierarchical annotation scheme; that is, it can be used to analyze most types of dialogues.
It is common, however, that DA schemes are specialized for a specific context or a specific type of a dialogue . For example, in a dialogue about trains and time tables the question “What time does the train leave?” would belong to the DA Question_train_time rather than a more general DA Question. See an example of
CHAPTER 2. BACKGROUND
specialized DAs in Table 2.1 and an excerpt from a dialogue where each utterances is tagged by a DA-tag in Table 2.2.
Act Description S Statement A statement of fact
Assessment having both positive and negative assessments
G Grounding Acknowledgment of previ-ous utterance EX Extra-
Domain Any utterance that is not related to the task.
Question A question about the task NF Negative
Feedback Negative assessment of knowledge or task
Table 2.1: An example of DAs with a description [6, p.2]. The DAs are produced for a tutoring corpus - a corpus where students interact remotely with tutors.
In short, the purpose of DA-tagging is to find out the meaning of each utterance that goes beyond the explicit words used in the utterance. There are general DA schemes and specialized DA schemes. Sometimes the schemes are hierarchical, like the DAMSL  and other times they are not, like the scheme in Table 2.1.
In general there are many different types of corpora available, for instance spoken versus written language, human-to-human versus human-computer, etc. . . .
Each type of corpus has its own pros and cons. For instance spoken lan- guage tends to be less formal and contains more pronouns compared to written language . It can additionally be more challenging to automatically POS-tag spoken corpora since the majority of available POS-taggers are trained using writ- ten corpora, such as newspapers, books, etc . . . .
Human-human dialogues, for instance, produce many more possible responses compared to more contrived human-computer dialogues. Additionally, turn taking in human-human dialogues is much richer than human-machine dialogues, which can pose problems when using such dialogues to create dialogue systems .
Speaker Utterance Utterance Student so obviously here im going to read into
the array list and pull what we have in the list so i can do my calculations S
Tutor something like that, yes LF
Tutor by the way, an array list (or ArrayList) is something different in Java. this is
just an array S
Student ok G
Student im sorry i just refer to it as a list be- cause thats what it reminds me it does S Student stores values inside a listbox(invisible) S
Tutor that’s fine EX
Tutor ok, so what are we doing here? EQ Student im not sure how to read into the array NF
Table 2.2: An excerpt from a corpus where students interact remotely with tutors [6, p.2]. Each utterance is tagged by a DA-tag from Table 2.1.
Furthermore, human-human corpora (and in some sense human-computer cor- pora) can per se differentiate regarding its contents. A corpus can for example consist of dialogues which are task oriented, such as the well known Map Task cor- pus [11–14], or domain oriented, such as the student-tutor interactions corpus , or completely open-domain corpus, as the Twitter corpus . Needless to say, both human-human and human-computer corpora can be either written or spoken.
It is important to be aware of the differences between the different types of corpora and what these differences may entail for the task of DA classification.
2.3 Related Work
As mentioned in Section 1.1 Project Introduction (p. 1) DA classification is an important step towards understanding speech and designing dialogue systems.
Many DA classification approaches rely on HMM modeling [12,15], which in turn relies on big amounts of hand labelled data . Joty et.al. (2011) and Ritter et.al. (2010) point out that due to the fact that many conversations are today happening on the internet our access to both spoken and written language is practically limitless and that our methods of DA classification should evolve and try to adapt to the newly available stream of data. The suggested adaptation is unsupervised machine learning [2,4].
CHAPTER 2. BACKGROUND
Overall there seems to be a need to explore new semi-supervised or unsuper- vised ML methods for DA classification. This section describes previous work carried out within the area of the semi-supervised and unsupervised approaches for DA classification along with examples of the evaluation methods used within the field. Unsupervised approaches and the evaluation methods are in focus while the area of supervised DA classification is here omitted.
The goal of this section is not to provide the reader with all of the details of the different approaches to semi-supervised and unsupervised DA classification, but rather to give an overview of the work that has been carried out within the area and to illuminate some of the common problems.
2.3.1 Semi-supervised approaches
The semi-supervised ML approaches to DA classification often focus on having a small hand labeled subset of the data and training the models on that subset to later on classify the rest of the data set, which is unlabeled. The ML approaches that have been used are among others HMM.
Venkataraman et.al. (2002) propose to train DA taggers with minimal super- vision. The proposed method is to bootstrap from a small training set of hand labeled data and to iterate over the unlabeled data while relabeling the data set.
That is, a DA tagger is initially trained on a handful of labeled data; the tagger is then used to tag a portion of the unlabeled data. After that the tagger is retrained using all of the previously labeled data. 
One of the conclusions drawn from the results of the proposed method is the fact that in some cases “relatively small amounts of labeled data can be more effective than much larger amounts of unlabeled data” [3, p.6]. “There is no data like labeled data” [1, p.275] is another conclusion in Vetekataraman et.al (2003), where the same approach of bootstrapping and relabeling is used along with taking into consideration the prosody of utterances .
2.3.2 Unsupervised approaches
One of the differences between semi-supervised and unsupervised approaches is the fact, that in unsupervised approaches none of the data used for model training is labeled. Thus in many of the approaches described below a dichotomy between the DAs found by the unsupervised methods and the DAs found by humans arises; this is especially true for the approaches which try to classify utterances into already pre-defined DAs.
Some specific approaches that will be summarized below include graph-based, K-Means and non-parametric Bayesian approaches.
Andernach et.al. (1997) proposed Kohonen Self-Organizing Maps (SOMs) as a method for unsupervised DA class finding. Andernach et.al. used superficial features to tag the utterances and then used these tags as features for clustering.
That is, each utterance was abstracted to a number of feature tags which included Speaker, Wh-word, First Verb Type, etc. . . . The feature vectors used in SOMs were consctructred from these feature tags rather than the concrete words in the utterances. In short, the words were abstracted away in favor of the superficial tags.
The corpus used in this study was human-computer and Andernach et.al. con- cluded that SOMs automatically was able to separate computer utterances from human utterances. 
Joty et.al. (2011) compared three different models for unsupervised DA tag- ging: one graph-theoretic clustering and two probabilistic models. The models explored did not tag each utterance with a DA, but rather clustered the utterances into clusters that should have contained sentences of the same DA. The downfall of the graph-theoretic model was the fact that it did not take into consideration the sequential structure of a dialog, thus rendering the model not suitable for the task of unsupervised DA clustering . Additionally, graph-theoretic models are also computationally expensive to formulate and to calculate.
In order to aviod clustering utterances into conversational topics, rather than DA clusters it is possible to remove certain words or certain POS-tags from the utterances. Noun masking in particular was used by Joty et.al in order to abstract away topics.
While Joty et.al. (2011) and Elsner and Charniak (2010) focused on matching pre-defined DA annotations to the clustering results, Ritter et.al. (2010) propose a different approach: to first cluster utterances and then to discover DAs based on these clusters . In other words, Ritter et.al. propose to, rather than trying to match clusters to pre defined DAs, use the clusters in order to discover DAs in a corpus .
Ritter et.al. (2010) conducted a study with a Twitter based corpus and thus carried out clustering in an open-topic domain (meaning that dialogues are not limited to one topic in particular). In order to avoid having topic-based clusters, which is a problem in an open-topic domain, they attempted to filter out topic information from the corpus using different filters, such as Latent Dirichlet Anno- tation (LDA) (more inforamtion below) . Their topic filtering can be compared to noun masking carried out by Joty et.al. (2011) (mentioned above) as both try to achieve the same result of not clustering utterances based on topics. Ritter et.al. use HMM and LDA (learn more ) to cluster the utterances.
In contrast to the graph-based clustering models used by Joty et.al. (2010) and Elsner and Charniak (2010), Ohtake (2008) uses K-Means for unsupervised
CHAPTER 2. BACKGROUND
DA classification . Ohtake’s method follows these three steps:
1. “Construct a feature vector from an utterance.
2. Reduce the dimensions of the feature space using a latent variable model.
3. Classify the vector whose dimension was reduced using an unsu- pervised classification method.”
In the second step of the method Ohtake (2008) uses LDA in order to reduce the dimensions of the feature space . LDA is a method where the latent variables model the topic of a text segment. Ohtake reduces the feature space by using the topic information indicated by the latent variables in LDA. Ritter et.al. (2011) also use LDA to filter out topic information from their data. One of the bullet points for future work Ohtake presents is, however, whether or not compression of the feature space “is really effective for DA classification or not” [5, p.451]. Ohtake also expresses an issue with the number of latent variables for the LDA model.
Moreover Ohtake paraphrases utterances in a rule-based manner in order to simplify the problem of DA classification with the unsupervised method. The reason for paraphrasing some of the utterances is the fact that there is a variety of expressions in Japanese that mean the same thing.
While paraphrasing worked well for some of the hand-labeled DAs, it showed poor results when a small experiment was carried out using a semi-supervised approach. Based on these contradictory results Ohtake’s overall conclusion was that a paraphraser should not be general, but should rather focus on specific problem areas within the context of the corpus .
Ezen-Can and Boyer (2014) first pre-process the dialogues by replacing some of the function words with their POS-tags and retaining the stemmed versions of the content words, and then use K-Means clustering in order to cluster similar utterances into one cluster which could then be matched with a pre-defined DA .
One of the challenges which Ezan-Can and Boyer discuss is the fact that in some corpora pre-defined DAs divide topic based utterances from other type of utterances that may not be relevant to the dialogues per se. In other words, if a dialogues is in its core about a teacher helping a student to understand computer science, utterances such as “Should I close the door?” are not relevant to the dialogues and could be manually tagged as Extra Domain [6, p.7] in order to mark their irrelevance. Ezan-Can and Boyer conclude that “[f]uture work will explore combining unsupervised dialogue act modeling with unsupervised topic modeling in order to address this type of modeling challenge.” [6, p.7].
Ezen-Can and Boyer (2014) additionally look into incorporating non-cognitive factors such as gender and domain-specific self-efficacy (the corpus utilized in this particular study is tutor-student interactions) into the unsupervised DA classifica- tion. The results of the study showed that DA classifiers with non-cognitive factors incorporated outperformed those that did not incorporate such factors. 
Crook et.al. (2009) used Dirichlet Process Mixture Model (DPMM, learn more ), which is a non-parametric Bayesian approach, to address the problem of unsupervised DA classification. The specific approach to PDMM called Chinese Restaurant Process (CRP, see ) was used. One of many differences between for instance K-Means and this particular method is the fact that CRP clusters an arbitrary number of clusters; that is, the number of clusters is not pre-determined as in k-Means. Additionally, CRP is certainly a more sophisticated and complex approach to the problem compared to K-Means. The result was that CRP clustered the utterances into 53 clusters. These 53 clusters were compared to 6 hand labeled DA-tags in a quantitative manner by trying to match the clusters with the hand labeled DAs. 
The corpus that was employed by Crook was a human-computer corpus. CRP was able to with ease cluster the computer utterances and to differentiate these from the user utterances. This fact means, however that it is not clear whether or not this particular method would perform well on a human-human corpus. 
Higashinaka et.al. (2014) compare K-Means, CRP and infinite HMM (for further information see ). Infinite HMM is also a non-parametric Bayesian method and it is similar to CRP in the sense that the number of clusters is not pre- determined beforehand; one crucial difference between the two models is the fact that infinite HMM takes the context of an utterance into account when clustering.
In addition to the comparison between the three clustering methods Higashinaka et.al. also compare different levels of abstraction of words: no level of abstraction, that is Bag-of-Words representation; one level of abstraction where content words such as nouns, verbs, etc. . . are abstracted to their POS-tags, and the other level where all words are abstracted to their POS-tags. 
Higashinaka et.al. additionally compare two different types of corpora, one human-computer corpora and one human-human corpora. Their conclusion is, unsurprisingly, that it is harder to achieve good results with human-human cor- pora. 
Another observation they make is that the two levels of abstraction perform worse than no level of abstraction consistently for all of the methods and the corpora. Their conclusion is that words are significant for distinguishing DAs. 
Overall, Higashinaka et.al. report that infinite HMM outperforms both K- means and CRP. On the downside however infinite HMM generates considerably
CHAPTER 2. BACKGROUND
more clusters than the number of pre-labeled DAs. When CRP, generates a number of clusters which is in general close to the number of pre-labeled DA, infinite HMM generates in one case five times more clusters then there are pre-labeled DAs. One of the final conclusions is that context should be incorporated when solving the problem of unsupervised DA clustering. 
In summary, clustering methods such as K-Means and non-parametric Bayesian methods have been previously used for unsupervised DA classification. It is com- mon to abstract away words or use superficial features in order to facilitate DA classification with unsupervised methods. Issues of topic clustering rather than DA clustering have also been raised by many. All in all, unsupervised DA classification is a non trivial problem.
2.3.3 Related evaluation methods
Evaluation methods for unsupervised ML approaches differ from evaluation meth- ods for supervised ML approaches due to the fact that unsupervised approaches not always classify the utterances in the same way a human would have classi- fied them. Below the reader will find some of the evaluation methods used in the studies described above.
Joty et.al. consider the problem of evaluation of unsupervised DA clustering approaches. Since DAs are clustered into clusters which are not labeled it is not possible to use evaluation methods as Ÿ statistic and F1 score , which are otherwise widely used in connection to DA classification when supervised ML approaches are utilized .
Further, Joty et.al. (2011) propose to use one-to-one accuracy introduced by Elsner and Charniak (2010) (who also used graph-based clustering model) in order to evaluate unsupervised DA-clustering approaches, one-to-one accuracy is more thoroughly described in Section 4.3 Evaluation methods (p. 28).
For evaluation purposes, Ritter et.al (2011) propose two methods one qualita- tive and one quantitative. The qualitative method visualizes the emergent model of DAs while the quantitative method “measures the intrinsic quality of a conver- sation model by its ability to predict the ordering of posts of conversations” [4, p.
178]. That is, since Ritter et.al. use generative machine learning methods they exploit this by measuring the predictive power of a model.
Jody and Carenini (2011), Elsner and Charniak (2010), and Ritter et. al.
(2010) agree however that evaluation of unsupervised clustering “is a non-trivial task” [23, p.394] irrespective of trying to evaluate the results by comparing to pre-defined DAs or automatically discovering DAs [2,4,23].
Ohtake’s evaluation approach was to compare the clusters to the hand labeled DAs. That is, to see how many utterances labeled by the DA Acknowledge were clustered into each and every cluster. Some of the clusters had a strong correlation
with the hand labeled DAs, while other DAs were “scattered” over a handful of different clusters. 
Ecan-Can and Boyer used two evaluation methods one quantitative and one qualitative. The quantitative method consisted of matching the clusters to the manual labels (pre-defined DAs). The qualitative method, on the other hand consisted of inspecting the clusters and determining the type of grouping criteria of the utterances in a specific cluster .
In summary, the most well used quantitative method used in connection to unsupervised classification of DAs is the method of matching the pre-defined DAs to the DAs inferred by the applied classification method. The qualitative methods include looking into the way the utterances were clustered and trying to find the reason for a particular clustering.
Lastly, there is consensus that evaluating unsupervised DA classification is a non-trivial problem and that there are different, both quantitative and qualitative, evaluation methods, non of which have yet become a standard.
3.1 The Wizard-User Corpus
As previously mentioned in Section 2.2 Corpora (p. 6), it is important to choose the right corpus for the specific problem at hand. According to the assumptions A.1 - that words can be abstracted to POS-tags, A.2 - when separating the speakers the clusters of utterances will be more specialized, and A.3 - non content utterances do not impact POS-tagging; and the aim of the project (Section 1.2 The master’s thesis project aim (p. 2)), a corpus most fitting for this master’s thesis project should perhaps consist of task-oriented spoken human-human or human- computer dialogues where the speakers have different roles.
With the above criteria in mind, the corpus chosen for this master’s thesis project consists of 40 spoken, transcribed dialogues consisting of in total 1360 utterances. The dialogues come from a study where one subject (Wizard) was giving another subject (User) directions in a simulated environment. The User’s task was to get from point A to point B and the Wizard’s task was to provide the User with the directions. The User knew where he or she was and what his or her goal was, but the User did not have a map. The Wizard, on the other hand, had a map but no way of knowing of where the User was, except for relying on the User’s description of his or her position. The Wizard and the User thus had asymmetrical roles. The Wizard and the User used push-to-talk mechanism to talk. All of the dialogues were conducted in Swedish .
All of the utterances in the corpus had already been DA tagged by hand into 42 hierarchical DAs. All utterances were tagged by at least one DA. The 42 DAs were later re-mapped into 14 DAs - the re-mapping was done in order to eliminate the hierarchy and to make each DA more general in order to simplify the problem (see Appendix A.2 Remapping of Dialogue Acts (p. 52) for the remapping schema). All of the 14 dialogues can be seen in Table 3.1. Table 3.2 shows an
CHAPTER 3. CORPUS
exert of a dialog showing the speaker, the utterance and the associated DA.
The corpus is divided into 40 files, each file containing one dialogue. The files are in XML-format and thus each utterance has meta information associated with it. The meta information consists of, among other things, the transcript of the utterance, the hand labeled DA and the speaker. Appendix Appendix A.1 Dialogue example (p. 51) shows one of the dialogues in corpus complete with the speaker of each utterance and one of the 14 DAs to which the particular utterance is associated with.
As mentioned above the Wizard and the User speakers have asymmetrical roles in the dialogue. This is mirrored in the fact that some DAs are only present in the utterances of one of the speakers; see Table 3.3 for further information.
In a sense the Wizard-User corpus is similar to the famous Map Task corpus which has been used in some DA classification studies . As the the Wizard- User Corpus the Map Task corpus consists of “task-oriented cooperative problem solving” [11, p.26] where one patricipant possesses a map with a route from point A to point B and where that person is responsible for providing the other patricipant with instructions on how to draw the route on a map where no route is marked.
One difference between the corpora is the fact that in Map Task the patricipants were situated in the same room and in Wizard-User corpus the patricipants could not see each other. The Map Task corpus is also in English and it is larger than the Wizard-User corpus as it consists of 128 dialogues. 
In summary, the Wizard-User corpus consists of transcribed task oriented human-human dialogues and is in some aspects similar to the famous Map Task corpus.
DA Utterance (In English) # in corpus AssertPosition jag passerar trähuset nu
(I’m now passing by the tree house) 313 RequestActWait vänta lite
RequestGoal okej vart ska du
(okay where are you going) 26
AssertProblem jag tappar bort mig
(I keep losing track of where I am) 20 Acknowledge okej
AssertRoute nu tar du höger tills du kommer till nummer elva på den gatan
(now walk to your right until you reach number eleven on that street)
AssertComplete bra det var dit du skulle farkostteknik (good that’s where you were headed vehicle engineering)
Other gå fram till
(go to) 35
Social man tackar
(thank you) 22
RequestRoute jag kan inte komma till tunnelbanan jag är i återvändsgränd
(I cannot go to the subway I’m in a blind allay)
SignalNonUnderstanding jag hör inte vad du säger kan du ta det en gång till lite tydligare
(I can’t hear what you are saying can you say it again a bit clearer)
RequestPosition står du mellan nummer tretton och tolv (are you standing between numbers thirteen and twelve)
AssertGoal institutionen för farkostteknik
(department of vehicle engineering) 45
Table 3.1: Table shows all 14 of pre-defined hand labeled the DAs, along with an example of an utterance and the number of utterances of a certain DA, in the corpus.
CHAPTER 3. CORPUS
Speaker Utterance (In English) DA
User jag står vid institutionen för industriell ekonomi och organisation
(I’m standing at the department of industrial economics and management )
Wizard okej och vart ska du gå
(okay and where are you going) RequestGoal User jag vill till institutionen för maskinkonstruktion
(I ant to go to the department of machine construction)
Wizard okej nu vet jag var det ligger om du går på den gata där du är så ser du att det står nummer tolv på den gatan
(okay I know now where it is if you walk on the street where you are then you will see that it says number twelve on that steet)
Wizard okej gå förbi den tills du kommer fram till ett trähus rakt framför dig
(okay walk past it until you reach a tree house in front of you)
User åt vilket håll
(in which direction) RequestRoute
User jag står rakt framför tolv
(I’m standing right in front of twelve) AssertRoute
Table 3.2: Table contains an excerpt from the dialog exp1_1a from the corpus.
Each utterance is labeled with a DA and the speaker of the utterance.
DA Speaker RequestGoal Wizard AssertRoute Wizard RequestPosition Wizard RequestRoute User
Table 3.3: Table shows DAs in the corpus that are only present in the utterances of a particular speaker.
This chapter contains the corpus preprocessing steps, an overview and the details of K-Means+HMM and three evaluation methods.
4.1 Preprocessing of the corpus
4.1.1 Removing non content utterances
The assumption A.3 states that non-content words can be omitted from the corpus in a pre-processing step without impacting the quality of the result.
Nivre et.al. (1996) concluded, for instance that pauses do not affect probabilis- tic POS tagging of transcribed spoken language . As one of the first steps in the process is POS-tagging it is sound to assume that omitting utterances transcribed as TYST (silence) and NONSENSE is sensible.
In short, all of the utterances transcribed as TYST and NONSENSE are removed from the corpus.
4.1.2 Remapping of the DAs
The corpus was hand labeled with 42 DAs; these 42 DAs were remapped into 15 DAs similar to ones described by Skantze (2004) . Since AssetActWait, Greeting and Thanks in Skantzes work are assigned only to a handful of utter- ances they are remapped into the DAs Other and Social. A complete list of the remapping of the DAs can be found in Appendix A.2 Remapping of Dialogue Acts (p. 52).
The remapping is conducted in order to have just enough utterances labeled by a particular DA. That is, for one DA type not to be too specific and to have too few utterances associated with it. In other words, the reason for DA remapping is
CHAPTER 4. METHOD
the data thinning issue - that is, the idea is to remove space density of the data without compromising the information which the data provides.
4.1.3 Speaker separation
The utterances of the speakers in each dialogue are separated according to the assumption A.2. Since the speakers in the particular corpus have asymmetrical roles leading to the fact that 5 out of 14 DAs only belong to one of the speakers it is only natural to assume that separation of utterances according to the speaker inhesion would facilitate the clustering into clusters of DAs.
In short all of the utterances belonging to the speaker Wizard are picked out and separated from the utterances of the speaker User.
Once the utterances are separated by speaker each word of each utterance is POS- tagged by JSON Tagger, which is built on Stagger . An example of a POS- tagged utterances by Wizard from dialogue exp1_a1 can be found in Appendix A.3 POS-tagged example (p. 53).
4.1.5 Further preprocessing steps
P.1 Adding “start” and “end” to each utterance
P.2 Retagging some words that were incorrectly tagged.
P.3 Retagging street names and names of institutions as PM (proper name).
P.4 Tagging some, for the dialogues significant, words with their distinct super- ficial feature tags.
P.5 Removing all of the words and leaving POS tags only.
The list above presents a number of small preprocessing steps in order to prepare the corpus for K-Means+HMM. Each small preprocessing step is described in detail below.
Step P.1 refers to adding the words “start” and “end” to each utterance. This addition is made to clearly separate each utterance and to also make sure that each feature vector has at least two features (read further below, M.1).
Step P.2 refers to the process of rule-based re-tagging of some of the words that were consistently incorrectly tagged by JSON tagger. See Table 4.1 for a
summary of the words which were incorrectly tagged and their retagged POS- tags. The retagged POS-tags were chosen through a study of the context in which the incorrectly tagged words occurred.
Words Tagged as Re-tagged as
här, där, då HA (relative adverb) AB (adverb) bra, toppen JJ (adjective) IN (interjection)
Table 4.1: Table summarizes the words which were incorrectly tagged by JSON tagger and their correct POS tags.
Step P.3 refers to the retagging of the street names and the names of the institutions to the POS-tag PM (proper name). The street names for example, if they consisted of one word, were originally tagged as NN (noun) and were re-tagged as PM.
Step P.4 refers to the process of rule based re-tagging of some of the key words, which by themselves hold special meaning and/or correlate specifically to a predefined DA, see Table 4.2. These special tags were produced in order to facilitate clustering of the utterances.
Additionally, at this stage all of the non-content words along with their POS- tag were removed. The non-content words include: eeh, eh, eehm, NONSENS and TYST. As mentioned in assumptionA.3, it is assumed that the non-content words are not interesting in the context of this master’s thesis project.
Words Re-tagged as
ja, japp, jajamensan Y
uppfattat, okej, precis UNDRSTND
framme, klart, hurra DONE
Table 4.2: Table summarizes the words that were re-tagged according to super- ficila features sometimes correlating to the pre-defined DAs.
At step P.5 all of the words are removed from the corpus and only the POS-tags are left.
CHAPTER 4. METHOD
An overview of K-Means+HMM
The following list of steps provides the reader with an overview of the necessary steps in the K-Means+HMM method. Each step is more thoroughly described in the following subsection.
M.1 Extraction of feature vectors
Extracting features from the dialogues and making feature vectors. Reducing the dimensionality of the feature space by removing features that occur only once in the feature space.
M.2 K-Means clustering
Clustering the features into X number of clusters with K-means. The utter- ances from each speaker is clustered separately.
M.3 Training HMM
Combining the clusters and including the cluster labels in the dialogues and training HMM with the cluster labels instead of the words in the utterances or their feature vectors:
• HMM is provided with X random examples of cluster labels and their associated DAs from the corpus - making K-Means+HMM a semi- supervised method
M.4 Applying Viterbi
Each cluster number in the dialogues is associated (via viterbi) with a DA inferred from HMM.
4.2.1 Details on K-Means+HMM
M.1 Extraction of feature vectors
All of the utterances need to be represented by feature vectors. The features are, more precisely, bi-grams of POS-tags:
• Each bi-gram of the POS-tags in all of the utterances in the dialogues in the whole corpus of one of the speakers is assigned an index, where the value of the index is 0 Æ index < n_features; n_features = distinct bi-grams in the corpus. That is, each distinct bi-gram of POS-tags is a feature.
• For each of the utterances count the number of occurrences of each feature and put the sum of the occurrences of the feature in the feature vector at
Speaker Dimensionality Wizard 624x228 Reduced Wizard 624x187
Reduced User 736x154
Table 4.3: Table shows the dimensionality of the data before and after reduction.
the index corresponding to that particular feature. (For more information see )
Additionally, in order to reduce the dimensionality of the feature space and to weed out less important features, features that occur less than two times are removed from the feature vectors before clustering. Table 4.3 shows the dimen- sionality of the feature space before and after reduction.
M.2 K-Means clustering
All of the feature vectors are clustered by K-Means.
K-Means is a simple clustering method where n feature vectors are partitioned into k clusters where each sum of distances of each feature vector within the cluster is minimized.
The K-Means algorithm consists of the following steps:
1. The first step is to choose initial centroids by randomly picking k feature vectors from the feature space. After the initialization step K-Means loops between steps 2 and 3.
2. Each feature vector is assigned to its nearest centroid.
3. New centroids are created as the mean value of all feature vectors belonging to the old centroid is computed.
The algorithm repeats the last two steps until the difference between the new and the old centroids is smaller than a threshold. In this case the threshold is 1e≠4 and the nearness measure is the inertia, that is, within cluster sum of squares. For more information on K-Means see .
As previously mentioned the utterances of each speaker are separated and are also clustered separately. That is utterances uttered by Wizard are clustered into for instance 7 clusters (cluster labels 0-6) and utterances from User are also clustered into 7 clusters (cluster labels 7-13). The cluster labels are crucial for the next step in the method.
CHAPTER 4. METHOD
M.3 Training the HMM
The Hidden Markov Model (HMM) is used in order to take into consideration the context of a dialog. HMM is a method that specifically deals with sequential data , which of course a dialogue is.
In general, an HMM can solve a task of determining to which class in a set of classes an observation belongs to. HMM is a Markov Process with hidden states and observations. A state is in this case the class of the observation . In a task of for instance determining a POS-tag of a word the hidden state in HMM is the POS-tag and the observation is the word itself. An HMM has the following components:
S = s1s2. . . sN N states A= a11a12. . . an1. . . ann
Transition probability matrix - proba- bilities of moving from state i to state j
O = o1o2. . . oT T observations in a sequence
B = bi(ot) Emission probabilities - the probability that an ot is emitted by state i
Table 4.4: Table exibits the components that define HMM. For further informa- tion on HMM see [8,28].
In this particular case DAs are the hidden states, the observations are however not the words but the cluster labels (CL) obtained from K-Means clustering. That is, each utterance in all of the dialogues belongs to a certain cluster. In other words the dialogues in the corpus have been approximated to a sequence of cluster numbers, where each utterance is marked as a cluster number. Figures 4.1 and 4.2 illustrate the model further.
Since the essence of this method is to infer DAs in semi-supervised manner the HMM is defined with 14 hidden states (the number of the pre-defined DAs shown in Table 3.1) and uniform transition and initial probabilities. That is, the answers (meaning the pre-defined hand labeled DAs which are associated with each utterance) from the corpus are not used to generate neither the transition nor the initial probabilities.
As for emission probabilities they are first defined as uniform probabilities as well, but are updated with examples of CL-DA pairs. That is, for each DA in the set of DAs defined in Table 3.1 an example of CL tagged by that DA forms an example CL-DA pair. These CL-DA pairs form the base for the update of the emission probabilities.
CL1 CL2 CL3 CL4
Figure 4.1: Figure depicts an HMM with 4 hidden states, in this case DAs DA1, DA2, DA3 and DA4, which can emit 4 cluster labels (CL) CL1, CL2, CL3 and CL4. aij is the probability to transition from DA DAi to state DAj. bj(CLk) is the probability to emit CLk in state DAj. In this HMM, DAs can only reach themselves or the adjacent DA.
t = 1 DA1 DA2 DA3 DA4 CL1
t = 2 DA1 DA2 DA3 DA4 CL2
t = 3 DA1 DA2 DA3 DA4 CL3
Figure 4.2: Figure shows a lattice diagram of the observation sequence CL1,CL2,CL3 for the HMM in Figure 4.1. The thick arrows indicate the most probable transitions. As an example, the transition between state DA1 at time t=2 and state DA4 at time t=3 has probability –2(1)a14b4(CL3), where –t(i) is the probability to be in state DAi at time t.
CHAPTER 4. METHOD
The update consists of assuming the probability that dialogue act DA is ex- pressed by cluster label CL, P (CL|DA), which is in turn expressed by the equation 4.1 where c(CL) is the number of occurrences of the cluster label CL, c(CL, DA) is the number of examples given where CL is an example of dialogue act DA, cÕ(CL) is the number of examples given where CL is an example of any DA, and N is the number of cluster labels.
P(CL|DA) = c(CL, DA) + c(CL)≠cÕ(CL)
c(CL) N (4.1)
When HMM is initialised Baum-Welch algorithm is used in order to estimate its parameters given a set of observations. That is, HMM is trained by using Baum-Welch; see more on Baum-Welch . The result of Baum-Welch is thus the transition and the emission probability matrices which contain the correlation of CLs and the inferred DAs.
M.4 Applying Viterbi
As the HMM is trained and the transition and emission probabilities are estimated, Viterbi is then applied to all of the dialogues. Viterbi associates each and every CL in a sequence of the CLs that form a dialogue with a DA label inferred from HMM. For more information on Viterbi see .
It is important to note that since the goal of this procedure is to find out the relationship between CLs and the inferred DAs there has been no division of the data for training and for testing. That is, the goal was not to test how well the trained HMM can tag new unseen CLs with a DA, but to find out which CLs were associated with which DAs.
4.3 Evaluation methods
The following list contains all the evaluation methods used to evaluate K-Means+HMM and to test the hypotheses H.1-H.3.
E.1 Measuring conditional probabilities for sequences of hand labeled DAs given DAs inferred from HMM, P (DA_HL | DA_HMM), compared to conditional probabilities for sequences of hand labeled DAs given clusters labels P (DA_HL | CL), see equation 4.2 below.
P(DA_HL1...i | CL1...i) = Ÿi
P(DA_HLj | CLj) (4.2)
Cluster label Hand labeled DA P(DA_HL | CL)
3 RequestGoal (RG) P(RG|3) = 0.21
5 Acknowledge (Ack) P(Ack|5) = 0.34
7 RequestRoute (RR) P(RR|7) = 0.02
2 SignalNonUnderstanding (SNU) P (SNU|2) = 0.51 Table 4.5: Table shows an examples of a dialogue, where the pre-defined DAs and the cluster labels are associated with each utterance.
For instance, imagine a dialog shown in Table 4.5 (after the step M.1): The P(DA_HL | CL) for the entire dialog would be:
P(DA_HL | CL) =
P(RG|3)P(Ack|5)P(RR|7)P(SNU|2) = 0.21 · 0.34 · 0.02 · 0.51 = 7.3e≠4
Since the conditional probabilities are so small it is sometimes easier to use the logarithms of the probabilities: that is ln(7.3e≠4) = ≠7.2. Unsurprisingly the P (DA_HL | CL) tend to grow smaller as the number of utterance in a dialogue grows larger. Thus this evaluation method is perhaps best to be used when comparing the effect of some parameter on dialogues, although in general the larger the P (DA_HL | CL) for a dialogue the better is the DA classification.
This particular evaluation method is novel and a proposition as this method, compared to for instance E.2 does not match a pre-defined DA with each cluster label, but rather measures the probabilities of the utterances to belong to a specific DA given a specific cluster.
E.2 One-to-one accuracy, as described by Elsner and Charniak (2010) and mentioned in Section 2.3.3 Related evaluation methods (p. 12) . In order to compute the matching between the hand labeled DAs and the DAs inferred by both K-Means and K-Means+HMM Hungarian algorithm was uti- lized. The Hungarian algorithm is an algorithm which solves the assignment problem in polynomial time. That is, the algorithm matches, in our case CLs with DAs, in order to globally maximize the one-to-one accuracy (read further ).
A general example of the algorithm is to assign tasks to workers, which all demand different amounts for the different tasks, in order to minimize the overall cost. The algorithm consists of four steps below. The input to the algorithm is an n by n matrix.
CHAPTER 4. METHOD
Step 1 Row wise: find the lowest value in the row and subtract that value from each element in the row.
Step 2 Column wise: As in Step 1 find the lowest value in the column and subtract the value from each element in the column.
Step 3 Cover all zeros in the matrix with a minimum number of vertical and horizontal lines. If the number of lines is equal to n then the algorithm stops as an optimal solution is found among the zeroes. If the number of lines < n continue to Step 4.
Step 4 Find the smallest value that is not covered by a line in Step 3 and subtract that value from all the values not covered by the lines. Then add that value to all of the values that are covered twice by the lines.
In order to use the algorithm for maximization the problem can be seen as minimizing the lost profit, instead of maximizing the cost. The only alteration that has to be done to the algorithm described by Step 1-Step 4 in order to find the assignment which maximizes the cost is that the largest value in the input matrix needs to be found and all other values in the matrix need to be subtracted from that value. Let me demonstrate with an example, lets say that there are three different types of stores and three different locations, and each type of store makes different amount of profit in the different location, then the problem is determining the minimum of lost profit by assigning a store a location. Table 4.6 shows the types of stores and the amount of profit each type could make in the location 1-4; the table is analogous to the (4x4) input matrix for the algorithm.
The largest value in Table 4.6 is 20 and so all of the other values need to be subtracted from 20; Table 4.7 shows the result of the subtraction.
After this alteration to the table the steps Step 1 - Step 4 can be made.
Table 4.8 illustrates Step 1, Table 4.9 illustrates Step 2, Table 4.10 illustrates Step 3 and finally Table 4.11 shows the result.
After the assignment problem is solved the one-to-one accuracy can be mea- sured as follows:
“Given two annotations (model’s output and human annotation), it pairs up the clusters from the two annotations in a way that max- imizes (globally) the total overlap and then reports the percentage of overlap.”
Store 1 2 3 4
Books 4 12 10 19
Shoes 5 6 16 3
Groceries 20 13 17 5 Electronics 11 10 4 15
Table 4.6: Table shows an example of assignment problem where minimiza- tion of lost profit is required.
Store 1 2 3 4
Books 16 8 10 1
Shoes 15 14 4 17
Groceries 0 7 3 15 Electronics 9 10 16 5
Table 4.7: Table shows the result of subtracting each value in Table 4.6 from 20.
Store 1 2 3 4 Subtract
Books 16 8 10 1 -1
Shoes 15 14 4 17 -4
Groceries 0 7 3 15 -0
Electronics 9 10 16 5 -5
Table 4.8: Table shows the lowest values in rows which need to be subtracted from each row.
Store 1 2 3 4
Books 15 7 9 0
Shoes 11 10 0 13
Groceries 0 7 3 15 Electronics 4 5 11 0
Subtract -0 -5 -0 -0
Table 4.9: Table shows the result of row wise subtraction in Table 4.8 and the lowest values which need to be subtracted column wise.
CHAPTER 4. METHOD
Store 1 2 3 4
Books 15 2 9 0
Shoes 11 5 0 13
Groceries 0 2 3 15 Electronics 4 0 11 0
Table 4.10: Table shows the column wise subtraction as shown in table 4.9 and the minimum number of lines covering all zeroes. The number of lines is equal to 4 and thus the opti- mal assignment is found; it is marked by the green cells.
Store 1 2 3 4
Books 4 12 10 19
Shoes 5 6 16 3
Groceries 20 13 17 5 Electronics 11 10 4 15
Table 4.11: Table shows the result of the Hungarian algorithm. The val- ues in Table 4.10 are replaced with the original values and the result as- signment is marked by the green cells.
The algorithm has found the assign- ment to maximize the profit.
E.3 A qualitative evaluation method of matching clusters with hand labeled DA tags. That is, looking at matrices of DAs and CLs and making observations.
The same qualitative method was used by Ohtake (2008)  and Ezen-Can and Boyer (2014)  and it is mentioned in Section 2.3 Related Work (p.
These three methods should be sufficient to evaluate the hypotheses H.1-H.3.
The evaluation methods E.1 and E.2 are generally useful for evaluation of the hypotheses. E.3, on the other hand is a method which revels information about the structures of the clusters in the K-means and the DAs inferred by K-Means+HMM, compared to the pre-defined hand labeled DAs.
It is additionally noteworthy that although K-Means is a step within the pro- posed novel method K-means+HMM, it is possible to evaluate the results of K- Means separately. In fact, this comparison between K-Means and K-Means+HMM is crucial in order to evaluate hypothesis H.1.
Below the reader will find a list of non trivial Python libraries and APIs used in the preprocessing of the corpus, the K-Means+HMM method and the evaluation methods.
Name Used for
JSON-tagger  POS-tagging
Text Feature extraction  Extracting features from the POS-tagged dialogues
K-means  K-means clustering
Hidden Markov Model  HMM
Munkers aka Hungarian algorithm  Matching in one-to-one accuracy
This section contains the results of K-means and K-means+HMM described in M.1 - M.3. The methods are evaluated by according to E.1-E.3.
5.1 K-Means vs. K-Means+HMM
Figure 5.1 shows the logarithms of the average conditional probabilities P (DA_HL | CL) and P (DA_HL | DA_HMM) E.1 for each one of the 40 dialogues, along with a baseline of guessing a DA for an utterance in a dialogue. The probabilities shows are an average of 100 runs of K-means and K-means+HMM, the number of clusters is 14 in total (7 clusters for Wizard and 7 clusters for User), the number of examples used for HMM is one per hand labeled DA. Figure 5.1 additionally shows the fact that the length of the dialogues affects the conditional probabilities. The shorter dialogues have higher conditional probabilities than the longer dialogues.
Table 5.1 shows a weighted average of P (DA_HL | CL) and P(DA_HL | DA_HMM) (E.1) in negative log likelihood (NLL) over all of the 40 dialogues and one-to-one accuracy (E.2). The number of clusters and examples is the same as for Figure 5.1, see details above.
The weighted average, ¯x, is shown in equation 5.1; wi is the number of utter- ances in a dialogue i and xi is the conditional probability for that same dialogue, n is the number of dialogues in total.
¯x = qni=1wixi
Tables 5.2 and 5.4 show results measured by the quantitative method E.3.
Table 5.2 shows the results of K-Means clustering of utterances that have been divided by speaker inhesion. The table clearly reflects the fact that the speakers
CHAPTER 5. RESULTS
Figure 5.1: Figure shows the logarithm of an average of conditional probabilities for each one of the 40 dialogues for 100 runs of K-means and K-means+HMM.
It is clear that K-Means outperforms K-Means+HMM.
had different roles in the dialogue (see the colored cells). Some of the clusters show a strong correlation to the hand labeled DAs. For instance, DA RequestActWait and cluster 2, DA No and cluster 7 both exibit strong correlations. Other DAs, in contrast, are more evenly spread out between a number of clusters, for instance RequestRoute.
RequestGoal and RequestPosition are both clustered into cluster 0. Table 5.3 shows example utterances of RequestGoal and RequestPosition. These examples
Model E.1 in NLL E.2
Baseline 9 —
K-Means 9 35.02%
K-Means+HMM 10 31.57%
Table 5.1: Table shows a weighted average of conditional probabilities in NLL over 40 dialogues E.1, and the one-to-one accuracy E.2.
provide a hint to the reason why these DAs were clustered into the same cluster.
It is possible that they were both clustered into the same cluster because both of the DAs have the structure of a question.
In contrast to Table 5.2, Table 5.4 shows that AssertPosition and RequestRoute are clustered into DA-label 3 inferred from K-Means+HMM. AssertRoute and RequestRoute are clustered into DA-label 1. Acknowledge shows to have a strong correlation with one cluster, in Table 5.2; Acknowledge is, however more spread out between different DA-labels inferred by K-Means+HMM. All in all, all hand labeled DAs appear to be more spread out between the DA-labels inferred by K-Means+HMM, compared to the spread in Table 5.2.