Clustering Semantically Related Questions

(1)

International Master’s Thesis

Clustering Semantically Related Questions

Nikolaos Karagkiozis

Computer Science

Studies from the Department of Technology at Örebro University örebro 2019

(2)

(3)

(4)

(5)

Studies from the Department of Technology

at Örebro University

Nikolaos Karagkiozis

Clustering Semantically Related

Questions

(6)

© Nikolaos Karagkiozis, 2019

Title: Clustering Semantically Related Questions

(7)

Abstract

There has been a vast increase of users that use the internet in order to com-municate and interact, and as a result, the amount of data created follows the same upward trend making data handling overwhelming. Users are often asked to submit their questions on various topics of their interest, and usually, that itself creates an information overload that is difficult to organize and process. This research addresses the problem of extracting information contained in a large set of questions by selecting the most representative ones from the total number of questions asked. The proposed framework attempts to find semantic similarities between questions and group them in clusters. It then selects the most relevant question from each cluster. In this way, the questions selected will be the most representative questions from all the submitted ones. To ob-tain the semantic similarities between the questions, two sentence embedding approaches, Universal Sentence Encoder (USE) and InferSent, are applied. Moreover, to achieve the clusters, k-means algorithm is used. The framework is evaluated on two large labelled data sets, called SQuAD and House of Com-mons Written Questions. These data sets include ground truth information that is used to distinctly evaluate the effectiveness of the proposed approach. The results in both data sets show that Universal Sentence Encoder (USE) achieves better outcomes in the produced clusters, which match better with the class labels of the data sets, compared to InferSent.

(8)

(9)

Acknowledgements

I would first like to thank my thesis supervisor Hadi Banaee for his valuable support and guidance in the execution of this thesis. He was always there to answer my questions and to steer me in the right direction. His guidance was very valuable and important for the outcomes of this work.

I would also like to thank all the tutors of the Department of Technology for all the knowledge they have provided me with during my journey though the Robotics and Intelligent Systems Master’s programme, and I wish them success in all their future endeavours.

Finally, I would like to thank my parents and my wife for the support that they have been offering me over the years. They are always there for me.

(10)

(11)

List of Figures

1.1 Obama AMA session . . . 2

2.1 Bag-of-Words Representation . . . 8

2.2 Skip-thought model . . . 11

2.3 Example sentences of the SNLI Corpus . . . 11

2.4 USE example Heatmap . . . 12

3.1 Global Approach Diagram . . . 15

3.2 Framework diagram . . . 16

3.3 Cosine Similarity Figure . . . 19

3.4 Similarity matrix USE . . . 20

3.5 Similarity matrix InferSent . . . 21

3.6 Heatmap - Toy Example Questions USE . . . 21

3.7 Heatmap - Toy Example Questions InferSent . . . 22

3.8 Entire Toy Data Set Heatmap - USE . . . 22

3.9 Entire Toy Data Set Heatmap - InferSent . . . 23

3.10 k-means procedure . . . 25

3.11 Toy data set UMAP plot - USE . . . 26

3.12 Toy data set UMAP plot - InferSent . . . 26

4.1 Heatmap of SQuAD - USE . . . 33

4.2 Heatmap of SQuAD - InferSent . . . 33

4.3 UMAP plot of SQuAD - USE . . . 34

4.4 UMAP plot of SQuAD - InferSent . . . 34

4.5 Heatmap of House of Commons Written Questions - USE . . . 40

4.6 Heatmap of House of Commons Written Questions - InferSent . 40 4.7 UMAP plot of House of Commons Written Questions - USE . . 41 4.8 UMAP plot of House of Commons Written Questions - InferSent 41

(14)

(15)

List of Tables

3.1 Labels per cluster index - USE . . . 27 3.2 Labels per cluster index - InferSent . . . 27 3.3 Most Representative Questions Toy data set - USE . . . 29 3.4 Most Representative Questions Toy data set - InferSent . . . . 29 4.1 Number of Questions before and after Sampling - SQuAD . . . 32 4.2 Labels per cluster index SQuAD - USE . . . 35 4.3 Labels per cluster index in SQuAD using InferSent . . . 35 4.4 Most Representative Questions SQuAD data set - USE . . . 36 4.5 Most Representative Questions SQuAD data set - InferSent . . 37 4.6 Number of Questions before and after Sampling - House Of

Commons Written Questions . . . 39 4.7 Labels per cluster index House of Commons Written Questions

- USE . . . 42 4.8 Labels per cluster index in House of Commons Written

Ques-tions using InferSent . . . 42 4.9 Most Representative Questions House of Commons Written

Ques-tions data set - USE . . . 44 4.10 Most Representative Questions House of Commons Written

Ques-tions data set - InferSent . . . 46 4.11 Summary of the Similarity Matrix Results . . . 46 4.12 Summary of the Clustering Performance Results . . . 47

(16)

(17)

List of Algorithms

(18)

(19)

Chapter 1

Introduction

1.1 Motivation

With the evolution of technology during the last decades, user interaction has become more and more prevalent. In the past, the first stage of the World Wide Web allowed for static content sharing and most of the published content was not user-generated. However, nowadays there exist the means that make user interaction fast and easy. It is very usual for big groups of people to interact and submit their questions or comments on specific topics of their interest and there are various use cases where this interaction is observed. Examples of such cases include but are not limited to, occasions where the audience is asked to submit their questions to a speaker during a conference, or cases where a TV show presenter hosts a guest and the audience is asked to submit their questions to the guest.

One popular example is the Reddit's “Ask me anything (AMA)” sessions [26] where the host of a session is either a famous person or an ordinary user. In this kind of sessions, the interview occurs between him or her and all the other every-day users of Reddit who want to ask the person any questions. For example, the former president of the United States, Barack Obama hosted a half-hour AMA session in 2012, where more than 200.000 concurrent users have visited the session page and around 22.000 questions have been submitted by them, according to Reddit (See Figure 1.1). The topics of the questions were very general and they were ranging from finance related topics to questions like the recipe for the White House's beer [31]. Barack Obama has answered just 10 among the total of the 22.000 questions, most probably the first ones or the ones he has found interesting and not a representative subset of questions that reflects most of them.

(20)

2 CHAPTER 1. INTRODUCTION

Figure 1.1: A screenshot of the Reddit's AMA session, with Barack Obama as the host [31].

Another related example is the summary of the frequently asked questions on a website (i.e, FAQs), in which the total questions submitted by the users of the website need to be summarized in a very smaller subset that includes the most frequently asked questions. These examples show the need for an automated solution to retrieve such representative questions from a large set of information. The fact is that this task is mostly performed manually by the websites' operators.

1.2 Problem Statement

The main challenge that arises, is how to automatically and efficiently deal with a large amount of submitted written questions, in a limited period of time. Ideally, the goal is to retrieve a set of questions that are the most representative ones from the total submitted questions. However, due to the fact that usually the number of questions is overwhelming and there is not enough time to read all of them to select those which are the most representative, the most common approach is to select some random questions. The random selection approach is naive and does not guarantee that the chosen set of questions properly represents the most prevalent information that the audience would like to ask. Furthermore, smaller groups of similar questions may end up not having a representative question in the final selected set, as a result of the random approach.

(21)

1.3. RESEARCH QUESTIONS 3 The term "representative questions" is used to underline the concept of finding the most relevant questions from the total submitted ones, which can be thought of as the most representative questions. However, finding the ac-tual most representative questions is challenging since the term is relative and depends on the cognitive perception of each human individually, the general-ity of the questions and their complexgeneral-ity. For example, in the Obama’s AMA session, if the largest portion of the questions are asking specifically about the type of the wine that is served in the White House, then the representative question should be about the wine of the White House specifically and not be generalised to a question that would ask, for example, how the lifestyle is in the White House. As a result of these challenges, this thesis relies on the input questions themselves to find the most relevant questions as the most representative ones in a data driven way.

1.3 Research Questions

The overall objective of this thesis is to propose an automated solution that addresses the problem of selecting a set of questions that are the most rep-resentative from the total number of the questions asked. Therefore, the goal is to find a subset of the large set of questions, that best represents it. A step to accomplish this objective is to first find the semantically similar ques-tions that will later allow to find and include the question that best represents them. Thus, one of the research questions focuses on obtaining the semantic similarities between the questions. In general, the first research question is:

i. How to compare the question sentences in order to find semantically sim-ilar questions?

After finding the semantic similarities between the questions, a further step is needed to group the related questions into relevant groups to be able to choose the most representative question from each. Thus, another research question focuses on grouping semantically similar questions. In general, the question can be formulated as:

ii. How to group the questions in a way to separate various semantic topics or subjects?

As noted in the problem statement, the number of submitted questions may be overwhelming and there is always the risk of including questions that are not the representative questions of the entire data set. Therefore, another research question focuses on choosing the most representative question from each group and consequently, from the entire set of questions. In general, the final research question is:

iii. How to choose the most representative questions from the grouped infor-mation?

(22)

4 CHAPTER 1. INTRODUCTION

1.4 Contributions

The contributions of this thesis are organised as one overall contribution that can be divided into three detailed categories of contributions. Given the ob-jective of this thesis, the main contribution is the development of a general framework that addresses the problem of handling the information of a large set of submitted questions by finding the most representative ones. To formu-late it, the overall contribution is:

Development of a general framework that starts with a large set of asked questions and returns a subset of the most representative ones

In general, the proposed solution works by finding semantic similarities between questions and clustering them together. Then, it selects the most relevant question from each cluster and provides it to the user. This means that the questions selected will be the most relevant from each cluster and thus, in the end, will be the most representative of all the submitted ones.

The framework consists of utilising sentence embeddings for the task of semantic similarity and then applying clustering approaches for the task of grouping relevant questions. In detail, the first two contributions involve as-sessing two state-of-the-art sentence embedding approaches that are used to find the semantic similarities between the questions and evaluating the clus-tering performances of each embedding on chosen large labelled data sets. The final contribution involves the process of selecting the most representative question from each cluster and thus, in the end, from the entire set of questions.

The first contribution is:

i. Utilise and analyse the goodness of sentence embedding approaches used to compare the semantic similarity between questions

Two state-of-the-art sentence embedding approaches (i.e., InferSent and USE) are utilised and their goodness for finding semantic similarities between the questions is evaluated, by assessing their similarity matrices.

The second contribution is:

ii. Apply clustering approach on large labelled data sets, and evaluate the goodness of the outputs according to the ground truth information pro-vided by the data sets

A clustering approach (i.e., k-means) is applied on two large labelled data sets, one from questions posed on a set of Wikipedia articles, and the other from questions of members of the UK parliament to government ministers. Then, for both sentence embedding approaches, the result of applying them is

(23)

1.5. THESIS OUTLINE 5 evaluated using external clustering performance measures.

The third contribution is:

iii. Choose the representative questions from every cluster

Among all the question members of each cluster, those ones that are the most representative questions of that cluster are selected to be included in the final subset of the most representative questions of the entire set of asked questions, and be suggested to the end user of the framework.

1.5 Thesis Outline

The outline of this thesis is as follows:

• Chapter 2 presents and discusses the related work on the sentence simi-larity and the sentence clustering approaches.

• Chapter 3 presents the proposed framework, including the developed approaches. Each step of the framework is then elaborated using a small labelled data set as a toy example.

• Chapter 4 depicts the applicability of the proposed framework on two real-world data sets. Moreover, this chapter shows the evaluation of the results on such data sets.

• Chapter 5 discusses the conclusions of the work including the summary of the conclusions, the limitations of the work, and the possible directions for the future steps of the research.

(24)

(25)

Chapter 2

Related Work

2.1 Textual Similarity and Clustering

Textual similarity and clustering of text are important subsets of Natural Language Processing (NLP), and are used in a wide range of tasks such as text summarization [2], document clustering [4, 34] and topic extraction [6] among others. The main goal of textual similarity and clustering is to find semantically related texts and group them into clusters [19].

2.1.1 Text Vectorization

In order to perform text clustering, the original text has to be transformed into a numerical form of representation that can be used as an input to clus-tering algorithms, as these algorithms work on numeric vectors [17]. Despite the fact that there are many effective clustering algorithms, the acceptance of the results in grouping related text together is highly dependent on the accu-racy of the transformation of the text into the numeric vector representation. Poor text vectorization would lead to the inability of accurately finding the semantically related text and thus no clustering algorithm would be able to compensate for that. So, one of the most important but also challenging steps in text clustering is to accurately transform it into a vector representation.

Feature selection is a major step for any text vectorization, which is to extract the proper set of information from the text in order to be transformed to numerical vectors [18]. Depending on the application of text similarity, various feature selection approaches might be used. In some applications, it might be important to consider the frequency of appearance of the words as a feature, and in others, the contextual weight of a word might be a feature.

One simple approach to transform a text into a numeric vector could be the Bag-of-Words representation, where each term/word is counted as per it’s frequency in appearing in the text. An example is shown in Figure 2.1, where

(26)

8 CHAPTER 2. RELATED WORK

Figure 2.1: Bag-of-Words Representation - [5]

it is depicted that for the sentence: "This day was a good day", the vector that represents it contains zeros in the elements that represent the words that do not appear in the sentence and contains a value of one or more in elements that represent the words that appear in the sentence once or more times.

2.1.2 Textual Similarity Measures

There exist two main types of measures to define textual similarity. These are lexical similarity and semantic similarity. Lexical similarity takes place when the words of two texts overlap even when the meanings of the texts are not similar, whereas semantic similarity takes place even when the words of the texts are completely different but the meanings are still similar [12, 27].

The importance of the feature extraction step in the similarity measures can be intuitively expressed by a simple example. Let's consider the following two sentences “I love reading books in my free time” and “One of my hobbies that I enjoy doing is the activity of reading novels”. It can be easily perceived that although there is almost no word overlap in the two example sentences, the meanings are still very similar. Furthermore, the example above explains why lexical similarity is not enough to define how the texts are related to one another. Moreover, to calculate such similarity between the sentences, the role of feature selection to perform the text vectorization step is thus crucial, in order to correctly align the semantically related parts of the sentences.

(27)

2.2. SENTENCE/QUESTION SIMILARITY 9

2.2 Sentence/Question similarity

Textual similarity, in general, is defined as a similarity between varied length texts, and it may be considered as a superset of short sentence or question similarity [22]. The following sections present the popular sentence similarity approaches that have been used in the area of natural language processing.

2.2.1 Term Frequency-Inverse Document Frequency

(Tf-idf )

Tf-idf was introduced by Salton and Buckley [33] as a simple and effective approach to finding similarities between short but also long length texts [30]. Tf-idf is a method of creating feature vectors from text, and has been exten-sively used in various tasks such as document clustering [4].

It works by weighing the words of a sentence by their importance, and in the end, the sentence is represented by a vector where each of its words has an importance value. The importance of each word is calculated as the product of the term frequency - tf and the inverse document frequency - idf. The tf term is defined as the ratio of the number of times a word appears in a sentence by the number of words that exist in the sentence, whereas the idf term is the log ratio of the total number of sentences by the number of sentences wherein the specific word appears [33, 30]. Formally, the tf-idf value of a word is given by the following formula:

tf-idft,d=tft,d· log

N dft

where tf is the number of occurencies of the word in the sentence, N is the total number of sentences, and df is the number of sentences containing the specific word. In this way, the tf-idf approach makes sure that the unimportant words of a sentence will have lower weights whereas the important words of the sentence will have bigger weights.

Although the tf-idf approach is considered to be effective and also easy to implement, it has certain limitations. It only captures lexical similarity and not semantic similarity [30], which would mean that the words “football” and “soccer”, for example, would be two completely unrelated words for it.

2.2.2 Word embeddings

The main idea of using word embeddings is that any given word can be as-sociated with a numeric vector in a way that semantically related words have similar vector representations and their cosine similarity - which is a measure of similarity between two vectors - is big.

One of the most popular word embedding approaches is the Word2vec em-bedding presented by Mikolov et. al. [24]. Word2vec uses Neural Networks to learn word embeddings. The model is usually trained on big text corpora

(28)

10 CHAPTER 2. RELATED WORK with a huge vocabulary and it learns the context of the words that appear in it. When words appear in the same context, it means that the similarity of those words is increased. For example, the words “Puerto” and “Rico” appear close to each other more frequently than the words “Puerto” and “piano”. That means, the words “Puerto” and “Rico” are semantically closer to each other rather than the words “Puerto” and “piano”. Furthermore, the authors have demonstrated that operations like addition and subtraction between the word vectors are possible by producing suitable results. For example, “king ” - “man” + “woman” vectors would yield the result of a vector that represents the word “queen”.

Word embedding approaches are developed on individual words in order to transform them into vectors, and not on whole sentences. However, there are several approaches that have been presented that use word embeddings in order to transform whole sentences into vectors. One way to represent a whole sentence as a vector representation has been presented by Chinea-Rios et. al. [8]. The authors have used the result of the vector representation of each word in a given sentence and then have summed the word vectors to get the vector representation of the whole sentence. This method has used the sentence vector representations for the task of sentence clustering.

The limitation of this approach though is that the order of words in a sentence is not taken into consideration and it is lost by just summing up the words vectors. This limitation can decrease the accuracy of the generated sentence-vector representations that may lead to inaccurate semantic similarity of sentences.

2.2.3 Unsupervised Sentence Embeddings

Unsupervised sentence embeddings are an extension of word embeddings for short texts (i.e., sentences) instead of individual words. Similar to the word embeddings, in sentence embeddings, any given sentence can be associated with a vector representation. However, instead of predicting the surrounding words of a given word, the unsupervised sentence embeddings predict the surrounding sentences of a given sentence [16]. Sentences that appear in the same context usually have similar meanings. Kiros et al., [16] have presented a method of learning sentence embeddings called Skip-thought vectors, which follows the same idea of the unsupervised sentence embeddings. In this method, a model is trained in order to reconstruct the surrounding sentences of a given target sentence, as shown in Figure 2.2

Given the target sentence “I could see the cat on the steps”, the model predicts its surrounding sentences. One sentence that precedes the target sen-tence can be “I got back home”, whereas another sensen-tence that follows the target sentence can be “This was strange”.

There has been an advancement of the Skip-thoughts vectors called Quick-thought vectors, presented by Logeswaran and Lee [20]. In Quick-thought, the

(29)

2.2. SENTENCE/QUESTION SIMILARITY 11

Figure 2.2: Skip-thought model - Kiros et al. [16]

approach is that instead of the model trying to reconstruct the preceding and the succeeding question, it tries to predict them by choosing among possible candidate sentences. That way the decoder of the Skip-thought is replaced by a classifier.

2.2.4 Supervised Sentence Embeddings

As opposed to the unsupervised sentence embedding approaches, the super-vised sentence embeddings require labelled data to be trained on, in order to learn sentence embeddings. The training data used, are annotated for specific tasks like the Natural Language Inference task [9]. Ideally, the annotated data that the model would be trained on, should be suitable enough for generating accurate sentence vectors.

A famous approach of a supervised sentence embedding is InferSent, pre-sented by Conneau et. al., [9] in 2017. InferSent is trained on the Stanford Natural Language Inference (SNLI) Corpus which consists of pairs of short text sentences being manually labelled as “neutral”, “contradiction”, and “en-tailment”. In Figure 2.3 some example sentences that appear in the SNLI Corpus are shown along with their labels.

(30)

12 CHAPTER 2. RELATED WORK For every pair of sentences, InferSent uses a sentence encoder that gener-ates sentence embeddings from word vectors and then trains a classifier using the sentence embeddings generated. The authors have demonstrated that the model trained on the Natural Language Inference task generates sentence em-beddings that generalize on a wide range of NLP tasks and achieve performant results that outperform the unsupervised methods such as the Skip-Thoughts.

2.2.5 Universal Sentence Encoder (USE)

To find sentence embeddings that could perform well on general NLP tasks, one idea is to combine several training objectives together to train the sentence embedding and produce sentence vectors.

Universal Sentence Encoder (USE), presented by Cer et al., [7] follows the approach of combining several training tasks in one training scheme. Thus, USE is trained on a variety of NLP tasks such as the Natural Language In-ference, but at the same time, it applies the unsupervised approach used in Skip-thoughts. In this way, USE achieves to generate sentence embeddings that would perform well on general NLP tasks such as sentiment classification and textual similarity. The authors have shown that USE can perform well in find-ing similarities between text sentences, which means that it can be successfully used for sentence clustering tasks.

(31)

2.2. SENTENCE/QUESTION SIMILARITY 13 Figure 2.4 shows a heatmap of the sentence similarities inferred from the USE sentence embeddings. In this heatmap, sentences 1 to 6 are shown on the x and y axes. The darker areas in the heatmap show the higher textual similarity between the pair sentences, while the lighter areas show the lower similarity between sentences. For instance, based on the similarity values from the heatmap, the sentences “How old are you?” and “what is your age?” are more similar or related than the sentences “How old are you?” and “Will it snow tomorrow?”.

Furthermore, USE offers two alternative encoding models, namely the Trans-former encoding model and the Deep Averaging Network (DAN) encoding model. The Transformer encoder is highly accurate but requires more compu-tational and memory resources, whereas the DAN encoder is compucompu-tationally less expensive but comes with less accuracy compared to the Transformer model.

2.2.6 Text/sentence/question Clustering

The main goal of text clustering is to group similar textual objects into clusters [1]. The textual objects may be words, sentences or even longer texts. Sentence Clustering using Word2vec

A sentence clustering approach has been presented by Chinea-Rios et. al., [8]. The authors have used word embeddings and more specifically Word2vec, to construct sentence vector representations. This approach first, splits the sentence into words and then generates vector representations of the words using Word2vec. Then, the word vectors of the sentence are summed up in order to construct a sentence vector representation. Finally, the approach uses the k-means clustering algorithm to cluster semantically related sentences.

The authors have used labelled data that consist of sentences grouped by a common characteristic which was chosen to be the domain of the sentence. Thus, they have used four different corpora belonging to different domains. The sampling of sentences involved 2500 sentences randomly selected from each corpus.

After performing the k-means clustering algorithm, the method uses the labelled data to evaluate how accurate is the method to recover the original labels.

Q&A Question Clustering

A question clustering approach targeted for Question-and-answer (Q&A) tasks has been presented by Paranjpe [27]. The main idea of the approach is to find similar questions by topic in the corpus and then to create internal clusters

(32)

14 CHAPTER 2. RELATED WORK based on the lexical and semantic relatedness of those questions with similar topics.

For that purpose, the author first has used part-of-speech (POS) tagging to extract the main topic of a question. Then, the other questions in the corpus that had the same overlapping topic have been retrieved to form a base set of questions similar to that question. Later, the rest of the words of the base questions that were neither the stop words nor the main topic words, were ex-tracted and checked for lexical and semantic relatedness by using the WordNet synonym detector. Furthermore, a classifier was used to classify the questions by question type (such as abbreviation, entity, description, location, number, human) to examine whether the questions were paraphrases of each other. Finally, a lexical similarity metric has been used to measure the similarity between the questions.

According to the author, it was noted that for 90% of the cases the clus-tering technique has shown to be effective.

Semantic clustering of questions

Another question clustering approach presented by Mocanu [25], has used hi-erarchical clustering to cluster semantically related questions.

For this purpose, a text preprocessing step has been conducted at first where Stanford CoreNLP was used to extract the relevant features of the questions. The questions have been first compared to how lexically close they are. If the result was below a predefined threshold, then the questions were also examined for semantic relatedness using the WorldNet synonym detector. This approach has been tested on a small corpus and the results were adequately positive according to the author.

A limitation of the approach though is the speed of execution as finding semantic relatedness between words using the WorldNet synonym detector is an expensive operation.

(33)

Chapter 3

Method

The proposed approach aims to find semantically related questions and group them into clusters. As indicated by the objective of this research, the target data set is a large set of unlabelled questions that will require to be summarised in a very smaller number of representative questions. The proposed approach, in general, involves the steps of obtaining the semantic similarities between the questions and grouping them in clusters from which the representative questions are then chosen. The steps are depicted in Figure 3.1.

Figure 3.1: Global Approach Diagram

3.1 Framework Overview

To better assess the results of the suggested approach and retrieve meaningful outcomes, the data sets that were chosen for the implementation are labelled, and therefore provide us with ground truth information that may be used to distinctly evaluate the effectiveness of the approach. A brief introduction of the overall approach is presented below and it is depicted in Figure 3.2:

1. A labelled data set is imported and the label-question pairs are extracted from it.

2. A preprocessing step is performed where a data cleaning step is con-ducted in which a possible noisy question text is being transformed.

(34)

16 CHAPTER 3. METHOD Furthermore, sampling from the data set is taking place where a subset of the label-question pairs is selected.

3. To the remaining questions after the preprocessing step, the semantic similarity between the questions is obtained. The step involves question vectorization using sentence embeddings, and similarity assessment using metrics such as the cosine similarity metric.

4. A clustering algorithm is used to group the resulting vectorized sentences into clusters.

5. The most representative question is extracted from every cluster and it is provided to the user.

Figure 3.2: Framework diagram

In order to observe the results of the method steps closely, a toy data set will be used. The toy data set contains 50 questions in total and it is separated into 10 classes of 5 questions each. It consists of customer support related questions and it is a subset of the Quora Question Pair data set [29]. The toy data set is publicly available on [14] and it is separated by the classes of 1) Missing Password, 2) Profile Image, 3) Credit card setup, 4) Video steps, 5) Python upload steps, 6) Search for issues, 7) Query metadata, 8) Update account, 9) Account delete, 10) API info. For example, the questions “How do you disable an account?”, “Can I recover a deleted account?” and “Do I lose all my data when I delete my account?” are included in the data set and are members of the “Account delete” class.

(35)

3.2. PREPROCESSING 17

3.2 Preprocessing

Preprocessing is an important step that precedes question vectorization. In the preprocessing step the noisy questions are transformed and a subset of the data set is chosen.

3.2.1 Data cleaning

Due to the fact that the written questions are generally submitted by people, there is always a risk of including noisy text such as special characters or multiple white-space characters, due to human error. Thus, a data cleaning step is conducted so as to remove noisy text from the questions. The data cleaning steps are presented below:

1. At first, possible HTML tags presented in the text are removed. For example, if a question contains an image HTML tag “<img>”, it will be removed.

2. All the characters of a question are converted to lowercase. 3. Extra newline and white-space characters are removed.

4. Accented characters, special characters, and digits are removed.

5. Contractions are expanded. For example, “I'm” will be replaced by “I am”.

A general example would be the question “Should I be at the !# OFFICE before <strong> noon </strong> or after?”, will be transformed to “should i be at the office before noon or after”

3.2.2 Sampling

After data cleaning, a further preprocessing step is performed and sampling from the data set is conducted. The sampling steps are presented below:

1. At first, the top N number of classes that contain the highest amount of questions are selected.

2. For every class, a scaling down step takes place, where the number of questions to be selected from it is a function of the percentage of the questions contained in it relative to the total number of questions of the data set, and a maximum number of questions set according to the model's characteristics and hardware limitations. For example, a data set that contains 1000 questions of which 70% belong on class 1 and 30% on class 2 would initially contain 700 questions of class 1 and 300 questions of class 2. After the scaling down step, if the maximum number

(36)

18 CHAPTER 3. METHOD of questions are set to be 500, the resulted number of questions of class 1 would be 350, and of class 2 would be 150.

3. Finally, after the calculation of the number of questions to be selected from each class, the actual questions are chosen according to their rel-evance to the other questions of the same class. Therefore, the most relevant N number of questions of each class are selected and thus the outlier questions are removed. To achieve that, the Universal Sentence Encoder (USE) approach is used to vectorize the questions of each class and a similarity matrix is created. The similarity matrix is constructed using the cosine similarity metric, that measures how close every pair of question vectors is. Using the similarity matrix, the top N questions that are the most similar to all the other questions of the same class, are selected. The similarity matrix and the cosine similarity metric will be discussed in further detail in the next section 3.3.

It should be noted that the sampling step in not necessary for all cases and scenarios. In real world applications of the proposed framework, there are no class labels for the questions and thus the sampling step is not applicable. Moreover, if the data set’s size and complexity allows for fast and feasible com-putations in a reasonable amount of time, then there is no need to perform this step. Finally, the sampling step needs further investigation on how to detect the outlier questions and remove them carefully in order to avoid any bias that could influence the results. However, as mentioned before, the sampling step is not applicable on unlabelled data sets and thus the proposed approach will work on the entire set of questions without any prior knowledge of possible outlier questions.

3.3 Question semantic similarity

As discussed in Chapter 2, the two main measures to define textual similarity are the lexical similarity and the semantic similarity. The vectorization step where the questions are transformed into numeric vectors should be accurate in order to effectively capture the semantic similarities between the questions.

3.3.1 Text Vectorization

For the task of question vectorization, two sentence embedding approaches were applied, with the one being Universal Sentence Encoder (USE) and the other being InferSent. USE creates a vector representation of 512 elements, while InferSent creates a vector of 4096. The resulting vectorized representa-tions of the following three example quesrepresenta-tions from the toy data set would be:

(37)

3.3. QUESTION SEMANTIC SIMILARITY 19 1. How do I use API from your website? - [ -0.00992286, -0.0392494, ...] 2. Can I use an API? - [ 0.0204898, -0.0575881, ...]

3. Which is best free video editing software? - [ 0.0612447, -0.00477671, ...] InferSent

1. How do I use API from your website? - [0.00746889, -0.0284283, ...] 2. Can I use an API? - [0.00746889, -0.0304202, ...]

3. Which is best free video editing software? - [0.00746889, -0.0598585, ...]

3.3.2 Cosine Similarity

Cosine similarity is a measure of similarity between non-zero numeric vectors. The more similar a pair of vectors is, the higher will be the value of the cosine similarity metric, which is bounded between [0,1] [15]. Formally, the cosine similarity is given by the following formula:

cos(xxx, yyy) = xxx · yyy ||xxx|| · ||yyy||,

where x, y is the pair of vectors which is measured for similarity. Cosine sim-ilarity can be used to measure the semantic simsim-ilarity between words, short sentences and varied length texts or items in general as shown in Figure 3.3

(38)

20 CHAPTER 3. METHOD Thus, when the questions were vectorized, cosine similarity was used to define how closely related the questions are to each other.

For the example questions of the toy data set, the cosine similarity value between the “How do I use API from your website?” and “Can I use an API?” questions would be 0.786158 for USE and 0.768903 for InferSent, while for the “How do I use API from your website?” and “Which is best free video editing software?” questions, it would be 0.419831 for USE and 0.425305 for InferSent. Similarity Matrix

Using the cosine similarity metric, a similarity matrix was constructed. The similarity matrix is an n x n matrix where n is the number of the total ques-tions available. Each element of the similarity matrix corresponds to the cosine similarity value of the specific pair of questions. For the example questions of the toy data set, the similarity matrices for USE and InferSent are presented in Figure 3.4 and Figure 3.5 respectively:

(39)

3.3. QUESTION SEMANTIC SIMILARITY 21

Figure 3.5: Similarity matrix InferSent

In order to visualize the similarity matrices, the heatmap representation was chosen, where the darker areas of the heatmap depict high semantic sim-ilarity values between pairs of questions, whereas the lighter areas of the heatmap represent low question semantic similarity. On Figure 3.6 and 3.7, the heatmaps of USE and InferSent for the example questions of the toy data set are presented.

Figure 3.6: Heatmap - Toy Example Questions USE - 0: How do I use API from your website? 1: Can I use an API? 2: Which is best free video editing software?

(40)

22 CHAPTER 3. METHOD

Figure 3.7: Heatmap - Toy Example Questions InferSent - 0: How do I use API from your website? 1: Can I use an API? 2: Which is best free video editing software?

The heatmap representations of the entire toy example data set for USE and InferSent are shown in Figure 3.8 and Figure 3.9 respectively.

(41)

3.3. QUESTION SEMANTIC SIMILARITY 23

Figure 3.9: Entire Toy Data Set Heatmap - InferSent

The similarity matrices of both USE and InferSent were assessed for their goodness in order to evaluate their accuracy on capturing the semantic simi-larities of the questions. For that purpose, min-max normalisation was applied to each row of the similarity matrix. Later, the average sum of the similarity values of the members of each class was obtained, and similarly, the average sum of the similarity values of the non-class members of each class was calcu-lated. The ratio of the total average sum of the similarity values of the class members of each class by the total average sum of the similarity values of non-class members of each non-class was used to measure the goodness of the similarity matrices and thus the accuracy of the sentence embedding approaches. The method of measuring the goodness of the similarity matrices is presented in the following equation.

measure = P P M n P N k η ,

where 'M' are the values of class members and 'n' is the number of mem-bers of each class and where 'N' are the values of non-class memmem-bers and 'k' the number of non-class members of each class. The results of the similarity matrix measures of the toy data set for USE and InferSent are 2.599, and 2.122 accordingly.

This shows that USE achieves better results in the task of obtaining the se-mantic similarities between the questions that are members of the same classes compared to InferSent.

(42)

3.4 Question Clustering

The main idea behind clustering is to separate the samples of a given data set into groups called clusters. For question clustering, the questions that are semantically similar to each other should belong to the same cluster. There are various clustering algorithms to choose from and their performance on a clustering task depends on the data. Some examples are, the Hierarchical clustering, Gaussian Mixture Models(GMM), DBSCAN and k-means, among others. By considering the initial results on different clustering algorithms, k-means is selected to be used for the purposes of the proposed framework, since the attempt of the application of the DBSCAN algorithm, which is based on the idea of connecting neighbouring samples, lead to an output that hardly determines a reasonable clustering result on the data sets.

3.4.1 k-means Algorithm

The k-means algorithm presented by MacQueen [21], is a widely-used approach in clustering tasks aimed for text summarization [35], document clustering [4] and topic extraction [36], among others.

In the k-means algorithm, the k parameter has to be decided beforehand, and it refers to the number of clusters the user wants to observe in the data. The algorithm’s steps are shown in Figure 3.10 and are briefly described.

The algorithm works by placing k centroids at random locations between the samples (b). For each sample, a distance metric is calculated between the sample and every centroid, and the sample is assigned to the cluster of its closest centroid (c). Then, for every cluster, the mean of all the samples belonging to it is calculated and the centroid is placed on the calculated mean location (d). Then, the assigning procedure repeats and all the samples are re-assigned to clusters according to their distance to each cluster centroid, and a new mean location for every centroid is calculated where the cluster centroids are then placed (e). The above procedure repeats until no data point gets reassigned to a different cluster (f).

(43)

3.4. QUESTION CLUSTERING 25

Figure 3.10: k-means procedure [28].

As the proposed approach aims to work on unlabelled data sets, the k pa-rameter should be decided by the user or it may be determined by approaches that find the optimal value of k, such as the finding knee [13] and TURN [11] elbow methods. However, as the data sets used for the implementation were labelled, the k parameter was chosen to be equal to the number of classes of each data set. Furthermore, as the USE and InferSent approaches transform the textual questions into high dimensional numeric vectors, the dimensional-ity reduction approach called UMAP, which was presented by McInnes et. al [23] was used, allowing for visualisation of the result of the k-means clustering algorithm in a dimensional plot. On Figure 3.11 and Figure 3.12, the two-dimensional plots that illustrate the result of the clustering algorithm of the toy data set are presented for USE and InferSent respectively. Furthermore, Table 3.1 and Table 3.2, depict the number of samples of every class label contained in each cluster for USE and InferSent respectively.

(44)

Figure 3.11: Toy data set UMAP plot - USE

(45)

3.4. QUESTION CLUSTERING 27 Cluster Index 0 1 2 3 4 5 6 7 8 9 Class Label

Video Steps 5

Search for Issues 4 1 Query Metadata 5

Update Account 5 Python Upload Steps 5

Profile Image 1 4 Missing Password 5

Account Delete 1 4

Credit Card Setup 2 3 API Info 5

Table 3.1: Labels per cluster index - USE

Cluster Index 0 1 2 3 4 5 6 7 8 9 Class Label

Video Steps 5

Search for Issues 5 Query Metadata 1 4

Update Account 5

Python Upload Steps 5 Profile Image 2 2 1

Missing Password 1 1 2 1 Account Delete 4 1 Credit Card Setup 4 1 API Info 5

Table 3.2: Labels per cluster index - InferSent

From the presented plots it can be perceived that the clustering that used the USE embeddings led to more dense clusters compared to the one that used the InferSent embeddings. Furthermore, by noticing the clustering plot of USE, it can be observed that the clusters 7 and 5 are close to each other and thus, it can be deduced that these clusters are more semantically similar to each other compared to the clusters 7 and 8.

3.4.2 Clustering Performance Measures

There are various metrics that can be used to measure the clustering perfor-mance and they are split in two categories: internal measures and external measures. The external measures are used when there is ground-truth infor-mation that would impose how the clustering structure should look. Some

(46)

28 CHAPTER 3. METHOD examples of the external measures are F-measure, Normalized Mutual Infor-mation (NMI), Entropy, and Rand index (RI) [3]. On the other hand, the in-ternal measures are used when there is no ground truth information about the clusters, and the clustering performance is measured according to the resulted cluster structures. The common examples of internal measures for clustering approaches are Silhouette index, Dunn index, and Davies-Bouldin index [32]. The internal measures are usually used to compare the goodness of the different clustering approaches, regardless of the similarity measure that is used.

In this work, since the evaluation of the k-means clustering approach de-pends on the goodness of the embeddings, the internal measures will not assess the result. However, using the labelled data sets, the external measures can be applied to evaluate the different semantic sentence embeddings through clustering. For the performance evaluation of the clustering using USE and In-ferSent embeddings, two external measures are considered in this framework: Normalized Mutual Information (NMI) and Rand index (RI). RI is based on counting pairs, wherein it is calculated according to the number of pairs of items belonging to the same or different cluster and class [3]. Formally, the RI is given by the following formula:

RI = SS + DD SS + SD + DS + DD,

where SS are the number of members that belong to the same cluster and have same labels, DD belong to different clusters and have different labels, DS belong to different clusters and have same labels, and SD belong to same clusters and have different labels [3].

NMI takes into account the mutual information between the clustering result and the actual class labels, and the entropy of each clustering results set and the class labels set. Formally, the NMI is given by the following formula:

N M I = I(X, Y ) pH(X) ∗ H(Y ),

where I(X,Y) is the mutual information between the clustering result and the actual class labels, and H(X), H(Y) are the entropies of each, clustering result and class labels sets.

For the toy data set, the NMI values for the clustering result using the USE and InferSent embeddings were 0.838 and 0.673, respectively. Likewise, the RI values for the clustering result using the USE and InferSent embeddings were 0.700 and 0.383, respectively. As seen by these values, the clustering results using USE embeddings lead to having outputs that better match with the ground truth information rather than InferSent. The reason for the difference in the results could be a result of the increased accuracy of USE to obtain the semantic similarities between questions that are members of the same classes, compared to InferSent, which is also supported by the similarity matrix mea-sure results.

(47)

3.5. REPRESENTATIVE QUESTION EXTRACTION 29

3.5 Representative Question extraction

The end goal of the proposed approach is to find the most representative questions of the data set and provide them to the user. The closest question to the centroid of each cluster is the most relevant question to each cluster in a data driven way, and thus can be though of as an approach of finding a kind of the most representative question of each cluster. Thus, the closest question to the centroid of each cluster is retrieved and provided to the user. The most representative questions of the toy data set for USE and InferSent are shown on Table 3.3 and Table 3.4, respectively.

Cluster Index

Rep. Question Class Label of Rep. Question

0 is there an api available? API Info

1 how can i seach for open issues? Search for Issues 2 how do i update my account? Update Account 3 can i search the metadata for a query? Query Metadata 4 can i upload data with a python script? Python Upload Steps 5 how can i edits videos? Video Steps

6 can i recover a deleted account? Account Delete 7 how do i add an image? Profile Image 8 how can i reset my password? Missing Password 9 can i signup without a credit card? Credit Card Setup

Table 3.3: Most Representative Questions Toy data set - USE

Cluster Index

0 can i change my profile image? Profile Image 1 can i use an api? API Info 2 how do you delete an account? Account Delete 3 can i search the metadata for a query? Query Metadata 4 how do i recover my password when im

not receiving any recovery email? Missing Password 5 how can i find open issues? Search for Issues 6 can i signup without a credit card? Credit Card Setup 7 how do i update my account? Update Account 8 can i update my videos before

upload-ing? Video Steps

9 can i upload data with a python script? Python Upload Steps Table 3.4: Most Representative Questions Toy data set - InferSent

(48)

30 CHAPTER 3. METHOD From the results presented in Tables 3.3 and 3.4 it can be observed that both USE and InferSent managed to achieve a result of a subset of the most representative questions that includes a representative question from every class of the entire data set. Furthermore, it is also important to note that by observing the representative question of the class label "Credit Card Setup", one might have expected it to be asking about how to add a credit card to an account, however by considering all the five members of that class it can be observed that they are more related to the question asking for the possibility of signing up without a credit card.

(49)

Chapter 4

Results

Following the methods of chapter 3, the results of the same approaches were observed on two large data sets. Both of the data sets are publicly available and include a large number of label-question pairs.

4.1 SQuAD: Stanford Question Answering

SQuAD is a reading comprehension data set that consists of questions posed by crowd workers on a set of Wikipedia articles. The SQuAD data set contains around 150.000 questions separated in 35 classes. The number of questions per class in not the same for each class. For example, the questions “What percentage of global assets does the richest 1% of people have?”, “What do the three richest people in the world posses more of than the lowest 48 nations together?”, “Why are there more poor people in the United States and Europe than China?” are members of the “Economic_inequality” class.

4.1.1 Preprocessing

As discussed on chapter 3, preprocessing is an important step that involves transforming noisy questions, and selecting a subset of the data set by sampling from each class, and thus achieving the goal of removing the outlier questions of each class.

Data cleaning

The data cleaning step discussed in chapter 3, was performed on the SQuAD data set and all possible noisy questions were transformed. For example, the question “Why are there more poor people in the United States and Europe than China?”, was transformed to “why are there more poor people in the united states and europe than china”.

(50)

32 CHAPTER 4. RESULTS

Sampling

The sampling step where the most N relevant questions of every class are selected in order to remove outliers, was performed and lead to the selection of a subset of the data set. The maximum number of questions to be selected from the whole data set was set to be 4000. On Table 4.1, the number of questions per class initially contained in the data set and the resulting number of questions after performing the sampling step, are presented.

Class Before Sampling After Sampling Queen_Victoria 883 519

New York City 817 480 American Idol 790 464 Beyonce 753 442 Frederic_Chopin 697 410 Buddhism 610 358 Pharmaceutical_industry 586 344 New_Haven,_Connecticut 582 342 Premier_League 551 324 Hunting 531 312

Table 4.1: Number of Questions before and after Sampling - SQuAD

4.1.2 Question semantic similarity

The semantic similarity of the remaining questions was then obtained by the approaches discussed in chapter 3. Universal Sentence Encoder (USE) and In-ferSent were used for text vectorization and the cosine similarity metric was used to construct the similarity matrices and then the heatmap visualizations. On Figure 4.1 and Figure 4.2, the USE and InferSent heatmaps are presented respectively. The results of the similarity matrix measures for USE and In-ferSent are 2.068, and 1.405 respectively.

This shows that USE achieves better results in the task of obtaining the se-mantic similarities between the questions that are members of the same classes compared to InferSent.

(51)

4.1. SQUAD: STANFORD QUESTION ANSWERING 33

Figure 4.1: Heatmap of SQuAD - USE

(52)

4.1.3 Question Clustering

As discussed on chapter 3, the k-means clustering algorithm was used to clus-ter questions according to their semantic similarity. Using the UMAP dimen-sionality reduction method, the visualizations of the results of the k-means clustering algorithm in a two-dimensional plot for both USE and InferSent are presented on Figure 4.3, and Figure 4.4. Furthermore, Table 4.2 and Table 4.3, depict the number of samples of every class label contained in each cluster for USE and InferSent respectively.

Figure 4.3: UMAP plot of SQuAD - USE

(53)

4.1. SQUAD: STANFORD QUESTION ANSWERING 35 Cluster Index 0 1 2 3 4 5 6 7 8 9 Class Label Queen_Victoria 519 New_York_City 1 206 1 3 269 American_Idol 20 444 Beyonce 442 New_Haven,_Connecticut 410 Frederic_Chopin 358 Buddhism 344 Pharmaceutical_industry 31 1 1 309 Premier_League 324 Hunting 1 1 309 1

Table 4.2: Labels per cluster index SQuAD - USE

Cluster Index 0 1 2 3 4 5 6 7 8 9 Class Label Queen_Victoria 1 43 10 2 1 384 78 New_York_City 6 20 5 422 13 14 American_Idol 124 286 2 5 1 6 3 37 Beyonce 45 1 344 5 1 4 42 New_Haven,_Connecticut 191 1 2 15 1 2 30 168 Frederic_Chopin 23 21 6 2 224 82 Buddhism 53 2 196 1 2 90 Pharmaceutical_industry 262 7 4 61 4 4 Premier_League 17 3 292 12 Hunting 4 21 6 252 6 4 19

Table 4.3: Labels per cluster index in SQuAD using InferSent

Clustering Performance Measures

The NMI performance measure was found to be 0.929 for USE and 0.619 for InferSent. Also, the RI performance measure was found to be 0.879 for USE and 0.515 for InferSent. These values show that in SQuAD data set, on both clustering performance measures, USE embeddings lead to a better clustering in regards to the class labels. The reason that USE achieves better results could be based on the better accuracy of the USE to obtain the semantic sim-ilarities between questions that are members of the same classes, compared to InferSent. This performance is also supported by the similarity matrix measure results.

(54)

4.1.4 Representative Question extraction

The closest question to the centroid of each cluster was chosen as the most representative question of that cluster and was provided to the user. The most representative questions of the SQuAD data set for USE and InferSent are shown on Table 4.4 and Table 4.5 below.

Cluster Index

0 who was princess victoria widowed

from? Queen_Victoria 1 what percentage of people that live

in the state of new york live in new york city?

New_York_City

2 in what sutras are the buddha

dharma and sangha viewed as one? Buddhism 3 which artist was associated with

beyonces premiere solo recording? Beyonce 4 who did chopin have at his last

parisian concert in 1848? Frederic_Chopin 5 what drug effectively treated

tuber-culosis? Pharmaceutical_industry 6 what have poachers contributed to

hunting? Hunting 7 who was the winner of american

idols sixth season? American_Idol 8 during the 2020 season how many

premier league representatives were there in the association?

Premier_League

9 what public area in new haven was named a national historic landmark in 1970?

New_Haven,_Connecticut

(55)

4.1. SQUAD: STANFORD QUESTION ANSWERING 37 Cluster

Index

0 what is the name of the collab-orated project between yale uni-versity connecticut and new haven city?

New_Haven,_Connecticut

1 what event did victoria attend in 1866 for the first time following al-berts death?

Queen_Victoria

2 what contestant came in fourth on

season two of american idol? American_Idol 3 what song did beyonce record for

the film epic? Beyonce 4 what does evidence suggest

hunt-ing may have been a factor in the extinction of?

Hunting

5 what percentage of people that live in the state of new york live in new york city?

New_York_City

6 at the beginning of the premier league how many foreign players were there for the first round of games?

Premier_League

7 when did victoria become queen

upon william ivs death? Queen_Victoria 8 what is considered central to the

teachings of buddhism? Buddhism 9 what did chopin compalin about? Frederic_Chopin

Table 4.5: Most Representative Questions SQuAD data set - InferSent From the results presented in Tables 4.4 and 4.5 it can be seen that by ap-plying USE, all classes have representative questions in the final subset of the most representative questions, whereas in the case of using InferSent it can be observed that some of the classes have more than one questions. This means that there are some classes that do not have any representative questions in the final subset of the most representative questions. This shows that USE achieves better results in covering the classes of the labelled data set in the clustering outputs.

(56)

4.2 House of Commons Written Questions

The “House of Commons Written Questions” data set consists of questions sub-mitted by the members of the UK parliament to government ministers, which are about policies and statistics on the activities of government departments from 2013 until 2019, and it contains around 236.648 questions. A subset of the data set, that contains 10.000 questions was randomly selected from it. As the questions were not labelled by the question subjects, it was decided to use the answering body field as the label of each question. The answering body is the department or ministry to switch the question was submitted to and from which the answer is expected. The number of questions per class varies with some of the classes containing more questions while others containing less. For example, the questions “To ask the Secretary of State for Transport, whether a trial of the new managed motorway all lanes running design standard will be conducted before implementation of these schemes”, “To ask the Secretary of State for Transport, what plans for the electrification of the Wigan to Bolton line are currently planned by his Department”, “To ask the Secretary of State for Transport, how many people in Kilmarnock and Loudoun constituency hold a current UK driving licence” are all members of the “Department for Transport” class.

4.2.1 Preprocessing

The preprocessing steps of data cleaning and sampling were performed simi-larly to the SQuAD data set, with the data cleaning step having an additional function due to the structure of the questions of the data set.

Data Cleaning

Similarly to the SQuAD data set, the data cleaning steps were also performed for the “House of Commons Written Questions” data set. However, by carefully observing the data set's example questions mentioned in section 4.2, we notice that most of them start with “To ask the...”. Not removing that part of it may lead to inaccurate results. Therefore, an additional step for this data set was performed and the part of the question that starts with “To ask the...” was removed.

For example, the “To ask the Secretary of State for Transport, what plans for the electrification of the Wigan to Bolton line are currently planned by his Department” question, was transformed to “what plans for the electrification of the wigan to bolton line are currently planned by his department”.

Sampling

For the sampling step of the “House of Commons Written Questions” data set, the maximum number of questions to be selected from it was set to be 4000.

(57)

4.2. HOUSE OF COMMONS WRITTEN QUESTIONS 39 On Table 4.6, the initial and the resulting number of questions contained per class are presented.

Class Before Sampling After Sampling Department of Health 996 596

Home Office 925 553 Department for Education 683 408 Foreign and Commonwealth

Office 672 402

Department for Transport 661 395 Department for Work and

Pen-sions 635 380

Ministry of Justice 586 350 Department for Environment,

Food and Rural Affairs 541 323 Ministry of Defence 520 311 HM Treasury 462 276

Table 4.6: Number of Questions before and after Sampling - House Of Com-mons Written Questions

4.2.2 Question semantic similarity

Similarly to the SQuAD data set, the semantic similarity of the questions of the “House of Commons Written Questions” data set was obtained. On Figure 4.5 and Figure 4.6, the USE and InferSent heatmaps are presented respectively. The results of the similarity matrix measures for USE and InferSent is 1.433, and 1.098 respectively. This shows that USE achieves better outcomes in the task of obtaining the semantic similarities between the questions that are members of the same classes compared to InferSent.

4.2.3 Question Clustering

Similarly to the SQuAD data set, the k-means clustering algorithm was used to cluster questions according to their semantic similarity, and UMAP was used for the two-dimensional plot visualization. The visualizations of the results of clustering for both USE and InferSent are presented on Figure 4.7, and Figure 4.8. Table 4.7 and Table 4.8, show the number of samples of every class contained in each cluster for USE and InferSent respectively.

(58)

Figure 4.5: Heatmap of House of Commons Written Questions - USE

(59)

4.2. HOUSE OF COMMONS WRITTEN QUESTIONS 41

Figure 4.7: UMAP plot of House of Commons Written Questions - USE

(60)

42 CHAPTER 4. RESULTS Cluster Index 0 1 2 3 4 5 6 7 8 9 Class Label

Department of Health 53 382 138 21 2 Home Office 41 15 2 226 2 20 7 200 38 2 Department for

Educa-tion 92 150 7 113 46 Foreign and

Common-wealth Office 4 383 15 Department for

Trans-port 18 3 9 51 314

Department for Work

and Pensions 1 172 1 40 13 119 1 32 1 Ministry of Justice 229 51 3 11 2 19 13 21 1 Department for

Envi-ronment, Food and Ru-ral Affairs

28 94 11 19 1 155 15

Ministry of Defence 5 19 215 1 4 30 1 35 1 HM Treasury 2 125 1 60 0 4 78 6 Table 4.7: Labels per cluster index House of Commons Written Questions -USE

Cluster Index 0 1 2 3 4 5 6 7 8 9 Class Label

Department of Health 58 89 32 97 51 81 16 41 46 85 Home Office 56 130 31 77 32 75 33 78 16 25 Department for

Educa-tion 71 60 16 49 42 39 23 36 23 49 Foreign and

Common-wealth Office 17 25 165 46 45 2 7 4 40 51 Department for

Trans-port 51 94 21 62 15 16 26 11 53 46 Department for Work

and Pensions 52 56 4 36 17 50 40 33 44 48 Ministry of Justice 32 40 6 32 12 89 17 83 12 27 Department for

Envi-ronment, Food and Ru-ral Affairs

40 58 27 35 44 16 24 10 30 39

Ministry of Defence 46 39 14 40 15 30 11 35 56 25 HM Treasury 27 37 24 38 7 34 19 24 24 42 Table 4.8: Labels per cluster index in House of Commons Written Questions using InferSent

(61)

4.2. HOUSE OF COMMONS WRITTEN QUESTIONS 43 Clustering Performance Measures

The NMI performance measure was found to be 0.513 for USE and 0.059 for InferSent. Also, the RI performance measure was found to be 0.380 for USE and 0.029 for InferSent. These values show that in House of Commons Written Questions data set, on the both clustering performance measures, the USE embeddings lead to a better clustering in regards to the class labels. The reason that the clustering results are better with the application of USE com-pared to InferSent, may be also supported by the similarity matrix measure results which show that USE achieves increased accuracy in obtaining the se-mantic similarities between questions that are members of the same classes, compared to InferSent.

4.2.4 Representative Question extraction

The most representative questions of the “House of Commons Written Ques-tions” that were chosen according to their distance to each cluster centroid, are presented for both USE and InferSent on Table 4.9 and Table 4.10 below.

Cluster Index

Rep. Question Class La-bel of Rep. Question 0 how many prisoners were serving an indeterminate

public protection sentence in each year since 2012? MinistryJustice of 1 what assessment he has made of the effect on

reduc-tions of the a lifetime allowance and b annual allowance for pension contributions on the overall level of pension saving?

HM Treasury

2 how many compensation payments of 500 or less were made to armed forces service complainants in the a army b royal air force and c royal navy in i 201516 and ii 201617 and if he will make a statement?

Ministry of De-fence

3 what steps she is taking to ensure that her department consults with civil society organisations on the future of eu citizens in the uk?

Home Office

4 what discussions his department has had with repre-sentatives from a nhs england b biogen pharmaceutical company and c patient groups on the expanded access programme for nusinersen for the treatment of infants with spinal muscular atrophy type 1?

Department of Health

Clustering Semantically Related Questions

International Master’s Thesis

Clustering Semantically Related Questions

Nikolaos Karagkiozis

Computer Science

Studies from the Department of Technology

at Örebro University

Nikolaos Karagkiozis

Clustering Semantically Related

Questions

© Nikolaos Karagkiozis, 2019

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

List of Algorithms

Chapter 1

Introduction

1.1

Motivation

1.2

Problem Statement

1.3

Research Questions

1.4

Contributions

1.5

Thesis Outline

Chapter 2

Related Work

2.1

Textual Similarity and Clustering

2.1.1

Text Vectorization

2.1.2

Textual Similarity Measures

2.2

Sentence/Question similarity

2.2.1

Term Frequency-Inverse Document Frequency

(Tf-idf )

2.2.2

Word embeddings

2.2.3

Unsupervised Sentence Embeddings

2.2.4

Supervised Sentence Embeddings

2.2.5

Universal Sentence Encoder (USE)

2.2.6

Text/sentence/question Clustering

Chapter 3

Method

3.1

Framework Overview

3.2

Preprocessing

3.2.1

Data cleaning

3.2.2

Sampling

3.3

Question semantic similarity

3.3.1

Text Vectorization

3.3.2

Cosine Similarity

3.4

Question Clustering

3.4.1

k-means Algorithm

3.4.2

Clustering Performance Measures

3.5

Representative Question extraction

Chapter 4

Results

4.1

SQuAD: Stanford Question Answering