Using Machine Learning to Learn from Bug Reports : Towards Improved Testing Efficiency

(1)

(2)

(3)

(4)

(5)

3.5.1 Clustering . . . 22 3.5.2 Topic Labeling . . . 23 3.5.3 Connection of Data . . . 23 4 Result 25 4.1 Clustering . . . 25 4.1.1 Preprocessing . . . 26 4.1.2 Vectorization . . . 28 4.1.3 Clustering . . . 30 4.1.4 Distribution clustering . . . 31 4.2 Topic Labelling . . . 34 4.2.1 Words . . . 34 4.2.2 Files . . . 40 4.3 Connection of data . . . 42

(6)

4.3.1 Bug to Bug/PC . . . 42 4.3.2 Cluster to Cluster . . . 42 5 Discussion 43 5.1 Vectorization . . . 43 5.2 Clustering . . . 43 5.3 Topic Labelling . . . 46 5.4 Connection of Data . . . 47

5.5 Improvement and future work . . . 48

6 Conclusions 50 6.1 Data Connection . . . 50

6.2 Data Retrieval . . . 50

(7)

1 Introduction

This thesis develops and evaluates a method for more efficient management of bug reports and possible ways to use this method as decision support within a medical IT company. The purpose of this project will be to evaluate possibilities to use data produced by the company today, in order to further understand connections in the developed product and how changes in certain part of the code can be connected to anomalies in the updated product. The goal is to investigate suit-able clustering methods for free text documents, apply the same to bug reports and commits and further on perform topic labelling to be used as guidelines for labelling of bug reports in the future. A possible and desirable output of the final product would be to find and visualize correlation between development in field A, which generate a bug in field B.

1.1 Motivation

Sectra Imaging IT Solutions AB is a company with multiple products that con-tinuously keeps growing. The baseline in the company’s products is a system that handles large amounts of data from the healthcare sector. These systems act as a central hub where all this data is stored, data for multiple purposes and applications such as visualization, measures (distance, area, volume), exchange of information between systems, handling of electrocardiography (ECG) etc. If changes are made to an already existing product, long and extensive tests must be performed upon release in order to ensure that no major defects have been implemented inattentively. If any anomalies are found in these release tests, they are reported as a free text comment in a bug report.

The test phase before a new software release takes up to six weeks, where a profuse number of tests are run to ensure that no damage was done to the pre-vious version. Currently customized tests are written upon every new release, a time-consuming task that aims to find the most vulnerable areas in the product after changes have been made. Even small changes in a product can cause bugs in other parts of the system. An understanding for how changes in certain parts of the software can affect the final product and lead to undesirable anomalies could possibly help to streamline the decision making of tests to execute, and thereby shorten the test phase upon new releases.

Bug reports are an important source of knowledge and contain crucial information about possible streamlining of processes that often are overseen. Even though the data exists, the reports are not always as explicit as desired. To use such reports and be able to create the desired deeper understanding of their content without too much manual work, it is important that they are structured in a way that it

(8)

is possible for machines to understand their content. Lately, it has become more and more common to use supervised learning for classification and allocation of the existing data resources within a company and in that way make sense of all data that already exists [12] [17]. A challenge with classification is that it requires a gold standard data set to train the unknown data on. Creating a data set like this is something that takes up an immense amount of resources to establish, re-sources that not always is possible to allocate within an organization without any further proof of progress. This project will therefore act as a proof of concept and aims to be investigative for the company in order to further handle bug reports in their test environment.

1.2 Problem Formulation

To guide the research in this project, the following research question was formu-lated:

In what ways is it possible to define links between bugs and de-velopment that can be helpful for testers to use as a decision basis for further testing?

As an aid to answer the primary research question, the existing data will have to be preprocessed and examined. Two sub-questions were formulated as guidance to do so. The first one aims to establish the basis for what kind of data that exists today, and what would be desired to gather in order to proceed with machine learning approaches. The second questions aim to investigate the possibilities to generate desired sub classes of the existing data, by using unsupervised methods. To which extent can useful data be retrieved from the company’s database as for today? What kind of data can be beneficial for the company to generate in the future in order to simplify decisions of what kind of tests to perform? In what ways can existing free text data be clustered and labelled such that they are useful for testers when evaluating which tests are to be performed?

1.3 Delimitations

The existing data as for today are descriptive free text fields written by multiple people with varying level of details. The final product depends on how well cluster algorithms can perform on the given data. This thesis will therefore mainly be an investigative project that intends to take on a proof of concept assessment.

(9)

2 Theory

The scientific theory behind the used methods in this project are described in this section. A background to why efficient bug handling is urgent in modern software companies is given initially, followed by a description of vector representation and finally and introduction to clustering.

2.1 Background

Evolution of a software system originates from its changes, whether it comes from changed user needs or adaption to its current environment. These changes are as encouraged as they are inevitable, although every change to a software system comes with a risk of introducing an error or a bug. Even the slightest changes in one part of the product can cause problems in other, distant parts of the system [3]. When a change is made to a software system, the usual aim is to add func-tionality upon requests from customers or fix a previous bug. In this example, this change assumes to be the bug introducing change (BIC) [3]. These modifications eventually turn out to have introduced an external undesired behaviour which is caught in a bug report during a test phase. This error will consequently be fixed by modification in the project’s source code as a bug-fix change [13]. To gain understanding of changes and bug reports, as well as keeping track of how bugs are handled, a specific Bug Tracking System (BTS) is used within a company. Multiple such systems are widely used, from open source systems as Bugzilla to more commercial ones such as Jira and IBM Notes. Both product changes (PC) and bug reports (BR) are collected in the BTS and can be tracked by an indi-vidual ID field. Upon release of a new product, or new version of a product, it goes through a testing phase where bugs are detected. When a bug is identified, a bug report is established and a description is being made by an individual tester or developer [12]. The author of a bug report is responsible for its quality and that the problem is described accurately. Huge gains have been seen by ensuring high quality bug reports, that developers can get a quick overview of and easily identify the problem [5]. Implementation of machine learning in large software systems have been proved to be efficient for quick location of bugs and possible fixes [12].

It is common that labels in a BTS are assigned to certain bugs. Such label fields can help to simplify tasks like accurate assessment of priority and severity, identify appropriate developers to resolve a bug and simply understand the prob-lem better. Due to common non-existing or incorrectly assigned bug reports, a need for automatic bug labelling has been seen [8]. Automatic bug labelling or classification of bug reports have previously been addressed with various machine learning techniques of clustering and classification [2][21]. By using the semantic

(10)

Bug ID Description Classification XYZ-123 This is a description of the first bug.

It describes what is done in a certain file 1 YXZ-213 This is a description of the second bug.

It has a similar description as the first 1 ZYX-312 This field describes a system failure 2

Table 1: Examples of how bugs are structured with a unique ID and a description together with a possible classification

information that describes the context of a problem, it is seen to be possible to label bug reports depending on the frequency of words in a certain class [8]. Clus-tering techniques where term frequency and semantic information is used have previously been investigated for multiple applications, for example for clustering of Web search results [22]. Table 1 pictures an example of three bugs and how they could possibly be classified based on the semantics in their description. The first and second bug have a similar structure of text with two sentences of similar length. The used words have a similar meaning or are even the same, and these words even occur at the same positions in the documents. On this basis it is reasonable to conclude that the two first bugs should belong to the same class while the third one should be on its own.

A concrete example of automatic bug labelling in this thesis is performed is given below to further clarify the workflow:

A new bug report appears in the bug tracking system. Vectorization of the text is performed which place it in a vector space as is described in Figure 1. Dependent of the position of the document compared to others, it is assigned to a certain cluster. Other documents such as commit messages and files that have been involved with the same bug ID is assigned the same cluster label. Topic labelling is eventually performed to find the most descriptive word(s) for a certain cluster rather than simply number it.

(11)

Figure 1: Simplified example of how the documents in Table 1 could be placed in a three dimensional vectorspace

Recently the company started to perform so called Mini RCA - a Root Cause Anal-ysis investigation where the developer backtrack the actual change that caused a bug in the first place. This is a newly implemented concept only available for approximately 500 specific bug reports. These known connections will be used as evaluation for clustering and connection of data as described in section 3.5.

2.2 Automatic bug labelling

An immense amount of data is collected in our society every day, which rise a de-sire of being able to analyze and predict patterns and behaviour in such. Machine learning, which aims to develop computer algorithms that improve with experi-ence, holds promise to enable computers to assist humans in the analysis of large, complex data sets. An increased popularity for text analytics has been seen in recent years due to the omnipresence of text data on the Web, social networks and in emails [1].

Machine learning techniques are divided into three families when it comes to clas-sification: Supervised Learning, Unsupervised Learning and Reinforcement Learn-ing. Supervised learning assumes that a model with labelled data is available. This data set is used as a reference for a classifier that learn its behaviour by maximizing gain or minimize a cost function. By introducing the same algorithm to new data samples, it is possible to learn and predict patterns that were too complex to see initially. Classification and regression are examples of supervised learning [7] [2]. Clustering is an example of unsupervised learning, which aims to find patterns in data sets without the use of labels, solely by finding similar patterns in for example images or text [16]. The labelling procedure occurs based on a cost function, usually based on similarity or distance. The data is translated into a numerical vector space which makes it possible to find out how close

(12)

cer-tain datapoints are to each other and by that predict a possible connection. This concept is further described in chapter 2.3. Clustering is an appealing technique since it does not require any pre-labelled data sets, but also that it may be hard to interpret the final classification as well as derive the guidelines that link cer-tain clusters with characteristics of the data [2]. Reinforcement learning applies a technique where a so-called agent learns the correct behaviour through a reward system by trying out various behaviours. It is a type of trial and error approach where the reward may be delayed. The algorithm must pay careful attention to the trade of between exploration of the unknown environment and exploitation of the already covered area [29].

2.3 Vector Representation and Normalization

The basic idea behind vector representation of a written document is to map it to a vector space, which allows to facilitate a numerical comparison between the documents and hence clustering. A dataset in Natural Language Processing is referred to as a corpus and built up of several documents. These documents can be of varying length and are the result of multiple terms which primary are words or tokens. When a desired dataset is located and extracted it must be processed in a way so it can be used for various machine learning techniques.

This section first introduces the basic way of representing a document in nat-ural language processing by using Bag of Words. TF-IDF is then presented as a method to achieve a mapping to a vector space by giving the terms in a document a weight. Finally, word and document embedding is introduced which uses a sim-ple neural network to generate understanding for the surrounding a word appears in, rather the solely its frequency.

2.3.1 Bag of Words

A common way to represent the documents when it comes to Natural Language Processing is with a so called Bag Of Words model. It is a simplified representation where a document is represented as a bag which holds a set of words. The linear order and syntactic structure of these words are lost and solely its multiplicity is kept [1]. To make it possible for machines to understand and interpret the human language, it further must be translated into a numerical representation, a so-called vector representation. To do so, it is crucial to capture this relation which can be represented by meaning, morphology and/or context. All independent terms in a corpus builds up a lexicon with as many dimensions as unique terms. A document can hence be described as a vector where the words are represented as a frequency rather than alphabetic value. The following example describes this dependency. Dependent on the frequency of words in a certain document, it may be possible

(13)

to categorize it with similar documents.

This is the first sentence Here comes the second

And here is the third sentence

These three sentences would in this example act as three separate documents and make up for example the following lexicon after tokenization

"and", "this", "sentence", "is", "comes", "the", "first", "here", "second", "third"

The three sentences would hence be converted into vectors based on the words they are made of. The length of the vector depends on the lexicon in which each token describes a new dimension. If the word in a sentence exist in the lexicon, this dimension/position is assigned number 1 while the dimensions/positions without a corresponding word in the sentence is assigned value 0. A clarifying example is given below:

This is the first sentence = [0, 1, 1, 1, 0, 1, 1, 0, 0, 0] Here comes the second = [0, 0, 0, 0, 1, 1, 0, 1, 1, 0]

And here is the third sentence = [1, 0, 1, 1, 0, 1, 0, 1, 0, 1] 2.3.2 TF-IDF

TF-IDF stands for Term Frequency - Inverse Document Frequency and is a nor-malization process that intends to reflect the importance of a certain term in a document. The first part, Term Frequency, defines the frequency of a certain term in a document. Inverse Document Frequency on the other hand diminishes the weights given to words that occurs frequent throughout the complete collection of documents, such as the or and. Both quantities are multiplied, where a high end-result implies a strong relation to the document the term appears in. This value can further be used as weighting in classification [24]. In a vector space model, a document is seen as a vector in space. Dependent on the weight it is given through the TF-IDF it will have a position. Vectors that are close to each other are considered to be similar and can hence be clustered together. There are

(14)

various ways to describe the TF-IDF, although one of the most common ones are described as the following [12]

T F − IDF = fw,d∗ log

N nw

(1) fw,d is the frequency of word w in document d, N is the number of documents in

the entire corpus, while nw is the number of documents in the corpus where word

w occurs.

2.3.3 Word Embedding

Both previously mentioned models meet the criteria to represent a document as a fixed-length feature vector, although two problems have been noted when these models are used: they lose the order of words and the semantics in the sentences are ignored. For example would words as strong, Paris and powerful have an equal distance between each other, even though in a semantic way powerful and strong should be mapped closer together. It also means that a sentence like Dog bites cat is considered to be the same as Cat bites dog in the vector space [15]. Mikolov et al, came in 2013 up with the method word2vec. Unlike the previous mentioned methods, word2vec is prediction based in a way such that word analogies and similarities could be taken into consideration in the word embedding instead of solely rely on frequency of words, which previously been the case [19].

Figure 2: Overview of a Feed Forward Neural Network with one hidden layer word2vec combines two models - Continuous Bag of Words (CBOW) and the Skip Gram Model shown in Figure 3 and 4.

CBOW predicts the probability that a certain word would occur given a con-text. Given a document wi−twi−t+1...wi+t−1wi+t the model hence aims to predict

(15)

its architecture with an input, projection, hidden and output layer, see Figure 2. This is a shallow neural network where the output is a softmax layer that creates real probabilistic output values in the range [0, 1]. Equation 2 and 3 describes how this softmax probability is performed

p(wt|wt−k, ..., wt+k) = eyw t P ieyi (2) y = b + U h(wt−k, ..., wt+k; W ) (3)

yi is an unnormalized probability for each output word, U, b are the softmax

parameters and h comes from the averaging or concatenation of word vectors [15]. The single input layer takes N 1-of-V encoded vectors as input, where V is the size of the lexicon. This layer will then be projected in the projection layer, generating a matrix of dimension N × D. The order of words in the projected lexicon does not matter in this architecture, hence the bag-of-words model with a continuous distributed representation of the context [1] [18].

The Skip Gram Model can be seen as a reversed version of CBOW, as shown in Figure 3, where the aim is to predict m context words given one input word. The current word is used as an input to a log-linear classifier with a continuous pro-jection layer [18]. The main objective with this model is to maximize the average log probability in equation 4

1 T T X t=1 X −c≤j≤c,j6=0 logp(wt+j|wt) (4)

c is here the size of the context to train on, w1, w2, ..., wt is a given sequence

of T words and j is the current word [19]. A single word is taken as the input to the classifier in 4 which will predict related words that are close to the cur-rent word in the training set. The wider the range around the word is, the more accurate are the results. Although an increased range also leads to an increased computational cost [18]. The Skip-gram model is in the word2vec method paired with so called Negative Sampling instead of the softmax technique as for CBOW. A concrete example of the Skip-gram model are shown in Figure 5 where it is easy to see how the context relates to the final predicted word.

Worth to notice is that an issue with the CBOW model is that it tends to per-form worse on large amounts of data, and that the Skip-gram should be the model

(16)

of choice if the data amount is large. This is because of an averaging effect in the CBOW model that occurs when words are loaded to the hidden layer, which smooths out the data [1].

Figure 3: The Skip-gram architec-ture where context or surroundings words are predicted based on the cur-rent word given as an input

Figure 4: The Continuous Bag of Word architecture where the current word is predicted by the context it belongs to

2.3.4 Document Embedding

The creators of word2vec came in 2014 up with an extension called doc2vec where document-leveled embedding was proposed instead of solely words [14]. The al-gorithm proposes an unsupervised framework of paragraph vectors which learns vector representation for documents of varying length. The doc2vec embedding is performed in two steps, where it initially inherits the important semantics of a word in a context and thereafter includes an additional paragraph vector that is unique for a certain document. This paragraph vector can be seen as an extra memory that remembers the missing pieces from the current context and is there-fore called the Distributed Memory Model. A second way to predict the Paragraph Vector is through a Distributed Bag Of Words Model. The ordering of words taken as an input are then ignored and instead force an output of randomly sampled words from the paragraph. This model requires storage of less data than what the Distributed Memory does. Together these are used to interpret semantics and put the document in a context [15].

doc2vec builds upon similar ideas as its ancestor word2vec where word-to-word relationships are mapped [19]. Similarities between the two models can easily

(17)

be detected by comparing Figure 5 with Figure 6 and 7 which also pictures the difference between Distributes Memory and Distributed Bag of Words. As seen in Figure 6, the distributed memory version of doc2vec has solely an additional input for paragraph vectors in the CBOW model of word2vec. Similarly in Fig-ure 7, the word identifier is replaced by a paragraph identifier to the Skip-gram model of word2vec. Another difference worth to note is that doc2vec uses histor-ical context to predict a word rather than a similar word centered in the current context. Furthermore doc2vec concatenates the vectors instead of averaging the embedding in the hidden layer which somewhat increases its dimensionality [1].

Figure 5: Overview of the word2vec model where three words are used to pre-dict the fourth by mapping the input words to unique vectors as columns in a matrix that gets indexed by position of the word in a corpus. Concatenation or summation of input vectors acts as features to predict the final word

(18)

Figure 6: Overview of the doc2vec model, using Distributed Memory. A paragraph ID is added to the frame-work in figure 1 where also each para-graph is mapped to a unique vector that builds up a new matrix and acts as a memory for the current context. It is averaged with the rest of the word vectors to predict the missing word.

Figure 7: Overview of the doc2vec model, using Distributed Bag of Words. A text window is sampled for each iteration, whereupon a ran-dom word is sampled from this text window. A classification task is then formed based on the context

2.4 Clustering Algorithms

Text clustering is a form of unsupervised machine learning which aims to group unlabelled data. A text corpus consists of several text documents, objects, which are partitioned into related groups of similar objects. These clusters relate to cer-tain topics which is initially unknown [1]. By representing each object in a vector space model, it is possible to measure its similarities to each other by a similar-ity function, like cosine- or Euclidean distance. The clustering of objects can be performed in multiple ways, although we are primarily interested in the content they contain and how it relates to each other. An uncertainty with clustering is that a cluster algorithm always produces partitioning, even though it is not clearly justified. A corpus can typically be divided in multiple ways dependent on the vectorization processes and the algorithm used. [26]. Cluster algorithms are divided into several groups based on their performance. Hard clustering only allows a certain object to be assigned to one specific cluster, while soft clustering allows an object to belong to several clusters. Clustering algorithms are com-monly divided into two sub groups, namely hierarchical and partitioning [11]. Hierarchical clustering has historically been portrayed as the better choice when it comes to clustering quality, although its time complexity is quadratic and hence somewhat limited in performance. Partitioning algorithms are linear in their time

(19)

complexity and therefore faster in its performance, although with somewhat sub-servient clusters [28]. Experiments with both methods have though shown that variations of K-means have performed with similar quality as agglomerative hier-archical clustering techniques [9]. A combination of the two have been done to investigate performance and run-time efficiency for large documents [9].

Partitioning Algorithms returns flat clustering of a data set, where a single par-tition is performed of the data compared to hierarchical technique. Each cluster is in that sense independent of each other. The specific number of clusters must be specified initially [11]. The K-means algorithm is an example of partitioning clustering, where k is taken as an input parameter and represents the number of clusters to divide the data set in. The problem this algorithm wants to solve is to find k representatives, such that the squared distance of each document in the corpus to its closest centroid is as low as possible. Both Euclidean distance or a cosine similarity can be used to measure this distance [1]. The K-means algorithm with Euclidean distance is defined as

J =

n

X

i=1

mind_j=1|| ¯Xi− ¯Yj||2 (5)

where Xi defines all documents and Yj is the after sought representatives with

the goal to minimize the squared distance from each document to its closest centroid [1]. Hierarchical algorithms on the other hand can be described with a tree structure where clusters are composed of clusters in a nested loop. The final number of clusters do not necessarily have to be specified, but the stop criterion is usually the desired number of clusters [26]. The tree-like structure is called a taxonomy and can be created either in a top-down or a bottom-up fashion. An agglomerative algorithm is a bottom-up method where each document initially make up an own cluster. These successively agglomerate and merge into larger clusters dependent on semantic similarities. Late stated clusters are in this way supersets of the early state clusters [1]. There are many ways to perform agglomerative clustering, where a common approach is to use the Intra-Cluster similarity technique. By measuring the similarities of all documents in a cluster to its centroid point it is possible to select the pair of clusters to merge by determining the smallest possible decrease in similarity. Similarity is defined by

sim(X) =X

d∈X

cosine(d, c) (6)

where d represents a document in cluster X and c is its centroid of the cluster [28]. This distance can also be measured by e.g. Euclidean distance. Hierarchical clustering is commonly described in a dendrogram as showcased in Figure 8.

(20)

Figure 8: Dendrogram where each document is a separate cluster and further on are merged together through a similarity function iteration by iteration

Scikit Learn provides a library that supports implementation of the two previ-ously mentioned algorithms. For agglomerative clustering, three different linkage criterion can be selected which determines the distance to be used between sets of observations, namely ward, complete or average. Average uses an average distance of each distance between two sets, complete uses the maximum distance between two sets and ward minimizes the variance of clusters that are being merged. Av-erage is the default setting and will be used if nothing else is mentioned. The goal of the algorithm is to merge the pair of clusters that minimizes the distance. Number of iterations is the number of iterations of the K-means algorithm for a single run [23].

(21)

3 Method

This project was brought out through three phases which can be seen in Figure 9. The aim was to address the overall question of whether it was possible to define links between bugs and development in certain parts of the code base. The first two phases aimed to answer the two sub questions, while the third phase eventually connected the dots. The initial phase, Data Collection and Preprocessing describes the step of how the used datasets were collected and extracted as well as how they were preprocessed to fit the next phase. Clustering and Labelling describes how clustering and topic modelling on free text data were performed. The third and final step Connection of Data presents an approach of how the previously processed data could be used in order to find the sought connections between how changes in code possibly caused bugs further down the line. The related work in this field are mainly approaching a way to identify Bug-Introducing changes, to localize a bug or assign it to a certain developer or a team. This thesis aims to facilitate testers in their decision making of selections of tests to perform in a new release, and the methods used inspired from [12] and [26] among others. Talks with in-house experts in various fields within the company were performed to reach the most suitable conclusions and methods to process the data.

Figure 9: System overview of the three main phases that each aims to answer one of the three questions in 2.2

(22)

3.1 Tools and frameworks

This section introduces the tools and frameworks used in this thesis. 3.1.1 Python Libraries

The tool has been written in Python version 3.6.6. Several libraries have been used, although Scikit Learn and NLTK are the two mainly worth mentioning. Scikit learn is a free machine learning library with multiple helpful features for classification, regression and clustering [23]. NLTK (Natural Language Toolkit) is a suite of libraries that performs text processing on natural language in an easy to use way [6]. WordNet is a large lexical database of English nouns, verbs, adjectives and adverbs which are grouped together into sets of cognitive synonyms. It was created by Princeton and is part of the NLTK corpus which can be used to find meanings of words, synonyms and grammatical forms of words [20]. Gensim is short for "generate similar" and is an open source vector-space and topic modelling library implemented in Python. In this project it is used for topic labelling and vectorization [25].

3.2 Data Collection and Preprocessing

This section aims to describe the data used in the project and how it was prepro-cessed to fit further use.

3.2.1 Datasets

Data used in this project comes primary from two sources, namely Bug Reports and Commit History. Further on are so called Change Proposals used, which can be a possible cause of a bug. They are built up and used in the same way as bug reports in this project. A bug report consists of various fields, although the ones that will be studied more closely are the bug ID and description seen in Figure 10. The description is a free text field that describes the problem that triggered the bug, while the ID is a unique key related to a certain issue. A commit describes what have been done to solve a bug and are tagged with the related bug ID, a description that describes what is done to solve the problem and several fileID which is a path to the files that have been changed.

(23)

It has been shown that it is important to use documents of similar length in a corpus [1]. This is mainly due to the distance computation in multidimensional data that tends to lose accuracy between documents of varying length. Distance between short description will typically be very small, while distance between long documents will be much larger. Based on this arguing, solely the summarizing headings were used since they tend to have approximately the same length of one to two sentences. Data are initially exported from two issue tracking systems, Jira and IBM notes, and cleaned in such way that it can be efficiently used with Python. Bug reports and product changes from two projects have been harvested for this project, which generated 8958 reports, where 4917 of these were reported bugs. All possible commit history from the two projects generated 37205 records. The distribution of documents and words from the different data sources are shown in Table 2. A transformation of data was done to make sure that only bugID and description appears in the data sets. Files that were changed in the corresponding commits had to be exported separately from the version handling software and thereafter merged with the description of the same ID. The cleaned dataset was thenceforth loaded into a Python script to be used in the next step of the process.

Type No. of documents No. of words

Bugs 4917 29922

PC/CP 4068 24708

Commits 37205 238142 Table 2:

Quantity of used data from bug reports, product changes/change proposals and commits

3.2.2 Preprocessing of data

The documents used in this project are composed by humans and consist of free text fields of varying length and quality that describes the problem or solution. Language used are mainly English, although Swedish has been seen in some rare occasions. Each document consists of one to two sentences. The preprocessing of data is performed in three steps described below, namely tokenization, stem-ming and lemmatization and stop-word removal. Tokenization is the task when a sequence of letters is split in such way that each term represents a token, and punc-tuation is commonly removed. It would be easy to simply perform tokenization by splitting the content of a document on all non-alphanumeric characters. Although tokenization tends to be more complex than that. The word aren’t would in this case be divided into aren and t which makes little to no sense. A way to ensure better tokenization is therefore to initially specify the used language in a corpus to ensure more specific partitioning [27]. Information retrieval usually means that all kinds of data, both relevant and irrelevant, will be gathered. To improve the

(24)

performance in later steps, lemmatization or stemming can be applied to docu-ments to initially reduce the amount of irrelevant data. Stemming is the process when all words with the same stem are reduced to a common form by removing its suffix. Lemmatization on the other hand is based on a morphological analysis of a word and returns its dictionary form [4]. Some common words such as the, and and a/an, usually pronouns, articles and prepositions have little distinguishing power in a mining process are called stop words. A final step in the pre-processing of the text documents is to remove these stop words and further perform case-folding [1]. These steps normalize the data and therefore reduce the noise in the same. The data is hence compressed in such way that the computational cost can be strongly reduced later in the process.

• Tokenization - Each document is taken as a separate string and is tok-enized by using the NLTK tokenize package with the regexp module [6]. This tokenizer splits a string into substrings by using regular expressions, here by alphabetic sequences. Eventually each word is converted to solely lower-case letters.

• Stemming and Lemmatization - The tokenized words are further on stemmed or lemmatized by using NLTK stem package [6]. Lemmatized data have been proven to generate more accurate results in language modelling techniques, although all three techniques will be tested in this project [4]. Lemmatization is here performed by using WordNet’s built in morphy func-tion and returns an unchanged input word if it does not exist in WordNet. • Stop-word removal - This step is performed by using the NLTK corpus

package with its stop word module. This corpus contains English words used with a high-frequency, such as the, a, an and to, words that usually provides little lexical content [6]. Further on are all versions of remove and add removed from documents gathered from all commits, since those words made up a big part of the descriptions.

Even though feature extraction, feature selection and method used for clustering or classification have substantial impact of the performance, the preprocessing steps is not to undermine. The normal procedure is to perform alphabetic tok-enization, stop-word removal, lowercase conversion and eventually stemming or tokenization. Investigations have shown that even though the influence of these preprocessing steps may be small, it is still seen as good practice to apply in terms of dimensionality reduction in the feature space as well as promote efficiency when it comes to classification [30].

3.3 Clustering and Labelling

This section describes the procedure to perform clustering on the retrieved data. By using a numeric representation of the descriptive fields from previous section,

(25)

it is possible to capture relations between documents and in that way form clusters of documents closest to each other in the vector space. Each cluster are further on labelled dependent on word frequency in each cluster.

3.3.1 Vectorization of Documents

Most modern machine learning algorithms typically requires the input to be of a fixed length feature vector. Here, doc2vec is used to create this document embedding [15]. The model is created by using the Gensim model library which requires an input of all documents along with a tag. The algorithm then runs through the sentence iterator twice: once to build the vocabulary, and once to train the model on the input data, learning a vector representation for each word and for each label in the dataset. Two parameters can be tweaked in this model to customize it according to own data, namely epochs and learning rate. The output from this algorithm is a paragraph vector that besides a word vectorization also results in a paragraph embedding as seen in Section 2.3.3. This gives the possibility to capture the difference between the same word while used in different contexts [25]. Multiple parameters can be changed while using doc2vec. Both Distributed Memory and Distributed Bag Of Words have been used and evaluated, together with varying learning rate in order to find the most suitable setting for this dataset.

3.3.2 Clustering

K-means and agglomerative clustering have been investigated in this thesis, where the dataset eventually was divided into 50, 75 or 100 clusters. Those granularity levels are agreed upon together with the company, with the aim to give an as realistic division of the data as possible. As for today, it is not sure from the companies view how the product should be divided or classified, something this method aims to aid with. To divide the product in 50 to 100 subcategories is something that is told to be realistic. Other parameters such as linkage for ag-glomerative clustering, and number of iterations for K-means have been changed during the procedure to find the most suiting settings by using a grid search. Skickit Learn provides the two clustering algorithms in their library which are used for this project [23].

The aim of this thesis is to find connections between bug reports and commit messages. It is therefore important that the different data sets are clustered on the same features. Apart from the two previously mentioned clustering algorithms, two different clustering approaches are therefore be performed and evaluated.

1. Interviews with testers at the company outlined that the more accurate de-scription of a bug mainly could be seen in the bug reports rather than the

(26)

commit messages. The first approach is therefore to solely cluster descrip-tions of bug reports. Their corresponding commits are then assigned to the same cluster.

2. Initially, each description of a commit message or a bug report is handled as a document. By combining all descriptions from both commits and bug reports with the same bug-ID, a larger and intentionally more descriptive document is formed, and clustering are performed of the same. Both com-mits and bug reports with the same bug-ID will in this way be assigned to the same cluster.

3.3.3 Topic Labelling

Each cluster were assigned to a unique index whom in the next step were used to find correlations between changes in the development phase that might lead to bugs in certain clusters. A desired output in this part was to label each clus-ter with a descriptive word, something that can be done through various topic modelling methods. The previously described method TF-IDF will be used in this case where the three words that scores highest in each cluster will generate suggested topics for the cluster under consideration. The same method was used to find the most critical files that takes part in each cluster.

TF-IDF aims to reflect how important a word is to a document in a corpus as described in Section 2.3.2. The more times a word appears in the corpus the more the score increases, while at the same time the score decreases the more common the word is in the entire corpus, which makes it a good measurement of importance to both words that can act as topics for the clusters as well as importance of a file to a certain document (here cluster). Apart from the TF-IDF score, frequency of occurring files in a cluster as well as frequency of descriptive words will also be investigated to make it possible to discuss its success rate. As for the suggested topics, looking at the most frequently occurring words and compare them with suggested TF-IDF titles gives the user a freedom to understand the TF-IDF score in a more intuitive way and get a more solid ground to stand on when deciding if the naming is good enough.

To measure frequency of file occurrence in clusters will also help the user to get a deeper insight in how files occur throughout the different clusters. Investigate the importance of files to certain clusters will mainly act as a help for users to find out which files that can be seen as a game changer, therefore the results here will be represented in four files. These files aim to act as a basis for the developers when tests are about to be developed, and to find the file that are more likely to cause a bug

(27)

1. Ten most important files for each cluster according to TF-IDF (highest to lowest)

2. Frequency of single files in each cluster

3. Frequency of clusters a single file appears in at the same time

3.4 Connection of Data

This Section aims to describe the third and final phase of this project, namely to find correlations between developed software of product and a bug. Note that the reason to why clustering was done in the previous step is that the data that will be used in this project lack classification related to the area that is affected, which is an important factor when it comes to find correlations. The method presented here will act as a proof of concept of what kind of connections that can be found if correct labelled data exists. Inspiration for the methods are taken from [12] but are further tweaked to fit the available data and expectations from the company through discussions with experts at Sectra. After that commits and bug reports are classified accordingly, the final question is about to be answered; whether it is possible to reach any conclusions about how bugs and changes in the development phase are connected. To do so, a reversed loop will be used as described below and showcased in Figure 3.4.

For each bugID

1. find the related commits 2. find the contributing files

3. find all other commits connected to 2 4. find all bugs connected to 3

5. Reverse procedure to find possible bugIDs that triggered certain bug This will be possible because every document has been assigned to a cluster in the clustering step. In this way it may be possible to find out how commits in certain clusters may have given rise to bugs in other clusters. Thanks to the previous clustering it is by now possible to find connections on two granularity levels, namely bug to bug or cluster to cluster

(28)

Figure 11: Connections between Bug Reports and Commits

3.5 Validation and Evaluation

This section presents the methods used for validation and evaluation of the two main parts of this project, namely clustering and connection of data.

3.5.1 Clustering

Clustering is an unsupervised method and can be challenging to perform proper evaluation of. The underlying idea of clustering is after all to find ulterior struc-tures in datasets without a known ground truth. As previously mentioned, the company have started to perform mini RCAs which in this thesis is used for evalua-tion. Developers have in some cases, 500 for the investigated projects, backtracked the actual cause of a bug to an implemented product change with an ID that is to find in the bug tracking system.

Index Bug Cluster Bug Cause Cluster Cause

1 XYZ-123 1 UVW-321 1

2 YXZ-213 1 WUV-231 4

3 ZYX-312 2 VWU-132 3

Table 3: A schematic view of how a mini RCA can look after clustering is per-formed and a cluster number is attached to an ID. It is from here known that bug and cause is supposed to be mapped to the same cluster to be awarded with a point.

These connections are gathered from a bug tracking system. Experts from the company have evaluated these causes and found that 255 of them can be used as a ground truth for the clustering where, if cause and bug coexist in the same cluster, a point will be awarded. An accuracy can then be calculated from this subset as

(29)

seen in Equation 7. An assumption is hence made that the known ground truth is made up of a subset that is representative for the entire dataset used.

Accuracy = P correctly classified bugs from subset

P total number of bugs in subset (7) Table 3 provides an example of how accuracy is measured. It is known that bug and cause on the same row is supposed to be mapped to the same cluster to be awarded with a point. This only holds true for row one in this example, leading to the accuracy of

Accuracy = 1

3 = 33% 3.5.2 Topic Labeling

In this project, topic labeling is performed to facilitate the user to find descriptive names for the clusters. Since this is a more conceptual part of the project, it will not be evaluated in a quantifiable manner. Instead can the suggested topics be used to point the user in the right direction and in the future also act as a suggestive template to find classes on a desired granularity level for the product under investigation. The results and findings were discussed with experts at the company and are further elaborated around in the discussion.

3.5.3 Connection of Data

To evaluate whether it is possible to backtrack the cause of a bug through commits and files, the previously mentioned Mini RCA will be used once again. In this case it is interesting to evaluate the correctness in the method in two granularities, namely

1. Bug to Bug/PC - Each of the collected bugs are mapped towards all commits, files and product changes/bugs according to Figure 10 in order to find possible connections. A collection of possible causes of commits and/or product changes/bugs will in this way be available to investigate for each unique bug. This gives the user a possibility to find out exactly which implementations that could have caused a certain bug.

2. Cluster to Cluster - All bugIDs are mapped to its corresponding cluster numbers. Possible cluster connections will hence be drawn accordingly As for the first granularity, it is investigated whether the cause from the miniRCA exists among the suggested causes or not. If that is the case, this row is awarded with one point. The second granularity requires some more constraints. In the same way as before, it is investigated if the cause from the miniRCA exists among

(30)

the suggested causes. Each suggested cause is then attached to its cluster number and a distribution of possible causes throughout the clusters are established. If a cluster contains => 10% of the total numbers of causes it is given as a suggested cause. A row in the dataset is then awarded with a point if the cluster it belongs to manage to make it over the 10% threshold. An accuracy of the results will be given in a similar manner as in Equation 7, which indicates how trustworthy the suggested causes may be.

(31)

4 Result

This section aims to reproduce the results from the method presented in Section 3. The outcome from clustering will be represented initially, with a grid search in order to find the optimal settings. Finally, the results from data connection on two granularity levels are presented.

4.1 Clustering

This section presents the results from the two previously described clustering algorithms. Evaluation techniques and datasets are mentioned and described in the Method section. The different parameters that have been tested in order to gain the most promising results according to the evaluation process can be seen in Figure 12. Parameters have been tested together in various constellations as are showcased in the following tables in order to find the most suiting settings. The used clustering algorithms have randomized initialization which means that different executions will start at different points, leading to a slight difference in the results for each run of the code. Results are therefore collected three times with the same parameters, and the mean value is then given as the final result. The accuracy mentioned refers to equation 7 in Section 3.5.1. The results are presented in tables below, with some rows highlighted in green. These rows are not always the ones with the highest accuracy, but rather the combination of parameters that have been continuously experimented with. The reasoning behind these selections are further described in Section 5.2.

(32)

Figure 12: Parameters that are tested together in various constellations from each clustering algorithm.

4.1.1 Preprocessing

The first tests performed aim to evaluate which preprocessing methods that seemed most promising for the two clustering algorithms. The descriptive sen-tences have been either solely tokenized or processed to be reduced in dimensions by stemming or lemmatization. The tables below showcase the results from both algorithms, three different sizes of clusters and default settings for vectorization. The following results are gathered from the clustering batch method done solely on bug reports. Preprocessing Accuracy Tokenization 87.45 % Lemmatization 91.37 % Stemming 90.20 % Table 4:

Clustering Method: Agglomerative No of clusters: 50

Clustering Batch Method: Solely bug reports Preprocessing Accuracy Tokenization 9.41 % Lemmatization 10.98 % Stemming 14.51 % Table 5:

Clustering Method: K-means No of clusters: 50

Clustering Batch Method: Solely bug reports

(33)

Preprocessing Accuracy Tokenization 87.06 % Lemmatization 81.18 % Stemming 89.02 % Table 6:

The tables below showcases the results from both algorithms, three different sizes of clusters and default settings for vectorization. The following results are gath-ered from the clustering batch method where the descriptions of both bug reports and commit messages from the same bug IDs are combined.

Clustering Batch Method: Com-bined bug report and commit

(34)

Clustering Batch Method: Bug Re-ports

Our experiments show that the highest overall result is obtained with stemming and 50 clusters for both clustering algorithms and batch methods which will there-fore be used in the following steps. In contrast, K-means clustering produces con-sistently lower accuracies compared to the agglomerative method. The reasoning behind this is further mentioned in 4.1.4, for now it is acknowledged, but the performance of both algorithms were still continuously investigated.

4.1.2 Vectorization

The second test aims to evaluate how the different clustering algorithms perform with different vectorization methods, where Distributed Bag of Words (DBoW) and Distributed Memory (DM) are analysed. Based on previous results in Sec-tion 4.1.1, stemming is used as preprocessing method in this part together with a default setting of 50 clusters. Both clustering algorithms are investigated in combination with the two different clustering batch methods.

(35)

Training Algorithm Accuracy

DBoW 72.55 %

DM 52.55 %

Table 16:

Clustering Method: Agglomerative Clustering Batch Method: Com-bined bug report and commit

DBoW 30.98 %

DM 6.86 %

Table 17:

Clustering Method: K-means

DBoW 30.59 %

DM 90.00 %

Table 18:

Clustering Method: Agglomerative Clustering Batch Method: Solely bug reports

DBoW 12.94 %

DM 17.06 %

Table 19:

Apart from training algorithms, the vectorization part also allows for change of the default settings where epochs = 10 and learningrate = 0.025 that previous tests have been performed with. The same settings as before are used, where the descriptions are preprocessed through stemming, together with a default setting of 50 clusters. Both clustering algorithms are investigated but the only clustering batch method used is a combination of bug reports and commit messages based on previous results. The training algorithm used in this case is Distributed Bag of Words for both Agglomerative clustering and K-means. This is despite the fact that the settings in 18.2 generates a somewhat better accuracy than 16.1, a decision based on a final parameter for distribution which is further discussed in Section 4.1.4.

Learning Rate Accuracy 0.25 91.47 % 0.0025 57.16 % Table 20:

Clustering Method: Agglomerative

Learning Rate Accuracy 0.25 13.92 % 0.0025 2.45 % Table 21:

(36)

Epochs Accuracy 50 88.72 % 200 73.23 % Table 22:

Clustering Method: Agglomerative

Epochs Accuracy 50 22.55 % 200 20.69 % Table 23:

Our experiments show that the highest result for this section is obtained with experiment 17.1 for K-means and 20.1 for agglomerative clustering.

4.1.3 Clustering

The last tests performed are linked to clustering and settings that can be changed in the two algorithms. The default settings for the second vectorization test (learn-ing rate = 0.025 and epochs = 10) will be used here, further on the same sett(learn-ings as before are used, where the descriptions are preprocessed through stemming, together with a default setting of 50 clusters. Both clustering algorithms are in-vestigated but the only clustering batch method used is a combination of bug reports and commit messages based on previous results. The training algorithm used in this case is Distributed Bag of Words. Linkage are changed and investi-gated for agglomerative clustering, while number of iterations are the parameter of target for K-means clustering. More information about these parameters are to find in Section 3.3.2 No of iterations Accuracy 70 34.12 % 50 30.78 % 30 32.16 % 10 29.02 % Table 24:

No of iterations Accuracy 70 16.86 % 50 16.47 % 30 16.86 % 10 16.47 % Table 25:

(37)

Linkage Accuracy Ward 30.39 % Complete 25.49 % Average 74.12 % Table 26:

Clustering Method: Agglomerative Clustering Batch Method: Com-bined bug report and commit

Linkage Accuracy Ward 11.48 % Complete 10.49 % Average 24.71 % Table 27:

Clustering Method: Agglomerative Clustering Batch Method: Solely bug reports

Our experiments show that number of iterations does not affect the results enough to let the algorithm run more iterations because of the longer execution time. Changing the linkage from average to ward gives a lower accuracy, although it leads to a better distribution which is further discussed in Section 4.1.4.

4.1.4 Distribution clustering

Distribution tests were performed to get an understanding for how the clustering algorithms perform. Table 28 describes the distribution for a selection of tests performed in Section 4.1.1 - 4.1.3, where maximal and minimal size of clusters are represented, together with median and mean value. Test number is based on table number and the parameters presented in the same from the top.

(38)

Test Nr Total Max Min Mean Median Accuracy 16.1 10631 9440 1 212.6 5 72.55 % 16.2 10631 10401 1 212.6 3 52.55 % 17.1 10631 312 128 212.6 206 30.98 % 17.2 10631 1248 11 212.6 91 6.86 % 18.1 8958 4449 1 179 20.5 30.59 % 18.2 8958 8511 1 179 5.5 90.00 % 19.1 8958 333 129 179 172.5 12.94 % 19.2 8958 483 97 179 168 17.06 % 20.1 10631 9324 1 212.6 10.5 91.47 % 20.2 10631 10517 1 212.6 1 57.16 % 21.1 8958 327 89 212.6 221 13.92 % 21.2 8958 1726 1 212.6 12.5 2.45 % 22.1 10631 9031 1 212.6 6 88.72 % 22.2 10631 5476 1 212.6 2 73.23 % 23.1 8958 366 154 212.6 210 22.55 % 23.2 8958 592 80 212.6 186 20.69 % 26.1 10631 415 80 212.6 198 30.39 % 26.2 10631 1227 1 212.6 146 25.49 % 27.1 8958 401 80 179 167 11.08 % 27.2 8958 422 41 179 171 10.49 % Table 28:

Description of the distribution from the performed tests in order to find the most suitable settings. Test nr is a reference to table nr and the subtest performed, Total is the total number of datapoints in a dataset, Max and Min describes the size of the biggest vs. smallest cluster, just as Median and Mean show how the distribution is over the clusters. As in the previous tables, Value show the percentage of correctly clustered data couples.

Experiments from previous sections shows that the accuracy in combination with distribution shown in Table 28 from test 17.1, 20.1 and 26.1 gives the overall best results. Figure 13 - 15 visualizes the distribution of documents from those tests. Each bar represents one of fifty clusters, where its heights tells the number of documents that are mapped to this certain cluster. The selection of these three tests are made based on a cross-evaluation between the score each test is given versus how the documents are distributed throughout the clusters. A more thorough explanation and justification of this choice is to be found in the discussion.

(39)

Figure 13: Distribution of test 17.1 where each cluster is showed as a bar, with the number of documents presented on the Y-axis.

(40)

4.2 Topic Labelling

This section presents the results from the topic labelling performed both to find the word that best represents a certain cluster, but also to find the most in-teresting and contributing files to that cluster. TF-IDF have been used to find these parameters. As a reference, it is also investigated how solely term frequency performs in comparison to TF-IDF.

4.2.1 Words

The company requested suggestions of possible ways to divide the current product into smaller subclasses. The first step to do so in this thesis was to use clustering. To be able to use this kind of division in the future, it is desired to label the data with a descriptive name which in this case is done by tobic labelling. Figure 16 - 18 showcase suggested topics for the three most optimal settings described in previous section. Figure 19 shows the results from an execution with the same parameters as test 17.1 but with the difference that only tokenization is performed as preprocessing. Figure 20 intends to show the difference between the most frequently occurring words in a cluster compared to using the highest TF-IDF score.

(41)

(42)

(43)

(44)

Figure 19: Suggested topics based on TF-IDF score from test 15.1, using only tokenization as preprocessing technique

(45)

Figure 20: A comparission between choosing the three most frequently occurring words in a cluster (left) vs. using its TF-IDF score (right)

(46)

4.2.2 Files

To make it possible for the user to conclude which files that are the most important to a cluster, TF-IDF and frequency have been used to produce three investigative datasets. The outcome of these tables first of all takes up a lot of space and are second of all classified as confidential from the company. An imaginary example of the three tables are hence given below.

File Cluster Frequency file_x1 0 65 file_x2 0 43 file_y3 0 20 file_y1 1 13 file_z1 1 11 file_y2 1 10 file_x1 1 7 file_x2 1 7 file_y3 1 2 file_z1 2 38 file_x1 2 15 file_x2 2 2

Table 29: An example of the first of the three tables produced where the frequency of a certain file in each cluster is presented

(47)

File Current Cluster Cluster Frequency file_x1 0 1 7 file_x1 0 2 15 file_x2 0 1 7 file_x2 0 2 2 file_y3 0 1 2 file_z1 1 2 38 file_x1 1 0 65 file_x1 1 2 15 file_x2 1 0 43 file_x2 1 2 2 file_y3 1 0 20 file_z1 2 1 11 file_x1 2 0 65 file_x1 2 1 7 file_x2 2 0 43 file_x2 2 1 7

Table 30: An example of the second of the three files where the occurrence of each file in all clusters are presented, together with all other clusters the same file exist in and its frequency

Clr no File1 TF-IDF Frq Order In # clr File2 TF-IDF Frq Order In # clr 0 file_1 0,311 24 1 3 file_2 0,208 17 2 4

1 file_3 0,318 4 1 11 file_4 0,308 4 2 12 2 file_5 0,489 23 3 3 file_6 0,319 13 5 1 3 file_7 0,287 31 3 3 file_8 0,286 36 2 6 4 file_9 0,086 27 2 6 file_10 0,060 14 12 1 Table 31: A comparison of the correlation between TF-IDF value and frequency

for the two files with highest TF-IDF value in the first five clusters

When analysing the suggested most important files according to their TF-IDF values, a correlation could be seen between a high TF-IDF value and a high frequency where the files with a high accuracy usually appeared in the top of the list over most frequently occurred files. As visualized in table 31, the first three columns with suggested most important files were never duplicated throughout the clusters, which could be the case in later columns. When duplicates were investigated for only frequency, the same files could occur as having the highest frequency in more than one cluster at the same time.

(48)

4.3 Connection of data

Connection of data is done on two different granularity levels, namely bug to bug or cluster to cluster as described in Section 3.5.3. The used dataset contains a total of 4917 bugs, where it was possible to link 1422 of those to possible causes of either product changes or bugs by using the workflow described in Section 3.4. 4.3.1 Bug to Bug/PC

There are 382 unique bugs in the mini RCA which have a known connection. By deriving the cause of those, as described in Section 3.4.2, it is possible to find the correct connection of 189 of these subsets. Assuming that the subset is representative for the entire dataset, there is then a 49.5% chance that the correct cause exists within the group of possible causes on this granularity level.

4.3.2 Cluster to Cluster

The same number of connected bugs in the mini RCA persists as for the evaluation on this granularity level, when a bug is assigned to a cluster and seen to be connected to bugs or product changes in other clusters as seen in Figure 21. It is possible to find the correct connections of 220 unique bugs from this subset. The number of correctly connected clusters is not as exact as for the bug to bug connections, since the clusters fluctuate somewhat for each run. With the same assumptions as before, this leaves a chance of around 57.6% that the bug is labelled with the correct cause. Furthermore, it is desired to find which clusters that mostly contributes to a certain bug. To get a fair representation of possible causes, a distribution of possible causes throughout the cluster is established. If a cluster holds => 10% of the total number of causes, it will be given as a suggested cluster. Finally, an example of possible causes for bugs belonging in cluster 40.

(49)

5 Discussion

This section aims to evaluate the reached results, part by part. Finally are some closing words about suggested improvements that could be worth to look into given.

5.1 Vectorization

Vectorization has in this project been a necessary sub step in order to make it possible to perform clustering as desired, and hence not evaluated per se. doc2vec was used as word embedding as described in Section 2.3.3, and its performance evaluated in relation to the two clustering algorithms. In Section 4.1.2 it is shown that parameters in this simple neural network can be tweaked to increase the ac-curacy of the clustering, as well as improving the distribution. An example of this is seen in Table 17 where the accuracy went up from 7% to 31% by changing the training algorithm from Distributed Memory to Distributed Bag of Words. The same example shows a more even ditribution as well when comparing test nr 17.1 with 17.2 in Table 28.

A simple neural network lays the basis for doc2vec, and as most other deep learning approaches it requires as much qualitative data as possible to be able to perform in an ultimate way. There are ways to simulate data to serve this purpose, by using for example data annealing. Another way to use and increase the amount of accessible data would have been to use the longer description of a bug or product change as described in Section 3.1.1. This description might have been more exhaustive, although way more work would have been required to retrieve the data, and the accuracy would probably have been less. The paragraph embedding method has been both easy to use and implement thanks to the pre-defined library in Gensim. Other document modelling algorithms such a Latent Diriclet Allocation and bag of n-gram representation have been used for the same purpose for many years, to capture the essential meaning of a document in a form that can be understood by a machine learning algorithm [10]. It may be interesting to review how such techniques perform, compared to the Paragraph Vector method doc2vec offers, in this context for the future.

5.2 Clustering

A grid search was performed to find the most suitable settings to use for cluster-ing of the given data. Various parameters were tested in combination with each other, measured with an accuracy to evaluate how well the clustering performed. Distribution of documents between the clusters turned out to be another impor-tant parameter to take into consideration apart from accuracy. Three experiments

Using Machine Learning to Learn from Bug Reports : Towards Improved Testing Efficiency

Contents

1

Introduction

1.1

Motivation

1.2

Problem Formulation

1.3

Delimitations

2

Theory

2.1

Background

2.2

Automatic bug labelling

2.3

Vector Representation and Normalization

2.4

Clustering Algorithms

3

Method

3.1

Tools and frameworks

3.2

Data Collection and Preprocessing

3.3

Clustering and Labelling

3.4

Connection of Data

3.5

Validation and Evaluation

4

Result

4.1

Clustering

4.2

Topic Labelling

4.3

Connection of data

5

Discussion

5.1

Vectorization

5.2

Clustering