Introducing automatic software fault localization in a continuous integration environment

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS ,

STOCKHOLM SWEDEN 2020

Introducing automatic software

fault localization in a continuous

integration environment

JOHANNES WIRKKALA WESTLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Introducing automatic

software fault

localization in a

continuous integration

environment

JOHANNES WIRKKALA WESTLUND

MASTER’S IN COMPUTER SCIENCE

DATE: JANUARY 28, 2020

SUPERVISOR: MATHIAS EKSTEDT

EXAMINER: ROBERT LAGERSTRÖM

SWEDISH TITLE: INTRODUKTION AV AUTOMATISK

MJUKVARUFELSÖKNING I EN MILJÖ MED KONTINUERLIG

INTEGRATION

(4)

Abstract

In this thesis we investigate the usefulness of neural networks to infer the relationship

between co-changes of software modules and test verdicts with the goal of localizing software faults. Data for this purpose was collected from a continuous integration (CI) environment at the telecommunications company Ericsson. The data consisted of test verdicts together with information about which software modules had been changed in a software product since last successful test execution. Using this data, different types of neural network models were trained. Following training, an evaluation of the neural networks’ fault localization and defect prediction capabilities was made. For comparison we also included two statistical approaches for fault localization known as Tarantula and Ochiai in the evaluation.

There have been similar studies like this thesis in the domain of software fault localization. However, the previous studies all work on the code level. The contribution of this thesis is to examine if the models used in the previous studies perform well when given a different kind of input, namely co-change information of the software modules making up the software

product.

One major obstacle with the thesis was that we only had data for the problem of software defect prediction. Because of this we had to evaluate the performance of our software fault localization models on the problem of predicting defects.

The results were that all networks performed poorly when predicting defects. The models achieve an accuracy of around 50% and an AUC score of around 0.5. Interestingly, F-score values can reach as high as 0.632 for some models. However, this is most likely a result of properties of the data set used rather than the models learning a relationship between input and output.

(5)

Sammanfattning

I detta arbete undersöks om neurala nätverk kan lära sig relationen mellan

kodmodulförändringar och testfallsresultat för att kunna underlätta lokalisering av fel. Data för studien samlades in från en kontinuerlig integrationsmiljö (KI) vid

telekommunikationsföretaget Ericsson. Data bestod av testkörningsresultat tillsammans med information om vilka kodmoduler som förändrats sedan senast lyckade testkörning för en mjukvaruprodukt. Data användes för att träna olika typer av neurala nätverk. Efter träning utvärderades de neurala nätverkens förmåga att lokalisera fel och förutspå defekter. I jämförelsesyfte inkluderades två statistiska metoder för felsökning: Tarantula och Ochiai. Det finns liknande forskning inom området mjukvarufelsökning som i detta arbete. Skillnaden är att tidigare arbeten studerar detta problem på kodnivå. Arbetets bidrag är att undersöka om liknande resultat kan fås när man ger modellerna från tidigare studier annorlunda indata i form av information om kodmodulerna som mjukvaran består av.

Ett hinder i arbetet var att vi enbart har data för forskningsproblemet att förutspå

mjukvarudefekter. Vi fick därför utvärdera våra felsökningsmodeller på problemet att förutspå mjukvarudefekter.

Slutresultatet var att maskininlärningsmodellerna generellt presterade dåligt när de försökte förutspå defekter. Alla modeller uppnår 50% eller lägre noggrannhet och en AUK-poäng runt 0,5. Det intressanta är att vissa modeller kan uppnå så höga F-poäng som 0,632 men detta beror troligtivs på egenskaper i datamängden vi använder snarare än att modellerna lärt sig en relation mellan indata och utdata.

När vi jämför modellerna med Tarantula och Ochiai noteras det att neurala nätverk kan anses vara för komplexa för vår situation. Övergripande tyder resultatet på att neurala

(6)

Acknowledgement

I would like to thank my supervisor at KTH, The Royal Institute of Technology, Mathias Ekstedt for his feedback and insightful comments during the thesis development. I would also like to thank Ericsson AB for their generosity of allowing me to conduct my thesis at their office in Kista, Stockholm, Sweden.

Special thanks to Tomas Borg, Ziver Koc and Conny Wickström for their supervision of my thesis work at Ericsson. You provided me with helpful feedback and insight during the execution of the thesis work.

(7)

List of figures

FIGURE 1THE STRUCTURE OF A SOFTWARE PRODUCT. ... 2

FIGURE 2FLOW INSIDE THE CI ENGINE AT ERICSSON FOR HOW NEW SOFTWARE MODULES ARE INTEGRATED AND TESTED IN THE SOFTWARE PRODUCT. ... 2

FIGURE 3ILLUSTRATION OF HOW WE GO FROM VERSION HISTORY IN THE CI ENVIRONMENT TO THE DATA FED TO THE MACHINE LEARNING MODELS. ... 3

FIGURE 4A GENERAL DEPICTION OF A CI PIPELINE CONSISTING OF ONE OR MORE DEVELOPERS, A CODE REPOSITORY,CI SERVER AND A PUBLIC REPOSITORY. ... 8

FIGURE 5A NETWORK OF CI PIPELINES, ILLUSTRATING THE SITUATION AT ERICSSON AND HOW AN ERROR CAN PROPAGATE THROUGH THE SYSTEM.RED BOXES INDICATING WHAT IS DEFECTIVE IN THE NETWORK. ... 9

FIGURE 6A TRADITIONAL ARTIFICIAL/DEEP NEURAL NETWORK. ... 11

FIGURE 7A1-D CONVOLUTIONAL FILTER THAT AGGREGATE TWO INPUT FEATURES INTO A SINGLE FEATURE. ... 14

FIGURE 8AN OVERVIEW OF A LSTM MEMORY CELL WITH A RECURRENT NEURON (BOX LABELED INTERNAL) AND ITS RESPECTIVE GATES.THE MEMORY CELL IS USED AS A LAYER IN A RECURRENT NEURAL NETWORK. ... 15

FIGURE 9AN OVERVIEW OF THE DATA GATHERING PROCESS. ... 20

FIGURE 10AN EXAMPLE ON HOW THE CHANGE MODULE LISTS WERE PREPROCESSED.NOTE THAT ALL SOFTWARE VERSIONS #1-3 ARE TOGETHER GIVEN AS INPUT TO THE PYTHON SCRIPT. ... 22

FIGURE 11AN ILLUSTRATION OF THE FLOW BETWEEN THE TEST EXECUTION API AND THE PREPROCESSING SCRIPT.NOTE THAT FOR A SINGLE SOFTWARE PRODUCT VERSION THERE MIGHT BE MULTIPLE ENTRIES OF TEST VERDICTS IN THE API WHICH NEED TO BE MERGED. ... 23

FIGURE 12AN ILLUSTRATION OF THE STRUCTURE OF OUR DEEP NEURAL NETWORKS.NOTE THAT THE NUMBER OF NEURONS IN THE HIDDEN LAYERS VARY AS WELL AS THE NUMBER OF HIDDEN LAYERS. ... 25

FIGURE 13AN ILLUSTRATION OF THE CONVOLUTIONAL NEURAL NETWORK THAT WE USED. ... 26

FIGURE 14A GENERAL ILLUSTRATION OF THE RECURRENT NEURAL NETWORK WE BUILT FOR THIS STUDY. ... 27

FIGURE 15PERFORMANCE OF DEEP NEURAL NETWORKS WHERE EACH HIDDEN LAYER CONSISTS OF 4 NEURONS. ... 30

FIGURE 16PERFORMANCE OF DEEP NEURAL NETWORKS WHERE EACH HIDDEN LAYER CONSISTS OF 20 NEURONS... 30

FIGURE 17PERFORMANCE OF DEEP NEURAL NETWORKS WHERE EACH HIDDEN LAYER CONSISTS OF 1024 NEURONS... 31

FIGURE 18PERFORMANCE OF CNN FOR 10 TRAINING SESSIONS. ... 31

FIGURE 19PERFORMANCE OF LSTM NETWORK FOR 10 TRAINING SESSIONS. ... 32

FIGURE 20SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY TARANTULA. ... 33

FIGURE 21SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OCHIAI. ... 33

FIGURE 22SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR CNN. ... 34

FIGURE 23SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR LSTM MODEL. ... 34

FIGURE 24SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR DEEP NEURAL NETWORK WITH 1 HIDDEN LAYER AND 4 NEURONS IN EACH HIDDEN LAYER. ... 35

FIGURE 25SUSPICIOUSNESS DISTRIBUTION OF THE 69 SOFTWARE MODULES AS CALCULATED BY OUR DEEP NEURAL NETWORK WITH 5 HIDDEN LAYERS AND 20 NEURONS IN EACH HIDDEN LAYER. ... 35

(10)

1

1 Introduction

Continuous integration (CI) of software changes is a widely used development practice in today’s software industry. As the idea of agile software development has become the norm in industry, continuous integration has been adopted to meet the demand of faster turnaround of software releases [1] [2].

Generally, continuous integration is defined as the development practice of automatically building and verifying software when a code change (also known as a code commit) to the software is made. Although, there is a debate on continuous integration’s exact meaning [3] we have chosen this definition based on previous research [4] [5] [6] [7].

The goal of CI is to allow developers to integrate changes to the software often and thus improve the development pace of the software project. However, various issues can arise when adopting CI such as long response time for developers to get feedback on integrated code, keeping track of the integrations and insufficient tools to cope with the complexity of CI [8]. Because of the severity of these issues a lot of research [4] [5] [6] [7] [9] [10] has gone into solving these problems. However, we would like to argue that there is one problem that have not been explored enough and is potentially a big bottleneck when developing software using CI. Namely, once the CI machinery has found a test that fails how do you determine what caused that failure?

The aim of this thesis is to investigate how supervised machine learning can be applied to solve this traceability issue in a continuous integration environment. Stated differently, can machine learning be useful to automate the localization of faults based on information in the CI environment?

1.1 Thesis problem and objectives

The problem we are trying to solve in this thesis is the following.

Can we automate the localization of software faults

in the telecommunications company Ericsson’s CI environment?

The proposed approach to solve this problem is to train various types of neural networks on data gathered from Ericsson’s continuous integration environment. We use various deep neural

networks, a convolutional neural network and a Long Short-Term memory network to search for the best type of neural network for the situation. The end goal is to find a model that can output the location of faults in a software product that Ericsson is developing.

1.2 Data restrictions in the environment

(11)

this version. If a test fails, we mark the new software modules that we swapped in as being defective and revert to using the older version of the software product for further development, see Figure 2. Because of this setup in the CI environment, the data that we can extract is a timeline of the

software product’s development from older to newer versions and what software modules have been changed between these versions. We can also extract the test verdicts of the tests that have been executed on each version of the software product. However, we cannot extract information such as what code instructions have been inserted or deleted with each new version of the software modules.

Figure 1 The structure of a software product.

(12)

1.3 Contributions to the research field

There have been previous studies in which one have applied neural networks to localize software faults [11] [12] [13]. This supports our idea that machine learning is applicable to our situation and specific problem. What differentiates this study from the previous ones is the type of data that is fed to the machine learning models. Thus, our main contribution to the research area of fault

localization is to investigate the usefulness of previously proposed neural network models within the field on a new type of data. The goal in this study is the same as in the previous studies. However, the intended data used to get there is vastly different. Because of this our neural networks will not be the same as in the previous studies. Nevertheless, we try to make our networks as similar to the previous studies as possible. The theories and ideas of this study will remain relatively the same as in the previous works and we will be introducing these in more depth in chapter 2.

1.4 Choice of training data

The type of data we decided to train our neural networks on stem from the field of software defect prediction and is referred to as the co-change history, change coupling history or evolutionary coupling in the literature [14] [15] [16]. Formally stated, co-change is the relationship between a software product’s components that are frequently changed at the same time during the development of the product [14].

In our scenario the co-change history should be defined as follows. For a version of the software product, we can obtain from the CI environment which software modules have been swapped since the previous version. This is the co-change for that version. The co-change history is then a list of the co-changes for each software version in the development history of the software product. This together with the test verdicts for the different versions of the software product is used to train the neural networks, see Figure 3.

This was the most similar measurement to the test coverage information (explained in section 2.5.2) used in the previous studies on software fault localization that we could extract given the

restrictions in the CI environment.

(13)

1.5 Thesis’s approach compared to previous studies

A major obstacle for this thesis was how we would evaluate the neural network models after training. In previous studies in software fault localization it is standard to evaluate the performance of the model by giving it data where the exact locations of the faults are known. Thus, you can evaluate the performance of your fault localization by checking that the model outputs the correct locations for the faults. We do not have this information in the CI environment. We can train the neural networks on the data described in the previous section. However, once trained we cannot evaluate the models’ performance on localizing faults because we do not know where the actual faults are located.

This forced us to consider another problem very similar to software fault localization known as software defect prediction1_{. These two problems are very similar to each other and intuitively, if you}

can solve one of them you can also solve the other. In other words, if you can predict whether software is defective with absolute certainty you can localize the faults responsible for the defect by simply testing each change introduced to the software since last time it was stable. Similarly, if you can localize all faults with absolute certainty then the software is considered non-defective if you cannot localize any faults. Otherwise, it is defective.

Based on this, we decided to evaluate our fault localization models on the software defect prediction problem when comparing the models against each other. For this we only need data on what

software product versions are faulty. Data which we already use during the training of the models. Again intuitively, the model that is best at predicting defectiveness should be the best model at localizing faults in the software.

This means that the thesis is also investigating the suitability of co-change as an input feature for software defect prediction, something that has been studied in [14] which reported a correlation between change coupling and software defects.

Of importance is that previous studies in software fault localization (that utilize neural networks) make a point to compare their neural network approaches to other non-machine learning methods for fault localization. These are called spectrum-based methods and because of their popularity in the literature we decided to make a similar comparison between these methods and our neural networks as in the previous studies.

1.6 Research questions

Based on the previous sections, we can summarize and itemize the questions we are trying to answer into two research questions.

RQ1: Is co-change history of software modules on its own useful as input for predicting defects? RQ2: Can neural networks utilizing co-change history be used for fault localization?

1_{To the best of our knowledge, the terms “fault” and “defect” refer to the same thing. However, depending on}

(14)

Note that the first research question stem from the fact that we must evaluate our models on the software defect prediction problem as discussed in the previous section. Thus, it is not really related to our problem presented in section 1.1 but is a byproduct of our forced approach. Instead, the second research question is the primary focus of this thesis.

1.7 Thesis’s restrictions

Due to the restrictions of the CI environment used at Ericsson, as well as other issues such as time, the thesis has some constraints. Some of them have already been discussed in the previous sections of this chapter. However, we summarize them all below, followed by a more thorough motivation of each one.

• Localization of defects is on a software module-level granularity.

• Only use data regarding the failure or passing of tests for different versions of the software product together with what software modules have been updated.

• Test three different types of neural network: a recurrent network, a convolutional network and deep networks.

• Test two different spectrum-based methods: Tarantula and Ochiai.

Due to the nature of how software is built at Ericsson, and the constraints of their CI environment, there is no interest at Ericsson for finer granularity than localizing which software modules contain the faults in a software product.

A software product can fail in the CI environment for many different reasons, for example bad environment configuration. This does not always have anything to do with the software product in itself and makes predicting which software module that is causing the failure meaningless. This is the reason for constraining the data to whether tests fail or pass. Because if the verification of the software product fails for other reasons it has nothing to do with the product itself. The type of tests we are gathering are integration tests that check that the new software modules work together as intended in the software product.

Due to time constraints, we could only test three different types of neural networks. It is not clear whether a different type of neural network would intuitively perform better than the ones we decided to try. Our motivation for picking networks was based on what type of networks that had been used in previous studies on software fault localization and software defect prediction. In order to make a qualitative analysis of our machine learning models’ performance on localizing faults it seemed important to compare them to other non-machine learning approaches within the research field of software fault localization. This has been done in previous work where machine learning models have been applied, such as [12]. There are of course, many different non-machine learning approaches that one could try. However, due to time constraints we limit ourselves to only two spectrum-based approaches for comparison.

1.8 Thesis structure

(15)

already done in the domains of software defect prediction and software fault localization. Observe that it is from the previous research we have obtained much of our theoretical knowledge and thus there might seem to be some overlap between chapter 2 and 3. A rule of thumb is that we in chapter 2 present the theoretical knowledge acquired and used by the previous studies whereas in chapter 3 we focus on the experiments conducted in the studies and their results.

Following Related works we have chapter 4 Method where we present the machine learning models that we build and details about the experiments conducted on these models. We also include details on the data gathering and data preprocessing. Our intention was to be as detailed as possible to allow for reproducibility of our experiments for comparison as well as motivating our choice of approach.

Chapter 5 Results will then present our findings from the experiments. Finally, we have chapter 6

Discussion where we discuss the results from chapter 5 in a more general context and compare them

(16)

2 Background

This chapter gives an overview of the theoretical knowledge within the field of continuous integration as well as machine learning within the domains of software defect prediction and software fault localization respectively. First, the general concept of continuous integration is presented. Second, theory on the three different types of neural networks we have decided to use is described. Following that we present the research field known as software defect prediction and more specifically describe the evaluation metrics used within the research field to evaluate neural networks. Last, we present the research field known as software fault localization.

2.1 Continuous integration

Continuous integration (CI) is the development practice of integrating software changes

continuously, thus getting a rapidly changing software which can be released at any given moment [3] [4] [5] [6] [7].

CI is usually realized by having a dedicated server (known as a CI server) get the code when a developer has submitted their changes to the code base. It then compiles the code into a software product and runs a set of tests on the product to ensure that it behaves as intended. This stepwise process (get code, compile, run test etc.) is usually referred to as a CI pipeline. Should the building of the software product or one of the tests fails, the server can send that information back to the developer and revert the changes. In the case that all tests pass, the newly compiled software is considered stable and new code changes can be added, and the process is repeated. This allows the CI system to always have a version of the software product that works, as seen in Figure 4.

The reason continuous integration is so popular in the software industry is because it allows for fast development of working software. This is in comparison with the traditional approach to software development, where each developer makes a lot of changes locally, spends a few months combining everyone’s changes, ensuring that it all works together. With CI you have one version of the

software product that everyone is working on, thus minimizing the overhead of merging people’s code together.

2.2 Continuous integration at large companies

This setup depicted in Figure 4 works well in small projects. But if you scale up the project you begin to get problems. To illustrate one of these issues we present an imaginary scenario from Ståhl’s and Mårtensson’s book Continuous Practices [3] on page 130:

“Assume a large software system which takes ten minutes to fully compile, package and test any source code change to the utmost confidence (needless to say ten minutes is rather an optimistic figure, but stay with us for the sake of argument). This means that during a typical working day of about ten hours, 60 new release candidates can be sequentially created and evaluated.”

(17)

One way to solve this issue is to introduce batching where you allow developers to commit their code without integrating it to the product for each commit. Then at specific times the CI machinery gets the latest commit and starts the integration, effectively testing many commits (i.e. a batch) at the same time [3]. This partially solves the congestion problem of the CI machinery but introduces a traceability issue. When an integration now fails it is not clear which commit in the batch introduced the fault.

This might still not seem like too big of an issue depending on the size of a batch. However, at a large company like Ericsson the pipeline of a software product will inherently have a large batch. Say for example, that the software product produced by a CI pipeline is the input to another pipeline, merging software products together to build a bigger software product which again is fed into yet another pipeline etc. It becomes apparent that the batch size grows for each CI pipeline and that at an early stage we are looking at batches consisting of millions of lines of code changes. This

illustrates the importance of being able to localize a fault and the situation can be seen in Figure 5.

(18)

(19)

2.3 Machine learning

As the problem is to find the relationship between co-change history of modules and test verdicts, a natural conclusion was that machine learning could potentially solve the problem. Machine learning is a field within computer science interested in the problem of making computers learn from experience [17]. Of interest to our study are the artificial neural network models known as deep neural networks, convolutional neural networks and recurrent neural networks. All of these models are widely popular machine learning models used for solving many different types of research problems [11] [12] [13] [18] [19] [20] .

2.3.1 Deep neural networks

Artificial neural networks are simple imitations of the brain process in biological creatures [21]. The networks consist of a set of artificial neurons, used to simulate single biological neurons, connected in such a way as to form neuron layers as seen in Figure 6. Each neuron in the network works like a computational unit. First it calculates a weighted sum of the input values (e.g. numbers) from the previous layer’s neurons in the network. The neuron then adds a bias (fixed value) to the sum and optionally apply a non-linear function to the result to determine what value it should output to the neurons in the next layer of the network [21]. There are various non-linear functions that you can use but for this thesis it is enough to know about the ReLU (rectified linear unit) activation function and the sigmoid activation function which are explained in section 2.3.2.

(20)

Figure 6 A traditional artificial/deep neural network.

2.3.2 Activation functions

In this section we present the rectified linear unit (ReLU) activation function and the sigmoid activation function which are used after certain layers in our neural networks.

The ReLU activation function is a non-linear function mapping all values bigger than or equal to 0 to themselves and all negative values to 0. This is to ensure that the output is a non-negative number and can be expressed as shown in Equation 1, which was obtained from [12].

𝑅𝑒𝐿𝑈(𝑥) = max(0, 𝑥) (1)

The sigmoid activation function transforms the input value into a value on the interval [0, 1]. The closer the input is to positive infinity the closer the output is to 1 and the closer it is to negative infinity the close the output is to 0. The formula used for the transformation can be seen in Equation 2, which was obtained from [12].

𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) = _1+𝑒1_−𝑥 (2)

2.3.3 Theory on training and testing neural networks

(21)

algorithm [6]. Overall, the process of learning the behavior of 𝑓 using samples of (𝑋, 𝑦) pairs is referred to as training the network. Understand that the fixed values in the network that is updated during training is randomly initialized at the beginning. This can potentially lead to variations in the final network’s performance between training sessions.

After a network has been trained it is important to test whether the network behaves as expected and thus it is common to have a testing phase after training. During the testing phase you supply the network with a new set of pairs of (𝑋, 𝑦) that has not been used during training (referred to as testing set) and compare the output 𝑓(𝑋) of the network with the desired output 𝑦 in order to determine whether the relationship was successfully learnt or not. More details on how to make this comparison can be found in section 2.4.

2.3.3.1 Batching and epochs

It is usually too computationally difficult to train the network on all (𝑋, 𝑦) pairs at the same time. Because of this you have to split the samples into smaller groups (called batches) and train your network on each group at a time. The size of a group is referred to as batch size. Once you have gone through all the batches, in other words used all samples of (𝑋, 𝑦), you have completed what is known as one epoch of training.

2.3.3.2 Optimization algorithm

During the updating phase, where parameters of the network are changed using the cost function and backpropagation algorithm, it is important to determine how much to change the network. This is determined by using what is called an optimization algorithm. There are many kinds of

optimization algorithms but for this thesis it is enough to know that the Adam optimizer [22] is frequently used within the field software defect prediction [6] [16] [20].

2.3.3.3 The issue of overfitting a network

One major issue that can happen during the training of a neural network is called overfitting. What this means is that the network finds relations between input and output in your training data that do not actually exist in the real relationship you are trying to learn [17]. As an example, if you were to train a network to recognize cars in images and only used pictures of yellow cars in your training data the network could potentially learn that all cars must be yellow. Something that we know is not generally true for cars.

They way to mitigate this issue is to use a method called early stopping, in which we take a small part of the training set and remove it from the training. This set is usually called the validation set. The idea with early stopping is that during training of the network we check how it performs on the validation set at certain time steps and record this performance. If we see an increase in

(22)

2.3.4 Convolutional neural networks

The network depicted in Figure 6 is that of a traditional deep neural network. There are many different variants to this traditional model where one of the more popular versions is the convolutional neural network (CNN). The difference between convolutional networks and deep neural networks is that convolutional ones allow for two special types of neural layers to be used in the network together with the ordinary neural layers we find in a deep neural network. These special kinds of layers are called convolutional layers and pooling layers [17].

Convolutional layers are used to perform the mathematical convolution operations on the input. In a sense, it aggregates the input by merging input features together by using what is called filters. A filter is simply a set of weights used in the weighted sum performed when aggregating the input and the number of weights in a filter is referred to as the filter’s size. You can imagine sliding over the input data, calculating a weighted sum over the data points as you are sliding over them [6], see Figure 7.

Note that Figure 7 displays the use of one filter. Usually you apply many filters in a convolutional layer and the output of each applied filter is stacked together into the final output of the layer. It is also common to apply an activation function to the output of the convolutional layer. When applying a filter, you must also specify the stride, which is a value defining how the “window” should slide over the data as depicted in Figure 7. To illustrate, if we have a stride value of 1 in Figure 7 it means that we should apply the filter to input feature 1 and input feature 2, followed by moving down one entry and apply the filter to input feature 2 and input feature 3 etc. However, if the stride was 2 what would happen in Figure 7 is that we would apply the filter to input feature 1 and input feature 2, followed by moving down two entries and apply the filter to input feature 3 and input feature 4. Thus, skipping certain combinations of input features that are included if you have a stride value of 1.

In addition to using convolutional layers it is common to use pooling layers directly after a

(23)

Figure 7 A 1-D convolutional filter that aggregate two input features into a single feature.

2.3.5 Recurrent neural networks

Another popular neural network model is the recurrent neural network (RNN). The difference between this kind of network and the deep neural network is that recurrent networks allow the network to have an internal state. Essentially, a recurrent neural network has at least one layer that keeps an internal state, see Figure 8 for an example of such a layer. This specializes the network to solve sequence classifications problems where it can use its state, that is dependent on what has previously been fed to the network, for the current classification [17].

(24)

Figure 8 An overview of a LSTM memory cell with a recurrent neuron (box labeled internal) and its respective gates. The memory cell is used as a layer in a recurrent neural network.

2.4 Software defect prediction

Software defect prediction is a research discipline where machine learning or statistical approaches are used to predict whether a piece of software is defective or not. As input the predictor takes a set of software features, some examples being lines of actual code, which developers have changed the software etc. [6] [16] and based on the input predicts whether the software is defective or not. The research area has been quite active during recent years, indicating that the problem of defect prediction has not yet been generally solved in a satisfactory way. This idea is further supported by meta-analysis of the field [24], which has concluded that the reliability of defect prediction

approaches is questionable. This is because few replications studies are performed in the field [24]. Some approaches within the field that appear to yield interesting predictions is to use either a convolutional neural network [6] [20] or recurrent neural network [16] to predict whether software is defective using historical data about the software development and test verdict results as training data for the networks. The general trend within the research of these approaches seems to be focused on finding more input features (or information about the software) to feed into the neural network models as this tends to increase performance, given enough data.

The field has numerous measurements to evaluate neural network models, some of which we list in the following subsections as we will use them to evaluate our own neural networks.

2.4.1 Accuracy

(25)

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦

𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (3)

The range of accuracy is [0, 1], where an accuracy closer to 1 indicates a better neural network model [6].

2.4.2 F-score

A popular metric to use when doing binary predictions such as in software defect prediction is the F-score. It is the harmonic mean of precision and recall where precision in software defect prediction is the number of instances correctly classified as defective out of all instances classified as defective and recall is the number of instances correctly classified as defective out of all instances that are truly defective. The range of F-score is [0, 1], where a value closer to 1 is more desirable as it indicates a better performing neural network [6] [25].

Let us define the notation 𝑁𝑑𝑑 for the number of defective instances classified as defective, 𝑁𝑛𝑑 for the number of non-defective instances classified as defective and 𝑁𝑑𝑛 for the number of defective instances classified as non-defective. Then we can express precision, recall and F-score using Equation 4, Equation 5 and Equation 6.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑁𝑑𝑑 𝑁𝑑𝑑+𝑁𝑛𝑑 (4) 𝑟𝑒𝑐𝑎𝑙𝑙 = _𝑁 𝑁𝑑𝑑 𝑑𝑑+𝑁𝑑𝑛 (5) 𝐹𝑠𝑐𝑜𝑟𝑒 = 2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ×𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙 (6)

2.4.3 Area under the receiver operator characteristic curve

The receiver operator curve has been used to investigate the trade-off between the hit rate and false alarm rate in the signal processing domain [26]. These days it is also used to evaluate machine learning algorithms [6] and can be interpreted in our setting as representing the probability that a randomly chosen defective instance of the software is more likely to be considered defective than a randomly chosen non-defective instance of the software [26].

The area under the receiver operator curve (AUC) is in the range [0, 1] where a value of 0.5 means that the neural network performs as if randomly guessing whether the software is defective or not. A value higher than 0.5 means the classifier performs better than random and a value lower than 0.5 means that the classifier performs worse than random [6].

2.5 Software fault localization

Software fault localization is the research area of determining which program elements need to be fixed for a failing test case to stop failing [27]. The definition of what a program element is depends on the scope and can range from single instructions (fine granularity) to packages (rough

granularity). It is widely recognized that determining the location of a fault in the code is one of the more time consuming and demanding tasks of software development [27].

2.5.1 Spectrum-based fault localization

(26)

defect based on test case coverage information [27]. To summarize the approach, each program element has a tuple of four values associated to it: (𝑒𝑓, 𝑒𝑝, 𝑛𝑓, 𝑛𝑝). 𝑒𝑓 is the number of times the program element is executed and a test case fails. 𝑒𝑝 is the number of times the program element is executed and a test case passes. Similarly, 𝑛𝑓 is the number of times the program element is not executed and a test case fails. And 𝑛𝑝 is the number of times the element is not executed and a test case passes [28].

Based on these values for each program element you can calculate a suspicious value. In this thesis work we have decided to use two different suspiciousness calculations: Tarantula and Ochiai [29] [28]. The calculations for Tarantula and Ochiai can be seen in Equation 7 and Equation 8 and was obtained from [28].

𝑇𝑎𝑟𝑎𝑛𝑡𝑢𝑙𝑎 = _(𝑒 𝑒𝑓⁄(𝑒𝑓+𝑛𝑓)

𝑓⁄(𝑒𝑓+𝑛𝑓))+(𝑒𝑝⁄(𝑒𝑝+𝑛𝑝)) (7)

𝑂𝑐ℎ𝑖𝑎𝑖 = 𝑒𝑓

√(𝑒𝑓+𝑛𝑓)(𝑒𝑓+𝑒𝑝) (8)

2.5.2 Fault localization based on neural networks

In software fault localization, the goal of the machine learning algorithms is learning to deduce the location of a fault based on input data about the software and test verdicts. To our understanding this approach is not as popular as the spectrum-based methods that exist [28]. Although new results within the field suggest that machine learning approaches are generally better than the spectrum-based techniques at finding the faults [12].

In general, the structure of the machine learning approach is as follows. The input to the neural network is test coverage information of code instructions. The idea is that you map the code to an input vector containing 0s and 1s, where each element in the vector represents whether a code instruction was executed during a test or not. From this you get a code coverage vector that you feed into your neural networks that returns as output a decimal value in the range [0, 1] where 0 means that the test passes and 1 means that the test fails. Now given pairs of input vectors and output values representing executions of different tests on the program the neural network can be trained [11] [12] [13].

(27)

3 Related work

In this chapter we present previous research within the field of software defect prediction and software fault localization. It is from these studies that we have obtained most of our theoretical knowledge presented in chapter 2. In addition, we review the work done on co-change history as a feature to describe defects in software.

3.1 Software defect prediction

One of the more recent studies in software defect prediction is the work of Wen et al. [16]. They use the recurrent neural network model Long Short-Term Memory (LSTM) to feed in sequences of changes to a software product for the model to learn how to predict defects. Their result is that the approach achieves an average F-score of 0.657 and an average area under receiver operator curve (AUC) score of 0.892 over 10 different software projects. This is much better than the more

traditional approaches to defect prediction. Furthermore, they conclude that the approach is better than the state-of-the-art technique of using deep learning to learn features from the static code [16].

The work is similar to ours because they view their input data as a time sequence, which we will also do in this thesis. This seems to be a rather new way to represent the data for defect prediction as the work of Wen et al. [16] is the only work that we have found that considers the data in this way. Nevertheless, as the research area of software defect prediction is popular there are many studies of interest to our thesis that should also be mentioned. We have a previous Master’s Thesis work from 2018 by Sundström [6] who tried to apply the approach of extracting semantic code features via a convolutional neural network for defect prediction in an industry setting. Most notably from the study was that a poor performance was observed when compared to the work of Jian et al. [20] from which the approach was based on. The best results observed were a maximum average F-score of 0.315 and an AUC score of 0.650 [6]. Furthermore, Sundström concluded that the trained model did not seem to generalize well and that this, together with the poor performance overall, might be because of insufficient data.

In comparison, the original study by Jian et al. [20] that introduced the approach used by Sundström [6] achieved an average F-score of 0.608 over 6 different projects. Most notable about their

approach is that they use a convolutional neural network (CNN) to extract semantic features from static code. These features are then combined with more traditional code metrics and fed to a logistic regression classifier which does the defect prediction [20]. This is different from the work of Wen et al. [16] and our own work since we strive to use neural network models throughout the entire process and not only for feature extraction.

3.2 Software fault localization

(28)

These studies have the same goal as this thesis, with the major difference being that the coverage metric in this thesis is less fine grained as we consider software modules rather than single code instructions.

Of importance in the research field of fault localization is that applying machine learning does not seem to be as popular as we initially thought it would be when considering the review studies of Wong et al. [28] and Zakari et al. [30]. In essence the work of [11], [12] and [13] appears to be the major papers on the topic of utilizing machine learning for fault localization that have been published. Although there are other works as well such as Briand et al. [31] who used the C4.5 decision trees algorithm for suspiciousness ranking.

Instead, the most popular approach to fault localization is spectrum-based techniques, which also use coverage metrics to rank code instructions similarly to the machine learning approach. The key difference being that spectrum-based techniques use statistical and probabilistic mathematical models to describe the relationship between the coverage statistics and fault proneness [28]. Some of the more popular mathematical models are the Tarantula model [28] and Ochiai model [32].

3.3 Previous work on co-change history

Previous work by Kirbas et al. [15] studied how evolutionary coupling (similar to co-change) relates to software defects. They mention that the literature is rather divided on whether evolutionary coupling is a useful metric for predicting defects. On one side we have the work of D’Ambros et al. [14] which concluded that the correlation between change coupling and defects is stronger than more traditional code complexity metrics like the size of the code.

The work of Kouroshfar [33] builds on this and concludes that change coupling taking into

(29)

4 Method

In this chapter we describe how the experimental setup is constructed. The first section presents how the data was gathered, preprocessed and structured. Following this is a description of the machine learning models that were tested. Finally, an overview of the hardware and software used in this thesis is presented.

4.1 Data collection

The first step to solving our research questions was to investigate what data that could be gathered at Ericsson to train our machine learning algorithms. But first, we decided on the software product to focus on for our data gathering.

Ericsson has implemented a very large CI machinery for the development of its products. This means that there was a lot of meta information that could potentially be gathered. There was also various APIs developed to expose the information for querying. It was decided that we would use an in-house developed rest API that exposes the test execution results of new software versions of the decided product. The API did however only contain information regarding the testing process of new product versions and nothing about which software modules had been changed between versions. Because of this we had to use software versioning data gathered from the test execution API to query another API exposing the artifact repositories for the software product. From the artifact API we could then obtain what software modules (which is the building blocks of a software in the CI system) had been changed when compiling new products. The result of using these two APIs is that we can get a list of what software modules have been changed when a new software product version is created and what the test execution verdicts were when testing the new version, see Figure 9.

(30)

The collecting of data was automated by writing a short Python script that used an initial software product ID supplied to the script. Using the product ID as a starting point, the script queried the two in-house APIs for data regarding the test verdicts of different versions of the software product together with, for each version, a list of software modules that had been changed since the previous version.

To be more specific about the type of data that was saved, for each version of the software product that we could find in the testing API we saved the software product’s name, its version, its previous version, what test suites it passed, what test suites it failed, the timestamp for when the testing began, what testing environment was used and what software modules had been changed since the previous version.

4.1.1 Issues with the data gathering

While doing the data gathering, we found that the CI system at Ericsson deletes data after some time. Because of this there was a server limit on the amount of data that could be collected by the script at any point in time. To mitigate this problem, we decided to run the Python script once a week, creating batches of gathered data files for each week. We then created a Python script to merge these data files together, also filtering the data so that no duplicate entries where present. By doing this we could amass data over a longer time period than was possible by only relying on the APIs at Ericsson. In total we amassed 1.5 months’ worth of data.

4.2 Data preprocessing

Having obtained the raw data that we wanted, the first step of the preprocessing consisted of mapping the list of changed software modules for each data point into a fixed-sized integer vector consisting only of 0s and 1s. We wrote a Python script that takes the entire raw data set and

(31)

Figure 10 An example on how the change module lists were preprocessed. Note that all software versions #1-3 are together given as input to the Python script.

After the integer vectorization phase, we started aggregating data points together if they convey information about the same version of the software product. The testing of software at Ericsson is based on allocating test jobs for the software. When we looked through the data received from the testing API we noticed that it is not necessarily true that a single test job is responsible for testing an entire version of a software product, see Figure 11.

As an example, assume we have a version of a software product. When testing this software product, we divide the test cases over 3 different test jobs in order to verify the product faster. However, due to concurrency issues one of the test jobs fails. A maintainer might notice this and decide to rerun the failed tests, spawning a new test job which now succeeds. This means that we have 4 test jobs, which when combined informs us that the new version of the software product works.

Because of this setup it was important to aggregate the data points (representing each testing job) together to get a single data point for each version of the software product. Otherwise, we would get conflicting data points that might hinder the learning of the machine learning models. This is because data points referring to the same version of the software product will have the same input (the swapped software modules for that version) but different test jobs might have either failed or succeeded. In our previous example, only looking at the test job that failed would give us the impression that the new software product is faulty. However, we have another test job that reran the same tests and succeeded, meaning that there was actually nothing wrong with the software product. Thus, we need to aggregate the data points dealing with the same version of the software product to infer whether that version is faulty or not.

(32)

Figure 11 An illustration of the flow between the test execution API and the preprocessing script. Note that for a single software product version there might be multiple entries of test verdicts in the API which need to be merged.

4.3 Training the machine learning models

After we had gathered and preprocessed the data from the CI machinery at Ericsson the next step was to train the machine learning models. When attempting to learn the relationship between swapped software modules and test verdicts we utilized three different types of neural networks to determine what type of machine learning model was most effective for our situation.

For each neural network model built we ran the training and testing phase 10 times. After that we calculated the average performance over these 10 executions when performing our comparison between the models. This was to make the comparison a bit more robust as the random

(33)

4.3.1 Splitting the data set

We started by dividing the data set we had gathered into a training set consisting of 70% of the data, followed by splitting the remaining 30% of the data in half to create the validation and test sets. This split was not random but performed by first ordering the data points with respect to time. Then we picked, starting with the oldest data point and moving forward in time, the training set followed by the validation set and finally the test set. The reason for doing this is that our interest in this study is whether historic co-change data can tell us anything about future versions of the software product. Therefore, randomly dividing the data into the different sets does not make sense in our situation.

4.3.2 The deep neural network

The first neural network that we tested is the standard densely connected deep neural network (Figure 6). The motivation for including the deep neural network is that it is easy to increase the complexity of the model and that the network can be trained rather quickly. This type of network has also been used in the previous work on fault localization [11] [13], which further supported the choice of the deep neural network.

Since it is easy to increase the complexity of the network, we assume that it should be easy to overfit the model to the training data. If this is not the case it gives an indication that the input data might not carry enough information about the relationship between input and output. The reason being either that an insufficient correlation exists or simply because of too much noise in the gathered data. In either way, we wanted to use the deep neural network as a basis as it does not contain more modern features found in convolutional and LSTM networks.

The network parameters we had to decide when constructing the deep neural network is how many hidden layers the network should have and how many neurons should be in each hidden layer. The number of neurons in the input layer is dictated by the number of software modules of the software product and hence no decision needs to be made for the input layer. The same goes for the output layer as we want the network to output a single decimal number representing the probability of the software product being defective.

To find a suitable number of hidden layers and number of neurons for each hidden layer we decided to build and evaluate many different deep neural networks, enumerating suitable values for these two network parameters. Based on the work of [11], [12] and [13] we assume that a good number of hidden layers ought to be in the range of [0, 5] and so we tried networks having between 0 and 5 hidden layers. However, for the number of neurons in each hidden layer the previous studies differ from each other quite severely. One study [11] had 4 neurons for each hidden layer. Another study [13] calculates the number of neurons using Equation 9.

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 = 𝑟𝑜𝑢𝑛𝑑 (𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑜𝑓 𝑖𝑛𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟

30 ) × 10 (9)

(34)

layer to ensure that the output is in the interval [0, 1]. A depiction of the structure can be seen in Figure12.

Figure 12 An illustration of the structure of our deep neural networks. Note that the number of neurons in the hidden layers vary as well as the number of hidden layers.

4.3.3 The convolutional neural network

Having found the best deep neural network configuration, we then wanted to examine if a convolutional neural network could perform better. Convolutional neural networks is perhaps the most popular alternative when it comes to defect prediction [6] [20] and more recently have shown promising results within the field of fault localization [12]. Hence, it seemed like a good idea to also evaluate a convolutional network for our problem.

For our convolutional neural network model, we used the model presented in [12] as reference. The architecture in [12] is a convolutional neural network having an input layer of dimensionality equal to the number of code statements in the program under consideration. Following that is one convolutional layer with 32 filters, each with a filter size of 10 and unknown stride configuration. Following the convolutional layer, ReLU activation function is applied to the output of the

convolutional layer. After that is a max pooling layer whose exact configurations are not described and another convolutional layer with 64 filters, each with a size of 10 and unknown stride

configuration. The ReLU activation function is again applied to the output from the convolutional layer and fed to another max pooling layer with unknown configuration. Finally, variably many fully connected layers are used before the output layer (consisting of 1 neuron) apply the sigmoid function and outputs a value in the range [0, 1].

In our network, we start with an input layer of dimensionality equal to the number of distinct software modules in our data set. Following that is a convolutional layer with 32 filters, each with a size of 10 and a stride of 1. We then apply the ReLU activation function on the output of the

(35)

connected neural layers afterwards. This is because the dimensionality of the output from a max pooling layer would not conform to the allowed input dimension for fully connected layers. It is unclear how the previous study [12] solved this issue as their configurations for the max pooling layers are not described.

Finally, 3 fully connected neural layers with 1024 neurons in each layer is used before the output layer consisting of a single neuron. The output layer uses the sigmoid activation function to ensure the output value is in the range [0, 1]. These final fully connected neural layers are like the ordinary layers we use for the deep neural networks. For an overview of the network see Figure 13.

Figure 13 An illustration of the convolutional neural network that we used.

4.3.4 The recurrent neural network

The third neural network we wanted to try is a simple recurrent neural network based on the Long Short-Term Memory (LSTM) model. The motivation for trying this model stem from the work [16] that reported an increase in performance in software defect prediction when considering the data as a sequence labeling problem. It should be noted that [16] used many more software metrics in their change sequence for predicting software defects and therefore we do not expect to get as high performance as them for our simple LSTM model. Nevertheless, for our research question it is interesting to evaluate this model as well.

(36)

Figure 14 A general illustration of the recurrent neural network we built for this study.

4.3.5 Parameters used during training

As mentioned, all networks use the sigmoid function as the activation function for the output value. Besides this the ReLU activation function is used for all convolutional layers. The cost function used during training was the mean squared error loss. The batch size during training was 32 except for the LSTM model which considered the entire training data as a long sequence. Each model was trained for 15 epochs in accordance with [6] and [20]. As optimization we used the Adam optimization as it has been used in other studies [6]. Early stopping is also applied during training as it is a well-known technique to combat overfitting [17].

4.4 Evaluating the models on the software defect prediction problem

After training each model we then wanted to see how well they performed on the software defect prediction problem. In other words, we wanted to analyze how well the models could predict whether a version of the software product was defective or not based on what software modules had been changed.

We fed what software modules had been swapped for each data point in the testing set. Then we rounded the output (which we also call suspiciousness score) of each network in the case it was bigger than 0.5 to 1. This indicate that the model thinks the software version was defective. Otherwise, we rounded the output to 0, which indicate that the software is believed to be non-defective. From this we could then calculate the accuracy, F-score and AUC score for each model on the test data.

4.5 Comparing neural network models against spectrum-based approaches

After having evaluated the different types of neural networks on the defect prediction problem we wanted to compare their suspiciousness scoring of the different software modules, as described in section 2.5.2.

The suspiciousness scoring is obtained by feeding in fake input to the models where only one software module has been swapped. The output produced by the networks is then considered the software module’s suspiciousness score. Doing this for every software module allowed us to plot the suspiciousness score distribution of the different modules for each neural network. For reference we also plotted the suspiciousness scores for the different modules calculated by the spectrum-based approaches Tarantula and Ochiai as described in chapter 2.

(37)

the same underlying issue. Thus, it is interesting to see whether we can draw similar conclusions about the neural networks by viewing them through these two problems.

The motivation for including the spectrum-based approaches Tarantula and Ochiai is that they have been heavily studied in software fault localization. Furthermore, to be able to compare the neural networks against the spectrum-based approaches on both software fault localization and software defect prediction we created a simple method for classifying a version of the software product as defective or non-defective using Tarantula and Ochiai. The method is described as follows. We used the data in the training set to calculate a suspiciousness for each software module using Tarantula and Ochiai respectively. Then, for each data point in the test data set, we evaluated whether a software module with suspiciousness higher than or equal to 0.5 had been changed for that data point. If it was true, we classified the software product version as being defective, otherwise it was considered non-defective.

Using this simple method allowed us to get an indication of whether Tarantula or Ochiai could be used for defect prediction in our situation.

4.6 Hardware and software configurations

All experiments were executed on an HP EliteBook 840 G5 having an Intel Core i7-8650U CPU at 1.90 GHz and 32 GB RAM. All scripts for data gathering, processing, creating and training of the models etc. were written in Python 3.6.72_{and the popular Python packages requests}3_{, pickle}4_{, numpy}5_,

sklearn6_{, keras}7_{and tensorflow}8_{were used. Requests was used to query the internal rest APIs and}

parse the JSON data obtained from them. Pickle was used to save the data to file so that the data was easily accessible after it had been gathered. Numpy, sklearn, keras and tensorflow were used to build the neural networks, train them and finally evaluate them.

(38)

5 Results

This chapter presents the experimental results. The following section contains results about the data collection process. Thereafter follows a presentation about the performance of the different neural network models on the data.

5.1 Resulting data

In the end we extracted data about one software product from the CI system at Ericsson. As the CI system deletes data after roughly a month has passed, we gathered the data during multiple occasions and then merged it all into a single data set. Information about the final data set can be found in Table 1. In total, the data set represents the development process of the software product over roughly 1.5 months’ time. It should be noted that the number of distinct software modules found in the data set was 69.

5.2 Resulting performance

In this section we present how the deep neural networks, the convolutional neural network and LSTM network performed on the test set.

5.2.1 Deep neural network performance

The average performance for the different configurations of the deep neural networks over 10 training and testing sessions can be seen in Figure 15, Figure 16 and Figure 17. Note that since the unique number of software modules in the data set was 69, we obtained from Equation 9 that 20 neurons in each hidden layer could be a good configuration. So the configurations for the number of neurons in each hidden layer that we tested was 4 neurons, 20 neurons and 1024 neurons

respectively.

Among the configurations there were two different networks that achieved the best performance with respect to the three measurements accuracy, F-score and AUC. The configuration with the highest accuracy and F-score was the 5 hidden layer network with 1024 neurons in each hidden layer seen in Figure 17. It achieved an accuracy of 55.5% on the test data set, an F-score of 0.632 and an AUC score of 0.543. The configuration with the highest AUC score was the 2 hidden layer network with 1024 neurons which you can also see in Figure 17. Achieving an accuracy of 52.9%, an F-score of 0.5 and an AUC score of 0.605.

The general trend based on a comparison of Figure 15, Figure 16 and Figure 17 is that no

configuration outperforms another by that much. On average all configurations have around 50% accuracy, 0.5 F-score and 0.5 AUC score.

Before or after preprocessing

Number of entries Percentage of entries defective

Before 5485 35.6%

After 289 52.2%

(39)

Figure 15 Performance of deep neural networks where each hidden layer consists of 4 neurons.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0 1 2 3 4 5 Val u es

Number of hidden layers

Deep networks having 4 neurons per layer

Accuracy F-score AUC

0 0,1 0,2 0,3 0,4 0,5 0,6 0 1 2 3 4 5 Val u es

Deep networks having 20 neurons per layer

(40)

5.2.2 Convolutional network performance

The performance of the convolutional neural network (CNN) for 10 runs can be seen in Figure 18. On average the CNN network had an accuracy 49.0% on the test data set, an F-score of 0.502 and an AUC score of 0.484.

Figure 18 Performance of CNN for 10 training sessions.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0 1 2 3 4 5 Val u es

Deep networks having 1024 neurons per layer

Accuracy F-score AUC

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 1 2 3 4 5 6 7 8 9 10 Val u es Training session

Performance of CNN for 10 training sessions

(41)

5.2.3 LSTM network performance

The performance of the recurrent neural network using the LSTM structure for 10 runs can be seen in Figure 19. On average the LSTM network had an accuracy of 46.4%, an F-score of 0.223 and an AUC score of 0.539.

Figure 19 Performance of LSTM network for 10 training sessions.

5.3 Comparison with Spectrum-based approaches

When performing our simple method described in section 4.5 to convert suspiciousness score of Tarantula and Ochiai into software defect prediction we got the result shown in Table 2.

Approach Accuracy F-score AUC

Tarantula 56.8% 0.698 0.487

Ochiai 36.4% 0.0 0.5

Table 2 Performance of Tarantula and Ochiai on the test data set, having used the training data set to calculate suspiciousness.

The suspiciousness ranking of the 69 modules making up the software product under consideration is depicted in Figure 20 for Tarantula and Figure 21 for Ochiai. We also include the suspiciousness distribution of the convolutional network (Figure 22), LSTM network (Figure 23) and for the best configuration in each of Figure 15, Figure 16 and Figure 17 (Figure 24, Figure 25 and Figure 26).

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 1 2 3 4 5 6 7 8 9 10 Val u es Training session

Performance of LSTM network for 10 training sessions

(42)

Figure 20 Suspiciousness distribution of the 69 software modules as calculated by Tarantula.

Figure 21 Suspiciousness distribution of the 69 software modules as calculated by Ochiai.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Su sp ici o u sn es s Software module

Tarantula suspiciousness score distribution

Tarantula 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Su sp ici o u sn es s Software module

Ochiai suspiciousness score distribution

(43)

Figure 22 Suspiciousness distribution of the 69 software modules as calculated by our CNN.

Figure 23 Suspiciousness distribution of the 69 software modules as calculated by our LSTM model.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Su sp ci ou sn es s Software modules

Convolutional network suspiciousness score distribution

Convolutional network 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 Su sp ici o u sn es s Software module

Introducing automatic software fault localization in a continuous integration environment

Introducing automatic software

fault localization in a continuous

integration environment

JOHANNES WIRKKALA WESTLUND

Introducing automatic

software fault

localization in a

continuous integration

environment

JOHANNES WIRKKALA WESTLUND

MASTER’S IN COMPUTER SCIENCE

DATE: JANUARY 28, 2020

SUPERVISOR: MATHIAS EKSTEDT

EXAMINER: ROBERT LAGERSTRÖM

SWEDISH TITLE: INTRODUKTION AV AUTOMATISK

MJUKVARUFELSÖKNING I EN MILJÖ MED KONTINUERLIG

INTEGRATION

Abstract

Sammanfattning

Acknowledgement

Table of Contents

List of figures

1 Introduction

1.1 Thesis problem and objectives

1.2 Data restrictions in the environment

1.3 Contributions to the research field

1.4 Choice of training data

1.5 Thesis’s approach compared to previous studies

1.6 Research questions

1.7 Thesis’s restrictions

1.8 Thesis structure

2 Background

2.1 Continuous integration

2.2 Continuous integration at large companies

2.3 Machine learning

2.3.1 Deep neural networks

2.3.2 Activation functions

2.3.3 Theory on training and testing neural networks

2.3.4 Convolutional neural networks

2.3.5 Recurrent neural networks

2.4 Software defect prediction

2.4.1 Accuracy

2.4.2 F-score

2.4.3 Area under the receiver operator characteristic curve

2.5 Software fault localization

2.5.1 Spectrum-based fault localization

2.5.2 Fault localization based on neural networks

3 Related work

3.1 Software defect prediction

3.2 Software fault localization

3.3 Previous work on co-change history

4 Method

4.1 Data collection

4.1.1 Issues with the data gathering

4.2 Data preprocessing

4.3 Training the machine learning models

4.3.1 Splitting the data set

4.3.2 The deep neural network

4.3.3 The convolutional neural network

4.3.4 The recurrent neural network

4.3.5 Parameters used during training

4.4 Evaluating the models on the software defect prediction problem

4.5 Comparing neural network models against spectrum-based approaches

4.6 Hardware and software configurations

5 Results

5.1 Resulting data

5.2 Resulting performance

5.2.1 Deep neural network performance

Deep networks having 4 neurons per layer

Deep networks having 20 neurons per layer

5.2.2 Convolutional network performance

Deep networks having 1024 neurons per layer

Performance of CNN for 10 training sessions

5.2.3 LSTM network performance

5.3 Comparison with Spectrum-based approaches

Performance of LSTM network for 10 training sessions

Tarantula suspiciousness score distribution

Ochiai suspiciousness score distribution

Convolutional network suspiciousness score distribution