The correlation between code coverage, cyclomatic complexity and fault frequency

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

The correlation between code coverage,

cyclomatic complexity and fault frequency

by

Simon Persson

LIU-IDA/LITH-EX-G--15/012--SE

2015-06-03

(2)

(3)

Final thesis

The correlation between code

coverage, cyclomatic complexity and

fault frequency

by

Simon Persson

LIU-IDA/LITH-EX-G–15/012–SE

2015-06-03

(4)

(5)

Abstract

The quality of software gets more and more important as software is intro-duced to systems that are important to the infrastructure of modern society. This thesis studies one such code base developed at Ericsson AB, that is a vital piece of software for our infrastructure. With an increased need for quality in software, it is important that we have quantifiable metrics that can be used to steer the development of software in a direction that leads to fewer faults. We look at the software metrics cyclomatic complexity and variations of code coverage and analyse how these metrics correlate to faults in the code. We find that code coverage has a weak negative correlation at best, but can have a weak positive correlation at worst (such that faults increase as coverage increases). The cyclomatic complexity metric has not been found to have any correlation at all to software faults.

(6)

(7)

List of Figures

3.1 Shepperd’s less complex example . . . 8

3.2 Shepperd’s more complex example . . . 8

4.1 Example of RPTERROR log . . . 11

4.2 Example of sending an RPTERROR . . . 13

5.1 Block coverage per module . . . 15

5.2 Line coverage per module . . . 16

5.3 Function coverage per module . . . 16

5.4 Block coverage per unit . . . 17

5.5 Line coverage per unit . . . 18

5.6 Function coverage per unit . . . 18

5.7 Block coverage per file . . . 19

5.8 Line coverage per file . . . 20

5.9 Function coverage per file . . . 20

5.11 Cyclomatic complexity per unit . . . 22

(9)

List of Tables

(10)

(11)

Chapter 1

Introduction

This bachelor thesis investigates the correlation between code coverage and software faults, and cyclomatic complexity of code and software faults. The study took place at Ericsson AB in Link¨oping.

1.1 Motivation

Testing is an essential part of developing and maintaining high quality soft-ware. There exists various metrics to measure the effectiveness of a test suite, with the code coverage family of metrics being one choice [1]. Previ-ous studies conducted by Hutchins et al. [2] and Inozemtseva et al. [3] have not found a consensus on whether there is a correlation between coverage and software faults.

The cyclomatic complexity metric, used to measure the complexity of code, is in a similar situation. It has widespread use, but previous studies conducted by Basili et al. [4], Shepperd [5], and Zhang et al. [6] have not found a consensus on whether a high cyclomatic complexity correlates with a high fault frequency.

The fact that previous studies do not find a consensus could mean that the correlation is domain dependant. It is then necessary to conduct a new study, or all efforts to improve these measurements at Ericsson could be a waste.

1.2 Purpose

The purpose of this thesis is to investigate the relation between cyclomatic complexity or code coverage and software faults exists in a code base devel-oped by Ericsson AB. Nevertheless, the results should still apply to other software projects.

(12)

1.3. RESEARCH QUESTIONS CHAPTER 1. INTRODUCTION

1.3 Research questions

This thesis attempts to answer the following questions:

• How strong, if at all existent, is the correlation between cyclomatic complexity and software faults?

• How strong, if at all existent, is the correlation between code coverage and software faults?

1.4 Limitations

The only software faults studied in this thesis will be those that are automat-ically discovered when running the software in an environment simulating real world conditions. Faults discovered in any other way are not taken into consideration. This limitation exists as it allows for an easy mapping of a found bug to a specific version of the software.

Furthermore, this thesis will not attempt to draw any conclusions re-garding how likely the software is to have undiscovered bugs. It will merely analyse the likelihood of discovering faults based on the code coverage or cyclomatic complexity.

(13)

Chapter 2

Background

Ericsson AB develops the code base studied in this thesis. It has a long history with roots in the early days of GSM. They develop this software in the C programming language.

The test system is based on standard industry practises. There are test cases for all methods, checks for memory errors with Rational Purify [7], code coverage reports, and cyclomatic complexity metrics. Structurally, the software is divided in modules that comprise a set of units that in turn comprise a set of files, that finally contain functions in the system.

What is particular for the software, and what makes this study possible, is that the software is full of sanity checks executed during run time. These checks trigger on “impossible” states and states that means something is not working correctly. The developers use these similarly to the assert macro in the C programming language, only instead of aborting the program when something is broken, the software generates an RPTERROR. RPTERRORs are error messages giving information on what went wrong and where, as in what line in what file. The software can store RPTERRORs in a persistent storage and hence can retrieve them even after a hard crash of the system. Ericsson is constantly running the software in a simulated environment that corresponds to real world usage and stores all RPTERRORs on disk. This means that there is plenty of data available to study and analyse in this thesis.

(14)

Chapter 3

Theory

This chapter attempts to give a brief summary of the software engineering fields related to this thesis with an overview of previous studies similar to this one.

3.1 Software quality

One obvious aspect of software quality is a programs ability to work as intended, following its specification. However, a program working correctly might not be enough to claim that its quality is high, as it could have non-functional problems such as being hard to maintain and understand or being platform dependent, which could be considered signs of low software quality [8]. This thesis only concerns the functional quality of software, its ability to work as intended, although the metrics studied might be of interest to evaluate non-functional qualities as well.

One obvious way to detect a low software quality is to test it. Testing, to some extent, tells us if the software works as intended and follows its specification. However, as Edsger Dijkstra put it “Testing shows the pres-ence, not the absence of bugs” [9], which means that even though we do test software, it might still contain errors that causes it to malfunction in some cases. This can be because the code that contains the error is in a branch that is not taken during testing or that the test fails to spot the error when it occurs.

Another approach to determining the quality of software is to calcu-late some metric based on the source code, with the idea that the metric would indicate which pieces of the code are most likely to be faulty. Histori-cally, this approach has been popular, Gaur et al. [10] recently published an overview of software engineering metrics for procedural programming where they find that a total of ten different metrics have been proposed in previous literature, one of them being the cyclomatic complexity metric studied in this thesis. Metrics for procedural programming are relevant as this study

(15)

3.2. SOFTWARE TESTING CHAPTER 3. THEORY

concerns code in the C programming language, which is considered a pro-cedural language. The authors conclude that there has been a decline in new metrics since the year 1990 and attribute that to the fact that modern programming languages have left the procedural paradigm of programming.

3.2 Software testing

Software testing is a broad topic, but Glenford et al. [1] manages to give a summary in three points. They say that “Testing is the process of executing a program with the intent of finding errors”, “A good test case is one that has a high probability of detecting an as yet undiscovered error”, and “A successful test case is one that detects an as yet undiscovered error”. With such a vague definition of software testing, it is obvious that there must be more than one way to do software testing in practise.

One popular method of testing is unit (or module) testing. Unit testing means that the tester tests individual components of a program, instead of the program as a whole. Testing individual components of a program brings benefits such as making the testing process less overwhelming, as the tester can limit his focus to testing one piece at a time, and easier debugging, as unit testing narrows down where the programmer must look for detected errors to a single module [1]. A unit test case can be as simple as calling the code one wants to test and comparing the return value to a value that the tester has determined is the correct one. A unit test case can also call code to see that it does not crash in some way. When testing a method, the parameters supplied are of great importance. Usually, one must use more than one set of parameters. Otherwise, there is a risk that the code has some good branches that gives correct behaviour but also bad branches that goes undetected because the combination of input parameters did not cause those branches to get taken.

Finding a good selection of parameters to test code with is one of the challenges in testing. Remember that Glenford et al. said that “A good test case is one that has a high probability of detecting an as yet undiscovered error” [1] so the tester must choose the parameters carefully in a way that makes it likely that the test case will discover an undiscovered bug.

Antonia Bertolino has written a paper [11] on the current state of soft-ware testing and goals for future research. She considers test effectiveness one of the main challenges for future advances in software testing. She writes that “To establish a useful theory for testing, we need to assess the effectiveness of testing and novel test criteria”. There are ways to measure the effectiveness of a set of test cases, some popular ones being the code coverage family of metrics.

(16)

3.3. CODE COVERAGE CHAPTER 3. THEORY

3.3 Code coverage

Code coverage refers to a metric that describes how thoroughly tested a piece of code is. This idea of measuring how thorough the testing of the code is has been around for a long time, with references dating back as far as 1963 [12]. The basic idea is that testing is not effective if all of the code is not exercised. Naturally, if code is well tested, that must mean that the greater part of the code was actually executed during testing. If not, we can not make any claims regarding the effectiveness of the testing.

There are plenty of criteria for coverage, some of the main ones being [1]: • Statement coverage - how many of the statements got executed? • Decision coverage - how many of the branches got taken?

• Condition coverage - how many of the boolean conditions got eval-uated?

The recommendation is to use a combination of multiple, if not all, of these criteria to get maximal efficiency out of the test suite [1]. The ideal test suite has a code coverage of 100%. That is, all of the statements, decisions or conditions in the software tested got evaluated. Unfortunately, that is rarely possible. In most nontrivial pieces of software the coverage will never reach 100% [1].

One technique for determining the adequacy of a test suite for a program is mutation testing. It works by having the computer make a slight change to the program and then running the test suite. One out of two things can then happen [13]:

1. The tests catches the change to the program 2. The change goes unnoticed

In the first case, the test suite was sufficient to catch the change to the program. In the second case, one out of two things have happened [13]:

1. The mutated program is logically equivalent to the original program 2. The mutated program is different, but the test cases are not sufficient

to find a faulty case

instances of the second case could then be used to evaluate the adequacy of a test suite. This is one possible approach to evaluate how well code coverage performs, and is used in a study by Inozemtseva et al. [3].

3.3.1 Code coverage and testing effectiveness

There are already studies on the correlation between code coverage and fault frequency in the literature, such as the ones by Hutchins et al. [2] and

(17)

3.4. CYCLOMATIC COMPLEXITY CHAPTER 3. THEORY

Inozemtseva et al. [3]. A consensus among these studies does no appear to exist. Rather, results are varying from no correlation whatsoever to a strong correlation. This was somewhat surprising to us as a lot of the previous works share a similar methodology and hence, one would expect the results to be similar, but that is not the case.

In a 1994 study by Hutchins et al. they found that “Within the limited domain of our experiments, test sets achieving coverage levels over 90% usually showed significantly better fault detection than randomly chosen test sets of the same size. In addition, significant improvements in the effectiveness of coverage-based tests usually occurred as coverage increased from 90% to 100%” [2]. This paper shares the same typical methodology as a lot of the other previous works. The first step was to get a faulty program to determine the testing effectiveness on. The authors achieved this by manually introducing artificial artefacts into an otherwise healthy piece of software. The authors then developed a test pool. The development of the test pool was as thorough as possible and the authors based the development on modern best practises of testing. Subsets of the test pool were then selected and ran against the faulty software. The subsets were randomly generated so that their coverage levels would differ from each other. The effectiveness of the test pools with different code coverage was then determined based on how well the test pools detected faults in the software.

Inozemtseva et al. carried out a more recent study. They found that “In general, there is a low to moderate correlation between the coverage of a test suite and its effectiveness when its size is controlled for” [3]. This paper studied bigger software, measured in lines of code, than previous studies had done. This study also took a different approach to introducing faults in the programs tested. Rather than manually introducing artefacts, they applied mutation testing.

The authors used test cases that were already supplied with the software projects they studied. The authors picked out random test cases to generate a total of 31000 different test suites of varying code coverage. The authors measured the effectiveness of the test suites in two ways, raw kill score, meaning the number of mutations discovered by a test suite, and normalised effectiveness, meaning the number of discovered mutations divided by the number of mutations not discovered.

As there does not appear to be a consensus on the effect of coverage on test suite effectiveness, there is still room for further research.

3.4 Cyclomatic complexity

Another software metric is McCabe’s Cyclomatic Complexity, an old metric dating back to 1976 [14]. The idea of the metric is to measure the number of paths through a program or method. To compute the metric, one must

(18)

flow of instructions and every arc is a branch. The cyclomatic complexity is then v(G) = e − n + 2p, where e is the number of edges in the graph, n is the number of vertices and p is the number of connected components [14]. A connected component is a sub graph of the whole graph that lacks a vertex to the rest of the graph. In a program, connected components are often methods. Calls to other methods in a method then increases p.

Basili et al. [4] carried out one study on the correlation between cyclo-matic complexity and fault frequency in software. Their study was empiri-cal. They collected data of faults in a software project during a period of 33 months, after which they made an analysis of the data. The authors divided the source code into modules and put each module into a category based on their size, so as to not draw the conclusion that fault frequency increases with cyclomatic complexity, if it was actually the size of the module that correlated with the fault frequency. Modules, in this context are not used in the same way as in this thesis. In this thesis, a module is a large collection of code, consisting of many files and functions, whereas in the study by Basili et al. [4] a module can be a single function or subroutine. Their findings are that larger modules are less prone to errors, even though the complexity of the larger modules is higher. Further, they find that among modules of the same size, the error prone modules do not have a higher complexity than error free modules.

Shepperd [5] did another study on the correlation between cyclomatic complexity and fault frequency. This study is a review, analysing both the theoretical weaknesses in McCabe’s work [14] as well as empirical results from other studies. One critique of the cyclomatic complexity metric that Shepperd has, is that it does not consider else clauses to add complexity. and Shepperd argues that in the examples in Figure 3.1 and Figure 3.2,

if(x<1) {

/* ... */ }

Figure 3.1: Shepperd’s less complex example

if(x<1) { /* ... */ } else { /* ... */ }

(19)

the latter is more complex from a psychological standpoint. Shepperd fur-ther argues that it is a weakness in the cyclomatic complexity metric that unstructured techniques such as GOTO and jumps out of loops are not con-sidered more complex than any other type of branch, such as loops or if statements, even though the author considers them a bad practise. Obvi-ously, Shepperd does not approve of the cyclomatic complexity metric from a theoretical point of view. He further goes on to analyse empirical studies on the relation between cyclomatic complexity and fault frequency in soft-ware, and finds that “Cyclomatic complexity fails to convince as a general software complexity metric. This impression is strengthened by the close association between v(G) and LOC and the fact that for a significant num-ber of studies LOC outperforms v(G)”. In this context, LOC refers to lines of code.

A more recent study by Zhang et al. [6], concludes that “[...] software complexity, especially the static code complexity measures, can be useful indicators of software quality” and “[...] we show that using classification techniques, we are able to predict defect-prone modules at component level based on their complexity with good accuracy.”, which contradicts the pre-vious studies published. The use of modules in the study by Zhang et al. [6] is similar to the use in this thesis in the sense that a module is a larger collection of software. In the paper, the authors describe the modules as “component-level”. The data studied is from NASA, it consists of metrics on the source code and detected faults in the software. What makes the study stand out is that the authors divide the source code into modules, and analyse the correlation between modules and fault frequency instead of methods and fault frequency which previous studies analysed. With this method they find that they can predict where faults are likely to be with a high accuracy.

(20)

Chapter 4

Method

This chapter outlines how problems faced during this thesis were solved and why that solution was chosen.

4.1 Retrieving data

All data needed was fortunately available in log files on a network shared file system at Ericsson. Fortunately, because that meant that retrieving the data of interest was as simple as executing some trivial commands in the BASH shell to find all logs for a specific build or version of the software.

It was possible to generate the logs with an automatic tool. This meant that if data on a particular version of the software was missing, it could be generated easily.

As finding errors in software takes a lot of time, the approach to retrieve data was to find versions that had already received thorough testing in the simulated environment and hence already had many RPTERRORs to study. I then generated the static measures for these versions (cyclomatic complexity and code coverage), as those measures are less time consuming to generate compared to the collection of RPTERRORs.

Three different projects were chosen, with one version of each. The projects are all related and have origins in the same code base, but are developed in parallel today. The purpose of using multiple projects was to get more data points than if only one project was used. For this reason, the results on the individual projects are not presented in this thesis, as the low number of data points make all the results insignificant.

4.2 Parsing the data

With the data retrieved, it was time to start writing a tool to make sense of it. Unfortunately, the data was stored in formats designed for human

(21)

4.2. PARSING THE DATA CHAPTER 4. METHOD

Running trace...

Current time: 2014-05-22 14:29:40

’trace’ Statistics since: 2014-05-22 10:22:18 ’trace’ Current tick = 30921865 ’trace’ Version = 1 ’trace’ Used index in trace data = 1

’trace’ Summary of trace and error messages since last restart -’trace’ RPTDOTRACE_LEV1 index = 0

’trace’ Trace number = 12683270 ’trace’ Severity = 1 ’trace’ Error description = -’trace’ Timestamp = 49791328 ’trace’ File name = hello_world.c ’trace’ Code line = 13210

’trace’ User parameter = 254

’trace’ Number of trace messages = 1

Done.

OSmon>

Figure 4.1: Example of RPTERROR log

readability rather than ease of parsing (see Figure 4.1 for an example log of RPTERRORs and Figure 4.2 for code that can cause such errors). Hence, a parser had to be written. Haskell was was chosen as the programming language used to write this tool. Haskell may seem like an odd choice of a language, but it appeared to be a good choice given the circumstances. First off, Haskell is a high level language. This allowed for quicker development than in a language such as the C programming language, as time could be spent on the problem at hand instead of managing memory manually and similar time consuming tasks related to developing in C. Secondly, compared to other high level languages, such as Python, Haskell has a mature set of libraries for developing parsers with the Parsec [15] and Attoparsec[16] being the two used for the tool in this thesis. This allowed for little time spent implementing a parser while still having good performance and robustness. Performance was crucial as the verbose logs could sometimes reach the order of hundreds of megabytes, and if the parser would not run in the order of tens of seconds, it would get tiresome in the long run.

(22)

4.3. GROUPING THE DATA CHAPTER 4. METHOD

4.3 Grouping the data

With the data parsed and stored in memory, it had to get grouped in order to make sense. That is, for each RPTERROR, there should be a corre-sponding metric such as code coverage or cyclomatic complexity. This was achieved by storing cyclomatic complexity and coverage measures in asso-ciative containers with the name of the source file the metric concerns as its key. The parser put the RPTERRORs in a list. The tool could then generate lists of pairs of RPTERRORs and the static metrics by looking up the corresponding metric based on the RPTERRORs source file for every element in the list of RPTERRORs.

With the lists of pairs, RPTERRORs were grouped by the file they appeared in, the method they appeared in or the module they appeared in by sorting the elements based on those keys and then grouping consecutive elements. These groups were used to extract metrics such as the number of RPTERRORs in a file or the number of RPTERRORs per average coverage in a module

4.4 Presenting the data

The first tool exported the grouped data to csv [17] files. Two other tools then read this data in order to present it in two ways.

First, we presented the data graphically as a scatter plot. This was use-ful to get a sense of what correlation could exist based on just intuition. It was also useful as a sanity check, to see that the data could possibly be cor-rect. The scatter plots in this thesis were generated with the matplotlib [18] python library.

Secondly, we measured the correlation with the Kendall rank correlation coefficient. Kendall’s τ gives us a metric that says how well the number of RPTERRORs follows a software metric. If RPTERRORs increases when cyclomatic complexity increases, that will result in a positive τ . If instead the number of RPTERRORs decreases as coverage increases, that gives a negative τ . There are multiple ways to measure correlation, but Kendall’s τ is less likely than other correlation metrics to demonstrate type I errors (false positives) [19]. Similar studies by Evanco [20] as well as Bachmann et al. [21] both use Kendall’s τ to measure correlation. The SciPy [22] python library provided the implementation for calculating Kendall’s τ .

(23)

4.4. PRESENTING THE DATA CHAPTER 4. METHOD switch (refAddrFlag) { case SEND_TLLI: source[0] = ( (TLLI_INDIC << 15) | ((tlli >> 17) & 0x7fff) ); source[1] = (tlli >> 1) & 0xffff;

source[2] = (tlli << 15) & 0x8000; bitLength = 33; break; case SEND_TFI_UL: source[0] = ( (TFI_INDIC << 14) | (GLOBAL_TFI_UPLINK_INDIC << 13) | ((tfi & 0x1f) << 8) ); bitLength = 8;

/* Set TFI and D bit values for RLC header. Used in case of

segmentation */

tfi_D = (((tfi & 0x1f) << 1) |

(GLOBAL_TFI_UPLINK_INDIC) ); break; case SEND_TFI_DL: source[0] = ( (TFI_INDIC << 14) | (GLOBAL_TFI_DOWNLINK_INDIC << 13) | ((tfi & 0x1f) << 8) ); bitLength = 8;

/* Set TFI and D bit values for RLC header. Used in case of

segmentation */

tfi_D = (((tfi & 0x1f) << 1) |

(GLOBAL_TFI_DOWNLINK_INDIC) );

break;

default:

/* Invalid function argument */

APT_RP_ERROR(ERROR_ID_R12_693, refAddrFlag);

return; /* Return to main */ }

Figure 4.2: Example of sending an RPTERROR. Switch statement should handle all possible cases, otherwise an error is reported.

(24)

Chapter 5

Results

The results gathered are presented both with scatter plots of the raw data and with the Kendall rank correlation coefficient, τ . The correlation coef-ficient is also accompanied with a P-value for a hypothesis test, where the null hypothesis is that there is no correlation.

Kendall’s τ varies between −1 and 1, where 1 is perfect correlation and −1 is a perfect negative correlation. In a paper by Bachmann et al. [21], 0.1 ≤ |τ | < 0.3 corresponds to a weak correlation, 0.3 ≤ |τ | < 0.5 a moderate correlation and 0.5 ≤ |τ | ≤ 1.0 a strong correlation.

In order for τ to be of any significance, P must me sufficiently low. In this thesis, P ≤ 0.05 will be considered significant results.

5.1 Code coverage

The three kinds of code coverage are grouped by structural level (module, unit or file).

(25)

5.1. CODE COVERAGE CHAPTER 5. RESULTS

5.1.1 Mean code coverage per module

Each point in these charts represent the number of RPTERRORs found in a module and the mean coverage of that module.

80 82 84 86 88 90 Mean block coverage per module

0 20 40 60 80 RPT ER RO Rs Figure 5.1: τ = −0.0233954135448, P = 0.9072159348991

(26)

80 82 84 86 88 90 92 94 Mean line coverage per module

0 20 40 60 80 RPT ER RO Rs Figure 5.2: τ = 0.0814173684099, P = 0.685033725917 88 90 92 94 96

Mean function coverage per module 0 20 40 60 80 RPT ER RO Rs Figure 5.3: τ = 0.0701862406344, P = 0.72659931661

(27)

5.1.2 Mean code coverage per unit

Each point in these charts represent the number of RPTERRORs found in a unit and the mean coverage of that unit.

40 50 60 70 80 90

Mean block coverage per unit 0 5 10 15 20 25 30 R PT ER R O R s Figure 5.4: τ = 0.0522722597533, P = 0.616972430142

(28)

40 50 60 70 80 90

Mean line coverage per unit. 0 5 10 15 20 25 30 R PT ER R O R s Figure 5.5: τ = 0.0792218311775, P = 0.448449929336 50 60 70 80 90 100

Mean function coverage per unit 0 5 10 15 20 25 30 R PT ER R O R s Figure 5.6: τ = 0.16299556919, P = 0.118864703568

(29)

5.1.3 Code coverage per file

Each point in these charts represent the number of RPTERRORs found in a file and the coverage of that file.

20 40 60 80 100 Block coverage 2 4 6 8 10 12 14 R PT ER R O R s Figure 5.7: τ = −0.145847264474, P = 0.0176856370598

(30)

5.1. CODE COVERAGE CHAPTER 5. RESULTS 20 40 60 80 100 Line Coverage 2 4 6 8 10 12 14 R PT ER R O R s Figure 5.8: τ = −0.111390954576, P = 0.0700295207606 20 40 60 80 100 Function coverage 2 4 6 8 10 12 14 R PT ER R O R s Figure 5.9: τ = 0.134488563188, P = 0.0287139890203

(31)

5.2. CYCLOMATIC COMPLEXITY CHAPTER 5. RESULTS

5.2 Cyclomatic complexity

The cyclomatic complexity is presented in the same way as the code cover-age. However, the complexity is measured on the function level rather than the file level and hence, the RPTERRORs are also mapped to a specific function rather than a file for that chart.

0 100 200 300 400 500 Cyclomatic complexity 2 4 6 8 10 RPT ER RO Rs Figure 5.10: τ = 0.0775294206981, P = 0.0502063264391

(32)

5.2. CYCLOMATIC COMPLEXITY CHAPTER 5. RESULTS

5 10 15 20

Mean cyclomatic complexity per unit 0 5 10 15 20 25 30 RPT ER RO Rs Figure 5.11: τ = −0.00741249316661, P = 0.94075934283 5 6 7 8 9 10 11

Mean cyclomatic complexity per module 0 20 40 60 80 RPT ER RO Rs Figure 5.12: τ = −0.110049533438, P = 0.567434349581

(33)

Chapter 6

Discussion

6.1 Results

Data series τ P

Block coverage per module −0.02 0.91 Line coverage per module 0.08 0.69 Function coverage per module 0.07 0.73 Block coverage per unit 0.05 0.62 Line coverage per unit 0.08 0.45 Function coverage per unit 0.16 0.12 Block coverage per file −0.15 0.02 Line coverage per file −0.11 0.07 Function coverage per file 0.13 0.03 Cyclomatic complexity 0.08 0.05 Cyclomatic complexity per unit −0.00 0.94 Cyclomatic complexity per module −0.11 0.57 Table 6.1: Summary of results

As we can see from the results, only two data series have a P -value below 0.05, which is the common threshold for determining the significance of a statistic. We can discard any correlation observed for series other than these two, as we can not be certain that a correlation really exists.

The two statistics that we consider significant are function coverage per file and block coverage per file. For these statistics, τ = 0.134488563188 and τ = −0.145847264474 respectively, which Bachmann et al. [21] considered to be weak correlations. At first glance, that result seems useful, both func-tion coverage and block coverage have a weak correlafunc-tion to found errors. However, the observant reader will notice that the correlation for function coverage is the opposite from what could be useful to predict faults in

(34)

soft-6.2. METHOD CHAPTER 6. DISCUSSION

coverage. This result contradicts the findings of Hutchins et al. [2] as well as Inozemtseva et al. [3], who both found a negative correlation, the results for block coverage on the other hand is more in line with the findings of these two previous studies.

The results for the block coverage means that the metric can possibly be useful to look at when developing software. Hypothetically, if a team of software developers were to always prioritise a high block coverage for all code they developed, then they would introduce marginally fewer bugs, as can be seen in the results of this thesis. It shall be noted though, that the correlation is weak, so this would be a very inefficient way of reducing the number of bugs produced. It must also be noted that although there were only a few files with a block coverage lower than 60%, these files do not at all appear to be more faulty. That means that even files that are very poorly tested according to the block coverage metric manages to keep a low fault rate.

In the case of cyclomatic complexity, none of the statistics are signifi-cant. This result is in line with what Shepperd [5] found, that the cyclomatic complexity metric failed to impress. The results further supports the find-ings of Basili et al. [4], who claimed that complex modules were no more error prone than other modules, which is true as we were not able to find a correlation at all.

Overall, the results found are not at all surprising given what previous studies have found.

6.2 Method

This study uses real software faults to study the correlation between software metrics and faults. Some previous studies, such as the ones by Hutchins et al. [2] and Inozemtseva et al. [3], uses artificial faults instead. Both approaches have advantages and disadvantages. One big disadvantage with using real software faults is that it makes the reliability of the experiment worse. As the data used in this study is not made public, any attempts to recreate the experiment would have to be with other data than what this thesis uses. More likely than not, that will make the results different from the results in this thesis. Nonetheless, the use of real faults increase the validity of the study. If there is a difference in the kind of faults that humans implement by mistake and the kind of faults that we implement artificially in studies, we are much more interested in the correlation between software metrics and faults implemented by humans by mistake. Using real faults makes sure that we find the correlation that interests us, without even having to worry about whether there is a difference between faults implemented by humans by mistake or artificial faults.

Apart from the data used in the experiment, no part of the method in this thesis is in any way secret and requires no equipment other than an ordinary computer to repeat.

(35)

6.3. THE WORK IN A LARGER CONTEXTCHAPTER 6. DISCUSSION

One thing to consider when looking for correlations between software metrics and faults is that the root of a software fault might not always be the place where it manifests itself. In an attempt to remedy this problem, this study also measures the mean of a metric in modules and the number of RPTERRORs for that module. The reasoning behind this is that one method with good metrics can fail because of faulty input from other meth-ods with worse metrics. For future work, it might be of interest to consider the method or file with the worst metrics in a module. The reasoning behind that would be that it is enough to have one faulty method in a module for all of the other methods to get faulty data.

RPTERRORs are not real errors in the sense that bug reports from hu-mans are. This is especially true as developers have to add RPTERROR checks manually to the source code, and it is possible that they miss several important checks that could and should have been made. RPTERRORs are, however, a very practical source of faults, as they are generated automat-ically and tells you exactly where the error occurred. A bug report rarely contains that information. Furthermore, I argue that the absence of a corre-lation between coverage and RPTERRORs makes it highly unlikely that one would exist between coverage and faults in a bug database. The reasoning behind this statement is that the developers must think of the cases where the code can go wrong in order to write the RPTERROR. The same kind of thinking that goes into forming the test cases and writing the code in the first place. If a correlation is not to be found between RPTERRORs, which are errors that the developers anticipate, and coverage, then it would seem very unlikely that there would be a correlation between faults discovered by clients and coverage, as the faults from clients can be of an entirely different nature, such as large architectural problems, whereas RPTERRORs must be errors of the kind that the developers anticipate.

The sources used in this thesis were selected with the utmost care. All sources used to support a claim are published by respectable organisations such as ACM and IEEE where the papers have been peer reviewed. When choosing between two papers to use as a source, the one with the most citations was always picked. Where empirical studies are referenced, there is always at least one other empirical study with a different result to show that there is a lack of consensus in that particular field. The sources used are of varying nature such as books, conference papers, and web pages.

6.3 The work in a larger context

Software plays a bigger role in society than it ever has before. We trust software for our banking needs, to apply the brakes in our cars in the most efficient way and to purchase items over the internet. The software studied in this thesis is absolutely vital to today’s infrastructure.

(36)

im-6.3. THE WORK IN A LARGER CONTEXTCHAPTER 6. DISCUSSION

the software engineering discipline should have some quantifiable metrics to apply to software so that the quality of code could be objectively measured for different code bases.

This study contributes to the understanding of how software quality can be quantified, how it can not be quantified and what software quality means in the first place.

(37)

Chapter 7

Conclusions

We have seen some varying results for the correlation between code coverage, cyclomatic complexity and software faults. The purpose of this thesis was to find out if such correlations exists and how strong they are, if they exist. As such, this thesis has accomplished its purpose. We have found that there is little to no correlation between metrics for code coverage and software faults. We have also found that there is no correlation what so ever between cyclomatic complexity and software faults. This means that both metrics are essentially useless as a tool to predict faults in software.

If it is the case that software developers indeed use cyclomatic complexity and code coverage to identify which part of a piece of software that is likely to be faulty, the results of this thesis are reasons to be worried. That means that software developers lack tools to reason about the robustness and quality of software.

For future work, it would be interesting to see how other software metrics perform. Further, it would be interesting to see a study where the software faults were manually sourced from a bug database rather than an automati-cally extracted in some way. If neither of those approaches yield results that indicates better performance than what was found in this thesis, it would appear that software quality is a more complex issue than we previously believed and that further research must take place if we want to find a way to quantify it reliably.

(38)

Bibliography

[1] Myers GJ, Badgett T, Sandler C. The Art of Software Testing. John Wiley & Sons; 2004.

[2] Hutchins M, Foster H, Goradia T, Ostrand T. Experiments of the ef-fectiveness of dataflow-and controlflow-based test adequacy criteria. In: Proceedings of the 16th international conference on Software engineer-ing. IEEE Computer Society Press; 1994. p. 191–200.

[3] Inozemtseva L, Holmes R. Coverage is not strongly correlated with test suite effectiveness. In: Proceedings of the 36th International Conference on Software Engineering. ACM; 2014. p. 435–445.

[4] Basili VR, Perricone BT. Software errors and complexity: an empirical investigation. Communications of the ACM. 1984;27(1):42–52.

[5] Shepperd M. A critique of cyclomatic complexity as a software metric. Software Engineering Journal. 1988;3(2):30–36.

[6] Zhang H, Zhang X, Gu M. Predicting defective software components from code complexity measures. In: Dependable Computing, 2007. PRDC 2007. 13th Pacific Rim International Symposium on. IEEE; 2007. p. 93–96.

[7] Rational Purify;. Accessed: 2015-03-16. http://unicomsi.com/ products/purifyplus/.

[8] Boehm BW, Brown JR, Kaspar H. Characteristics of software quality. 1978;.

[9] Buxton JN, Randell B. Software Engineering Techniques: Report on a Conference Sponsored by the NATO Science Committee. NATO Sci-ence Committee; available from Scientific Affairs Division, NATO; 1970. [10] Gaur G, Suri B, Singhal S. Overview of software engineering metrics for procedural paradigm. In: IT in Business, Industry and Government (CSIBIG), 2014 Conference on. IEEE; 2014. p. 1–5.

(39)

BIBLIOGRAPHY BIBLIOGRAPHY

[11] Bertolino A. Software testing research: Achievements, challenges, dreams. In: 2007 Future of Software Engineering. IEEE Computer Society; 2007. p. 85–103.

[12] Miller JC, Maloney CJ. Systematic Mistake Analysis of Digital Com-puter Programs. Commun ACM. 1963 Feb;6(2):58–63.

[13] DeMillo RA, Lipton RJ, Sayward FG. Hints on test data selection: Help for the practicing programmer. Computer. 1978;11(4):34–41. [14] McCabe TJ. A complexity measure. Software Engineering, IEEE

Trans-actions on. 1976;(4):308–320.

[15] Parsec;. Accessed: 2015-04-02. https://github.com/aslatter/ parsec.

[16] Attoparsec;. Accessed: 2015-04-02. https://github.com/bos/ attoparsec.

[17] Shafranovich Y. Common Format and MIME Type for Comma-Separated Values (CSV) Files; 2005. Internet RFC 4180.

[18] Hunter JD. Matplotlib: A 2D graphics environment. Computing In Science & Engineering. 2007;9(3):90–95.

[19] Arndt S, Turvey C, Andreasen NC. Correlating and predicting psy-chiatric symptom ratings: Spearmans r versus Kendalls tau cor-relation. Journal of Psychiatric Research. 1999;33(2):97 – 104. Available from: http://www.sciencedirect.com/science/ article/pii/S0022395698900462.

[20] Evanco WM. Prediction models for software fault correction effort. In: Software Maintenance and Reengineering, 2001. Fifth European Conference on. IEEE; 2001. p. 114–120.

[21] Bachmann A, Bernstein A. When process data quality affects the num-ber of bugs: Correlations in software engineering datasets. In: Mining Software Repositories (MSR), 2010 7th IEEE Working Conference on. IEEE; 2010. p. 62–71.

[22] Jones E, Oliphant T, Peterson P, et al.. SciPy: Open source scientific tools for Python; 2001–. [Online; accessed 2015-04-23]. Available from: http://www.scipy.org/.

(40)

(41)

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a period of 25 years from the date of publication barring

exceptional circumstances.

The online availability of the document implies a permanent permission for

anyone to read, to download, to print out single copies for your own use and to

use it unchanged for any non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional on the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its WWW home page:

http://www.ep.liu.se/

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –

under 25 år från publiceringsdatum under förutsättning att inga extraordinära

omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

ickekommersiell forskning och för undervisning. Överföring av upphovsrätten

vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ

art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan

form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

förlagets hemsida

http://www.ep.liu.se/

The correlation between code coverage, cyclomatic complexity and fault frequency

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

The correlation between code coverage,

cyclomatic complexity and fault frequency

Simon Persson

LIU-IDA/LITH-EX-G--15/012--SE

2015-06-03

Final thesis

The correlation between code

coverage, cyclomatic complexity and

fault frequency

Simon Persson

LIU-IDA/LITH-EX-G–15/012–SE

2015-06-03

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Purpose

1.3

Research questions

1.4

Limitations

Chapter 2

Background

Chapter 3

Theory

3.1

Software quality

3.2

Software testing

3.3

Code coverage

3.3.1

Code coverage and testing effectiveness

3.4

Cyclomatic complexity

Chapter 4

Method

4.1

Retrieving data

4.2

Parsing the data

4.3

Grouping the data

4.4

Presenting the data

Chapter 5

Results

5.1

Code coverage

5.1.1

Mean code coverage per module

5.1.2

Mean code coverage per unit

5.1.3

Code coverage per file

5.2

Cyclomatic complexity

Chapter 6

Discussion

6.1

Results

6.2

Method

6.3

The work in a larger context

Chapter 7

Conclusions

Bibliography

Copyright

The publishers will keep this document online on the Internet - or its possible

replacement - for a period of 25 years from the date of publication barring