Effectiveness of Inadequate Test Suites

(1)

IN

DEGREE PROJECT

COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

Effectiveness of Inadequate

Test Suites

A Case Study of Mutation Analysis

HIKARI WATANABE

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Effectiveness of Inadequate

Test Suites

A Case Study of Mutation Analysis

HIKARI WATANABE

DD221X, Degree project in Computer Science (30 ECTS credits) Master’s program in Computer Science (120 ECTS credits) KTH Royal Institute of Technology, Year 2017

Supervisor at CSC: Karl Meinke Examiner at CSC: Cristian M Bogdan

(4)

Abstract

How can you tell whether your test suites are reliable? This is often done through the use of coverage criterion that would define a set of requirements that the test suites need to fulfill in order to be considered reliable. The most widely used criterion is those referred to as code coverage, where the degree to which the code base is covered is used as an arbitrary measure of how good the test suites are. Achieving high coverage would indicate an adequate test suite i.e. reliable according to the standards of code coverage. However, covering a line of code does not necessarily mean that it has been tested. Thus, code coverage can only tell you what parts of the code base have not been tested, opposed to what have been tested.

Mutation testing on the other hand is an approach to evaluate the adequacy of test suites through their fault detection ability, rather than how much of the code base they cover.

This thesis performs mutation analysis on a project with inadequate code coverage. The present testing effort on unit level is evaluated and the cost and benefits of adopting mutation testing as a testing method is evaluated.

Sammanfattning

Hur vet man när tester är tillförlitliga? Ofta använder man sig av täckningskriterium som definierar en uppsättning krav som tester måste uppfylla för att betraktas som pålitlig. Det mest använda kriterier är de som kallas kodtäckning, där graden till vilken kodbasen är täckt används som ett mått av pålitlighet av tester. Hög täckning indikerar adekvat tester, dvs pålitlig enligt kodtäckning. Men täckning av en kodlinje betyder inte nödvändigtvis att den har testats. Koddekning kan således bara visa vilka delar av kodbasen som inte har testats, snarare än vad som har testats. Mutation testing å andra hand är ett sätt att utvärdera testers effektivitet genom deras felsökningsförmåga, snarare än hur mycket av kodbasen de täcker.

Denna examensarbete utför mutationsanalys på ett projekt med otillräcklig koddekning. Kvalite av nuvarande tester på enhetsnivå utvärderas och kostnaden och fördelar för att anta mutation testning som en testmetod utforskas.

Keywords

(5)

(6)

Preface

I would like to thank my team at NASDAQ that has helped and supported me throughout the project. Special thanks to:

Kjell Paulson at NASDAQ for having me in his team

(7)

(8)

Chapter 1 Introduction

This chapter introduces the topic and objective of this thesis project along with the research questions and relevance of this study.

Software testing remains one of the most important moreover expensive aspects of ensuring high quality. According to the Capgemini World Quality Report in 2015 budgets for quality assurance and testing have risen to an average 35% of total IT spending. Significant 9% increase from 2014, with a prediction that the average will reach 40% by the year 2018 [WQR].

At its core, software testing is an endeavor for higher quality, typically through the detection of dormant faults. However, the growing size and complexity of software entail a practically infinite input space, making it infeasible to completely test entire systems. Testing is thus always a trade-off between the cost of testing and potential cost of undiscovered faults. To overcome this fundamental limitation of testing, developers need a structured way to assess the effectiveness, or quality of test suites in terms of detecting faults.

Intuitively, the most logical measure of fault detection ability is simply the number of real faults it detects. Faults discovered during a products lifetime can be used in retrospect, to assess the adequacy of test suites. However, this approach does not lend itself well to a development process. Thus, a method to predict the quality of test suites solely based on the suites, and the current build of the system under test (SUT) is required. The most common such approach is the use of coverage criteria [AO17]. Coverage criteria define the properties that the test suit needs to fulfill, for example, statement coverage require that every statement to be executed and branch coverage that every branch be traversed. The coverage measurements would then serve as an arbitrary indicator of adequacy, e.g. test suite with 80% statement coverage is higher quality than a test suite with 70% statement coverage.

(12)

2 | Introduction

Mutation analysis is the process of injecting small faults into the SUT through syntactic changes, creating copies, or mutants each containing one fault. Test suites that are able to detect the injected faults are then considered adequate i.e. reliable.

Mutation testing can be used for testing at both unit level and integration level [DM96]. It has been applied to many programming languages, e.g. Fortran, Ada, AspectJ, Java and C. Beside the use at the software implementation level, it has been applied at the software design level to test the specifications of the SUT [MR01].

The concept of mutation was first introduced in 1971, by Richard Lipton in a class term paper. It was later developed and published by DeMillo et al. in 1978. Over four decades of history and wide range of studies have resulted in a large body of literature [JH11, OU00].

Mutation coverage subsumes many other coverage criteria [OV96] where subsumption can be defined as, coverage criterion Ca subsumes Cb if and only if every test suit that satisfies Ca also satisfies Cb. It

has also been shown to predict actual faults detection ability better than other criteria in some settings, but never shown to be worse [GJ14]. However, mutation testing is computationally expensive and difficult to apply and although there has been much research [JH11], it is still regarded as academic and not widely adopted within the industry.

1.1 Objective

NASDAQ Technology AB is an American fin-tech company. NASDAQ is a leading provider of trading, clearing, exchange technology, listing, information and public company services across six continents [NH17]. The business-critical nature of the financial domain necessitates a solid testing effort with reliable test suites.

In this thesis, the test suites of one of NASDAQs software projects are evaluated on the unit level. Historically the project has lacked a set structure for testing which resulted in low coverage. The team maintaining the project are taking steps to supplement the present testing efforts, however the abundance of legacy code with high interdependency have made it difficult to create unit tests. Unit tests are widely recognized as an integral part of a development process. Among other benefits, they serve as a safety net during the inevitable refactoring of old code, detecting undesired behaviors and helping to facilitate the fault.

On investigating previous system failures, the project team discovered that only a handful of the critical failures could have been prevented with unit tests. Thus, the team is doubtful of the gain from further unit tests and reluctant to invest any resource. However, assessing the quality of present unit tests would determine the effectiveness of past approaches and could prove to be useful in convincing the team otherwise.

(13)

Introduction | 3

The research questions can be defined as: What is the quality of present unit tests, and what are the cost and benefits of adopting mutation testing?

1.2 Delimitations

The measurements used within the thesis are directly dependent on the metrics reported by the tools. Although simple coverage measurements such as line coverage can be easily cross validated because of its prevalence, path coverage is far more difficult. Thus the ability to validate measurements is somewhat limited in this regard.

The performance comparison of mutation testing is limited by the lack of possibility to augment or modify the present test suites. The SUT is extremely large and complex, therefore creating meaningful test suites without the assistance of a developer from the project is far too time-consuming. Without a way to augment the test suites, it is practically impossible to measure the performance of mutation testing at different degree of testing efforts.

1.3 Related Work

A study conducted by Simona et al. [NW11], attempted to assess the cost of applying mutation testing on a real-world software system. The study applies three widely recognized mutation testing tools, namely, MuJava, Jumble and Javalanche, on the open source project Eclipse. The study concluded that although the configuring and applying the tools is simple enough, we should pay special attention to the high execution time.

A recent study conducted by Gopinath et al. [GJ14] attempted to investigate the correlation between mutation kill ratio and widely used coverage criteria (statement, block, branch and path coverage). The study considered hundreds of open source java projects amassed from GitHub repositories. They measured the coverage and performed mutation analysis on the test suits. The data was then analyzed through regression analysis, measuring both τβ (Kendal rank correlation coefficient) and R2

(coefficient of determination). The same experiment was conducted on both the projects original test suites and suites automatically generated through the Randoop testing tool. The study found correlation between the widely used coverage criteria and mutation kill ratio, with statement coverage being the best at R2_{= 0.94 for original tests and 0.72 for generated tests. The aim of Gopinath et al}

was to measure the ability of coverage criteria as a predictor of suite quality, from the perspective of non-researchers and to present a possible alternative to the computationally expensive mutation testing.

(14)

4 | Background

Chapter 2 Background

This chapter presents the background material to understand mutation testing and the tools used throughout this thesis.

2.1 Mutation Testing

The 70s saw the rise of Van Halen. Like any other band, when Van Halen was hired to play at a venue they provided the promoter with a contract rider. The rider included everything from sound and lighting requirements to food and drinks. Listed among these was a big bowls of M&M’s, but absolutely no brown ones. This was not just a superstition or some rock star ridiculousness and served a very specific purpose. They randomly buried the odd request to make sure that the contract was thoroughly read. Finding a brown M&M meant there might be other things that the promoter missed [VA01].

Van Halen made sure the rider was thoroughly read, by hiding an odd item for the promoter to find. In a similar way, mutation testing will make sure the SUT is thoroughly tested, by introducing artificial faults for the test suites to find. The process creates several copies of the code, each containing one fault. Existing test cases are executed against the copies, with the objective to distinguish the original program from the faulty ones, determining the adequacy of existing test suites.

Let P be a program that correctly functions on some test set T. The program is subjected to a mutation operators that introduce small artificial faults, thereby creating mutants (refer to figure 2.1) that differ from the original program in very small ways. Note that each mutant only contains one fault each. Let these mutants be called P1, P 2 … P n. Running each mutant against T, there are two possible

outcomes:

1. Some Pi gives a different result than P

2. Some Pi gives the same result as P

In case (1) Pi is said to be killed and in case (2) Pi is said to be alive. If a mutant is killed, that means

(15)

Background | 5

Either the tests were not sensitive enough to detect the introduced fault and must be augmented, or Pi and P turns out are functionally equivalent (henceforth noted as Pi ≡ P) [DL78, AB79].

Program P Mutant Pi … if (a ≤ b) … … if (a ≥ b) …

Figure 2.1: Example of a mutant

2.1.1 RIP Model

Condition for a mutant to be considered killed, can be expressed more formally with three conditions, together referred to as the RIP model [YH14, VM97, AO17].

 Reachability: The location of the mutation must be reached by the test.

 Infection: After the location is executed, the state of the program must be infected i.e. differ from the corresponding state, of the original program.

 Propagation: The infection must propagate through execution and result in an erroneous output or final state.

2.1.2 Mutation Score

As is defined by DeMillo et al [DL78], a test set that manages to kill all mutants, except for those

equivalents to P is adequate. In other words, a test set is adequate, if it distinguishes the program from

the mutant programs.

The extent to which coverage criteria is satisfied is measured as a coverage score, calculated in terms of imposed requirements. In the case of mutation testing it is referred to as mutation score [AO17, OU00]. Let M be total number of mutants, D the number of killed mutants and E the number of equivalent mutants. [JH11, AB79, GO92]. Mutation score can be defined as:

MS(T) = 𝐷

𝑀−𝐸 (2.1) 2.1.4 Mutation Operators

A mutation operator is a syntactic or semantic transformation rule applied to a SUT to create mutants. Operators are created with one of two goals: to inject faults representative of common mistakes the programmers tend to make, or to enforce testing heuristics, e.g. executing every branch.

Key to successful mutation testing is well designed mutation operators. Syntactically illegal mutants would be caught by the compiler and be of no value. These are called stillborn mutants and should be discarded or not generated at all, and a trivial mutant can be caught by any test.

(16)

6 | Background

errors and implemented in the Mothra mutation system [KO91, DG88]. The full list and detailed description of each operator can be found elsewhere [KO91]. The operators were adapted to Java by Ammann et al [AO17] and one of them is:

Relational Operator Replacement - ROR

Replace each occurrence of one of the relational operators (<, ≤, >, ≥, ==, ! =) by each of the other operators and by falseOp and trueOp, where falseOp always result in false and trueOp in result in true. Applying the ROR operator on for example the program P shown in figure 2.1 we would generate seven possible mutants,

𝑖𝑓(𝑎 ≤ 𝑏), 𝑖𝑓(𝑎 > 𝑏), 𝑖𝑓(𝑎 ≥ 𝑏), 𝑖𝑓(𝑎 == 𝑏), 𝑖𝑓(𝑎 ! = 𝑏), 𝑖𝑓(𝑓𝑎𝑙𝑠𝑒), 𝑖𝑓(𝑡𝑟𝑢𝑒)

2.1.5 Equivalent Mutants

One of the biggest hurdles of mutation testing is the equivalent mutant problem. Some mutants can turn out to be semantically equal to the original program, although they are syntactically different. Without detecting all the equivalent mutants, the tester cannot have complete confidence in the test data. There would simply be no way to be sure, whether the test is inadequate or the live mutants are equivalent.

An equivalent mutant will always produce the same output as the original program, thus impossible to kill. Refer to figure 2.2 for an example. Although they have two different conditions, both program P and mutant Pi will act in the exact same way, hence they are equal.

Detecting equivalence between two programs is an undecidable problem [BA82], i.e. there is no algorithmic solution. The situation however is somewhat different for the equivalent mutant problem. We do not need to determine the equivalence of two arbitrary pair of programs, but rather two syntactically very similar programs. Although this was also proven undecidable, it has been suggested that it is possible in many specific cases [OP97, OC94].

Program P Equivalent Mutant Pi

… int a = 0; while ( 5 < a ) { a++; } … … int a = 0; while ( 5 != a ) { a++; } …

Figure 2.2: Example of equivalent mutant

2.1.6 Cost reduction

(17)

Background | 7

There are several approaches proposed to reduce the computational cost of mutation testing. These methods can be categorized as mutant reduction and execution cost reduction. This section will present the most studied methods for each category according to the survey done by Jia et al [JH11]. 2.1.6.1 Mutant Reduction Techniques

Mutant reduction techniques aim to reduce the number of generated mutants without suffering a significant loss of effectiveness. Let MST(M) denote the mutation score for a test set T applied on the

mutants M. Mutant reduction problem can be defined as the problem of finding the subset M’ of M so that MST(M) ≈ MST(M’) [JH11].

Mathur et al. proposed the idea of constrained mutation, to apply mutation testing with only the crucial mutation operators. The concept was later developed by Offutt et al [OR93] as Selective mutation, an approximation technique that reduces the number of created mutants by reducing the number of used mutation operators. Mutation operators generate varying number of mutants, some operators have higher applicability and will generate that many more than others, which may turn out to be redundant [JH11, OU00, OL96, MA91].

A study on selective mutation conducted by Offutt et al [OL96], on 10 FORTRAN programs concluded that 5 of the Mothra mutant operators are sufficient, to effectively conduct mutation testing.

2.1.6.2 Execution Cost Reduction Technique

Another way to reduce the computational cost, other than reducing the number of mutants generated, is to optimize the mutant execution process.

Traditional mutation testing is often referred to as strong mutation. In strong mutation, for a given program P, a mutant Pi is said be killed, only if the original program P and the mutant Pi produce

different outputs.

Proposed by Howden [HO82], weak mutation is an approximation technique that optimizes the execution of strong mutation by relaxing the definition of “killing a mutant”. Weak mutation only requires that the first two condition of the RIP (Reachability, Infection and Propagation) model to be satisfied. A program P is assumed to be constructed by components {c1, c2 … cn}. Let Pi be a mutant

created by changing the component ci, mutant Pi is said be killed if the internal state of Pi is incorrect

after the execution of the mutated component. As such weak mutation trades test effectiveness for reduced computational cost [JH11, AO17].

2.2 Theory behind Mutation Testing

Section

2.1

gave an overview of mutation testing. This section will present the theory that makes

mutation testing possible.

Mutation testing is grounded on two fundamental hypotheses, first introduced by DeMillo et al. in

1978

[DL78] stated as:

(18)

8 | Background

 Coupling effect: Tests that detect small errors are so sensitive that they implicitly detect

more complex errors.

Suppose we have a program P, which is meant to compute a function F with an input domain D. The traditional approach to determining the correctness of P would be to find a subset T of D, such that

if for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥) (2.2) then for all 𝑥 in 𝐷, 𝑃(𝑥) = 𝐹(𝑥)

where 𝑃′ is the function actually computed by P. The subset T is then referred to as a reliable test set i.e. the set of input data needed to determine the correctness of P. However, finding T requires exhaustive testing efforts and is deemed undecidable [HO76] for any non-trivial programs.

Mutation testing on the other hand is a technique that attempt to draw a weaker conclusion, find a subset T of D, such that:

if 𝑃 is not pathological

and for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥) (2.3) then for all 𝑥 in 𝐷, 𝑃(𝑥) = 𝐹(𝑥)

A program P is not pathological if it was written by a competent programmer, i.e. it follows the competent programmer hypothesis. Mutation testing assumes that P is close to the correct program Pc hence, either P=Pc or some other program Q close to P is correct.

Figure 2.3: Neighborhood of P within the domain of all possible programs

Φ

μ

P

(19)

Background | 9

Let Φ be the set of programs close to P. With the assumption that P or some other program Q within Φ is correct. The approach of mutation testing to find subset T is to eliminate the alternatives. We formulate the method as, find subset T of D, such that:

for all 𝑥 in 𝑇, 𝑃(𝑥) = 𝐹(𝑥)

and for all 𝑄 in Φ (2.4)

either 𝑄 ≡ 𝑃

or for some 𝑥 in 𝑇, 𝑄(𝑥) ≠ 𝑃(𝑥)

If we can find a subset T that satisfies formula 2.4 then we say that P passes the Φ mutant test, or that T differentiates P from all other programs in Φ. This can be explained as: Given that P performs correctly on test set T. Each program Q in Φ should either be equivalent to P or produce a different result output than P. Instead of having to exhaustively test P with a practically infinite amount of test set, we can focus on differentiating P from Φ. However, the problem remains too large.

The coupling effect hypothesis says that there is often a strong coupling between members of Φ and a small subset μ (refer to figure 2.3). The subset μ can be thought of as a set of programs very close to P, such that if P passes the μ mutant test with test data T, then P will also pass the Φ mutant test with test data T. The subset μ is referred to as mutants of P and the task of differentiating P from Φ is reduced to finding μ and differentiating P from μ [BD80].

2.3 Mutation System

Mutation testing is performed using a so called mutation system. A mutation system would implement the mutation analysis process, i.e. generating the mutants and handling them.

Figure 2.4 show a generic process for mutation analysis. Let P be a program and T be a set of tests to be evaluated. When P is submitted to a mutation system, the system would first create the mutants P1, P2 … Pn. Next, T is loaded as input to P. If a test fails we have discovered a bug within P and it needs

to be corrected, otherwise T is executed on the mutants P1, P2 … Pn. If the output of a mutant Pi is

different from the output of P, we mark Pi as killed. Once all the tests in T have been executed, the

mutation score is calculated. If there are still live mutants, the tester can augment T to target the live mutants and the process is repeated. Equivalent mutants are marked, either manually or through some automated technique and are not considered for the next iteration.

(20)

10 | Background

Figure 2.4: Generic mutation testing process [JH11, OU00]

The above described process of mutation analysis is based on the theory from section 2.2. Creating the mutants P1, P2 … Pn using mutation operators is an attempt to find μ. The repeated process after

the creating the mutants implements the method of formula 2.4 i.e. differentiating P from μ.

(21)

Methodology | 11

Chapter 3 Methodology

This chapter presents the methodology used throughout this thesis to answer the research question, including the process outline and a description of each step.

An empirical approach was adopted since the problem statement of this thesis is directly reliant on measured data. The experimental model contains the steps:

1. Codebase of the SUT statistically analyzed

2. An appropriate sample space is chosen from the codebase

3. Second set of unit tests is generated through an automatic test suite generation tool, to perform mutation analysis and compare the performance.

4. A mutation testing tool is chosen as the mutation system

5. Mutation analysis is performed on both the original and the generated suites 6. Common coverage criteria are compared to mutation coverage

7. The results are evaluated and performance of mutation analysis is compared between the two data sets.

3.1 Codebase

(22)

12 | Methodology

unit tests (#UT). LOC is the number of executable lines of code and CC is the number of independent paths within a code.

3.2 Sample Space

The measurements presented here and the final sample space discussed in section 4.1 can to some extent also be found in the works of Mishra [SM17]. This is because the same codebase was evaluated and thus can be referred to as well for further information.

Table 3.1 contains data from the statistical measurement of the SUT. Immediately apparent when observing the measurements is that the first 3 modules are larger in terms of LOC. Unit tests are few in numbers and concentrated on 2 modules.

Name LOC CC LC #UT

Kenny 81448 16132 5.6% 233 Mark 61248 11520 1.2% 7 Perry 37728 7005 28.5% 172 Sally 16757 2260 0.9% 33 Martin 7269 1197 0.0% 0 Conan 6074 1384 13.1% 74 Coral 5278 965 8.4% 34 Patrick 3285 565 14.4% 10 Derek 3076 598 2.4% 1 Tommy 1745 290 0.0% 0 Brad 1137 209 0.0% 0 Daniel 1132 243 29.8% 5 Emil 917 183 19.7% 2 Uther 831 178 27.8% 17 Danny 819 154 0.0% 0 Francine 585 0.0% 126 0 Sebastian 369 0.0% 48 0 Waldo 164 0.0% 41 0

Table 3.1: Measurements of modules in the SUT

Each module is built separately and tested with its own test suite, as such, can be looked at individually. To quantify the overall test suite quality of the clearing engine, mutation analysis would have to cover every module. Mutation analysis will generate mutants for every mutable line of code, regardless of the absence of tests. Analyzing the quality of the entire project will inevitably result in very low mutation score, and the data would not fairly represent the quality of current in place unit tests.

(23)

Methodology | 13

test suites within the modules Kenny and Perry best represent the most recent testing efforts. Further, they are two of the largest and most complex modules, constituting a good chunk of the project. Hence, they were chosen as the sample space to be analyzed.

3.3 Generating Unit Tests

Performing mutation analysis on the current test suites is sufficient for evaluating the quality of present unit tests. However, to assess the cost of mutation analysis, it is necessary to obtain a second set of measurements. Mutation analysis of test suites with higher number of tests will allow a performance comparison. Difference in execution time can be observed and explained through factors affecting the process e.g. number of test cases and the code coverage. Number of mutants created, killed and those never covered by tests can be compared to further understand the difference in results.

Although theoretically possible, it was deemed impractical to create a second set of test suites by hand. Instead, an automatic test suite generation tool called EvoSuite [EVO1] was used to create a second set of test suites that was analyzed separately from the original suites.

While test cases can be automatically generated, the task of verifying the correctness still remains a problem. Faults that that cause exceptions and program crashes can be easily detect but only testing for obvious faults will lead to negligible tests.

EvoSuite automate the creation of test suites and adopts a search based approach and state of the art techniques to create tests with small assertions i.e. testing for small faults that do not cause an exception. EvoSuite also applies an approach that first generates test suites, and later optimizes the suites to achieve a high coverage criteria score e.g. line, branch and weak mutation coverage. Thus generating test suites with high coverage [EVO1, FA11].

For further information on the inner workings of EvoSuite, the study of Fraser et al. [FA11] can be referred.

3.4 Mutation System

Mutation analysis can be defined as a twostep process, generate mutants then check whether the mutants are detected by the test. Generating mutants is essentially done through creating copies of the source or byte code with small changes. This process is very rarely done by hand and generally uses a mutation system. Although there are several mutation systems available for Java, most are old and come with certain usability issues. This could be the lack of support for popular build tools such as Maven or mocking frameworks such as Mockito.

(24)

14 | Methodology

PIT applies a set of mutation operators to the byte code generating a large number of mutant classes. Before exercising the tests against the newly created mutants, PIT will first measure the line coverage (LC) of the code base. Employing the coverage information, for each mutant, PIT will only execute tests that cover the line which contain the mutation. This optimization is significant for inadequate test suites with a large codebase, such as the one examined in this thesis.

3.5 Mutation Analysis

The mutation analysis is performed using the most stable default mutant operators in PIT. They are defined in the documentation as:

1. Conditionals Boundary Mutator (CBM)

Mutates relational operators <, <=, > and >= to their boundary counterpart. 2. Increments Mutator (IM)

Mutates increments, decrements, assignment increments and assignment decrements of local variables. For example, i++ would be mutated to i--.

3. Invert Negatives Mutator (INM)

Inverts negation of integers and floating-point numbers, e.g. –i would be mutated to i. 4. Math Mutator (MM)

Replaces binary arithmetic operations for either integer or floating-point arithmetic with another operation. For example, a + b would be mutated to a – b.

5. Negate Conditionals Mutator (NCM)

Mutates conditionals i.e. ==, !=, <=, >=, < and >. This operator overlaps to some extent with conditionals boundary mutator, but is easier to kill.

6. Return Values Mutator (RVM)

Mutates the return value of method calls. For example, in the case of a Boolean type return value, false would be mutated to true and vice versa.

7. Void Method Calls Mutator (VMCM)

Mutator will remove calls to methods with return value type void.

3.6 Coverage Metrics and Mutation Coverage

A similar experiment to that of Gopinath et al. [GJ14] and Gligoric et al. [GG13] is used in this thesis. The ability of Line Coverage (LC), Branch Coverage (BC) and Path Coverage (PC) to predict the Mutation Score (MS) is evaluated through linear regression analysis.

(25)

Methodology | 15

The data set used for regression analysis is per source code class basis. Each class was measured using the mentioned coverage criteria, each one is then combined with the MS and shown in a scatter graph. The aim was to determine how well LC, BC and PC could serve as predictors of MS. For that purpose, the coefficient of determination (R2_{) was calculated. R}2 _{is a measure of how well the regression line}

approximates the real data i.e. a high R2 _{would indicate that the independent variables are good}

predictors.

(26)

16 | Results

Chapter 4 Results

This chapter presents the empirical data from the experiments described in the method section. The sample space is motivated, followed by the result from performing analysis on both original and generated test suites are presented. Finally, the result from regression analysis between common coverage criteria and mutation score is presented.

4.1 The Sample Space

Table 4.1 gives an overview of the modules constituting the sample space. The selection process of the sample space drastically restricted the number of modules. Although the two modules combined make up half of the SUT, it is not certain they have similar distributions of factors that can affect the mutation analysis. Concerning at this point is whether this has resulted in a skewed sample space that can jeopardize the integrity of analysis results.

LOC CC LC #UT

Kenny 81448 16132 5.6% 233

Perry 37728 7005 28.5% 172

Table 4.1: Measurements of modules in the sample space

Factors to consider are LOC and CC of classes. High LOC indicates a large number of lines to cover thus implicitly reducing coverage. High CC indicates a complicated class with large number of paths thus making it difficult to achieve high quality.

(27)

Results | 17

Figure 4.1: Distribution of LOC and complexity per class, after the initial selection (Covered) and last selection (Core) process.

4.2 Mutation Analysis

The analysis was performed on both the original test suites and the test suites generated through EvoSuite. For each case, the test suites for Kenny and Perry were considered separately. The results are presented in the tables 4.2, 4.3, 4.5 and 4.6 where each row corresponds to one of the mutation operators. The columns display, for each operator the number of created mutants, how many of those were killed, how many were left alive and how many were never reached due to the lack of coverage.

4.2.1 Original Test Suites

Result from the mutation analysis of the original test suites can be found in table 4.2 and 4.3. The mutation score (MS) for both modules are very low, which was to be expected considering the low LC.

It is immediately apparent that some operators, specifically NCM, RVM and VMCM create most of the mutants. Although it might be affected by the type of code that is being mutated, this is most likely due to their more applicable nature. For example, NCM overlaps to some degree with CBM but apply to far more situations.

The uneven number of mutants created between the modules can be explained as. Kenny has more than twice the LOC of Perry, hence resulting in far more mutants created.

(28)

18 | Results

Operator Created Killed Live No coverage Created Killed Live No coverage CBM 2666 103 (4%) 84 2479 452 29 (6%) 24 399 IM ₁₅₈₉ 57 (4%) 36 1496 ₁₈₄ 16 (9%) 2 166 INM 5 1 (20%) 0 4 0 0 (0%) 0 0 MM 330 31 (9%) 16 283 69 15 (22%) 3 51 NCM ₁₀₀₉₁ 564 (9%) 153 9374 ₂₁₁₀ 638 (20%) 122 1350 RVM ₅₁₄₃ 230 (4%) 48 4865 ₃₀₁₇ 595 (20%) 198 2224 VMCM 6727 183 (3%) 184 6360 1607 41 (3%) 22 1544 Total 26551 1169 (4%) 521 24861 7439 1334 (18%) 371 5734 Table 4.2: Analysis result of Kenny’s test suites Table 4.3: Analysis result of Perry’s test suites An observation is that, although the MS are very low for both analyses, the ratio of killed to live mutants seems to overall lean toward the killed. Table 4.4 contains the MS recalculated to only consider mutants with coverage. The test suites are effective at the part of the code base they cover. The MS is low due to the low coverage and would most likely increase accordingly with higher coverage.

Operator Killed Live Killed Live CBM 103 (55%) 84 29 (55%) 24 IM 57 (61%) 36 16 (89%) 2 INM 1 (100%) 0 0 (0%) 0 MM 31 (66%) 16 15 (83%) 3 NCM 564 (79%) 153 638 (84%) 122 RVM 230 (83%) 48 595 (75%) 198 VMCM 183 (49%) 184 41 (65%) 22 Total 1169 (69%) 521 1334 (78%) 371

Table 4.4: Ratio between killed and live mutants for analysis of Kenny and Perry

4.2.2 Generated Test Suites

The automatic generation of unit tests, yielded new test suites with significantly more unit tests. Test suites generated for Kenny contained 4923 unit tests with 27% LC compared to the previous 5.6%. The suites generated for Perry contained 2037 unit tests with 40% LC compared to the previous 28.5%. Although this was a significant increase in coverage it is still low when considering that they were generated with the goal to achieve a high coverage score. This can be attributed to the complex codebase and is most likely difficult to remedy.

Result from the mutation analysis of the generated test suites can be found in table 4.5 and 4.6. The generated suites were analyzed in the same manner as the original suites. The increase in coverage was reflected by similar increase in MS, strengthening the previous explanation in section 4.1.1.

(29)

Results | 19

Operator Created Killed Live No coverage Created Killed Live No coverage CBM ₂₆₆₆ 430 (16%) 241 1995 ₄₅₂ 167 (37%) 17 268 IM ₁₅₈₉ 208 (13%) 188 1193 184 62 (34%) 6 116 INM 5 0 (0%) 0 5 0 0 (0%) 0 0 MM ₃₃₀ 27 (8%) 50 253 ₆₉ 40 (58%) 2 27 NCM ₁₀₀₉₁ 1654 (16%) 780 7657 2110 942 (45%) 105 1063 RVM 5143 1357 (26%) 397 3389 3017 768 (25%) 86 2163 VMCM 6727 1078 (16%) 648 5001 1607 500 (31%) 91 1016 Total ₂₆₅₅₁ 4754 (19%) 2304 19493 7439 2479 (33%) 307 4653 Table 4.5: Analysis result of Kenny’s generated test suites Table 4.6: Analysis result of Perry’s generated test suites Again, it can be observed the ratio of killed and live mutants seem to overall lean toward the killed. Table 4.7 contains the MS recalculated to only consider mutants with coverage.

Operator Killed Live Killed Live CBM 430 (64%) 241 167 (91%) 17 IM 208 (52%) 188 62 (91%) 6 INM 0 (0%) 0 0 (0%) 0 MM 27 (35%) 50 40 (95%) 2 NCM 1654 (68%) 780 942 (96%) 105 RVM 1357 (77%) 397 768 (90%) 86 VMCM 1078 (62%) 648 500 (85%) 91 Total 4754 (67%) 2304 2479 (90%) 307

Table 4.7: Ratio between killed and live mutants for analysis of Kenny and Perry

4.2.3 Performance

During the two mutation analyses, performance data was gathered for both the original test suites (OTS) and generated test suites (GTS). Table 4.8 contains the overview with Line Coverage (LC), number of unit tests (#UT), number of covered mutants (#CM), number of executed tests (#ET) and the execution time.

LC #UT #CM #ET Exec Time

OTSKenny 5.6% 233 1690 4532 4 min 36 sec GTSKenny 27% 4923 7058 71638 3 h 29 min 30 sec OTSPerry _28.5% ₁₇₂ 1705 24854 1 h 21 min 21 sec

GTSPerry 40% 2037 2786 24510 50 min 56 sec Table 4.8: Summary of mutation analysis performance

(30)

20 | Results

Visible is the enormous increase in execution time between analyzing OTSKenny and GTSKenny. Although

the LC increased moderately, it alone cannot explain the spike. The increased LC presumably led to more covered mutations This combined with the drastic increase in the number of unit tests increased the number of test executions hence spike in execution time.

The difference in execution time between analyzing OTSKenny and OTSPerry is somewhat difficult

explain. Although one test suite has higher coverage than the other, they both cover almost the same number of mutations. This combined with similar number of unit tests should result in similar

execution time. The most likely explanation is that, only a handful of test cover any mutations in OTSKenny, resulting in less number of executed tests than OTSPerry. This indicates that the number tests

and the number of covered mutations are sufficient predictors of execution time.

The reduction in execution time between analyzing OTSPerry and GTSPerry is unexpected. The increase

in LC should increase the execution time if were to look at the case of OTSKenny and GTSKenny. GTSPerry

has higher LC, more unit tests and more covered mutations, yet less executed tests, resulting in a shorter execution time. Only explanation for this phenomenon, is that even with over ten times more unit tests, fewer tests in GTSPerry cover any mutation compared to OTSPerry.

4.3 Linear Regression Analysis

The results presented in this section are shared with the thesis work of Mishra [SM17]. Although the data are the same, they are incorporated differently into the works.

Table 4.9 displays the measured Line Coverage (LC), Branch Coverage (BC) and Path Coverage (PC) for both modules.

LC BC PC

Kenny 5.6 % 5 % 2 %

Perry 28.5 % 27.7 % 13 % Table 4.9: Line, branch and path coverage summary

(31)

Results | 21

Estimate Std Error tStat pValue

Lines of code 5.0995e-05 4.011e-05 1.2714 0.20383

Complexity -0.00026416 0.0002234 -1.1825 0.23725

Line coverage 0.81205 0.0086981 93.36 0

Table 4.10: Estimated coefficients for saturated regression model

Figure 4.2 is the scatter plot between the MS and LC. Each data point corresponds to a class in either module, with the size of the circle representing the classes LOC.

The coefficient of determination or R2_{is displayed above the regression line. The value of R}2 _{is perhaps}

the most relevant information here. The variable indicates how well the regression line fits the data set i.e. how well the independent variables can predict the dependent variable i.e. how well LC predicts MS.

Figure 4.2: Scatter plot between MS and LC

(32)

22 | Results

Figure 4.3: Scatter plot between MS and BC Figure 4.4: Scatter plot between MS and PC

(33)

Discussion | 23

Chapter 5 Discussion

This chapter discusses the observed results in regards to the research questions, reflects on the presented findings and concludes the work.

5.1 Quality of Unit Tests

Mutation analysis will apply a set of mutation operators to create a set of mutants. The test suites are then executed against these mutants to measure how many mutants can be detected. The purpose is to measure the quality of test suites. Mutations in uncovered parts of the codebase are never detected, directly lowering the mutation score (MS). Performing mutation analysis for the whole SUT without moderate to high coverage will always result in low total MS.

It was assumed very early in the thesis, that the mutation score for the original suites would be low. This was due to the low coverage and limited number of unit tests, and was shown to be true immediately after the first mutation analysis.

As mentioned in section 4.2.1 and 4.2.2 the results also support a different observation. When measuring the number of killed, alive and uncovered mutants, it was noted that the ration between killed and live mutants leaned towards the killed. This was made clearer when recalculating the mutation score with only the covered mutants. When considering only the part of the code base with coverage, the unit tests are surprisingly effective with around 70% in mutation score.

Above observation can be explained as, the test suites are effective at the parts of the code base they cover, the mutation score is low due to the low coverage and would increase accordingly with higher coverage. This of course is only true when any new test suites that is added, maintain the same level of quality.

The generated test suites display the exact same behavior, with the MS being drastically higher when only considering covered mutants, adding to the plausibility of the above explanation.

(34)

24 | Discussion

corresponding LC. This observation indicates that automatically generated tests are of enough quality that developers should consider them as a replacement of unit tests if the current coverage is low or use them to augment the current test suites. The moderate quality of automatically generated test suites should also remove any concern of the validity of any comparison between the original test suites and the generated ones.

5.2 Cost and Benefits of Mutation Testing

Mutation testing subsumes many other coverage criteria [OV96] and has been shown to predict actual faults detection ability better than other criteria in some settings, but never shown to be worse [GJ14]. Thus, it is difficult to deny the effectiveness of mutation testing. The practicality of mutation testing however is very much up for debate.

Pitest (PIT) was chosen for mutation analysis. PIT applies a set of mutation operators to the byte code generating a large number of mutant classes. Before exercising the tests against the newly created mutants, PIT will first measure the line coverage (LC) of the code base. Employing the coverage information, for each mutant, PIT will only execute tests that cover the line which contain the mutation. This optimization is significant in reducing the execution time e.g. the longest execution time during this thesis was 3 hour 30 minuts.

The result of mutation analysis displayed a drastic increase in execution time, when the number of unit tests covering any mutation and the number of covered mutations increased. Let us refer to the situation when two unit tests cover the same mutation as overlapping. Overlapping (as was discussed in the results section) increase the number of test executions without an increase in killed mutants, thus directly increasing execution time without increasing the mutation score (MS).

In a perfect world, there would be a handful of unit tests covering all mutation with no overlapping. However, it is reasonable to assume that, as coverage increases so does the overlapping. After a threshold, the increase in execution time when supplementing the test suite will not be worth the increase in mutation score.

The purpose of conducting regression analysis was to assess the ability of common coverage metrics to predict mutation score, to determine if mutation analysis is truly worth the cost, and whether other cheaper coverage metrics could be used instead.

The result indicates that LC is an effective predictor of mutation score. Although LC is by no means a replacement for mutation analysis it can serve as an indicator in practice.

(35)

Discussion | 25

5.4 Reflection

The decision to perform mutation analysis on the two modules was due to technical limitations. This approach can be criticized due to the risk of test suites for one module covering parts of the other module. This could result in some lost coverage that could have increased the mutation score, although most likely not in a meaningful way.

Generating a second data set for comparison did enable some comparison of performance. However, the legacy code and high interdependency can have contributed to meaningless tests, e.g. test suites that simply call class constructors to add to the LC. The measurements obtained from these test suites might not be genuine.

Performing mutation analysis on the original tests and the generated test resulted in some interesting data. However, manually creating test suites to measure the performance at different levels of LC and MS would have been more fruitful.

5.5 Sustainability and Societal Aspects

This thesis is a case study of a fault-based testing technique on a software project used within the financial industry, as such there is very little ethical concerns. From the societal and aspects, this thesis is not only relevant for the project team providing the SUT, but also to other development setups with similar project and the need for higher quality testing efforts.

From the economical sustainability perspective, studies in this field contribute to prevent software failure with significant economic consequences [FT01]. This thesis can inspire anyone trying to delve into the subject of mutation testing and higher quality testing.

5.6 Conclusions

Performing mutation analysis, what is the quality of present test suites? The mutation score for the SUT is low, indicating that very few of the created mutants is discovered. However, when considering only the part of the code base with coverage, the unit tests are surprisingly effective with around 70% in mutation score. Hence, it is reasonable to assume that current unit tests are of high quality, albeit only covering a small portion of the system.

(36)

26 | Discussion

The overlap in coverage between tests was shown to be a major contributor to high execution time for mutation analysis. Minimizing the number of tests, maximizing the number of covered mutations and minimizing the overlap in coverage between tests, should result in the best possible execution time. Performing regression analysis on the original test suites resulted in LC performing the best. Developers can thus use LC as the measurement of test suite effectiveness in practice, and have scheduled mutation analysis of the SUT.

Another, unexpected conclusion of this thesis project was in regards to the automatically generated unit tests. The generated test suites had significantly higher coverage than the original test suites. Performing mutation analysis, it was revealed that the mutation score was also higher than that of the original suites. This entails that the test suites do not only cover more of the code base, but also are effective when doing so. It can be concluded that automatically generated suites can replace hand written test suites if the current coverage is low, or be used to augment the hand written suites.

5.7 Future Work

(37)

References | 27

References

[YH14] X. Yao, M. Harman, and Y. Jia, “A study of equivalent and stubborn mutation operators using human analysis of equivalence,” International Conference on Software Engineering, pp. 919–930, 2014.

[VM97] J. Voas and G. McGraw. Software Fault Injection: Inoculating Programs Against Errors. John Wiley & Sons, 1997.

[OR93] A. J. Offutt, G. Rothermel, and C. Zapf, “An experimental evaluation of selective mutation," in Proceedings of the Fifteenth International Conference on Software Engineering, (Baltimore, MD), pp. 100-107, IEEE Computer Society Press, May 1993.

[WD94] W. E. Wong, M. E. Delamaro, J. C. Maldonado, and A. P. Mathur, Constrained mutation in C programs," in Pro- ceedings of the 8th Brazilian Symposium on Software Engi- neering, (Curitiba, Brazil), pp. 439{452, October 1994.

[HO82] W. E. Howden, “Weak Mutation Testing and Completeness of Test Sets,” IEEE Transactions on Software Engineering, vol. 8, no. 4, pp. 371–379, July 1982.

[DG88] R. A. DeMillo, D. S. Guindi, K. N. King, W. M. McCracken, and A. J. Offutt, “An Extended Overview of the Mothra Software Testing Environment,” in Proceedings of the 2nd Workshop on Software Testing, Verification, and Analysis (TVA’88). Banff Alberta,Canada: IEEE Computer society, July 1988, pp. 142–151.

[MA91] A. P. Mathur, “Performance, Effectiveness, and Reliability Issues in Software Testing,” in Proceedings of the 5th International Computer Software and Applications Conference (COMPSAC’79), Tokyo, Japan, 11-13 September 1991, pp. 604–605.

[OL96] A. Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland Untch and Christian Zapf: An Experimental Determination of Sufficient Mutation Operators, ACM Trans. on Software Engineering & Methodology, Vol. 5, pp. 99–118, April 1996.

[KO91] K. N. King and A. J. Offut. A Fortran language system for mutation-based software testing. Software-Practice and Experience, 21(7):685-718, July 1991.

(38)

28 | References

[OP97] A. Jefferson Offutt and Jie Pan: Automatically Detecting Equivalent Mutants and Infeasible Paths, The Journal of Software Testing, Verification, and Reliability, Vol 7, No. 3, pp. 165–192, September 1997.

[BA82] T. A. Budd and D. Angluin. Two Notions of Correctness and Their Relation to Testing. Acta Informatica, 18(1):31–45, March 1982.

[GO92]Robert Geist and A. Jefferson Offutt and Frederick C. Harris Estimation and Enhancement

of Real-Time Software Reliability Through Mutation Analysis IEEE Transactions on Computers, 41(5), May 1992.

[AO17] Paul Ammann , Jeff Offutt, Introduction to Software Testing Second Edition, Cambridge University Press, New York, NY, 2017

[HO76] William E. Howden, “Reliability of the path analysis testing strategy.” IEEE Transactions on Software Engineering SE-2(3):208-214, September 1976.

[JH11] Yue Jia , Mark Harman, An Analysis and Survey of the Development of Mutation Testing, IEEE Transactions on Software Engineering, v.37 n.5, p.649-678, September 2011

[BD80] Timothy A. Budd, Richard A. DeMillo, Richard J. Lipton and Frederick G. Sayward: Theoretical and Empirical Studies on Using Program Mutation To Test The Functional Correctness of Programs, Proceedings of the 7th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, p.220–233, January 28–30, 1980, Las Vegas, Nevada.

[DL78] R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Hints on Test Data Selection: Help for the Practicing Programmer. Computer, 11(4):34–41, April 1978

[AB79] A. T. Acree, T. A. Budd, R. A. DeMillo, R. J. Lipton, and F. G. Sayward. Mutation Analysis. Technique Report GIT-ICS-79/08, Georgia Institute of Technology, Atlanta, Georgia, 1979.

[OU00] A. Jefferson Offutt and Roland H. Untch: Mutation 2000: Uniting the Orthogonal, Mutation 2000: Mutation Testing in the Twentieth and the Twenty First Centuries, pp. 45–55, San Jose, CA, October 2000.

[OV96] A. J. Offutt and J. M. Voas. Subsumption of condition coverage techniques by mutation testing. Technical report, 1996.

[GJ14] Rahul Gopinath, Carlos Jensen, and Groce Alex. Code coverage for suite evaluation by developers. In ICSE, pages 72–82, 2014.

(39)

References | 29

[GG13] M. Gligoric, A. Groce, C. Zhang, R. Sharma, M. A. Alipour, and D. Marinov. Comparing non-adequate test suites using coverage criteria. In ACM International Symposium on Software Testing and Analysis. ACM, 2013.

[NW11] Nica, S. A., Ramler, R., & Wotawa, F. (2011). Is Mutation Testing Scalable for Real-World Software Projects? In The Third International Conference on Advances in System Testing and Validation Lifecycle.

[PIT1] Pitestorg. (2017). Pitestorg. Retrieved 12 May, 2017, from http://pitest.org/ [EVO1] Evosuiteorg. (2017). Evosuiteorg. Retrieved 16 May, 2017, from

http://www.evosuite.org/evosuite/

[SQ01] Documentation - SonarQube Documentation. (n.d.). Retrieved May 16, 2017, from https://docs.sonarqube.org/display/SONAR/Documentation

[WQR] Capgemini Releases World Quality Report 2016. (2016, September 21). Entertainment Close-up.

[MR01] T. Murnane, K. Reed: On the Effectiveness of Mutation Analysis as a Black Box Testing Technique, 13th Australian Software Engineering Conference (ASWEC’01) August 27–28, 2001, Canberra, Australia p. 0012, 2001.

[DM96] M. E. Delamaro, J. C. Maldonado, A. P. Mathur: Integration Testing Using Interface Mutation, Proceedings of the Seventh International Symposium of Software Reliability Engineering (ISSRE’96), White Plains, NY, pp. 112–121, 1996.

[FT01] Financial Times. (n.d.). Retrieved May 24, 2017, from https://www.ft.com/content/9657d306-4d7c-11e5-b558-8a9722977189

[JC01] JaCoCo Java Code Coverage Library. (2017, March 21). Retrieved June 04, 2017, from http://www.eclemma.org/jacoco/

[JM01] JMockit An automated testing toolkit for Java. (n.d.). Retrieved June 04, 2017, from http://jmockit.org/

[SM17] "Analysis of test coverage metrics in a business critical setup". MSc. KTH Royal Institute of Technology, 2017. Print.

(40)

Effectiveness of Inadequate Test Suites

IN

DEGREE PROJECT

COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2017

Effectiveness of Inadequate

Test Suites

A Case Study of Mutation Analysis

HIKARI WATANABE

KTH ROYAL INSTITUTE OF TECHNOLOGY

Effectiveness of Inadequate

Test Suites

A Case Study of Mutation Analysis

HIKARI WATANABE

Abstract

Sammanfattning

Preface

Contents

Chapter 1

Introduction

1.1 Objective

1.2 Delimitations

1.3 Related Work

Chapter 2

Background

2.1 Mutation Testing

As is defined by DeMillo et al [DL78], a test set that manages to kill all mutants, except for those

equivalents to P is adequate. In other words, a test set is adequate, if it distinguishes the program from

the mutant programs.

2.2 Theory behind Mutation Testing

Section

gave an overview of mutation testing. This section will present the theory that makes

mutation testing possible.

Mutation testing is grounded on two fundamental hypotheses, first introduced by DeMillo et al. in

[DL78] stated as:

 Coupling effect: Tests that detect small errors are so sensitive that they implicitly detect

more complex errors.

Φ

μ

2.3 Mutation System

Chapter 3

Methodology

3.1 Codebase

3.2 Sample Space

3.3 Generating Unit Tests

3.4 Mutation System

3.5 Mutation Analysis

3.6 Coverage Metrics and Mutation Coverage

Chapter 4

Results

4.1 The Sample Space

4.2 Mutation Analysis

4.3 Linear Regression Analysis

Chapter 5

Discussion

5.1 Quality of Unit Tests

5.2 Cost and Benefits of Mutation Testing

5.4 Reflection

5.5 Sustainability and Societal Aspects

5.6 Conclusions

5.7 Future Work

References