Generate Test Selection Statistics With Automated Selective Mutation

(1)

Master of Science in Computer Science February 2020

Generate test selection statistics with automated selective mutation

Course Code: DV2572 Master Thesis in Computer Science Gamini Devi Charan

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona Sweden

1

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies.

Contact Information:

Author(s):

Gamini Devi Charan

E-mail: dega16@student.bth.se

External advisor:

Edward Ekelund

edward.ekelund@axis.com Phone : +46 46 272 2746

University advisor:

Omid Gholami

Omid.gholami@bth.se

Department of Computer Science and Engineering. Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden Internet: www.bth.se

Phone : +46 455 38 50 00 Fax : +46 455 38 50 57 2

(3)

3

Abstract

Context. Software systems are under constant updating for being faulty and to improve and introduce features. The Software testing is the most commonly used method for validating the quality of software systems. Agile processes help to automate testing process. A regression test is the main strategy used in testing.

Regression testing is time consuming, but with increase in codebases is making it more time extensive and time consuming. Making regression testing time efficient for continuous integration is the new strategy.

Objectives. This thesis focuses on co-relating code packages to test packages by automating mutation to inject error into C code. Regression testing against mutated code establishes co-relations. Co-relation data of particular modified code packages can be used for test selections. This method is most effective than the traditional test selection method. For this thesis to reduce the mutation costs selective mutation method is selected. Demonstrating the proof of concept helps to prove proposed hypothesis.

Methods. An experiment answers the research questions. Testing of hypothesis on open source C programs will evaluate efficiency. Using this correlation method testers can reduce the testing cycles regardless of test environments. Results. Experimenting with sample programs using automated selective mutation the efficiency to co-relate tests to code packages was 93.4%.

Results. After experimenting with sample programs using automated selective mutation the efficiency to co-relate tests to code packages was 93.4%.

Conclusions. This research concludes that the automated mutation to obtain test selection statistics can be adopted. Though it is difficult for mutants to fail every test case, supposing that this method works with 93.4% efficient test failure on an average, then this method can reduce the test suite size to 5% for the particular modified code package.

Keywords: Regression testing, test case selection, mutation testing, selective mutation.

(4)

4

Acknowledgments

I would like to thank Edward Ekelund and my supervisors form Axis communications for supporting me with this thesis and for their valuable advises and ideas. I extend my sincere gratitude to everyone from BTH who helped me with this thesis.

(5)

5

List of figures

Figure 1: relation of research questions ... 28 Figure 2: Flow of experiment ... 29

(8)

8

List of tables

Table 1: Example of mutation [4]. ... 19

Table 2: Search results. ... 26

Table 3: programs and test suite sizes ... 31

Table 4: Regression test methods and their features. ... 34

Table 5: Mutation testing methods and their findings. ... 35

Table 6: Test failure achieved from mutations of sample programs ... 36

Table 7: Failed test cases percentage ... 36

(9)

9

List of Equations

Equation 1: Test case fail percentage ... 36 Equation 2: Test values comparison equation. ... 37 Equation 3: standard deviation ... 38

(10)

10

1 Introduction

In the software industry, evolving is updating the system to meet the user needs or change in the environment for better usage. New software agile development methods like continuous integration and continuous development are very important to stay advanced in the development. Even a well-conceived and tested software in the long run requires changes, reliably doing this is the challenge. When introducing new functionalities it is necessary to check if the system has not regressed from the bugs that new functionalities might introduce [1]. New features can introduce faults into other functionalities. So, testing of the whole system is required for quality assurance.

Regression testing is used for this purpose. Testing is the most expensive and timeconsuming stage for any software development. Regression testing must be performed to check that recent changes did not affect existing features of the software and then to test the new features too, providing confidence for implementation. But with the use of agile methodologies the test cases are reused and mostly automated [2].

Approximate estimations show that regression testing costs as one-half of the total software maintenance. There is not enough time to run whole regression tests suites as the code bases are too large, as updating the software grows the regression test suite larger, as old tests are not eliminated and expense of regression testing grows too [1].

There is a need to reduce the number of test cases being executed. Using regression testing techniques can make regression testing time efficient without losing its bug detecting capabilities. The usual way of regression testing is retest-all, run all the tests on the modified program under development. For this reason, the regression testing techniques are used, which aim to reduce the size of test suites for execution and in a way not to exclude the fault exposing test cases. The use of component-based software developments results in black-box components, which leads to black-box testing [3].

These black box components are hard to deal with for these traditional techniques.

Different approaches are later discussed in detail. Regression testing techniques using specific methods reduce the test cases to be executed to save time and cost. Proper research has yet to be done on these techniques.

Mutation testing is a fault based testing technique in which faults (mutants) are introduced into source code and test if test cases can find these faults. These faults

(11)

11

(mutants) represent the mistakes that programmers often make [4]. Mutation testing tests the quality of the test suite, if the failing test cases detect the introduced faults then the test suite is reliable. If the faults are not detected by test cases and the output is same with no failed test cases then the test suite is not reliable. Mutation score is calculated from the detection rate. Used for testing at unit, specification and integration levels. Mutation testing strategies are discussed in later sections.

This thesis is carried-out at Axis communications, Lund. Axis is a Swedish manufacturer of physical security and video surveillance founded in 1984. With headquarters at Lund, it is spread worldwide producing network cameras. Axis had problems with their regression test suite, as their regression test suites were enormous because of millions of lines codebases. It takes 8 hours straight for any modification to be tested. This was an interesting study for the thesis.

1.1 Thesis structure

This thesis is structured as follows:

• Chapter 2 has the background and related work of this thesis, presenting regression and mutation testing with their techniques. Research gap and research questions are formulated in this section.

• Chapter 3 describes the research methodology used to answer the research questions.

• Chapter 4 provides the results from the research done and evaluation of experiment results.

• Chapter 5 is about discussion and validity threats of results.

• Chapter 6 ends the thesis with conclusions and future scope of this research.

(12)

12

2 Background & Context

This section provides a background for this thesis. Description of regression testing strategies and mutation methods is given in this section.

2.1 Agile software development

Companies have been adopting agile processes, Waterfall model was used for several years despite its disadvantages [2]. Agile development overcame all the faults from the waterfall model. As an agile process does not have requirement assumptions and changes in requirements were possible. Agile processes mainly focuses on quick delivery with continuous development and continuous integration. Continuous testing is the most important aspect of continuous development as testing needs to be done before the code is integrated into the system.

Testing is done to find the faults or to show that software is working and to check if software produces the desired output under different circumstances. Test cases are the set of inputs and outputs which are checked against the program to determine its productivity. There are different types of testing like integration, unit, system, performance etc. These tests are only used to their certain levels. This thesis is focused on regression testing, which tests the whole software. The less time spent on running tests can help speed up the development process significantly.

2.2 Regression testing

Regression testing is to make sure that when new functionality is introduced it does not make the system faulty with the new functions. It is to confirm that previously working system works fine after the changes too. The same regression test suite is used when the same system is modified.

The Agile process mainly looks to continuously development of the process. So, the software is built frequently. The software is built several times a day, for every build regression tests is run as changes are made. As these builds go on the code bases and test suites get bigger. This makes the regression testing more time consuming for every build.

(13)

13

2.3 Regression testing strategies

To reduce the execution time, regression testing techniques are used. The main approaches used to address regression testing challenges are [3]:

2.3.1 Test suite minimisation (reduction)

Sometimes called as a reduction method as test cases are permanently removed. Test suite reduction method aims to find redundant test cases and then take them out of the test suite to reduce its size [3]. Reduced test suites are quicker to run. Reducing the size of test suites can save time and cost for future releases. Redundant test cases are calculated by code coverage. If the same aspect is covered by the remaining test suite then the test case is redundant. The changes done to the test suite are permanent. This method can also reduce the fault detection abilities of the test suite, which in any case is a disadvantage.

Schroder and Korel [5] had an approach, each output variable has a set of inputs that affect the outcome. Combine these to make a set of inputs to make a test suite. This works on the assumption that each requirement can be satisfied by one test case and if the test requirement is functional rather than structural then reduction does not work for functional test suites [6].

Orso, shi and Harrold have introduced an algorithm, DejaVoo for OOP languages [7].

The method partitions and then selects the test cases. A relation graph is constructed for the modified system and the original system with great detail. The Analysis is done and the difference of a modified system is drawn. The test cases are selected with the help of a graph to cover the changes [7].

2.3.2 Test case prioritization

Prioritization is the ordering of test cases for testing depending on their relevance.

Priority is given to test cases and the most prioritized test cases are run first. This benefits tester, even if the testing stops at some point the most relevant tests are run first [3]. So test cases are executed in a specific order for early bug detection. The objective of these strategies is mainly to reduce the fault detection time i.e. finding faults earlier and increase reliability at a faster rate [8].

(14)

14

Sherrif gave a technique on combining clusters of software artefacts to get the order [6]. Use of mutation score for prioritization was used by Rothermel [8]. Prioritize test cases based on their mutation score (mutants exposed by test cases).

Malhotra [9] had a modification based approach to prioritizing the test cases with high and low priority until a level of confidence is achieved. The method has a modification algorithm which is used for the prioritization based on the modified lines and the test cases related to the modification. It has two algorithms, a modification algorithm to spot the modifications and prioritize the test cases and a deletion algorithm to remove the redundant test cases which are covered by other test cases [9].

Rothermel [10] has done an empirical study on the prioritization techniques of regression tests. The techniques were

• No prioritization: To use against to compare with prioritizing methods.

• Random prioritization: the test cases are prioritized in a randomized order.

Selected randomly.

• Optimal prioritization: The program containing known faults can show which test cases can reveal faults. So testers can calculate the optimal order of test cases to maximize efficiency. To practice, testers need to know which test cases expose what faults.

• Total branch coverage prioritization: Instrumenting a program can give us the number of branches for that program that is being worked by the test cases. So prioritizing test cases based on how much they cover the program. O (nlogn) time is required for programs with n branches.

• Additional branch coverage prioritization: an alternative to total branch coverage. It selects the subsequent test cases by covering branches that are not covered. It selects the test cases with best branch coverage and then repeats until each branch is covered by at least a single test case. It requires reexecution of branch coverage and costing O(n²).

• Total statement coverage prioritization: Total branch coverage is with branches and decisions and the total statement is done with statements.

• Additional statement coverage prioritization: similar to the additional branch coverage, again the test coverage is measured with statements. Complete coverage is achieved then no need to prioritize anymore.

(15)

15

• Total fault-exposing potential prioritization: The above methods prioritize according to the coverage of branches or statements. This method Works only when the test cases can reach the faults, but the test case might fail. This method is to expose the fault detecting potential of the test case. So the test cases can be prioritized according to the fault exposing values of covered branches or statements.

• Additional fault-exposing-potential prioritization: same as the fault- exposingpotential prioritization, but previous test coverages are taken into account for the statements.

Zhang [11] in his Ph.D. has combined regression and mutation testing to an efficient way to prioritize tests. Changes did by programmers during development and introduced changes (mutants) were used to evaluate the test suites. He introduced a ReMT (regression mutation testing) algorithm to work with mutation testing on regression test suites. ReMT (regression mutation testing) has three parts.

Prepossessing component – which maps the mutants. Core ReMT has other two components, mutant-coverage checking and dangerous-edge reachability checking.

The mutant coverage uses the coverage information for test selection based on history.

If tests did not have any history then ReMT executes the tests to gather results. The dangerous-edge reachability coverage is used to check if the mutation results can be used again.

• Can only be reused if no dangerous edge is executed from beginning to mutation, done by dynamic coverage.

• No dangerous edge can be executed from mutation to end state, done by dangerous-edge reachability analysis.

When evaluated on different programs, results showed that ReMT (regression mutation testing) can reduce costs in his Ph.D.

2.3.3 Test selection

Test selection is choosing a subset of tests from the test suite, the same as the minimisation method. Both aim to choose subsets of test cases from test suites.

Minimisation is to measure coverage from a single version of the program being tested.

In test selection, only relevant test cases are selected depending on the changes to

(16)

16

Software under test. This is based on metrics that calculate coverage by tests for new code.

Fischer [12] had one of the earliest approaches, Integer programming approach. He used this to test FORTRAN programs. The program segment is defined as a single entry and a single exit to execute statements. This selection algorithm has two matrices that relate segments and test cases. This model cannot deal with the control flow changes in the modified program. The test case dependency matrix wants the control flow structure of the program under test. Changes in control flow structure, test case dependency can be updated by executing all test cases. This is not reducing any cost or time.

To make regression testing more economical Chen has proposed a framework called TestTube [13], it is a modification based technique for test case selection. TestTube divides the Software under test into program entities, monitors the execution of test cases to make connections between test cases. One drawback from this pointer handling, as it includes data types and variable as program entities. Only valid for languages without pointer arithmetic and type coercion. So TestTube makes assumptions. Example – it takes all pointer arithmetic as well bounded if assumptions go wrong then results will be invalid [13].

Harrold and Soffa gave a data flow analysis technique for unit testing in the maintenance phase [14]. It is an incremental approach. They had both intraprocedural, interprocedural selection methods. Using slicing techniques cut down the data flow analysis costs. Data flow analysis are to identify definition use pairs. The data flow analysis based selection techniques cannot detect the modification that is not related to the data flow.

Khan and Nadeem [15] in their paper have introduced a tool (ETODF) for the automatic generation of test paths with data flow coverage. ETODF takes the data flow graph as the input and analyses, applies semantic nodes. The analyser reads the data flow information, after the analysis, each node is stored with the definitions and variables.

A test path generator generates the test path and valid paths are passed to fitness a calculator. Then the fitness of each path is calculated using coverage. Although the experimentation is done shows that this method is better than random testing, this

(17)

17

cannot be used for functional testing as the data flow graphs are not available for this type of testing.

Rothermel and Harrold [16] had a selection technique based on graph walking of control dependencies, program dependencies, system dependency, and control flow graphs. They have developed to use PDG for intraprocedural and SDG for interprocedural selection. CDG lack data dependencies, this will select test cases that execute only modifies definitions. If the modification is not used then its inclusion is not required. The based method could select test cases that are not capable of giving different outputs.

Bates and Horwitz had a test selection method on program slices from program dependency graphs [17]. They had two stages, identify all test cases to be reused and some conditions to equivalent execution pattern.

• Test cases which execute P and P`, then S and S` have the same number of executions.

• If P executes for a test case and no P` then S` is not exercised more than S.

• If P` executes and not P then S is not exercised more than S`.

Using program slices statements were divided into execution classes.

Taha, Thebaut, and Liu made attest selection framework on incremental data flow algorithm. One flaw with these data flow is they do not detect modifications that are unrelated to data flow change [18].

Gupta has applied program slicing techniques to recognize definition-use pairs that are changed by code modification [19]. This method helped to find definition-use pairs that needed to be tested without complete data-flow analysis. It is the most expensive part.

A drawback is that data flow analysis based selection techniques are unable to detect changes from out of the data flow.

Agarwal [20] has proposed a technique based on program slicing technique in execution slice of a program is execution trace and is the set of statements executed by given tests. Dynamic slice of a program is a set of statements in execution slice which have an impact on output statement. Dynamic slice is a subset of execution slice.

Agarwal also proposed, slicing eliminated the requirement to formulate linear programming problem, reducing the effort for testers. For example, adding a statement

(18)

18

which has a simple output and no define or uses any variable, this statement can be modification revealing. A new statement has no variable, its addition will not affect existing slices, results in an empty solution [20].

Ekelund [21] in his thesis has proposed a tool to analyse the historical test failure data and use this to select test cases when a particular package is changed. The difference engine developed collects the previous and current regression test data and uses this data to correlate code packages to tests. It is very accurate and reliable for a complex Software under test too. But the problem in using it is most of the industrial test suites are very stable and the test fails very rarely to make a correlation from that data.

In test case selection it is hard to analyse the impact of changes to existing code. So, test selection is a bit difficult. Test case prioritization based methods on risks and fault detection are a bit off the scope as should not lose any important test cases.

2.4 Mutation

Mutation testing was first introduced in 1971 by Lipton [4]. Coverage metrics are used to know how the inputs work on the code. They do not check the quality of test cases.

Mutation testing can effectively asses the reliability of a test suite. Mutation analysis of a program p is generating a set of faulty of programs p’ called mutants which are syntactically correct but have different output than the original program. The faults are called mutations, the injected program is called a mutant. Basically, mutation testing is faults that programmers tend to make. These faults are introduced into the original program which is syntactically correct, the set of these faulty programs are called mutants. These mutants are tested against the test set and if the outcome of the mutant program is different from the original program then the mutant is detected by test cases.

The mutation testing is evaluated with a mutation score, is the ratio of detected faults to total introduced fault [4]. Mutation operators are the functions that are applied to target for mutation.

(19)

19

Program p Mutant P´

…

if( some parameter > =0) { return 0;

…

if( some parameter < 0) { return -1;

…

Table 1: Example of mutation [4].

Although mutation testing is very reliable on testing the quality of test suites. It is very expensive to practice because of the costs to produce large sets of mutants and the time needed to execute these mutants [4]. The number of mutants increases as mutation operators are increased. Other problems are the amount of human involvement required that is checking the original program’s output to that of the mutated ones. These make mutation analysis very hard to practice on an industrial level. Several techniques are introduced to address the problems of mutation testing. Mostly these methods deal to do fewer, do smarter and do faster [22].

2.4.1 Mutation clustering:

Mutant clustering is to choose a subset of mutants with the help of clustering algorithms. A cluster of mutants are gathered that kill similar test cases.

Hussain in his thesis proposed a mutant clustering method, which chooses mutants by clustering algorithms. It generates first order mutants then using the clustering algorithms few mutants are selected from every cluster to use in the mutation testing [4]. Rest of the mutants are removed from the test suite. Hussain studies showed that clustering can select fewer mutants while maintaining the mutation score.

2.4.2 Mutation selection:

To reduce the number of mutants to be executed can be achieved by limiting the mutation operators used. A small set of mutation operators are used to get all the possible mutants.

Zhang has a random selection technique [23]. It randomly selects the mutation operators and then a mutant to operate on. So it chooses an operator that can produce

(20)

20

fewer mutants than other operators. But compared to normal random selection this technique is not so advisable.

Selective mutation is a technique used to reduce the number of mutation operators being used. Reducing the number of mutation operators can also lead to fewer mutants this is Selective mutation. No of mutants can be reduced by controlling mutation operators. Offut has done research based on mothra’s [24] selective mutation, from mothra’s 22 selective mutation operators they found five of the operators are the key.

ABS, OUI, AOR, LCR, ROR. These operators have a 99.5% mutation score and while the number of mutants produced is 60% less than the number of actual mutants [24].

These operators were refined for 10 years through several mutation systems.

DeMilli and Offutt [25] presented a new technique for automating the test data generation. This technique uses a set of tools named Godzilla and it is a fault-based technique to find faults. This technique creates test cases for unit and module testing only and cannot be used for different types of testing. The predefined constraints, which are taken directly form mutation operators are used to changing the output of the statements. The constraints along with path analyser are used to mutate certain paths that change the output and gather the test case data and failure data. This technique is fully automated.

2.4.3 Mutant sampling:

Mutant sampling is an approach to randomly choose a small subset of mutants from the entire set of mutants. All the possible mutants are generated and then a small percentage of mutants are selected randomly, remaining are deleted.

2.4.4 Weak and strong mutation:

Howden [26] proposed weak mutation testing, examines the efficiency of mutation after execution. Weak mutation checks for if the mutant has been detected or not. If the test case fails then the mutant is detected, if the test case does not change the state then the mutant is not detected. Strong mutation will not consider passing or failing of tests.

If the mutant has been detected then the tests should not be run anymore, thus saving costs. Weak mutation gives different results than others. It analyses the quality of test cases, if they detect the mutations. This method is not proved that it will reduce execution costs.

(21)

21

2.4.5 Higher order mutation:

This is a new type of mutation testing, used to find strong higher order mutants. In mutation testing two types of mutants, first order mutants (FOM) and higher order mutants (HOM). HOM are generated by mutating one then once i.e, they have more than one mutant for a given parameter.

2.5 Motivation

The most popular regression testing techniques are mostly white-box testing methods.

None of these can be used for black-box testing, where static analysis or code coverages cannot be used. This thesis is a continuation of Ekelund approach to get correlations (mapping) by making the tests fail through introducing errors into the Software under test with mutations. This way correlations (mapping) are made without waiting for the tests to fail which can only happen scarcely in industrial usage. This correlation can be used for test selections even in the black box testing.

2.5.1 Research gap

Axis communications have a very large regression test suite which is run every day for any small change in the code base to make sure that no bugs are being introduced. Their large code base has a very large regression test suite which requires several hours to run. A programmer has to wait for the results before committing the changes, thus it leads to a lot of consumption in time and expenses.

Research is available on regression testing techniques, but it is observed that in practice these are hard to implement on an industrial level. Generally, in testing testers do not need to wait for all the test results. There are tests which are not related to the changes, but they need results from all the tests to be safe from faults. If testers could execute the tests related to the changes made for that cycle then they could save a lot of time.

Using test selection testers can select test cases that are only related to a particular code package from the previously failing tests. The most of the industrial test suites are black box test suites, test selection technique cannot be used for a black box test suite as it is impossible to generate control flow graphs and test coverages. So test new way is required to correlate test cases to code. But the axis framework was quite stable with no proper historical failure data. This raised two questions:

• How to use test selection technique for axis black box test suite?

(22)

22

So, testing a mutated program can answer the question. This can be achieved if using mutation testing can make mutations and which in turn fail test cases. The result of regression testing shows if test cases passed or failed. This process can be quickened if the test cases which are related to the particular change are only executed in testing.

This process can give immediate feedback but

• Can mutation effectively fail the tests, to map code packages to failing test cases?

Axis communication has lines of code for their security cameras written in several code packages. So correlating (mapping) these code packages to test cases can reduce the time required by selecting only the related test cases to the modified code package.

Correlating code to test cases requires historical data of failed tests, which is not available in this process. The traditional regression test methods involve white box analysis and these methods should be performed for every change in the code base and cannot be used for functional test suites. There is no research done on regression techniques for functional test suites. So, the idea is to correlate the packages and test cases this could be done by introducing mutants with an automated tool using selective mutation and then testing these against the regression test suites to correlate code packages to test cases with the error data. These correlations (mapping/ linking) data can be saved and used for test statistical test selection without the need to run analysis every time like in traditional methods [21].

2.6 Aim & Objectives

The aim of this research is correlating code packages to test cases and use this statistical test selection data - the correlation of code packages to test cases is done by automated selective mutation. These correlations (mapping/ linking) can help for faster test selection.

The regression test suites require an extreme amount of time to run complete tests and no technique for functional tests. This proposed approach is quicker as it does not require analysis of code every time test selection is being done from historical data [21]

and this method can also be used for functional test suites.

This research is a proof-of-concept implementation. An evaluation was performed on a few sample C programs. Statistical analysis is done to ensure the method is reliable.

This thesis aims to be used on an industrial level.

(23)

23

2.7 Research Questions

In the context of this thesis, the following questions are framed:

RQ 1) How to use regression test selection technique for black box testing?

Motivation: traditional test selection technique cannot be used for black-box testing. Need new ways to make these techniques usable for black-box

(Functional) testing.

RQ 2) How efficient is the proposed method in correlate/map code packages to test cases?

Motivation: one way to correlate is to have the failed tests data from builds done by developers and correlate the code to tests from them. But not possible in many cases. So to get the failed tests data by introducing errors and use this data for correlations (mapping/ linking) [21]. Tests have to kill the mutants, which is the opposite approach to mutation testing.

2.8 Scope & Limitations

Knowing the importance of Regression testing on an industrial level, this thesis is performed to improve the operation of regression testing to save time and high costs.

This experiment is proposed based on an idea to correlate code packages to tests.

Following with a literature review to make sure that the experiment is successful and correlations (mapping/ linking) are made and experiment is conducted to test the hypothesis on a few sample programs. However, there is an estimation that this method is going to work in most of the testing scenarios, with the help of mutation testing a better test case selection is possible.

This thesis focuses on Regression test selection with mutation testing on C programs, cannot be considered on other languages or dependencies. Done with automated testing so, manual testing is out of scope.

(24)

24

3 Methodology

This chapter explains the methods used to prove the proposed hypothesis of correlating code packages to test cases. After formulating a hypothesis and the research questions for the thesis, the research method has to be selected to continue with the research.

The theory has been stated and now research should be done in this context. The theory comes first in research then it’s a deductive approach or if the study comes first then it’s an inductive approach. This thesis studies how to correlate code to tests, our proposed method will be validated through empirical evidence. Data collected can be of two types, qualitative or quantitative. Qualitative methods make the researcher study in depth rather than making assumptions. In our case qualitative and quantitative both types of data should be collected for our experiment. .

3.1 Literature Review 3.1.1 Motivation:

For any thesis proper literature has to be studied to formulate the idea of research.

Literature review helps to identify the relevant data to do the research. It helps to find relevant data of any selected subject. Research gap was established by understanding the background studies and this can be answered by a literature review followed by an experiment to prove the hypothesis. Literature review was conducted to study regression testing with functional test suite. Lot of literature is reviewed to understand the concept of using regression testing with different test suites. The reviewed concepts helped to conduct the experiment to prove the hypothesis.

A full scale systematic literature review wasn’t conducted as to stay relevant with time and the main aim is to prove a hypothesis and it can’t be done with a full literature review but with an experiment. That is the reason literature review is opted rather than a systematic literature review.

3.1.2 Objective:

Primary objective of literature review is to explore and collect relevant related articles to study and prepare a method to evaluate proposed hypothesis. Literature review

(25)

25

helped to answer research questions. The questions with which literature review is conducted are

1. Can the regression testing techniques be used on functional test suite?

2. Can mutation effectively fail the tests, to map code packages to failing test cases?

3.1.3 Literature review design:

To answer these literature questions is important as to know if existing regression testing techniques can work with functional test suites. If not as we intend to use mutation testing, an effecting mutation method is needed to make the tests fail.

Researchers say it is good to have multiple references before writing something important. Relevant literature search is needed to support our study. Digital search was done with the help of the BTH library. Keywords were formulated and the literature review is conducted by following “Guidelines for performing systematic literature reviews” by B.A.Kitchenham [27]. This is a qualitative study that helped us to answer research questions.

• Keywords are formulated in the research topic.

• The keywords formulated were ‘Functional Regression testing’ ‘Regression test selection’ ‘Mutation testing techniques’ ‘selective mutation’ ‘correlation of tests and code’.

• Keywords are used to search for conference papers, journal articles and web pages from the following online sources:

1. IEEE Xplore 2. Google Scholar 3. ACM Digital Library 4. Summon BTH

The database search resulted in the following:

Keyword Results

Functional Regression testing 326

Regression test selection functional 43

Mutation testing techniques 988

(26)

26

selective mutation 121

correlating tests and source code ( index term: program testing)

14

Table 2: Search results.

The literature review was done to know two aspects, what way can the test selection can be used on black box testing and how to know if that method is reliable. The following is the inclusion/exclusion criteria used to study the literature:

3.2 Inclusion/ Exclusion criteria:

Inclusion:

x Is the article in english language?

x No priority on the year published as the articles are rare on regression test selection.

x Is full text available to read?

x Papers published in journals, conferences and peer reviewed are only considered.

x Does the article discuss about the reliable ways to reduce time consumption in regression testing?

x Does the article discuss about reducing the costs of mutation testing?

Exclusion:

x Papers which lack proper empirical evidence are excluded.

x Irrelevant papers discussing on other than testing domain are excluded.

Along with the literature review, snowballing is done to find more relevant literature.

A total of 26 articles were found, of which 14 are of regression testing and its techniques, 6 are of mutation testing and its methods and 5 are of software testing.

Literature review was done on these. From the literature review and research gap, RQ 1 is answered.

This literature review was done for:

(27)

27

A new method to use for in conjunction to regression test selection, Mutation testing to reduce the computational and operational costs of mutation. Available literature on mutation testing was reviewed. Mutation methods which can help to cut down the cost and efficiency were studied. On the analysis the following methods were identified:

• For Regression testing 1. Regression test reduction 2. Regression test prioritization 3. Regression test selection

• For Mutation testing 1. Mutant clustering.

2. Mutant sampling.

3. Selective mutation.

4. Weak and strong mutation.

5. Higher order mutation.

Summary of evidence:

Upon studying the related work, regression test selection is selected among regression test techniques. From studying the articles [12], [13], [14], [15], [16], [17], [18], [19], [20] it is observed from their proven research that to minimise test execution time without losing test cases, regression test selection is the suitable technique. For the mutation testing method, selective mutation is selected after reviewing [23], [24] and [25] as it only uses selected mutant operators. This can help to minimize the no of mutants produced and to strengthen the mutations with using only strong mutators.

A literature review has to support our study and have good data. Articles which are selected were are only peer-reviewed and abstract relevant to our study. First the abstract is studied, if it was relevant then the introduction and conclusion were studied.

The relation between the research questions is presented below:

(28)

28

Figure 1: relation of research questions

3.3 Experiment

The most common method of research is an experiment. The goal of the experiment is to evaluate the software methodologies, it helps to test a new or existing hypothesis.

This study requires a controlled environment with a program under test and a test suite for the program under test to assess the proposed model of its reliability and how it behaves with desired variables. The metrics used to evaluate are failed test case percentage. The method’s consistency is checked from the experiment, as it is the most valuable quality.

Rejected alternative research methods:

A case study is basically exploring the existing principles and survey is to collect views from different users. Case study and survey methods will not help in proving a new hypothesis which has never been used before. Experimentation with literature review is selected as the research method in this case. As a new hypothesis is being tested for the cost-effective regression testing selection method. An observation of how strategy is working had to be made and if this method is actually failing the tests to make the

Literature review

Answer RQ 1

Experiment on sample programs

Answer RQ 2

(29)

29

necessary correlations for test selection. As the thesis is to prove that mutation of Software under test can help to get error test statistics which can be used for correlations and experiment is the best method suitable. RQs 1 is answered by literature review and RQ 2 was answered by experiment.

3.3.1 Experimental setup:

Software platform:

For this research, the experimental setup an automated selective mutation testing was performed on different sample programs and evaluated in a Linux environment. In this method, an automated mutation tool [28] was selected which produces mutants to an input C file. Five C programs with their tests available in test script language from the software-artifact infrastructure repository [29].

Repeat with different Mutations Automated mutation tool

Testing against test suite

If not failed

Figure 2: Flow of experiment

C programs (code packages)

Failed test cases

Correlate code packages to tests

Use correlations for test selection

Mutated C programs

(30)

30

3.3.2 Sample motivation:

In this experiment, the concerned population is every possible C program that can be written and their respective test suites. A sample of five C programs with large lines of code and most common functionalities used are selected to draw conclusions of the whole population. software-artifact infrastructure repository [29] has large C utility programs with their own test suites which makes them easy for testing as they save the time for building new a test suite. Five sample C programs which had functional test suites were selected randomly from the available programs. If the other C programs were to be selected the outcome would not be very different, as the selected programs are large enough covering most of the functionalities in C language. The programs were to be tested if our mutations can fail, so to correlate for our historical data. The selected samples are listed below:

• Jo: is an open source utility tool to create JSON objects. It has 3 main functions to style the output. Pretty, Boolean and array are the three functions. These functions are mutated and tested against their test suite.

• Grep, Flex, Gzip, Sed: all four programs are Unix utility tools. Grep looks in input files for given pattern list. Flex is a fast lexical analyser generator used to create scanners. Gzip is a file compression tool. Sed is a stream editor for text transformations. These programs have a large number of functions so all the code is mutated as possible. These four are from the software-artifact infrastructure repository[29].

These five sample programs had their own test suites provided from the repository. The functions in the programs can be considered as code packages from a Software under test.

(31)

31

Program Lines of code Test cases

Jo 1186 24

grep 10068 809

flex 10459 567

gzip 5680 214

sed 14427 360 Table 3: programs and test suite sizes

3.3.3 Experiment design:

The theory here is that once the correlations are made, they can be used repeatedly for test selections. To prove that medium level mutations can fail the test cases, sample utility programs with functions and tests similar to black box testing are selected. The experiment is done by mutating the program under test and then running the test suite against the mutated program under test. The automated selective mutation testing tool used was an open source tool from Github [28]. The tool was found to be quite useful rather than developing a new one, which will take a lot of time. The tool takes a c file as input and generates all the possible mutations from the mutation operators specified to it. Only five mutation operators AOR, ROR, UOI, ABS, and LCR are used for the experiment as mentioned earlier. The five operators are from the 22 selective operators specified by Mothra. These operators cover most of the mutations [24]. The mutation tool makes an output file with all the mutations possible from the operators. Once the mutations are performed, each program is tested against their test suites. If the tests have failed for our mutations then the experiment is successful. The metric used to evaluate is the ‘failed test case percentage’.

3.3.4 Independent variables

As the methods used is a controlled experiment the researcher has control over independent variables. The independent variables are manipulated to get the quantifiable or qualitative results from the experiment. Independent variable is the

(32)

32

input, in our case C sample programs, test cases and mutation testing are the inputs, which are controlled and manipulated to get the results for our research.

3.3.5 Dependent variables

The dependent variables are the ones affected by the change in independent variables.

In our case, dependent variables are failed test cases from mutations and percentage of failed test cases. With these outputs, RQ 2 can be answered.

(33)

33

4 Results

This section reports the results achieved from the implementation of the literature review and experimentation for the research questions.

4.1 Using regression test selection for black box testing:

The investigation to make use of regression test selection technique gave an idea of using mutation testing, it helps to correlate code to tests. Eliminating the need for code coverage graphs or test coverage. The mutation testing can make the tests fail for specific mutation to a part of code. Hence helping to correlate what test cases are related to that specific code. Still mutation testing is an expensive process. The basic idea of mutation is it introduces faults into programs to test the effectiveness of test suites. To reduce the computational costs, the proposed methods were weak mutation, selective mutation, mutation clustering. After literature study, selective mutation was selected as it reduces the number of mutants better than others without losing the mutation score [24]. The findings from literature review are as follows:

Regression testing methods

limitations Advantage Cost reduction

Time

consumption

Test case reduction

This method permanently deletes the redundant test cases. Might lose relevant test cases,

sometimes.

Permanently reduces the size of test suite.

low low

Test case

prioritization Sets a priority to tests cases, but the whole suite

High priority tests are run first, so the result a

low high

(34)

34

is run eventually. tester requires is given quickly.

Test case selection

Needs test coverage to be calculated to select test cases (not possible in black box testing)

A subset of test suite is selected by algorithms but test cases are never deleted.

High Medium- High

Table 4: Regression test methods and their features.

Mutation testing methods

Time

consumption

limitations advantage

Mutant clustering Low

Clustering algorithm might miss relevant test cases.

Takes less time to execute the small mutant of clusters.

Mutant sampling Low Sampling selection

might miss relevant test

cases. Takes less time to

execute as a small sample is selected from the test suite.

Selective mutation Medium Might still lead to

large Will cover all the

possible mutants and has the best mutation coverage.

Weak and strong mutation

Medium

Weak mutation is stronger, test cases mostly pass the mutation. Cannot be

Can be used to test the quality of the test suite to an extreme scope.

(35)

35

used in this research to make tests fail.

Higher mutation

order less Mutations must be

done several time to obtain higher order mutations.

Very hard for any test suite to detect them. Will prove the potential of a test suite if tested.

Table 5: Mutation testing methods and their findings.

Mothra has proposed 22 mutant operators which were refined through several mutation systems. Offut [24] has experimented with a different number of mutant operators and having a 6-selective mutation operators AOR, ROR, UOI, ABS, DER, and LCR gave 60% reduction in a number of mutants generated with an average mutation score of 99.71% compared to other methods. This is the most effective way to reduce the computational costs by half without losing mutation score [24]. So the selective mutation method is used to cut down the costs in our method to perform mutations.

4.2 Efficiency of proposed method to correlate/map code packages with test cases

The experiment was implemented on five sample programs. From the literature review, the selective mutation method was selected to execute. Selective Mutation of the sample programs gave us the following results in table 5. Test failure percentage is used as the metric to calculate the correlations. The average of the test failure percentage is calculated.

(36)

36

C Program Lines of code Mutants generated from mutations

Test cases of C program

Tests failed from mutation

testing

Jo 1186 1587 24 21

grep 10068 12472 809 793

flex 10459 13248 567 528

gzip 5680 6975 214 200

sed 14427 17858 360 342

Table 6: Test failure achieved from mutations of sample programs

The metric used to calculate the correlations is failed test case percentage. The failed test case percentage represents the correlations percentage from the mutations.

C Program

Percentage of failed test cases for each program

Jo 87.5%

grep 98.02%

flex 93.1%

gzip 93.4%

sed 95%

Table 7: Failed test cases percentage

Using the test case failure data the sample mean percentage at which the test cases failed can be calculated by

Sum of all test cases failed percentage

̅ =

Total number of programs tested

Equation 1: Test case fail percentage

(37)

37

= 87.5+98.02+93.1+93.4+95 5

= 93.4%

The percentage of failed tests (correlations percentage is also same) is 93.4%

4.3 Evaluation

The proposed method is not evaluated for every syntax in c but is robust in most of the common cases. The program used in the experiment is the typical C programs having commonly used constructs. Statistical tests are done to verify any hypothesis in research. The t-test is a statistical test done to the differences of the sample population.

These statistical tests help the researcher to know when to reject the null hypothesis.

One-sample T-test was selected to prove that the sample is from a population with a specific mean. The population mean is not always known. For example, if you want to test a new teaching method for backward students. Your sample is the backward students with new teaching and their mean is the average score. The chance for an error occurs in a small sample, so a small sample of programs is taken. One sample t-test was selected as the other correlating methods were for white box testing and the method introduced in this thesis is for black box testing, 2 sample t-test was irrelevant. A one sample t-test is a statistical test to know if the sample is from the population with a specific mean. The sample mean is compared to the test value and t is calculated [30].

Evaluation is done by proving that the test values are actually not occurred by chance.

The comparison is done by

̅ − = ⁄√

Equation 2: Test values comparison equation.

Where ̅ is sample mean, μ is tested value, s is deviation, n is the size. The value of t is compared to the critical value to reject the null hypothesis. The assumption of one sample t-test is unbiased sampling, independent recording, and normality [30].

(38)

38

Defining hypothesis: Two types of hypothesis are written for statistical testing, the null hypothesis, and the alternative hypothesis. In the one sample t-test, an alternative hypothesis is that there is a difference between test value and population mean.

Null hypothesis (H0): the null hypothesis assumes that the difference between the sample mean and the test value is 0. “The efficiency to correlate code to test cases with mutations is 100%”.

Alternative hypothesis (H1): assumes that there is a difference between test value and sample mean. “The efficiency to correlate code to test cases with mutations is not 100%”.

Rejecting the null hypothesis in any statistical test proves the hypothesis. The sample mean was calculated to be 93.4%. To prove that it did not occur by mistake one sample t-test is performed. To calculate t,

Need s, standard deviation.

s = ∑ (x − x) n − 1

Equation 3: standard deviation

s = (87.5 − 93.4) + ⋯ 4

= 14.7

Now calculating t,

=93.4 − 100 14.7

√5

(39)

39

= −1.05

The t value is less than the critical value 1.75, -1.05 < 1.75. It is conclude that deviation from 100% has occurred by chance. From the results, it is conclude that the mutation testing can correlate code packages to tests with not 100% efficiency. But can correlate.

(40)

40

5 Discussion and validity threats

This chapter explains the ideas drawn from the results of experiment and literature review.

5.1 Discussion

After the experiment was performed, the evaluation of results is done. The defined hypothesis is shown to be working. As the experiment was performed on functional testing suites, the proposed method cannot be compared to other methods as they only work for Unit testing. This method was developed to answer the test selection and selection of functional test cases. From the evaluation with the one sample T-test, the method is proved to be efficient. In Zhang’s ph.d [11], the only close article related to this research is which discusses about reducing costs of mutation testing using regression test prioritization.

This research is the extension of Ekelund’s thesis. In his research Ekelund proved regression test selection from historical data can reduce the size of test suite to 5%, saving cost and time. But his research lacked much needed gathering of historical test data needed to use for test selection. This research discusses about generating historical test failure data, by using selective mutation testing to create test failure data thus, correlating/mapping code packages to test cases. This thesis “Generate test selection statistics with automated selective mutation” is a process to use regression test selection to black box testing (functional testing), by using mutation testing to generate test selection data. This method when used with the conjunction of Ekelund research, can reduce the regression test suite size to 5% [21] for a particular change by only selecting the relevant test cases. As there is no mere working model works for black box test selection, this method can get the work done. This method can save a lot of time and expense, can be used on an industrial level. The total gain from this hypothesis is that correlations/mapping done once from code packages to tests, the data can be saved and used for future regression tests without running analysis each time. The whole regression test suite can be run eventually if required after the selected test cases are tested. The answers to research questions are as follows

(41)

41

RQ 1) How to use regression test selection technique for black box testing?

Implementing mutation testing to generate test data further correlating code packages to test cases was the solution, as it help to remove the need for test coverage or coverage graphs. But using mutations has a large impact on cost and time, mutation testing results in large no of mutants. To reduce these costs mutation techniques can is used.

Selective mutation can be automated to inject mutants into any code package. Use of selective mutation can decrease the mutations produced by 60% than nonselective mutations. This will help to cut down the costs and time a lot. 6-Selective mutation operators were used into the automation tool that will inject mutants only from these operators.

To make sure that test cases fail for the mutants introduced i.e. to make a test case fail a mutant must manipulate the output of the program. The problem of path propagation from infection to output is hard [31]. So introduce only mutants that propagate the infection to affect the output using selective mutation again. Introduce mutants that are very easy for the test cases to identify. So have selected operators which inject the most common faults into code. AOR, ROR, UOI, ABS, and LCR, inject the simplest mutations that mutate arithmetic operators, relation operators, unary operators, constant values and logical connector respectively. These mutations are going to be easy for any test suite to identify [24].

RQ 2) How efficient is the proposed method in correlate/map code packages to test cases?

The introduced method had an efficiency of 93.4% which is quite fit to be used, as there are no methods for a black box regression test selection. As an error in small sample has high percentage representation, small sample of five C programs is selected to be experimented. Failed test case percentage is used as metric to calculate the correlations.

One sample t-test is selected for evaluation of experiment results, as the method cannot be compared to other available methods. From the results, the method can be used to correlate code to tests, this data can be used for black box regression test selection. In the industrial application, the whole process can be automated on the Jenkins platform to reduce the effort of mutations. With so many builds every day this method used in conjunction with test case selection algorithm (difference engine) can help reduce the size of the test suite for a specific modification of up to 5% [21], even for functional

(42)

42

(black box) test suites.. Which can save time more than any other methods, if unsatisfied the remaining test suite can also be tested once the selected tests are finished.

5.2 Validity threats 5.2.1 Internal threats

Internal threats are caused by the subjects in research and could happen without researcher knowledge [32]. In this experiment, relevant articles or papers could have been missed. Any faults might be missed from the test suite being used. So the test suite was properly tested for any errors beforehand and a literature review was structured properly with every keyword possible. Failed mutations or weak test cases can also be a threat to the research as improper mutations or weak test cases which cannot detect faults can make the experiment fail.

5.2.2 External threats

One external threat is that results cannot be generalized. The method is validated for only C sample programs and assumed to be working for every other program that might be used. The method cannot be surely used with other programming languages. The test cases left after the correlation are made can be from improper mutations or they can be irrelevant to the Software under test. The irrelevant test cases in the test suite can reduce the efficiency of the method. The environment is made sure to match the industrial type.

5.2.3 Reliability

Research is reliable when biases can be reduced [33]. The research is done on limited programs and test cases, analysing the theory on only a few test cases and programs.

Reliability was achieved after evaluating the results with statistical tests. This way reliability is achieved from the study.

(43)

43

6 Conclusion and Future work

This chapter discusses about the conclusions of this study and the scope of extending this research in future.

6.1 Conclusion:

This thesis is about finding a way to use regression test selection method for black box test suite. Mutation testing was found to be useful, so it is used to correlate code to test cases and this hypothesis is evaluated to be reliable.

The research starts with finding a suitable way to correlate code to test cases without having to compute the control flow graphs or test coverages. Mutation testing is then studied more to reduce cost from large number of mutation. Selective mutation was used to control the generation of mutants, this way less mutants are generated.

The environment for experiment is configured by choosing few sample C programs and an automated mutation tool. The mutation tool is altered to mutate only with the mothra’s selected mutation operators. The experiment is initiated with mutating the sample C programs with the mutation tool. Once the mutations are introduced the mutated code is tested against the each programs own test suite. This experiment is repeated for all the five sample programs, to analyse the output from different inputs.

After the experiment. Failed test case percentage is calculated as the metric. After this implementation it is shown, the proposed hypothesis can be used for test case selection in the industrial application too and shown to be effective to correlate code packages to test cases by mean of 93.4%. This method can further be used for test data generation and reduce test suite size by 5% [21] for any specific modification to the program, can be a major time saving method. From the experiment performed it can be concluded that by introducing mutants into Software under test and testing it against the regression test suite can correlate code package to test case and these correlations can be used for test selection. The test selection method was never used for Black box regression testing, test selection can now be used in black-box testing. Although other languages constructs were not tested other than C, this method can be used for other languages too as regression testing and mutation testing are performed in most of the languages.

(44)

44

6.2 Future work:

For the future work, following is proposed.

• This research starts with the lack of historical test data, if a method can be proposed to gather all the test data and save it from start of the development process. The data can be used for test selection with available simple algorithms without need for our mutation based test data generation.

• Different methods of mutation testing can be implemented to compare against the selective mutation that is implemented in this research.

• Exploring any more methods that can be used to with the combination of regression testing to save costs.

Generate Test Selection Statistics With Automated Selective Mutation