Early and Cost-Effective Software Fault Detection: Measurement and Implementation in an Industrial Setting

(1)

ISSN 1653-2090

Avoidable rework consumes a large part of development projects, i.e. 20-80 percent depending on the maturity of the organization and the complexity of the products. High amounts of avoidable rework commonly occur when having many faults left to correct in late stages of a project. In fact, research studies indicate that the cost of rework could be decreased by up to 50 percent by ﬁnding more faults earlier. Therefore, the interest from industry to improve this area is large.

It might appear easy to reduce the amount of rework just by putting more focus on early veriﬁ- cation activities, e.g. reviews. However, activities such as reviews and testing are good at catching different types of faults at different stages in the development cycle. Further, some system characteristics such as system capacity and backward compatibility might not be feasible to verify early through for example reviews or unit tests. There- fore, the objective should not just be to ﬁnd and remove all faults as early as possible. Instead, the cost-effectiveness of different techniques in relation to different types of faults should be in focus.

A department at Ericsson AB was interested in approaches for assessing and improving early and cost-effective fault detection. In particular, there was a need to quantify the value of suggested improvements. Based on this objective, research was during a few years conducted in the industrial environment.

The conducted research resulted in this thesis, which determines how to quantify unnecessary rework costs and determines which phases and activities to focus improvement work on in order to achieve earlier and more cost-effective fault detection. The thesis describes and evaluates measurement methods that make organizations strive towards ﬁnding the right faults in the right phase.

The developed methods were also used for evaluating the impact a framework for component- level test automation and test-driven development had on development efﬁciency and quality. Further, the thesis demonstrates how the implementation of such improvements can be continuously monitored to obtain feedback during ongoing projects.

Finally, recommendations on how to deﬁne and implement measurements, and how to interpret obtained measurement data are provided, e.g. presented as considerations, lessons learned, and success factors.

The thesis concluded that existing approaches for assessing and improving the degree of early and cost-effective software fault detection are not satisfactory since they can cause counter-productive behavior. An approach that more adequately considers the cost-efﬁciency aspects of software fault detection is required. Additionally, experiences from different products and organizations led to the conclusion that a combination of measurements is commonly necessary to accurately identify and prioritize improvements.

ABSTRACT

Blekinge Institute of Technology

Doctoral Dissertation Series No. 2007:09 School of Engineering

EARLY AND COST-EFFECTIVE SOFTWARE FAULT DETECTION

MEASUREMENT AND IMPLEMENTATION IN AN INDUSTRIAL SETTING

Lars-Ola Damm

EARL Y AND COST -EFFECTIVE SOFTW ARE F A UL T DETECTION Lars-Ola Damm

2007:09

(2)

(3)

Early and Cost-Effective Software Fault Detection

Measurement and Implementation in an Industrial Setting

Lars-Ola Damm

(4)

(5)

Early and Cost-Effective Software Fault Detection

Measurement and Implementation in an Industrial Setting

Lars-Ola Damm

Blekinge Institute of Technology Doctoral Dissertation Series No 2007:09

ISSN 1653-2090 ISBN 978-91-7295-113-6

Department of Systems and Software Engineering School of Engineering

Blekinge Institute of Technology

(6)

School of Engineering

Publisher: Blekinge Institute of Technology Printed by Printfabriken, Karlskrona, Sweden 2007

(7)

Q: What are the most exciting, promising software engineering ideas or techniques on the horizon?

A: I don’t think that the most promising ideas are on the horizon. They are already here and have been for years, but are not being used properly.

–David L. Parnas

(8)

(9)

Abstract

Avoidable rework consumes a large part of development projects, i.e. 20-80 percent depending on the maturity of the organization and the complexity of the products. High amounts of avoidable rework commonly occur when having many faults left to correct in late stages of a project. In fact, research studies indicate that the cost of rework could be decreased by up to 50 percent by finding more faults earlier. Therefore, the interest from industry to improve this area is large.

It might appear easy to reduce the amount of rework just by putting more focus on early verification activities, e.g. reviews. However, activities such as reviews and testing are good at catching different types of faults at different stages in the development cycle. Further, some system characteristics such as system capacity and backward compatibility might not be feasible to verify early through for example reviews or unit tests. Therefore, the objective should not just be to find and remove all faults as early as possible. Instead, the cost-effectiveness of different techniques in relation to different types of faults should be in focus.

A department at Ericsson AB was interested in approaches for assessing and improving early and cost-effective fault detection. In particular, there was a need to quantify the value of suggested improvements. Based on this objective, research was during a few years conducted in the industrial environment.

The conducted research resulted in this thesis, which determines how to quantify unnecessary rework costs and determines which phases and activities to focus improvement work on in order to achieve earlier and more cost-effective fault detection. The thesis describes and evaluates measurement methods that make organizations strive towards finding the right faults in the right phase. The developed methods were also used for evaluating the impact a framework for component-level test automation and test-driven development had on development efficiency and quality. Further, the thesis demonstrates how the implementation of such improvements can be continuously monitored to obtain feedback during ongoing projects. Finally, recommendations on how to define and implement measurements, and how to interpret obtained measurement data are provided, e.g. presented as considerations, lessons learned, and success factors.

The thesis concluded that existing approaches for assessing and improving the degree of early and cost-effective software fault detection are not satisfactory since they can cause counter-productive behavior. An approach that more adequately considers the cost-efficiency aspects of software fault detection is required. Additionally, experiences from different products and organizations led to the conclusion that a combination of measurements is commonly necessary to accurately identify and prioritize improvements.

(10)

(11)

Acknowledgements

First and foremost, I would like to thank my supervisors Professor Lars Lundberg and Professor Claes Wohlin for their support, especially for valuable feedback on papers and other research related advice. I would also like to express my gratitude to my manager Bengt Gustavsson for continuous support and ideas, and for making it possible to integrate the research with the industrial environment. He has provided invaluable help to ensure long-term support for the research activities in a constantly changing industrial environment.

I would also like to thank all colleagues at Ericsson AB in Karlskrona that have taken part of or been affected by the research work. These include the members of the development projects that have been studied as a part of the research as well as line managers and the other members of the research project’s steering group. Thanks for providing ideas and feedback, and for letting me interfere with the daily work when conducting the research studies. In particular, I would like to thank Michel Koivisto for providing valuable input to the research especially in form of access to people and projects across Ericsson. Additionally, David Olsson has especially in the earlier part of the research provided a lot of valuable feedback, support, and ideas.

My colleagues in the BESQ environment, and the research groups SERL and PAARTS have also been very supportive. In particular, they have provided a scientific mindset and broadened my knowledge of software engineering research and practice.

Especially, I would like to thank Patrik Berander and Piotr Tomaszewski for fruitful cooperation, not only in relation to the work presented in this thesis but for example also in university courses. I also appreciate the help external persons have given, in particular Johan Nilsson for continuously giving feedback on papers. This especially to make sure that people outside the research environment also can understand them.

Further, I am thankful for Richard Torkar’s help with resolving thesis formatting issues.

Finally, I would like to thank family and friends for putting up with me despite neglecting them when having a high work load.

This work was funded jointly by Ericsson AB and the Knowledge Foundation in Swe- den under a research grant for the project ”Blekinge - Engineering Software Qualities (BESQ)” (http://www.bth.se/besq).

(12)

(13)

Overview of Papers

Papers included in this thesis.

Chapter 2. ’Faults-Slip-Through - A Concept for Measuring the Efficiency of the Test Process’, Journal of Software Process: Improvement and Practice, Wiley InterScience, 11(1), pp. 47-59, 2006.

Chapter 3. ’Company-wide Implementation of Metrics for Early Software Fault De- tection’, To be published in: Proceeding of the 29th International Conference on Soft- ware Engineering(ICSE), IEEE Computer Society, Minneapolis, USA, May 2007.

Chapter 4. ’Using Fault Slippage Measurement for Monitoring Software Process Quality during Development’, Proceedings of the 4th International Workshop on Soft- ware Quality (WOSQ), ACM Press, Shanghai, China, pp. 15-20, May 2006.

Chapter 5. ’Identification of Test Process Improvements by Combining ODC Triggers and Faults-Slip-Through’, Proceedings of the 4th International Symposium on Empir- ical Software Engineering (ISESE), IEEE Computer Society, Noose Heads, Australia, pp. 152-161, November 2005.

Chapter 6. ’A Model for Software Rework Reduction through a Combination of Anomaly Metrics’, To be submitted for publication.

Chapter 7. ’Results from Introducing Component-Level Test Automation and Test- Driven Development”. Journal of Systems and Software, 79(7), pp. 1001-1014, 2006.

Chapter 8. ’Quality Impact of Introducing Component-Level Test Automation and Test-Driven Development’, Submitted to: the 14th European Conference on Systems &

Software Process Improvement and Innovation (EuroSPI), Potsdam, Germany, Septem- ber 2007.

Lars-Ola Damm is the main author of all papers, i.e. based on advisory support from the co-authors, he has outlined and written all included papers. Lars Lundberg is a co-author of Chapters 2-8 and Claes Wohlin is a co-author of Chapters 2 and 6.

(14)

Paper 1. Lars-Ola Damm, Lars Lundberg, and Claes Wohlin.

’Determining the Improvement Potential of a Software Development Organization through Fault Analysis: A Method and a Case Study’, Proceedings of the 11th Euro- pean Conference on Software Process Improvement (EuroSPI), Springer-Verlag, Trond- heim, Norway, pp. 138-149, November 2004. The paper in Chapter 2 is an extended version of this paper.

Paper 2. Lars-Ola Damm, Lars Lundberg, and David Olsson.

’Introducing Test Automation and Test-Driven Development: an Experience Report’, Proceedings of the International Workshop on Test and Analysis of Component-Based Systems (TACoS), Electronic Notes in Theoretical Computer Science, 316, Elsevier Science Inc., pp. 3-15, April 2005.

Paper 3. Piotr Tomaszewski and Lars-Ola Damm.

’Comparing the Fault-Proneness of New and Modified Code - An Industrial Case Study’, Proceedings of the 5th International Symposium on Empirical Software En- gineering (ISESE), Rio de Janeiro, Brazil, pp. 2-7, September 2006.

Paper 4. Piotr Tomaszewski, Patrik Berander, and Lars-Ola Damm.

’From Traditional to Streamline Development - Opportunities and Challenges’, Sub- mitted to the Journal of Software Process: Improvement and Practice, Wiley Inter- Science, January 2007.

Paper 5. Renas Reda, Yusuf Tozmal, Miroslaw Staron, and Lars-Ola Damm.

’An Industrial Case Study on Testing Methods and Tools in Model Driven Develop- ment’, To be submitted for publication, 2007.

(15)

Introduction

1.1 Preamble

Software is one of the most complex human constructs because no two parts are alike.

If they are, they are made into one (Brooks 1974). Research and development of techniques for making software development easier and faster have been going on for as long as software has existed. Still, projects commonly spend at least 50 percent of their development effort on rework that could have been avoided or at least been fixed less expensively (Boehm and Basili 2001). That is, 20-80 percent depending on the maturity of the organization and the types of systems the organization develops (Boehm and Basili 2001), (Shull et al. 2002), (Veenendaal 2002). In a larger study on productivity improvement data, most of the effort savings generated by improving software process maturity, software architectures, and software risk management came from reductions in avoidable rework (Boehm et al. 2000). A major reason for this is that faults are cheaper to find and remove earlier in the development process (Boehm 1983), (Boehm and Basili 2001), (Shull et al. 2002). Omitting early quality assurance results in significantly more faults found in testing. Such an increase commonly overshadows the benefits from omitting early quality assurance. It has also been reported that the impact of defective software is estimated to be as much as almost 1 percent of the Gross Domestic Product (GDP) of the U.S.A. (Howles and Daniels 2003). Further, fewer faults in late test phases leads to improved predictability and thereby increased delivery precision since the software processes become more stable when most of the faults are removed in earlier phases (Tanaka et al. 1995), (Rakitin 2001). Therefore, there is a large interest in approaches that can reduce the cost of rework, e.g. make sure that

(22)

more faults are found earlier.

Although most software development organizations are aware of that faults are cheaper to find earlier, many still struggle with high rework costs (Boehm and Basili 2001), (Shull et al. 2002). In our experience, there are two primary reasons that have contributed to unrealized rework reduction:

Short-term focus: During time-pressure, early quality assurance activities are omitted to deliver to the test department faster (Maximilien and Williams 2003). In our experience, this problem is highly related to challenges of software process improvement in general, e.g. the conflict between short-term and long-term goals. ”When the customer is waving his cheque book at you, process issues have to go” (Baddoo and Hall 2003). Further, people easily become resistant because when under constant time- pressure, they do not have time to understand a change and the benefits it will bring later (Baddoo and Hall 2003). Additionally, the short-term/long-term conflict to a large degree explains the high failure rate of implementation of identified improvements, e.g.

70 percent in one survey (Ngwenyama and Nielsen 2003). Turning assessment results into actions is when most organizations fail (Mathiassen et al. 2002).

Difficult improvement selection: There are always far more opportunities for im- provement than there are resources available to implement them (Bullock 2000). For example, if improvements in unit testing are desired, large investments in tools and improved product testability are commonly required. Therefore, the challenge is to prioritize the areas in order to know where to focus the improvement work (Wohlwend and Rosenbaum 1993). Without a proper decision support for selecting which problem areas to address, it is common that improvements are not implemented because organizations find them difficult to prioritize (Wohlwend and Rosenbaum 1993).

These two aspects are for companies with a high market pressure unavoidable.

Therefore, they must learn to manage these two realities in continuous improvement work. To achieve this, there is in our experience one factor that is far more important than any other. That is, the underlying reason for why improvements are not implemented is that the value of suggested improvements is rarely quantified. Only when organizations can accurately determine what to improve and estimate the return on investment will the business impact of a change become evident (Chillarege 2002). The product owner can then prioritize the suggested improvements against incoming customer requirements instead of having to take risks with an investment that is based on subjective opinions. Therefore, the primary objective of this thesis is to determine how to provide quantified decision support to reduce the effort spent on rework.

The remainder of this chapter is outlined as follows. Section 1.2 provides an overview of concepts related to the area of rework reduction. After that, Section 1.3 investigates existing research that is related to the objectives of the thesis followed by an identification of research questions to address in the thesis. Section 1.4 describes

(23)

the methodology for how the thesis addressed the research questions including the industrial context of the research (i.e. Ericsson AB, from now on referred to as Ericsson).

Finally, Section 1.5 outlines the remainder of the thesis including a summary of major contributions.

1.2 Concepts

This section provides an overview of software engineering concepts relevant to this thesis. It defines general terms and provides an overview of the research area. Thus, it provides a foundation for the related work description provided in the next section.

First, Section 1.2.1 defines the relationship between quality, faults and rework.

Then, since most rework occur during testing, Section 1.2.2 provides an overview of testing concepts. Finally, Sections 1.2.3, 1.2.4, and 1.2.5 provide an overview of the areas that are important to understand when working with rework reduction, i.e. software process improvement, improvement implementation, and software measurement.

1.2.1 Software Quality, Faults, and Rework

Terminology

Software quality is hard to define and impossible to measure since it is very elusive.

That is, a number of factors, of which several are impossible to put a number on, determine the quality of a software product, i.e. portability, reliability, efficiency, usability, testability, understandability, and modifiability (Glass 2003). It is not enough to define quality as for example ‘conformance to requirements’, ‘reliability’ or ‘user satisfaction’ (Glass 2003). Nevertheless, the best available indicator of insufficient quality is the anomalies reported in testing or in live operation.

There is a great diversity in the research literature regarding the terminology used to report software or system related anomalies, e.g. the anomalies may be denoted anomalies, problems, troubles, bugs, defects, errors, faults or failures (Mohagheghi et al. 2006). In this thesis, the following definitions are used. “A fault is a manifes- tation of an error. A fault, if encountered, may cause a failure” (IEEE 1990). The fault may be in the software or in the surrounding system environment including the documentation. The term anomaly is used for reported issues that might be faults.

That is, in accordance with the IEEE standard definition, an anomaly is “any condition that deviates from expectations based on requirement specifications, design documents, user documents, standards, etc. or from someone’s perceptions or experiences” (IEEE 1993). To summarize, the typical chain of events in testing or live usage is as follows.

(24)

An error is made that causes a failure. The failure leads to a reported anomaly. When the reported anomaly is analyzed, the fault(s) causing the failure is found and corrected.

The term defect is not considered in the definitions above since it in the research literature is used interchangeably to be an anomaly, fault, failure or both (Mohagheghi et al. 2006). Therefore, it is in this thesis only used as a general term when not suitable to refer to any of the terms defined above.

Rework is about revising an existing piece of software or related artifact. Therefore, a typical rework activity is to correct reported anomalies. Rework can be divided into two primary types of corrective work (Fairley and Willshire 2005):

• Avoidable rework is work that would not have been needed if the previous work would have been correct, complete, and consistent (Fairley and Willshire 2005).

Such rework consists of the effort spent on fixing software difficulties that could have been discovered earlier or avoided altogether (Boehm and Basili 2001).

• Unavoidable rework is work that could not have been avoided because the de- velopers were not aware of or could not foresee the change when developing the software, e.g. changed user requirements or environmental constraints (Fairley and Willshire 2005).

However, striving towards having no avoidable rework is not the optimal solution since some development effort benefit from including ‘avoidable rework’. For example, having avoidable rework does not necessarily reduce the efficiency in modern development processes where the requirements emerge from prototyping and other evolution- ary development activities, i.e. when it is hard to clearly specify the user requirements up-front. Then, it might be more cost-effective to change the system afterwards than putting significant efforts specifying the requirements correctly up-front (Boehm and Basili 2001). In these cases, the rework instead becomes a natural part of the learning process. Another reason for preferring some avoidable rework is that sometimes a fault is introduced in a certain phase but it is not efficient to find in the same phase. That is, it might appear easy to reduce the amount of rework just by putting more focus on early verification activities such as reviews. However, peer reviews, analysis tools, and testing catch different types of faults at different stages in the development cycle (Boehm et al. 2000). Therefore, a more balanced view is required where the objective is not just to find and remove all faults as early as possible. Instead, the cost-effectiveness of different techniques in relation to different types of faults should be in focus. To separate avoidable rework that is avoidable but not cost-effective to avoid or get rid of earlier, avoidable rework should be divided into two subtypes of rework:

(25)

• Necessary rework is avoidable rework that would not have been more cost- effective to avoid or find and remove earlier. This means that both conformance costs (see Figure 1.1 below) and the cost of removing a fault are considered.

• Unnecessary rework is avoidable rework that would have been more cost-effective to avoid or find and fix earlier.

Cost of Quality

Quality involves development costs no matter which quality assurance strategy is cho- sen. That is, either a significant effort is spent on making sure that the developed software has a high quality, or a low-quality system is developed with cost of failures as the resulting cost of quality instead. Figure 1.1 illustrates this relationship as the cost of conformance versus non-conformance (Slaughter et al. 1998). As can be seen in the figure, the cost of conformance includes prevention costs (i.e. activities to prevent faults to be inserted) and appraisal costs (e.g. inspections and testing for measuring and evaluating software systems) (Slaughter et al. 1998). The cost of non-conformance includes costs of isolating, fixing, and verifying faults. If the faults were found due to external failures in live usage, the cost of quality also includes for example cost of field service and lost customer satisfaction (Slaughter et al. 1998).

Prevention costs, appraisal costs, and internal failures affect the cost of development whereas the cost of external failures affects maintenance costs and future rev- enues. The cost of non-conformance can be reduced by spending more effort on prevention and appraisal activities, i.e. through rework reduction.

Cost of quality

e c n a m r o f n o c - n o n e

c n a m r o f n o C

internal failure external failure prevention costs appraisal costs

setup execution fault removal effect Figure 1.1: Cost of Quality

(26)

Cost of Faults

The major reason for why it is possible to improve efficiency through rework reduction is because of the difference in fault costs at different development stages. That is, Figure 1.2 demonstrates how the cost of faults typically rises by development phase.

For instance, the cost of finding and fixing faults is often 100 times more expensive after delivery than during the design phase (Boehm and Basili 2001). The implication of such a cost curve is that the identification of software faults earlier in the development cycle is the quickest way to make development more productive (Groth 2004). In fact, the cost of rework could be reduced by up to 30-50 percent by finding more faults earlier (Boehm 1987). Therefore, employees should dare to delay the deliveries to the test department until the code has reached an adequate quality since high performing projects design more and debug less (DeMarco 1997). However, the differences in fault costs depend on the development practice used. That is, agile practitioners claim not to have such steep fault cost curves (Beck 2003). The reason for this is a significantly reduced feedback loop, which is partly achieved through test-driven development (Ambler 2004). Nevertheless, any software system that cannot solely rely on the quality assurance achieved during unit testing, i.e. that requires integration/system testing, cannot avoid having a rather steep fault curve. That is, the gentle change curve advocated by Beck is correct only if each piece of client-valued functionality can be delivered in isolation (Anderson 2003).

Cost of rework (average cost of removing faults)

Design

Time Coding UnitTest FunctionTest SystemTest Operation

Figure 1.2: Cost of Rework

(27)

1.2.2 Software Testing Concepts

Software has been tested from as early as software has been written because without testing, there is no way of knowing whether the system will work or not before live use (Graham 2001). Software testing can be defined as:

”The planning, preparation, and execution of tasks to establish the characteristics of a software product and to determine the difference between the actual and required status, in order to meet the quality requirements of the customers and to mitigate risk” (Veenendaal 2002)

In practice, the primary purposes of software testing are to give confidence in that the system works and at the same time try to break it (Graham 2001). Verification and validation are terms that commonly are used in conjunction with testing. The estab- lished definitions for these terms are (Pfleeger 2001), (Rakitin 2001), (Sommerville 2004):

Validation - are we building the right product?

Verification - are we building the product right?

There is however no universal agreement on which stages of the development cycle can be considered verification respectively validation. Nevertheless, a majority of the research community at least agree that verification and validation tasks should span the entire development cycle (IEEE 2004), (Lewis and Bassetti 2004), (Wallace and Fujii 1989). Both verification and validation are important parts of testing. In testing for validation, the observed behavior from a test execution is checked against user expectations and in testing for verification, it is checked against a specification (SWEBOK 2004).

Testing software to ensure that it is both validated and verified can be achieved in several different ways. The remainder of this subsection describes common approaches to do this in an efficient way. The description starts with a definition of what efficient testing is and how it is achieved.

Efficient Testing

Traditionally, test efficiency is measured by dividing the number of faults found in a test by the effort needed to perform the test (Pfleeger 2001). This can be compared to test effectiveness that only focuses on how many faults a technique or process finds without considering the costs of finding them (Pfleeger 2001). In the context of this

(28)

thesis, an efficient test process verifies that a product has reached the desired quality level at the lowest cost.

In our experience, achieving efficient testing is dependent on a number of factors.

They include using appropriate techniques for design and selection of test cases, having sufficient tool support, and having a test process that tests the different aspects of the product in the right order. An efficient test process verifies each product aspect in the test phase where it is easiest to test and the faults are cheapest to fix. Additionally, it avoids redundant testing. However, achieving an efficient test process is in practice far from trivial.

Test levels

The test phases of a test process are commonly built on the underlying development process, i.e. each test phase verifies a corresponding design phase. This relationship is typically presented as a V-model (Watkins 2001). The implication of this is that for each design phase in a project, the designers/testers make a plan for what should be tested in the corresponding test phase before moving on to the next design phase.

However, new development processes and the realization that it is more efficient to start with test activities already during early phases has resulted in enhancements of the v-model, i.e. commonly illustrated as a w-model (see figure 1.3).

Coding

Planning unit tests

Executing unit tests Detailed

design

Debugging/

Changing Planning

integr. tests

Executing integr. tests High-level

design

Debugging/

Changing Planning

system tests System

specification

Executing system tests

Debugging/

Changing Starting test

activities Customer

requirements

Executing accept. tests

Debugging/

Changing

Figure 1.3: W-Model for Development and Testing (Spillner 2002)

The contents of each test level differ a lot between different contexts. Different names are used for the same test levels and different contexts have different testing

(29)

needs. However, in one way or another, most organizations perform the test activities included in the test levels.

Unit/component testing: Tests the basic functionality of code units/components.

The programmer who wrote the code normally performs this test activity (Graham 2001). Since the tests focus on smaller chunks of code, it is easier to isolate the faults because they are normally located in the tested unit (Patton 2001).

Function/integration testing: When two or more tested components are combined into a larger structure, integration testing looks for defects in the interfaces between the components and in the functions (Graham 2001).

System testing: After integrated testing is completed, system testing verifies the system as a whole. This phase looks for faults in all functional and non-functional requirements on a system level (Graham 2001).

Acceptance testing: When the system tests are completed and the system is about to be put into operation, the test department commonly conducts an acceptance test together with the customer (Graham 2001).

In addition to these test levels, there is one vital test activity that is not considered as a standalone phase but rather is performed repeatedly within the other phases. That is, regression testing, which is applied after a module is modified or a new module is added to the system. The purpose of regression testing is to re-test the modified program in order to re-establish confidence that the program still performs according to its specification (Graham 2001).

Test Perspectives

Positive versus negative testing: Testing can have different purposes, i.e. to find faults (negative) or demonstrate that the software works (positive) (Watkins 2001). Positive testing only needs to assure that the system minimally works whereas negative testing commonly involves checking special circumstances that are outside the strict scope of the requirements specification (Watkins 2001).

Static versus dynamic testing: Examination of the behavior of a system can be performed with or without executing the actual code, i.e. examining the code either through ocular examination or with a tool without observing the run-time behavior is part of static analysis (Veenendaal 2002).

Functional versus structural testing: Functional and structural testing are com- monly also called black-box and white-box testing. That is, in functional testing the software is perceived as a black box which is impossible to look into to see how the software operates (Patton 2001). In structural testing the test cases can be designed according to the physical structure of the software (Watkins 2001).

(30)

Functional versus non-functional testing: The functional perspective can be fur- ther sub-divided into two sub-areas since every system besides functional requirements also have explicit or implicit quality requirements on the functional requirements, i.e.

commonly named non-functional requirements.

Test Techniques

Selecting an adequate set of test cases is a very important task for the testers. Other- wise, it might result in too much testing, too little testing or testing the wrong things (Patton 2001). Additionally, reducing the infinite possibilities to a manageable effective set and weighing the risks intelligently can save a lot of testing effort (Patton 2001). Figure 1.4 provides an overview of common test techniques divided after if they are functional, non-functional, structural, or analytical. Analytical techniques (not defined in the previous heading) are structural techniques where test cases are not used as verdicts for pass and fail. Additionally, some techniques are also denoted statistical, e.g. where tests are generated based on samples from the intended usage environment (Mills et al. 1987). The figure has been developed based on different listings and clas- sifications of techniques, e.g. (Beizer 1990), (Juristo et al. 2004), (Pfleeger 2001), (SWEBOK 2004).

Comparing the effectiveness and efficiency of different techniques is very difficult.

That is, “there is no such thing as a best test technique, a program can be looked at from different points of view that reveals different kinds of faults” (Beizer 1990).

Additionally, sampling of programs to use for evaluations cannot be selected randomly or in any other way to ensure that the used program can be considered typical since every program written in industry is fundamentally different (Weyuker 1993). Further, as illustrated in Figure 1.5, test techniques tend to become less efficient the more time is spent on them (Wagner and Seifert 2005).

In practice, little is known about the relative efficiency of different techniques. Ac- cording to one survey, structural and functional testing tend to outperform inspections (Runeson et al. 2006). However, for products with relatively high defect levels, inspections are most likely to be more efficient than testing because inspections can continue although defects are detected, i.e. defects found in testing commonly block further testing until the defect is fixed (Tian 2001).

A commonly recurring pattern when comparing V&V techniques is that they tend to be complementary, i.e. a combination of V&V activities is necessary to efficiently remove defects and ensure a sufficient product quality (Tian 2001). Nevertheless, the people factor is by far the most important factor to achieve effective and efficient V&V (Tian 2001).

(31)

Figure 1.4: Overview of Test Techniques

(32)

Efficiency

Effort spent

Figure 1.5: Typical Cash Flow for a Test Technique (Wagner and Seifert 2005)

Test Strategies

The term test strategy is commonly used but rarely defined. Veenendaal defines it as

“a high-level document defining the test phases to be performed” (Veenendaal 2002).

This thesis below defines the term based on the common usage in the context of this research, i.e. at Ericsson.

A test strategy is the foundation for the test process, i.e. it states how a product should reach the desired quality level at the lowest cost. The test strategy states which test levels the test process should have and what each test level is expected to achieve, e.g. which test areas to cover. This implies that the strategy at least implicitly defines which types of faults shall be found at which test level. However, the test strategy does not state what concrete activities are needed to fulfill the requirements since that is described in the test process. Additionally, a test strategy may also state overall goals such as to find the faults as early as possible or to use test automation extensively. Fi- nally, the test strategy has a long-term perspective, e.g. if a requirement in the strategy requires significant investments, it might take a few years before the projects are able to fully comply with the strategy.

Test Automation

Many think that automated testing is just the execution of test cases but in fact it involves three activities: creation, execution and evaluation (Poston 2005). Additionally, Fewster and Graham (1999) include other pre- and post- processing activities that need to be performed before and after executing the test cases. Such pre-processing activities include generation of customer records and product data. Further, post-processing activities analyze and sort the outputs to minimize the amount of manual work (Few-

(33)

ster and Graham 1999). However, the most important part of post-processing is the result comparison, i.e. the idea is that each test case is specified with an expected result that can be automatically compared with the actual result after execution (Fewster and Graham 1999). The paragraphs below provide an overview of possible techniques and tools to consider when implementing test automation.

• Code analysis tools, i.e. static or dynamic code analysis (Pfleeger 2001)

• Test case generators (Beizer 1990)

• Capture-and-replay tools (Beizer 1990)

• Scripting techniques, e.g. linear, structured, data-driven, and framework script- ing (Fewster and Graham 1999), (Mosley and Posey 2002)

The most important benefit of test automation can be obtained during regression testing. For example, one study reported that practicing automated regression testing at code check-in resulted in a 36 percent reduction in fault rates (MacCormack et al.

2003). Another important aspect when considering test automation is testability. The success of test automation is highly dependent of having robust and common product interfaces that are easy to connect to test tools and that will not cause hundreds of test cases to fail upon an architecture change. The more testable the software is, the less effort developers and testers need to locate the faults (McGregor and Sykes 2001). In fact, testability might even be a better investment than test automation (Kaner et al.

2002).

Test-Driven Development

The concept of Test-Driven Development (TDD) has been sporadically used for a long time, e.g. one usage case from as early as the late 1960’s has been reported (Larman and Basili 2003). However, it became popular with the emergence of the development practice eXtreme Programming (XP) (Beck 2000). Among the practices included in XP, TDD is considered as one of few that has standalone benefits (Fraser et al. 2003).

The main difference between TDD and a typical test process is that in TDD, the developers write the tests before the code. A result of this is that the test cases drive the design of the product since it is the test cases that decide what is required of each unit (Beck 2003). “The test cases can be seen as example-based specifications of the code”

(Madsen 2004). In short, a developer that uses traditional TDD works in the following way (Beck 2003):

(34)

1. Write the test case

2. Execute the test case and verify that it fails as expected 3. Implement code that makes the test case passes 4. Refactor the code if necessary

The most obvious advantage of TDD is the same as for test automation in general, i.e. the possibility to do continuous quality assurance of the code. This gives both instant feedback to the developers about the state of their code and most likely, a significantly lower percentage of faults left to be found in later testing and at customer sites (Maximilien and Williams 2003). Further, with early quality assurance, a common problem with test automation is avoided. That is, when an organization introduces automated testing late in the development cycle, it becomes a catch for all faults just before delivery to the customer. The corrections of found faults lead to a spiral of testing and re-testing which delays the delivery of the product (Kehlenbeck 1997).

The main disadvantage with TDD is that in the worst case, the test cases duplicate the amount of code to write and maintain. However, this is the same problem as for all kinds of test automation (Hayes 1995). Nevertheless, to what extent the amount of code increases depends on the granularity of the test cases and what module level the test cases encapsulates, e.g. class level or component level. Finally, TDD is quite hard for developers to learn, i.e. it is hard to write efficient unit tests (Crispin 2006).

However, according to Crispin, once a team passes the painful learning curve they will never go back to the old way (Crispin 2006).

1.2.3 Software Process Improvement

Working with rework reduction commonly implies that current development processes need to be improved. That is, Software Process Improvement (SPI), which during the 90’s started gaining a lot of attention through quality assessment and improvement paradigms such as the ISO standard (ISO 1991) and the Capability Maturity Model (CMM) (Paulk et al. 1995). Since then, the field of SPI has evolved significantly and is today dominated by two contradictory approaches, i.e. top-down SPI and bottom-up SPI:

Top-down SPI is sometimes also called prescriptive since the frameworks using the approach provide a set of best practices or processes that are to be adhered by all organizations using the framework. The basic rationale behind these frameworks is that consistent usage of well-defined software processes combined with continuous SPI will substantially improve productivity and quality (Krishnan and Kellner 1999).

(35)

From feedback of earlier applications of ISO and CMM, enhanced versions of these top-down frameworks have been developed. For example, SPICE (ISO/IEC 15504), which is focused on process assessments (El Emam et al. 1998) and Bootstrap which is another top-down process maturity framework developed by a set of European companies as an adaptation of CMM (BOOTSTRAP 1993). Further, CMM has been tai- lored into sub-versions such as SW-CMM, which is adapted for software development (SW-CMM 2005). Recently, an integration of existing CMM variants has also been gathered into a model called CMMI (Capability Maturity Model Integration) (CMMI 2002). Work has also been done to tailor CMM to test process improvement, i.e. the Test Maturity Model (TMM) (Veenendaal 2002). Here CMM has been adapted to what the test process should achieve for each maturity level. Besides TMM, there exist other similar maturity oriented models such as the Test Process Improvement (TPI) model and the Test Improvement Model (TIM). The TPI model is centered on 20 key areas with different levels of maturity (Koomen and Pol 1999) whereas TIM has four key areas connected to a four level improvement ladder (Ericson et al. 1997).

Bottom-up SPI is the opposite of applying a top-down approach where improve- ments are identified and implemented locally in a problem-based fashion (Jakobsen 1998). The bottom-up approach is sometimes also referred to as an inductive approach since it is based on a thorough understanding of the current situation (El Emam and Madhavji 1999). A typical bottom-up approach is the Quality Improvement Paradigm (QIP), where a six step improvement cycle guides an organization through continuous improvements (Basili and Green 1994). The measurement part of QIP utilizes the Goal-Question-Metric (GQM) paradigm (Basili 1992), which is centered on goal- oriented metrics. GQM is described in detail in Section 1.3.1. Since problem-based improvements occur spontaneously in the grassroots of several organizations, several other more pragmatic approaches to bottom-up improvements exist (Jakobsen 1998), (Mathiassen et al. 2002). Identifying problems and then improve against them can be achieved without using a formal framework.

Although these major approaches to SPI have been developed independently, there exist some examples of cross-fertilization as well. For example, PSP (Personal Soft- ware Process) (Humphrey 1994) and IDEAL (Initiating Diagnosing Establishing Act- ing Learning) (McFeeley 1996) are both inductive models for continuous identification and implementation of process improvements. However, they are both also explicitly related to the CMM framework (Krishnan and Kellner 1999).

A major driving force of the top-down frameworks has been that if the development process has a high quality, the products that are developed with it also will (Whittaker and Voas 2002). The basic motivation for using the bottom-up approach instead of the top-down approach is that process improvements should focus on problems in the

(36)

current process instead of trying to follow what some consider to be best practices (Beecham and Hall 2003), (Glass 2004), (Jakobsen 1998), (Mathiassen et al. 2002).

Just because a technique works well in one context does not mean that it also will in another (Glass 2004). Additionally, different processes are preferable depending on the product complexity and where in the life-cycle the product is, i.e. entry, growth, stability, or sunset (Chillarege 2002). For example, a mature product requires more rigorous processes than a new product (Chillarege 2002).

Another advantage with problem-based improvements is that the improvement work becomes more focused, i.e. one should identify a few areas of improvement and focus on those (Humphrey 2002). Nevertheless, this does not mean that the top-down approaches never are beneficial. For example, such frameworks could guide immature companies that do not have sufficient knowledge of what aspects of their processes to improve. However, in these cases, they should be considered as recipes instead of blueprints, which historically has not been the case (Aaen 2003).

1.2.4 Improvement Implementation

As stated in the introduction of this chapter, the failure rate of process improvement implementation is reported to be about 70 percent (Ngwenyama and Nielsen 2003).

Therefore, it is not surprising that practitioners want more guidance on how, not just what to improve (Rainer and Hall 2002), (Niazi et al. 2005). Much of the failure is blamed on the above-described top-down frameworks since they do not consider that since SPI is creative, feedback-driven, and adaptive; the concepts of evolution, feedback, and human control are of particular importance for successful process improvement (Gray and Smith 1998).

Several studies have been conducted to determine characteristics of successful and failed SPI attempts. In a study that gathered results from several previous research studies, and also conducted a survey among practitioners, the following success factors and barriers were identified as most important (Niazi et al. 2005).

Success factors:

• Senior management commitment

• Staff involvement

• Training and mentoring

• Time and Resources

(37)

Barriers:

• Lack of resources

• Organizational politics

• Lack of support

• Time pressure

• Inexperienced staff

• Lack of formal methodology

• Lack of awareness

Most of these success factors and barriers are probably well-known to anyone that has conducted improvement work in practice. However, the last two barriers require an explanation. As also acknowledged in the beginning of this subsection, a lack of formal methodology concerns guidance regarding how to implement improvements. Further, awareness of process improvements is important to get long-term support by managers and practitioners to conduct process improvements (Niazi et al. 2005).

Another barrier not mentioned in this study but frequently in others is ‘resistance to change’ (Baddoo and Hall 2003). In the list above, this barrier is however strongly related to staff involvement and time pressure. That is, process initiatives that do not involve practitioners are de-motivating and unlikely to be supported, and if they do not have time to understand the benefit a change will give, they are resistant to the change (Baddoo and Hall 2003). A success factor frequently mentioned in other studies is that measuring the impact of improvements increases the likelihood of success (Dyb˚a 2002), (Rainer and Hall 2003). That is, it is very important to be able to demonstrate that proposed changes can be expected to have the desired effects. When we can pre- dict the benefits of proposed improvements based on sound evidence, the credibility is enhanced significantly and the proposals have a much better chance of getting ac- cepted (Florac et al. 1997). Metrics for measuring the impact of improvements are further discussed in Section 1.2.5.

From the results of an analysis of several process improvement initiatives at a department at Ericsson in Gothenburg, Sweden, another important success factor was identified (Borjesson and Mathiassen 2004). That is, the likelihood of implementation success increases significantly when the improvement is implemented iteratively over a longer period of time. The main reason for this was that the first iteration of an implemented change results in a chaos that causes resistance. However, the situation stabilizes within a few iterations and if the chaos phase is passed, the implementation is more likely to succeed (Borjesson and Mathiassen 2004).

(38)

1.2.5 Software Measurement

Terminology and Overview

In this thesis, different terms for software measurements are used in different contexts.

The terms and their differences are as follows.

Measurement: The act or process of assigning a value to an attribute. A figure, extent, or amount obtained by measuring (IEEE 1998).

Measure: To apply a metric (IEEE 1998).

Metric: States how we measure something, i.e. the degree to which a system, compo- nent or process possesses a given attribute (IEEE 1998).

The primary driver for using measurements is to manage software development better. “You cannot control what you cannot measure” (DeMarco and Lister 1987).

Measurements enable people to detect trends and to anticipate problems. This provides better control of costs, reduces risks, improves quality, and ensures that business objectives are met. Measurement methods that identify important events and trends are invaluable in guiding software organizations to informed decisions (Florac et al. 1997).

In improvement work, measurements are in particular important since the only way to know if you are improving is through measurements (Rakitin 2001).

In practice, software measurement is however not easy to manage. There are so many possible things to measure that we are easily overwhelmed by opportunities (Park et al. 1996). Additionally, the research community has suggested a vast amount of measurements. However, the advocated metrics are commonly either irrelevant in scope (i.e. not scalable to larger programs), or irrelevant in content (i.e. of little practical interest) (Fenton and Neil 1999).

Typical characteristics of good metrics are that they are informative, cost-effective, simple to understand, and objective (Daskalantonakis 1992). However, since not everything can be objectively measured, this does not mean that practitioners should not use subjective metrics. In fact, choosing only objective metrics may be worse (El Emam et al. 1993). What instead should be strived for is consistency, repeatability, and mini- mization of errors and noise (Park et al. 1996). The difference with subjective metrics is that they require processes for continuously improving the consistency (Park et al.

1996). In the end, of highest importance is that the value of the information obtained exceeds the cost of collecting the data, calculating the metric, and analyzing its values (Daskalantonakis 1992).

(39)

Metrics Categories

A common way to categorize different types of software metrics is by the management function they address, i.e. project, process, or product measurements (Fenton and Pfleeger 1997), (Florac et al. 1997):

Project metrics: Software metrics are most commonly used on a project level, e.g.

75 percent in one study (Fredericks and Basili 1998). Typical project metrics monitor the progress of projects, e.g. effort spent, number of test cases passed, and number of unresolved anomalies.

Process metrics: Process metrics is an important driver for process improvement pro- grams (Gopal et al. 2002), (Offen and Jeffery 1997). Examples of common process metrics are the time required/activity, and the number of defects detected in different phases (Florac et al. 1997). Software measurement is primarily used to identify the strengths and weaknesses of processes, and to evaluate processes after they have been implemented or changed (SWEBOK 2004). Since there commonly is a discrepancy between the definition and actual usage of processes, process compliance is also of interest to measure. However, since process compliance is hard to measure, it is commonly estimated through questionnaires (Florac et al. 1997).

Product metrics: Typical product metrics measure product size, product structure, and product quality (SWEBOK 2004). In research studies, the static/structural behavior of a product is commonly measured, e.g. Lines of Code (LoC), and code complex- ity. However, such measures are overrated since they are poor quality assessors (Voas 1997). For example, LoC is commonly used for measuring productivity. However, this is misleading because the amount of LoC in a software program is negatively correlated with design efficiency. That is, an efficient design results in a lower implementation effort and fewer LoC (Jones 2000), (Kan 1994).

Besides categorizing metrics after the aspect they measure, it is in this thesis also relevant to categorize them after which level in the organization hierarchy they address since some types of metrics are only applicable on certain levels. Daskalantonakis distinguish the levels as follows to reflect common measurement areas (Daskalantonakis 1992):

The company level (or business unit level), at which data across several projects may be grouped to provide a view of attributes such as quality and cycle-time across projects.

The product level (or development unit level), at which data across several projects in the same product area may be grouped to provide an aggregated view of the same attributes.

The project level, at which data within a project is tracked and analyzed in order to

(40)

plan and control the project, as well as to improve similar projects.

The component level, at which measurements within a component of a product is tracked and analyzed for managing development and quality improvement of it.

The primary difference when applying measurements against the different levels is that higher levels mainly concern organizational performance measurements whereas measurements on lower levels primarily are informational. Performance metrics are mainly process related, i.e. either they measure attributes of products that the process produces or they measure attributes of the process itself (Florac et al. 1997). Per- formance measurements are commonly used for benchmarking organizations against each other and to motivate people to perform better. Therefore, they are sometimes also denoted motivational measurements (Austin 1996). Informational measurements have as the name implies the purpose to provide information about something. For example, the purpose could be to assess the current status of a project (i.e. coordination measurement) or to identify areas of improvement in a process or product (i.e. process refinement measurement) (Austin 1996).

The major reason why it is important to distinguish these types of measurements is that motivational measurements are a lot harder to work with in practice since they easily become dysfunctional (Austin 1996). That is, when people are rewarded/punished based the outcome of measurements, they will do everything to find ways to optimize the measure even if it will cause counter-productive behavior, i.e. because ‘people work according to how they are measured’ (Austin 1996). Most measurements are possible to do this with and the longer the measurement program lasts, the more people learn how to manipulate the results. That is, it commonly takes a while to learn how to manipulate the measurement system. Further, as the goal levels become increasingly hard to achieve, they drive workers to take counter-productive shortcuts (Austin 1996).

As listed below, Florac et al. (1997) define a set of criteria that a performance measure should comply to (note that the list is valid for informational measurements as well). A performance measure should:

• relate closely to the issue under study, e.g. the degree of avoidable rework

• have high information content. That is, a measure that is sensitive to as many facets of process results as possible.

• pass a reality test, i.e. does the measure really reflect the degree to which the process achieves the desired results?

• permit fast and easy collection of data.

• permit consistently collected and well-defined data.

(41)

• show measurable variation, i.e. a number that does not change over time does not provide any useful information.

• have a diagnostic value. The measure should help you identify not only if you have an issue, but also what might be causing it.

Measuring Return on Investment

When estimating the expected gains from a suggested improvement (e.g. rework reduction), a measurement approach for comparing the costs and benefits is needed. Such an analysis is in business economics called Return On Investment (ROI). In traditional business case analysis, ROI is calculated as (Benefit-Investment Cost)/(Investment cost) (Van Solingen 2004). The result of a ROI analysis is presented as a numerical value of X where every invested hour gave X hours profit and all values above zero means a positive return.

However, this ROI measurement approach is according to El Emam (2003) not appropriate in process improvement evaluations because it does not accurately account for the benefits of investments in software projects (El Emam 2003). Instead, El Emam advocates that ROI should be measured according to the formula below, i.e. based on previous work in the area (Kusumoto 1993). The reason is for this is that the Kusumoto approach is more appropriate for ROI analysis of SPI work (El Emam 2003). That is, the result of applying traditional ROI analysis on SPI implementations sometimes become misleading (El Emam 2003). Further, the approach has received acceptance in the research community of software engineering as a valid way to measure ROI (El Emam 2003). The Kusumoto approach was therefore preferred when applied in this thesis.

ROI=Bene f it¹− InvestmentCost OriginalCost²

1The benefit can be measured both as the total benefit of the measured project and as the isolated benefit of specific improvements (when possible to distinguish)

2The ‘Original cost’ equals the total cost of the project if the change would not have been implemented, which in practice is measured as the actual total cost of the improved project plus the obtained benefit

Success factors

Establishing software metrics in an organization is not trivial. In fact, one study reported a mortality rate for metrics programs to be about 80 percent (Rubin 1991).

However, the likelihood of a successful metrics implementation increases significantly

Early and Cost-Effective Software Fault Detection: Measurement and Implementation in an Industrial Setting

ABSTRACT

Blekinge Institute of Technology

Doctoral Dissertation Series No. 2007:09 School of Engineering

EARLY AND COST-EFFECTIVE SOFTWARE FAULT DETECTION

MEASUREMENT AND IMPLEMENTATION IN AN INDUSTRIAL SETTING

Lars-Ola Damm

EARL Y AND COST -EFFECTIVE SOFTW ARE F A UL T DETECTION Lars-Ola Damm

Early and Cost-Effective Software Fault Detection

Lars-Ola Damm

Early and Cost-Effective Software Fault Detection

Measurement and Implementation in an Industrial Setting

Lars-Ola Damm

Blekinge Institute of Technology Doctoral Dissertation Series No 2007:09

ISSN 1653-2090 ISBN 978-91-7295-113-6

Department of Systems and Software Engineering School of Engineering

Blekinge Institute of Technology

Abstract

Acknowledgements

Overview of Papers

Table of Contents

Chapter 1

Introduction

1.1 Preamble

1.2 Concepts