Improving MCDC adequate test sets for safety critical software to be RORG adequate

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Improving MCDC adequate test sets for safety

critical software to be RORG adequate

by

Christoffer Nylén

LIU-IDA/LITH-EX-A--14/067--SE

Linköping 2015-09-01

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

(2)

(3)

Improving MCDC adequate test sets for safety

critical software to be RORG adequate

by

Christoffer Nylén

LIU-IDA/LITH-EX-A--14/067--SE

2015-09-01

Supervisor: Håkan Anderwall

Saab Aeronautics, Linköping

Jeff Offutt

George Mason University

Examiner: Ahmed Rezine

(4)

(5)

Abstract

A number of logical code coverage criteria have been used throughout the years in the testing of safety-critical software. Kaminski, et al. [4] proposed Relational Operator Replacement Global (RORG), a method to bring benefits from ROR mutation to Modified Condition / Decision Coverage (MCDC), which is widely used in the avionics industry. However, there is a lack of studies in the industry to support this method. In this thesis, we report on the results of applying RORG to avionic code, augmenting an MCDC adequate test set to satisfy RORG, evaluating its ability to find real faults in industrial software.

Conclusions drawn from this thesis are: (1) Faults in relational operators in avionic code are rare, no faults were found in this study. (2) 24% of the relational operators in our study would require additional software requirements to be verified for RORG coverage. (3) 37% of the relational operators in our study were infeasible to test due to program semantics. (4) 84% of the tests added covered enumeration comparisons.

Linköping, Sweden, September 2015 Christoffer Nylén

(6)

(7)

Acknowledgements

This master thesis was done at Saab Aeronautics in Linköping. First of all, I would like to thank my supervisor Håkan Anderwall for all of the support and discussion sessions during this work. I also would like to thank Frans Bergquist for discussions of possible thesis projects and explaining the concept of code mutation.

From the university side I want to thank Assistant Professor Ahmed Rezine for follow-ups and acting as examiner. I also would like to thank my distant supervisor Professor Jeff Offutt, George Mason University, for giving me expert guidance.

At last I would like to thank my family and friends for motivating me.

Thank you very much Linköping, September 2015 Christoffer Nylén

(8)

(9)

1 Introduction

1.1 Context

(10)

1 Introduction

In recent years, the public has become increasingly reliant on software to ensure safety in different fields such as banking, medical instruments and nuclear power. In avionics , software controls hydraulics, air 1 pressure, weaponry and fuel etc. The correctness of such safetycritical software is important and needs to be verified through extensive testing. Large parts of project budgets are allocated for testing, around 50% or even more for safetycritical software. The cost is related to the construction of test cases, execution of tests, verification, determination of achieved coverage, qualification of tools, maintenance and education. In avionics, there are standards not only for how to test but also for how to measure the tests' ability to exercise the code. Logical test criteria like Modified Condition / Decision Coverage (MCDC) [3] has long been a standard for testing logical expressions in code. Traditionally, logical criteria have only looked at the clause level, where each clause is considered a boolean. In this thesis, clauses containing a relational operator will be further examined. In previous studies, the Relational Operator Replacement (ROR) mutation has been used to test the correctness of relational operators. Kaminski [4] et al. proved that only three mutants per clause are necessary and proposed a method (RORG) to bring benefits from ROR mutation to MCDC by making three coverage measurements for each relational operator. The goal of this thesis is to take an MCDC adequate test set for avionic code, to augment it to satisfy RORG, then to compare the number of additional test cases with the number of exposed errors. Thesis outline: The remainder of this chapter will introduce the context of the performed work and of the used data and its characteristics. Chapter 2 provides a theoretical background and Chapter 3 gives an overview of our implementation of RORG. In Chapter 4 we present the test results of the software under test. Chapter 5 discusses patterns in the results and possible improvements. Finally, Chapter 6 provides overall conclusions and recommendations.

1.1 Context

This thesis was carried out at Saab Aeronautics in Linköping. The idea was to study how mutation could be used to measure tests’ ability to detect faults. Linköping University had been in contact with Professor Offutt at George Mason University, who is one of the foremost experts in the field. Offutt sent a paper [4] with suggested ways to improve logicbased testing. The paper included a method to extend an MCDC adequate test set using ideas from mutation, which turned into this thesis work. The goal would be to try it on real avionic code to make an industrial evaluation. Ada was chosen as primary language rather than C/C++ for several reasons: There were no C/C++ parts with MCDC adequate test sets available at the time, which would have required more tests and requirements knowledge. Besides, Håkan had been involved in developing an automated test solution that was written in Ada with an MCDC adequate test set and he was willing to provide expertise regarding both requirements and testing environment. 1_{Avionics are the electronic systems that are significant to an aircraft’s safe and efficient performance.}

(11)

2

1.1.1 Subject Program

The chosen software needed to meet the following requirements: That statistics from test results could be published in a report, the corresponding test set was fully MCDC adequate, and many predicates had more than one or two clauses. The chosen software had the following characteristics: Table 11: Subject Program . Lines of code 4754 Predicates 350 Clauses 578 Diagram 11: Predicate/Clause Distribution . The Stores Management Unit is responsible for the controlled and safe loading, release and unloading of the major weapons (bombs and missiles) carried by an aircraft [7]. The subject code has been used since 2001 to monitor hardware and signal if there are any errors. The expected result is that faults are rare. However, hopefully the amount of additional test cases to satisfy RORG will be deemed to be manageable.

(12)

1.2 Abbreviations

ACC Active Clause Coverage ASIS Ada Semantics Interface Specification BVC Boundary Value Constraint CACC Correlated Active Clause Coverage CC Clause Coverage CCE Coupled Clause Effect CoC Complete Clause Coverage MCDC Modified Condition/Decision Coverage GACC General Active Clause Coverage GNAT GNU Ada tool chain PC Predicate Coverage RACC Restricted Active Clause Coverage UCT Unexpected Counter Transition UMF Unexpected Minor Frame Relop Relational Operator ROR Relational Operator Replacement RORG Relational Operator Replacement Global SUT Software Under Test TR Test Requirement

(13)

4

2 Background

In this chapter, logicbased testing will be discussed: why it is recommended for safetycritical code and different ways of measuring it using testing criteria and improving it using mutation. Also, the structure of a logicbased testing tool will be discussed, including static analysis of the subject code and instrumentation used to store runtime behavior when tests exercise the code.

2.1 Requirement-Based Testing

Software requirements contain a finite list of behaviors and features, each written to be verifiable. The finite list of requirements and its associated completion criteria should turn requirements coverage into a feasible process of verifying software. Lowlevel requirements and integration between modules and hardware are tested based on requirements. Achieved requirement coverage is then analyzed. Software requirements might not always be directly mapped to all of the behavior represented in the executable code. Code that is not linked to requirements might not be exercised in the requirements coverage. If such code contains unintended functions, they might not be detected.

2.2 Logic-Based Testing

Logical coverage analysis is performed to reveal to what extent the internal logical expressions have been covered by the requirementbased verification process. The input domain of a logical expression can be rather large and as a consequence exhaustive testing often becomes practically impossible. The problem 2

is to come up with a “good enough” test set.

Logic coverage involves a test data adequacy criterion to fulfill. A logical criterion can be thought of as a set of logical test requirements describing properties that the test set must have.

Definition 2.1 Logical Test Requirement: A logical expression that a test case must reach and evaluate in a specified way.

Definition 2.2 Logical Criterion: A set of logical test requirements that describe properties for the test set that can be reached on a measurable level.

A logical expression (shown in Figure 22) is composed of a predicate with one or more clauses and

boolean operators between the clauses. A predicate without a boolean operator is a clause. Predicates and clauses are also called decisions and conditions.

Figure 22: A Logical Expression

(14)

2.2.1 Logical Criteria

A number of commonly used logical criteria will now be explained. Let us consider the logical expression

(x>y or B) and C derived from the source code in Figure 23.

if (x>y or B) and C then

a();

else b();

end if

Figure 23: if statement in Ada source code

We shall now look at test requirements (TR) for different criteria and provide a set of adequate test cases in terms of truth table rows. Criterion 2.1 Predicate Coverage (PC): For each predicate p: TR₁ : evaluate to true. TR₂ : evaluate to false. Predicate coverage is a basic criterion that usually can be satisfied in many ways. An example of an adequate test set is shown in Table 21. Note that for the selected test set, the truth value does not change for clause B. _Table 21:_{Test set adequate to PC} x>y B C (x>y or B) and C 1 F F F F 2 F F T F 3 F T F F 4 F T T T 5 T F F F 6 T F T T 7 T T F F 8 T T T T Criterion 2.2 Clause Coverage (CC): For each clause in predicate p:c_i TR₁ : evaluate to true. TR₂ : evaluate to false. Clause coverage makes sure that each clause evaluates to true and false. However, it does not necessarily guarantee predicate coverage (see Table 22).

(15)

6 Table 22: Test set adequate to CC x>y B C (x>y or B) and C 1 F F F F 2 F F T F 3 F T F F 4 F T T T 5 T F F F 6 T F T T 7 T T F F 8 T T T T Criterion 2.3 Complete Clause Coverage (CoC): For all clauses in predicate p: TR: Every possible combination occurs. CoC requires all possible combinations of clauses (see Table 23). It is considered a very expensive criteria because of its exponential growth in the number of tests, and is therefore mostly not useful in practice. Table 23: Test set adequate to CoC x>y B C (x>y or B) and C 1 F F F F 2 F F T F 3 F T F F 4 F T T T 5 T F F F 6 T F T T 7 T T F F 8 T T T T In Table 24 we can see that changing the truth value of clause C does not affect the outcome of the predicate. Table 24: Test set adequate to clause coverage x>y B C (x>y or B) and C 1 F F F F 2 F F T F 3 F T F F 4 F T T T 5 T F F F 6 T F T T 7 T T F F 8 T T T T

(16)

We shall now look at coverage criteria where the idea is to measure that clauses have affected the outcome of the predicate. The term major is used to distinguish a clause in predicate c_i p that we are focusing on from the remaining minor clauses c_j∈p, j != i.

Definition 2.3 Determination [4]: Given a major clause in predicate p, we say that determinesc_i c_i

p if the minor clauses c_j∈p, j != i have values so that changing the truth value of changes thec_i truth value of p.

Definition 2.4 Active Clause: A clause c in predicate p is said to be active when it determines the outcome of p. Criterion 2.4 Active Clause Coverage (ACC): For each major clause in predicate p:c_i TR₁ : evaluate to true while being active. TR₂ : evaluate to false while being active. ACC achieves benefits close to CoC while still keeping the number of required test cases at a linear growth. Criterion 2.5 General Active Clause Coverage (GACC): While satisfying ACC, the values chosen for the minor clauses c_j∈p, j != i may vary. Table 25: Test set adequate to GACC x>y B C (x>y or B) and C 1 F F F F 2 F F T F 3 F T F F 4 F T T T 5 T F F F 6 T F T T 7 T T F F 8 T T T T

(17)

8 Criterion 2.6 Correlated Active Clause Coverage (CACC): While satisfying GACC, also satisfy PC. Criterion 2.7 Restricted Active Clause Coverage (RACC): While satisfying ACC, also satisfy PC. The values chosen for the minor clauses c_j∈p, j != i must be the same.

Even though RACC (sometimes called unique cause MCDC) is stricter than CACC (sometimes called

masking MCDC), it has been shown [2] that their performance in detecting incorrect predicates is not that much different, which allows CACC to be applied more costeffectively. Masking MCDC is recommended for LevelA software in FAADO178C . It has been shown [1] that the probability of 3 detecting an error in a predicate will increase as the number of clauses grows.

2.2.2 Subsumption

Many criteria can be related to each other by subsumption.

Definition 2.5 Subsumption: A criterion A subsumes criterion B if, and only if, every test set that satisfies A also satisfies B. For a subsuming hierarchy of the discussed logical criteria see Figure 24. Observe that A is not necessarily stronger than B; B can have tests that are not in A, so there is always a possibility that B may expose faults that A misses. Figure 24: Subsumption Among Different Test Criteria [5] 3_{FAADO178C is a guideline declared by the Federal Aviation Administration, for the production of Airborne} Systems to ensure that avionic software will perform safely.

(18)

2.2.3 Mutation analysis

Mutation analysis, [25, 26] can be used as a test set metric by inserting mutants as syntaxchanges into the code and check the test sets’ ability to discover them. Once a mutant is discovered, it is said to have been killed. The goal is to make the test set adequate relative to the mutants by designing additional tests to kill the remaining live mutants. Criterion 2.8 Mutation Coverage: Every mutant is killed. Mutants could be chosen based on errors that are common among programmers. They could also be used to force the creation of valuable tests; such as FailOnZero in MuJava [10] that replaces each numerical expression by zero or Bomb that replaces each statement with a function call that raises a runtime

exception. A drawback with mutation testing is that it often requires many test cases because of redundant mutants, thus different methods to reduce the number of mutants has been developed and is still an active research subject. There are a number of mutation analysis tools [8, 9, 10, 23], some take advantage of the compilers using mutation schemata [11] that speed up the test execution by compiling each mutant into a single file, eliminating the need for separately compiling each mutant.

2.2.3.1 Relational Operator Replacement (ROR)

Mutation testing of relational operators in predicates is performed by replacing relational operators with every other possible relational operator per clause. Unlike logical testing, ROR mutation tests the relational operator inside a clause. Therefore, logical criteria do not subsume ROR mutation.

Kaminski [4] et al. developed a fault hierarchy for the ROR mutants and were able to prove that only three of the seven mutants are necessary to achieve ROR mutation.

2.2.3.2 Relational Operator Replacement Global (RORG)

RORG [4] is a method that can be be used to complement MCDC with benefits from ROR mutation. In this approach, the algorithm checks all clauses containing a relational operator against three detection conditions to see if the three supposed ROR mutants (<, >, =) were to be killed. At its minimum, MCDC requires each clause to evaluate true and false while determining the outcome of the predicate. Therefore, at most one additional test for each clause is sufficient for RORG adequacy. For the proposed algorithm to make an MCDC test set RORG adequate by Kaminski [4] et al., see Algorithm 21.

(19)

10

Algorithm 21

Algorithm [4] to Make an MCDC Test Set RORGAdequate

Require:

Predicate

p

and a test set

T

that satisfies MCDC (ACC) with respect to

p

Ensure:

A test set that still satisfies MCDC (ACC), but is now also RORGadequate

1: // It does not matter which version of ACC is satisfied

2: // (GACC, CACC, or RACC), or whether masking or

3: // nonmasking MCDC is used.

4:

for each

clause

c

in

p

do

5:

if

c

contains a relational operator

relop

then

6: Identify

T

c

, the tests for which

c

determines

p

7: // Clause

c

determines predicate

p

if changing

8: // the value of

c

, while leaving all other

9: // clauses unchanged, changes the value of

p

[1].

10: //

T

c

will have at least two tests and possibly more.

11: //

c

will have the value

true

for at least one test

12: // and

false

for at least one test.

13: Assume

c = c relop c

₁ ₂

14: // We need three tests,

c

₁

< c

₂

, c

₁

= c

= ,

₂

and

15: //

c

₁

> c

₂

,

and we are assured of having at least two.

16:

for each

test in

t

_i

T

c

do

17: isCovered[‘<’] = isCovered[‘==’] = isCovered[‘>’] =

false

18:

for each

relop

in {<,==,>}

do

19:

if

c

₁

relop

is

c

₂

true

for

t

_i

then

20: isCovered[‘

relop

’] =

true

21:

end if

(

c

₁

relop is true)

c

₂

22:

end for

(each relop)

23:

end for

(each test)

24:

for each

relop

in {<, ==, >}

do

25:

if

isCovered[‘

relop

’] ==

false

then

26: Construct a new test by modifying an arbitrary test in

T

c

.

Leave

all other variables alone, but change the values for the variables

in

c

so that

c

₁

relop

is

c

₂

true

27:

end if

(isCovered[] == false)

28:

end for

(each

relop

)

29:

end if

(

c

contains a

relop

)

30:

end for

(each clause)

(20)

2.3 Feasible Test Problem

For logicbased testing, the semantics of a program sometimes makes it impossible to come up with test cases that will meet the test requirements. Offutt and Pan [22] define the feasible test problem as: “Given a requirement for a test case, the feasible test problem is to determine if there is input data that can satisfy the requirement”. In this thesis, we adapt the feasible test problem as follows: Given a relational operator, the feasible test problem is to determine if there is input data such that the three detection conditions can be met. If there is no input data that can satisfy the test requirement, and the testing tool fails to detect this, it will lead to warnings of code fragments that cannot be covered. For this study, such warnings are called falsepositives. The necessity to review falsepositives takes time and weakens attention to those relational operators that remain to be covered. We will now discuss program semantics where the problem arises.

2.3.1 Coupled Clause Effect

A predicate with n clauses has 2npermutations for which clauses can be true and false. Sometimes, not all of these combinations are feasible. Sometimes, clauses can affect one another.

Definition 2.6 Coupled Clause Effect (CCE): If changing clause A causes clause B to change, there is said to be a coupling effect between A and B.

As an example, Figure 25 shows variable Cabin_Pressure that can be Low, Normal or High. Let A be the clause Cabin_Pressure == Low and B be the clause Cabin_Pressure == Normal. Suppose we want RORG coverage for clause B. Then Cabin_Pressure shall have evaluated (Low, Normal, High) while determining the outcome of the predicate. However, changing Cabin_Pressure from Normal to Low will also cause clause A to change from False to True. Thus, for clause B only two of three detection conditions can be met.

type Pressure_Type is (Low, Normal, High);

Cabin_Pressure : Pressure_Type;

...

if (Cabin_Pressure == Low or Cabin_Pressure == Normal) then

...

Figure 25: Coupled Clause Effect

2.3.2 Boundary Value Constraint

Given a clause containing a relational operator to be measured against the three detection conditions: if either of the operands constantly represents an upper or lower boundary of a finite set, only two detection conditions can be met since a third would require one value that is outside the sets’ range. We shall refer to this as the boundary value constraint (BVC). For example, Cabin_Pressure in Figure 25 can never evaluate Cabin_Pressure < Low to true.

(21)

12

2.4 Requirement Completeness Problem

For logicbased testing to become a feasible process of verifying software, it is important that the

requirements are complete, in the sense that all requirements are included. Requirement specifications are not always written in sufficient detail to be able to determine all expected outcome. We will now give two examples where some requirement specifications might be incomplete.

2.4.1 Unexpected Minor Frame

In realtime systems, the interfaces between procedures often require that executions take place in a certain minor frame. However, to fulfill a certain test criterion, it might be necessary to execute a

procedure in the wrong minor frame. If the software requirements does not specify the expected outcome when the procedure is executed in the wrong minor frame, a test case cannot be derived. We shall refer to this as unexpected minor frame (UMF).

2.4.2 Unexpected Counter Transition

Figure 26 shows an example where a procedure (state machine) will take a certain action once a number of events has been counted. It uses a global variable to count the number of events. Once the counter reaches five, the procedure performs an action and sets the counter back to zero (thus, it can never be greater than five). It would be possible to achieve RORG coverage for the clause error_counter >= 5 by globally altering the counter to a value greater than five from a test procedure outside the monitoring procedure. However, there must be a requirement specifying the expected behavior when the counter makes such unexpected jump (being changed from four to six for example), otherwise a test case cannot be derived. We shall refer to this as the unexpected counter transition (UCT).

procedure MonitorUnit is

begin ...

when ErrorHasOccured =>

if error_counter >= 5 then

SendMessage(Failure); error_counter := 0; else error_counter++; end if; when OK => error_counter := 0; ... end MonitorUnit; Figure 26: Resetting Counter Example

(22)

2.5 Static Code Analysis

The purpose of a static code analyzer is to find specific constructs in the subject code and to make specific instructions once they are discovered. Static analysis is used in many areas such as compiling, control flow analysis, code coverage, software security, simulation, parallel optimization, virtualization, memory debugging, emulation, performance analysis, software profiling, memory leak detection etc.

A compiler transforms the code into several intermediate forms to make it easier to be handled in different stages of a compiling processchain. The first representation is the Abstract Syntax Tree (AST). In this step, the code is represented as a tree that can be semantically analyzed. Figure 27 illustrates an example of a compiling processchain (GCC). Figure 27: GCC’s Compiling Process Chain [12] By utilizing middleend intermediate formats like GIMPLE [13] or LLVM IR [14], static code analysis can be performed in a language independent fashion partially. This intermediate form is mostly used to generalize compiler optimizations, making them frontend and backend independent. However, many language specific features are not stored in the generalized format. In this thesis, there are a number of language aspects that require frontend processing. Ada Semantic Interface Specification (ASIS) is an interface between an Ada environment and analysis tools designed to be compiler independent. It helps with the traversal of code and provides queries to retrieve syntactic and semantic information from the subject code. ASIS [15] is an ISO standard developed between 19891999 by The ASIS Working Group (ASISWG) in a voluntary collaboration between universities, aircraft manufacturers and organizations.

(23)

14

2.6 Instrumentation

Instrumentation is a technique that aids in the analysis of dynamic program behavior. Without changing the functional behavior, instrumentation adds monitor calls to enable information gathering of certain entities in the subject program during execution. Instrumentation affects execution time and memory arrangements. Therefore, it should not contain expensive calculations. Functional behavior consistency between instrumented and original programs is assured by checking that they have the same output.

In Table 26 an instrumented program p’ has been generated from program p where a logical expression, containing a relational operator, is to be measured for RORG coverage. The logical expression in p’ has been replaced with a call to a function that will return the same results as evaluations in p and measure which of the three detection conditions have been met. Table 26: Measurement insertion Original program p Instrumented program p’

procedure Monitor_Pylon is

type Pressure_Type is (High, Low, Undefined); Pylon_Pressure : array (1..6) of Pressure_Type;

begin

for I in Pylon_Pressure'Range loop if Pylon_Pressure(I) = Low then Send_Error_Msg(I, 'Pressure_Low'); else Send_Error_Msg(I, 'Pressure_Undefined'); end if; end loop; end Monitor_Pylon;

procedure Monitor_Pylon is

type Pressure_Type is (High, Low, Undefined); Pylon_Pressure : array (1..6) of Pressure_Type;

function Mark_42(I : Integer) return boolean is …

end Mark_42;

begin

for I in Pylon_Pressure'Range loop if Mark_42(I) then

Send_Error_Msg(I, 'Pressure_Low'); else Send_Error_Msg(I, 'Pressure_Undefined'); end if; end loop; end Monitor_Pylon;

2.7 Verification

To augment the MCDC (RACC) adequate test set to satisfy RORG, requirementsbased testing is performed. Normally, unevaluated logical expressions are dealt with in one of the following ways: 1. Additional requirementsbased test cases are added 2. Requirements are changed or added 3. Dead or unwanted code is removed 4. Inspection is used as a complement to justify exceptions for logical expressions that can not be evaluated, for example due to program semantics However, due to limited time, we do not alter the legacy code or add or change requirements in this study.

(24)

3 Methods

This chapter will explain the inner workings of AdaRORG [24], our implementation of Algorithm 21 [4] to make an MCDC adequate test set RORG adequate.

3.1 AdaRORG

Subject code is statically analyzed using Asis. For each discovered logical expression (called conditional expression in Ada), a truth table is generated to determine when clauses are active. Clauses containing a relational operator are then instrumented by replacing each predicate with a corresponding measurement function. Some Ada aspects that affect instrumentation, such as loop variables, will need special treatment. Falsepositives are reduced by ignoring clauses where an operand is a boundary Ada enumeration literal. During runtime, instrumented code will check that each clause containing a relational operator has covered <, > and = while determining the predicate outcome. The testing process is illustrated in Figure 31: Figure 31: AdaRORG: From subject code to measured coverage

(25)

16

3.1.1 ASIS

ASIS for GNAT [16] is an open source library for Ada that follows the Asis standard and uses Gnat as precompiler. Figure 32 illustrates how an Asis application acquires information through Asis for Gnat without the need of understanding Gnats internal representation. Figure 32: Relationship between source, Asis and application

The construction of an Asis application mainly involves the steps of initialization and pre and post processing of tree nodes.

During initialization, a context is set up. It identifies a set of compilation units composing the program, its so called Ada environment. The Gnat compiler takes one or several compilation units as input, with instructions on how they should be compiled. The compilation unit can be a package declaration, a package body, a subprogram declaration or a subprogram body. The compilation unit is then decomposed into Asis Elements that can be seen as representatives for the nodes in the abstract syntax tree. The programmer provides pre and a postprocessing instructions that will be invoked as each element in the tree is traversed. In this report we also refer to preprocessing as “entering” and postprocessing as “leaving”. As a result of depthfirst order, the process boils down to collecting information as recognized constructs are entered and make specific instructions, based on the collected information, when leaving a particular construct.

3.1.2 Predicates

The predicate structure is represented using a tree. Figure 33 illustrates the expression: (x>y or B) and C in its tree form: Figure 33: Predicatetree

(26)

For a list of the different types of treenodes see Table 31:

Table 31: Predicatetree nodes .

Logical Operators not, and, or, xor

Parenthesis (, )

Relational Clauses = ! = < ≤ ≥ > , , , , ,

Regular Clauses VARIABLE, CONSTANT, ARITHMETIC OPERATION, PREFIX CALL

3.1.3 Truth Tables

E

ach clause in the predicatetree is assigned a truth value with respect to a test case

. The predicate is then evaluated using recursive depthfirst traversal. A truth table is populated by evaluating every possible test case. Figure 34 illustrates the expression: (x>y or B) and C where it has been assigned according to test case 1 in Table 32. Figure 34: Assigned predicatetree

3.1.4 Active Clauses

From the truth table, a GACC adequate test set T is calculated by switching the boolean value for each clause c one at a time. If the outcome of the corresponding predicate p changes, c is known to determine the outcome of p. Once every clause has been tested, the set of possible active clauses containing a

relational operator is stored. Table 32 shows a partial truth table for the expression: (x>y or B) and C where clause x>y determines the predicate outcome.

Table 32: Partial truth table for: (x>y or B) and C

Test case x>y B C (x>y or B) and C

1 F F T F

(27)

18

3.1.5 Instrumentation

Instrumentation is generated from the set of possible active clauses. A conditional expression is replaced with a call to a generated mark function that will make measurements and return the result of the original expression. Table 33 shows an example of a program containing a conditional expression. that has been replaced with a call to a measurement function.

Table 33: Measurement insertion

Original program (example.adb) Instrumented program (example.ror)

procedure Example is

X : Integer; Y : Integer; B : Boolean; C : Boolean; begin

if (X>Y or B) and C then

null;

end if;

end Example;

procedure Example is

X : Integer; Y : Integer; B : Boolean; C : Boolean; RORG_Mark_0 Location : test/example.adb:10:8: Predicate : (A or B) and C Clause A : X>Y Clause B : B Clause C : C Exceptions : Clause B : 'B' does not contain a relational operator Clause C : 'C' does not contain a relational operator Source Expression : (X>Y or B) and C function RORG_Mark_0 return Boolean is …

end RORG_Mark_0;

begin

if RORG_Mark_0 then

null;

end if;

(28)

3.1.5.1 Mark Functions

The generated measurement functions (marks) start with identifying a permutation from the truth table in which one or several clauses are active. The relational operators inside the active clauses are then measured for coverage. This is shown in Figure 35. A possible improvement would be to merge

conditional expressions, since reducing branches leads to improved performance on processors based on pipelined instruction stream architecture. The time complexity of the instrumentation is O{2n} where n is the number of clauses.

function RORG_Mark_0 return Boolean is

begin

if not (X>Y) then if not (B) then if C then Test Case : A=False, B=False, C=True => False Active Clauses : A Clause A: if X=Y then

Rorg.Is_Covered(1, ‘=‘) := Rorg.Is_Covered(1, ‘=‘)+1; else Rorg.Is_Covered(1, ‘<‘) := Rorg.Is_Covered(1, ‘<‘)+1; end if; return False; end if; end if; else if not (B) then if C then Test Case : A=True, B=False, C=True => True Active Clauses : A  Clause A: Rorg.Is_Covered(1, ‘>‘) := Rorg.Is_Covered(1, ‘>’)+1; return True; end if; end if; end if;

return (X>Y or B) and C; end RORG_Mark_0; Figure 35: Measurement function for expression: (X>Y or B) and C

3.2 Execution

AdaTEST95 [17] is a tool qualified for measuring MCDC according to RTCA/DO178B [3]. It is used to perform unit and integration testing in the Stores Management Unit [7]. For this study, AdaTEST95 was used as a test framework where the test case execution was automated. The RORG adequate test set was executed both on the instrumented and on the original version of the subject code.

(29)

20

4 Results

This chapter summarizes subject code statistics gathered during static analysis, the RORG coverage results after executing the initial MCDC (RACC) adequate test set and test results obtained after augmenting the MCDC adequate test set to satisfy RORG.

4.1 Static Analysis

Analysis of the subject code (see Table 41) showed that: ● 40% of the predicates contained a relational operator ● 41% of the clauses contained a relational operator ● 20% of the relational operators were automatically filtered due to boundary value constraints (BVC), see Chapter 2.3.4. Table 41: General statistics . Predicates total 350 containing relational operators 142 Clauses total 578 Relational operators total 233 measured for RORG coverage 187 filtered due to BVC 46 The distribution of predicates and their size in terms of clauses can be found in Chapter 1.1.1. Diagram 41 shows the same distribution, except only with predicates containing a relational operator. For example, the subject code has five predicates where at least one of four clauses contains a relational operator. Diagram 41: Predicate/clause distribution .

(30)

Analysis of the subject code (see Table 42) showed that the relational operators were: ● 43% enumeration comparisons ● 40% integer comparisons ● 9% floating point comparisons ● 8% array comparisons. Table 42: Relational operators & Operand types. Enumeration comparisons = 46 != 35 Integer comparisons = 36 < 6 > 18 >= 15 Floating point comparisons < 8 > 8 Array comparisons = 8 != 7

4.2 Testing

Table 43 summarizes achieved RORG coverage after letting the initial MCDC adequate test set exercise the instrumented subject code: Table 43: Initial RORG coverage . Total RORG coverage 54/187 enumeration comparisons covered 30/81 ‘=’ evaluations 81/81 ‘<’ evaluations 52/81 ‘>’ evaluations 59/81 integer comparisons covered 12/75 ‘=’ evaluations 75/75 ‘<’ evaluations 46/75 ‘>’ evaluations 53/75 floating point comparisons covered 12/16 ‘=’ evaluations 12/16 ‘<’ evaluations 16/16 ‘>’ evaluations 16/16 array comparisons covered 0/15 ‘=’ evaluations 15/15 ‘<’ evaluations 2/15 ‘>’ evaluations 13/15

(31)

22

Table 44 presents the results after manually inspecting the achieved RORG coverage, requirement specification and program semantics. The verifiability of the relational operators not covered was manually categorized into: Verifiable: Additional requirementsbased testing is possible. Incomplete:

Requirement specification is not written in sufficient detail to be able to determine the expected outcome. Infeasible: Program semantics makes it impossible to come up with test cases that meet the test requirements. Table 44: Inspection of verifiability . Verifiable total 37 enumeration comparisons 31 integer comparisons 2 floating point comparisons 4 Incomplete total 55 unexpected minor frame (UMF) integer comparisons 24 unexpected counter transition (UCT) integer comparisons 31 Infeasible total 41 coupled clause effect (CCE) enumeration comparisons 20 boundary value constraint (BVC) integer comparisons 6 array comparisons 15 For the verifiable relational operators, requirementbased testing was performed to satisfy RORG. Test results after the RORG adequate test set exercised the subject code is summarized in Table 45. Table 45: Test report . Additional tests passed 37 failed 0

(32)

5 Discussion

In this chapter the test results will be discussed and future improvements will be suggested.

5.1 Fault Probability

The subject code contained few (see Results, Diagram 41) predicates with a relational operator where there were more than three clauses which lowered the probability of finding a fault. However, our initial studies of the Stores Management Unit software showed that this software still is one of the most complex regarding the amount of clauses.

5.2 Feasible Test Problem

31% of the relational operators not covered by the initial MCDC adequate test set were infeasible to test due to program semantics. This provides an opportunity for future improvement of AdaRORG so that unsatisfiable test case requirements are automatically filtered.

5.3 Requirement Completeness Problem

In our study, the requirement specification was incomplete regarding how to handle unexpected counter transitions and unexpected minor frame executions. 41% of the relational operators not covered by the MCDC adequate test set would need additional software requirements to be verifiable.

5.4 Enumeration literal ordering independence

In an enumeration type declaration, literal values (identifiers) are related to one another through ordering

that is determined by the sequence they are declared in:

The relationship between enumeration identifiers can be tested using equality operators = (equals) and != (not equals) and ordering operators < (less than), <= (less than or equal), > (greater than), and >= (greater than or equal). When expressing enumeration comparisons, one may decide whether or not to make use of ordering relationships. To compare the two alternatives, let's say we have a safetycritical system where there shall be an air pressure monitoring function that shall give a warning if the pressure starts to deviate from normal level. A software developer now decides between two alternatives (see Table 51) to

implement. In both cases, the monitoring function utilizes Pressure_Type (see above) that has been declared in a separate package specification, to be used by several components in the system.

(33)

24

Table 51: Air pressure monitor

Variant A: Ordering dependent comparison Variant B: Ordering independent comparison

Cabin_Pressure : Pressure_Type := Undefined;

...

if Cabin_Pressure < Normal then

Send_Warning(‘Incorrect pressure level’);

end if;

Cabin_Pressure : Pressure_Type := Undefined;

...

if Cabin_Pressure != Normal then

Send_Warning(‘Incorrect pressure level’); end if; A software tester has written two tests to verify that the function behaves as expected, one where the pressure is normal and one where it is low. The test set happens to satisfy MCDC. Now let’s say a few years later, another team of software developers shall integrate additional functionality. While doing so, a developer decides it would make more sense to put identifier ‘High’ after ‘Normal’ and decides to refactorize the Pressure_Type type declaration:

The developer inspects the other code parts utilizing Pressure_Type and reexecutes the available test suite. No faults appears to have been introduced. Therefore, the other parts of the code remains unchanged. After this, an independent software tester writes an additional test case to verify that the air pressure monitor signals if the cabin pressure is high. The test set now satisfies RORG. Depending on which of the two variants in Table 51 was chosen, there are two possible outcomes:

Variant A: The test case exposes a fault because of the ordering dependent operator < (less than) being incorrect; as no warning gets signaled even though the pressure is high.

Variant B: The test case demonstrates that the monitoring function still behaves as expected, as the != (not equals) operator is ordering independent, thus not affected by the new sequential ordering. In Table 42, one can notice that enumeration identifiers in the subject code were only compared using equality operators. According to a software quality assurance representative, it had been a design principle not to be dependent on the sequential ordering relationship between enumeration identifiers, which is why none of the ordering operators were necessary when comparing them. When regarding this study, it means that for enumeration comparisons, if the design principle of enumeration literal ordering independence has been correctly followed, MCDC is sufficient to verify that the relational operators are correct, since they can only be one of the equality operators = (equals) or != (not equals). However, 84% of the additional tests added in this study covered enumeration comparisons.

(34)

5.5 Improvement suggestions

There are possible improvements that can be made to reduce falsepositives and allow more Ada constructs to be measured using AdaRORG:

Fix 1 Falsepositives: (see infeasible in Table 44) Coupled clauses (CCE) sometimes makes it impossible to come up with test cases that satisfy the test criterion. A future improvement could be to use a theorem prover or a model checker to reduce the number of impossible combinations that are generated in the truth table.

Falsepositives due to the boundary value constraint (BVC) can also be reduced by ignoring clauses where an operand is an attribute reference to ‘First or ‘Last or a boolean constant.

Fix 2 Assignments: The subject code contained variables that were assigned a conditional expression. For complete RORG coverage, these expressions shall also be measured. This can be measured similarly to the conditional expressions used by control structures.

Fix 3 Declarative statements: This is similar to “Assignments” above, except if the assignment is part of an initialization (inside a declarative part), the mark function needs to be declared before the initialization takes place.

Fix 4 Function calls: If a called function, inside a clause or operand, has side effects, the

instrumented program might execute differently from the original program as the function might get called multiple times due to instrumentation checks. In a future improvement, this could be fixed by using temporary variables. However, there were no function calls of this type in the subject code.

Fix 5 Shortcircuit evaluation: The subject code contained no shortcircuit operators, thus an implementation would not benefit the study. Instrumentation would have to be implemented defferently and values would have to be checked in different orders to measure RORG coverage without causing any unexpected side effects.

Fix 6 Statement declaration blocks: The subject code contained no statement declaration blocks, thus an implementation would not benefit the study. To be able to generate instrumentation for this, RORG mark declaration information would need to be stacked instead of being written right away.

Fix 7 Row 26 in algorithm 1 [4]: (see Appendix) suggests to “construct a new test by modifying an arbitrary test” for relational operators that has not been covered. Even though ASIS is suitable to

(35)

26

6 Conclusions

From the results of this study, we argue that if an enumeration comparison have been carefully designed not to make use of the sequential ordering relationship between enumeration identifiers, extending an MCDC adequate test set to be RORG adequate is unlikely to expose faults in that comparison. RORG can be used to further improve requirement specifications, in our case regarding unexpected input data. 41% of the relational operators not covered by the MCDC adequate test set would need additional requirement clarifications specifying how to handle an unexpected counter transition or what to do when a task is being executed inside the wrong minor frame. Extending an already MCDC adequate test set is relatively straightforward, since it is a matter of modifying an existing test set.

7 Recommendations

We suggest to make an additional study where there are more integer and floating point comparisons to be verified. MCDC has been shown [1] to have an increased likelihood of exposing faults as the number of clauses in a predicate grows larger. A future study could develop a metric for a recommended number of clauses and relational operators to increase the likelihood of fault exposure compared to the number of tests. For this study, Ada was the chosen language. It would also be possible to measure RORG coverage in other languages such as C/C++ or Java etc. A possible start point here would be to modify one of many [18, 19, 20, 21] existing static analysis projects that are open source.

Improving MCDC adequate test sets for safety critical software to be RORG adequate

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Improving MCDC adequate test sets for safety

critical software to be RORG adequate

by

Christoffer Nylén

LIU-IDA/LITH-EX-A--14/067--SE

Linköping 2015-09-01

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

Improving MCDC adequate test sets for safety

critical software to be RORG adequate

by

Christoffer Nylén

LIU-IDA/LITH-EX-A--14/067--SE

2015-09-01

Supervisor: Håkan Anderwall

Saab Aeronautics, Linköping

Jeff Offutt

George Mason University

Examiner: Ahmed Rezine

Abstract

Acknowledgements

Table of Contents

1 Introduction

1.1 Context

1

Introduction

1.1

Context

1.1.1 Subject Program

1.2

Abbreviations

2

Background

2.1

Requirement-Based Testing

2.2

Logic-Based Testing

2.2.1 Logical Criteria

2.2.2 Subsumption

2.2.3 Mutation analysis

Algorithm 2­1

​

Algorithm [4] to Make an MCDC Test Set RORG­Adequate

Require:

​

Predicate

​

p

​

and a test set

​

T

​

that satisfies MCDC (ACC) with respect to

​

p

Ensure:

​

A test set that still satisfies MCDC (ACC), but is now also RORG­adequate

1: // It does not matter which version of ACC is satisfied

Algorithm 21

Algorithm [4] to Make an MCDC Test Set RORGAdequate

A test set that still satisfies MCDC (ACC), but is now also RORGadequate