Improving the Development of Safety Critical Software : Automated Test Case Generation for MC/DC Coverage using Incremental SAT-Based Model Checking

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datavetenskap

2019 | LIU-IDA/LITH-EX-A--19/082--SE

Improving the Development of

Safety Critical Software

–

Automated Test Case Generation for MC/DC Coverage using

Incremental SAT-Based Model Checking

Oscar Holm

Supervisor : Zebo Peng Examiner : Soheil Samii

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

The importance and requirements of certifying safety critical software is today more apparent than ever. This study focuses on the standards and practices used within the avionics, automotive and medical domain when it comes to safety critical software. We identify critical problems and trends when certifying safety critical software and propose a proof-of-concept using static analysis, model checking and incremental SAT solving as a contribution towards solving the identified problems. We present quantitative execution times and code coverage results of our proposed solution. The proposed solution is devel-oped under the assumptions of safety critical software standards and compared to other studies proposing similar methods. Lastly, we conclude the issues and advantages of our proof-of-concept in perspective of the software developer community.

(4)

Acknowledgments

We thank Soheil Samii and Zebo Peng for supervising and giving valuable input during the progress of the study. We are also grateful to Åsa Detterfelt for the role as external supervisor at MindRoad. We want to thank VerifySoft for allowing us to use their software testing tool TestWell CTC++ and Joakim Brannström for creating Dextool. Lastly, we thank Erik Lidström for providing interesting knowledge and literature recommendations for the study.

(5)

List of Figures

2.1 V-model . . . 7

2.2 Simple if-statement which writes to a file . . . 8

2.3 Test cases for program in figure 2.2 . . . 9

2.4 First expression allows unique-cause approach, second and third expression does not . . . 10

2.5 Test generation framework using model checker . . . 11

2.6 The 5 compiler phases. . . 13

2.7 Input to compiler . . . 14

2.8 Parse tree of input given in figure 2.7 . . . 14

2.9 The abstract syntax tree has the structure of the parse tree but unnecessary nodes are removed . . . 14

2.10 Flowchart for proof-of-concept for automated test generation . . . 15

3.1 Example program . . . 17

3.2 MC/DC test cases for example program . . . 17

3.3 Test object used for study . . . 18

3.4 Solutions found for example with algorithm 1 . . . 20

3.5 Classification of AST nodes . . . 22

3.6 Code example for an AST Decision Block . . . 22

3.7 Code example for an AST Decision D . . . 23

3.8 Decision logic for a node being transformed . . . 24

3.9 Code example of variable versioning . . . 25

3.10 Input implementation file to plugin . . . 26

3.11 Input header file to plugin . . . 26

3.12 Generated implementation file . . . 26

3.13 Generated header file . . . 27

4.1 Coverage results from Proof-of-concept test generation . . . 28

4.2 Average amount of statements, functions, conditions and decisions in each file . . 28

4.3 Average execution time using a monotonic system clock . . . 29

4.4 Average execution time using a monotonic system clock . . . 29

(8)

1 Introduction

In this chapter we introduce the scope and aim of the work.

1.1 Motivation

In today’s world, software is more interwoven than ever with safety critical systems. The automotive, medical and avionic industry are all relying on that safety critical systems fulfill specific requirements. A malfunction in this type of system can cause harm to humans and can be very expensive for the ones responsible for the system. An example of this is from the 80’s when the medical division of Atomic Energy of Canada (AECL) was manufacturing a radiation therapy machine called Therac-251. A design revolutionary at its time allowed Therac-25 to have two operation modes - improving logistics and maintenance. It was also improved from its previous iteration in that it used software instead of hardware controls for switching operating modes which - in theory - reduced complexity and manufacturing cost. Therac-25 was state-of-the-art; it was also involved in at least six accidents causing serious injury or death to patients2. A race-condition bug in the codebase of Therac-25 occasionally caused a much higher dose of radiation to be delivered than intended. The race-condition bug in Therac-25 had been present in Therac-20 codebase as well. It was not discovered because of the hardware safety features - which Therac-25 replaced with software. It was later discovered that Therac-25 had no formal software specification or testing plan. The lessons learned from this was that it is not enough to do functional testing on safety critical software, there must be clear requirements that we can prove are fulfilled.

Safety Critical System

The definition for a safety critical system is that a failure or malfunction may lead to one or more of the following outcomes [1, 2, 3, 4]:

• Death or critical injury to humans • Severe damage and/or loss to property

1_{https://en.wikipedia.org/wiki/Therac-25}

(9)

1.2. Aim

• Environmental harm Safety and Availability

Bowen et al. presents ethical considerations that are important when developing safety criti-cal systems, one of the most noteworthy being: "The development of a safety-criticriti-cal system should aim to avoid the loss of human life or serious injury by reducing the risks involved to an acceptable level." [5]. This is according to Bowen et al. an overriding factor - the system should aim to be safe even if it negatively affects availability. A similar philosophy towards safety critical systems exists in many industries today. For example the avionic industry has the DO-178C software design guidelines covering risk-assessment and availability require-ments [6].

Quality, Requirements and Cost

Weinberg said, "Quality is value to some person." [7]. In a safety critical system, the customers using the software puts value on the safety. The project manager(s) puts value in getting the application finished in time and without too high expenses. Where persons put the value is what forms the requirements for safety critical software. These requirements need to be ful-filled when developing the software (to assure quality) and testing is used to do so. Because of the growing complexity of software, it is impractical and error-prone to evaluate the safety and quality of software with manual testing [8]. Hence, there are advantages to gain from using automated testing. According to Anand et al. finding defects in early development lessen the cost to remove them, thus the earlier these can be found the better [9]. The cost for finding and fixing software bugs after delivery can be up to 100 times higher compared to doing so during development phase [8].

For the ethical and logical reasons, it is of general interest for the software developer community to explore the available methods to test their safety critical software during de-velopment.

1.2 Aim

There is a need to improve development of safety critical software to ensure qualitative soft-ware and reduce the time and cost of development. One way to improve development is with automated testing. There is a need to ensure that safety critical software does not malfunc-tion more frequently than the requirements tolerates. We aim to contribute to the testing of safety critical software with our proof-of-concept for generating test cases from C++ source code. The generated test cases aims to fulfill the requirements for the code coverage metric modified condition/decision (MC/DC).

1.3 Research questions

• How does the proof-of-concept method compare to other methods using model check-ing to automatically generate MC/DC test cases:

– Execution time – Scalability

– Size of generated test suite – Coverage

(10)

1.4. Delimitations

1.4 Delimitations

This paper focus on how safety critical software is developed within automotive, medical and avionics field, and what can be done to improve the testing and verification of this software. We limit our focus to the requirements, verification and software development life cycle of safety critical software. Our proof-of-concept method shows a way to improve the testing and verification with automated test case generation. The method is limited to C++ code and using Dextool3for generating test cases. We also limit the test objects such that they must follow specific coding standards for safety critical systems.

(11)

2 Theory

This chapter goes through necessary theory behind safety critical systems, testing and auto-mated testing.

2.1 Safety Critical System Standards

In this section we go through standards for safety critical systems in the aviation, automotive and medical industry that is relevant to our work.

AC 25.1309-1A

The Federal Aviation Administration (FAA) have provided the advisory circular AC 25.1309-1A with guidelines regarding failure rates of safety critical systems in the avionics domain. Failure Conditions

AC 25.1309-1A states that failure conditions are:

• Probable when a critical function malfunctions more than once per 105hours of opera-tion [10]

• Improbable when a critical function malfunctions once per 105until 109hours of oper-ation [10]

• Extremely improbable when a critical function malfunctions once per 109or more hours of operation [10]

Structural Coverage Requirements

AC 25.1309-1A advises that safety critical software is developed according to the DO-178C standard [6]. One necessary condition for DO-178C certification of the highest critical level is that the software has full MC/DC coverage [6].

(12)

2.1. Safety Critical System Standards

Design Assurance Level (DAL)

DO-178C [6] defines the concept of Design Assurance Level (DAL). There are 5 different levels in DAL (A, B, C, D and E) and each level has different requirements. For example, level A is the most critical level and requires full MC/DC coverage as mentioned above. Safety critical software needs to be certified for different levels in DAL depending on how critical the software is considered. The DAL level that the software needs to be certified for is determined by a rigorous process where for example probabilistic risk-assessment is done [6].

ISO 26262

In the automotive industry, it is a requirement that safety critical systems are developed ac-cording to the ISO 26262 standard (derivative from IEC61508) [11]. Automotive Open Sys-tem Architecture (AUTOSAR) provides guidelines how to satisfy the ISO 26262 requirements when using C++ [12]. A tool confidence level needs to be determined when developing safety critical software with C++, MC/DC coverage is a proposed example to increase the tool error detection (which implicitly means lower tool confidence level). Lower tool confidence level is better [12].

Automotive Safety Integrity Level (ASIL)

The automotive industry standard according to ISO26262 for certifying safety critical soft-ware is called Automotive Safety Integrity Level (ASIL) [13]. ASIL has 4 different levels: ASIL A, ASIL B, ASIL C and ASIL D. Structural coverage metrics in ASIL are derived from ISO 26262 and among those metrics are, for example, MC/DC coverage [14]. While not being the same standard, there are many similarities between ASIL and DAL. For this paper, the most notable similarity is the structural coverage requirements.

ISO 14971

There is a standard for certifying safety critical medical devices and it is called ISO 14971 [15, 16]. The standard defines that the probability p of the frequency which persons are harmed can be estimated in 5 scales [16]:

• Frequent when: p ą=10´3

• Probable when: 10´4ă=p ă 10´3 • Occasional when: 10´5ă= p ă 10´4 • Remote when: 10´6_ă₌_{p ă 10}´5

• Improbable when: p ă 10´6 Safety Levels in IS0 14971

Safety critical software are classified into levels in ISO 14971 depending on the consequences of a fault. Flood et al. [16]: described the three different levels as:

1. Level A: Significant, death or loss of function or structure [16] 2. Level B: Moderate, reversible or minor injury [16]

(13)

2.2. The V-model

C++ Coding Standards for Safety Critical Systems

We limit the testing to objects that follow C++ safety critical system standards. Some exam-ples of such standards are:

• Rule M7-3-1: "The global namespace shall only contain main, namespace declarations and extern "C" declarations" [12]. AV Rule 98 in JSF++ also specifies that "Every nonlocal name, except main(), should be placed in some namespace." [17].

• "Namespaces will not be nested more than two levels deep." [17] • Forbidden to use malloc and free functions [12]

• Forbidden to use try blocks [12] • No use of multiple inheritance [12]

• Only 4 allowed pre-processor directives [17]: 1. #ifndef

2. #define 3. #endif 4. #include

• Header files (*.h) shall not contain function definitions. Typically, they should contain interface declarations and not details about implementation. The exception is inline function (AV Rule 39) [17]

• The statement body of if, else if, else and loops should always be enclosed with braces, even if empty (AV Rule 59) [17]

For the full details of the standards see JSF [17] and AUTOSAR [12].

2.2 The V-model

A traditional way to generalize software development is the V-model. It maps the waterfall development model into different testing levels. The concept of what later came to be called the waterfall model was proposed 1968 by Royce et al. and describes an approach on how to manage the development of software systems [18]. Royce points out that, in his experi-ence, the proposed model needs iterative relationships between the different phases to be successful. Kramer et al. says that the biggest concern with the waterfall model is that it does not handle change well in terms of cost-efficiency [19]. Kramer agrees with Royce that iterative relationships between phases are needed to mitigate the above issue. Weinberg et al. discussed the same issue, 20 years earlier, in "Quality Software Management: Systems Think-ing" saying that "Sequential methodologies are essentially linear processes, supplemented by implicit feedback" [7]. What Weinberg meant was that corrections to the mistakes that could happen during the development process was the implicit feedback. According to Weinberg, sequential methods such as the waterfall and modified waterfall model often makes projects fail to explore the surrounding territory (or if they do, they fail to finish in time). We note from this that while it is good to stick with what works, it can also be important to question the methods for developing software. Even though the waterfall model is not always the preferred software development model (still holds up for smaller projects [19]) over mod-ern approaches such as agile, spiral and iterative models, the same techniques and tools are still used. The model is simple to understand, implement and execute. For that reason, the waterfall model is still used or considered when developing software. This takes us back to

(14)

2.2. The V-model Requirements System specification Architectural design Detailed design Acceptance testing System testing Integration testing Unit testing Implementation

Figure 2.1: The V-model illustrating a software test-driven life cycle4

the V-model, which originates from the waterfall model. The goal of the V-model is to make development of software test-driven, at several levels of integration.

Figure 2.1 is a graphical illustration of the V-model. The left side contains the phases of waterfall development model and the right side shows corresponding testing phases. The V-model is used for development of safety critical software in for example the medical device industry [20, 21]. According to Hanssen et al. regarding the development of safety critical software: "Normally some variant of the V-model is used to guide design, implementation, testing and validation." [22].

The Three Testing Levels

The V-model showed us three different test levels: unit, integration and system/acceptance. These are typical for test case design [23].

Unit Testing

The definition of an unit is the smallest testable part. This can for example be classes and functions. Full knowledge of internal code is needed which means that developers5_performs

this kind of testing. It is relatively simple and cost effective to do unit testing, but it is difficult to catch all bugs in the application and write good test cases [24].

Integration Testing

The units are merged together into modules. Full knowledge of internal code is not needed, but knowledge about the modules functionality are needed. In safety critical software, this knowledge should come from the requirements. Even if the units are functioning properly when unit tested, they might fail when merged and tested as a module.

System Testing

The specified requirements for the system is completed and the system is tested as a whole to see if it fulfill those requirements. At this point, no knowledge about internal code is

4_{All figures are created by the author of this paper}

(15)

2.3. Software Testing Techniques

needed. This is done to test the whole application. The tests can be targeted on for example functionality or security of the application.

Acceptance Testing

The threshold for when the software is finished for delivery to the customer. The amount of testing done here is an agreement between the ones creating the application and the customer. System testing often borders on acceptance testing, thus it is common to see acceptance test-ing as a subset of system testtest-ing.

2.3 Software Testing Techniques

The levels of testing belong to a broader set of software testing techniques. The three most important are white, black and grey box testing [25]. The basic idea is to see the software under test (SUT) as a "box". The "box" is tested differently depending on if we can see inside the "box", have an idea of what is inside the box, or that we do not know.

White Box Testing

The testing is based on knowledge about the internal code of software to be tested. We can see inside the "box". This can for example be unit testing or structural testing.

Black Box Testing

The testing is based on requirements and specifications. No knowledge about internal code is needed. We cannot see inside the "box". System testing implicitly goes together with black box testing and can for example be random testing. Random testing is when input is random and independent, output is checked against requirements and specifications.

Gray Box Testing

This falls somewhere in between black and white-box testing. Partial knowledge about the internal code is needed but we do not need to fully understand it. We have an idea of what is inside the "box". Integration testing implicitly goes together with gray box testing.

2.4 Code Coverage

There are several types of code coverage to measure how much of the application was exer-cised, we focus on the common ones and the ones relevant for safety critical software.

Decision and Statement Coverage

A very simple way to measure how well code has been tested is statement coverage. Consider the following program in figure 2.2:

1 def w r i t e F i l e (f i l e , x ) :

2 i f x <= 3 :

3 f i l e. w r i t e ( x )

Figure 2.2: Simple if-statement which writes to a file

We have the following test cases which executes the program with different assignments to the variable x: To achieve full statement coverage each statement in the program must be tested. Statement coverage is 50% if using only the second test case and 100% if using only

(16)

2.4. Code Coverage

x:= x ă=3 file.write(x)

3 True Yes

5 False No

Figure 2.3: Test cases for program in figure 2.2

the first test case - because only the first test case reaches the write statement. For decision coverage to be 100% all decisions must be made in the program. This means both test cases are needed because the two decisions possible are that the if-conditional is false or that it is true. Full decision coverage implies full statement coverage, but not the other way around. Decision coverage is also known as branch coverage6_.

Modified Condition/Decision Coverage

To understand Modified Condition/Decision Coverage (MC/DC), we break it down into: • Condition: A Boolean expression that cannot be simplified (leaf node in an expression) • Decision: Controls program flow, for example a condition evaluating true or false is a

decision.

• Condition Coverage (CC): In every decision, each condition takes on all possible out-comes.

• Condition/Decision Coverage (CDC): Every decision takes on all possible outcomes and has full condition coverage.

• Multiple Condition Coverage (MCC): Every decision takes on all possible combinations of input. This equals exhaustive testing.

MCC is the ideal testing coverage, but it is not efficient to test every possible combination of input. For that reason, MC/DC can be used instead since it has relaxed the requirements of MCC. MC/DC is thus in the middle between CDC and MCC in terms of requirements. For MC/DC it must be proved that each condition independently affect decision outcome (MC/DC implies CDC) [26]. As seen in figure 2.3, the first test case makes the if-conditional true leading to the write decision being executed. To establish the independence of the first test case, the second test case shows that by changing assignment of x (and the if-conditional) we also change the decision outcome (no write is done). For this example, the two test cases achieve full MC/DC coverage. The MCC coverage would require us to create test cases for all possible assignments of x which means 2Ntest cases for N conditions. MC/DC has linear growth in test cases and requires N+1 test cases when it is possible to overlap test pairs, in the worst-case scenario 2N test cases are needed.

Expression Parse Tree

Offutt et al. created a way to generate test data from state-based specifications where one step is to achieve full predicate coverage [27]. They define a predicate as an expression with clauses and zero or more logical operators. The requirements for predicate coverage are much like MC/DC, requiring it to be shown that each clause in the predicate is shown to indepen-dently affect predicate value [27]. The predicate is then represented as an expression parse tree (a kind of binary tree). In this expression tree, every node is the same as a test case in MC/DC. We take great interest in this modelling of the problem, as we will be traversing a parse tree which contains similar representation of expressions in our plugin, in which we

(17)

2.4. Code Coverage

generate MC/DC test cases. According to Bloem et al. who proposed a method to generate test suites for MC/DC, systems modelled as described by Offutt et al. (expression parse trees) could be applied to their method as well [27, 28].

Independence Pairs

When a variable A in a truth vector which evaluates an expression to true and the corre-sponding truth vector where only A is changed makes the expression evaluate to false, it is called a independence pair [29]. Since only the variable A can change in the independence pair, it can be used to prove that changing A independently affects the decision.

Strongly Coupled Conditions

Conditions where a change in a condition always changes the outcome of other conditions are called strongly coupled [30]. Example of this is the second expression in figure 2.4 where B occurs in both (A ^ B)and (C ^ B). Changing the condition B will always change the outcome of the condition(A ^ B)and (C ^ B). If this this relation is true for one or more assignments of B but not all assignments, the conditions are weakly coupled.

Unique-Cause MC/DC

Unique-cause is one of the common approaches to prove that each condition independently affects a decision. This is done for an expression by allowing only one condition to change and let the rest be fixed [26]. If changing the condition now affects the outcome, it has been proven that it independently affected the decision outcome. This proof is stronger than mask-ing - it does not matter whether conditions are masked or not when usmask-ing this approach. This does not apply to all cases though. When conditions are repeated or strongly coupled the unique cause approach does not work [26]. An example of unique cause MC/DC is the first expression in figure 2.4. We see that if we change the assignment of any of the conditions A, B and C, decision outcome is different for the expression.

Expression Unique Cause Masking

(A ^ B)_C Yes Yes

(A ^ B)_(C ^ B) No Yes

A _ B No No

Figure 2.4: First expression allows unique-cause approach, second and third expression does not

Masking MC/DC

In expressions with several logical operators, one condition may mask other conditions [26]. For example the expression A ^(B _ C)contains two logical operators and when the value of A is false it will always make the decision outcome false regardless of the values that B and C takes. Masking MC/DC allows more than one condition to change when proving that it independently affects decision outcome. If unique cause MC/DC is satisfied it implies that masking MC/DC is also satisfied since the independence pairs in masking MC/DC is a superset of unique cause independence pairs [31]. In general, there are more sets of test cases that satisfy masking MC/DC. In figure 2.4 unique-cause can be used on the first expression because it has no repeated conditions. In the second expression unique-cause may not be used, since B occurs in different logical operators which are coupled with another logical operator. Both expressions have masked conditions because of logical operators. The third expression has no masking and unique-cause can be used.

(18)

2.5. Automated Testing Methods

Function Coverage

The measure of how many of the functions in the software that has been executed at least once is known as Function Coverage (FC). FC metric is recommended as a structural coverage metric during integration testing by the ISO 26262 standard [14]. To give a concrete example, say we have the functions f0, f1, f2, f3. If we have a test that calls f0and f1, FC is 50%. To

achieve 100% FC, all the functions ( f0, f1, f2, f3) must be called in our test.

2.5 Automated Testing Methods

This section goes through some methods that can be used for automating the testing of safety critical systems on different levels.

Model Checkers

Early research on generating test cases for MC/DC coverage proposed the use of model checkers, Rayadurgam et al. proposed one such test generation framework [32]. Software artefacts is mapped to a finite state system and MC/DC is the test criteria to be satisfied. The finite state system is a formal system model and for example Requirements State Ma-chine Language (RSML) can be used for mapping artefacts. Especially important to us is that Rayadurgam et al. says that "one could map implementations in languages like Java to this model using techniques that extract abstract models from program source code" [32]. The Clang compiler can break down program source code (C++) and allow us to view the result-ing abstract syntax tree7- described under Dextool subsection 2.6. This would correspond to the mapping of artefacts in the testing framework proposed. Figure 2.5 illustrates the test generation framework.

RSML Source Code Test Criteria_(MC/DC)

Formal

Sys-tem Model LTL Properties

Model Checker

Counter-examples (test cases)

map _extract

translate

Figure 2.5: Test generation framework using model checker

The MC/DC criterion can be expressed as a property that is true during a state transition using linear time temporal logic (LTL). The model checker is challenged to find a path to a state where the test criteria is satisfied by negating the property. Once such a state is found, the model checker then finds counter-examples showing that the property is satisfied. This kind of problem is known as the boolean satisfiability problem (SAT) [33]. The procedure is then repeated for all state transitions, the resulting counter-examples are the generated test cases that gives MC/DC coverage. The main challenge using model checkers is the state

(19)

2.6. Compiler Structure

space explosion which makes it hard to compute the counter-examples. Model checkers has previously been successfully used to find errors in software written in C [34]. Kitamura et al. proposed an algorithm for generating an optimal MC/DC test suite (as few test cases as possible) [35]. Their findings were that model checking using incremental SAT solving works fairly well for real-world system within the avionics domain [35]. Model checking has also been studied for its use in generating test cases within automotive industry by Enoiu et al. [36]. We note that Enoiu et al. recommends exploring combining model checking with static analysis to improve their method. Our proof-of-concept relates to that recommendation and uses both static analysis and model checking.

SAT Solvers

A SAT solver is a software tool which purpose is to solve the SAT problem as quickly and efficiently as possible. To alleviate the problem with state space explosion that model check-ers suffer from, SAT solvcheck-ers were proposed to be used together with the model checkcheck-ers [37]. There exist many SAT solvers that can work in parallel, do incremental SAT solving and that have interfaces to several programming languages. A few examples of this are Crypto-MiniSat, ManySat and Z3 [38, 39, 40] (Z3 is technically an SMT8solver which can do SAT solving). An SMT solver is a software tool for constraint programming. There is a standard called SMTLIB that SMT solvers usually follow [41]. Incremental solving is a desirable func-tionality when we want to find all the solutions to the SAT problem (which becomes our test cases) because the solver can reuse previous knowledge about clauses. Out of these solvers all can perform incremental SAT solving. Noteworthy is that Z3 has a frontend for data types such as integers, booleans and floats. CryptoMiniSat and ManySat operate on CNF given in DIMACS format. The SAT solver in Z3 has been used to generate MC/DC test suites [35, 28]. In the study conducted by Bloem et al. they tested a Java Card Applet Firewall by generating test cases using Z3 [28]. The generated test suite was compared to an existing test suite for the Applet firewall and the results showed that generated test suites can find errors manually crafted test suites will not [28]. We note that to measure the results, Bloem et al. instrumented the code to allow MC/DC to be measured [28].

Fuzz Testing

Fuzz testing is a way to automatically test software. A typical fuzz testing tool feeds input to the SUT and can be used to find bugs which manual testing would not [8]. There has been use of fuzz testing towards validation as well, for example to validate TLS libraries [42]. We have previously tested modules for avionics components with the fuzz testing tool American Fuzzy Lop (AFL) [43]. The conclusion we could draw was that AFL could be used for stability testing given since it achieves good statement coverage and input could be mutated [43]. Fuzz testing can fit into any of the white, black and grey box testing techniques.

Automatic Unit Test Generation

Automatic unit test generation can be used to generate a suite of unit tests for given code, it has been shown to give good statement coverage compared to test suites created manu-ally [44]. Shamshiri et al. also concluded that it is detrimental for the automatic unit test generation if the initial generated test suite is bad [44].

2.6 Compiler Structure

The structure of a modern compiler consist of 5 phases [45, 46]:

(20)

1. Lexical analysis: Input is a sequence of characters and output is tokens. Construction of tables such as symbol, constants and string tables are done here.

2. Syntactic analysis: Input is a sequence of tokens, symbol table and output is parse tree and error messages.

3. Semantic analysis and Intermediate code generation: Input is parse tree and output is intermediate code and symbol table temporary variables. Examples of intermediate code is for example three address code, quads and reverse polish notation.

4. Code optimization (optional): Input is internal form and output is internal form with improvements. Example of improvements are constant folding.

5. Code generation: Input is internal form and output is assembly code. Register alloca-tions are done in this phase.

Source program Lexical analysis 1 Syntactical analysis 2 Semantic analysis and intermediate code generation 3 Code optimization 4 Code generation 5 Executable program

Table management Error management

Sequence of characters Sequence of tokens Parse tree Internal form, intermediate code Internal form, intermediate code

Figure 2.6: The 5 compiler phases.

Figure 2.6 illustrates the compiler phases. We are particularly interested in the phase 2, 3 and 5 when creating our proof-of-concept.

Abstract Syntax Tree

The syntactical analysis compiler phase outputs a parse tree which represents the syntactic structure of the language. The AST is a reduction of the parse tree which contains enough

(21)

to the compiler: This input would produce a parse tree as in figure 2.8. Usually the parse tree

1 i f x>4 then y : = x e l s e y : = 0

Figure 2.7: Input to compiler

contains redundant nodes which will not be needed for later stages (as in this case). For this <stmt> if <expr> <less-than> <term> <factor> <id> x <term> <factor> 4 then <assign> <id> y <term> <factor> <id> x else <assign> <id> y <term> <factor> 4

Figure 2.8: Parse tree of input given in figure 2.7

reason an abstract syntax tree (AST) can be generated in place of the parse tree. The produced AST can be seen in figure 2.9.

if-stmt > x 4 := y x := y 0

Figure 2.9: The abstract syntax tree has the structure of the parse tree but unnecessary nodes are removed

Dextool

Dextool2is a framework written in the C-based language Dlang. It uses libclang and llvm to break down C/C++ code into an AST which a chosen plugin then parses and generate new code according to the specified rules. You do not need to handle assembly code generation; it is enough in minimal case when writing a plugin to transform the AST to some new inter-mediate form and generate new C++ code from that. For our proof-of-concept, we created a plugin to Dextool that generates test cases with MC/DC coverage. Figure 2.10 shows a flowchart of the proof-of-concept.

Dextool handles breaking down the AST from the input code, the plugin then performs the following steps:

(22)

Dextool SUT

Plugin

Transform AST

SAT solve for decisions and conditions in AST

Filter AST Generate Code

Generated Test Case MC/DC Measuring Tool

Figure 2.10: Flowchart for proof-of-concept for automated test generation

2. Transformation of the AST to intermediate form which contains decisions and condi-tions

3. For every node in AST with decisions and conditions, do SAT solving for these 4. Gather the results from SAT solving

5. Generate code from the transformed AST

Example of things we filter out are calls to compilers functions and standard library nodes. There is a lot of "noise" we are not interested in, so we discard those nodes early during analysis. Transformation of the AST is mainly ensuring the structure of the remaining nodes are ready for SAT solving and code generation, for example gathering relevant conditions and decisions related to each AST node. Once that is done, we start solving SAT problems for these decisions and conditions and the result is gathered after transformation step. Then the code generation step is started, using the remaining nodes and the gathered results to generate test cases.

(23)

3 Method

This chapter goes through details about our method and specific design decisions and/or considerations.

3.1 Testbench Setup

The system used to run the experiments and generate the results, was running Ubuntu 18.10 as operating system with the following hardware:

• 8 core Ryzen 7 1700 CPU (16 hardware threads) • 16 GB DDR4 RAM

• 60 GB SSD

3.2 Measuring Results

To measure structural coverage, it is necessary to instrument the code being tested. The in-strumentation makes it possible to verify if certain parts of the code is reached or not at a structural level. When measuring execution time, it is desirable that the program is mea-sured in such a way that the time represents the time our proof-of-concept is running. This means that certain parts of the plugin may be omitted from the measuring such as reading input parameters and writing the result to file.

MC/DC

Commercial options for analyzing and measuring MC/DC are for example: • Testwell CTC++9

• Coco10

9_{https://www.verifysoft.com/en_ctcpp.html} 10_{https://www.froglogic.com/coco/}

(24)

3.2. Measuring Results

• BullseyeCoverage11 • VectorCast12 • RapiCover13 • CoverageMaster14

We are using Testwell CTC++9for measuring code coverage. The reason for using Testwell CTC++ is that they offered us a licence to use. Since MC/DC is the highest requirement cov-erage, code instrumented for MC/DC analysis can also report coverage with lower require-ments such as statement, function, decision and condition coverage. We intend to provide those coverage metrics as well. To give an example of how MC/DC is calculated, we look at the program seen in figure 3.1. The program in figure 3.1 has the conditions A, B, C and a

1 i f ( (A || B ) && C ) {

2 _{/* i n s t r u c t i o n s */} 3 } e l s e {

4 _{/* more i n s t r u c t i o n s . . */} 5 }

Figure 3.1: Example program

decision at the if-statement expression(A _ B)^C. To achieve 100% MC/DC coverage we could use the 4 test cases seen in figure 3.2. If we only had the first two test cases, MC/DC

# A B C (A _ B)^C

1 F F T False

2 F T T True

3 F T F False

4 T F F True

Figure 3.2: MC/DC test cases for example program

coverage would be 50%. The MC/DC coverage for a decision is calculated by equation 3.1. MC/DC= ConditionsIndependent

Conditions (3.1)

Where ConditionsIndependent is the conditions that was proved to independently affect de-cision outcome, Conditions is the amount of conditions in the dede-cision.

Execution Time

For timing the execution time of the plugin, we use the standard library module in D called std.datetime.stopwatch [47]. The stopwatch is implemented using the monotonic system clock. This is desired since the system clock may only advance forwards and cannot go backwards. According to Stewart et al. the precision of the measuring should be at least five to ten times faster than the task execution time [48]. The system clock on our test bench has a precision of 10 milliseconds as seen by using the sysconf() command [49]. The precision of the system clock should be sufficient in our case as our plugin will need to open, read and analyze information inside of files as well as waiting for processes to finish execute our solver program. Thus, the time consumption for our test object is most likely to exceed the recommendation for the execution time.

11_{http://www.bullseye.com/}

12_{https://www.vectorcast.com/software-testing-products/embedded-mcdc-uni} 13_{https://www.rapitasystems.com/products/rapicover}

(25)

3.3. Test Object

3.3 Test Object

We investigate one safety critical system within the medicine domain [50]. It will be referred to as the test object. Figure 3.3 describes the relevant properties of the test object. The goal

Test Case Domain Language Lines of Code Files Open-Source

1 Medicine C++ 2213 24 Yes

Figure 3.3: Test object used for study

is to achieve full MC/DC coverage for the test object within a reasonable time. We have chosen this test object due to its closeness to the software we want our work to be relevant for. The test object exercises many features of the C++ language and standard library such as templates, class inheritance, vectors, operators and virtual functions.

3.4 Incremental SAT solving with Z3

Z315 can do incremental SAT solving and has a C++ and Python interface [40]. It is possible to link C++ code with D by writing wrapper code and linking during compilation, this is time consuming however so we have opted for another way to use Z3. We created a program using the Z3 Python interface that finds multiple solutions for an input with n variables and a Boolean expression E. We call the program when needed in our plugin by creating processes using std.process in Dlang. This gives us flexibility since we can replace, update or use other SAT solvers without breaking our plugin since we handle input the same way. To make the solving as fast as possible we want to exploit parallelism, and we have several processes running that does SAT solving simultaneously as our plugin traverses the AST. This setup makes it trivial to solve SAT problems in parallel with the plugin execution. This ensures our solution will be scalable - we can create several processes which can in turn create several threads to fully utilize a CPU multicore architecture. We propose algorithm 1 for finding test cases with Z3.

(26)

3.4. Incremental SAT solving with Z3

Algorithm 1:Algorithm for finding MC/DC test cases with Z3 input :Boolean Expression E

output:List of solutions satisfying E 1 solver Ðcreate Z3 solver

2 Let V be variables v in E 3 Let C be conditions c in E

4 Let solver_Cbe conditions in solver 5 Let solver_Vbe variables in solver 6 Let C be conditions c in E

7 Function F(v, c)solves SAT for variables v, conditions c 8 solver_VÐ V

9 solver_CÐ C;// List of conditions

10 S Ð H

11 U Ð tx :(True, True)|x P Vu

// U is a hashmap, key is x, value is pair of bools

12 model ÐF(solver_V, solver_C) 13 while model ” SAT^ U ‰ H do 14 for x P U do

15 solver_CÐ C

16 solution ÐF(solver_V, solver_C) 17 if solution ” SAT then

18 if solution Ć S ^ U[x][0]then 19 U[x][0]ÐFalse

20 solver_CÐ solver_CY (solution)

21 S Ð S Y solution

22 end

23 end

24 solver_CÐ (C)

25 solution ÐF(solver_V, solver_C) 26 if solution ” SAT then

27 if solution Ć S ^ U[x][1]then 28 U[x][1]ÐFalse

29 solver_CÐ solverCY (solution)

30 S Ð S Y solution

31 end

32 end

33 end

34 model ÐF(solverV, solverC)

35 end 36 return S

The algorithm uses a combination of the unique cause and masking MC/DC approaches, a combination which was suggested by Chilenski et al. to solve issues with condition cou-pling [51]. Z3 is called in an iterative way to find the independence pairs for each variable. Once independence pairs for a variable is found, the algorithm will stop trying to find pairs for it. Identical vectors in independence pairs are not added if they have already been found and added once. For expressions with masked conditions, the masked conditions are not assigned at all. The algorithm might in these cases generate more test cases (up to 2N test cases). We implemented the proposed algorithm in Python for quicker prototyping. How-ever, it should be straightforward to implement it in other languages that have interfaces for Z3 such as C++, C# and Rust.

(27)

3.5. Plugin Implementation

Example Input to Algorithm 1

To give an example of the algorithm at work, let us assume the input is(A ^ B)^ C. We then know that:

• Ukeys= [A, B, C]

• SolverC= [((A ^ B)^ C)]

• S is a list of all valid solutions.

The algorithm will iterate over the keys in U, which means that first iteration is finding inde-pendence pairs for A, second for B and third for C. For A we found the indeinde-pendence pair tA=True, B=True, C=Falseu and tA=False, B=True, C=Falseu. This pair proves that changing A and only A made the Boolean expression E evaluate differently, thus proving that A independently effected the outcome of E. If any of the solutions in the pair is not already in S, that solution is added to S. For example:

1. Independence pair for A: tA = True, B = True, C = Falseu and tA = False, B =

True, C = Falseu. Both solutions are added to S. Both solutions are added as negated conditions to SolverC.

2. Independence pair for B: tA = True, B = True, C = Falseu and tA = True, B =

False, C = Falseu. First solution is already in S but second is not. Second solution is added to S. Second solution is added as a negated condition to SolverC.

3. Independence pair for C: tA = True, B = True, C = Falseu and tA = True, B =

True, C = Trueu. First solution is already in S but second is not. Second solution is added to S. Second solution is added as a negated condition to SolverC.

The final solutions for this example can be seen in figure 3.4. In this example, we achieve

# A B C (A ^ B)^ C

1 True False False True

2 False True False False 3 True False False False

4 True True True False

Figure 3.4: Solutions found for example with algorithm 1

an optimal test suite of size N ´ 1. The first solution in the independence pair for A was also the first solution in the independence pairs for B and C. For this reason, there was no point in adding a duplicate of that solution.

3.5 Plugin Implementation

This section goes through the implementation and design choices made for our plugin.

Input Handling

To execute the plugin, the user needs to specify the file(s) and namespace(s) t analyze, as well as a path to our solver program.

(28)

3.5. Plugin Implementation

File Input

The plugin takes absolute file paths and/or a compilation database as file input. The compi-lation database is useful, as it can be exported from CMake16_{. The drawback of doing so is}

that the compilation database does not include any commands for implementation files - this problem is solved with the pip17tool compdb18. The initial compilation database generated from CMake is used as input to compdb which generates a compilation database which also has implementation compilation commands. The purpose of this design is to make it easier to use the plugin with safety critical software projects that are using CMake as build system, such as [52]. If the project is small, it might be less complicated to manually add the input file paths.

Namespace Input

After the source code has been analyzed, the AST is filtered. In this step, it is necessary to filter out namespaces used by the standard library and other compiler specific namespaces. For this reason, the namespace(s) that the SUT resides in must be passed as input to the plugin. This requirement is based on the assumption that the SUT follows AV Rule 98 (JSF++) [17]. Solver Path Input

This is the path to the solver program that we developed in the plugin. It may differ from host to host and could be predefined in the environment if desired. This also makes it possible to completely change the solver program used. If the input and output structure is kept - the plugin logic is separated from the solver program.

Implementation Assumptions

We make assumptions about the input objects, based on the specifications for safety critical systems such as AUTOSAR and JSF [12, 17].

• It is assumed that no implementation details reside in header files (*.h), these are instead defined in implementation files (*.cpp). This follows the rules of the JSF standard [17]. • It is assumed that all functions, classes and so on are defined within a named

names-pace.

• It is assumed that all statements which have bodies (if, else, loops etcetera), always have braces enclosing the body. That is, they always have a compound statement following the keyword.

• A implementation file (*.cpp) for a header file (*.h) has the same name except for the file extension. For example, we assume if there is a file "foo.cpp" the header file will be named "foo.h".

• Implementation is not dependent on compiler reserved identifiers [53] beginning with double underscore or simple underscore followed by alphanumeric such as "__" or "_A". These identifiers will be filtered out early during analysis.

• Most of the standard library implementation is filtered out (namespace std::), common headers such as <vector>, <string> and <cmath> is specifically handled.

16_{https://cmake.org/cmake/help/v3.5/variable/CMAKE_EXPORT_COMPILE_COMMANDS.html} 17_{https://pypi.org/project/pip/}

(29)

3.6. AST Nodes of Interest

3.6 AST Nodes of Interest

When analyzing the AST from Clang, we look for specific nodes to include and exclude the rest. What these nodes have in common is that they either can contain decisions and condi-tions, or they represent a decision or condition. We classify the nodes into three categories: Decision Block, Decision and Condition. The groundwork for these classifications can be seen in figure 3.5. By inspecting the nodes of interest, we can find the Decision Blocks, Decisions

Decision Block Decision Condition

FunctionDecl IfStmt BinaryOperator

Method ForStmt CompoundAssignOperator Constructor WhileStmt UnaryOperator

ClassDecl SwitchStmt CallExpr

ClassTemplate ReturnStmt MemberRefExpr

StructDecl MemberDeclExpr

CompoundStmt CaseStmt

VarDecl Figure 3.5: Classification of AST nodes

and Conditions that exists in the code which we can derive information from to generate the MC/DC test cases.

AST Decision Block

We define a decision block B such that it contains all decisions and conditions of its chil-dren. That means that B contains all variables used in decisions and conditions within the block such as function parameters, global variables, local variables and function calls. Class-es/structs decision blocks consists of the decisions and conditions of the functions, methods and constructors in the class/struct. CompoundStmt is always a decision block since it may contain decisions within the body. An example of an decision block B is seen in figure 3.6.

1 _{/* Decision Block s t a r t s here */} 2 i f ( A || B ) { 3 i f (C) { 4 /* i n s t r u c t i o n s */ 5 } 6 } 7 i f (C) { 8 _{/* more i n s t r u c t i o n s . . */} 9 }

10 _{/* Decision Block ends here */}

Figure 3.6: Code example for an AST Decision Block

AST Decision

We define a decision D such that it contains all conditions used in the decision. D may also contain decision blocks, for example an IfStmt node is a decision, where the body is a decision block and reaching the body is dependant on the condition. The else-if statements of an IfStmt node are also IfStmt nodes. Return statements are classified as decisions since they can be conditional just like the conditions when evaluating an If-statement. An example of an decision D is seen in figure 3.7.

(30)

3.6. AST Nodes of Interest

1 i f ( A || B ) {

2 _{/* i n s t r u c t i o n s */} 3 }

Figure 3.7: Code example for an AST Decision D

AST Condition

We define a condition C such that it may influence a decision depending on what value it takes. For example a decision such as i f((A ą= B/10)&&(C))then.. have the conditions t(A ą=B/10),((A ą=B/10)&&C)u. Conditions may contain several variables.

Representing the If-Statement Decision

In Clang AST, the IfStmt19_{node has an Expression}20 _{node representing the condition and a}

Stmt representing the else-statement. The else-statement may also be empty and represented by a null pointer. Since we assumed that the code being analyzed follows JSF++ [17] rule that if-statements should be enclosed in braces a IfStmt node can come in three forms:

1. Else-statement is a null pointer 2. Else-statement is a Stmt node 3. Else-statement is a IfStmt node

The first case is a IfStmt node with no else or else if keyword. The second case is a IfStmt node with the else keyword. The third case is a IfStmt node with the else if keyword. For every IfStmt node analyzed we identify which form it is in, then extract the condition, body and else-statement for the node. In our plugin, the condition of the IfStmt node is an AST condition and the body is a decision block. The else-statement is another IfStmt node. If the else-statement node is in the second form, we represent it as any other if-statement but give it an empty AST condition.

Representing the Return-Statement Decision

The ReturnStmt21node has an Expression node representing the returned value. We consider return-statements as decisions if they contain AST conditions and can affect the return value of a function. The children of the expression node are visited, looking for AST condition nodes to determine if the return statement should be treated as a decision. If the return-statement does not have AST conditions, it will not be treated as a decision during the solving step.

AST Traversal Decision Logic

When the plugin traverses the AST during the transform stage, decision logic for how to handle a node is illustrated in figure 3.8. The translation units are the highest order node in the AST tree. By visiting the children of the translation units, the structure and logic of the code can be derived. When we visit the children, we look for the AST nodes of interest as presented in figure 3.5. When a node of interest is found, we start extracting information from the node depending on its type. The information extraction in many cases includes visiting the children of the node as well. Other needed information such as class inheritance, class names, field member names, function names and so on are also extracted for later use. The

19_{https://clang.llvm.org/doxygen/classclang_1_1IfStmt.html} 20_{https://clang.llvm.org/doxygen/classclang_1_1Expr.html}

(31)

3.7. Implementation Details Translation Unit Visit children If child is a function declaration Extract Decisions If child is a class/struct declaration Extract Conditions Figure 3.8: Decision logic for a node being transformed

final extracted information for functions and classes is hashed uniquely and can be looked up at any time during transformation and code generation. The lookup can be with either the hash or the fully qualified name for classes. The fully qualified name for a class that inherits a class Bar which inherits a class Foo is for example "Foo::Bar::ClassName". That is, the fully qualified name includes the namespace and the class name. Functions may only be looked up with hash as several functions can have the same fully qualified name (for example functions with same names but different parameters).

Variable Versioning

When SAT solving in Z3, a variable is distinct and may only take one value. This is a prob-lem when trying to represent a variable that changes value several times in a decision block. For example, the variable might be initialized with one value, then assigned to another value which depends on another variable. To solve this problem, every time a variable that is al-ready assigned is changed in the decision block, we give it a new version. For example given the variable x, there will be n versions where tx1, ..., xnurepresent the different versions of x.

Each version is treated as a different variable. The value of the newly versioned variable is determined by the assignment, if we have x1 = 5 and then assign x1+ =10 a new variable

x2is created with the assignment x2=x1+10. This impacts how conditions are represented,

several conditions that is dependent on x may have different versions of x. This method is similar to how many modern compilers handle variable assignments, by storing them in tem-porary variables at specific addresses during compilation time. An example of how variables in analyzed source code are versioned is shown in figure 3.9.

In the code example, the variables x and y are declared but not assigned. Since they have not yet been assigned, they have not been versioned yet. If we (hypothetically) declared them with values, they would be versioned the first them they were assigned again in the code. When x and y are assigned to 0, they are still not versioned. However, once x is assigned to y+2 the variable is versioned. In other words, every time a variable that is already is assigned a value is reassigned, it will be versioned. We also see that operators such as AssignAdd (+ =) and AssignMultiply (˚=) are expanded for variable versioning.

3.7 Implementation Details

(32)

3.7. Implementation Details

1 i n t main (i n t argc , char_{* * argv )}

2 { 3 i n t x , y ; 4 x = y = 0 ; 5 // x = 0 , y = 0 6 x = y + 2 ; 7 // x0 = x + y + 2 8 _{x *= 2 ;} 9 _{// x1 = x0 * 2} 10 _{y += y * 2 + x ;} 11 _{// y0 = y * 2 + x1} 12 r e t u r n 0 ; 13 }

Figure 3.9: Code example of variable versioning

Function Calls in Conditions

To express a function call in a condition, we convert them to variables. Consecutive calls to a function will be versioned just as described in section 3.6. To represent the function call as a variable, we express a variable as the value returned from the decision block of the function. If the function is dependent on variables such as parameters or class members, these are assigned as well.

Mapping Data Types

C++ has several primitive data types such as floats, doubles, integers, booleans and chars. When representing these in Z3, we have to map the data types to a data type in Z3 that is capable of representing it. Z3 can represents all kinds of reals with the Real data type, which is the type we use currently in our solver program. It is also possible to restrict the precision to IEEE floating points number, which we currently do not do. C++ strings are represented as the String data type in Z3, and integers as the Integer data type. In our test object, most classes tested could always be decomposed into primitive types, which was then mapped to correct Z3 data type. Arrays and vectors are not mapped but could potentially be mapped to corresponding data types in Z3 such as Array. When Z3 finish solving for the problem, the variables can be mapped back to their original data types used in C++. This is done during code generation by mapping the unique id of the variables and finding the same id within the scope of the decision block. The original data type of the variable is stored during analysis so it can be retrieved by the unique id for each variable. If the variable belongs to a class, the class can be looked up by the variable id and then instantiated correctly by matching variable id with correct member field in the class.

Code Generation Formatting

When generating a test for a class, each method in the class will instantiate their own object of that class. The instantiated class then calls the method with correct parameters which is derived from solver results. If the parameters do not matter for the outcome, default param-eters are generated for the function call. For primitives such as integers, strings and floats the default value is hard coded. If a parameter is a class the first found constructor for the class is called as default value.

Code Generation Example

Assume we use the implementation file in figure 3.10 and the header file in figure 3.11 as input to our plugin.

(33)

1 # i n c l u d e " foo . hpp "

2

3 namespace Foo {

4 bool Bar : : isFun (i n t t ) {

5 i f ( t == 0 ) { 6 r e t u r n f a l s e; 7 } e l s e i f ( 5 >= t ) { 8 r e t u r n f a l s e; 9 } e l s e { 10 r e t u r n t r u e; 11 } 12 } 13 }

Figure 3.10: Input implementation file to plugin

1 # i f n d e f FOO_HPP 2 # d e f i n e FOO_HPP 3 4 namespace Foo { 5 c l a s s Bar 6 { 7 p u b l i c: 8 bool isFun (i n t t ) ; 9 } ; 10 } 11 12 # e n d i f

Figure 3.11: Input header file to plugin

The plugin first analyze the input files and finds the Decision Blocks and Decisions. Then it calls our Z3 program for each decision, using the decision as expression input to solve for and conditions inside the decision as input for variables. The Z3 program returns the resulting assignments for each variable that is needed to achieve MC/DC coverage and these assignments are used when generating the final code. The plugin will generate code that instantiates the relevant test objects such as classes. For the input code in this example, the plugin generates the output files seen in figure 3.12 and figure 3.13.

1 # i n c l u d e " mcdc . hpp "

2

3 i n t main (i n t argc , char_{* * argv ) {}

4 // Foo : : Bar 5 Foo : : Bar B a r T e s t O b j e c t ; 6 B a r T e s t O b j e c t . isFun ( 0 ) ; 7 B a r T e s t O b j e c t . isFun ( 1 ) ; 8 B a r T e s t O b j e c t . isFun ( 2 ) ; 9 B a r T e s t O b j e c t . isFun ( 6 ) ; 10 r e t u r n 0 ; 11 } ;

(34)

1 # i f n d e f mcdc_hpp

2 # d e f i n e mcdc_hpp

3 # i n c l u d e " /ABSOLUTE_URL/Dextool´MCDC/foo . hpp "

4 # e n d i f // mcdc_hpp

(35)

4 Results

This chapter presents the results observed from our method.

4.1 Code Coverage

We measured the statement, function, condition, decision and MC/DC coverage our gener-ated test cases had against the test object. The results can be seen in figure 4.1.

Statement Function Condition Decision MC/DC

Total to Cover 1088 73 439 427 433

Covered 727 61 245 242 242

Coverage 67% 84% 56% 57% 56%

Figure 4.1: Coverage results from Proof-of-concept test generation

The first row shows the total points to cover for respective metric (a point being for ex-ample a statement for statement coverage etcetera). The second row shows how many points was covered by the generated test case. The third row shows the achieved coverage in per-centage. The achieved coverage is a product of equation 4.1.

AchievedCoverage= Covered

Total ˚100 (4.1)

Where Total is the points to cover, Covered is the points covered and the product is rounded to closest natural number.

Statement Function Condition Decision

Average Per File 45 3 18 18

Figure 4.2: Average amount of statements, functions, conditions and decisions in each file In figure 4.2 we see the average amount of statements, functions, conditions and decisions in each file. The average is the quotient of the equation 4.2.

Average= Total

(36)

4.2. Execution Time

Where Total is the points to cover, 24 is the number of files and the quotient is rounded to the closest natural number.

4.2 Execution Time

The execution time is measured with std.datetime.stopwatch [47] from the D standard library which uses a monotonic system clock. The stopwatch was started before the compilation database files was to be analyzed, and stopped before the results were written to file. The

Method Precision Average Execution Time

Monotonic System Clock 10 milliseconds 6.103 seconds Figure 4.3: Average execution time using a monotonic system clock

time presented in figure 4.3 is the average value from all the measured execution times. The average value was calculated by taking the sum of all measured execution time and dividing it by the amount of test runs where execution time was measured.

Sample Size Maximum Time Difference

10 8 milliseconds

Figure 4.4: Average execution time using a monotonic system clock

In total, 10 test runs were done where the execution time was measured. The maximum difference is the difference in milliseconds between the shortest and longest execution times.

4.3 Test Object Errors

We observed dozens of errors (segmentation faults) when executing our test cases originating from using uninitialized values. Using one of the class constructors and then calling a method would cause an error in these cases. We also observed one function defined but without implementation being used that the compiler did not catch, causing an error when called in the test case. Another function had no return value (void) when it was supposed to return a double type.

(37)

5 Discussion

This chapter goes through the observed results, analyzing and comparing their importance to other studies conducted in the area. Discussion about the method and future work is done here too.

5.1 Results

Discussion about the observed results.

Code Coverage

We start by comparing our achieved coverage with results found by other studies done in the area such as Enoui et al. who used model checking to generate test cases [36]. They analyzed C code and for a significant code base achieved 100%MC/DC, 100% CC and 100% Decision Coverage (DC) in 78% of cases. In 22% of cases they achieved 65% MC/DC, 88% CC and 82%DC. An important difference between the cases was that the ones that achieved 100% MC/DC all had between 1 to 22 decisions with the average being 5. The cases that did not achieve 100% MC/DC had between 12 to 196 decisions with the average being 38. In these cases, the allocated time of 10 minutes for finding a test suite was reached and their program was forced to quit. We observed the average of 18 decisions per case from our own method. If we calculate the average amount of decisions for all the cases from the result presented by Enoui et al. as seen in equation 5.1 there is for all cases an average of approximately 12 decisions.

AverageDecisions= (0.78 ˚ 5) + (0.22 ˚ 38)«12 (5.1) The product of equation 5.1 is rounded to the closest natural number. 12 decisions per file is fairly close to the average amount of decisions we observed in our own results (18), and thus it seems like a reasonable comparison between the two. We can calculate CC, DC and MC/DC achieved for all their cases as shown in the equations 5.2, 5.3 and 5.4.

ConditionCoverage= (0.78 ˚ 1.00) + (0.22 ˚ 0.88) =0.9736 « 97% (5.2)

Improving the Development of Safety Critical Software : Automated Test Case Generation for MC/DC Coverage using Incremental SAT-Based Model Checking

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datavetenskap

2019 | LIU-IDA/LITH-EX-A--19/082--SE

Improving the Development of

Safety Critical Software

Automated Test Case Generation for MC/DC Coverage using

Incremental SAT-Based Model Checking

Oscar Holm

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

1

Introduction

1.1

Motivation

Safety Critical System

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Theory

2.1

Safety Critical System Standards

AC 25.1309-1A

ISO 26262

ISO 14971

C++ Coding Standards for Safety Critical Systems

2.2

The V-model

The Three Testing Levels

2.3

Software Testing Techniques

White Box Testing

Black Box Testing

Gray Box Testing

2.4

Code Coverage

Decision and Statement Coverage

Modified Condition/Decision Coverage

Function Coverage

2.5

Automated Testing Methods

Model Checkers

Fuzz Testing

Automatic Unit Test Generation

2.6

Compiler Structure

Abstract Syntax Tree

Dextool

3

Method

3.1

Testbench Setup

3.2

Measuring Results

MC/DC

Execution Time

3.3

Test Object

3.4

Incremental SAT solving with Z3