A Comparative Study of Automated Test Explorers

(1)

Institutionen f¨

or datavetenskap

Department of Computer and Information Science

Final thesis

A Comparative Study of Automated

Test Explorers

by

Johan Gustafsson

LIU-IDA/LITH-EX-A–15/065–SE

November 16, 2015

(2)

(3)

Link¨opings universitet

Institutionen f¨or datavetenskap

Final thesis

A Comparative Study of Automated

Test Explorers

by

Johan Gustafsson

LIU-IDA/LITH-EX-A–15/065–SE

November 16, 2015

Supervisor: Amir Aminifar & Jonatan Isaksson Examiner: Ahmed Rezine

(4)

(5)

Abstract

With modern computer systems becoming more and more complicated, the importance of rigorous testing to ensure the quality of the product increases. This, however, means that the cost to perform tests also increases. In order to address this problem, a lot of research has been conducted during the last years to find a more automated way of testing software systems. In this thesis, different algorithms to automatically explore and test a system have been implemented and evaluated. In addition to this, a second set of algorithms have been implemented with the objective to isolate which interactions with the system were responsible for a failure. These algorithms were also evaluated and compared against each other. In the first evaluation two explorers, which I called DeBruijn and LStarExplorer, were considered superior to the other. The first used a DeBruijn sequence to brute force a solution while the second used the L*-algorithm to build an FSM over the system under test. This FSM could then be used to provide a more accurate description for when the failure occurred. The result from the second evaluation were two reducers which both tried to recreate a failure by first applying interactions performed just before the failure occurred. If this was not successful, they tried interactions further and further away, until the failure was triggered. In addition to this, the thesis contains descriptions about the framework used to run the different strategies.

(6)

(7)

Sammanfattning

D˚a v˚ara moderna datasystem blir allt mer komplicerade, ökar detta ständigt behovet av rigorösa tester för att säkerställa kvaliteten p˚a den slutgiltiga pro-dukten. Det här innebär dock att kostnaden för att utföra testerna ocks˚a ¨

okar. För att försöka hitta en lösning p˚a det här problemet har forsknin-gen under senare tid arbetat med att ta fram automatiserade metoder att testa mjukvarusystem. I den här uppsatsen har olika algoritmer, för att utforska och testa ett system, implementerats och utvärderats. Därutöver har ocks˚a en grupp algoritmer implementerats som ska kunna isolera vilka interaktioner med ett system som f˚ar det att fallera. även dessa algoritmer har utvärderats och testats mot varandra. Resultatet fr˚an det första ex-perimentet var tv˚a explorers, här kallade DeBruijn och LStarExplorer, som visade sig vara bättre än de andra. Den första av dessa använde en DeBruijn-sekvens för att hitta felen, medan den andra använde en L*-algoritm för att bygga upp en FSM över systemet. Den här FSM:en kunde sedan användas för att mer precist beskriva när felet uppstod. Resultatet fr˚an det andra experimentet var tv˚a reducers, vilka b˚ada försökte ˚aterskapa fel genom att först applicera interaktioner som ursprungligen utfördes percis innan felet uppstod. Om felet inte kunde ˚aterskapas p˚a detta sätt, fortsatte de med att applicera interaktioner längre bort tills felet kunde ˚aterskapas. Utöver detta inneh˚aller uppsatsen ocks˚a beskrivningar av ramverken som används för att köra de olika strategierna.

(8)

(9)

Acknowledgement

A great thanks to my supervisors, Amir Aminifar and Jonatan Isaksson which have supplied vital input on how to structure the report and helped in the hunt for linguistic flaws. I also want to thank the entire System Engineering Team at Zenterio, and especially Per B¨ohlin, which has been greatly supportive and provided a lot of good input on how to design the different software modules constructed as a part of this thesis. Lastly I want to thank my examiner, Ahmed Rezine, which has been very encouraging and helped to steer this thesis to what it is today.

(10)

(11)

2.4.1 GraphWalker . . . 7 2.5 ddmin algorithm . . . 9 3 Related Work 11 4 Implementation 13 4.1 Adapter Module . . . 14 4.1.1 Mock-ups . . . 15 4.2 Exploration Module . . . 18 4.2.1 RandomWalk (Rand) . . . 20 4.2.2 CostGuidedRandomWalk (CGRand) . . . 20 4.2.3 BruteForce (BruteF) . . . 20 4.2.4 AllGroupsIncrementally (AGInc) . . . 21 4.2.5 DeBruijn (DeBru) . . . 21 4.2.6 LStarExplorer (LSExp) . . . 22 4.3 Reduction Module . . . 24 4.3.1 TraceIncrementationAdd (TIAdd) . . . 25 4.3.2 TraceIncrementationMult (TIMult) . . . 25 4.3.3 DeltaDebugger (DelDeb) . . . 26 5 Experiment 28 5.1 Experimental Setup . . . 28 5.2 Results . . . 29

(12)

CONTENTS CONTENTS 6 Discussion 33 6.1 Exploration discussion . . . 33 6.2 Reduction discussion . . . 37 7 Future work 39 8 Conclusions 41 9 References 43 A Configuration appendix 46 B Exploration results 49 C Reduction results 71

(13)

CONTENTS CONTENTS

Dictionary

action An abstract item representing an interaction with the SUT. action trace A sequence of actions.

action stream A sequence of actions.

adapters Strategies in the Adapter module which translate actions into something that the SUT understands.

available actions Those actions that, for a given SUT, can be translated into an interaction.

DFA Deterministic Finite Automata is a strict mathematical description of a FSM.

error trace An action trace that has caused a failure in the SUT.

explorers Strategies in the Exploration module which applies actions to the SUT and thus explore it.

FSM Finite State Machine MBT Model Based Testing

module A part of the implementations design, with responsibility for a cer-tain task. Different implementations of a module are called strategies. reducers Strategies in the Reduction module which reduce error traces. strategy A concrete implementation of one of the modules interfaces. SUT System Under Test, can also refer to whatever the test engine is

work-ing against. E.g. a real system or a mock-up.

wall time Also known as clock-wall time, is the amount of real time or clock time elapsed.

(14)

(15)

1 Introduction

This chapter provides an overview for the rest of the thesis. It establishes the problem and its background, together with useful terminology. The method for how to address the problem is also described.

1.1 Background

Software testing is a very costly process [30]. Nevertheless, it is necessary to perform rigorous testing to ensure the quality of a product. To meet this requirement, the industry is moving more and more towards test automa-tion. However, this is mostly the case for checking, confirming expected behaviour of the product, and not so much for actual testing, exploring the product in order to discover unknown defects [2, 4].

Automated checking is great in the sense that it allows for a huge num-ber of test cases to be executed in a short period of time. This approach will make sure that each individual feature of the system under test (SUT) works properly. Automated checking is however usually incapable of finding faults related to longer sequences of executed features. The reason for this is the concept of test cases. Smaller, more precise tests will not only indicate whether there is a fault, but also where to find it. To provide this function-ality, even when the SUT has an internal state, automated checking uses a set-up/tear-down approach. This means the state of the SUT is restored before each test, so that a potential failure can be derived to a single test case. But as the state is repeatedly restored, the SUT will never be subject to more than a few interaction without resets in between. This makes it impossible to trigger more delicate failures, related to a longer execution sequences. For this to be the case, a stream of interactions has to be applied on the SUT. But then again, it will be difficult to determine exactly which interaction(s) did cause the failure.

There are automated tools which try to address this problem by applying model-based testing (MBT) [18, 32, 31, 25, 10]. However, they have not yet been widely adopted within the industry. The reasons could be many, but one seems to be the amount of effort which needs to be invested in creating and maintaining models needed by the tools [5]. These models describe the SUT and are needed in order to create tests.

This thesis will investigate and evaluate different algorithms and methods to (1), provoke failures in a system and (2), perform post processing of the results in order to simplify for the developers to locate and fix the problems. In order to avoid committing to a SUT-specific interface, a concept of actions has been adopted to represent interactions with the SUT. An action can thus

(16)

1.2. PROBLEM FORMULATION CHAPTER 1. INTRODUCTION

be seen as a command from another system which tells the SUT to execute one or several specific subroutines.

The concept for the implementations described in this thesis, as well as the problem formulation, originates from the company Zenterio AB. Person-nel from Zenterio has also been responsible for the formulation of additional requirements on the implementation, such as the programming language used and the modular design which simplifies future extensions and makes it possible to integrate the implementation with existing tools used at the company. In addition to this, the Zenterio personnel have provided knowl-edge from their domain user interactive real time systems which, thus, have been used to conduct approximations and plausibility calculations.

1.2 Problem formulation

The problem for this thesis can be divided into two parts where an answer to the second part serves as a refinement to the first in order to increase its usability.

(1) Identify a sequence of actions which cause a failure in the SUT. input: A connection to the SUT.

input: A strategy to trigger failures in the SUT.

output: A trace of actions which, if applied on the SUT, causes a failure.

(2) Isolate the action(s) responsible for the failure in an output trace from problem (1).

input: A connection to the SUT.

input: A trace of action which caused a failure in the SUT, later referred to as an error trace.

input: A strategy to identify which actions in the error trace were responsible for the failure.

output: A shortened error trace which still causes a failure in the SUT.

The solution for these two problems is intended to provide a method which automatically finds problems in a system and returns a manageable trace of actions which can be used by the developers to recreate the failure.

1.3 Literature

Databases, such as IEEE Xplore Digital Library and ACM Library, were used in order to identify algorithms suitable to address the problems for-mulated above. When a wider perspective was needed, Google Scholar was

(17)

1.4. METHOD CHAPTER 1. INTRODUCTION

also used. The most promising algorithms were expected to be found in articles related to model-based testing, automated exploratory testing, cov-erage criteria, online testing and exploration testing, thus the research was focused on these areas. The result was several interesting candidates which are presented in the Implementation chapter.

Regarding source criticism, the main approach has been to use google scholar to see how frequently cited an article is. If no references to a rea-sonable old article could be found, the content should probably be doubted. This method was not possible to apply in those case when plain technical information was collected from organisations web pages, but in those cases the need for serious source criticism was also limited.

1.4 Method

In this thesis, different algorithms were examined, with regard to their ability to find faults related to sequences of interactions, see problem (1). The idea was to identify automated approaches which could be applied without investing a lot of resources in model creation or similar preparations.

Since a set-up/tear-down approach would not be possible, with regard to the kind of faults we were looking for, some other means of identifying which interactions caused which failure had to be applied. One part of this thesis was dedicated to discussing this issue, by evaluating different methods to reduce/shorten error traces, problem (2).

In order to identify suitable methods and algorithms, articles, reports, and proceedings were examined. This search resulted in several interesting candidates which were then adapted to the framework used in this thesis and implemented in the programming language Python.

Instead of trying the different algorithms on a real system, mock-ups were created. The main advantage with this approach was a more controlled comparison. It also allowed for faster development and testing since time could be simulated instead of endured. The greatest problem was to make the mock-ups realistic enough so that the result of the comparisons could be trustworthy. This threat was mitigated by consulting people at the company Zenterio. By using their expertise, multiple realistic failure categories could be produced. These were later used to derive mock-ups from. All failures simulated by the mock-ups were triggered by sequences of interactions.

To assess the different algorithms, they were executed on the mock-up while data was collected for statistical analysis. This process was repeated several times in order to get a more stable and fair comparison since some of the algorithms used stochastic processes.

Finally, the statistics were examined and processed together with other properties of the algorithms such as the amount of simulated time needed or the complexity of the initialization data required to launch the algorithms.

(18)

1.5. RESTRICTIONS CHAPTER 1. INTRODUCTION

1.5 Restrictions

In this section, the restrictions of thesis is presented.

1.5.1 Mock-ups

This thesis will not try to apply the implemented algorithms on a real sys-tem. Instead several mock-ups were created to simulate different failure prone behaviours. The main reason for this was time. If each action would take several minutes to execute one would not be able to perform enough tests for an interesting analysis. With mock-ups the time could be simulated instead of endured. Another reason to use mock-ups was to make it easier to perform controlled comparisons of the algorithms. A third reason why a real system was not used originated from difficulties with locating a repre-sentative system. Different kinds of system tend to display different kinds of failures. If only one were used, that could favour or disfavour some of the algorithms. If multiple mock-ups were used instead, these could simulate different categories of failures so that a fair comparison of the algorithms could be performed.

1.5.2 Actions applicability

When available actions appear in discussions and reasoning within this the-sis, it is presumed that all of them always can be applied. This is different form the usual set up in MBT where the state of the SUT decides which actions that can be applied next. The reason for this restriction is to reduce the scope of the thesis but also since it is possible to create actions so that all of them always are applicable. Such a scenario would require high level actions which, for example, include menu traversal. The level of the action is also something that is omitted from the discussion in this thesis.

1.6 Structure

The rest of the thesis is structured as follows. Chapter 2 establishes a theoretical background useful for the later chapters. In Chapter 3, projects similar to this will be discussed. The implementation of the algorithms, and how they are structured, is described in Chapter 4. Chapter 5 is dedicated to the gathering of statistics from executions of the algorithms. In Chapter 6 the result is discussed, future work is presented in Chapter 7, and the thesis ends with conclusions in Chapter 8.

(19)

2 Theory

This chapter establishes a theoretical background by explaining some of the more complicated algorithms used to later in this thesis.

2.1 All pairs testing

The all pairs strategy, or combinatorial testing, is a commonly used tech-nique to reduce the amount of argument permutations in unit testing [13]. It has also been applied as a coverage criteria within MBT [32]. Equiva-lent classes are normally used together with the all pairs approach to deal with an infinite test space. Instead of trying all permutations of arguments, the all pairs strategy construct test cases so that all arguments’ equivalent classes has been tried together with all other classes in at least one test case. This heavily reduces the number of test cases needed, while still making sure that all classes have been tested together with all other classes.

In this thesis we will try to make use of this method in order to identify failures in the SUT. But instead of equivalence classes we are going to use interactions with the SUT. Also, instead of just consider all pairs, we are going to include all triplets, all quadruples and so on, see Section 4.2.4.

2.2 De Bruijn sequence

The De Bruijn sequence, named after the Dutch mathematician Nicolaas Govert de Bruijn [8], is defined as a sequence of letters, in which each possible subsequence of a specified length only occurs once.

A De Bruijn sequence B(k, n) is defined as a cyclic sequence over an alphabet A, with size k. This sequence, thus, must be constructed in such a way that every consecutive sequence of length n over the alphabet A occurs as a subsequence, to the original sequence, exactly once. This definition gives B(k, n) a length of kn_{. The cyclic property of the De Bruijn sequence}

means that a string based on a B(k, n) can cover all letter permutations of length n with just kn_{+ (n}

− 1) letters. This is approximately n times better, compared to the naive brute force strategy where nkn _{letters are}

needed. This is the reason why the De Bruijn sequence has been suggested for automated test generation [22].

In this thesis, we will use a De Bruijn sequences to choose which inter-actions to perform on the SUT in order to cover all combinations with as few interactions as possible, see Section 4.2.5.

(20)

2.3. L* ALGORITHM CHAPTER 2. THEORY

2.3 L* algorithm

This algorithm, described in [1], can be used to construct a DFA (Determin-istic Finite Automata) M over an unknown regular set U . The L* algorithm is able to achieve this by only asking two types of questions to a teacher.

A DFA can be described as a 5-tuple (Q, Σ, δ, q0, F ) where:

Q is a finite set of states.

Σ is a finite set of possible input symbols called the alphabet.

δ is a transition function from Q_{×Σ to Q, which means that δ is a mapping} from each combination of q_{∈ Q and σ ∈ Σ to a state q}dst∈ Q.

q0 is the starting state so that q0∈ Q.

F is a finite set of accepting states so that F _{⊂ Q.}

The teacher used by L* must be able to answer two types of questions, membership queries and equivalence queries. A membership query answers if a string s over the alphabet Σ belongs to the regular set U . An equivalence query answers if a proposed automata Mpis equivalent with the regular set

U . If not, a counterexample is returned. A counterexample is a string over the input alphabet Σ which is either accepted by Mp but is not in U or

vice versa, i.e. an example of a string where the Mp and U give different

results. By asking membership queries and equivalence queries the L* is able to construct a DFA for any regular set U .

In this thesis the L* algorithm will be used to learn a DFA describing the SUT, in order to identify short interaction sequences which triggers failures in the SUT, see Section 4.2.6.

2.4 Model Based Testing

Model based testing (MBT) is a technique used to automatically test a computer system and search for bugs. It is most commonly used for system testing, but has also been applied in other forms of testing [5]. There exists several variations of MBT but one thing they all have in common is the model. The model is used to describe how the reality, or the SUT, works. From this description, the MBT can then deduce test cases and execute them on the SUT. According to [13], these are the five basic steps of all MBTs:

1. Create a model which describes the SUT 2. Identify threats in the model

(21)

2.4. MODEL BASED TESTING CHAPTER 2. THEORY

4. Execute the test cases on the SUT and collect the results 5. Update the model and go to step 2

When we talk about MBT we have to distinguish between online and offline testing. Offline testing is the perhaps most common scenario. Here the test cases are generated from the model and stored, to later be executed on the SUT. The benefit with this solution is that the MBT can be com-pletely separated from the SUT. With online testing, however, the MBT only calculates one step in a test case before automatically applying it to the SUT. The drawback is that an online MBT needs to be more aware of the SUT. The advantage is that an online MBT can deal with stochastic behaviours in the SUT if it make use of feedback from earlier executions to decide the next step in the test session. Also, it is more suited for running long test sessions since the entire session does not need to be pre-calculated [18, 7].

Since the model is the most important part when an MBT create test cases, it is very important how the model is represented. If the representa-tion is not expressive enough, the tester can not create a model which fully describes the SUT and, thus, no good test cases can be generated by the MBT. However, it is also important that the model representation is not too complicated, since the modelling often needs to be performed by a human tester. A too complicated modelling language means that the model itself needs to be tested as much as the SUT in order to make sure that it really describes the SUT [5, 21]. In any case, modelling is a costly process and this is probably one reason why MBT has not yet been widely adopted by the industry [5].

2.4.1 GraphWalker

In this thesis we are going to use an MBT called GraphWalker (GW) in order to answer equivalence queries from the L* algorithm described above. More details about how these two algorithms are combined will be discussed later, in Section 4.2.6.

GraphWalker [20], later referred to as GW, is an MBT tool implemented in java. However, it provides a websocket interface which enables easy in-teraction from other programming domains as well. It is distributed under the MIT license. GW is categorized as open source, see [29], and it has been used in the industry, e.g. at Spotify. Both online and offline test generation is supported.

GW uses a model represented as a directed graph, containing vertexes and edges. Each edge can be associated with actions and guards. The actions are basically simple java statements which modify global variables when their corresponding edge is traversed. The guards on the other hand, are boolean expressions which depend on the variables controlled by the actions. If a guard condition for a given edge evaluates to false, that edge will not be considered when GW chooses the next edge to traverse.

(22)

2.4. MODEL BASED TESTING CHAPTER 2. THEORY

The way in which GW chooses edges and vertices to traverse is decided by a path generator. The version of GW considered in this thesis, i.e. version 3.2.1, provides three different implementations of path generators, which are described in table 2.1. In addition to this, six different stop conditions can be chosen from, and combined, in order to determine the length of each test session, see table 2.2. GW can also be configured to run a sequence of path generators, with different stop conditions. When one generator reaches its stop condition, the next generator in the sequence is started from the current state of the model. The test session is finished first when the last generator has reached its stop conditions.

Table 2.1: The path generators available in GW Path generators Description

random Chooses a random edge which leaves the cur-rent vertex.

quick random Uses Dijkstra’s algorithm to search for a path to the least visited edge, execute that path and start over by finding a new path to a new edge. a star Uses the A* algorithm to find the shortest path from the current vertex to the edge or vertex specified in the stop condition.

Table 2.2: The stop conditions available in GW Stop condition Description

edge coverage Stops the testing session when a specified percentage of all edges have been visited. vertex coverage Stops the testing session when a specified

percentage of all vertexes have been visited. reached vertex Stops the testing session when a specified

vertex has been reached.

reached edge Stops the testing session when a specified edge has been reached.

time duration Stops the testing session when a specified amount of time has elapsed.

(23)

2.5. DDMIN ALGORITHM CHAPTER 2. THEORY

2.5 ddmin algorithm

The ddmin or the Minimizing Delta Debugging Algorithm, minimizes se-quences by testing subsese-quences to see if they are enough to fulfil some requirement, see [33]. This is done with an advanced form of binary search. Unlike ordinary binary search, the ddmin alters the number of partitions in each iteration in order to reduce a sequence as fast as possible. Another trick used by the ddmin is that, in each iteration, it tries every partition, ∆i, but also their complements,∇i. In this way, the ddmin tries both a fast

and a slow but more accurate reduction method in each iteration.

As is described in Algorithm 1, the core of ddmin is a recursive algorithm which takes a sequence, cx, and a value n. It is initially called with the

entire sequence to be minimized and n = 2. In each iteration, the cxis then

partitioned into n subsets, here called ∆1, ..., ∆n. E.g. if cxhas a length of

10, and n = 3, this would mean that ∆1would include the first four elements

of cx, ∆2 would consist of the next four, and ∆3 would consist of the last

two elements in cx. After the partitioning, each subset is then tested to see

if the requirement is fulfilled. If so, the algorithm calls itself with the subset that caused the failure and n = 2. This procedure is called reduce to subset. If none of the subsets caused a failure, the subsets complements∇1, ...,∇n,

where∇i= cx−∆i, are tried instead. If the requirement is fulfilled now, the

algorithm will reduce to complement by calling itself with the complement that caused the failure and n = max(n− 1, 2). In this way the subsets will be the same in the next iteration and so, every subset will eventually be tested. The only exception is when the decrease of n would lead to a single subset in the next iteration. Such an iteration would be unable to reduce cx

and hence, this scenario is avoided with max(n_{−1, 2). If neither the subsets,} nor their complements are able to fulfil the requirement, and n <_|cx| the

number of partitions is increased in order to reduce the sequence further. This is achieved with a recursive call with n = n_{∗ 2 and c}x unchanged. If

the algorithm did not reduce to subset, reduce to complement or increase the number of partitions, the current cxis a minimal sequence, with regard

to the requirement, and can be returned.

In this thesis, the ddmin algorithm will be used to reduce sequences of interactions with a SUT, in order to determine which ones caused a failure, see Section 4.3.3.

(24)

2.5. DDMIN ALGORITHM CHAPTER 2. THEORY

Algorithm 1 DeltaDebugger Algorithm

1: functiondelta(cx) 2: delta2(cx, 2) 3: end function

4: functiondelta2(c0x, n)

5: for allsubsets ∆i in c0xdo 6: if test(∆i) = f ailure then

7: returndelta2(∆i, 2) . Reduce to subset

8: end if

9: end for

10: for allcomplements_∇i in c0x do 11: if test(∇i) = f ailure then

12: returndelta2(∇i, max(n− 1, 2)) . Reduce to complement

13: end if

14: end for

15: if n <|c0 x| then

16: returndelta2(c0x, n∗ 2) . Increase partitions 17: end if

18: returnc0

x . End recursion

(25)

3 Related Work

The field of automated software testing has been of greatest interest to researchers throughout the last decades which in turn has produced a lot of different approaches. The most frequently used so far are, however, different types of scripted testing, i.e. manually written test cases which are executed one after another by a test engine. This type of test automation has become so common that support is often provided directly in the used programming language. Examples of this are the unittest package in Python [6] and the JUnit package in java [3].

During recent years, the research focus has come to concentrate on ap-proaches which automate, not only the execution of tests, but also the cre-ation. This has resulted in several scientific branches, of which MBT is perhaps the most flourishing.

Even though there do exist some major differences between the work presented in this thesis and MBT tools, most apparent that models are not explicitly used here, there are also similarities. Both in the high level ob-jective, to identify failures, but also in the algorithms used. Many of the algorithms evaluated in this thesis are inspired by, or have directly been used in, model-based testers. One of the tools used for inspiration is SpecExplorer [17, 28, 29], which performs online testing on an FSM-like model and can be configured to use different strategies to locate errors. Another source of in-spiration is T-UPAAL [18] which performs online testing with an algorithm based on random choice. Unlike the algorithms in this thesis, T-UPAAL has special support to deal with time dependent and non-deterministic SUTs. TorX [31] is yet another MBT which is able to focus its search to specific parts of the model and uses heuristics to suppress the length of the action trace during online testing. PyModel [11, 12], which can be configured to use different strategies to identify failures, such as StateCoverage or Action-NameCoverage.

There are also a lot of articles comparing different MBT tools or tax-onomies, where different metrics to compare MBTs are discussed, e.g. [5, 28, 24, 32, 29]. The metric used in those cases, however, have been hard to apply in this thesis since they often have been very customized to compar-isons of MBTs, e.g. comparing different types of modelling languages with different possibilities to express relations within the model.

MBT has however also given rise to several subcategories which address the problem of finding failures in SUT in similar ways as is done in this thesis. One such subgroup is model-based behavioural fuzzing [27, 26] which tries to identify failures by only knowing what interactions with the SUT is possible. Unlike normal fuzzing, behavioural fuzzing does not alter the data of each interaction, but the order in which they occur [26], which is very similar to

(26)

CHAPTER 3. RELATED WORK

what the algorithms in this thesis are doing. A big difference, however, is that behavioural fuzzers do try to break the SUT by intentionally ignoring the rules describing the order in which interactions should be performed. Another difference is that fuzzing, in general, is used to reveal security issues and not ordinary failures, even though it many times is hard to separate one from the other.

Yet another subgroup of MBT is presented in [10] and called ”Explo-ration testing”. The difference compared to ”normal” MBT is, according to the authors, that the model describes the behaviour of the SUT, i.e. how one can interact with it. Some concrete implementations have also been suggested, such as [14], which describes a test engine as well as some heuris-tics and algorithms for testing and [19] which compares different algorithms and coverage criteria applied on the test engine TEMA, which can be used to test GUIs in phone applications.

Some attempts to partially or entirely automate the otherwise manual process of Exploratory Testing has also been made [25, 9]. These tries are, however, harder to directly connect to this thesis.

The idea of reducing long error traces, especially from online model-based testing, is not new either. It is proposed both in [26, 7, 33]. This functionality has however not been found in any of the model-based testers investigated on behalf of this thesis.

(27)

4 Implementation

In this chapter the implementation of the different algorithms will be ex-plained in more details.

The goal for the implementation is to allow testing of different algo-rithms on SUT:s with different failure behaviours. Even though this thesis restrict itself to only testing the algorithms against mock-ups, the implemen-tation itself does not experience the same restrictions. The implemenimplemen-tation consists of three different modules, the Exploration module, the Reduction module and the Adapter module. Each module provides an interface which is implemented by strategies. For convenience, the strategies are grouped and referred to as explorers, reducers or adapters, depending on which interface they are implementing. Each strategy uses one, or several, algorithms which decides the behaviour of the strategy.

To reduce the coupling amongst the modules, and between the modules and the SUT, a concept of high level actions is used. Instead of forcing each strategy to keep track of how different interactions are performed with different SUTs, the strategies are initialized with a set of actions, which in turn represent possible interactions. With this approach, no customization is needed when the targeted SUT or a strategy is being replaced. Another benefit is that the granularity of the interactions easily can be varied be-tween different SUTs. For example, in one SUT each action might represent a single click in the interface. In another, each action can represent the execution of an entire feature, such as uploading a file or navigating to and starting the music player. The possibility to vary the granularity is use-ful when the strategies shall be applied on a real SUT. In such scenario, it is important to reduce the amount of work needed for adaptations of the strategies. With the action concept it would be possible to let the actions correspond to, e.g. already implemented feature tests.

Below follows a description where all the modules’ purposes are described in more detail.

Adapter module

This module translates actions into tests which are compatible with the SUT. It is also responsible to execute these test on the SUT and to make the result of the tests available to other modules. Finally, the adapter module must also be able to check if the SUT is still working as it should.

Exploration module

This module is responsible for generating a stream of actions to be tested. It interacts with the adapter module in order to apply actions and receive feedback.

(28)

4.1. ADAPTER MODULE CHAPTER 4. IMPLEMENTATION

Reduction module

The strategies in the reduction module take an error trace which they try to reduce as much as possible. The methods used differ between the strategies, but a high level approach is that they construct sub traces which are applied on an adapter to see if the failure remains. Thus, the reduction module interacts with the adapter module.

Figure 4.1: This figure shows how the modules interact with each other.

4.1 Adapter Module

Figure 4.2: The AdapterInterface.

All adapter strategies have to implement the AdapterInterface described in Figure 4.2. To be able to vary how long delays are handled by differ-ent explorers and reducers, the interface to apply actions on the SUT is asynchronous. This also makes it more suitable to use a remote adapter and communicate with it over a network. In order to receive feedback on executed actions, methods for fetching the next feedback object or the total count of available feedback objects are also provided.

The feedback contains three fields, which the adapter has to calculate the data for. The fields are:

(29)

• The action that the feedback corresponds to. • The cost to perform that action on the SUT. • The result of the action.

The result field can adopt one of the three values, depending on how the execution of the action proceeded:

success: Everything went well.

failure: A failure occurred during the execution of the action.

unknown: The SUT did not provide enough information for the adapter to determine success or failure.

The unknown is used in order to delegate the interpretation of ambiguous results to the explorers and the reducers, instead of forcing the adapter to guess. The cost in the feedback correspond to an amount of resources needed to execute the action. It could for example be measured in time or memory. The cost can be used as a means to optimize the explorer’s and the reducer’s solutions. The reason why the adapter will measure time, if that is chosen as cost, is the asynchronous nature of the adapter interface. This also becomes very handy when the mock-ups will be implemented, since different actions’ execution times can be simulated instead of endured.

Besides the result of the feedback, a special method for checking the state of the SUT is provided by the AdapterInterface. The purpose of this functions is to allow other modules to poll the SUT and see if it is still alive. One application of this could be to get information during execution of slow actions. The different status are alive, dead, and unknown. The un-known status serves the same purpose as the unun-known result in the feedback objects.

To make it possible for other modules to reset the state of the SUT, this functionality is also provided through the AdapterInterface. An example of such a situation could be when an explorer has found an error trace and wants to restart the exploration in order to find more failures.

4.1.1 Mock-ups

In order to test the explorers and the reducers without a real SUT, mock-ups were used to simulate failure prone behaviour. To make the simulations realistic enough to base conclusions on, each mock-up was designed with regard to a specific category of failures. These categories where selected in cooperation with personnel from the test automation infrastructure group at the company Zenterio, Link¨oping. Their experience with failures, from development of commercial systems, played a key role in our discussions. The result became three distinct failure categories and six different mock-ups. The mock-ups, which are described more in the following sections,

(30)

were designed to simulate problems related to one of the three categories. In this way the mock-ups would yield realistic failures and hence provide a good foundation for evaluating how the performance of the explorers and the reducers would be if they were applied on a real SUT. The selected failure categories are described below.

State Failures

The state failures category was created to incorporate failures related to the systems state. In general, this means failures caused by the execution of specific actions in a specific order. An example could be a sequence of actions which brings the system into the dangerous state. Here the crash will occur as soon as the wrong action is applied. There might exist traces which make the system leave the dangerous state but it can also be the case that the failure is triggered many actions after the dangerous state was entered. In this scenario the sys-tems behaviour will appear random and unpredictable, which makes the failure very hard to find. It is however worth noticing that the behaviour is not random, but very predictable if one only can iden-tify the key-sequences. The state failures category is simulated by the mock-ups (m1) and (m2).

Resource Failures

The resource failures category was created for failures related to re-source requirements, e.g. memory or available space in internal buffers. When these resources are not returned properly, a failure will even-tually occur. An example of a resource failure is where most actions consume different kinds of resources. This means that only a few ac-tions actually affect each other. Another scenario is where all acac-tions compete for the same resource. A third scenario could be that some actions releases resources again, which makes the procedure of isolat-ing the failures even more complicated. These three failure scenarios are simulated by the mock-ups (m3), (m4) and (m5).

Timing Failures

Failures belonging to the third category, timing failures, are those that can be derived to timing issues or other stochastic behaviours. It could for example be failures caused by race conditions. When multiple threads compete for the same resources, failures may occur, but not always. Another scenario is when timer interrupts are fired while a specific action or action trace is executed. Such coincidences can easily cause a failure. These kind of scenarios are simulated by the (m6) mock-up.

In order to perform multiple tests for each failure category, the mock-ups were made configurable. This allowed them to be instantiated with differ-ent configurations and hence, be able to yield failures after differdiffer-ent action traces. In order to easily change mock-up, they were all made to implement

(31)

a common interface. The interface was chosen to be the AdapterInterface since this also made the mock-ups compatible with the explorers and the re-ducers, without any additional adaptation. The following sections describe the six different mock-ups, implemented in this thesis. They are referred to as (m1) - (m6). I would have been possible to replace the five first mock-ups with one mock-up which was to be configured with regular expressions, but instead (m1) - (m5) were implemented in order to make the difference between the different cases more obvious.

FailureSequenceMockup (m1)

The FailureSequenceMockup mock-up, later referred to as (m1), starts yield-ing failures after a certain coherent sequence of actions. It can be configured with one or more of these sequences to provide more flexibility. As soon as a fault sequence has been discovered the status of the mock-up is switched to dead and all subsequent actions, including the current one, will produce feedback with result failure.

Example: (m1) is configured with AB and fed with the action trace BABC. All actions from the second B and forth, will produce feedbacks with result failure the status of the mock-up will be dead.

SequenceDependenciesMockup (m2)

The SequenceDependenciesMockup is a more advanced version of (m1). It starts yielding failures when a sequence Ykiappears any time after a sequence

Xk. In addition, the (m2) can be configured with a third sequence Zki. If

this sequence appears after Xk, Yki will no longer cause the mock-up to

start yielding failures, at least not until another Xk appears. Each instance

of (m2) can be configured with multiple Xk and lists containing the Yki

and Zki sequences. As soon as a failure has been yielded, all subsequent

feedbacks and status requests will return failure and dead, respectively. Example: (m2) is configured with X1= AAB, Y11= BBA and Z11=

C. When the action trace AABBA is fed to the adapter nothing will happen because X1 and Y11 overlaps. Neither will the trace AABCBBA cause the

mock-up start to yielding failures, because the Z11did a reset. However, the

action trace CAABDBBAD will cause the (m2) to start yielding failures from the third A and forth.

TraceLengthMockup (m3)

The (m3) will start yielding failures when the length of the applied action trace has reached a certain threshold T . This threshold is the only config-urable parameter of the (m3). When it has been exceeded, all subsequent actions will produce in feedback with result failure and the status of the mock-up will be dead.

(32)

4.2. EXPLORATION MODULE CHAPTER 4. IMPLEMENTATION

Example: (m3) is configured with a threshold T = 2. This means the trace ABCD will cause the (m3) to start yielding failures from the C action and forth.

MaxActionCountMockup (m4)

The MaxActionCountMockup is quite similar to the (m3) but instead of total trace length, the (m4) consider count of individual actions. This means it has to be configured with a list of max action counts Ca. When more than

Caoccurrences of any action a has been recorded all subsequent actions will

result in feedback with result failure and the mock-ups state will be dead. Example: (m4) is configured with CA = 2 and CB = 3. This means

the action trace BACBBCA will cause the (m4) to yield failures from the third B action and forth.

AccumulationMockup (m5)

The AccumulationMockup is configured with a cost Cafor each action and a

max accumulation value M . The (m5) starts yielding failures when the sum over all actions costs in the applied action trace is greater then M . When a failure has been yielded, all subsequent actions will also result in feedbacks with result failure and the mock-ups status will be dead. The (m5) can also be configured with reset actions R. When any action in R set appears in the trace, the accumulated sum is set to zero.

Example: (m5) is configured with CA = 2, CB = −1, M = 3 and

R = _{{D}. When the action trace ABADAAD is applied, failures will be} yielded for all action from the fourth A action and forth.

StochasticMockup (m6)

The StochasticMockup is a non-deterministic mock-up. It is configured with different chances Cs for sequences of actions s. When a sequence s occurs

in the action trace applied on the mock-up, a failure is yielded with chance Cs. When a failure has been yielded all subsequent actions will result in

feedback with result failure and the statues of the mock-up will be dead. Example: (m6) is configured with CA = 0.2 and CAA = 0.5. This

means that the trace AAA will cause a failure with probability 0.2 after the first A, 0.5 after the second and 0.5 again after the third. The total chance for the trace to make (m6) start yielding failures is thus, 1_{− (1 − 0.2)(1 −} 0.5)(1− 0.5) = 0.8.

4.2 Exploration Module

All the strategies in the exploration module have to implement the Explo-rationInterface, see Figure 4.3. This will make it possible to use the same

(33)

driver for all the strategies when statistics over the explorers performance will be collected in chapter 5. The ExlporationInterface consists of one single method, explore(). A call to this method will cause the strategy to explore the provided adapter in search for failures. When a failure is discovered the trace of actions applied on the adapter is returned as an error trace. Except for the adapter argument to the explore method, other initiation data can be fed to the explorers constructor. The only data that all the explorers require is information about which the available actions are. Apart from this, some explorers can make use of additional arguments. This will later be one aspect from which the explorers can be evaluated. The extraction of the extra information needed by a certain explorer could be very expensive, which would reduce the explorers cost effectiveness. On the other hand, the more information an algorithm has available, the better it generally per-forms. These two aspects must be considered in order to determine how applicable the explorer would be on a real SUT.

The exploration algorithms considered in this thesis have been selected and constructed with a goal to keep the effort needed to apply them as low as possible. As described earlier, there do exist methods which perform exploration of a SUT, e.g. MBT, with good results. But since these methods require a lot of efforts to get started, e.g. modelling of the SUT, and also to maintain the test framework, these methods have not been widely adopted yet. The algorithms in this thesis are intended to try a different path. Thus, perhaps, drop some performance but increase the usability, and lower the threshold to start using automated exploratory testing.

In order to find the right balance between the amount of initialization data, fault finding performance and usability, different exploration strate-gies, with different properties, have been implemented. Below follows de-tailed descriptions of these six explorers.

When reading about them, it should be obvious that several of the strate-gies could have been optimized and tweaked to a much greater extent than what have been done here. This has, however, been an informed decision since the comparison of the strategies in chapter 5 would have been mis-leading if a lot of time had been put in optimizing of one the strategies. The alternative, to spend time optimizing all of them, was considered to be out of the scope for this thesis. Suggestions on optimisations and how they could affect performance have been left to the discussion in Section 6.1.

(34)

4.2.1 RandomWalk (Rand)

The RandomWalk explorer is the simplest explorer considered here. It does not require any additional input, apart form the list with the available ac-tions. The next action to be explored is chosen randomly from these. The reason why the (Rand) strategy is included in this thesis is to be used as a base line when comparing the other explorers, similar to what is done in [14, 19, 25]. Randomly chosen tests are also surprisingly common and multiple commercial MBTs use this strategy, e.g. [20, 17].

4.2.2 CostGuidedRandomWalk (CGRand)

This explorer tries to unveil faults by applying as many actions as possible. In order to prioritize actions correctly, the (CGRand) needs to be initiated with a dictionary, mapping actions to their cost. The unit of the cost is not important, but it could be measure in e.g. execution time. Because of the improbable scenario that any faults will be discovered if only the cheapest actions is applied over and over again, a probability distribution is used to choose the next action to apply. This distribution is constructed in such a way that cheap actions have a higher priority than expensive ones. Over time this means that the (CGRand) will execute more cheap than expensive actions. Therefore, the total cost of the exploration will be lower compared to an ordinary random walk algorithm. Also, since different actions are applied, enough alternation in the trace is hopefully created to trigger sequence related faults.

Different versions of guided random walk have been tried in MBT be-fore [25, 15, 14]. In these cases, actions with fewer executions have been prioritized over more frequently executed ones. But when it comes to our situation, where all actions always are possible to execute, that strategy would only produce the same result as a regular random walk algorithm. For this reason, the priority distribution, has been chosen and implemented instead.

4.2.3 BruteForce (BruteF)

The BruteForce explorer tries to find faults by applying all action traces of length one, then all traces of length two, and so on. Because of the huge amount of possible combinations, the length will not become particularly large within reasonable time. However, since all short action combinations are applied after each other the combined trace will be long enough to trigger complicated fault sequences. Another reason why the (BruteF) has been implemented is to be used as comparison, especially with (DeBru), which will try a more complex brute force pattern. The (BruteF) explorer does not make use of any additional input apart from which the available actions are.

(35)

A brute force strategy similar to the one applied by the (BruteF) is also mentioned in [16] where it was used as a coverage criterion called length-n coverage. It is not unusual however, to base test generation algorithms on coverage criteria.

Example: Consider a scenario where the explorer has the actions A, B, and C to choose from. (BruteF) will first try all traces of length one after each other, which will cause the actions A B C to be applied. In the next step, (BruteF) tries all action traces of length two, i.e. AA AB AC BA BB BC CA CB CC. In the next step all traces of length three will be applied and so on.

4.2.4 AllGroupsIncrementally (AGInc)

The AllGroupsIncrementally explorer is based on the same idea as all pairs or combinatorial testing, described in Section 2.1. Since (AGInc) is an on-line testing method however, testcases can not be used in this case. Also the infinite test space need to be handle differently since the problem for the explorer is not which data to send to the SUT, but in which order. To overcome this, the (AGInc) works incrementally. In each round an action trace is constructed and applied on the adapter. In round n, the trace con-sists of groups, with n actions in each group. These groups are constructed in accordance with the all pairs method, thus, for each possible action com-bination of length n, there exists a group containing those n actions, but without regard to the order. This means that the first round will produce a trace composed of all the groups of size one. In the second round, all groups of size two is generated and so on. As in the all pairs strategy, this heavily reduces the number of action permutations. However, since the (AGInc) is trying to unveil faults related to sequences of actions, the order in which the actions are applied is also of great importance. This method can not guaran-tee that actions are being applied in the correct order to trigger faults. The (AGInc) explorer does not need to be initiated with any other information than available actions.

Example: Consider a scenario where the explorer has the actions A, B, and C to choose from. In the first round, length equals to one, the trace ABC will be created and applied on the SUT. In the next round, six action groups will be created and concatenated to the trace AA AB AC BB BC CC, which are then applied. In the third round, all groups of length three will be created and so on. Notice that BA, CA, and CB are not included in the trace from round two since these actions already have been tried together in the pairs AB, AC, and BC. This concept is the idea behind combinatorial testing.

4.2.5 DeBruijn (DeBru)

This explorer uses the De Bruijn sequence, described in Section 2.2, to be able to test all action permutations without applying ordinary brute

(36)

force. The (DeBru) achieves this by generating De Bruijn sequences over the alphabet of available actions. When the entire sequence has been applied on the adapter, the n parameter of the De Bruijn sequence is incremented and a new sequence is generated. In this way the (DeBru) exhaustively tests all action permutations, but with far less applied actions compared to the naive brute force strategy. There exists a lot of ways to construct a De Bruijn sequence. The method used in the (DeBru) explorer is based on an algorithm from [23]. One property of this particular algorithm is that letters in the beginning of the alphabet occurs more frequently in the beginning of the generated sequence. The (DeBru) can exploit this if an optional argument, representing the action costs, is provided to the explorer. In this scenario, the (DeBru) creates a mapping which causes cheep actions to be located earlier in the sequence which in turns causes the explorer to try cheap action combinations before expensive ones.

Example: Consider a scenario where the available actions are A, B, and C and that the failure occurs when the sequence CB is applied. The cost of the actions is set to A = 2, B = 1, C = 3. The (DeBru) starts by calculating B(3, 1) over the alphabet _{{0, 1, 2}, which is [0, 1, 2]. This} is then mapped against the available actions which in turn are ordered by their costs. The result becomes the trace BAC which is applied on the adapter. Since this did not cause a failure the n value is increased and the procedure repeated. B(3, 2) is calculated and translated into the action trace BBABCAACC. Since the De Bruijn sequence is cyclic and an action trace is not, the algorithm needs to copy the n− 1 first actions to the end of the trace. Otherwise the trace would not contain all permutations of length n. In this case CB would not be in the trace. The result after the copying is the action trace BBABCAACCB which can now be applied on the adapter, one action at a time. When the failure is detected, the combined error trace from all iterations is returned, in this case BACBBABCAACCB.

4.2.6 LStarExplorer (LSExp)

The (LSExp) uses the L* algorithm from Section 2.3 to create a DFA over the set of valid inputs to an unknown adapter. The input alphabet of this DFA consists of the actions accepted by the adapter, the states of the DFA correspond to the states of the adapter and the non-accepting states of the DFA represents failures in the adapter. A string which is not accepted by the DFA will thus correspond to an error trace for the adapter.

When the explore() method of the (LSExp) is called it initiates L* with available actions and itself as the teacher. If this setup were allowed to run forever, it would result in a DFA describing the entire behaviour of the adapter, including its failures. But since the objective for (LSExp) is to identify failures, the computation will usually be terminated long before this stage, i.e. as soon as a failure is discovered. But in the case where the adapter does not contain any failures, the L* will try longer and longer

(37)

strings which are executed on the adapter and, thus, explores more and more of the adapter’s state space.

The membership queries asked to the (LSExp) is answered by first creat-ing an equivalent action trace from the provided strcreat-ing s, and then applycreat-ing this trace on the adapter. If a failure is detected, the execution of L* is aborted and the action trace returned from the explore() method. If the action trace does not cause a failure, the string is considered to be in the valid input set of the adapter and so, true is returned. In order to prevent different membership queries from interfering with each other, a reset of the adapter has to be performed before executing the next query.

In order to answer the equivalence queries, the MBT tool GraphWalker, see Section 2.4.1, is used to look for differences between Mpand the adapter.

This is done by initiating GW with Mp as its model. In order to do this

Mp needs to be translated into something that GW understands. This is

however trivial since GW uses a model represented as a directed graph, which is basically the same as a DFA, the representation used in Mp. This

equivalence also means that guard conditions, which is supported in GW models, can be omitted. When the translation is completed GW starts to traverse the model. For each edge chosen, the corresponding action will be applied on the adapter. Since GW supports online testing this can be done one step at a time, i.e. choose and traverse an edge, apply the action corresponding to the chosen edge, decide whether to choose a new edge or signal an error. The process proceeds until:

1. A failure in the adapter is detected. This will terminate the L* and the trace of all applied actions are returned from the explore() method as an error trace.

2. Mp enters a non-accepting state, but the adapter does not fail. This

means that an inconsistency has been detected and the current trace of applied actions can be converted into a string and returned as a counterexample. As it turned out, this case did never occur due to one of the restrictions used for this thesis, but more about this in the Discussion, Chapter 6.

3. If Mp enters an accepting state and no failure is detected in the

adapter, the (LSExp) starts over by asking GW to choose and tra-verse another edge.

To increase the efficiency, GW is not left to random exploration. Instead it is configured with three stages of path generators. The first stage is a set of a star generators with stop condition reached vertex(vi)where viis a

vertex which corresponds to a non-accepting state in Mp. This means that

GW first will try to reach all non-accepting states which will cause either a failure or an inconsistency. If this is not possible, the second generator stage will be used instead. This stage uses the quick random path generator with stop condition edge coverage(100). If this also fails to discover an

(38)

4.3. REDUCTION MODULE CHAPTER 4. IMPLEMENTATION

inconsistency or a failure, stage three will be used as a last resort. This stage uses random as path generator and never as stop condition and is thus terminated from the (LSExp) instead. The maximum time allowed for (LSExp) to spend on an equivalence query can be configured with an optional argument to the (LSExp) constructor.

Before executing an equivalence query, (LSExp) must first reset the adapter to make sure earlier membership or equivalence queries have not changed the state of the adapter.

4.3 Reduction Module

Since an error trace needs to be applied, probably multiple times, in order for the developers to fix the problem, short error traces are of course to prefer. This mean that the traces shall contain as few redundant actions as possible. A redundant action is defined as an action in an error trace which is not needed to produce the failure. The concept of exploring the system in order to find failures will produce error traces containing a lot of redundant actions. This is because the traces needed to trigger the failures is not known and hence optimal error traces can not be expected from the exploration algorithms. To deal with this problem, another set of strategies can be applied after an error trace has been identified, in order to remove redundant actions. This is the purpose of the reduction strategies.

Similar to the explorers, all the reduction strategies were forced to im-plement an interface. In this case, it is called the ReductionInterface, see Figure 4.4. Like the ExplorationInterface described earlier, the Reduction-Interface provides only one method, reduce(). This is called with an error trace and the adapter from which the error trace was deduced. The goal for the reduce() method is to remove as many redundant actions as possible from the error trace. This is done by applying subtraces on the adapter and observing the feedback. In order to propose a solution for problem (2), multiple reduction strategies have been implemented and will be compared against each other in chapter 5. Their possibility to discover and remove redundant actions will be compared to the resources they consume, e.g. number of actions applied, how much each action costs to execute and how many times the reducer needs to reset the adapter. Like the explorers, some reducers may also make use of additional information which is fed to their constructors. There is, however, no obligatory information that all the reducers require. The different strategies implemented for this thesis are described below and are later referred to as (TIAdd) - (DelDeb). With the same reasoning from last section, optimization of the strategies has been avoided in order to reduce the risk that one strategy receives more attention than the other and because of this appears more suited than the unoptimized alternative.

(39)

Figure 4.4: The ReductionInterface

4.3.1 TraceIncrementationAdd (TIAdd)

The TraceIncrementationAdd uses the fact that the last action in the pro-vided error trace was the action that triggered the failure. The (TIAdd) exploits this by first applying subtraces from the end of the error trace. The number of actions in the subtraces are increased by a configurable value I, which is also the only initialization data required by (TIAdd). As soon as a failure is triggered the current subtrace is returned. This strategy has the potential to quickly and efficiently reduce the error trace. As long as the actions responsible for the failure is located near the end of the error trace, the result will be quite good. Together with the earlier discussion, from which we know that at least one of the actions responsible for the failure is located at the very end of the error trace, one can hope for this strategy to perform well. A drawback with the (TIAdd) is however that it is incapable of removing redundant actions in between non redundant ones. Also, since the length of the substraces only grows linearly, non redundant actions in the beginning of a long error trace will cause the (TIAdd) to consume a lot of resources without reducing the trace very much.

This algorithm has been used before, e.g. by [7], to shorten long error traces in order to assist manual debugging. In [7], the error traces were created by an MBT using a random walk algorithm. But there is no reason why this algorithm could not be applied on error traces from different sources as well.

Example: Consider a scenario where the (TIAdd) has been initialized with a I = 2 and is called with the error trace ABCDDBECA. The (TIAdd) will first try the CA subtrace. If this does not cause a failure the subtrace BECA will be tried. Then DDBECA and so on until a failure is triggered or the subtrace contains the entire error trace.

4.3.2 TraceIncrementationMult (TIMult)

The TraceIncrementationMult strategy is very similar to the (TIAdd). It also tries to reduce the error trace by creating and testing subtraces from the end of the original error trace. The motivation for this idea is the same as for the (TIAdd). The algorithm in (TIMult) is also, like the one in (TIAdd), described in [7]. The difference between the (TIAdd) and the (TIMult) is the way in which they construct the subtraces. The (TIMult) increase the length of its subtraces exponentially, instead of linearly, which allows it to incorporate more actions with fewer iterations. This means that a larger part of the error trace can be tried with the same number of resets.

(40)

A drawback however, compared to (TIAdd), is that the reduced error trace will generally be longer. The (TIMult) is also initiated with a value F , which in turn, decides the rate at which the length of the subtraces increases.

Example: Consider a scenario where the (TIMult) is initialized with F = 3 and is called with the error trace BBABCDDBECA. The (TIMult) will first try the A subtrace. If no failure was caused, the length of the next subtrace will be equal to the length of the last, multiplied by F . In this example that would be ECA. In the third iteration, the trace will be ABCDDBECA, and so on until a failure is triggered or the subtrace contains the entire original error trace.

4.3.3 DeltaDebugger (DelDeb)

The DeltaDebugger is based on the ddmin algorithm described in Sec-tion 2.5. This algorithm can be setup so that it perfectly fits the problem the reducers are trying to solve. The ddmin is initially called with the er-ror trace that (DelDeb) wants to reduce. This means the result will be a minimal error trace with regard to some requirements. This requirement is decided by letting (DelDeb) perform the testing of the subsets and the com-plements. Since these sequences only contains actions, they can be treated as action traces. If such an action trace cause a failure in the adapter fed to (DelDeb), then (DelDeb) reports back to ddmin than that sequence does fulfil the requirements. When ddmin finally returns, (DelDeb) can pass the minimized action trace along.

One thing that should be noticed here is that (DelDeb) needs to perform a reset operation on the adapter before testing each action trace. This means that a lot of resets needs to be conducted.

Example: Consider a scenario where the failure is triggered by the application of the actions B and D, regardless of other applied actions. This would result in an execution such as the one described in table 4.1, given that the error trace provided to (DelDeb) was ABCD.

In the first iteration the ddmin algorithm splits the provided error trace into two subsets, AB and CD, which are then applied on the adapter with resets in between. The complements does not need to be considered in this iteration since n = 2. Because none of the sets did cause a failure in the adapter, the algorithm increases the number of partitions by setting n = 4. In the next iteration, step 3 to 7, the input trace is divided into 4 subsets, A, B, C, and D each of which are applied on the adapter. But since none of them caused a failure, the complements are tried instead. In this case the first complement BCD does cause a failure and, thus, the algorithm reduce to this complement and adjust the value of n to 3. In the next iteration, step 8 to 12, the input trace, which now is BCD, is divided into 3 subsets, B, C, and D. Since these has been tried before and are known to not trigger any failure, they could be omitted but this feature has not yet been implemented. Next the complements CD, BD, and CD are supposed to be tested, but

(41)

since BD cause a failure the iteration is interrupted there by a reduce to complement. In the last iteration the subsets and the complements are yet again the same, but neither of them cause a failure, and since n is not less than the length of the input trace, the execution is terminated and BD returned.

Table 4.1: Execution example of the DeltaDebugger Algorithm

Step n Test A B C D Comments

1 2 ∆1=∇2 A B • • 2 2 ∆2=∇1 • • C D Add partitions 3 4 ∆1 A • • • 4 4 ∆2 • B • • 5 4 ∆3 • • C • 6 4 ∆4 • • • D 7 4 _∇1 • B C D Reduce to complement 8 3 ∆2 • B • • 9 3 ∆3 • • C • 10 3 ∆4 • • • D 11 3 ∇2 • • C D 12 3 _∇3 • B • D Reduce to complement 13 2 ∆2=∇4 • B • • 14 2 ∆4=∇2 • • • D Done Result _• B _• D

A Comparative Study of Automated Test Explorers

Institutionen f¨

or datavetenskap

Department of Computer and Information Science

Final thesis

A Comparative Study of Automated

Test Explorers

Johan Gustafsson

LIU-IDA/LITH-EX-A–15/065–SE

November 16, 2015

Final thesis

A Comparative Study of Automated

Test Explorers

Johan Gustafsson

LIU-IDA/LITH-EX-A–15/065–SE

November 16, 2015

Abstract

Sammanfattning

Acknowledgement

Contents

Dictionary

1

Introduction

1.1

Background

1.2

Problem formulation

1.3

Literature

1.4

Method

1.5

Restrictions

1.5.1

Mock-ups

1.5.2

Actions applicability

1.6

Structure

2

Theory

2.1

All pairs testing

2.2

De Bruijn sequence

2.3

L* algorithm

2.4

Model Based Testing

2.4.1

GraphWalker

2.5

ddmin algorithm

3

Related Work

4

Implementation

4.1

Adapter Module

4.1.1

Mock-ups

4.2

Exploration Module

4.2.1

RandomWalk (Rand)

4.2.2

CostGuidedRandomWalk (CGRand)

4.2.3

BruteForce (BruteF)

4.2.4

AllGroupsIncrementally (AGInc)

4.2.5

DeBruijn (DeBru)

4.2.6

LStarExplorer (LSExp)

4.3

Reduction Module

4.3.1

TraceIncrementationAdd (TIAdd)

4.3.2