Test Case Dependency Detection Using Syntactic Analysis of Code for Test Optimization Purposes

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Test Case Dependency Detection Using Syntactic Analysis of Code for Test Optimization Purposes

ROUWAYD HANNA

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Test Case Dependency

Detection Using Syntactic Analysis of Code for Test Optimization Purposes

ROUWAYD HANNA

Master in Computer Science, TCSCM Date: June 22, 2020

Supervisor: Håkan Lane Examiner: Olle Bälter

School of Electrical Engineering and Computer Science (EECS) Host company: Ericsson AB

Supervisors at Ericsson: Auwn Muhammad and Sahar Tahvili

(4)

(5)

iii

Abstract

It is not possible to develop high quality software for large systems without a rigorous testing process. However, testing tends to be costly and time- consuming, which is why research in test optimization has received a great deal of attention. Test optimization is often seen as a multi-criteria decision making problem, where dependencies between test cases are one of the criteria. Since dependent test cases directly influence the execution results of each other, ignoring these dependencies can cause unnecessary test execution failures. Recognizing dependencies and similarities between test cases is beneficial in many aspects of test optimization, such as test minimization and test prioritization.

The dependency information is typically derived from requirements and design artifacts which are not always present in the testing phase. One artifact that is always available during a testing process is the test code that executes the test cases. In this thesis, an approach for automatically detecting test case dependencies by analyzing test code is proposed, applied, and evaluated in the context of an industrial case study at Ericsson AB in Sweden.

The proposed approach involves syntactic analysis of test code to produce Ab- stract Syntax Trees, which are converted into feature vectors and fed into machine learning models to classify the dependent test cases into clusters. Two clustering algorithms HDBSCAN and K-means were used and their results were compared. The proposed approach was able to detect dependencies using the test code and the best achieved results were obtained when using the HDBSCAN clustering algorithm, yielding an F1 score of 70.7%. The approach proposed in this degree project can be used in industrial settings to help testers in identifying dependencies between test cases. Making use of the identified dependencies during the testing process can reduce the risk of unnecessary failures, thus saving time and costs.

(6)

iv

Sammanfattning

Det är inte möjligt att utveckla programvara av hög kvalitet för stora system utan en rigorös testningsprocess. Testning tenderar dock att vara kostsamt och tidskrävande, vilket är anledningen till att forskning inom testoptimering har fått mycket uppmärksamhet. Testoptimering är ofta sedd som ett multikriterie- beslutstödsproblem, där beroenden mellan testfall är ett av kriterierna. Att inte ta hänsyn till beroenden mellan testfall kan orsaka onödiga testfel eftersom beroende testfall direkt påverkar varandra. Att känna till beroenden och likheter mellan testfall är gynnsamt i många aspekter av testoptimering, såsom testmi- nimering och testprioritering.

Beroendeinformationen är vanligtvis härledd från krav- och designartefakter som inte alltid är tillgängliga i testfasen. En artefakt som alltid är tillgänglig under en testningsprocess är testkoden som exekverar testfallen. I denna rap- port föreslås, tillämpas och utvärderas en metod för automatisk detektering av testfallsberoenden genom att analysera testkod inom ramen för en industriell fallstudie hos Ericsson AB i Sverige.

Den föreslagna metoden innefattar syntaktisk analys av testkod för att pro- ducera abstrakta syntaxträd som omvandlas till egenskapsvektorer och ma- tas in i maskininlärningsmodeller för att klassificera de beroende testfallen i kluster. Två klusteralgoritmer, HDBSCAN och K-means, användes och deras resultat jämfördes. Den föreslagna metoden kunde upptäcka beroenden med hjälp av testkoden och de bästa uppnådda resultaten erhölls med HDBSCAN- klusteralgoritmen, vilket gav ett F1-värde på 70.7%. Metoden som föreslås i detta examensarbete kan användas i industriella miljöer för att hjälpa testare att identifiera beroenden mellan testfall. Att använda identifierade beroenden under testningsprocessen kan minska risken för onödiga exekveringsfel, vilket sparar tid och kostnader.

(7)

v

Acknowledgements

First of all, I would like to express my sincere gratitude to my supervisors at Ericsson, Auwn Muhammad and Sahar Tahvili, for their help and support throughout this degree project. Also, a heartfelt thanks goes to my KTH supervisor, Håkan Lane, and my peer-review group at KTH for their insights and continuous helpful feedback. Finally, my deepest gratitude goes to my family, partner, and closest friends for supporting me and keeping me motivated both during difficult times as well as joyful times.

Thank you!

Rouwayd Hanna

(8)

Chapter 1 Introduction

The usage of software has become integral in society and in our daily lives.

As software grows in popularity, the demand for boosting the quality in software increases. It is not possible to achieve high quality in software without rigorous testing, which is why software testing is one of the most important steps during the software development life cycle (SDLC) process. Detecting failures, verifying systems, and designing test plans help in ensuring reliabil- ity and quality of a software product which can lead to end-users’ satisfaction.

However, such activities tend to be costly and time-consuming, especially in large-scale software with high complexity [1]. Maintaining quality and safety of a software product as the complexity increases while minimizing costs has become ever more relevant, which is why test optimization and test effectiveness have been the topic of several research studies [1, 2, 3].

There are many ways to optimize the testing process, such as test case minimization and test case prioritization. In some cases, for example in the automotive industry, the execution of a test case might take several hours, which includes laborious and complex manual test setups [2]. Knowing which test cases are redundant prior to execution can be time saving and useful. Test suite minimization aims at eliminating redundant test cases from test suites to reduce maintenance and testing costs [4]. The purpose of testing is to detect errors and bugs in the system. The earlier the error is detected the earlier it can be fixed, which can lead to more on-time deliveries. Test case prioritization techniques aim at finding errors early in the execution. The techniques involve ordering an existing set of test cases to increase their effectiveness at finding errors and meeting performance goals [5].

1

(12)

2 CHAPTER 1. INTRODUCTION

Test optimization is often seen as a multi-criteria decision making problem, where one of the identified criteria is the dependency information between test cases [1]. As a software system grows in complexity, it accumulates interactions between different modules and sub-modules. These interactions are in- herited in the test cases that test the functionality of these modules [3]. There- fore, the dependencies between test cases should be considered in the same way as the dependencies in the system are considered. Ignoring dependencies when executing tests can lead to unnecessary failures and time-loss [6]. The dependency information can be utilized for different optimization purposes.

For instance, knowing the dependencies can help testers and test managers to rank test cases for execution, which can be used in both test case prioritization and minimization. Finding dependencies manually is time-consuming and requires domain knowledge. Most of the state-of-the art solutions for automatically detecting dependencies and optimizing the testing process rely on artifacts that are not always present in the testing phase, e.g. signal information [7] or a structured requirements specification [2]. Industry would benefit from a tool that can detect dependencies automatically from analyzing test code, an artifact that is often present in a test suite.

In this thesis, an approach for automatic detection of dependencies between test cases by analyzing their corresponding codes is proposed, applied, and evaluated. The approach uses only the test code without requiring any other data source. Data analysis and machine learning is used to find similarities between the test cases based on their codes’ syntactic features. The approach is evaluated in the context of a testing project at the host company Ericsson AB in Sweden.

1.1 Purpose

The test code of an application is mostly only used for the execution of the test cases. Many test engineers overlook the fact that the code can contain valuable information to be used for optimizing the testing process. The purpose of this study is to investigate the possibilities of automatically extracting valuable information that can be used for test optimization by analyzing the syntactic features of the test code. Specifically, the information that is in demand by Ericsson, for which this degree project is being carried out, is the dependency information between test cases for five of their products. This information can be used for test optimization purposes, such as test case minimization, which can lead to time and cost reductions. Furthermore, the approach proposed in

(13)

CHAPTER 1. INTRODUCTION 3

this project can be beneficial in any testing domain with a large amount of testing code where dependency information is to be utilized.

1.2 Research Question

The goal of the degree project is to solve a test optimization problem which is test case dependency detection by syntactically analyzing test scripts. This leads to the following research question that needs to be answered:

To what extent can dependencies between test cases be detected by analyzing test scripts?

1.3 Scope

There are many parameters to consider when optimizing a testing process and the causes of failed test cases may vary. This project will only focus on the impact of dependencies between test cases when optimizing the effectiveness of testing at Ericsson. There are different types and degrees of dependencies between test cases. In this project, the focus will be to find shared dependencies, where dependent test cases have been defined to be syntactically similar in code as they share resources. Moreover, only a binary classification for groups of test cases will be made, as either being dependent or not dependent.

Neither the degree nor the direction of the dependencies will be measured.

Furthermore, implementing a syntactic analysis tool is not included in this de- gree project and instead, an open-source platform, Roslyn, will be used, which is further described in Section 2.2.1. However, the part where the relevant features are extracted and machine learning is applied will be designed and implemented as part of this project. The approach that is implemented during this project will only work with test suites written in C#. However, this could be extended to any programming language with a suitable code analysis tool.

In that case, other syntactic features would probably have to be used, as the data in this project is specific to the product. Also, the measurements of the efficiency of the manual dependency detection process at the host company Ericsson were not provided and therefore a comparison between the manual and the automatic approaches is not performed in this thesis.

(14)

4 CHAPTER 1. INTRODUCTION

1.4 Outline

The thesis is organized as follows: Chapter 1 introduces the subject and the purpose of the study. This chapter includes the research question as well. Next, Chapter 2 provides the relevant theory about test optimization and dependencies between test cases. The chapter includes related work in the field of finding test case dependencies for optimization purposes. The methodology of the approach is described in Chapter 3 and the results of the experiments are presented in Chapter 4. In the next chapter, Discussion, the results are discussed and sections on future work and sustainability are included. The thesis closes with Chapter 6 where the thesis is concluded.

(15)

Chapter 2 Background

In this chapter, the relevant theory about test optimization and dependencies between test cases is presented. This chapter includes a section containing an overview of syntactic analysis. The next section explains the theory of machine learning that is relevant for this study. The chapter ends with a related work section describing the work that this approach is built upon.

2.1 Test Optimization

Software testing is one of the most important steps during the software development life cycle (SDLC) process. Detecting failures, verifying systems, and designing test plans tend to be costly [1], which is why the topic of several studies has been test optimization and test effectiveness [1, 2, 3]. Having an effective testing procedure is beneficial in the industry as it allows for cost reduction and more on-time deliveries while maintaining software quality. Some common ways of optimizing the testing process are test minimization [8] and test case prioritization [3]. Multiple algorithms have been applied to tackle the test optimization problem. Surveys have been made to review some common approaches [9, 10], such as genetic algorithm, greedy algorithm-based approach, and different types of heuristics. Test optimization is often seen as a multi-criteria decision making problem [1], where some of the identified criteria are: requirement coverage, execution time, fault detection probability, and test case dependencies. This thesis will focus on test optimization based on test case dependencies with the hypothesis that test cases that are syntactically similar in code are more likely to be dependent since the similarity suggest that there is a sharing of resources between the tests.

5

(16)

6 CHAPTER 2. BACKGROUND

2.1.1 Test Case Dependencies

As a system grows in complexity, with increasing numbers of interactions between modules and sub-modules, the matter of finding dependencies becomes ever more important. Since test cases are seen as use case scenarios that mir- ror the functionality of the system under test, they inherit a similar dependency structure as the functions they test. Previous studies have shown that test failures can be caused by not considering test case dependencies before or during execution [2, 11]. For example, let us consider a simple server program where two of the functionalities are to start the server and to connect a client to the server. There is an obvious dependency here: to be able to connect a client to the server, the server must have been started. This also holds true for the test cases that test these functionalities, as the test cases that assess starting the server must be executed before the test cases that assess connecting a client to the server, thus making them dependent. Dependent test cases have a direct impact on the test execution results and therefore the dependency information is critical when performing test optimization techniques.

Researchers have identified different kinds of dependencies between test cases [3, 7]:

• Functional dependency: dependencies that represent the interactions and relationships among system functionality determining their run sequence [3]. Two functions are functionally dependent if one function can only be executed if a precondition is met and the other function en- ables this precondition. Therefore, the test cases should respect the same ordering.

• Temporal dependency: dependencies that are based on time and exact sequence. If a test case TC² is temporally dependent on test case TC¹, then TC2 should be executed exactly after TC1.

• Abstract dependency: dependencies that occur in models that have a hierarchical decomposition of the model elements, such as aggregation and composition [12].

• Causal dependency: dependencies based on the need for data and re- sources. For example test case TC2 requires a data item to be passed, and test case TC¹ creates that data item, then TC¹ should be executed any time before TC2.

In this thesis, the shared dependencies between test cases are the main focus, which is a type of dependency that is based on the syntactic similarity of their

(17)

CHAPTER 2. BACKGROUND 7

code. Test cases that are syntactically similar in code have a higher chance of sharing resources, for example through their common global variables or methods. This interpretation of dependencies follows to some extent both the definitions of causal dependency and of functional dependency. If two test cases are sharing resources then there is a high probability that one of them is creating or modifying a data resource that the other requires. Also, the precondition in this context is the need for resources. Detecting dependencies between test cases can result in a more effective use of testing resources [13], for example by executing independent test cases in parallel and avoiding redundant test executions. When a test case fails during execution, all dependent test cases also fail and should therefore be disabled. Thus, redundant executions are avoided.

Figure 2.1: An example of a directed dependency graph between six different test cases TC₁-TC₆. A directed edge represent a dependency between two test cases, where the test case at the end of the edge is dependent on the test case at the start of the edge.

The dependency information can visually be presented in a graph, with nodes as test cases and edges describing the dependency relationship, for example see Figure 2.1. Much useful information can be extracted from such a graph.

It can be seen that test case TC6is an independent test case and therefore it is not required to consider its position in the running sequence as it can be run

(18)

at any point in time. Figure 2.1 also shows that test case TC1 is dependent on three other test cases: TC³, TC⁴, and TC⁵. This is interpreted as test cases TC³, TC⁴, and TC⁵ must be executed before test case TC¹. The information extracted from such a dependency graph can be used in many different test optimization techniques, such as test prioritization, test minimization, and test scheduling, making this information critical for test optimization purposes. In addition to containing the information about which test cases are dependent, a dependency graph can also include the direction of dependency as well as the degree of dependency between test cases. The dependency degree of a test case TC in a dependency graph denotes the number of test cases that TC depends on, and can be interpreted as the number of interactions of the part of the system tested by TC. Miller et al. [3] uses the dependency degree of a test case as a critical criterion when ranking test cases for execution. The authors claim that assigning higher priority to tests with a higher dependency degree will lead to an increased likelihood of finding errors early in the test run.

2.1.2 Test Prioritization

It is not often the case that all test cases in a test suite are considered equally important. Some test cases might have a higher chance of detecting failures in the system than others or might be better than other test cases based on some quality measures. As such, ranking the test cases for execution based on the quality attribute of choice would lead to a more effective testing process [5].

The task of ranking test cases is called test case prioritization and it can be applied to all testing levels such as unit testing, regression testing, and system testing. The goal with test case prioritization is for the early maximization of some desirable properties, such as fault detection rate. Rothermel et al. [14]

define the test case prioritization problem as follows:

Definition 2.1.1 Given: A test suite T , P T := {p | p is a permutation of T }, and a function from P T to real numbers f : P T → R.

Problem: Find T⁰ ∈ P T that maximizes f .

In this definition, P T contains permutations of the whole set T . Solving the test case prioritization problem does not involve changing the size or content of the set of test cases, but rather finding the optimal permutation of test cases.

The function f is the quality measure of a test suite and the function should ideally determine the exact rate of fault detection of a test suite. However, since this is difficult to achieve, as this requires information given by execution results, the function f is often approximated in test case prioritization

(19)

techniques. Several methods estimate the quality measure, for example, based on the amount of code coverage that the tests achieve [14]. In this study, the potentials of using test case dependencies for test prioritization will be discussed.

2.1.3 Test Minimization

As a software system evolves, its test suite can accumulate redundancies over time. In some systems, each test case requires hours of laborious and complex manual test setups and its execution may occupy the testing environment for several hours. Such time-consuming test cases can be found in for example embedded system testing in the automotive industry [2]. Therefore, each op- portunity to skip a test case is economically valuable [15]. The objective of test suite minimization is to eliminate redundant test cases to reduce the number of tests to run as an attempt to minimize the costs of software testing. The redundancy in question usually refers to the requirements that are tested and sometimes it is difficult to know exactly what requirements are being tested by a test case. This is why current techniques [8, 16, 17] use different coverage criteria, such as statement or branch coverage, to identify redundant test code.

The problem with these techniques is that they do not consider the dependency information between test cases when deciding which of the test cases are redundant. This could lead to unwanted behavior such as redundant test failure.

For example, if test case TC1 is identified as redundant and test case TC2 is dependent on TC1then removing TC1from the test suite might cause a failure in TC², see Figure 2.1.

2.2 Syntactic Analysis

Syntactic analysis, or parsing, in the context of programming languages, is the process of creating a syntax tree, or parse tree, from a string of words that conform to the formal grammar defined by the programming language of which it was written [18]. It is used to better understand the formal syntactic structure of source code. The formal information is useful and can be used in various information extraction activities as the source code can provide, precisely and in detail, what a software system does.

The objective of syntactic analysis is to produce a hierarchical structure that represents the derivation process for later interpretation. The derivation process refers to the process that determines which strings are part of the language

(20)

generated by a grammar [19]. The structure generated by the syntactic analysis is a tree structure that is called a syntax tree [19]. A useful attribute of a syntax tree is the fact that a syntax tree obtained from a parser can produce the exact text it was parsed from and vice versa [20]. The nodes of a syntax tree represent syntactic constructs such as declarations, statements, expressions, and clauses. Usually, syntax trees hold all the source information including semi- colons and commas but sometimes not all details are present. A tree that omits some syntactic details is called an Abstract Syntax Tree (AST) [21]. Figure 2.2 shows an example of a syntax tree and its corresponding code. This syntax tree is not the only possible tree for the code since a syntax tree can include varying levels of detail.

Figure 2.2: A syntax tree example and its corresponding code.

2.2.1 .Net Compiler Platform

Normally, a compiler is seen as a black box, where source code is the input and executable code is the output. However, the deep understanding of the source code that is needed to produce executable code is lost and unavailable for usage. This information can be used in code analysis tools to, for instance, find references of a variable or get access to the Abstract Syntax Tree that can be generated from the code.

The .Net Compiler Platform [22], code-named Roslyn, created by Microsoft, is an open-source C# and Visual Basic compiler that exposes its API:s and provides rich code analysis tools. The Roslyn compiler API:s can be used among other things to parse code (create syntax trees), perform semantic analysis, and compile code.

(21)

Figure 2.3: The compilation pipeline that is exposed by the .Net Compiler Platform. Image is taken from [22].

Roslyn divides the process of compilation into separate steps, as can be seen in Figure 2.3. The first component is the parser, which takes source code and tokenizes it. The tokens are then used to generate a syntax tree that conforms to the grammar of the language. In this thesis, the parser component is of most interest since the syntax tree produced by the parser will be used as input data in the process of detecting dependencies between test cases.

2.3 Machine Learning

In this section, necessary background information related to machine learning topics that are relevant for this thesis will be presented.

2.3.1 Feature Engineering

Feature engineering is an important step in any machine learning pipeline as it is part of the data preparation phase. Ozdemir and Susarla [23] provide a definition of feature engineering as follows:

Definition 2.3.1 Feature engineering is the process of transforming data into features that better represent the underlying problem, resulting in improved machine learning performance.

The features that are referred to in Definition 2.3.1 are a numeric representation of an aspect of raw data which are used as the input to a machine learning model [24]. The choice of features can greatly impact the results of the model.

There are several feature engineering techniques such as numeric feature engineering (e.g. scaling and quantization), feature engineering for natural text (e.g. bag-of-words and n-grams) and model-based feature engineering (e.g.

Principal Component Analysis (PCA)) [24]. These techniques can be used in- dividually or collectively depending on the structure of the data being studied.

One common obstacle found in many machine learning applications is having to deal with high dimensional data. This phenomenon is commonly known as

(22)

the curse of dimensionality and arise when handling data in high-dimensional spaces [25]. As the number of different features grows, the volume of the feature space increases exponentially which leads to data sparsity. Data sparsity makes it difficult to obtain statistically significant information on the distribution of the data. To address this, there exist different dimensionality reduction techniques. PCA is used for feature dimensionality reduction by examining linear correlation patterns between features [26]. PCA is based on the as- sumption that variance of the data represents the information contained in the data and therefore the goal is to find the linear combinations of features such that the derived features capture maximal variance [26]. Another approach of dimensionality reduction is to use feature selection techniques where the number of dimensions is reduced by omitting irrelevant information. Feature selection is the process of selecting only a subset of relevant features, i.e. excluding irrelevant features, in order to reduce the complexity of the resulting model [25].

2.3.2 Clustering

Clustering, or cluster analysis, is the task of grouping sets of data into different groups of similar features, called clusters [27]. The goal of clustering is to divide the observations into distinct groups such that similar observations belong to the same group. Finding the conditions that define which observations are similar or different often requires a domain-specific consideration that is based on the data in question [25]. As such, it is difficult to fully automate the task of discovering the underlying patterns of data and some trial and error activities are required to find the techniques and model parameters to use to achieve the desired properties in the results.

Clustering is popular in many fields and there exist several different clustering methods. One of the best known clustering approaches is K-means which is a simple approach for partitioning a data set into K distinct, non- overlapping clusters [25]. Given a set of n d-dimensional real vector observations, D = {x¹, x2, ..., xn}, the goal of K-means is to partition the data set D into K clusters containing the indices of the observations, C¹, C₂..., C_K, such that two properties are satisfied [25]:

1. C1 ∪ C₂ ∪ ... ∪ C_K = {1, 2, ..., n}. This means that each observation belongs to at least one of the K clusters.

2. Ci ∩ C_j = ∅ for all i 6= j. This condition guarantees non-overlapping clusters, i.e. no observation belongs to more than one cluster

(23)

The number of clusters, K, must be specified to be able to perform the K- means algorithm. The task of selecting the parameter K can sometimes be difficult as there could be a lack of prior knowledge about the data [28].

There exists other popular clustering methods that do not require the user to pre-specify the number of clusters, for example DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise. DBSCAN is a clustering algorithm in the popular clustering paradigm density-based clustering [29]. In density-based clustering, the similarity in points lying in the same cluster is based on the density within that region and different clusters are separated by regions of lower density points. Using density as the main measure leads to the ability to find clusters of any shape in the feature space, unlike for example the clustering method K-means which makes the assump- tion that clusters are spherical. However, DBSCAN can only provide a non- hierarchical labeling of data objects based on a single global density threshold [29]. Using a global density threshold can lead to poor characterization of data sets where the clusters have different densities. These issues are not present in the modified version of DBSCAN which is HDBSCAN, Hierarchical DB- SCAN [30]. HDBSCAN generates a tree-like visual representation of the observations containing clusters of different densities. Although HDBSCAN is a powerful clustering method, one disadvantage with it is that the computa- tion of the hierarchy runs in quadratic time for both best-case and worst-case scenarios [31].

Clustering techniques will be used in this thesis as a means of grouping repre- sentations of test cases such that a cluster contains test cases that are likely to be dependent. Feature engineering will be an important pre-processing step to find how to best represent a test case, i.e. what features should make up a test case.

2.4 Performance Measurements

It is crucial to choose a suitable performance metric as it influences the quality of the approach. It can be difficult to evaluate clustering algorithms, especially without prior information or assumptions on the data [32]. One powerful analytical tool in machine learning is confusion matrix [33], which contains information about how the machine learning model has performed with respect to a ground truth. A confusion matrix provides a pairwise comparison between the actual and the predicted classification done by a classification system [32]. The rows and columns of the matrix represents the instances of the

(24)

actual and the predicted classes [34]. The cells of the matrix represent how many instances of a true class that were classified for each of the predicted classes. The advantage of using a confusion matrix is that it provides a whole picture of the performance and many metrics can be extracted from it, such as accuracy [34].

Table 2.1: A confusion matrix for a multi-class classifier.

Predicted C₁ ... Cj ... C_n

C1 N11 N1j N1n

... C_i

...

N_i1

... . . . Nij . . .

...

N_in

Actual

C_n N_n1 N_nj N_nn

Table 2.1 presents the basic form of a confusion matrix for a multi-class classifier with the classes C1, C2, ..., Cn. Nij represents the number of samples belonging to class Cⁱand classified as class C^j. If the goal is to make as many correct predictions as possible, then the confusion matrix should have zeros everywhere except for the diagonal.

2.5 Related Work

The dependency information is useful in many test optimization strategies and thus dependencies are used as an artifact in several studies to perform test optimization [1, 2, 3, 12]. However, defining and finding the dependencies between test cases is a challenging task and it requires capturing and analyzing several test artifacts such as test specification, requirement specification, system architecture, log files, etc. Therefore, the conducted research in the area is limited. There are some studies that propose techniques on finding dependencies between test cases using Natural Language Processing (NLP) on specification documents [35], signal analysis [7] and questionnaire-based analysis [36]. Other studies use domain knowledge to obtain the dependency information [3]. Table 2.2 provides a summary of the papers and methods identified as being relevant in this thesis.

Tahvili [1] state that test optimization is often seen as a multi-criteria decision- making problem where the identified criteria are: requirement coverage, ex-

(25)

ecution time, fault detection probability, and test case dependencies. The research in [1] is extensive in the area of test optimization at the manual- integration testing level, where six studies are included. The main goal of the study is to provide methods for a more efficient manual integration testing process. Dependency detection is an important part of [1] and multiple approaches that suggest methods for finding dependencies are included. These approaches can be categorized into questionnaire-based, deep learning-based using NLP, and signal analysis based studies.

Tahvili et al. [35] present a novel approach for deriving similarities and functional dependencies between test cases from test specification documents written in natural language. The proposed approach adopts a set of new techniques inspired by NLP and deep learning. The hypothesis in [35] is that there exists a correlation between semantically similar test case specifications and their corresponding functionally dependent test cases and this correlation is used throughout the approach. The first stage of the proposed approach is deriving feature vectors from the test specifications based on their semantic meaning using Doc2Vec. Semantically similar documents will be closer to each other in the vector space created by Doc2Vec according to the cosine similarity measure. The approach proceeds by clustering the vectors, according to the similarity function, to form clusters of feature vectors corresponding to clusters of test cases, using two clustering algorithms HDBSCAN and FCM (Fuzzy c-means). The approach is evaluated in a case study at Bombardier Trans- portation AB in Sweden by comparing the dependencies between test cases obtained from the approach to the ones previously derived at the company.

The authors find that HDBSCAN is the more accurate clustering method with an accuracy level of 80% and F1 score of 75% when comparing with ground truth.

In [7], an automated approach for the detection of functional dependencies between manual test cases at the integration testing level is presented. Sig- nal communication between software modules is used for identifying the dependencies. The proposed approach is based on natural language processing techniques that are applied to software requirements specifications (SRS) and test specifications for extracting the necessary signals information, that is later used for the matching process to identify dependencies. The dependencies between test cases are defined as follows: two software modules, M 1 and M 2, are considered dependent on each other if and only if the internal output signal from M 1 is required by M 2 as an internal input signal. Thus, all test cases for M 2 are dependent on the test cases of M 1. The approach shows feasibility

(26)

when evaluated in a case study at Bombardier Transportation AB in Sweden.

It is important to note that the signal information and requirement specifications are not available artifacts for all systems under test, making this approach restricted.

Arlt et al. [2] utilized a structured requirements specification, that implicitly carries some information about logical dependencies between requirements, together with current test execution results to automatically infer redundant test cases. The authors suggest a format for writing requirements specifications such that logical dependencies can be implicitly extracted. The approach is essentially a test suite minimization technique based on test execution results and logical dependencies. A set of definitions and rules are presented that can be used to find redundant test cases. While having a similar end goal, the work of this degree project is more focused on finding dependencies automatically, without requiring additional requirements or formalisms, but by relying on test code.

Miller et al. [3] argue in their thesis that many existing test case prioritization techniques do not constrain the order of which tests can be run. However, the functional dependency between test cases is one such constraint that should be considered. Miller et al. [3] present several test case prioritization techniques that use the dependency information from a test suite for prioritization.

The authors hypothesize that the dependencies between test cases are representative of the interactions in the system modules that are tested. Therefore, executing test cases with multiple dependencies earlier might lead to a higher rate of fault detection. Dependency structures, graphs that model the dependencies in a test suite, are used in various algorithms to calculate a test case ordering’s graph coverage value based on the complexity of the dependencies.

The empirical results indicate that the techniques show great promise in increasing the fault detection rate compared to untreated orders, random orders, and test suite orders based on techniques that use functional coverage. Unlike the work in this degree project, Miller et al. [3] do not define dependencies between tests. The dependencies are provided by test engineers and the information is used to construct dependency structures.

(27)

Table 2.2: Summary of related work.

Paper Dependency Detection Method

Optimization

Technique Benefits Drawbacks

Tahvili et al. [35]

NLP & deep learning techniques used on test case specifications

-

Test case specifications are common in requirements engineering practice.

Does not find the direction of the dependencies.

Not precise, relies on the hypothesis that dependent test cases have semantically similar test specifications.

Tahvili et al. [7]

Signal analysis & NLP on SRS and test specifications

Test case prioritization

& scheduling

Extracted dependencies are precise.

Signal information is not available for all systems under test.

Arlt et al. [2]

Dependencies are implicitly defined in structured requirement specification

Test suite reduction Automatic online detection of redundant test cases.

A structured requirement specification is required.

Relies on test execution results.

Miller et al. [3]

Dependencies are provided by test engineers

Test case prioritization

Previous test execution results are not required for prioritization.

Dependencies between tests are not defined.

Domain knowledge is required to find dependencies.

(28)

Chapter 3 Methods

This chapter summarizes the methodology of the degree project. First, the data set used for the experiments is described. This is followed by a description of the pipeline of the proposed approach. The next section explains the syntactic analysis stage of the project, including how the data set, as raw text, is converted into feature vectors. This is followed by a section describing the details of the different data processing techniques that are applied to transform the data into an appropriate version. Later, a section containing a description of the clustering algorithms that are used is included. The chapter closes with sections describing the evaluation methods and the hardware and software pre- requisites for performing the experiments.

3.1 The Dataset

The data set used for the experiments in this degree project consists of source codes for testing five of Ericsson’s products. A total of 400 distinct source code files made up the data set. Each code file is written in the programming language C# and is used to execute one test case. The source codes vary in length and functionality, but all of them implement a common method which is the starting point for each test. Resources from Dynamic Link Libraries (DLLs) are heavily used in all code files. A DLL is a library that is shared by several applications running under Windows. Furthermore, the code files are written by different authors, resulting in different coding styles being involved.

The codes are well commented, containing comments on the steps that the test performs as well as comments explaining what certain statements do. More details and statistics of the data set is presented in Section 4.1.

18

(29)

CHAPTER 3. METHODS 19

3.2 Dependency Detection Pipeline

The goal of this thesis is to design, implement, and evaluate an approach for detecting test case dependencies by syntactically analyzing test cases’ source code files. Figure 3.1 shows the pipeline of the proposed approach, providing an overview of the whole approach.

Figure 3.1: The steps of the proposed approach. The approach consists of three main phases: Syntactic Analysis, Feature Engineering and Clustering

The approach starts by performing syntactic analysis on each of the C# source codes to produce ASTs (Abstract Syntax Trees). After carefully analyzing some example test scripts and consulting domain experts at Ericsson, relevant features of the ASTs are cherry picked to produce feature vectors. The feature vectors are preprocessed, where the non-numerical features are encoded using a valid encoder and unimportant features are removed by using feature selection, which is further explained in Section 3.4. The dimensionality of the preprocessed vectors is then reduced by performing PCA. In the final stage, the dimensionality reduced feature vectors are fed into clustering algorithms to produce clusters containing test case source codes of dependent test cases.

3.3 Syntactic Analysis

The first stage of the project is syntactic analysis which is an essential part of the process. The open source C# compiler platform Roslyn was used to create AST from raw code text. A tree in Roslyn is represented as a C# object containing rich querying methods. Figure 3.2 shows the structure of an AST produced by Roslyn and what it is composed of.

(30)

20 CHAPTER 3. METHODS

Figure 3.2: A high level visualization of an Abstract Syntax Tree produced by the open source compiler platform Roslyn.

Each node of the tree contains an attribute called SyntaxKind, which de- scribes the type of the construct. For example, a C# if-statement has Syn- taxKind.IfStatement as its SyntaxKind attribute. This attribute, along with other tree node attributes, can be used in the querying methods to extract a variety of different features from each code file. The features that make up a feature vector are handpicked from the ASTs.

The raw data of the code files was transformed into relevant features suitable for modeling. The following features were extracted from each AST to represent a code file:

• Class name: the name of the class containing the main method for the test.

• Data labels: IDs of variables storing the test criteria for the test case.

The criteria contain the input and expected output of a test case.

• Instruments: the measuring instruments (devices) that are used in this test.

• Variables that retrieve data: the names of variables that call a method to retrieve/read some data.

• Variables that set data: the names of variables that call a method to set/write some data.

(31)

• Type of device: a categorical label that can either be Transmitter (Tx), Receiver (Rx) or other. Other indicates that the label is neither Tx nor Rx.

• Number of if-statements: the number of if-statements found in the test code.

• Number of loops: the number of loop-constructs found in the test code

• Number of lines of code: the number of lines of code in the test case source file, including empty lines.

• Number of statements: the number of statements ending with a semi- colon in the code file.

These features were the only ones selected and fed to the machine learning algorithm in the clustering phase. It was found that test cases that are similar based on the above features are more likely to be dependent. For example, test cases that use the same measuring instruments should belong in the same cluster since they are sharing the same resources, thus a kind of dependency is involved. Also, when analyzing the test scripts, it was noticed that test cases with similar functionalities tend to have about the same number of if-statements, loops and lines of code. Therefore, it was decided that these numerical code attributes should be included in the feature vectors.

All these features were obtained from ASTs by using Roslyn’s querying methods. When all features were collected for each code file, a matrix was constructed as a csv file. The rows of the matrix represent the C# code files and columns represent their features.

3.4 Data Preprocessing

Before feeding the feature vectors into the clustering algorithms, the raw data needed to be preprocessed. The feature vectors contain different types of features: numerical, textual and categorical, resulting in having to use different preprocessing techniques for each type of feature.

3.4.1 Encoding

Several of the features listed in Section 3.3 are categorical and textual, such as data labels and instruments. Many machine learning algorithms can not operate on categorical data directly. To be able to perform clustering on the

(32)

data points, these features must be encoded into a numerical representation.

In this degree project, a data point’s categorical labels were one-hot encoded, meaning that a categorical label was transformed into a binary label, where the number 1 indicates that the label exists in that data point and the number 0 indicates that the label does not exist. Figure 3.3 shows an example of the transformation of one of the categorical features, Instruments, of the data. It can be seen that the instruments that exist in a test case are denoted with a value of 1 after the transformation.

Figure 3.3: Before and after one-hot encoding the Instruments feature of the test cases

3.4.2 Dimensionality Reduction

To minimize the consequences of high dimensionality data, some common dimensionality reduction techniques were used. One simple technique that was used is feature selection, a process of automatically or manually selecting the features which will contribute the most to the output. In this project, the features with a very low variance were removed. Specifically, the only features that were removed were the categorical features that occurred in one test case since they do not contribute to a better clustering of the test cases.

Furthermore, the most important technique used to reduce the dimensionality of the data is PCA. The principal components help in summarizing the original data set into a smaller set containing representative variables that collectively explain most of the variability in the original set. Figure 3.4 shows an example of a PCA transformation applied to a two-dimensional data set. The vectors in the left image of Figure 3.4 represent the principal axis of the data set and the length of them indicates the importance of that axis in describing the distribution of the data, i.e. the variance of the data.

(33)

Figure 3.4: A demonstration of PCA applied to a two-dimensional data set.

Left image visualizes the principal axis in the original data. Right image shows the transformation from data axis to principal axis. Images are taken from [37].

Using PCA for dimensionality reduction involves excluding one or more of the the smallest principal components. This way, a lower-dimensional projection of the data is obtained while preserving the maximal data variance [37]. In this degree project, the first two components of PCA were used to transform the data. These two components captured most of the variance of the data and enabled a visualization of the clusters in a two-dimensional plane. Before PCA was performed in this project, the variables were centered to have mean zero and scaled to have standard deviation one. This is because the original variables are not measured on the same scale which can lead to the principal components being dominated by a single variable that happen to have the largest variance.

3.5 Clustering

After the syntactic analysis phase, a set of high-dimensional vectors (repre- senting feature vectors of each test case) was generated. These vectors were encoded and dimensionality reduced during the preprocessing phase of the approach. Next is the clustering phase where the goal is to group similar feature vectors (test cases) into the same cluster, resulting in dependent test cases being in the same cluster and dissimilar test cases belonging to different clusters.

In real life testing projects, not all of the test cases are dependent or similar to each other. Therefore a set of independent test cases need to be produced. In-

(34)

dependent test cases can be executed in no particular order and can be removed from a test suite without affecting the other test cases.

To perform the clustering, two clustering algorithms were applied and evaluated: HDBSCAN and K-means. The HDBSCAN algorithm measures the distance between the vectors, using Euclidean distance, and automatically provides the number of clusters, the clusters themselves and a set of non- clusterable data points. The main reasons for choosing HDBSCAN are its capability to produce a set of non-clusterable data points, i.e. independent test cases, and the fact that it has a good ability of dealing with high dimensional data. Although, HDBSCAN measures the number of clusters automatically, the minimum size of a cluster can be chosen by the user. This influences the number of clusters in the results, since a higher minimum size will result in fewer clusters and vice versa. It was decided that the minimum cluster size should be set to 2, i.e. each cluster contains a minimum of 2 dependent test cases. This is because the testing experts at Ericsson found that the granularity of a group of dependent test cases could be as small as 2. Aside from choosing the minimum cluster size parameter, the default behavior of the algorithm was used for the rest of the parameters. The different parameters and their default settings can found in the API reference of HDBSCAN’s primary class [38].

In this study, HDBSCAN is compared to the K-means clustering algorithm. K- means is simple to implement but unlike HDBSCAN, the number of clusters need to be pre-defined. The choice for this parameter was inspired by the number of clusters that HDBSCAN produced and the desired granularity of the cluster sizes. The parameter K was therefore set to 110, which was the same number of clusters that HDBSCAN generated (excluding the outliers cluster). Also, K-means does not produce a non-clusterable set which can cause distorted clusters due to the presence of outliers that do not belong to any cluster. However, K-means can produce clusters containing only one data point, which can be interpreted as independent test cases.

3.6 Evaluation Method

3.6.1 Ground Truth

A ground truth table for the dependencies between the test cases at Ericsson was not available at the beginning of the project and therefore had to be constructed in order to evaluate the proposed approach of this study. The subject matter expert at Ericsson provided information on the test cases that was used

(35)

to determine which of the 400 test cases were dependent and should belong to the same cluster. However, given the information, only a partial ground truth table with 135 test cases was produced. This is because the process of manually determining dependent test cases is very time-consuming. Each of the 400 test code files had to be examined and analyzed to understand the underlying dependencies between the test cases. Also, the proposed approach produces a set of independent test cases but an evaluation of the independent test cases was not possible since the partial ground truth table did not include independent test cases.

3.6.2 Performance Measurements

In order to compare the results achieved by the proposed approach with the ground truth and obtain quantitative measurements, a type of pairwise testing was made. All possible pairs of test cases in the ground truth were considered and compared with the predictions made by the clustering algorithms. Given the total number of 135 test cases, there are ¹³⁵₂ = 9045 test case pairs. Each pair of test cases is binary labeled as either belonging to the same cluster or not. A positive pair of test cases denotes that the two test cases are dependent or similar enough to belong in the same cluster. A negative pair of test cases denotes that these test cases should not be in the same cluster.

A confusion matrix was constructed in the pairwise testing phase, where the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) were calculated. Given the confusion matrix, all performance measurements relevant for this project can be obtained.

The pairwise comparison approach leads to increased imbalance in the data set, since the number of negative elements becomes much larger than the number of positive elements. Accuracy, shown in Equation 3.1, as a metric can lead to misleading results when performed on imbalanced data sets. This is because accuracy generally apt to predict the majority class better and behaves poorly to the minority class [39].

Accuracy = T P + T N

T P + F P + T N + F N (3.1)

In this project, the positive pairs are considered to be of higher importance.

Therefore, it was decided to exclude accuracy and use F1 score as the main performance measure to evaluate the results of the clustering algorithms. F1 score is a harmonic mean of Precision and Recall. In this study, Precision

(36)

denotes the number of correctly detected dependencies divided by the total number of detected dependencies by the proposed approach. Recall denotes the number of correctly predicted dependencies over the total number of existing dependencies (in the ground truth). Precision and Recall [25] are described in Equations 3.2 and 3.3, respectively. Recall, Precision, and F1 score all have values between 0 and 1, where 1 denotes the best value and 0 is the worst value.

P recision = T P

T P + F P (3.2)

Recall = T P

T P + F N (3.3)

Using Precision and Recall, F1 score can be calculated according to Equation 3.4

F 1 Score = 2 · P recision · Recall

P recision + Recall (3.4)

3.7 Hardware and Software

The data set in this degree project was not large, thus the approach did not require heavy computations to be performed. The hardware resources used for conducting the experiments were:

• CPU: Intel ^R

Core^TM i7-8650U CPU 1.90GHz 2.11GHz

• Memory: 32 GB

Open source Python software modules were used to implement the clustering algorithms. For the HDBSCAN algorithm, the primary class of The hdbscan Clustering Library [38] was used. For the implementation of K-means and PCA, different modules of the Python package Scikit-learn [40] were utilized.

(37)

Chapter 4 Results

In this chapter, the results of the experiments are presented. The first section includes statistics obtained by analyzing the data. This is followed by a section containing the clustering results and performance measurements of the approach for both the algorithms.

4.1 Data Statistics

The choice of features is critical when determining the clusters of dependent test cases. Analyzing the raw data, i.e. the 400 code files, contributed to the choice of features and preprocessing techniques used in this thesis. In this section, some statistics on the data are included and shown.

Figure 4.1 shows the top 10 most frequent data labels and the proportion of the 400 test scripts they occur in.

27

(38)

28 CHAPTER 4. RESULTS

Figure 4.1: The relative frequencies of the top ten most occurring data labels.

In Figure 4.2 the relative frequencies of the instruments are shown. It is visible that instrument_1 and instrument_2 are the instruments that occur the most in the test scripts, i.e. these are the most used instruments during the testing process.

Figure 4.2: The relative frequencies of the different instruments picked up during the feature extraction phase.

Figure 4.3 consists of 4 histograms showing the statistics of the numerical

(39)

CHAPTER 4. RESULTS 29

features, i.e. number of if-statements, number of loops, number of lines of code and number of statements.

Figure 4.3: The statistics of the numerical code features. Top left: relative frequencies of the number of if-statements in the test scripts. Top right: relative frequencies of the number of loops in the test scripts. Bottom left: relative frequencies of the number of lines of code in the test scripts. Bottom right:

relative frequencies of the number of statements in the test scripts.

Figure 4.4 is a pie chart showing the proportion of the test scripts where Tx, Rx or another device type is used for testing. It is clear that most of the test scripts use other device types that we do not possess knowledge of.

(40)

Figure 4.4: A pie chart describing the distribution of the type of device feature out of the data consisting of 400 test cases.

4.2 Clustering Results

In this section, the clustering results achieved by HDBSCAN and K-means are presented.

4.2.1 HDBSCAN

Table 4.1 shows the confusion matrix obtained from evaluating the results of the approach when using the HDBSCAN algorithm. The confusion matrix contains the number of True Positives, True Negatives, False Positives, and False Negatives obtained from the pairwise comparison with the ground truth.

Table 4.1: Confusion matrix for the HDBSCAN algorithm.

Predicted Positives Negatives

Actual Positives

T P = 87 F N = 48

Negatives

F P = 24 T N = 8886

(41)

Using the confusion matrix in Table 4.1 and Equations 3.2, 3.3 and 3.4, Pre- cision, Recall and F1 score for HDBSCAN have been calculated. The values for these metrics are presented in Table 4.2. It is important to note that the HDBSCAN algorithm does not contain random characteristics, meaning that it produces the same values across multiple runs.

Table 4.2: Precision, Recall and F1 Score metrics calculated for the HDB- SCAN algorithm.

Precision Recall F1 score

HDBSCAN 0.784 0.644 0.707

The number of clusters obtained from performing the HDBSCAN algorithm is 110 (excluding the outliers cluster) and the average cluster size is 3.018.

Also, 68 data points were clustered as outliers, i.e. independent test cases.

Figure 4.5 shows the clusters produced by the HDBSCAN algorithm where each color-shape represents a cluster of test cases and the gray small circles represent independent test cases.

Figure 4.5: A visualization of the most dense area of the clustered test cases using the HDBSCAN algorithm for 400 test cases. Each color-shape repre- sents a cluster of test cases and the gray small circles represent outliers. The axis represent the two principal components obtained from PCA with the high- est variance.

(42)

4.2.2 K-Means

A confusion matrix for the K-means algorithm was also constructed and shown in Table 4.3. The parameter K was set to 110 as described in Section 3.5. The algorithm converges when the Euclidean norm of the difference in the cluster centers of the last two consecutive iterations is less than 1e − 4. The number of iterations do not exceed 300.

Table 4.3: Confusion matrix for the K-means algorithm.

Predicted Positives Negatives

Actual Positives

T P = 85 F N = 50

Negatives

F P = 33 T N = 8877

The values for F1 score, Recall and Precision for the K-means algorithm are shown in Table 4.4. The values are calculated using Equations 3.2, 3.3 and 3.4 where the number of TP, FN and FP elements are obtained from the confusion matrix, Table 4.3.

Table 4.4: Precision, Recall and F1 Score metrics calculated for the K-means algorithm. The values shown are the average across 10 runs of the K-means algorithm.

Precision Recall F1 score K-means 0.720 0.630 0.672

An average cluster size of 3.6 was obtained for K-means. Figure 4.6 shows the clusters produced by the K-means algorithm where each color-shape represents a cluster of test cases.

(43)

Figure 4.6: A visualization of the most dense area of the clustered test cases using the K-means algorithm for 400 test cases. Each color-shape represents a cluster of test cases. The axis represent the two principal components obtained from PCA with the highest variance

(44)

Chapter 5 Discussion

This chapter discusses the results of the proposed approach and explores ways of using dependencies for test optimization purposes. Also, the chapter includes a section discussing the challenges and threats to validity of this study and a section discussing ethical and sustainability aspects of the study. Finally, a section explaining how this research can be further improved or built upon is included.

5.1 Analysis of Results

In this study, two models have been implemented to label test case codes in order to detect dependencies between them based on the similarity of their syntactic features. By performing a pairwise comparison with a partial ground truth labeling, it was possible to obtain performance measurements of the two models’ predictions in the form of Recall, Precision, and F1 score. In regards to the research question, to what extent dependency information can be extracted from test scripts, the performance measurements of the machine learning models show promising results in the ability to identify dependent test cases from test code. The best achieved result was obtained when using the HDBSCAN algorithm with an F1 score of 70.7%. In this section, the results from Chapter 4 are discussed and analyzed in further detail.

A large and vital part of this study was choosing the features to be used during the clustering phase of the pipeline. For this, the data had to be analyzed and the testing experts at the host company Ericsson had to be consulted. To answer the research question, we had to first get a a sense of how the depen-

34

Test Case Dependency Detection Using Syntactic Analysis of Code for Test Optimization Purposes

Test Case Dependency Detection Using Syntactic Analysis of Code for Test Optimization Purposes

ROUWAYD HANNA

Test Case Dependency

Detection Using Syntactic Analysis of Code for Test Optimization Purposes

ROUWAYD HANNA

Abstract

Sammanfattning

Acknowledgements

Contents

Chapter 1 Introduction

1.1 Purpose

1.2 Research Question

1.3 Scope

1.4 Outline

Chapter 2 Background

2.1 Test Optimization

2.1.1 Test Case Dependencies

2.1.2 Test Prioritization

2.1.3 Test Minimization

2.2 Syntactic Analysis

2.2.1 .Net Compiler Platform

2.3 Machine Learning

2.3.1 Feature Engineering

2.3.2 Clustering

2.4 Performance Measurements

2.5 Related Work

Chapter 3 Methods

3.1 The Dataset

3.2 Dependency Detection Pipeline

3.3 Syntactic Analysis

3.4 Data Preprocessing

3.4.1 Encoding

3.4.2 Dimensionality Reduction

3.5 Clustering

3.6 Evaluation Method

3.6.1 Ground Truth

3.6.2 Performance Measurements

3.7 Hardware and Software

Chapter 4 Results

4.1 Data Statistics

4.2 Clustering Results

4.2.1 HDBSCAN

4.2.2 K-Means

Chapter 5 Discussion

5.1 Analysis of Results