Improving Software Testing in an Agile Environment

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Improving Software Testing in an Agile Environment

JÉRÔME DE CHAUVERON

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Abstract

Software development has evolved at an ever-increasing pace over the past years, one of the forces behind this acceleration is the move from on-premise application to cloud- based software: Software as a service (SaaS). Cloud computing has changed the way applications are deployed used and tested. Time-to-market for software has decreased from the order of month to the order of day. Furthermore, Source Code Management based on Git (introduced in 2005), changed the way software is developed allowing for collaborative work thanks to automatic merge and versioning. Additionally, continuous integration (CI) tools developed on top of Git facilitate regular testing, build and deployment.Nevertheless, despite of being necessary Integration tools, they require an extensive amount of cloud resources which may stretch the duration of code integration.

The goal of this thesis is to optimize the speed of the CI pipeline, improve the software tests while optimizing the load on the cloud resources.

As a result of the work of this thesis the completion time of the software test has been decreased by 21%, and the total continuous integration completion time decreased by 18%. Furthermore, new bugs and anomalies were detected thanks to improved software test and new approaches for the emulation of extreme scenarios. The bugs were corrected making the system more resilient and improving the user experience.

(4)

Sammanfattning

Programvaruutveckling har accelererat i en allt snabbare takt under de senaste åren. En av krafterna bakom denna acceleration är övergången från lokal applikation till molnbaserad programvara: Programvara som en tjänst (SaaS). Molntjänster (Cloud computing) har förändrat sättet applikationer distribueras, används och testas. Tiden till att mjukvaran når marknaden har minskat från månader till dagligen.

Dessutom har källkodshantering baserad på Git (introducerades 2005), ändrat hur programvaran utvecklas, vilket möjliggör samarbete, detta tack vare automatisk sam- manslagning och versionshantering. Dessutom underlättar integrationsverktyg (CI) utvecklade ovanpå Git, regelbunden testning, byggnad och utveckling. Trots att de är nödvändiga integrationsverktyg, kräver de en omfattande mängd molnresurser som kan tänja på varaktigheten för integration.

Målet med denna avhandling är att optimera hastigheten på CI-pipeline, förbättra programvarutester medan belastningen på molnresurserna optimeras.

Som ett resultat av arbetet med denna avhandling har mjukvarutesternas slutföringstid reducerats med 21 %, och den totala kompletteringstiden för kontinuerlig integration reducerats med 18 %. Dessutom upptäcktes nya buggar och avvikelser, tack vare förbättrade mjukvarutester och nya tillvägagångssätt för emulering av extrema scenarier. Buggarna korrigerades, vilket gör systemet mer motståndskraftigt med en förbättrad

användarupplevelse.

(5)

Acknowledgements

I would like to thank my supervisor Shatha Jaradat for her hard work, time, and guidance throughout this thesis.

I would also like to thank my manager and colleagues at Dassault system for their cooperation, advice and friendship. It has been a very challenging experience.

I would like to thank my examiner Mihhail Matskin.

Febuary, 9 2020 J´erˆome de Chauveron

(6)

List of Figures

1 Software testing Pyramid [1] . . . 13

2 Git practices description [3] . . . 14

3 CI/CD Pipeline description . . . 15

4 Representation of process arrival in Gitlab [6] . . . 15

5 Example of Cyclomatic number computation . . . 16

6 Code Example for Halstead complexity computation . . . 17

7 Representation of an End to End Test scenario [5] . . . 17

8 Code example of an unreachable path via blackbox testing . . . 19

9 SAGE algorithm description [14] . . . 20

10 Flow Chart of the AFL algorithm . . . 21

11 GAN architecture [21] . . . 23

12 Instrumenting code an example . . . 25

13 HTML Code Coverage Report . . . 26

14 Troubleshooting setup description . . . 28

15 Evolution of Katalon test duration . . . 30

16 Pipeline execution time distribution . . . 31

17 Job triggering pipeline failure distribution . . . 33

18 Pipeline success function of the number of .ts files modified . . . 34

19 Merge Pipeline total execution time evolution . . . 35

20 Example of an asynchronous request failure . . . 37

21 Example of Linear size increase of factor 2 on a JSON of depth 2 . . . 37

22 Example of Recursive size increase of factor 2 on a JSON of depth 2 . . . 37

(8)

List of Tables

1 Comparison of different End-to-End testing tools . . . 18

2 Comparison of the different tool used for troubleshooting . . . 27

3 Comparison of the different Inter-process communication tools . . . 28

4 Comparison of the different tool for load testing . . . 29

5 Correlation Between code metrics and Pipeline success rate . . . 32

(9)

List of acronyms

API Application Programming Interface AWS Amazon Web Services

CI/CD Continuous Integration / Continuous Deployment HTTPS Hypertext Transfer Protocol Secure

JSON JavaScript Object Notation

IDE Integrated Development Environment OS Operating System

QA Quality Assurance

REST REpresentational State Transfer SSL Secure Socket Layer

TCP Transmission Control Protocol UI User Interface

VM Virtual Machine

(10)

1 Introduction

1.1 Motivation

The DevOps approach and the Agile methodology for software development had a profound impact on the way software is being developed, shipped, and deployed. The era of main- frames (1970-1980) was defined by technologies such as Cobol and Multiple Virtual Storage (MVS) with 1-5 years release cycle and extremely high risks in terms of meeting the cus- tomer needs. Then came the era of client/server (1990) with still high risk and relatively slow release cycle: 3-12 month, the current era of cloud is pushing down the release cycle to the order of days. This new approach pushing for faster release cycle and thus driving quality analysis and software deployment towards more automation as manual testing is tedious, lengthy, and expensive. This thesis was conducted during the development of the UI for a web application referred to as the web-app with the purpose of improving software testing and the CI pipeline efficiency.

1.2 Background

Software quality is a leading concern for companies as a bug might be extremely expensive in terms of time spent correcting it and impact on the client. To maintain a certain level of quality, tests are used to certify that the system can handle a typical user scenario (with different network configurations). Moreover the system needs to remain functional despite having given servers unreachable, high network congestion or unexpected API response. In fact those conditions may occur in a real-world scenario despite being very unlikely to arise in a testing environment.

The objective is to have the most relevant tests possible and to have them being automated while using the minimum resources in terms of VM’s.

Relevant test meaning that the tests need to go through as much as possible of the system and remain close to real-world usage. Moreover, the test should access that the system is functional in various scenarios

The studied system contains multiple types of tests the one detailed in this thesis are the following:

• Functional test

• Load Test

• Unit Test

• Chaos testing

To evaluate and improve the relevancy of those tests we are using:

• Code coverage a metric indicating how much of the code is being executed through a test (can be applied to any type of test but in our case is designed for functional testing). It is the percentage of the line being executed during the test as well as the execution detail line-per-line. The objective is to maximize the percentage of executed code and complete the functional test by unit tests for the part of the code not executed during the functional tests.

(11)

• Mitmproxy a proxy to modify on the fly response given by the server to test the resilience of the UI

• Netem a network emulation tool: adding delay, loss, packet re-ordering, and breaking connection¹

The Code management platform used is Gitlab which enables us to extract traces of the activity and pipeline history and will be used to optimize the continuous Integration pipeline.

1.3 Problem

Continuous Integration pipeline enables software to be built, tested and integrated in the master directory continuously. Enabling more collaborative work and faster release cycle.

However, a CI pipeline is expensive to set-up and maintain due to the different jobs required: deployment of the code, End-to-End test, Unit test and Build. All those jobs require a VMs, and maintaining them is a tedious process. Moreover the CI pipeline contains E2E test checking the entire system. But the E2E test is lengthy in terms of execution time and hard to maintain. On top of this, those tests need to remain relevant and evolve with the system. Otherwise a bug won’t be detected when code is pushed, making the bug harder to locate and correct.

Finally, due to their execution time, the jobs contained in continuous integration pipeline may generate a delay in the test feedback loop of the continuous integration pipeline.

This would significantly decrease developers’ productivity and impact the release cycle.

1.4 Purpose

In this master thesis we evaluate the improvements to a CI pipeline efficiency and the possible enhancement of software tests. Furthermore we intend to keep the cloud resources usage low by running as few VMs as possible.

In the first place, we use functional testing combined with metrics on the code namely:

Cyclomatic complexity, and code coverage. It is important to ensure the test relevance in terms of how much of the code is being tested, the variety of the scenarios and the different path in the control structure of the code.

The motivation behind this is to improve the code quality and to accelerate the CI pipeline execution time, crucial for the developers productivity and the product development time- line. Furthermore, we use different troubleshooting techniques: adding delay ,connection loss, altering JSON and emulating a service is down. With the aim of testing all the possible case scenarios that would not be explored by a traditional functional test or most manual test.

Troubleshooting is performed to meet with the user experience requirement of the web- app in extreme conditions (e.g., high latency, down services). Due to the heterogeneity of the cloud environment, the consistency of the different services used by the web-app cannot be guaranteed.

Troubleshooting is performed with the purpose of improving the end user experience in real world scenarios, that may not be reproducible in a testing environment. Moreover, we are

1Netem is a tool acting directly on the Qdisk table of the operating system (works only on Linux based Operating system)https://wiki.linuxfoundation.org/networking/netem

(12)

monitoring the VMs used to complete those jobs with the intention of reducing the number of required cloud resources for the continuous integration pipeline.

We defined the goals as explained in the coming section.

1.5 Goal

The goals of this thesis are automating the software tests and improving them: their execution time, their relevance compared to both typical real-world usage and extreme scenarios.

Another goal is testing as much as possible of the system in terms of quantity of code executed.

Hence the final objective is to increase the quality of the code, the resiliency of the system and the CI pipeline completion time. As a matter of fact fast and continuous test feedback is crucial for the developers’ productivity.

In order to reach the defined goals the following work is completed:

• Testing the system in extreme scenarios: high latency and unavailability of certain resources

• Testing the system with inconsistent data structures coming from the rest API

• Introducing code coverage with functional and End-to-End testing

• Introducing Fuzz testing to test the system with unexpected input

• Performing load testing on the system

The ultimate goal is to run those jobs on a minimum number of runners while maintaining the completion time low. In fact we want the test feedback as fast as possible for the developers.

1.6 Research Methodology

The methodology used in this thesis is a quantitative research approach based on the following results

• Experiments gathering values on the performances before and after the different improvement. Both on the continuous integration pipeline and the End-to-End test.

• Comparing the improvement with the original system and with to the state-of-the-art of research established in software testing.

1.7 Delimitation

It is essential to note the material outside of the scope of this thesis. First, the work of this thesis is concentrated on software testing and CI optimization.

However, the software itself developed referred to as the web-app, won’t be studied in this thesis both for the UI and the back-end. Moreover the optimization of the cloud resources allocation outside of the software delivery pipeline is also out of the scope of this thesis.

Furthermore the deployment of the web-app in production (Continuous Delivery) is also outside of the scope of this thesis.

(13)

1.8 Ethics and Sustainability

The data analyzed in this thesis is collected from the code-management tools (Gitlab)²used for the development of the web-app. These data were anonymized, such that the Data Set cannot be used to retrieve user per user information: the number of lines committed or the pipeline failure rate. Nevertheless, as this data-set contains confidential information (code, commit name) it will remain in the company for confidentiality and intellectual property reasons. Nonetheless, all the details on how this data is used and analyzed is given in full details in this thesis and can be reproduced on any other Gitlab project. Finally, regarding the social, economic and sustainability implications, there are no major negative implication as the data retrieval is performed only once and the data-set is fully nameless. Moreover, favorable implication to this thesis is the performance increase on the deployment pipeline and optimization of cloud resources usage.

1.9 Outline

This thesis is divided into the following chapters:

• Introduction Introduction to the field of research - Chapter 1

• Background and Related Work A comprehensive summary of the research problem - Chapter 2

• Methods The methods and tools used for this thesis - Chapter 3

• Experiments The experimentation’s setup, results and their analysis - Chapter 4

• Conclusion on the work done wrapping the thesis - Chapter 5

2GitLab is a web-based platform with CI/CD tools on top of a Git-repository.

(14)

2 Background and Related Work

2.1 Background on Software testing

Detailed below are the different types of software tests:

• Unit test is a test for functions or small fraction of the code. It does not take con- text or any external resources (database dependencies) into account.Unit verify small independent part of the code based on preconditions.

• Fuzzy Testing is a test that modifies a program’s input while remaining in the required range of structure and monitoring any unexpected output or error. Fuzz testing or Fuzzing is mainly used to detect security breaches or vulnerabilities and is one of the fastest evolving research field in software testing.

• Smoke test is a test verifying that the basic functionalities are working as planned.

• Integration Test is a test similar to a unit test besides the fact that it tests both the system and its dependencies. For instance, in case of a function calling a database, a unit test wouldn’t call a real database whereas an integration test would.

• Functional test is a test similar to an integration test in the sense that it both checks the system more than one component at a time, for instance, the UI and the database.

However a functional test also checks the values returned by those different components as defined in the product definition. (e.g. an integration test would query a database while a functional test would test the value displayed in the User Interface)

Note that a functional test can be model based test: the system under test expected behavior are defined in model (UML or SysML) and abstract test suite is created based on the model. Afterward actual tests are generated for the system.

• End to End (E2E) test is a test that verifies the complete set of functionality of a system by emulating the behavior of a user. In the case of the web-app, the test is performed using a browser automation tool (e.g. Katalon).

• Regression test is a test triggered after a change in the system (such as new or updated library/framework, new code or re-factoring). It checks if this change generates an error or unexpected behavior.

• Security test is a test that checks any potential security breach or flaw in the system to ensure its security.

• Performance/Load test is a test that puts the system under high loads and controls its performances. It is performed with JMeter in the case of our web-app

• ”Chaos test” is a test that trouble shoot the system and assesses how it responds in the worst conditions: services down, extreme delays, server not responding or incoherent data structure in an API response.

It is used to increase software resilience by maintaining a decent user experience in an extreme case scenario. This concept introduced by Netflix with the Simian Army and Chaos Monkey test Framework (emulating AWS instance / Region failure).

(15)

• Manual Test is a test completed by a QA engineer to explore the system manually, without the use of any external automation tool.

Figure 1: Software testing Pyramid [1]

This set of different types of test is often referred to as the software testing pyramid [2]

(as shown in the Figure 1) , as being the most basic simple, efficient testing framework.

It contains at its core the most basic test Unit test verifying only small fraction of the code and the highest level the most difficult to maintain test: Integration, Functional, and End-to-End tests checking the entire system. Additionally note that the unit test are not enough to verify a system. Unit test cannot check multi-threaded code, private methods or the compatibility of different elements.

High-level tests require the whole app to work, but break as soon as one of the components is down, making them not fully reproducible and hard to maintain. Furthermore, for the scope of this Thesis, we focus mainly on End-to-End, Chaos, Fuzzy and Performances testing even though other types of tests (smoke, regression, unit) are also being used.

Having test at integration level is essential to maintain the quality of the system, moreover automating them decreases the risk of human error the cost in terms of human resources and enables faster testing even though some testing might not be adapted for the automation tools.

The approach taken for testing here is Behavior-Driven Development (BDD) [8] suiting best the End-to-End testing. Moreover, there are also ways to prevent bugs in the first place with code parsing, e.g. tools such as DeepCode TSLint or Sonar³. For the development of the web-app we are using TSLint as it is made explicitly for TypeScript code whereas DeepCode or Sonar are more generic and harder to integrate in the pipeline. Those tools are based on general best practices such as keeping functions small and with small number of arguments, variables, as well as decreasing the cyclomatic complexity (detailed for instance in the book Clean Code [7]). Those tools give recommendation on the code without executing it, thus they do not replace software testing. In Figure 2 is a detailed

3Static code analysis tools Deepcode https://www.deepcode.ai/, TSLint https://palantir.github.io/tslint/, Sonar https://www.sonarqube.org/

(16)

diagram of what code is executed in the different case scenario of the CI pipeline.

2.2 Pipeline Description

First of all the development and Git practices are detailed in Figure 2. Each developer has

Figure 2: Git practices description [3]

one branch and in most cases each branch is created to resolve a Jira story⁴. Then after a story is finished, a merge request is created with the master branch. It is either approved or rejected by a different developer than the one working on the branch. In the interest of having an external view on the code and keeping developers updated with the new code.

The code on the master brach is then used for the deployment in production. However, this is not done through GitLab, and it won’t be discussed in this thesis, as it is a lengthy and complicated process on its own.

Each of the actions detailed previously triggers a different pipeline containing different jobs (test, build, deployment). The CI pipeline is composed of the following steps, as detailed in Figure 3, note that this pipeline was specifically designed for the development of the web- app. in red are the improvements to the pipeline completed during this thesis.

Furthermore, the end to end test completion time was decreased as a result of the work for this thesis. Each step (represented as a box in Figure 3) contains multiple jobs: Deployment to a server, Test. This pipeline is triggered on 3 different types of events: Push, Merge and Production Deployment via a cron job. Each event triggers a different type of pipeline.

Actually, as the number of Git push is considerably more significant than the number of merge the pipeline triggered with a push contains far less jobs. In the interest of reducing the load on the runners and the feedback loop time for the developers. Note that in the CI pipeline to move from step n to n + 1, step n needs to complete successfully, (similarly to the End-to-End Katalon test described in figure 7).

Each step is composed of different jobs that run on Gitlab Runners. Each job is labeled with tags and runners are given a set of tags (such as: Linux, Windows, Deployment, UnitTest) and the job is then executed on the runner with the corresponding tag.

This can lead to an unfair load balance between the runners. The allocation and matching

4Jira is a project management tool developed by Atlassian used for agile software development

(17)

Figure 3: CI/CD Pipeline description

of a job to a runner can be described as the following Birth-Death process: a specific case of Continuous-Time Markov Chain as shown in Figure 4

(0, 0) (0, 1) (1, 1) (1, 2) (1, n)

λ

µ λ

λ

µ

λ

µ

λ

µ

Figure 4: Representation of process arrival in Gitlab [6]

Note that this Markov Chain only describes a single runner. With the State of the System being described with the following two values: (a, b). a being the number of jobs being processed and b the number of jobs in the queue. λ is the birth rate thus the rate of Job ar- riving at the runner, and µ the death rate so the rate the runner executes jobs. In this case, the birth and death rate are equal for all state as all the jobs are considered to have equal execution time (in fact each runner mainly runs one type of job). Moreover the number of runners remains invariant for the scope of this thesis. However, the arrival rate depends on the time. The goal in optimizing runner resources is to have as few runners as possible being idle while maintaining a waiting time as low as possible.

Gitlab is used for the code management and build part, for End-to-End testing Katalon is being used and integrated directly into Gitlab (detailed in 2.2) for the E2E testing. Note that there are other tools for CI but we are using Gitlab as it is the best suited for this web-app, it is open-source and covers multiple steps of the integration process.

(18)

2.3 Code and Testing Metrics

In order to guarantee the efficiency and the relevance of our test at functional level, we use the following metrics:

1. Code Coverage : Instrumenting the code to gain insight on which line is being executed showing what our tests do not cover

2. Cyclomatic complexity: It represents the number of independent paths through a given program. It is defined as follows :

(i)M = E − N + 2P (ii)M = E − N + P

With E the number of edges, N the number of vertices and P the number of connected elements and M the complexity

The (i) formula is used in general case while the (ii) is used for strongly connected graph

Figure 5: Example of Cyclomatic number computation

As shown in Figure 5 the cyclomatic complexity for the left graph is thus 9−8+2∗1 = 3 As well as for the righ graph : 10 − 8 + 1 = 3

3. Halstead Complexity

η1 : Number of distinct operators N1 : Total number of operators η₂ : Number of distinct operands N₂: Total number of operands The difficulty is defined as : D =^η₂¹ ×^N_η²

2 It is a metric indicating how long it is for a person to understand the code. This metric is critical as it can increase the time for code review development.

The volume is defined as V = N × log η The effort is defined as E = D × V and represents the time required for the actual coding. For instance with the following code Figure 6:

(19)

main ( ) {

i n t a = 1 ; i n t b = 2 ;

i n t sum = a + b ; p r i n t f (”%d ” , sum ) ; }

Figure 6: Code Example for Halstead complexity computation

The unique operators are: main () int = + , ; printf The unique operands are a b sum ”%d” 1 2

η1 = 9, η2= 6, η = 15 N1 = 16, N2 = 9, N = 25 Thus we have a difficulty of D = 6.75

and an effort of E = 6.75 × 97 = 654.75

Integrating code coverage with software is usually done with Unit tests nevertheless for this thesis it will be implemented with End-to-End test, thus a custom mechanism needs to be set-up to retrieve code coverage from the Katalon tests

2.4 Integration Testing & Functional testing

We are representing an End to End test scenario as the following state machine: (inspired by the model described in Testing software design modeled by finite-state machines [9]) Let: m be the total number of test cases and

ϕ(i) the number of actions performed on the i^thTest Case.

P(s) the probability of failure in state s S Being the Set of all possible states

S_ij Represents State i and action j (an action being a click a form submission)

S

11

S

12

S

_1ϕ(1)

S

₂₁

S

_2ϕ(2)

S

_mϕ(m)

Success

F ailure

P(S11)

P(S12) P(S_1ϕ(1)) P(S21)

P(S_2ϕ(2)) P(Smϕ(m))

Figure 7: Representation of an End to End Test scenario [5]

Note that a test is considered as successful if and only if our agent reaches state Success

(20)

Thus the likely-hood of success is defined as 1 − X

s ∈ S

P(s)

Moreover, the different P(s) depends on what type of actions was taken as well as at what time the action was initiated. The total time of execution for the test is set to tmax= 90 minutes. Thus the later an action is performed, the more likely it is to trigger a failure. Our goal in developing this app is to decrease the likely-hood of an error and thus to increase the overall resilience and user experience of the web-app. In the case of our web-app we use in total m = 82 test case, one of them written for code coverage retrieval (executed before the last) and one to communicate with a proxy for UI troubleshooting.

To perform the End-to-End test, we have various options on what tools we can use. Detailed below in Table 1 is a concise comparison of the different End-to-End testing tools preformed for the work of this thesis:

Tool Description Usage

Katalon Testing tool based on Selenium with an GUI

Entirely manageable through a GUI Can be programmed with Selenium-like scripting in groovy, and extensions Puppeter Headless Chrome for E2E testing

Integrates code coverage, Open-Source JS scripting Selenium Testing tool built with java, Open-source Scripting with Java PhantomJS Testing tool fully written in JS

Doesn’t use any browser Open-source JS scripting Cypress Testing tool working only with Chrome,

Open-Source JS scripting

SoapUI

Testing UI similarly to Selenium, Offers Web-services (includes mocking), and includes load testing

Java Programming

TestCraft

Testing UI, Offers rich Graphical user interface Offers automatic scheduled test playback,

database monitoring and confirmation email testing

Usable through a GUI or Selenium code Autify Testing UI, with rich graphical user interface

Offers integration with Jenkins and TestRail Usable only through a GUI Mable

Testing UI, with rich GUI

Offers integration with Jira, Jenkins, CircleCI as well as Email and PDF validation and API testing

Usable only through a GUI

Table 1: Comparison of different End-to-End testing tools

For our specific requirements, we are using Katalon as it gives both the simplicity of a clickable GUI to manage tests and the possibility to customize test with scripting. This fa- cilitates the implementation of the troubleshooting test (done with Mitmproxy and Netem) into Katalon test scenario.

(21)

2.5 Fuzzy testing Background

One other type of tests used in this system is Fuzzy testing (also called Fuzzing) it consists in modifying the input of the system in such a way that its value is different enough from the baseline that it triggers unexpected behavior. Furthermore, the system is being monitored for crashes, memory leaks or any exception thrown.

Fuzzing is usually performed in order to detect security breaches, and can also be used to detect errors or unexpected behavior. The most relevant part of the code to be tested are the part accessible by non-privileged users since they are the potential hacker of the system.

The attack would be a file download, an API call or an HTML form.

A Fuzzer is characterized by the following parameters:

• Mutation or Generation based algorithm: Describes whether the Fuzzer uses an input seed to generate new input values using a genetic algorithm: Mutation based.

In opposition to generating new inputs with a generation based algorithm internally either from scratch or using a grammar such as BNF spec: Generation based.

Note that using a generation algorithm can take up much more time: the algorithm can mutate the input by flipping bits one by one.

• Structure Input aware (smart) or unaware: Describes whether the Fuzzer is aware of the input expected by the system or generates an input entirely randomly.

• Program Structure Aware or Unaware : Describes whether the Fuzzer is generating input that maximizes the code coverage, testing as much as possible of the code:

whitebox (structure aware). It would also try as many branch in terms of the control structure and function call as possible.

Alternatively if the fuzzer would use the code as a black box fully unaware of its structure or complexity. Consequently, Fuzzer unaware of the program structure, tends to have lower coverage and detects fewer bugs. For instance in Figure 8 the first if statement would have only 1 chance in 10³²of being executed. However Fuzzer unaware of program structure runs much faster as it doesn’t go through the code or analyze execution traces. Hence the most efficient way to test a program through Fuzzing is to start with a black box Fuzzer as it is fast and easy to set-up and then if not enough bugs were discovered to move to a white box Fuzzer reaching more code and control path.

i f ( a==27){

a b o r t ( ) ; r e t u r n 1 ; }

e l s e {

c o n t i n u e ( ) ; }

Figure 8: Code example of an unreachable path via blackbox testing

Multiple different Fuzzing algorithm can be used, below is a comparison to the most broadly used Fuzzing algorithms to choose from for the Fuzzy testing experiment on the web-app:

(22)

• AFL (American Fuller Loop) developed by Google. Modifying the input of a program based on techniques meant to increase code coverage and the number of control structure executed. AFL utilize a set of rules based on experiments on Fuzzy testing it was shown that by flipping a single bit triggers the exploration of (on average) 70 new path (in the control structure) while flipping 2 bits 20 and four bites 10 additional path. [15]

• SAGE (Scalable Automated Guided Execution) developed by Microsoft, its high- level architecture is detailed in Figure 9. The given input are checked if they trigger a

Figure 9: SAGE algorithm description [14]

crash, then the code coverage is computed and constrains are automatically established based on the executed code. For instance in the following statement: if(x+3==7) the constraint is x to equal 4. This is then given as an input to a constraint solver such as Z3 SMT solver⁵ [14]

• DART (Directed Automated Random Testing) one of the first dynamic test generation algorithm, it analyses traces from the program execution and code coverage to explore all the possible path in the control structure. One of its extension is CUTE handling multi-threaded programs.

• T-fuzzing (Transform Fuzzing) modify the program to execute the different control structure to execute all the code [16]. Unlike the previously described algorithm T- Fyzzing doesn’t modify the input. Hence it is also coverage based.

One of the most efficient, easy to use and set-up fuzzer is the American Fuzzer Loop (AFL).

The algorithm used to fuzz-test the web-app for the work of this thesis is derived from the AFL algorithm. ⁶ algorithm, it is a mutation based structure aware and program structure.

The AFL genetic algorithm is described in figure 10 (a chart created for the work of this thesis to describe the AFL algorithm) :

AFL was chosen among others Fuzzer algorithms since it is easy to set-up for web-application,

5Constrain satisfaction solver developed by Microsoft, solving logical problems

6First Created by Micha˚A Zalewski AFL is now maintained by google https://github.com/google/AFL

(23)

Initial input seed

Mutation:

generating new population

Computing Fitness

score

Restoring original population

Adding the new elements

to the Population

Fitness above desirability threshold

Fitness below desirability threshold

Figure 10: Flow Chart of the AFL algorithm

and runs faster than other structure-aware Fuzzer, and has a very developer community supporting it. Moreover AFL has been used in the development and test of large scale and established project revealing error and bugs in MySQL, Mozilla Firefox or OpenBSD.

The Fuzzer we use for the web-app is derived from AFL, it modifies the top-level keys of a JSON and if a bug or an incoherence is detected (similar to computing the fitness score). The Fuzzer then modifies the JSON deeper from this specific key’s value (similar to adding new element to the Population). The reason we are using this approach is that we want explore the JSON tree as in a real world scenario all the keys can be deleted. In addition to this we want to keep the Fuzzing completion time relatively low. Hence we assume that if deleting the top-level key doesn’t trigger a failure we may proceed to a more critical key of the JSON causing failure.

2.6 Load testing Background

Loads testing consist of putting high demand on a system in order to test its limitations.

Those tests increase the load on:

• Network (congestion, reordering, delay)

• Database Server

• Web Server

(24)

• Load Balancer

• Client-side application

Thus for the purpose of fitting a real-world scenario, we use different types of load tests:

• Capacity test: Simulates slowly increasing and decreasing load for a reasonably long time

• Stress test: Simulates spike or short burst of loads, to asses that the system can successfully recover after a failure.

• Scalability test: Simulates increasing loads under varying performances (number of CPU, memory, storage, network capacity) while measuring system performances.

• Robustness test: Simulates a long period of loads to asses the system sustainability.

• Volume testing: Simulates increasing volume of load on the system without increasing load (for instance increasing the number of keys without having a significantly larger request)

For the scope of this thesis, we focus on Load test and Volume testing.

2.7 Literature review and related work

As discussed previously, software testing and continuous integration are a critical part of software development and an ever-evolving area of research. We can divide it in 3 different areas: Chaos and Fuzzy Testing, Load testing, and finally the Pipeline traces analysis.

First Chaos testing the research conducted in this domain (for Chaos testing it has mainly been by Netflix) [4] consisted in emulating Cloud malfunction for the purpose of improving the Back-end resilience here our goal is to improve the Front-End resilience to similar cloud and network-related issues, and more such as inconsistent data structure, network delay or loss. In the same way as chaos testing emulates a malfunctioning or unexpected behavior from either the system itself or an outside component, Fuzz testing asses if the behavior of the system with unexpected input is acceptable.

Fuzz testing has been lately one of the most active area in research on software testing. Note that most of this research is aimed at improving whitebox testing as it is the most complex type of Fuzz testing, and is the type we’ll study in this part. The performance of Fuzz testing is described in [18]: the number of vulnerabilities discovered via a Fuzzer (AFL Fuzzer) is 11 while only 3 were discovered using Unit Test.

However, as mentioned in [18], not all system satisfies the prerequisites of Fuzz testing.

Fuzzer don’t necessarily work in a multi-threaded environment and the compiler might not support code coverage. Overall the output of this paper is that Fuzz testing is cheaper than Unit test and can achieve better results however it might not be applicable to all software.

Furthermore, it is shown that resource usage should be monitored as an uncontrolled memory allocation would crash the system, and be hard to detect.

One other active topic of research is improving whitebox Fuzz testing performances.

(25)

In [17] is a description on how Fuzz testing can be improved (using the SAGE algorithm).

The main takeaway message of this paper is that the quality and relevance of the original seed file is crucial and prevails on the number of iterations performed: in fact in this paper 76 % of the bugs were found while using 2 to 4 iterations. Consequently, input generation is decisive in the Fuzzer performances, and to get higher quality and more variety input an approach proposed in An Intelligent Fuzzing Data Generation Method Based on Deep Adversarial Learning [19] is to use a Wasserstein Generative Adversarial Network (WGAN).

GAN architecture is composed of two neural networks: a Generator network (such as an auto encoder) and a discriminator trained to recognize whether an input is real data or was synthetically generated Described in Figure 11.

Figure 11: GAN architecture [21]

This network is used to generate network frames, that are structurally similar to the training test without having received any information on the type of network protocol. Consequently, it can detect pattern in the input that were not necessarily written in the technical speci- fication. For instance, with the EtherCAT protocol were discovered: 137 packet injection attack 353 man in the middle and 31 working counter-attack. In addition to this, the training time is fairly short: A Test Input Accepted Rate (TIAR) of 85 % after 40 epoch. Note that here the WGAN network trains faster and has higher TIAR than the GAN network with the same meta parameters.

Then another area of research is the pipeline traces analysis it is aiming at improving the project pipeline success rate and thus the deployment time. The research already done on this topic mainly concern the release cycle and the re-usability of the code with for instance the COCOMO metric COnstructive COst MOdel [11] being used as a difficulty to write and integrate the code. Furthermore, Load testing and tools bench-marking is also a very active area of research in software testing. Research was performed in comparing the different types of load test and tools being used, and how each control a different aspect of the platform resilience to load [12]. Moreover, other studies were conducted to compare the different tools [13]. Both of those studies concluding that JMeter is both provides more ac- curate results and has a more accessible User interface that can be customized with plugins.

Note that for the specific needs of this thesis we’ll use JMeter⁷.

7A tool used for load testing and monitoring performances, focused on web-application

(26)

Finally one more area of research is intelligent software testing. It utilize both fuzzing and search based exploration to explore an application. It is primarily developed by Face- book for Android application testing: Sapienz project [20] and was able to discover 558 unique crashes in the top 1000 most popular android application. As previously described autonomous UI or functional testing is expensive and tedious to maintain however it usually represents well the typical real-world usage of a system.

https://jmeter.apache.org/

(27)

3 Methods

3.1 Code coverage for End-to-End testing

Code coverage is defined as a measure of the quantity of code being executed during a test.

It gives a metric on how much of the system is being tested. Hence indicating how relevant those tests are. Moreover, by having line-by-line coverage on the functional test, the code not verified by E2E test can be checked with additional Unit tests. Some code is ”unreachable” with End-to-End testing, for instance code executed in case of a page not found error is never executed with End-To-End testing. Conventionally code coverage is performed with Unit tests, here the goal is to integrate code coverage on our End-to-End tests. To obtain code-coverage on E2E test, one possible option would be to shift all our script to a Puppeteer script or Cypress and to use their integrated code coverage functionalities. However, as the work to convert and maintain the current test scenario to Puppeteer or Cypress script is not negligible, Katalon and the current script are kept. Moreover, Puppeteer and Cypress doesn’t offer all the functionality that Katalon has such as: video recording for instance.

Thus to integrate coverage, the following 4 steps are performed:

1. Instrument the code Adding counters for every instruction, as well as one for every branch in the control structure (as detailed in figure 4). It is essential to keep the original non-instrumented files stored in a backup directory as they will be required to display the coverage line-by-line and to recover the original code of the web-app.

For this, we use a tool called Istanbul⁸ that automatically generates instrumented .js files. Other tools exist, such as blanket.js however they do not offer the ability to merge multiple coverage objects or to generate a very detailed HTML coverage report.

Note that this process can be lengthy for a large project as each file’s instrumentation can take up to 7 seconds (for a thousand line of code).

Moreover, all those extra instructions add up on the execution time of the code: for the web-app it is a 30-40% increase in execution time.

f u n c t i o n add ( a , b ) { r e t u r n a + b

}

v a r i a b l e = t r u e

c o u n t e r . s t a t e m e n t [0]++

f u n c t i o n add ( a , b ) { c o u n t e r . f u n c t i o n [0]++

r e t u r n a + b }

v a r i a b l e = t r u e Figure 12: Instrumenting code an example

8Istanbul is a JavaScript code coverage tool instrumenting the code (adding counters for each instruction) and converting raw coverage JSON into HTML reports https://istanbul.js.org/

(28)

2. Run E2E Test Test run with custom capability browser ( as named in Katalon) : a browser with an extension here: Custom JavaScript (CJS)⁹ to inject JS code in the browser in the test part of the process might take longer to run due to the instrumentation of the code, this must be taken into account when writing the tests 3. Retrieve Coverage & conversion to HTML Call a custom JS function injected

using CJS to display the coverage object in the browser console as a JSON object.

The result is an array of coverage objects (one object for each page reload) it is then merged into one using Istanbul-merge. Then we use Istanbul to convert this raw JSON object into an HTML page.

4. De-instrumenting the code Recovering the original state of the system, by deleting the instrumented directory and restoring the un-instrumented backup directory.

After converting the coverage JSON object to HTML we obtain the following result.

Figure 13: HTML Code Coverage Report

The branches’ execution percentage thus usually decreases with increasing cyclomatic complexity.

3.2 Automating Coverage for End-to-End tests

The objective with the code coverage is to have it included in Gitlab as a cron job and running it regularly. Code-coverage is performed as a mean to guarantee the relevance of the test throughout the development of the web-app. It guarantees that all the code is tested.

To obtain code-coverage we created a web-based protocol instrumenting and de-instrumenting the JS code on our web server in order to be able to generate the instrumented code re- motely. This protocol calls a python script utilizing the previously mentioned tools: Istan- bul, Istanbul-merge.

This test is running weekly at a time when both of the teams are not pushing code as the

9Chrome extension to inject JS (or CSS) code in any website,

https://chrome.google.com/webstore/detail/custom-javascript-for-web/poakhlngfciodnhlhhgnaaelnpjljija

(29)

instrumentation slows down the web-app with a 30% increase in JavaScript execution time.

Picking a time for the coverage test is not trivial as running it requires a Gitlab runner and a web-server for 30 minutes. Moreover, some of the teams working on this system have a -9 Hours time shift with europe, thus coverage runs at 1AM on Sunday. Coverage cannot run more than once a week mainly for this reason.

3.3 Background and Set-up description

The web-app is deployed in a Cloud environment in which we cannot control the availability of all the services or the consistency of the data structure. We need to have a system that can remain functional in a heterogeneous cloud environment. Thus we want to be able to emulate how the web-app would react to an incoherent data structure (such as a missing key in a data-structure), an unexpected 500 response code or a missing resource.

Tool Language Possible action OSI level

Mitmproxy Python Intercept HTTP and HTTPS query and response

and modify them on the fly Application

Toxiproxy Go Cut TCP connection, decrease Bandwith add latency Network

Fiddler JS

Modify HTTP/HTTPS

responses on the fly similarly to Mitmproxy script written in JS

Application

Netem Shell

Update the Qdisk table (in Unix system) to add delay packet loss or reordering Emulating WAN network characteristics

Network

Google Chrome GUI

Integrated with Chrome developer tools (more tools - Network conditions) Doesn’t offer any automation

Application

Table 2: Comparison of the different tool used for troubleshooting

For the specific need of this study the tool that we using is Mitmproxy as it gives the possibility to edit any HTTP response.

It is open-source, easy to set-up, programmable with Python script and works seamlessly on Windows and Linux. Furthermore Netem is used to emulates different network conditions as it is highly configurable and can easily be automated. Overall based on the research on existing troubleshooting outlined in 3.3 it is the best suited tool for the work of this thesis.

However Google Chrome developer tools is also used instead of Netem in some cases, as it doesn’t require any extra installation and is the easiest solution but doesn’t include all the capabilities of Netem and won’t work with another browser.

Concerning the tool used to enable communication between the proxy and Katalon we have the following options shown in Table 3. From table 3 comparison of different protocols we concluded that the best suited for our use case is Name pipes as it works with different programming languages (in our case Java, Groovy and Python) and it is compatible with both Linux and Windows, a critical point in our case as both OS are used on CI servers.

(30)

Mechanism Description OS compatibility Shared Memory Multiple processes are given access to a bloc of

memory requires a synchronization mechanism All OS Named pipe A pipe implemented through a file multiple

Process can read/write seamlessly All OS Socket Data is sent over the local interface (loopback) All OS Signal

A signal is sent to another process Generally used to command a process and not to transfer data

Most OS Pipe A one way buffered communication channel Most OS Table 3: Comparison of the different Inter-process communication tools

3.4 Implementation description

As discussed previously to test the web-app in extreme scenarios, we are using Mitmproxy a python proxy to intercept HTTP traffic at the network layer and modify them on the fly.

These methods also work with HTTPS connections, but the SSL certificate is generated by Mitmproxy. Hence this approach doesn’t work with a system using certificate pinning (only allowing certificates that were issued by the Certificate Authority (CA) of the website domain). This limitation isn’t specific to Mitmproxy and also applies to other software.

Our set-up for chaos troubleshooting is detailed in Figure 14. Note that this set-up was specifically created for the work of this thesis.

Figure 14: Troubleshooting setup description

3.5 Load Testing Tools Comparison

For this specific set of tests, we are using Apache JMeter even though it isn’t the most accessible tool to use or set-up. Nevertheless it is open-source, free, offers numerous testing possibilities and an extensive community of users offering support, documentations and code-examples. This choice was done after research performed for the work of this thesis on

(31)

Tool Language Possible action Apache JMeter Java

Increase load and volume and SSL certificate validity

Offers basic performance graph open-source JMeter in Katalon Groovy/Java Integrating Katalon API testing in JMeter

WebLoad Java

Increase loads and volume

Integrates with most CI/CD tools Offers a UI with performance graph

Grinder Jython

Java implementation of Python

Increase load and volume

No integrated UI or performance graph display Open-source

Tsung Erlang

with randomized user arrival distribution HTML performances report Open-Source

Gatling Scala

Also support multiple L.7 protocols Yields very detailed HTML reports Open-source

Table 4: Comparison of the different tool for load testing

the most efficient load testing tools and presented in Table 3.5.

But it should be remembered that there exists software improving the already existing load testing tools adding a layer on top of them, for instance, Taurus: a wrapper for JMeter, Grinder or Gatling adding functionalities making test writing easier and providing live and detailed JUnit reports Open-source.

(32)

4 Experiments

4.1 Functional test traces analysis

Using the Gitlab API we were able to extract metrics (completion time, pipeline failure rate, Katalon test failure rate, commit size) on the first 3 steps (code management, Build, Testing) of the CI pipeline and use it to improve the success rate of the pipeline. We are analyzing them in the view of decreasing the deployment time.

First to detect the faulty part of the code we are not able to use Fault location techniques on the Katalon End-to-End tests as the two main techniques for fault location (spectrum or mutation-based) detailed bellow cannot be applied to our End-to-End tests.

• Spectrum Based fault location: Running multiple tests and analyzing where a fault could occur in the code based on which line is executed in the successful and non- successful tests.

• Mutation-Based fault location: Running the same test multiple times but slightly modifying the code (one operator, for instance) exploring what is the impact on the test output.

Note that both of these techniques require code coverage. The End-to-End test-scenario takes over 20 minutes and having different instances of it and running them multiple time would increase the testing time to over an hour and it is not acceptable for a continuous integration pipeline. For this reason, we assume that the committed code triggering failure in the test is in its entirety, faulty code. As we ran the tests we were able to notice that there is a correlation between the cyclomatic complexity and the part of the code that triggered tests to fail.

Figure 15 show how the Katalon tests execution time evolved in function of the advance- ment of the project ( pipeline number):

Figure 15: Evolution of Katalon test duration

As shown in Figure 15 the total completion time of the test doesn’t increase linearly and

(33)

tends to a limit after 40% of the test launched. In fact at the beginning as numerous pages are being added to the web-app and numbers of test are being written for it and changing pages is one of the longest part in the execution of End-to-End test. Note that the metric here is pipeline number, not the date for privacy reasons.

Thus we can conclude that even though our project is getting bigger the testing time remains acceptable even with End-to-End tests. Furthermore, the Katalon test execution represents 75.4% of the total merge pipeline execution time as detailed in Figure 16 bellow, thus optimizing it is crucial to reduce the pipeline execution time.

Figure 16: Pipeline execution time distribution

After implementing code coverage and optimizing the Katalon test based on the result obtained, the completion time of the Katalon test was significantly decreased: A 200 seconds decrease in Katalon execution time.

Furthermore, optimizing the waiting time allocated for page loading: the time between the execution of S_iϕ(i)and S_(i+1)1(as described in Figure 7) decreased the total execution time by 43 seconds. The total time waiting spent waiting for page load during the Katalon test is 400 seconds, representing 36% of the total Katalon test execution time.

The results of the Katalon test completion time optimization are summarized in the table 4.1 bellow.

Finally, note that there is another way to increase the completion time of an End-to-End test by mocking the API response. By using a proxy to intercept the API request and responding with a prerecorded response we gain a significant amount of time waiting for a

(34)

Optimization Time gained (seconds) Test refactoring after Code Coverage 200 (16%)

Object detection 27 (2.2%)

Page load waiting 43 (2.8%)

Total 270 (21%)

remote server to respond. With this approach the API response time would be divided by 10. However, this approach is not suitable for our case since both the front and back end are tested at the same time and so the API response cannot be mocked.

4.2 Parameters influencing failure rate

The second part of the data-analysis was pursued with a view to discover a correlation between the number of tests failing and metrics related to the code pushed (type and number of line modified, code complexity, commit size). First, the overall success rate is: 70.6%.

However, the failure rate of the pipeline containing the Katalon tests (only the Git merge trigger Katalon, commit doesn’t) is 70.75 %. Even though only 21% of the pipelines contains Katalon Tests, Katalon is responsible for 46 % of all pipeline failures. Finally among all the pipelines triggered by a merge (executing Katalon) Katalon Tests caused 69.1 % of the failures. The distribution of jobs triggering pipeline failure is shown in Figure 17.

In order to identify the relevant factor, we computed the Bravais-Pearson Correlation between the failure rate and the different factors (number of modified file, number of lines committed, type of file committed). Correlation is computed using the following formula :

rxy= P (xi− ¯x)(y_i− ¯y) pP(xi− ¯x)²(y_i− ¯y)²

With rxy being the correlation between x and y, x and y the two array compared (having the same size) and ¯x¯y the mean value of those arrays The Correlation is considered as being

Parameter Bravais-Pearson Correlation

Number of modified Files 0.13

Number of .html lines committed -0.21

Number of .ts lines committed -0.58

Number of .groovy (Katalon test) lines committed 0.01

Total number of lines added 0.35

Cyclomatic complexity -0.82

Halstead Complexity -0.74

Table 5: Correlation Between code metrics and Pipeline success rate low if it is higher than -0.5 or lower than 0.5.

From the above table 5 we can see a high correlation between the cyclomatic complexity and the pipeline failure rate.

Thus as detailed in the above Figure a large part of those parameters do not have any impact on the total pipeline failure rate. The parameters having a significant correlation with the failure rate (≥ 0.5 or ≤ −0.5) are code Complexity metrics: the cyclomatic and the

(35)

Figure 17: Job triggering pipeline failure distribution

Halstead complexity. This means that higher number of possible paths in the code tends to increase the risk of pipeline failure (note that it may not necessarily be due to a Katalon test failure) as it gets more challenging to write script exploring all the possible path whereas exploring more functions is easier when writing test.

The Halstead complexity also has an impact on failure rate. Variables or operands added to the code cause error or incompatibility with part of the rest of the code. Moreover, multiple developers write the code it gets harder to understand code with high Halstead complexity.

Furthermore functions with high Halstead complexity written by multiple developers are more likely to cause a pipeline failure.

To tackle this issue of increasing failure rate we installed a tool (tslint) measuring the cyclomatic complexity within the IDE (code editor) in order to keep it within boundaries and keep the functions relatively small and straightforward.

Also note that a code containing too many variable operands could cause some problem later on as the functions get harder to understand and complete by other developers Moreover, one surprising fact in favor of the use of End-to-End tests is the very low correlation between the number of .html lines and the failure rate.

(36)

HTML files contain the page structure and the button identifiers that Katalon utilize to navigate through the app. This is thanks to efficient communication between the QA and the developer: tests and the code are updated simultaneously. And the smart-locator techniques used for object (button, link, form input) location on the web-page. By using different locators attributes: data-*, class, id, or text, we are able to decrease the number of object not found causing Katalon test failure.

Figure 18 presents the pipeline success rate in function of the number of .ts files committed, we can see that the number of files and success rate are inversely proportional.

Note the lasts failure rates for 6 and 7 files are based on a few pipelines (3-4) only whereas the others are based on hundreds. Consequently, this affirms that smaller commits tend to have a higher success rate at being integrated in the project as they have lower pipeline failure rate. Finally it should be recalled that the CI pipeline is ever-evolving and jobs are

Figure 18: Pipeline success function of the number of .ts files modified

constantly added or withdrawn to the pipeline. Thus it is not relevant to plot the overall pipeline success rate in function of the time, as both the optimization and the pipeline modification change the success rate.

4.3 Gitlab Runner usage optimization

As mentioned in part 2.2 the load on runners (VM’s used for the CI pipeline with Gitlab) can be modeled as a markov chain as in Figure 4. More precisely it is an MM1 queue, this queue becomes unsustainable if we have λ > µ. Note that only the arrivals between 2pm and 6pm are considered as it is the only time when the load on the system is significant. The first part of the server optimization concerns the runner dedicated for End-to-End testing usage and preventing it from being the bottleneck of the CI pipeline since it is the longest job and can only be ran on a specific runner (it requires specifics software installed). From the pipeline traces analysis the rate of arrival of new jobs on the Runner associated with End-to-End test (Katalon test) is 1.2 job per hour. Hence since the arrival rate is = 1.2/hour and µ = 2.57/hour the load on the Katalon runner is sustainable and won’t keep forever

(37)

increasing even with high loads.

However the CI pipeline is composed of other types of jobs: unit test, build or deployment. Those jobs don’t run on the same server as the Functional test indeed they require less specific software to be installed. They also require active runners thus the goal would be to evaluate the minimum number of runners required in order to maintain a functioning CI pipeline.

Each of those jobs is running on a different runner and for each Git Push. Meaning that each job is triggered on average every 9.5 minutes. Based on the completion time of those jobs and the frequency of git push or merge we can establish that the two runners already present in the system are enough. However the load balance can be optimized: meaning that the arrival rate λ would be more consistent among all the runners. And the optimization of deployment speed by choosing runners on the same cluster would enable on average a 15 % decrease in both Git merge and Git push execution time. For this allocation optimization only the job tags needs to be updated and for intellectual property reasons the runners’ tags cannot be disclosed in this thesis.

On top of this another runner is dedicated for the Katalon tests, note that adding second runner dedicated for the Katalon test would enable faster pipeline completion.

In fact with up to 3 Katalon job in the waiting list at peak load, adding a second runner would enable faster pipeline execution: more precisely, it would decrease the waiting time for Katalon jobs by 2 minutes. Indeed 10% of the Katalon test are queued and the Katalon test execution time is 22 minutes. However due to the associated cost and maintenance, for a relatively small gain it was decided not to add a second Katalon runner.

Moreover, we can modify GIT DEPTH: a parameter indicating how much of the Git history is retrieved when the code is copied to a Gitlab Runner. Decreasing GIT DEPTH to 5 would reduce by 6.2% the repository size and thus decrease the copy to the different runners.

Overall from all the optimization performed on the pipeline execution time we have been able to decrease the total pipeline execution time by 320 seconds as shown in Figure 19

Figure 19: Merge Pipeline total execution time evolution

(38)

With:

1∗ being the Katalon test optimization 2∗ being the runner usage optimization

2∗ being the runner location and Git DEPTH optimization

4.4 Troubleshooting and Fuzz testing Results and Analysis

From the Mitmproxy Troubleshooting deleting the keys we got the following results :

• 1/3 of the key deleted triggered no error on the UI and had thus no impact on the User experience

• 1/3 of the key deleted compromised some of the information displayed but the platform was still fully functional and all the buttons/links were working,

• 1/3 of the key deleted caused the platform to stop working: no display, or links to stop working.

We can thus conclude that for the key-deleting troubleshooting part 30% was use-full to detect critical bug freezing the app, the 30% compromising the displayed information was not repairable as the information was missing. Finally for the 30% of the key deleted not triggering any error we can conclude that the REST API contains too much data for some use-case as some keys are not used and could be deleted to decrease transfer time. It should be remembered that those results were obtained by first deleting the highest level keys and then deleting keys deeper into the JSON structure based on the high level key experiment results (a method derived from the AFL algorithm). Only deleting the top-level keys also triggers failures or unexpected behavior however it would give a less comprehensive analysis on the system resiliency as in a real-world not only the top-level keys are altered.

Then after adding delay (1.5 to 30 seconds) with Netem and 3 bugs were detected, all of them crashed the app completely. This test was critical since this type of delay occurs in real words scenario, not necessarily as a network delay but also as a response delay from the server it happened on the web-app in a testing environment (since testing environment have minimal resources).

Moreover, as some of the resources of the web-app are loaded asynchronously, having long delays may lead to a dependency error. Some resource depend on each other to be executed as shown in Figure 20. One error of this type was detected and

Then we modified the HTTP response code to 404 and 500 (not found or server error) for both CSS and JS resources.

For each JS resource turned-into a 404 or 500 the web-app completely crashed (also note that to perform those tests Chrome is enough as some specific given URL can be blocked through Chrome developer tool). The 404 and 500 HTTP response code for the CSS file had a serious impact on the user interface and the web-app became barely non-usable.

Disabling JavaScript from the browser made the web-app non-usable while no error message was displayed an issue corrected later with a clear error message.

(39)

Figure 20: Example of an asynchronous request failure

Finally for the API returning JSON content to test the resilience of the web app to extremely large responses, the JSON response are being increased in size through the proxy both linearly and recursively.

As shown in figures 21 and 22 the tree representing the JSON data structure (JavaScript Object Notation a value pair text-based data format), on the left the original JSON and on the right the JSON after size increase.

• Linear size increase:

Figure 21: Example of Linear size increase of factor 2 on a JSON of depth 2

• Recursive size increase :

Figure 22: Example of Recursive size increase of factor 2 on a JSON of depth 2

Improving Software Testing in an Agile Environment

Improving Software Testing in an Agile Environment

JÉRÔME DE CHAUVERON

Contents

List of Figures

List of Tables

1 Introduction

1.1 Motivation

1.2 Background

1.3 Problem

1.4 Purpose

1.5 Goal

1.6 Research Methodology

1.7 Delimitation

1.8 Ethics and Sustainability

1.9 Outline

2 Background and Related Work

2.1 Background on Software testing

2.2 Pipeline Description

2.3 Code and Testing Metrics

2.4 Integration Testing & Functional testing

S

S

S

S

S

S

Success

F ailure

2.5 Fuzzy testing Background

2.6 Load testing Background

2.7 Literature review and related work

3 Methods

3.1 Code coverage for End-to-End testing

3.2 Automating Coverage for End-to-End tests

3.3 Background and Set-up description

3.4 Implementation description

3.5 Load Testing Tools Comparison

4 Experiments

4.1 Functional test traces analysis

4.2 Parameters influencing failure rate

4.3 Gitlab Runner usage optimization

4.4 Troubleshooting and Fuzz testing Results and Analysis