Randomness as a Cause of Test Flakiness

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-G--21/048--SE

Randomness as a Cause of Test

Flakiness

Slumpmässighet som en orsak till skakiga tester

Daniel Mastell & Jesper Mjörnman

Supervisor : Azeem Ahmad Examiner : Ola Leiﬂer

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

With today’s focus on Continuous Integration, test cases are used to ensure the soft-ware’s reliability when integrating and developing code. Test cases that behave in an un-deterministic manner are known as flaky tests, which threatens the software’s reliability. Because of flaky test’s undeterministic nature, they can be troublesome to detect and cor-rect. This is causing companies to spend great amount of resources on flaky tests since they can reduce the quality of their products and services.

The aim of this thesis was to develop a usable tool that can automatically detect flaki-ness in the Randomflaki-ness category. This was done by initially locating and rerunning flaky tests found in public Git repositories. By scanning the resulting pytest logs from the tests that manifested flaky behaviour, noting indicators of how flakiness manifests in the Ran-domness category. From these findings we determined tracing to be a viable option of detecting Randomness as a cause of flakiness. The findings were implemented into our proposed tool FlakyReporter, which reruns flaky tests to determine if they pertain to the Randomness category. Our FlakyReporter tool was found to accurately categorise flaky tests into the Randomness category when tested against 25 different flaky tests. This indi-cates the viability of utilizing tracing as a method of categorizing flakiness.

(4)

Acknowledgments

The authors would like to thank the examiner Ola Leifler and the supervisor Azeem Ahmad for their guidance and help with the direction of our work. Thanks to the students that are conducting the same field of work for their help and discussions about the thesis.

(5)

1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 1 1.3 Research Questions . . . 2 1.4 Delimitations . . . 2 1.4.1 Flaky Tests . . . 2 1.4.2 Python . . . 2 1.4.2.1 Testing Framework . . . 2 2 Background 3 2.1 Flaky Test . . . 3 2.2 Continuous Integration . . . 3 2.3 Taxonomy of Flakiness . . . 4 2.4 Execution Tracing . . . 5 2.5 GitHub . . . 5 2.6 Python . . . 5 2.7 Unit Testing . . . 5

2.7.1 Unit Testing Framework . . . 6

2.7.2 Pytest . . . 6

3 Related Works 7 3.1 Empirical Studies . . . 7

3.2 Automatic Detection . . . 8

3.3 Automatic Fault Localization . . . 9

3.4 Automatic Flaky Categorization . . . 10

4 Method 11 4.1 Thesis Work Process . . . 11

4.2 Data Collection . . . 12

4.2.1 Log Files Generation . . . 13

4.2.2 Category Identification . . . 13

4.3 Data Analysis . . . 14

(6)

4.4 Data Reporting . . . 16

4.4.1 Execution Traces . . . 17

4.5 FlakyReporter . . . 20

4.5.1 Rerun Flaky Test . . . 21

4.5.2 Trace Logs . . . 21

4.5.3 Execution Divergence . . . 22

4.5.4 Compare Return Values . . . 23

4.5.5 Compare Locals . . . 24 4.5.6 Compare Assertions . . . 24 4.5.7 Compare Partials . . . 25 4.5.8 Calculate Result . . . 25 4.5.9 Produce Report . . . 26 4.6 Evaluation . . . 26 5 Results 28 5.1 Rerunning Tests & Recreating Flakiness . . . 28

5.2 Analyzing Log Files . . . 30

5.2.1 Causes of Randomness . . . 30 5.3 Produced Report . . . 30 5.4 Evaluation . . . 31 5.5 Tracing . . . 33 6 Discussion 34 6.1 Results . . . 34 6.1.1 Flaky Tests . . . 34 6.1.1.1 Limitations . . . 35 6.1.1.2 Human Error . . . 35 6.1.2 Tracing . . . 35 6.1.3 FlakyReporter . . . 35 6.2 Method . . . 36 6.2.1 Creating a Report . . . 36 6.3 Source Criticism . . . 36

6.4 Replicability, Reliability and Validity . . . 37

6.4.1 The work in a wider context . . . 37

7 Conclusion 38 7.1 Tracing & Log Files . . . 38

7.2 Consequences . . . 38

7.3 Future Work . . . 39

Bibliography 40

(7)

List of Figures

2.1 Test Driven Development. . . 6

4.1 Workflow of thesis. . . 12

4.2 Github commit fixing Too Restrictive Range flakiness. . . 14

4.3 Found categories from public dataset and GitHub. . . 15

4.4 Found testing frameworks from projects in our created dataset. . . 16

4.5 Snapshot of two pytest fail messages from two iterations of the same test. . . 17

4.6 Flow of sys.settrace(). . . 17

4.7 Overview of execution trace depth. . . 19

4.8 Flowchart of the method for analyzing trace logs. . . 20

5.1 Reduction of projects based on events. . . 29

5.2 Test case producing No Indications of Randomness. . . 31

5.3 Test case producing Many Indications of Randomness. . . 31

5.4 Iterations of produced report for test function. . . 31

(8)

List of Tables

4.1 Test functions used for evaluating. . . 27

5.1 Categories and numbers of tests used for evaluation. . . 32

5.2 Results from running the tests with FlakyReporter. . . 32

(9)

1 Introduction

1.1 Motivation

With today’s focus on Continuous Integration, there are a number of test cases used to ensure the reliability of the new code with the old. For each new edit in the code, all test cases must be rerun, to ensure that the newly added code does not introduce any problems. Within these test cases there is a risk that flaky tests are created. Flaky tests are non-deterministic tests, meaning they fail to always produce the expected outcome. In some cases a test may run 200 times correctly before failing once which creates uncertainty whether the test case or the code is the problem.

Companies consume both time and resources due to flaky tests, from finding the root cause to handling bugs created by negligence in test cases made for produced software [11, 15, 20]. According to Silva et al. [20] and Lam et al. [11], Google spends between 2-16% of its testing budget on rerunning tests that are suspected of being flaky. This is not unexpected and is further explained by Gruber et al. [8] who presents that it takes approximately 170 reruns of a test case in order to determine with certainty whether a test is flaky or not.

In order to combat this, programs have been developed in order to automatically evaluate and identify whether a failed test is due to flakiness. For example, there is a program named DeFlaker that correctly identified 95.5% of all flaky tests that were included in a sample of 28.068 tests [2]. Even thought this is saving a lot of time on rerunning test cases, the problem of finding the root cause still remain.

1.2 Aim

Eck et al. [3] presents a taxonomy of eleven categories of flakiness, namely Concurrency, Async Wait, Too Restrictive Range, Test Order Dependency, Test Case Timeout, Resource Leak, Platform Dependency, Float Precision, Test Suite Timeout, Time, and Randomness. From this taxonomy, Randomness is found to be a prevalent category of flakiness in Python test suites [8] and is therefore an important issue that needs fixing. Since the categories differ in how they display and manifest flakiness, we have decided to only focus on the Randomness category. Creating a technique for automatically detecting Randomness may also inspire further developments or demonstrate the prerequisites for the development of such techniques. Based on this, we propose a new technique; FlakyReporter, that classifies a flaky test’s likelihood to be flaky due

(10)

1.3. Research Questions

to Randomness. It does this by rerunning a flaky test, tracing its execution and storing the information into trace logs which are then parsed and analyzed.

1.3 Research Questions

This presents the following research question:

• To what extent can tracing or log files be used to locate and identify a test being flaky in the Randomness category?

1.4 Delimitations

1.4.1 Flaky Tests

Due to the scope of our work the developed application will assume that a test is suspected of being flaky and run it based on that assumption. Furthermore, only the category of Ran-domness, as defined by Eck et al. [3], will be investigated due to the same reasons already stated.

1.4.2 Python

Every language that is used in a continuous integration format is prone to flakiness. However, this paper will only focus on flakiness in the Python language. Python was selected as a language due to the available datasets and Git repositories online.

1.4.2.1 Testing Framework

Due to the flexibility and popularity of pytest, we have selected to focus our efforts on only this module. By creating a plugin for pytest we strive towards producing traceback logs, without creating too much overhead, which the more general approach would. The drawback is the inability to use plugins for any other framework but pytest. Even though pytestsupports running test suites from other frameworks, it does not support plugins for them. There might exist workarounds but we have failed to locate a suitable one.

(11)

2 Background

2.1 Flaky Test

Flaky test is a term coined to describe tests that show non-determinism. This means that test cases that fail to always give the same result are classified as flaky. Debugging a flaky test presents a lot of complications for developers who have to fix the cause of flakiness [8, 12, 15, 27]. One such complication is how a test may behave differently on different hardware which can indicate the test as flaky on one setup and flaky on another. Since flaky tests are non-deterministic, recreating the failure of the test is difficult due to its non-deterministic nature. To recreate a failure it might require several reruns of the test before the failure manifests. As a result, some companies such as Google rerun a suspected flaky test ten times and mark it as flaky if and only if the ten runs results in at least one pass and one failure. [2, 15]. By doing this the cost of rerunning until a flaky test failure manifests is avoided. This does however, fail to entirely deduce whether a test is flaky or not but is a compromise, since investigating if a test is actually flaky consumes both time and resources.

Further difficulties of debugging a flaky test are introduced by how the root causes cor-relate. For example, a test that is flaky due to order dependency may cause other causes to be introduced, hiding the actual root cause. However, order dependency is not inherently flaky, i.e., test suites can exhibit order dependency without exhibiting any flakiness. In fact, is it not uncommon for tests to be order dependent while not exhibiting any flakiness.

Several articles have been written, discussing this issue in various ways; some automat-ically detect flakiness [2, 13, 27], some automatautomat-ically fixes flakiness [19], and some discuss flakiness in relation to languages and developers [3, 8, 15, 23]. These articles provide insight, methods and most importantly; data sets containing known flaky tests and their respective categories of flakiness, including Randomness. We gather different categories to be used when validating our tool FlakyReporter. This is done to determine the accuracy of how it manages to correctly identify a flaky test as Randomness or not.

2.2 Continuous Integration

Continuous Integration is a practice that is frequently used within software development. Development teams use a system for version handling where contribution from the team’s respective members are added together. When contributions are added, the most recent

(12)

ver-2.3. Taxonomy of Flakiness

sion of the software is built and its reliability is ensured by running tests. This method of developing is reliant on tests passing and is susceptible to flaky tests and their implications. For developers it is therefore very important to avoid creating as well as finding and fixing flaky tests consistently and efficiently [11, 13, 19, 20].

2.3 Taxonomy of Flakiness

The eleven categories described; Concurrency, Async Wait, Too Restrictive Range, Test Order Dependency, Test Case Timeout, Resource Leak, Platform Dependency, Float Precision, Test Suite Timeout, Time, and Randomness, each has their own problems they introduce [3]. To be able to discern the different categories and to both determine and categorize found flaky tests it is important to understand the difference between each category. Below, each of the eleven categories are briefly described from the characterizations of causes presented by Eck et al. [3].

1. Concurrency classifies tests that are flaky due to synchronization issues, mostly origi-nating from unsafe threading interactions. For example it can be caused by race condi-tions where two threads fight over the same resource.

2. Async Wait is similar to concurrency but is instead characterized by performing asyn-chronous calls without waiting for the result.

3. Too Restrictive Range is categorized by valid output values not being within the asser-tion range considered at test design time, failing these tests when they show up. 4. Test Order Dependency classifies a test that is reliant on the outcome of previous tests

and is the most problematic one of all causes as it is ambiguous and is not inherently flaky. Flakiness due to this cause occurs most often when shared variables are handled badly e.g., a previous test, fail to reset the shared variables.

5. Test Case Timeout is when a test is suffering from non-deterministic timeouts.

6. Resource Leak is characterized by improper management of external resources. Al-locating memory and not releasing it or not dereferencing a pointer are examples of causes.

7. Platform Dependency is when a test being flaky is caused by its inability to run on a specific platform. This means that for some flaky tests, different hardware introduces flakiness.

8. Float Precision is when potential precision over- and underflow of floating point op-erations are not considered. This can be caused by rounding to a certain number of significant digits.

(13)

2.4. Execution Tracing

2.4 Execution Tracing

Execution tracing, or only tracing, is a form of logging where each line of execution is logged. This allows for in-depth information about a program’s execution and can further allow dif-ferent methods for debugging. One such technique is divergence, where two tracing logs are compared to find the difference between them [27].

Tracing is a widely used debugging approach in both a fully automated way and a man-ual way where a developer debugs by tracing each line of execution [1]. Due to the added overhead of tracing; each line has to be analyzed and logged correctly, different algorithms have been presented to combat this and reduce overhead [17].

From earlier works it is further stated how fault localization uses tracing in a majority of implementations [24, 25]. Implementation of execution tracing is proven to reliably find locations of interest by using the diff between runs [27]. These two arguments backs the selection of tracing as an approach to determine Randomness as the cause of flakiness.

Kraft et al. [10] describes tracing as a technique that detects and stores relevant events during run-time which is used in a later stage for off-line analysis. By storing the relevant events we gain the exact lines of execution for each iteration, all locals stored at any given line and all function calls made during execution. In the same fashion as described in their study, we create log files containing the relevant execution traces which is then analyzed in a later stage.

2.5 GitHub

GitHub is an online code hosting platform for version control and collaboration. It allows for teams to collaborate and maintain projects from anywhere. Millions of developers and companies use GitHub to develop, maintain and ship their software [9]. This paper utilizes the Explore functionality of GitHub that lets you browse through all available public projects. Publicly available projects allows for anyone to download the source code and contribute to the development. Retrieving the source code allows for the running of their tests which in turn might yield manifestations of flaky tests essential for conducting this research.

2.6 Python

Python is a interpreted high-level language which supports a variety of easy to use official and unofficial packages [5]. The language is often associated with machine learning, web development, embedded systems, data analysis and scripting. Although, a simple language that allows for swift production of software code quantity-wise, it lacks in its execution speed compared to that of a pre-compiled language like C++.

2.7 Unit Testing

Unit Testing is a software testing method where individual and isolated units of software code are tested in order to validate their functionality and reliability [18]. A unit in regards to Unit Testing could be any given part of a code that requires testing, for example any function or object. This method of testing is often used in continuous integration to ensure the reliabil-ity of the code that is being under the development phase. Unit Testing is necessary when it comes to Test-Driven Development which is the process that enables developers to produce code and to continuously test its functionality, thus providing code with increased quality. This provides opportunity for the developers to detect non-reliable or non-functional code earlier on, which in turn leads to a possible reduction of resources being spent both identi-fying and fixing it in the future. Unit Testing Is also beneficial when it comes to refactoring legacy code, meanwhile ensuring the previous functionality. However, it is important to be

(14)

2.7. Unit Testing

aware of that Unit Testing is only as good as its practitioner and can not be expected to catch every flaw of a program. For developers to accomplish this while also keeping the syntax of testing concise, developers take use of what is called a Testing Framework.

Figure 2.1: Test Driven Development.

2.7.1 Unit Testing Framework

Unit testing frameworks allow for concise implementation of testing and do often have sup-port for both logging and supplying feedback when failing tests. Python supsup-ports several testing frameworks such as its built-in unittest package which supplies the developer with simple creation of test cases but a lack of in-depth logging compared to other testing frame-works. Another one is pytest which is a framework for unit testing that supports adding and creating plugins.

2.7.2 Pytest

pytestis a Python testing framework which features; detailed failure logging, auto discov-ery of test modules, modular fixtures, plugin architecture and the ability to run unittest and nose suites [22]. It is often selected by developers due to its simplicity of creating tests, its support for creating and installing plugins. Due to this and its ability to run most of the other Python testing frameworks we will utilize its plugin functionality this thesis.

(15)

3 Related Works

Several articles on automatically detecting flaky tests are written and present both test datasets used and developed software created. The works do however not as often concern finding the root cause automatically, leaving a gap in the research.

In this section, relevant and earlier work will be presented. The presented works are split into their respective sections based on their area of contribution.

3.1 Empirical Studies

Eck et al. [3] describes flaky tests at a more thorough level; what exactly flaky tests are and how they, in many ways, negatively affect produced software. To combat the problems of debugging and finding root causes they introduce a taxonomy of eleven potential root causes to flakiness. They also explain the most common ways of solving the given root causes. These root causes are widely accepted by researchers and will, like other works, be used in this paper to categorize causes of flakiness.

Common causes, manifestation and useful strategies for avoiding, identifying and solv-ing flakiness of tests are discussed by Luo et al. [15]. By examinsolv-ing flaky tests from open source projects, collecting data and analyzing how developers have solved flaky tests from the project history, they present common strategies for solving flakiness. The most prevalent causes examined are of async wait, concurrency and order dependency categories, which are de-scribed in Eck et al. [3]. These are identified in their study as the most prevalent categories of flakiness. It is also explained how order dependency may cause unpredictable behavior and cause different kinds of flakiness to manifest. This can result in more difficulties of examining the root cause as the order dependency can be hidden behind a different cause of flakiness cre-ated from it. Luo et al. [15] manage to present reliable strategies for fixing root causes defined in the taxonomy as well as presenting relevant information for identifying and categorizing common root causes. Their findings provide information that will be used to identify root causes and what information might be needed for any developer to understand and fix it.

Gruber et al. [8] have in their empirical study examined flaky tests from available Github repositories, comparing tests written in Python with tests written in Java to compare differ-ence of causes and quantities of flakiness. From their initial findings they state that the causes are different while the quantity is mostly the same. This implies that different languages suf-fers from the same amount of flaky tests but from different causes. They highlight that 59% of

(16)

3.2. Automatic Detection

tests cases written in Python are flaky due to order dependency, which is not the case for tests written in Java. Their study has created a public dataset of public git repositories which have some commit with flaky tests. All these detected tests are fully classified into their respective category (defined in [3]). This dataset is used for research, testing and development in this paper.

3.2 Automatic Detection

Bell et al. [2] present relevant research which discuss the possibility of automatically being able to deduce whether a suspected test is flaky or not. They also claim that testing for flaki-ness by reruns is both ineffective and costly, highlighting their DeFlaker which avoids doing too many reruns. DeFlaker works by conducting an analysis on the difference between the previous and the new release (code wise). DeFlaker detects the relevant changes and how they might affect the suspected flaky test. These identified changes are selected to be tracked by byte-code injection, tracking both statements and class coverage. By recording the out-come of suspected tests and printing a report, it helps debugging and determining if a test failed due to flakiness. DeFlaker is, however, susceptible to false positives as it suffers from ignoring changes to code that uses reflection and how it rather overestimates the amount of flakiness. Bell et al. [2] argues that this is preferred, instead of getting false negatives, which will allow flaky tests to go through the detector. DeFlaker does not support any non-java files which includes setting and data files which could introduce flakiness. By using DeFlaker in conjunction with rerunning they argue that it will save time, resources and will manage to achieve higher precision for determining flakiness. Due to the existence of applications that automatically detect if a test is flaky, this paper will instead only focus on root causes for tests proven to be flaky. Meaning our method will assume that a test is defined as flaky before any categorization and analysis is to be done.

iDFlakies is a framework developed by Lam et al. [13], for detecting flakiness caused by order dependency. The method of the framework consists of first checking if the original or-der passes, followed by rerunning the test suite a given number of rounds with any of their five presented configurations where random-class-method is the best performing one. These configurations reorder the test order and/or class methods. The last step in the identification is rerunning the failing order and comparing it to a rerun of the original order which in turn indicates whether a test is flaky due to order dependency or not. Similarly, iFixFlakies is an-other framework handling the same issue of detecting order dependency [19]. The method of detecting flakiness is similar, but iFixFlakies supports automatically patching the test with what they call helpers. These helpers are identified functions that reset the data that causes order dependency. The helper functions are then patched into the test code to create a func-tion call to the helper before the flaky test which ensures that the affected test is isolated. Both frameworks present relevant information about detecting flakiness caused by order de-pendency. Their techniques are helpful as eliminating order dependency from possible root causes is the first step to be done in detecting other causes.

(17)

3.3. Automatic Fault Localization

3.3 Automatic Fault Localization

Ye et al. [26] propose a method of fault localization by intersection of control-flow based ex-ecution traces. Their method traces test exex-ecutions, partitioning the resulting logs into two different sets, TRpand TRf, defining traces for passed and failed test iterations respectively.

From these logs they compute the intersection of TRf and report all points of the program

that are run in every failing test case. These are then ranked based on their suspiciousness of causing bugs. From this point onward, it is only relevant to look at the suspicious points of the program to identify the root cause. Wang et al. [24] further explains the usage of execu-tion tracing for fault localizaexecu-tion. Similarly, they too propose a scheme for defining a test’s suspiciousness to sort what points in execution are likely to induce faults. Their approach is however different, they propose an approach tailored to object oriented languages. To man-age tracing of an object oriented approach in a simple way, they instead only trace what they refer to as blocks. These blocks represents a code block that was or was not executed in the trace. From this point it is stated how selection of test cases is important since redundant test cases may negatively impact the effectiveness of fault localization. The selected test cases are then used in conjunction with the block trace to determine the suspiciousness of the block. For each block in the program they compute the relative percentage of passed test cases that execute the block to the failed test cases that execute it, i.e. the higher the percentage of failed cases, the higher the likelihood that the block is faulty. The methods of tracing pre-sented follow a similar pattern of producing trace logs, comparing passing and failing logs, determining suspiciousness and estimating what part of the code that is faulty. For finding flakiness, this method is applicable, as proven by Ziftci and Cavalcanti [27]. The main is-sue with tracing is its overhead which in large test suites can be impactful depending on the scope and method of testing. Wang et al. [24] further provides indicators that tracing is a viable approach for locating flakiness. This thesis make use of their method of creating and comparing logs. We aim to determine if a test is flaky due to Randomness or not by utilizing tracing. This is done in similar steps by creating and comparing trace logs.

Ziftci and Cavalcanti [27] refer their methods to Wong et al. [25], in order to solve the problem of where flaky tests are created. Ziftci and Cavalcanti [27] describe how Google have developed methods in order to locate where the root causes exist with 82% certainty. The proposed tool compares the execution traces of each failing run to each passing run, finding their divergence. By using the divergence and a flakiness score, it is possible to locate where the root cause is created. This greatly helps developers find out what the cause of the problem is as the location of said problem is presented. Their study further takes into concern what information developers want to be able to fix the flakiness. Since developers often want to understand and fix the problem themselves it is important to be able to give them information to achieve that. To manage this, they have developed a report tool that presents the important information. Since their method only detects the location of flakiness, we will expand on their methods to be able to automatically detect the Randomness category. This is done by introducing an identifier for what type of root cause that is present and a method of reporting the findings from feedback of their testing. We will mainly utilize the concept of finding the divergence in executed code to locate and determine flakiness. We exclude the usage of a longest prefix, and instead only locate the first differing executed line. If any such diverging line is located the code leading to that point is further analyzed to locate any indications of Randomness. The technique is also extended on to differentiate between returns from function calls, differing locals etc. We further use the same method but modified tailored to locating differing values, returns and assertions instead of lines of execution. Here we instead locate the divergence in values between iterations.

(18)

3.4. Automatic Flaky Categorization

3.4 Automatic Flaky Categorization

Frameworks developed for automatically detecting root causes are not common, but such a framework has been developed by Lam et al. [11], named RootFinder. Their framework is produced for Microsoft projects and is made in and for C# projects. Like all other frameworks created for detecting and categorizing, it is done in several steps. Firstly, CloudBuild detects flaky tests by rerunning any failed test to see if it passes. If the test passes it is classified as flaky as the test fails and passes randomly. When the detection is done, CloudBuild stores information about all tests in a database. Each flaky test in the database has its information read and all dependencies are collected to allow for running on any local machine indepen-dent of CloudBuild. Followed by this they use a tool called Torch, creating an instrumented version of all dependencies. This tool is used to allow for more logging of various runtime properties at test execution. Followed by this they rerun all flaky tests 100 times to produce both passing and failing logs for each test. When the logs are collected they are analyzed by using their proposed application; RootFinder. Each log file is firstly processed independently against certain predicates; Relative, Absolute, Exception, Order, Slow and Fast, creating logs of their outcome as well. These predicates determine if the called behaviour at a certain instruc-tion is deemed "interesting". When this first step is done their RootFinder compares the newly created predicate logs with each passing and failing one to identify ones that are true/false in all passing executions, but are the contrary in all failing ones. To then determine the category of flakiness they use, in conjunction with other methods, keyword searches. Some keywords strongly indicate what the most probable category of root cause is. Similarly, we aim to utilize tracing to create trace logs which will then be analyzed when passing and failing iterations are found. Although we will not use any predicate functionality since we focus on only one category. We will also utilize the usage of keyword matching to further ascertain the category of the flaky test.

(19)

4 Method

In the following section the method used to create a dataset containing flaky tests and the methodology to answer the research question will be presented. All testing, creating of log files and development was done on the Ubuntu 20.04.2 LTS OS using Python 3 (3.8.5).

4.1 Thesis Work Process

The work was done in three major steps, represented in the workflow of figure 4.1. Firstly we located projects containing flaky commits and downloaded these to try and manifest their flaky behaviour.

In step 0, we focused on recreating flakiness in found flaky commits by running them with pytest. The flaky commits were of different categories and were used to create a dataset containing the commits that managed to manifest flakiness.

In step 1 we determined how to implement FlakyReporter. From the resulting log files we first categorized each one into their respective categories. Flaky tests categorized as Random-ness were analyzed to locate identifiers of the RandomRandom-ness category. From our findings we proceeded to implement the code of our FlakyReporter.

In step 2 we evaluate our proposed FlakyReporter. From a set of flaky tests used for evaluating we let our FlakyReporter run and collect the produced trace logs. Divergence was run on the collected data and the probability of Randomness was calculated based on the divergence data and the found identifiers. The final step of FlakyReporter is to produce a report which we analyzed and summarized the accuracy of our tool; FlakyReporter.

(20)

4.2. Data Collection

Locate flaky tests

from Python projects Flaky commitswithin projects

Run each test 5 000 times locally or until 2 iterations fail

Find identifiers for the Randomness category

Step 0: Log Files Generation

Categorize and identify located flaky tests

Create dataset

Step 1: Data Analysis

Resulting log files from pytest output

FlakyReporter: Collect and read tracelogs Run Divergence and locate identifiers Calculate probability of

Randomness Produce report Implement Tool

FlakyReporter

Calculate accuracy and

precision

Step 2: Evaluation of FlakyReporter

Figure 4.1: Workflow of thesis.

4.2 Data Collection

To be able to determine the most efficient approach for locating root causes and to have a dataset for testing and development of potential methods, flaky tests from all categories had to be found. To accommodate these requirements, we created a dataset of found flaky tests.

The created flaky dataset [16], consists of flaky tests from public GitHub repositories, found in the same manner as Luo et al. [15]; searching for keywords in popular reposito-ries. All repositories were found from either public datasets, containing research on flakiness such as [7], or from searching for the keyword Python on GitHub. The repositories found, from our search, contain all publicly available Python repositories where the most popular ones (most starred) were selected. In these repositories, searches for flak, flaky and intermit were made, finding all commits that have mentioned and often fixed flakiness. From these commits, the parent commit is used for the dataset. Both open and closed issues, that contain any of the keywords, are also examined as they often produce additional information about the flakiness. If any issue is fixed and closed, its corresponding fix merge is presented and that commit is then added to the dataset.

In the 3rd party pytest package it is possible to mark tests as flaky. This is done through using the flaky plugin by marking a test as @flaky. Marking these as flaky, will either let them be rerun a set amount of times if failed, or to ignore them as impactful on the result of the test suite. By removing the @flaky marking, we ensure that the test is run and its result is not ignored. This ensures that the test have the capability to display flaky behaviour when run. The drawback of this method of collecting flaky tests is the requirement of manually locating the flakiness. In comparison to found commits, which give some information about the root cause from the commit message, removing the @flaky tag fails to give any such information.

(21)

4.2. Data Collection

4.2.1 Log Files Generation

To create a backlog of data from test logs and prove that a test is flaky, all suspected commits were run a number of times.

The amount of reruns used for each test suite was decided to be 5 000, or until 2 flaky test executions. This was determined after the first test case we ran which required more than 1 000 iterations. The large amount of reruns were made possible due to most used projects contained small test suites that were executed in a small time frame. From these reruns, tests that failed more than once were deemed flaky. Any test that did not fail during these 5 000 runs were either rerun for a set amount, or deemed not flaky depending on the time it took to execute. We decided that tests that took < 5 minutes to execute were rerun again while tests that took > 5 minutes to execute were deemed not flaky.

The tests that failed to manifest any flakiness in this amount of runs were discarded as not flaky. Since all used repositories use either unittest, pytest or any other framework that supports pytest, a simple script was created (see Listing 4.1). The script runs the test suite a given amount of iterations and parses the verbose output of a pytest execution to a log file log_{i}, where i is the iteration. If the test run fails, the log file is renamed to failed_log_{i}. 1 # ! / b i n / b a s h 2 3 mkdir −p l o g s 4 5 f o r i i n $ ( seq 1 $1 ) 6 do 7 touch l o g s /l o g _ $ { i } 8 python3 −m p y t e s t −v &> l o g s /l o g _ $ { i } 9 i f t a i l −n 1 l o g s /l o g _ $ { i } | grep −c " f a i l e d " ; then 10 mv l o g s /l o g _ $ { i } l o g s / f a i l e d _ l o g _ $ { i } 11 f i 12 done

Listing 4.1: Script for creating logs.

4.2.2 Category Identification

The tests and projects that showcased flakiness were categorized into their respective cat-egory. The categorization was based on the descriptions of Eck et al. [3]. Our approach consisted of reading commits, code and our created test logs.

Reading the commits made for fixing the test’s flakiness in conjunction with reading the failing logs was a method used for determining the category of a flaky test. Root causes were often tangible from the solution the developers found. For example, some flakiness was caused by comparing two unsorted lists which caused flakiness due to their differing order-ing. This was fixed by sorting both lists before comparing them which gave the indication that Randomness or Too Restrictive Range were categories and the root causes were the list ordering and their comparison (see figure 4.2). The "-" sign in red color presents the old line of code and the "+" sign in green color presents the new line of code replacing the old line. The figure presents a change where the previously unordered list instead gets sorted to solve the cause of flakiness.

(22)

4.3. Data Analysis

Figure 4.2: Github commit fixing Too Restrictive Range flakiness.

Tests that suffered from flakiness due to more abstract reasons were harder to determine, e.g. Concurrency or Async Wait. This can be attributed to their reliance in timings where a timeout may be due to Async Wait which in turn makes it harder to categorize.

Other categories where often simpler to determine. Randomness, for example, was easy to determine by how most of the failing and passing runs were differing in values. In listing 4.2, one passing and one failing iteration of the same test is described. As can be seen at line 5 & 12, the same assert compares two different sets of numbers (line 6 & 13). The first iteration is passing since avse30_1 is larger than avse60, while in the second iteration it is failing since avse30_1is smaller than avse60. Most tests that exhibit flakiness due to Randomness tend to produce a similar behaviour, where failing and passing traces contain differing variables within the same coverage. Furthermore, differing assertion values between iterations of the same result, i.e. passing or failing, is an indicator of flakiness in the Randomness category.

1 # I t e r a t i o n 1 2 . 3 . 4 . 5 a s s e r t avse30_1 > avse60 6 > ( 0 . 1 5 5 7 8 1 5 4 0 0 0 1 4 4 4 2 > 0 . 1 1 4 5 5 3 7 2 8 6 1 5 5 3 2 9 8 ) 7 8 # I t e r a t i o n 2 9 . 10 . 11 . 12 a s s e r t avse30_1 > avse60 13 > ( 0 . 1 1 5 4 0 3 8 3 6 1 8 0 6 8 7 0 3 > 0 . 1 2 6 4 8 9 3 6 2 1 2 2 2 7 7 9 6 )

Listing 4.2: Two iterations of a "Randomness" flaky test.

The initially defined categories that were based on the empirical study by Gruber et al. [8], were also cross checked with the findings from the created test logs. In cases where the used repositories came from their public dataset [7], the category found by analyzing the test logs was then compared to the result in the dataset. This ensured both identifying and classifying the flaky tests correctly into their respective category.

(23)

4.3. Data Analysis Async Wait Concurr ency IO Network Platform Dependency Randomness Resour ce Leak Test Case Timeout Test Or der Dependency Time Too Restrictive Range Precision Unknown Category Unor der ed Collection 0 5 10 15 20 25 30 35 40 45 10 7 4 37 2 44 2 1 1 14 7 1 4 1 Category Occurr ences

Figure 4.3: Found categories from public dataset and GitHub.

Figure 4.3 is a representation of all of the flaky commits used in this paper, both through our manual search on Github and in addition to the flaky commits used by Gruber et al. [7]. Since we are basing our categorization on the taxonomy by Eck et al. [3], meanwhile Gruber et al. [7] uses its own, figure 4.3 contains both naming schemes of respective categories. This is both due to transparency reasons as well as the high level of complexity in categorizing flakiness. For this reason there are categories that are more similar to each other than the rest. It is of great importance to have acquired flaky tests, from categories that are not only within Randomness, since these will also be useful to test our method against. A method to determine flakiness within Randomness must also not wrongfully determine flakiness if supplied with other categories.

Async Wait and Concurrency could in some cases be categorized as Network, which may be the reason why Network as category is more prevalent in the dataset compared to our own search through Github repositories. Another occurrence of this is that Randomness and Too Restrictive Range are difficult to distinguish one from another, resulting in most cases gets defined as Randomness.

(24)

4.4. Data Reporting

4.3.1 Frameworks

From the projects, that were able to run (see appendix A), we found a split in unittest and pytest. As this work will only focus on pytest, we gathered 51 projects using the pytest and 41 projects using the unittest framework, as can be seen in figure 4.4. Individual amount of flaky tests that were using either unittest or pytest are not presented but only what a specific test suite, or project, is using.

pytest unittest 0 10 20 30 40 50 51 41 Framework Occurr ences

Figure 4.4: Found testing frameworks from projects in our created dataset.

Since the method itself is relying on the narrowed scope of pytest, 12 out of the total 51 were utilized to develop the method itself, capable of locating the root causes of Randomness. Meanwhile, the remainder 39, were instead used to determine the accuracy of the method itself after that of its development phase. Keeping the remainder separated from the total 51, enabled us to retrieve a validation with certain precision. Using only 12 projects as a basis to create a method was deemed to not impact the end product in a negative way since Randomness tends to manifest itself similarly throughout different projects and test cases.

4.4 Data Reporting

Through analyzing our created log files, we found that tracing seemed to be a viable approach in locating flakiness. Tracing allows us to keep track of variables, return values and function calls. From what we found, we had to closely follow the execution between runs to accurately determine if the flakiness is due to Randomness. The aim of this is to locate returns, assertions and variables that are varying in value between each run, both passing and failing. Figure

(25)

4.4. Data Reporting

Figure 4.5: Snapshot of two pytest fail messages from two iterations of the same test.

4.4.1 Execution Traces

The function sys.settrace(fn) [6], enables the user to create custom tracing functionality which in turn enables a more pinpointed approach. It also includes the inspect package [4], which handles frames and other inspect objects. These are used to gain the actual information of the trace. The frame retrieved in the trace function represents the current frame of execution. This works by sys.settrace firstly registering a global trace which invokes a callback returning the local trace, or frame containing all relevant data. Figure 4.6 explains the process in a more clear manner.

sys.settrace() Global Trace

Callback Local Trace / Frame Registers Invokes Returns

Figure 4.6: Flow of sys.settrace().

Using sys.settrace(fn) allows for excluding tracing into irrelevant files by only trac-ing the correct function name in the correct file. In comparison, Pythons trace module has no support for defining what files and what functions to trace but instead traces every call. Since pytest calls multiple functions and classes when doing a test run it produces several redundant trace logs during its execution.

As stated earlier, this approach provides the functionality of excluding non-interesting tests, which can be seen below in Listing 4.3; line 6-7. Here it is defined how only selected target functions from a specific file will be traced.

(26)

4.4. Data Reporting

1 def _ t r a c e _ f u n c ( frame , event , arg ) −>None : 2 co = frame . f_code 3 func_name = co . co_name 4 f i l e n a m e = co . c o _ f i l e n a m e 5 6 i f func_name in t a r g e t _ f u n c \ 7 and f i l e n a m e in t r a c e _ l i s t : 8 l i n e _ n o = frame . f _ l i n e n o 9 f _ l o c a l s = frame . f _ l o c a l s 10 t r a c e _ l i n e = { 11 ’ event ’ : event , 12 ’ func_name ’ : func_name , 13 ’ l i n e _ s t r ’ : l i n e c a c h e . g e t l i n e ( filename , l i n e _ n o ) . r s t r i p ( ) , 14 ’ l i n e _ n o ’ : l i n e _ n o 15 } 16 . 17 . 18 .

Listing 4.3: Simple trace function example

The linecache.getline(...).rstrip() provides the string representation of the code executed. The string representation is paired with its potential locals by fetching the them from the current frame. At line 9 in listing 4.3, it defines how the frame’s locals are fetched. However, the locals only appear on the following line since the line has yet to be executed.

The tracing is done in an event based manner where each trace call provides an event, frame and arguments. The frame provides the current top frame on the stack which in turn provides meta information about the currently executing object. Events like call and return happens when a function either gets called or when returned, which includes the return val-ues. Other events include the line event, which represents a line being executed. This event is used for collecting information about executed lines of code. The syscall functionality includes fetching the frame of the parent caller which enables us to only allow tracing if a certain parent called the function. Doing this can set the depth of tracing allowed, which cur-rently is at a depth of 1. Figure 4.7 illustrates the depth traced, where every arrow represents call/return. The crossed out arrows are calls that are not traced and are thus ignored from the resulting trace logs. Below, in figure 4.7, it is displayed how the Test Function performs several calls to pytest since it is the module controlling the testing environment. The Python library contains several different files native to the Python environment. From the analysis done on the found flaky tests we found that these files are often uninteresting or can be ignored. We further noted that tracing into multiple function calls, i.e. calls from a call, are mostly redundant. The first Called Function, at depth 1, represents any user defined function not native to Python or pytest. This function gets fully traced, i.e. lines executed, locals and return value. Any function calls from this depth onward are not traced. By ignoring tracing on greater depth, we effectively reduce the potential overhead while still maintaining the relevant information needed to deduce Randomness. It is however, possible to extend the depth traced, which is easy to accomplish in our implementation and can be tailored to any

(27)

4.4. Data Reporting

Figure 4.7: Overview of execution trace depth.

Listing 4.4 describes the two events; call and return, as well as how the parent is used to determine if it should be traced. Line 23 provides the return values which are residing in the argparameter from the traceback call.

1 . 2 . 3 . 4 5 e l i f event == ’ c a l l ’ : 6 t r y: 7 i f p a r e n t . co_name in s e l f . l o g s : 8 i f ’ c a l l ’ not in s e l f . l o g s [ p a r e n t . co_name ] : 9 s e l f . l o g s [ p a r e n t . co_name ] [ ’ c a l l ’ ] = d i c t ( ) 10 . 11 . 12 . 13 14 e x c e p t E x c e p t i o n as e :

15 p r i n t( ’ Trace c a l l e x c e p t i o n , { } \ nin f i l e { } ’ . format ( e , f i l e n a m e ) ) 16 e l i f event == ’ r e t u r n ’ : 17 t r y: 18 i f p a r e n t . co_name in s e l f . l o g s : 19 s e l f . l o g s [ p a r e n t . co_name ]\ 20 [ ’ c a l l ’ ]\ 21 [ frame . f _ b a c k . f _ l i n e n o ]\ 22 [ func_name ]\ 23 [ ’ r e t u r n ’ ] = arg 24 e x c e p t E x c e p t i o n as e :

25 p r i n t( ’ Trace r e t u r n e x c e p t i o n , { } \ nin f i l e { } ’ . format ( e , f i l e n a m e ) ) 26

27 .

28 .

29 .

(28)

4.5. FlakyReporter

Logging the executed lines allows for using a divergence method similar to Ziftci and Caval-canti [27], determining that the same part of the code gets executed in both failing and passing runs. If both passing and failing runs execute the same code, the variable that causes the test to fail is tracked through all execution logs. Depending on how the variable is changing val-ues between each run it can be attributed to Randomness. I.e. if both passing and failing logs vary in the value that caused the test to fail, it is most likely due to randomness.

4.5 FlakyReporter

FlakyReporter calculates a probability of a test being flaky due to Randomness and produces an interactive .html report in several steps. It firstly reruns the target flaky test and traces its execution, creating trace logs. When the log files have been created for the target test function, it is parsed and analyzed. In this script it is analyzed and its suspiciousness is calculated in several steps. In the following section we will explain how the category is determined and how the location of the possible root cause is found. This is done in several steps, as can be seen below in figure 4.8. Firstly the executed lines are checked for diverging lines executed. If none are found we can compare the common identifiers of Randomness; returns, locals and assertions. These three categories tend to display randomness through differing between runs with the same result. For example, function returns changing values for each passing iteration indicates that the test suffers from randomness. If a divergence is found, we only compare returns and locals. We ignore assertions due to how impactful differences in the executed lines between passing and failing are towards the final assertion. Furthermore, if a divergence is found we will never execute the final assertion as we do not continue past the first divergent line. Therefore we classify it as partial data as it does not contain data from the full trace log. This is further explained in section 4.5.7. In the following subsections we will explain the steps taken to produce the report of its probability of being flaky due to Randomness from each step described in figure 4.8.

(29)

4.5. FlakyReporter

4.5.1 Rerun Flaky Test

FlakyReporter starts with rerunning the test files with pytest, using execution tracing to store the information of the currently executed lines.

The amount of reruns is defined by the user before it starts executing. The tool traces the execution of the target test function and stores it, in conjunction with the pytest results, locally into text files.

4.5.2 Trace Logs

The initial step of the FlakyReporter generates trace logs. The trace logs consists of informa-tion gathered by tracing and pytest results. The logs are created in a format of lineno - string < locals, Call-> func : fname, or C-> lineno - string < locals. The logs may contain any number of iterations, i.e. instead of having ten files for the same function, only one file is used where each iteration is separated by 20 equal signs and a newline (see below in Listing 4.5).

1 _ _ _ l i n e 11 def t e s t _ r a n d o m _ t e s t ( ) : 2 _ _ _ l i n e 12 rand = create_random ( ) 3 C a l l −> create_random : . . . / t e s t _ i n i t i a l . py 4 C−> _ _ _ l i n e 3 def create_random ( ) : 5 C−> _ _ _ l i n e 4 rand = random . r a n d i n t ( 0 , 1 0 ) 6 C−> _ _ _ l i n e 5 r e t u r n rand 7 C−> r e t 4 8 < ( rand = 4 ) 9 _ _ _ l i n e 13 i f rand == 0 : 10 _ _ _ l i n e 15 rand2 = create_random ( ) 11 C a l l −> create_random : . . . / t e s t _ i n i t i a l . py 12 C−> _ _ _ l i n e 3 def create_random ( ) : 13 C−> _ _ _ l i n e 4 rand = random . r a n d i n t ( 0 , 1 0 ) 14 C−> _ _ _ l i n e 5 r e t u r n rand 15 C−> r e t 6 16 < ( rand = 4 ) 17 < ( rand2 = 6 ) 18 _ _ _ l i n e 16 a s s e r t rand2 >= rand 19 > ( 6 >= 4 ) 20 ==================== 21 22 _ _ _ l i n e 11 def t e s t _ r a n d o m _ t e s t ( ) : 23 _ _ _ l i n e 12 rand = create_random ( ) 24 C a l l −> create_random : . . . / t e s t _ i n i t i a l . py 25 C−> _ _ _ l i n e 3 def create_random ( ) : 26 C−> _ _ _ l i n e 4 rand = random . r a n d i n t ( 0 , 1 0 ) 27 C−> _ _ _ l i n e 5 r e t u r n rand 28 C−> r e t 3 29 < ( rand = 3 ) 30 _ _ _ l i n e 13 i f rand == 0 : 31 _ _ _ l i n e 15 rand2 = create_random ( ) 32 C a l l −> create_random : . . . / t e s t _ i n i t i a l . py 33 C−> _ _ _ l i n e 3 def create_random ( ) : 34 C−> _ _ _ l i n e 4 rand = random . r a n d i n t ( 0 , 1 0 ) 35 C−> _ _ _ l i n e 5 r e t u r n rand 36 C−> r e t 8 37 < ( rand = 3 ) 38 < ( rand2 = 8 ) 39 _ _ _ l i n e 16 a s s e r t rand2 >= rand 40 > ( 8 >= 3 ) 41 ==================== 42 . 43 . 44 .

(30)

4.5. FlakyReporter

The "<" sign represents the locals. This can be seen in line 8; where "< (rand = 4)" rep-resents the assignment of rand. The ">" sign reprep-resents the assertion value, or comparison. On line 19; "> (6 >= 4)" represents a passing assertion where "assert rand2 >= rand" is the same as "assert 8 >= 7". At line 3 a call is made to the function create_random which resides in the file test_initial.py. The following "C->" lines represents the trace within the called function. The "ret 4" at line 7, represents the return value of the called function.

The trace logs are parsed back into the tool which are read and stored in a usable format for the remaining steps in the process of producing a report.

4.5.3 Execution Divergence

As can be seen in figure 4.8, the first step, after reading the trace logs, in determining Ran-domness is performing a divergence analysis on the executed lines. Our divergence method is implemented in a similar way as Ziftci and Cavalcanti [27], where the first diverging line is found and stops further line analysis. Any new, earlier divergent line found will become the new stopping point for further comparisons, reducing the actual commonly executed lines of code over all logs. We store the commonly executed code for passing/failing until any diverging code is found, in the same way as their method. We also add more data and store all locals, assertions and function calls as well as their return values. This is then used in the later stages to calculate the result, i.e. calculate the probability of the flakiness pertaining to the Randomness category. In listing 4.6 we have written an example, taken from Ziftci and Cavalcanti [27], and re-written in Python. The gray area represents the commons (commonly executed code), green and red represents what the passing and failing iteration respectively, diverges in. 1 def divergent_function(): 2 value = RandInt(0,1) 3 if value == 0: 4 return False 5 e l s e: 6 return True

Listing 4.6: Example of divergence. Green line and red line represents passing and failing iterations and what code that differs between them. Gray represents the commons.

The function will diverge between the return statements as the False return indicates a failed run. This should split the result in each iteration evenly between passing/failing. Ziftci and Cavalcanti [27] argues that the location of the first divergence detected is the fault location. We also conform to their argument and ignore further line analysis when any divergence is found. Instead, the next passing log is compared to the failing log to gather more data about locals, returns and in some cases new divergences.

(31)

4.5. FlakyReporter

gets checked, where each test in Tp gets checked against each test in Tf. I.e. each

pass-ing test tp P Tp gets checked against each failing test tf P Tf, where if any executed line

linep P line in tp differs from linef P line in tf, a divergence is found. The different

vari-ables represents their respective type of data where div contains the diverging lines and all common lines between failing and passing iterations.

1 div , l o c a l s , r e t u r n s , a s s e r t i o n s Ð H 2 f o r e a c h ( tf , tp) i n ( Tf, Tp) P T do

3 commons Ð H

4 f o r e a c h l i n e i n tf , tp do

5 i f linep ! = linef do

6 div Ð linep + linef + commons

7 break 8 end 9 commons Ð commons Y l i n e 10 r e t u r n s Ð r e t u r n s Y g e t R e t u r n s ( linec f, linecp) 11 l o c a l s Ð l o c a l s Y g e t L o c a l s ( linef, linep) 12 end 13 a s s e r t i o n s Ð a s s e r t i o n s Y g e t A s s e r t i o n s ( tf, tp) 14 end

Algorithm 4.1: Divergence algorithm.

One thing done in the background of the divergence algorithm is locating keywords for each line read. While each line is checked for any difference between passing and failing execu-tion, it is also checked for any keywords. Keywords represents any word that may reference random and to support any further developments of random functions and libraries, we sup-port a keywords.txt. This file contains, for each line, any keyword that might possibly reference a random function. Listing 4.8 displays a short example list of keywords that refer-ence random functionality. Each line, until any divergrefer-ence is found, is scanned for any words containing any of the ones contained in the list. Found keywords is then later used to further argue for randomness being the category in the Calculate Result step.

1 rand 2 Rand 3 r a n d i n t 4 RandInt 5 random 6 Random 7 uniform 8 Uniform

Listing 4.8: Example of a short keyword list.

4.5.4 Compare Return Values

All return values reached from running our divergence method are stored and compared. In each iteration where any call happens before any divergent line is found, the function _store_returns(self, ...), is executed which stores all unique returns. Since all re-turns are read as a string we store the string representation of the returned value. This allows us to read and store user defined objects which would not work otherwise and it further allows us to uniquely store each returned value.

The return value is stored with its resulting run and line number where the call occurred. The function arguments contain the called function name, line number where the function was called, passing return value and failing return value. The number of occurrences of a passing return value is incremented. This provides data on the proportion of all iterations that has the same specific return value. This further provides the amount of differing return values between each passing and failing iteration as the number of keys in the dictionary correspond to the number of different return values.

(32)

4.5. FlakyReporter

The resulting data stored is then compared to determine the amount of differing return values based on the number of iterations. This is done by dividing the number of keys by the number of iterations which is done for passing and failing independently. The aim is to compare if the amount of differing return values can be seen to be distributed in a sim-ilar scope between both passing and failing. We further compare if any return value from any failing iteration is present in any passing iteration. If both passing and failing logs con-tain differing return values it indicates randomness. However, if any return value exists in both passing and failing, it does not indicate that Randomness is the cause of flakiness, i.e. if there exists any failed return, fr P Fr and passed return, pr P Pr where fr = pr, then

Returns Ñ Randomness. Nor does differing returns inherently imply Randomness as the cause of flakiness. From these findings we calculate a numeric representation of how impact-ful differing return values might be. This is later used to calculate the final result of how probable it is that the flakiness pertains to Randomness.

4.5.5 Compare Locals

The term Locals refer to all local variables in a function. For each line that contains any stored local it is stored in the log files which is then read during our divergence check. Each local is further stored until the end of that variable and is present in each line. Listing 4.9 contains two variables, or two locals. At line 2, the execution sets a value of a local but this value is not set until the line has executed. Therefore the value of bar is not set as a local value until line 3. The same goes for the execution of line 4, where baz get initialized. Since bar is still in use when baz is initialized, both of them are the locals present at line 5, since line 4 needs to be executed before baz is set as a local. Until any return is reached the locals will stay the same or increase if no disposal called for any of the locals.

1 def foo ( ) : 2 bar = 5 # No l o c a l s a t t h i s l i n e 3 baz = 120 # { b a r : 5 } i s a l o c a l a t t h i s l i n e 4 5 i f bar == 5 : # { b a r : 5 , b a z : 120 } a r e l o c a l s a t t h i s l i n e 6 r e t u r n True 7 e l s e: 8 r e t u r n F a l s e

Listing 4.9: Example of locals.

The locals are compared in the same manner as the return values; each local is compared to every other iteration to locate any locals differing in values between runs. This is done for both passing and failing logs where the occurrences of each local value is stored. The locals are also compared to the assertion failing and passing assertion statements such that, if any local that is differing in value between runs is used in any failing assertion, it further proves randomness.

(33)

4.5. FlakyReporter

runs, it does imply Randomness. Although that is the case, most cases of Randomness tend to display differing values in both passing and failing assertions.

The assertions are compared by checking the amount of distinct assertion values for the full set of passing or failing test iterations. By doing this we gain a better estimation of the amount of differing assertion values as the number of failing runs should be fewer in com-parison to the passing runs. With a greater proportion of differing assertion values, the prob-ability of it being due to randomness is in turn also greater.

4.5.7 Compare Partials

If any divergent line is located the full data of the comparisons is not available. Instead we run the same method for comparing data but with only partial information. We collect the locals and returns that are available before the divergent line and compare them instead of comparing the full test execution. An implication of doing partial comparisons is the lack of assertion statements. Due to our usage of a breakpoint when a divergent line is found, the final assertion will never be reached when comparing.

The partial comparisons are performed in an identical manner to the ones already men-tioned, i.e. Compare Locals, Compare Assertions. The only difference is the amount of data and the resulting accuracy of determining Randomness as a root cause.

4.5.8 Calculate Result

Calculating the probability of Randomness is done in a simple manner where each indica-tor of a random behaviour adds points to the variable rnd_probability which represents the probability of Randomness. All calculations and its resulting score are done in the back-ground and are never presented to the user. This means that we will never present any specific score to the user, but will instead present the indicators of Randomness found and a more broad term of categorizing;

• No Indications of Randomness, is the result when no indications have been found. • Few Indications of Randomness, is when the number of indications is very small. This

is a result that can be seen as the margin of error between No Indications of Randomness. and Some Indications of Randomness.

• Some Indications of Randomness, is when there are enough indicators for it to be pos-sibly flaky due to Randomness.

• Many Indications of Randomness, is when the amount of indications found strongly implies flakiness due to Randomness.

Each part; locals, assertions, returns and keywords are "measured" to give an estimation of the flaky category. For locals, we calculate the number of differing local values divided by the amount of iterations followed by returning its average. The formula can be seen in equation 4.1, where Nf, Np P N is the number of iterations for failing and passing runs respectively.

The values in l_if, l_ip P L references the set of suspicious values any local gets assigned at a given iteration i. The impact constant is used to define how impactful any certain element is. Random variables used in the failing assertion will be more impactful and therefore have a higher value on impact. This is used to further balance the probability value of any test function, creating a more reliable and easier to modify value of probability.

probability= probability+ řNf i=0l f i Nf + řNp i=0l p i Np +impact 2 (4.1)

(34)

4.6. Evaluation

The returns are calculated in the same way as the locals. It follows the same exact principle as in equation 4.1, but instead of a set of locals, it uses a set of returns that are compared and calculated. The assertions also follow the same formula but differs in how it is not a set of different assertions that is used for calculating, but one set of different values. Assertions also differs in how they are only calculated when no divergence is located, since no assertion can truly be examined when any divergence is located. The final result is calculated as described in equation 4.2, where probability is the resulting probability form equation 4.1.

result=1 ´ 1 probability (4.2)

The final result, described in 4.2, is therefore any number between 0 and 1. We define the range of 0 ď result ď 0.1 to be No Indications of Randomness, 0.1 ă result ď 0.4 to be Few Indications of Randomness, 0.4 ă result ď 0.7 to be Some Indications of Randomness and 0.7 ă result ď 1.0 to be Many Indications of Randomness.

4.5.9 Produce Report

The resulting is stored and presented to the reader through a report.html file. Meta data from the local environment and the total number of iterations run are also added to the report. The report is formatted to enable interactive viewing by adding expandable sections. This allows for better readability, flexibility and allows the user to select what information to view. Ziftci and Cavalcanti [27] state the importance of readability of any relevant information for the developer to fix the issue. The choice of .html is also inspired by their report creation.

4.6 Evaluation

To evaluate our proposed method we tested it against our created dataset, where the selected tests are ones not used directly in development. Each test selected for evaluation is run until four iterations fail. These logs are created with our FlakyReporter where the check of accuracy is done with the resulting number of probability and the resulting report.

Since our method produces four different results we define a false-positive as Most Likely Randomness and Some Indications of Randomness. That is when any flaky test not in the Ran-domness category produces the Most Likely RanRan-domness result it is considered a false-positive. For the Some Indications of Randomness result, we define it as a false-positive depending on the data collected and how it corresponds to Randomness. For each collected indicator, we manu-ally scan it to see how well it determined correctly and if it can be considered a false-positive. Similarly, a false-negative is defined as Few Indications of Randomness and No Indications of Randomness.

Each test is run 200 times, creating a total of 200 trace logs which are then parsed and analyzed. The resulting outcome of each of the tests are checked and recorded to verify the amount of correct and false results generated by our FlakyReporter. Of our entire dataset, some tests were unable to produce flakiness when run with our FlakyReporter. Due to this,

Randomness as a Cause of Test Flakiness

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-G--21/048--SE

Randomness as a Cause of Test

Flakiness

Slumpmässighet som en orsak till skakiga tester

Daniel Mastell & Jesper Mjörnman

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research Questions

1.4

Delimitations

1.4.1

Flaky Tests

1.4.2

Python

2

Background

2.1

Flaky Test

2.2

Continuous Integration

2.3

Taxonomy of Flakiness

2.4

Execution Tracing

2.5

GitHub

2.6

Python

2.7

Unit Testing

2.7.1

Unit Testing Framework

2.7.2

Pytest

3

Related Works

3.1

Empirical Studies

3.2

Automatic Detection

3.3

Automatic Fault Localization

3.4

Automatic Flaky Categorization

4

Method

4.1

Thesis Work Process

4.2

Data Collection

4.2.1

Log Files Generation

4.2.2

Category Identification

4.3.1

Frameworks

4.4

Data Reporting

4.4.1

Execution Traces

4.5

FlakyReporter

4.5.1

Rerun Flaky Test

4.5.2

Trace Logs