Possibilities of automatic detection of "Async Wait" Flaky tests in Python applications

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-G--2021/052--SE

Possibilities of automatic

detec-tion of "Async Wait" Flaky tests in

Python applications

Möjligheter till automatisk detektering av icke-deterministiska

tester inom "Async Wait" -kategorin i Pythonapplikationer

Joel Nilsson

Supervisor : Azeem Ahmad Examiner : Ola Leiﬂer

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Flaky tests are defined as tests that show non-deterministic outcomes, meaning they can show both passing and failing results without changes to the code. These tests cause a major problem in the software development process since it can be difficult to know if the cause of a failure originates from the production- or test code. Developers may choose to ignore failing tests known to be flaky when they might actually hide real bugs in the production code.

This thesis investigates a specific category of flaky tests known as "Async Wait", which are tests that makes asynchronous calls to servers and other remote resources and fails to properly wait for the results to be returned. There are tools available for detecting flaky tests, but most of these need the test to be executed and operate on run time information. In order to detect potential flakiness in an even earlier state, this thesis looks in to if it is possible to predict flaky outcomes by analyzing only at the test code itself without running it. The scope is limited to the Async Wait only to determine in which cases and under what circumstances developing an algorithm to automatically detect these flaky tests would be possible in this category.

Commits from open source projects on GitHub were scanned for Async Wait flaky tests with the intention of finding the characteristics of the asynchronous calls and how the wait-ing for them is handled as well as how the flakiness is resolved by developers in practice in order to see if the information in only the test code is enough to predict flaky behavior.

(4)

Acknowledgments

I would like to thank my supervisor Azeem Ahmad for the help during the thesis and for the feedback on the final report. I would also like to thank my examiner Ola Leifler for the support and for discussing problems that came up along the way. Lastly I want to thank my opponents Daniel Mastell and Jesper Mjörnman for the comments and feedback on the report and presentation.

(5)

List of Figures

4.1 Example of a commit from the integrations-core project that claims to fix flakiness

by increasing the time sleeping from 1 to 1.1 seconds. . . 16

5.1 Example of a commit from the CPython repository that reduces flakiness by intro-ducing a time.sleep call. . . 21

5.2 Example of a commit from the Salt project that reduces flakiness by increasing a sleep call. . . 21

5.3 Example of a commit from the Raiden project that fixes flakiness by changing a sleep into a waitFor condition. . . 22

5.4 A commit from the YouCompleteMe project that introduces a waitFor in the form of a while loop. . . 22

5.5 A commit from the Sentry project that modifies the waitFor to fix flakiness. . . 23

5.6 A commit from the Raiden project that increases the timeout to fix flakiness. . . 23

5.7 How the common case flaky test looks in Django and how it is fixed. . . 24

(8)

List of Tables

3.1 Async Wait found studies categorizing flaky tests. . . 10

4.1 Projects containing flaky commits. . . 13

4.2 The number of commits with keywords indicating flakiness and how many were Async Wait. . . 14

4.3 The number of Async Wait flaky tests in each analyzed project. . . 15

5.1 Tests categorized on the type of waiting before fixes. . . 18

(9)

1 Introduction

Testing is an important part of the software development process. Each time code is changed, tests are run to make sure there are no new bugs or flaws introduced to the software. In order to guarantee that the released version of an application works as intended, you would need to assume that running the same test several times over the same unchanged code would pro-duce the same result every time. This however is not always the case. Flaky tests are defined as tests that can change their outcome without changes in the production code. This non-deterministic behavior of a test can have its root cause in either issues in the infrastructure of the production code or in the test code itself [6].

Test flakiness can originate from several different causes. This report will analyze a spe-cific category of flaky tests referred to as "Async Wait" by Luo et al. [14]. These authors conducted an investigation on root causes of flaky tests and did so by organizing them in different categories. One of the more common categories was found to be where test flaki-ness could be derived from whether or not the test code succeeded to properly wait for the response of an asynchronous call.

With the problems that comes with flaky tests, being able to detect them early on can be very beneficial and save much work down the line. This is done by manually or automatically analyzing the code with the intention of predicting if a test has potential to flake. Reviewing the code manually is a time consuming task but can be used as a simple method of finding and predicting where flakiness could occur. Automatic detection run on the test code is more efficient on large projects, but also includes certain constraints as to what is possible to detect by an algorithm. The purpose of this report is to investigate the possibilities and problems associated with automatically predicting if a test is prone to show flaky outcomes by only looking at the test code in the specific category of Async Wait.

Many of the tools [11, 2, 12] available for automatic detection and analysis requires the tests to be run one or several times to reach conclusions on flakiness, while this report focuses on the possibilities on detecting them even before this stage, using only information from the actual code.

1.1 Motivation

Flaky tests causes a major problem in the software development process since it can be diffi-cult to know if the cause of a bug originates from the production code or the test code. If the

(10)

1.2. Aim

bug is in fact only in the test code unnecessary resources will be put into fixing the problem when the actual production code may be working fine. If flakiness is hard to reproduce and resolve, developers may choose to ignore failing runs which could lead to bugs in the pro-duction code not being detected. For this reason it is of much importance to find these tests as early as possible to prevent flakiness from occurring.

Flaky tests are commonly occurring in many software applications. According to Luo et al. [14] Google’s TAP systems experienced test failures caused by flaky tests 4.56% of the times out of the 1.6 million tests run each day. Similarly, Lam et al. [11] tracked five different projects for a month, and found that 4.6% of the tests were showing flaky behavior. This shows that flaky tests are a major problem in modern software development which should not be ignored. Finding these tests and correcting them can take a large amount of work as it might not be clear for the developer if the cause of failure is in the production or the test code.

Choice of category and language

Most research has been done analyzing all or several categories of flaky tests [14, 11, 10, 6, 17] while this report will focus on the Async Wait category only. Luo et al. [14] found, when collecting commits to analyze, that 45% of them belonged to the Async Wait category. This indicates that this an important category to focus on. By limiting the scope of this report to just one category the aim is to give a deeper analysis of the causes of flakiness and focus-ing on providfocus-ing the developers with useful information to aid them in resolvfocus-ing the error. Also most works analyzing flaky test data have focused on projects mainly written in Java [10]. While there are some empirical studies [10, 17] which are looking at flakiness in Python specifically, it is still a more recent area of research with many parts still fully unexplored.

According to the TIOBE index [19] Python is on the rise and ranked as the second most popular programming language as of June 2021 meaning this is becoming an increasingly important language to focus on.

Applicability of automatic detection

Since flaky tests clearly take up a lot of development resources there have been tools devel-oped that helps detect flakiness and understand its root causes [11, 2, 12]. These tools com-monly rely on runtime information to detect, categorize or analyze flakiness. Lam et al. [13] found that 75% of flaky tests are flaky already when first added, which shows that detecting potential flakiness from the code when the developer is actually writing the test would mean avoiding future debugging efforts. Finding the potential flakiness in this early stage would allow the developer to prevent it before it is merged with the production code.

The possibility of detecting flakiness in the Async Wait category from only the information available in the test code is investigated in this work.

1.2 Aim

This work explores the characteristics of flaky tests in the Async Wait category, how they are fixed in practice and techniques on how they can be properly classified. The aim is to give more insight into what causes the flakiness in these specific types of tests and how it is resolved with the intention of investigating whether the flakiness could be automatically found before executing the test.

There are existing studies that partly looks into Async Wait flaky tests [14, 10, 17, 11, 6], but these only briefly explains the characteristics of this category. The intention of this work is to provide more detail on what the flakiness originates from in these tests and how it can be effectively resolved. Using this information this report aims to explain in what different ways

(11)

1.3. Research questions

Async Wait flaky tests can present themselves in order to give an understanding on in which cases and under what circumstances automatic detection from the code would be possible.

1.3 Research questions

1. Is the information available in only the test code enough to automatically detect flaky tests in the Async Wait category?

2. What traits of a test can indicate that it may be susceptible to flakiness from asyn-chronous waiting?

3. How are the Async Wait flaky tests resolved by developers in practice?

1.4 Delimitations

The projects collected are limited to being written mainly in Python. Some larger projects have parts implemented in other languages, but the test cases are always in Python code. The report only focuses on analysis of flaky tests in the Async Wait category.

(12)

2 Background

This chapter will explain concepts related to testing and the development cycle as well as the nature of the different kinds of flaky tests that are present in today’s software applications.

2.1 Software testing

With frequent changes to code there is always the risk of new additions interfering with existing code and causing issues and bugs. To mitigate this problem it is common practice to test an application to assure all features are working correctly.

Unit testing

Unit testing is a common approach to software testing where separate units of the program are tested individually. A unit is a part of code that can be executed and tested in isolation, usually built up by a class or collection of functions. A unit test would call several functions of the class and check if the results match the expected outcomes [16].

2.2 Continuous Integration

In modern workplaces, software development teams often work on different features in par-allel which are then merged into the production version in use. If large changes and features are developed separately and merged to the main branch in only one step, there is a sig-nificant risk that the new code will cause conflicts and errors when interacting with other new parts of the code base. The concept of continuous integration (CI) relates to practices for minimizing this risk by integrating new features in smaller steps on a regular basis [7]. With continuous integration additions to features are frequently put through the CI pipeline which goes through the steps of building, testing and integrating the changes. Only builds that passes the tests are deployed to the production version, assuring that it will not nega-tively interfere with the active code base.

With frequent submissions into the CI pipeline it is important that the tests produce con-sistent results, meaning a pass will always result in a pass on a distinct version of the produc-tion code.

(13)

2.3. Flaky tests

2.3 Flaky tests

Flaky tests are defined as tests whose outcomes are non-deterministic on the same code [14]. Being run several times a flaky test may result in either a pass or a fail without any changes made to the test or production code.

There are numerous reasons for why a test might be showing flaky behavior. To give an example, it may stem from certain values in random variables causing unexpected failures. It may also be because of external reasons such as network issues leading to intermittent results [14].

These unreliable characteristics of flaky tests can lead to problems for developers as it can be difficult to know if the issue causing intermittent failures is in the production code under testing or if it is in the test code itself. A flaky test might be ignored assuming there is a bug in the test, when in reality the bug might be hiding in the production code.

2.4 Flaky test categories

Flaky tests are non-deterministic by nature, meaning they behave differently without changes to the code they operate on. This can occur for numerous reasons, such as random variables or the current system time affecting whether a test results in a pass or fail.

Luo et al. [14] proposed the following categorization for flaky tests:

• Async Wait: These are tests where an asynchronous call is made and the flakiness is caused by the test not properly waiting for the result.

• Concurrency: This category describes test that show flaky behavior because of issues in using multiple threads, for example race conditions and deadlocks that can cause different outcomes.

• Test order dependency: In these test cases, the order in which the tests are run is the cause for the non-determinism in result outcomes.

• Resource leak: This type of flaky test occurs in relation to how resources such as mem-ory allocation and databases are handled.

• Network: Here the outcome of the test is dependent on the network connection. Con-trary to Async Wait, this category deals with connection failures rather than delays. It is split up into two categories, differentiating between remote and local connections. • Time: Tests belonging to this category have their flakiness origin from the use of the

system time. Factors such as difference in time zones and time precision can be over-looked by developers, causing unexpected failures or passes.

• IO: Flakiness is in this case caused by how IO operations such as reading or writing to files are managed.

• Floating point operations: When performing floating point operations inconsistent outcomes can arise from under- and overflows occurring.

• Randomness: These tests show flakiness based on random numbers generated and used. Certain numbers could cause the code to behave in ways the developers had not intended.

• Unordered collections: Here the flaky tests are caused by the order of elements in un-ordered collections such as sets.

(14)

2.5. Async Wait

2.5 Async Wait

As this is the category of interest in this analysis it is described in more detail here.

Asynchronous calls

An asynchronous call does not wait for the result of the call to be returned. Instead the program execution continues unless the developer explicitly implements a rule to wait for the completion of the call. The asynchronous call can be internal where a function or service in the application is run concurrently in a separate thread, or it can be external where the call is made to outer resources for example to a remote server [14].

Flakiness caused by asynchronous calls

The fact that the program execution continues while the call is being processed can bring a variety of problems. It means that when an asynchronous call is made to a resource, for example to a server, the result is not guaranteed to be available when it is to be used in the program. As described by Luo et al. [14] , a common way of dealing with these calls is to wait for a set amount of time to allow for it to complete and for the result to be available. It can of course not be assured that the result is returned in this predefined amount of time, and this is where the asynchronous waiting flaky behavior originates from. When the asynchronous call plays a part in a test case evaluation, the non-deterministic outcome seen in flaky tests can be observed where a passing or failing result depends on if the call result was returned and available before the test assessment.

The flakiness of these types of tests can sometimes be hard to reproduce as the developer might have introduced a generous time delay so that even running the test thousands of times does not result in a failure. It may be that it only fails under special conditions, for example when there are big network delays or a server takes an unusually long time to respond [2]. However, by analyzing the source code there are ways to detect the possibility that this might occur under certain circumstances.

Async Wait can be somewhat similar to the Concurrency category as both are related so synchronization issues but it is important to establish the difference between them. Flaky tests related to concurrency deals only with local synchronization problems, opposed to Async Wait which is related to remote calls to external resources, such as servers [6].

2.6 Python

Python is a programming language that was first released in 1991. It is currently one of the most popular languages, much because of its simplicity and readability [19].

Testing frameworks

There are several different frameworks available for testing in Python. PyUnit is the default unit testing framework that is included in Python which is fairly simple to start using. Pytest is another popular framework which is known for being highly flexible by having access to a large number of plugins.

Selenium

Selenium is a framework made for testing web applications specifically that is available for a number of languages, Python being one of them. The core component is the Selenium Webdriver which sends commands to the web browser and retrieves the results [5]. This allows for using Python programs to make requests for web pages via the browser which can be used to test that components and elements are loaded correctly.

(15)

2.7. Version control and Github

Asynchronous program concepts

The most simple form of waiting in the Python language is done using the sleep() function from the time module. A call to this function suspends the current thread for a number of seconds after which the execution resumes.

The "asyncio" API was added to Python in version 3.4, which introduced new features and syntax for handling asynchronous code. Asyncio is based on an event loop which exe-cutes and manages tasks running concurrently in the application. A "coroutine" is a Python function that is meant to be run asynchronously. To define a function as a coroutine the "async" keyword, as provided by the asyncio library, can be used. Calling a coroutine us-ing the "await" keyword, ensures that the function can finish before continuus-ing executus-ing the next line, while the event loop can still run other tasks concurrently. Inside coroutines the asyncio.sleep() function is often preferred over time.sleep() as the latter will block the entire execution of the current thread [4] while the asyncio.sleep() will allow other tasks to run while waiting [3].

2.7 Version control and Github

Github [8] is a version control system based on Git, that is used to manage and keep track of changes in the source code of projects. The main uses of version control software is that teams of developers are being able to collaborate and coordinate their work on a joint repository and being able to revert back to older versions of software since all changes are saved. When a change has been made to the code it can be committed with a message describing the change, and then pushed to the common repository. Most Github projects are open source meaning their code can be accessed and viewed by anyone.

2.8 Program analysis

When analyzing software there are different approaches, usually divided into static or dy-namic analysis. In relation to flaky tests this describes how a tool would find them and what information it would have access to.

Static

Static program analysis is the concept of analyzing a program without executing it, and the analysis is instead based only on the source code. This technique is commonly used by au-tomated tools and can lead to software flaws and vulnerabilities being detected before the program is run. Using this approach compared to analyzing the program dynamically dur-ing runtime has both its advantages and drawbacks. One benefit is of course that there is no need to compile and build the code and it also does not depend on the paths the program execution takes meaning it should lead to a consistent result. The main disadvantage is that it limits the amount of information that could be collected if it were to be run. However using the static measure means that the possibility of an error, even with very small probability, can be recognized without it needing to occur.

Dynamic

Using dynamic program analysis means the program is actually executed and the analysis is done during runtime, but as each execution may differ it can be more complex [9].

Code review

While the concept of static program analysis is usually related to the use of automated soft-ware, this can also be done manually by humans, often referred to as code review. This being

(16)

2.8. Program analysis

a very time consuming activity it is many times not the preferred approach, although manual inspection could have benefits such as being able to see the code under review in a wider context [1].

(17)

3 Related work

Although flaky tests is a relatively recent topic of research, there have been several studies regarding flaky tests and their causes. This chapter describes some works and how they have inspired the methods used in this report.

3.1 Categorization of flaky tests

Luo et al. [14] were one of the first to make a large-scale study on flaky tests. They col-lected a large set of tests from projects, most of which were written in the Java language, that were likely to be flaky and presented their findings about their prevalence and causes. They also presented the first classification of flaky tests into the ten categories listed in section 2.4. To find flaky tests to investigate they searched through commit messages in a version con-trol repository of the Apache Software Foundation which contains a large number of known projects. To assess which commits were fixes for flakiness they searched for the two words "intermit", and "flak" (to include all inflections of the words). The commits found using this method were filtered by manually analyzing them and deciding if they were in fact fixes for flaky tests. A sample of 201 of these were then further analyzed to find the root cause of the flakiness where each was put into one of the categories mentioned. To increase the credibil-ity of the manual categorization it was done by two of the authors, comparing their results and reevaluating the differing commits. The method for finding and categorizing flaky tests presented by Luo et al. are the main sources of inspiration for the methods of finding the test data for this report.

Prevalence of Async Wait

The results given by Luo et al. showed that the most common source of flakiness was Async Wait with 45% of all tests belonging to this category. Further they found that 34% of the Async Wait flaky tests use a time delay of some sort to have events be carried out in a certain order. In these cases, reducing the time delay would increase the number of failures as the external call has less time to complete. In the remainder there were no attempt at using delays to enforce an order of the events.

Eck et al.[6], Gruber et al. [10] and Sjöbom [17] each presented their statistics on preva-lence of Async Wait in part of their results as shown in table 3.1

(18)

3.2. Resolving flaky tests

Authors Async Wait Total

Luo et al. [14] 74 (45%) 161 Eck et al. [6] 52 (22%) 234 Gruber et al. [10] 3 (3%) 100

Sjöbom [17] 35 (31%) 113

Table 3.1: Async Wait found studies categorizing flaky tests.

3.2 Resolving flaky tests

Luo et al. [14] presented their findings on how tests in each respective category are commonly fixed by developers. For Async Wait they found that most common fix was to introduce what they refer to as a "waitFor" call which blocks the current thread indefinitely or until a maximum time is reached, in wait for an event to complete. The second most common method used to fix the flakiness was to add a "sleep" call that suspends the thread for a predetermined amount of time. This does not fully fix the flakiness but it does reduce the frequency of failed runs by an amount depending on the time of the added delay. A third much less occurring way of resolving the flakiness was when a developer had changed the order of code such that the execution time of a rearranged part of the code works as a delay instead of suspending the thread.

Eck et al. [6] made a study on the developers views on problems related to flaky tests. They had the developers rate the effort that was needed to fix previously flaky tests on a Likert scale from 1 to 5. Async Wait was given a fixing effort of 3.0 on average which while not being the highest rated category still shows that debugging these types of flaky tests can still require considerable work. When investigating the most common fix for Async Wait their results were similar to those of Luo et al. [14] in the sense that adding a "waitFor" condition was the most common method used, with 86% of all cases in this category being fixed with this addition. The second most used method was found to be to reorder execution of threads and the most rare case was to simply disable the test.

Since neither of the studies above dealt with the Python language specifically this is the intention of this report. The aim is to see if the various ways of handling waiting are different with the characteristics of the Python language, such as the concepts described in section 2.6, as well as to provide a more in depth view and statistics regarding the different methods that are used by developers to fix flaky tests in this category.

3.3 Flakiness in Python applications

While most studies on flakiness so far has covered Java applications, Gruber et al. [10] did an investigation on flaky tests in the Python language. In order to find a set of flaky tests, they collected a large number of test cases from projects found in the Python Package Index, which is a software repository for Python applications. They then ran them multiple times to observe any inconsistent outcomes. By doing so they found that 0.86% of all the tests they ran showed flaky behavior.

For each project they ran the tests in both same and random order to conclude if the flakiness was due to test order dependency. They found that 59% of all flaky tests could be fit into the Test Order Dependency category, which is a very large amount compared to the results of similar studies in other languages. The authors also included one more category specific to their work, which they named "infrastructure flakiness". This category contains tests that are flaky due to external issues in the machine executing them. In order to detect these they distributed the test runs were executed on different machines at different time

(19)

3.4. Static analysis of the test code

points to be able to find patterns of tests only showing flakiness on certain systems. This was found to be the second largest category after Order Dependency with 28% of all tests.

Out of the remaining 13% that were neither flaky due to test order dependency nor infras-tructure, the authors manually classified a sample of 100 tests into the categories described in section 2.4 as well as 4 additional less common categories that Eck et al. [6] used in their anal-ysis on flaky tests. Surprisingly their results differ considerably from other similar studies as they classified 79% of the tests as belonging to either the Network or Randomness category while only 3% was labeled as Async Wait.

Sjöbom [17] also studied flakiness in Python projects, but did so by collecting test cases through commit parsing using the same method as Luo et al. [14]. His findings were that 31% of the flaky tests were found in the Async Wait category. While most works done in this area seem to indicate that Async Wait is one of the most prevalent causes of flakiness [14, 6, 17], the fact that Gruber et al. only found 3% to be in this category is an interesting point of information that will be discussed further in this report.

While Sjöbom presents general information about all the categories and how frequently occurring they are, he does not go into detail on what different ways Async Wait flakiness in specific can manifest itself in the Python language. He also does not include information on different methods used for fixing these tests which is covered in this report.

3.4 Static analysis of the test code

Lam et al. [13] conducted a study on at what point in the development cycle that a test becomes flaky and the underlying reasons as to why they do. They ran their experiment on 245 flaky tests from 55 Java projects. For each test they iterated through the commit history, analyzing the earliest point at which it became flaky. They found that 184 of the 245 tests (75%) were flaky when they were first written. This shows the importance of being able to detect flaky tests early, possibly before even running them.

While many tools such as [11, 2, 12] focuses on analyzing test flakiness by running them, there have been a few studies on techniques for statically analyzing source code for detecting flaky tests. One example is the work by Pinto et al. [15] who explores the possibility of de-tecting flaky tests based on their vocabulary i.e. frequently occurring words in the test code. They collected test cases from Java projects and re ran them to find a set of flaky tests. These were parsed for commonly occurring words in method and variable names to find a vocab-ulary that indicate flakiness. They used 80% of the tests for training and the remaining 20% for validation, and trying different machine learning techniques they tested if it is possible to tell if a test is flaky based only on similarities in the words used.

(20)

4 Method

This chapter covers the methods used throughout this work. How the data was collected and how it was analyzed to obtain the results.

4.1 Creating a set of flaky tests

As the publicly available datasets containing flaky tests in the Async Wait category was lack-ing, it needed to be expanded by manually finding flaky tests in this category. The following section describes the method used for acquiring and the set of test of which the analysis is based on.

Searching for Async Wait flaky tests

In order to find a set of flaky tests to analyze in the Async Wait category, the method used was inspired by that of Luo et al. [14] where commits are searched for certain keywords in open source repositories. Unlike Luo et al. projects and commits were searched and accessed through Github as it is the largest host of open source projects. The Github search function was used to look for keywords that indicate flakiness, specifically the words "flaky", "flak-iness" and "intermittent" were searched for in some of the biggest Python projects listed in table 4.1. The full statistics on the number of flaky commits found in each project and how many were manually classified as Async Wait is seen in table 4.2.

The projects were chosen partly from Githubs listings of trending projects and partly from Sjöboms [17] list of Python projects already known to contain flaky commits in the category of interest for this report. A total of 583 commits were searched for Async Wait flaky tests. It resulted in 58 such cases found distributed over 11 projects as seen in table 4.2.

The Github search feature

The Github search feature does not currently support searching among all commits among all projects of a specific language, but it is possible to search for open or closed issues of a single language. This was utilized in order to find a broader variety of projects of different sizes by searching them for issues containing the same keywords as mentioned above. Github issues commonly does not reference a commit which made it harder to find the reason of the

(21)

4.1. Creating a set of flaky tests

Project name Description

Ansible Configuration management and application deployment Django Framework for developing web applications

CPython The Python programming language

Home-assistant Home automation software

Integrations-core Core Datadog, a cloud monotering service

Pandas Data analysis tool

Raiden Blockchain service

Salt Automation and configuration management software

Sentry Application monitoring

Tornado Web server and web application framework YouCompleteMe Code completion engine

Compose Tool for running multi-container applications

Matplotlib Plotting library

HexmeetAVC_windows Video conference systems

aiohttp Asynchronous HTTP client/server framework Inspirehep Information platform for High Energy Physics Piripherals Tool for interacting with the Raspberry Pi

Vault-dev Tool for running the Vault server in development mode Discovery-client Python client for the Discovery service

Table 4.1: Projects containing flaky commits.

flakiness but using this did expand the set of flaky tests by also including smaller and less known projects.

Some projects utilized the plugin pytest-rerunfailures where known flaky tests can be marked as flaky and to be rerun a set number of times like in the following example.

1 @pytest.mark.flaky(rerun=1, rerun_delay=2) 2 def test_click_reserve_meeting(): 3 sleep(3) 4 reserve_meeting.clear_reserved_meeting() 5 sleep(4) 6 reserve_meeting.reserve_meeting_from_panel()

Listing 4.1: Example of a test from the HexmeetAVC_windows project marked as flaky. This marked test was found by searching the source code of Github repositories for "@pytest.mark.flaky". As it was not fixed and the reason not given the cause of flakiness had to be manually investigated. In the given example above the cause can be tracked down to being Async Wait by noting the sleep calls and then looking further into the function called before. The 5 projects Compose, Matplotlib, HexmeetAVC_windows, aiohttp and Inspirehep containing a total of 6 Async Wait tests were found and added to the set using these tech-niques.

Additional data from test set

The data set made available by Gruber et al. [10] contained only 3 test cases in the Async Wait category. These are also included in the analysis and come from the 3 smaller projects Piripherals, Vault-dev and Discovery-client listed with a description in table 4.1.

(22)

4.1. Creating a set of flaky tests

Categorization

After a test had been identified as flaky by the information in the commit or issue or code, it had to be manually categorized. As Async Wait is the category of interest in this report, tests that did not seem to be related to this were disregarded when using this method. A total of 583 commits were found using the keywords "flaky", "flakiness" and "intermittent" as earlier described. These were then searched for test cases that could be classified into the Async Wait category. However it should be noted that the goal was not to categorize every test found, but rather to find a large variety of tests spread on many different projects, meaning cases that did not show a clear indication as to belonging to Async Wait were not more thoroughly analyzed.

Table 4.2 shows how many commits were found when using the keywords stated above in each project. It also shows how many of the tests associated with these commits were manually classified as belonging to Async Wait.

Project name Commits with flaky keywords Async Wait flaky commits

Ansible 13 2 Django 12 10 CPython 75 7 Home-assistant 76 8 Integrations-core 37 3 Pandas 18 1 Raiden 78 7 Salt 209 8 Sentry 47 4 Tornado 16 7 YouCompleteMe 2 1

Table 4.2: The number of commits with keywords indicating flakiness and how many were Async Wait.

To determine if the flakiness of a test case seemed likely to have it’s origin in the Async Wait category the test was observed and searched for a number of properties. Mainly the presence of any form of waiting or thread suspension is a good sign to go on [10]. The simple example is calls to time.sleep(), but many other forms of waiting functions from third party modules exist. These include the keywords "wait", "time" and "sleep", and this also goes for user defined functions with waiting functionality. Looking for these words in the code is a good starting point for identifying Async Wait flakiness. Another trait of this category is the asynchronous call to external resources. This can be harder to identify as it is seldom called directly in the test function or it might not be obvious that the call is asynchronous. Finding these includes looking for URLs in the test code or keywords such as "connection" or "server". The commits that claim to fix the flakiness generally contains useful information for cat-egorization. If the fix involve increasing the time of a sleep call such as in the following example it is highly likely that it is related to Async Wait.

The tests that did not have a commit with a fix associated with them were generally harder to classify and required a deeper investigation of the code, by looking into how the function calls work internally and where response delays could cause flakiness.

The complete list of Async Wait flaky tests

Using the stated methods the following 19 projects with a total of 64 flaky tests in the Async Wait category was found. These are listed in table 4.3 below.

(23)

4.2. Analysis

Project name Async Wait flaky tests

Ansible 2 Django 8 CPython 7 Home-assistant 8 Integrations-core 3 Pandas 1 Raiden 7 Salt 8 Sentry 4 Tornado 7 YouCompleteMe 1 Compose 1 Matplotlib 1 HexmeetAVC_windows 1 aiohttp 1 Inspirehep 2 Piripherals 1 Vault-dev 1 Discovery-client 1 Total 64

Table 4.3: The number of Async Wait flaky tests in each analyzed project.

4.2 Analysis

This part describes how the analysis of tests and projects was carried out.

Analysis of test characteristics

The two main sources of information for analyzing each case were the test code of the test and changes in the commit from where it was discovered. For each test in the set the type of asynchronous call made was investigated, for example if it was a request for information from a server or something more complex. Secondly, if present, the type of waiting implemented was analyzed, whether it is a simple sleep call or other thread suspending operations.

Some commits contained additional information in the commit message explaining rea-sons for the flaky behavior. This was taken into account when investigating the causes and what information the developer used to identify the calls leading to flakiness.

Waiting operations

As suggested by the name, a key component to Async Wait flakiness is how waiting opera-tions are utilized to try to enforce the order of events. To get an overview of what methods of waiting that is present in these flaky tests in Python they were each manually analyzed and classified into the following categories:

• sleep: The common example being time.sleep() which completely suspends the current thread, but also other sleep calls such as the sleep call from the asyncio package which waits a set number of seconds, but still allows other tasks to be run concurrently. • waitFor: This covers calls that actively waits until the asynchronous call is complete or

a timeout is reached.

(24)

4.2. Analysis

• No waiting: No time managing operations are seen in the test code.

Finding the sleep, wait and other thread suspending calls could often be done by looking at names of the API calls involving the words "time", "sleep", "wait" or similar. In some cases the API call was not made directly in the test function but instead called in a user defined function, but the names of these often gave a clue to them containing calls related to waiting.

Flaky test fixes

To find which are the most common ways of resolving the flakiness each commit was also manually analyzed for fixes. The goal was to categorize the fixes based on how they modify the waiting operations described above, for example if they increase time values or introduce new calls.

In some cases the commit only contained a simple change directly related to fixing the flakiness, like the modification of a value. In other cases they contained a larger number of changes, sometimes fixing several unrelated bugs in the same commit which meant it was sometimes more difficult to find the change which actually resolved the flakiness.

A simple example of an Async Wait flaky test commit involving a fix is seen in figure 4.1. The code line in red represents the removed code and the green line shows the new addition.

Figure 4.1: Example of a commit from the integrations-core project that claims to fix flakiness by increasing the time sleeping from 1 to 1.1 seconds.

Asynchronous operations

To get an understanding of the types of asynchronous activity that is present in this category, each test case was inspected for asynchronous calls that are likely to be the cause of the flak-iness. In some cases the commit message explicitly pointed out the call that is the cause of flakiness. If that was not the case then the code was manually analyzed to find the call in question. Given that Async Wait flakiness comes from the failure to wait properly for a call, the asynchronous call will be placed before any waiting operation in the code. It can be dif-ficult knowing only from the syntax which line starts such a call, but the presence of an URL can give good indications. It will then have to be more thoroughly examined to find if it is in fact able to run concurrently and if a delay of this could alter the resulting outcome of the test.

Project specific flakiness causes

With the intention of finding if flakiness can be detected from test code before being fixed, the reasons for flakiness specific to certain projects needed to be investigated. The goal with this was to see if flakiness in the Async Wait category is easier to detect in some projects than in others. This information would be considered useful as this may have the potential to be used to detect flakiness in other tests in the same project.

(25)

4.2. Analysis

The projects containing 7 or more Async Wait flaky tests were analyzed by comparing the causes and potential fixes seen in the code and commits. The aim was to find similarities in modules and functions used between tests in the same project. For each of these projects the type of waiting in the flaky tests as well as the fixes were compared to see if conclusions could be drawn on whether similarities existed because the nature of the project or how developers handles the problems differently.

(26)

5 Results

5.1 Waiting operations

All test cases were categorized based on the type of waiting operation used in the flaky state, before any fixes were applied, as described in section 4.2. Only waiting that actually con-tributed to the flaky behavior is counted here.

Waiting category Number of tests

Sleep 15

waitFor 12

Timeout 11

No waiting 26

Table 5.1: Tests categorized on the type of waiting before fixes.

It was found that 41% of the Async Wait tests did not utilize any form of waiting or time-out while in the flaky state. One example of such test can be seen in listing 5.1 where a request for an URL is done on line 3 utilizing the Selenium framework. The code then instructs the driver to click on two links, judging only by their name it is likely that the first links to a new URL and the second goes back to "full_url". On line 8 there is an assert testing if the current URL matches the one requested which evaluates whether the actions were executed correctly it returned to the original URL successfully. Since these actions are requests to a remote server the time it takes to complete the request will vary from time to time depending on external factors. A delay in one of the actions could result in the assert statement returning a failure, which is the cause of the flaky behavior reported in the commit associated to this test case.

1 full_url = ’%s%s’ % (self.live_server_url, url)

2 self.admin_login(username=’super’, password=’secret’, login_url=’/ test_admin/admin/’)

3 self.selenium.get(full_url) 4

5 self.selenium.find_element_by_class_name(’deletelink’).click() 6 self.selenium.find_element_by_class_name(’cancel-link’).click()

(27)

5.1. Waiting operations

7

8 self.assertEqual(self.selenium.current_url, full_url)

Listing 5.1: Parts of a flaky test from the Django project where no waiting occurs initially.

sleep

The most common type of waiting was using a sleep call. time.sleep() was the most common API call which was used in 11 out of the 15 test cases seen in table 5.1. The remaining 4 used sleep functions from other libraries, specifically asyncio.sleep(), gevent.sleep() and self.io_loop.call_later(). The following for loop comes from the test case "test_mpr121_irq" in the Piripherals repository. The function call irq() as seen on line 3 and 7 in listing 5.2 is asynchronous and the developer implementing the test has added a time.sleep() call for 10 ms to suspend the thread and allow the irq() call to finish. As the test was reported as flaky, this time limit was not enough, and the asserts on line 5 and 9 will fail if this is the case. This however was found through execution of the test and is not apparent from looking only at the code.

1 for i in range(13): 2 dev.write_word(0, 1 << i) 3 irq() 4 sleep(0.01) 5 handlers[i].assert_called_once_with(True, i) 6 dev.write_word(0, 0) 7 irq() 8 sleep(0.01) 9 handlers[i].assert_called_with(False, i)

Listing 5.2: Part of a flaky test from the Piripherals project which uses time.sleep().

waitFor

12 of the 64 tests used a waitFor function of some sort in their flaky state. Luo et al. [14] found that implementing a waitFor condition in a test is a common way to fix flakiness, but this shows that waitFor is also commonly occuring before the fix is applied. In many of these cases the waitFor was used incorrectly in the sense that the event it waited for was not enough to guarantee a consistent outcome. One such example is seen in listing 5.3 below, where the click() function is called on line 1 which runs asynchronously. In an attempt to wait for this to finish the wait_until() function is run right after, but as the test is reported as flaky this does not fully prevent a non-deterministic outcome of the assert. This was later fixed in a commit by adjusting the waitFor condition to fully load the page before asserting. The commit for this fix can be seen in figure 5.8.

1 self.browser.click(’.new-project-submit’) 2 self.browser.wait_until(title=’Java’) 3

4 project = Project.objects.get(organization=self.org) 5 assert project.name == ’Java’

(28)

5.2. Flaky test fixes

timeout

The least common category of waiting was found to be timeouts, which appeared in 11 of the 64 flaky tests. The following example testVRFY is taken from a commit from the CPython repository which describes that the flakiness in this case is due to the timeout of 3 seconds as specified on line 2. If the smtp.vrfy(email) call on line 6 takes more than 3 seconds it will time out and the assert will fail.

1 def testVRFY(self):

2 smtp = smtplib.SMTP(HOST, PORT, local_hostname=’localhost’,

timeout=3)

3

4 for email, name in sim_users.items():

5 expected_known = (250, ’%s %s’ % (name, smtplib.quoteaddr( email)))

6 self.assertEqual(smtp.vrfy(email), expected_known)

Listing 5.4: Part of a flaky test from the CPython repository which uses a timeout.

5.2 Flaky test fixes

While section 5.1 categorized the tests based on the type of waiting in the flaky state before any fixes, this section focuses on categorizing the fixes themselves and how the waiting is modified. 58 of the 64 Async Wait flaky tests had a commit where the flakiness had been re-solved or reduced. The remaining 6 was simply marked as flaky, disabled or fully removed. The commits were categorized on how they changed the waiting operations and the asyn-chronous calls in the flaky state of the test. The results are as seen in table 5.2.

Fix category Number of tests

Introduce sleep 5

Increase sleep time 5 Introduce waitFor 19

Modify waitFor 10

Increase timeout 15

Other 4

Not fixed 6

Table 5.2: Tests categorized on the type of fix.

Introduce sleep

Adding a completely new sleep call was a relatively rare fix as it only occurred in 5 of the cases. Sleeping for a set amount of time does not completely eliminate any risk of flakiness but it does make it less likely to occur [14]. If the time slept is long enough it could be considered a consistent fix, but suspending the thread does increase the test case execution time.

Figure 5.1 shows a commit from the CPython project which claims to reduce flakiness in a test that verifies that sending emails works as intended. The flakiness here is caused by the smtp.sendmail() function which runs asynchronously and the risk of failures is reduced by making the current thread sleep for 10 ms in hope that the call in question has time to complete.

(29)

Figure 5.1: Example of a commit from the CPython repository that reduces flakiness by intro-ducing a time.sleep call.

Increase sleep time

Increasing the time value on an already existing sleep call was the fix used to resolve the flakiness in 5 of the tests cases. An example of this is seen in figure 5.2 where flakiness is fixed by doubling the sleep time when waiting for a DNS cache table entry to expire.

Figure 5.2: Example of a commit from the Salt project that reduces flakiness by increasing a sleep call.

Introduce waitFor

To introduce a waitFor condition was found to be the most common fix for flakiness in this category as this was the case in 19 of the tests. In contrast to adding a static sleep call this method completely resolves the flakiness since the call is given enough time to complete. The exception is if the call takes an excessively long time to complete or does not respond at all where a maximum time limit may wake the thread suspension. In that case a timeout could be reached depending on the specific waitFor condition and how it is implemented.

The commit from figure 5.3 below is taken from the Raiden repository. The test initiates a message transfer and waits for it to complete. In the original flaky implementation this is done by making a sleep call for 1 second. To resolve the flakiness this was exchanged with a waitFor function which suspends the thread and wakes it when the transfer is completed.

Another more simple type of waitFor condition that was present in some of the cases was when the developer had implemented it in the form of a while loop which breaks when the result is returned. Figure 5.4 shows a commit from a test in the YouCompleteMe project where flaky behavior is caused by a server not starting in time. A sleep of 2 seconds had been used to reduce the flakiness, but this was changed into a waitFor in the form of a while loop running indefinitely checking if the server is running every 0.2 seconds. When the server is confirmed to be running it breaks out of the loop and the program execution can continue.

This approach is less efficient compared to the example in figure 5.3 in the sense that the thread is not fully suspended the entire duration of the wait, and instead it needs to poll for

(30)

Figure 5.3: Example of a commit from the Raiden project that fixes flakiness by changing a sleep into a waitFor condition.

the result frequently. It also does not include a timeout functionality which means if the server never starts the test would get stuck and not be able to complete unless a maximum test time is implemented in the running of the test suit. So while this approach can be considered a fix for the cases where the server startup time is longer than 2 seconds, it does not address the case where it does not start at all. It also solves the issue of always having to wait a minimum of 2 seconds even if the startup time is much shorter.

Figure 5.4: A commit from the YouCompleteMe project that introduces a waitFor in the form of a while loop.

Modify waitFor

As described in section 5.1 there were cases where a waitFor condition was already imple-mented in the before-fix state of the test but did not negate all flaky behavior. In these cases some fixes involved modifying the condition in some way, either by exchanging it for a dif-ferent one or by adjusting the usage.

The commit in figure 5.5 is taken from the Sentry project and uses Selenium which sends an asynchronous requests for a page. In the flaky state of the test it calls a function that waits until elements of the class "entries" is loaded. As this was reported as flaky the fix involved adding a second wait_until() function that waits for the specific element with the id "linked-issues". The problem here was likely that "entries" was discovered before the specific element was loaded.

Increase timeout

In a number of the cases the fix was done by increasing a timeout value. The timeout is usually applied to some form of waitFor call and could therefore be classified as "Modify waitFor" but since this was a commonly occuring fix it was categorized on its own. In figure

(31)

5.3. Asynchronous calls

Figure 5.5: A commit from the Sentry project that modifies the waitFor to fix flakiness.

5.6 the flakiness is reduced by changing the timeout from 20 to 40 seconds when waiting for a state change.

Figure 5.6: A commit from the Raiden project that increases the timeout to fix flakiness.

Other fixes and marked tests

Four of the fixes were very specific to the project and could not be categorized into the rest. 6 flaky tests were not fixed and only marked or by other means been reported as showing flaky behavior.

5.3 Asynchronous calls

A wide assortment of different types of asynchronous calls was found in the projects ana-lyzed. To find which call was responsible for the flakiness was in some cases rather difficult, especially in the cases where no fix was supplied. If a call was clearly requesting a specific URL or content from a server it is likely that this will run asynchronously. This was many times not the case and looking at the syntax of the call itself would not be enough to deduce that it is the one responsible. In these cases it was more important to look at the test in a wider context, seeing what is asserted and where a delay could result in problems. Look-ing back at the example from listLook-ing 5.2 there is no call that inherently can be proven to be asynchronous without further information, but knowing that this is a test for communication with the Raspberry Pi (a single-board computer) and looking into the inner workings of this module one can see that the dev.write_word() function calls one of the handlers and the irq()function generates an interrupt and this can be deduced as being the asynchronous part responsible for the flakiness.

In the tests utilizing the Selenium web driver the code was usually easily readable and finding the asynchronous call responsible was more clear.

(32)

5.4. Project specific flaky causes

5.4 Project specific flaky causes

There were a lot of similarities both in the causes of flakiness and the methods used for fixing all the flaky tests when looking at an individual project. In the Django project for example 7 out of 8 flaky tests were related to Selenium and in most of these cases there was no waiting present initially and the click() function was responsible for the asynchronous call taking too much time. Most tests in this project were very similar to the example commit seen in figure 5.7. The click() call instructs the web browser to click on a link in a page causing a new one to start loading. Asserts are then made to make sure the elements of interest are found on the page, and flakiness is shown when an assert is performed before the page has loaded the elements. All of the 7 tests were fixed by introducing a waitFor call that makes sure the page is loaded, such as in the example below. No sleep calls were used, likely because Selenium has a wide variety of easy to use waitFor conditions available.

The Sentry project also uses Selenium for testing but unlike Django all the flaky tests from this project already had a waitFor implemented and the fix was to modify this instead.

Figure 5.7: How the common case flaky test looks in Django and how it is fixed.

The Home-assistant code contains a number of functions to deal with timings and con-currency in the project. One example that was found in most of the Async Wait flaky tests was the block_till_done() function which uses the asyncio module to block the thread until all other concurrent tasks have been completed meaning this would be classified as a waitFor function. Introducing or modifying one of these functions was the solution to fix flakiness in all but one of the tests in this specific project.

In the Raiden project most flaky tests in the category was fixed by increasing a timeout. This was the case in 4 out of the 6 tests and they all dealt with relatively long timeouts of around 10 to 40 seconds.

Some projects had a much more spread variety of flaky test characteristics. The tests found in the CPython project for example had little in common both when it came to similarities in function calls and the fixes applied. The same thing could be seen in the Salt project which had very different looking flaky tests in the Async Wait category.

5.5 Async Wait flaky test characteristics

To know if a flaky test might belong to Async Wait from only looking at the code is in most cases very difficult, but there are certain traits that are common for tests in Async Wait that could give indications. As seen in table 5.1, 60% of all the tests had some form of waiting implemented already in the flaky state meaning this is one common characteristic of these tests.

As per definition they all contain an asynchronous call of some sort. There was a very large variety in the implementation of these calls, but the most common case found was to request the contents of a web page and then asserting whether or not certain elements are present, which was commonly done using Selenium. Among the remaining cases there was few similarities to be found and often times it required an extensive search of the source code to find which calls were asynchronous and which ones were actually responsible for the flaki-ness. Listing 5.5 shows parts a test case reported as flaky where the asynchronous call may not be obvious at first sight, but it turns out that the Person.objects.update_or_create()

(33)

5.6. Deciding on suspension time

function in turn updates records in a database which is the cause of the synchronization is-sues. 1 def test_updates_in_transaction(self): 2 3 ... 4 5 def birthday_sleep(): 6 time.sleep(0.3) 7 return date(1940, 10, 10) 8 9 def update_birthday_slowly(): 10 Person.objects.update_or_create(

11 first_name=’John’, defaults={’birthday’: birthday_sleep

}

12 )

13

14 Person.objects.create(first_name=’John’, last_name=’Lennon’,

birthday=date(1940, 10, 9))

15

16 ...

Listing 5.5: Part of a flaky test from the Django project where the async part is not clearly visible.

The presence of an asynchronous call in a test is of course no guarantee that it will be able to show flaky outcomes, but it does raise the possibility if it is used in the determination of the asserts and if it is not waited for properly.

5.6 Deciding on suspension time

When flaky tests in this category are fixed by introducing or increasing a sleep time suspend-ing the thread, such as the common time.sleep() call, there is always a trade off between the flakiness reduction and the run time for the test. The title of these commits was sometimes "reducing flakiness" rather than "fixing flakiness", for the simple reason that this method does not completely mitigate the problem. Increasing the sleep by a very large number would likely remove any flakiness altogether but at the cost of making the test suite execution time slower.

The average increase in waiting was found to be 103% indicating that developers often choose to double the time. None of the commit messages explain any reason behind the length of the chosen time slept. It is reasonable to think there is little investigation done on the developer side when deciding on the time increase other than trial and error. A more effective approach would be to measure the time of the asynchronous call and calculate the average time taken for it to return a result. This value could work as a basis for how long the thread suspension should be to reduce flakiness as much as possible while still maintaining an effective run time.

5.7 Automatic detection

To develop an algorithm that can automatically detect flakiness from only looking at the test code it would need to be able to recognize two things:

(34)

5.7. Automatic detection

1. Whether the code contains one or more asynchronous calls that will be executed and if the result from these are in some way used in an assert.

2. Whether the waiting for the results of these calls are handled accurately or if a delay in the response time from a server could change the outcome.

Judging by the test cases analyzed in this work, algorithmically finding if calls are asyn-chronous or not can be tricky as there are few similarities to be found between these. There are certain words such as "request" that occur more frequently in these function names, and the keyword "await" from the asyncio library was also seen in a number of these cases. If a function accepts a callback function as parameter it is also a good indication. Going by these attributes can give a hint but is by no means proof of a function being asynchronous. Instead, finding these calls required looking at the code from a broader view, using common sense and then confirming if they are in fact run asynchronously by reading the documentation.

When looking at a single project however, more similarities could be found in the asyn-chronous calls as described in section 5.3. By knowing the calls that are common in the spe-cific project these can be searched for in the new test code under investigation.

Another approach is to instead look for the waiting calls as these are more similar and should be easier to find automatically. Of course these are used for all types of different purposes unrelated to synchronization but it can work as an indication as to where in the code the asynchronous calls might be located as this was in many cases positioned right before the waiting.

For the second part, in order for the algorithm to detect flakiness it would need to know if these asynchronous calls can potentially cause flakiness which depends on how the waiting is handled. How effectively this could be detected was found to be very much dependent on the type of waiting implemented.

Sleep

The sleep is as mentioned a relatively poor method of waiting which could in certain cases lead to flaky outcomes. They would be easy to find automatically as time.sleep() was by far found to be the most common call. All other alternatives except one also included the word "sleep" in the call name, meaning they can easily be found going only by syntax.

The main problem here is the inability to know if the sleep time is long enough. Even if one has access to very detailed information on how the call operates the time taken is dependent on many external factors such as network delays etc.

While detecting flakiness in these cases would be difficult, judging by the information available it would be possible to detect the potential of a test being flaky if certain conditions are met, namely if the delay times are close to the thread suspension time.

waitFor

If used right a waitFor condition is, as Luo et al. [14] also describes, one of the best methods of dealing with flakiness in this category, but as mentioned in section 5.1 these conditions are also present in the flaky state.

Based on the tests collected where waitFor exists but used incorrectly in a way that causes flaky behavior, it would be difficult to implement an algorithm being able to detect these cases. To do so one would need to have access to very detailed information on exactly how individual functions work. The commit in figure 5.8 below shows the fix to the flaky test code presented in listing 5.1, showing the difficulties in these cases. On the first line of code an asynchronous call is made that submits a new project by clicking on a button, and in the flaky state the browser is instructed by the Selenium web driver to wait for the title to be "Java" as an indication that the submission is complete. This appeared to not be the case as it showed flakiness and was fixed by waiting until the new page was fully loaded as seen on

Possibilities of automatic detection of &quot;Async Wait&quot; Flaky tests in Python applications

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-G--2021/052--SE

Possibilities of automatic

detec-tion of "Async Wait" Flaky tests in

Python applications

Möjligheter till automatisk detektering av icke-deterministiska

tester inom "Async Wait" -kategorin i Pythonapplikationer

Joel Nilsson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

Choice of category and language

Applicability of automatic detection

1.2

Aim

1.3

Research questions

1.4

Delimitations

2

Background

2.1

Software testing

Unit testing

2.2

Continuous Integration

2.3

Flaky tests

2.4

Flaky test categories

2.5

Async Wait

Asynchronous calls

Flakiness caused by asynchronous calls

2.6

Python

Testing frameworks

Asynchronous program concepts

2.7

Version control and Github

2.8

Program analysis

Static

Dynamic

Code review

3

Related work

3.1

Categorization of flaky tests

Prevalence of Async Wait

3.2

Resolving flaky tests

3.3

Flakiness in Python applications

3.4

Static analysis of the test code

4

Method

4.1

Creating a set of flaky tests

Searching for Async Wait flaky tests

Additional data from test set

Categorization

The complete list of Async Wait flaky tests

4.2

Analysis

Analysis of test characteristics

Waiting operations

Flaky test fixes

Asynchronous operations

Project specific flakiness causes

Possibilities of automatic detection of "Async Wait" Flaky tests in Python applications