System Agnostic GUI Testing: Analysis of Augmented Image Recognition Testing

(1)

System Agnostic GUI Testing

Analysis of Augmented Image Recognition Testing

Martin Moberg Joel Amundberg

June 11, 2021

Faculty of Computing

Blekinge Institute of Technology

SE–371 79 Karlskrona, Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulﬁlment of the requirements for the degree of Bachelor of Science in Software Engineering. The thesis is equivalent to 10 weeks of full time studies.

Contact Information:

Author(s):

Martin Moberg

E-mail: martin3127@gmail.com Joel Amundberg

E-mail: joel.amundberg@gmail.com

University advisor:

Dr. Emil Alégroth

Department of Software Engineering

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Automated Graphical User Interface (GUI) tests have been known to be fragile in the past, thus requiring a lot of maintenance. Document Object Model (DOM) tools or other component based tools solves fragility to an extent, but also has its drawbacks.

Accessing components requires knowledge of widget metadata and in a lot of cases this is not possible due to programmatic limitations. A somewhat novel approach to this problem is Visual GUI Testing (VGT). Instead of components VGT uses image recognition as its driver. VGT also suffers from fragility issues but its main benefit include being system agnostic. There are VGT tools available that use scripts, these scripts has been shown to be more fragile than some DOM-based models and require more maintenance, but has potential to improve in time- and cost efficiency when creating the base test cases to offset this downside.

Therefore in this paper we evaluated a proof-of-concept plugin called JMScout that we developed for a tool called Scout. Scout uses an technique called Augmented Testing, which is a visual layer put between the SUT and the tester which they interact with. The primary goal is to integrate JMScout into Scout and attempt to improve time efficiency in test development, test execution and effectiveness in finding faults. Augmented testing in combination with VGT has never been studied before. Another major goal was to evaluate the augmented paradigm’s potential to improve usability and time efficiency in test development compared to other tools.

In order to evaluate this plugin a quasi-experiment was conducted where the effec- tiveness, time efficiency in test development and test execution was analyzed. The experiment was conducted against an application called Rachota which the authors implemented faults to, also known as mutants. Tool performance when attempting to find these faults were documented as the results. These results were analyzed by descriptive statistics using graphs, box plots and tables in conjunction with more formal statistics using the Kruskal-Wallis H-test and Wilcoxon Signed Rank Test.

The results indicate that Manual testing was more effective at finding faults than JMScout and EyeAutomate. It also suggests that JMScout was the most time efficient at test execution in comparison to the other tools, however time efficiency in test development was inconclusive.

In conclusion this study suggests that JMScout doesn’t solve the fragility issues that EyeAutomate or possibly other VGT tools possess, but provides a way to mitigate the costs of VGT in general as much as possible. Thanks to the augmented testing paradigm that Scout provides, it might result in JMScout improving eﬃciency in test development. Even more probable if the maturity of JMScout will come to further fruition.

Keywords: Testing, Augmented Testing, Image Recognition, VGT, Visual GUI

Testing

(4)

Acknowledgments

First and foremost we would like to express our gratitude to our supervisor Dr.Emil Alégroth for his continued support throughout the thesis. In particular his guidance in the empirical study, intermediation of the state and value of VGT. Also for the continuous feedback on our work. We would also like to thank Rachota for providing us with an application in which we had freedom in implementing faults. Additionally we would like to thank Michel Nass and Andreas Bauer for developing Scout.

ii

(5)

Abstract i

Acknowledgments ii

1 Introduction 1

1.1 Background . . . . 1

1.1.1 Motivation . . . . 2

1.1.2 Contribution . . . . 3

1.1.3 Scope . . . . 3

2 Literature Review 5 2.1 Method . . . . 5

2.2 Study . . . . 5

2.3 Summary . . . . 9

3 Methodology 10 3.1 Research questions . . . . 11

3.2 Phase 1 - Literature Analysis . . . . 12

3.2.1 Literature search . . . . 12

3.3 Phase 2 - Design of Study . . . . 12

3.3.1 Empirical stragety . . . . 12

3.3.2 Variables . . . . 12

3.3.3 Scope . . . . 14

3.3.4 Design of JMScout . . . . 14

3.3.5 Tools used . . . . 18

3.4 Phase 3 - Execution of Study . . . . 19

3.4.1 User Stories . . . . 19

3.4.2 Mutants . . . . 19

3.4.3 Experiment method . . . . 21

3.4.4 Results . . . . 23

3.5 Phase 4 - Analysis of Results . . . . 23

3.5.1 Descriptive Statistics . . . . 23

3.5.2 Formal Statistics . . . . 23

3.5.3 Limitations . . . . 24

3.5.4 Outliers . . . . 24

iii

(6)

3.6 Alternative approaches . . . . 24

3.7 Replication Pack . . . . 24

4 Results and Analysis 25 4.1 Results . . . . 25

4.1.1 Fault prediction vs Fault found . . . . 25

4.1.2 Development Times . . . . 26

4.1.3 Execution Times . . . . 27

4.2 Analysis . . . . 29

5 Discussion 33 5.1 Observations . . . . 33

5.2 Validity threats . . . . 35

5.2.1 Internal . . . . 35

5.2.2 External . . . . 35

5.2.3 Construct . . . . 35

5.2.4 Reliability . . . . 35

6 Conclusions and Future Work 36 6.1 Conclusion . . . . 36

6.2 Future work . . . . 36

6.2.1 Future development work . . . . 36

6.2.2 Future scientiﬁc examination . . . . 37

A Annexes 40 A.0.1 User story template example . . . . 40

iv

(7)

List of Figures

1.1 Scout tool using the Selenium plugin . . . . 2

3.1 Methodology steps . . . . 10

3.2 Replacing Selenium Plugin with JMScout . . . . 14

3.3 JMScout testing Rachota . . . . 15

3.4 Plugin Architecture . . . . 17

3.5 User Story 12 - Original to the left vs mutant implemented on the right 20 3.6 Experiment Steps . . . . 21

4.1 Mutant predictions and actual outcome per tool . . . . 25

4.2 Test development Box plot . . . . 26

4.3 Box plot JMScout Manual vs Manual testing with Faults . . . . 27

4.4 Box plot EyeAutomate vs JMScout No faults . . . . 27

4.5 Box plot EyeAutomate vs JMScout Faults . . . . 28

4.6 Test development times . . . . 29

4.7 Test execution times Manual vs JMScout manual . . . . 31

4.8 Test execution times EyeAutomate vs JMScout . . . . 32

5.1 User Story 21 - Rachota exit button . . . . 34

A.1 Box plot components . . . . 42

A.2 Times for test development . . . . 42

A.3 Execution times of tests with Manual vs JMScout against Rachota with faults implemented by the authors. . . . 43

A.4 Execution times of tests with JMScout vs EyeAutomate with no faults implemented by the authors. . . . 43

A.5 Execution times of tests with JMScout vs EyeAutomate against Rachota with faults implemented by the authors. . . . 44

v

(8)

List of Tables

3.1 Confounding variables . . . . 13

3.2 Rachota metadata . . . . 18

3.3 Mutant operators . . . . 20

A.1 User Stories . . . . 41

vi

(9)

Abbreviations

AT - Augmented Testing

DOM - Document Object Model GUI - Graphical User Interface

HTML - Hyper-Text Markup Language OSS - Open Source Software

R&R - Record & Replay SUT - System Under Test VGT - Visual GUI Testing

vii

(10)

Chapter 1 Introduction

1.1 Background

Sommerville states that in software development industry, one of the keystones of product development is software testing [25]. This practice gives the product owner and the development team conﬁdence in the application conforming to the users’s needs. Often, in today’s industry, high level tests like Graphical User Interface (GUI) based tests are mostly performed with manual techniques such as Exploratory Testing [3]. Exploratory methods are often tedious and costly - and if you do manually repeated tests, also possibly error prone [3].

A proposed solution for these stated issues of cost, tedium and being error-prone is Visual GUI Testing (VGT) [2]. VGT is an emerging technique in the industry due to its perceived higher ﬂexibility and robustness to certain changes that may aﬀect the GUI that other GUI automation techniques would either not apply on or be too sensitive to changes for [3].

There are currently multiple tools that attempt to solve the posed issues. One technique amongst them is Record and Replay (R&R) which relies on replaying a previous user session by recording the users’ clicks. This is generally called the ﬁrst generation [2]. There are some limiting factors of this technique in regards to maintenance; the scripts are fragile towards changes in layout, API and other similar factors which can result in an entire test suite breaking in the worst case for the ﬁrst generation of testing tools [6]. The second generation of testing tools also utilize R&R, a model called structure-aware capture/replay, which instead of recording where a user clicked, it records what was clicked on [2, 24]. The downside with that is that it cannot assert how a GUI looks, or if it’s even interactable at all by the user, this in addition to being limited to a subset of programming languages it can hook in to [2].

Keeping all of this in mind, it was chosen by the authors to attempt making a more user friendly system with the aid of Image Recognition and Augmented Testing. Testing based on Image Recognition is termed the third generation or VGT [2]. Augmented testing is a technique that superimposes relevant information and location data between the tester and the SUT [18]. This will aid in keeping the maintenance costs down when a testing suite needs maintenance, and lowering the barrier of entrance to the testing ﬁeld through lowering the requirements of having a programming background. The Image Recognition will be completely system agnostic, not interacting with any of the System Under Tests (SUT) intrinsic functionality in contrast to the second generation of testing.

Therefore it was chosen by the authors to develop an extension for a prototype tool with the name Scout

¹

[19]. Scout is a academic testing tool built in Java, currently having Selenium as its sole core driver. This driver is written as a plugin to Scout, which makes swapping the core out a simple task.

A common test case is written using actions and assertions which this tool will assist with; Creating visual markers where the test path goes next and markers for what is to be examined, noting test steps with missing or altered content. The tool will also be able to autonomously run the entire test suite.

1

https://github.com/augmented-testing/scout

1

(11)

2 Chapter 1. Introduction

Figure 1.1: Scout tool using the Selenium plugin

1.1.1 Motivation

As mentioned in the article "Visual GUI Testing:Automating High-Level Software Testing in Industrial Practice" [2], we can state that test automation will be essential in meeting time to market while still providing quality software through providing a test suite that can verify application functionality continuously without the added costs of manual testing [2, 6]. In the current industry, manual testing techniques such as Exploratory testing is one of the standards for testing a GUI [13]. This limits the ability to continuously apply testing due to both budgetary and time-based constraints. The other standard for testing a GUI is the second generation of testing, which is suitable for automation well but cannot assert how a GUI looks, or even if a component is interactable [2].

While VGT tools have been created before, they haven’t been made in the context of Augmented Testing(AT) - which is why the tool Scout was chosen [18, 19]. Scout is arguably unique in its testing approach in which it provides an augmented GUI.

Visualizing test information directly on top of the application that is undergoing the tests.

The reason that combining the two techniques of AT and VGT is interesting is that it increases the ease of use for automated testing, lessening the requirement of a programming background to write scripts due to being provided all the information and utility through a point-and-click GUI. VGT is also a system agnostic technique, which will widely improve the applicability of the AT technique from its current DOM-based implementation in Scout.

It is unknown and untested if this technique will have any positive eﬀects on the

(12)

1.1. Background 3 time efficiency or the effectiveness in finding faults, which is what this works aims to examine. Specifically effectiveness in finding faults and efficiency in test development and execution.

1.1.2 Contribution

The main contribution of this work is developing a proof-of-concept Augmented Testing tool in combination with the technique VGT. This also means that the tool Scout will develop its functionality towards being system agnostic. The evaluation part of the thesis analyzed the developed plugin in terms of effectiveness in finding faults and efficiency in test development and execution.

This thesis also further examines the maturity and potential of VGT and the potential of the combination of VGT and Augmented Testing.

As this is a proof-of-concept study, the results should be taken under consideration as an indication of the potential for further study and development within the area of VGT combined with AT.

1.1.3 Scope

To develop Scout with the Image Recognition plugin, henceforth called JMScout in the thesis, towards being a true System Agnostic tool there are some advantages and disadvantages. Due to JMScout being implemented in Java version 1.8, it is portable to most modern GUI-based operating systems and for use against emulators of various other GUI-based platforms such as Android.

In the same vein, JMScout does not interact with any SUT except for how a user would, and has therefore no access to intrinsic data such as component identiﬁers, contrary to the more structure-aware second generation tooling. JMScout will only be able to access what is clearly displayed on the GUI. This means that there is a clear limitation in what can be expected to be asserted and tested for based on the functionality that will be available.

To perform the the evaluation of the developed system, inspiration of a variety of mutations was gathered from the work of Oﬀutt et al [20]. Mutations are intentionally introduced defects in the application code that the testing tools try to detect [20].

These mutations which Add, Change and Remove were each all included in a test case at least once. Any mutations which required interacting with the intrinsic data of the components of the application were excluded from consideration since JMScout does not have any access to such data. More details about mutations will be brought up in Section 3.4.2.

The experiment will be conducted using these tools and techniques:

• JMScout

• EyeAutomate

• Manual Testing

Tests was conducted against Rachota which is an Open Source Software. The last major release of Rachota was in 2012. With these two factors in mind it can be stated that Rachota doesn’t necessarily represent modern state of the art applications.

This might aﬀect external validity and construct validity negatively somewhat.

(13)

4 Chapter 1. Introduction

The data analysis was performed with Kruskal-Wallis H-Tests and Wilcoxon Signed

Rank tests [28]. The methods was used in evaluating the statistical signiﬁcance of

the research questions. The environment will be made as fair as possible to make the

tests comparable on a rational level.

(14)

Chapter 2 Literature Review

2.1 Method

To acquire a foundation about the main concepts discussed in this thesis, a literature review was performed in majority by researching relevant keywords and later snowballing the references used in the papers judged relevant according the guidelines published by Wohlin et al [28].

The databases used for locating relevant literature has been the search engine Summon@BTH which is a database-of-databases provided by Blekinge Institute of Technology. Supplementary searches and information has been located through Google Scholar as a secondary priority when no result were presented by the primary search engine. Search terms used to locate relevant literature includes "Visual GUI Testing",

"VGT", "Augmented Testing", "EyeAutomate" and "Sikuli" primarily. As VGT is not a mature area of study, the initial literature search involved the ﬁrst three terms.

Due to the term VGT being used in another area of study it was diﬃcult to ascertain that all results had been found. To assure that as much relevant literature as possible had been found, the VGT tools known to the authors were searched additionally to complete the base to build the study from. The rest of the literature has been located through snowballing from the references provided in the papers located, sometimes multiple papers deep.

Initially our goal with ﬁnding literature was to understand augmented testing and the state of GUI testing. With the located information we found a gap in the subject matter and formulated our research so it can be put in a broader perspective in terms of value. With the aid from learning about past trials and tribulations we found where our plugin could provide a diﬀerent approach.

2.2 Study

In the industry, one of the most commonly used techniques to handle GUI- interacting tests is manual testing [4, 27]. Manual techniques such as Exploratory Testing have long been an accepted approach for testing in the industry according to Itkonen and Rautiainen [13]. Key advantages to manual techniques is ﬁnding new and unexpected errors [27].

But at the same time that Manual testing is eﬀective at ﬁnding novel faults, it is also susceptible to a few pitfalls; First and foremost, Manual testing is a large expense in the budget [2, 4, 22, 27]. The technique has been known to take up up to 40% - in some cases up to 50%, or more - of a software projects budget [4, 22]. A second pitfall of Manual testing is that it is susceptible to the Human factor which may make the testing error prone due to being perceived as tedious by the tester, or other human factors that may result in a human being inattentive such as being tired [3, 4].

To address these concerns, the ﬁrst generation of test automation was concep- tualized, coordinate based testing, which utilizes exact coordinates on the screen to interact with the SUT [2, 10]. One of the techniques that applies this concept is Record and Replay [5]. Record and replay, traditionally, is a technique where a manual tester would perform a test path while a tool were recording at what points on the screen the user interacted with [2, 5, 10]. After a ﬁnished session, the recorded test could be replayed for any number of times, which improves test frequency [2].

5

(15)

6 Chapter 2. Literature Review In the end, this method was found to be highly fragile to minor changes in the GUI, screen resolution or even size of the application window and thus requiring a lot of costly maintenance [5, 10]. These weaknesses has lead to the technique to be mostly abandoned in practice as a stand-alone tool according to Alégroth [2].

This great ﬂaw lead to what is colloquially called the second generation of test automation, or structure-aware capture/replay, which is component/widget based and interacts with the SUT through hooking in to its GUI libraries or toolkits [2, 24].

With this access, the second generation of test automation took the critical ﬂaw of the ﬁrst generation and attempted to solve it through recording what was clicked on, instead of where the click was [17, 24].

While this model has advantages such as a generally robust test execution envi- ronment and being able to improve test execution time by forcing the SUT to bypass cosmetic timed events, the technique also comes with a few disadvantages [2]. Most notably, a system can only be tested if it is written in a subset of available languages that has tools developed that can interpret and interact with it [2]. A point of concern is that this limitation is not only set to programming languages, but also custom components developed within the available languages [2]. Most tools has no issue in parsing the associated languages default or commonly used libraries such as, for example, the default AWT or Swing components if the application is written in Java [2]. If the case is that the SUT has custom components deﬁned, there will be an extra economic burden to write custom parsers for these and their subsequent maintenance if the component is ever changed in the future [2]. In addition, there are a form of dynamically generated GUI components that are diﬃcult to test, or in some cases impossible to test, in using this technique [2]. This is based on the fact that these components are generated at runtime, which can make their properties completely unknown prior to execution [2].

Another ﬂaw is the inability to evaluate how a user interface actually looks to the user [2]. While the script-based system easily can assert if an element exists or not, it is not easily - or at all - possible to be able to assert how the element looks or if it is even interactable by the user [2].

Finally, it has been shown that the technique is associated with script maintenance costs that can be signiﬁcant, and in some cases make the costs so large that they make the approach infeasible [11, 24]. One study by Grechanik et al [11] notes that simple modiﬁcations to the GUI can result in 30% to 70% change to test scripts.

Another study by Memon and Soﬀa [17] notes that major changes to the GUI layout, such as a menu hierarchy change or moving the location of a widget from one window to another may lead to an excess of 74% of test cases being rendered unusable.

This leads up to the third generation, the main topic of this thesis - Visual GUI Testing, or more commonly known, VGT [2]. VGT had its inception in the early 90’s with a tool called Triggers [21] and subsequently a few years down the line, in the late 90’s, a tool called VisMap [30]. Unfortunately the technique suﬀered, as can be expected, from hardware limitations in the form of computational power which made the very computationally heavy algorithms unfeasible to use in practice [2].

But as time has gone on, the computational power available to the market has

increased in a signiﬁcant manner and the algorithms used to perform the image

recognition has been improved which makes the continuation of the research into the

(16)

2.2. Study 7 ﬁeld more feasible [2].

The core concept of VGT is, contrary to the second generation, to be able to interact with any application - regardless of what language it is written in or what system it is on [3]. According to Alégroth et al [3], this allows for a perceived higher ﬂexibility and a certain level of robustness to some GUI changes that previous high-level GUI techniques may have struggled with.

Reviewing the perceived general status of VGT in practice and what the research says about the state of the available tools we ﬁnd that in a case study performed by Alégroth et al [3] in 2015, 26 unique challenges, problems and limitations were identiﬁed in regards to VGT as a technique, its applicability in industry or the tools used. Key among these were related to the tool Sikuli

¹

[29], which were found to have image recognition volatility and showing signs of tool immaturity, along with lack of integration with third party software [3]. One study by Coppola et al [7] notes that another VGT tool by the name EyeAutomate

²

[4] (previously known as JAutomate) has an API that allows for some connection to third party software, contrary to what was observed in the study that focused on Sikuli, which means that integration with third party tools is not a general truth for the state of all VGT software [3, 7].

The volatility is further examined in a comparative study by Börjesson and Feldt [6] where the tools Sikuli and an unnamed tool with the pseudonym CommercialTool, both being VGT tools, were being examined. The results of the study observes volatility in both tools, but in different manners. Sikuli only had a 50% success rate differing between the number 6 and the letter C, which CommercialTool handled flawlessly each time [6]. In contrast, running an experiment where the tool had to trace a call sign, a text string that is, across a multi-color radar screen Sikuli had a 100% success rate while CommercialTool had a 0% success rate [6]. This indicates that the available tools has a certain level of volatility and immaturity still and requires further development and empirical evaluation.

Moving on to contrasting development time between generations a study performed by Dobslaw et al [8] compared the second generation tool Selenium with EyeAutomate and it was observed that test implementation was signiﬁcantly faster with the VGT tool than for Selenium. Development time for EyeAutomate were ca. 20 working hours while the Selenium implementation took ca. 38 hours, which is close to double the total time [8]. While the results in time eﬃciency for EyeAutomate is fantastic in the previous comparison, this thesis aims to contribute by improving this time saving even further. Contrary to EyeAutomate, a study by Leotta et al [16] compares Sikuli with Selenium WebDriver and observes that the DOM-based Selenium performs better in implementation time by a margin of 22% to 57% better. With this we can assert that tool maturity has a large impact on the perceived and observed value of VGT. This also strengthens the necessity of our study which will work with a R&R-alike method instead of scripting.

During an evaluation on how much test cases needs to be updated during the evolution of an application Coppola et al [7] found that a second generation testing tool, Appium

³

, only had to be updated 20% of the time while the EyeAutomate’s

1

https://sikulix.com/

2

https://eyeautomate.com/

3

https://appium.io

(17)

8 Chapter 2. Literature Review tests had to be updated 30% of the time during the evaluation of the same application and versions. This shows an indication that VGT tests require more maintenance in the long run. This is something that Dobslaw et al [8] agrees with with their observation that maintenance costs were, on average, 32% higher for EyeAutomate in comparison with another second generation tool, Selenium

⁴

. Furthermore, in a study by Leotta et al [16] which compared Sikuli with Selenium WebDriver the cost for maintaining the VGT-based script were often larger than for the Selenium-based test cases, but in a few instances the cost for maintaining the Selenium scripts were a multiple of times larger which leads to no clear conclusion being able to be drawn. In summary, depending on what tool and what application and their changes is being evaluated VGT maintenance costs are likely to be higher, but it is not a guaranteed assertion under all circumstances.

In the review by Leotta et al [16] it was also observed that applications with a stable GUI but with a back-end that is refactored in a signiﬁcant way between versions gives VGT techniques a strong advantage in maintenance metrics. This observation lends credence to each technique having their own advantages and disadvantages.

An observation made by Dobslaw et al [8] was that EyeAutomate required less of a programming background to implement the tests compared to the required background of a programmer for second generation tools such as Selenium, which touches on something very important - ease of use and lowering the barrier of entry to the ﬁeld.

We intend to improve the ease of use for VGT with the aid of research provided by Nass et al [18] which is the study into Augmented Testing (AT), where the technique provides a layer between the tester and the SUT that superimposes relevant information and location data on top of the GUI. Key advantages of AT were noted to be that the testers know what to test and what has been tested in addition to it being less manual work in total [18].

In another Empirical Study Nass et al [19] postulates that testing applications with a GUI is an important, but time-consuming, task in practice but that they attempt to mitigate the cost of eﬀort and cost of time through applying AT techniques in the form of the prototype tool called Scout. Scout is a tool that works against the second generation of component based testing. The evaluation result in the study ﬁnds that Scout can be used to create equivalent test cases at a faster rate than two popular state-of-practice tools within the second generation [19].

As Scout is a second generation AT-tool and it already shows an improved rate of test generation in comparison to other state-of-practice tools, our theory is that bringing this time eﬃciency to the third generation will contribute to the ﬁeld of research. Our study also intends to examine how much AT can lower the requirement of programming knowledge through foregoing the scripts entirely with a R&R-alike approach.

Reviewing the economic feasibility of VGT Dobslaw et al [8] observes that mainte- nance costs were, on average, 32% higher over the span of a year with weekly test runs for EyeAutomate when comparing it and Selenium. This does distract from the fact that the main cost came from the initial test implementation which was 87% of the

4

https://www.selenium.dev/

(18)

2.3. Summary 9 total cost for the entire test suite [8]. In total, with implementation and maintenance time included, EyeAutomate allocated 23 hours, while Selenium allocated 44 hours [8]. The reason for the disparity is clearly the notable beneﬁts of being able to rapidly create test cases in the initial stage which were previously mentioned in this chapter.

2.3 Summary

In summary, AT in combination with VGT is something that requires examination of its feasibility academically to judge its value as a technique as it has not been done before. The authors theorize that there will be gains in the applicability of VGT due to opening up the technique to less technically aware users through the use of AT.

The questions in relation to test creation, test execution and test robustness has academic value in their empirical examination as we intended to do with this thesis due to it being common metrics within testing capabilities and that it has not been evaluated before.

Visual GUI Testing is clearly still a research topic in its early examination phase

on a more commercial scale with there being a larger gap in the available amount of

sources to use to compare and contrast against each other. We hope to contribute

to this body of work in however minor of a way that our bachelor thesis can when

walking amongst giants like doctorates.

(19)

Chapter 3 Methodology

There were two major limiting factors that factored into the choices when conducting this research.

1. Resources: The resources available to the authors. Also when possible to reduce external resources or dependencies such as separate testing groups, it was decided to do so. This reduced dependability on external factors and lowered the risks towards the experiment failing to complete due to unfortunate circumstances.

2. Knowledge: Both authors have limited experience with automated tests, having only used Selenium. The authors have approximately the same level of experience in it. Another consideration were that learning how Scout operated had a learning curve too since neither of the authors had developed anything with the code-base before, resulting in having to learn it all from scratch.

Resource concerns in terms of time factored heavily when choosing a Quasi- Experiment. The alternatives Case Study and Design Experiment will be discussed in Section 3.6. Additionally, we wanted to conduct this experiment with people, making it per deﬁnition a Quasi-Experiment [28]. Choosing ourselves was deemed as adequate in addressing budget concerns and the goal to reduce external factors.

The choice to perform the test with three tools and techniques, that is Manual, EyeAutomate and JMScout, is based on what tools that were available to the researchers. The requirement to be able to provide a statistically significant amount of tests had to be weighed against the ability to learn new tools and techniques in a limited time frame which also fit within the experiment. The result was that it was judged sufficient to compare the selected tools and techniques.

Due to lack of access to industrial system it was decided to run tests against an Open Source Software (OSS) application. OSS are often used as a proxy for industrial software in academic settings, which gives us further reasons to choose OSS. This also gave us freedom to introduce mutants to the application as necessary for running the tests, and gave us access to the entire unmodiﬁed code base.

This leads us up to the study itself; Following in ﬁgure 3.1 are the steps of the study presented as sequential phases with their content put in context for the reader to peruse. These phases are further explained in Chapter 3.2.

Phase 2 Design of Study Phase 1

Literature Analysis

Phase 3 Execuon of study

Phase 4 Analysis of Results

Literature Search Empirical Stragety User Stories Descripve Stascs

Variables Mutants Formal Stascs

Scope Experiment method Alternave

approaches Design of JMScout Results

Tools used

Figure 3.1: Methodology steps

10

(20)

3.1. Research questions 11

3.1 Research questions

• RQ.1 What is the difference in effectiveness at finding GUI faults when com- paring JMScout to current state of the art and state of practice tools and techniques?

– Hypothesis: Scout with the newly written image recognition plugin will be more eﬀective at locating faults than the current state of the art and state of practice tools and techniques.

– Metrics: Eﬀectiveness will be determined with mutation score. It is deﬁned by Jie and Harman [14] as the ratio of number of faults detected over total number of seeded faults. In our case

x/y = mutationscore

where x is amount of found mutations and y is amount of faults imple- mented. Only true positive results are to be used in this calculation. The statistical signiﬁcance of the result will be examined with a Kruskal-Wallis H-Test.

• RQ.2 What is the diﬀerence in time eﬀectiveness in test development when comparing JMScout to current state of the art and state of practice tools and techniques?

– Hypothesis: Scout with the newly written image recognition plugin will be more time eﬃcient in test development than the current state of the art and state of practice tools and techniques.

– Metrics: Development time will be compared between the following tools:

Manual vs JMScout and EyeAutomate vs JMScout. A Kruskal-Wallis H-test will be conducted to either reject or accept the null hypothesis.

• RQ.3 What is the diﬀerence in time eﬃciency in test development when com- paring JMScout to current state of the art and state of practice tools and techniques?

– Hypothesis: Scout with the newly written image recognition plugin will be more time eﬃcient in test execution than the current state of the art and state of practice tools and techniques.

– Metrics: Execution time will be compared between the following tools:

Manual vs JMScout manual and EyeAutomate vs JMScout automated.

In this case a Wilcoxon Signed-Rank test will be conducted. The null hypothesis will be rejected or accepted based on values gathered from the test. These values will be compared to a Wilcoxon Signed-Ranks Table as seen in [23].

The motivation behind these research questions is to analyze and evaluate the

developed plugin. Also further to ﬁnd support for VGT as a robust solution to the

current state of automation tests. The expected outcome was that the tool will fare

well against certain mutations such as complete removal of elements. However other

(21)

12 Chapter 3. Methodology mutations such as slight modiﬁcations in the GUI could be diﬃcult for the VGT technique.

3.2 Phase 1 - Literature Analysis

3.2.1 Literature search

The search was performed on the BTH Summon database as the primary source that connects a number of academic databases. As a secondary priority source, Google Scholar was also used. We used keywords and search terms related to the subject matter and then snowballed from the references according to the methods described by Wohlin et al [28].

3.3 Phase 2 - Design of Study

3.3.1 Empirical stragety

It is noted by Harris et al [12] that a major weakness to Quasi-Experiments is the lack of randomization, which we try to mitigate as explained in the following sections.

The human factor in conjunction with the lack of randomization often results in confounding variables, therefore a list of confounding variables 3.1 were identiﬁed and mitigations were planned and executed.

3.3.2 Variables

Dependent

Test case development time

The development time of a test case is a dependent variable of each tool and technique.

Development time will be measured for all tools and techniques when developing a new test case through measuring the time elapsed from the ﬁrst interaction until a veriﬁed working test exists.

Test case execution time

The execution time is a dependent variable which depends on a few factors. Among these are device performance, tester attentiveness and application performance. The execution times for the tools and techniques was measured and compared in the analysis section of this thesis through measuring the time from the ﬁrst interaction with the SUT until the test is completed or has failed.

Mutant detected

The mutant detected by the applications and techniques. The amount of mutant detected per tool and technique will be measured and compared in the analysis section of this thesis.

Independent Tools

The tools chosen are independent variables as they do not depend on any other factor. The tools chosen are Manual testing as a technique, EyeAutomate as a third generation testing tool and the developed plugin, JMScout.

Confounding

Hardware speciﬁcations

The hardware of the authors is a confounding variable as they will be used to execute

the tests. This is a confounding variable because test execution may vary due to

(22)

3.3. Phase 2 - Design of Study 13 diﬀerent hardware conﬁgurations between the authors.

The speciﬁcations are as follows:

Author 1: Intel Core i5-10210U @ 1.60 GHz for the CPU with 8 GiB of DDR4 RAM running on Windows 10 Home.

Author 2: AMD Ryzen 3600 @ 3.6 GHz for the CPU with 16 GiB DDR4 RAM running on Windows 10 Education.

Below, in Table 3.1, is a compiled collection of confounding variables predicted, and their respective mitigations and treatments applied to lessen or mitigate the impact on the study.

Name Description Eﬀect Mitigation

Knowledge bias

Diﬀerence in knowl- edge level in regards to tools used.

Development time may vary due to familiarity

Both testers has approxi- mately the same knowledge level of all tools.

Learning bias

Gradual familiarity with user stories and tools while preparing and performing tests.

Development and execution of tests will vary.

The ordering of test develop- ment was randomized with a number 1-3 each correspond- ing to a tool.

Fault knowledge bias

Bias while looking for faults due to knowl- edge where faults are introduced.

Time and eﬃ- ciency in ﬁnding faults will vary in an undesirable way.

Faults are introduced through user stories. There is no knowledge between the testers of what faults have been introduced by the other tester as per Section 3.4.1 and 3.4.2

Selection bias

Tester selects user story or mutation due to preference

Undesirable bias in test creation and bug intro- duction.

User stories and mutations to apply were randomly as- signed as per Section 3.4.1 and 3.4.2.

Performance impact

The performance of the testers workstation may be impacted if an update, or similar, is started.

The testing time may be skewed due to outside factors.

The tester takes precautions to make sure that no system updates, or similar, are re- quired before testing.

Human fac- tor

Frame of mind or hu- man error might aﬀect results.

Diﬀerence in test development time.

The tester only tests when they’re in a relatively normal frame of mind.

Table 3.1: Confounding variables

(23)

14 Chapter 3. Methodology

3.3.3 Scope

There was a total of 21 test cases developed thus, multiplied by the amount of tools and techniques used, a total of 63 tests were written. In execution, the tests was ran a total of 315 times. This number was found through multiplying a few factors;

Manual will be ran once per user story (21). JMScout will be run manually twice per user story (42) and then automatically six times (126). Finally, EyeAutomate will be run six times per user story too (126).

3.3.4 Design of JMScout

Scout

Plugin 1

Plugin X Plugin 2

SeleniumPlugin JMScout

Figure 3.2: Replacing Selenium Plugin with JMScout

The goal of the study is to provide a proof-of-concept plugin that examines the feasibility of the combination of AT and VGT. With this in mind, several design decisions were made, which we present in the following sections.

Functionality

With the intention to mimic user interactions with applications functionality to specify actions in the plugin had to be implemented. Due to resource constraints, not all actions, such as click-and-drag, could be implemented in JMScout. The prioritized lists of actions were chosen based on the need to perform user-like test scenarios in the experiment. This list of actions include:

• Left mouse button click

• Right mouse button click

• Double left mouse button click

• Type text

• Check area for match

With the ability to left click, right click and double click, a large set of common GUI actions can be performed. With the added ability to type text, the possibilities to - amongst others - log in to accounts, type in registration details has been enabled.

The ﬁnal functionality of checking an area for a match is the plugins way to assert that the expected result exists without taking any action. This is for example useful in cases where it has to be asserted that an emblem still exists and looks the same on a web page, or that a submitted form created with the other functionalities in the plugin returns the same looking output form.

In addition to the basic functionalities listed, an integration was made with existing Scout functionalities that provides an augmented experience such as the state tree graph and widget markers seen in Fig. 3.3.

Each outline represents a Widget, which is represented in code as a class which

contains widget metadata. Metadata such as what action to perform or the path to

an image, amongst others. The image, for example, is used for matching with Image

Recognition. The listed actions are mapped to re-bindable keybindings and allows

(24)

3.3. Phase 2 - Design of Study 15 for designation of action for the widget to be inserted.

Figure 3.3: JMScout testing Rachota

Once a specific action or check has been designated as the action to perform. A widget size can be specified and then finally inserted, resulting in an Colored square around the widget on the augmented GUI in Scout. The tree like graph in the top right corner of 3.3 is Scouts state tree. Each node in this tree represents an appstate class, which is - in essence - a testing "step", in code and inserting an action widget results in another node being added to this graph. This class is meant to represent a state and the class contains paths (branches) containing widgets. This is also what the developed plugin traverses when performing widgets with their specified action.

Note: in Fig. 3.3, Checks are represented with a yellow outline while Actions are represented with a Blue outline. If a check is marked as located while running a scenario, it turns green instead of yellow. If a widget at any point cannot be located, it will turn grey and give the tester the option to click on it to see what the image it was looking for looked like. In the case a widget has turned grey, the tester will also be provided the option to repair the test case through giving them the option to perform the selection again with a simple and user friendly interaction.

These widgets and their associated action can be performed either manually or automatically. To manually perform an action on a widget you simply left click them.

If you want to automatically execute all actions in a test suite, a re-bindable hotkey

has to be used, per default CTRL+R.

(25)

16 Chapter 3. Methodology Design decisions

Eye

As the Scout application itself does not carry the capability to perform image recognition natively, a suitable selection of a lightweight Image Recognition tool was required. The choice made was to use the Eye tool, which is used in a related tool of Scout called EyeSel [15]. Eye

¹

provides great functionality in comparison to its perceived ease of use in development and provides a range of ready-to-use functionalities to aid in research development speed. Eye also provides three diﬀerent recognition modes to aid in the certainty to locate the images searched for: EXACT which provides as close to exact match as possible of the image searched for, COLOR is the same as exact but with a certain tolerance to color variance, ﬁnally TOLERANT is the most tolerant of the algorithms and ignores color matching almost completely and instead choosing to match according to edges found in the sought image.

Default image size

Throughout the development of the plugin diﬀerent sizes for comparison of images was used. In the end it was analyzed that 150 by 150 pixels was appropriate as 100 by 100 pixel images were found to be sometimes too small for the image recognition to notice contrasts. This has been a careful weighing of which size to pick, as larger sizes start to suﬀer performance issues.

Image Match Percent

Due to evaluation and testing match percentages between drastically diﬀerent, but similar amounts of, sections of text on the same background showed a concerning amount of match percentages. Small enough images, like the default 150 times 150 pixel images that the default setting of the images for the app uses, frequently displayed matching percentages in excess of 90% when looking for a matching image.

Under certain circumstances these numbers went as high as 95% match or slightly more. Due to these ﬁndings, the default and recommended conﬁguration of the developed software is set to only accept 100% matching images. Only accepting 100%

matching images does make the matching more fragile, but ensures that what is matched is what is being sought.

Conﬁgurability

Due to a variety of factors we could never be certain that we made the most optimal choices in regards to which default image sizes and what default image matching percent would always suit an individual project the best. As a result of this, we made the most essential options conﬁgurable. Most notable among these are:

• Retries to ﬁnd a widget before it gives up and times out.

• The default image size on widget creation.

• Minimum image matching percentage.

• Keybindings to use for interacting with most application functionality.

Menu click

During the early stages of testing, it was noticed that a big issue with using Augmented Testing for applications were the ability to click on menus being lackluster due to the issue of any system menu option will immediately close the second the application window loses focus due to the user clicking on the JMScout window. To mitigate

1

https://store.synteda.se/eu/product/eye-for-researchers-and-students/

(26)

3.3. Phase 2 - Design of Study 17 this, we had to create a special action to handle clicking on menus, colloquially called MENU_ACTION. This in addition to the ability to "freeze" the image displayed in the main application window by holding CTRL+SHIFT down at the same time mitigated the issue almost completely, even though it is somewhat forced.

Retries with diﬀerent modes

During the design phase of JMScout we examined the ability of Eye to be executed in diﬀerent modes of recognition, and came to the conclusion that while EXACT was the generally most optimal algorithm to use with consideration to both performance and desired elements to match, running the other two modes when EXACT failed provided a level of robustness in locating images. Due to this reasoning, we implemented functionality with several match attempts, leading to a cascading fail-safe approach.

Repairing widgets

During experimentation with implemented functionality, testing were often halted by a widget going grey - that is, being unable to be located. This lead us to initially display what image it was looking for which aided the tester in understanding why it could not be located with a visual check. In the end after more implementation, errors started appearing in the middle of test paths, which was severely undesirable since any node after the erroneous node would also be deleted upon node deletion.

This lead to the implementation of the Repair feature, which allows the tester to substitute any ailing node. The tester can also switch type of action node, going from left click to double click for example.

Key Listeners

Another issue that was noticed during development, is that key listeners do not update correctly at all if the key is released while the main window does not have focus. This initially lead to some confusion as to why you no longer needed to hold CTRL down to perform some actions. The initial solution to this was immediately requesting focus back to the main application window, but this solution had to be scrapped with concerns raised in conjunction with the menu click situation. A ﬁnal solution was found by utilizing the JNativeHook

²

package for implementing global listeners, which completely resolved the concern and issue neatly at the same time through not requiring a focus requests and through constantly being aware if a key is pressed or released no matter what application has the focus. This listener is only used to listen for the CTRL and SHIFT press and release events.

Architecture overview

Scout

JMScout

Image Recognition (EYE)

State controller Component manager Action handler Draw handler

Figure 3.4: Plugin Architecture

2

https://github.com/kwhat/jnativehook

(27)

18 Chapter 3. Methodology To develop JMScout we depend on two external libraries. One of them is the Eye module, which provides image recognition, and the other is Scout itself, which is the main application that this plugin is written for. The Scout application JAR ﬁle is a black box with API functions that can be accessed. Using these limited functions to invoke UI functionalities and storing data, JMScout was developed.

Action handler

This component handles the interactions that the user initiates. It also executes the widgets’ speciﬁed action.

Draw handler

Used to draw custom sized widgets for easier user interaction.

State controller

The state controller is what is used to detect if the states in Scout has changed and if any new checks needs to be perform. The state controller is also invoked when adding a new Widget.

Component manager

The component manager is what assures that each application widget is located and drawn on the user interface after locating them in the displayed image on Scouts user interface.

Experiment subject

The experiment was conducted with an application written in Java called Rachota.

The selection of this specific application were based on a number of criteria. There were three key criteria. First off, the open source nature and license that allows legal modification and introduction of artificial defects (mutants) to the application for testing purposes. Secondly, Rachota contains a myriad of different user interface elements that makes it representative for most common applications on the wider market. Hence, a suitable candidate for the experiment to optimize its external validity given the project’s resource constraints.

Lastly Rachota has been used in previous academic research for instance Alégroth et al [1], for executing tests against. Further supporting its validity as a test subject.

That study and a study by Souza [26] also served as guidance when specifying on what metrics our test subject had to comply to in order to be an appropriate subject.

Version # Windows # of Classes Lines of code

2.4 10 53 14034

Table 3.2: Rachota metadata

Due to the experiment subject having 10 windows and 14k LOC, the experiment can only be generalized to other smaller applications. That this is the case is also a limitation planned for with the allocated resources available.

3.3.5 Tools used

To be able to communicate and plan in an age of COVID-19 restriction and

perform the experiment and develop JMScout, a number of diﬀerent tools were

required.

(28)

3.4. Phase 3 - Execution of Study 19 First and foremost is Discord

³

, a communication platform with text chat, voice &

video chat and screen sharing capabilities. The tool has been the key in communicating between the authors as they have been unable to meet due to the active restrictions.

Second up is the web-based tool Notion

⁴

which is a all-in-one tool with task boards with statuses and labels, tables for structuring literature to read, note-keeping and so much more.

For application development, IntelliJ IDEA

⁵

with a license obtained through the Github Students pack were used since IntelliJ IDEA specializes on Java-based languages. On the same note, version control were handled through Git via Github

⁶

.

And ﬁnally, for bug introduction, Apache NetBeans

⁷

were used as Rachota was originally composed using a form builder. The reason for this choice was that the authors couldn’t get the form builder for Intellij to work together with Rachotas source ﬁles.

3.4 Phase 3 - Execution of Study

3.4.1 User Stories

There were 21 user stories determined to be used in the study. Rachota was manually explored to determine its common use cases. Therefore deciding what deemed to be common tasks for a day planner.

After the familiarization, it was noted that the application does have certain functionalities that results in generation of content outside of the application itself, such as a summarized report of time spent in a HTML ﬁle, which was unanimously excluded from consideration of being included in a user story.

After being familiar with the application functionalities, User Stories were deter- mined by following the considerations of getting a good mix of complexity and number of interactions. The base, minimum, amount of interactions were pinned as 5 and no specific maximum were set, but were assumed to be around 20. Simple interactions were defined as making a Check or clicking a button. Complex interactions were defined as, amongst others, typing text or making a Menu Action interaction.

An example of an included user story reads like this: 5. As a user, I want to be able to create a task and then delete it. A full list of user stories can be found in Table A.1.

3.4.2 Mutants

Mutations are according to Oﬀutt et al [20] when the tester creates test data that causes a set amount of faults. The mutation testing technique serves two goals: it provides a test adequacy criterion and leads to detection of faults. Each mutant were introduced through mutant operators in a random, but unique, order. Speciﬁcally in this case the ADD, CHANGE and REMOVE. Mutants from the list 3.3 was used to introduce one mutant per user story. When each Mutant has been introduced once, then mutants will be randomly selected to be introduced once again. Therefore some

3

https://www.discord.com

4

https://www.notion.so

5

https://www.jetbrains.com/idea/

6

https://www.github.com

7

https://netbeans.apache.org/

(29)

20 Chapter 3. Methodology mutants may be introduced multiple times.

Below is an example of mutant #4 (Add identical widget) from table 3.3 being implemented in User Story #12 as depicted in the appendix A.1. In this case adding an identical "System" menu item with identical listeners. None of the tools caught this fault, the manual tester simply didn’t notice it due to tunnel vision. The VGT tools struggled with detecting this fault as well since clicking either menu option will result in the same outcome.

Figure 3.5: User Story 12 - Original to the left vs mutant implemented on the right Who implemented each mutant operator is also clearly displayed in the By column of Table 3.3, where A1 stands for Author 1 and A2 for Author 2.

ID By Type Mutant operator 1 A2 Remove Remove Completely 2 A1 Remove Invisible

3 A2 Remove Remove Listener 4 A1 Dup./Ins. Add identical widget 5 A2 Dup./Ins. Add similar widget 6 A1 Dup./Ins. Add diﬀerent widget 7 A1 Dup./Ins. Add another listener

8 A2 Modify Expand/Reduce sizes of windows and components will auto-adjust

9 A2 Modify Expand/Reduce sizes of windows and components will not auto-adjust

10 A1 Modify Reduce size of windows to hide components

11 A1 Modify Modify the location of a component to another valid position

12 A2 Modify Modify the location of a component to the edges of the window

13 A1 Modify Modify the location of a component so it overlaps an- other.

14 A2 Modify Modify the size of a component.

15 A1 Modify Modify the appearance of a component.

16 A2 Modify Modify the type of component (Button changed into a TextField, etc.)

17 A2 Modify Modify GUI Library for widgets (Swing button changed to AWT Button)

18 A1 Modify Expand/Reduce size of windows and widgets will auto- adjust their sizes

Table 3.3: Mutant operators

(30)

3.4. Phase 3 - Execution of Study 21

3.4.3 Experiment method

Step 2 Migaons established Step 1

Confounding variables listed

Step 3 User Stories for each test case is

developed

Step 4 Ordering of tools

per test case is randomized

Step 5 Tests are developed Step 6

Bugs are introduced in context of User

Stories Step 7

Test execuon is conducted Step 8

Result is analyzed and interpreted

Figure 3.6: Experiment Steps

The steps used to perform this Quasi-Experiment is depicted in Fig 3.6. The steps are detailed as followed:

• Step 1: In order to mitigate validity threats to the study, it was decided to ﬁrst list confounding variables. These variables can be seen in Table 3.1.

• Step 2: Mitigation’s to the confounding variables were established and later followed during the proceeding test stages. These mitigations can be found alongside the confounding variables in Table 3.1

• Step 3: Once familiarity with Rachota was gathered, it was then time to develop User Stories. There was no guide on deciding our User Stories instead our goal was to incorporate as many diﬀerent GUI elements and windows as possible. Such as menus, buttons and text ﬁelds among others. While also covering what was deemed as typical interactions with the application.

• Step 4: The ordering in which tests would be developed in was randomised in order to mitigate gradual familiarity bias with user stories and tools, which could lead to an unfair time bias in one way or the other for a tool or technique.

• Step 5: Test development were conducted with all tools according to the order determined previously. All tests were conducted on Rachota v2.4 and this version of Rachota had no faults implemented by the testers.

Time for each experiment was logged according to their criteria. The criteria are as follows:

– Manual: Tests were written according to a template as seen in Appendix A.0.1. No typing shortcuts such as copy paste or other tools were allowed in this process. Time starts precisely when the tester starts writing on the manual test or interacts with Rachota. Time ends when all typing is done, instructions are validated and the ﬁle is properly named.

– EyeAutomate: Time starts precisely when the tester starts writing on the visual script or interacts with Rachota. Time ends when all typing is done and the test is considered done when it runs through once without issues.

– JMScout: Time starts precisely when the tester interacts with Rachota.

Time stops when and the test is considered done when it runs through

once without issues

(31)

22 Chapter 3. Methodology Each test were implemented without stopping for breaks, that is, the timer were never paused.

• Step 6: Mutants from the list 3.3 were randomly assigned to user stories until each mutant was assigned once. There are 18 mutants assigned to 21 user stories. Thus resulting in all 18 mutants being implemented once and 3 mutants twice.

In order to mitigate bug knowledge amongst the testers, it was decided to implement the faults through user stories. Also each user story was tested with the tester that didn’t implement the bug. Bias when looking for faults is therefore diminished, since where the faults are implemented is unknown amongst the testers. However they are still implemented along the interactions necessary to fulﬁl the user story.

Once a mutant has been introduced to an Rachota instance the implementer writes their prediction to each respective tool. These predictions are one of False Negative, False Positive and True Positive. These terms are classiﬁed as binary classiﬁcations of test results and can be explained as the following:

– False Negative: There was a bug but none was reported by the tool.

– False Positive: There was a fault but another fault was found or there was no fault but the tool reported one.

– True Positive: Tool found a fault that existed.

True Negatives, i.e. no bug found and no bug existed, are not listed since the testing methodology speciﬁcally makes the prerequisite that a single bug always exists in each non-default-application testing case. The predictions and their resulting outcomes are presented in Fig. 4.1.

• Step 7: Here the tests are executed with the following order

1. Manual: Against JMScout with manual execution of widget actions in JMScout. Each respective tool only execute their assigned user story once, with manual always starting ﬁrst to prevent knowledge of faults existence.

To compensate for this, the manual execution of the JMScout test case only ever stopped when the application had located an issue.

2. EyeAutomate: Against JMScout with scripted execution. Both of these techniques will be performed 3 times. First against Rachota with no added faults 3 times, then against Rachota with an implemented fault 3 times.

Times for each test case is logged once that respective test case is done. Each result in the tests will be logged as true or false, depending if the test passed or not.

• Step 8: All of the results regarding time of development and execution as well as perceived bug will be analyzed in the context of the research questions.

These results will be analyzed in the context of time spent for development,

execution and found fault. Also the perceived faults that were logged will either

be stated as False negative, False positive or True positive. How the analysis

will be performed will be further elaborated in Section 3.5.

(32)

3.5. Phase 4 - Analysis of Results 23

3.4.4 Results

The results were gathered in two main forms; The results of manual test runs when comparing Manual testing and JMScout Manual. These tests were only run once since the Manual testing times are severely impacted by practiced knowledge of the test path to take.

The second form of result gathering were the Automatic runs between JMScout and EyeAutomate. The results gathered here were ran three times to ensure that a good average time could be found. This was mostly done to mitigate the chances of image recognition volatility as noted during the Literature Review in Chapter 2.

Every result was measured with a timer application on the testers mobile devices, this was done to mitigate chances that the device testing the SUT lagging and attributing a larger time than it otherwise should. When the times are measured on the time-scale of single digit seconds under most circumstances, this consideration had to be made.

3.5 Phase 4 - Analysis of Results

3.5.1 Descriptive Statistics

Development and execution times are displayed using a Box plot diagram. Box plots are typically useful for identifying outliers and other key values. The appendix A.1 shows the diﬀerent parts of a box plot. Between the upper quartile 75% of the scores and lower quartile 25% of the scores is a box. This box contains 50%

of the scores for the group. Finally the upper extreme signiﬁes the maximum data point when excluding outliers, lower extreme being the minimum. Any data points that are outside of the upper/lower extremes are considered outliers. Frost et al [9]

concludes that outliers are data points far from other data points. Outliers can be problematic for statistical analysis since they can distort the results.

3.5.2 Formal Statistics

To start with, a Kruskal-Wallis H-test is performed against the outcome of the test cases to determine if there is any statistically signiﬁcant diﬀerence in the results.

A Kruskal-Wallis H-test is a non-parametric single factor analysis of variance test that does not require the data set to be normally distributed. The True Positive results are scored a 0, which is the best result in a ranked test. Followed by that, the False Positives are scored as a 1, which is worse than True Positives in performance.

Finally, False Negatives are scored a 2, since missing that a defect exists is worse than ﬁnding the wrong defect, which is extra penalized.

For the test implementation time analysis, we use the Kruskal-Wallis H-Test to determine if there is any tool that has statistically signiﬁcant diﬀerences in implementation times. The test method was chosen to be able to compare multiple columns of data at the same time, being able to judge if there was further analysis necessary.

For the test run time analysis, we use the Wilcoxon Signed-Ranks test to determine

if there is statistical signiﬁcance in the run times. The Wilcoxon test is a non-

parametric paired rank test that will take into consideration that diﬀerent kinds of

tests were implemented and run, with diﬀerent amounts of steps and complexity. The

Wilcoxon Signed-Ranks test is also a non-parametric single factor analysis of variance.