System Agnostic GUI Testing
Analysis of Augmented Image Recognition Testing
Martin Moberg Joel Amundberg
June 11, 2021
Faculty of Computing
Blekinge Institute of Technology
SE–371 79 Karlskrona, Sweden
This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfilment of the requirements for the degree of Bachelor of Science in Software Engineering. The thesis is equivalent to 10 weeks of full time studies.
Contact Information:
Author(s):
Martin Moberg
E-mail: martin3127@gmail.com Joel Amundberg
E-mail: joel.amundberg@gmail.com
University advisor:
Dr. Emil Alégroth
Department of Software Engineering
Faculty of Computing Internet : www.bth.se
Blekinge Institute of Technology Phone : +46 455 38 50 00
SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57
Abstract
Automated Graphical User Interface (GUI) tests have been known to be fragile in the past, thus requiring a lot of maintenance. Document Object Model (DOM) tools or other component based tools solves fragility to an extent, but also has its drawbacks.
Accessing components requires knowledge of widget metadata and in a lot of cases this is not possible due to programmatic limitations. A somewhat novel approach to this problem is Visual GUI Testing (VGT). Instead of components VGT uses image recognition as its driver. VGT also suffers from fragility issues but its main benefit include being system agnostic. There are VGT tools available that use scripts, these scripts has been shown to be more fragile than some DOM-based models and require more maintenance, but has potential to improve in time- and cost efficiency when creating the base test cases to offset this downside.
Therefore in this paper we evaluated a proof-of-concept plugin called JMScout that we developed for a tool called Scout. Scout uses an technique called Augmented Testing, which is a visual layer put between the SUT and the tester which they interact with. The primary goal is to integrate JMScout into Scout and attempt to improve time efficiency in test development, test execution and effectiveness in finding faults. Augmented testing in combination with VGT has never been studied before. Another major goal was to evaluate the augmented paradigm’s potential to improve usability and time efficiency in test development compared to other tools.
In order to evaluate this plugin a quasi-experiment was conducted where the effec- tiveness, time efficiency in test development and test execution was analyzed. The experiment was conducted against an application called Rachota which the authors implemented faults to, also known as mutants. Tool performance when attempting to find these faults were documented as the results. These results were analyzed by descriptive statistics using graphs, box plots and tables in conjunction with more formal statistics using the Kruskal-Wallis H-test and Wilcoxon Signed Rank Test.
The results indicate that Manual testing was more effective at finding faults than JMScout and EyeAutomate. It also suggests that JMScout was the most time efficient at test execution in comparison to the other tools, however time efficiency in test development was inconclusive.
In conclusion this study suggests that JMScout doesn’t solve the fragility issues that EyeAutomate or possibly other VGT tools possess, but provides a way to mitigate the costs of VGT in general as much as possible. Thanks to the augmented testing paradigm that Scout provides, it might result in JMScout improving efficiency in test development. Even more probable if the maturity of JMScout will come to further fruition.
Keywords: Testing, Augmented Testing, Image Recognition, VGT, Visual GUI
Testing
Acknowledgments
First and foremost we would like to express our gratitude to our supervisor Dr.Emil Alégroth for his continued support throughout the thesis. In particular his guidance in the empirical study, intermediation of the state and value of VGT. Also for the continuous feedback on our work. We would also like to thank Rachota for providing us with an application in which we had freedom in implementing faults. Additionally we would like to thank Michel Nass and Andreas Bauer for developing Scout.
ii
Contents
Abstract i
Acknowledgments ii
1 Introduction 1
1.1 Background . . . . 1
1.1.1 Motivation . . . . 2
1.1.2 Contribution . . . . 3
1.1.3 Scope . . . . 3
2 Literature Review 5 2.1 Method . . . . 5
2.2 Study . . . . 5
2.3 Summary . . . . 9
3 Methodology 10 3.1 Research questions . . . . 11
3.2 Phase 1 - Literature Analysis . . . . 12
3.2.1 Literature search . . . . 12
3.3 Phase 2 - Design of Study . . . . 12
3.3.1 Empirical stragety . . . . 12
3.3.2 Variables . . . . 12
3.3.3 Scope . . . . 14
3.3.4 Design of JMScout . . . . 14
3.3.5 Tools used . . . . 18
3.4 Phase 3 - Execution of Study . . . . 19
3.4.1 User Stories . . . . 19
3.4.2 Mutants . . . . 19
3.4.3 Experiment method . . . . 21
3.4.4 Results . . . . 23
3.5 Phase 4 - Analysis of Results . . . . 23
3.5.1 Descriptive Statistics . . . . 23
3.5.2 Formal Statistics . . . . 23
3.5.3 Limitations . . . . 24
3.5.4 Outliers . . . . 24
iii
3.6 Alternative approaches . . . . 24
3.7 Replication Pack . . . . 24
4 Results and Analysis 25 4.1 Results . . . . 25
4.1.1 Fault prediction vs Fault found . . . . 25
4.1.2 Development Times . . . . 26
4.1.3 Execution Times . . . . 27
4.2 Analysis . . . . 29
5 Discussion 33 5.1 Observations . . . . 33
5.2 Validity threats . . . . 35
5.2.1 Internal . . . . 35
5.2.2 External . . . . 35
5.2.3 Construct . . . . 35
5.2.4 Reliability . . . . 35
6 Conclusions and Future Work 36 6.1 Conclusion . . . . 36
6.2 Future work . . . . 36
6.2.1 Future development work . . . . 36
6.2.2 Future scientific examination . . . . 37
A Annexes 40 A.0.1 User story template example . . . . 40
iv
List of Figures
1.1 Scout tool using the Selenium plugin . . . . 2
3.1 Methodology steps . . . . 10
3.2 Replacing Selenium Plugin with JMScout . . . . 14
3.3 JMScout testing Rachota . . . . 15
3.4 Plugin Architecture . . . . 17
3.5 User Story 12 - Original to the left vs mutant implemented on the right 20 3.6 Experiment Steps . . . . 21
4.1 Mutant predictions and actual outcome per tool . . . . 25
4.2 Test development Box plot . . . . 26
4.3 Box plot JMScout Manual vs Manual testing with Faults . . . . 27
4.4 Box plot EyeAutomate vs JMScout No faults . . . . 27
4.5 Box plot EyeAutomate vs JMScout Faults . . . . 28
4.6 Test development times . . . . 29
4.7 Test execution times Manual vs JMScout manual . . . . 31
4.8 Test execution times EyeAutomate vs JMScout . . . . 32
5.1 User Story 21 - Rachota exit button . . . . 34
A.1 Box plot components . . . . 42
A.2 Times for test development . . . . 42
A.3 Execution times of tests with Manual vs JMScout against Rachota with faults implemented by the authors. . . . 43
A.4 Execution times of tests with JMScout vs EyeAutomate with no faults implemented by the authors. . . . 43
A.5 Execution times of tests with JMScout vs EyeAutomate against Rachota with faults implemented by the authors. . . . 44
v
List of Tables
3.1 Confounding variables . . . . 13
3.2 Rachota metadata . . . . 18
3.3 Mutant operators . . . . 20
A.1 User Stories . . . . 41
vi
Abbreviations
AT - Augmented Testing
DOM - Document Object Model GUI - Graphical User Interface
HTML - Hyper-Text Markup Language OSS - Open Source Software
R&R - Record & Replay SUT - System Under Test VGT - Visual GUI Testing
vii
Chapter 1
Introduction
1.1 Background
Sommerville states that in software development industry, one of the keystones of product development is software testing [25]. This practice gives the product owner and the development team confidence in the application conforming to the users’s needs. Often, in today’s industry, high level tests like Graphical User Interface (GUI) based tests are mostly performed with manual techniques such as Exploratory Testing [3]. Exploratory methods are often tedious and costly - and if you do manually repeated tests, also possibly error prone [3].
A proposed solution for these stated issues of cost, tedium and being error-prone is Visual GUI Testing (VGT) [2]. VGT is an emerging technique in the industry due to its perceived higher flexibility and robustness to certain changes that may affect the GUI that other GUI automation techniques would either not apply on or be too sensitive to changes for [3].
There are currently multiple tools that attempt to solve the posed issues. One technique amongst them is Record and Replay (R&R) which relies on replaying a previous user session by recording the users’ clicks. This is generally called the first generation [2]. There are some limiting factors of this technique in regards to maintenance; the scripts are fragile towards changes in layout, API and other similar factors which can result in an entire test suite breaking in the worst case for the first generation of testing tools [6]. The second generation of testing tools also utilize R&R, a model called structure-aware capture/replay, which instead of recording where a user clicked, it records what was clicked on [2, 24]. The downside with that is that it cannot assert how a GUI looks, or if it’s even interactable at all by the user, this in addition to being limited to a subset of programming languages it can hook in to [2].
Keeping all of this in mind, it was chosen by the authors to attempt making a more user friendly system with the aid of Image Recognition and Augmented Testing. Testing based on Image Recognition is termed the third generation or VGT [2]. Augmented testing is a technique that superimposes relevant information and location data between the tester and the SUT [18]. This will aid in keeping the maintenance costs down when a testing suite needs maintenance, and lowering the barrier of entrance to the testing field through lowering the requirements of having a programming background. The Image Recognition will be completely system agnostic, not interacting with any of the System Under Tests (SUT) intrinsic functionality in contrast to the second generation of testing.
Therefore it was chosen by the authors to develop an extension for a prototype tool with the name Scout
1[19]. Scout is a academic testing tool built in Java, currently having Selenium as its sole core driver. This driver is written as a plugin to Scout, which makes swapping the core out a simple task.
A common test case is written using actions and assertions which this tool will assist with; Creating visual markers where the test path goes next and markers for what is to be examined, noting test steps with missing or altered content. The tool will also be able to autonomously run the entire test suite.
1
https://github.com/augmented-testing/scout
1
2 Chapter 1. Introduction
Figure 1.1: Scout tool using the Selenium plugin
1.1.1 Motivation
As mentioned in the article "Visual GUI Testing:Automating High-Level Software Testing in Industrial Practice" [2], we can state that test automation will be essential in meeting time to market while still providing quality software through providing a test suite that can verify application functionality continuously without the added costs of manual testing [2, 6]. In the current industry, manual testing techniques such as Exploratory testing is one of the standards for testing a GUI [13]. This limits the ability to continuously apply testing due to both budgetary and time-based constraints. The other standard for testing a GUI is the second generation of testing, which is suitable for automation well but cannot assert how a GUI looks, or even if a component is interactable [2].
While VGT tools have been created before, they haven’t been made in the context of Augmented Testing(AT) - which is why the tool Scout was chosen [18, 19]. Scout is arguably unique in its testing approach in which it provides an augmented GUI.
Visualizing test information directly on top of the application that is undergoing the tests.
The reason that combining the two techniques of AT and VGT is interesting is that it increases the ease of use for automated testing, lessening the requirement of a programming background to write scripts due to being provided all the information and utility through a point-and-click GUI. VGT is also a system agnostic technique, which will widely improve the applicability of the AT technique from its current DOM-based implementation in Scout.
It is unknown and untested if this technique will have any positive effects on the
1.1. Background 3 time efficiency or the effectiveness in finding faults, which is what this works aims to examine. Specifically effectiveness in finding faults and efficiency in test development and execution.
1.1.2 Contribution
The main contribution of this work is developing a proof-of-concept Augmented Testing tool in combination with the technique VGT. This also means that the tool Scout will develop its functionality towards being system agnostic. The evaluation part of the thesis analyzed the developed plugin in terms of effectiveness in finding faults and efficiency in test development and execution.
This thesis also further examines the maturity and potential of VGT and the potential of the combination of VGT and Augmented Testing.
As this is a proof-of-concept study, the results should be taken under consideration as an indication of the potential for further study and development within the area of VGT combined with AT.
1.1.3 Scope
To develop Scout with the Image Recognition plugin, henceforth called JMScout in the thesis, towards being a true System Agnostic tool there are some advantages and disadvantages. Due to JMScout being implemented in Java version 1.8, it is portable to most modern GUI-based operating systems and for use against emulators of various other GUI-based platforms such as Android.
In the same vein, JMScout does not interact with any SUT except for how a user would, and has therefore no access to intrinsic data such as component identifiers, contrary to the more structure-aware second generation tooling. JMScout will only be able to access what is clearly displayed on the GUI. This means that there is a clear limitation in what can be expected to be asserted and tested for based on the functionality that will be available.
To perform the the evaluation of the developed system, inspiration of a variety of mutations was gathered from the work of Offutt et al [20]. Mutations are intentionally introduced defects in the application code that the testing tools try to detect [20].
These mutations which Add, Change and Remove were each all included in a test case at least once. Any mutations which required interacting with the intrinsic data of the components of the application were excluded from consideration since JMScout does not have any access to such data. More details about mutations will be brought up in Section 3.4.2.
The experiment will be conducted using these tools and techniques:
• JMScout
• EyeAutomate
• Manual Testing
Tests was conducted against Rachota which is an Open Source Software. The last major release of Rachota was in 2012. With these two factors in mind it can be stated that Rachota doesn’t necessarily represent modern state of the art applications.
This might affect external validity and construct validity negatively somewhat.
4 Chapter 1. Introduction
The data analysis was performed with Kruskal-Wallis H-Tests and Wilcoxon Signed
Rank tests [28]. The methods was used in evaluating the statistical significance of
the research questions. The environment will be made as fair as possible to make the
tests comparable on a rational level.
Chapter 2
Literature Review
2.1 Method
To acquire a foundation about the main concepts discussed in this thesis, a literature review was performed in majority by researching relevant keywords and later snowballing the references used in the papers judged relevant according the guidelines published by Wohlin et al [28].
The databases used for locating relevant literature has been the search engine Summon@BTH which is a database-of-databases provided by Blekinge Institute of Technology. Supplementary searches and information has been located through Google Scholar as a secondary priority when no result were presented by the primary search engine. Search terms used to locate relevant literature includes "Visual GUI Testing",
"VGT", "Augmented Testing", "EyeAutomate" and "Sikuli" primarily. As VGT is not a mature area of study, the initial literature search involved the first three terms.
Due to the term VGT being used in another area of study it was difficult to ascertain that all results had been found. To assure that as much relevant literature as possible had been found, the VGT tools known to the authors were searched additionally to complete the base to build the study from. The rest of the literature has been located through snowballing from the references provided in the papers located, sometimes multiple papers deep.
Initially our goal with finding literature was to understand augmented testing and the state of GUI testing. With the located information we found a gap in the subject matter and formulated our research so it can be put in a broader perspective in terms of value. With the aid from learning about past trials and tribulations we found where our plugin could provide a different approach.
2.2 Study
In the industry, one of the most commonly used techniques to handle GUI- interacting tests is manual testing [4, 27]. Manual techniques such as Exploratory Testing have long been an accepted approach for testing in the industry according to Itkonen and Rautiainen [13]. Key advantages to manual techniques is finding new and unexpected errors [27].
But at the same time that Manual testing is effective at finding novel faults, it is also susceptible to a few pitfalls; First and foremost, Manual testing is a large expense in the budget [2, 4, 22, 27]. The technique has been known to take up up to 40% - in some cases up to 50%, or more - of a software projects budget [4, 22]. A second pitfall of Manual testing is that it is susceptible to the Human factor which may make the testing error prone due to being perceived as tedious by the tester, or other human factors that may result in a human being inattentive such as being tired [3, 4].
To address these concerns, the first generation of test automation was concep- tualized, coordinate based testing, which utilizes exact coordinates on the screen to interact with the SUT [2, 10]. One of the techniques that applies this concept is Record and Replay [5]. Record and replay, traditionally, is a technique where a manual tester would perform a test path while a tool were recording at what points on the screen the user interacted with [2, 5, 10]. After a finished session, the recorded test could be replayed for any number of times, which improves test frequency [2].
5
6 Chapter 2. Literature Review In the end, this method was found to be highly fragile to minor changes in the GUI, screen resolution or even size of the application window and thus requiring a lot of costly maintenance [5, 10]. These weaknesses has lead to the technique to be mostly abandoned in practice as a stand-alone tool according to Alégroth [2].
This great flaw lead to what is colloquially called the second generation of test automation, or structure-aware capture/replay, which is component/widget based and interacts with the SUT through hooking in to its GUI libraries or toolkits [2, 24].
With this access, the second generation of test automation took the critical flaw of the first generation and attempted to solve it through recording what was clicked on, instead of where the click was [17, 24].
While this model has advantages such as a generally robust test execution envi- ronment and being able to improve test execution time by forcing the SUT to bypass cosmetic timed events, the technique also comes with a few disadvantages [2]. Most notably, a system can only be tested if it is written in a subset of available languages that has tools developed that can interpret and interact with it [2]. A point of concern is that this limitation is not only set to programming languages, but also custom components developed within the available languages [2]. Most tools has no issue in parsing the associated languages default or commonly used libraries such as, for example, the default AWT or Swing components if the application is written in Java [2]. If the case is that the SUT has custom components defined, there will be an extra economic burden to write custom parsers for these and their subsequent maintenance if the component is ever changed in the future [2]. In addition, there are a form of dynamically generated GUI components that are difficult to test, or in some cases impossible to test, in using this technique [2]. This is based on the fact that these components are generated at runtime, which can make their properties completely unknown prior to execution [2].
Another flaw is the inability to evaluate how a user interface actually looks to the user [2]. While the script-based system easily can assert if an element exists or not, it is not easily - or at all - possible to be able to assert how the element looks or if it is even interactable by the user [2].
Finally, it has been shown that the technique is associated with script maintenance costs that can be significant, and in some cases make the costs so large that they make the approach infeasible [11, 24]. One study by Grechanik et al [11] notes that simple modifications to the GUI can result in 30% to 70% change to test scripts.
Another study by Memon and Soffa [17] notes that major changes to the GUI layout, such as a menu hierarchy change or moving the location of a widget from one window to another may lead to an excess of 74% of test cases being rendered unusable.
This leads up to the third generation, the main topic of this thesis - Visual GUI Testing, or more commonly known, VGT [2]. VGT had its inception in the early 90’s with a tool called Triggers [21] and subsequently a few years down the line, in the late 90’s, a tool called VisMap [30]. Unfortunately the technique suffered, as can be expected, from hardware limitations in the form of computational power which made the very computationally heavy algorithms unfeasible to use in practice [2].
But as time has gone on, the computational power available to the market has
increased in a significant manner and the algorithms used to perform the image
recognition has been improved which makes the continuation of the research into the
2.2. Study 7 field more feasible [2].
The core concept of VGT is, contrary to the second generation, to be able to interact with any application - regardless of what language it is written in or what system it is on [3]. According to Alégroth et al [3], this allows for a perceived higher flexibility and a certain level of robustness to some GUI changes that previous high-level GUI techniques may have struggled with.
Reviewing the perceived general status of VGT in practice and what the research says about the state of the available tools we find that in a case study performed by Alégroth et al [3] in 2015, 26 unique challenges, problems and limitations were identified in regards to VGT as a technique, its applicability in industry or the tools used. Key among these were related to the tool Sikuli
1[29], which were found to have image recognition volatility and showing signs of tool immaturity, along with lack of integration with third party software [3]. One study by Coppola et al [7] notes that another VGT tool by the name EyeAutomate
2[4] (previously known as JAutomate) has an API that allows for some connection to third party software, contrary to what was observed in the study that focused on Sikuli, which means that integration with third party tools is not a general truth for the state of all VGT software [3, 7].
The volatility is further examined in a comparative study by Börjesson and Feldt [6] where the tools Sikuli and an unnamed tool with the pseudonym CommercialTool, both being VGT tools, were being examined. The results of the study observes volatility in both tools, but in different manners. Sikuli only had a 50% success rate differing between the number 6 and the letter C, which CommercialTool handled flawlessly each time [6]. In contrast, running an experiment where the tool had to trace a call sign, a text string that is, across a multi-color radar screen Sikuli had a 100% success rate while CommercialTool had a 0% success rate [6]. This indicates that the available tools has a certain level of volatility and immaturity still and requires further development and empirical evaluation.
Moving on to contrasting development time between generations a study performed by Dobslaw et al [8] compared the second generation tool Selenium with EyeAutomate and it was observed that test implementation was significantly faster with the VGT tool than for Selenium. Development time for EyeAutomate were ca. 20 working hours while the Selenium implementation took ca. 38 hours, which is close to double the total time [8]. While the results in time efficiency for EyeAutomate is fantastic in the previous comparison, this thesis aims to contribute by improving this time saving even further. Contrary to EyeAutomate, a study by Leotta et al [16] compares Sikuli with Selenium WebDriver and observes that the DOM-based Selenium performs better in implementation time by a margin of 22% to 57% better. With this we can assert that tool maturity has a large impact on the perceived and observed value of VGT. This also strengthens the necessity of our study which will work with a R&R-alike method instead of scripting.
During an evaluation on how much test cases needs to be updated during the evolution of an application Coppola et al [7] found that a second generation testing tool, Appium
3, only had to be updated 20% of the time while the EyeAutomate’s
1
https://sikulix.com/
2
https://eyeautomate.com/
3
https://appium.io
8 Chapter 2. Literature Review tests had to be updated 30% of the time during the evaluation of the same application and versions. This shows an indication that VGT tests require more maintenance in the long run. This is something that Dobslaw et al [8] agrees with with their observation that maintenance costs were, on average, 32% higher for EyeAutomate in comparison with another second generation tool, Selenium
4. Furthermore, in a study by Leotta et al [16] which compared Sikuli with Selenium WebDriver the cost for maintaining the VGT-based script were often larger than for the Selenium-based test cases, but in a few instances the cost for maintaining the Selenium scripts were a multiple of times larger which leads to no clear conclusion being able to be drawn. In summary, depending on what tool and what application and their changes is being evaluated VGT maintenance costs are likely to be higher, but it is not a guaranteed assertion under all circumstances.
In the review by Leotta et al [16] it was also observed that applications with a stable GUI but with a back-end that is refactored in a significant way between versions gives VGT techniques a strong advantage in maintenance metrics. This observation lends credence to each technique having their own advantages and disadvantages.
An observation made by Dobslaw et al [8] was that EyeAutomate required less of a programming background to implement the tests compared to the required background of a programmer for second generation tools such as Selenium, which touches on something very important - ease of use and lowering the barrier of entry to the field.
We intend to improve the ease of use for VGT with the aid of research provided by Nass et al [18] which is the study into Augmented Testing (AT), where the technique provides a layer between the tester and the SUT that superimposes relevant information and location data on top of the GUI. Key advantages of AT were noted to be that the testers know what to test and what has been tested in addition to it being less manual work in total [18].
In another Empirical Study Nass et al [19] postulates that testing applications with a GUI is an important, but time-consuming, task in practice but that they attempt to mitigate the cost of effort and cost of time through applying AT techniques in the form of the prototype tool called Scout. Scout is a tool that works against the second generation of component based testing. The evaluation result in the study finds that Scout can be used to create equivalent test cases at a faster rate than two popular state-of-practice tools within the second generation [19].
As Scout is a second generation AT-tool and it already shows an improved rate of test generation in comparison to other state-of-practice tools, our theory is that bringing this time efficiency to the third generation will contribute to the field of research. Our study also intends to examine how much AT can lower the requirement of programming knowledge through foregoing the scripts entirely with a R&R-alike approach.
Reviewing the economic feasibility of VGT Dobslaw et al [8] observes that mainte- nance costs were, on average, 32% higher over the span of a year with weekly test runs for EyeAutomate when comparing it and Selenium. This does distract from the fact that the main cost came from the initial test implementation which was 87% of the
4
https://www.selenium.dev/
2.3. Summary 9 total cost for the entire test suite [8]. In total, with implementation and maintenance time included, EyeAutomate allocated 23 hours, while Selenium allocated 44 hours [8]. The reason for the disparity is clearly the notable benefits of being able to rapidly create test cases in the initial stage which were previously mentioned in this chapter.
2.3 Summary
In summary, AT in combination with VGT is something that requires examination of its feasibility academically to judge its value as a technique as it has not been done before. The authors theorize that there will be gains in the applicability of VGT due to opening up the technique to less technically aware users through the use of AT.
The questions in relation to test creation, test execution and test robustness has academic value in their empirical examination as we intended to do with this thesis due to it being common metrics within testing capabilities and that it has not been evaluated before.
Visual GUI Testing is clearly still a research topic in its early examination phase
on a more commercial scale with there being a larger gap in the available amount of
sources to use to compare and contrast against each other. We hope to contribute
to this body of work in however minor of a way that our bachelor thesis can when
walking amongst giants like doctorates.
Chapter 3
Methodology
There were two major limiting factors that factored into the choices when conducting this research.
1. Resources: The resources available to the authors. Also when possible to reduce external resources or dependencies such as separate testing groups, it was decided to do so. This reduced dependability on external factors and lowered the risks towards the experiment failing to complete due to unfortunate circumstances.
2. Knowledge: Both authors have limited experience with automated tests, having only used Selenium. The authors have approximately the same level of experience in it. Another consideration were that learning how Scout operated had a learning curve too since neither of the authors had developed anything with the code-base before, resulting in having to learn it all from scratch.
Resource concerns in terms of time factored heavily when choosing a Quasi- Experiment. The alternatives Case Study and Design Experiment will be discussed in Section 3.6. Additionally, we wanted to conduct this experiment with people, making it per definition a Quasi-Experiment [28]. Choosing ourselves was deemed as adequate in addressing budget concerns and the goal to reduce external factors.
The choice to perform the test with three tools and techniques, that is Manual, EyeAutomate and JMScout, is based on what tools that were available to the researchers. The requirement to be able to provide a statistically significant amount of tests had to be weighed against the ability to learn new tools and techniques in a limited time frame which also fit within the experiment. The result was that it was judged sufficient to compare the selected tools and techniques.
Due to lack of access to industrial system it was decided to run tests against an Open Source Software (OSS) application. OSS are often used as a proxy for industrial software in academic settings, which gives us further reasons to choose OSS. This also gave us freedom to introduce mutants to the application as necessary for running the tests, and gave us access to the entire unmodified code base.
This leads us up to the study itself; Following in figure 3.1 are the steps of the study presented as sequential phases with their content put in context for the reader to peruse. These phases are further explained in Chapter 3.2.
Phase 2 Design of Study Phase 1
Literature Analysis
Phase 3 Execuon of study
Phase 4 Analysis of Results
Literature Search Empirical Stragety User Stories Descripve Stascs
Variables Mutants Formal Stascs
Scope Experiment method Alternave
approaches Design of JMScout Results
Tools used
Figure 3.1: Methodology steps
10
3.1. Research questions 11
3.1 Research questions
• RQ.1 What is the difference in effectiveness at finding GUI faults when com- paring JMScout to current state of the art and state of practice tools and techniques?
– Hypothesis: Scout with the newly written image recognition plugin will be more effective at locating faults than the current state of the art and state of practice tools and techniques.
– Metrics: Effectiveness will be determined with mutation score. It is defined by Jie and Harman [14] as the ratio of number of faults detected over total number of seeded faults. In our case
x/y = mutationscore
where x is amount of found mutations and y is amount of faults imple- mented. Only true positive results are to be used in this calculation. The statistical significance of the result will be examined with a Kruskal-Wallis H-Test.
• RQ.2 What is the difference in time effectiveness in test development when comparing JMScout to current state of the art and state of practice tools and techniques?
– Hypothesis: Scout with the newly written image recognition plugin will be more time efficient in test development than the current state of the art and state of practice tools and techniques.
– Metrics: Development time will be compared between the following tools:
Manual vs JMScout and EyeAutomate vs JMScout. A Kruskal-Wallis H-test will be conducted to either reject or accept the null hypothesis.
• RQ.3 What is the difference in time efficiency in test development when com- paring JMScout to current state of the art and state of practice tools and techniques?
– Hypothesis: Scout with the newly written image recognition plugin will be more time efficient in test execution than the current state of the art and state of practice tools and techniques.
– Metrics: Execution time will be compared between the following tools:
Manual vs JMScout manual and EyeAutomate vs JMScout automated.
In this case a Wilcoxon Signed-Rank test will be conducted. The null hypothesis will be rejected or accepted based on values gathered from the test. These values will be compared to a Wilcoxon Signed-Ranks Table as seen in [23].
The motivation behind these research questions is to analyze and evaluate the
developed plugin. Also further to find support for VGT as a robust solution to the
current state of automation tests. The expected outcome was that the tool will fare
well against certain mutations such as complete removal of elements. However other
12 Chapter 3. Methodology mutations such as slight modifications in the GUI could be difficult for the VGT technique.
3.2 Phase 1 - Literature Analysis
3.2.1 Literature search
The search was performed on the BTH Summon database as the primary source that connects a number of academic databases. As a secondary priority source, Google Scholar was also used. We used keywords and search terms related to the subject matter and then snowballed from the references according to the methods described by Wohlin et al [28].
3.3 Phase 2 - Design of Study
3.3.1 Empirical stragety
It is noted by Harris et al [12] that a major weakness to Quasi-Experiments is the lack of randomization, which we try to mitigate as explained in the following sections.
The human factor in conjunction with the lack of randomization often results in confounding variables, therefore a list of confounding variables 3.1 were identified and mitigations were planned and executed.
3.3.2 Variables
Dependent
Test case development time
The development time of a test case is a dependent variable of each tool and technique.
Development time will be measured for all tools and techniques when developing a new test case through measuring the time elapsed from the first interaction until a verified working test exists.
Test case execution time
The execution time is a dependent variable which depends on a few factors. Among these are device performance, tester attentiveness and application performance. The execution times for the tools and techniques was measured and compared in the analysis section of this thesis through measuring the time from the first interaction with the SUT until the test is completed or has failed.
Mutant detected
The mutant detected by the applications and techniques. The amount of mutant detected per tool and technique will be measured and compared in the analysis section of this thesis.
Independent Tools
The tools chosen are independent variables as they do not depend on any other factor. The tools chosen are Manual testing as a technique, EyeAutomate as a third generation testing tool and the developed plugin, JMScout.
Confounding
Hardware specifications
The hardware of the authors is a confounding variable as they will be used to execute
the tests. This is a confounding variable because test execution may vary due to
3.3. Phase 2 - Design of Study 13 different hardware configurations between the authors.
The specifications are as follows:
Author 1: Intel Core i5-10210U @ 1.60 GHz for the CPU with 8 GiB of DDR4 RAM running on Windows 10 Home.
Author 2: AMD Ryzen 3600 @ 3.6 GHz for the CPU with 16 GiB DDR4 RAM running on Windows 10 Education.
Below, in Table 3.1, is a compiled collection of confounding variables predicted, and their respective mitigations and treatments applied to lessen or mitigate the impact on the study.
Name Description Effect Mitigation
Knowledge bias
Difference in knowl- edge level in regards to tools used.
Development time may vary due to familiarity
Both testers has approxi- mately the same knowledge level of all tools.
Learning bias
Gradual familiarity with user stories and tools while preparing and performing tests.
Development and execution of tests will vary.
The ordering of test develop- ment was randomized with a number 1-3 each correspond- ing to a tool.
Fault knowledge bias
Bias while looking for faults due to knowl- edge where faults are introduced.
Time and effi- ciency in finding faults will vary in an undesirable way.
Faults are introduced through user stories. There is no knowledge between the testers of what faults have been introduced by the other tester as per Section 3.4.1 and 3.4.2
Selection bias
Tester selects user story or mutation due to preference
Undesirable bias in test creation and bug intro- duction.
User stories and mutations to apply were randomly as- signed as per Section 3.4.1 and 3.4.2.
Performance impact
The performance of the testers workstation may be impacted if an update, or similar, is started.
The testing time may be skewed due to outside factors.
The tester takes precautions to make sure that no system updates, or similar, are re- quired before testing.
Human fac- tor
Frame of mind or hu- man error might affect results.
Difference in test development time.
The tester only tests when they’re in a relatively normal frame of mind.
Table 3.1: Confounding variables
14 Chapter 3. Methodology
3.3.3 Scope
There was a total of 21 test cases developed thus, multiplied by the amount of tools and techniques used, a total of 63 tests were written. In execution, the tests was ran a total of 315 times. This number was found through multiplying a few factors;
Manual will be ran once per user story (21). JMScout will be run manually twice per user story (42) and then automatically six times (126). Finally, EyeAutomate will be run six times per user story too (126).
3.3.4 Design of JMScout
Scout
Plugin 1
Plugin X Plugin 2
SeleniumPlugin JMScout
Figure 3.2: Replacing Selenium Plugin with JMScout
The goal of the study is to provide a proof-of-concept plugin that examines the feasibility of the combination of AT and VGT. With this in mind, several design decisions were made, which we present in the following sections.
Functionality
With the intention to mimic user interactions with applications functionality to specify actions in the plugin had to be implemented. Due to resource constraints, not all actions, such as click-and-drag, could be implemented in JMScout. The prioritized lists of actions were chosen based on the need to perform user-like test scenarios in the experiment. This list of actions include:
• Left mouse button click
• Right mouse button click
• Double left mouse button click
• Type text
• Check area for match
With the ability to left click, right click and double click, a large set of common GUI actions can be performed. With the added ability to type text, the possibilities to - amongst others - log in to accounts, type in registration details has been enabled.
The final functionality of checking an area for a match is the plugins way to assert that the expected result exists without taking any action. This is for example useful in cases where it has to be asserted that an emblem still exists and looks the same on a web page, or that a submitted form created with the other functionalities in the plugin returns the same looking output form.
In addition to the basic functionalities listed, an integration was made with existing Scout functionalities that provides an augmented experience such as the state tree graph and widget markers seen in Fig. 3.3.
Each outline represents a Widget, which is represented in code as a class which
contains widget metadata. Metadata such as what action to perform or the path to
an image, amongst others. The image, for example, is used for matching with Image
Recognition. The listed actions are mapped to re-bindable keybindings and allows
3.3. Phase 2 - Design of Study 15 for designation of action for the widget to be inserted.
Figure 3.3: JMScout testing Rachota
Once a specific action or check has been designated as the action to perform. A widget size can be specified and then finally inserted, resulting in an Colored square around the widget on the augmented GUI in Scout. The tree like graph in the top right corner of 3.3 is Scouts state tree. Each node in this tree represents an appstate class, which is - in essence - a testing "step", in code and inserting an action widget results in another node being added to this graph. This class is meant to represent a state and the class contains paths (branches) containing widgets. This is also what the developed plugin traverses when performing widgets with their specified action.
Note: in Fig. 3.3, Checks are represented with a yellow outline while Actions are represented with a Blue outline. If a check is marked as located while running a scenario, it turns green instead of yellow. If a widget at any point cannot be located, it will turn grey and give the tester the option to click on it to see what the image it was looking for looked like. In the case a widget has turned grey, the tester will also be provided the option to repair the test case through giving them the option to perform the selection again with a simple and user friendly interaction.
These widgets and their associated action can be performed either manually or automatically. To manually perform an action on a widget you simply left click them.
If you want to automatically execute all actions in a test suite, a re-bindable hotkey
has to be used, per default CTRL+R.
16 Chapter 3. Methodology Design decisions
Eye
As the Scout application itself does not carry the capability to perform image recognition natively, a suitable selection of a lightweight Image Recognition tool was required. The choice made was to use the Eye tool, which is used in a related tool of Scout called EyeSel [15]. Eye
1provides great functionality in comparison to its perceived ease of use in development and provides a range of ready-to-use functionalities to aid in research development speed. Eye also provides three different recognition modes to aid in the certainty to locate the images searched for: EXACT which provides as close to exact match as possible of the image searched for, COLOR is the same as exact but with a certain tolerance to color variance, finally TOLERANT is the most tolerant of the algorithms and ignores color matching almost completely and instead choosing to match according to edges found in the sought image.
Default image size
Throughout the development of the plugin different sizes for comparison of images was used. In the end it was analyzed that 150 by 150 pixels was appropriate as 100 by 100 pixel images were found to be sometimes too small for the image recognition to notice contrasts. This has been a careful weighing of which size to pick, as larger sizes start to suffer performance issues.
Image Match Percent
Due to evaluation and testing match percentages between drastically different, but similar amounts of, sections of text on the same background showed a concerning amount of match percentages. Small enough images, like the default 150 times 150 pixel images that the default setting of the images for the app uses, frequently displayed matching percentages in excess of 90% when looking for a matching image.
Under certain circumstances these numbers went as high as 95% match or slightly more. Due to these findings, the default and recommended configuration of the developed software is set to only accept 100% matching images. Only accepting 100%
matching images does make the matching more fragile, but ensures that what is matched is what is being sought.
Configurability
Due to a variety of factors we could never be certain that we made the most optimal choices in regards to which default image sizes and what default image matching percent would always suit an individual project the best. As a result of this, we made the most essential options configurable. Most notable among these are:
• Retries to find a widget before it gives up and times out.
• The default image size on widget creation.
• Minimum image matching percentage.
• Keybindings to use for interacting with most application functionality.
Menu click
During the early stages of testing, it was noticed that a big issue with using Augmented Testing for applications were the ability to click on menus being lackluster due to the issue of any system menu option will immediately close the second the application window loses focus due to the user clicking on the JMScout window. To mitigate
1
https://store.synteda.se/eu/product/eye-for-researchers-and-students/
3.3. Phase 2 - Design of Study 17 this, we had to create a special action to handle clicking on menus, colloquially called MENU_ACTION. This in addition to the ability to "freeze" the image displayed in the main application window by holding CTRL+SHIFT down at the same time mitigated the issue almost completely, even though it is somewhat forced.
Retries with different modes
During the design phase of JMScout we examined the ability of Eye to be executed in different modes of recognition, and came to the conclusion that while EXACT was the generally most optimal algorithm to use with consideration to both performance and desired elements to match, running the other two modes when EXACT failed provided a level of robustness in locating images. Due to this reasoning, we implemented functionality with several match attempts, leading to a cascading fail-safe approach.
Repairing widgets
During experimentation with implemented functionality, testing were often halted by a widget going grey - that is, being unable to be located. This lead us to initially display what image it was looking for which aided the tester in understanding why it could not be located with a visual check. In the end after more implementation, errors started appearing in the middle of test paths, which was severely undesirable since any node after the erroneous node would also be deleted upon node deletion.
This lead to the implementation of the Repair feature, which allows the tester to substitute any ailing node. The tester can also switch type of action node, going from left click to double click for example.
Key Listeners
Another issue that was noticed during development, is that key listeners do not update correctly at all if the key is released while the main window does not have focus. This initially lead to some confusion as to why you no longer needed to hold CTRL down to perform some actions. The initial solution to this was immediately requesting focus back to the main application window, but this solution had to be scrapped with concerns raised in conjunction with the menu click situation. A final solution was found by utilizing the JNativeHook
2package for implementing global listeners, which completely resolved the concern and issue neatly at the same time through not requiring a focus requests and through constantly being aware if a key is pressed or released no matter what application has the focus. This listener is only used to listen for the CTRL and SHIFT press and release events.
Architecture overview
Scout
JMScout
Image Recognition (EYE)
State controller Component manager Action handler Draw handler
Figure 3.4: Plugin Architecture
2
https://github.com/kwhat/jnativehook
18 Chapter 3. Methodology To develop JMScout we depend on two external libraries. One of them is the Eye module, which provides image recognition, and the other is Scout itself, which is the main application that this plugin is written for. The Scout application JAR file is a black box with API functions that can be accessed. Using these limited functions to invoke UI functionalities and storing data, JMScout was developed.
Action handler
This component handles the interactions that the user initiates. It also executes the widgets’ specified action.
Draw handler
Used to draw custom sized widgets for easier user interaction.
State controller
The state controller is what is used to detect if the states in Scout has changed and if any new checks needs to be perform. The state controller is also invoked when adding a new Widget.
Component manager
The component manager is what assures that each application widget is located and drawn on the user interface after locating them in the displayed image on Scouts user interface.
Experiment subject
The experiment was conducted with an application written in Java called Rachota.
The selection of this specific application were based on a number of criteria. There were three key criteria. First off, the open source nature and license that allows legal modification and introduction of artificial defects (mutants) to the application for testing purposes. Secondly, Rachota contains a myriad of different user interface elements that makes it representative for most common applications on the wider market. Hence, a suitable candidate for the experiment to optimize its external validity given the project’s resource constraints.
Lastly Rachota has been used in previous academic research for instance Alégroth et al [1], for executing tests against. Further supporting its validity as a test subject.
That study and a study by Souza [26] also served as guidance when specifying on what metrics our test subject had to comply to in order to be an appropriate subject.
Version # Windows # of Classes Lines of code
2.4 10 53 14034
Table 3.2: Rachota metadata
Due to the experiment subject having 10 windows and 14k LOC, the experiment can only be generalized to other smaller applications. That this is the case is also a limitation planned for with the allocated resources available.
3.3.5 Tools used
To be able to communicate and plan in an age of COVID-19 restriction and
perform the experiment and develop JMScout, a number of different tools were
required.
3.4. Phase 3 - Execution of Study 19 First and foremost is Discord
3, a communication platform with text chat, voice &
video chat and screen sharing capabilities. The tool has been the key in communicating between the authors as they have been unable to meet due to the active restrictions.
Second up is the web-based tool Notion
4which is a all-in-one tool with task boards with statuses and labels, tables for structuring literature to read, note-keeping and so much more.
For application development, IntelliJ IDEA
5with a license obtained through the Github Students pack were used since IntelliJ IDEA specializes on Java-based languages. On the same note, version control were handled through Git via Github
6.
And finally, for bug introduction, Apache NetBeans
7were used as Rachota was originally composed using a form builder. The reason for this choice was that the authors couldn’t get the form builder for Intellij to work together with Rachotas source files.
3.4 Phase 3 - Execution of Study
3.4.1 User Stories
There were 21 user stories determined to be used in the study. Rachota was manually explored to determine its common use cases. Therefore deciding what deemed to be common tasks for a day planner.
After the familiarization, it was noted that the application does have certain functionalities that results in generation of content outside of the application itself, such as a summarized report of time spent in a HTML file, which was unanimously excluded from consideration of being included in a user story.
After being familiar with the application functionalities, User Stories were deter- mined by following the considerations of getting a good mix of complexity and number of interactions. The base, minimum, amount of interactions were pinned as 5 and no specific maximum were set, but were assumed to be around 20. Simple interactions were defined as making a Check or clicking a button. Complex interactions were defined as, amongst others, typing text or making a Menu Action interaction.
An example of an included user story reads like this: 5. As a user, I want to be able to create a task and then delete it. A full list of user stories can be found in Table A.1.
3.4.2 Mutants
Mutations are according to Offutt et al [20] when the tester creates test data that causes a set amount of faults. The mutation testing technique serves two goals: it provides a test adequacy criterion and leads to detection of faults. Each mutant were introduced through mutant operators in a random, but unique, order. Specifically in this case the ADD, CHANGE and REMOVE. Mutants from the list 3.3 was used to introduce one mutant per user story. When each Mutant has been introduced once, then mutants will be randomly selected to be introduced once again. Therefore some
3
https://www.discord.com
4
https://www.notion.so
5
https://www.jetbrains.com/idea/
6
https://www.github.com
7