Examining maintenance cost of automated GUI tests : An empirical study of how test script design affects the maintenance of automated visual GUI tests

(1)

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Science and Engineering

2020 | LIU-IDA/LITH-EX-A--20/071--SE

Examining maintenance cost of

automated GUI tests

–

An empirical study of how test script design aﬀects the

main-tenance of automated visual GUI tests

En empirisk undersökning av hur testskriptdesign påverkar

un-derhåll av automatiserade visuella graﬁska

användargränss-nittstester

Elin Petersén

Supervisor : John Tinnerholm Examiner : Lena Buﬀoni

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

GUI testing is expensive to perform manually. Software systems involving a hetero-geneous set of components exclude the applicability of specific GUI testing techniques. Visual GUI Testing (VGT) is a test automation technique that combines image recognition with scripts. It applies to almost any GUI driven application. VGT is proven to be cost-effective in comparison with manual testing. Still, it is expensive to maintain. This study investigates if test script design by following specific guidelines positively affects mainte-nance costs.

A case study was conducted to identify best practices for VGT w.r.t. maintenance time. Four VGT versions were developed for each manual test case. These consisted of two design versions, with/without guidelines, for the two VGT-tools EyeAutomate and Sikuli. Data was collected using time measurements, observations, and interviews.

Results highlighted differences in initial development time and maintenance time be-tween the two design versions. In total, 44 observations were collected. 17 were related to the design versions, 17 to the VGT-tools, and 10 to VGT in general, initial development, and the system under test. The interviews collected the perceptions of VGT in general, maintenance of the different VGT versions, and guidelines.

In conclusion, the combination of the guidelines did not have a positive effect on main-tenance in terms of costs and experience. However, some of the individual guidelines did. A rationale why the guidelines did not give the desired result was identified. Future re-search is necessary to investigate other combinations of guidelines, such as those identified as beneficial.

(4)

Acknowledgments

Firstly, I would like to thank my supervisor John Tinnerholm for his assistance and guidance during this project. Also, I would like to thank my examiner Lena Buffoni for her advice and feedback. I’d like to express my gratitude to Emil Alégroth, who has published many papers in the area of VGT, has a doctor’s degree with a focus on automated software testing, and was one of the founders of the VGT concept. Thank you Emil for your commitment, consulting, and help during the project.

I would like to thank Saab AB and everyone involved who helped me make this study possible. Thank you to all the people who participated in the study and made the data col-lection possible. To my colleagues and all the people who helped me at Saab AB, I’d like to express my gratitude for your support and help in solving technical problems. Thank you Raimund Hocke, who is responsible for the support and maintenance of the VGT-tool Sikuli, for your quick help solving an issue that made it possible to use the tool in the context of the project.

(5)

List of Figures

2.1 V-shaped life cycle model, often referred to as the V-model, for software

develop-ment . . . 5

2.2 System design of Sikuli. . . 9

2.3 Components for template matching . . . 10

2.4 How template matching identifies the matching area . . . 10

2.5 System design of EyeAutomate . . . 11

2.6 An example of automated test steps in EyeStudio . . . 13

2.7 An example of automated test steps in Sikuli IDE . . . 14

2.8 Relation between product properties (from the SIG quality model) and sub-characteristics of maintainability (defined by ISO 25010). . . 16

3.1 The communication between the applications . . . 36

3.2 The different automated test case versions developed for each manual test case. DV is an abbreviation for design version. . . 38

4.1 Initial development time . . . 44

4.2 Maintenance time . . . 45

4.3 Comparison of manual execution time and total development times for the main-tained test cases . . . 47

(8)

List of Tables

2.1 Summary of properties related to the three generations of GUI testing techniques

and tools. . . 8

2.2 The tolerance for changes of different properties for the pixel- and vector-based image recognition . . . 11

2.3 Example test case of LiU’s website. Each test step contains an identical number (ID), description, and expected result. . . 12

2.4 The relation between product properties and maintainability sub-characterstics. . . 17

2.5 The impact of different factors of emprical strategies. . . 24

3.1 Summary of the interview participants. Which project the paricipants belongs to and their working length and role in the project are presented. . . 30

3.2 Comparison of two different GUI testing tools, one from the second and one from the third generation, used in project B. . . 31

3.3 The participants in the case study . . . 35

3.4 Attributes connected to the possible test cases . . . 37

3.5 The selected manual test cases . . . 39

3.6 The initial development order of the automated test cases . . . 40

3.7 The maintenance order for the participants. . . 41

4.1 An overview of the observations . . . 48

4.2 Interview results regarding advantages and disadvantages with maintaining VGT tests designed with and without guidelines . . . 53

(9)

List of Abbreviations

AGT Automated GUI Testing

CAD Computer Aided Design

DV1 Design Version 1

DV2 Design Version 2

GUI Graphical User Interface

IDE Integrated Development Environment

OpenCV Open Source Computer Vision Library

ROI Return On Investment

SIG Software Improvement Group

SUT System Under Test

TC Test Case

(10)

1 Introduction

Test automation aims to improve software quality, such as test reusability and repeatability [20]. It can increase test coverage and reduce cost, for instance, the effort of test executions [49]. Limitations with test automation are often associated with high costs of automation setup, tool selection, training, and maintenance [49].

GUI systems are challenging to test due to their complexity and special characteristics [33]. Different techniques have been developed for system and acceptance tests that captures GUI components in scripts. The capture is made on a GUI component level using direct ref-erences to the components or on a bitmap level using coordinates to the component location. The technique that relies on exact screen coordinates is not robust to platform changes [14, 56] and has shown to be less cost-effective than manual testing [56]. The technique that re-lies on references to GUI components are not always applicable [5] and requires high initial costs [16]. Another approach has been developed using computer vision [14]. This approach allows the use of images to specify GUI components for interaction and visual feedback to be observed. This approach is called visual GUI testing (VGT) [5]. VGT is flexible since it works on almost any kind of GUI driven platform independent of programming language or operating system [3]. Several studies have investigated the cost of introducing VGT into existing software projects [3, 4, 13] and there exist studies that investigate its maintenance cost [2, 3, 7]. VGT can decrease testing costs and help to raise quality comparing to manual testing [3, 7, 13]. The approach has shown to have positive effects but it also has challenges [4], such as high maintenance costs [1, 2, 7], robustness problems [4, 5], and it has shown to be unpredictable due to its uncertain image recognition outcome [3, 4]. Despite these problems, the technology has been perceived as cost-effective compared to manual testing.

Technical debt is the negative effect on software caused by previous software decisions [30]. Development of software without any consideration on design or maintainability can result in technical debt which has a negative impact on maintenance cost [30]. Research has shown that technical debt from software also applies to testware (software developed explic-itly for automated testing) [6, 9].

Maintenance of testware often become a major burden for testware projects which has shown to be connected to a poorly designed testware architecture without considering main-tainability issues [11]. Maintenance tends to have a greater impact on the total cost of testing than the initial implementation of automated tests [11]. Studies have suggested that testware

(11)

1.1. Motivation

projects should apply code engineering practices and patterns, such as for software, for the projects to become successful and to ensure the quality of the test code [11, 20].

1.1 Motivation

This master thesis project is produced in collaboration with the company Saab AB in Linköping, Sweden. Saab produces world-leading products, services, and solutions for the military defense and civil security on the global market [42]. The work presented in this the-sis belongs to the business area Aeronautics and involves software for aircraft. The software provides a graphical user interface (GUI) that is based on 3D CAD modeling. It contains several applications that interact with each other, built on multiple programming languages. Some of the applications are third-party, where the source code is not available. The GUI of these applications needs to be constantly tested since new functions are continuously inte-grated. These tests are today performed manually and require a lot of time, especially during regression testing since the number of tests is large. Many of the manual tests are repetitive, long, and/or complex. Due to limited time in the project sprints, the GUI tests can not be run as often as desired. Therefore, there is a need to automate the tests. A previous attempt at automating tests has been made within the project at Saab of this thesis, but not with VGT. In the attempt, a GUI testing tool using references to GUI components was applied. This attempt included several different tools that had to be combined to cover whole test cases due to mixed languages and third-party problems. The attempt required a large amount of code, learning, development, and maintenance time. The costs became too expensive in comparison with manual testing and therefore the project went back to manual testing. VGT seems to be a more suitable technique for the project at the company and is thus interesting to introduce and investigate.

The support for automated GUI testing (AGT) has existed for several decades. Industrial case studies from the last decade have shown benefits and cost-effectiveness with AGT [2, 3, 4, 7, 13]. Despite this evidence of benefits, AGT only exists on a small scale [34]. To only study the direct cost of introducing test automation is not enough to determine the cost-effectiveness of the technique since other factors affect the cost, such as maintenance cost over time [1, 2, 7], problems, limitations, and challenges [4]. Previous studies [2, 3, 4, 7, 13, 16] have not investigated how the test scripts were written during the development of the scripts. They only focused on whether the development worked no matter how they were written.

Having code standards is the norm for modern software development but it is not as common for testing. Investing in design, such as code reuse and structure, when developing testware can result in lower maintenance costs [11, 30]. Research has shown that this also applies to GUI testing [9], but (to the authors’ knowledge) there is a lack of studies that inves-tigates how this applies to VGT. This master’s thesis will explore the importance of investing in testware design when developing and maintaining automated visual GUI tests and how this affects maintenance costs.

1.2 Aim

The purpose of this study is to investigate whether cost (represented as time) can be reduced by developing well-designed visual GUI tests to help developers decide if VGT is worth applying to their project. This study investigates whether VGT is cost-effective to introduce and maintain compared to manual testing in an existing project at Saab.

(12)

1.3. Research questions

there is a lack of studies (to the author’s knowledge) that includes the impact of testware design investment on these costs. This study aims to fill this gap.

1.3 Research questions

The following research questions will be investigated in this thesis.

RQ1 What coding guidelines related to maintainability of software and testware can be used to design automated visual GUI tests?

RQ2 How do the maintenance costs differ between visual GUI tests developed with and without maintainability in focus?

1.4 Delimitations

This study is delimited to studying the maintainability aspects of visual GUI testing scripts. Although the results may be generalizable to other automated GUI or non-GUI test tech-niques, this work only focuses on automated tests driven with image recognition technology. The VGT-tools used are limited to two, EyeAutomate and Sikuli, even though the results may be generalizable to other VGT-tools. The VGT-tools used in this study are also delim-ited to only concern its integrated development environment (IDE) with its default language, although the results may be generalizable to other IDEs and languages. Additionally, the study is limited to large-scale and safety-critical software systems. Hence, although these results may be generalizable to other domains, no evidence is presented to support such a claim. The tests produced in this work involves software for applications used by Saab AB and contains sensitive information. The source code and details associated with the test cases are therefore not included in this thesis.

1.5 Thesis outline

The rest of this thesis is structured as follows:

• Chapter 2 describes the relevant theory and terminology. It focuses on different levels of testing and maintenance. It also covers related work.

• Chapter 3 describes how the study was carried out. It contains a pre-study, a description of how the guidelines were conducted, and a case study.

• Chapter 4 presents the results generated from the case study.

• Chapter 5 discusses the results, method, and work in a wider context.

• Chapter 6 draws conclusions and connects the findings to the research questions. It also presents directions for future work.

(13)

2 Theory

This chapter describes different levels of testing; for software in general (in Section 2.1), GUI (in Section 2.3), and automated software (in Section 2.2). Automated GUI-based software testing techniques and tools are introduced in Section 2.4 and with focus on visual GUI testing tools in Section 2.5.

Various aspects of maintenance are also described in this chapter. Section 2.6 focuses on software maintainability, Section 2.7 on maintenance of automated tests, and Section 2.8 on software maintenance guidelines.

Section 2.9 addresses empirical stuides in the context of software engineering. Related work is described in Section 2.10.

2.1 Software Testing

Software testing is an important activity to ensure software quality [60]. Software testing is expensive and can account for over 50 percent of the total development cost [24, 38, 44].

Software testing and test case design are usually performed at four different levels [15]: • Unit: the smallest executable part of source code, e.g. a method or function [61]. Unit

testing focuses on testing individual subprograms, smaller building blocks rather than the system as a whole [44]. At this level of testing, tests are created for the low-level design to verify that low-level requirements are met [43] shown in Figure 2.1.

• Integration: individual units are put together into groups, components, or subsystems [15]. These are finally assembled into systems [15]. A system represents all software included in the product delivered to the end-users [15]. At this level of testing, tests are created for the high-level design (shown in Figure 2.1) with a focus on system design and architecture [43].

(14)

2.2. Automated Software Testing

• Acceptance: requirements and customer’s needs are tested and are usually performed by or together with the end-users [44].

Figure 2.1: V-shaped life cycle model, often referred to as the V-model, for software devel-opment. The life cycle starts with specifying requirements and then continues with an se-quential path of process executions. At the bottom of the figure, the implementation of the software takes place. Source [43]

Regression testing is a process used to validate updated software to detect if errors have been introduced to previously tested software [21, 25, 38]. The need for regression testing increases in software projects as software becomes more complex with shorter development cycles [17]. Regression testing is applied continuously and at several levels in most develop-ment organizations [17]. It is an expensive process and can account for up to 80% of the total testing cost [21, 25]. The process involves several issues. One issue is to identify and fix the test cases affected by the software update [25]. Another issue is the selection of test cases to rerun, since rerunning all test cases can result in a waste of test resources and be unnecessary [25]. Not all test cases may include the exercise of new or modified functionality. If all the test cases available are unnecessary to rerun it can be useful to prioritize them, e.g. those who exercise new/modified functionality and have the greatest error detection ability.

2.2 Automated Software Testing

Test automation aims to reduce human interaction during testing with the system under test (SUT) and to increase test frequency [22]. It provides benefits, such as repeatability and effi-cient test execution [20]. Test automation helps shorten test phases and release cycles, and it helps to detect defects earlier [11]. It also has drawbacks. Script development is error-prone, tedious, and requires a significant initial investment [20]. After the initial development, the scripts require maintenance and they must adapt to changes in the SUT. Testware has several functional difficulties [20]. One of them is to determine if the SUT has a fault if a test case fails. Another difficulty is to make sure the test suite detects faults in the SUT.

Study [20] summarizes findings from multiple software test code engineering projects with a focus on high-quality automated test scripts. In the study, it is proposed to apply guidelines and patterns to ensure the quality of the test code. Besides, developers perceive the test code quality as low when test patterns not are applied.

Return On Investment (ROI) or so called break-even point [50] is the saving (return) re-alized as a result of a one-time expenditure (investment) [10]. ROI can be used to analyze

(15)

2.3. GUI Testing

an investment’s performance or to make a decision on a potential investment should be in-troduced [10]. Return and investment can be quantified on a scale where both scales use the same unit, usually money or time [10]. Common metrics used to calculate the cost of auto-mated or manual test cases are the cost of specifying, developing, and executing the test cases [50].

2.3 GUI Testing

A graphical user interface (GUI) provides a graphical front-end to a software system. It ac-cepts events in the form of inputs from the user or system and based on these events generates graphical outputs. GUIs consist of widgets, e.g. text fields and buttons, with a fixed set of properties [41]. These properties contain values which make up for the GUIs state and might change during execution [41].

Special characteristics for GUI systems [33]:

• Large input space. GUI system inputs are user-triggered sequences of events. These events often result in a large number of permutations, depending on how complicated the GUI system is. This requires a large number of GUI states that must be tested. • Hidden dependencies and synchronization. Dependencies and synchronizations

be-tween objects often involve different windows (connections bebe-tween values in the dif-ferent windows). It can be difficult to find and test all these connections.

• Many ways in and out. GUIs can provide multiple ways to interact with an applica-tion, e.g. by using keyboard shortcuts, button clicks, menu options or clicks on other windows. This may require testing of the same feature multiple times with different inputs. This is often called combinatorial testing, where the challenge is to cover all possible combinations of actions and state changes.

GUIs are a common way of interacting with software systems as they simplify the use of the software. Traditional software testing techniques and tools are not well suited for testing GUIs, e.g. traditional coverage criteria and test oracles, since they have different character-istics [39]. Some examples of charactercharacter-istics were described earlier, which were large input space, hidden dependencies, and many ways in and out. In addition, the source code for GUI elements or events are not always available since they may consist of instances to pre-compiled elements stored in a library or are at a higher level of abstraction than the code [39]. GUIs require a different technique which traditional software techniques do not support [38]. Before the regression testing phase, new test cases are developed and older test cases changed to cover testing of the modified software. To cover all changes in the older test cases can be difficult. There is a risk that parts will be missed. Regression testing of GUIs can be challenging since changes in the GUI layout, such as button placements, affects the input-output mapping and thus previous test cases and test oracles. The expected input-output by the test oracle may become outdated and cause difficulties in verifying whether the GUI executes correctly [38]. Using images as test oracles may further complicate the verification process. It can be difficult to determine which images are suitable and how to capture appropriate images for the oracles.

(16)

2.4. Automated GUI-based software testing techniques and tools

2.4 Automated GUI-based software testing techniques and tools

The support of automation using GUI interaction and image recognition has existed since the early 90s. Potter developed the tool Triggers that use pattern matching applied to the pixels on the computer screen to access data by keyboard macros [48]. Other early work, by Zettle-moyer and Amant, introduced the tool VisMap [65] that runs image processing algorithms over the contents on the screen and passes this to a controller program using mouse and key-board actions. These early works focused on automation in general with image recognition algorithms and not on automated testing.

2.4.1 The three generations

Alégroth et al. divides automated GUI-based test techniques and tools into three different generations [5]. The different generations differ in how they interact with the SUT presented in Table 2.1, through exact screen coordinates, GUI components or image recognition [5].

The first generation extracts exact screen coordinates that are recorded during manual testing and then replayed in combination with scripts to automate the tests. The tools used in this type of testing are called traditional capture/replay [22]. It is easy to implement test cases with this technique [31, 58], however, the technique is costly to maintain due to its sensitivity to GUI changes and dependency on screen resolution [5, 22, 31, 58]. The smallest changes in the GUI implies that the bitmaps used for the comparison must be replaced [58]. It is therefore rarely used in practice. The initial development time for the first generation technique is often lower but requires higher maintenance time than for the second generation [31]. The cost (measured in time) of maintaining test cases between a small number of releases has shown to quickly become lower for the second-generation approach compared to the first generation [31].

The second-generation uses the system under test directly by its GUI components, through access to GUI libraries or tools. The tools used in this type of testing are often re-ferred to as modern capture/replay [22]. This approach is more robust against changes to the system under test and has high test execution performance [5]. It supports recording, GUI ripping, and automated test case generation. However, the approach restricts the tests to certain programming languages and GUI libraries or tools. It is not possible to execute tests outside or between different applications where its components are not available. The tools of the second generation also do not verify that the pictorial GUI fulfills the visual require-ments of the system under test [5]. These tools do not interact with the SUT in the same way as a human, which can lead to incorrect test results compared to manual testing. Examples of second-generation tools are Selenium1_{and GUITAR[45].}

The third generation is called Visual GUI Testing (VGT). VGT combines image recognition with scripts to automate GUI tests. This is controlled by keyboard and mouse events which make VGT applicable on almost any GUI driven application. Another advantage is its close resemblance with manual interaction with the GUI. Therefore, VGT is a more appropriate approach to emulating manual testing. However, in comparison with the tools of the second-generation, VGT is less robust and has a longer test execution time [5]. The third generation and its tools are further described in Section 2.5.

(17)

2.5. Visual GUI Testing Tools

Table 2.1: Summary of properties related to the three generations of GUI testing techniques and tools.

First generation Second generation Third generation

SUT interaction Exact screen coordi-nates

GUI components Image recognition Advantages Easy and fast to

de-velop

Robust to changes in the SUT, fast test execution

Applicable on any GUI driven ap-plication, verifies requirements of the pictorial GUI, resembles human behavior

Disadvantages Sensitive to changes in the SUT and screen resolution

Restricts the tests to certain program-ming languages, libraries/tools, and applications

Slow test execu-tion, has robustness problems

2.5 Visual GUI Testing Tools

VGT tools make it possible to write scripts that use images to specify which GUI components to interact with and for visual assertions [14]. Examples of VGT tools are Sikuli2, EyeAu-tomate3(previously known as JAutomate), EggPlant[3], and Squish4. This section focuses specifically on Sikuli and EyeAutomate since they will be used in this masters’ thesis. These two tools are chosen because they offer free use and the use of these tools has proven to be cost-effective in several studies ([2, 3, 4, 13]) compared to manual testing. These two tools (and not just one tool) are chosen since they differ in structure and syntax, which can affect the ability to design test scripts and which coding guidelines are appropriate. Section 2.5.1 and 2.5.2 describes how Sikuli and EyeAutomate works. Both Sikuli and EyeAutomate also support visual character recognition from the screen using Tesseract OCR engine5.

2.5.1 Sikuli

Sikuli was created in 2009 as an open-source research project [55, 64]. In 2012 the development and support of Sikuli were taken over by Raimund Hocke and renamed the project to SikuliX [55].

Sikuli is a Java application and works on Windows, Linux, and macOS. How Sikuli works is described in figure 2.2. The Sikuli script can be edited and executed by the Sikuli IDE. The IDE also provides image capturing. A Sikuli script (.sikuli) contains a python (.py) source file and image files (.png). The core of the Sikuli script consists of a Java library containing java.awt.Robot and a c++ engine. java.awt.Robot controls keyboard and mouse events. The c++ engine is based on OpenCV6(Open Source Computer Vision Library), which provides image pattern searching on the screen. To the Java library, a Jython layer is provided for the end-users. This means that it is also possible to add layers for other languages running on java, such as JRuby and Javascript [27].

2_{http://sikulix.com/}

3_{https://eyeautomate.com/eyeautomate/}

(18)

Figure 2.2: System design of Sikuli. Source [27]

Image recognition

Sikuli uses an image recognition technique called template matching. Template matching identifies an area in an image that match a template image (patch). Two components are necessary for the searching, presented in Figure 2.3. OpenCV performs template matching by comparing the patch against the source image for each pixel location (starting from the upper left corner) according to Figure 2.4.

Six different matching methods are available [28, 59]; Square difference matching method (CV::TM_SQDIFF), Normalized square difference matching method (CV::TM_SQDIFF_NORMED), Correlation matching method (CV::TM_CCORR), Normal-ized cross-correlation mathing method (CV::TM_CCORR_NORMED), Correlation coefficient matching method (CV::TM_CCOEFF), and Normalized correlation coefficient matching method (CV::TM_CCOEFF_NORMED). The square difference methods use the squared difference during the matching, the correlation methods uses a multiplicative match, and the correla-tion coefficient methods perform matching between the template and source image relative to their mean [28]. When these methods have been applied, the best match (with the best match value) is chosen [28] since there may be several matches.

The score value for each pixel varies between 0.0 and 1.0. The probability of a match increases with a higher score value, where 1.0 indicates an exact match. By default only score values greater than 0.7 counts as a match since it signals a high probability for a match, but this can be changed [54].

(19)

Figure 2.3: Components for template matching. Source [59]

Figure 2.4: How template matching identifies the matching area. The template image is com-pared against the source image at every location (x,y). Source [59]

2.5.2 EyeAutomate

JAutomate was released in 2012. In 2016 it was renamed to EyeAutomate. How EyeAuto-mate works is described in figure 2.5. EyeAutoEyeAuto-mate is run on Java and supports techniques from all three generations (presented in Section 2.4.1) of GUI testing [18]. EyeAutomate is grounded around a library called the Eye (Eye2 for later versions), which contains the image recognition algorithm. The Eye supports image recognition based on both pixels and vec-tors. It is possible to choose one of them or both. EyeAutomate scripts can be edited and executed by the IDE EyeStudio. EyeStudio provides, just like the Sikuli IDE, image capturing. The EyeAutomate script contains four different files; scripts (.txt), images (.png), data (.csv) and widgets (.wid). In the script, the template image needs to have a target location/area specified. The target location is at the center of the image by default [18].

(20)

Figure 2.5: System design of EyeAutomate.

Image recognition

The Eye returns the best match (the match with the highest matching score), since there may exist several matches. Two image recognition modes are supported; tolerant and fast [18]. The tolerant mode compares images by contrast and is used by default. The fast mode com-pares images by colors. The fast mode is faster than the tolerant mode but less reliable since it makes the algorithm more sensitive to the actual color of a widget which can change depend-ing on the widget state [8]. The sensitivity (for the target area) for the pixel image recognition is set to 95% by default, which can be modified [18]. This value determines how accurate the matches must be to be counted as a match, where 100% indicates the best possible match. A comparison of image recognition with pixels and vectors is shown in Table 2.5.2. The major difference between the two modes is that vector-based recognition is a lot less sensitive to zoom and rendering details [18]. Either is selected by default and requires either a pixel or vector match [18]. This mode is used during the thesis project.

Table 2.2: The tolerance for changes of different properties for the pixel- and vector-based image recognition. If a property is tolerant, it is marked with the symbolX. That a property is intolerant does not mean that it is not supported at all. Source: [18]

Pixel recognition Vector recognition

Location X X Resolution X X Color X X Shape X Size/Zoom X Font X

(21)

Features

The EyeAutomate script can be divided into Steps [18], which can be used to represent a manual test step and/or for code indentation. A CSV (Comma-Separated Values) file can be used to insert parameters to the script, which enables data-driven testing [18].

The EyeAutomate script supports manual steps since some parts can be hard to automate. A manual step is executed by showing a dialog containing the test instruction, result field for comments and pass/fail buttons. A manual recovery mode is possible to enable, that requires a user to perform a failed automated test step manually instead of failing the test.

Additionally, encrypted texts can be inserted and automatically decrypted during the exe-cution of the EyeAutomate script. EyeStudio generates test summary reports of the executed tests. The reports contain information such as test status (passed/failed), execution time, execution date, and screenshots for failed commands.

2.5.3 Test case example

An example of manual test steps for a test case is described in Table 2.3. Screenshots of how these manual test steps can be automated in EyeStudio and Sikuli IDE are shown in Figure 2.6 and 2.7.

Table 2.3: Example test case of LiU’s website. Each test step contains an identical number (ID), description, and expected result.

ID Description Expected result

1 Visit the website https://liu.se/en LiU’s logotype is shown 2 Hover over LiU’s logotype The tooltip Home is shown 3 The menu items Education, Research, and

Collaboration are shown

4 Click on the menu item Education The item should be selected. This should be shown with a blue/turquoise under-line. The title Why study at Linköping Uni-versity? should be shown.

5 Click on the item Programmes & courses The item should be selected. This should be shown with a bold text and black un-derline.

6 Search for the program computer science in the search field under the title What can you study at LiU?

An object with a search symbol followed by the text Computer Science, Master’s pro-gramme, 120 credits should be shown. The object should have a light blue color. 7 Hover over the search object Computer

Science, Master’s programme, 120 credits

The object should be selected. This should be shown with a darker blue background color and the text should have a black underline.

(22)

(23)

(24)

2.6. Software maintainability

2.6 Software maintainability

Maintainability is important since it can have a significant business impact [62]. Efficient maintenance implies less time spent on maintenance per developer and lower maintenance cost. This enables more focus for other tasks, such as the implementation of new function-alities and quality improvement. This also enables shorter release cycles of new products. Maintainability can be greatly improved by following simple guidelines [62]. Maintainability needs to be addressed at the beginning of a development project [62].

The software maintenance guidelines presented in Section 2.8 are based on the relations between the sub-characteristics and product properties described in Section 2.6.1 , 2.6.2, and 2.6.3. Section 2.6.4 describes the software maintenance types.

2.6.1 Sub-characteristics of maintainability

The international standard ISO/IEC 25010:2011 (ISO 25010) defines maintainability as one of eight characteristics of software quality [57]. Maintainability describes how easy it is to modify a system [62]. ISO 25010 divides maintainability into five sub-characteristics [57]:

• Modifiability: the efficiency of not introducing defects or decrease the existing quality of modified software.

• Reusability: the degree to which assets can be used in multiple systems or help create other assets.

• Analyzability: the efficiency of assessing the impact of a system change, diagnose the cause of the failure or identify parts to be modified.

• Modularity: the degree of discrete system components such that changes in compo-nents have minimal impact on other compocompo-nents.

• Testability: the degree of efficiency for test criteria establishment of software and that tests can be executed to examine the establishment of these test criteria.

2.6.2 Software product properties

The sub-characteristics of maintainability are connected to the software product properties [61]:

• Duplication: identical fragments of source code that appears more than once in a soft-ware product.

• Unit complexity: the degree of complexity in the source code units. • Unit size: the number of code lines of the source code unit.

• Unit interfacing: the interface size (number of interface parameter declarations) of source code units.

• Module coupling: the number of incoming dependencies for the source code modules. A module is a group of related units.

• Component balance: a rating for the existing top-level components and component size uniformity. Component size uniformity is a rating for the size distribution of these top-level components. A component corresponds to a group of related modules. A top-level component is the first subdivision of modules into components.

• Component independence: a rating for the code percentage in modules with no in-coming dependencies from modules in other top-level components.

(25)

2.6.3 Relationships between sub-characteristics of maintainability and software

product properties

The relationship between the sub-characteristics and product properties are shown in figure 2.8. The reason why they are related are described in table 2.4.

Figure 2.8: Relation between product properties (from the SIG quality model) and sub-characteristics of maintainability (defined by ISO 25010). Source [36, 61]

(26)

Table 2.4: The relation between product properties and maintainability sub-characterstics. A negative impact is denoted with a - symbol. A positive impact is denoted with a + symbol. Source [61]

Property Impact Characteristic Reason

Duplication - Modifiability Changes in duplicated code have to be made at different places.

Analyzability Parts are more time-consuming or difficult to modify when the same parts occur multiple times in different places.

Unit size - Modifiability Parts are more time-consuming or difficult to modify when there exists a low number of large units instead of a high number of small units.

Reusability Large units often contain multiple functional-ities which makes it harder to reuse each sub-function.

Unit complexity

- Modifiability It is more time-consuming or difficult to mod-ify units when they are complex.

Testability A complex unit requires more test cases due to the validation of modifications of the unit. Unit

interfacing

- Reusability Units with many parameters are more diffi-cult to instantiate. The reason for this is that each parameter needs more information about the context and expected parameter values to reuse a unit.

Module coupling

- Modifiability Changes to modules frequently used in other parts require changes to code fragments using these modules.

Modularity Changes to modules frequently used in other parts require changes to code fragments using these modules.

Component balance

+ Analyzability Balanced components contain more related functionality which is easier to analyze. Modularity Balanced components or a small number of

components are an indicator of high modular-ity.

Component

indepen-dence

+ Testability Components with fewer connections are easier to test in isolation.

Modularity Components with less connections are consid-ered more independent and easier to change separately.

Volume - Analyzability Parts are more time-consuming or difficult to modify when a large volume of code must be taken into account.

Testability A larger system requires more code to be tested.

(27)

2.7. Maintenance of automated tests

2.6.4 Software maintenance types

Software maintenance can be divided into four types; corrective maintenance, adaptive main-tenance, perfective maintenance [35], and preventive maintenance [62]. Corrective mainte-nance is modifications that are made due to bug fixes. Adaptive maintemainte-nance is coupled to changes in the environment, e.g. technology upgrades. Perfective maintenance is related to changed requirements of the system. Perfective maintenance often requires the most effort [35]. Changes made to increase quality is called preventive maintenance [62].

2.7 Maintenance of automated tests

Maintenance of testware is necessary due to modifications in the SUT, e.g. new releases. As described in Section 2.6.4, maintenance is usually required due to system requirement changes (perfective maintenance). This in turn requires modification of automated tests. It can be difficult to determine in advance which tests or parts of the tests that will be affected by SUT changes. This often results in broken tests that not is detected until the test execution fails [20].

In addition to the maintenance costs, analysis costs of the test results are also required. As described in Section 2.2, if a test case fails an analysis is required to determine if the SUT has a fault. If the SUT does not have a fault, the test result is a false positive. A false positive is not a real bug. It can be a consequence of modifications made to the SUT [40] (broken tests) or flakiness in the test. Flaky tests are tests that have non-deterministic outcomes [37]. If the SUT has not changed, an unmodified test is expected to always pass or fail. If this does not happen, the test is considered flaky. Flakiness can be caused by e.g. lack of synchronization [37] or failed image recognition [4]. Synchronization problems occur when the runtime of system calls varies from time to time [37]. A common reason for a failed image recognition is that rendering on the screen can differ for different runs or systems [1].

In the study [11], observations and experiences from several testware projects are sum-marized. One of the experiences from the study is that depending on the number of release cycles that exist for the SUT, the maintenance of automated tests tends to have a greater im-pact on the total cost of testing than the initial implementation of automated tests. Another experience is that the cost-effectiveness of an automated test grows with the number of times the test is executed. Additionally, the maintenance of testware often becomes a major bur-den when the testware architecture is poorly designed without considering maintainability issues. A common mistake in test automation projects, according to the study, is that the opportunity to reuse testware is often missed. Reimplementation of identical code parts can lead to increased maintenance costs if the SUT changes since maintenance must be done at all these places. This requires more effort than having the common functionality gathered. This also increases the risk of mistakes where modification of one or more of the duplicate parts are forgotten. Another pitfall, reported from the study, is poorly structured testware. It is suggested that tests should be structured to avoid duplication. In larger projects where many developers modify the automated tests, they tend to reimplement testware components if they are not easy to locate, understand, and use. Test automation projects have similar prob-lems as software development projects. The study suggests that they should apply software test code engineering practices to be successful and decrease maintenance costs.

(28)

2.8. Software maintenance guidelines

2.8 Software maintenance guidelines

Joost Visser et al. [62] describes the 10 most important guidelines for building maintain-able software from experience. The first eight of these guidelines are derived from the Software Improvement Group (SIG) evaluation criteria for trusted product maintainability 7_{. The guidelines presented are highly language-independent. The 10 guidelines are} pre-sented in Section 2.8.1 - 2.8.10. These 10 guidelines are based on the relationships between sub-characteristics of maintainability and software product properties described in Section 2.6.3.

Rules regarding naming (e.g. classes, methods, and variables) extracted from coding guidelines at Saab are presented in Section 2.8.11. These rules are created by several expe-rienced developers at Saab, with at least 5 years of experience in software development, from their own experience of what they perceive important for maintainable software. The naming rules are important to improve the understandability of software. The goal of the guidelines, what they are trying to achieve, is described in Section 2.8.12.

The guidelines presented in Section 2.8.1-2.8.11 will be used as a basis for the creation of guidelines for VGT used in this thesis project. How the guidelines presented below are used is further described and motivated in the method chapter (Section 3.2).

2.8.1 Write short units of code

Short units are easier to reuse, analyze, and test compared to long units. Long units often have a mixture of many responsibilities which makes them harder to test. A single responsibility is a piece of a functionality provided by the software. It corresponds to a single indivisible task and it does only one thing, for example getting data from a database, extracting a value from a text string, and verify a property of a widget. A shorter unit does not accommodate that many responsibilities. Long units usually contain more specialized functionalities which makes reuse more difficult and less suitable. Code reuse helps to reduce the total code volume. Shorter units result in less code to analyze and therefore less time to read which makes them easier to analyze.

Units should not be longer than 15 lines of code, according to the SIG rating of unit size (described in [62]). If units become larger than 15 lines, they should be divided into several smaller units.

2.8.2 Write simple units of code

Units with fewer decision points simplify analyzing and testing. Simple units are easier to understand and modify compared to units with high complexity.

Branch/decision points are statements that take different directions in the code depending on certain conditions. If and switch statements are examples of branch points. The more decision points a unit contains, the more complex it becomes and more difficult to test. For example, if a unit contains eight isolated paths (control flow branches), it will require eight test cases to cover them. By decreasing the number of branch points, maintainability increases since units become easier to test and modify.

The number of branch points for a unit should not exceed four. This is based on unit complexity rated by SIG which in turn is based on cyclomatic/McCabe complexity (described in [62]). Divide complex units into simpler units.

(29)

2.8. Software maintenance guidelines

2.8.3 Write code once

Code duplication should be avoided. Modifications in duplicated code must be done in all copies. This include repeated bug fixes, if the duplication contains a bug. This is inefficient and error-prone. Instead, write reusable and generic code.

Joost Visser et al. defines duplicated or cloned code as at least six lines long identical code (that are exactly the same), since it by experience has resulted in a good balance in the number of clones identified. Whitespaces and comments are not counted. This kind of identical code is called type 1 clones. Two syntactically identical code fragments, but differ in e.g. identifier names and literals, are called type 2 clones and are not considered as code duplication in this guideline. Type 2 clones are not considered because they are more difficult to detect compared to type 1 clones.

Reusing code by copy and paste is not a valid way to reuse code, since it does not avoid the bug duplication problem. Instead, reuse should be performed by putting the duplicated code in a separate unit.

2.8.4 Keep unit interfaces small

Units with fewer parameters simplifies reuse and testing. A large interface is usually not the root problem. It is a code smell that indicate a deeper maintainability problem. The root problem is rather that units with many parameters indicate multiple responsibilities and higher complexity which makes them harder to modify. Lowering the number of parameters improves maintainability since units become easier to understand and reuse.

Do not use more parameters than four for each unit. According to the SIG rating of unit interface size (described in [62]), the quality increases as the number of parameters decreases. The quality decreases greatly after more than four parameters have been implemented. The number of parameters can be decreased by passing them through objects.

2.8.5 Separate concerns in units

Keep units short and simple to avoid tight coupling between them. Loosely coupled units make the system more modular and are easier to modify. They allow maintenance of isolated parts of the code, and thus minimizing the risk of further modifications elsewhere. Loosely coupled units also improves analyzability since the units become easier to understand when they only have one task. Assign responsibilities in order to separate units and hide imple-mentation details behind interfaces. A unit should only have one responsibility, also known as the single responsibility principle. Split up units that have several responsibilities.

2.8.6 Couple architecture components loosely

Avoid tight coupling between levels (top and lower levels) of components. Loosely coupled system components make the system more modular and are easier to modify. This guideline has the same principles as the one just described in Section 2.8.5 but it is applied at a higher architecture level. Minimize the amount of code within modules that receive calls from mod-ules in other components. Independent components help to achieve isolated maintenance.

2.8.7 Keep architecture components balanced

Keep levels of components balanced. The architecture should not have too many or too few components and should be of uniform size. This makes the system more modular and

(30)

sim-2.8. Software maintenance guidelines

coefficient, which in this context is a measure of the inequality in the distribution of source code volume between components. A number close to nine has shown to give the best result for this coefficient, which means a high equality.

2.8.8 Keep your codebase small

A smaller codebase is easier to maintain since less code needs to be analyzed, changed and tested. Avoid unneccessary codebase growth and reduce codebase size.

2.8.9 Automate development pipeline and tests

Automated tests should be implemented for the codebase. Automated tests enable feedback on the effectiveness of modifications. The automated tests should be started automatically and not by hand, since starting a lot of test cases manually repetitive times is time-consuming and can be saved by automation. Automate everything that is supposed to be executed re-peatedly.

2.8.10 Write clean code

Artifacts such as TODOs and dead code in the codebase decreases productivity for new team members and results in less efficient maintenance. Dead code is code that is never executed or its output is not used anywhere.

Do not leave code smells after the development is finished. Code smells are patterns in code that indicates a problem. Introducing or not removing this kind of pattern results in bad practice.

When the development work is done make sure to remove bad patterns in the code, such as the following; code smells, dead code, bad comments, out commented code, long identi-fier (item) names, magic constants (number/literal values without a clear definition of their meaning), and badly handled exceptions (for example to miss catching exceptions).

2.8.11 Use descriptive and uniform names

Always use descriptive names, e.g. for units, functions, files, classes, methods and variables. For example, rather use the name CreationDate than CDat, CrDate, CreateDat or similar.

2.8.12 Summary

The objective of the guidelines presented in Section 2.8.1-2.8.11 is to increase code cohesion, readability, and understandability. This makes the code easier to modify and build on, for existing and new team members. This in turn simplifies innovation and increases code qual-ity. The guidelines also strive to minimize technical debt, having to pay back (concerning time) quick and bad design choices previously made. The guidelines help to avoid repetition of identical code changes, fixing the same error in several places, and that changes in the code creates further modifications. All these positive effects should result in a reduction in maintenance time per developer.

(31)

2.9. Empirical studies in the context of software engineering

2.9 Empirical studies in the context of software engineering

Software engineering is controlled by human behavior performed by developers. Finding rules or laws in software engineering can be a challenge [63]. Empirical studies are beneficial when evaluating processes, human-based activities, software products, and tools [63].

Strategies in empirical studies include formal experimental setup (experiments), studying real projects in the industry (case studies), and performing surveys e.g. interviews. A sur-vey is a system for gathering information from people to describe, explain, or compare their knowledge, behavior, and attitudes [19]. Surveys are performed before or after a technique or tool has been introduced. In a survey, information is primarily collected by interviews or questionnaires [63]. Case studies are classified as observational study and experiments as a controlled study [63]. Experiments help to evaluate human-based activities in a systematic, quantifiable, and controlled manner.

2.9.1 Interviews

Interviews are a technique for collecting qualitative data [52]. An interview can be divided into three different types; structured, semi-structured and unstructured [51, 52, 63]. The in-terviewer should introduce the interview with a short explanation of the study [51, 52, 63]. Otherwise, participants may not fully participate or leave important information out, since they do not understand the objectives of the study. How the data will be used should be explained to the participants [63]. A guide should be used during the interviews containing a list of questions, notes to direct questions in various cases or topics [51, 52]. The interview should begin with background questions, of e.g. the subjects, and then the main question should be asked which should correspond for the largest part of the interview [63].

In a structured interview, the interviewer asks the questions and the interviewee controls the responses [52]. The questions must be asked in the same order they are planned [51, 63]. The interviewer has specific goals for the expected results from the interview [52]. The questions in this type of interview are specific [52]. In an extreme case, all answers can be quantified e.g. yes or no.

Semi-structured interviews are a combination of structured and unstructured interviews [51, 52, 63]. Both specific and open questions are asked. The questions must not be presented in the same order they are organized. This technique collects both expected and unexpected types of information. Improvisation and exploration are allowed during the interview. In case studies, semi-structured interviews are common [51, 52, 63].

In an unstructured interview, the interviewee controls both the questions and the re-sponses [52]. This type of interview is suitable when the research scope is broad. It is also suitable when the interviewer does not know much of the research area and the questions, therefore, must be open [51]. In an extreme case, no questions are asked at all from the in-terviewer, but only topics mentioned to open discussions or extensions from the interviewee [52].

2.9.2 Observations

Observations are usually performed to investigate how software engineers conduct certain tasks [63]. Different approaches can be used during observations. One of them is video recording. Another approach is think aloud protocols, or so-called thinking aloud [46], or think-aloud [47]. Observations provide a deep understanding of the task that is studied [63]. Traditionally, thinking aloud has been used as a psychological research method [46].

(32)

large amount of data where the relevant data is mixed with irrelevant material, since strip-ping out irrelevant material would interrupt the data collection process [47].

The observer often needs to remind the test subjects to think out loud if they become silent [12, 46]. The remainder can be expressed by asking questions such as What are you thinking now? and Is that what you expected would happen?. The remainder should come after a predetermined limit, for example 60 seconds, and should not highlight the presence of the observer. The observer should be careful with not interfering too much with the participants [12, 46]. If for example a participant asks Can I do that?, the observer should not answer or use counter-questions such as What do you think happens if you do that?. Any other interactions than reminding to think aloud should be avoided as much as possible, such as neutral questions or comments, since they might redirect the attention [12, 46].

2.9.3 Case Study

The term Case Study has several names, such as field study [32] and observational study [63]. A case study is performed in real-life contexts and studies a single phenomenon or project within a certain time limit [63]. Usually, the phenomenon is difficult to distinguish from its surroundings. In this strategy, the aim is usually to track a certain attribute or establish re-lationships between attributes. A case study does not produce the same outcome on causal relationships as a controlled experiment, but it results in a deeper understanding of the phe-nomena under study in its real context [63]. The benefits of case studies are that they are simple to plan and realistic [63]. They can visualize qualities that experiment can not, such as unpredictability and complexity. Drawbacks are that the result is hard to interpret and generalize [63].

If we, for example, want to compare two methods, the study can be defined either as a case study or an experiment [63]. The strategy to be used depends on how many factors can be isolated, the ability to randomize and the level of control of the context [63]. In case studies, variables are selected that represent typical situations. In experiments, variables are sampled and manipulated.

A case study can be conducted in five process steps according to Wohlin et al. [63]: 1. Design: The study is planned and objectives defined. A plan should describe what to

achieve and study. It should contain theory about the frame of references. It should also formulate research questions. Methods should be defined on how to collect data and strategies selected on how to seek data. The objective of the study can for example be descriptive, exploratory, explanatory, or improving. In case studies it is common that the research questions evolve to be more specific during the study iterations.

2. Data collection preparation: Protocols and procedures are defined for data collection. Data can be collected by several different methods, such as interviews (described in Section 2.9.1), focus groups, observations with thinking aloud protocols (described in Section 2.9.2), video recording, archival data, and metrics.

3. Data collection: The prepared data collection is executed on the studied phenomenon. 4. Analysis: Analysis of the data collected is performed differently for quantitative and qualitative data. For quantitative data, hypothesis testing and descriptive statistics, are often included. Hypothesis testing is used to determine if the effect of independent variables on dependent variables are significant. Descriptive statistics, such as mean values, scatter plots, and standard deviations, helps to understand the collected data. For qualitative data, the analysis should derive conclusions of the results in such a way that a reader can follow. Decisions and relevant information must be presented. The validity of a study describes the reliability of the results and to what proportions the results are true and not biased. There are four common aspects of validity; construct

(33)

validity, internal validity, external validity, and reliability. Construct validity analyzes that what is being studied is really what is being investigated according to the research questions. Internal validity concerns relations that are examined. External validity con-cerns the generalization of the results and the interest of other researchers. Reliability analyzes the dependency of the specific researchers, the replication of the study. For quantitative analysis, reliability validity concerns conclusion validity.

5. Reporting. The report describes the findings of the study. It can have different focus groups, such as peer researchers, research sponsors, and industry practitioners, which may require different reports for different focus groups.

2.9.4 Experiments

An experiment is a study that deliberately introduces an intervention and observes its effects [53]. Some variables are manipulated while the other variables are fixed [63]. The effect of the manipulation is then measured and a statistical analysis conducted [63]. In a randomized experiment, units are assigned randomly to receive alternative conditions or treatments [53]. In a quasi-experiment, units are not assigned randomly [53, 63]. The cause of the experiment is manipulated before the effect is measured [53]. Randomization is not always possible or desirable due to high cost [29].

If it is possible to control who uses a method and when and where the method is used, an experiment is possible to conduct [63]. Experiments can be performed in a laboratory environment under controlled conditions or they can be executed in real-life contexts.

The experiment must be planned in order to control the experiment and produce valuable results. The planning should describe how the experiment is conducted and it can be divided into seven steps, which are described by Wohlin et al [63].

2.9.5 Choice of empirical research strategy

The choice of empirical research strategy depends on different factors, such as execution con-trol, measurement concon-trol, investigation cost, and replicability [63]. Table 2.9.5 describes the strength of the factors for the different emprical strategies. The boundaries between differ-ent strategies are not always clear. A comparative case study can be referred to as a quasi-experiment in an industrial context [63].

Table 2.5: The impact of different factors of emprical strategies, extracted from Wohlin et al. [63].

Factor Survey Case Study Experiment

Execution control low low high

Measurement control low high high

Investigation cost low medium high

(34)

2.10. Related Work

2.10 Related Work

This section presents related research that forms the basis of this thesis. Studies that examine the initial development of visual GUI tests are presented in Section 2.10.1. Maintenance of visual GUI tests are described in Section 2.10.2. Mapping of technical debt between software and testware are presented in Section 2.10.3. Best practices for AGT are presented in Section 2.10.4.

2.10.1 Transition from manual to automated visual GUI tests

Börjesson and Feldt evaluated in a case study [13] two VGT tools, one open-source (Sikuli) and one commercial, on the company Saab AB. Five test cases were automated. The tool’s properties were compared and their applicability on a system that had previously been tested manually. The metrics collected were development time, execution time, and size of the automated tests. The results proved that there were only minor changes between the tools and that VGT is applicable in the industry with effort gains comparing to manual testing. The study also shows that VGT has challenges that need to be addressed such as robust test execution and maintenance costs of the scripts.

Alégroth, Feldt, and Olsson evaluated in a case study [3] the transition from three manual system tests to automated tests using VGT. The VGT-tool used was Sikuli. During the tran-sition, support for taking screenshots was implemented in the test scripts which captured the faulty states of the GUI. These screenshots were later used during the analysis of the test results and could be used to recreate bugs manually. This also facilitated the explanation and presentation of the faults to other team members. The screenshots helped decrease the maintenance time of the SUT. Results showed that the transition was successful. The test ex-ecution speed was improved comparing to manual testing and all faults that previously had been identified manually were found.

Alégroth, Feldt, and Ryrholm conducted a case study [4] at two companies that studied the challenges, problems, and limitations during the VGT transition. Information about the return on investment and defect finding ability were also collected. Three VGT-tools were evaluated in the study. These tools were Sikuli, eggPlant, and Squish. Sikuli was evaluated to be the most suitable for the transition at the two companies. Data was collected in different contexts with different developers. Over 300 test cases were created. The results showed that all test cases could be automated and defects could be detected using VGT. 58 different challenges, problems, and limitations were identified and divided into 26 types. Positive ROI was achieved one month after development and the execution speed was 16 times faster than the manual tests. Equal or better defect finding ability was provided comparing to the manual tests. Development time for the automated test cases (CVGT) and execution time for the manual test cases (Cmanual) were the metrics collected to calculate the ROI, according to Equation 2.1. In the equation, ROIVGTdescribes the number of VGT suite executions required until CVGT break even with Cmanual. Positive ROI is reached when Cmanual is greater than CVGT.

ROIVGT= CVGT Cmanual

ă1 (2.1)

Previous research has shown that a transition from manual to automated testing using VGT is possible and reduces the execution time compared to manual testing [2, 3, 4, 13]. Similar to these studies, this study will also perform a case study and a VGT transition of manual tests. Additionally, development time and manual execution time will be collected to calculate the ROI. More specifically, Equation 2.1 will be used.

Examining maintenance cost of automated GUI tests : An empirical study of how test script design affects the maintenance of automated visual GUI tests

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Science and Engineering

2020 | LIU-IDA/LITH-EX-A--20/071--SE

Examining maintenance cost of

automated GUI tests

An empirical study of how test script design aﬀects the

main-tenance of automated visual GUI tests

En empirisk undersökning av hur testskriptdesign påverkar

un-derhåll av automatiserade visuella graﬁska

användargränss-nittstester

Elin Petersén

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

List of Abbreviations

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

1.5

Thesis outline

2

Theory

2.1

Software Testing

2.2

Automated Software Testing

2.3

GUI Testing

2.4

Automated GUI-based software testing techniques and tools

2.4.1

The three generations

2.5

Visual GUI Testing Tools

2.5.1

Sikuli

2.5.2

EyeAutomate

2.5.3

Test case example

2.6

Software maintainability

2.6.1

Sub-characteristics of maintainability

2.6.2

Software product properties

2.6.3

Relationships between sub-characteristics of maintainability and software

product properties

2.6.4

Software maintenance types

2.7

Maintenance of automated tests

2.8

Software maintenance guidelines

2.8.1

Write short units of code

2.8.2

Write simple units of code

2.8.3

Write code once

2.8.4

Keep unit interfaces small

2.8.5

Separate concerns in units

2.8.6

Couple architecture components loosely

2.8.7

Keep architecture components balanced