Experimental Evaluation of Tools for Mining Test Execution Logs

(1)

M

ÄLARDALEN

U

NIVERSITY

S

CHOOL OF

I

NNOVATION

,

D

ESIGN AND

E

NGINEERING

V

ÄSTERÅS

,

S

WEDEN

Thesis for the Degree of Master of Science in Software Engineering

15.0 credits

EXPERIMENTAL EVALUATION OF TOOLS FOR

MINING TEST EXECUTION LOGS

Edvin Parmeza

epa19002@student.mdh.se

Examiner:

Daniel Sundmark

Mälardalen University, Västerås, Sweden

Supervisor: Wasif Afzal

Mälardalen University, Västerås, Sweden

(2)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs

Acknowledgments

This report is a summary of the Master Thesis work performed at Mälardalen University in Västerås, Sweden which finalizes my studies in Master of Science in Software Engineering. This work has been the most challenging, but also the best professional and academic experience so far, which not only helped me on expanding significantly my knowledge and practice in log analysis tools, but it also provided strong directions and chances for my future career in many of the Software Engineering areas.

At first place, I would like to thank my Master Thesis supervisor, Wasif Afzal who has been very helpful, motivating and patient all the way while supervising my Master Thesis work at Mälardalen University. His continuous feedback, recommendations, encouragements as well as the discussions in the Master Thesis meetings have been crucial throughout the whole process.

Secondly, I would like to thank all of my professors in Mälardalen and in my home university in Albania for everything they taught me. All the knowledge and academic growth and skills that I have gained during these years exist because of their work.

Finally, I would definitely like to thank my family and all of my friends for supporting and encouraging me while working on my Master Thesis. In such cases, moral support is also very important and much appreciated.

(3)

Abstract

Data and software analysis tools are considered a very beneficial and advantageous approach that is used in the software industry environments. They are powerful tools that help to generate testing, web browsing and mail server statistics in different formats. These statistics are also known as logs or log files, and they can be generated in different formats, textually or visually, depending on the tool which tests them. Though these tools have been used in software industry for many years, there is still a lack of fully understanding them by software developers and testers. Literature study shows that related work on test execution log analysis is rather limited. Studies on evaluating a subset of features related to test execution logs are missing from existing literature since even those that exist are usually focused only on a one – feature comparison (e.g., fault – localization algorithms). One of the reasons for this issue might be the lack of experience or training. Some practitioners are also not fully involved with the testing tools that their companies use, so lack of time and involvement might be another reason that there are only a few experts on this field, who can understand these tools very well and find any error in a short time. This makes the need for more research on this topic even more important.

In this thesis report, we presented a case study focused on the evaluation of tools which are used for analyzing test execution logs. Our work relied on three different studies:

- Literature study - Experimental study

- Expert - based survey study

So, in order to get familiar with the topic, we started with the literature study. It helped us to investigate the current tools and approaches that exist in the software industry. It was a very important, but also difficult step, since it was hard to find research papers that are relevant to our work. Our topic was very specific, while many of research papers had performed just a general investigation on different tools. That is why in our literature search, in order to get relevant papers, we had to use specific digital libraries, terms and keywords, and a criteria for literature selection. In the next step, we experimented with two specific tools, in order to investigate their capabilities and features which they provide to analyze the execution logs. The tools we managed to work with are Splunk and Loggly. They were the only tools available for us which would comform to our thesis demands, containing the features that we needed for our work to be more complete. The last part of the study was a survey, which we sent to different experts. A total of twenty-six practitioners responded and their answers gave us a lot of useful information to enrich our work.

The contributions of this thesis will be:

1. The analysis of the findings and results which are derived from the three conducted studies in order to identify the performance of the tools, the fault localization techniques they use, the test failures that occur during the test runs and conclude which one is better in these terms.

2. The proposals on how to improve further our work on log analysis tools. We explain what is needed in addition in order to understand better these tools and to provide correct results during testing.

(4)

3. Methodology ... 18 3.1 Literature Study ... 18 3.1.1 Literature Search ... 19 3.1.2 Literature Selection ... 19 3.2 Experimental Evaluation ... 19 3.3 Expert-Based Study ... 20 4. Experimental Evaluation ... 21 4.1 Splunk ... 21 4.2 Loggly ... 30 4.3 Summary of Experimentations ... 35 5. Survey Study ... 40 5.1 Introduction ... 40 5.2 Survey Feedback ... 40 5.2.1 General Overview ... 40

5.2.2 Fault Localization Technique ... 44

5.2.3 Difficulties with test execution logs and other root causes ... 49

6. Thesis General Summary ... 50

7. Conclusion ... 51

8. Future Work ... 53

9. References ... 54

(5)

List of Figures

1. Jenkins Main Page ... 3

2. Functionality of the CI Delivery System in Jenkins ... 4

3. CAM: Average of TF-IDF and KNN ... 6

4. SKEWCAM: CAM with EKNN ... 6

5. LOGLINER: Line-IDF and EKNN ... 7

6. LOGFAULTFLAGGER: PastFaults * Line-IDF & EKNN ... 8

7. Efficiency of tools in fault catching and log line flagging ... 8

8. TITAN block diagram ... 9

9. Archiving process describing how the data is locked for concurrency and how does it recover from failed archiving ... 12

10. Sequence diagram over how the path resolver works and what it does ... 13

11. Benchmark of archiving buckets ... 14

12. Timeline panel of events in Splunk ... 22

13. Attributes of events ... 22

14. Display of the whole event history ... 22

15. Timeline panel of one of our log files ... 23

16. Search results displaying events list of our log files ... 23

17. Field sidebar parsing our data into fields or log types ... 24

18. Events with status code 503 and 408 ... 24

19. Status field ... 25

20. Column Chart of products and purchase count ... 25

21. Pie Chart of products and purchase count and percentage ... 26

22. Timeline panel of failed passwords ... 27

23. Events list of failed passwords ... 27

24. Events division in two groups ... 28

25. Number of Splunk operations per minute ... 29

26. Page faults per operation ... 29

27. Timeline panel and events list in Loggly ... 30

28. Fields sidebar in Loggly ... 31

29. Events of failed purchases ... 31

30. Expanded event ... 32

31. Surrounding events ... 32

32. Events with status 500 ... 33

33. Dashboard of our charts ... 33

34. Apache pre-configured dashboard ... 34

35. Average performance results for Splunk ... 36

36. Average performance results for Loggly ... 37

37. Efficiency of Loggly in finding the root cause ... 37

38. Efficiency of Splunk in finding the root cause ... 38

39. Countries where survey respondents work ... 41

40. Types of domains where survey respondents work ... 41

41. Identified issues regarding test tools and logs ... 46

42. Performance of the testing tools by fault localization ... 46

(6)

List of Tables

1. Summary of tool features from literature (part 1) ... 16

2. Summary of tool features from literature (part 2) ... 16

3. Purchase count and percentage of each product ... 26

4. Comparison between our tools (Splunk and Loggly) ... 36

5. Summary of features evaluated during our experimentation ... 38

6. General background of the practitioners ... 42

7. Involvement of the practitioners with the tools ... 43

(7)

1. Introduction

1.1 Background

The importance of generating and analyzing test execution logs increases with time in different industrial backgrounds. In order to ensure software quality on a functional, system and other levels, test cases are written and run, either manually or automatically [1-4]. Execution of test cases produces logged output. In case of a failed test case, logs are typically looked at, mostly manually, to identify the root cause of the fault. In a continuous integration setting or in case of rapid execution of same set of test cases, several such logs are recorded [5][8].

Available literature can help on identifying potential tools that can perform this analysis [11-19]. For instance, the tool developed in [11] relies on the log abstraction approach [5,14] and clustering similar logs [27-29] in order to have a better performance in failure identification. Another tool, Jenkins [12-16] utilizes several plugins and built-in analyzers for investigating the root cause of different environment and product-related failures [20-26].

A grouping-based strategy is proposed in [30] to improve and boost the effectiveness of many fault localization techniques. Its application is assessed over a tool called Tarantula and a neural network-based technique called Radial Basis Function. The advantage of this strategy is that it doesn’t require the technique to be modified in any way.

The paper [32] performs a study, which overviews and compares the Tarantula technique with four other techniques: Set Union, Set Intersection, Nearest Neighbor, and Cause Transitions, in terms of effectiveness in fault localization. There are also studies performed like the ones in [33] and [36], in which different approaches or methods are used for fault localization. One of them is called slicing, which is the computation of the set of program statements, the program slice, that may affect the values at some point of interest, referred to as a slicing criterion. Program slicing can be used in debugging to locate source of errors more easily. It is very effective for focusing on relevant parts of a program in case of a detected misbehavior.

Another problem that is crucial in software quality and its development and maintenance cost, is of course the prediction of defects, errors and bugs in its modules. The studies performed in [31], [34], [35], [37], [38] and [39] aim to evaluate and compare different approaches, models and tools that are used to improve the prediction of these defects.

However, evaluating and comparing such tools in terms of their capabilities and features is hard to achieve by only relying on existing literature. It is also important to consider the fact that it is the software tester or developer that investigates the root cause of a failed test case in a certain environment. This fact makes the failure analysis and inspection subjective and dependent to the experience and expertise level of the practitioner.

In order to fill this gap, the main contribution of this thesis work will be to evaluate several tools for mining test execution logs by relying on an experimental evaluation of such tools as well as on real expertise of testers that use them in different industrial contexts.

In Section 2, related work on tools that analyze test execution logs is summarized. The research questions formulated based on the main problem of this thesis as well as the methodology chosen to solve them, are presented in Section 3. Section 4 clarifies the expected outcome from this work finally, it gives the results of our experimentation and compares our tools based on those results. The limitations that might exist on the way are given in Section 5 through the survey study. Section 6 presents a brief summary of all our work. In Section 7, we give our conclusion and in Section 8, we give suggestions on what can be improved for future work.

(8)

2. Related Work

When it comes to tools for mining test execution logs, it is possible to identify many of them in existing literature. Except for tools that are already consolidated in a proper software industrial context, there are also studies on new tools with improved features which are still experimental.

2.1 Existing tools for test log analysis

2.1.1 Jenkins

One tool that is quite popular among software testers and CI (Continuous Integration) experts nowadays is definitely Jenkins [12-16]. During the past decade, several domains have been using Jenkins for running their tests. As a consequence, extensive work has been done in order to improve this test management tool itself and thus, make it more complex and efficient.

Jenkins is a tool that runs on its own IDE (Integration Development Environment) [13].

As an important tool for software integration, Jenkins is compared with four other tools in the study in [13] in two different perspectives: efficiency and usability. After performing this comparative analysis, it could be noted that Jenkins is able to fix most of the critical bugs very easily and quickly compared to the other four tools: Cruise Control, Hudson, Apache´s Continuum and Team City.

Jenkins is originally an open source tool. It is also a server-oriented tool meaning that it needs a container (such as Apache Tomcat) to run the tests. In Jenkins, several SCMs (Source Control Management tools) are supported. Some of them are Git, CC (Clearcase), Subversion, Mercurial, Perforce and RTC (Rational Team Concert) [13-15].

How Jenkins takes the input and executes tests can be explained in the following simplistic way. All desired tests are attached to repetitive automatic jobs that are named “builds”. Each build is run periodically (example, on hourly or daily basis). If the build is successful, it will be shown as blue in the Jenkins GUI. That means that no failure was identified in any of the tests included in that build. In case that there are one or more failures, then the build will be marked as red. A visualization of passed and failed builds is presented in figure 1 [13].

(9)

Figure 1: Jenkins Main Page [13]

It is of course, possible to click on each build and get to the detailed test execution logs in order to analyze the root cause of the failure. Of course, the root case can be related to the system, developed product but even to the testing environment.

To summarize how it works with the builds, we must mention also Ansible [14] which is an automatic build server where the builds are triggered. After that, the test server will start with the functionality, system or other testing tasks and then, the test results will be received by the Jenkins CI (Continuous Integration) server.

(10)

Figure 2: Functionality of the CI Delivery System in Jenkins [14]

The functionality layers of the CI (Continuous Integrated) Delivery System in Jenkins are shown in figure 2 [14]. It includes all layers from the system service base layer where the automatic testing is run up to the information display where the results are visualized as test logs.

Another advantage of this tool is that it keeps track of all the results from the automatic jobs which means that changes in the system as well as in the software version can be observed and analyzed. The Jenkins CI server is built in Java and provides around 1000 plug-ins in order to build different projects properly. Newer versions of the tool provide better support for unit testing as well [15]. Pipeline-as-code is followed in such versions which means that shared libraries replace now all the work on configuring jobs for deploying and testing the system in the previous Jenkins versions. Unit tests are automated thanks to the Jenkins settings written in Groovy.

The Puppet and Chef plugin - built on top of the Notification plugin - makes Jenkins capable of receiving deployment notifications coming from Puppet or Chef, containing the fingerprint of the deployed artifact. Jenkins, thus, will be able to identify the fingerprint in the report and look up for the MD5 checksum in its database: after successful mapping, the deployment information (when, where) will be added to the build [16].

Thanks to this mechanism, for each produced artifact, it is now possible to trace where this was deployed, when it was deployed and also the results of tests run against it, giving the test engineer full visibility on the history of an artifact and giving them an easy way to debug a potential error.

(11)

2.1.2 Astro

The work performed in [2] provides an overview of ASTRO (Analysis of Software Tests Results and Outputs) tool, which presents the information of log files in multi-perspective interactive visualizations. The technique called interactive visualization is used to help developers understand better the information extracted from log files. The study that was carried out to assess the benefits that software developers can get by using ASTRO indicates that ASTRO facilitates analysis of log files information. This is achieved by providing three different perspectives with interactive visualizations that can support the execution of the task. It offers support to developers in finding points of failure and their causes.

The perspectives are:

- Overview, which presents to the user a quick summary of the test case executions, providing a general support of the status of each test performed.

- Callers & Callees, which aims at providing an understanding of what classes and methods were involved in an error, based on the stack trace data.

- History, which aims at organizing the information of the test cases chronologically, providing the order of execution of each test method, how long they took to execute, and the exact date and time of their execution.

The results of the informal tests that were conducted with a large amount of data indicate that the performance of the tool decreased perceptively. The reason was the sizes of the generated visualizations. The bubble chart and the MOV were affected the most by the problem. Callers & Callees perspective does not account for methods that call each other or the number of times one method calls another.

The objective for future work is the integration of ASTRO with Integrated Development Environments (IDEs), so that developers can use the tool in a faster and more practical way.

2.1.3 LogFaultFlagger and CAM

Another interesting fault localization technique is presented in the work by Amar and Rigby [11]. Here, they present techniques with the goal of capturing the maximum number of product faults in test logs while flagging the minimum number of log lines per inspection. A problem addressed here is that test runs can produce many lines of logged output, making it difficult to find the cause of a fault in the logs and lines that occur in both a passing and failing log introduce noise when attempting to find the fault in a failing log. The contribution of this work [11] is a new approach called LogFaultFlagger, which removes the lines that occur in the passing log from the failing log. The results indicated that LogFaultFlagger was able to identify 89% of the total faults and flagged less than 1% of the total failed log lines for inspection. This made LogFaultFlagger outperform previous approaches and tools such as CAM (Cause Analysis Model) which could find only about 50% of the total faults and flagged all of the failed log lines.

CAM has successfully been used at Huawei to categorize test logs. Their technique is re-implemented in Ericsson and a replication if performed on its test logs. CAM is computationally expensive and takes 7 hours to process the entire dataset. The reason for this is that it runs Term Frequency by Inverse Document Frequency (TF-IDF) across the logs to determine which terms had the biggest importance, generating large term-based vectors and then calculates the similarity between the vector of failing log and the vectors of all the past failing logs. It also uses K Nearest Neighbors (KNN) to classify and categorize logs. Figure 3 shows the results of the direct application of CAM to the Ericsson dataset. For K values 1, 15, 30, 60 and 120, the average percentage of faults that were caught is 47.32%, the average percentage of log lines that are flagged is 4.24%, while the average execution time is 457.333 minutes.

(12)

Figure 3: CAM: Average of TF-IDF and KNN [11]

Ericsson’s test environment involves complex hardware simulations of cellular base stations. As a result, many test failures are environmental and do not lead to a product fault. Since the data is skewed, KNN is modified. Figure 4 shows that more neighbors catch more product faults but also flag many lines. As K increases (1 -> 30 -> 120), the number of faults found increases a little bit (47.13 -> 88.64 -> 90.84), but the lines flagged increase too much (4.21 -> 27.71 -> 43.65).

Figure 4: SKEWCAM: CAM with EKNN [11]

SKEWCAM can accurately identify the logs that lead to product faults, however it flags a large number of suspicious log lines that need to be examined by testers. To effectively identify product faults while flagging

as few log lines as possible, a new technique called

LOGLINER is developed. LOGLINER uses the uniqueness of log lines to predict product faults. The uniqueness of the log line is calculated by calculating the Inverse Document Frequency (IDF) for each log

0 50 100 150 200 250 300 350 400 450 500

% FaultCaught % LogLineFlagged Execution Time (mins)

0 50 100 150 200 250 300

K1 K2 K3

(13)

IDF is used to generate the vectors for the current failing

log and all of the past failing logs. By the results at figure 5, we can clearly see that when both K (number of neighbors) and N (number of lines) are increased, LOGLINER can find a high percentage of the faults by flagging a very low percentage of the log lines for inspection. So, LOGLINER outperforms both CAM and SKEWCAM.

Figure 5: LOGLINER: Line-IDF and EKNN [11]

LOGLINER flags fewer lines, but drops slightly in the number of FaultsFound. LOGLINER is being built on with LOGFAULTFLAGGER, which incorporates faults into the line level prediction. IDF is usually weighted. Instead of using a generic weight, such as term frequency, another metric is used: the number of times a log line has been associated with a product fault in the past. The testers add 1 to this frequency to ensure that the standard IDF of the line is applied if a line has never been associated with any faults. Then they weight line-IDF with the line fault frequency (FF). Figure 6 shows that the value of N has little impact on the number of faults found. The results show that LOGFAULTFLAGGER finds the same number of faults as SKEWCAM, but it flags less than 1% of total log lines, compared to SKEWCAM 28%. Compared to LOGLINER, LOGFAULTFLAGGER finds 4 percentage points more faults with 2.5 percentage points fewer lines flagged.

0 20 40 60 80 100 120 140 1 2 3 4 5 6 7 8 9 10

(14)

Figure 6: LOGFAULTFLAGGER: PastFaults * Line-IDF & EKNN [11]

As we can see from figure 7, LOGFAULTFLAGGER and LOGLINER are more efficient than CAM and SKEWCAM in capturing the maximum number of faults, while flagging the minimum number of log lines per inspection. The average percentage of the log lines that they flag is much smaller than the percentage of CAM and SKEWCAM, so the ratio between average %FaultCaught and average %LogLineFlagged will also be much smaller.

Figure 7: Efficiency of tools in fault catching and log line flagging [11]

0 20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 6 7 8 9 10

K N % FaultCaught % LogLineFlagged Execution Time (mins)

0 5 10 15 20 25 30 35 40

CAM SKEWCAM LOGLINER LOGFAULTFLAGGER

Ratio between average %FaultCaught and average

%LogLineFlagged in each of the tools

(15)

2.1.4 TITAN / TTCN-3

In [3], a concept of graphical presentation of test execution is presented. The goal of this approach is to make the log traces analysis easier and to give the opportunity to consider them on-line. Visualization of test procedures, executions and results provide a better view of particular details for test operators and users. It can be done on-line during execution or off-line after execution.

Requirements demanded by users, which are the main focus of this paper, include the challenge of providing a reliable tool working under different conditions, the need for open source interfaces and the need for the tool to run in different environments.

The steps of specifying, executing and analyzing test cases can be described in three ways: - textual format (programming language)

- graphical format (UML)

- test language that allows multi-representations (TTCN-3)

TTCN-3 (Testing and Test Control Notation) is a flexible and powerful language applicable to protocol testing, service testing, module testing etc. and can be used for many other kinds of testing.

GFT (Graphical Presentation Format) represents graphically the behavioral aspects of TTCN-3 and GFT Notations.

The graphical visualization tool for TTCN-3 execution environments is explained better through an implementation architecture. The chosen graphical symbols and the test execution logging interface is considered and explained.

Further details on TTCN-3 as implemented in an industrial context (Ericsson Hungary) are given in [18]. This tool utilizes the test execution environment of Ericsson called TITAN. Internal details and the operation of the toolset are shown. Unique TITAN features and differences from other commercial TTCN-3 tools are also discussed in this paper.

Figure 8: TITAN block diagram [18]

(16)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs 1. TTCN-3 and ASN.1 compiler, which includes parsing and analysis of the test suite and reporting

the syntax and semantic errors that are present in the input. The results are generated in the form of C++ program modules as part of the Executable Test Suite (ETS).

2. Base Library, which contains common and static parts of the ETS that are independent of the actual test suite.

3. Test Port API, which is well-defined programming interface for handling incoming and outgoing messages.

4. Utility programs, which make test suite development, compilation, test execution and result analysis easier.

5. Main Controller, which performs the task of coordinating the operation of TTCN-3 test components, which are required to be run in parallel.

TITAN performs these operations: a. Syntax and Semantic Analysis b. Code Generation

c. Executable Test Suite Derivation d. Encoding and Decoding

e. Graphical User Interface

TTCN-3 Executable Test Suite can communicate with the outside world via standardized interfaces. The standards describe the interfaces with programming language independent notation and give canonical C, Java and XML mappings for the data types and procedures. The interface standardization aimed at allowing users to switch from one TTCN-3 runtime environment to a tool of another vendor without changing the application specific software modules. The goal was to provide an effective test system in such a way that the test users need to develop the smallest and simplest external program modules possible.

The drawback of standardized interfaces is that their later implementation in TITAN could cause difficulties, because they assume dynamic typing and multi-threaded operation. According to the authors, this implementation would not optimize the performance, simplify the structure or make the usage more comfortable compared to the existing built-in functionalities in TITAN.

Thanks to the work performed on these tools [18], TTCN-3 and TITAN became a widely used test solution within Ericsson and the authors contributed on the TTCN-3 standardization within the ETSI (European Telecommunications Standards Institute). TITAN is today the official TTCN-3 Test Tool within Ericsson. The Test Competence Center works on the deployment and support of TITAN and TITAN-based test solutions. 50 Test Ports and 100 Protocol modules have been developed for TITAN [18].

2.1.5 Splunk

In [4], logs are proposed to be used as mechanism to bridge the gap that exists between the software developers and software operators. Developers and operators rarely communicate development and field knowledge to each other. The knowledge in logs has not been fully used and the reason for this is their non-structured nature, large scale, and the use of ad hoc analysis techniques. Hence, there are many case studies that are performed on large commercial and open source systems, in order to demonstrate the value of logs as a tool to support developers and operators. Several techniques are proposed to be used, like a scalable relational-algebra based language (our infrastructure), built on top of a web mining framework (ex. Pig), open sources (Hadoop) and commercial systems.

Some scalable and systematic log analysis platforms, which are worth to be mentioned, are Splunk and IBM Infosphere Streams. We will be focused on Splunk, since we are using this tool for analyzing our logs.

(17)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs Splunk is a software product that enables us to search, analyze, and visualize the data gathered from the components of an IT infrastructure or business. It is used as a log-processing platform and indexes log data and supports scalable searching for keywords in logs.

With Splunk, we have the ability to connect data outputs of various software to monitor them from a single interface. We can abstract format of the data and analyze it using a single search language.

As the data inside the product grows, the search times increase and there is need for more for storage. In order to solve this issue, data can be archived or deleted based on total size. Splunk does not do any of the archive management itself, but it provides configuration options to control when data should be removed, as well as configuration for a script that is called just before the removal event.

The study performed in [43] aims to create a complete archiving solution for Splunk. There are issues mentioned like designing and implementing an archiving feature for Splunk that satisfies attributes like reliability, scalability and manageability and the authors propose a file system that goes well with the data to be archived, in order to solve this problem.

The main advantage of using Splunk is that it does not need any database to store its data, as it extensively makes use of its indexes to store the data. Splunk uses source types to categorize the type of data being indexed. The source type is the default field for Splunk software that assigns to all incoming data. The purpose of source type in Splunk software is formatting of the data in indexing and categorizing the data for easy searching.

However, Splunk’s architecture has also constraints that need to be taken into consideration when designing a solution. They are described in the paragraphs below. First, let’s see Splunk’s main structure and components.

The Splunk instance that stores and indexes data is called an indexer. Splunk can be setup with a single indexer or in a cluster configuration with multiple indexers. Each indexer can contain multiple indices. Each index contains multiple buckets of data. A bucket has an id that is unique within the index it belongs to. The id is not sufficient to identify a single bucket amongst multiple indexers. There may exist other Splunk indexers that have the same index name and may therefore generate buckets with the same index and id pair. This is why configuration parameters were introduced to distinguish between instances of Splunk. They are Cluster name and Server name.

The archiving process is the most important step. During this process, the data will be removed from one file system and stored in another one. Data loss is unacceptable and it must not be error prone.

(18)

Figure 9: Archiving process describing how the data is locked for concurrency and how does it recover

from failed archiving [43]

First, the bucket has to be moved to a safe storage as fast as possible, because after the archiving code is run, Splunk will remove the bucket that was passed to the script, even if the script contains errors. We must keep the bucket away from being removed, in case the transfer to the archive fails. In order to prevent a data loss, a transactional transfer strategy must be performed. A transaction either changes the state of a system as requested or, in case of an error or a request for abortion, leaves the system in the same state as it began with. It is built out of multiple operations with the following properties:

- Atomic - Consistent - Isolated - Durable

The phases of transaction follow these steps:

1. Begin a transaction/Lock the file in safe storage, and only allow a single transfer process to access the file.

2. Execute all the operations in the transaction/Transfer the file from local safe storage to the archiving file system.

3. Commit the changes to the transaction/Commit the changes by doing the atomic move (rename) operation.

Regarding scalability, the same archiving file system should be able to store big data sizes from a single data source as well as from multiple data sources. This is solved by a path resolver. Figure 10 shows how a path resolver works:

(19)

Figure 10: Sequence diagram over how the path resolver works and what it does [43]

It’s very important to make Splunk work not only now, but also in the future. Performing tests and high code coverage is crucial in achieving this goal, and also a big priority of the study performed in [43], because it makes it easier for new contributors to add functionality and it also makes sure that the system will behave in the way we intended it to. Then, we can refactor the code and every time we refactor it, tests are run in order to make sure that the system will behave as we want it to. This approach enables us to produce a not only workable, but also readable and understandable code. This ensures the future of the product.

Figure 11 presents benchmarking results of the archiver. The y-axes show the time in seconds and the x-axes the bucket size in MB.

The results indicate that some runs generate regular times, and some runs generate irregular times. This may be because of Java’s garbage collection. It may start at the middle of transfer and delaying the execution of the transfer.

(20)

Figure 11: Benchmark of archiving buckets [43]

This study has succeeded in designing and implementing an archiving feature for Splunk that satisfies reliability, scalability and manageability. However, there are still improvements that can be made to the system.

One is using several Hadoop instances for load balancing purposes, by letting the path resolver return paths that contain specification of which instance to use.

Another one is to support other file systems, by implementing the file system interface for another system. The archiving process is solid. However, there are execution cycles that may cause errors. Some of them include:

1. If the start script crashes before moving the bucket to the safe location, the bucket will be deleted by Splunk and the data will be lost!

2. If our archiving server crashes after the bucket was successfully transferred to Hadoop but before it was deleted from local file system, the bucket will be re-transferred the next time our start script is invoked by Hadoop. But since the bucket is already there, this will generate a FileOverwriteException.

The first problem is very unlikely to happen. It can happen when a user sends a kill signal to the application. A power failure can crash Splunk, so when the system is up and running, it will have to run the script again. We can make an improvement by letting Splunk do the moving operation to a safe path and then invoking our script. This problem can also happen when the script doesn’t start. The cause for this could be a human error while configuring the script.

The second problem does not cause any data loss. Since we use transactional transfers to the archive, we know that if a bucket is in the archive, it is a complete bucket. So, when we discover that the file is already in the archive, we will throw the FileOverwriteException. It can then be handled by deleting the local file.

(21)

2.2 Related Work on Test Logs and Clustering Algorithms

The paper [7] proposes a data clustering algorithm for mining patterns from event logs. Data mining method helps in finding patterns that characterize the normal behavior of the system, and facilitates the creation of the system profile. Association rule algorithms have been proposed for mining temporal patterns from event logs, while data clustering algorithms are used for reviewing large log files. They aim at dividing the set of objects into groups or clusters, where objects in each cluster are similar to each other, while objects that do not fit well to any of the clusters are known to form a special cluster of outliers. A tool called Simple Logfile Clustering Tool (SLCT) is used to implement a log file clustering algorithm, which can be fast by making only a few passes over the data and detecting clusters in subspaces of the original data space. The algorithm follows three steps. The first one includes passing over the data and building a data summary. In the second step, the algorithm makes another pass to build cluster candidates and in the third one, it selects certain clusters from the candidates. The tool SLCT is developed in C. It implements vocabulary and candidate table through a data structure called move-to-front hash table. This structure has proven to be very efficient and very fast. SLCT takes log files and a support threshold as input. Then, it detects a clustering on input data and reports clusters by printing out line patterns that correspond to clusters. Because of memory cost, SLCT cannot report the lines that do not belong to any of the detected clusters. In cases when a log file is larger, the support threshold value is too large or too small, there might be errors in the cluster detection. As a solution for this problem, an iterative clustering approach was suggested. After many experiments with SLCT, the researchers are satisfied with the tool and conclude that their algorithm has modest memory requirements, and finds many clusters from large log files in a short amount of time.

A similar approach to SLCT, which clusters similar frequency words in each line and abstracts it in event type, is presented in paper [29]. Its implementation is evaluated on a log file with 128,636 log lines, each one extracted to one of 727 unique event types. The approach clusters similar frequency words in each line and abstract it to event types. The results conclude that it is able to detect events that occur twice or more, but not those which occur only once. However, abstraction is not necessary for an event that occurs only once, since it is unique in itself. The authors suggest for future work to calculate the precision and recall of the techniques used in this research paper on log files from different applications to find which technique is better under the given conditions.

(22)

2.3 Summary of Literature Findings

After summarizing the literature findings, we did a preliminary analysis of features that each tool offers. Pros and cons of each tool are highlighted from tables 1 and 2.

Nr. Tool Fault Categorization Algorithms IDE Open Source Web-based GUI 1 Jenkins X X X X 2 Astro 3 LogFaultFlagger X 4 Titan (TTCN-3) X X 5 Cruise Control X X 6 Apache´s Continuum X 7 Hudson n/a 8 TeamCity 9 CAM 10 Splunk X X X 11 SLCT X

Table 1: Summary of tool features from literature (part 1)

Nr. Tool Industrial Application

SCM Support Automated Test Jobs Plug-ins 1 Jenkins X X X X 2 Astro X 3 LogFaultFlagger n/a 4 Titan (TTCN-3) X X 5 Cruise Control X 6 Apache´s Continuum X 7 Hudson X 8 TeamCity 9 CAM 10 Splunk X X X X 11 SLCT X X X

(23)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs As we can see from the two tables, the tools that use fault localization techniques are Jenkins, LogFaultFlagger, Titan (TTCN-3), Cruise Control, Splunk and SLCT.

Jenkins is server-oriented, which means it needs a container to run the tests. It provides the feature that allows us to attach the desirable tests to automatic jobs named “builds”, which are run periodically. The builds identify if there are any failures in the tests and they also provide the detailed test execution logs that allow us to analyze the root cause of the failure. Jenkins also keeps track of all the results from the automatic jobs, so it observes and analyzes any change that occurs in the system and software version. The Jenkins CI server provides around 1000 plug-ins in order to build different projects properly. The test results in [13] indicate that Jenkins performs better and faster than Cruise Control, Apache’s Continuum, Hudson and TeamCity in terms of fixing the critical bugs.

LogFaultFlagger is an approach, which aims at capturing the maximum number of product faults in test logs while flagging the minimum number of log lines per inspection. The technique that it uses removes the lines that occur in the passing log from the failing log. The results in [11] indicate that LogFaultFlagger performs way better than CAM and SKEWCAM in terms of identifying most of the total faults and flagging as less failed log lines as possible.

TTCN-3 is an approach, which presents a concept of graphical presentation of test execution, with the goal of making the log traces analysis easier. It is a flexible and powerful language applicable to protocol testing, service testing, module testing etc. and can be used for many other kinds of testing. This tool presents the concept of standardized interfaces, which have the goal to help the test users develop the smallest and simplest external program modules. However, the results in [18] indicate that their implementation could cause difficulties, because they assume dynamic typing and multi-threaded operation, so the performance won’t be optimized and the structure won’t be simplified.

In [43], the authors aim at designing and implementing an archiving feature for Splunk that satisfies attributes like reliability, scalability and manageability. They propose a file system that goes well with the data to be archived. After being tested, the results indicate that the archiving process is solid, but there may be some drawbacks. For example, there are execution cycles that may cause errors, like data loss or data overwrite, due to certain anomalies.

The authors suggest some improvements for these issues, like letting Splunk do the moving operation to a safe path and then invoking the script, or in case of overwritten data, throw the FileOverwriteException, and then it is handled by deleting the local file.

SLCT is a tool developed in C, which is used to implement a log file clustering algorithm [7] that is able to make only a few passes over the data and detecting clusters in subspaces of the original data space. Experimenting with this tool has concluded that the algorithm has modest memory requirements, and finds many clusters from large log files in a short amount of time. Since it is very practicable, SLCT is proposed to be used in the software industry.

Based on the literature findings and the highlights in the tables 1 and 2, Jenkins is the most efficient and the most popular tool. The reason for this is that Jenkins provides more plug-ins, it runs on its own IDE, it’s very efficient and can be used easily, it supports several Source Control Management tools, it creates automated test builds for the logs that it analyzes etc. These are just some of the reasons that several domains have been using Jenkins for running their tests. The experts have also been doing an extensive work to improve this tool, making it more complex and more efficient.

While our aim is to evaluate such tools in terms of their features and capabilities, it is a hard objective to achieve by only relying on the related work. Hence, in the third section we present our methodology, which describes the research questions we came up with to describe our main problems, the studies that we will perform to help us answer these questions, and a brief description about what we intend to achieve by following this methodology.

(24)

3. Methodology

The main problems presented in this thesis work have been summarized by the following research questions:

RQ1: What features and capabilities can be identified in tools that analyze test execution logs and how

feasible it is to compare such features in different tools?

RQ2: How can we evaluate a subset of such tools for different features?

Regarding the rationale behind these research questions, RQ1 aims to identify the features and capabilities in each tool that are mostly used by practitioners in the industrial backgrounds. It also aims to indicate if it is possible to compare them with each other. The performance of the same features from different tools can be hard to compare given that it is influenced by different hardware and software components that are present in the industrial environment where the tools are being used. Hence, the performance of some of the tools depends not only on the number and the efficiency of the features that they provide but also on the environment that they are working in. This challenge, of course, exists only in tools that are environment-dependent. Although there are a lot of environment-independent tools in industry today, they unfortunately do not provide a lot of advanced features which makes them not so commonly used in the software industry. On the other hand, the environment-dependent tools with advanced features that are used by experts do not have a free-license, making it difficult-to-impossible for us to access them and make an evaluation by ourselves.

While RQ1 is limited to identifying such features and capabilities, RQ2 covers the possibility of evaluating them in a common and when possible, comparative analytic perspective. We want to explore the ways we can make the evaluation and comparison between the tools, based on the features that they provide. Our aim is to get a detailed information on how software developers and testers work with different log analysis tools, what benefits these tools provide to their work, what difficulties and challenges the practitioners face during testing and what could be improved in each of the tools in order to be more efficient and obtain the expected results. We also want to experiment with some of these tools by ourselves, so our study will cover different perspectives regarding the evaluation of tools for mining test execution logs.

Our thesis work relies on three different studies for answering the research questions: 1. A literature study

2. An experimental evaluation 3. An expert-based study

In order to answer RQ1, we will rely on all three studies. We will start by identifying all features and capabilities in existing literature (study I). Then, we will identify and when feasible, compare such features and capabilities in the tools that will be part of our own experimental evaluation (study II). We will use the feedback received from practitioners in study III to identify further features and capabilities in the tools that they use for mining different test execution logs.

In order to answer RQ2, evaluating a subset of such tools for different features would require an evaluation and when feasible, a comparison of such tools in performance and other features as we already explained. For this purpose, we will rely on our own experimental study (study II) and the feedback from the practitioners with different industrial background (study III).

(25)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs A literature study has been performed [1-43] in order to get as much information as possible about the assessment of tools that are used for mining test logs. This information has been extracted by reviewing existing studies that are relevant to our topic and has helped us answer our research questions. We intended to review old and new studies about our topic, because we wanted to know how log analysis tools have evolved throughout the years, and what benefits and challenges did their improvement bring. This study can help us in answering our first research question. However, we will have to perform the experimental evaluation and the expert-based study in order to be able to answer our second research question. In the first two steps for our literature study, we performed a literature search and a literature selection, which helped us find research papers that were related to our topic.

3.1.1 Literature Search

When we started searching for references, we used some terms and keywords. The most common ones were:

 test execution logs

 tools for mining test logs

 analyse of test execution logs

 test framework for Jenkins

 finding defects by using logs

 faults in test traces

 clustering logs

 cluster-based tests

The main search engines included the digital libraries ACM Digital Library, IEEE Xplore, Scopus, ResearchGate as well as Google Scholar.

3.1.2 Literature Selection

Results from the search process, during which we used the keywords mentioned above, included around 550-600 articles, published papers and books. We tried to choose the papers which are more relevant to our topic and then, discarded all the rest that were not relevant took the other ones out of the list. The criteria included: tools used in different software testing and CI environments, techniques of identifying faults and failures during test execution, test execution logs analysis, algorithms and models related to them as well as useful and necessary background on these topics. We concluded that 43 papers satisfy the selection criteria.

3.2 Experimental Evaluation

After literature study, what came next was an experimental evaluation and comparison of different tools in terms of their capabilities and features [5-9][27-39][42-43]. Some of the features include presenting the log files information in single or multi-perspective interactive visualization, correct or incorrect clustering of similar faults, performance, fault prediction and identification techniques, predicting bugs, fault location, etc. Each test execution log can have different failure types.

Some of the fault prediction techniques use statistical regression models and machine learning models to predict faults in software modules. We intended to find tools that are similar to or have similar features to the ones evaluated in our literature study. Since the majority of such tools are environment-dependent or

(26)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs don’t provide free license, it would be difficult to achieve this goal. Hence, we decided to go with two tools that were accessible to all users and provide basic features.

The tools that we evaluated during this work are:

-Splunk

-Loggly

The experiment helped us answer our research questions partially, and since we were limited in resources and time, there were still gaps in the results that we wanted to get. That is why we decided to add another study in order to fill these gaps and came up with the survey, which we sent to several experts of software developing and testing industry.

3.3 Expert-Based Study

The third and final part of our methodology included a survey study, which we sent to different software testers and other practitioners (example, in different LinkedIn groups) in order to collect more practical information. As we mentioned before, it was impossible to get a free-license for environment-dependent tools, so that is one of the reasons that we added a survey in our study. We wanted to get information about these tools by people who can access, work with and evaluate them. These people are the practitioners and experts that come from different industrial backgrounds, and most of them have been working with several testing tools for years. Working in an industrial background has given these practitioners the benefit of working with tools that provide advanced features and are updated continuously. They are provided with a license from their respective organizations, which allows them to use all of the features within the corresponding tool(s). Hence, the majority of the experts know a lot about the tool(s) they work with, the capabilities, limitations and the drawbacks that they have, so they can make a more precise evaluation about them than we can.

We sent the survey to different software developers and testers from different nationalities and different industrial backgrounds. After getting several responses, we were able to get useful information from these practitioners and use their expert-based feedback on making our findings more reliable.

(27)

4. Experimental Evaluation

We decided to consider two different open source test tools in order to avoid issues with upgrade features in other tools and legal considerations for tools that involve specific framework and data from certain companies. For the experimental evaluation, the tools we decided to take in consideration are Splunk and Loggly. The purpose of this evaluation was to see how these tools analyze different kinds of logs and which one performs better than the others. We managed to get the logs in the Internet. After analyzing them with our tools, based on the results that we got, we did a comparison between the tools. We compared them based on some aspects: their performance, the test results that each tool provides, the fault localization techniques that they are using (if they use any) and the test failures that we experience during the test runs.

4.1 Splunk

Let’s start with Splunk. As we mentioned before, Splunk is an enterprise platform which analyzes application logs, web server logs, sensor data, syslog, Windows events, all of this in many supported formats. It searches, indexes, and correlates log data in a searchable repository from which it can generate graphs, reports and visualizations [44].

Splunk provides a simple but powerful interface to get insight out of the contextual data. It’s fair to say that Splunk is the ultimate log collection and analysis tool and the ultimate solution for log processing.

Logs are something that get generated on computing devices and non-computing devices, which are really important for businesses, especially online ones. Any basic system has logs being generated and that will be stored in one location and directory. The problem is we cannot read those logs, at least not in our readable format. But, they have something important, they have details about every single transaction and every single operation that happens on that device. As we know, logs get generated massively. Thousands of such lines of codes get generated every single minute. We need a tool which can understand our logs and explain to us in a very simple manner, which is why Splunk comes in help. Splunk stores data in a complex form. Whatever data comes in, it’s going to be compressed at one level to one point of time and, later when it’s archived, it will be compressed even more.

If someone is accessing our network from a very unreliable source, then immediately we can configure Splunk such that it would throw us an alert when it realizes that there is a request coming from a particular IP range and that can be done and probably you can also install Splunk on some system and monitor the CPU performance and whenever the CPU performance or the usage crosses a threshold then our system might crash. So, in these cases Splunk gives us an alert and tells us that something is wrong, so we can immediately fix any of the coming losses. We can avoid all of them.

Here are the steps to analyze logs in Splunk. After uploading them, we have to follow the wizard steps which provide us with the search/query screen where we can do a detailed analysis over the data. When data is indexed, it is divided into individual events (Figure 12). An event is a single piece of data in Splunk, similar to a record in a log file or other data input. Each event is given a timestamp, host, source and source type (Figure 13). A single event may correspond to a single line in our inputs, but some inputs (ex. XML logs) have multiline events, and some inputs have multiple events on a single line. When we run a successful search, we get back events. Similar events can be categorized together with event types.

(28)

Figure 12: Timeline panel of events in Splunk

Figure 13: Attributes of events

In the Figure 14, we can see all the history of events in Splunk. Each bar shows the number of events from the first day of a month to the first day of the next month. The earliest event here is from the interval June 1 2016 at 1 AM to July 1 2016 at 1 AM.

Figure 14: Display of the whole event history

(29)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs one specific second. If we click again, we will get the same number of events even during 0.01 seconds or during 0.001 seconds. The dates of events are the dates when the logs have been generated.

So, we performed analysis on some apache server data from a fiction game company (Buttercup Games). The log files we got from this data contain the IP address of the logs, the timestamp, which is the time when these logs were created, id of the products, HTTP status code and some other information. When we search for our log file in the Search bar, what will be displayed is the timeline panel of the events (Figure 15), the Search results which include raw events (Figure 16) and a sidebar of fields that are extracted from the events (Figure 17).

Figure 15: Timeline panel of one of our log files

(30)

Figure 17: Field sidebar parsing our data into fields or log types

If we want from Splunk to display the events with the certain values that we want to see, without having to go through all the events, we should write a search-criteria that matches our demands in the Search bar. Selecting a time range also helps in making our search easier.

Limiting a search by time is key to getting results faster and is a best practice to use for every search. Let’s assume that we want Splunk to display only the events with HTTP status code that begins with 503 and 408, so we will use a key value pair by adding the case-sensitive field name to the value we want to find (status=503/408) and the uppercase boolean OR with our search terms:

status=503 OR status=408

(31)

Figure 18: Events with status code 503 and 408

If we click in the status field in our field sidebar, it provides links to quick reports, values returned and statistics for those values (Figure 19).

Figure 19: Status field

As we have mentioned earlier, with Splunk we can also visualize the data that we are analyzing. By using search terms with search components, a whole new world of monitoring and analyzing opens up.

In our example below, we search for products sold on our web shopping cart, using a stats command to count them by product Id, sorting them by a number sold and displaying them in a column chart and pie chart visualization:

(32)

Figure 21: Pie Chart of products and purchase count and percentage

productId count count%

WC-SH-G04 13 11.607% WC-SH-A01 5 4.464% WC-SH-T02 5 4.464% CU-PG-G06 6 5.357% FI-AG-G08 6 5.357% MB-AG-T01 6 5.357% PZ-SG-G05 6 5.357% BS-AG-G09 8 7.143% DB-SG-G01 8 7.143% MB-AB-G07 9 8.036% WC-SH-A02 9 8.036% FS-SG-G03 10 8.928% SC-MG-G10 10 8.928% DC-SG-G02 11 9.821%

Table 3: Purchase count and percentage of each product

We also analyzed a log file which contains information about users logging in. Users are divided into valid users and invalid users. Passwords are divided into failed passwords and accepted passwords. As we did with the log file above, we use a search-criteria in the search bar to make Splunk return the output that we expect. We want to test this file with event types. An event type is a categorization system or a knowledge object which helps us make sense of our data and let us examine huge amount of data, find similar patterns, and create alerts and reports. It allows us to automatically classify events based on a search-criteria.

(33)

password fail*

in the search results we will see the timeline panel of the events (Figure 22). The events that are displayed are the ones that contain failed password for invalid users and events that contain failed password for valid users (Figure 23).

Figure 22: Timeline panel of failed passwords

Figure 23: Events List of failed passwords

If we want to show only one of these two groups of events, we will write:

password fail* "for invalid user"

or

password fail* NOT invalid

Now we can create an event type for each of these groups. The first one is called less_risky (for invalid users), while the second one is called risky (for valid user). After creating the event types, we can test them by writing in the search bar:

eventtype=less_risky

or

(34)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs and the results will be the same as above.

We can build reports based on this information. So, if we take our initial search and we type it to the stats command and do a count by (or for each) eventtype, we will notice that it is telling us how many events of our search criteria fell into the less_risky category and how many fell into the risky category (Figure 24).

password fail* | statscountby eventtype

We can also get a visual display of this:

Figure 24: Events division in two groups

So, as we can see, there are some fault localization techniques which Splunk uses, like parsing the log files, creating specific reports and visualizing testing in many different ways.

Splunk allows us to create reports and to schedule them to be run every certain time and choose a time range to include in the events.

Another important feature in Splunk is the option to create dashboards. A dashboard is a collection of reports compiled into a single pane of glass, allowing us quick visual access to our data.

There is also one interesting feature called Monitoring Console, which analyzes all the activity that Splunk does. For example, it gives us information about operations performed per minute (Figure 25) and page faults per operation (Figure 26).

(35)

Figure 25: Number of Splunk operations per minute

As we can see from the figure, only command and query are included in the chart, since we used only them for parsing our log files. The maximum number of operations per minute occurred during the time [08/27/2020 23:25], with 116.6 operations for command and 80.4 operations for query. The minimum number of operations per minute occurred during the time [08/27/2020 23:20], with 67.2 operations for command and 45.2 operations for query.

(36)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs From [08/27/2020 22:20] to [08/28/2020 00:55] there occurred 4.72 page-faults per operation. From [08/28/2020 01:00] to [08/28/2020 02:20] there occurred 4.73 page-faults per operation.

4.2 Loggly

The second tool we tested for our experimental evaluation is Loggly. Loggly is a Solution as a Service (SaaS) solution for log data management [45]. It makes possible for us to upload any kind of text-based log data and track activity, analyze trends and gain insight from log data. One of the reasons we decided to use Loggly for testing our logs is the similarity it has with Splunk. Like the latter, Loggly also clusters our data into events, it makes every event undergo a full-text index, and makes possible for us to parse out as many field-value pairs as we can, making the searches more accurate and shortening our time to root-cause. Another feature that makes Loggly similar to Splunk is the usage of tags to form source groups, which helps in segmenting our data and narrowing down our search results.

Loggly can parse many types of data automatically and it provides features like statistical analysis on value fields, faceted search and filters. In case the automatic parsing is impossible due to the type of the log, we can do full text search over our logs. The data we used to test with Loggly contains logs about web purchases. As we can see from the figures down below, the search results include timeline panel, raw events (Figure 27) and the sidebar of fields that are extracted from the events (Figure 28). So, there is a similarity with Splunk so far.

(37)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs We can see from the Figure 28 that the fields are clustered into different categories. Some of the data is parsed into Apache, another part is parsed into JSON etc. So, this is a feature that sets Loggly apart from Splunk. It is called Dynamic Field Explorer and helps the tool to automatically parse the data into different fields, but it can also go further than that. It can parse the fields into different categories, based on their type.

Figure 28: Fields sidebar in Loggly

(38)

Edvin Parmeza Experimental Evaluation of Tools for Mining Test Execution Logs There are 11 events which contain data about the failed purchases and if we expand one of the events, we can see that all the fields are automatically parsed (Figure 30). We can see from the figure that there is an option to view surrounding events, which are the events that happened before the current one. After we click it, the results show these events (Figure 31) and just three events back there’s a Java stack trace (exception, java ArrayList, 635). So, in a short matter of time, we managed to find the failed purchases and the root cause.

Figure 30: Expanded event

Experimental Evaluation of Tools for Mining Test Execution Logs

M

U

S

I

,

D

E

V

,

S

Thesis for the Degree of Master of Science in Software Engineering

15.0 credits

EXPERIMENTAL EVALUATION OF TOOLS FOR

MINING TEST EXECUTION LOGS

Edvin Parmeza

epa19002@student.mdh.se

Examiner:

Daniel Sundmark

Mälardalen University, Västerås, Sweden

Supervisor: Wasif Afzal

Mälardalen University, Västerås, Sweden

Acknowledgments

Abstract

Table of Contents

List of Figures

List of Tables

1. Introduction

1.1 Background

2. Related Work

2.1 Existing tools for test log analysis

2.1.1 Jenkins

2.1.2 Astro

2.1.3 LogFaultFlagger and CAM

Ratio between average %FaultCaught and average

%LogLineFlagged in each of the tools

2.1.4 TITAN / TTCN-3

2.1.5 Splunk

2.2 Related Work on Test Logs and Clustering Algorithms

2.3 Summary of Literature Findings

3. Methodology

3.1.1 Literature Search

3.1.2 Literature Selection

3.2 Experimental Evaluation

3.3 Expert-Based Study

4. Experimental Evaluation

4.1 Splunk

4.2 Loggly