Ett program för att upptäcka tekniska begränsningar utifr˚an utdata fr˚an exekverade prestandatester Mikael Östberg

(1)

Ett program f ör att upptäcka tekniska begränsningar

utifr˚an utdata fr˚an exekverade prestandatester

Mikael ¨

Ostberg

Examensarbete, Master-niv˚a 30 hp

Examinator: Anne H˚akansson Handledare: Henrik Olsson

Software Engineering of Distributed Systems ICT Skolan

KTH Kungliga Tekniska H ¨ogskolan Stockholm, Sweden

(2)

(3)

Abstrakt

F öre lansering av ny mjukvara s˚a är det vitkigt att veta mjuk-varans begränsningar innan användarna g ör det. Det kan vara b˚ade en tidskrävande och sv˚ar uppgift. En erfaren testare kan luta sig tillbaka p˚a erfarenhet som pekpinne vart man kan b örja titta p˚a den stora mängden data fr˚an ett lasttest och avg öra dess begränsningar. Den här studien f öresl˚ar ett program som f örenklar proceduren genom att upptäcka flaskhalsar med hjälp av schematiska regeldefinitioner som g ör det m öjligt att an-passa detektionsbeteendet utefter domänen. Kombinerat med välkända algoritmer fr˚an signalbehandling som lägger märke till f örändringar i alla typer av r˚a data kan f öresl˚a vilken typ av begränsning som finns i systemet. P˚alitligheten av programmet testas med hjälp av fyra olika experiment som använder r˚a data som inneh˚aller flaskhalsar f ör CPU, minne eller nätverk eller in-get. Resultaten f öresl˚ar att programmets p˚alitlighet motiverar fortsatta studier eftersom den gissar rätt väldigt ofta vilken typ av begränsning som finns i systemet när n˚agot s˚adant är p˚a plats. Dock är resultaten f ör det fjärde experimentet när ingen flaskhals finns i systemet riktigt d˚aliga vilket f öresl˚ar att ett annat sätt att upptäcka avsaknaden av begränsningar beh övs. Experi-mentet visar att metoden kan användas f ör att bygga tillägg eller funktioner som assisterar oerfarna lasttestare f ör väldigt simpla begränsningar.

(4)

(5)

An application to detect technical limitations

by using data from performance test executions

Mikael ¨

Ostberg

Master of Science Thesis

Academic Supervisor: Anne H˚akansson Industrial Supervisor: Henrik Olsson

Software Engineering of Distributed Systems School of Information and Communication Technology

KTH, Royal Institute of Technology Stockholm, Sweden

(6)

(7)

Abstract

Before launching new software it is imperative to know the lim-its of the application before the users do. It can be both a time-consuming and a difficult task. A seasoned performance tester may rely on experience to know where to start looking at great amounts of data from performance tests to detect its limits. This study implements and tests the reliability of an application by ap-plying a model to simplify the procedure of detecting bottlenecks with the help of a schema defining metric connections specific to a target application domain. Combined with well known signal-processing algorithms for detecting changes in any raw data can suggest what type of bottleneck is present in a system. The relia-bility of the application is assessed by four types of experiments carried out to detect the bottleneck from raw data containing bot-tlenecks of the types CPU, memory or networ or nothing. The results suggests that the applications reliability motivates further study since it presents a very strong ratio of correct guesses when a bottleneck is present within a system. However, the results for the fourth experiment where no bottleneck is present in a system are very bad, suggesting a different model for detecting no bot-tlenecks is needed. The experiment shows that the method sug-gested can be used to build add-ons or features that may assist inexperienced performance testers for very simple bottlenecks.

(8)

(9)

List of Figures 5 List of Tables 5 1 Introduction 1 1.1 Background . . . 2 1.2 Problem . . . 3 1.3 Purpose . . . 3 1.4 Goal . . . 4 1.4.1 Social benefits . . . 4 1.4.2 Ethics . . . 4 1.4.3 Sustainability . . . 4 1.5 The Hypothesis . . . 4 1.6 Method . . . 4 1.6.1 Investigation . . . 4 1.6.2 Development . . . 5 1.6.3 Experiment . . . 5 1.7 Delimitation . . . 5 1.8 Disposition . . . 6

2 Automatic analysis of performance data 7 2.1 Industrial thesis . . . 7

2.2 Performance testing . . . 7

2.2.1 Planning and preparation . . . 7

2.2.2 Setup . . . 8

2.2.3 Execution . . . 9

2.2.4 Analysis and test . . . 9

2.3 The data analysis process . . . 10

2.3.1 Average Response Times . . . 10

2.3.2 Transactions per second . . . 10

2.3.3 Server Health . . . 11 2.4 Measurements as a signal . . . 13 2.5 Signals . . . 14 2.5.1 Adaptive Filters . . . 14 2.5.2 Change Detectors . . . 15 2.6 Knowledge . . . 16 2.7 Background summary . . . 17

(10)

CONTENTS CONTENTS

3 Method 18

3.1 The Scientific Method . . . 18

3.2 Qualitative or Quantitative research methods . . . 18

3.3 Philosophical Assumptions . . . 18 3.4 Research Question . . . 19 3.5 Research Method . . . 19 3.6 Research Approach . . . 19 3.7 Research Strategy . . . 20 3.8 Data Collection . . . 20

3.9 The Data Analysis . . . 21

3.10 Quality Assurance . . . 21

4 Model Definition 22 4.1 Formal Model definition . . . 22

4.1.1 Data Collection . . . 22

4.1.2 Adaptive Filtering . . . 22

4.1.3 Change Detection . . . 23

4.1.4 Relational Profiling . . . 23

4.2 Implementation of the model . . . 24

4.3 Testing the model . . . 24

4.4 Simulation Environment . . . 25 4.5 Data collection . . . 26 4.5.1 CPU Bottleneck . . . 26 4.5.2 Memory Bottleneck . . . 26 4.5.3 Network Bottleneck . . . 27 4.5.4 No bottleneck . . . 27 4.5.5 Bottlenecks in simulation . . . 27 4.6 Evaluation . . . 28 5 HALP 31 5.1 From raw to exposed data . . . 31

5.2 Data Model . . . 32

5.3 Filter Model . . . 33

5.3.1 Windowed Least Squares . . . 33

5.3.2 The Kalman Filter . . . 34

5.4 Detector Model . . . 36 5.4.1 CUSUM . . . 36 5.4.2 ACTIVITY . . . 38 5.4.3 PASSIVITY . . . 38 5.5 Profile Model . . . 39 5.5.1 Relations . . . 39

(11)

5.5.2 States . . . 40

5.5.3 Active profile during experiment . . . 42

5.5.4 Usage . . . 42 5.6 Summary . . . 43 6 Experiment results 44 6.1 Experiment . . . 44 6.1.1 Legend . . . 46 6.1.2 CPU Experiment . . . 47 6.1.3 Memory Experiment . . . 48 6.1.4 Network Experiment . . . 49 6.1.5 No Bottleneck Experiment . . . 50 6.2 Result Summary . . . 51 6.2.1 Tabular Summary . . . 51 6.3 Visual Summary . . . 51 7 Conclusions 52 7.1 Conclusion . . . 52 7.2 Discussion . . . 52 7.3 Result of experiment . . . 54 7.3.1 Disclaimer . . . 55 7.4 Further work . . . 55 7.5 Availability . . . 56 8 References 57

(12)

List of Figures

2.1 Example load branching scenario . . . 8

2.2 Memory bottleneck example . . . 11

2.3 CPU bottleneck example . . . 12

2.4 Network bottleneck example . . . 12

2.5 Thermometer-example . . . 13

2.6 CPU-example . . . 14

2.7 Kalman adaptive filter . . . 15

2.8 Relation between adaptive filters and change detectors . . . 15

2.9 The bottleneck suggestion model . . . 16

4.1 HALP Overview . . . 25

4.2 Calculating password is an expensive operation that can simulate a CPU bottleneck . . . 26

4.3 A memory-wasteful php-script simulates the memory bottleneck . . . 27

4.4 Network limit configuration on VM simulates network Bottleneck . . . . 27

4.5 The whole experiment setup which is part of the evaluation method . . . 29

4.6 Simple model of PoC . . . 30

5.1 Data Model . . . 32

5.2 Filter Model . . . 33

5.3 Detector Model . . . 36

5.4 Profile Model . . . 39

5.5 Rules of Detection Example . . . 40

5.6 Rules of Detection Example . . . 41

6.1 Representation of the experiment setup of running one test using a prede-fined bottleneck X . . . 44

6.2 The ratio of correct vs incorrect guesses for the CPU experiment . . . 47

6.3 The ratio of correct vs incorrect guesses for the memory experiment . . . . 48

6.4 The ratio of correct vs incorrect guesses for the network experiment . . . . 49

6.5 The ratio of correct vs incorrect guesses for the baseline experiment . . . . 50

6.6 CPU Bottleneck results, 204 samples, CPU suggested in 89% . . . 51

6.7 Memory bottleneck results, 139 samples, memory suggested in 95% . . . 51

6.8 Network Bottleneck results, 190 samples, CPU suggested in 94% . . . 51

6.9 No bottleneck results, 204 samples, CPU suggested in 89% . . . 51

7.1 A graphical representation of a baseline-experiment-147 with focus on CPU data. . . 53

List of Tables

1 Acronyms and common abbreviations . . . 7

(13)

3 A measurement example . . . 14

4 Legend for the profile table 5 . . . 42

5 Profiles used during experiment. . . 42

6 HALP CPU results . . . 47

7 HALP memory results . . . 48

8 HALP network results . . . 49

9 HALP baseline results . . . 50

10 Summary of HALP experiment . . . 51

List of Algorithms

1 Windowed Least Squares . . . 34

2 The 1-dimensional Kalman Filter . . . 35

3 The CUmulative SUM Detector (CUSUM) . . . 37

(14)

Acronyms and common abbreviations

Acronym Meaning Description

ART Average Response Time The average time for a request to turn into a re-sponse

TPS Transactions Per Second The number of transactions passing through a system per second

SUT System Under Test The system that is the receiving end of the load to verify some performance criteria about it CPU Central Processing Unit A common metric on the utilization of a

sys-tems resources

RAM Random Access Memory Another common metric of a systems

re-sources, in this case memory

SWAP Swap space memory Another metric, SWAP space memory is an

emergency solution for when the RAM is full to store application data

PoC Proof of Concept A practical implementation that embodies a

theory or an idea with an example

PI Performance Indicators PI’s is a general term for performance measure-ment through monitoring

VU Virtual Users A virtual user passes through a set of

oper-ations of a load scenario, mimicking the be-haviour of a real user

CSV Comma Separated Values A common raw data format, which is

com-monly used to export data between applica-tions

VM Virtual Machine A virtual machine is an application that are able to run a guest operation system

PoI Points of Interest A point of interest in this thesis is a set of data points or ranges that indicate some kind of con-clusion

Table 1: The acronyms in this table are commonly used throughout the study and can be kept as a reference while reading.

(15)

1 INTRODUCTION

1 Introduction

No customer wants to wait in line for a service longer than needed, whether at the su-permarket or when ordering something on the Internet. It is even worse if there is no information of how many customers are ahead in line, this can utterly ruin the feeling of good customer service. It is very unlikely that a user would even be aware that a web-site is currently having 500 people continuously pressing F5 and update some web-page. Users almost never know when a site is experiencing heavy traffic, and most of them will simply no longer bother if is perceived to give no or slow responses [29, page 9].

A service has to be responsive. Otherwise it will hurt the business through decreased revenue, goodwill among users and ultimately harms the company brand. [20] When it does not work, it will turn into a problem for everyone involved. Both for business-owners providing the service and to the users who only wishes to take part in the service. The web has turned into an infrastructure, and as any other infrastructure it is expected to always work. Imagine the consequences that would occur if Googles search engine stopped responding because of a sudden and ridiculous increase in load. The costs for only one hour of downtime could easily cost an internet company up to a $500000 [23]. Now this would probably never happen as Google has many countermeasures to denial-of-service attacks to keep uptime of the service at 100% such as throttling [27]. For each request it will increase slightly and is probably the remote web-server keeping the re-sponse for a little longer each time as a counter-measure.

This is why there is a need for performance testing, because during development it is easy to forget functionality that limits excessive usage. Should the counter-measure not be in place, it would not be discovered in a development environment by other means than through a performance test.

Fighting for customers is always going to be an uphill battle on the web because it is so easy for the user to defect and start using another more responsive service. The key to solving these performance problems is to make the service avoid consuming more resources than needed for the specific user that simply does not need it. Testing for this type of scenarios is the essence of the art of performance testing. A performance tester puts time into investigating an applications limits and breaking points [29]. In order to put time into rewriting unnecessarily limited parts of a service, right about when it is supposed to be placed into production, requires solid motivation. Unfortunately this often happens just before launch of the software since that is when most pieces will be in place and fully testable. But it is always better to discover any issues before the users do. The good users will tell you about any issues of a service. Most of them will simply give it up and tell their friends it is a terrible service. This suggests that the most profitable way to investigate the limits of an application is during development, preferably before launch, and in the best case in the early stages of development. Performance testing should be included as a part of any iterative development process. As soon as there is anything to test, it should be tested [29].

(16)

1 INTRODUCTION 1.1 Background

Performance testing has a number of methods to ensure that the performance stays good and that bottlenecks are discovered in time as part of the process [20]. There are also a great number of tools available to assist in this procedure. For example: HP Load-Runner [14], Apache JMeter [6], Gatling [33], LoadUI [31] and several others. These tools allow the tester to produce a simulated load scenario onto a service and to investigate how it behaves during these conditions. Some of these tools also monitor web-servers, operating systems and database-health in order to draw additional conclusions on what is the bottleneck when system resources reaches their limits. Every system resource has breaking points and it is just a matter of which resource runs out of capacity first.

But all of these tools lack something. Currently performance testing tools do not assist in the procedure of telling the user where the system bottleneck might be. They simply display the data to the user, which demands a certain amount of experience before being able to deduce anything useful about the data. This is something that adds to the steep learning curve of performance testing and there are studies that suggests it is a hard endeavour [19]. It is not impossible however, there are many methods of analysis that can be applied on large sets of data.

1.1 Background

For an academically young area of research there has been a number of studies building and refining the performance testing method but none of them investigates the process of automating the analysis-part of a test. Meier et al [20] defined the practical work a performance tester should perform from the start of an applications life-cycle to its end. Their process contains the following steps: Identify the environment, identify the performance acceptance criteria, configure the environment, implement test scenarios and execute them, then analyze the data. [20]. Meiers method definitions has stood the test of time rather well with minor additions to the process from other authors such as Nivas [22]. The final step in meiers process is result analysis, which is the task this study seeks to provide automated assistance with [20]. Rajesh et al [19] argue that when the test progress and results are monitored they also must be analyzed manually which is known to be a time-consuming activity. However, the pressures of running a large number of tests with domain-specific output formats make it a difficult to apply automated analysis tools at the time of the test.

A similar problem to automated analysis exists within the domain of software de-velopment, which today has a multitude of different code-analysis tools on the market. The demand of this type of analysis is huge as software development is an industry with great growth with a sometimes inexperienced workforce. Then, it makes sense to provide tools that enforces paradigms and patterns that facilitates good design. View qualitative source code as the produced result of the developer, the produced result of the perfor-mance tester can be seen as conclusions about raw data from a load.

(17)

sug-1 INTRODUCTION 1.2 Problem

gestions on bad design for developers interested in avoiding the most common mistakes. None of the common load testing tools today provide help with this [6, 14, 31, 33]. Some of the tools can both produce load onto a service and consequently measure the perfor-mance on the system under test (SUT) [31]. But all of them leave it up to the tester to decide where the bottleneck might be located and especially what might be the cause. Usually it is left up to the tester to find third-party tools or manually monitor a system while the load test is running. Tools that do that and provide a meaningful connection between the monitored system and load test metrics are rare and usually very domain-specific [19]. It is also widely acknowledged by the industry that there is a need for better performance testing methods [21]. The scope of this study investigates a new method for automating the analysis of performance test data, under the condition that both sys-tem monitoring-data and load test metrics are available as in LoadUI [31]. The focus are methods of analysis that can be automated using limited domain specific knowledge using a common data format, such as comma separated values (CSV) [30].

1.2 Problem

Performance testing can be used in order to reach conclusions about an applications lim-its and weaknesses [29]. But it can be a long and repetitive process which can include reiteration of tedious parts like analysis [20]. None of the performance testing tools in the market today try to suggest where the user could start investigating the data and suggest possible causes. This is a problem that leads to a steep learning-curve of a process. A method that would flatten that curve would need to be designed in adherence with the process as closely as possible. It would need to be generic enough to be applied on as many performance testing tools as possible and support input-formats used by a major-ity of them. It should also allow experienced performance tester with the corresponding domain-knowledge to assist less experienced performance tester through the application. It should also be possible to apply a set of rules that can be customized to any specific domain being performance tested.

1.3 Purpose

This thesis presents a method to automate the analysis process of performance testing by applying adaptive filtering algorithms for pattern analysis in large data sets. The purpose of automating the analysis process is to enable indirect knowledge transfer between ex-perienced and inexex-perienced performance testers. Its purpose is also to build a platform to construct rules for detecting potential bottlenecks. This is, in turn, useful for making the performance testing process more intuitive, appealing to new users and hopefully less repetitive.

(18)

1 INTRODUCTION 1.4 Goal

1.4 Goal

The goal of this study is an implementation-ready, automated and configurable appli-cation with the ability to analyse raw performance testing data and find probable bot-tlenecks. It should also provide feedback to the user why and where there might be a certain bottleneck. The benefit of

1.4.1 Social benefits

The greater benefit of this application would be that the software industry would get more skilled performance testers. This in turn would lead to more stable online public services that does not crumble under a load that is higher than usual.

1.4.2 Ethics

There is an ethic aspect with using adaptive algorithms to suggest bottlenecks. Let’s con-sider the case that a company adopts the software or the algorithms from this study but does not make the application configurable. The company that created the configura-tion would assume liability of that configuraconfigura-tion being correct in every case marketed. An inexperienced performance tester may then take business-decisions based upon that suggestion when can be a false positive. This can be avoided by enforcing that configura-tion of a suggestive system must be done by an experienced performance tester that can pour domain-knowledge into configurating a set of rules for the assistive tool.

1.4.3 Sustainability

The application furthers environmental sustainability indirectly by making sure that as minimal system resources and hardware components are used to generate as much value as possible.

1.5 The Hypothesis

"Adaptive filters and change detector theory is unable to make correct assumptions on what type of bottleneck is present in a system based on its performance data. But if and only if more than 25% of the suggestions are incorrect"

1.6 Method

This study is divided into three separate phases: investigation, development and experi-mentation.

1.6.1 Investigation

The purpose of the investigation phase is to familiarize with the topic and find if it has already been investigated and to find closely related work through literature study.

(19)

An-1 INTRODUCTION 1.7 Delimitation

other purpose is to find what methods of manual analysis is applied by performance testers today in order to find what and how it can be automated. Related studies are found by searching article databases such as Google scholar [10] and KTH Online Li-brary [18] for keywords such as ¨load testing¨, ¨performance analysis¨and ¨adaptive filters¨.

1.6.2 Development

The purpose of the development-phase would be to create an application. The proof-of-concept (PoC) application should serve suggestions about the input data simply by giving it raw data input from a real performance tool. Development within this study follows a resemblance of an agile [8] process and makes use of a GIT repository where the commit-log paints the picture of the development history under the development months.

1.6.3 Experiment

A closed experiment with virtual machines that can manufacture data containing symp-toms using loadtesting tools and induce different kinds of bottlenecks. The output data would be fed into the application and the actual result that is produced would be matched against introduced bottlenecks. The experiment is performed with an as large sample-size as possible and its success would be matched against the metrics of the hypothesis.

1.7 Delimitation

Performance testing as a subject is wide and is often intertwined with many areas of development, the following limitations apply to this thesis.

• This work is aimed towards suggest how useful data pattern recognition techniques are to produce correctly guessed points of interests in performance testing data. • It is not work that seeks to suggest different detailed profiling techniques, but

it does provide a customizable framework to implement any profiling technique wanted.

• This study is tool-agnostic and does not go in-depth into the tools on the market at this time. The performance testing tool LoadUI is used during the experiment but the tool itself is not evaluated in any way.

• The study is not a reinvention of the wheel, adaptive filters have been around for a long time, and does thereby not go into optimizations of any kind on the adaptive filters.

• The filters are reduced to their one-dimensional states and are not much more than their most basic implementations.

(20)

1 INTRODUCTION 1.8 Disposition

• It does however make suggestions on what algorithm to use for what type of profil-ing and strives to make suggestions on how to apply the knowledge in an enterprise testing tool.

• Change detectors can be used to restart filters to get better change detection results, this was not investigated in this study because it mostly concerns live monitoring systems.

• The implementation of data recognition is intended to be used live during an on-going test, but this is not evaluated in this study.

1.8 Disposition

The structure of this thesis is as follows. Chapter 2, is called automatic analysis of per-formance data and provides a theoretical background to the study and the process it attempts to make easier. Chapter 3 describes the method, it provides an extensive de-scription of the methodology that is used where in this study, the hypothesis it attempts to reject and how the scientific method is applied through experiment. Chapter 4 de-scribes the model behind the application produced out of this study Chapter 5 dives more in-depth into the architecture of the application, how its built, algorithms used and us-age. Chapter 6 describes an analysis which contains the result-set from the experiment, discussion around the results and conclusions that can be drawn from the result-set, it summarizes by suggesting future work. Chapter 7 describes conclusions and further work.

(21)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA

2 Automatic analysis of performance data

This chapter serves as background information to the area of this study, it goes through performance testing as a process and how adaptive filters and change detectors can be applied as data pattern recognition.

2.1 Industrial thesis

First off this thesis was performed at SmartBear Software, a software company in the software quality field. SmartBear in fact has several products that are used to create both functional and performance testing scenarios with focus on continous integration. The performance testing tool LoadUI [31] is one of the potential products that could imple-ment the method in this study. LoadUI primarily load tests Application Programming Interfaces (API) using an arrival-based virtual-user model. While SmartBear has many products that could benefit from the results of this study the data collected could be use-ful for any company developing such tools.

2.2 Performance testing

The process involves a number of steps which are referred to as the performance test cycle, which is explained below. It contains several subcategories of tasks as described by Meier et al and Rajesh et al [19, 20], from section 2.2.1 to 2.2.4.

2.2.1 Planning and preparation

The first stages are all about long-term planning and preparation of the performance test. It is usually performed during the early stages of development where many components of an application have not been completed yet.

• Identify the test-environment, build an understanding of the system-under-test (SUT) and the similarities and differences to the production environment. Investi-gate if there are any hardware, software or configuration differences between them. The goal is to mimic the production environment as closely as possible. [20]

• Identify performance testing criteria, In order for the performance testing to be useful and to be able to detect potential threats to the software performance ac-ceptance criteria. This means to identify what constrains exist on the production environment such as average response time (ART), transactions per second (TPS) and resource-utilization constraints. Then make sure the test investigates those cri-teria. [20]

• Plan and design tests. Identify what key scenarios exist [20]. For example, how many users will proceed through the entire transaction in one scenario and how

(22)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA 2.2 Performance testing

Figure 2.1: Barbers site navigation model where a load scenario with 20 virtual users connect to a site each second and a do different probable operations in sequence.

many are predicted to proceed no longer and log out in the middle in another sce-nario [29]. Also, what is the average load expected on the system, and how is that tested and simulated on the test-environment. Establish what metrics or data types will be measured and what test-data will be gathered in any test reporting. Fur-thermore make sure that the load applied is as realistic to the real world usage as possible. [2, 5]

2.2.2 Setup

At this stage it’s time to create, automate and scale up the tests to represent a production environment where the application handles real users as closely as possible. The closer the better.

• Configure the test environment. Prepare the test environment to mimic the proper-ties of tfhe current production environment as close as possible or estimations of it if it does not yet exist. That mimic is best accomplished by making sure that for each user moving through each scenario, the same user is moving through the system and is given enough realistic think-time at each step [2]. A good rule-of-thumb is to keep the scenarios as close to a real physical user as possible, consider the scenario in figure 2.1 heavily inspired by barber’s navigation model [20, 29]. Configure cho-sen tools to monitor the many system utilizations or use the ones available inside the performance testing tool. [20]

(23)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA 2.2 Performance testing

sure they confirm or reject the performance testing criteria properly. Remember that some tools support asserting at run-time and also contain the functionality to stop the tests when a certain number of assertions have failed. By using runtime assertions the time spent testing can be reduced. [20]

2.2.3 Execution

The smallest but possible also the longest step of the performance test cycle, for each iteration the performance tester returns to this step to benchmark and test a new release. • Execute the test. Start all of the monitoring components. Also make sure that clocks are synchronized over the entire testing suite in order to properly correlate any causality between different testing components, this is obviously much easier if only one tool is applied. If assertions are manual, then make sure to execute all relevant test-scenarios and monitor the system.

2.2.4 Analysis and test

The analysis is the part where the performance tester gathers key stakeholders and de-velopment teams together to analyze results and compares.

• Analyze results. Collect all the relevant data from all testing components and anal-yse it both as individual data series and from different point of views by preferably including the whole team of development. This procedure could be automated by asking the involved team to provide properties that would indicate specific bot-tlenecks. For example, high load on login operations is connected to the CPU of the resource-server to go up because of an experimental password-hashing func-tion. [20]

• Report. Consolidate what tests were within the performance testing criteria and what tests did not. Use the whole team to suggest probable causes why it was not within the criteria. Attempt to identify any resource bottlenecks. Give addi-tional time for developers to implement improvements where such may be needed and system operators to increase capacity where limits detected. The part of this step that suggests bottlenecks from team can also be automated given the proper knowledge-base. [20]

• Test data correlation. Some argue that when analysing performance tests then it should be from the point of view of the previous production environment if it exists. This means that where average performance data for previous versions exists then it should always be compared to the average of the last production environment as the metrics seldom change [22].

• Retest. Repeat the tests that failed with suggested improvements and repeat the analysis-process from that point on. This step restarts the process from execution

(24)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA2.3 The data analysis process

(2.2.3) and this is why it is a process with a steep learning curve, because it takes much experience to coordinate a discussion around data from many domains. [20]

2.3 The data analysis process

It is a vital part of the process and one that is still mostly performed manually [19]. The data analysis is mostly performed by analysing execution logs and correlating with inval-idated performance criteria [16, 20]. Attempts have been made to automate the process with the assistance of applying historical performance tests which takes into account all performance metrics and matches them to logs or request-failures. [16] Most studies focus on post-test analysis and do not discuss the possibility of live feedback on a performance test that might trigger alarms during execution. It might be possible to pre-emptively define what performance metrics are connected to eachother to define specific types of performance bottlenecks. From this knowledge it is possible to provide live feedback during a test when performance metrics are going bad and execution logs start throwing errors and failures. The data analysis process involves a number of essential components which are discussed further in this section.

2.3.1 Average Response Times

The average response-time (ART) for requests is a great indicator about the user experi-ence (UX) of the SUT. [29] An increasing ART is in itself a good indicator that the system is currently handling an increasing number of VUs. By itself it is not necessarily a problem. In order to draw any conclusions then information is needed on connected properties of the SUT which is achieved by monitoring of the system health. This is where a tester can start to connect the dots and see what resources are breaking the pattern of normal usage. Other indicators could be to analyse the number of ART that invalidates the per-formance validation criteria can and should be an assertion at the end of a test and is already used thoroughly. By sorting the number of ART and assembling it into bucket line-charts is another good method to measure how the system is performing compared to earlier tests [22].

2.3.2 Transactions per second

Transactions per second (TPS) is an aggregation of the number of completed requests per second. This is probably the second most important data indicator of a SUT [21]. It indicates the capacity of a system and if there is a drop in TPS then it may coincide with errors encountered by a virtual user, causing the series of requests to abruptly fail. This usually occurs when the server gets saturated with requests and the TPS limits is getting closer [29]. It is important that after a sudden increase of virtual users, i.e., a peak load scenario, the TPS recovers from a sudden drop of successful requests. Otherwise this is a strong indication on the connection between the arrival rate of users and TPS if they

(25)

do not recover and could be automated. This is usually referred to as soak or endurance testing.

2.3.3 Server Health

Investigating ART and TPS is a good indication of a systems performance. But when the ART goes up then it is interesting to know what the slow spots are, what system-atic bottlenecks might be present and find the applications limits. [29] The best way to find any type of root cause is to monitor the SUT. System-monitors come in many forms from Java Management Extension (JMX) [17] to Simple Network Management Protocol (SNMP) [25], [4] etc. The thing they have in common is that they monitor metrics that make performance indicators, or PI’s. A metric can be CPU load, virtual memory, num-ber of active processes or SQL query processing time. They can all be used to discern what might be the possible cause of increased ART or insufficient TPS according to the performance criteria.

The following practical examples visualize this clearly. Assume that the ART for a request to a regular web based service goes up at a given point in time, then you would need to investigate the reasons for this. The following conditions may apply, they could in turn resolve into different conclusions about the potential resource bottleneck.

Figure 2.2: This chart shows no axis values, but demonstrates the structure of an example memory bottleneck, where swap increases suddenly after a steady increase of memory along with a CPU

• In figure 2.2 a severe virtual memory usage increase can be spotted just before the

ART increasealong with 20 percent swap-space memory utilization. CPU is also

maximizedduring the memory increase. Something is obviously consuming mem-ory at a rate that the swap-space memmem-ory gets utilized. This has the direct conse-quences that much of the computational resources is put on moving data back and forth between virtual memory and swap. This combination concludes a memory

(26)

Figure 2.3: This chart shows no axis values, but demonstrates the structure of an example CPU bottleneck, where neither memory or swap increases but CPU and ART does.

• In figure 2.3 a severe CPU load is noticed, but no notable increase in either memory or swap. Obviously a very CPU-intense task is consuming so much of the CPU time that requests are taking longer time to process. This combination concludes a CPU bottleneck.

Figure 2.4: This chart demonstrates the structure of an example network bottleneck, where neither memory or swap or CPU increases but ART does, we don’t even need to monitor the network performance to come to that conclusion.

• In figure 2.4 ART goes up sporadically, but no apparent increases in throughput, cpu or memory load. Most likely a network bottleneck, alternatively a connection issue. Some other process might be consuming all the resources of the network card of the system. The combination of no detected resource utilization but increased ART concludes a possible network bottleneck.

These are suggested scenarios that could be translated and automatically translated into points of interest for the inexperienced performance tester.

(27)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA2.4 Measurements as a signal

Degrees 10 15 20 22 26 28 26 22 22 17 14 10 10 9

Time 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Table 2: A series of 14 measurements on a thermometer by time

2.4 Measurements as a signal

This leads into the question of how can the method of data analysis and indications can be made automated. One way is to expect data to follow a pattern, and then provide means to detect where data are no longer meeting that pattern. This is a very common problem in the field of signal processing because a signal usually contains noise. In or-der to distinguish the part of the signal from the noise, filters are applied to filter out the noise and provide the expected or real signal. [12, 13] A signal can be described as a function that conveys information about the behaviour of a system or attributes of some phenomenon [13, 26]. When considering signals it is easy to get stuck with the idea of electrical impulses from measurement devices governing some input channel. But sig-nals is a very general term. Let’s reduce the signal to a function of an independent vari-able like time yielding some measurement. Let’s consider a thermometer for detecting the temperature in figure 2.4 and the data in table 2. Each measurement is recorded with hourly intervals with the time.

Figure 2.5: Representation of thermometer table

The difference from a measurement on usage of a processor core also based on time, in this case time is measured every five seconds. See figure 2.6 and table 3.

(28)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA 2.5 Signals

CPU 10 45 60 100 88 66 77 68 45 66 77 79 40 29

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Table 3: A series of 14 measurements on the usage of a CPU core by time

Figure 2.6: Representation of a CPU activities during 14 seconds

When reduced to a series of numbers based on time, they are indistinguishable from what is the measurement and what is the signal. Another way to explain it is as a data stream. Only they reside on completely separate levels of software and hardware. This suggests that algorithms and methods that have been developed to extract information on these signals should also be applicable to measurements of potential PI’s.

2.5 Signals

A signal or measurement can be viewed as a point of data at point t in time which is a combination of a real signal θt and noise et. See equation 2.1.

yt = θt+ et. (2.1)

2.5.1 Adaptive Filters

Consider a data stream where measurements or values are emitted based on a process that is time-varying. If you would want a good estimate of what the process mean at any given point in time is, then adaptive filters can solve that problem. The mean is estimated upon each new measurement of the process, delivering an estimation of the probable next measurement value by using the mean of the process [13]. Historically they have been implemented in projects such as the Apollo space shuttle program to solve trajectory estimation and control problem for sending astronauts to the Moon and back [11]. But this study is not about sending people to the moon, but detecting changes in a system which is the subcategory of adaptive filters called adaptive estimation [12,26]. Signal estimation is the task of following the pattern of a signal. Its purpose is to use

(29)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA 2.5 Signals

the behaviour of previous measurements in order to predict what the next measurement should be if the pattern is followed.

Figure 2.7: The kalman adaptive filter applied to a series of measurement on voltage It is also used commonly in surveillance systems to detect pattern-changes by using the difference between the expected measurement and the actual measurement. [12,13, p. 5] This is visualized by figure 2.5.1 where a one-dimensional kalman-filter is applied to a series of measurements on a voltage-outlet. This voltage outlet is unstable and measure-ments are flaky at best, and fluctuates between 500V and 250V where the source emits a signal of 400 volts. It can be seen how the filter starts to adapt to the noisy environ-ment putting itself at an estimated average of the series, approximating quickly at first and then getting more and more stable at 400 volts as more measurements are adapted to, eventually ending up at the actual signal. But the mean value of the next most likely measurements and is often used in adaptive estimation [12, p. 9].

Figure 2.8: A representation of the relation between an adaptive filter

2.5.2 Change Detectors

The method of using that residual is called change detection [12, P. 17]. The relation is made more clear by figure 2.8. Filters are also highly configurable and can be adapted to react faster or slower to the measurements to achieve more or less detections [12, 13, p. 24]. By analysing several change detections that happen to occur concurrently then it is possible to detect patterns that break any formalized set of rules. One detector for each type of statistic can potentially detect when the data is breaking the detection rules during

(30)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA 2.6 Knowledge

a performance test. By using that data it could be possible to create points of interest based on multiple detections. The system can suggest where the rules of detection (RoD) are broken or no longer meets the performance testing criteria. From this step it is an easy measure to relate different detections with each other to start to build points of interest (PoI).

2.6 Knowledge

From these different technologies a model of suggesting possible system bottlenecks re-lying on filtering statistic data to gain residuals can be modeled. The data is then passed through the change detector selected for that statistic type according to the RoD. The goal is then to gather all detections and pass them through the RoD which compile the data according to a specific bottleneck profile and produce points of interests based on it. This model is illustrated extensively with figure 2.9

(31)

2 AUTOMATIC ANALYSIS OF PERFORMANCE DATA 2.7 Background summary

2.7 Background summary

In this chapter performance testing as a process has been discussed and linked to how it can be strengthened with the help of adaptive filters and change detectors. By doing that a wealth of knowledge may be produced that perhaps could give some insight as to what bottleneck might be present in a system.

(32)

3 METHOD

3 Method

This chapter is an outline of how the scientific method is applied throughout this study. This chapter gives an overview of the scientific methods available and which are selected to test the work.

3.1 The Scientific Method

The scientific method may be reduced to four steps [32]. This study makes an effort to follow the scientific method in it’s entirety.

1. Formulate a question or define a problem (described in the introduction in chapter 1.2)

2. Perform background research into the problem domain and construct a hypothesis around it

3. Test the hypothesis through controlled experiment 4. Analyse the data and draw conclusions around them

3.2 Qualitative or Quantitative research methods

Qualitative research is a research method that is primarily exploratory research for new fields. It is most commonly used to get an understanding of reasons behind a phenomena and motivate further work. Quantitative research is used to quantify a phenomena by generating numerical data to gain support for a theory. The goal of this study is to find out if the suggested model is reliable and valid. In order to do that an application is built and applications make it easy to produce automated simulations and get a great number of data to evaluate its reliability. Since it is possible to create large data sets the research method selected is Quantitative research over Qualitative research.

3.3 Philosophical Assumptions

There are four major core philosophical assumptions [9]. Positivism, Realism, Interpre-tivism and Criticalism. InterpreInterpre-tivism and Criticalism may be discarded as assumptions for this study as they are based on opinions and experience of users on a phenomena. In this study we are unable to reach such a number of users to critique or intepret the application as they simply are very few and hard to get ahold of, while it would be pos-sible given the time. It would also conflict somewhat with the stated research method being Qualitative research. We are then left with Realism and Positivism. Realism is an assumption that things are, without being perceived to be. A realist observes a phenom-ena and gather data from that observation to create an understanding of it. While Pos-itivism states that reality is objectively given, independently of interpretation and mea-surements. It is primarily used in experimental where variables are defined and tweaked

(33)

3 METHOD 3.4 Research Question

to produce an expected behavior. Both methods are particularly useful for ICT research, but this study applies positivism based on the experimental and tweaking nature of the study.

3.4 Research Question

The problem is stated around hardships of using a load testing tool and the study sug-gests a solution on how to make it an easier task. Yet this study does not quantify how much easier or more confusing the solution is to a user. It strives to find whether the solu-tion is correct and can be used reliably. This suggests that the suggested solusolu-tion should have its reliability tested instead of evaluated. Performing evaluational work against users would be an interesting topic for future work, which would be dependent on the reliability found by this study, but that is not covered by this work.Since the method sug-gested is new work, based on two rather disjoint fields, its validity must be tested by experimentation.

3.5 Research Method

Many research methods exist, yet those that could be relevant for this study are fewer. Yet Applied Research [9] and Experimental Research [9] are the only ones applicable to this study because of the lack of existing literature on the model under test and the un-availability of performance testers. Applied research seeks to answer specific questions or solving known and practical problems by developing an application that solves that problem. While experimental research studies the causes and effects by tweaking vari-ables to find relationsships therein. If this study would be investigating exactly what kind of relational profile configuration was applicable for each scenario then experimen-tal research would be suitable. But since the application and its reliability of solving a problem is under focus here the selected research method is Applied research.

3.6 Research Approach

Research approach is used for how conclusions are drawn and to establish what is deemed a success or a failure. There are three commonly used approaches: Inductive, Deductive and Abductive approaches [9]. The inductive approach is not particularly useful in this study since it uses propositions and theories to explain patterns and results. The outcome is judged based on behavior, opinions and experiences which again is hard to do with-out test subjects. The third method is a combination of the two when an incomplete set of data or observations may infer conclusions. This method can be discarded since the dataset is not incomplete. The deductive reasoning approach is the approach of testing theories against a hypotheses to verify or falsify it. Since data can be created and suc-cess is measured against a predefined set of metrics this approach is the best to validate

(34)

3 METHOD 3.7 Research Strategy

whether the application can be trusted or not. The deductive approach is selected for the study.

3.7 Research Strategy

A research strategy is guidelines on how research is performed. Again there are several to choose from, but only Experimental research, Ex post facto research, Surveys and Case studies applies to quantitative research which is already selected for this study. Surveys and case studies assume that we have a number of test subjects to fill them out, which is not readily available. This leaves Experimental research and Ex post facto research to choose from. Experimental research uses experiment stategy to control all factors of an experiment. This method is useful when the amount of data is enormous and the vari-ables are controllable. Ex post facto is similar to experimental strategy but without the control aspect, and that conclusions are drawn without tinkering with the experimental variables beforehand. Ex post facto is useful if reproducing an experiment to validate its results or if it is impossible to control the variables of an experimental setup. As this study is primarily about experimentation through an application created as part of ap-plied research the experimental strategy is used.

3.8 Data Collection

Data for this study is collected primarily from executions of an application. The data is produced by a load testing tool and injected into the application to produce a result. This result can be verified for its validity. The experiment target must be a controlled virtualized environment that is similar to a double-blind experiment by eliminating all kind of human error. This will have to be true for both the creation of load testing results based on the controlled environment and the work performed by the suggested solution. In this environment the constant variable would be the application that performs the suggested work by this study without modifying any configuration while running the experiment. The parameters under investigation would be what type of problem that the solution would suggest a service experiences. It would also be imperative to know where to draw the line of a reliable solution. For this study, being the first of its kind, we put the metric of motivating further work on the topic reasonably low at 75% success ratio and a total success at above 95% success ratio. A total success would here motivate implementing the solution to a load testing tool. These metrics allow us to formulate a null hypothesis. Since the format of the experiment allows it and in general a large sample size is usually encouraged to achieve some statistical significance to the results at least 800 experiments should be performed.

(35)

3 METHOD 3.9 The Data Analysis

3.9 The Data Analysis

In order to test against the reliability of the application as stated by the hypothesis a com-parison between the known input parameters and the actual output suggestions must be made and measured. To make sure that the suggestions is not only valid for one type of problem it is tested to several different type of input parameters as well as a baseline where nothing special should be detected. This is also known as a placebo test in order to make sure the application does not provide false positives. If the suggestion is iden-tical to the configuration in the virtualized environment, then it should be treated as a successful result. If the suggestion is not identical to the configuration then it should be treated as a failure. If a baseline configuration is providing a suggestion at all then this should be treated as a failure. And if more than 25% of the experiments end up in failure then the experiment as a whole should be treated as a failure. If less then 5% end up in failure then the experiment is to be treated as a success. Anything in between those two metrics should be treated as a motivation for further work on the topic. The data analysis method is thereby Statistics analysis of all the experiment samples.

3.10 Quality Assurance

This study welcomes the idea of falsification [24] and this study strives to be easily repli-cated as a quality assurance of this study. It looks to reject the hypothesis by finding enough data that suggests something out of the ordinary is happening. Any results found presented are left open for falsification by anyone and the source for the suggested appli-cation are published online.

(36)

4 MODEL DEFINITION

4 Model Definition

4.1 Formal Model definition

Based on the discussion that measurements may be treated as a signal in 2.4 we may start constructing a model definition to treat performance test data as a signal and create a model that defines what has to be implemented. Any kind of performance testing data is assumed from this model.

4.1.1 Data Collection

Data is collected from files or streams to produce maps called Statistics S metrics. It is a map of numbers corresponding to individual measurements performed at a specific point in time as in equation.

V = [V1,V2, ...,Vc, ...,VN] (4.1)

T= [T1, T2, ..., Tc, ..., TN] (4.2) A statistic may thereby be seen as a key-value map as in equation 4.3 where a current timestamp yields a current measured value as in equation 4.4.

S= [T1: V1, T2: V2, ..., Tc: Vc, ..., TN: VN] (4.3)

S(Tc) → Vc (4.4)

This assumes that there exists a measured measured at each time T, which may not be the case. If that occurs then the missing values are extrapolated from last existing measurement Tlastto the next existing measurement at Tnext. Measurements are normally taken each second. A statistic is then connected to one of several statistic types Stypeas defined explicitly by user or application.

4.1.2 Adaptive Filtering

Adaptive filtering is an operation performed on a Statistic that produces a collection of predictions for the next measurement based on its previous measurements based shown by equation 4.5. The predictions P may also be queried by timestamp T but will yield the prediction made by the timestamp that came before. By comparing the size of the difference between the actual measurements with the set of predictions we will achieve a set of residuals R. The residual at a certain point in time is the difference between the expected measurement and the actual measurement as seen in equation 4.6.

(37)

4 MODEL DEFINITION 4.1 Formal Model definition

Rn= |P(Tn) − S(Tn)| (4.6)

4.1.3 Change Detection

A change detection is an operation performed on a residual R. It applies an algorithm that deduces whether a change is detected or not based on the current residual Rn and the previous residuals R1, R2, ..., R(n− 1) as seen in equation 4.7

C(R(Tn), [R1, R2, ..., Rn−1]) → Dn (4.7)

f(D) → [{start1, stop1}, ..., {startn, stopn}] (4.8) This collection of detections is transformed into a collection of objects, as in equation 4.8, that suggest ranges of timestamps where there is a detection according to the change detector.

4.1.4 Relational Profiling

A relational profile is here used to describe how different statistics relate to each other. That one statistic detects a change or an abnormal behaviour for a range is irrelevant to deduce any kind of conclusions from. It is when we connect these different detections with each other that we can see patterns. Consider three statistics X,Y and Z and a range of timestamps T = {T1..Tn}.X WY want to find out where X and Y has an overlapping detection range, but where Z does not, as on equation 4.10. This would yield what we would like to call points of interests P. We have chosen to describe these relational pro-files as logical statements.

Pt ← Xt∧Yt∧ ¬Zt (4.9)

Another situation could be that X and Y triggers detection when seen together, but if Z is detected then it is always a point of interest. We call this statistic omnitrue. Because it will ensure a point of interest even though none of the other statistics are detecting.

Pt← Xt∧Yt∨ Zt (4.10)

These statements are then applied to all timestamps with all of the, for that profile, included statistics to yield a set of points of interest for that profile. A profile may be used to test how many of the timestamps adhere to the specific profile. After all the profiles are applied what we end up with is a likelihood of each of the supplied profiles. The profile with the most matching timestamps ends up being the suggested bottleneck.

(38)

4 MODEL DEFINITION 4.2 Implementation of the model

4.2 Implementation of the model

To test the suggested model of combining adaptive filters, change detectors and perfor-mance test data an implementation of the model must be created that is able to run the test several times with different profiles activated to detect which profile is the likeliest. Then it should be verified whether that suggestion was correct or not. It should embod-ied a number of common adaptive filters and change detectors for different purposes. For this study 2 common adaptive filters are selected to limit the study to the most common algorithms. One being the famous Kalman filter, another being windowed least squares (WLS). They are selected because both are commonly used widely [12]. Three change detectors are selected for implementation. Cumulative SUM that is used widely [12] and two simple ones that detect activity or passivity within a certain value boundary. For investigating whether the model is useful for different types of technical restrictions four relational profiles are selected. They are Baseline, CPU, Memory and network profiling. Where baseline is a control group test that applies the algorithm on a non-existing tech-nical bottleneck.

4.3 Testing the model

The hypothesis is tested by experiment as the research into automated detection for per-formance tests is insufficient to perform a deductive study. By using the model imple-mentation it may be put into a controlled experiment. The controlled experiment should produce different behavioral patterns indicating the typial signs of a single technical lim-itation. By knowing beforehand what technical limitation is present in the system we may produce a performance test using a common performance testing tool. There are several performance testing tools to choose from but LoadUI [31] is selected since it can both monitor the target system health and produce metric data on requests sent to the system.

(39)

4 MODEL DEFINITION 4.4 Simulation Environment

Figure 4.1: Overview of the modules that HALP consists of

4.4 Simulation Environment

Data has to be manufactured and has to be the results of a real world example if not drawn from real load test projects. To get data from a real performance testing project then a real virtual environment was set up and was tested in the same means as a load test would be. The environment consists of a virtual machine running a Ubuntu Server 12.04 Linux distribution with Net-SNMP monitoring enabled [25] [4]. It also powers a stock phpBB bulletin board from the built-in aptitude package manager [28]. The bulletin board has not been configured more than adding a basic user that can login and check out the topics. Logging in and checking out the topics make out the baseline test scenario for this board [3]. Those actions upon the board was performed using a Geb-script groovy scripting dialect for browser automation that some load-testing tools has support for [7, 31]. The server showed no particular difficulties handling a load of ten virtual users each second performing the Geb operations. Bottlenecks are introduced by malicious php-scripts that perform different wasteful operations on the service. they are introduced in parallel with a running baseline on the service and are from now on referred to as scenarios. Upon each scenarios completion the virtual machine was forcibly rebooted, this was in order to avoid any consequences of previous scenarios. Once the service was operational again a new scenario was executed with another bottleneck scenario

(40)

4 MODEL DEFINITION 4.5 Data collection

introduced to the service.

4.5 Data collection

Data was collected and load was produced with the help of a performance testing tool, with four different scenarios towards a virtual machine that is also running the base-line service. The performance testing tool selected to produce the load was LoadUI Pro 2.6.6 [31]. The load testing tool was modified to export raw data automatically through command-line execution. This was achieved with the help of the LoadUI developers, which deserve a thanks for this. Each scenario is corresponding to a project created for use with LoadUI. The average response time of requests to the baseline service and prop-erties of the server were monitored during all of the scenarios. Each scenario runs for approximately seven minutes. The scenarios can be regarded as the following factors of control in a scientific experiment. Scenario #1,#2 and #3 are the positive control groups. Detections of bottlenecks are expected to happen when using the PoC and the test asser-tion fails when none or the wrong type of bottleneck is detected. Scenario #4 is considered the negative control group. The assertion ensures the opposite of the other scenarios and the test assesrtion fails if any type of resource bottleneck is detected in this control group.

4.5.1 CPU Bottleneck

A computationally heavy php script that computes several password hashes. The PoC should detect a CPU bottleneck. For the experiment the bottleneck shown in figure 4.2 is used.

Figure 4.2: Calculating password is an expensive operation that can simulate a CPU bot-tleneck

4.5.2 Memory Bottleneck

A wasteful php script that stores data in the virtual machines RAM for a few seconds. The PoC should detect a memory bottleneck. For the experiment the bottleneck shown in figure 4.3 is used.

(41)

4 MODEL DEFINITION 4.5 Data collection

Figure 4.3: A memory-wasteful php-script simulates the memory bottleneck

4.5.3 Network Bottleneck

The network bandwidth of the virtual machine is reduced stepwise from 100Mbps to 100kbps. The PoC should detect a network bottleneck. For the experiment the bottleneck shown in figure 4.4 is used.

Figure 4.4: Network limit configuration on VM simulates network Bottleneck

4.5.4 No bottleneck

No bottleneck is introduced to a the virtual machine, 10 virtual users still arrive at the system under test each second for the full 2 minutes.

4.5.5 Bottlenecks in simulation

Virtual bottlenecks are introduced approximately 30 seconds into the test for all scenarios except scenario #4. Stop time of the bottleneck varies as the server takes more time to recover depending on the bottleneck and may sometimes not stop at all until the VM reboots. For example, the memory bottleneck actually wastes memory so much that the swap space is used, since the virtual machine is tiny this takes very much time to recover

(42)

4 MODEL DEFINITION 4.6 Evaluation

from. Each scenario was performed in iterations of 200. Each resulting dataset is a raw data file of the CSV format with a filename that contains the type of scenario that created it for the future reference to know how many times the PoC suggested the correct type of bottleneck.

4.6 Evaluation

In order to evaluate the correctness properties of the PoC it is run on a battery of example-data that was generated using the different bottleneck-scenarios. For each individual dataset collected the command-line runner is executed and writes the guesses to an accu-mulative CSV-file. The PoC matches the data towards all available profiles and uses the confidence-level of each profile possibility to judge what type of bottleneck is the most likely one. With the help of the accumulated data, the number of times that a guess was correct can be assessed by comparing them to the filename of the data that was used as input. Each guess that is upon the correct scenario that the example data was generated with is counted as a success and every other type of guess is treated as a failure. Stan-dard deviation is calculated based on the sample-size of each type of scenario. Should number of correct guesses exceed 95% of the samples then the can be rejected. Shouldn’t An overview of the entire experiment can be seen in figure 4.5

(43)

(44)

Figure 4.6: Very simplified PoC model showing how raw data is transformed into a list of points of interest (PoI) and a guess of the most likely bottleneck

(45)

5 HALP

This chapter describes the structure of the implementation of the solution with close connection to the model described in 4. The solution to the problem defined in is em-bodied by an application will furthermore be referred to as Heuristic Assistance for Lazy Performance-testing (HALP). It’s purpose is to apply adaptive filtering and change detec-tion algorithms to real world data parsed from performance tests in order to enable draw-ing automated conclusions about them. Those conclusions are suggestions on where the user could start looking and is useful to initially suggest what kind of bottleneck the user is looking at.

5.1 From raw to exposed data

HALP is split into four different components called models. They are the interaction points between the main-method and the object model. Each level of models are named closely to what they interact with in the theory for better readability. The data have been reduced to working with only long-variables in order to be able to handle data originating from as many types of monitors as possible. This means that for all types of measurements, all values are rounded to the closest natural number. Since the filters adapt to any level of measurements it is good to support very high values. This limitation imposes a limit of not being able to handle decimal values, which should be noted.

HALP is written in Java which makes it run on any platform. It has been running on Linux, while during development mostly on Windows 7. The choice of platform is a mat-ter of usability, java is a big platform and has many available points of inmat-teraction with other environments. The algorithms presented in this chapter are pseudo-code which should be straightforward to implement in any programming language.

All usage of HALP needs to walk through the four models described in 4.1, or stages in order to produce useful results. The data model collects data from a source and assem-bles it into a list of statistics and sets the type of each statistic. The filter model can create filters and apply them to a statistic, when applied it stores the results for that statistic for later use. The detector model applies any detector algorithm upon the data that a filter has accumulated from a statistic and produces detection ranges. Detection ranges consist of a start and stop time which corresponds to the rows of data in the original CSV. Those detection ranges are reliant on the type of filter and detector selected. The profile model depends on a resource that depicts the rules of detection (RoD) that define all relations that a profile consists of. When a profile is applied to a number of detections then points of interest (PoI) are built on ranges where all RoD are met. These points of interest are created by each profile and can easily be inter-compared between themselves. The profile that has the largest and most numerous detections is suggested to be the current bottle-neck. At this stage it is easy to inform the user why it is a probable bottleneck, it is already defined by the rules of detection. If the confidence of the PoI are no more than a certain portion of the whole range, by default 10%, then HALP assumes there is no bottleneck.

(46)

5 HALP 5.2 Data Model

5.2 Data Model

The Data model is the smallest model and has no more purpose than collecting, parsing and storing data related to each statistic as shown in figure 5.2. Responsible for parsing data from source and transform the file CSV into data structures using a data crunching, which is the HALP name for a CSV-parser. Since the application has support for different kinds of data, a cruncher must also be built to transform the data into statistic data struc-tures of wanted types. The CSV data is parsed and crunched into data strucstruc-tures called statistics. Then the data is placed into maps where the name of a statistic is key to be easy to fetch for the other models.

Ett program för att upptäcka tekniska begränsningar utifr˚an utdata fr˚an exekverade prestandatester Mikael Östberg