Development of a prototype framework for monitoring application events

(1)

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

202020 | LIU-IDA/LITH-EX-G--2020/012--SE

Development of a prototype

framework for monitoring

application events

Utveckling av ett monitoreringsramverk för

applikationshän-delser

Edvin Persson

Supervisor : George Osipov Examiner : Peter Jonsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Software rarely comes without maintenance after it is released. There can be bugs not captured in development or performance that might not meet expectations; therefore, it is crucial to be able to collect data from running software, preemptively addressing such is-sues. A common way to monitor the general health of a system is by monitoring it through the users’ perspective — so-called "black-box" monitoring. Making a more sophisticated analysis of software requires code that offers no functionality to the software, whose pur-pose is to create data about the software itself. A common way of creating such data is through logging. While logging can be used in the general case, alternatively, more specific solutions can offer an easier pipeline to work with; while not being suited for tasks such as root-cause analysis. This study briefly looks at four different frameworks, all having different approaches to collect and structure data. This study also covers the development of a proof-of-concept framework that creates structured events through logging — along with a SQL-server database to store the event data.

(4)

Acknowledgments

I want to thank my supervisor at Linköping University, George Osipov. George has helped throughout the process of writing the thesis, by providing valuable feedback and guidance. I would also like to thank my supervisor at Sectra, Mattias Pettersson, whom I was able to ask any question and brought interesting discussions within the subject. I would also like to thank my family for their general support while writing the thesis.

(5)

List of Figures

4.1 The project design . . . 13 4.2 Function to log Event data in JSON format. . . 14 4.3 Entity-Relationship diagram of the database. The Tags are kept in JSON format

but the datatype is the same as name and help text. . . 15 4.4 T-SQL procedure to creating a event . . . 16 4.5 T-SQL procedure to parse and insert a event from a JSON string . . . 16 4.6 Query duration as time range of events increases with the query in figure 4.6. The

tests were performed with 20 million records of 200 unique events. . . 17 4.7 Query to fetch average daily request time of a event during the month of june.

The JOIN operation is hidden since the query operation is performed on a view, vwEventOccurance which is a view with events table joined with the Occurence table. . . 18 4.8 The time to insert records. There is a small increase in time to insert when using

(8)

1 Introduction

1.1 Motivation

Production software seldom behaves exactly as expected when released. To avoid issues at first (or not at all) being reported by clients, software monitoring is essential. The num-ber one priority should be tracking immediately crucial information, such as if a server is unavailable. The collection of this type of data can often be achieved through the user’s perspective, e.g., via requests to the server’s APIs. To make further, more granular analysis of issues usually requires data created by the applications source code itself. As the creation of such data is not a part of the software’s original functionality, the overhead it implies is a consideration a developer must make. Furthermore, data created by developers are often made for humans to read (this is often the case when logging), providing a hard data mining problem when analysis of the data is to be performed.

To have a dedicated API for developers that smoothly goes from the creation of data to the analysis of the data can mean much to an organization; enabling development teams to have a proactive approach to solving software issues after release. While a logging frame-work can frame-work in the general case, enforcing the structure of the data can simplify the task of analyzing the data down the line.

1.2 Aim

This study aims to develop a proof of concept framework to measure, automatically collect and store data at runtime from a C# application. The tool is intended to be used to gather data for analytical purposes rather than to be a general logging framework. Possible use cases for the tool is to collect application usage and performance data. As per request, the system should use Microsoft’s SQL Server as a central database. The study also seeks to identify what challenges development of such a framework implies, as well as what possible techniques and solutions exist. There is no formal requirements specification of the framework and the purpose of the project was to investigate in what ways the data collection pipeline could be implemented.

(9)

1.3. Background

1.3 Background

Company

The thesis was written at Sectra AB in the department of Local Development as a Bachelor’s thesis for Linköping University. Sectra is a Swedish medical technology company founded in 1978 in Linköping. The company is active in multiple countries.

Existing company monitoring solution

Sectra’s current monitoring tool is mainly used to proactively identify and prevent system performance issues by collecting and analyzing data collected from log files and Windows performance counters.

Workflow

The development team externalizes application data mostly by logging. As mentioned the logs can be collected by the teams existing monitoring solution. log entries that should be collected to a central server are distinguished by regular expressions (regexes). The regexes are written by developers and can be changed at runtime without the need to recompile any code. The monitoring software at the client-side periodically scrapes log files with the regexes followed by pushing the data to the central server.

Problems with current workflow

1. To collect data from applications developers need to not only log data but also filter what data to collect from logs with regexes. A separate library to collect data would require only one step from developers.

2. Since data from the logs are not structured in any particular way, data embedded in the log files become hard to aggregate for analytical purposes at the central server.

Possible separation

Right now, the application logs are mostly used for finding the root cause of bugs in the applications, and for support and maintenance. The system is also used for automatic alerting when a high number of matched log entries are detected. The tool can also be used to gather information from the logs such as response times embedded in the log messages. A possible direction for the developed framework is to separate logging for finding bugs and to use another framework for analytical purposes.

(10)

1.4. Research questions

1.4 Research questions

1. How can runtime data be extracted from source code? Are there any alternatives to logging?

(11)

2 Theory

2.1 Black-box versus white-box monitoring

Distributed monitoring solutions can generally be classified into two different categories (1) black-box monitoring and (2) white-box monitoring [3]. The first approach instruments the underlying system and no actual application source code is altered to gather information. The second approach is usually an API-based where a library can provide a set of interfaces to instrument the application code itself. It provides developers with the ability to gather information about the system running in production. An important part of a library is to automatically transfer the data to a central system to be able to compare the information in different dimensions. A typical set of dimensions added to data could be the identity of a user or the version of software from which the data was collected. There exist lots of commercial as well as open-source solutions that can be used for data collection, visualization and automatic alerting. Depending on the use cases for the data to be collected, the data needs to arrive at the central server at different speeds.

2.2 Detail of instrumentation

Code can be instrumented at different levels at runtime. One can, for example, measure the performance of each method running in the code with techniques such as byte code instru-mentation. In the case of the Java programming language it can be done through the Java Virtual Machine Tool Interface1. This has the benefit of not having to change the code base of the running application. While this technique can be used to instrument code on a very detailed level, this produces to much data to collect and analyse. Often times there are more specific methods that are of interest to measure and track. This could be functions such as HTTP requests or navigation functions in the application.

2.3 Events vs Metrics

When externalizing data through the application, one can record every time an event of inter-est occurs (for example through logging). An alternative approach is to keep an aggregated

(12)

2.4. Data collection overhead

state of the events in memory and scrape the state periodically through another thread. This means the system won’t know, for example, the number of events of a certain type at ev-ery point in time. Instead, the number is kept in memory and the value is only known at the scrape period timestamps. This is referred to as metrics. Events are irregular data and metrics are regular data.

Use cases

Although metrics can be derived from events data, each technique is usually better suited for specific situations/scenarios [11]. Irregular events such as when a user presses a specific button in an application would be better suited for the running thread to publish each time it occurs. Having a counter in memory and scraping the counter every 5 seconds would generate unnecessary amounts of data at the backend. Metrics scraping is also not possible when the information from each data point is required, for example, if one would monitor bank transactions. A case where metrics would be better is if one would want to count each response code of a web server. It would not be necessary to write to an I/O device each time the request is made, especially if the requests are very frequent. Instead, increasing a counter in memory would be more efficient.

2.4 Data collection overhead

Since collecting event data and metrics usually is not a part of an application’s logic itself, developers need to be wary of how the data collection can affect the performance of the ap-plication and the clients’ machine itself. Apart from increasing execution time from the appli-cations thread, some form of I/O is needed to persist the data which can be a major factor in a systems performance [1]. In the case of logging and centralizing the data many resources are affected, which consequently dictates what, and how much data can be collected. Affected Machine Resources are:

• CPU: Execution time to store the data, especially if synchronous I/O is performed. If it is asynchronous the CPU is still affected.

• Network: Since data should be centralised the clients network will be affected. • Disk: Affected when a logging framework writes to disk.

• RAM: Allocation is involved, for example a temporary string when logging.

2.5 Related frameworks

There exist numerous related monitor and data collection frameworks. Some of these offer full-stack support to collect and analyze data, from a front end in a language of choice, to collection, storage and visualization/alerting.

2.6 Metrics-based frameworks

Ganglia

One of the earlier frameworks developed is Ganglia.2 It is a distributed monitoring system that monitors metrics such as network and CPU utilization with automatic collecting, visual-izing and/or alerting. It is also possible to define new metrics to be collected in an application through a command-line tool. In the paper introducing Ganglia [15] the authors identified the key design challenges of a distributed monitoring system to be:

(13)

2.6. Metrics-based frameworks

• Scalability: The system should scale with the number of monitored nodes (client ma-chines).

• Robustness: As the number of nodes scale, network or node failures are inevitable and should be automatically localized.

• Extensibility: The system should allow extensions of what data is collected and the na-ture of how the data is collected.

• Manageability: The system administration work needed should not scale linearly with the number of nodes.

• Portability: The system should be able to run on multiple operating systems and CPU architectures.

• Overhead: The system should incur low overhead on the nodes in terms of CPU, mem-ory, I/O and network bandwidth.

Ganglia data model

Ganglia data is formatted in XML. To reduce network overhead data is transported in a binary XML format. Each metric has a name and a value and the users can choose the value type from a set of trivial value types.

How Ganlglia extracts runtime data

Monitored ganglia machines (nodes) keeps a time threshold for each metric to be polled from through a separate thread. For example, a metric for CPU utilization can be polled every second. The values of the metrics are only kept in a soft state in a lookup table. Each moni-tored machine multicasts its metrics to other nodes in the network. This means that the full state of each node in the network is kept in each node, improving redundancy but increasing network overhead. For example, if an unexpected shutdown happens, the last multicasted state exists in other nodes. Metrics other than the standard ones offered by Ganglia (network, CPU, etc) can be pushed to the lookup table via a command-line tool. This enables developers to instrument their source code via creating metrics.

Prometheus

Another open source metrics framework that is widely used today is Prometheus3. It was initially started at Soundcloud in 2012 and it was the second project incubated by the Cloud Native Computing Foundation. Due to its wide use, Prometheus has wide support. Individ-uals and companies using the software often contribute integrations to Prometheus so that their software can be monitored. Ranging from databases to embedded software. Prometheus also has wide support among programming languages. Making it possible for developers to define custom metrics in their software.

Prometheus datamodel

The standard metric types in the Prometheus API can be one of four types. A counter, gauge, histogram or summary. Creating new metric types is possible since the code is open-sourced. Each metric has a name and optionally a set of labels. The labels enable Prometheus dimen-sional data model. Each version of a metric with the same name but different labels makes a distinct metric [20]. An example of a metric could be a counter of the number of requests

(14)

2.7. Event-based frameworks

to a web server of a certain user. The metric is named http_requests_total and the La-bel User is set depending on who makes a request. For example, user-123 or user-321. The counters for each user are kept in memory and its value is incremented each time the server receives a request. The labels enable aggregations later on when analyzing the metrics. In this case, we can find out which user makes the most requests to the webserver.

How Prometheus extracts runtime data

Prometheus is not an event driven monitoring solution. Instead an aggregated state of the data is kept in memory and its values are scraped by a separate thread. The clients expose their metrics through a http endpoint in the Prometheus/metrics directory where the server Pulls metrics from. Alternatively, if exposing the metrics is not possible, Prometheus clients can Push metrics to the server [21].

2.7 Event-based frameworks

StatsD

StatsD4is a daemon to aggregate and summarize application metrics. It was developed and released by Etsy and has quickly gained popularity since. There exists numerous language specific client libraries to send data to the daemon.

StatsD datamodel

The StatsD daemon can receive event data in the form of a counter, a timing (a measurement), gauge or a set (StatsD will keep each unique value received). Each of the received values will have the form:

for example a measurement of a users login session duration could be sent like this:

login_duration|200s|timing

Compared to Prometheus the data does not have labels, therefore it can be hard to make comparisons when analyzing the data (although one can embed labels in the name to a certain extent).

How StatsD extracts runtime data

Instead of keeping the data in memory, the client pushes each event data via UDP or TCP. The StatsD daemon then aggregates the data. If the StatsD daeomon would receive a counter:

http_requests_total:1|Counter

The counter at the backend would increase its value by one each time such a message is received. This differs from a metrics-based library where the state of the counter is kept in the client’s memory and the actual data transferred is the counters in-memory value at that timestamp. That is, StatsD aggregates all incoming values at the backend.

(15)

2.8. Client overhead comparison

The ELK (Elasticsearch, logstash kibana) stack

An alternative approach is to derive metrics from event data in unstructured log files. A popular stack of libraries to aggregate the logs and visualize the data is the "ELK" stack. Logstash is a log aggregator that collects, transforms and enhances the log data. The logs are then indexed by Elasticsearch, providing the ability to search. Thereafter the indexed data can be visualized with Kibana [26].

2.8 Client overhead comparison

As mentioned there are resources of the computer that are affected when extracting data from an application; CPU, memory, network, and disk in the case of centralizing log events. Brian Brazil [5] discusses some of the differences in overhead when using a metrics-based library such as Prometheus versus centralizing logs. A scenario is when counting the requests to a server. If 10 servers are logging 1KB of data for every request, with each server serving 1000 requests per second. It would fill most of a 100 MB/s connection to send the data. Whereas a metrics-based approach such as Prometheus would allow 1,000,000 metrics per server with a scraping interval of 10 seconds with the same bandwidth.

2.9 Relational databases

A relational database is a collection of data items that has predefined relations between them [22]. The data items are represented in sets of tables consisting of columns and rows. The columns represent a certain kind of data and the field the actual values. The rows in the table are a collection of related values. Each row can have a primary key to uniquely identify the row. Foreign keys make the tables related. Relational databases has been studied for a long time with the term relational database first introduced in 1970 by Edgar Codd [7].

Transactions to relational databases are single units of work that accesses and possibly modifies existing data in the database. To ensure consistency of the database, the database should have the ACID properties [2]. This stands for:

• Atomicity: Each transaction should either take place or not happen at all.

• Consistency: Transactions should be executed from beginning to end without other transactions interfering in between, Taking the database from one consistent state to another.

• Isolation: A transaction occurring concurrently should not interfere with the other transactions. It should be as though the transactions are executed in isolation.

• Durability: Changes in the database by a transaction must be permanent.

Database indexing

A database index is a structure used to accelerate retrievals of data in databases. A way to think about database indexes is to think about indexes in a textbook [10]. We can search the indexes to find a term in the textbook. The index tells us on what page the term is located. The alternative is to search the whole book for the term. The equivalent in a database is a full table scan which is a linear search. This is acceptable for small tables but quickly becomes too slow when the tables grow. The indexes are ordered so that binary search can be used on the indexes.

(16)

2.9. Relational databases

Microsoft SQL Server

Microsoft SQL Server is a relational database developed by Microsoft. The language used is called T-SQL and all communication with SQL server is done so by T-SQL commands. No implementation of SQL fully follows the SQL language specification, including T-SQL [25]. Most features in standard SQL is however included and T-SQL has additional, nonstandard features included. To index data in SQL Server there exists several different types of indexes. A clustered index physically sorts the database items. A non-clustered index keeps a sorted list of indexes, with each index having a pointer to the actual data items. More types of indexes such as hash indexes or unique indexes exists. To choose what type of index to use is not easy and might require performance studies to compare indexing techniques. A useful tool is the SQL Server execution-plan tool [18]. The tool can be used to see how the database searches for database items when data is queried. If a JOIN operation is performed between two tables, one could see if a linear or a binary search was performed to match the keys in the JOIN operation.

(17)

3 Method

3.1 Preliminary investigation

Firstly an investigation was carried out to identify and state what the problem was with the current workflow. This was done by interviewing developers and the supervisor at Sectra. Questions asked were what they saw as the problem with the existing workflow and how it could be improved.

3.2 Literature study

After conducting the interviews, it was not clear in what direction development should start. Therefore a study was carried out on related frameworks and workflows that collect runtime data. The studied literature consisted of blog posts on use cases and best practices of these frameworks, as well as blog posts comparing different monitoring solutions. In the case of Ganglia, the framework was a part of a study, and an associated paper was found [15]. The design and implementation of the framework could then be studied easily. In the case of more modern frameworks, the project usually was started in-house at a company but later open-sourced. With these newer, open-source projects, no associated papers describing the design of the frameworks in detail were found. Instead, the documentation and source code were studied.

Selection of frameworks to study

There were no initial criteria for what frameworks to study other than they should be open-source and that one should be able to define custom data to be collected from open-source code. Some monitoring frameworks only monitor specific parts, e.g., networking. Since there exist many frameworks and time was limited, a criteria for the frameworks became popularity. Prometheus, StatsD, and the ELK stack all have thousands of stars on GitHub, and they all are different ways to collect data. Ganglia was chosen since a paper describing its design was found, which was relevant for the project.

(18)

3.3. Design and implementation of a prototype

3.3 Design and implementation of a prototype

Technologies used in the project, SQL-server and C# were studied as the design of the frame-work was done. SQL-server has very good technical documentation[9] and tutorials provided by Microsoft, which was the main source for SQL-server. Microsoft’s documentation on the C# programming language [4] was also the main source to learn the best practises for coding in C#. Stack overflow was used to answer more specific questions.

3.4 Evaluation method

Evaluation of the project was done weekly with the supervisor at Sectra. The framework was designed and implemented in an iterative process.

(19)

4 Results

4.1 Summary

The result of this project is a structured way for developers to store events data to a JSON file at the client-side. The developers can provide labels to each event which makes a comparison of the data across arbitrary dimensions possible. A server-side program to parse the JSON file and insert data in SQL-Server was implemented in C#. To visualize the data tools such as Grafana1can be used. Grafana can automatically query the database with a set interval, updating its graphs as more data is inserted. The transmission of data from the clients to the server was not implemented due to time constraints.

(20)

4.2. Project design

4.2 Project design

The project was designed to have three layers (overview in figure 4.1). The first layer logs event data to a JSON file, the second layer transmits and inserts data into the database via an intermediate server. The third layer fetches the data and can be used to visualize/analyze. The transmission from the client to the server was not implemented due to time constraints. A simple solution could be to periodically transmit the entire file contents and then delete that data. Although in this case, the time for the server to get data would likely be too long for the framework to be used for real-time alerting or visualization.

Figure 4.1: The project design

4.3 Design Client-side

The first decision was to save runtime data of measurements and events to disk via a logging framework. The library is, therefore, an events based solution. Since multiple threads can save to disk at the same time, a logging framework does the work of synchronization when writing to file.

Datamodel

At the client-side, data of events with its name and a help text was first chosen for the design. Since labels (tags) are an important part of how comparisons are made later on. The option

(21)

4.3. Design Client-side

help text. JSON was chosen as the data format. The data logged after a button press could be saved like this:

{

"Name": "return_home_button",

"Help": "The user pressed return home button", "Timestamp": "2019-01-01 00:00:00",

"Tags":{"host": "device-1", "product": "product-1", "user": "user-1", "pressed from": "settings menu"} }

Optionally a value of a float can be added to an event. With the addition of a value, measure-ments can be added to each event. A measurement of a request time of an external API could look like this:

{

"Name": "get_demographics_https_call", "Help": "Request time duration in MS", "Timestamp": "2019-01-01 00:00:00", "Value": 100.40964394892997,

"Tags": {"host": "device-2", "product": "product-1"} }

Implementation Client-side

To log the data in the model that was described in the previous chapter, one function for log-ging an event and one for loglog-ging measurement was implemented. The function for loglog-ging a measurement is the same but it also takes a float value as an argument and serializes it to the data format described above. To serialize the data to the JSON format, using a

serializa-Figure 4.2: Function to log Event data in JSON format.

tion library such as Newtonsoft was first considered. Since the performance of the functions is important, serialization to JSON was custom implemented using a C# StringBuilder and StringWriter. This improved the performance of the function considerably. [17].

(22)

4.4. Design backend

4.4 Design backend

An ER-diagram (Entity-relationship diagram) was used to design the database with the JSON data model from clients. Each event gets an integer primary key that uniquely identifies the event. This makes other tables be able to have foreign keys to refer to a unique event. To reduce data size and speed up queries the database was normalized into two tables. Instead of saving the name, help text and tags for each occurrence of the event, the timestamp, and potential value is kept in the occurrence table. Since there might be any number of tags, the Tags JSON data in the event table is kept in JSON format, which SQL-server supports.

Figure 4.3: Entity-Relationship diagram of the database. The Tags are kept in JSON format but the datatype is the same as name and help text.

(23)

4.5. Implementation backend

4.5 Implementation backend

A T-SQL procedure was implemented to create an event (Figure 4.4). The procedure returns the unique primary key of the inserted event. The procedure first checks if an event with the same name and set of tags already exists, in that case, the existing event ID is returned. Else an insertion is performed and its ID is returned via the SCOPE_IDENTITY function. Each insertion has to first call the CreateEvent procedure, thereafter insertion of the timestamp and value can be done in the occurrence table with the foreign key being the value returned from the CreateEvent procedure.

Figure 4.4: T-SQL procedure to creating a event

Another T-SQL procedure was implemented that receives a string in JSON format. Then it parses the value for each JSON key in an event object. The parsed values are then inserted. Now clients can call the InsertEventJson procedure instead of parsing the values before in-serting it to the database. Those JSON values which are required are parsed via T-SQL’s strict

(24)

4.6. Insert and query performance

queries, enforcing those values to be present. If they are not found the insert will fail. The op-tional tags and value data are not enforced. A C# command line program was implemented that takes a JSON file and inserts all the events via the InsertEventJson T-SQL procedure.

4.6 Insert and query performance

To improve the performance of fetching data from SQL-Server, indexes can be introduced. Experiments were performed to check which index works best. A typical query that might be performed is to take the daily average of measurements across time. The index that improved the execution time of this query the most was a multi- unclustered index on the Occurrence table. Firstly indexed on the Occurrence table foreign key and secondly on the timestamp. The results in figure 4.6 illustrates how the query time was improved with the query taking the daily average of a events value (figure 4.7). The time range of events was increased to see how the query times increased with more data. The event’s timestamps were uniformly

Figure 4.6: Query duration as time range of events increases with the query in figure 4.6. The tests were performed with 20 million records of 200 unique events.

distributed across the year. Therefore increasing the time range linearly increases the number of events returned linearly.

To test the insert performance of the database was done via inserting event data concurrently by 10 threads, each thread inserting sequentially from the JSON file via the JSON insert T-SQL procedure. This simulates a real-world scenario where multiple clients might insert data concurrently.

(25)

4.6. Insert and query performance

Figure 4.7: Query to fetch average daily request time of a event during the month of june. The JOIN operation is hidden since the query operation is performed on a view, vwEventOc-curance which is a view with events table joined with the Occurence table.

Figure 4.8: The time to insert records. There is a small increase in time to insert when using indexing

(26)

5 Discussion

5.1 Results

The client-side functions to store events data is an easy way to perform analytics on a appli-cation. The data can then be aggregated at the server and form metrics. Code to transmit the data to a central server was not implemented but a possible solution is just to send the entire JSON file. Another solution is to instead send data the same way StatsD does, each time an event occurs, send the data via UDP or TCP. A negative aspect of having a events based solution is that it suffers from the same network overhead for clients as collecting data from log files. Possibly a framework such as prometheus could be used to monitor more frequently occurring functions and bigger networks of machines.

The performance of insertion in the database is acceptable for small networks of machines, although if bigger networks with more frequent inserts were to be monitored, possibly a NO-SQL database that scales horizontally would be better. For specific data such as the events data in this study, a time-series database such as influxDB1could be used. The SQL language has however been used for a long time and is well understood, simplifying the task of fetching and analysing the data.

5.2 Method

There are many aspects of the methodology that could have been improved in hindsight, with two of them being most important for future work. The first being the selection process of which related frameworks to study. This was done rather arbitrary, a better approach would be to do a systematic review of existing frameworks and categorize them. This would likely reduce the risk of missing relevant open source frameworks and would make the study more reproducible. Secondly the interview questions with developers in the preliminary investigation could have been better documented by writing down each question and answer. The development process in the company having contiguous meetings with the supervisor at Sectra worked well and is suitable for a newly started project. Having a development process

(27)

5.3. Ethical considerations

such as SCRUM would be better suited if there were more developers and a clear direction of the implementation from the start.

5.3 Ethical considerations

A major ethical concern for developers is when collecting user data. For example, a company might decide to collect data to make a more personalized experience of their product. There needs to be transparency to the user about what data is collected and why. Companies and developers need to take the issue of privacy of the user, and security of the data as one of our highest priorities.

(28)

6 Conclusion

The purpose of this study was to develop a proof-of-concept framework that developers can use to collect data about running software. To gain a better understanding of different ways data can be structured and collected, a study of related frameworks was conducted.

Four different frameworks were studied and a brief comparison of an event based ver-sus a metrics based approach was carried out, mainly in terms of network overhead. The development resulted in a proof-of-concept framework that structurally logs event data, along with a database to save the event data. Due to time constraints, transfer of data was not implemented, but it could be investigated and developed in future work.

Answer to research questions

1. How can runtime data be extracted from source code? Are there any alternatives to logging? The studied frameworks had different solutions to extracting runtime data. StatsD does not log any data but sends data over TCP or UDP every time an event occurs. The ELK stack extracts run-time data via logging similar to the solution at the company. The Prometheus server collects metrics data from clients over the network but does not do so every time the event occurs. Instead, run time data is kept aggregated in memory at client-side and is only persisted to I/O at the scrape time. Having a metrics-based solution like Prometheus scales much better when collecting data about frequently oc-curring events since it has lower overhead for clients machines (subsequently it also scales better at the backend since the data is already aggregated), therefore more data can be extracted from applications. What is sacrificed is the ability to inspect every event’s data. Instead, its values (in the form of different metrics types) are only known at the scraping timestamps. The prototype that was developed extracted run-time data by logging event data to disk. Although no time was available to develop transmis-sion of data, this could be done by sending logs in a similar fashion as the ELK stack. Alternatively one could skip logging events to disk first and instead send it via UDP directly.

2. How can the data be modeled to allow comparisons and aggregations?

(29)

and visualisations. The ELK stack uses this approach. A problem with this approach is that since no tags or data that allow comparisons are encouraged, developers are likely to omit such data when logging.

StatsD models its events in different types such as counters or timings. Each created event has a name which makes that type of event unique. This is how data is aggregated in the backend, the events with the same name have their values aggregated. The lack of labels does however restrict the comparisons later on since no data such as user identity can be embedded (one could embed dimensions in the name but it quickly becomes hard to manage). Prometheus has the possibility to embed dimensions in the form of labels, where each metric is unique by its name and labels. The labels makes it possible to compare metrics across each unique dimension. For example, it allows to compare data from each unique user. For the prototype developed in this thesis, each event was structured in a similar way that metrics is structured in Prometheus, each event having a name and a set of labels, simplifying aggregations and comparisons when analysing the data.

(30)

Bibliography

[1] Silberschatz. A., Galvin. P., and Gagne. G. Operating system concepts international student version. 9th ed. Wiley, 2013, p. 603.

[2] ACID Properties in DBMS. Jan. 2019. URL: https : / / www . geeksforgeeks . org / acid-properties-in-dbms/.

[3] Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. Site Reliability En-gineering: How Google Runs Production Systems. " O’Reilly Media, Inc.", 2016.

[4] BillWagner. C Guide. URL: https : / / docs . microsoft . com / en - us / dotnet / csharp/.

[5] Brian Brazil. Logs and Metrics and Graphs, Oh My! URL: https : / / grafana . com / blog/2016/01/05/logs-and-metrics-and-graphs-oh-my/.

[6] Brian Brazil. Pushing Events or metrics.URL: https : / / www . robustperception . io/which-kind-of-push-events-or-metrics. Accessed: 2019-08-25.

[7] Edgar F Codd. “A relational model of data for large shared data banks”. In: Communi-cations of the ACM 13.6 (1970), pp. 377–387.

[8] Course 101: Instrumentation, Software Metrics, Monitoring, Alerts. July 2019.URL: https: //pandorafms.com/blog/course-101/.

[9] Craigg-Msft. SQL Server Documentation - SQL Server. URL: https : / / docs . microsoft . com / en us / sql / sql server / sql server technical -documentation?view=sql-server-2017.

[10] Ramez Elmasri and Sham Navathe. Fundamentals of database systems. 7th ed. Pearson, 2017, p. 632.

[11] Katy Farmer. Metrics vs Events.URL: https://thenewstack.io/what-is-the-difference-between-metrics-and-events/. Accessed: 2019-07-15.

[12] Ganglia Github page.URL: https://github.com/ganglia.

[13] Yuvaraj Loganathan. Prometheus vs StatsD for metrics collection. Aug. 2018.URL: https: / / medium . com / @yuvarajl / prometheus vs statsd for metrics -collection-3b107ab1f60d.

[14] Logs or Metrics - A Conceptual Decision. Aug. 2018.URL: https://logz.io/blog/ logs-or-metrics/.

(31)

Bibliography

[15] Matthew L Massie, Brent N Chun, and David E Culler. “The ganglia distributed mon-itoring system: design, implementation, and experience”. In: Parallel Computing 30.7 (2004), pp. 817–840.ISSN: 0167-8191.DOI: https://doi.org/10.1016/j.parco. 2004 . 04 . 001.URL: http://www.sciencedirect.com/science/article/ pii/S0167819104000535.

[16] Metrics vs. Logs Dilemma: Selecting the Right Platform for Monitoring Your Cloud Services (part 3 of 3). Jan. 2018.URL: https://www.wavefront.com/metrics-vs-logs-dilemma-series-3-3/.

[17] NewtonSoft Json serializer performance. URL: https : / / stackoverflow . com / questions / 23183550 / newtonsoft - json - serializer - performance / 23185249.

[18] Ed Pollack. Query optimization techniques in SQL Server: the basics. Dec. 2018. URL: https://www.sqlshack.com/query- optimization- techniques- in- sql-server-the-basics/.

[19] Jovan Popovic. Inserting JSON Text into SQL Server Table. Mar. 2016.URL: https:// www.codeproject.com/Articles/1087995/Inserting- JSON- Text- into-SQL-Server-Table.

[20] Prometheus Datamodel.URL: https://prometheus.io/docs/concepts/data_ model/. Accessed: 2019-08-01.

[21] Prometheus Overview. URL: https : / / prometheus . io / docs / introduction / overview/.

[22] Relational databases.URL: https://aws.amazon.com/relational-database/.

[23] Rothja. Query Processing Architecture Guide - SQL Server. URL: https : / / docs . microsoft.com/en- us/sql/relational- databases/query- processing-architecture-guide?view=sql-server-ver15.

[24] Using Microsoft SQL Server in Grafana. URL: https : / / grafana . com / docs / features/datasources/mssql/.

[25] Dorota Wdzi˛eczna. [.] Standard SQL–What’s the Difference? Feb. 2019.URL: https:// academy . vertabelo . com / blog / t sql vs standard sql whats the -difference/. Accessed: 2019-07-01.

[26] Tal Weiss. The 7 Log Management Tools Java Developers Should Know. URL: https : / / blog.overops.com/the- 7- log- management- tools- you- need- to- know/. Accessed: 2019-08-15.

[27] Alex Zhitnitsky. 15 Tools to Use When Deploying Code to Production. Dec. 2014. URL: https://blog.overops.com/15- tools- to- use- when- deploying- code-to-production/. Accessed: 2019-07-20.

Development of a prototype framework for monitoring application events

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Datateknik

202020 | LIU-IDA/LITH-EX-G--2020/012--SE

Development of a prototype

framework for monitoring

application events

Utveckling av ett monitoreringsramverk för

applikationshän-delser

Edvin Persson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Background

Company

Existing company monitoring solution

Problems with current workflow

Possible separation

1.4

Research questions

2

Theory

2.1

Black-box versus white-box monitoring

2.2

Detail of instrumentation

2.3

Events vs Metrics

Use cases

2.4

Data collection overhead

2.5

Related frameworks

2.6

Metrics-based frameworks

Ganglia

Ganglia data model

Prometheus

2.7

Event-based frameworks

StatsD

StatsD datamodel

The ELK (Elasticsearch, logstash kibana) stack

2.8

Client overhead comparison

2.9

Relational databases

Database indexing

Microsoft SQL Server

3

Method

3.1

Preliminary investigation

3.2

Literature study

Selection of frameworks to study

3.3

Design and implementation of a prototype

3.4

Evaluation method

4

Results

4.1

Summary

4.2

Project design

4.3

Design Client-side

Datamodel

Implementation Client-side

4.4