Anomaly Detection in Console Logs

(1)

UPTEC IT 16 012

Examensarbete 30 hp November 2016

Anomaly Detection in Console Logs

Jonas Samuelsson

Institutionen för informationsteknologi

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Anomaly Detection in Console Logs

Jonas Samuelsson

The overall purpose of this project was to find anomalies in

unstructured console logs. Logs were generated from system components in a contact center, specifically components in an email chain. An

anomaly is behaviour that can be described as abnormal. Such behaviour was found by creating features of the data that later on could be analyzed by a data mining model. The mining model involved the usage of normalisation methods together with different distance functions. The algorithms that were used in order to generate results on the prepared data were DBSCAN, Local Outlier Factor, and k-NN Global Anomaly Score. Every algorithm was combined with two different normalisation technologies, namely Min-Max- and Z-transformation normalisation. The six different experiments yielded three data points that could be considered anomalies. Further inspection on the data showed that the anomalies could be divided into two different types of anomalies; system- or user behavioural related. Two out of three algorithms gave an anomaly score to a data point, whereas the third gave a binary anomaly value to a data point. All the six

experiments in this project had a common denominator; two data points could be classified as anomalies in all the six experiments.

Ämnesgranskare: Justin Pearson Handledare: Jonas Västibacken

(3)

Popular Scientific Summary in Swedish

Kontactcenter är programvaror som, ofta större, företag använder för att f˚a en struktur p˚a interaktioner mellan kund och agent, samt för att effektivare kunna styra vart interaktionerna ska kopplas. För att avgöra om systemkomponenterna i ett kontactcenter kan upptäckande av anomalier vara användbart.

M˚alet med det här projektet har varit att studera konsolloggar för att kunna avgöra om systemkomponenterna har ett abnormalt betéende. Detta har gjorts genom att förbereda data fr˚an loggarna och sedan använda populära datautvin- ningsmetoder. Förberedning av datat har inneburit attirbutkonstruktion samt användning av tv˚a olika normaliseringsmetoder. Efter förberedningsprocessen skickades det in i en datautvinningsmodell som var klusterbaserad. Algorithmer som användes i modellen är DBSCAN, Local Outlier Factor och k-NN Global Anomaly Score. Den förstnämnda genererar ett binärt attribut för datapunkterna som säger om de är anomalier eller inte. De tv˚a sistnämnda rangordnar datapunkterna beroende p˚a hur stor sannolikheten är att de är anomalier.

Totalt gjordes sex olika experiemnt, nämligen användandet av tre algoritmer samt tv˚a olika normaliseringsmetoder för varje algoritm. Resultaten visade att det fanns ˚atminstone tv˚a datapunkter som kunde klassificeras som anomalier.

Utöver de tv˚a datapunkterna, fanns det även en tredje datapunkt som var i gränslandet mellan anomali och ickeanomali. Vidare inspektion av de hittade datapunkterna visade att anomalier kunde delas in i tv˚a olika kategorier; system- och användarrelaterade anomalier.

(4)

Acknowledgements

This project was done in collaboration with Telia in Uppsala. People from the Genesys team at Telia were the ones involved in this project. I am grateful for being a part of that team and I would like to thank every other person being a part of it. Special thanks go to Jonas V¨astibacken, Jonas Wulff, and ˚Ase Rundberg for helping me with trivial tasks during the project.

Furthermore, I would like to thank Justin Pearson at Uppsala University for guidance and supervision throughout the project.

(5)

1 Introduction

Enterprise companies working within the area of Information Technology (IT) often have large computer systems. It is not uncommon for large systems to consist of several components, which are responsible for different parts of the system. To be able to track the behaviour of each individual component it is of value to log events in a file or database.

1.1 Setting

Telia is a provider of contact center solutions to enterprise customers in Eu- rope [1]. A contact center handles interactions between a company and their customers. Examples of such interactions are email, telephone, and chat. One purpose of the contact call center platform is to guide a customer to the most suitable agent in the company, so that the customer can get the best help possible. For example, if the customer is from Spain, the contact center will try to find an agent that can speak Spanish.

For every interaction, data is stored to disk. Every component that is involved in the interaction has its own log file. If an error or abnormal behaviour has occurred somewhere in the system, it is not always an easy task to pinpoint exactly which component was the cause of the behaviour. Telia employs many people to search for such errors or abnormal behaviour in the log files.

To help employees at Telia in their daily work, finding errors without having to manually search the logs would be valuable. A proposed method is to use Data Mining (DM) theories that can be applied to the data in the log files and thus be able to find anomalies without having to search manually.

The system components that Telia provides to customers are developed by a company named Genesys, which is another contact center solutions provider [2].

This means that that employees at Telia are Genesys consultants and do not have complete control over what is logged from the different system components.

1.2 The aim of this project

The purpose of this thesis is to find errors or abnormal behaviour in raw data logs. The goal is to build a prototype that applies DM theories to said data.

The prototype will be evaluated by measuring how good it is at detecting trends and how well it can correlate data from the different components to each other.

To achieve these goals, the following questions will be answered:

• What DM methods and areas are important to include for the described problem?

(8)

• Is it possible to detect trends and correlation in data without knowing what to look for?

• Within the areas of DM that are relevant for this problem: What algorithms are developed for anomaly detection and is one algorithm superior to others?

1.3 Scope

The focus in this report is on exploring DM theories and algorithms in order to see how they apply to the described problem. The model will be built using laboratory data. However, the finalized model will be tested on customer data.

Errors in customer logs occur frequently and the customer data used for model testing will supposedly contain errors.

Only one of the interaction mediums will be investigated, specifically the email interaction. The model will use data from all the components that are involved in an email interaction and take them into account.

For the DM model, the open source software RapidMiner will be used as it offers implemented algorithms and operators for data processing.

(9)

2 Background

For the described problem there are several areas that are of relevance. This chapter will describe and explain different areas within DM and how the specific areas can be applied to this project.

2.1 Data Mining

The term ’Data Mining’ usually refers to properties and attributes in data, and how those can be used to draw conclusions about patterns in data.

In DM it is common to talk about classification, clustering, and association rules [3]. These terms are different approaches to how information is gathered from the raw data and then analysed.

2.1.1 Classification

Classification is the process of labeling data with a class by looking at different attributes. An example of such a process is the boolean classification problem - labelling an email as spam or non-spam. The number of words used in the email, how long sentences and words are, and frequently used words are some examples of attributes that can be used to classify the email. In order to use this method, a network needs to be trained so that it can recognize patterns.

The goal for the network is to learn a target function, which is applied to unseen data when the training process is complete. Thus, the network needs labelled data during training so that it can adjust the target function when the label for a data pattern is incorrectly guessed. Classification is usually described as supervised learning, which is a way of describing that there are target values for every data input. The target values are produced by the trainer of the system and this restricts the function in that it cannot become better at classification than its trainer.

Using Artifical Neural Networks (ANNs) is one way of doing classification.

ANNs use different layers and nodes associated with weights that determine the target function for a specific input. For every ANN, there are at least three necessary layers. An input layer, where each input pattern begins, an output layer, which determines which class the data pattern belongs to, and n number of hidden layers. In the hidden layer, nodes exists and are associated with weights. The weights determine the importance of a specific node, or more ac- curately, how large an influence each specific node should have on the resulting output. There are different types of ANNs and one type is Feed Forward Neural Networks (FFNNs). In FFNNs, nodes in a layer are connected to nodes in the previous layer and data travels through layers. After a data pattern is incorrectly labelled, weights are updated so that a similar data point is more likely to be correctly classified next time it shows up [4].

(10)

2.1.2 Clustering

Clustering is another area in DM. Every data point consists of several measured attributes, as in classification. However, instead of labelling the data as a specific class, as in the case of classification, clustering methods look at similarities in data. The similarities between data points creates clusters of data. As opposed to classification, clustering methods do not use any kind of target data and are thus described as unsupervised learning. For example, separating spam emails from non-spam could possibly be done with clustering methods as well as classification. The clustering methods could detect that there are two types of clusters. One cluster would contain data points that are non-spam emails where as the other cluster would contain the spam emails. The algorithm cannot specifically tell which of the clusters are spam or non-spam, and this would have to be manually processed by a human.

An example of an algorithm used for clustering is the k-means algorithm. The aim for this algorithm is to divide all the training patterns into k clusters [3].

The variable k is user defined and thus requires the user to have a decent understanding of the problem and the data. In the email example it is already stated that there are two clusters, spam and non-spam. However, a user does not always have knowledge about the number of clusters that are to be expected. If this is the case, there are alternative algorithms that do not require a predefined number of clusters.

2.1.3 Association rules

Association rules are the correlations between information within A data set.

For example, consider a store that is keeping track of what their customers usually buy. Storing each transaction allows the store to see if there is any correlation between the items that their customers buy. If the store can detect that there are items that correlate to one another, the owners of the store know what combinations of products to offer discount on and thus can increase their total sales.

An algorithm frequently used within the area of association rules is the Apriori algorithm [5]. It can be used to find frequent item sets in transactional data.

For example, the information that a customer in a grocery store usually buys cereal at the same time as milk. There are a few terms that are important for the understanding of the algorithm such as item set, support, and confidence.

Item set is the collection of one or several items. Support is the percentage for which a pattern is true over all the transactions. Confidence is a metric of how trustworthy a pattern is. A pattern is usually written X → Y which in the example would be milk → cereal. The support in this case is the number of records that contain both milk and cereal, divided by the total amount of records. Confidence is the number of records containing both milk and cereal, divided by the number of records containing only milk.

(11)

2.2 Anomalies in data

An anomaly can be defined as something that is abnormal or does not fit into the normal behaviour [6]. Anomalies are an important part of DM because they raise awareness about abnormal behaviour in the data and can help to come to conclusions about how to respond in certain situations.

The usage of anomaly detection can be justified for various applications. To name two examples from the application domain, anomalies can be used for system intrusion detection and credit card fraud detection [6]. The reason behind its usefulness in these cases is because the behaviour of someone that has stolen a credit card or breached a system is usually quite different from the general behaviour of normal users in these systems. Because of the difference in behaviour, the anomaly can be found and the owner can be alerted.

Anomaly detection in console logs have been done before under different cir- cumstances. Wei Xu et. al. tried to detect problems by mining console logs in 2009 [7]. They claim that console logs are usually more structured than they appear to be. Their approach was to analyze the source code of an application in order to recover the built-in structure of the console logs. In this structure, important information regarding objects and variables can be found and used to create data features. The features that are created are then used in the mining process, which includes algorithms that are specifically produced for anomaly detection. When the mining process has been done, detections are visualized in a decision tree. As opposed to in this thesis, Xu et. al. has access to source code and thus are, with some certainty, able to create features that are relevant to the mining process. Specifically, for the mining process, Principal Compo- nent Analysis (PCA) is used. This method is a way of calculating the variance of different input attributes to determine their influence on the distribution of data points. Essentially it is used for dimension reduction, which is explained in Section 2.5. Their results showed that they could detect different problems with high accuracy and low false positives on logs of any size.

In the paper [8], written by Miranskyy et. al. in 2016, mining on operational logs for big data systems were explored. Specifically, large logs from industrial projects at IBM and Ericsson were analyzed. The article brings up several important characteristics that may make data analysis of logs problematic in industrial settings. Mentioned characteristics are velocity, volume, variety, veracity, and value. The word velocity represents that real-time data analysis is occasionally needed. Large systems may produce large amounts of data, such that the analysis process is optimally done on-site in order to avoid data trans- fer over the internet. However, lack of resources on-site is often a problem as it restricts the analysis process. The authors proposes that a possible solution to this is to distribute large logs over various storage devices, although it might be expensive to do so. Volume is the amount of data that the logs actually contain.

Logs can contain the most recent data, but also historical data. To reduce the volume a solution is to purge data that is older than a specific threshold. How- ever, the authors suggests that this approach should be used with caution, as customers sometimes rediscover old problems and old data is then useful. Vari-

(12)

ety means that the logs may be structured or unstructured, which might make the pre-processing more difficult as mining requires a unified format. There is no developed solution to this problem and as a result, developers will build converters for each format. Veracity is the need for data cleaning. As customer data usually is sensitive and private, cleaning it is a necessity. When an error has occurred, a data object, together with the memory dump, can be sent to the developers. If data is cleaned, the debugging process will be harder as the object does not contain all the information. If the data, however, has not been cleaned, the data object contains sensitive information. A solution to this problem, according to the authors, is to use obfuscation and anonymisation of data.

Value means that not every log is of value for the analysis process, as it does not provide any useful information.

For further knowledge on anomaly detection, go to [9].

2.3 Data pre-processing

For any DM problem, pre-processing is an important part of the analytics process. There are different ways to do this and no written formula as features in data usually are very different. In [10], written by Adhikari et. al. 2014, there are several examples of feasible methods that can be used for preprocessing.

Preparation of data, aggregation, database partitioning, and database thinning are some examples.

2.3.1 Data preparation

For many DM techniques it is necessary to have numerical data. Data is not always in such a format and some transformation has to take place. Imagine that an attribute describes if a state is true or false. Transformation in this case could be to convert ’true’ to the value 1 and ’false’ to the value 0. Data transformation might be especially important if distance is somehow measured, as mentioned in Section 2.1.1. If Euclidian distance is used as a measurement between data points, problems regarding the weight of an attribute can occur when an attribute possesses large values compared to others. Attributes with large values will dominate the others, making them less significant for deciding the total distance between the points. To avoid the described problem, normalisation can be applied to all or a few specific attributes. To normalise attributes is to rescale them into a new range. How normalisation is used in this project is explained more in detail in Section 4.1.

2.3.2 Temporal aggregation

As the heading suggests, temporal aggregation [10] is when an attribute is sum- marised over time and then converted to a new representation. An easy example

(13)

would be to summarise the amount of rainfall in an area from each specific day to the amount of rainfall during a month. The advantage of this process is that the amount of data to be analysed is reduced. However, the specifics of the data will be erased unless both the original data and the aggregated data are stored, which would defeat the advantage of this method.

2.4 Approaches to Mining Multiple Sources

In [10], three different approaches to mining multiple data sources are presented.

Local Pattern Analysis (LPA), Sampling, and Re-mining. However, only the former two will be explained here. The reason for this is that when data comes from multiple sources, Re-Mining could be considered as another form of LPA.

2.4.1 Local Pattern Analysis

Large amounts of data that is available in multiple places can be mined locally.

It is inefficient to send raw data to a centralised place to be mined for several reasons. Sending data over the network will take a large amount of time if the bandwidth is too low and the data too large. Raw data from multiple sources that is consolidated into a single database can be the cause of problems regarding data privacy, data consistency, data conflict, and data irrelevance [11].

According to [10], patterns in LPA can be divided into three categories. These three categories are local patterns, global patterns, and patterns that are neither local nor global. Local patterns are detected during the local mining process.

Global patterns emerge after the local patterns have been synthesised. In other words, global patterns can be detected after local patterns have been merged together into a single unit.

2.4.2 Sampling

Sampling is to take a sample of a subset of the data, instead of using all the data. This method can be motivated because if a specific pattern is frequently present in a large database, it is most likely to be frequently present in a sample of the data as well. This method might allow data to be sent over a network, if the bandwidth is constrained, because of the fact that the size of the data is decreased [10]. This method can also been seen as a part of the pre-processing, discussed in Section 2.3, as it is a method for making the mining process easier and faster.

(14)

2.5 Dimensionality Reduction

In addition to the problem of data not being of the same format, attributes in the data can also contribute more or less to feature correlation. Dimensionality reduction is a method used to remove attributes that are less relevant for the mining process. For example, an ID attribute would not contribute to any correlation between data because they are unique. Using the ID attribute in the mining process would simply show that every data entry has a unique behaviour.

2.6 False Positives and False Negatives

In any application regarding data analysis it is common to talk about false positive and false negative errors. In the sense of anomaly detection, a false positive is a data point that has been wrongly assigned as an anomaly. Consider the example where a person is wrongly sentenced to prison for a crime that he or she did not commit. This example would be a false positive error [12].

A false negative anomaly is the opposite of a false positive. This phenomenon happens when an anomaly is not found, but should have been. In the crime example, this time the person did commit the crime but was never accused of it or sentenced for it and is thus a false negative [12].

These two terms are important for the performance of an anomaly application as the goal is to minimize the number of times they occur.

(15)

3 Problem approach

An important part of this thesis is the literature study, where much of the knowledge regarding mining algorithms and methods have been explored. Through- out the following section, the practical work which includes system architecture, feature creation, and how JSON objects are treated is presented.

3.1 Problem challenges

Challenges are present in any problem and there are several challenges to the problem in this thesis. As described in 2.2, from [8], one potential problem of mining console logs is that the raw data may very well be unstructured. In fact, this is true for the data in this thesis. One component can contain a various number of events. Some are multiline events, others are single line events. Some events concern specific IDs whereas others do not. Different components also produce different types of log messages. Rapidly, the number of different log events will become large.

In order to know what information to look for, careful inspection of the logs is required. As the format varies in-between components, special cases for parsing the information may have to be made. For this thesis, the data to be parsed is moved from customer’s computers to a local machine. This means that the data has to be moved over the Internet in order to be processed. However, in a real environment this would not be a problem as the goal would be to do all processing and mining locally. Whether local processing can be achieved or not depends on the local machine. The issue of having too little computing resources, mentioned in [8] in Section 2.2, may be a problem.

As have been mentioned earlier, the mining involves analysis of attributes. The process of finding such attributes in data is often called feature creation. How features are created in this project, is described in detail in Section 3.3. Fur- thermore, there is no guarantee that the selected features will find anomalies.

To setup the parser that gives the attributes value, coding is required. This setup, of course, takes time to create and there is no insurance that the setup will generate any results.

3.2 System Architecture

To be able to find anomalies in data it is necessary to define what an anomaly actually is. When two logs that are produced from the same component are compared to each other it is quite clear that the two logs can be similar or dissimilar in multiple ways. So, in order to draw conclusions about whether or not the two logs are similar, things that make them similar have to be defined first.

(16)

Figure 1: Architecture that an incoming mail passes through.

Figure made by Jonas V¨astibacken.

The architecture of the system that an incoming email passes through is pictured in Figure 1. The email server retrieves emails from the exchange server with the POP3 protocol. The body of the email is sent to the Universal Contact Server (UCS) component. This component is holding information about customers such as all the times a customer has been in contact with the company. The meta-data of the email is sent to the Interaction Server, which is the component that controls and owns the interaction between a customer and an agent. For each email that is sent to and from the contact call center, an interaction ID is created. The interaction server asks the Universal Routing Server (URS) to route the email. The routing is dependant on logic that has been predefined.

An example of such logic could be ‘If the sender is logic-example@email.com, the agent has to be able to talk Spanish’. The Classification Server classifies the email based on specific text strings in the subject, header and body of the email. What might not be obvious from the picture is that the Classification Server has a connection to the UCS and is thus able to look at the body of the email. With information from the Classification Server, the URS asks the Stat Server about non-busy agents that fulfill the logic requirements. If an agent is found, information about that agent is sent back to the interaction server which in turn can pass the email to the found agent. When the information from the Interaction server has been sent to the agent, the agent works as a client to the UCS and can thus retrieve the body of the email.

(17)

3.3 Feature Creation

The architecture from Figure 1 is helpful when trying to define an anomaly. As seen and explained, there are different messages passing between the components. One way of defining features that later on can be used for mining is to use the interaction IDs that are created for every email. The number of times every interaction ID is mentioned in the logs can be used to find anomalies regarding every interaction. Note that a customer does not always choose to use the classification server. Therefore, this component has been ignored for feature creation in this project. Ignoring the classification server makes this project more generalised and the proposed method of finding anomalies to these types of logs is more likely to be applied to several customers.

As mentioned in Section 3.1, the data is not very well structured. The interaction ID in the logs is named differently, depending on the type of event.

Attr itx id, InteractionId, IxnContentId, and Id are examples of how an interaction is mentioned in an event. Intuitively the various use of names for the same object is problematic as it becomes difficult to ensure that every event with an interaction ID is included in the count. However, as mentioned in Section 2.4.2, if an interaction ID is frequent throughout the logs, it is safe to assume that the ID is frequent in samples of the logs as well. Sampling allows the feature creation to be simplified while maintaining the correlation between the logs.

Logstash [13] is part of the ELK acronym that stands for Elasticsearch, Logstash, and Kibana. It is developed by Elasticsearch and helps a user to process logs and events where the data is stored and produced from several systems. Not only does Logstash help a user to centralise the analysis of the data, but it also allows the data from the different components to be parsed in to a common format. For this thesis, the named software is highly relevant as data logs are spread out at different places and also initially unstructured.

3.4 Events to JSON - Logstash

Every row in the file is sent as input to Logstash. For every row there is a filter that parses the input according to some predefined configuration. Defining an event is necessary in order to make sense out of the data. An event can be defined as information that correlates to a specific timestamp. With this definition, an event can take place over one or multiple rows. The configuration decides what parts of the event that should be parsed and produced as output. The output is a JSON object that is, in this case, sent to Elasticsearch.

The configuration includes different filters which all have their own purpose.

For example, the multiline filter is used when an event takes place over multiple rows, as seen in Figure 2. Another example is the grok filter which uses regular expressions to match out different fields in the event. An example of a configuration of the grok filter is to match out fields that are relevant for analysis. Examples of such fields are ‘timestamp’, ‘loglevel’, ‘protocol’, and ‘message

(18)

Figure 2: Data sample from the UCS component.

remainder’.

There are some fields that will be in the resulting JSON object, regardless of whether they have been parsed or not. The timestamp field is an example of such and has to include year, month, and day together with hour, minute, second, and possibly thousands of seconds. As can be seen in Figure 2, events in logs from Genesys components are not always logged with a date. In this figure events are tagged with only a timestamp on the format ‘HH:mm:ss.SSS’.

This is problematic when the the events are supposed to be searchable. Luckily, Logstash has a filter named ruby which allows ruby code to be executed in the configuration file. With ruby code, the time of the creation of the file can be found and concatenated with the timestamp in the event. However, this code has to be executed for every incoming event to Logstash.

3.5 Elasticsearch - queries on JSON objects

Elasticsearch [14] is the first part of ELK, mentioned in Section 3.3. As opposed to an ordinary relational database, defined relations between data points entries in Elasticsearch cannot be created. Entries in Elasticsearch are on JSON format, which is a datatype of dictionary format as in the sense that every value has a key. Together with a library called Lucene, which is implemented in Java, this software is possibly a solution when scaling is of importance.

As soon as the Logstash output has been created and sent to Elasticsearch,

(19)

it is searchable. This can be done in multiple ways, Kibana has already been mentioned as a part of ELK and is a visualisation tool with various amounts of diagrams. However, the idea is to apply DM theories on the result of the search. RapidMiner is a software that offers different operators that in some way implements DM algorithms or data modification algorithms, clustering included.

Both Elasticsearch and RapidMiner have a Java API and thus it is convenient if the search queries are done with Java so that the result of the queries can be directly inserted into RapidMiner.

The Elasticsearch server is running on the local computer, meaning that it is accessible through localhost. There is a plugin called ‘head’ which allows a user to do queries directly in the browser as long as the web page is requested with port 9200. However, for this project the result of the queries should be sent as a response to some Java program. For this interaction to work, a client that connects to the Elasticsearch server is needed. The Java client uses port 9300 instead, which is a commonly used port for the Transmission Control Protocol (TCP) and User Datagram Protocol (UDP).

Figure 3: Format of data after pre-processing is done.

Figure 3 depicts the format of the data before it is used for mining but after all the pre-processing is performed. This particular log contains 490 data points and every attribute is separated with a comma. The first value is the interaction ID which is created when an email is sent to an agent, or vice versa. The ID is followed by the five measured values of the ID in the component logs.

(20)

4 Mining model

The mining process is inspired by other work that relates to the area and was found during the study. Different choices regarding mining are motivated by research that was found during the literature study.

4.1 Normalisation

As can be seen in Figure 3, attributes do not have the same value ranges. For example, the interaction server has values between 20 and 60 whereas values from the URS are below 10. If these ranges are representative for all the 490 data points, the average for the interaction server will be significantly higher than for the URS. As described in Section 2.3.1, this kind of difference of the average value for attributes is not optimal when the distance between data points is measured. The reason behind its suboptimality is that values with higher average will have a higher weight and influence on the result when the distance is calculated. Luckily, there are solutions to this problem.

Min-Max normalisation [15] is a popularly used method that does a linear transformation of values for an attribute into a new range. For an attribute X, the method is defined as

X

_i⁰

= X

_i

− X

_min

X

max

− X

min

(X

_{newM ax}

− X

_{newM in}

) + X

_{newM in}

(1)

where X_i⁰is the new transformed value, X_iis the value to be transformed, X_min is the minimum observed value for attribute X, X_maxis the maximum observed value for attribute X, X_{newM in} is the minimum value of the new range, and X_{newM ax}is the maximum value of the new range. The range of the transformed attribute is usually either [0, 1] or [−1, 1]. Consider that this method is applied to the interaction server attribute at the second column, third row in Figure 3 with [0, 1] as the new range. The transformation then generates

X

_i⁰

=

⁴⁵⁻²⁰₅₃₋₂₀

(1 − 0) + 0 ≈ 0.758

As the transformation is linear, values for an attribute will keep their relation to each other which is the main advantage of this normalisation method.

Z-transformation [15] is another method of normalisation and requires calculation of the mean and standard deviation of attribute X. The method is useful if

(21)

minimum and maximum values are unknown or if there are outliers in the attribute that are significantly larger than the average value and would dominate the min-max normalisation. The equation for Z-transformation is defined as

X_i⁰ =Xi− X σX

(2)

where X is the average of X, and σ_X is the standard deviation of X. The result of this equation when it is applied to the same row and attribute as in the Min- Max example is

X

_i⁰

=

^45−36.2_11.19

≈ 0.786

Note that the calculation of the mean and the standard deviation is left out for this example. The difference between the value for min-max normalisation and Z-transformation is not significant for this particular example. As opposed to Min-Max transformation, Z-transformation will always generate negative values which means that the measured distance from a point a to a point b may well be 0 . Patterns with negative values occur when a value that is smaller than the mean is transformed.

In this thesis, both of the two described normalization methods have been used in the experiments showed in Section 5.

4.2 Distance functions

To be able to determine if any data points within the range X1and Xnare close or far away from each other, a measurement for distance has to be selected.

There are several options regarding distance functions.

Euclidean distance [15] creates a line between two data points. The Euclidean distance between two data points i and j of n dimensions can be calculated as

d_i,j= q

(x_i1− xj1)²+ (x_i2− xj2)²+ · · · + (x_in− xjn)² (3)

The distance between the third and the fourth row, second column in Figure 3 is then calculated as

(22)

d_3,4 =p(45 − 53)²+ (2 − 7)²+ (12 − 27)²+ (48 − 30)²+ (16 − 16)²≈ 25.259

Another well known function is the Manhattan distance [15]. This distance function is also known as city-block distance as it can be compared to how it is only possible to move in four directions in an intersection of a city block.

The four available direction are then up, down, right and left. The equation for Manhattan distance is

d_i,j= |x_i1− xj1| + |xi2− xj2| + · · · + |xin− xjn| (4)

The distance between the third and the fourth row, second column in Figure 3 is then calculated as

d3,4 = |45 − 53| + |2 − 7| + |12 − 27| + |48 − 30| + |16 − 16| = 46

Both Euclidean- and Manhattan distance fulfill four requirements for distance functions. The four requirements are [15]

• A measured distance is always non-negative.

• The distance measured from a point to itself is always zero.

• The distance function is symmetric, meaning that di,j= d_j,i.

• A measured distance d_i,j is always smaller or equal to d_i,k+ d_k,j, meaning that going directly from i to j is the optimal choice if the distance is to be minimised rather than taking a detour.

The difference in how distance is measured between the two described functions is pictured in Figure 4

(23)

Figure 4: Difference between Euclidean and Manhattan distance measurements between two data points.

4.3 Clustering algorithms

Within the clustering area of DM, there are several subareas regarding how the data is processed. One of the areas is partitioning methods and have partly described in Section 2.1.2. The idea of partitioning methods is to separate a data set into k partitions (or clusters). The clusters are created by looking at similarity and dissimilarity between each data pattern. Data within the same cluster are similar whereas two data patterns assigned to two different clusters are dissimilar. The k-means algorithm is part of these methods and possibly the most well known algorithm within partitioning methods. The k- means algorithm is does not work well on data that form clusters of different sizes. The algorithm is also sensitive to noise and outliers as these patterns may very well influence the mean, which the algorithm relies heavily on [15].

Hierarchical methods is when data is grouped into a tree structure and there are two forms of these methods, agglomerative or divisive. In aggomerative hierarchical clustering all the data patterns are separated from the start. Data patterns are then assigned to clusters, and with time the formed clusters are merged together until there is only one cluster connecting all patterns. This method is also called the bottom-up approach. The divisive approach is the opposite of the one recently described. Instead of patterns being separated in the beginning, they all belong to the same cluster. The big cluster is divided into sub-clusters. The process of dividing clusters continues until each data pat-

(24)

tern forms a separate cluster. In both aggromerative and divisive hierarchical clustering, a user can choose a number of clusters as a stop condition for the algorithm. Hierarchical clustering methods generally suffer from difficulties when selecting a split or merge point. A split or merge point is when two sub-clusters are merged into the same cluster or split into two separate clusters. The importance of a split or merge point is high as other clusters will depend on former decisions. As a result, if said patterns are not chosen well, clusters that do not represent the data well may be shaped. Hierarchical methods do not scale well as each merge or split requires evaluation of the current clusters. The difference between agglomerative and divisive methods are visualised in Figure 5 [15].

Figure 5: Agglomerative and divisive hierarchical clustering.

Source: TODO

Density-based methods form clusters out of dense regions in the data, that is the number of data points in a specified area. As opposed to partitioning methods, density-based methods have been developed in order to handle clusters of different sizes and shapes [15]. Density based analysis can be used for anomaly and outlier detection as they often do not require user input in terms of how many clusters that is supposed to be shaped. Algorithms for such analysis are used in this thesis and are described in sections 4.3.1, 4.3.2, and 4.3.3.

4.3.1 DBSCAN

DBSCAN was originally proposed in [16] and stands for Density Based Spatial Clustering of Applications with Noise. As the name suggest, the algorithm uses information regarding density and form clusters in dense regions of the data.

The number of clusters shaped by input data is defined by the distance between data points. The algorithm uses information from two parameters, epsilon and MinPts, to determine if a point belongs to a specific cluster or not. Every data point is said to be either a core point, a density reachable point or an outlier point. A point p is a core point if the number of points within is higher or equal to MinPts, including p. A point q is density reachable from p if it is within the distance to a core point, but not a core point itself. An outlier is point whose neither a core point nor a density reachable point.

(25)

As have been mentioned earlier, density based algorithms do sometimes not require a user to define the number of clusters to be shaped. However, the two parametres used for the clustering process is user-set in DBSCAN. It can be said that a user of the algorithm indirectly controls the number of clusters to be shaped by setting the parameters. The two parameters are also sensitive, meaning that a small increase can yield very different results. For example, the increase by 0.1 of could possibly be the difference between two and ten created clusters. This behaviour and sensitivity is problematic when the difference between intended clusters are small. However, in this thesis, the outliers and anomalies are of interest which tend to be more outstanding and are thus easier to distinguish. As a result, the parameter sensitivity is not as a big problem as it can be in other applications.

Figure 6 depicts how points are evaluated by the DBSCAN algorithm. In this particular picture, the MinPts parameter is equal to four. The red points are core points. The requirement for a core points is that there are at least MinPts number of points, with itself included, within the range of . The yellow data points, B and C, are density-reachable points. As described in this section, they do not fulfil the requirements of MinPts but are however within the distance of to a core point. The data point N is neither a core point nor a density-reachable point and is therefore an outlier.

Figure 6: Evaluation of data points with DBSCAN. Picture taken from Wikipedia [17].

The authors of the paper [16] suggest ways of estimating the two parameters so that a user do not have to set it randomly, or set it by trial-and-error. MinPts can be estimated by using the dimension D of the data. In this thesis there are five attributes used for finding the anomalies, namely the number of times an ID occur in the five different system logs. The recommended size of MinPts is M inP ts ≥ #attributes + 1, which in thesis would mean that MinPts has to be at least six. The parameter can be estimated using a k-distance graph which calculates the distance to the k-nearest neighbours for a data point. The

(26)

value of k is set to the same value as MinPts. The calculated distance to all the k-nearest neighbours is then plotted to a graph. The plot is helpful when is about to be set. There will likely be a point in the graph where the measured distance starts to converge against the minimum value. The value of that point is what should be set to.

An implementation of the DBSCAN algorithm comes with RapidMiner Studio 7.0. The implementation is used for the DBSCAN experiments in this thesis.

4.3.2 Local Outlier Factor

Local Outlier Factor (LOF) was originally proposed in [18] by Breunig, Markus M et. al. The idea of the algorithm is to give every data point an outlier score.

According to Breunig et. al. the outlier score is a key feature as it is not a binary score. The binary score is usually the approach for other outlier algorithms and is true for DBSCAN, described in this thesis. Although the representation of an outlier is different between LOF and DBSCAN, both algorithms share some properties. The method of having core points and density reachable points for density estimation is true for both algorithms.

The outlier score in LOF is a representation of a point’s local density with respect to its neighbours. This means that a data point with low local density compared to its neighbours’ local density will receive a higher outlier score.

Figure 7: Evaluation of data points with LOF. Picture taken from Wikipedia [19].

Figure 7 shows how the outlier score is set with LOF. Data point A will have a significantly higher outlier score than its neighbours due the difference in density.

LOF is yet another algorithm that requires user input. Although user input is an unwanted property as it may heavily affect the result, it is hard to avoid

(27)

because of the fact that data sets are different from each other. In LOF, the outlier score will fluctuate depending on what the MinPts is set to. To get a more reliable score for all the points, the authors of the article suggest that a range of MinPts should be used. Setting the range of MinPts to 10-20 worked well for almost every data set they experienced with.

The authors of the article write that a point with an outlier score of 1 is an inline data point, meaning that it is a point close to a cluster. However, there are no distinct guidelines to what an outlier data point score would look like.

For example, in one data set the outlier score 2 could very well be an inline point whereas in another data set that same score could be an outlier point.

This problem can also occur within the data set because of the local density calculations and the lack of global density calculations. An example of named problem can be seen in Figure 8. It shows data points with similar outlier score (points with score 3.4745, 3.108, and 3.1361) despite having a noticeable distance difference to the nearest cluster. Also worth mentioning from the picture is that there is a data point with outlier score 7.6259, which is the highest score in the data space despite having a noticeably lower distance to the nearest cluster than some other data points.

Figure 8: Scores for data points using the LOF algorithm on a test data set. Picture taken from Wikipedia [19].

An implementation of LOF was done in [20] by Mennatallah Amer and Markus Goldstein and is used for the LOF experiments in this thesis. The implementation is an extension to RapidMiner and is implemented according to the definition in [18]. Mennatallah Amer and Markus Goldstein mentions that LOF is possibly the most commonly used algorithm when local anomalies are of interest, i.e. anomalies that might be missed by a global anomaly detection algortihm such as the one described in Section 4.3.3.

(28)

4.3.3 k-NN Global Anomaly Score

k-NN Global Anomaly Score is another algorithm that was implemented in [20]. The algorithm also uses the k nearest neighbours in order to calculate the anomaly score. As opposed to LOF, the k-NN algorithm does not consider local densities when the anomaly score is calculated, hence the name.

The paper in [21] is a performance comparison between different anomaly detection algorithms done by Markus Goldstein and Seiichi Uchida. The authors of the paper propose that in the k-NN global anomaly score algorithm, k should be set to 10 ≤ k ≤ 50. The study concludes that k-NN Global Anomaly Score and LOF generally performs well compared to other algorithms in the study.

k-NN Global Anomaly Score and LOF are used when global and local, respectively, anomalies are of interest. In this thesis, and generally for any problem regarding data analysis, it is hard to know whether the anomalies are local or global. Therefore, the k-NN Global Anomaly Score algorithm implemented in [20] is used on the data in this thesis.

(29)

5 Experiments and Results

The experiments that have been conducted in this thesis include a comparison of the results from the described clustering and outlier detection algorithms in Section 4.3.1 and 4.3.2. To ensure that the found result of the experiment is fairly accurate, the experiments have been run with both normalisation methods described in Section 4.1. Only one of the distance measurement methods described in Section 4.2 has been used, namely the Euclidean distance.

5.1 DBSCAN with Min-Max Normalisation

Figure 9 shows the k-distance graph of the data set when it has been normalisation with Min-Max normalisation. The parameter k is set to 20 because of MinPts being set to 20. The bend of the graph, where it starts to converge slowly towards the minimum measured distance, is marked by a black arrow.

The arrow points at roughly 0.275.

Figure 9: K-distance graph over data that is normalised with Min-Max.

After the value of has been found from the k-distance graph, the DBSCAN algorithm is run on the data. Figure 10 shows that out of the 490 data instances, three of them was labeled to cluster zero. The zero cluster contains data points that have not been labeled to any cluster, meaning that they are outliers.

Furthermore, Figure 11 shows normalised data together with every data point’s respective cluster assignment. From this data, it is possible to see that two of the three data points that are outliers have the max value one in at least one of their attributes. The third data point of these three outliers does not have a

(30)

Figure 10: Clustering model showing how many data points that were assigned to a specific cluster.

max-value in any of the attribute but generally has high attribute values.

Figure 11: Raw and Min-Max normalised data together with assigned cluster. The data is sorted so that the three outliers are shown first.

5.2 DBSCAN with Z-transformation Normalisation

As written in Section 4.2, attributes will have different values depending on the normalisation method. In this experiment, the Z-transformation normalisation method is used. Once again, the k-graph distance operator is used in Rapid- Miner and yields the result in Figure 12. As seen, the black arrow points at about 2.75.

After the value of the calculated epsilon is set in the DBSCAN operator the model can be run. The cluster model of this experiment is seen in Figure 13.

Once again, there are three data points that are not clustered, i.e. assigned to the zero cluster.

(31)

Figure 12: K-distance graph over data that is normalised with Z-transformation.

Figure 13: Clustering model showing how many data points that were assigned to a specific cluster.

By inspecting the raw data after normalisation in Figure 14 it is clear that the Z- transformed normalised values are very different from the resulting values when Min-Max normalisation is used. The figure is also sorted to show the found anomalies in the top. It is clear from the figure that three anomalies were found in this experiment as well. In fact, the found anomalies with Z-transformation as the normalisation method are the same as the ones found in the experiment with Min-Max normalisation.

(32)

Figure 14: Raw and Z-transformed normalised data together with assigned cluster. The data is sorted so that the three outliers are shown first.

5.3 Local Outlier Factor with Min-Max Normalisation

Figure 15 shows the result when the LOF algorithm is run with Min-Max normalisation. In this experiment, which is also true for all LOF experiments in this thesis, the MinPts lower bound and upper bound is set to 10 and 20 respectively. As described in Section 4.3.2, it is possible to see that data points do no longer receive a binary score nor is assigned to a cluster. Instead, a score for the possibility of being an outlier is shown in the outlier column. From the 10 data points with highest anomaly score shown in the figure, we can see that two out of the three anomalies that were found in the DBSCAN experiments receive the highest anomaly score. However, one of the found anomalies in the DBSCAN experiments, data point with row number 485, is not within the top 10 data points with highest anomaly score. The highest anomaly score is about twice as high as the data instance with the second highest anomaly score. Furthermore, there are several data instances with the exact same anomaly score, i.e. data dupilcates.

(33)

Figure 15: Outlier score for the first LOF experiment where min-max normalisation is used. The lower bound of k is set to 10 whereas the upper bound is set to 20.

5.4 Local Outlier Factor with Z-transformation Normali- sation

Figure 16 is the result of LOF when Z-transformation normalisation is used.

The top two data points are the same as in the LOF experiment with Min- Max normalisation but the result is otherwise different. We are also able to see that the data point with row number 485 does appear within the top 10 instances with highest anomaly score, as opposed to in the LOF experiment with Min-Max normalization. The figure also shows that the data instance with highest anomaly score is about 20 times as high as the data instance with the second highest anomaly score. Once again, several data instances have the same anomaly score as a result of data instances having the same measured ID count.

Figure 16: Outlier score for the second LOF experiment where Z-transformation normalisation is used. Once again, lower and upper bound of k is set to 10 and 20 respectively.

(34)

5.5 k-NN Global Anomaly Score with Min-Max Normal- isation

Figure 17 is the result of running the k-NN Global Anomaly Score operator on the data with Min-Max normalisation. We can see that the highest placed anomalies are the same ones as those found by the DBSCAN experiments. The data point with the highest anomaly score is about 1.12 times as high as the data point with the second highest anomaly score.

Figure 17: Outlier score for the first k-NN global anomaly score experiment where min-max normalisation is used.. The parameter k is set to 20.

5.6 k-NN Global Anomaly Score with Z-transformation Normalisation

Figure 18 is the result of running the k-NN Global Anomaly Score operator on the data with Z-transformation normalisation. Once again, the data points with highest anomaly score are the same one as those found with the DBSCAN experiments. The data point with highest anomaly score is about 2.4 times as high as the data point with second highest anomaly score.

(35)

Figure 18: Outlier score for the second k-NN global anomaly score experiment where Z-transformation normalisation is used.

Once again, k is set to 20.

(36)

6 Discussion

For all the experiments conducted in this thesis there are two data points, row number 4 and 407 that keep reoccurring as anomalies. Furthermore, there is one data point, row number 485 that occurs in four out of six experiments as an anomaly or with fairly high anomaly score. By manually inspecting the logs that contain these IDs, one can come to the conclusion that these are rightfully assigned anomalies.

6.1 Reason Behind Found Anomalies

First, the data point with row number 4, ID = 0003f abhdnk15kae, is assigned to the zero cluster when DBSCAN is used and receives the highest score for all experiments with anomaly score algorithms. The manual inspection, which is not included in this report because of confidentiality, shows that this ID is mentioned about 2400 times in the email server log which is greatly above the average count for IDs in the email log. There are two messages in these 2400 mentions that keep reoccurring:

• ‘0003fabhdnk15kae contains attribute name with invalid character at po- sition ... ’

• ‘0003fabhdnk15kae rejected because it contains too many errors’

The two message tells us that there is something with the email that makes the system unable to process the interaction. As a result, the found anomaly is an example of a system failure or a system fail-safe. In either case, the anomaly is system related.

Second, the data point with second highest anomaly score for all experiments is the data point with row number 407, ID = 0003f abhdnk15bex. This data point is also assigned to the zero cluster when DBSCAN is used. Inspection shows that this is not a system related anomaly, as opposed to the previous one. For this interaction, the system is working as intended. However, it is the behaviour of the responding agent that is the reason to why the ID is found.

As described in Section 3.2, an interaction between agent and customer can start when information from the URS component has sent information to the interaction server. In order for the interaction to actually start, the agent has to check a ready box in a graphical user interface (GUI). If the ready box is not checked, the interaction will re-enter the system. When the interaction is re- entered the system process restarts and the ID will go through the system again.

For this particular ID, the agent did not respond to the interaction several times.

As a result, the values for the count of the ID in the logs is increased for every time the agent does not respond.

(37)

Third, the data point with row number 485, ID = 0003f abhdnk15eju, has fairly high anomaly score and is assigned as an anomaly with DBSCAN. This anomaly is another case of the interaction re-entering the system queue. The reason why it does not receive as high anomaly score as the other example of the same nature is that the number of times the interaction re-enters the system is lower.

6.2 False negatives

As written before, data analysis application supposedly want to minimize false negatives and false positive errors. In this thesis, no points that were inspected manually could be classified as false positives as they were rightfully classified anomalies. However, by manually inspecting the data of the input file to the model, it could be detected that there existed three data points where the number of times an ID occurred in the logs was low. The three data points are shown in Figure 19.

Figure 19: Examples of data points that should have been detected as anomalies but was in most experiments not detected as such.

After Min-Max normalization, these three data points are positioned at origo in the data space as they contain the lowest counted value for all five attributes.

Although they deviate quite a bit from the average value for each attribute, they are not within the top ten data points with highest anomaly score for any experiment. Nor are they classified as anomalies by DBSCAN. The presence of false negatives in this application makes the performance of the model ques- tionable. The false negatives could possibly explained by the theory that they are significantly closer to a cluster compared to the true positive anomalies.

The reason behind the existence of these three data points in the data set could be explained by the fact that some interactions will start at the same time, or soon before, the time that the collection of data for this application ended. That means that they have not yet been processed by the system.

6.3 The Effect of Using Different Normalisation Methods

The results display that the choice of normalisation method is of importance for data analysis. The impact of the normalisation method is shown more distinctively in the two anomaly score algorithms (LOF and k-NN) compared to DBSCAN. If we were to assume that the anomaly score is linear, in the

(38)

sense of how likely it is for a data point to be an anomaly, a score twice as high as another score would mean that the probability for the higher score to be an anomaly is twice as high. With this assumption in mind, it is clear that normalisation methods have a big roll for deciding the likeliness of a data point being an anomaly. As an example, which was already stated in Section 5.4, the likeliness of data point with ID = 0003f abhdnk15kae in Figure 16 being an anomaly would be about 20 times as high as for data point with ID = 0003f abhdnk15bex. In Figure 15, the factor between the former and the latter interaction ID is only 2.

6.4 Correlated Data Attributes

From the architecture picture, Figure 1 in Section 3.2, we can see that the data flow of the system is not serial. However, because components depend on each other in the system, there will be a correlation between the number of times an ID occurs in the logs. For example, for two of the three found anomalies the reason was agent behaviour. The agent did not respond to the interaction which meant that the interaction re-entered the system. In turn, the count for the interaction in the system logs went up significantly. This means that if the count of an interaction for one attribute increases, the count for the other attributes will most likely increase as well. The correlation between attributes in this thesis is a result of the feature creation. Optimally, data analysis application do not have attributes that are correlated which tells us that the feature creation in this thesis is a drawback of the application.

6.5 Discussing Initial Questions

What DM methods and areas are important to include for the described problem?

The purpose of this thesis was to explore how anomalies can be found in raw data. That is preferably done with clustering because there is no labeled data.

Anomalies in the data are usually a lot fewer than events that are not anomalies.

This is one of the main reasons why clustering is preferable over classification.

This problem could possibly be described as a binary classification problem as opposed to using clustering methods. However, because of the fact the data (anomalies and non-anomalies) is not equally distributed, it is hard to create a classification model as training requires a distributed labeled data set.

There are some hybrid models of classification and clustering methods that can be used for this type of problem. In this case, events that are viewed as non-anomaly types are labeled, and the type of events that are considered as anomalies are not labeled. However, this is not something that has been explored in this thesis.

Generally, it is hard for a person that is not well trained in what the different events in the logs mean to draw conclusions on whether an event is an anomaly

(39)

or a false positive. False positives, in terms of an event being a non-anomaly but clustered as one, is to be expected and for a person that has little knowledge about events in the logs, determining if these events actually are anomalies or false positives is not an easy task.

Furthermore, papers looked at in the literature study of this thesis shows that clustering methods is the most used approach when anomaly detection problems are addressed. As an addition, the availability of anomaly detection algorithms that addresses the same problem seem to mostly implement clustering methods.

Is it possible to detect trends and correlation in data without knowing what to look for?

The simple answer to this question is yes. However, the real answer is not that simple. In this thesis, two different types of anomalies were found. When selecting features for the mining process in order to actually find anything, the idea was to follow an interaction that has been processed by the system.

Selecting features defines the domain for what kind of anomalies that can be found. For example, anomalies regarding components’ health such as number of processed request per minute or similar, will not be found by the model in this thesis nor will unprocessed interactions.

Within the areas of DM that are relevant for this problem: What algorithms are developed for anomaly detection and is one algorithm superior to others?

Throughout this thesis there are three clustering algorithms used. Together with other algorithms that relate to anomaly detection, two of the three algortihms used in this thesis (LOF and k-NN) are compared in the paper in [21]. The study shows that LOF and k-NN generally performs well for anomaly detection. As they both perform well, it is hard to say if any one is superior to the other.

The performance of the algorithms is dependant on the data that is supposed to be mined. As written earlier, LOF outperforms k-NN when there are local anomalies whereas k-NN outperforms LOF when there are global anomalies.

Generally there is no correct way doing anomaly detection. In fact, the result is dependant on the shape of the data to a high extent. What normalization algorithm to use, what distance function to use, or what anomaly detection algorithm to use is usually a process testing different settings to see if the results seem reasonable. These things can be tested on a first set of data in order to see if the model yields any results at all. If the results seem reasonable, the model can be added as a new part into the system.

(40)

7 Conclusion and Future Work

It has been shown throughout this report that anomaly detection in console logs is a feasible project. In this thesis’s specific context, the goal was to find anomalies in order to help employees at Telia with error detection. Three different algorithms was used to find anomalies. The result of DBSCAN and k-NN Global Anomaly Detection was similar in regards to what anomalies that were detected. The result of LOF did not match the other two mentioned algorithms.

A conclusion that can be drawn from this, which is also shown in [21], is that global algorithms seems to perform better on real-world problems.

There were two types of anomalies in the data set found by the model in this project. Those anomalies that were found were either system related or user- behaviour related.

The current model can find what the ID of the anomaly. When the anomaly has been found, a human has to manually search the logs with the given interaction ID to see what the actual problem of said interaction is. For further research, it could be useful to do some text mining on the logged events which potentially could tell us more about what the reason behind the logged anomaly is. An interesting idea would be to use classification methods to classify an anomaly into different categories. The categories could possibly be derived from what the messages from the events of the found anomalies say. For example, the found anomalies in this thesis could be classified as system or user related.

Anomaly Detection in Console Logs

Examensarbete 30 hp November 2016

Anomaly Detection in Console Logs

Jonas Samuelsson

Institutionen för informationsteknologi

Abstract

Anomaly Detection in Console Logs

Jonas Samuelsson

Popular Scientific Summary in Swedish

Acknowledgements

Contents

1 Introduction

1.1 Setting

1.2 The aim of this project

1.3 Scope

2 Background

2.1 Data Mining

2.2 Anomalies in data

2.3 Data pre-processing

2.4 Approaches to Mining Multiple Sources

2.5 Dimensionality Reduction

2.6 False Positives and False Negatives

3 Problem approach

3.1 Problem challenges

3.2 System Architecture

3.3 Feature Creation

3.4 Events to JSON - Logstash

3.5 Elasticsearch - queries on JSON objects

4 Mining model

4.1 Normalisation

X

= X

− X

X

− X

(X

− X

) + X

(1)

X

=

(1 − 0) + 0 ≈ 0.758

X

=

≈ 0.786

4.2 Distance functions

4.3 Clustering algorithms

5 Experiments and Results

5.1 DBSCAN with Min-Max Normalisation

5.2 DBSCAN with Z-transformation Normalisation

5.3 Local Outlier Factor with Min-Max Normalisation

5.4 Local Outlier Factor with Z-transformation Normali- sation

5.5 k-NN Global Anomaly Score with Min-Max Normal- isation

5.6 k-NN Global Anomaly Score with Z-transformation Normalisation

6 Discussion

6.1 Reason Behind Found Anomalies

6.2 False negatives

6.3 The Effect of Using Different Normalisation Methods

6.4 Correlated Data Attributes

6.5 Discussing Initial Questions

7 Conclusion and Future Work