Unsupervised anomaly detection on log-based time series data

(1)

Unsupervised anomaly

detection on log-based time series data

OSKAR GRANLUND

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

detection on log-based time series data

OSKAR GRANLUND

Master in Computer Science Date: September 24, 2019 Supervisor: Johan Gustavsson Examiner: Olle Bälter

School of Electrical Engineering and Computer Science Host company: SAAB

Swedish title: Oövervakad anomalidetektering på logbaserad tidsseriedata

(4)

(5)

Abstract

Due to a constant increase in the number of connected devices and there is an increased demand for confidentiality, availability, and integrity on applications. This thesis was focused on unsupervised anomaly detection in data centers. It evaluates how suitable open source state-of-the-art solutions are at finding abnormal trends and patterns in log-based data streams. The methods used in this work are Principal component analysis (PCA), LogCluster, and Hierarchical temporal memory (HTM). They were evaluated using F-score on a real data set from an Apache access log. The data set was carefully chosen to represent a normal state in which close to no anomalous events occurred. Af- terward, 0.5% of the data points were transformed into anomalous data points, calculated based on the average frequency of log events matching a certain pattern. PCA showed the best performance with an F-score ranging from 0.4 - 0.56. The second best method was LogCluster but the HTM methods did not show adequate results. The result showed that PCA can find approximately 50% of the injected anomalies, this can be used to improve the confidentiality, integrity and availability of applications.

(6)

Sammanfattning

Eftersom antalet uppkopplade enheter ständigt har ökat och kravet på tillgäng- lighet, äkthet och integritet hos applikationer är höga så har den här uppsatsen fokuserat på oövervakad anomalidetektering i datacenter. Den utvärderar hur lämpliga öppna och moderna anomalidetekteringsmetoder är för att hitta avvi- kande mönster och trender på logbaserade dataströmmar. Metoderna använda i det här projektet är Principalkomponentanalys, LogCluster och Hierarkisk temporärt minne. De är utvärderade med F-score på en datamängd från en Apache-accesslogg tagen från en produktionsmiljö. Datan var utvald för att reprensentera ett normalt tillstånd där få eller inga onormala händelser före- kom. 0.5% av datapunkterna transformerades till anomalier, baserat på den ge- nomsnittliga förekomsten av varje logsekvens som matchar ett visst mönster.

Principalkomponentanalys visade de bästa resultaten med ett F-score från 0.4 till 0.56. Näst bäst var LogCluster, de två metoderna baserade på hierarkiskt temporärt minne visade inte alls bra resultat. Resultaten visade att PCA kan hitta ca 50% av de injecerade anomalierna vilket kan användas för att förbättra konfidentialitet, tillgänglighet och integriteten hos applikationer.

(7)

1 Introduction 1

1.1 Problem Description . . . 2

1.2 Research Questions . . . 2

2 Background 3 2.1 Machine Learning . . . 3

2.1.1 Supervised Machine Learning . . . 4

2.1.2 Unsupervised Machine Learning . . . 4

2.1.3 Online Learning . . . 4

2.2 Time series analysis and forecasting . . . 4

2.2.1 Analysis and Forecasting . . . 5

2.2.2 Window Types . . . 5

2.2.3 Mathematical Methods . . . 5

2.3 Anomaly Detection . . . 6

2.3.1 Types of Anomalies . . . 7

2.3.2 Hierarchical Temporal Memory . . . 7

2.3.3 Log Cluster . . . 9

2.3.4 Principal Component Analysis . . . 10

2.3.5 Feature Distribution Detection . . . 11

2.4 Performance Measurement . . . 11

2.4.1 F-Score . . . 11

2.4.2 Accuracy . . . 12

2.5 Related Work . . . 13

2.5.1 Experience report . . . 13

2.5.2 Unsupervised real-time anomaly detection . . . 13

2.5.3 Network based anomaly detection . . . 14

2.5.4 Access log anomaly detection . . . 14

2.5.5 Commercial products . . . 14

2.6 Summary . . . 14

v

(8)

3 Method 16

3.1 Data and Environment . . . 16

3.1.1 Environment . . . 16

3.1.2 Data set . . . 17

3.2 Feature Selection and Pre-processing . . . 18

3.2.1 Features . . . 18

3.2.2 Window Type . . . 18

3.2.3 Apache log to an Event ID Vector . . . 19

3.3 Artificial Anomaly Injection . . . 19

3.4 Implementation . . . 20

3.4.1 Input Layers . . . 20

3.4.2 Training Ratio . . . 22

3.4.3 Implementation Details . . . 22

3.4.4 Hyperparameter Optimization . . . 23

3.5 Evaluation . . . 24

4 Results 25 4.1 Average Event Frequency . . . 25

4.2 Model Comparison . . . 26

4.2.1 Default Hyperparameters . . . 27

4.2.2 Optimized Hyperparameters . . . 28

4.2.3 Increased anomaly amplitude . . . 29

4.2.4 HTMAL and HTMAL Shannon Entropy graphs . . . . 30

5 Discussion 35 5.1 Results . . . 35

5.2 Applying the methodology in a broader context . . . 37

5.3 Source of Errors . . . 37

5.4 Sustainability, Ethics, and Social Aspects . . . 37

5.5 Future Work . . . 38

6 Conclusions 40

Bibliography 41

(9)

Introduction

In recent years, there has been a rapid growth of connected devices, especially with the rise of popularity in Internet of Things (IoT). This increases the number of critical points of failure in terms of confidentiality, availability, and integrity in data centers infrastructure. A data center consists of a large number of entities such as clients, servers and applications. These entities leave traces both internally and externally. Internal traces could be a server’s log messages or a users bash history. External traces could be the communication between entities in the network, this can be collected by firewalls, routers and web servers. When looking at these log messages individually they may seem useless. However, combining the log traces from all entities in a centralized log management system give new opportunities for deducing valuable information about the overall health of the data center. This creates a new field of system monitoring compared to the more primitive strategy, for instance, looking after one specific event such as sending a ping request to see if a host is online or not. The produced log data in a data center tend to be very large, many gigabytes per hour is not unusual and it is not feasible for a human to process this information in real-time. The data can be gathered by installing agents on the entities in the data center. These entities collect information and send it to the centralized log system. With the advancement in the field of machine learning, a new toolset for monitoring faults and anomalies are available.

Recent incidents regarding availability and security are often detected late [1].

For example, unexpected network traffic caused by corrupted scripts or Dis- tributed Denial of Service (DDoS) attacks can be hard to detect. It is of great interest to detect intrusions and to forecast a possible system failure by finding anomalies in real-time. An anomaly can be many things, but the most

1

(10)

common aspect may be to find outliers such as rare objects or incidents. An anomaly can also be a normal event with a sudden burst of activity [2]. For instance, a brute force attack against a password, causes the amount of 401 REST responses to increase.

1.1 Problem Description

The degree project was conducted at SAAB in an organization that develops and maintains a large data center. The data center has more than 2000 comput- ers in approximately 100 networks and more than 500 users. A log management system is in operation which receives log data from several log producing entities such as servers and clients. The existing data has not been labeled into classes of normal or abnormal events. The objective is to investigate how the log data can be extended further with the aid of online unsupervised machine learning to detect anomalies. Online learning means that the time series data is processed event by event instead of in a batch. An important property of online learning is that it uses continuous learning meaning that the model ad- justs after each processed event.

Log-based anomaly detection can be divided into four steps, log collection, log parsing, feature extraction, and anomaly detection. The log collection has already been done and this project focused on the three later topics. Prior training the anomaly detection method a representative normal state of the data has to be found. A normal state is a state when the log events are performing as expected.

1.2 Research Questions

Which unsupervised anomaly detection method achieves highest F-score on log-based data comparing PCA, LogCluster, HTMAL, and HTMAL Shannon Entropy on Apache access logs?

(11)

Background

In this chapter, the required theory about unsupervised anomaly detection is presented to better understand the method. General concepts of time series analysis and forecasting are presented in section 2.2. In section 2.3 theory and concepts of anomaly detection are presented as well as a definition of an anomalous event. The theory between two relevant performance measurement methods is presented in section 2.4. The related work section in section 2.5 is a summary of what has been done in recent research in the field of log-based anomaly detection.

2.1 Machine Learning

Machine learning strategies are highly dependant on the context and quality of the data. The most common scenario in real-life situations is that the data is not labeled and therefore contains entries from all possible classes. A class or a label is a tag marking the event into a category such as an anomalous or a normal event an example is to determine if a photograph was taken during the day or night. Different strategies has to be chosen dependant on the structure of the data set. Many machine learning methods are computationally expensive which has to be taken into consideration as well as how much data has to be collected before the model starts to be representative. There exist a few different strategies, they can be divided into supervised, unsupervised and, semi-supervised learning. They are conceptually explained in section 2.1.1 and 2.1.2, in section 2.1.3 the concept of online learning is presented. Online learning is a technique that can be applied to machine learning methods.

3

(12)

2.1.1 Supervised Machine Learning

Supervised learning requires that the training set has been labeled into specific classes [3]. The fact that the training data is categorized into different classes is used to build a model that describes the data. This is a big drawback since it is a very time-consuming task and can be hard to automate. Another problem is that in the case of anomaly and normal classification, the anomaly class is often a lot smaller than the normal case which creates an imbalance between the classes. When the data is manually labeled the supervised approach learns to mimic a human expert rather than learning by itself. This is a problem because the classifier does generally not outperform a human expert since it does not get better than the data it is training on [4].

2.1.2 Unsupervised Machine Learning

Unsupervised learning solves the big drawback of on beforehand labeling the data into classes, under the assumption that a normal event happens with a much higher frequency than the anomaly case. If this is not the case then a lot of false alarms will occur. Compared to supervised learning the unsupervised approach can find unknown anomalies and can outperform a human expert [5].

An unsupervised machine learning algorithm does not need a labeled training set, however, it is good to approximate a model on a portion of the data set before searching for anomalies. Not to be mixed with unsupervised learning is semi-supervised learning which builds a predictive model on a training set from just one class [4].

2.1.3 Online Learning

Online machine learning means that the data is processed in a sequential man- ner rather than in batch [3]. This is beneficial when the data is too large to fit into the memory and when the algorithm uses continues learning. Continuous learning means that the model is adjusting itself after each new event. For example, with a clustering method, each new event is classified into one of the clusters, then the cluster’s centroid has to be adjusted.

2.2 Time series analysis and forecasting

Time series analysis and forecasting are two essential concepts to most machine learning methods. A time series consists of data points ordered in chrono-

(13)

logical order, for instance, the CPU temperature over time.

This section explains the fundamental concept of how to analyze time series data. How to make predictions based on forecasting techniques and how times series data can be divided into windows.

2.2.1 Analysis and Forecasting

In the analysis step, the goal is to deduce information from a time series sequence. This is done by applying different techniques one such is dividing the time series data into window types and then apply mathematical methods. The goal is to transform the original data into a form such that trends and characteristics can be identified. Forecasting is the act of extrapolating a model built in the analysis step on historical data to predict a future event. Forecasting time series has a wide variety of use cases, for instance, anomaly detection.

2.2.2 Window Types

Time series data often contain a large amount of data and a technique used to find characteristics in the data is to group it into windows. When grouping time series data into windows there exist three main approaches, fixed, sliding and session window, respectively. Which window type to use depend on the data and can potentially have a great influence on the final result. A fixed window fetches all messages within a period of time, for example, all messages per day or hour. A sliding window consists of a window size and a step size, for example, window size, is a day and step size is every hour. Session windows are based on identifiers, one such identifier could be when a window is created based on an ID instead of time [3].

2.2.3 Mathematical Methods

Mathematical methods are used to describe the characteristics of the time se- ries data. The time series can be analyzed with term weighting which aims to determine the importance of an event based on its frequency. Similar methods can be used to normalize the data. Example of such mathematical methods are term frequency inverse document frequency (TF-IDF), Zero mean, and Sig- moid. TF-IDF is a statistical method that determines how important an event is in the data set based on how frequent the event occurs. Term frequency (TF) is simply the number of occurrences of an event. Inverse document frequency (IDF) is a measurement of how much information is revealed by an event. IDF

(14)

is calculated in a logarithmic scale, by taking the inverse fraction of an event.

IDF (t, D) = log( N

d D : t d) (2.1)

Where N is the number of total events and t is the number of times the event happened during a window and D is a set of windows. TF-IDF combines the two by multiplying the term frequency with the IDF as in equation 2.2 [6, 7].

T F IDF (t, d, D) = T F (t, d) ∗ IDF (t, D) (2.2)

Zero mean transform a vector containing numerical values such that the mean of all indexes is zero. It does this by manipulating the amplitude [8].

Sigmoid is mathematical function with the property of s shaped curves. It has various implementations, in machine learning it is common to use logistic regression as an activation function 2.3 [9].

f (x) = 1

1 + e^−x (2.3)

2.3 Anomaly Detection

An anomaly detection method is used to build a representative model on historical time series data. The model can be extrapolated to make predictions about future events. If these predictions deviate more than a threshold x then it can be considered an anomaly. An anomaly is defined as a pattern that devi- ates from the expected pattern. Some of the challenges in anomaly detection are:

• Defining the difference between a normal and an anomalous event.

• Data sets from a real context often contain a significant amount of noise which may be intercepted as anomalies and vice versa.

• An anomaly can be normal events but with a sudden burst of occurrences or a sudden decrease.

• An adversary would try to act expectedly to not raise suspicion.

A general approach to anomaly detection is very hard to obtain because of the nature of the input data. Anomaly detection techniques tend to solve a

(15)

specific problem [4]. In this section the theory about what kind of anomalies that exists and how to detect them are presented in section 2.3.1. The process of encoding real-world data into a suitable form is presented in section 2.3.2 and 2.3.2. Necessary theory about how unsupervised anomaly detection methods with different strategies operates. Hierarchical Temporal Memory (HTM) is a method based on a neural network, Principal Component Analysis (PCA) is purely based on statistics and LogCluster is a cluster-based approach.

2.3.1 Types of Anomalies

In an anomaly detection review report from 2009, an overview of the anomaly detection area is described [4]. The review report identified a problem that there did not exist a unified notion of anomalies and tried to define anomalies by grouping them into point anomalies, contextual anomalies, and collective anomalies. Many anomaly detection techniques are specifically designed for a certain task and a unified notion prevents misunderstandings when talking about anomaly detection techniques. The simplest form of anomalies are point anomalies, they occur when an individual data point significantly differentiates from the rest of the data points. For instance, in DDoS attacks when the only feature is the number of unique IP addresses. Then a high increase of unique IP addresses would be detected, it would be considered a point anomaly. Another type of anomaly is contextual anomalies, it is when a data point is classified as anomalous given a specific context. These events have contextual and behav- ioral attributes, in time series data time is considered a contextual attribute.

Collective anomalies are when related data points are anomalous compared to the entire data set [4].

2.3.2 Hierarchical Temporal Memory

Hierarchical Temporal Memory (HTM) is a biological machine learning method inspired by the neocortex located in mammals brains. The neocortex exists ex- clusively in mammal’s and enables higher-order brain functions, and is what builds our identity. The neocortex is the largest part of our brain and consists of six layers in a hierarchical order formed like a pyramid. The input is taken from the real-world via sensors such as eyes and ears, the signal is traveling through a synapse that stimulates neurons. Only a very small portion of neurons are active at the same time [10]. An HTM model in a machine learning context closely mimics the mammal’s neocortex and has a wide variety of use cases. HTM models learn in an unsupervised to semi-supervised fashion and

(16)

were originally described in the book "On Intelligence" by Jeff Hawkins and Sandra Blakeslee in 2004 [11]. HTM is extremely good at managing noisy environments thanks to the property of sparsity [12] which makes it suitable for many scenarios.

Sparse Distributed Representation

In the field of AI and machine learning, one of the hardest tasks to solve is to represent the real-world in a format such that a computer can understand it.

Sparse Distributed Representation (SDR) is an attempt to represent information received via sensors in a binary string. An SDR is a data structure that holds a bit array of ones and zeroes, however, only a small portion of bits are allowed to be active. Each bit in the SDR corresponds to a neuron and each bit has a meaning. Each active bit contains information about the semantics of an attribute but they are not labeled to something. When two SDRs have similar active bits they share semantic attributes from the real-world. Two SDRs can be compared with an intersection, the amount of overlapping active bits show the similarity between the two SDRs. An SDR representation has to be very sparse, about 2% of the bits can be active at once. For example, if the array consists of 2000 bits then there can be at most 40 active bits with a population on 2%. Each of these 40 bits describes something, for example, a number can be represented with a set of active bits, depending on the span of numbers and in what detail it should be represented the number of bits required may vary.

One position could describe if the number is even or not, one bit could mean that the number is prime or not. The bigger the array the more features can be stored. Even if the SDRs are very big they can efficiently be stored in memory by using the property of a small population, each index of the population can be stored in memory which reduces the size by 98% [13, 14].

01000011100000000000 (2.4)

00000001110000000000 (2.5)

The bit array in 2.4 could represent the number 4 and that it is even. The second bit meaning it is even and the three active bit could represent the number 4. In 2.5 the active ones have been moved one position to the right representing 5 and the even bit is now inactivated. The intersection between these two arrays would generate in an overlap of 2 bits which show they are semantically similar because 4 and 5 are close to each other.

(17)

n k

= n!

w!(n − w)! (2.6)

An SDR can hold a lot of information, the formula in equation 2.6 shows the number of possible representations, where n is the size of the array and w is the cardinality.

Encoding Data

The flow chart of a real event to an SDR is described in figure 2.1. A real event happens then an encoder transforms the event to an SDR which is the input layer of HTM methods.

Figure 2.1: From real event to SDR To create an encoder these four properties must be fulfilled:

• Semantically similar input data should result in an SDR with overlapping active bits.

• The encoder has to be deterministic, meaning the same input always produces the same SDR.

• The dimensionality of the output SDR should always be the same.

• The sparsity of the output should always be the same for all inputs and have enough active bits to efficiently deal with subsampling and noise.

An encoder can be built to fit any domain but it has shown to be a very hard task to fulfill these properties. One of the most challenging tasks is to encode similar events from a real context in such a way that the SDRs are semantically similar, meaning they have overlapping active bits. This is the key property used when calculating anomaly likelihood in HTM algorithms [15].

2.3.3 Log Cluster

LogCluster is a tool used for unsupervised anomaly detection on time series data with the aid of clustering techniques. LogCluster consists of two phases, the construction phase and the production phase [3].

(18)

Construction Phase

The construction phase consists of three steps, vectorization, clustering of logs and lastly representative vector extraction. In the vectorization step, each log event is converted to a numerical vector. This is done by counting each event ID occurrence between a time window of time x. Each event ID frequency in the count vector is weighted by calculating the Inverse Document Frequency (IDF) values. The IDF is calculated with the formula in equation 2.7

W_idf(t) = log N n_t

(2.7)

where N is the sum of all events and n^tis the number of times event t occurs.

In the log clustering step, the similarity between two log sequences is calculated with a method called Agglomerative Hierarchical clustering. This is a clustering method used to merge clusters laying close to each other. In the beginning, all log events form their own cluster but step by step the similarity between the events are measured and the closest clusters are merged into a new cluster. This process goes on until the distance threshold (Max Dist) is met and we end up with a normal cluster and an anomaly cluster. For each cluster, a representative sequence is extracted by measuring which sequence is closest to the cluster’s centroid [16].

Production Phase

The production phase uses data from the actual test set, for example, real-time log events. The production phase consists of the previously mentioned log vectorization and log clustering step. In the online production phase, each new event count vector is compared to each of the clusters representative vector and if the distance is less than a threshold, the new event is added to the closest cluster. The representative sequence of that cluster is updated to further update the knowledge base. [3, 16].

2.3.4 Principal Component Analysis

Principal Component Analysis (PCA) is a method purely based on statistical analysis. It is especially suitable when there is a large feature space. PCA reduces the dimensionality by projecting high dimensional data to a new co- ordinate system consisting of k principal components, where k is less than the original dimensionality. PCA tries to preserve the characteristics of the original data by catching the highest variance between the high dimensional data.

(19)

The principal components are ranked and the first component is considered more important than the following components. It is hard to determine how many principal components to use because it highly influences the result. To detect anomalies the distance from the new point to the normal subspace is considered an anomaly if the distance is higher than a threshold [3, 17, 18].

2.3.5 Feature Distribution Detection

Entropy exists in many variants and is commonly used to describe the distribution between numeric values in a vector. Shannon entropy is a well-known entropy method in anomaly detection contexts. The more similar the variables are the less the entropy, a wide span of variables increases the entropy [19]. The formula for calculating Shannon entropy can be seen in equation 2.8. It transforms a numeric value into one number which is suitable for many anomaly detection methods.

H(X) = −

N −1

X

i=0

pilog2pi (2.8)

2.4 Performance Measurement

To evaluate the performance of machine learning algorithms many different scoring techniques can be used, for instance, F-score and accuracy.

2.4.1 F-Score

F-score takes recall and precision and calculates the harmonic mean. A score of one means a perfectly executed test run while a score of zero is the worst possible outcome [20]. The score is calculated as in equation 2.9.

Fscore = recall⁻¹+ precision⁻¹ 2

−1

(2.9) Evaluation of anomaly detection methods commonly depends on four different cases, true or false negatives and true or false positives. F-score takes three of these cases into consideration as can be seen in figure 2.2.

(20)

Figure 2.2: Precision is calculated based on False Negatives and True Posi- tives. Recall is calculated based on False Positives and True positives.

The true negative case is completely ignored, this has to be taken in consideration when using F-Score.

Recall

Recall is the number of correct results divided by the true number of correct results as seen in equation 2.10 [21].

Recall = T rueP ositives

T rueP ositives + F alseN egative (2.10) Precision

Precision is the number of correct results divided by the number of all positive cases [21] as seen in equation 2.11.

P recision = T rueP ositives

T rueP ositives + F alseP ositives (2.11)

2.4.2 Accuracy

Accuracy is the most intuitive performance measurement, it measures the ratio between correctly identified true negatives + true positives and the total number of events. It covers all possible cases and rewards methods for finding true negatives. [22]

(21)

2.5 Related Work

There exist a lot of research in the field of log-based machine learning. It is an important research field in regard of improving the ability to monitor the system health of applications, servers and data centers.

2.5.1 Experience report

An experience report from 2016 by Shilin et al [3] addresses the issue of anomaly detection methods are non-trival to use and aims to compare state- of-the-art methods. A set of supervised methods such as logistic regression, Decision tree and Support vector machine are compared. Additionally, the unsupervised methods log cluster, principal component analysis, and invariant mining are evaluated. The supervised methods are slightly better on average with an F-score ranging from 0.74 to 0.85 while the unsupervised methods get an F-score ranging from 0.55 to 0.91. The best method was Invariant mining that received an F-score on 0.91 on a BGL data set. The report writes that the supervised methods had a lot faster execution time compared to the unsupervised methods. However, the supervised methods are not practical to use since they require labeled data. PCA wast he fastest unsupervised method while LogCluster and Invariant Mining are considered computational heavy.

This experience report released a tool that has all the methods implemented to make it easier for the industry to use them.

2.5.2 Unsupervised real-time anomaly detection

In the paper unsupervised real-time anomaly detection for streaming data by S. Ahmad et al [23]. A set of online unsupervised algorithms was compared against a data set from real-world data streams called Numenta Anomaly Bench- mark (NAB). The report compared 11 different methods such as Etsy Skyline [24], Twitter ADVec [25], Contextual Anomaly Detector Open Source Edi- tion (CAD OSE), and Hierarchical Temporal Memory (HTM) [26]. HTM is a neural network approach that was extended with an anomaly likelihood (AL) estimation that achieved the highest score. They were all tested on numeric data such as clients CPU utilization, request latency, and machine system temperature. The report writes about the massive number of streams that have to be monitored to increase availability increases the demands on the anomaly detection algorithms, they have to be fully automated. It is not feasible for human intervention on each detected anomaly.

(22)

2.5.3 Network based anomaly detection

Berezinski et al [19], were evaluating different entropy approaches to find network anomalies. A set of different entropy methods was compared by applying them on a set of feature and then sending it to a machine learning method called Anode. An entropy function describes how a set of numerical values are distributed, it typically outputs a smaller numeric value that fits most input layers, Renyi and Tsallis entropy performed best.

2.5.4 Access log anomaly detection

M. Tharshini et al. compared supervised learning with unsupervised learning to find anomalies in web access logs in the paper Access Log Anomaly De- tection [27]. The project’s objective was to analyze web access logs to find anomalous events to defend against attacks. The report has chosen to work with access logs since it is considered a very important indicator of intrusions and it is important to detect potential attacks in time. is a key The most suc- cessful method was a supervised method called Naive Bayes Multinomial Text algorithm which achieved an error rate of approximately 10% compared to an unsupervised clustering method which achieved an error rate of approximately 35%.

2.5.5 Commercial products

A commercial solution named X-Pack was developed by the Elasticsearch machine learning team. The implementation is closed source but is built upon unsupervised machine learning based on time series decomposition and clustering techniques [28]. X-pack has somewhat limited configuration properties but what can be modified is the input data, the bucket span which divides the data into batches. A detector property can be chosen, for example, the total amount of occurrences.

2.6 Summary

Due to the nature of the problem where a large number of unlabeled data constantly are produced this project focuses on unsupervised methods. Through- out the literature study, the most promising methods were HTMAL, LogClus- ter, and PCA they showed good results in a similar context as the environment this project concerns. HTMAL Outperformed 10 other unsupervised anomaly

(23)

detection methods in the report that is described in section 2.5.2. LogCluster and PCA showed promising results as described in section 2.5.1. They can op- erate on any free text log-based data by manipulateing it into a numerical feature vector. Distribution approaches have been studied in various approaches such as the one described in section 2.5.3, therefor Shannon entropy in combination with HTMAL was tested, both methods have been seen in multiple reports with good results, however, not they have not been used together. The methods was compared with F-score since it is a proven good method to compare algorithms performance.

(24)

Method

In this chapter, the methodology used to answer the research question is described in detail. The environment in which the experiment is conducted and the data set is described in section 3.1. The pre-processing of raw log events and the choice of features are described in section 3.2. Implementation details of HTMAL, LogCluster, and PCA are presented such that they can easily be reproduced in section 3.4. Finally, the model comparison strategy is presented in section 3.5.

3.1 Data and Environment

The data set originates from an Apache access log, the Apache web server acts as a reverse proxy to multiple company critical applications described in section 3.1.1 and figure 3.1.

3.1.1 Environment

The implementation phase was conducted in an environment consisting of an Apache web server, serving as a reverse proxy for the application layer see figure 3.1. The application layer consists of company critical applications that are essential for the company’s infrastructure. Examples of application functions are version-control, task management, and documentation. These applications are not only used with human intervention, they are also used by a large number of scripts that are scheduled at different times. Some scripts run very frequently, up to every other minute and some run once a week. These scripts have very different run times and they cause heavy fluctuations in the

16

(25)

data which increases the complexity of classifying new data points.

Figure 3.1: The environment in which the Apache access log data is initially created.

The reverse proxy gathers data from the application layer on behalf of the client’s. A client asks the reverse proxy for a resource, the proxy forwards the question to the requested application. The application replies back to the reverse proxy by sending the requested data. The reverse proxy then sends it back to the client. This creates an illusion of the reverse proxy being the source of the information while the applications remain hidden from the clients. The process is illustrated in figure 3.1. This enables extensive logging from all applications in one single Apache access log. The clients originate from a large set of different internal networks. The data is initially stored on the host in an Apache access log responsible for the reverse proxy. The host has a log col- lecting agent that constantly sends log events to a centralized log management system, in this case, the Elastic stack. However, the Elastic stack has not been used during the evaluation phase of this project. For the sake of simplicity, the evaluation of the chosen methods was used directly on the raw Apache logs.

3.1.2 Data set

The access logs provide historical logs for a period of approximately two years.

The historical data consists of a tremendous amount of log events and it was not feasible to use all available data. Therefore a baseline was extracted to represent a normal state where no incidents were reported or identified. The

(26)

chosen baseline was extracted from four consecutive days and consisted of 2 710 948 log events sorted in chronological order. Each log event consists of the following information fields: client IP, identity, username, date, request, response code, and size of the response.

3.2 Feature Selection and Pre-processing

3.2.1 Features

For this study, a set of 20 features was used, five for each application see table 3.1. The features were selected based on application and which REST response that is received from the application. These responses 200, 401, 403, 404 and 503 were considered to be most representative for describing the system’s health. They describe "Everything went as intended", "Unauthenti- cated", "Unauthorized/Forbidden", "Page not found","Service unavailable".

App(1-4) Response E-ID

1-4 200 E(0-3)0

1-4 401 E(0-3)1

1-4 403 E(0-3)2

1-4 404 E(0-3)3

1-4 503 E(0-3)5

Table 3.1: Table of possible event-IDs generated by E(X-1)Y where X is the application and Y is the REST response.

3.2.2 Window Type

During the evaluation of the methods, a fixed time window was used. To see how the fluctuations in the environment affect the result, a fine-grained and a coarse-grained granularity was created. The fine-grained granularity was set to 10 second windows and the coarse-grained was set to 60 second windows. A coarse-grained granularity could smooth out the high peaks in the traffic caused by scripts in the environment. A fine-grained granularity could increase fluctuation and noise. This resulted in 5760 data points with a coarse- grained granularity and 34 560 data points with a fine-grained granularity.

(27)

3.2.3 Apache log to an Event ID Vector

The process of transforming raw Apache logs to count vectors can be seen in figure 3.2. First, the raw Apache log was parsed trough a regex in Python to classify each log event into an event ID. Then it is further grouped into a numeric feature vector. Each index of the numeric feature vector corresponds to an event ID and with which frequency it appeared in during the last window.

Figure 3.2: From an Apache access log to event IDs and then to a numeric feature vector. Each number corresponds to how many times each event occurred during the last window.

An example Apache access log message is structured as follows:

127.0.0.1 − kalle [9/F eb/2017 : 10 : 34 : 12 + 0100]

”GET /app/sample − image.png HT T P/2” 200 1479

From the URL field, each application’s root name was used as a feature in combination with the response code. For example, app 1 with response 200 gets event id E00 and app 2 with response 503 gets event ID E15. Resulting in 20 different log events as illustrated in table 3.1.

3.3 Artificial Anomaly Injection

An assumption that the selected data was in a normal state was used to define anomalies during the anomaly creation process. Two ways of creating an anomaly was used:

• Picking a random quantifier between 10 and 150.

• Picking a random increase between 10 and 200 events.

First, the average frequency was determined by grouping the data into time windows of a chosen granularity, each of these windows corresponds to a data

(28)

point. For each time window, the average frequency of each of the 20 possible event types was calculated. Then a random event was chosen and injected X number of times into a random window making that window or data point anomalous. This was done until 0.5% of the data points were anomalous. X was determined by multiplying the average frequency of the selected event with the randomly chosen quantifier. If the multiplication results in a value smaller than 10 an increase was randomly chosen from the range between 10 and 200. Then the event was appended that number of times into a random window. The first case multiplies the frequency with 10 to 150 and appends that amount of occurrences into a random window.

This simulates attacks such as brute forcing employees credentials, a DDoS attack or possible system failure. A large burst of 403 events would indicate that it is likely someone is trying to brute force a specific account or it could be a faulty script. An increase in 404 error would imply that a service is about to or have stopped working.

3.4 Implementation

This section explains how the different anomaly detection methods input layer are implemented and which partition of the data set that was used to build the model. The implementation details of each of the methods PCA, LogCluster, HTMAL, and HTMAL Shannon entropy are described and how the hyperparameters are calculated.

3.4.1 Input Layers

The chosen methods LogCluster and PCA have identical input layers, they take a vector containing the number of occurrences of each event as input.

The HTM model takes two parameters, date/time and a numeric value which were converted by an encoder into an SDR. Notice that LogCluster and PCA do not take time into account. The process of transforming the data into their respective anomaly detection method is described in figure 3.3.

(29)

Figure 3.3: The process of sending a numerical feature vector to each methods respectively input layer.

HTMAL

HTMAL uses two features, the date in combination with a numeric value. The numeric value was based on how many accesses that were made during the last window as illustrated in figure 3.3. This was chosen to find changes in activity through the reverse proxy.

HTMAL Shannon Entropy

This method uses the same input layer as HTMAL. Shannon entropy is calculated on the numeric feature vector as illustrated in Figure 3.3. It is calculated with the formula in equation 2.8, however, it does not handle zeroes, this was solved by removing all zeroes from the list. The numerical value is a mea- sure of how well distributed the represented events was during the last time window.

(30)

3.4.2 Training Ratio

Throughout this project, a training ratio of 30% was used. Unsupervised methods do not need a specific training set but to prevent heavy fluctuations a training period of 30% was used. The consequence of this is that the first 30% of the data was ignored in the results.

3.4.3 Implementation Details

This section explains the configuration and implementation details about Log- Cluster, PCA, HTMAL, and HTMAL Shannon entropy. In table 3.2 the default hyperparameters for each method are displayed. These default parameters were provided by respectively report that implemented the different methods that are used in this project.

LogCluster

LogCluster is implemented in Python and is built upon Scipy [29] and Numpy [30]. A default implementation from a Github project called Logpai was used with the default configuration seen in an experience report [3, 31] . The default parameters are displayed in table 3.2.

PCA

PCA is implemented with the default configuration given in an experience report [3]. The report provided an implementation on Github that was used in this project [31]. The default parameters are displayed in table 3.2.

HTMAL

HTM is built on a framework called Nupic which is created by an open source community. The implementation used in this project uses the default configuration presented in the supplementary material [32]. A default implementation was used from a Github project called NAB [26]. The default parameters are displayed in table 3.2.

HTMAL Shannon Entropy

HTMAL Shannon Entropy used the same base implementation as HTMAL and the same default configuration. Then it is combined with an entropy function called Shannon entropy, these methods have not been combined in any of

(31)

the reports seen in the literature study. The HTMAL implementation takes two features, date/time and a numeric value as described in section 3.4.1. The numeric value is based on calculating the Shannon entropy of the numeric count vector as illustrated in figure 3.3. It is calculated using a variation of Shannon entropy that describes the distribution of the represented events in the numeric feature vector. To deal with zero division, all zero indexes are removed. Shan- non entropy describes the minimum amount of bits needed to encode a string based on the frequency it occurs in a window. It is calculated with the formula shown in equation 2.8. The default parameters are displayed in table 3.2.

Default Hyperparameters

In table 3.2 the default hyperparameters are displayed. These was later optimized for the data set used in this project.

Default H-Parameters Term Weighting Normalization Threshold Max Dist

PCA TF-IDF Zero Mean X X

LogCluster X TF-IDF 0.30 0.30

HTMAL X X 0.50 X

HTMAL S-Entropy X X 0.50 X

Table 3.2: Default hyperparameters, X meaning it is not used.

Validation

Each of the implementations was validated by running them on publicly known data sets. PCA and LogCluster achieved the same F-scores as in the report where they were originally evaluated [3]. HTMAL were validated by running the NAB data set and it received the same NAB score as in the original report [23].

3.4.4 Hyperparameter Optimization

The hyperparameters were estimated by putting the program in an infinite loop.

For each iteration in the loop, a new set of hyperparameters was randomly chosen. The new set of hyperparameters was then evaluated by transforming 0.5%

of the data points into anomalies on random places in the data set as described in section 3.3. An average F-score was calculated based on 20 iterations for each combination of hyperparameters. If a higher F-score were found then the hyperparameters were saved.

(32)

LogCluster

Logcluster has three hyperparameters, normalization method, max dist, and threshold. Max dist and threshold are randomly chosen between 0.01 and 1. The normalization method were randomly chosen between TF-IDF, Zero- mean, and Sigmoid.

PCA

PCA uses two non-numerical hyperparameters Term weighting and normal- ization. Every combination is executed by looping over the lists. The possible values on normalization are None and Zero Mean and term weighting are Zero Mean IDF and, Sigmoid.

HTMAL and HTM Shannon Entropy

HTMAL and HTM Shanon Entropy uses one hyperparameter, that is Thresh- old and is randomly chosen between 0.01 and 1.

3.5 Evaluation

The test environment was setup programmatically by transforming 0.5% of the data points in the data set to anomalies, the same way as described in section 3.3. The modified data set were then sent to each anomaly detection method. The methods were compared with F-score that outputs a numerical value between 0 and 1. An assumption was made that the initial data set was in a normal state which makes F-score suitable because it does not reward true negatives. F-score takes two lists, one containing the predicted results filled with either 1’s or 0’s and one list with the true values. The true values are initially filled with zeroes. This because of the assumption, of the data being normal initially, for each injected anomaly the corresponding index was changed to 1 marking it as an anomaly. The F-score value was saved and a new iteration started with a new set of anomalies. This was repeated 20 times and the average F-score was then compared with a tolerance calculated from the standard deviation based on the average F-score. For further analysis, each recall, precision, the amount of found anomalies, missed anomalies, and false positives were saved.

(33)

Results

In this chapter, the results are presented and a comparison between the models with different configurations are displayed. In section 4.1 the average event frequency per granularity is displayed. Section 4.2 show how the different models perform with default, optimized hyperparametrs, and how the performance is affected when increasing the anomaly amplitude.

4.1 Average Event Frequency

Table 4.1 and 4.2 show the average frequency of REST responses received by each application. The average frequencies showed that application 3 is most representative with a large number of 200 responses, then application 2 is the next most representative, closely followed by application 3. Application 4 has a very low activity which indicates that it is not used much. Response 200 is by far most representative, it indicates that the application functioned as intended. Low frequency in 403 and 503 indicate that the data set represents a normal state. An increase in those fields would indicate forbidden or service unavailable, meaning that a user or script would try to access a resource they are not permitted to see or that a service is unavailable. Application 2 has a significantly high number of occurrences with REST code 401. It could lead to an unrepresentative model of a normal state. The relation between table 4.1 and 4.2 is of a factor six, since table 4.1 is based on 60 second windows and table 4.2 on 10 second windows.

25

(34)

Coarse-Grained Granularity

REST Code 200 401 403 404 503

Application 1 35.67 0.06 0.01 2.81 0 Application 2 23.38 8.13 0.01 0.08 0 Application 3 146.56 0.2 0 12.94 0.01

Application 4 0.03 0 0 0 0

Table 4.1: Event frequency (number of events per 60 seconds) with coarse- grained granularity showing application and REST codes based on 2 710 948 log events.

Fine-Grained Granularity

REST Code 200 401 403 404 503

Application 1 5.95 0.01 0 0.5 0 Application 2 3.9 1.4 0 0.01 0 Application 3 24.43 0.03 0 2.16 0.02

Application 4 0 0 0 0 0

Table 4.2: Event frequency (number of events per 10 seconds) with fine- grained granularity showing application and REST codes based on 2 710 948 log events.

4.2 Model Comparison

This section shows the result after the different models are compared with the default and optimized hyperparameters and lastly with an amplified anomaly amplitude. The results can be seen in table 4.3 - 4.9. The results have seven columns: Method, Missed Identification (MI), Correct Identification (CI), False Positives (FP), Recall, Precision, and F-score. The most important field is F-Score which is dependant on Recall and Precision. All values show the average value based on 20 iterations, recall, precision and F-score also display the standard deviation. The fine-grained granularity is compared with the coarse-grained granularity.

(35)

4.2.1 Default Hyperparameters

The default hyperparameters shown in table 3.2 are used and it provides the results seen in table 4.3 and table 4.4. The results showed that PCA and LogClus- ter outperform HTMAL and HTMAL Shannon Entropy. In the coarse-grained case, LogCluster achieved a noticeably high precision close to one with zero false positives but low recall, it received the best f-score. In the fine-grained case, PCA significantly improves and achieves an F-score on 0.54 which is the second-best score throughout this project the other methods showed no significant change. With default hyperparameters, PCA achieved the best score in the fine-grained case.

Method MI CI FP Recall Precision F-Score

PCA 21.10 6.90 42.60 0.25 ± 0.08 0.14 ± 0.00 0.18 ± 0.05 LogCluster 23.40 4.60 0.00 0.16 ± 0.06 0.95 ± 0.22 0.28 ± 0.09 HTMAL 27.05 0.95 21.65 0.03 ± 0.03 0.07 ± 0.08 0.04 ± 0.04 HTMAL S-Entropy 27.40 0.60 50.35 0.02 ± 0.03 0.01 ± 0.01 0.02 ± 0.02 Table 4.3: Coarse-grained granularity with default parameters on 5760 data

points.

PCA 93.55 78.45 39.70 0.46 ± 0.04 0.66 ± 0.02 0.54 ± 0.03 LogCluster 142.55 29.45 6.95 0.17 ± 0.03 0.80 ± 0.03 0.28 ± 0.05 HTMAL 168.45 3.55 146.25 0.02 ± 0.01 0.03 ± 0.01 0.02 ± 0.01 HTMAL S-Entropy 172 1.30 271.25 0.01 ± 0.01 0.01 ± 0.01 0.01 ± 0.01 Table 4.4: Fine-grained granularity with default hyperparameters on 34 560

data points.

(36)

4.2.2 Optimized Hyperparameters

These results are based on optimized hyperparameters calculated on the coarse- grained granularity data set.

The optimized hyperparameters seen in table 4.5 display the optimal values calculated based on the method described in section 3.4.4. It is very time- consuming to calculate hyperparameters on the fine-grained granularity data, therefore the same set of hyperparameters was used in both cases.

HyperParameters Term Weighting Normalization Threshold Max Dist

PCA Zero Mean None X X

LogCluster Zero Mean X 0.24 0.73

HTMAL X X 0.3 X

HTMAL S-Entropy X X 0.7 X

Table 4.5: Optimized hyperparameters, X meaning it is not used.

Looking at the result in table 4.6 and 4.7 PCA significantly improves the F-score from 0.18 in table 4.3 to 0.39 in the coarse-grained case. No other significant changes in F-score was detected, however PCA achieves an F-score of 0.56 in table 4.7 which is the best score throughout this project. LogCluster significantly reduces its precision compared to the coarse-grained case.

PCA 18.45 9.55 10.9 0.34 ± 0.07 0.46 ± 0.05 0.39 ± 0.07 LogCluster 23.05 4.95 4.90 0.18 ± 0.10 0.49 ± 0.10 0.26 ± 0.07 HTMAL 27.35 0.65 21.65 0.03 ± 0.03 0.07 ± 0.08 0.04 ± 0.04 HTMAL S-Entropy 27.4 0.60 50.6 0.02 ± 0.03 0.01 ± 0.01 0.02 ± 0.02 Table 4.6: Coarse-grained granularity with optimized hyperparameters on

5760 data points.

(37)

PCA 86.35 85.65 47.80 0.50 ± 0.04 0.64 ± 0.02 0.56 ± 0.04 LogCluster 120 52 395.2 0.30 ± 0.20 0.12 ± 0.10 0.17 ± 0.02 HTMAL 168.25 3.75 142.05 0.02 ± 0.01 0.03 ± 0.02 0.02 ± 0.01 HTMAL S-Entropy 170.10 1.90 272.7 0.01 ± 0.01 0.01 ± 0.01 0.01 ± 0.01 Table 4.7: Fine-Grained granularity with optimized hyperparameters on 34

560 data points.

4.2.3 Increased anomaly amplitude

This section show the results when optimized hyperparameters was used and the anomalies are increased with a factor of 10.

The expected outcome when increasing the anomaly amplitude was an increased F-score. The results in table 4.8 and 4.9 show a mixture of increases and decreases. In the coarse-grained case there was an significant increase in F-score to LogCluster and HTMAL compared to previous results. In the fine- grained case PCA showed a significant decrease compared to table 4.7 and HTMAL an significant increase. LogCluster further decreases in Precision compared to the earlier fine-grained in table 4.4 and 4.7

PCA 18.20 9.80 10.95 0.35 ± 0.08 0.47 ± 0.06 0.40 ± 0.08 LogCluster 21.40 6.60 4.95 0.24 ± 0.08 0.55 ± 0.14 0.33 ± 0.10 HTMAL 22.9 5.10 9.35 0.18 ± 0.04 0.38 ± 0.11 0.24 ± 0.05 HTMAL S-Entropy 26.55 1.45 64.90 0.05 ± 0.04 0.02 ± 0.02 0.03 ± 0.02 Table 4.8: Coarse-grained granularity with optimized hyperparameters on am-

plified anomalies and 5760 data points.

(38)

Method MI CI FP Recall Precision F-Score

PCA 109.3 62.7 47.75 0.35 ± 0.04 0.57 ± 0.03 0.44 ± 0.04 LogCluster 125.05 46.95 395.2 0.27 ± 0.04 0.01 ± 0.01 0.15 ± 0.02 HTMAL 159.4 12.6 82.1 0.07 ± 0.01 0.01 ± 0.03 0.09 ± 0.05 HTMAL S-Entropy 170.3 1.7 254.1 0.01 ± 0.01 0.01 ± 0.01 0.01 ± 0.01 Table 4.9: Fine-Grained granularity with optimized hyperparameters on am-

plified anomalies and 34 560 data points.

4.2.4 HTMAL and HTMAL Shannon Entropy graphs

In this section figure 4.1 - 4.4 visually displays the data and where the HT- MAL methods identified anomalies. The red squared symbols represent an identified anomalous event. The teal line represents the number of connections through the reverse proxy in the HTMAL case and the Shannon entropy case, it represents the Shannon entropy value. A high entropy value represents an uneven distribution.

(39)

Figure 4.1 showed that HTMAL Shannon Entropy identifies a lot of anomalies even though only 0.5% of the data points are anomalous, meaning there was an extensive amount of false positives in the graph. The Shannon entropy value were fluctuating between high and low entropy values frequently.

Figure 4.1: HTMAL Shannon Entropy showing one hour time slice with coarse-grained granularity.

(40)

Looking at the graph in 4.2 there were heavy fluctuations which may increase the complexity of classifying the data. This can be one explanation to why HTMAL Shannon entropy performs poorly.

Figure 4.2: HTMAL Shannon Entropy with four days consecutive data and coarse granularity.

(41)

In figure 4.3 one hour of coarse-grained data is displayed. HTMAL showed three anomalous points. It identifies spatial anomalies but does not identify other anomalies.

Figure 4.3: HTMAL showing one hour time slice with coarse-grained granularity.

(42)

In figure 4.4, HTMAL classifies 4 days of coarse-grained data. Many false positives were identified around 03-22 09. The graph shows that some high peaks are classified as anomalies while some were not identified as anomalies which they probably should have.

Figure 4.4: HTMAL with four days consecutive data and coarse granularity.

(43)

Discussion

Throughout this project, state-of-the-art anomaly detection methods were applied and evaluated in an active data center. The company in which the work was conducted, hoped to improve availability and security by finding abnormal trends on log-based data streams. Therefore the work was focused around unsupervised anomaly detection methods. The chosen methods were PCA, LogCluster, and HTMAL because they have the ability of online learning and are fairly computational light and can quickly classify new data points. This section discusses the achieved results from chapter 4 and what the difference is between the two granularity cases. The approach used in this work is discussed in a broader more general perspective, potential sources of errors and lastly future work is discussed.

5.1 Results

The results show that the four methods can be divided into groups of well- performing and bad-performing. HTMAL and HTMAL Shannon Entropy did not achieve adequate results and are not sufficient for anomaly detection in this context. However, HTMAL has shown very good results in similar contexts in various articles. The implementation used in this project is validated through known data sets yes fails to perform in this context. The comparison is not completely fair considering HTMAL takes two features, date/time and a numeric value. The HTMAL implementation used in this project takes date/time and the total amount of connections through the reverse proxy. The increase of events is either a quantifier or a random value in the none amplified anomaly case. This may be too small of an increase in activity for HTMAL to recog- nize an anomaly. However, in the case when the anomalies were amplified

35

(44)

with a factor of 10, HTMAL showed a significant increase in both the coarse and fine-grained case. HTMAL Shannon Entropy is an attempt to solve the unfairness between HTMAL compared to PCA and LogCluster caused by the differences in the input layer. HTMAL Shannon Entropy uses the same numeric feature vector as PCA and LogCluster, then calculates the distribution among the represented Event IDs with Shannon entropy. During the literature study, Shannon entropy was used in multiple anomaly detection articles to calculate the distribution of a numeric vector. Combining HTMAL and Shannon entropy looked promising but it failed to achieve any better F-scores then a random classifier. When looking at the graphs in section 4.2.4, there were heavy fluctuations in the Shannon Entropy value, this may create an environment with too many fluctuations making classification impossible. The implementation of Shannon entropy handle zero occurrences of an event by removing it from the feature vector to avoid zero division, this may increase the fluctuations. There exist many different entropy formulas such as Renyi and Tsallis entropy that performed good in [19], they should be evaluated with HTMAL in the future. It is unclear how to deal with zero indexes in the data and how it affects the results.

PCA and LogCluster achieved results ranging from 0.15 to 0.56 which is slightly worse than in another report where they were used [3]. However, the results in this report were not so far away. There are many factors regarding the data set that can interfere with the results, the data used in this project have a lot of irregularities. For example, during certain hours the activity drops down close to zero and at some hours an extensive amount of scripts run which increases the activity a lot. This creates a data set which is fluctuating a lot and the results achieved by PCA and LogCluster show promising results.

Speed and computational power

This project is not concerned about computational power and speed but, after observing the run times of the program it is clear PCA is fastest. Then comes LogCluster and last the HTMAL method. The HTMAL method builds a network of neurons that are computational heavy while PCA purely relies on statistical methods.

(45)

5.2 Applying the methodology in a broader context

These projects methods were applied on a very specific data set, an Apache access log. The strategy used can be applied on any free text log-based data under the assumption that the log-based data have reoccurring events which can be parsed into a numeric feature vector. Unfortunately, this implies manual labor because the feature has to be chosen manually and properly converted into Event IDS. This strategy could be applied directly on top of all data streams to a centralized log management system. The challenge would be to find the right features, and a big drawback is that it would only find anomalies in the selected features. It does not find anomalies that are not actively searched for.

Before this can be used in a larger scale the methods have to reduce the number of false positives they generate and find more true positives. As of now, the methods fail to detect half of the anomalies, that only provide protection against half of the anomalies and an extensive amount of labor of classifying the found anomalies has to be done manually.

5.3 Source of Errors

Unsupervised methods do not require labeled data but to make a comparative evaluation between the methods, an assumption was made which may affect the results. The results are based on data that are assumed to represent a normal state with no errors meaning all data points are initially considered being normal. Even though the incident history was looked through and the baseline looked over some anomalies are likely present in the baseline data set. These anomalous events in the baseline data are assumed to be a normal event, this may cause a false positive which decreases the F-score.

The definition of an anomalous event is derived based on the average frequency of each event ID. The average value is multiplied with a factor randomly chosen from a range. This provides an amplitude increase which only gives one side of the perspective, an anomalous event can also be a decrease in activity.

This project has not considered that case.

5.4 Sustainability, Ethics, and Social Aspects

In recent years there has been a huge increase in internet traffic and connected devices. This opens many new attack vectors and increases the load on appli-

(46)

cations. The need for improved monitoring is critical and I believe companies and governments around the world have to adapt and invest resources into further improve the use of internal and external log messages. A system failure or a hack is devastating economically and functionally in various contexts.

We are building our societies around technical applications and systems and the demand for availability has never been higher. It is in everyone’s interest that system failures and attacks can be prevented in real-time by efficiently detect suspicious behavior in time. However, this does not come free, an ethical dilemma may occur of how much information can be monitored without vio- lating citizens integrity. This can be seen at big companies what information do companies store about the employees? For example, are the internal chats stored and accessible by the company? Monitoring logs is important but it is easy to cross the line where employees and citizens integrity is violated.

5.5 Future Work

For future work I have acknowledged four different tracks:

• Fill the gap between academia and research.

• Automatic profiling/feature selection.

• Classifying anomalies.

• Identify real anomalies and apply these methods

In the area of machine learning there is a large amount of ongoing research in the machine learning area, this has to be connected to real cases to see how well it solves real-life problems. It would be interesting to see more state-of- the-art anomaly detection methods be integrated into the industry such as in large data centers.

The evaluated methods in this report HTMAL, LogCluster, and PCA have an input layer far from the real-world. An important interesting topic would be to automatically perform feature extraction on a specific log stream. As of now, manual labor has to be done to select relevant features and it is a time- consuming task.

An anomaly is not necessarily a bad thing, it is just a sign that the new data de- viates from the expected data. To decrease the amount of manual labor when an anomaly is detected it would be interesting to classify those into "good" or

(47)

"bad" anomalies. This can be significantly improved by having a good set of features or some kind of classifiers that determines how big impact the given anomaly has.

During this work, the PCA, Logcluster, HTMAL, and HTMAL Shannon entropy was applied to artificial anomalous mimicking real incidents. A next step would be to manually identify real incidents and try to detect them. That would eliminate one of the sources of error in this project. It is unclear how artificial anomalies are compared to real anomalies. It would of great interest to evaluate if the methods are more suitable to some anomalies than others.

(48)

Conclusions

This project has evaluated state-of-the-art anomaly detection methods in an active data center. The task was conducted in an environment with high demands on confidentiality, availability, and integrity. The goal was to evaluate how these methods can be applied to a data centers log streams. The anomaly detection methods PCA, LogCluster, and HTMAL were applied on text-based log data. The data set originated from an Apache web server serving as a reverse proxy for the application layer of company critical systems. A profile was built on an Apache access log to categorize the log events into events IDs. This could be applied in other contexts in the data center but the profiling and the pre-processing in the current state has to be done manually. The results showed that PCA achieved the best F-scores with optimal hyperparameters ranging from 0.4 to 0.56. LogCluster performed slightly worse with an F-score ranging from 0.15 to 0.33 while HTMAL and HTMAL Shannon entropy performed very poorly. State-of-the-art anomaly detection methods can be applied in the industry but they require a great deal of manual labor both regarding pre-processing, feature extraction and to manually handle alarms because there are many false positives. There is a trade-off between how many hours of manual labor it is worth to maybe find a possible intrusion or a future system failure. After all the best method PCA according to this study identified half of all anomalies which would increase the security but also require manual labor.

40

(49)

[1] Teresa Wingfield. url: https : / / www . nyotron . com / how - long-does-a-zero-day-last/ (visited on 02/05/2019).

[2] Y. Dokas et al. Data Mining for Network Intrusion Detection. Tech. rep.

University of Minnesota, Computer Science Department.

[3] Shilin He et al. “Experience Report: System Log Analysis for Anomaly Detection”. In: Proceedings - International Symposium on Software Re- liability Engineering, ISSRE (2016), pp. 207–218. issn: 10719458. doi:

10.1109/ISSRE.2016.21.

[4] V Chandola, A Banerjee, and V Kumar. “Anomaly detection: A sur- vey”. In: ACM Reference Format 41.15 (2009). doi: 10.1145/1541880.

1541882. url: http://doi.acm.org/10.1145/1541880.

1541882.

[5] Anderson Hiroshi Hamamoto et al. “Network Anomaly Detection Sys- tem using Genetic Algorithm and Fuzzy Logic”. In: Expert Systems with Applications 92 (2018), pp. 390–402. issn: 09574174. doi: 10.1016/

j.eswa.2017.09.013.

[6] Samer Hassan, Rada Mihalcea, and Carmen Banea. “Random-walk term weighting for improved text classification”. In: ICSC 2007 International Conference on Semantic Computing (2007), pp. 242–249. doi: 10 . 1109/ICSC.2007.56.

[7] John T. Medler. “The types of Flatidae (Homoptera) in the Stockholm Museum described by Stål, Melichar, Jacobi and Walker”. In: Insect Systematics and Evolution 17.3 (1986), pp. 323–337. issn: 1876312X.

doi: 10.1163/187631286X00251.

[8] Y. Braiman, F. Family, and H. G E Hentschel. Discrete voltage states in one-dimensional parallel array of Josephson junctions. Vol. 68. 22.

1996, pp. 3180–3182. isbn: 9783319298528. doi: 10 . 1063 / 1 . 115817.

41