Anomaly detection for automated security log analysis : Comparison of existing techniques and tools

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-A--2021/033--SE

Anomaly detection for

auto-mated security log analysis

–

Comparison of existing techniques and tools

Detektion av anomalier för automatisk analys av

säkerhetslog-gar

Måns F. Franzén

Nils Tyrén

Supervisor : Le Minh Ha Examiner : Niklas Carlsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Logging security-related events is becoming increasingly important for companies. Log messages can be used for surveillance of a system or to make an assessment of the dam-age caused in the event of, for example, an infringement. Typically, large quantities of log messages are produced making manual inspection for finding traces of unwanted activity quite difficult. It is therefore desirable to be able to automate the process of analysing log messages. One way of finding suspicious behavior within log files is to set up rules that trigger alerts when certain log messages fit the criteria. However, this requires prior knowl-edge about the system and what kind of security issues that can be expected. Meaning that any novel attacks will not be detected with this approach. It can also be very difficult to determine what normal behavior and abnormal behavior is. A potential solution to this problem is machine learning and anomaly-based detection. Anomaly detection is the pro-cess of finding patterns which do not behave like defined notion of normal behavior. This thesis examines the process of going from raw log data to finding anomalies. Both existing log analysis tools and the creation of our own proof-of-concept implementation are used for the analysis. With the use of labeled log data, our implementation was able to reach a precision of 73.7% and a recall of 100%. The advantages and disadvantages of creating our own implementation as opposed to using an existing tool is presented and discussed along with several insights from the field of anomaly detection for log analysis.

(4)

Acknowledgments

We would like to thank all the people at link22 with an extra thanks to Andreas Karström and Erik Boström for the help and encouragement during this thesis. Thanks to Niklas Carlsson for the feedback and useful insights. Finally, we would like to thank our opponents Tim Hellberg and Filip Ström.

(5)

2.12.6 Spark . . . 18 2.12.7 ArcSight . . . 19 2.12.8 Loggly . . . 19 3 Related works 20 4 Method 22 4.1 Proof-of-concept . . . 22 4.1.1 System overview . . . 22 4.1.2 Dataset . . . 23 4.1.3 Log parsing . . . 23 4.1.4 Feature extraction . . . 24 4.1.5 Preprocessing . . . 24 4.1.6 Anomaly detection . . . 26 4.1.7 Evaluation . . . 26

4.2 Log analysis tools . . . 27

4.2.1 ELK . . . 27 4.2.2 Splunk . . . 28 4.2.3 Logz.io . . . 28 4.2.4 Remaining tools . . . 29 5 Results 30 5.1 Performance of PoC . . . 30 5.1.1 Clusters . . . 30 5.1.2 Baseline . . . 30 5.1.3 Normalized . . . 31 5.1.4 Standardized . . . 32 5.1.5 PCA . . . 32

5.2 Log analysis tools . . . 33

5.2.1 Splunk . . . 33

5.2.2 ELK . . . 34

5.2.3 Summary of log analysis tools . . . 34

6 Discussion 41 6.1 Results . . . 41

6.1.1 PoC implementation . . . 41

6.1.2 Log analysis tools . . . 42

6.1.3 Log parsing . . . 43

6.1.4 Feature extraction . . . 43

6.1.5 Anomaly detection . . . 44

6.2 Method . . . 44

6.2.1 PoC implementation . . . 44

6.2.2 Log analysis tools . . . 44

6.2.3 Delimitations . . . 45

6.3 Source critisism . . . 45

6.4 The work in a wider context . . . 45

7 Conclusion 47 7.1 Future work . . . 48

(7)

List of Figures

2.1 Categories of anomalies . . . 8

2.2 Example graph from elbow method . . . 10

2.3 Log parsing . . . 12

2.4 Example of parsing log with Grok in Logstash . . . 15

4.1 Overview of PoC-implementation . . . 23

4.2 Evaluation matrix . . . 26

4.3 Query outlier detection ELK . . . 28

5.1 Outliers compared to normal data instances . . . 33

(8)

List of Tables

5.1 Baseline for 36 event templates. . . 31

5.2 Baseline for 156 event templates. . . 31

5.3 K-means and outlier detection with normalization and 36 event templates. . . 36

5.4 K-means and outlier detection with normalization and 156 event templates. . . 37

5.5 K-means and outlier detection with standardization and 36 event templates. . . 38

5.6 K-means and outlier detection with standardization and 156 event templates. . . . 38

5.7 Results from K-means and outlier detection with PCA and 36 event templates. . . 39

5.8 Results from K-means and outlier detection with PCA and 156 event templates. . . 40

5.9 Time executing K-means with different dimensions . . . 40

(9)

1 Introduction

With the constant digitization of society and the increasing number of internet users, IT-security is now more relevant than ever. More sensitive information is stored on the internet and cyber attacks are becoming increasingly sophisticated.

In many cases, unwanted security related activity can go unnoticed since it may not affect the functionality of the software. With security breaches it can also be difficult to determine the extent of damage caused. This means that one of the first steps in securing a system is to have sufficient detection of unwanted behaviour and be able to assess the potential damage if a breach has occurred. This is where log analysis becomes particularly useful [10]. The ability to analyze logs is important in many software systems to establish correct behavior and no-tice potential intrusions, attacks or other unwanted activity. As computer systems increases in size, so does the amount and variety of security logs. Not only is manual log analysis in-efficient but it also depends on the performer´s knowledge. Well created and implemented security logs in a system may be useless if the logs are analysed without the correct expertise or even worse, not analyzed at all. This may become a significant problem since it is expected by 2022 to be a shortage of expertise in Sweden [27]. Therefore it is important to automate log analysis to improve and maintain system security. Currently there exists several techniques for automating log analysis. Some of these techniques include anomaly based detection, rule based detection and data mining [9, 59]. Since security logs will contain lots of varying in-formation in different formats it may also be necessary to structure the logs before being able to apply data mining methods for automating analysis [23, 64]. Several papers that evalu-ate and summarizes log analysis techniques already exists. A paper by Svacina et al. [60] summarizes recent trends in the field of automatic log analysis techniques. However, there are not any recent papers comparing techniques for anomaly detection in log analysis that also compare the tools that implement these techniques. In this thesis an external analysis is performed to compare different techniques for anomaly detection in automatic log analysis and the tools that could be used to implement them. A proof-of-concept(PoC) implemen-tation will be created based on a suitable technique that will prove the concept of anomaly detection in the area of log analysis.

(10)

1.1. Motivation

1.1 Motivation

This project entails collecting and analyzing different methods, techniques and tools to per-form automatic log analysis and an implementation in a PoC per-format. It is done in collab-oration with link22, a company that specializes in IT-security and helps different agencies protect their sensitive information. They offer different types of products that can be used depending on the needs of the client.

The clients of link22 have particularly high demands on security. This means that logging of security related events is crucial. Logging is not only crucial in products developed by link22 but also in the environment where the product is developed. The logs produced in the office system of link22 are extensive and only some are relevant when searching for un-wanted activity. This makes manual inspection difficult. Therefore it is desirable to automate the analysis of the security logs and identify components that need to be added to make this possible in an existing system. Automated log analysis has the potential to detect previously unknown attacks and security risks and simultaneously save time by reducing manual work load.

1.2 Aim

The aim of this thesis is to investigate current techniques and tools and compare how anomaly detection can be performed in the field of automatic log analysis. This thesis examines the necessary steps to go from raw log data to classifying anomalies or outliers. A PoC of anomaly detection is made and an external analysis of existing tools is performed. In other words, the aim of this thesis is not to develop a new general algorithm, technique or tool that will improve or solve problems regarding automated log analysis. Instead this thesis aims to find up-to-date solutions that can be used to improve log analysis in an existing system.

1.3 Research questions

To compare techniques and tools, the following questions are addressed in this thesis: 1. What are the common techniques and tools for automatic security log analysis? 2. In what ways can anomaly detection optimize log analysis?

3. What is the general process of going from raw log data to finding anomalies? 4. In which existing log analysis tool is this process best utilized?

1.4 Contributions

One of the main contributions of this thesis is the identification of the general process of find-ing anomalies from raw log data. The different steps are outlined and explained. This pro-cess is utilized in a PoC implementation where the results are evaluated and presented. The impact of changing different parameters and preprocessing of data is also examined and pre-sented. The other major contribution is a comparison between existing log analysis tools and an explanation of how anomaly detection is utilized in these. From the PoC implementation and comparison of existing tools, valuable insights have been produced and are presented in this thesis.

1.5 Delimitations

• Unable to use real logs from link22 - Since the logs from the office of link22 may con-tain sensitive information, it is not possible to use these logs when testing the PoC

(11)

im-1.5. Delimitations

plementation or existing tools. Public log data has to be used where the log messages are often structured differently.

• Techniques used by current log analysis tools are often not shared publicly - When investigating different log analysis tools it is interesting to know what techniques are used. This can be important when making comparisons since it allows for a deeper understanding of when certain tools are suitable.

• Difficult to know what functionality is free - To understand and compare log analysis tools it can be helpful to install some of them and try with actual log data. However, before determining if installation is appropriate it would be helpful to know what func-tionality is available without having to pay. If a tool is not open-source or no free trial is possible the tool may not be installed and tested.

• Not feasible to test all functionalities - Some analysis tools have many possibilities when it comes to anomaly and outlier detection. All functionality can not be tested due to time constraints. Tools may also offer other security related functionalities which may be as important as log analysis but these functionalities will be too time consuming to examine.

• Limited documentation - In some cases, when investigating existing tools the docu-mentation available is limited. It can be difficult to get an understanding of what tech-niques are used and what functionality is actually available. It could be the case that many of these tools do not want to give competitors an insight into the underlying functionality.

(12)

2 Background

In this chapter, relevant background is presented in order to make the rest of the thesis un-derstandable.

2.1 What is a log?

A log message is something that is generated by a system or device to inform that some kind of event has occurred. A log mainly contains a timestamp, source and data. The data contains information such as username, source, destination IP-address and port numbers. An example of a log message can be seen below.

2018-02-09 13:37:44 sshd[16332]: Accepted password for hyab from 10.3.10.46 port 1281 ssh2

According to Chuvaki et al. [9], a criteria for good logging is that the logs should contain enough information to be able to answer the following questions:

• What happened? • When did it occur? • Where did it happen? • Who was involved? • Where did it come from?

Logging and log messages can be used for different purposes, such as application debug logging, operational logging and security logging [9]. In this thesis, security logging is in focus. By analyzing security logs it is possible to detect and respond to security related events such as attacks. A system can however produce thousands of log messages each hour making it almost impossible to perform manual log analysis. Therefore it is desired to use automated log analysis.

(13)

2.2. Log structure

2.2 Log structure

The generated logs from devices can contain a lot of information in various formats. Due to the various formats and in some cases, lack of important information, it is more difficult to parse and analyse the information from the logs. Logs that have a common structure are much easier to parse, extract features from and analyze. What parsing, feature extraction and analyzing are will be explained in later sections. There are some standards regarding log structure and what information a log message should contain. Two of these standards are Syslog and Common event format.

2.2.1 Syslog

Several Unix applications and the Unix kernel use Syslog to log messages. It is the most common way of recording application events on systems based on UNIX [9]. Many different software applications use Syslog and each message is labeled with a specific code to indicate what software it originated from. UDP can be used to send Syslog messages and with more modern versions of Syslog TCP can also be used.

2.2.2 Common event format

To make the integration between logs from different devices more efficient and to simplify log management, the Common event format (CEF) can be used [4]. CEF standardizes the format on collected data that later can be analyzed. CEF uses Syslog for log message net-working. When using Syslog a prefix containing date and host is added to the CEF message. The CEF message includes a header and an extension. A CEF message using Syslog looks as follows:

In Micro Focus Security ArcSight Common Event Format [4] the header fields are defined as follows:

Version is an integer and identifies the version of the CEF format. Event con-sumers use this information to determine what the following fields represent.

Device Vendor, Device Product and Device Version are strings that uniquely identify the type of sending device. No two products may use the same device-vendor and device-product pair. There is no central authority managing these pairs. Event producers must ensure that they assign unique name pairs.

Device Event Class IDis a unique identifier per event-type. This can be a string or an integer. Device Event Class ID identifies the type of event reported. In the intrusion detection system (IDS) world, each signature or rule that detects certain activity has a unique Device Event Class ID assigned. This is a requirement for other types of devices as well, and helps correlation engines process the events. Also known as Signature ID.

Nameis a string representing a human-readable and understandable description of the event. The event name should not contain information that is specifically mentioned in other fields. For example: "Port scan from 10.0.0.1 targeting 20.1.1.1" is not a good event name. It should be: "Portscan". The other information is redundant and can be picked up from the other fields.

Severityis a string or integer and reflects the importance of the event. The valid string values are Unknown, Low, Medium, High, and Very-High. The valid inte-ger values are 0-3=Low, 4-6=Medium, 7- 8=High, and 9-10=Very-High.

(14)

2.3. Data diodes

The Extension are additional fields that are not mandatory. These field are set as key-value pairs and all the extensions and definitions can be found in the same documentation as the header definitions.

2.3 Data diodes

Data diodes are systems that are designed to only allow for one-way communication. The most secure version of data diodes are hardware enforced with one sending unit and one re-ceiving unit. The sender unit is physically incapable of rere-ceiving data and the rere-ceiving unit is physically incapable of transmitting data. This makes it impossible for hackers to influence the direction of data traffic with software [42]. One of the challenges with hardware enforced one-way communication is that several transmission protocols rely on "handshaking" mean-ing that acknowledgement packets are sent back to the sender to ensure that the data has reached the receiver. This means that data diode systems have to rely on protocols that does not require handshaking such as UDP. However, these types of protocols are considered to be less reliable due to the lack of feedback to the sender. To improve the reliability, techniques such as checksums and redundant transmissions can be used. Data diodes can be used for safe storage of sensitive information such as security related logs. However, some tools use cloud services that may require two way communication possibly making them difficult to use if data diodes are implemented. Other techniques that rely on two-way communication could also be difficult to implement.

2.4 Security operation center

Security operation center (SOC) is an group of cyber security professionals in a organization responsible for monitoring, analyzing, and protecting against cyber attacks [40]. Log man-agement and log analysis is a central part of the SOC.

2.5 Security information and event management

Security information and event management (SIEM) tools makes it possible to analyze, visu-alize and report security events in real time [9]. Most of the tools analyzed in this thesis fall into this category. While SOC is about the people and process of maintaining security, SIEM is about the technology used by the SOC.

2.6 Rule based detection

The idea of rule based detection is to set up rules that can detect known attack or threat patterns. This means that the person setting up the rules must have a good understanding of the system to know what kind of attack that could be expected. Therefore, it is also known as knowledge-based detection. Any novel attacks will not be detected or identified by using this strategy [60, 33].

Rule correlations is the idea of finding similar or dissimilar events and from this get a better understanding that something larger and more complex is happening. This is accomplished by creating rules that can detect known vulnerable behaviors or patterns [60]. For example, if "this" then "that" happens the system may be under attack. These events may by themselves not be interesting but together they can show that an event of interest is occurring.

2.7 Anomaly based detection

Anomaly based detection is the process of finding patterns which does not behave like de-fined notion of normal behavior [9, 29]. For example, malicious user behaviour is often

(15)

be-2.8. Machine learning

haviour by a user that does not fit the normal behaviour by the same user or other users. Anomaly detection can be applied to different domains with different notions of anomaly and data. There are different existing techniques and challenges depending on what domain you want to perform anomaly detection in [8]. This thesis focuses on security related do-mains.

According to Chandola et al. [8] anomalies can be divided into the three following categories.

Point anomaliesis if an event is considered abnormal compared to the whole set of data. Ex-ample of point anomalies can be seen in Figure 2.1 (a).

Contextual anomaliesis if an event is considered abnormal compared to its context but not otherwise. The context can be divided into contextual attributes and behavioral attributes. Contextual could for example be in time-series data and behavioral could for example be the size of a packet that has been transmitted. Example of contextual anomaly in time-series can be seen in Figure 2.1 (b). The anomaly is not an anomaly because of its y-value but is an anomaly with respect to the time context.

Collective Anomaliesis if a collection of related events is abnormal compared to the whole set of data. This is similar to rule correlation explained in Section 2.6. This means that a single event may by itself not be an anomaly, but together they create a abnormal behaviour. For example a distributed denial of service attack. Example of collective anomalies can be seen in Figure 2.1 (c). These are anomalies since the same value occurs under a longer time. Each single point is not anomalies but together they become an anomaly. Point and collective anomalies can also be seen as contextual anomalies if they are analyzed in a context.

Anomaly based detection makes it possible to detect previously unknown threats and attacks, as opposed to rule-based approaches where the analyst has to set up rules to detect known behaviors. It can be used in log analysis when not knowing what to look for or when normal behaviour is hard to define. The idea is, as earlier mentioned, to compare new behavior with normal behavior and in that way detect anomalies.

Outlier detection is an area within anomaly detection where outliers are values that devi-ate from normal values. Metrics such as euclidean distance can be used where outliers are further away from their neighbours than normal data points. A problem that could occur with anomaly based detection is a high number of false positives (false alarms). It is there-fore recommended to use both anomaly based detection and rule-based detection to make the analysis more powerful [9, 29]. For anomaly based detection, machine learning is mostly used as the core component. Therefore, anomaly based detection in this thesis, refers to the use of machine learning for identifying anomalies. In a paper by Liu et al. a framework for anomaly detection is proposed using a clustering algorithm known as K-prototyping which will be discussed further in this thesis. There exist other alternatives to clustering for anomaly detection such as deep learning. In a paper by Du et al. [63] anomaly detection is performed with DeepLog, a deep neural network model, utilizing Long Short-Term Memory (LSTM). Other techniques for anomaly detection will also be discussed.

There are some limitations when it comes to using anomaly detection algorithms for log anal-ysis. One of these limitations is that most anomaly detection algorithms are designed to be used after an event has occurred, meaning that they do not analyse logs in real-time. An-other disadvantage is that machine learning algorithms tend to be more resource-intensive and take longer time compared to traditional rule-based approaches [60].

2.8 Machine learning

Due to the recent trends around machine learning, one may think that the use of machine learning in software security is a new topic. This is however not the case. For example Dun-ning [11] conducted a study about an intrusion detection model in 1987.

It is important to be reminded of the fact that machine learning can not only be used by "the good side". Even attackers can use machine learning to improve and automate their attacks.

(16)

2.8. Machine learning

(a) Point anomalies (b) Contextual anomaly

(c) Collective anomalies

Figure 2.1: Categories of anomalies

The machine learning algorithms themselves can also contain security vulnerabilities that at-tackers can exploit. Humans must still be involved to maintain security. A machine learning algorithm will not make all the threats go away but it can be a very useful resource [32]. There are different types of machine learning and some of these will be explained in the fol-lowing paragraph.

Supervised machine leaning means that the model is trained with labeled data. In other words there is an input with a correct output. The model can learn from its mistakes by com-paring its output with the correct output and from this change its behaviour [9]. Examples of supervised learning are classification and regression. Note that the training data has to be labeled, which could be a very time consuming and difficult task since it may have to be done manually. In use of anomaly detection the labeled data must contain instances of both normal and abnormal (anomalies) classes.

Unsupervised machine learningis a model that is not trained with labeled data. This means that there is no correct or incorrect output instead the goal is to find relationships and struc-tures within a dataset [9]. Since it does not require labeled data, unsupervised machine learn-ing is more applicable in real-world production [8]. An example of unsupervised machine learning is clustering. What clustering is and different types of clustering algorithms will be explained at the end of this section. There is also semi-supervised and reinforcement learn-ing. These types are however not commonly used in anomaly detection and will therefore not be explained further.

2.8.1 Clustering

Clustering is a machine learning technique commonly used in log analysis. Detecting un-wanted behavior by analysing log messages individually can be difficult. Therefore, log messages can be grouped together using clustering and the clusters can be compared with each other [60].

According to a paper by Landauer et al. [31] there are two different ways of clustering log messages. Log messages can be analyzed individually and either be clustered by the similarity of their data or message structure which gives an overview of the events that occur

(17)

2.8. Machine learning

in the system. This is referred to as static clustering. Another clustering method is called dynamic clustering where sequences of log messages can be clustered which can contribute to the understanding of the underlying logic of the program. Examples of dynamic cluster-ing includes uscluster-ing windows for feature extraction such as fixed window or session window where sequences of log messages are clustered. This is discussed in depth in Section 2.11. There are several categories of anomalies that are detectable with clustering where some are more related to security than others. Collective anomalies for example can indicate attacks that execute a series of events. Dynamic clustering is able to detect a substantially larger amount of these anomalies compared to static clustering that mainly detects point anomalies which can be described as single log messages that do not fit into any of the existing cluster templates.

When it comes to identifying normal clusters and anomaly candidate clusters, density of the datasets is often used [35]. Normal clusters tend to have a higher density since normal system behavior is usually more common than anomalies. The anomaly candidate clusters tend to be more sparse since anomalies are typically less common. This can be used to create a threshold that excludes the high density normal clusters so the focus can be put on the anomaly candidates.

There exists several other techniques for clustering. Neural networks for example can be used for pattern recognition and classification of events using natural language processing [31]. Longest common sub-sequence is another technique which measures similarity between sequences of log messages or single logs.

K-means

K-means is an unsupervised machine learning algorithm that takes in numerical data and creates clusters based on euclidean distance [26]. The algorithm is commonly used and is efficient for large datasets. One drawback with this algorithm is that it only takes numerical data. That means that the feature extraction, when it comes to log analysis, will have to return only numeric values. K-means forces every instance to belong to an clusters meaning that anomalies can be assigned to a larger cluster. This means that not only sparse clusters can contain anomaly candidates. Detecting anomalies is done by assuming that normal data instances are close to their closest cluster centroid, anomalies will however have a greater distance to their closest cluster centroid [8]. These kinds of anomalies, which differ from other data instances by numerical values such as distance, will be referred to as outliers in this thesis as mentioned in Section 2.7.

The number of clusters that K-means will divide the dataset in is specified by the user. One way of determining the optimum number of clusters is to use the elbow method [41]. This method tests the K-means algorithm with different numbers of clusters, for example 1-10. For each cluster the Sum Square Error (SSE) is calculated. SSE is the sum of the average euclidean distance from each instance to its closets centroid. As the clusters increase, the SSE decreases and will be zero when there are as many clusters as there are data instances. The results can be plotted as a graph that can be seen in Figure 2.2. The optimum number of clusters is where the graph goes from steep to flat, an elbow. In the example elbow graph in Figure 2.2 the elbow is at four clusters.

K-prototyping

K-prototype is an unsupervised machine learning algorithm for clustering large sets of data. One of the main advantages of this algorithm is that it works with datasets that contain mixed attributes, meaning both numerical and categorical data [35]. This is particularly useful when it comes to log analysis since log messages often contain a mix of numerical and categorical data. prototype can be seen as an extension of the more well known clustering algorithm

(18)

K-2.9. Forensics

Figure 2.2: Example graph from elbow method

means. K-prototype measures similarity between different records and clusters them accord-ingly. When it comes to numerical attributes the similarity measure is the square Euclidean distance. The similarity measure of categorical attributes is the number of mismatches be-tween objects and cluster prototypes [26].

2.9 Forensics

Digital or IT-Forensics can be defined as the process of collecting digital evidence from dif-ferent sources in order to extract facts that could be related to unwanted activities such as attacks or security breaches [34]. It is done after an incident has occurred to recreate the events that lead up to it and assess the potential damage. Log analysis is particularly useful in forensics since log files contain large amounts of information and are usually stored for a relatively long period of time. Once a log message is created it is not altered as long as the system behaves normally. This means that logs can be seen as a permanent record and can be an important complement to other information from the system that may be more likely to be altered or corrupted [9]. Logs contain time stamps which makes it possible to reconstruct a sequence of events in chronological order showing not only what happened but also in what order. Similar to real-time detection, the collection and analysis of security related logs for forensic evidence is highly time consuming and would benefit greatly from automation. Log analysis is only one part of the wide area of forensics analysis. Other parts could be examination of file systems and memory dumps [47]. Commonly used tools for forensics analysis are EnCase [20] and Sleuth Kit [50]. Since this study has a focus on log analysis, tools like EnCase and Slueth kith will not be considered. How logs can be used in the area of forensics are still relevant.

2.10 Real-time analysis vs batch analysis

Log analysis isn’t just about how, but also about when. When it comes to log analysis there are two main approaches, real-time or batch analysis. Real time meaning that logs are processed and analyzed in a streaming fashion, as they arrive to the log management system. Batch processing and analysing means that a log file or a batch of logs are processed and analyzed at the same time [64]. These two approaches will be described more under Log parsing in Section 2.11.

(19)

2.11. The five steps of log analysis

2.11 The five steps of log analysis

From research conducted in this thesis there seems to be a re-occurring series of steps that need to be performed in order to construct a complete log analysis system.

2.11.1 Log collection

Logs are routinely generated by systems. The message that a log contains can be used for different purposes and one of them is anomaly detection. Therefore log collection is the first step in detecting anomalies in logs. Collecting and storing logs is a central part of log analysis. It is however not in the general interest of this paper to examine collection and storage of logs.

2.11.2 Log parsing

To enable log analysis based on techniques such as machine learning, logs need to be trans-formed into structured events with the same type of fields in each log message [64]. Some parts of a log message can be fairly easy to extract. For example the timestamp which is usu-ally the first header in a log message. Other parts, such as free-text, can be more challenging to extract. The free-text message often contains a lot of information in various formats. These messages are often a combination of constant parts with variable values. For example, by parsing log messages, different event templates can be created based on the constant parts and the variable values are passed along with the template. Figure 2.3 illustrates how a log message can be parsed.

Parsers are often divided into offline or online parsers. Offline meaning that the logs are parsed as a batch, in other words the logs are first collected and then the whole batch are parsed. Online meaning that logs are parsed in real time as they are collected, in a streaming fashion [13].

There exist different techniques and tools for log parsing. Some of these techniques include: rule based, source code based, and data driven parsing [64]. A common approach when it comes to rule-based parsing is to use regular expression(Regex). Using Regex does not work for general purposes since each type of log message require a unique regular expression pars-ing and domain-specific expert knowledge [13]. Source code parspars-ing works by creatpars-ing event templates from the source code of the program responsible for logging and then match log messages to these templates to recover its structure [63]. This technique requires access to the source code which can become complex when there are multiple system to collect logs from. Data driven parsing works by creating templates from the log data itself. The templates will depend entirely on the structure of the logs. The advantage of data driven parsing is that it is more general and can be used for all log structures. It also does not require access to source code. For example the parsed log in Figure 2.3 is parsed with a data-driven parser. Many tools for log parsing already exist where some are open-source and deployed in production and some are proposed in different literature studies. Zhu et al. [64] made an evaluation of 13 data-driven log parsers to guide usage deployment of automated log parser and further research. They also provide an open-source-toolkit that can be used for testing 13 different log parsing methods. One of these are for example Spell: Streaming Parsing of System Event Logs presented by Du and Li [13].

2.11.3 Feature extraction

In order to run anomaly detection algorithms on the log data, features that will be fed into the algorithm needs to be extracted. This step can be done in a variety of ways and the best approach depends on the anomaly detection algorithm, the log dataset and what kind of anomalies that need to be detected. In a paper by Liu et al. [35] twelve features are extracted where most of them relate to user behavior. The combined features aim to answer the four following questions, who, where, when and what. In this case the data is divided into user

(20)

2018-02-09 13:37:44 sshd[16332]: Accepted password for hyab from 10.3.10.46 port 1281 ssh2

Date 2018-02-09

Time 13:37:44

Component sshd[16332]:

Event template <*> password for <*> from <*> port <*> ssh2 Variable values [’Accepted’, ’hyab’, ’10.3.10.46’, ’1281’]

Figure 2.3: Log parsing

sessions and relevant data is extracted from each session. One example of a feature is the login username. This is categorical data and not all algorithms can accept categorical data as input. In other cases one may use a time window to extract features. In a paper by He et al. [24] three different types of windows are used. These include fixed window, sliding window and session window. In the case of fixed windows and sliding windows the occurrences of certain log messages inside each time window can be counted and a matrix can be created where each row contains information about the occurrences of a specific type of log message in that particular time window. This matrix will contain strictly numerical data and can be used as input data to certain clustering algorithms.

Fixed window

Fixed window uses the timestamps of log messages and classifies logs that occur in the same time window as a sequence of logs. In this case the time difference is constant through the entire execution of the feature extraction. The amount of log messages in one window is recorded and put into an event count vector. The concept of an event count vector is explained more in depth in Section 4.1.4. One of the disadvantages of using a fixed window is that some log messages that appear in one window may actually be related to logs in another window. For example, two logs that together create an anomaly may be put in different time windows and in that way bypass detection. This can reduce the accuracy of anomaly detection [24].

Session window

Instead of using time to determine log sequences, sessions can be used. A session window uses identifiers in the log messages to separate sequences of log messages [24]. There are a variety of identifiers that could be used when creating a session window. One example could be a user session. That means that every action taken by a specific user is put into the same session window. This method would find anomalies related to user behavior. For example, if a user performs a certain action an unusual amount of times this can be seen as an anomaly. That user session would then be labeled as anomalous and further investigation is advised.

2.11.4 Preprocessing

To make machine learning models perform better and make data easier to work with, the features can be refactored. Three common ways to do this is standardizing, normalizing and changing the dimensions of the features.

Standardizing features

Machine learning models may perform poorly if the features are not somehow normally dis-tributed, therefore standardization may be important for constructing an accurate model [49].

(21)

Normal distribution means that the mean is zero and the standard deviation is one. If a fea-ture is much larger than other feafea-tures, this feafea-ture might dominate and the algorithm may be unable to interpret other features in a correct way. A feature is therefore standardized compared to how many times this feature occurs in other event vectors. For example if a user performs an action 200 times in a time period and in the same time period the user performs another action one time. The action that is performed 200 times will then dominate the event vector. The action that was only performed once, could however be the abnormal behavior of that user and not the action that was performed 200 times. Each action in an event vector is therefore standardized compared to how many times this action is performed by other users. The standardization of a feature is calculated as follow:

z= x ´ ρ

σ ,

where x is feature value with mean

ρ= 1 N N ÿ i=1 (xi) , and standard deviation

σ= g f f e1 N N ÿ i=1 (xi´ ρ) , where N is the number feature values.

Normalizing features

The process of normalizing data can be useful in many contexts of data mining and can sometimes have different meanings depending on the context. In this thesis normalization involves turning individual samples of data in the form of vectors and scaling them to unit form. This means that the length of the vector is 1 and can usually be achieved by dividing each element in the vector with the euclidean length of the vector. The normalization of a feature can be performed with the following formula:

x1= x

||x|| ,

where x is the original vector and ||x|| is the euclidean length of the vector.

The euclidean length of the vector is calculated by the square root of the sum of each element squared:

||x||=

b x2

0+x21+...+x2n´1+x2n .

Normalization gives the attributes in the dataset equal weights which can decrease redun-dancy and prevent "noisy" samples from taking over. K-means clusters data by measuring euclidean distance between cluster centers. This technique can be highly affected by irregu-larities in the dataset which makes normalization effective [62].

Principal component analysis

When parsing large log files, the amount of features or templates can quickly grow in size resulting in event vectors with potentially hundreds of dimensions. In order to reduce the number of dimensions and making the dataset more manageable, Principal component anal-ysis (PCA) can be used. PCA is a technique often used in data analanal-ysis to reduce the amount of dimensions without losing any important information. This is done by projecting each data point onto principal components which can be described as direction vectors [1]. PCA is often used on a data set before running machine learning algorithms. For example Arlitt et al. [6] used PCA together with clustering to reduce dimensions and enable visualization of results and relationships in the data.

(22)

2.12. Existing log analysis tools

2.11.5 Anomaly detection

In this step, when the data is parsed and features are extracted, anomaly detection can be performed. As mentioned in Section 2.7 this is the process of comparing new behavior with normal behavior to find potential outliers or anomalies.

2.12 Existing log analysis tools

In this section, the existing tools that are examined in this thesis are outlined and described in terms of how functionality such as parsing and anomaly detection is performed.

2.12.1 ELK stack

The ELK stack is a collection of open-source components - Elasticsearch, Logstash, and Kibana [18]. Elasticsearch is used to store, search and analyze data, Logstash feeds and pro-cess data into Elasticsearch and Kibana is a visualization tool. This makes it possible to search, analyze, and visualize it in real time by using data in different formats.

Elasticsearch

This Section will describe an overall picture of Elasticsearch. Elasticsearch is a free open-source search and analytics engine for various types of data meaning that it can not only be used for log data analytics. Analytics of log data and especially anomalies and outliers is in focus when using Elasticsearch in this paper.

With ELK it possible to perform rule based detection [19]. These could be rules that match specific events or parts of events such as ip, event correlation rules or threshold rules as de-scribed in Section 2.6. It is also possible to use machine learning for analyzing data to find be-havior patterns. ELK offers both supervised and unsupervised machine learning techniques [17]. The supervised machine learning performs two types of analysis that both require train-ing sets, classification and regression. With unsupervised machine learntrain-ing anomaly detec-tion and outlier detecdetec-tion can be performed. No training data is necessary for these tech-niques and anomaly detection requires times series data while outlier detection does not. The anomaly detection uses a mix of techniques such as clustering, various types of time se-ries decomposition, Bayesian distribution modeling, and correlation analysis [17].

Except from detecting anomalies and tell that there is a problem, Elasticsearch can also ana-lyze the problem to identify other properties of the data, such as related user and machines and by that find the root cause of the problem.

To make use of all the capabilities with ELK such as machine learning, the extension X-Pack is needed [14]. To make use of all the features of ELK a subscription is needed and it is also possible to consume ELK via its cloud services.

Logstash

This Section will describe an overall picture of Logstash. Logstash is an open-source data collection engine meaning that it is the data flow central in the ELK stack [16]. Logstash supports various inputs for example logs in different formats such as Syslog. Logs can be ingested into Logstash by using Filebeat which collects logs on a server and forwards these logs to Logstash. As logs go through Logstash it is possible to process the data. This is where the data is prepared for further analysis in Elasticsearch. It is for example possible to structure data to a common format and identify name fields. This is done with so-called filter plugins. One of the most useful plugins is Grok. Grok matches user specified patterns in log messages and extract the desired parts of information and save it as key-value pair. There exists predefined Grok patterns that can be used otherwise it is possible to create custom Grok patterns. The better structure of log messages, like using Syslog or CEF, the easier the

(23)

Figure 2.4: Example of parsing log with Grok in Logstash

parsing in Logstash can be performed. How a log can be parsed can be seen in Figure 2.4. Using Grok is fairly straightforward, it could however be quite time consuming. This is due to the variety of log messages a system or systems can produce. Every time the logging code changes, the parsing rules must be manually updated [64]. As previously mentioned, using a common format on log messages makes it possible to reuse Grok patterns for logs from different sources. Once the data is processed it can be forwarded to different output sources, in this case Elasticsearch.

Kibana

Kibana is an open-source visualization tool [15]. It can be used to visualize and interact with Elasticsearch data. It can also be used to create machine learning jobs such as anomaly detection and outlier detection. It is also possible to use Kibana Query Language (KQL) for filtering and searching Elasticsearch data.

Anomaly detection

Kibana provides some wizards making it easier to create anomaly detection jobs. These in-clude single metric jobs, multi metric jobs, population jobs, categorization jobs and advanced jobs [17].

A single metric job contains a single detector which defines the type of analysis and on which fields to analyze. For example, it can use a low count function which classifies events as anomalous if they fall under the expected value. This can be useful for example when ex-amining the amount of requests for a web page. If the amount of requests drop significantly during a time when requests should be frequent, it can indicate unwanted behavior. This type of job only uses two fields in the data, time and value. Therefore, users can not send in unstructured log messages and expect anomalies to be detected. A multi-metric job con-tains multiple detectors and can be seen as combining several single metric jobs. Different detectors can be put on different fields in the data. Users can also choose to split the data into multiple series depending on a categorical field in the data. For example if a keyword in this field changes, the data can be split and treated as a new series.

(24)

A population job detects unusual behaviour compared to the behaviour of the population. It uses a single detector and a population field. All values in the population field will be grouped together and anomalies will be detected by comparing values to other values within the population. For example, a field called user could be chosen as population field. In that case, a user’s behaviour will be compared to other users behaviour or the previous behaviour of the user. It is most useful for data with high-cardinality.

A categorization job groups log messages into categories and finds anomalies inside them. It is best suited for categorical machine generated data since human generated data would most likely result in too many categories. The model learns normal behavior for each cate-gory over time and anomalies can be detected with for example, count functions.

An advanced job can contain multiple detectors and makes it possible to manage all configu-ration settings.

Since anomaly detection is performed on times series data it is possible to divide the data into so-called bucket spans when creating single or multi metric jobs. Using bucket spans is similar to the concept of fixed window. A bucket is a batch of data that occurs under a specified time period. For example if the time series is one day the bucket size could be set to one hour. At the end of each hour average values from the data can be calculated. Anomaly detection can be performed both on uploaded data and in real-time by receiving data from, for example, Logstash.

Outlier detection

With Kibana it is also possible to create a data frame analytics job to perform outlier detection [17]. For this job time series data is not needed. The outlier detection is performed using following different unsupervised machine learning techniques:

• distance of Kthnearest neighbor, • distance of K-nearest neighbors, • local outlier factor, and

• local distance-based outlier factor.

The results from the different algorithms are normalized and combined to give each data instance an outlier score from 0 to 1 where the higher score the higher probability that the data instance is an outlier compared to the other instances. Besides from outliers score a feature influence is given which tells which features in the instance that makes it an outlier.

2.12.2 Splunk

Splunk [56] is an existing tool used for big data analysis. It is used in a variety of fields to automate the process of analysing large quantities of data [28]. It can be used as a tool for performing automated log analysis for security purposes. It offers a web-like user interface where users can make their own triggers and rules. A user can decide to put a trigger on a specific log message or make rules to detect and alert when certain sequences of log messages appear [57]. As described in Section 2.6 this is known as rule based detection and does not require clustering or other machine learning algorithms. However, Splunk offers extension apps from Splunkbase1 such as the machine learning toolkit which uses different machine learning algorithms to perform a number of tasks.

(25)

Parsing

There are a number of automatically recognized source types i.e. log formats that can be parsed automatically without additional input from users [57]. However, if the log format is not automatically recognized, it requires creating a source type for that particular format. This can be done in a few ways. Users can enter Regex to decide in what way the data should be parsed in terms of event breaks and timestamps. If no Regex is applied when adding the data, Splunk will, by default, separate events on new lines. That means if there is no line break in the raw log file, the entire file will be seen as a single event. New source types can also be created by editing a file called props.conf, which contains different rules and settings for source types.

Search processing language

Splunk uses its own query language known as Splunk’s search processing language(SPL) when interacting with the data [57]. There is a large variety of queries that can be used and customized to fit the needs of the users. The web interface is a way of performing SPL-queries in a more user-friendly way. For example when performing numeric outlier detection, the web interfaces uses an underlying SPL-query to perform the correct task. The query itself can be seen when pressing the SPL-button in the web interface. Using the same query in the search field of the web interface will yield the same result as performing it via the web interface. However, writing SPL-queries by hand has the advantage of a more customized search. For example in the web interface, only one field can be chosen as "field to analyze". However, by handwriting the SPL-query, more fields can be added.

Anomaly detection

Splunk has the ability to perform anomaly detection in a variety of ways [57]. By using the SPL-command "anomalydetection" Splunk detects anomalies by calculating the probability of certain events. Events with lower probability will be more likely to be classified as anoma-lies. For categorical fields, probability is calculated by dividing the frequency of the event with the total amount of events. For numerical fields the probability is calculated in a similar way but using a histogram of all the values in the data. The total probability is calculated by multiplying the probability of each field. As previously mentioned, Splunk also offers an extension app called the machine learning toolkit. The toolkit has six different areas of fo-cus where two are especially related to automated security log analysis. These are detecting numeric and categorical outliers and clustering numeric events. Detecting outliers is useful in log analysis since the events that occur less frequently have a higher probability of being damaging or simply interesting from a security point of view [35]. Detection of numeric out-liers is done using distribution statistics [57]. This means that previous values are compared with current values. The values that differ significantly are classified as outliers. When it comes to categorical features, probabilistic measures are used. This means that the model checks for unusual combinations of values. As discussed in Section 2.8.1 clustering is used in anomaly detection where sparse clusters should be examined more in depth. Clustering can be performed in the machine learning toolkit with several algorithms such as K-means, DBSCAN and Spectral Clustering. The machine learning toolkit contains around 30 algo-rithms but it is also possible to extend with open-source Python algoalgo-rithms from for example Scikit-learn, numpy, and scipy libraries. Another extension app is the Splunk User Behav-ior Analytics(UBA) and where it is possible to find hidden threats and anomalous behavBehav-ior across users, devices, and applications by using unsupervised machine learning. There are also forensics and investigation possibilities in Splunk so that security incidents can not only be detected but it is also possible to determine the scope and root of the incident. This can help to prioritize and determine next steps that have to be taken.

(26)

2.12.3 Logz.io

Logz.io [38] is a cloud based service where different open-source tools, such as ELK, Prometheus and Jaeger, can be used on a single platform . Logz.io can be used in the areas of log management and analytics. Parsing of logs is done the same way as in Logstash with Grok and an example can bee seen in Figure 2.4. Common structures of logs can be parsed automatically by predefined Grok patterns but otherwise the Grok pattern must be created manually by the user. Logz.io has some machine learning capabilities with their service called Insights. This feature can detect repeating patterns or errors in log messages. These events can be highlighted and saved for further investigation [39]. However, it is unclear exactly what machine learning algorithms are implemented and it does not seem possible to perform traditional outlier detection on a collection of log messages.

2.12.4 Anomali

Anomali [3] is a cyber security company with intelligence driven products that offer threat visibility and accelerated anomaly detection. Anomali has three different products useful for different circumstances. These include ThreatStream, Match and Lens. Threatstream is used to provide an automated workflow for Splunk users where potential threats can be sent to threatstream for further analysis. For example, logs that appear suspicious can be sent to Anomali threatstream where they are scrapped for "indicators of compromise" and presented to the user with additional information such as comments and all related raw events. Match can analyse large amounts of logs in real-time and present matches depending on what the user is looking for. Lens uses natural language processing to automatically scan and detect anomalies in web content or user interfaces. This is useful in log analysis since large quanti-ties of data can be processed in a short amount of time and anomalies can be detected. It can be used together with other tools such as Splunk by scanning log collections and presenting identified threat intelligence with severity level and confidence level [2]. Objects with high severity and high confidence should be prioritized.

2.12.5 LogRhythm

The purpose of LogRhythm [36] is to combine security analytics, log management and foren-sics/endpoint monitoring. Data is fed to LogRhythm and processed in real time showing changes and potential threats in a highly visually focused web interface. A world map in-dicating where log messages originate from is used to give a quick overview of suspicious locations. When it comes to anomaly detection, LogRhythm provides functionality such as Advanced intelligence engine(AI Engine) and User and Entity Behavior Analytics(UEBA) [37]. AI Engine uses a set of 900 predefined rules to detect and respond to different security related events with the possibility to create custom rules. When it comes to parsing it has been hard to determine how this is performed in LogRhythm. From the documentation there seems to be some predefined parsing rules for common log sources and Message Processing Engine (MBE) Rule Builder can be used to create or modify rules for new logs, which is based on Regex .

2.12.6 Spark

Spark is a data process engine that can be used for different purposes including log analysis [55]. Spark is used to process large amounts of data in a distributed computer system. This means that if datasets become too big or come too fast for a single computer to process it can distribute it to multiple computers. For anomaly detection it is possible to use the machine learning library (MLlib) with algorithms such as classification, regression (supervised ma-chine learning) and clustering (unsupervised mama-chine learning) [54]. It is up to the developer to create models of these algorithms and Spark supports Scala, Python (pyspark) and Java.

(27)

It is also possible to perform real-time analysis with Spark. One of the main components of Spark is known as Resilient Distributed datasets(RDDs). These are data structures that can hold multiple data types and can be distributed over multiple nodes/clusters, once they are created, they can not be altered. The process of distributing RDDs over multiple nodes intro-duces parallelization meaning that operations performed on RDDs will be done in parallel, resulting in faster computations.

2.12.7 ArcSight

ArcSight [21] by Micro Focus is a cyber security service that provides a multitude of differ-ent software products using big data analytics and intelligence. This is used for log man-agement, analytics and security event management (SIEM). This service includes multiple software products used for different situations. These include ArcSight Enterprise Security Manager (ESM), ArcSight Intelligence, ArcSight Recon and ArcSight Logger. ESM is used for detecting known threats in real-time by using correlation and rules that can be customized by the user. Intelligence is used for anomaly detection in log data using unsupervised machine learning. Raw events are analyzed and anomalies and violations are identified. A number of active risk entities is then presented and categorized. These could include user behavior, websites, machine logs, IP-addresses and so on. Exactly what machine learning algorithms are used is not mentioned, however it works by identifying normal behavior and classifying anomalies as events that deviate [22]. Recon is a log management tool and security analytic solution that are used for forensic investigations. Logger is a log management solution for easier compliance, efficient log search, and secure storage. Machine learning packages have been introduced to logger to boost performance and user experience. Users can access pre-built content developed by ArcSight or build their own models with python packages. Arc-Sight uses connectors to collect raw log events and process them into Common Event Format (CEF) [5]. Explained as one of the different log structures in Section 2.2.2. No information if automated parsing or if there are some predefined parsing rules that can parse common log structures could be found. Via the community and video tutorials on youtube it was possible to determine that parsing can be performed with Regex.

2.12.8 Loggly

Loggly [51] is a cloud based log management and analysis tool. It has several recognized log structures that can be automatically parsed [52]. It is possible to use a custom parser to parse log structures that are not recognized into JSON and then send them to Loggly. For example Logstash using grok can be used. It is also possible to create your own fields from log data with custom parsing rules.

Similar to Splunk and many other tools, users can create alerts in Loggly to warn users when certain events occur. There are also anomaly detection capabilities where the frequency of occurrences can be compared with earlier frequencies. However, only one field can be cho-sen to analyze and it does not seem possible to choose between different machine learning algorithms or implement your own. Therefore, it is not possible to perform user behavior analysis. It resembles more rule based detection where certain statistics regarding events can be presented. When it comes to forensics, Loggly can automatically find surrounding events. This means that if a suspicious event has occurred, events related to that particular event can easily be found and presented. This can help in finding the root cause of the problem. Like many existing log analysis tools, Loggly has plenty of visual aids such as charts and dashboards that are customizable to help guide log management.

(28)

3 Related works

In a paper from 2019 by Studiawan et al. [58] they provide a comprehensive survey of exist-ing literature in the area of log forensics. They compare different techniques for log analysis and event reconstruction. However they have a more generalized approach to the topic of log forensics and cover a wider area from log security to log retrieval and log visualization. Only a small part of the paper covers the automation of anomaly detection and only operating system logs are covered. When it comes to log retrieval there is a classification based on the storage type of the logs. This paper covers three types of storage methods and the retrieval techniques are different depending on the storage method used. Different techniques for event reconstruction using the OS logs are also presented and compared. Some tools in the field of OS log forensics are also presented, such as ELK and Splunk. Even though this thesis mentions forensics and examine tools such as ELK and Splunk, the area of log forensics is investigated more in depth by the paper of Studiawan et al.

Jayathilake [28] describes expected features, such as capability of handling logs contain-ing different information of different structure, from a log analysis system. The study of common analysis tools shows lack of structured data extraction and processing and provides a framework as a solution to this problem. The paper deals with more general log types and is not specifically focused on the security aspect of log analysis which differs from this thesis where security is the major area of focus. Different techniques used by the tools are also not investigated in depth compared to this thesis where investigating different techniques is a central part.

Log file monitoring as a technique for network and system management is covered in a paper by Vaarandi et al. [61]. Here, a framework based on data mining is presented as a means of detecting anomalies in system log files. The advantage of this technique, as opposed to other techniques that do not involve pattern recognition, is the possibility of detecting error conditions that are previously unknown. Many other techniques require human experts to define patterns for log messages that need further investigation. Other techniques using data mining are also presented and briefly compared with the model used in the paper. The use of data mining to detect anomalies is heavily related to this thesis since the PoC implementation uses unsupervised machine learning to detect previously threats and anomalies.

(29)

In a paper by Cao et al. [7] machine learning is used as a part of a system to detect anomalies in log files produced by a web application. The paper mentions shortcomings of traditional log analysis that is reliant on manual inspection and matching using regular expressions. Another problem discussed in the paper is the size of the log files which makes the traditional detection techniques less efficient. The system presented in the paper uses a decision tree algorithm and a Hidden Markov Model. The decision tree is used for classification and the Hidden Markov Model is used for modeling the data. From their experiments they present a detection accuracy of 93.54% and a false positive rate of 4.09%. Even though, in the paper by Cao et al.[7] they use different techniques for log analysis, it relates to this thesis with the discussion of the shortcomings of traditional and manual inspection in log analysis and the use of machine learning in the field. It is also interesting that a result, in terms of precision and false positive rate is presented making comparisons possible.

In a survey by Khan et al.[30] they present the problem of log files being created on dif-ferent devices in many large organizations making management of the logs particularly difficult. A solution to this problem is that many organizations are starting to use cloud computing for storing the log files. This means that log files created on different devices are sent to the cloud for storage where they can be analysed collectively. The paper reviews the current status of cloud log forensics and presents some of the challenges involved in analyzing log data on the cloud. They present and explain case studies to point out some of the advantages of using cloud log forensics. The paper concludes that the time overhead is decreased for users and organizations that introduce cloud log management. However, there are still some challenges when it comes to managing logs on the cloud, such as ensuring that the logs are stored in a safe location and with sufficient information since many of the logs come from different devices. Even though cloud services are not investigated in this thesis, the survey points to the potential of cloud log management as a future prospect.

Svacina et al. [60] have done a systematic literature review on recent trends in security log analysis. They summarize current research and discuss problems and possible future directions. The work done by Svacina et al. has been useful in the systematic mapping done in this thesis.

The research made by Poh et al. [46] focuses on detecting anomalies in physical access security instead of in cyber security. This is because serious damage can be caused if an outsider or insider accesses sensitive areas. The research focuses on buildings where there are many possible paths to many possible restricted areas. This is an interesting topic show-ing that user behavior is not only connected to what users do on computers but also to other types of behaviors. Physical access logs are also relevant for the purpose of this thesis, however it will not be explicitly examined.

Son and Kwon [53] made a performance comparison between ELK and Splunk. They suggest using ELK as security log analysis system for small or medium sized enterprises. They do not discuss rule based or anomaly detection but the results can be taken into consideration when evaluating these tools.

Anomaly detection for automated security log analysis : Comparison of existing techniques and tools

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Datateknik

2021 | LIU-IDA/LITH-EX-A--2021/033--SE

Anomaly detection for

auto-mated security log analysis

Comparison of existing techniques and tools

Detektion av anomalier för automatisk analys av

säkerhetslog-gar

Måns F. Franzén

Nils Tyrén

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Contributions

1.5

Delimitations

2

Background

2.1

What is a log?

2.2

Log structure

2.2.1

Syslog

2.2.2

Common event format

2.3

Data diodes

2.4

Security operation center

2.5

Security information and event management

2.6

Rule based detection

2.7

Anomaly based detection

2.8

Machine learning

2.8.1

Clustering

2.9

Forensics

2.10

Real-time analysis vs batch analysis

2.11

The five steps of log analysis

2.11.1

Log collection

2.11.2

Log parsing

2.11.3

Feature extraction

2.11.4

Preprocessing

2.11.5

Anomaly detection

2.12

Existing log analysis tools

2.12.1

ELK stack

2.12.2

Splunk

2.12.3

Logz.io

2.12.4

Anomali

2.12.5