Detecting Lateral Movement in Microsoft Active Directory Log Files

(1)

Detecting Lateral Movement in

Microsoft Active Directory Log Files

A supervised machine learning approach

Viktor Uppströmer

Henning Råberg

(2)

Security. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s): Viktor Uppströmer E-mail: viup14@student.bth.se Henning Råberg E-mail: herb14@student.bth.se University advisor: Ph.D. Anton Borg

Department of Computer Science

Faculty of Computing Internet : www.bth.se

(3)

Cyber attacks raise a high threat for companies and organisations worldwide. With the cost of a data breach reaching $3.86 million on average, the demand is high for a rapid solution to detect cyber attacks as early as possible. Advanced persistent threats (APT) are sophisticated cyber attacks which have long persistence inside the network. During an APT, the attacker will spread its foothold over the network. This stage, which is one of the most critical steps in an APT, is called lateral move-ment. The purpose of the thesis is to investigate lateral movement detection with a machine learning approach. Five machine learning algorithms are compared using repeated cross-validation followed statistical testing to determine the best perform-ing algorithm and feature importance. Features used for learnperform-ing the classifiers are extracted from Active Directory log entries that relate to each other, with a similar workstation, IP, or account name. These features are the basis of a semi-synthetic dataset, which consists of a multiclass classification problem.

The experiment concludes that all five algorithms perform with an accuracy of 0.998. RF displays the highest f1-score (0.88) and recall (0.858), SVM performs the best with the performance metric precision (0.972), and DT has the lowest computational cost (1237 ms). Based on these results, the thesis concludes that the algorithms RF, SVM, and DT perform best in different scenarios. For instance, SVM should be used if a low amount of false positives is favoured. If the general and balanced perfor-mance of multiple metrics is preferred, then RF will perform best. The results also conclude that a significant amount of the examined features can be disregarded in future experiments, as they do not impact the performance of either classifier.

Keywords: Advanced Persistent Threat, Lateral Movement, Active Directory, Mul-ticlass Classification, Intrusion Detection System

(4)

(5)

Cyberattacker utgör ett stort hot för dagens företag och organisationer, med en genomsnittlig kostnad för ett intrång på ca 3, 86 miljoner USD. För att minimera kostnaden av ett intrång är det viktigt att detektera intrånget i ett så tidigt stadium som möjligt. Avancerande långvariga hot (APT) är en sofistikerad cyberattack som har en lång närvaro i offrets nätverk. Efter attackerarens första intrång kommer fokuset av attacken skifta till att få kontroll över så många enheter som möjligt på nätverket. Detta steg kallas för lateral rörelse och är ett av de mest kritiska stegen i en APT. Syftet med denna uppsats är att undersöka hur och hur väl lateral rörelse kan upptäckas med hjälp av en maskininlärningsmetod. I undersökningen jämförs och utvärderas fem maskininlärningsalgoritmer med upprepad kors-validering följt av statistisk testning för att bestämma vilken av algoritmerna som är bäst. Un-dersökningen konkluderar även vilka attributer i det undersökta datasetet som är väsentliga för att detektera laterala rörelser. Datasetet kommer från en Active Di-rectory domänkontrollant där datasetets attributer är skapade av korrelerade loggar med hjälp av datornamn, IP-adress och användarnamn. Datasetet består av en syn-tetisk, samt, en verklig del vilket skapar ett semi-syntetiskt dataset som innehåller ett multiklass klassifierings problem.

Experimentet konkluderar att all fem algoritmer klassificerar rätt med en prick-säkerhet (accuracy) på 0.998. Algoritmen RF presterar med den högsta f-measure (0.88) samt recall (0.858), SVM är bäst gällande precision (0.972) och DT har den lägsta inlärningstiden (1237 ms). Baserat på resultaten indikerar undersökningen att algoritmerna RF, SVM och DT presterar bäst i olika scenarier. Till exempel kan SVM användas om en låg mängd falsk positiva larm är viktigt. Om en balanserad prestation av de olika prestandamätningarna är viktigast ska RF användas. Under-sökningen konkluderar även att en stor mängd utav de undersökta attributerna av datasetet kan bortses i framtida experiment, då det inte påverkade prestandan på någon av algoritmerna.

Nyckelord: Avancerade långvariga hot, Lateral rörelse, Active Directory, Multik-lassklassificering , Intrångsdetektering

(6)

(7)

We want to sincerely thank our supervisor Anton Borg from Blekinge Institute of Technology for valuable insights and guidance throughout this thesis. Moreover, we want to thank the supervisors from SecureLink for making this research project possible; David Olander for the master thesis idea and opportunity to conduct the research at SecureLink, Patrik Birgersson for supplying us with data, and Petrus Allberg for useful discussions and pointers on how to improve the thesis.

(8)

(9)

Acronyms

APT Advanced Persistent Threat DC Domain Controller

LM Lateral Movement DS Domain Service

AD Windows Active Directory RAT Remote Access Tool TP True Positive

FP False Positive TN True Negative FN False Negative

WFP Windows Filtering Platform MDA Mean Decrease Accuracy RF Random Forest

SVM Support Vector Machine KNN K-Nearest Neighbour DT Decision Tree

ANN Artificial Neural Network IDS Intrusion Detection System

(10)

(11)

Abstract i Sammanfattning iii Acknowledgments v Nomenclature vii 1 Introduction 1 1.1 Background . . . 2 1.2 Outline . . . 3 1.3 Theory . . . 3

1.3.1 Advance Persistent Threats . . . 3

1.3.2 Intrusion Detection Systems . . . 5

1.3.3 Windows Active Directory . . . 5

1.3.4 Supervised Machine Learning . . . 5

1.3.5 Supervised Machine Learning Algorithms . . . 7

2 Related Work 13 2.1 Detection and Prevention Against APT . . . 13

2.1.1 Lateral Movement Detection . . . 14

2.2 Intrusion Detection Systems . . . 14

2.3 Active Directory based IDS . . . 15

2.4 Semi-Synthetic Datasets . . . 15

2.5 Research Gap . . . 16

3 Research Questions 17 4 Method 19 4.1 Data Collection . . . 19

4.1.1 Definition of Normal Behaviour . . . 19

4.1.2 Lateral Movement Analysis . . . 20

4.1.3 Simulating Lateral Movement . . . 22

4.1.4 Pre-processing . . . 23

4.1.5 Choosing the Optimal Search Parameter . . . 27

4.1.6 Dataset . . . 27

4.2 Experiment Setup . . . 28

(12)

4.3 Evaluation and Analysis . . . 31 4.3.1 Friedman Test . . . 31 4.3.2 Nemenyi Test . . . 32 4.3.3 Feature Evaluation . . . 32 5 Results 33 5.1 Overall Performance . . . 33 5.1.1 Accuracy . . . 35 5.1.2 Recall . . . 36 5.1.3 Precision . . . 37 5.1.4 F1-Score . . . 38 5.1.5 Computational Cost . . . 39

5.2 Class Specific Performance . . . 40

5.3 Feature Importance . . . 40

6 Analysis and Discussion 43 6.1 The Classifiers Capability of Detecting LM . . . 43

6.1.1 Choosing the optimal classifier . . . 43

6.1.2 Attack Specific Detection . . . 44

6.2 Features of Interest for Detecting LM . . . 45

6.3 Experiment Validity . . . 46

6.3.1 Dataset Validity . . . 46

6.3.2 Inaccuracy of The Friedman Test . . . 46

6.3.3 Class Imbalance . . . 47 7 Conclusions 49 7.1 Major Findings . . . 49 7.2 Contributions . . . 49 7.3 Future Work . . . 51 References 53 A Supplemental Information 57 A.1 Parsing Translations . . . 57

A.2 Parameters for Machine Learning Algorithms . . . 58

A.2.1 KNN . . . 58

A.2.2 DT . . . 58

A.2.3 SVM . . . 58

A.2.4 RF . . . 58

A.2.5 ANN . . . 58

A.3 Execution of the Activities Related to LM . . . 59

A.3.1 Admin Login With PsExec . . . 59

A.3.2 Pass-the-Hash . . . 59

A.3.3 Pass-the-Ticket . . . 61

(13)

A.4.2 Admin Using PsExec . . . 66

A.4.3 Pass-The-Hash . . . 67

A.4.4 Pass-The-Ticket . . . 67

A.4.5 User AD Enumeration . . . 67

(14)

(15)

1.1 APT Timeline . . . 1 1.2 Classification using the KNN classification algorithm, with a k set to

five. The dataset consists of three classes, thus forming a multiclass classification problem. Since the figure only has two dimensions, each instance has two features. The unknown instance will be classified as green, because from the five nearest instances three belong to the green group. . . 7 1.3 Defining the hyperplane in two dimensions using SVM. The sum of

margins m1 and m2 should be maximised. To classify an unknown

distance, the hyperplane is used to see which class the new instance should belong to. The circled instance are the support vectors for this ML model. . . 8 1.4 A DT for solving the multiclass classification problem of determine if

an animal is a bird, dog, or a cat . . . 9 1.5 A random forest classifier consisting of n DT, solving the same

multi-class multi-classification problem as the DT from figure 1.4. . . 10 1.6 Illustration of how a perceptron works. The perceptron has n inputs

with dedicated weights. When the input parameters are given to the perceptron it calculates the result by first summarise the values and then put the summarised value in a activation function to determine the output. . . 10 1.7 The design of a neural network using MLP with two hidden layers. . 11 4.1 Network topology of the lab environment. . . 22 4.2 Illustration of the sliding window. The red event triggers the creation

of a new dataset entry. The blue events correlate with the red event, based on IP, Workstation or Account. Moreover, the information of these four events are used to build the features for this dataset. . . . 24 4.3 The PTH attack template, used as dataset example as input for the

parsing process. . . 26 4.4 A dataset example, based on figure 4.3 . . . 26 4.5 Result of the test to conclude where the search parameter n is optimal

for f1-score. The x-axis is n, and the y-axis is the f1-score. The graph concludes that n = 3 is the overall optimal search parameter. . . 27 5.1 Algorithm performance, measured in accuracy, macro recall, macro

precision and macro f1-score. Each staple represents the mean perfor-mance from the repeated cross-validation experiment. . . 34

(16)

ange as described in section 4.3.2. . . 35 5.3 Illustrating results from the post-hoc Nemenyi test on the predictive

metric recall. Diagram generated with the Pyhton framework Orange as described in section 4.3.2. . . 36 5.4 Illustrating results from the post-hoc Nemenyi test on the predictive

metric precision. Diagram generated with the Pyhton framework Or-ange as described in section 4.3.2. . . 37 5.5 Illustrating results from the post-hoc Nemenyi test on the predictive

metric precision. Diagram generated with the Pyhton framework Or-ange as described in section 4.3.2. . . 38 5.6 Nemenyi test on the metric computational cost. Diagram generated

with the Pyhton framework Orange as described in section 4.3.2.3 . . 39 5.7 Feature importance calculated using MDA. . . 41 A.1 Spawning a command prompt on another machine using PsExec. . . . 59 A.2 Result of the Mimikatz command extracting the NTLM hash. . . 60 A.3 Testing the privilege on a shared folder and performing the attack

pass-the-hash that spawns a command prompt. . . 61 A.4 Testing the privilege on a shared folder with success after performing

a Pass-The-Hash attack. . . 61 A.5 Listing tickets extracted with the command sekurls::tickets /export. . 62 A.6 Listing current tickets in sessions, reusing a ticket and verifying it by

listing the session tickets again. . . 62 A.7 Executing the AD enumeration that stores the results in a .zip file. . 63 A.8 Displaying the result of the AD enumeration in BloodHounds

graphi-cal interface. . . 64

(17)

4.1 Information about the log file provided by SecureLink containing events related to normal behaviour. . . 20 4.2 The five indications of LM that are chosen to be investigated. . . 20 4.3 Information about the generated log file. . . 23 4.4 Features present in the dataset divided into three groups. The first

group measures how each event in the sliding window is connected to the triggered event. The second is general information about the information inside the sliding-window. The last group contains infor-mation about the event that triggered the log entry. . . 25 4.5 Class distribution in the finalised dataset. Note the difference between

“Entries” and “Attacks Executed”. Entries specifies how many log entries are created when the attacks have been executed. . . 28 5.1 Presenting the accuracy performance of the classifiers including the

average Friedman ranks and standard deviation from the stratified ten-fold-cross-validation. . . 35 5.2 Presenting the macro average of the recall performance for the

classi-fiers including the average Friedman ranks and standard deviation for each classifier. . . 36 5.3 Presenting the macro average of the precision performance for the

clas-sifiers including the average Friedman ranks and standard deviation for each classifier. . . 37 5.4 Presenting the macro average of the f1-score performance for the

clas-sifiers including the average Friedman ranks and standard deviation for each classifier. . . 38 5.5 Presenting the computational cost in milliseconds for the classifiers

including the average Friedman ranks and standard deviation for each classifier. Diagram generated with the Pyhton framework Orange as described in section 4.3.2. . . 39 5.6 The table presents class-specific metrics for the random forest classifier. 40 5.7 Top 10 features for each machine learning algorithm. . . 40 5.8 Features present in the dataset. The Event ID descriptions are based

on the Microsoft documentation . . . 41 6.1 All features that impact the performance of the machine learning

algo-rithms (measured by MDA) when solving the multiclass classification problem present in the AD DC dataset. . . 45

(18)

A.3 Translations used for keywords. . . 57 A.4 Translations used for error codes. . . 57

(19)

Introduction

Sophisticated cyber attacks that establish an unauthorised access to a companies network for a substantial time are called advanced persistent threat (APT). Early detection of APT is a crucial factor to mitigate its damage. Today, the average de-tection time is about 197 days and can cost the company a fortune. A data breach costs in average $3.86 million [1] and damages the company’s reliability.

Its persistence inside the network partly defines an APT. It can linger for years before detection. However, its characteristics can be broken down into a timeline, as described in section 1.3.1. In short, the actions performed in the attack are separated into six phases. Starting with reconnaissance, moving on into delivering an attack which causes the initial intrusion. Once inside, the threat tries to contact a command and control server which later is used to move laterally inside the network and finally fulfilling the goal by extracting the sensitive data which the attacker tries to steal. Figure 1.1 presents the timeline for an APT.

Figure 1.1: APT Timeline

Arguably the most critical stage in this chain of events is the lateral movement (LM) phase. Here, the attacker tries to map the inside of the network and gain access to as much as possible of the network. All this, without risking detection. To do this, the attacker has to be extremely careful and therefore, will utilise legitimate services to move between the assets [10]. Due to the sensitive nature of the LM phase, it can be considered as the phase that is hardest to detect. Therefore, the detection of LM will be the focus of this thesis.

To detect a LM, one would need to analyse data generated by the behaviour of the attacker. Collecting this data can be done efficiently by using a directory service such as Microsoft Active Directory (AD). With this approach, data from each endpoint is obtained from a single server called the domain controller (DC), which authenticates

(20)

users across the network. An attacker will have to communicate with this server to access any assets present in the network. Thus, leaving traces of the attackers activities in the log-files of the AD. Because of this, this thesis presents a way of analysing DC log-files in an attempt to detect indicators of LM. Specifically, using machine learning algorithms to classify segments of log entries as different indication types of LM.

The approach creates a supervised multiclass classification problem, where each type of activity that indicates LM is attempted to be classified accordingly by the differ-ent algorithms. Supervised, or classification, algorithms is a way of applying labels (classify) to instances of data [15, p.52 – 53]. The algorithms use features inside the raw data to detect general rules that are used to set the labels, or to classify, the data. A typical problem that can be solved with using supervised machine learning is the classification of spam emails [5]. When classifying an instances that could belong to more than just two categories .(e.g. when classifying activities as multiple types of attacks), a multiclass classification problem occurs [15, p.81 – 91]. These problems can be solved using the algorithms k-nearest neighbour (KNN), artificial neuron networks (ANN), decision trees (DT), random forest (RF), and support vec-tor machines (SVM), to name a few. Supervised machine learning algorithms can also be used inside intrusion detection systems [26], which will be the main applica-tion for this thesis.

To train and evaluate machine learning algorithms on the presented problem, a la-belled dataset containing both regular and malicious events is needed. Since no public datasets of this sort are available, one is created with a semi-synthetical ap-proach. I.e. a dataset where one part originates from a real-world environment and the other simulated. In this thesis, all events related to regular events will arise from a live environment and malicious activities will be synthetically generated.

1.1 Background

Due to the expensive nature of APT’s, it’s crucial to detect them as early as possible and to provide ways to identify them in all of its stages. By finding an approach to discover an APT in the LM stage, security professionals can reduce the risk of a successful attack. Stopping the attack before its goal has been fulfilled will elim-inate the expensive costs of these advanced cyber attacks. When this detection is done based solely on the data present inside the log files of a single end-point, the workload for analysts is kept to a bare minimum, thus reducing the cost of detec-tion. Conclusively, by establishing a method that can both detect LM and reduce the workload for analysts, companies reduce risks of successful APT’s and reduce maintenance costs for APT defence.

(21)

performed in a Windows environment leaves traces in the logs of the AD1, an oppor-tunity of detecting indications of LM is observed.

1.2 Outline

This thesis revolves around answering the two research questions RQ1, and RQ2 de-fined in chapter 3. The questions are resolved by conducting an experiment described in detail in chapter 4. The chapter is broken down into the process of collecting the dataset and defining a multiclass classification problem, describing the evaluation of different methods in solving the problem, and how a statistical comparison on the different methods is conducted. The results of the experiment are presented in chapter 5. Following is an analysis and discussion of them to fully answer RQ1 and RQ2 in chapter 6. The conclusion and future works of the thesis are discussed in chapter 7.

1.3 Theory

This section covers the theory required to understand the results and methods of this thesis. The section consists of a description of the terminology advanced per-sistent threat, intrusion detection systems (IDS) and technology used. Lastly, an explanation of the five machine learning techniques used follows.

1.3.1 Advance Persistent Threats

LM is one of the phases in the cyber attack called advanced persistent threat. These are attacks with a more advanced and sophisticated method to gain access to sensitive information usually go under the name APT. The threat typically lingers inside a network for a long time before detection and is executed by threat actors with high expertise. Because of this, the detection of APTs has become a difficult task for security analysts. There are multiple definitions of these threats. However, from here on and throughout this paper, when referring to an APT we will use the following definition released by NIST in 2011 [32];

“An adversary that possesses sophisticated levels of expertise and sig-nificant resources which allow it to create opportunities to achieve its objectives by using multiple attack vectors (e.g., cyber, physical, and de-ception). These objectives typically include establishing and extending footholds within the information technology infrastructure of the tar-geted organizations for purposes of exfiltrating information, undermining or impeding critical aspects of a mission, program, or organization; or positioning itself to carry out these objectives in the future. The ad-vanced persistent threat: (i) pursues its objectives repeatedly over an extended period of time; (ii) adapts to defenders’ efforts to resist it; and (iii) is determined to maintain the level of interaction needed to execute its objectives. ”

(22)

The APT Life-Cycle

The life of an APT is long, and the attacks are usually hidden inside networks for an extended time-frame. For instance operation “Ke3chang” that targeted ministries of foreign affairs [46] operated for three years between 2010 and 2013. Although the attacks span over these time-frames and they have different motives and methods on reaching them, similarities common for all APTs are present. The authors Ping Chen, Lieven Desmet and Christophe Huygens, identified a series of phases that can be seen in any APT [10]. The authors divide an APT into the phases Reconnais-sance and Weaponization, Delivery, Initial Intrusion, Command and Control, Lateral Movement and Data Exfiltration. These phases are visualized as a timeline in figure 1.1 and defined in detail below.

1. Reconnaissance and Weaponization - Before an attack, the attacker would gather information about the infrastructure and its usage. During this phase, the attacker would collect as much technical data about the infrastructure and data about its employees. A typical approach to this stage is to perform a social engineering attack.

2. Delivery - During the delivery stage of an APT, the attacker delivers the exploits used to gain entry to the system. For instance, an attacker could perform a spear-phishing attack.

3. Initial Intrusion - The attacker gain the first unauthorized access to the system. The compromise is accomplished by executing the exploits delivered in the delivery phase. The exploits that are chosen by the attacker vary from, for instance, exploiting old vulnerabilities inside unpatched systems or zero-day exploits. Another option for an attacker in this stage would be that the attacker would manage to find legitimate credentials and use these to gain the initial access.

4. Command and control - The attacker seeks for a way to control the in-trusion. The command and control stage consists of the attacker gaining this control over the infected assets. Typically a Remote Access Tool (RAT) is installed. A RAT would allow an attacker to access and control the asset remotely.

5. Lateral Movement - Once the attacker has remote access to the network, the attack would focus on increasing the foothold inside the system. Here, the focus for the attack shifts from breaching the network to increase its foothold. The attacker would move inside the network to attempt to gain access to as much as possible. This stage typically occurs over a long period to avoid detection. 6. Data Exfiltration - The overall goal for the APT would be to transfer the

(23)

1.3.2 Intrusion Detection Systems

Intrusion detection systems (IDS) are software designed to detect and alert on mali-cious behaviours such as APTs or simpler types of cyber attacks. There are mainly two different types of IDS systems, network intrusion detection systems (NIDS) and host-based intrusion detection systems (HIDS) [40]. NIDS is designed to detect ma-licious behaviours on a larger scale, e.g. by analysing network traffic. Meanwhile, HIDS is intended to detect intrusions on a single machine, e.g. by monitoring mali-cious modifications inside an OS file. There are mainly two approaches for an IDS to detect malicious behaviour [43]. Firstly, there is a signature-based approach that detects intrusions by comparing the current activity to previously known malicious activities (signatures). Then, there is the anomaly based approach that alerts on deviations in the “normal” traffic.

This thesis will focus on the theory of NIDS using an anomaly-based approach by solely examining log files from the Microsoft Active Directory (AD) Domain Con-troller. The approach is considered a NIDS since communication between all end-points and DC is correlated, which means that this method can detect compromises across the entire network.

1.3.3 Windows Active Directory

Windows Active Directory is a directory service (DS) developed by Microsoft2, and controlled by an endpoint called domain controller (DC). The DC is a centralised server that keeps track of what entities exists within a given domain in a database. The entities consist of objects such as users, user groups, printers, shared folders, and workstations across a organisations network. A DC also manages the security of the domain by keeping track of the permissions and relations between the objects. If an object wants to access a resource within the network domain, it initially sends an authentication request to the DC before gaining access [14, p.3 – 4]. Meaning, that if an intrusion has occurred. It’s most likely the intruder have interacted with the AD leaving traces in the logs, which makes them interesting in intrusion prevention and detection perspective.

1.3.4 Supervised Machine Learning

Machine learning consists of multiple techniques to solve problems that originates from data, such as regression, classification, clustering, or optimisation. This thesis focuses on the concept of solving multiclass classification problems using supervised machine learning. The concept of supervised machine learning, in contrast to un-supervised, uses labelled data to train machine learning algorithm to solve a classi-fication task. With the usage of labelled data, the algorithms will learn to identify patterns with the help of the correct answers to the classification problem. This allows the models from the machine learning algorithms to accurately predict unseen data (no labels). The task of classifying objects as one out of two types is called

2_Windows _AD.

(24)

binary classification [15, p.53] and the task of classifying one out of more then two types is called multiclass classification [15, p.81 – 82]. The area where supervised machine learning can be applied is wide. For instance, besides network security, it’s used in areas such as healthcare, finance and e-commerce, to name a few [42]. The process of supervised machine learning can be simplified by explaining it in four different steps, which are data collection, pre-processing, training and evalua-tion, and application [2].

Data Collection

Data that is used in ML can be of any size or format. A typical approach for manag-ing data is to structure it in a table, where each row represents one instance and each column a specific feature. Labelled data, is data where each instance has a specific label attached to it. Unlabelled is not assigned to any specific class. As an example, let’s say a set of photos is a dataset. Labels for these could be what each photo is showing; animals, buildings, etc. Another example could be network traffic, where the data is divided into TCP sessions and the labels are “malicious” or “normal”. A dataset referrers to a collection of data, where the dimension of the dataset is defined by the number of features for each instance. [18].

Collecting data is one of the most critical stages in the process of supervised ma-chine learning. This is where you collect the data that the classifier will be trained and evaluated on. The quality of the data collected will be reflected in the final performance of the classifier. Worth noting is that when using a supervised machine learning approach, the data that is being collected has to be labelled. This means that before the algorithms can learn from the data, the data has to be categorised into the different classes manually.

Pre-processing

Pre-processing is the stage where the data collected gets translated into a dataset that machine learning algorithms can interpret. In this stage, the features used by the classifiers to make prediction on is developed. For instance, when classifying animals, the weight represented in grams could be a feature.

Training and Evaluation

(25)

Application

The final stage of the process is the application. This is when the classifier have reached the desired performance in when evaluating. To get a finalised classifier it’s trained on the whole dataset, stored and then applied to the problem that it’s trained to solved.

1.3.5 Supervised Machine Learning Algorithms

This thesis consists of an experiment in which five different machine learning algo-rithms are used to solve a multi-class classification problem. The algoalgo-rithms k-nearest neighbour (KNN), decision tree (DT), random forest (RF), artificial neural network (ANN) and linear support vector machine (SVM), which are used in the experiment, are described briefly in this section.

K-Nearest Neighbour

The KNN algorithm plots each instance in a Cartesian coordinate system, of the same dimensions as there are features for the ML algorithm. The algorithm can solve the machine learning problems of classification and regression. Since the problem in this thesis is of classification the k-nearest neighbour for classification problem is described from now on. When learning, the algorithm plots each instance in this coordinate system, and for classification the Euclidean distance,

q Pd

i=1(x1 − yi)2,

from each learned instance is calculated. A voting process between the n-nearest neighbours is done to establish what class the new instance should belong to [15, p.21 – 23]. Figure 1.2 illustrates the multiclass classification of an unknown instance using the KNN algorithm with k = 5.

(26)

Figure 1.3: Defining the hyperplane in two dimensions using SVM. The sum of margins m1 and m2 should be maximised. To classify an unknown distance, the

hyperplane is used to see which class the new instance should belong to. The circled instance are the support vectors for this ML model.

Linear Support Vector Machine

SVM is a supervised machine learning model that, similar to KNN, plots each in-stance in a Cartesian coordinate system [15, p.21 – 23]. The algorithm can be used to solve classification problems. The goal for SVM is to find the best hyperplane that separates the data from each class. Similar to KNN, the coordinate system will have the same dimensions as there are features for the dataset. A hyperplane is defined as the plane that separates the data. In two dimensions, this hyperplane is a line, f (x) = ax + b. When dividing the instances by a hyperplane, two margins appear (in two dimensions), the distance from the closest instance of each class defines these margins. The overall goal for an SVM is to minimise these margins, thus finding the optimal hyperplane. The training instances that are closest to the hyperplane are called support vectors. Figure 1.3 illustrates the hyperplane in two dimensions for a binary classification problem.

The SVM classifier is designed for single class classification problems. However, there are several few ways to turn the classifier into one that can solve a multiclass clas-sification problem, such as the one present in this thesis. C. Hsu et al. describes the processes “one-against-all”, “one-against-one”, and DAGSVM [11]. However, the chosen multiclass method for SVM in this thesis is the approach presented by K. Cramer and Y. Singer in their paper “On the Algorithmic Implementation of Multi-class Kernel-based Vector Machines” [12].

Decision Tree

(27)

Figure 1.4: A DT for solving the multiclass classification problem of determine if an animal is a bird, dog, or a cat

the DT the training data is repeatedly split on a given feature, until a subset of the training data only consists of a single class, thus forming a homogeneous subset. An important function for building this tree is to continuously find the best split of a subset of the learning set, thus finding the next optimal feature to split on. To determine the best split, the DT can use a few different functions. For instance, minority class, gini index or entropy. To classify a new instance, the features are evaluated by iterating the tree from the root to the appropriate leaf node [15, p.133 – 138]. Figure 1.4 illustrates a DT for a simple multiclass classification problem. Random Forest

A model ensemble consisting of DT are called a RF. Model ensembles, combines a series of classifiers into a single classifier to increase performance [18, p.180 – 184]. RF is a supervised machine learning algorithm that can be used for both regression and classification that builds upon the method of ensemble learning, which means, classification is done using a voting process [15, p.332] between multiple classifiers. The classifiers are various decision trees that are slightly different from one another. The difference comes from the way they are trained. Each decision tree is trained with a random subset of the data and attributes. This will reduce the risk of the model to overfit [19, p.592]. This process, of building one ensemble of the different DT is called bagging. The output of the bagging process is a series of models, whose outputs can be voted to create a final classification. Figure 1.5 illustrates a RF model.

Artificial Neural Network

(28)

Figure 1.5: A random forest classifier consisting of n DT, solving the same multiclass classification problem as the DT from figure 1.4.

neurons or so called perceptrons [18, .p 313 – 318]. A perceptron is a single classifier that solves linear separable problems. As input, it takes values with weights applied to them. The values are then summarised as follows:

ysum = n

X

i=0

xiwi

The summarised value ysum then goes through a activation function that determines

what the output should be. A simple example of a activation function could simply check whether the result is larger then 0.5, then the perceptron should forward a one, else is should forward a zero. Figure 1.6 illustrates the design of how a single perceptron function.

Figure 1.6: Illustration of how a perceptron works. The perceptron has n inputs with dedicated weights. When the input parameters are given to the perceptron it calculates the result by first summarise the values and then put the summarised value in a activation function to determine the output.

(29)

changed and optimised by using backpropagation. Figure 1.7 illustrates the design of a neural network using MLP with two hidden layers.

(30)

(31)

Related Work

As Virvilis and Gritzalis conclude, the sophistication and resources used in APTs are increasing, posing a bigger cyber threat to today’s companies and organisations than ever before [47]. This becomes evident when analysing the recent malware outbreaks, one of the most famous being Stuxnet that set back Iran’s nuclear research and development by four years. Virvilis and Gritzalis state that today’s cybersecurity products, such as NIDS and anti-malware, lacks the ability to detect APTs. More detection methods and further development in the approach of detecting APTs are needed in order to mitigate the increasing threat. Luckily, the field of APTs has become popular amongst cybersecurity researchers. There is a multitude of papers that attempts to define the characteristics of the attack [38, 10, 9]. However, the area of using AD as a basis for detecting LM is untouched to our knowledge. The research related to this thesis are; APT detection and prevention, IDS research, other proposed methods for LM detection, work into using AD as a basis for anomaly detection or classification, and research into constructing a synthetic/semi-synthetic dataset similar to the one used in the experiment for this thesis.

2.1 Detection and Prevention Against APT

P. Chen et al. proposes the following countermeasures for defending against APTs [10]; Security Awareness training, Traditional Defense mechanisms (IDS, SIEM, IPS, etc), Advanced Malware Protection, Event Anomaly Detection, Data Loss Preven-tion, and Intelligence-Driven Defense. Moreover, the articles provide general guide-lines for detecting APTs in each of the stages of an APT.

N. Virvilis and D. Gritzalis conclude that there is no single product or technique to protect against APTs, and it’s important to account for unique details in each environment [47]. However, the authors conclude that there are still a few guidelines for securing against APTs; Patch management, strong network access controls and monitoring, Internet access policies, protocol-aware security solutions (to drop unau-thorised encrypted sessions), DNS monitoring to detect unusual domains, Honeypots and honeynets, and HIDS.

Conclusively, the research behind generall APT protection revolves around maintain-ing a secure infrastructure, rather than implementmaintain-ing specific techniques. However, there are a few approaches which propose systems that can be used to detect the different stages of an APT. For instance, G. Zhao et al. proposes an IDS that detect

(32)

the stage of when the malware attempts to contact a command and control server [48].

2.1.1 Lateral Movement Detection

In a research paper posted by the Tomas Bata University in Zlín 2013 [22] a hon-eynet, which is a series of honeypots cooperating, is used to gather information about an attacker that already has breached the network. The article proposes that with this approach, a network administrator could notice access to these honeypots and then use the information from them to further analyse the incident.

A master thesis study published by the University of Twente in 2016 written by the student Ullah presents an anomaly based approach by analysing the server message block (SMB) [44]. This approach identifies five anomalies that would indicate LM. The paper uses the machine learning algorithm k-nearest neighbour to define thresh-olds that separate normal and malicious network behaviour. The Fujutsi lab has also developed a similar approach. The lab developed a method for detecting LM that is using a remote access tool (RAT) to exploit the functionaries of the SMB protocol. A work published by master students at BTH proposed a method to detect APTs in all of its stages by analysing patterns in TCP sessions when analysing network traffic. By using features such as destination port, bytes sent, average packet size, and intervals for the package they managed to build a machine learning model that detects LM with an accuracy of over 98% and false-positive rate (FPR) of below 2%. Their study showed that traffic size and traffic flow ratio was the most weighted features for their model [8].

A white paper produced by the CERT-EU presents a technical walk-through of what events to look for when detecting LMs in a Windows environment [29]. The article targets detection of the infamous LMs Pass-The-Hash (PTH) and Pass-The-Ticket (PTT). The authors propose to look at different log events whether the analysis is done on the attacked host or the networks domain controller (DC) to identify the attacks. When looking for LMs on a host, the authors propose to look at logs related to the event IDs 4624 and 4625. When looking for LMs on a DC, the authors propose to look at logs associated with the event IDs 4768 and 4769.

2.2 Intrusion Detection Systems

(33)

models use the public datasets, DARPA 1998 [28], DARPA 1999 [27] and KDD 1999 [20] which is based on the DARPA 1998 dataset. However, some approaches use other types of data (e.g. net flows, tcpdump, shell commands) that doesn’t originate from a public dataset. Moreover, there is a revised version of the KDD 1999 dataset called NSL KDD released by M. Tavallaee, which identified and solved several flaws in the original dataset [39].

S. Mukkamala et al. demonstrated the capabilities of using SVM and Neural Net-works as an IDS based on the KDD99 dataset [30]. K. A. Taher et al. performed a similar study on the NSL-KDD dataset, that concluded that ANN can outperform SVM on classifying network traffic [37].

P. Alaei and F. Noorbehbahani presented a new approached called Network Anomaly Detection using Active Learning (NADAL) which they compare to an Incremental Naive Bayes IDS using the NSL KDD dataset. The article, which was presented in 2017, concluded that the proposed NADAL approach has several advantages over the Incremental Naive Bayes approach, thus making it suitable for an IDS [3].

2.3 Active Directory based IDS

A relatively untouched field for detecting cybersecurity threats is to utilise the AD DC log files. A research posted in 2015 claims to be first to develop this technique [21]. The article proposes a method were a Markov model is used to learn the be-haviours for different users and can detect when users diverge from these patterns. Thus, forming an anomaly based intrusion detection. Although not directly directed to detecting LM, the project claims that it can detect APT’s post-compromise with a recall of 66.6% and 99.0% in precision. The article evaluates the model in a real dataset from an organisation in Taiwan with 95 employees and a dataset of two months DC logs. The high precision and relative low recall entails that the approach struggles with false negatives.

2.4 Semi-Synthetic Datasets

(34)

2.5 Research Gap

(35)

Research Questions

The research questions for this thesis are designed to provide guidelines for a de-veloper to implement machine learning algorithms into a IDS that has a sole goal of classifying LM based on data from AD DC logs. Furthermore, RQ1 will provide guidelines for the preferred machine learning algorithm in different scenarios. An answer for RQ2 aids the developer in extracting the correct information from the log files. Both research questions are based on experiments conducted on a dataset consisting of AD DC log entries. The research questions are defined below:

RQ1: To what extent, given the machine learning metrics of accuracy, recall, pre-cision, f1-score and computational cost, are the machine learning algorithms KNN, DT, RF, ANN and SVM capable of detecting lateral movement in the AD DC dataset?

There is a wide range of machine learning classifiers suited for multiclass classification problems. By answering RQ1, each algorithm stated in the RQ should be compared and evaluated on the specified dataset, consisting of traces of LM inside DC log entries. The evaluation is based on the machine learning metrics stated in RQ1, where computational cost referrers to the learning time for a given algorithm. RQ2: To what extent do features extracted from the AD DC dataset affect the

accu-racy of the machine learning algorithms KNN, DT, RF, ANN and SVM? By answering RQ2, the experiment will highlight those of high importance and po-tentially provide a basis for which features that could be ignored in future research. The experiment should also separate features from classifiers to produce results that can show that specific features perform better on a subset of classifiers. Furthermore, this would create prioritisation recommendations of the features when implementing any of the stated classifiers for classifying LM in the given AD DC log dataset.

(36)

(37)

Method

This chapter describes the experiment designed to answer RQ1 and RQ2. The pro-cess of collecting a realistic dataset and labelling it with the appropriate classes. It also covers the evaluation process for both the machine learning algorithms and the selected features. The process of collecting data and processing it is described in section 4.1, the setup for the experiments is explained in section 4.2, and a statistical evaluation of the experiment is described in section 4.3. Moreover, the programming language Python 2.7 [45] is used together with the machine learning library Scikit Learn [33] to construct the experiment and its components. To clarify the structure of the described experiment in this chapter, its considered to be a controlled experi-ment. Where the independent variables are the parameters of the machine learning algorithms and the dependent variables are the metrics the algorithms are measured on.

4.1 Data Collection

Finding datasets containing APTs is challenging [44] due to several factors, such as the sensitive nature of the data and the rarity of APT attacks. Consequently, no open dataset was found consisting of AD logs with traces of LM. Therefore, the creation of a dataset is necessary to answer the research questions. A semi-synthetic approach [35] is applied to create the dataset and to solve this issue. An anonymised log file originating from an AD in a live environment, provided by SecureLink, constitutes the base of the dataset and defines “normal” data. To add malicious activities into the data, synthetic logs from five different types of LM activities are created and inserted into the dataset. To finalise the dataset, pre-processing is applied to the data. Thus enabling machine learning algorithms to learn and predict LM indications inside the dataset. The process of creating the dataset is broken down into four subsections. Section 4.1.1 covers the log file provided by SecureLink. Then, section 4.1.2 describes how each attack is analysed by looking into AD log files. Section 4.1.3 describes the process of creating the semi-synthetic log file, thus, generating LM attacks and parsing them into the log file provided by SecureLink. Lastly, section 4.1.4 covers the pre-processing of the log file that results in the finalised dataset.

4.1.1 Definition of Normal Behaviour

The AD events that are considered “normal” or “non-malicious” originates from a log extraction of an AD DC in a real life environment provided by SecureLink. The

(38)

extracted logs cover all AD connected endpoints between 08:21:04 and 09:22:53 during a Thursday. This log file constitutes the basis of the dataset. Sensitive data is replaced with the hashed version of the data to protect confidentiality. See table 4.1 for information about the log file. The dataset is JSON formatted to facilitate log handling in Python.

Dataset Info Value

Format JSON

Log source AD Domain Controller Log type Windows Security Event Log

Size 418 MB Normal Events 1460671 Attack Events 0 Total Events 1460671 Attack Ratio 0.0 Timespan 01:01:49

Table 4.1: Information about the log file provided by SecureLink containing events related to normal behaviour.

An important aspect to consider when using logs from a live environment is the risk of the logs already containing events related to LM. It is never entirely confident that the logs are pure. But, because the environment has been under monitoring and continuously maintained, it’s suggested that the risk of contaminated logs is considered low.

4.1.2 Lateral Movement Analysis

The activities related to LM that has been investigated are Pass-The-Hash (PTH), Pass-The-Ticket (PTT), usage of the tool PsExec, and AD enumeration by both a user and admin account. All activities are summarised in table 4.2 where abbrevia-tion, tool and label related to each activity are presented for clarity.

Label Attack Abbreviation Tool

1 AD enumeration by admin account AD enum (admin) BloodHound

2 Tool breaking user policy PsExec PsExec64

3 Pass-The-Hash PTH Mimikatz

4 Pass-The-Ticket PTT Mimikatz

5 AD enumeration by user account AD enum (user) BloodHound Table 4.2: The five indications of LM that are chosen to be investigated. PTH and PTT are both direct lateral movement attacks that can be performed with the tool Mimikatz1_{. The result of a successful PTH or PTT are a successful}

authen-tication without the need of the victim’s clear text password.

(39)

The tool PsExec2 _{is a administrative tool used in Windows environments. It can}

be seen as a substitute for remote control programs such as Telnet and SSH. With a single command, it’s possible to execute processes and spawn command prompts on remote systems.

AD enumeration is a reconnaissance method used for discovering what a AD domain contains, i.e. users, administrative accounts and workstations. A AD enumeration will also tell the relations between the objects, i.e. what workstation a certain user can log in to. In this thesis, AD enumeration will be performed with the tool Blood-Hound3.

To see a elaborate walk-though of how the activities are performed and how they function, see appendix section A.3.

Justification of Lateral Movement Activity Selection

The motivation behind the choice of the actions PTT and PTH is that they are considered to be a evident examples of LM [29, 31] and is therefor sufficient can-didates for investigation. The usage of PsExec is motivated with the fact that it’s a tool present in free-to-use and infamous penetration testing frameworks such as Metasploit [23, p.84], meaning, its widely used in the context of LM. A unauthorised sign-in with PsExec could therefore be a sufficient indication of LM [24, p.43 – 50]. Lastly, the AD enumeration is chosen because it’s a indication of a LM is about to or has already happen. Before proceeding with breaching further into the network, the intruder needs to gather intelligence about what accounts and machines are possible targets. To do this, a enumeration of the domain’s AD will give the intelligence needed [34, p.47]. Therefore, a unauthorised enumeration of a AD can be seen as a possible indication of a past or future LM.

Lab Environment

Each action described in table 4.2 are performed and analysed using a small lab environment. The lab has a network infrastructure consisting of two virtual local networks (VLAN). A log management server with CentOS 74 running Splunk5 is used to collect the logs from an AD DC running on a Windows Server 2016. Two Windows 10 workstations are connected to the DC, thus ensuring that the AD logs generated by the workstations are sent to the Splunk server. The workstations and DC are located on a separate VLAN from the Splunk server to keep them isolated. The software is chosen specifically for being up-to-date as a way to keep the system realistic. Figure 4.1 presents the network topology for the lab environment used for the attack analysis.

2_Psexec. _{https://docs.microsoft.com/en-us/sysinternals/downloads/psexec.} _{(Accessed on}

03/19/2019).

3_BloodHound. _{https://github.com/BloodHoundAD/Bloodhound/wiki.} _(Accessed _on

03/13/2019)

(40)

Figure 4.1: Network topology of the lab environment.

Isolating LM Traces in AD Logs

Execution of the LM activities described in table 4.2 produces AD log entries when performed in the lab environment. The attack is performed from one workstation to the other. By using the analytical capabilities of Splunk, it’s possible to isolate each activity by its associated AD events. With a query in Splunk based on an LM activities time of execution and information about the involved attacker and the victim, it’s possible to isolate the related events. Once successful isolation of an activity is done, it’s extracted from Splunk in JSON file format. This process is repeated for every LM activity which results in five various JSON files, each file containing AD logs from one of each LM activity. By doing this, the appearance of each LM action inside the AD log is understood, and the synthetic logs can be generated. Appendix A.4 displays the log traces of the five JSON files defining each LM activity.

4.1.3 Simulating Lateral Movement

A script which utilises the tool Eventgen6 _{is used to integrate the attacks into the}

clean JSON log file. Since the tool can replay log sequences from files, it can replay full attacks based on the LM analysis in section 4.1.2. The LM attacks are replayed with altered timestamps, account names, workstation and IP address to fit in the clean log file. The generated logs are then parsed into the log file based on the times-tamp. Furthermore, the script performs the following actions to add the malicious behaviour into the “normal” data:

(41)

1. Extract users, workstations and the IP address from the “clean” log file. These will be used as a pool when generating LM-related events.

2. Extract the timespan of the “clean” log file.

3. Generate attacks with a random account, IP and workstation from the user pool. The timestamp of the attack is randomised from the extracted timespan. 4. Label the generated attacks according to table 4.2.

5. All the generated attacks get parsed into the log file according to the timestamp. After executing the script, a new JSON file that contains both normal behaviour and LM is created with each log entry labelled correctly. This file will be the dataset before pre-processing. Since the lateral movement phase tends to be performed stealthy, as explained in section 1.3.1, the indications of lateral movement in the log entries is assumed to be very low. However, since the algorithms need to see lateral movement in order to learn to detect them, the dataset needs to have a substantial part of lateral movement indicators. To resolve the issue, the dataset is generated with 1% events related to the indication of LM and 99% related to normal network usage to keep the dataset as realistic as possible, meanwhile containing a sufficient amount of events related to LM. Its considered realistic due to the distance between the LM activities will be great enough to make them independent from one another after the pre-processing described in section 4.1.4. The five activities are randomly distributed on the 1% allocated for events related to the indication of LM. See table 4.3 for information about the generated file.

Dataset info Value

Format JSON

Log source AD Domain Controller Log type Windows Security Event Log

Size 423 MB Normal Events 1460671 Attack Events 14769 Total Events 1475460 Attack Ratio 0.01 Timespan 01:01:49

Table 4.3: Information about the generated log file.

4.1.4 Pre-processing

(42)

Figure 4.2: Illustration of the sliding window. The red event triggers the creation of a new dataset entry. The blue events correlate with the red event, based on IP, Workstation or Account. Moreover, the information of these four events are used to build the features for this dataset.

in AD logs seen in appendix A.4, to ensure that each attack triggers at least one of these log entries. Which log entries that are inside the sliding-window are selected based on how they are related to the alarm that triggered the criteria. Moreover, the selection is n log entries that can be tied to the same account, the same workstation och the same IP address. The amount of logs, or n, is set as described in section 4.1.5. Figure 4.2 illustrates the pre-processing stage using n = 3. The information from the sliding window is used to calculated the values for features x0, x1, . . . , x19, which

are presented in table 4.4. The Event ID descriptions are based on the Microsoft documentation7.

Features x0, x1 and x2 show the percentage of how the log entries inside the

sliding-window are related to the triggered log entry. The features x3 to x15 shows the

percentage of log entries with a given Event ID inside the sliding window. Lastly, x16 to x19 displays different attributes of the log-event that triggered this log entry.

Each feature has a value based on its type, and the ratio implies that the feature uses a percentage that is represented with a number between 0 and 1. A value is a translation from a string to an integer according to appendix A.1. Since the labels are set in the JSON data, these are appended to each entry in the dataset as y1 and

y2, where the former can be used for binary classification and the latter for multiclass

classification, i.e. y1 can only have the values 0 or 1 depending on if this is entry

entails malicious activity or not, and y2 changes from 0 to 5 depending on what

activity the entry represents. Conclusively, the format for each entry in the dataset is:

x0 x1 . . . x19 y1 y2

7_Microsoft

(43)

Name Type Feature Description

x0 Ratio Account % of sliding window with same account

x1 Ratio Workstation % of sliding window with same workstation

x2 Ratio IP % of sliding window with same IP

x3 Ratio Event ID 4624 Successful logon

x4 Ratio Event ID 4625 Failed Logon

x5 Ratio Event ID 4627 Group Membership

x6 Ratio Event ID 4658 The handle to an object was closed

x7 Ratio Event ID 4661 The handle to an object was requested

x8 Ratio Event ID 4768 Kerberos Authentication

x9 Ratio Event ID 4769 Kerberos Authentication - For services

x10 Ratio Event ID 4672 Assignment of Administrator Rights

x11 Ratio Event ID 4776 Kerberos Service Ticket

x12 Ratio Event ID 4799 Group membership enumeration

x13 Ratio Event ID 5140 A network share object was accessed

x14 Ratio Event ID 5145 A network share object request

x15 Ratio Event ID 5158 WFP has permitted a bind to a local port

x16 Value Ticket Options Triggered event: ticket options

x17 Value Service Options Triggered event: service options

x18 Value Keyword Triggered event: keywords

x19 Value Error Code Triggered event: Error code

(44)

Feature Decisions

The overall goal for selecting these features was to choose as many relevant features as possible for a sliding window of any given n. The features present in the first group (x0,x1, and x2) are selected since these are the only ways the different events

are correlated in the sliding window, hence no further (obvious) features could be extracted. From the second group (x3 to x15) the features are chosen based on their

event id. Here, we choose each ID that was present in any event related to the LM activities seen in appendix A.4. The features from group three (x16 to x19) were

picked from the information present in the trigger event. Here, each relevant value that could have been extracted from these events were chosen. No information was excluded from the triggered event.

Parsing Example

To process a small JSON dataset consisting of three AD log entries that are generated through a Pass-The-Hash attack, as demonstrated in figure 4.3, the pre-processing would trigger three times as seen in the figure 4.4. In the first line in that figure, all event ID ratios are 0 since there are no previous events to base en enrichment on. On the last entry, we can see that 50% of all related events are of id 4769, 50% is 4769, the ticket option is 0x60810010. Since all events are based on an attack, each is labelled as y1 = 1 and y2 = 3. Note that this parsing example would only be

accurate when n ≥ 3.

1 {"preview":false, "result":{"_time":"2019-03-05T14:46:43.000+0100", "EventCode":"4768",

"Keywords":"Audit Success", "Account_Name_Hash":"U2U2U2U2", "src_Hash":"XX1.XX1.XX1.XX1",

"Ticket_Options":"0x40810010", "Service_Name_Hash":"krbtgt", "EventCodeDescription":"A Kerberos authentication ticket (TGT) was requested"}, "classification":1, "label":3}

,→ ,→ ,→

"Failure_Code":"0x0", "Keywords":"Audit Success", "Account_Name_Hash":"U2U2U2U2",

"src_Hash":"XX1.XX1.XX1.XX1", "Ticket_Options":"0x40810000",

"Service_Name_Hash":"W2W2W2W2$", "EventCodeDescription":"A Kerberos service ticket was requested"}, "classification":1, "label":3}

,→ ,→ ,→ ,→

"Failure_Code":"0x0", "Keywords":"Audit Success", "Account_Name_Hash":"U2U2U2U2",

"src_Hash":"XX1.XX1.XX1.XX1", "Ticket_Options":"0x60810010", "Service_Name_Hash":"krbtgt",

"EventCodeDescription":"A Kerberos service ticket was requested"}, "classification":1,

"label":3}

,→ ,→ ,→ ,→

Figure 4.3: The PTH attack template, used as dataset example as input for the parsing process.

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 3 2 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 1 2 1 3

3 1 0.5 0 0 0 0 0 0 0 0.5 0.5 0 0 0 0 0 3 0 1 2 1 3

(45)

4.1.5 Choosing the Optimal Search Parameter

The search parameter, n, defines the number of log entries that will be correlated and analyzed with the triggered event, as explained in section 4.1.4. A test is conducted to know which n is optimal for the predictive performance of the models. This is done by generating 50 small (27669 entries) datasets where n ranges from 1 to 50. Dataset 1 is pre-processed with n = 1, dataset 2 is pre-processed with n = 2 and so on. All five classifiers are trained and tested on each dataset where the macro average of f1-score is evaluated and then plotted on a graph to see where the optimal n is. The test concludes that the search parameter n = 3 is optimal and will, therefore, be used when experimenting. Figure 4.5 illustrates the test. The filled blue line is the average performance for all the five algorithms.

Figure 4.5: Result of the test to conclude where the search parameter n is optimal for f1-score. The x-axis is n, and the y-axis is the f1-score. The graph concludes that n = 3 is the overall optimal search parameter.

4.1.6 Dataset

(46)

Label Activity Entries Ratio Attacks Executed

0 Normal usage 369049 0.98575 N/A

1 AD enumeration by admin account 2380 0.0063 170

2 Tool breaking user policy 825 0.0022 165

3 Pass-The-Hash 528 0.0014 176

4 Pass-The-Ticket 410 0.0011 205

5 AD enumeration by user account 1190 0.0031 170

Table 4.5: Class distribution in the finalised dataset. Note the difference between “Entries” and “Attacks Executed”. Entries specifies how many log entries are created when the attacks have been executed.

4.2 Experiment Setup

This section outlines the set up of the experiment that is designed to evaluate how different machine learning algorithms tackle the multiclass classification problem present in the dataset from section 4.1.6. The experiment set up is divided into three subsections. Section 4.2.1 covers the implementations of the five classifiers. The metrics the classifiers will be evaluated on are described in section 4.2.2. Lastly, the process of repeated stratified 10-fold-cross-validation is covered in section 4.2.3.

4.2.1 Machine Learning Algorithms

Each classifier is implemented using the machine learning toolkit Scikit Learn [33] with its default parameters, as specified in the appendix A.2. As the research ques-tions state, the machine learning algorithms KNN, DT, RF, ANN and SVM are the classifiers used in this experiment. The main reason behind the choices of algorithms is that the machine learning approaches are diverse, as described in the article “Sur-vey on Multiclass Classification Methods” [4]. The RF classifier is chosen as it’s an ensemble approach [19, p.592] which combines multiple DT classifiers into one ensemble. By evaluating diverse algorithms, this thesis will give a pointer to further research on which of the five supervised machine learning approaches are best suited for the given problem.

The classifiers from each machine learning algorithm is evaluated using two proaches; one were the overall performance of each algorithm is measured, and an ap-proach where how well each individual attack is classified for each algorithm. Despite the the multiclass nature of this experiment, the former approach can be compared to a binary approach.

4.2.2 Metrics

(47)

p. 53 – 54]. This is done by calculating the amount of true positive (TP), false posi-tive (FP), true negaposi-tive (TN), and false negaposi-tive (FN) instances for the classification. Ones this is done it’s possible to create a wide range of metrics to evaluate the mod-els created by the algorithms. The following parts of this subsection overviews the metrics used in the controlled experiment which all bases from the confusion matrix. The metrics have been chosen because they are considered to give a good overview of the classifiers performance when evaluating a multiclass classification problem [36]. Accuracy

Accuracy is a metric used to assess how often the machine learning model accurately manages to correctly classify an instance [15, p. 57].

Accuracy = T P + T N T P + F P + T N + F N Recall

Sometimes going under the name of sensitivity or true positive rate is the metric recall. The metrics assess how many times the model makes an accurate positive classification. [15, p. 57].

Recall = T P T P + F N Precision

The precision metric measures how often the positive classifications of the model is accurate. [15, p. 57]

P recision = T P T P + F P F1-Score

By combining the metrics recall and precision, F1-score provides an overview of how well a a model classifies instances, taking account to both FP and FN. The f1-score is defined as the harmonic mean for recall and precision [15, p. 99].

F = 2 · P recision · Recall P recision + Recall Multi-classification Metrics

(48)

used in this experiment because the classes related to LM constitutes 1.4% of the dataset. To calculate the macro averages, the equation

RecallM = Pk i=1 T Pi T Pi+F Ni k and P recisionM = Pk i=1 T Pi T Pi+F Pi k ,

is used, where k represents the number of classes. The macro f1-score is then calcu-lated as the harmonic mean of RecallM and P recisionM.

If the macro average was not used, the performance measures would mostly reflect the classifiers ability of detecting normal behaviour because of the big imbalance of the classes. By using the macro, all classes are weighted the same, highlighting the performance of detecting the different activities related to LM as much as detecting normal activites.

Computational Cost

To evaluate how resource demanding a specific machine learning algorithm is during the learning phase, the computational cost is be measured. This is done by comparing the system time before and after the learning process for any given algorithm using the time module for Python8_.

4.2.3 Cross validation

To evaluate the machine learning algorithms, the stratified k-fold cross validation is used with k set to 10 as it’s conventionally applied [15, p.349] (as long as each fold has more then 30 instances). The main idea of cross-validation is to evaluate how well a machine learning algorithms perform on unseen data. This is done by randomly splitting the dataset into k folds. The classifiers are then trained on k − 1 of the folds and evaluated on the excluded fold. This is repeated for all the folds. The final result is the mean of the ten folds [25]. In this experiment ten times k-fold stratified cross evaluation is used, meaning that we run the stratified k-fold cross validation ten times and presents the average for each fold (this method is also re-ferred to as repeated cross-validation). Each fold is made to distribute the samples for each class evenly. By doing this, the class ratio from the original dataset is pre-served. Performance metrics Accuracy, Recall, Precision and f1-score is calculated [15, p. 99] for each of the folds. When splitting the dataset into folds, the function RepeatedStratifiedKFold() from the python library Scikit-learn [33] is used with a random seed set to 42.

The output for the cross-validation can be divided into two parts. Primarily the overall performance for each algorithm is presented using the average metrics. The

8_{Python time module.} _https: _{//docs.python.org/3/library/time.html.} _{(Accessed on}

(49)

latterpartisthe multiclassevaluationwheretheindividualperformanceforeach algorithmandLMactivityisevaluated.

4 .3 Eva

luat

ionand Ana

lys

is

ThissectionpresentsthestatisticalevaluationmethodsFriedmanandNemenyiused tocomparetheclassiﬁers. Thechoiceofthestatisticaltestisbasedonastudy [13]thatconcludesthatFriedmantestistobepreferredwhencomparing multiple classiﬁers.

4 .3

.1 Fr

iedmanTest

ToanswerRQ1,thusfindingthealgorithmthatisbestsuitedfordetectingLM,a Friedmantestisusedtodeterminewhetherthereisasignificantdifferenceinper for-mancebetweenthem. Thetestisconductedwiththefollowingnullandalternative hypothesis:

H0Thereisnodiﬀerenceintheperformancebetweentheclassiﬁers DT,SVM,

KNN,RFandANNwhendetectingLMinthedataset.

H1ThereisadiﬀerenceintheperformancebetweentheclassiﬁersDT,SVM,KNN,

RFandANNwhendetectingLMinthedataset.

Theperformanceinthehypothesisisdeﬁnedasthe metricsthatiscurrentlyunder investigation(eitheraccuracy,f1-score,precision,recallorcomputationalcost).The FriedmantestcanrejectH0,thusprovingH1,iftheFriedmanstatistic,χ2f,ishigher

thanthecriticalvalue,CD. Givenatablewhereeachcolumnrepresentsaclassiﬁer andeachrowadataset(orfoldforcross-validation)containingtheranksforeach classiﬁerin

1,2,...,k

letRijdenotetherankofthej-thclassiﬁeronthei-thdataset(orfold).Tocalculate

theFriedmanstatistic,thefollowingthreestepsneedtobecalculated[15,p.355– 377].

1. AveragerankRavg=k+1₂,wherekisthenumberofclassiﬁers.

2. Theﬁrstsumofsquareddiﬀerencesn _j(Rj−Ravg)2

3. Thesecondsumofsquareddiﬀerences 1

n(k 1) ij(Rij−Ravg)2

TheFriedmanstatisticisthendefinedastheratiobetweenthefirstandsecondsum ofsquareddifferences. TotestthehypothesestheFriedmanstatisticiscompared tothecriticalvalue, whichisretrieved with,α=0.05withdegreesoffreedom, DF = k=5, wherekequalsthenumberofclassifiersbeingcompared[41]. By lookingatthechi-squareddistributiontable9_,thecr_it_ica_lva_{lueissetto11.}_07.

9_Ch_i-squared_istr_ibut_ion_.https_://www_._it_l_.n_ist_.gov/d_{iv898/handbook/eda/sect}_ion3/eda3674_.htm_.