Machine Learning in Defensive IT Security: Early Detection of Novel Threats

(1)

IN

DEGREE PROJECT ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Machine Learning in Defensive IT Security: Early Detection of Novel Threats

REN DUAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Machine Learning in

Defensive IT Security: Early Detection of Novel Threats

REN DUAN

Master in Information and Network Engineering Date: October 4, 2019

Supervisor: Nicolas Innocenti (Venor AB), Henrik Åmark (Venor AB), Philip Törner (Venor AB), Saikat Chatterjee (KTH)

Examiner: Saikat Chatterjee

School of Electrical Engineering and Computer Science Host company: Venor AB

Swedish title: Maskininlärning för defensiv IT-säkerhet: Skyndsam upptäckt av tidigare okända hot

(4)

(5)

iii

Abstract

The rapid development of technology leads to a rise in cybercrime, hence cybersecurity is of unprecedented significance, especially for businesses. De- fensive and forensic IT security is a rather niche field in IT security but it is surely going to grow. It focuses on preventing attacks by good design stan- dards and the education of persons. The typical reaction time of a computer attack currently lies in the order of hours, due to the reason that this field still relies on intensive manual work of skilled experts. In this thesis, we combined defensive IT security with the most flourishing field in the present time: Ar- tificial Intelligence and Machine Learning. We investigate the possibility of using Machine Learning for filtering out the obvious normal data and focusing the attention of the experts onto important things where experience really matters. The nature of this problem is anomaly detection, therefore, we select and test several algorithms which perform well in detecting anomalies, including Term Frequency-Inverse Document Frequency, K-Means, K-Nearest Neighbours, Isolation Forest, and Autoencoders, and apply them on the Http (KDDCUP99) dataset and our own network connection dataset collected using Carbon Black Response. Carbon Black Response is an industry-leading incident response and threat hunting solution. The results show that Isola- tion Forest and K-Nearest Neighbours are the best traditional Machine Learn- ing methods for the two datasets respectively, meanwhile, as a deep learning method, Autoencoders did quite well in differentiating normal and malicious events for both datasets.

(6)

iv

Sammanfattning

Den snabba och ständigt ökande teknologiska utvecklingen har lett till att en ökning inom IT relaterade brott där företag och organisationer ofta blir drabba- de med nästintill oförutsägbara konsekvenser. Defensiv IT-säkerhet och foren- sik, där fokus ligger på att upptäcka, stoppa och mitigera attacker genom olika tekniker, utbildning och design. Trots att organisationer idag ofta spenderar stora delar av sin budget på defensiv säkerhet så mäts ändå tiden det tar att age- ra på attacker och intrång ofta minst i timmar då arbetet innebär stora mängder manuellt arbete för områdets experter. Större angrepp kan ta veckor eller må- nader att utreda. I det här arbetet kombineras defensiv IT-säkerhet med några av de mest omtalade områdena i dagsläget: Artificiell intelligens och maski- ninlärning. Vi undersöker möjligheten att använda dessa tekniker för att filtrera ut det uppenbart normala datat och fokusera på det avvikande och vesentliga så att områdets experter kan lägga tid där det verkligen behövs. Problemets kär- na ligger i att kunna detektera avvikelser. Därav grundas arbetet i att utvärdera olika algoritmer för att upptäcka anomalier för att se hur dessa preseterar mot varandra. Vi kommer använda oss av tekniker som Term Frequency-Inverse Document Frequency, K-Means, K-Nearest Neighbours, Isolation Forest, och Autoencoders mot två olika dataset. Det första datasetet är baserat på HTTP trafik (KDDCUP99) medan det andra bygger på insamling av data från riktiga klienter via ett verktyg som heter Carbon Black Response som är ett ledande verktyg för att utför storskaliga undersökningar och söka efter angripare. Re- sultatet av arbetet visar att Isolation Forest och K-Nearest Neighbours är för respektive dataset men också att Autoencoders, som är en metod för Deep Le- arning, presterar goda resultat för att identifiera elakartade aktiviteter för båda dataseten.

(7)

v

Acknowledgement

I would like to thank Venor AB for giving me this opportunity to do such an interesting project as my master thesis. I feel so honored to work with such a good team. I would also like to thank my industry supervisor Nicolas Innocenti, Henrik Åmark and Philip Törner, for offering all the guidance and support that I need during the whole process. Thanks for my supervisor and examiner at KTH, Saikat Chatterjee, who gives me help on the project and provides useful comments on the thesis report. Finally, I want to express my thanks to my families and friends and also thank myself for the hard work.

(8)

Chapter 1 Introduction

This chapter first gives an overview of the field and how this work is different from existing techniques and methods, then talks about related work and organization of the thesis.

1.1 Motivation

One of the biggest problems large corporations and organizations are facing today is cyberattacks and threats [1]. Usually, by disrupting the victim’s network, the attacker tries to look for some type of benefit, like any kind of con- fidential information, because this information can always be sold at a good price or used to blackmail the victim. According to statics, 53 percent of cyberattacks resulted in losses of half a million dollars or more.

When a cyberattack occurs, people always want to know what happened. Un- fortunately, this is a difficult question to answer in a few words, not only because of the frequent appearance of new types of attack but also because the attackers are good at modifying, hiding, deleting, cloaking and destroying information that is useful in the eyes of cybersecurity experts. Computer forensics is a branch of digital forensic science related to evidence found in computers and digital storage media [2][3]. The main aim of computer forensics is to present the criminal trail that is left in a computer by attackers to court as valid litigation evidence [2][3]. Computer forensics is of great importance to a business or a corporation [1]. In the view of the complexity of the hacking methods today, the thinking that firewalls and routers can simply strengthen defense enough to avoid any kind of cyberattack is untrue. Although these so- lutions can to some extent provide relevant information during an attack, they

1

(12)

2 CHAPTER 1. INTRODUCTION

lack the ability to search deeper to tell what happened exactly [1]. Defensive and forensic IT security is a rather niche field in IT security that is surely going to grow in the years to come, deploying security mechanisms that can provide specific information should be on the schedule of businesses and corporations.

Depending on the severity of the attack, it may take a long time to determine how this happened and then take action. The typical time to react against a computer attack in large organizations currently lies in the order of hours, with a few minutes for best in class cases. Furthermore, if a forensic analysis is performed to investigate what actually happened in detail, it can take weeks or even months. Therefore, analyses are seldom done properly due to these time constraints. Reasons for such a poor situation are that this field still relies on labor-intensive ways of work, often requiring skilled experts to manually browse through a vast amount of forensic data to discover suspicious behav- iors.

Artificial intelligence and machine learning win more and more attention from the end of the 20th century, is widely used and developing at a high pace since the beginning of the 21st century. With the ability to learn from data and experience automatically, machine learning algorithms are routinely used in a wide variety of applications, for example, image recognition, speech recognition, natural language processing, medical diagnosis, etc [4]. However, machine learning is not used as widely in network security as in other fields, there is no commonly used machine learning method that can be used to han- dle general problems in defensive network security since the truth that network security problems are complex and unpredictable. Besides, attackers can also find loopholes in machine learning algorithms or systems and avoid them, or even make use of machine learning algorithms to carry out more advanced malware. As such, a hybrid approach where the Machine Learning is used to help and complement the human expert is expected to work best.

This thesis is a first attempt to bring a solution to this situation. More specifi- cally, we want to experiment with applying machine learning in defensive and forensic IT security. By analyzing “evidence” data from computers with machine learning, we hope to develop mechanisms to filter out obvious normal events beforehand and help network security experts focus on that evidence that has higher probabilities to be abnormal events, which will significantly lessen their workload, shorten response time and reduce financial losses.

(13)

CHAPTER 1. INTRODUCTION 3

The basic idea of studying this problem is from simple to complicated. With more details, we will start with network connection logs, which contains information about how and when the processes in our computers build network connections. The reason we care about this information is that the internet is the most common entrances point for malware. For example, when users down- load data, files or software from suspicious websites, malware can get access to their computers, hide somewhere and steal personal information without being noticed. Besides, malware can also get into our computers when we click on suspicious links on untrusted websites or in emails from unknown sources. Therefore, network connection logs is a really good entry point for detecting the activities of malware. Since the problem that we want to study can be generalized into an anomaly detection problem, for anomaly detection, we can refer to lots of mature methods that used in this area, from classical clustering and classification methods to deep learning methods. Based on our specific case, methods that are studied include Principal Component Analy- sis (PCA), Term Frequency-Inverse Document Frequency (TFIDF), K-Means clustering, K-nearest Neighbours (KNN), Isolation Forest (iForest) and deep learning method, Autoencoders.

1.2 Technical Hightlight

Up to now, the most common way of preventing computers from being attacked by malware is using virus protection software. There are lots of good virus protection software in the market, such as Total AV, PC Protect, McAfee, etc. This software is based on identifying typical sequences of bits, called signature, in applications by comparing them to signature databases built from previously identified viruses, malware, etc. For example, if you double-click

“svchost.exe” and it is a virus disguised as a safe process, the antivirus checks the resolved bits against its database of signatures, if there is a match, it will prevent the process from executing [5]. However, since new malware pops up every day, we need to update our database of signatures very often. Without updating [5], our antivirus software is unable to recognize new malware and we will be at risk of installing those malware [5].

From this, we can see that the virus protection software installed in our computer does has the ability to stop known malware attacking our computers, but does not possess the ability to detect malware never observed before. However, as mentioned previously, the time between a new type of malware appears and the moment people understand what exactly happened can be rather long, and

(14)

4 CHAPTER 1. INTRODUCTION

this difference gives malware time and the chance to attack our computers.

Compared with these traditional virus protection software, machine learning- based methods focusing on malware behavior — which we make a first investi- gation of in this thesis — may be able to, besides recognizing known malware, detect malicious application that it has never seen before by recognizing typical features, for example in network activity.

Our model can greatly lessen workload, reduce response time, cost and im- prove everyone’s overall security by recognizing novel malware automatically, which does help to promote productive employment [6] and build a sustainable business [7].

1.3 Related Work

There are not much but a few papers about anomaly detection on log data. Most other works are from private or industry professionals published on blogs and web pages, and when I use ideas from them, I referenced in the text.

Aarish Grover, in his master thesis [8], explored anomaly detection in application log data, with both historical techniques in anomaly detection and recent advancements in neural networks, and also proposed a hybrid model combin- ing LSTM Neural Network and Autoencoder which improves upon existing techniques [8].

Swapneel Mehta, Prasanth Kothuri and Daniel Lanza Garcia in [9] leveraged a streaming architecture based on ELK, Spark and Hadoop to collect, store, and analyze database connection logs in near real-time. The proposed system investigates outliers using unsupervised learning and they also propose an approach that can be extrapolated to a generalized system of analyzing connection logs across a large infrastructure comprising thousands of individual nodes and generating hundreds of lines in logs per second [9].

In [10], Jakub Breier and Branišová Jana proposed a method for anomaly detection in log files, based on data mining techniques for dynamic rule creation.

Pankaj Malhotra et al. [11] used stacked LSTM networks for anomaly/fault detection in time series and the efficacy of this approach is demonstrated on four datasets, including ECG, space shuttle, power demand, and multi-sensor engine dataset.

(15)

CHAPTER 1. INTRODUCTION 5

1.4 Thesis Organization

The structure of the thesis is as follows: Chapter 2 summarizes the essential background knowledge used in the thesis work. Chapter 3 gives a detailed description of the datasets that we used and how we set up the platform to run data collection and experiments. In chapter 4, we present the different methods we have tested and their results. Finally, chapter 5 brings conclusions to this thesis and discusses plans for future work.

(16)

Chapter 2 Background

In this chapter we provide a brief discussion of the essential background needed throughout the thesis, which includes the Principal Component Analysis [12], term frequency-inverse document frequency [13], traditional Machine Learn- ing methods: 1) K-Means [14], 2) K-Nearest Neighbors [15], 3) Isolation For- est [16], Deep Learning method Autoencoders [17] and performance metrics we will use to evaluate the models.

2.1 PCA

PCA is a dimensionality-reduction method which is often used to reduce the dimensionality of large datasets, by transforming a large set of variables into a small one which still contains most of the information of the large set [18].

Reducing the number of variables of a dataset naturally comes at the expense of accuracy, but trade for simplicity [18]. Because smaller datasets are easier to explore and visualize, it makes analyzing data much easier and faster for machine learning algorithms [18].

The first step of PCA is standardization [18][12]. This step aims to standardize the range of the continuous initial variables so that each of them contributes equally to the analysis, otherwise, the dominant variables will lead to biased results [18][12]. Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each variable:

z = value − mean

standard deviation (2.1)

6

(17)

CHAPTER 2. BACKGROUND 7

Step 2 is covariance matrix computation and the aim of this step is to see if there is any relationship between the variables of the input dataset [18][12].

Because sometimes, variables are highly correlated in such a way that they contain redundant information [18][12]. Hence, in order to identify these cor- relations, we compute the covariance matrix, which is a p×p symmetric matrix [18][12]. For example, for a 3-dimensional data set with 3 variables x, y and z, the covariance matrix is a 3 × 3 matrix of this form:





Cov (x, x) Cov (x, y) Cov (x, z) Cov (y, x) Cov (y, y) Cov (y, z) Cov (z, x) Cov (z, y) Cov (z, z)



 (2.2)

The next step is computing the eigenvectors and eigenvalues of the covariance matrix to identify what is called the principal components [18][12].

These principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These new variables are un- correlated and most of the information within the initial variables are squeezed or compressed into the first components [18][12]. Organizing information in principal components this way will allow reducing dimensionality without los- ing much information because we can discard components with low information [18][12]. Geometrically speaking, principal components represent the directions of data that explain a maximal amount of variance, which is to say, the lines that capture most information of the data [18][12].

The eigenvectors of the covariance matrix give us the directions of the axes which contain the most variance, i.e. the principal components described above [18][12], while the eigenvalues give the amount of variance carried in each principal component. By ranking the eigenvectors according to the order of their eigenvalues, from highest to lowest, we can get the principal components in order of significance [18][12]. Finally, we can choose all the eigenvectors or discard some of lesser significance depending on the application case and amount of dimensionality reduction desired [18][12].

2.2 TFIDF

TFIDF is a numerical statistic that is intended to measure how important a word is to a document in a collection or corpus [13]. This method is a widely used technique in Information Retrieval and Text Mining [19]. The TFIDF

(18)

8 CHAPTER 2. BACKGROUND

value increases proportionally to the number of times a word in the document and is offset by the number of documents in the corpus that contain this word, which helps to adjust for the truth that some words appear more frequently in general [19]. As one of the most popular term-weighting schemes today, TFIDF is used by 83% digital libraries in their text-based recommender systems [20].

The TFIDF is the product of two statistics, term frequency and inverse document frequency. There are various ways of determining the exact values of both statistics. Here shows the way how we calculate them:

tf (t, d) = 1 + log(f_t,d) (2.3)

idf (t, D) = log N

1 + |{d ∈ D, t ∈ d}| (2.4) tf idf (t, d, D) = tf (t, d) · idf (t, D) (2.5) ft,d is the number of times that term t occurs in document d, N is total number of documents in the corpus N = |D|, |{d ∈ D, t ∈ d}| is the number of documents where the term t appears [19].

2.3 K-Means

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms [21][14]. The goal of K-means is to group similar data points and discover underlying patterns [21][14]. To achieve this goal, K- means seeks for a fixed number (k) of clusters in the dataset [21][14]. A cluster is a collection of data points aggregated together because of certain similari- ties, while k refers to the number of centroids you need in the dataset [21][14].

K-means algorithm starts with k number of randomly selected centroids, then allocates each data point to the nearest cluster, after that, k number of centroids are updated by averaging all the data points in every cluster, these calculations are performed repetitively to optimize the positions of the centroids [21][14].

If the centroids stop changing their values or the defined number of iterations has been reached, the algorithm stops optimizing clusters [21][14].

K-means is easy to understand and it is an extensively used technique for data cluster analysis [21][14]. However, its performance is not as competitive as

(19)

those other sophisticated clustering algorithms, since slight variations in the data may lead to high variance. In other words, the algorithm is not stable.

2.4 KNN

The KNN algorithm is a simple, easy-to-implement supervised machine learning algorithm, it is mostly used to classify a data point based on how its neighbors are classified [22].

k is a user-defined constant. A test point is classified by assigning the label which is most frequent among the k training samples nearest to that test point [23]. Choosing the right value of k for the KNN algorithm based on feature similarity is a process called parameter tuning and is important for better accuracy [15]. One way to choose k is through cross-validation, take a small portion from the training dataset and call it a validation dataset, then use the same to evaluate different possible values of k, and we take the value of k which gives the best performance on the validation dataset [15]. Generally speaking, in practice, a rule of thumb is to choose k = sqrt(N ), where N stands for the number of samples in the training dataset [15].

In the classification setting, the KNN algorithm, in essence, is a majority vote between the k most similar instances to a given “unseen” observation. The similarity is defined by the distance between two data points, a popular one to measure distance is the Euclidean distance method given by Equation 2.7.

d = v u u t

k

X

i=1

(x_i− y_i)² (2.6)

In addition to be simple to implement, KNN performs well in multi-class cases, and flexible to feature/distance choices. On the other hand, the computation cost is quite high because we need to compute the distance of each query instance to all training samples. Moreover, we also need to determine a suitable value of k, which is not easy as well.

2.5 iForest

Isolation Forest is an outlier detection technique which identifies anomalies instead of normal events [16], it is good at handling large, high-dimensional

(20)

datasets [16].

Like any tree ensemble method, iForest is built on the basis of decision trees [16]. In these trees, partitions are created by first randomly selecting a feature and then selecting a split value between the minimum and maximum value of the feature [16]. In principle, outliers are less frequent than regular observations, they lie further away from normal events in feature space. This is why outliers should be identified closer to the root of the tree on average in ensemble learning by using such partition strategy [16]. As shown in Figure 2.1, more partitions are needed in order to identify a normal observation [16].

For decision making, an anomaly score is required as with other outlier detection methods [16]. In the case of iForest, the anomaly score is defined as:

s (x, n) = 2⁻

E(h(x))

c(n) (2.7)

where h(x) is the path length of the observation x, c(n) is the average path length of unsuccessful search in Binary Search Tree, and n is the number of external nodes [16]. If an observation is scored close to 1, then it is an anomaly to a great extent, an observation is normal if it is scored much smaller than 0.5, if all scores are close to 0.5, then there is no distinct anomaly among all the instances.

Figure 2.1: Identifying normal vs. abnormal observations¹

(21)

2.6 Autoencoders

Autoencoders are a special type of feedforward neural networks where the input is the same as the output [24]. They are trained in an unsupervised manner in order to learn the extremely low-level representations of the input data [17].

An Autoencoder consists of three parts, encoder, code and decoder [24], as shown in Figure 2.2. The encoder compresses the input data and produces the code, then the decoder reconstructs the input only by the code [24]. To build an Autoencoder, we need three things: an encoding method, and decoding method, and a loss function to compare the output with the target, which is the input [24].

Figure 2.2: Architecture of Autoencoders²

Autoencoders are considered an unsupervised learning technique since they don’t need explicit labels to train on, but to be more precise, they are self- supervised because the inputs also act as labels [24]. Autoencoders can be seen as a lossy compression/decompression algorithm, but they can only com- press data similar to what they have been trained on because the network only learns features of the training data [24]. Besides, because this class of compression algorithms is lossy, the output of an Autoencoder is not identical to

1Retrieved from https://www.depends-on-the-definition.com/dete cting-network-attacks-with-isolation-forests/

2Retrieved from https://towardsdatascience.com/generating-image s-with-autoencoders-77fd3a8dd368

(22)

the input, since the objective of Autoencoders is not perfect reconstruction but learning low dimensional representations [24].

Before training an Autoencoder, we need to set 4 hyperparameters, including code size, number of layers, number of nodes per layer and loss function. Code size is the number of nodes in the middle layer, which controls the degree of compression. The Autoencoder can be as deep as we like, without considering the input and output layer, in Figure 2.2, we have 2 layers in both the encoder and decoder. The number of nodes per layer decreases with each subsequent layer of the encoder, and increase back in the decoder. Mean squared error (mse) and binary crossentropy are two alternative loss functions. If the input values are in the range [0, 1], we usually adopt crossentropy, otherwise, we use mse.

2.7 Model Evaluation Metrics

While training a model is a key step, how the model generalizes on unseen data is equally important and should be considered in every machine learning pipeline [25]. To quantify model performance, model evaluation metrics are required [25]. The choice of evaluation metrics depends on a given machine learning task [25]. Here, we focus on metrics used in classification problems, including confusion matrix, accuracy, precision, recall, f1 score and AUC- ROC Curve.

2.7.1 Confusion Matrix

A confusion matrix contains information about actual and predicted classifi- cations done by a classification system [26]. Performance of such systems is commonly evaluated using the data in the matrix [26]. Table 2.1 shows the confusion matrix for a two class classifier. The confusion matrix can generate other metrics, including accuracy, precision, recall, and f1 score [26].

Predicted

Negative Positive

Actual Negative True Negative (TN) False Positive (FP) Positive False Negative (FN) True Positive (TP)

Table 2.1: Confusion Matrix

(23)

2.7.2 Accuracy

Accuracy is out of all the instances in the dataset, how many are correctly predicted. It can be expressed as:

Accuracy = T P + T N

T P + F P + F N + T N (2.8)

However, accuracy can not reflect if the model actually works or not on bias data. For instance, when the dataset has a 10:1 class bias, random guessing will give 90% accuracy. Hence, high accuracy does not mean good performance, high bias can cause an algorithm to miss the relevant relations between features and target outputs while gives high accuracy meanwhile [27].

2.7.3 Precision

Precision means out of all the positive classes we have predicted correctly, how many are actually positive [26]. It can be mathematically expressed as:

P recision = T P

T P + F P (2.9)

2.7.4 Recall

Recall indicates out of all the positive classes, how much we predicted correctly [26]. It should be high as possible [26]. The mathematical expression of recall is:

Recall = T P

T P + F N (2.10)

2.7.5 F1 Score

There is a trade-off between precision and recall. If you have to recall every- thing, you will have to keep generating results which are not accurate, hence lowering your precision [28]. In order to make them comparable, we use f1 score, which helps to measure precision and recall at the same time [29]. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more, as shown in Equation 2.11 [29].

F 1 score = 2 ×P recision × Recall

P recision + Recall (2.11)

(24)

2.7.6 AUC-ROC Curve

AUC-ROC Curve is a widely used measure of performance of supervised classification rules [30]. ROC curve plots True Positive Rate (TPR) versus False Positive Rate (FPR), the area under it is called Area Under the ROC Curve (AUC) [31]. It tells how well the model is capable of distinguishing between classes [30]. The higher the AUC, the better the model is at identifying 0s as 0s and 1s as 1s [30]. The ROC curve is plotted with True Positive Rate versus False Positive Rate, as shown in Figure 2.3.

Figure 2.3: Example of a ROC-AUC Curve³

3Retrieved from https://towardsdatascience.com/understanding-a uc-roc-curve-68b2303cc9c5

(25)

Chapter 3 Datasets

We mainly use two datasets in our algorithms, one is http (KDDCUP99) dataset, the other is our network connection dataset collected by Carbon Black (CB) Response. In this Chapter, we first give introduction on the http dataset, and then describe in details about our own dataset, including what CB Response is, how raw data look like, how we preprocess the raw data to make it accept- able by the algorithms and privacy issues, finally, comparison between the two datasets will be given.

3.1 Http (KDDCUP99) Dataset

Http (KDDCUP99) dataset is one of the most classical datasets for outlier detection, and are widely used by researchers in testing their new anomaly detection algorithms, we use this dataset in the early stage for checking if our machine learning algorithms can work on outlier detection problems, which lays the foundation for applying the algorithms on our dataset. The original KDD Cup 1999 dataset from UCI Machine Learning Repository contains information of TCP netwok connections, it was used for the 1999 KDD intru- sion detection contest and after then it was also widely used for research pur- poses. There are 41 attributes in total, in which 34 are continuous and 7 are categorical [32], as shown in Table 3.1. The UCI Machine Learning Reposi- tory is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms [33]. A reduced version of the original KDD Cup 1999 dataset only keeps 4 attributes, because they are more general compared with other attributes, including “service”, “duration”, “src_bytes” and “dst_bytes”.

“Service” is the only attribute among 4 that is categorical, it falls into 5 dis-

15

(26)

16 CHAPTER 3. DATASETS

tinct categories, which contains “http”, “smtp”, “ftp”, “ftp_data” and “others”

[32]. As indicated by the name of the dataset, only “http” data is saved, therefore, http (KDDCUP99) dataset is 3 dimensional, “duration”, “src_bytes” and

“dst_bytes” are the three attributed that were saved [32]. The original dataset has 3,925,651 attacks (80.1%) out of 4,898,431 records, while the http dataset has 2211 attacks (0.4%) out of 567,479 records. Attacks are labeled with 1 and normal events are labeled with 0 [32].

Feature name Description Type

duration length (number of seconds) of the connection continuous

protocol_type type of the protocol, e.g. tcp, udp, etc. discrete

service network service on the destination, e.g., http, telnet, etc. discrete

src_bytes number of data bytes from source to destination continuous

dst_bytes number of data bytes from destination to source continuous

flag normal or error status of the connection discrete

land 1 if connection is from/to the same host/port; 0 otherwise discrete

wrong_fragment number of “wrong” fragments continuous

urgent number of urgent packets continuous

hot number of “hot” indicators continuous

num_failed_logins number of failed login attempts continuous

logged_in 1 if successfully logged in; 0 otherwise discrete

num_compromised number of “compromised” conditions continuous

root_shell 1 if root shell is obtained; 0 otherwise discrete

su_attempted 1 if “su root” command attempted; 0 otherwise discrete

num_root number of “root” accesses continuous

num_file_creations number of file creation operations continuous

num_shells number of shell prompts continuous

num_access_files number of operations on access control files continuous

num_outbound_cmds number of outbound commands in an ftp session continuous

is_hot_login 1 if the login belongs to the “hot” list; 0 otherwise discrete

is_guest_login 1 if the login is a “guest”login; 0 otherwise discrete

count number of connections to the same host as the current connection in the past two seconds continuous Note: The following features refer to these same-host connections.

serror_rate % of connections that have “SYN” errors continuous

rerror_rate % of connections that have “REJ” errors continuous

same_srv_rate % of connections to the same service continuous

diff_srv_rate % of connections to different services continuous

srv_count number of connections to the same service as the current connection in the past two seconds continuous Note: The following features refer to these same-service connections.

srv_serror_rate % of connections that have “SYN” errors continuous

srv_rerror_rate % of connections that have “REJ” errors continuous

srv_diff_host_rate % of connections to different hosts continuous

dst_host_count number of connections to the same destination host as the current connection in the past two seconds continuous Note: The following features refer to these same-host connections.

dst_host_serror_rate % of connections that have “SYN” errors continuous

dst_host_rerror_rate % of connections that have “REJ” errors continuous

dst_host_same_srv_rate % of connections to the same service continuous

dst_host_diff_srv_rate % of connections to different services continuous

dst_host_same_src_port_rate % of connections from the same source port continuous

dst_host_srv_count number of connections to the same destination host and service as the current connection in the past two seconds continuous Note: The following features refer to these same-service connections.

dst_host_srv_serror_rate % of connections that have “SYN” errors continuous

dst_host_srv_rerror_rate % of connections that have “REJ” errors continuous

dst_host_srv_diff_host_rate % of connections to different hosts continuous

Table 3.1: Basic information of attributes in http (KDDCUP99) dataset

3.2 Network Connection Dataset

In this subsection, detailed information will be given on the network connection dataset, including introduction to the equipment that were used to collect data, names and introduction of the attributes in the original network con-

(27)

CHAPTER 3. DATASETS 17

nection dataset, procedures of preprocessing for making the original dataset available for the algorithms, how we insert malware logs into the dataset and finally, experimental set-up and privacy issues.

3.2.1 Introduction to CB Response

CB Response is an industry-leading incident response and threat hunting solution designed for security operations center (SOC) teams [34]. Visibility and contextuality are the main advantages of CB Response. Compared with conventional antivirus, it continuously records and stores unfiltered endpoint data, so that security professionals can hunt threats in real-time and visualize the complete attack kill chain, hence it wins wide favor from top SOC teams, incident response (IR) firms and managed security service providers (MSSPs) [34]. With the help of CB Response, we get access to process data. A process is an instance of a program running in a computer, usually started when a program is initiated either by a user or by another program [35]. When a process is running, there are particular sets of associated data that we can keep track of, which makes it possible for security professionals to have a clear understanding when something really happens [35]. In our case, we only care about process records that are network connections, since the internet is usually the door that malware come into our computers, analyzing network connection data allows us to explore features of an abnormal event when it tries to invade our computer.

For reporting data to the CB server, every user needs to install corresponding CB Response applications according to their operating systems, with one click of the executable file. Then, the application which is always running in the background will continuously send data from users’ computers to the remote server. When retrieving data from the CB server, we run a Python script with one command, and it will grab data that belongs to the time range that we set in advance from the server side and save locally into a CSV file.

3.2.2 Dataset Information

Table 3.2 gives names and introductions of all the 22 attributes contained in the dataset, including 1 boolean, 5 integers and 16 strings.

(28)

Column name Introduction

process.username username context associated with the process process.process_name name of the process

process.hostname hostname of the PC the process executed on connection.timestamp timestamp of the connection event

connection.domain domain name that remote IP address points to connection.remote_ip remote IP address of the connection event connection.remote_port remote port of the connection event

connection.proto protocol used, either UDP or TCP connection.direction either outbound or inbound

connection.local_ip local IP address of the connection event connection.local_port local port of the connection event

connection.proxy_ip IP address of the web proxy connection.proxy_port port of the web proxy

eventtype event type of the process

process_start start time of the process

process_end “true” if child process started, “false” if terminated os_type os type of the computer for the process

pid internal CB process id of the process ppid CB process id of the parent process process_md5 MD5 of the executable backing the process process_path full path of the executable backing the process

cmdline command line of the process

Table 3.2: Basic information of attributes in the original network connection dataset

3.2.3 Data Preprocessing

Since strings can not be accepted by machine learning algorithms, we preprocess the dataset in two ways. Whether to consider the data as time series is the main difference between the two methods.

Table 3.3 shows the names and introductions of attributes contained in the dataset after preprocessing when time is taken into consideration. The values of these attributes are derived from counting numbers inside sliding windows.

The working mechanism is as shown in Figure 3.1. A 60-second sliding window is used here as an example. Begin from the first sample row in the dataset, every time, the sliding window will include all the sample rows within 60 sec-

(29)

onds, with the timestamp of the first row inside the sliding window acting as the beginning of the 60 seconds. Then we count the number of connections, the number of unique processes and the number of unique remote IP addresses the sliding window includes. The red box in Figure 3.1 describes the initial position of the 60-second sliding window, and at this position, there are 5 connections, 1 unique process, and 1 unique remote IP address. Next, the sliding window moves to the next position with step size equals to 1, which is from red box to blue box as shown in Figure 3.1, then we count those 3 attributes again.

We repeat the moving and counting process until the last line of the dataset.

Table 3.4 gives the counting results derived from the marked positions of the 60-second sliding window in Figure 3.1. In real experiments, sliding windows of different sizes are used on the dataset respectively, as in Table 3.3, for dig- ging more time-related information, so as to detect malicious events.

connection_300s number of connection inside a 300-second sliding window process_300s number of unique process inside a 300-second sliding window

IP_300s number of unique IP address inside a 300-second sliding window connection_600s number of connection inside a 600-second sliding window

process_600s number of unique process inside a 600-second sliding window IP_600s number of unique IP address inside a 600-second sliding window connection_1800s number of connection inside a 1800-second sliding window

process_1800s number of unique process inside a 1800-second sliding window IP_1800s number of unique IP address inside a 1800-second sliding window connection_3600s number of connection inside a 3600-second sliding window

process_3600s number of unique process inside a 3600-second sliding window IP_3600s number of unique IP address inside a 3600-second sliding window

Table 3.3: Basic information of attributes in the preprocessed dataset when consider time

When time is not taken into consideration, different rows are independent of each other. Table 3.5 gives the names and introductions of attributes contained in the preprocessed dataset. Details regarding how we get these attributes are given as follow:

• process_name_idf: Derived from column “process.process_name”, by applying TFIDF on the corpus which is composed of all the process names in the dataset. IDF values indicate the occurrence frequency of process names and we think, compared with normal events, anomalies rarely happen.

(30)

Figure 3.1: The working mechanism of a sliding window Index Num of connection Num of process Num of IP

0 5 1 1

1 4 1 1

2 3 1 1

3 5 2 3

Table 3.4: Counting Results derived from marked position of the 60-second sliding window in Figure 3.1

• path_probs: Derived from column “process_path”, by calculating the percentage of each path of the particular process. Then if the process is executed from an unusual path, it will be given a low percentage, which can mark the anomalism.

• ppid_prob: Derived from column “ppid”, by calculating the percentage of each parent process ID of the particular process. The same thing as above, if the process is initiated by an unusual parent, the percentage will be quite low, which may indicate anomalism.

• cmdline_prob: Derived from column “cmdline”, by calculating the percentage of each command line by which a particular process runs. As mentioned above, if the process runs with an uncommon command line, then an abnormal event may happen.

• connection.remote_ip.Lattitude & connection.remote_ip.Longitude: We map the remote IP address to its corresponding latitude and longitude,

(31)

with the help of IP2Location LITE database [36]. Geolocation can of- fer more practical information than IP addresses, to be more specific, if latitude and longitude of each remote IP address are plotted in a plane coordinate system, we can have an intuitive understanding of which IP addresses look more abnormal.

• process_end_False & process_end_True: These are two columns got by applying One Hot Encoding on the column “process_end”.

• process.username_LOCAL SERVICE & process.username_NETWORK S-ERVICE & process.username_SYSTEM & process.username_USER:

These are four columns got by applying One Hot Encoding on the column “process.username”. And here, we do not differentiate user-level usernames, but uniformly rename them into “USER”.

• connection.proto_IPPROTO_TCP & connection.proto_IPPROTO_UDP:

These are two columns got by applying One Hot Encoding on the column “connection.proto”.

• connection.direction_Inbound & connection.direction_Outbound: These are two columns got by applying One Hot Encoding on the column “connection.direction”.

Not all the columns in the original dataset contain useful information for detecting anomalies. Keeping useless columns may sometimes result in confu- sions. Hence, for those columns of no use, we just drop them, which include

“process.hostname”, “connection.domain’, “connection.remote_port”, “connection.local_ip”, “connection.local_port”, “connection.proxy_ip”, “connection.proxy_port”, “eventtype”, “process_start”, “process_end”, “os_type”, “pid”

and “peocess_md5”.

3.2.4 Malware Data Collection and Insertion

In general, process data that reported to the CB server from our computers are almost normal events, since personal computers are less likely to be attacked compared with computers in large companies and organizations. If we only make use of the data reported by normal personal computers, we can only apply unsupervised machine learning algorithms on the dataset. However, the results that unsupervised machine learning algorithms gave can not be ascer- tained and are with less accuracy compared to supervised algorithms[37]. To

(32)

process_name_idf IDF value of the process name

path_prob occurrence probability of the path for a particular process ppid_prob occurrence probability of the parent process ID for a particular process cmdline_prob occurrence probability of the command line for a particular process connection.remote_ip.Lattitude lattitude of the location of the remote IP address connection.remote_ip.Longitude longitude of the location of the remote IP address

process_end_False if the process comes to an end, the corresponding position is filled with “1”

process_end_True if the process is still alive, this corresponding position is filled with “1”

process.username_LOCAL SERVICE if the username is “LOCAL SERVICE”, “1” fills the corresponding position process.username_NETWORK SERVICE if the username is “NETWORK SERVICE”, “1” fills the corresponding position

process.username_SYSTEM if the username is “SYSTEM”, the corresponding position is filled with “1”

process.username_USER if the username is “USER”, the corresponding position is filled with “1”

connection.proto_IPPROTO_TCP if TCP is used, the corresponding position is filled with “1”

connection.proto_IPPROTO_UDP if UDP is used, this corresponding position is filled with “1”

connection.direction_Inbound if this connection is inbound, the corresponding position is filled with “1”

connection.direction_Outbound if this connection is outbound, the corresponding position is filled with “1”

Table 3.5: Basic information of attributes in the preprocessed dataset when time is not considered

carry out more precise and extensive research and analysis, we introduce some real malware, including svchost.exe, brbbot.exe, getdown.exe and openme.exe.

We run this malware one by one on a clean windows virtual machine, although this malware has different characteristics, they usually build network connections to contact their remote servers at the beginning of running, hence, we still can grab network connection data to do analysis.

After retrieving data reported by the malware machine from the CB server, we extract real malware network connections using their process name, and insert these data into dataset reported by normal machines, according to time order. Finally, label malware samples with 1s and normal samples with 0s, then we have our labeled network connection dataset.

3.2.5 Experimental Set-up

All the experiments are carried out in a docker container with Python and Jupyter. For this thesis, the docker container is run on t3.medium. T3 are instances provided by Amazon EC2, which provides a balance of computing, memory and networking resources. The deployment is automated so that we can easily run it on more or larger instances.

3.2.6 Privacy Protection of Personal Data

Seven computers in total make contributions to our network connection dataset, two run OS X and five run Windows. Among those machines which run Win-

(33)

dows, we have one virtual machine runs in my computer and we create it for running real malware and collect malware reported data. Therefore, seven computers are owned by six users in fact. All the users are volunteers, they are voluntary to install the Carbon Black application, and report their personal data to the server side. Meanwhile, we also follow the privacy clauses, keep them for internal use only.

3.3 Comparison between Http (KDDCUP99) Dataset and Network Connection Dataset

Table 3.6 makes a comparison between the three datasets that we use in this thesis work. The time-dependent version of the network connection dataset is unlabeled, while the Http (KDDCup99) dataset and the time-independent version of the network connection dataset are labeled. Both labeled datasets are highly imbalanced. Testing algorithms on the Http (KDDCup99) Dataset lays the foundation for exploring the feasibility of these algorithms on our own network connection datasets.

Features Datasets

Http Netconn (time-dependent) Netconn (time-independent)

Size 567,479 100,943 30,488

Num of Anomaly 2,211 0 85

Proportion of Anomaly 0.4% 0 0.3%

Num of Attributes 3 12 17

Content TCP connections network connections built by processes network connections built by processes

Table 3.6: Comparison between three used datasets (Notes: Http = Http (KD- DCup99) Dataset, Netconn (time-dependent) = Network Connection Dataste (time-dependent), Netconn (time-independent) = Network Connection Dataset (time-independent))

(34)

Chapter 4 Methods, Results and Analysis

In this chapter, we provide a detailed discussion of the experiments and results.

For each dataset, we first use PCA to visualize the data for knowing about the distribution of them, and then, traditional Machine Learning methods and a Deep Learning method are applied on the dataset one after another to detect anomalies. However, it is worth mentioning that TFIDF is only used on the column “process.process_name” of the original network connection dataset and our preprocessed time-based network connection dataset is only studied with PCA.

4.1 Http (KDDCUP99) Dataset

4.1.1 PCA

The first two components and three components of the Http dataset are visu- alized separately in sub figure 4.1a and 4.1b. We can see from the figures that it is difficult to set normal events and attacks apart because they are mixed up with each other. Therefore, more advanced methods should be used to differentiate them.

4.1.2 K-Means

Method Application

Elbow method is used in K-Means to determine a suitable k value, it is designed to help to find the appropriate number of clusters in a dataset. We can see from Figure 4.2 that 8 is a decent value for k.

24

(35)

CHAPTER 4. METHODS, RESULTS AND ANALYSIS 25

(a) Two components (b) Three components

Figure 4.1: Visualize http (KDDCUP99) dataset by using PCA (0 represents for normal event, 1 represents for attack)

Figure 4.2: Elbow method for optimal k for http (KDDCUP99) dataset

Results and Analysis

Figure 4.3 gives confusion matrices on the clustering results. Table 4.1 provides accuracy, precision, recall, f1 score calculated from unnormalized confusion matrix and AUC-ROC curve. The results indicate that K-Means did quite well in detecting attacks from the Http dataset.

Machine Learning in Defensive IT Security: Early Detection of Novel Threats

Machine Learning in Defensive IT Security: Early Detection of Novel Threats

Machine Learning in

Defensive IT Security: Early Detection of Novel Threats

Contents

Chapter 1 Introduction

Chapter 2 Background

Chapter 3 Datasets

Chapter 4

Methods, Results and Analysis