Information-Theoretic Framework for Network Anomaly Detection: Enabling online application of statistical learning models to high-speed traffic

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Information-Theoretic Framework for Network Anomaly Detection

Enabling online application of statistical learning models to high-speed traffic

GABRIEL DAMOUR

(2)

(3)

Information-Theoretic Framework for Network Anomaly Detection

Enabling online application of statistical learning models to high-speed traffic

GABRIEL DAMOUR

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Industrial Engineering and Management (120 credits) KTH Royal Institute of Technology year 2019

Supervisor at Baffin Bay Networks AB: Pär Lundgren Supervisor at KTH: Timo Koski

Examiner at KTH: Timo Koski

(4)

TRITA-SCI-GRU 2019:088 MAT-E 2019:44

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

With the current proliferation of cyber attacks, safeguarding internet facing assets from network intrusions, is becoming a vital task in our increasingly digitalised economies. Al- though recent successes of machine learning (ML) models bode the dawn of a new generation of intrusion detection systems (IDS); current solutions struggle to implement these in an efficient manner, leaving many IDSs to rely on rule-based techniques. In this paper we begin by reviewing the different approaches to feature construction and attack source identification employed in such applications. We refer to these steps as the framework within which models are implemented, and use it as a prism through which we can identify the challenges different solutions face, when applied in modern network traffic conditions. Specifically, we discuss how the most popular framework – the so called flow -based approach – suffers from significant overhead being introduced by its resource heavy pre-processing step. To address these issues, we propose the Information Theoretic Framework for Network Anomaly De- tection (ITF-NAD); whose purpose is to facilitate online application of statistical learning models onto high-speed network links, as well as provide a method of identifying the sources of traffic anomalies. Its development was inspired by previous work on information theoretic- based anomaly and outlier detection, and employs modern techniques of entropy estimation over data streams. Furthermore, a case study of the framework’s detection performance over 5 different types of Denial of Service (DoS) attacks is undertaken, in order to illustrate its potential use for intrusion detection and mitigation. The case study resulted in state-of-the- art performance for time-anomaly detection of single source as well as distributed attacks, and show promising results regarding its ability to identify underlying sources.

Keywords : Network Security, Distributed Denial of Service, DDoS, DoS, Anomaly De- tection, Intrusion Detection, Attack Source Identification, Information Theory, Statistical Learning.

(6)

(7)

Sammanfattning

Svensk titel : ITF-NAD : Ett Informationsteoretiskt Ramverk f¨or Realtidsdetektering av N¨atverksanomlier.

I takt med att antalet cyberattacker växer snabbt blir det alltmer viktigt för v˚ara digitaliser- ade ekonomier att skydda uppkopplade verksamheter fr˚an nätverksintr˚ang. Maskininlärning (ML) porträtteras som ett kraftfullt alternativ till konventionella regelbaserade lösningar och dess anmärkningsvärda framg˚angar b˚adar för en ny generation detekteringssytem mot intr˚ang (IDS). Trots denna utveckling, bygger m˚anga IDS:er fortfarande p˚a signaturbaserade metoder, vilket förklaras av de stora svagheter som präglar m˚anga ML-baserade lösningar.

I detta arbete utg˚ar vi fr˚an en granskning av nuvarande forskning kring tillämpningen av ML för intr˚angsdetektering, med fokus p˚a de nödvändiga steg som omger modellernas implementation inom IDS. Genom att sätta upp ett ramverk för hur variabler konstrueras och identifiering av attackkällor (ASI) utförs i olika lösningar, kan vi identifiera de flaskhalsar och begränsningar som förhindrar deras praktiska implementation. Särskild vikt läggs vid anal- ysen av de populära flödesbaserade modellerna, vars resurskrävande bearbetning av r˚adata leder till signifikant tidsfördröjning, vilket omöjliggör deras användning i realtidssystem.

För att bemöta dessa svagheter föresl˚ar vi ett nytt ramverk – det informationsteoretiska ramverket för detektering av nätverksanomalier (ITF-NAD) – vars syfte är att möjliggöra direktanslutning av ML-modeller över nätverkslänkar med höghastighetstrafik, samt tillhan- dah˚aller en metod för identifiering av de bakomliggande källorna till attacken. Ramverket bygger p˚a modern entropiestimeringsteknik, designad för att tillämpas över dataströmmar, samt en ASI-metod inspirerad av entropibaserad detektering av avvikande punkter i kat- egoriska rum. Utöver detta presenteras en studie av ramverkets prestanda över verklig internettrafik, vilken inneh˚aller 5 olika typer av överbelastningsattacker (DoS) genererad fr˚an populära DDoS-verktyg, vilket i sin tur illustrerar ramverkets användning med en enkel semi-övervakad ML-modell. Resultaten visar p˚a hög niv˚a av noggrannhet för detektion av samtliga attacktyper samt lovande prestanda gällande ramverkets förm˚aga att identifiera de bakomliggande aktörerna.

Nyckelord : Nätverkssäkerhet, Distribuerad Överbelastningsattack, DDoS, DoS, Anoma- lidetektering, Intr˚angsdetektering, Identifiering av Attackkällor, Informationsteori, Mask- ininlärning.

(8)

(9)

Declaration

I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person or material which has to a substantial extent been accepted for the award of any other degree or diploma at any university or other institute of higher learning, except where due acknowledgement has been made in the text.

Gabriel Eric Damour, May 27, 2019

(10)

(11)

Acknowledgements

I would like to express my very great appreciation to Professor Timo Koski for his valuable input during the planning and development of this research work.

I would also like to thank the staff of Baffin Bay Networks, and especially my supervisor P¨ar Lundgren, for providing me with the resources and guidance required to understand the complex world of cyber security.

Finally, I wish to extend my deepest gratitude to those who have supported me during this intensive period of study : my friends and family, and last but not least my partner.

(12)

(13)

Sola dosis facit venenum

(14)

(15)

Glossary

ASI Attack Source Investigation. 14 AUC Area Under the Curve. 24 DDoS Distributed DoS. 10 DoS Denial of Service attack. 10 FN False Negative. 24

FP False Positive. 24 FPR False Positive Rate. 24 IDS Intrusion Detection System. 12 IoT Internet of Things. 10

IP Internet Protocol (often used to denote the IP addresses). 11

ITF-NAD Information Theoretical Framework for Network Anomaly Detection. 12 kNN k-Nearest Neighbours. 23

L4 Transport layer, or Layer 4, of the Open Systems Interconnection model . 10 ML Machine Learning. 2

MSE Mean Squared Error. 33

PRNG Pseudo Random Number Generator. 22 ROC Receiver Operator Characteristics. 24 TCP Transmission Control Protocol. 10 TN True Negative. 24

TP True Positive. 24 TPR True Positive Rate. 24

(16)

(17)

Contents

1 Introduction 10

1.1 Background . . . . 10

1.2 Problem Description . . . . 10

1.3 Project Environment . . . . 11

1.4 Structure . . . . 12

2 Previous Work 13 2.1 Network Intrusion Detection . . . . 13

2.2 Anomaly Detection . . . . 13

2.2.1 Requirements . . . . 14

2.2.2 Review of Current IDS Frameworks . . . . 15

2.3 Related Work . . . . 16

2.3.1 Online B-PCA Method . . . . 16

2.3.2 Behavioural Distance . . . . 17

2.3.3 Information Theory . . . . 17

3 Research Outline 19 4 Theoretical Background 20 4.1 Entropy . . . . 20

4.2 Entropy Estimation . . . . 20

4.2.1 Naive Plugin Estimator . . . . 20

4.2.2 Entropy sketching . . . . 21

4.3 Anomaly detection . . . . 23

4.3.1 k-Nearest Neighbour Distance . . . . 23

4.3.2 Performance Measures . . . . 24

5 Proposed Framework 25 5.1 Feature Extraction . . . . 25

5.2 Feature Engineering . . . . 25

5.2.1 Entropy Tracker . . . . 25

5.2.2 Context Trackers . . . . 26

5.2.3 Complexity Analysis . . . . 27

5.3 Anomaly Detection . . . . 27

5.4 Attack Source Identification . . . . 28

5.4.1 Suspect Context . . . . 28

5.4.2 Complexity Analysis . . . . 29

6 Experimental Design : A Case Study 31 6.1 Data . . . . 31

6.2 Estimation Parameter Tuning . . . . 31

6.3 Point-time Anomaly Detection . . . . 32

6.5 Analytical Tools . . . . 32

7 Results 33 7.1 Estimation Parameter Tuning . . . . 33

7.2 Anomaly detection . . . . 34

7.2.1 Normal Traffic Entropy . . . . 34

7.2.2 Entropy Profile . . . . 35

7.2.3 Parameter Tuning and Performance . . . . 35

7.3.1 kNN Score Analysis . . . . 37

(18)

7.3.2 ASI Performance . . . . 38 7.4 Complexity Analysis . . . . 40

8 Conclusion 41

8.1 Summary . . . . 41 8.2 Disclaimer : Packet Independence Assumption . . . . 42 8.3 Future Work . . . . 43

A TCP Field Description 46

B Maximally Skewed α-stable Distribution 47

9

(19)

1 Introduction

1.1 Background

First encountered during the DEF CON conference of 1988, Distributed Denial of Service (DDoS) attacks – which appear to be continually growing in every regard, from frequency to size and complexity – remain a major issue for the cyber security industry. They have doubled in number over a period of three years and a third of all organisations reported being affected by this kind of intrusion [30]. More often than not they occur as part of complex attacks – e.g. used for illicit extraction of sensible data – which makes them particularly nefarious. They are also used to create disruptions of retail, technology and official sites for diverse reasons ranging from extortion to more political motivations. In many cases, the network paralysis they can engender have major impacts on companies’ revenues (especially regarding e-commerce) and damage the reputation of businesses [22]. Consequently, a lot of work has been put into the identification and mitigation of such attacks, and as we will see the field is constantly challenged by increasingly complex network traffic conditions.

1.2 Problem Description

A Denials of Service (DoS) attack is a network intrusion in which the perpetrator seeks to consume the resources of one or several victim hosts. This is achieved by forcing the targeted machines to process large amounts of data, either by flooding it with illegitimate requests or by setting up a strategy such that its processing capacities are mobilised by connections held open for a long (yet short enough to avoid time-outs) period of time – e.g.

slow lorry attacks. More often than not, these types of intrusions are distributed over a large network of hosts – also known as botnet – consisting of legions of robots, that usually consist of innocent machines compromised by some kind of malware. These are known as Distributed DoS attacks (DDoS), and are carried out by an ever growing number of ghost hosts. This development is especially worrisome given the current proliferation of Internet of Things (IoT) devices which have been proven extremely vulnerable to passive infections, usually intended for their use in DDoS attacks [15]. In the past few years, several attacks have been recorded breaking the symbolical threshold of 1 terabit per second.

Figure 1: TCP header structure

In order to further delimit the scope of this paper, a specific type of DDoS attack is considered, namely : the so called flood -based Layer 4 attacks (L4). These are volumetric intrusions which target networks at the transport level of internet communications – a.k.a Layer 4. Moreover, it will further be delimited to the study of a specific protocol, namely the Transmission Control Protocol (TCP) which is the primary target of L4 DoS attacks, accounting for more than 80% of transport layer attacks [20]. Using this protocol, hosts can send and receive data over the internet by segmenting the payload into a sequence of packets. A TCP packet is composed of a header segment, containing 10 mandatory fields and an

(20)

optional one – informing the receiver on which individual connection the packet is part of and its state – as well as a data segment, or payload, which contains the information being transmitted. Figure 1 illustrates the structure of a TCP packet, whose detailed description is found in Table 4 of Appendix A.

One example of such L4 DDoS flood, is the so called SYN-flood attack, which exploits how the TCP establishes connections; a process commonly known as: a three-way handshake.

To initiate a TCP handshake, the client host begins by issuing a synchronisation (SYN) packet – i.e. packet with SYN-flag set to 1 – to the server host, which in turn responds by sending back a acknowledgement of synchronisation (ACK-SYN) packet. The client can now establish the connection by sending an acknowledgement (ACK) packet for the data stream to initiate. Data is subsequently sent via these packets, which must always be confirmed by the receiver sending an ACK response.

Figure 2: TCP Three-way handshake

In a SYN-flood, the malicious client sends a great volume of SYN-packets containing falsified (a.k.a spoofed ) Internet Protocol addresses (IP), leading the targeted host to send SYN- ACK packets into the void, awaiting nonexistent ACK responses. When enough of the server host’s resources are engaged by these illicit connections, its stack overloads and fails to process requests from legitimate users – who receive denial of service messages, giving its name to the attack.

Figure 3: SYN-flood DoS attack

1.3 Project Environment

This project was conducted in collaboration with Baffin Bay Networks AB, a Stockholm based cyber-security company that offers cloud-based protection solutions, including mitigation of DDoS attacks. They have been interested in investigating how to improve the efficiency of online anomaly detection models and place weight on the need for possible sampling of the packet trace – which as we will see in subsequent sections, is lacking in the

11

(21)

current research landscape. This has lead to the development of : the Information Theoretic Framework for online Network Anomaly Detection (ITF-NAD).

1.4 Structure

The remainder of this paper is structured as follows : Section 2 attempts to synthesise the theoretical landscape surrounding the problem of network intrusion mitigation, placing particular attention on the issue of DDoS attacks. Starting from a high-level perspective – in which the different strategies for mitigating DDoS attacks are presented – we will succes- sively narrow down the review to the current progress made in the field of anomaly detection, focusing on applications of statistical learning models to intrusion detection. In the latter, three main approaches are identified which we analyse through what will be referred to as an Intrusion detection System (IDS) framework. This abstract structure will be used as a lens through which we can better identify and understand the problems respective approach is faced with in their application to online intrusion detection in high-speed networks. This will in turn lead us to a discussion of how these challenges can and should be addressed, first by summarising related work that has been conducted on the subject and finally by introducing the solution proposed in this paper.

Having identified these issues, we move on in Section 3, to a description of the purpose and aim of this research project as well as an account of the main contribution we can expect it to make. Here, the framework developed during this research project is introduced by underlining the key goals it seeks to meet.

In Section 4, we will go through the main theoretical concepts and methods employed by the proposed framework, as well as the statistical learning model used to illustrate its performance and characteristics. Particular weight is put on the concept of entropy, which is key to the framework definition presented in Section 5. In the latter, a detailed description of the different framework steps is presented, as well as an analysis of its space-time complexity.

Further on, Section 6 describes the case study that was conducted, with the purpose of il- lustrating the framework’s use and evaluate its performance under different attack scenarios.

The results of this case study are subsequently presented in Section 7, which encompasses both quantitative evaluations of the framework output and deductive reasoning regarding the best practices when using it.

Finally, a conclusion is articulated in Section 8, which consists of : a brief summary of the main characteristics of the framework, a discussion regarding a fundamental issue in our approach as well as suggestions for further developments of the framework.

(22)

2 Previous Work

2.1 Network Intrusion Detection

In order to prevent DDoS attacks from disrupting a network, several strategies are available – the taxonomy of which is illustrated in Figure 4 – that differ in fundamental ways. In this paper, we consider the challenging task of detecting such network intrusions, which itself can be performed using two distinct approaches : signature-based and anomaly-based detection.

Figure 4: DDoS Mitigation Methods; source : [2]

In the signature-based approach, one seeks to detect attacks by identifying patterns in the network traffic which match ones of previously known intrusions. Though this is efficient for mitigating known attacks, it leaves the network vulnerable to new types of intrusions and requires continuous updating of the signature database. Anomaly-based detection, on the other hand, requires no previous knowledge of the attacks it seeks to identify. Instead, these methods provide ways of modelling normal traffic conditions and identifying large deviation as anomalies, which would suggest that some malicious activity is underway. Although this theoretically implies the possibility of detecting both common and new attack patterns, this approach is faced with a number of challenging issues [7], of which the main ones are :

• The difficulty in defining a normal region in such a way that every possible normal behaviour would be included

• The risk that malicious actors adapt their attacks so to appear like normal traffic.

• The possibility that normal behaviour might evolve, thus requiring frequent updating of the normal region boundaries.

• The fact that available training data may contain noise similar to actual anomalies which would make their detection more difficult.

Although these are serious weaknesses, the benefits of this approach in potentially identifying any attack behaviours, makes it the preferred approach in the literature considered by this review.

2.2 Anomaly Detection

Anomaly detection is being studied across several domains: from Statistics to Machine Learning, including Spectral and Information Theory as well as Data Mining [7] – this is a rather well-studied field, which finds its application in a variety of domains. In this paper, we will limit ourselves to the study of statistical learning models and their application to the problem of network anomaly detection. We will, furthermore, refrain from presenting a

13

(23)

review of the many models and their reported performance and/or deficiencies – which are covered in detail by a variety of surveys such as [19], [24] or [25]. Instead, we will focus our attention to their implementation and the problems that characterise them. The literature does, however, provide some important insights regarding the advantages that methods of unsupervised learning present over supervised ones. Supervised methods require significant volumes of labeled data and have been shown to remain somewhat inefficient in detecting zero-day attacks – i.e. ones following unknown patterns or which simply are not covered by the training data. That being said, it is important to note that unsupervised methods are characterised by a trade-off between accuracy and false positive rates – which vary depend- ing on the model used and the threshold level over which traffic is classified as anomalous.

Because of the lack of large amounts of labelled data and the aforementioned requirements, this paper will focus on unsupervised – as well as semi-supervised – methods.

Rather than constraining ourselves to a specific learning model, the focus of this paper is set on the framework within which these methods are applied. Figure 5 below, illustrates the necessary steps surrounding any application of statistical learning in an IDS. These include : feature extraction, feature engineering and finally how the problem of attack source identification (ASI) is addressed.

Figure 5: IDS Anomaly Detection Framework

Analysing different solutions through this prism, allows us to clearly identify the challenges that respective solutions face when applied in modern traffic conditions. In the following section, an account of the main requirements that state-of-the-art IDS should meet is presented, followed by a brief discussion regarding the strengths and weaknesses that characterise common frameworks, within which current methods of network anomaly detection operate.

2.2.1 Requirements

Requirements that state-of-the-art IDS should meet can be summarised by the following points: 1) high levels of accuracy, 2) low levels of false alarm ratios, 3) adaptability to new intrusion patterns – so called zero-day attacks – and 4) enabling early detection of threats for efficient mitigation. Furthermore, we argue that modern IDS should be able to operate on sampled TCP data and require minimum prepossessing of their headers, so to allow on-line integration to traffic monitoring systems; and thereby enabling timely responses to attacks. Finally, for it to be of any value for attack mitigation, the IDS should be able to give indications on the source of the attack (i.e. source IP addresses that lie behind the attack) – any additional information given by the output of the IDS regarding the type of attack encountered, such as the victim host(s) IP address(es) and flag types used, would be of great value to the mitigation process.

(24)

2.2.2 Review of Current IDS Frameworks

Reviewing the many models and methods of anomaly detection that are employed in the context of network intrusions, one can roughly group these under three main approaches in the underlying framework. In this section, we will argue that each of these approaches, fails to meet at least one of the aforementioned requirements leading us to a reflection on ways to address these fundamental issues.

The first one, which will be referred to as the count -based approach, constructs features through simple aggregations of packets into counts of different network events – e.g. packet counts, field counts, byte-counts etc. – over specific periods of time. These are then either modelled as stochastic processes [31] from which outcome probability is used to determine whether a traffic time-window is deemed too unlikely to be an outcome of normal traffic conditions, or used as features of machine learning algorithms [16]. This approach has the advantage of requiring minimal pre-processing of the TCP packet headers resulting in scalable solutions that can work online in high-speed traffic conditions. On the other hand, count -metrics are difficult to model as stochastic processes, leading to high false-alarm rates.

Finally, the aggregation of packets into count-metrics entails a loss of information regarding the source of the anomaly which requires a separate, deductive process of attack source identification.

With the rise of machine learning and its impressive results in domains like natural language processing and image recognition, new approaches to intrusion detection emerged. One such approach is the so called packet -based approach, which focuses on the raw TCP packet data (a.k.a packet payload) to identify anomalous patterns. The idea is that malicious packages – which usually are generated in an autonomous manner – are characterised by recurrent patterns. The model performs an analysis of the semantic structure of packet payloads to find these patterns and classify packets as part of a legitimate or anomalous traffic flow.

The major advantage of this approach over the aforementioned one is that the outputs of this type of IDS implicitly solves the subsequent attack source identification task. Given that the model identifies anomalous packets, it follows that the source IPs associated with these packets are the addresses of culprit hosts. On the other hand, this method has two significant weaknesses which has rendered it rather obsolete for applications in modern traffic conditions. First of all the processing of packet payloads is very resource consuming, leading to significant overhead, and secondly the analysis of these packets is being restricted by the encryption of TCP communications which is increasingly common [11].

Again, to overcome the limitation of packet -based IDS, a new approach is needed. Contrary to the former one, the flow -based approach only requires the observation of TCP-header information which it aggregates into flows, or individual connections between two hosts.

Given flow records that contain statistics specific to each connection – such as its duration, the packet-frequency or the number of different flags used – machine learning algorithms are applied to identify anomalous behaviour. This can be done in an unsupervised manner – either by clustering of flows or by modelling normal flows to find deviation from it – but also in a supervised setting such that the model classifies flows into different categories. To that end there exist a number of data-sets containing simulated flow records that encompass several types of network intrusion – of which the Knowledge Discovery Database Cup 1999 (KDD99) is the most popular. The most significant advantages of this approach are: 1) the high levels of accuracy that are reported for some supervised [24] as well as semi-supervised models [3] on these data sets, providing evidence of the high relevance of information contained in flow statistics, 2) it requires the IDS only to process packet header information and 3) similarly to the packet -based approach, its output implicitly solves the problem of attack source identification. That being said this approach has challenges of its own when faced with modern, high-speed, network traffic. Given the high space-time complexity of the

15

(25)

pre-processing of TCP packets into flow records and the extraction of high-level information on these flows, IDS operating under this approach are highly inefficient and are ultimately doomed to off-line applications [29]. Furthermore, this approach requires the monitoring probes – used for packet data collection – to observe the entire traffic. Indeed, several stud- ies point out that these flow statistics are very sensitive to sampling and their estimation lead to significant bias [10].

2.3 Related Work

Although machine learning models proved to be highly performance in detecting network intrusions – some reporting accuracy rates reaching 98.23% [17] – most IDS currently in use continue to rely on rule or expert based methods. As G. Sourek points out in [29], although most of the works considering methods of statistical learning are based on the KDD99 dataset, few are those which address the issues associated with the flow aggregation part of the intrusion detection problem. From the literature gathered during this research, none discuss how sampling might affect performance. Evidently, the challenges these models face lie outside their definitions, in the surrounding framework steps. In this section we discuss two of the few papers that do address these issues (albeit not explicitly) by diverting from the popular IDS frameworks discussed above.

2.3.1 Online B-PCA Method

One such ”fast” method, is provided in [35] where the authors propose to directly model raw network traffic as a 2D origin-destination (OD) matrix. Given a network consisting of n endpoints, an OD matrix is a n × n binary matrix, representing the active flows in the network during some time period of fixed length – as illustrated in Figure 6 below.

Figure 6: Origin Destination matrices over time; source : [35]

The idea is to collect a normal set of such matrices and perform a bilateral principle com- ponent analysis (B-PCA) to calculate the principal directions of the normal OD matrices.

The model is then used online, detecting anomalies by monitoring the shifts in principle directions cause by newly arriving data-points, as is illustrated in Figure 7. Furthermore, the authors advocate the use of oversampling in order to further accentuate these shifts.

By employing a cosine distance, an anomaly score is constructed, which is used to classify large deviations of the principal components as anomalies. Furthermore, the paper provides a way to deduce which individual connections are responsible for these deviations – thereby addressing the problem of attack source identification. In order to do this efficiently, the paper provides a closed-form updating function of the principal direction, without having to rely on resource-heavy iterative calculations.

(26)

Figure 7: ”Anomaly detection using B-PCA. (a) Left Subspace (add normal sample). (b) Right Subspace (add normal sample). (c) Left Subspace (add anomalous sample). (d) Right Subspace (add anomalous sample)”; source : [35]

Although this method presents weaknesses of its own – such as the problem of dealing with large OD matrices in complex networks and the restriction to IP address information contained in the packet headers – it does illustrate how a change of approach (regarding the framework) can lead to great improvements in efficiency, without any significant loss in accuracy.

2.3.2 Behavioural Distance

Another approach – which, as we will see, is quite similar to the one taken in this project – is found in [27], where the authors propose an efficient traffic anomaly detection which can scale to high-speed traffic. By monitoring the entropy levels of different packet header fields – referred to as entropy profiles– the model measures the distance between a baseline and newly computed estimates, which are classified as anomalous, if said distance breaks a certain threshold.

Furthermore, the authors claim that these entropy profiles can be used to classify the type of anomaly detected. By applying their model on high-speed internet backbone links, they show that this method can be implemented online and that it accurately detects a number of real-world attacks, without having to rely on per-flow state information.

A major drawback of this method compared to the B-PCA one, is that it does not explicitly address the problem of attack source identification. Although it gives some indication as to how this process could be performed, the paper does not provide any detailed method, let alone an evaluation of its performance and complexity.

2.3.3 Information Theory

It is worth highlighting that both of the solutions proposed in the above sections, rest upon the assumption that the distribution of header fields – such as the source and destination IP addresses – contain relevant information for the detection of network intrusions. Far from being a novelty, this assumption is the foundation of another approach, which has drawn significantly less attention than the aforementioned ones, namely : the so called distribution-based approach.

17

(27)

To quantify these changes, information-theoretic measures (such as entropy) are used to track the distribution proprieties of TCP header fields. In [33], Wagner and Plattner propose an entropy-based approach to extract relevant information regarding the distributions of common categorical fields found in protocol packets, such as IP addresses and port num- bers. Furthermore, they establish the relevance of such features in a real-world situation, analysing their dynamics under known, large scale, worm attacks. Similarly, Lakhina et al.

show that the analysis of TCP field distribution enables both the identification and classification of a range of network anomalies, through the use of unsupervised learning [21].

This approach offers the significant advantage of not having to rely on flow -metrics – although it is often employed on this type of data [5], where entropy also provides information relevant to IDS. Furthermore, entropy-based metrics, as well as simple metrics such as packet and byte counts, prove to be fairly resistant to the effects of sampling [6], which could be employed to decrease the time-complexity of the IDS and allow it to operate in an online manner. To our knowledge, this area of network anomaly detection has been left relatively understudied when compare to the amount of research devoted to improvement of monitoring and flow -aggregation efficiency.

(28)

3 Research Outline

Having established the challenges that machine learning based IDSs face with the current explosion in internet traffic speed, we can articulate a research goal, that aims at bridging this apparent gap in the reviewed literature. Specifically, the purpose of this paper is to reaffirm the need to find alternative frameworks, than those broadly employed by current research. In this paper, we propose a novel network anomaly detection framework, designed for processing large amounts of data in real-time and investigate its detection abilities for different types of TCP L4 DDoS-flood attacks. Choosing a simple distance based anomaly detection algorithm (the kNN, see Section 4.3.1), we exemplify its application within the ITF-NAD through a case study, intended to provide insights on framework parameter se- lection and evidence of its relevance for intrusion detection. Furthermore, the framework addresses the challenging task of attack source identification (ASI), central to any efficient mitigation, without having it to rely on the specific learning method used.

Although no explicit hypothesis was formulated during the project’s course, a number of questions arose from the development of the ITF-NAD, which are investigate in the framework’s definition and case study. These can be summarised in the following way :

• How can we estimate entropy over data streams in an efficient manner?

• Does the entropy of packet fields’ provide relevant information for the detection of DDoS flood attacks in TCP network traffic?

• Do these measures provide insights on the type of attack detected?

• How can these measures be used to for attack source identification?

It is worth noting that these are very open ended question, and that the aim of this paper is not to perform a statistical hypothesis-testing of these, but rather, to focus our attention to key aspects which characterise the ITF-NAD. Furthermore, the proposed framework was developed with the goal of providing an alternative way of applying machine learning models to online detection of network intrusions. In doing so, we hope to make the following contributions :

• Provide an efficient framework for online application of statistical learning models to intrusion detection

• Provide further empirical evidence of the relevance of information theoretic measures for intrusion detection and attack type classification.

• Provide an IDS framework which can operate on sampled packet trace.

• Provide a reformulation of Wu and Wang’s Information-Theoretic Based Single Pass (ITB-SP) algorithm for outlier detection (see Section 5), thereby extending it’s application to attack source identification of network intrusions.

19

(29)

4 Theoretical Background

Before diving into the details of the framework, we need to properly define the key mathematical concepts and methods on which it relies. Starting from the central concept of entropy and its estimation for discrete variables, we conclude this section with an overview of the semi-supervised method chosen to exemplify the framework’s use (see Section 6).

Finally, we go over the performance measures which were used in its evaluation.

4.1 Entropy

The concept that lies at the heart of the ITF-NAD, is that of entropy. In particular, we consider the so called information entropy as described by Claude Shannon in his landmark paper A Mathematical Theory of Communication, which differ from its homonym used in statistical mechanics – describing the amount of disorder in a thermodynamic system. Instead the Shannon entropy can be understood as a measure of a random variable’s uncertainty [9], or conversely a way to quantify the amount of information it contains. This conceptual in- terpretation can also be thought in more concrete terms, as the minimum amount of memory required to store the variable, a metric representing the maximum, loss-less, compression of the data [32]. In this paper we will only consider the entropy of discrete random variables, which is defined as follows :

Definition 4.1. Empirical Shannon Entropy : Let X be a discrete random variable with domain space X and probability mass function p_X(x), then the empirical Shannon entropy of X, H(X), is given b,

H(X) = −X

x∈X

p_X(x) log_bp_X(x), where b is the logarithmic base used.

The choice of logarithmic base represents the unit of measure, e.g. if b = 2 the entropy represents the number of bits needed, whereas using the natural logarithm (i.e. with base b = e¹) the unit of measure is called a nat. If not otherwise specified, the logarithmic base will be exponential.

4.2 Entropy Estimation

Having defined the concept of entropy, we now turn to the challenging task of its estimation, which amounts to the estimation of the associated probability mass function p_X(x). We distinguish two different cases when estimating the entropy of TCP header fields : the binary flag-values (6 fields) on the one hand and categorical variables with large yet finite discrete outcome spaces – i.e. 10 fields, such as IP addresses, which have outcome space X with cardinally K = 2³² – on the other. Furthermore, two distinct estimation techniques are covered below : the histogram method – which as we will see presents some challenges for online estimation over data streams – and the so called sketching method, specifically designed for stream estimations.

4.2.1 Naive Plugin Estimator

Given a sample of i.i.d. observations S = {x₁, ..., x_N}, the most straight-forward and intuitive method, is to use the naive plug-in estimator of the p.m.f. (a.k.a the histogram method) : ˆpX(x) = ^|{xⁱ^∈S|x_Nⁱ^=x}|. This leads to the following expression for empirical entropy.

H(X) = −ˆ X

x∈X

|{x_i∈ S|x_i= x}|

N log|{x_i∈ S|x_i= x}|

N

(30)

In the binary case, the entropy definition in 4.1 can be simplified to H(X) = −p log p − (1 − p) log(1 − p), where p = pX(1). Using the naive estimator of p = _Nⁿ, where n = |{xi∈ S|xi= 1}|, we get :

δˆ_bin(n, N ) = −n N log n

N −N − n

N logN − n

N , (1)

which requires us only to keep track of two parameters : the amount of positive occurrences in the sample data n and the total amount of observations N .

For categorical variables on the other hand, the estimation function requires us to keep track of every non-zero probability nx= ˆpX(x) for all x ∈ X . This can result in quite large storage costs, having to keep in memory several large hash-maps with potentially more than 4 billion entries (in the case of 32-bit fields). In the case of stream estimation, on would need to either estimate over non-overlapping windows or keep time-stamps of the counts in respective time-sloth (in order to remove them as time goes by). The update of these data structures, result in prohibitively large space-time complexity and in most cases covered in the literature, entropy is estimated over small and non-overlapping time windows – leading to volatile time-series.

4.2.2 Entropy sketching

In order to reduce the complexity of estimating entropy over streams of large categorical variables, we turn to a simple yet effective method perfected by Clifford and Cosma in [8].

They propose an unbiased and efficient estimation of entropy from a low-dimensional synopsis of the sample points, called an α-stable data sketch. This method is specifically designed to be applied on streaming data in a so called relaxed strict-turnstile model – i.e. data can be added and deleted from the sample of interest.

The idea is to map, or rather sketch, the outcome of a discrete r.v. X to that of a continuous one, following a maximally skew α-stable distribution – with stability, shape, scale and location parameters respectively set to α = 1, β = −1, γ = ^π₂ and δ = 0, denoted F (x; α, β, γ, δ) (see Appendix B for further details) – which presents some interesting proprieties. Indeed, Clifford and Cosma show, how the negative entropy of a F (x; 1, −1,^π₂, 0)-distributed r.v.

can be recovered in the location parameter of the distribution that a linear combination of i.i.d X_j ∼ F (x; 1, −1,^π₂, 0) realisations, follows (weighted by their respective probability mass function pj= pX(Xj)), as stated in the following Lemma.

Lemma 4.1. Let X1, ..., XN ∼ F (x; 1, −1,^π₂, 0) i.i.d., and let p1, ..., pN be positive constants satisfyingPN

j=1p_j = 1. Then,

N

X

j=1

pjXj∼ F (x; 1, −1,π 2,

N

X

j=1

pjlog pj)

To recover the entropy, the authors provide a so called log-mean estimator, ˆδlm(ζ), used to estimate for the location parameter δ over an i.i.d. sample of F (x; 1, −1,^π₂, δ)-distributed variables – i.e. d linear combinations of X, as specified above.

Lemma 4.2. Let y₁, ..., y_d be independent samples from F (x; 1, −1,^π₂, δ) distribution, and let ζ > 0 be a constant. The log-mean estimator of δ is,

δˆlm(ζ) = ζ⁻¹log



ζ^−ζd⁻¹

d

X

j=1

exp(ζyj)





21

(31)

As the sample size d increases to ∞, the estimator is asymptotically unbiased; in particular, as d → ∞,

√

d(ˆδlm(ζ) − δ) → N (0,4^ζ− 1 ζ² )

Regarding the estimation parameter, the authors explicitly recommend using the estimator with ζ = 1, so to maximise asymptotic relative efficiency while ensuring exponentially decreasing tail bounds of the estimator.

Algorithm 1: Sample r from maximally skewed stable distribution F (x; 1, −1, π/2, 0) Result: A sample from F (x; 1, −1, π/2, 0)

1 Simulate U1, U2∼ U(0, 1) i.i.d. ; // standard uniform distribution

2 Set V1= π(U1−¹₂);

3 Set V2= − log(U2);

4 return r = tan(V₁)[^π₂ − V₁] + log(^V_π/2−V²^{cos V}¹

1)

In order to sketch the categorical random variables (e.g. IP addresses) the authors advocate using the so called seeding method. The idea is, first to map each outcome Xj ∈ X to an positive integer cj – a task easily achieved for TCP fields by taking the numerical value of these binary representations – and then take this value to select the seed of a pseudo random number generator (PRNG) used to simulate an 1-stable maximally skewed r.v., R(cj), thereby deterministically mapping Xj to copies of pseudo-random variables. In order to increase the accuracy of the estimator, ˆδlm, we construct a d-dimensional sketch vector y over which to estimate the location parameter.

A step-by-step description of this method is found in Algorithm 2 while the one used to simulate pseudo random variables from the maximally skewed distribution is described in Algorithm 1 above.

Algorithm 2: Entropy sketch

Data: collection of integers Cint, dimension of sketch d Result: tracker object T

1 initialise data sketch y = (y₁, ..., y_d) = 0;

2 for c_i∈ Cintdo

3 Set PRNG seed c_i;

4 for j ∈ {1, ..., k} do

5 Sample r_j ∼ F (x; 1, −1, π/2, 0) with Algorithm 1;

6 Update yj= yj+ rj;

7 end

8 end

9 return y

In conclusion, this allows us to construct d-dimensional synopsis y = (y1, ..., yd)^T vector, that can be combined linearly to estimate the entropy over some sample of the data stream.

Following the recommendation of Clifford and Cosma we define our estimation function as the log-mean estimator with ζ = 1, as,

ˆδlm(y, N ) = log



d⁻¹

d

X

j=1

exp(y_j N)



 (2)

where y is a sketch vector and N the total number of observation used in its construction.

(32)

In this way, we can estimate entropy over collections of large-scale categorical variables, without having to keep said collection in memory or track counts as in the histogram method.

Furthermore, the estimations can be easily updated by adding or removing sketches from the one used and by updating N .

4.3 Anomaly detection

The goal of the framework proposed in this paper, is to facilitate the application of statistical learning models to streams of TCP packet headers, with the aim of identifying network anomalies and their source. The framework supports any model which can be formulated in such a way that a measure of divergence from the normal state (or anomaly score α(x)) is defined. It is worth noting some models are better suited for some types of anomalies than others (which are illustrated in Figure 8 below); and we will in this paper restrict the analysis to that of a rather simple method, designed to identify so called global point anomalies.

Figure 8: Anomaly types, source:[13]

Figure 9: Learning models, source:[14]

The model chosen to illustrate the implementation of the ITF-NAD, is derived from a semi- supervised learning algorithm; meaning that we assume there to be a dataset Dtrain consisting of normal observations used to train the models, as well as a dataset Dtestconsisting of points to be evaluated – as illustrated in Figure 9.

4.3.1 k-Nearest Neighbour Distance

The so called k-nearest neighbours global unsupervised anomaly detection score (kNN) (not to be confused with the KNN classifier), is defined as the average euclidean distance of a point to its k nearest neighbours. In the following case study, we consider a semi-supervised version of this model, where the distance is calculated upon a baseline of training sample.

This is probably one of the most simple and intuitive of anomaly detection model, yet it has not been shown to be systematically outperformed by more complex models – e.g.

see Skvara’s comparative evaluation of generative and distance-based models in [28]. The anomaly score function αnn(x) is defined as follows :

αnn(x) = X

xi∈Nk

d(x, xi)

k ,

where N_k ⊂ Strain is the set of k nearest neighbours of x ∈ S_test. The main drawback of this method, is that it requires the user to arbitrarily set the number k of neighbours to consider. As a rule of thumb, a value in the range [10, 15] is used [14]. Given some threshold value τ , we classify any point x as anomalous if α_nn(x) > τ , and as normal otherwise.

23

(33)

4.3.2 Performance Measures

In order to estimate the performance of the above anomaly detection model, several measures can be computed, which provide information on key characteristics of binary classifications.

These are based on True Positives (TP) and False Positives (FP), representing correctly respectively incorrectly predicted anomalous points; as well as True Negatives (TN) and False Negatives (FN), representing correctly respectively incorrectly predicted normal points.

The measures considered in this paper include :

• The Sensitivity = _{T P +F N}^{T P} , also called True Positive Rate (TPR) or Recall. It is the ratio between detected anomalies and the total number of anomalies.

• The Specif icity = _{T N +F P}^{T N} = 1 − F P R (False Positive Rate), which is the ratio of correctly classified normal points and the total number of normal points. Conversely, the FPR is the ratio of miss-classified normal points, an important indicator for IDSs given the high cost false alarms entail for web-services.

• The P recision = _{T P +F P}^{T P} = 1 − F DR (False Discovery Rate), which represents the probability of correct classification when detecting an anomaly.

• The Accuracy = T P +T N +F N +F P^{T P +T N} , which is the ratio between correctly classified points and the total number of observations considered.

Furthermore, we consider the so called Area Under Curve (AUC) score – presented in the majority of the articles reviewed as a measure of quality of the model – which is illustrated in Figure 10. The orange curve, known as Receiving Operator Curve (ROC), represents the relationship between the Sensitivity and Specificity of our model for different threshold values τ . The AUC is the area under the ROC and represents the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example.

The higher the AUC, the better our model is at distinguishing anomalous points from ones that are considered to be normal.

Figure 10: ROC example, source:[1]

Another popular performance metric, is the so called F1 score, which constitutes an overall measure of model accuracy, which combines precision and recall.

F1= 2 ×P recision × Recall P recision + Recall

The measure is bounded between 0 and 1, where a good F1 score (close to 1) represents a situation where the model correctly identifies anomalies while keeping false alarms to a minimum.