Flow Classification of Encrypted Traffic Streams using Multi-fractal Features

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-A--18/015--SE

Flow Classification of

Encrypted Traffic Streams

using Multi-fractal Features

Erik Areström

Supervisor : Niklas Carlsson Examiner : Niklas Carlsson

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Abstract

The increased usage of encrypted application layer traffic is making it harder for tradi-tional traffic categorization methods like deep packet inspection to function. Without ways of categorizing traffic, network service providers have a hard time optimizing traffic flows, resulting in worse quality of experience for the end user. Recent solutions to this prob-lem typically apply some statistical measurements on network flows and use the resulting values as features in a machine learning model. However, by utilizing recent advances in multi-fractal analysis, multi-fractal features can be extracted from time series via wavelet leaders, which can be used as features instead. In this thesis, these features are used exclu-sively, together with support vector machines, to build a model that categorizes encrypted network traffic into six categories that, according to a report, accounts for over 80% of the mobile traffic composition. The resulting model achieved an F1-score of 0.958 on synthetic traffic while only using multi-fractal features, leading to the conclusion that incorporating multi-fractal features in a traffic categorization framework, implemented at a base station, would be beneficial for the traffic categorization for such a framework.

(4)

Acknowledgments

I would like to extend my thanks to the following persons who assisted me through out the process of writing this thesis:

• Niklas Carlsson, my examiner, for all your invaluable feedback, • Peter Alvarsson, my supervisor, for all the help received at Ericsson, • Eric Henziger, my opponent, for all your help and feedback,

• Peter Keijser Tullstedt, my classmate, for your help in the conceptual phase, and to, • Elin Areström, my wife, for your help with proof reading.

(5)

List of Figures

2.1 Visualisation of the Mandelbrot set. . . 7

2.2 A mono-fractal and a multi-fractal. In the multi-fractal, the lower part’s width is doubled in each iteration. . . 8

2.3 The singularity spectrum of a mono-fractal signal (in red) and a multi-fractal sig-nal (in blue). . . 9

2.4 The structure obtained from the absolute values of the coefficients of each level. . . 11

2.5 The time neighbourhood, ´λ, for the wavelet coefficient marked with a blue circle. . 12

2.6 The greatest elements in each dyadic interval λ, circled in blue. . . . 12

2.7 The wavelet leaders of level 23. The wavelet leaders are the greatest elements in each of the levels time neighbourhoods. . . 13

2.8 Some of the wavelets in the Daubechies family. The order of the wavelet corre-spond to the number of vanishing moments. . . 14

2.9 Some of the wavelets in the symlet family. . . 15

2.10 Some of the wavelets in the coiflet family. . . 15

2.11 The discrete (1 dimensional) wavelet transform. . . 16

2.12 The maximum margin separating hyperplane between two 2-dimensional data sets. 18 2.13 Example of a kernel function φ enabling linear separation of two data sets. . . . 19

3.1 A high level overview of the method. . . 20

4.1 The set-up used for data collection. . . 28

4.2 The number of collected samples per class and each class’ application composition. 28 4.3 An example of a time series, each timeslot is filled with the number of packets arriving during that timeslot. . . 29

5.1 The average(D(h(q=5)), h(q=5))point for each class. . . 31

5.2 The feature weights for the features after performing NCA with optimized param-eters. . . 32

5.3 The progress of the Bayesian optimization algorithm, per iteration, for 20 iterations. 33 5.4 The initial framework design. . . 34

6.1 Diagram of a fingerprinting method, creating an IP fingerprint. . . 37

6.2 The resulting confusion matrix from the first evaluation of the framework. . . 40

6.3 t-SNE Visualisation of the chosen multi-fractal features. . . 41

7.1 The resulting confusion matrix from evaluating the framework with the addition of two non-fractal features. . . 46

7.2 The resulting confusion matrix from evaluating the model. . . 48

7.3 t-SNE Visualisation of the chosen multi-fractal features. . . 49

7.4 Welch power spectral density estimate of the two classes’ mean time series . . . 50

(8)

List of Tables

4.1 The chosen websites and the category of the website. . . 25 5.1 The result of summing the pairwise distance between each category’s average

(D(h), h(q))and their corresponding sampling duration and sampling rate. . . 31 5.2 The features with a feature weight over 0.01. . . 33 5.3 The resulting parameters for building the model after applying the Bayesian

opti-mization algorithm. . . 34 6.1 The resulting wavelet parameters and the resulting frameworks F1-score. . . 36 6.2 F1-score, precision and recall of the framework for different methods for building

the time series. . . 38 6.3 The impact of the duration t on the F1-score, while fixing the number of timeslots

to 20000. . . 38 6.4 The parameters of the final framework. . . 39 6.5 The F1-scores of obtained after rebuilding and evaluating the model ten times. . . 40 6.6 The average F1-score, precision, and recall , per classifier after rebuilding and

eval-uating the model ten times. . . 41 6.7 The resulting F1-scores of the model built with the original data when scaling the

evaluation set. . . 42 6.8 The resulting F1-scores of the model built with the original data when moving

each packet in the evaluation set a number of steps, according to a uniform and normal distribution dependant on the parameter n . . . 44 7.1 The number of collected samples for each class, and their composition. . . 47 7.2 The resulting parameters for building the SVM model, after applying the Bayesian

(9)

1 Introduction

1.1 Motivation

There is a trend in the IT-industry that application level traffic between the end user’s ap-plication and the provided service is becoming encrypted. According to the 2016 Sandvine report [25], the traffic composition of North America’s mobile networks consists of 64.52% encrypted data. Traffic that is transmitted over a network and that have been encrypted with an adequate encryption algorithm can not be decrypted unless the encryption key is known. Transport layer security (TLS) is a cryptographic protocol that encrypts the data sent between two applications so that the packet can not be correctly read by a third party. Packets sent over the internet typically passes many gateways before arriving to their destination. Usage of TLS ensures confidentiality of a packet’s data, i.e. that unauthorized gateways relaying, or persons eavesdropping on the channel, can not learn the contents of the packet.

Mobile network traffic passes through a base station before arriving to its destination. The base station relays the sent data to its destination but needs to prioritize how much of its bandwidth it should allocate. From a radio network perspective, it is necessary to classify traffic in order to provide clients with the best service available. Encrypted application streams provide a new challenge in this technological area since information in the packets that have been sent are encrypted and thus hard for an operator to classify. Without this classification, optimizations can not be made, which in turn can affect the client’s quality of experience negatively. One example that shows the benefit of being able to identify video streams is a paper by Krishnamoorthi et al. [33]. In the paper it is shown that the quality of experience of users can be improved while reducing wasted bandwidth, by introducing their cap-based framework.

Recent research indicates that it is still possible to do these classifications by other means than the previously used methods that were only applicable on non-encrypted data. Muehlstein et al. [45] managed to identify the operating system, browser and application of unknown host with an achieved accuracy of 96%, by looking at statistical features of encrypted traffic flows. Another example is a paper by Wright et al. [80] where the authors were able to correctly identify the language used on an encrypted VoIP traffic channel. Other recent classifications of application behaviour are for example a paper by Krishnamoorthi et al. [34],

(10)

1.2. Aim

where the authors manage to estimate the buffer conditions of an encrypted adaptive video streaming application, and a paper by Reed et al. [52], where it is shown that it is still possible to identify exactly what Netflix video you are watching, even when the data is encrypted, by analysing the size of the video segments sent. Shi et al. [58] managed to classify traffic for some popular unencrypted application layer protocols using their multi-fractal feature extraction and selection method. Since that method does not utilize application layer data, applying a variant of this method on encrypted data could enable classification.

1.2 Aim

The purpose of this thesis is to study mechanisms and algorithms for categorizing encrypted network traffic. The result of this should be used to thoroughly investigate one method and create a prototype framework that utilizes this method. The resulting prototype framework’s accuracy should then be measured by comparing the classified data to a ground truth estab-lished from analysing the same dataset without encryption. To improve the time needed to perform the categorization, and the accuracy of the result, multiple possible optimizations must be investigated. The aim is that the resulting framework can serve as a proof of con-cept which shows that classification of encrypted traffic is possible with the chosen method. Such a framework could pave the way for an implementation of the concept at a base station, which could improve the quality of experience for the end users.

1.3 Research questions

1. How can we develop a framework for automatic classification of encrypted mobile traf-fic streams into categories depending on the type of application, using the multi-fractal features of encrypted traffic flows, to achieve real time performance?

2. How does the developed framework perform, with regards to precision and recall, as measured by the achieved F1-score?

3. How do the resulting multi-fractal features, for different methods of time series gener-ation, impact the F1-score of the framework?

1.4 Contributions

In this thesis the research questions are answered by developing a framework that label net-work flows depending on which of six categories it belongs to, according to the multi-fractal features of the flows. The framework is then evaluated and improved.

1.5 Delimitations

In this thesis only encrypted mobile data is considered, and the actual optimizations are not within the scope of this thesis to implement or evaluate. The resulting framework only acts as a prototype and the prospect of an actual implementation at a base station is only discussed. No measurements of the impact on quality of experience for users are done.

(11)

2 Theory

This section describes technology that is relevant to the study. It begins with describing basic internet technology that is needed to understand the more advanced theory and the study as a whole. Some techniques for packet classification are described as well as related work in which these techniques are used. There are also sections containing the basics of fractals and wavelets as well as a section explaining some theory behind the wavelet leaders multi-fractal formalism which is used in this thesis. An explanation of principal component analysis and neighbourhood component analysis follows, which is needed to understand the feature selection step and the t-SNE visualisation. The theory section ends with a basic description of some of the machine learning techniques considered in this thesis.

2.1 TCP

The Transmission Control Protocol (TCP) [72] is a highly reliable host-to-host networking protocol implemented at the transport layer which allows a reliable connection over an unreliable channel. This reliability is implemented by returning a positive acknowledgement packet (ACK) whenever a packet is received. A sequence number is attached to every packet, and the corresponding ACK packet will have the same sequence number. This means that the sender has a guarantee that the receiver has received a packet if they receive the correspond-ing ACK in return. TCP also includes flow control, meancorrespond-ing the transmission rate adapts to the available bandwidth of the communication channel between the two hosts. The general idea of this is that the sender starts by sending one packet in a transmission. As long as the receiver does not experience packet loss, the congestion window is increased at a certain rate every transmission. As the congestion window increases, the transmission rate will reach the maximum bandwidth and packets will be lost. When this happens, the congestion window will be lowered, typically by half.

To allow for multiple connections on the same host at the same time, TCP provides a set of ports within the host. The port and network address together form a socket, which serves as a process’ interface to TCP and the transport layer. Some well known port numbers are generally only used by specific services [9].for instance port 22 is used for ssh, port 25 is used for Simple Mail Transfer Protocol (SMTP), and port 80 is used for the Hyper Text Transfer Protocol (HTTP) which is described more in detail in the following section.

(12)

2.2. HTTP

2.2 HTTP

The Hyper Text Transfer Protocol [21] is the application layer protocol used for the largest internet application; the World Wide Web. HTTP uses TCP as its underlying transport layer because the properties of TCP fit well for the requirements of the World Wide Web. For instance, it is more valuable to ensure that a web page is served in its entirety than to serve it fast. Many services that were traditionally using their own protocols, for instance e-mail, are today often using HTTP instead. Gmail and Hotmail are examples of web services that serve email to the users via HTTP instead of SMTP.

HTTP does not implement any security. However, if a secure underlying transport layer protocol is used, the HTTP communication will also be secure. This is called secure HTTP or HTTPS and is covered in the next section.

2.3 TLS/SSL

The Secure Sockets Layer (SSL) [69] protocol was created by Netscape as a secure layer on top TCP, or another reliable transport channel. Due to the fact that TCP is the de facto standard for reliable transmission over the internet, this thesis only considers TCP as the underlying transport channel for Transport Layer Security (TLS) [68]. TLS is an evolution of SSLv3.0, and is the modern equivalent of SSL.

TLS consists of two layers, the TLS Record Protocol and the TLS Handshake. The TLS Handshake protocol allows the client and server to negotiate security parameters, such as which encryption algorithm to use and which cryptographic keys to encrypt and decrypt the message with. TLS Record Protocol is at the lowest layer of TLS, right on top of TCP, and enforces the cryptographic properties of the transmission. In other words, it fragments the data into manageable blocks, and encrypts the data.

Usage of application layer encryption nullifies the effect of some packet classification tech-niques. The following section describes some of the common techniques that can be used to perform packet classification.

2.4 Packet classification

Packet classification is a technique in which incoming packets are marked as belonging to a class, for example video, mail, etc. This can be done either in real-time, or by first collecting data and then perform classification.

Different applications of packet classification have different requirements. Applications where the goal is to redistribute or reduce traffic are examples of applications needing real-time packet classifications, while statistical applications can afford to perform classification passively [74].

Two common ways of doing packet classification are to utilize Deep Packet Inspection, where the data part of the packet is inspected [16], or to look at statistical features of the networks flows. This is typically combined with some machine learning algorithm, where a computer learns to do classifications by finding patterns in the packet’s payload or in the features of the network flows [74]. Other methods use knowledge of the protocol and the ports they use, either combined with machine learning [1], or together with another observed factor like packet size distribution [36].

(13)

2.4. Packet classification

Deep Packet Inspection

Deep Package Inspection (DPI) [16] works by inspecting the data payload of the packet as opposed to looking at other factors like e.g. arrival time of the packet. A DPI system looks for signatures (byte strings) in the payload of a packet that matches an entry in the users pre-defined list of signatures. It is then up to the user to decide what action that is appropriate when the DPI system finds a match.

There are some problems associated with the usage of DPI systems for classification. One of these is that when a targeted application is updated, its signature may change, and this means that the list of signatures needs to be updated as well. This problem can be reduced by applying an automatic verification system [73] but this only helps with detecting when a signature has changed. Another problem is that DPI can be too slow for doing real-time classification [83]. The largest problem for general classification of packets with DPI systems however, is the usage of application layer encryption. Encryption effectively nullifies the ef-fect of DPI systems since the same clear text signature will be transformed to different outputs every time. Because of this other methods are needed.

Port-based classification

Port-based classification requires knowledge of the protocol and what port it uses. In a paper by Karagiannis et al. [32], the authors identify peer-to-peer (P2P) file-sharing traffic using known ports for popular P2P protocols. They then reverse engineered the protocols to find signatures of the payloads generated in the protocols. The signatures were then used for deep packet inspection and the authors could conclude that P2P traffic did not decline (as previously believed) but rather, the P2P file sharing protocols had changed to use arbitrary ports instead. Usage of arbitrary ports limits the usage of classification by knowledge of the ports used [70] which is why this method is not chosenin this thesis.

Flow based classification

Flow classification [79] is a method that measures a connection and uses properties of the data flow to classify connections into categories. Instead of looking at the contents of the packet (like DPI), statistical features of a flow can instead be studied with machine learning techniques. For this to be possible there is a need to find suitable independent statistical features.

Features of traffic flows

Properties of data flows that have been used in previous work for feature extraction include features like bit-rate (conveyed bits per unit of time), inter-packet-gap (time between packets in the same flow) or payload length. Bonfigilio et al. [10], used bit-rate, inter-packet-gap and payload length to characterize traffic of the Voice-over IP (VoIP) protocol used by Skype [59]. Chen et al. [13] also studied Skype network traffic flow but with regards to user satisfac-tion. They looked at the effect of bit-rate, round-trip-times (time for a packet to travel back and forth) and jitter (standard deviation of bit rate) and related those factors to user expe-rience factors like responsiveness. The result was a model for quantifying the VoIP user satisfaction based on network factors.

Pan et al. [48] studied the flows in the HTTPS handshake process to fingerprint encrypted applications. The authors used a Markov Chain probabilistic model and managed to reach a classification accuracy of 90% on their experiment data.

(14)

2.5. Fractals

better represent the nonlinear traits of internet traffic, and extracted the multi-fractal features of different traffic classes. They managed an average accuracy of over 90% when classifying an unencrypted data set into categories like P2P and POP/IMAP, when comparing to what the DPI system categorized it as. Usage of multi-fractal features to model traffic flows is the chosen method to use in this thesis due to the possibility that this approach could work on encrypted streams. The following sections thus deals with the theory of fractals and multi-fractal systems.

2.5 Fractals

This section covers the basics of fractal sets and the concept of fractal dimension, which is re-quired to understand the extension that is multi-fractal systems. To understand the definition of fractals that is used throughout the thesis, grasp of the Hausdorff dimension is essential. The Hausdorff dimension itself is defined from something called the Hausdorff measure.

Hausdorff measure

The Hausdorff measure [40] is a generalization of measures like length, area, and volume that also manages to measure spaces which dimension can be non integer. The general idea is to cover a space with balls with smaller and smaller radius, until a limit is met. Mathematically, the Hausdorff measure,HDT₍_S₎_{, of a set S in}Rn_{is defined as:}

HDT₍_S_{) =}_lim

δÑ0 inf¸

i

γ(DT)δ_iDT, (2.1)

where DT is the topological dimension of the space formed by S. Then, δ_iDT can be seen as the radius of each ball that fills the space formed by S. For example, to measure a surface (topological dimension 2) the expression γ(DT)δD_i T would be equal to δ_i2π

Hausdorff dimension

The Hausdorff dimension [40] DH of a set S inRnis the DT in the Hausdorff measure of S that yields: HDT₍_S_{) =} # 8, DT DH 0, DT¡ DH. (2.2) This means that for euclidean spaces the Hausdorff dimension is exactly the same as the topological dimension i.e. DH =DT. However, from this definition the definition of fractals can be derived.

Fractals

A fractal is defined as set S inRnfor which the Hausdorff dimension DH, strictly exceeds the topological dimension DT:

DH¡ DT. (2.3)

Fractals [40] are shapes that can be used to describe patterns, commonly found in natural phenomena, which are either too fragmented, or too irregular to be described by classical geometrical shapes. This is due to nearly all fractals exhibiting scale invariant behaviour, i.e. patterns are repeated regardless of scale. A famous example of this can be seen in Figure 2.1. The figure shows the same pattern repeating itself on in different scales. Measuring the length of a shape that exhibits such a behaviour is the problem that prompted the generalization of classical measurements, which led to the Hausdorff measure.

(15)

2.6. Multi-fractal systems

Figure 2.1: Visualisation of the Mandelbrot set.

Scale invariant behaviour can be found in fractals that are used to describe many natural phenomenon such as the length of coastlines [40] and earthquakes[61]. Fractal analysis have also been applied in fields like image analysis[8] and medicine[35]. However, when applying fractals for measuring the length of coastlines, Mandelbrot noted that "coastlines of different degree of irregularity tend to have different fractal dimensions"[40]. In some systems, e.g network traffic [53], the fractal dimension can change over time. The next section deals with systems that exhibit such behaviour.

2.6 Multi-fractal systems

A multi-fractal system [38] is an extension of a fractal system in which the system is described by a spectrum of dimensions as opposed to a mono-fractal system’s single dimension. An example that shows the difference between a mono-fractal and a multi-fractal can be seen in Figure 2.2 . The figure shows a mono-fractal that on each iteration splits in two parts that each are 1₃ of the height from the previous iteration. The multi-fractal does the same expect that for the lower of the two parts, the width is doubled.

(16)

2.6. Multi-fractal systems

Figure 2.2: A mono-fractal and a multi-fractal. In the multi-fractal, the lower part’s width is doubled in each iteration.

To analyse a multi-fractal system, estimating the so called singularity spectrum is essential. The singularity spectrum consists of the holder exponents, H, and the fractal dimension in those points, D(h). The fractal dimension have been explained in previous sections and thus the following section explain what the Holder exponents of a signal represent.

Holder exponents

The Holder exponents [76] for a signal characterizes the signal’s regularity. The spread of Holder exponents in a spectrum can be used to deduce whether a signal is mono-fractal or multi-fractal. Since a mono-fractal signal displays the same regularity regardless of time, the resulting width of the Holder exponents is narrow, and the signal could instead be character-ized by a single Holder exponent. Multi-fractal signals hence have a wider range of Holder exponents. An example of plotting the singularity spectrum of a mono-fractal and a multi-fractal signal can be seen in Figure 2.3. The time series (in blue) is an example of a multi-multi-fractal signal, while the brown noise (in red) is an example of a mono-fractal signal. This can be seen by the spread of Holder exponents, which is larger for the multi-fractal signal.

(17)

2.7. Wavelet leaders multi-fractal formalism

Figure 2.3: The singularity spectrum of a mono-fractal signal (in red) and a multi-fractal signal (in blue).

The singularity spectrum can be very different for signals that are similar in other represen-tations. If a signal exhibits multi-fractal behaviour that is unique for that class of signals, the values for the multi-fractal spectrum will be distinct for that class. This would imply that those values could be used as a basis to categorize a signal. This is why estimating the singularity spectrum is important in this thesis. The following section deals with how the singularity spectrum can be estimated using the Wavelet leaders multi-fractal formalism.

2.7 Wavelet leaders multi-fractal formalism

The wavelet leaders multi-fractal formalism (WLMF) [28] is a method for estimating the sin-gularity spectrum from a time-series by performing the wavelet transform on the signal. From the resulting wavelet coefficents, the set of so called ’wavelet leaders’ can be obtained. Together, the wavelet leaders for multiple levels form the structure function, which is used to obtain the scaling exponents via linear regression. The scaling exponent can in a last step be used to acquire the singularity spectrum. The following sections go through this process in more detail, from a top down perspective. For a more complete theoretical explanation, and a comparison between the WLMF and other multi-fractal formalisms, the reader is referred to Jaffard et al. [28]. For a more practical approach on the WLMF, the reader is referred to Wendt et al. [77].

Estimating the singularity spectrum

The singularity spectrum D(h(q)) displays the distribution of scaling exponents ζ(q)for a signal, which can be seen as a measure of how the signal’s local regularity varies with time. Performing the Legendre transform on the scaling exponents yields the singularity

(18)

spec-2.7. Wavelet leaders multi-fractal formalism

trum [77]:

D(h) =inf

q0(1+qh ζ(q)). (2.4)

The values of q determines how many points in the spectrum that are estimated. However the numerical computations become unstable for values too far from zero [77].

Scaling exponents

The scaling exponents ζ(q), are defined as:

ζ(q) = lim

2j_Ñ0inf(

log₂S(q, j)

j ), (2.5)

where S(q, j)is the structure function, depending on the moment order q and the analysis scale j. In practice, due to complexity of the Legendre transform, ζ(q)can be estimated by linear regression yielding:

ζ(q) =

j2

¸ j=j1

wj(log2S(q, j)), (2.6) where wjare the weights used in the regression [77].

Structure function

The structure function S(q, j), is defined as: S(q, j) = 1 nj nj ¸ k=1 |Lx(j, k)|q, (2.7)

where Lx(j, k)are the 1-D wavelet leaders by scale and njis the number of wavelet leaders at each scale [77].

Wavelet leaders

The 1-D wavelet leaders [28] Lx(j, k), are defined as: Lx(j, k) = sup

´λP3λj,k

|dx(j, k)|, (2.8)

where dx(j, k) are the discrete wavelet transform (DWT) coefficients of the signal. While j represents the analysis scale and k the time, together they can form a matrix of wavelet coefficients such as in Figure 2.4. The ´λ in the definition represents a time neighbourhood in such a matrix, meaning a dyadic interval and its two neighbours, 3λj,k =λj,k1Y λj,kY λj,k+1. An example of this can be seen in Figure 2.5, which shows the time neighbourhood in grey.

(19)

The fact that the interval is dyadic means that the following holds true [28]: • A dyadic interval’s length is a power of 2.

• Every dyadic interval is contained in another dyadic interval that is twice as big. The process is now broken down into steps with an example. First the wavelet leaders are obtained by computing the absolute value of every wavelet coefficient dx(j, k)on each scale 2j. Figure 2.4 shows an example of the structure obtained by this, the values used are only an example. In the example, the finest scale used has 16 wavelet coefficients associated with it and each coarser scale have half the number of coefficients as the finer. In this example only the wavelet leaders for level 23is calculated.

Figure 2.4: The structure obtained from the absolute values of the coefficients of each level.

Since the process operates on time neighbourhoods, an example of such a union of dyadic intervals is displayed in Figure 2.5. The figure shows the time neighbourhood for the wavelet coefficient marked with a blue circle. On the same level ( 23) there is another neighbourhood centred around the wavelet coefficient 2, which should be taken into account in the final step of computing the wavelet leaders.

(20)

Figure 2.5: The time neighbourhood, ´λ, for the wavelet coefficient marked with a blue circle.

Then in the next step, the greatest element of each dyadic interval is found and all other val-ues are ignored, this can be seen in Figure 2.6, where all the greatest elements of each interval

λ, are circled.

Figure 2.6: The greatest elements in each dyadic interval λ, circled in blue.

The last step is then finding the greatest element of each time neighbourhood ´λ in that level. The set of those values are then the wavelet leaders, in our example they aret14, 14u. Fig-ure 2.7 shows this, the greatest value for each of the two time neighbourhoods, marked by blue and yellow arrows, are the wavelet leaders for level 23.

To summarize the WLMF before moving on to wavelets: The WLMF enables the estimation of the singularity spectrum and the Holder exponents from a time series. This is done by linear regression of the structure function, a function that is defined from the wavelet leaders function. The set of wavelet leaders can then be obtained by calculating wavelet coefficients for multiple levels. The following section contains the theory behind wavelets and how the wavelet coefficients are calculated.

(21)

2.8. Wavelets

Figure 2.7: The wavelet leaders of level 23_{. The wavelet leaders are the greatest elements in} each of the levels time neighbourhoods.

2.8 Wavelets

This section covers the basics of wavelets. The wavelet coefficient named in the previous chapter are obtained by the discrete wavelet transform of a signal. There are, however, mul-tiple wavelets to choose from that all yield different wavelet coefficients when transformed. The next section starts with a general definition of a wavelet. For a more complete theoretical background the reader is referred to Ingrid Daubechies’ "Ten lectures on wavelets" [15].

Wavelet definition

A wavelet, ψ(t), is a signal that starts at an amplitude of zero, oscillates for a number of times, and ends with an amplitude of zero[65]. For a wavelet with Nψvanishing moments, Nψ¥ 1, NψP Z, the following holds true[76]:

» R tkψ(t)dt=0, k=0, 1, ..., Nψ 1, (2.9) » R tNψ_ψ₍_t₎_dt 0. _(2.10)

Wavelets are categorized by their number of vanishing moments, Nψ ¥ 1, Nψ P Z, and their wavelet family. Three wavelet families which are considered in this thesis are introduced in the following sections.

Daubechies wavelets

A Daubechies wavelet [15] of order n, is a wavelet with n vanishing moments. Some examples of wavelets in this family can be seen in Figure 2.8. Daubechies wavelets have the most vanishing moments for a given support. The wavelet ψ(t)needs to be an elementary function with a compact time support for wavelet leaders to be applicable.

(22)

2.8. Wavelets

Figure 2.8: Some of the wavelets in the Daubechies family. The order of the wavelet corre-spond to the number of vanishing moments.

Support

The support of a function f is the elements of f :s domain that are not mapped to zero; i.e.: supp(f) =xP X| f(x) 0, (2.11) where f : XÑ R.[55]

Symlet

Symlet wavelets [15] are more symmetric than wavelets in the Daubuchies family. Some examples of wavelets in the symlet family can be seen in Figure 2.9. As with Daubuchies’ wavelets, the order of the wavelet correspond to the number of vanishing moments.

(23)

2.8. Wavelets

Figure 2.9: Some of the wavelets in the symlet family.

Coiflet

Coiflet wavelets [15] are originally derived from the Daubechies wavelet but are almost sym-metric. Some examples of wavelets belonging to the coiflet family can be seen in Figure 2.10. Coiflets are also characterized by their number of vanishing moments which correspond to the order of the wavelet.

(24)

2.9. Principal component analysis

Discrete 1-D wavelet transform

The Discrete wavelet transform (DTW) for a signal X[n]is characterized by its detail coef-ficients Yd(high frequency content) and by its approximation coefficients Ya(low frequency content). These coefficients are what is referred to as the wavelet coefficients dx(j, k)in sec-tion 2.7.

Ydand Yaare obtained from:

Yd= (x h)Ó 2, (2.12)

Ya= (x l)Ó 2, (2.13)

where h is the impulse response from a high pass filter, l the impulse response from a low pass filter, is the convolution operator and Ó is the subsampling operator [65].

Figure 2.11 displays a block diagram of this processes. The low pass and high pass filter depend on the chosen mother wavelet. Due to the Nyquist–Shannon sampling theorem, downsampling by the factor 2 is possible (half of the frequency is removed in the filter).

Figure 2.11: The discrete (1 dimensional) wavelet transform.

DWT preserves both frequency and time information as opposed to the Fourier trans-form, which only result in knowing what frequency components that exist for a signal, and not at what time they exist.

2.9 Principal component analysis

This section covers principal component analysis (PCA), a method that is used in this study, as a step in visualising the dataset. Singular Value Decompositon is the underlying method used to obtain the principal components.

Singular value decomposition

Singular value decomposition (SVD) X = USV|of a real n m matrix X, where m is the number of samples and n is the number of variables, is a decomposition of the original matrix into three parts. The first part, U, is an n m orthonormal matrix. The second part S contains the singular values and is an m m matrix. The third part V is an m m orthonormal matrix.

(25)

2.10. Neighbourhood component analysis

These parts can be used to efficiently obtain the principal components of the matrix X by calculating the covariance matrix C of X with U, S and VTinstead:

C= X |_X (n 1) = VSU|_USV| (n 1) = VS2_V| (n 1). (2.14)

The principal components are now directly given by:

XV=USV|V=US. (2.15)

Principal components

The principal components summarizes the original data as a set of linearly uncorrelated vari-ables. The k principal components (k columns in US) that explain most of the variance in the dataset can then be chosen to represent the data instead. Principal components are used as a step in the visualization of the result. For selecting features however, neighbourhood component analysis (NCA) was chosen over PCA due to better performance.

2.10 Neighbourhood component analysis

Neighbourhood component analysis (NCA)[18] is a method for feature selection that seeks to maximize the prediction accuracy of classification algorithms. The result of running the algorithm is the feature weights of each feature, i.e. how much they impact the prediction accuracy. Irrelevant features can then be discarded. NCA depends on the regularization pa-rameter, λ, which can be chosen by running an optimization algorithm like Limited-memory Broyden–Fletcher–Goldfarb–Shanno (Limited-memory BFGS) [37].

2.11 t-distributed stochastic neighbour embedding

t-distributed Stochastic Neighbour Embedding (t-SNE) [39] is a technique in which high di-mensional data points are embedded into a low didi-mensional space depending on the multi dimensional distance between the points. Thus t-SNE enables visualisation of high dimen-sional data. t-SNE is only used in this thesis for the reader to see if the data points from the same class end up close to each other in the visualisation.

2.12 Machine learning classifiers

Classification models are built by applying a machine learning (ML) algorithm on a repre-sentative training set. This is called supervised learning and collected traffic data must be labelled before the ML algorithm is applied. Several ML algorithms exist that have been used for traffic flow classification [79].

Naive Bayes classifier

Naive Bayes [31] is a family of algorithms that produces a classifier by considering how much each feature contributes to the probability of an unknown object belonging to a class. Naive Bayes algorithms do not consider the fact that features can have correlation between each other, hence the ’Naive’.

Naive Bayes has been used for traffic classification before, for example by Moore et al. [44] who managed to classify internet applications with accuracy approaching 100% but requiring tests against multiple criteria which decreases performance, also, their approach relies on a full payload packet trace.

(26)

2.12. Machine learning classifiers

Support Vector Machine classifier

Support Vector Machine (SVM) [14] is a supervised learning model in which the training data is mapped to points in a space called a feature space. The feature space can be of very high dimension depending on the number of features. It is then divided in two by the hyperplane that gives the best separation between the two resulting spaces. Panchenko et al. [49], used SVM to identify web pages. The authors managed to bypass countermeasures in the form of the anonymous network Tor [71], which shows the effectiveness of SVM. The mathematical problem formulation follows:

For a set of points Xj , their categories Yj and a hyperplane f(X) = X

1

β+b = 0 and the

dimension d , where Xj P Rdand Yj=1, the best separating hyperplane is the β and b that minimizes||β|| so that for all Xjand Yj, Yjf(Xj)¥ 1. The support vectors are then defined as all Xjwhere Yjf(Xj) =1.

An example of support vectors can be seen in Figure 2.12. The figure displays two separable 2-dimensional datasets in red and in blue. The maximum margin separating hyperplane is in this case the filled line separating the two data sets. The support vectors for each set are circled.

Figure 2.12: The maximum margin separating hyperplane between two 2-dimensional data sets.

As some datasets can not be separated by a linear function, other so called ’kernel functions’ exist.

Kernel functions

Kernel functions [57] are, in the context of SVM, functions that can be used to fit the max-imum margin separating hyperplane in a transformed space instead. By transforming the space with a non-linear transformation the two data sets can become linearly separable in the transformed space. To do this, the coordinates do not even need to be explicitly translated. Instead, by computing the inner product between the data pairs and replacing the dot prod-uct with a kernel function, the maximum margin separating hyperplane can still be fitted. An example of a non-linear transformation, φ, that does this can be seen in figure 2.13

(27)

2.12. Machine learning classifiers

Figure 2.13: Example of a kernel function φ enabling linear separation of two data sets.

An example of a kernel function that is commonly used is the Radial Basis Function kernel (RBF) [20] that is defined as:

K(xi, xj) =exp(γ||xi xj||2) γ¡ 0, γ= _2σ12, (2.16)

where σ is the standard deviation.

Bayesian optimization of support vector machines

Bayesian optimization [43] is a method that can be applied on machine learning algorithms, to improve the resulting model, by tuning the hyper-parameters used to build the model [60]. In the case of SVM, these hyper-parameters can be for example the box-constraint (the cost of misclassification) or the kernel-scale.

Bayesian optimization can be applied in several ways, however, Bayesian optimization with the ’expected improvement’ acquisition method has been shown to be able to converge near-optimally[12]. The ’expected improvement’ methods works by evaluating the expected amount of improvement in the objective function and only tries parameters that lowers the objective.

(28)

3 Method overview

This section contains a high level overview of the three steps of the method. An overview of these steps and their substeps can be seen in Figure 3.1. On the right side of each step there is a reference to which section that contains the details of the step.

Figure 3.1: A high level overview of the method.

3.1 Data collection

The first step in the method is finding which categories the framework should be able to differentiate between. After this is done, representative data for these categories is generated by automatic instrumentation of a smartphone and data is captured. The captured traffic is transformed to time series that represents when packets are arriving within the 20 first seconds of flows.

(29)

3.2. Baseline model and preliminary analysis

3.2 Baseline model and preliminary analysis

In the second step the singularity spectrum of the time series are extracted via wavelet lead-ers. The mean values of the points of the spectrum are used to choose how long the time series should be and how many timeslots they should consist of. After this is decided the singularity spectrum of all classes are extracted and a feature selection algorithm is applied to remove irrelevant features. When this is done a machine learning model is built and opti-mized, resulting in an initial framework.

3.3 System inputs and evaluation

The framework is improved by exploring the impact of system inputs on the evaluation score of the framework. The framework is then evaluated and the result is visualised.

(30)

4 Data collection

This chapter begins with explaining the process of finding a suitable categorization of net-work traffic. The methods for generating and collecting data are then described, followed by a section on how the collected data was processed.

4.1 Traffic categorization

First, there is a need to find a suitable way to categorize traffic. The following categories are chosen as a starting point for this process:

• real-time entertainment, • web browsing,

• communication, and • social networking.

The categories are justified by being the 4 largest traffic groups in the Sandvine report on peak period, mobile traffic composition, for north America [25]. According to the report, real-time entertainment accounts for 35.39% of the traffic, social networking for 23%, web browsing for 12.55%, and communication for 9.15%. These 4 categories accounts for over 80% of the traffic composition.

Due to the four first categories’ broadness, they are expanded into the following, specialized, categories: • video streaming, • audio streaming, • gaming, • web browsing, • bulk download,

(31)

4.1. Traffic categorization

• text communication, and • social networking.

Video and Audio streaming as well as gaming are subsets of Real-time entertainment. These subcategories are more homogeneous and are hence easier to identify. However, the final chosen categories need to have optimizations available for that category, or help improving the model by allowing it to identify more data streams.

Video streaming

Video streaming traffic include traffic from applications like YouTube [82] and Netflix [46]. All of these applications buffer (preload) the video while playing it. If the buffer empties the video stalls, hence, making sure that the buffer never empties by allocating enough band-width is an example of an available optimization. This category was kept.

Audio Streaming

Traffic from applications like Spotify [64] and Soundcloud [62] belong to the audio streaming category. These applications’ performance can also be optimized by avoiding stalls. How-ever, due to the fact that neither of these applications have implemented HTTPS for their data streams, this category was discarded.

Gaming

Gaming was discarded due to it being a small part of real-time entertainment [25] and due to difficulties in generating data that is representative enough for this category.

Web browsing

Web browsing is such a large part in the total traffic composition, so identifying this category could help ruling out that a traffic flow is not part of another category. This category could potentially be a part of an ’other’ category that the model can distinguish more important categories from.

Bulk download

Bulk download is a category for data from applications that download files to a users local storage. This category is kept due to it being much different to the other categories, and due to the ease of collecting such data.

Audio communication

Audio communication traffic include traffic from applications like Skype [59] and Dis-cord [22]. Due to the International Telecommunication recommendation which states that there should not be delays over 150 milliseconds for audio communication [27] and since the quality of service have a big impact on the users’ quality of experience [30] audio communi-cation can be said to have real-time requirements . Due to these requirements this category is kept.

Text communication

Text communication (chat) is possible in applications like Messenger [23], Skype [59] and Discord [22]. This category is interesting due to for example encrypted chat applications being banned in some countries.

(32)

4.2. Final categories

Social networking

Social networking traffic includes traffic from applications like Facebook [17] and Insta-gram [26]. No optimizations could be found, however, due to being such a large part in the total traffic composition, identifying this category could help ruling out that a traffic flow is not part of another category. This category could potentially be a part of an ’other’ category that the model can distinguish more important categories from.

4.2 Final categories

After discarding some categories the process was concluded with the following resulting categories: • video streaming, • web browsing, • bulk download, • audio communication, • text communication. • social networking.

Each category is chosen due to its common occurrence in mobile network traffic, the distinct features of the category’s members, and either the prospect of available optimizations or the possibility of it increasing the models reliability by allowing more types of data.

4.3 Data generation

The chosen method requires multiple data sets. For each of the 6 final categories there is a need for a representative dataset to use for training. To evaluate the performance of the built framework, a mixed, representative set is also required. Manually producing data is a labour extensive process which is why an automated approach for generating data is preferred.

Automated data generation

There are a number of possible ways to automate data generation. Taylor et al. [67] used programs to simulate user input, which ran commands on the Android Debug Bridge [2] (ADB). The ADB then forwarded the commands to a phone, connected in debug mode via USB. The resulting network traffic was then captured and used as a fingerprint for identifying an application. The program they used were ’The Monkey’ [5], an android UI/application exerciser which generates a random stream of user events.

Applications can have many states. The chances of visiting the interesting states are very low when just following a random stream of user events. This is the reason why a more controlled approach is needed.

Controlled application instrumentation

The android tool called ’Monkeyrunner’ [4] allows testing of an application by issuing device commands. This is done by writing a program in the Python programming language [51]. Generating the training sets require programs that generate network traffic that is represen-tative to the category. The evaluation set requires that the program issues commands that are representative to what a real user would do.

(33)

4.3. Data generation

Monkeyrunner require knowledge of which application to start, and which of the applica-tions’ activities that should be started. To list the installed packages of a device the following command can be issued to the ADB:

1 $ adb s h e l l pm l i s t packages

Finding out the possible activities of a package can then be done by issuing: 1 $ adb s h e l l dumpsys package | grep - i " + PACKAGENAME + " |grep A c t i v i t y

The name of the activity can then be used to start the application in a specific state. Thus, one data generation script was created per application. However, the context that each applica-tion is run in is the same regardless of class.

The script context

The one thing in common between all the data generation scripts is that they all follow the same steps. First the application is started and a time-stamp is collected. Then, category specific data is generated for 20 seconds, as can be measured by using the time-stamp. The last step is a 10 second long pause, before repeating the process. The following sections deals with how the category specific data is generated.

Video streaming data generation

Video streaming data is generated from running the following applications: YouTube, Netflix, HBO, Twitch and Svt Play. During the start of the first 20 seconds for each cycle, a video is started. During the 10 second pause, a new video is found but not started. In the case of YouTube and Twitch, the new video is found by using the search function with a randomly selected word from a list of the 10,000 most common English words, as determined by n-gram frequency analysis of the Google’s Trillion Word Corpus[47]. In the case of the other services, a new video is started by playing the next suggested video.

Web browsing data generation

To generate web browsing data, the websites in Table 4.1 are selected. Web sites with many sub-sites containing mainly text and pictures are chosen to reduce the similarity between this category, social networking and video streaming.

Website Website category

http://reddit.com Link collection website. http://svt.se/nyheter Local popular news website. http://dn.se Local popular news website. http://di.se Local popular news website. http://nouw.com Local popular blogging platform. Table 4.1: The chosen websites and the category of the website.

Each of these sites have multiple sub-sites, which is found by automatically crawling the sites with the following command:

1 $ wget -m h t t p ://www. example . com 2>&1 | grep ’ ^ - - ’ | awk ’ { p r i n t $3 } ’ | grep - v ’ \ . \ ( c s s \| j s \|png\| g i f \|j p g\|JPG \) $ ’ > name . t x t

Once this is done the data generation process can begin. During the first 20 seconds of each cycle, a randomly selected link from the list of sub-sites is visited. To prevent that the browser starts searching for web sites while switching sites, the browser is restarted on the next web-site at the beginning of the cycle.

(34)

4.4. Data collection

Bulk download data generation

Data from the bulk download class is generated by downloading applications from Google Play[24]. The next application to download was found by searching with a randomly selected word from a list of the 10,000 most common English words.

Audio communication data generation

Audio communication data is generated from the applications Skype and Discord. During the first 20 seconds of each cycle a call is made to another phone that automatically accepts the call. During the call, sound from a radio is transmitted. The call is ended during the pause.

Text communication data generation

Data from the text communication class is generated by receiving four messages, and sending four messages at random points of time during the first 20 seconds of each cycle. This is done using two phones, using either the application Skype, Discord or Facebook Messenger[23]. The messages sent are one to three (selected randomly) words, selected randomly from a list of the 10,000 most common English words.

Social network data generation

Social network data is generated from the applications Facebook and Instagram. During the 20 first seconds of each cycle, the screen is continuously swiped vertically for the length of three screens, triggering the loading of new content.

4.4 Data collection

All the data collected needs to be available both in its encrypted and unencrypted form so that a ground truth can be established. This is done by using a trusted proxy.

Trusted proxy

A trusted proxy, for example BRO[11] (open-source, also used as network intrusion detection system), or mitmproxy [42] (also open-source), inspects all network traffic while it passes through. Since the trusted proxy knows the negotiated encryption key, all encrypted traffic passing through can be logged in its unencrypted form and the corresponding encrypted traf-fic can also be logged. This enables labelling of the data so a ground truth can be established which can be used to evaluate the frameworks performance. The proxy mitmproxy is picked due to being open-source, having extensive documentation, and due to its efficiency. The trusted proxy approach has been used in previous research, for example by Krishnamoorthi et al. [34].

Trusting the trusted proxy

The trusted proxy has a certificate that the connecting device needs to have installed. How-ever, after the Android version "Nougat", Android applications no longer trust user installed certificate authorities unless the application’s developer explicitly opts in for it. To bypass this, each analysed application have to be decompiled to opt in for it. This is done by first fetching the Android Package Kit (APK):

1 $ adb p u l l PATH APPLICATION . apk

(35)

1 $ a p k t o o l decode APPLICATION . apk

After this is done, the file called "network_security_config.xml" is edited/created to contain the following: 1 <?xml v e r s i o n=" 1 . 0 "?> <network - s e c u r i t y - c o n f i g > 3 <base - c o n f i g > < t r u s t - anchors > 5 < c e r t i f i c a t e s s r c =" system "/> < c e r t i f i c a t e s s r c =" u s er "/> 7 </ t r u s t - anchors > </base - c o n f i g > 9 </network - s e c u r i t y - c o n f i g >

AndroidManifest.xml needs to be updated now to point to the file we edited/created: 1 android : n e t w o r k S e c u r i t y C o n f i g=" @xml/ n e t w o r k _ s e c u r i t y _ c o n f i g "

Then the APK rebuilt by issuing: 1 $ a p k t o o l b u i l d - o APPLICATION . apk

The rebuilt APK must then be signed using the build tool apksigner[3] in the android SDK. This requires setting up a keystore using keytool from the Java Development Kit[29]:

1 k e y t o o l - k e y s t o r e APPLICATION . j k s - genkey - keyalg RSA - k e y s i z e 2048 - v a l i d i t y 30 - a l i a s APPLICATION

The APK is also aligned using zipalign from the Android SDK[6] before it is signed: z i p a l i g n - f - v 4 APPLICATION . apk APPLICATION_signed . apk

2 a p k s i g n e r s i g n - - ks APPLICATION . j k s APPLICATION . apk

The last step is installing the application again: adb i n s t a l l APPLICATION_signed . apk

Data collection set-up

The set-up for data collection can be seen in Figure 4.1. First, the laptop running Monkeyrun-ner issues commands over the ADB and the ADB forwards the commands to the phone, via Universal Serial Bus (USB). The phone, which is in USB debugging mode, and which is running the targeted application receives the commands and interacts with the applications accordingly. The resulting encrypted network data is sent to its destination but first passes the trusted proxy, mitmproxy, (actually run on the laptop) which logs the traffic before for-warding it to its destination.

(36)

Figure 4.1: The set-up used for data collection.

Collected samples

Figure 4.2 shows the number of 20 seconds samples generated for each class. The figure also shows the traffic composition of each class with the number of samples per application. The total number of samples are 7154, which means almost 60 hours of data was captured. More data is of course always better, however, enough data is collected for the scope of this thesis. Each category is represented by data from at least two different applications, except bulk download. This is done to increase the reliability of the result.

Figure 4.2: The number of collected samples per class and each class’ application composi-tion.

(37)

4.5. Data processing

4.5 Data processing

The data needs to be processed and time series needs to be built before multi-fractal analysis can be performed.

Splitting the data

The .pcap files containing the logged traffic is split up by TCP sessions resulting in one file per flow. A flow is traffic that share the same 5-tuple:

source_IP, source_port, destination_IP, destination_port, protocol ¡

The split is achieved by using the tool SplitCap [63] which has been used in related works, for example by Muehlstein et al. [45]. The command that does this is:

1 S p l i t C a p - r f i l e n a m e . pcap - s flow

Creating a single time series

For each file, a time series of t seconds and with a sampling interval of n milliseconds, filled with zeros, is created. The empty time series are then filled by going through each packet arriving during the duration and incrementing the value of each packets corresponding time slot. An example of a time series created by this process can be seen in Figure 4.3. In reality the length of the time series are much longer.

Figure 4.3: An example of a time series, each timeslot is filled with the number of packets arriving during that timeslot.

Creating data sets

Multiple data sets are created by using different combinations of the parameters t and n. The values used for t (seconds) are t =10, t =15 and t= 20. A lower duration means that the model will be able to predict the traffic class earlier. However, a higher duration can give a more reliable result.

The values used for n (milliseconds) are n=1, n=5 and n=10. A lower sampling interval yields more data points but this does not necessarily improve the reliability of the model. The desired trait of the data is that it yields different multi-fractal features than the other classes. Decreasing or increasing the sample interval may both improve reliability but this depends also on the parameter t and hence multiple combinations needs to be examined.

The procedure produces 9 data sets for each class. This is done so that suitable values for the parameters t and n can be selected.

(38)

5 Baseline model and preliminary

analysis

This chapter covers how the baseline model is built. Some preliminary analysis is done to decide values for some parameters without having to use the evaluation data set. After that, the processes of extracting and selecting features are explained. With the selected features a model can be built. This is described in section 5.4 ’Building the model’. The chapter ends with an overview of the resulting initial framework.

5.1 Determining sampling rate and duration

To reduce the complexity of the framework, sampling needs to be done at a predetermined sampling rate for a predetermined amount of time. For each combination of values for t and n, all time series are created. For all those time series, their singularity spectrum D(h(q))

and corresponding holder exponent h(q)are estimated for the linearly-spaced moments of the structure functions from q = 5 to q = 5 by computing the wavelet leaders, using the Daubechies wavelet of order four. The mean values of each category’s D(h(q))and h(q)are then calculated, resulting in one point,(D(h(q)), h(q)), per category. Figure 5.1 shows the mean values for each category, for q = 5. Mean values far from each other enable a greater spread of actual values without misclassification.

The pairwise distances between all classes corresponding pairs are thus summed. This is repeated for all datasets, resulting in the values seen in Table 5.1. As seen in the table, the largest sum of average pairwise distances corresponds to a sampling duration t of 20 seconds and a sampling rate of one sample per millisecond. Consequently, those values are chosen.

(39)

5.1. Determining sampling rate and duration

Figure 5.1: The average(D(h(q=5)), h(q=5))point for each class.

Sampling duration Sampling rate Sum

(seconds) (samples/millisecond) 20 1 2.1668 20 5 2.0867 20 10 1.7669 15 1 1.9053 15 5 1.8017 15 10 1.7219 10 1 1.9327 10 5 1.8372 10 10 1.8291

Table 5.1: The result of summing the pairwise distance between each category’s average

(D(h), h(q))and their corresponding sampling duration and sampling rate.

The data set that gave the best value was also the one that contained the most values. A sampling duration of 20 seconds with a sampling rate of one sample per millisecond equals 20000 values per time series. However, most of the values are actually zero unless a packet arrives during that timeslot. In practice the whole time series can thus be preallocated and the framework then only has to find which timeslot to increment when a packet arrives. This means that the sampling rate can be modified without having much of an impact on the performance of data collection. The sampling duration thus serves as a lower bound of how fast a classification can be made, which is why more analysis on the impact of the sampling duration is done in section 6.5.

Now that values for t and n have been decided, features to build the model with can be extracted.

(40)

5.2. Feature extraction

5.2 Feature extraction

The features used throughout this thesis are the singularity spectrum and corresponding holder exponents for the signal. These values are estimated for the linearly-spaced moments of the structure functions from q=5 to q=5 by computing the wavelet leaders, using the Daubechies wavelet of order four. This produces 20 values per time series. However, since all values might not be relevant a feature selection method is used to remove irrelevant features.

5.3 Feature selection

The feature selection (FS) method used in this thesis is Neighbourhood Component Analysis (NCA). NCA has been shown to perform well compared to other feature selection meth-ods[81]. To obtain an optimized value for the regularization parameter, λ, the limited-memory BFGS optimization algorithm is applied. Running the algorithm for 50 seconds computes the feature selection regularization parameter to λ = 2.8 104. The result of running NCA with that λ can be seen in Figure 5.2. The figure displays the feature weights for all 20 features, a high weight meaning that the feature is relevant to the model.

Figure 5.2: The feature weights for the features after performing NCA with optimized pa-rameters.

The features that were deemed irrelevant (feature weight lower than 0.01 was used as a threshold here) were removed. The kept features and their corresponding weight can be seen in Table 5.2. Eleven of the original 20 features were kept. This implies that the frame-work would not have to extract the full spectrum and could still have similar performance.

Flow Classification of Encrypted Traffic Streams using Multi-fractal Features

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2018 | LIU-IDA/LITH-EX-A--18/015--SE

Flow Classification of

Encrypted Traffic Streams

using Multi-fractal Features

Erik Areström

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Contributions

1.5

Delimitations

2

Theory

2.1

TCP

2.2

HTTP

2.3

TLS/SSL

2.4

Packet classification

Deep Packet Inspection

Port-based classification

Flow based classification

2.5

Fractals

Hausdorff measure

Hausdorff dimension

Fractals

2.6

Multi-fractal systems

Holder exponents

2.7

Wavelet leaders multi-fractal formalism

Estimating the singularity spectrum

2.8

Wavelets

Wavelet definition

Daubechies wavelets

Symlet

Coiflet

Discrete 1-D wavelet transform

2.9

Principal component analysis

Singular value decomposition

Principal components

2.10

Neighbourhood component analysis

2.11

t-distributed stochastic neighbour embedding

2.12

Machine learning classifiers

Naive Bayes classifier

Support Vector Machine classifier

3

Method overview

3.1

Data collection

3.2

Baseline model and preliminary analysis

3.3

System inputs and evaluation

4

Data collection

4.1