Classiﬁcation of Video Traﬃc

(1)

Classification of Video Traffic

An Evaluation of Video Traffic Classification using Random

Forests and Gradient Boosted Trees

Ricky Andersson

Faculty of Health, Science and Technology

Degree Project for Master of Science in Engineering Supervisor: Johan Garcia

Examiner: Leonardo Martucci 30 Hp

(2)

(3)

Acknowledgements

I would like to thank Procera Networks for the opportunity of working on this project. I would also like to thank Anders Waldenborg who was my supervisor at Procera. A special thanks goes out to Johan Garcia, who was my supervisor at Karlstad University and for the help provided during the project and dissertation.

(4)

(5)

Abstract

Traffic classification is important for Internet providers and other organizations to solve some critical network management problems. The most common methods for traf-fic classitraf-fication is Deep Packet Inspection (DPI) and port based classitraf-fication. These methods are starting to become obsolete as more and more traffic are being encrypted and applications are starting to use dynamic ports and ports of other popular applica-tions. An alternative method for traffic classification uses Machine Learning (ML). This ML method uses statistical features of network traffic flows, which solves the fundamen-tal problems of DPI and port based classification for encrypted flows. The data used in this study is divided into video and non-video traffic flows and the goal of the study is to create a model which can classify video flows accurately in real-time. Previous studies found tree-based algorithms to work well in classifying network traffic. In this study random forest and gradient boosted trees are examined and compared as they are two of the best performing tree-based classification models. Random forest was found to work the best as the classification speed was significantly faster than gradient boosted trees. Over 93% correctly classified flows were achieved while keeping the random forest model small enough to keep fast classification speeds.

(6)

(7)

List of Figures

2.1 Overview of Classification . . . 8

2.2 Figure of the classification task of finding a way to split the two classes of data. 9 2.3 Figure of the complete model of the data set. . . 10

2.4 Left plot shows the unlabeled new observations and the right plot shows them labeled . . . 10

2.5 Example of a ROC curve . . . 12

2.6 Example of a Confusion Matrix . . . 13

2.7 Plots illustrating bias and variance . . . 14

2.8 Bull’s eye diagram illustrating variance and bias . . . 14

2.9 Two fold Cross-validation . . . 15

2.10 Five fold Cross-validation . . . 15

2.11 Decision Tree: Identification of Animals . . . 16

2.12 Decision Tree: splits at different tree depths . . . 16

2.13 Decision Tree: Same model fit to two different halves of the same data set . . 17

2.14 Random forest: Combination of the two decision trees from Figure 2.13 . . . 19

2.15 Random forest of 100 decision trees . . . 19

2.16 Gradient boosted tree example[43]. . . 20

5.1 Accuracy performance for increasing number of trees . . . 36

5.2 ROC curves for varying numbers of trees . . . 37

5.3 Accuracy performance for increasing max depth . . . 38

5.4 Accuracy for increasing max feature parameter . . . 40

5.5 Plots showing the performance of the min samples leaf and split parameters . 42 5.6 ROC Comparing the tuned and untuned versions of the classifiers . . . 44

6.1 Learning rate comparison . . . 48

6.2 Learning rate comparison . . . 50

6.3 Performance comparison between different numbers of trees . . . 50

6.4 Performance comparison for increasing max depth . . . 51

6.5 Varying the minimum samples leaf parameter for both feature sets . . . 53

6.6 ROC curve showing the performance of the tuned classifier compared to the untuned model . . . 56

(12)

7.1 Feature Importance of RF classifiers of different sizes . . . 60

7.2 Feature importance for GBM classifiers of different sizes . . . 60

7.3 Feature importance comparison between Random Forest and Gradient Boosted Trees . . . 61

7.4 Feature importance comparison between Random Forest and Gradient Boosted Trees . . . 62

7.5 ROC Curve showing the tuning sensitivity for RF and GBM . . . 63

7.6 Performance and Latency for varying number of trees in RF . . . 64

7.7 Performance and Latency for varying number of trees in GBM . . . 65

7.8 Performance and Latency for varying number of max depth . . . 65

7.9 Comparison between some RF and GBM classifiers . . . 67

8.1 Performance for increasing training size . . . 70

8.2 Misclassified flows - Random Forest . . . 71

8.3 ROC showing the performance of a RF classifier with increasing minimum flow length . . . 72

8.4 ROC showing the performance of a RF classifier with increasing total bytes down . . . 73

8.5 Confusion matrix comparison between unfiltered and unfiltered performance . 73 8.6 Comparison Between the Performance increase of the results in the previous sections . . . 74

(13)

List of Tables

2.1 Example data set . . . 6

3.1 Features used for classification . . . 23

3.2 Applications studied . . . 27

4.1 Top 10 services in the dataset . . . 32

4.2 Top 4 services of video traffic in the dataset . . . 32

4.3 Statistical features . . . 33

4.4 Composite Features . . . 33

5.1 Tuned RF Results . . . 44

6.1 Tuned GBM Results . . . 57

7.1 Table comparing RF and GBM of different sizes . . . 66

8.1 Four metrics showing the performance of the classifier before and after im-provements were made . . . 75

(14)

(15)

1. Introduction and Background

In 2016 the number of Internet users grew by over 10%, and now over 50% of the worlds population has access to Internet [28]. Internet service providers (ISPs) and other organi-zations are in many cases dependent on traffic classification to solve network management problems. By being able to identify the traffic flowing through their networks in real-time, ISPs can properly handle the traffic. For example, by being able to identify the traffic it is possible to detect and mitigate denial of service attacks [41][49].

The most common techniques of traffic classification uses Deep Packet Inspection (DPI). DPI is a technology capable of inspecting packet contents in network traffic. Previously a lot of applications used specific port number for its traffic which made it possible to identify net-work traffic by the port numbers used. In later years applications have started using dynamic or obscure ports which has lead to port-based classification being less reliable [29]. Another common classification method uses DPI to inspect the data payloads from the packets. As-pects of the data can then be examined for similar asAs-pects from well-known applications [40]. However, an increasingly large part of Internet traffic is being encrypted and it has been predicted that over 75% of web traffic will be encrypted by 2019 [32]. This increase in encrypted traffic has reduced the effect of DPI methods for classification.

Procera Networks is a networking equipment company that offers IT Analytic solutions. Procera offers Deep Packet Inspection and data analytics services to network operators and vendors which first and foremost is used to increase the quality of their services to costumers. Some of the solutions provided by Procera includes classification of some of the most common Internet services, in order to correctly handle the network traffic. Right now the classification by Procera is done by inspection of packets which, as mentioned above, is starting to become insufficient as more and more traffic is encrypted. This means that Procera is interested in alternative methods of classification which does not involve DPI and which also works well

(16)

for encrypted traffic.

One of the newer methods for traffic classification uses statistical patterns in traffic flows. These patterns, called features, often include things like inter-arrival times, statistical vari-ations in packet payload sizes and flow lengths. Most of these feature-based classification techniques uses machine learning. Machine learning is a type of artificial intelligence, where a model learns from and can make predictions on data. The goal of this method is typically to classify one or more specific applications or application types. This study focuses on video traffic classification using tree-based classification models.

1.1 Aim of this Study

The goal of this study is to examine interesting aspects around classification of network traffic using machine learning, as the previous DPI and port based methods are starting to become obsolete. There are several machine learning algorithms available. Previous studies found that tree based classification algorithms work particularly well for network classification and so two classifiers using trees will be compared, Random Forest (RF) and Gradient Boosted Machine (GBM), often called Gradient Boosted Trees. In addition to comparing RF and GBM, some additional aspects of traffic classification is studied; such as, comparing different feature sets and different ways of increasing the classification performance and classification speed.

1.2 Brief Summary of the Results

The Internet traffic used in this study was divided into video and non-video traffic flows. Two different feature sets were compared, the first set uses statistical properties of the packets in each flow; such as average and max packet sizes in the flows. The second feature set is composed of features that are fast to calculate; such as, inter-arrival times and number of bursts. Two machine learning algorithms, Random forest and Gradient Boosted Trees, were compared to find out which of the two models were best fit for real-time classification. Both of the models achieved similar classification performance when ignoring the classification speed of the models. Random forest was found to be better fitting as the classification speeds were significantly faster.

(17)

The data provided by Procera was classified by the associated service, so every flow to and from a specific video service was considered video traffic even if it did not include any video. By filtering out mislabeled data as well as encrypted traffic as it could include video traffic the performance was increased slightly. For the examined data set, over 93% correctly classified flows were achieved while keeping the random forest model small enough to keep fast classification speeds.

1.3 Tools and Resources

1.3.1 Python

Python is a high level object-oriented programming language [5] for general-purpose pro-gramming. Python is one of the most popular programming languages and focuses on a simple and readable syntax.

1.3.2 Pandas

One of the most used libraries for Python is Pandas [4]. Pandas is used for data manipu-lation and data analysis as it offers powerful data structures; such as, the Pandas DataFrame which is a way of storing data in rectangular grids [48].

1.3.3 Numpy

NumPy is a powerful Python library used for scientific computing [3]. NumPy provides a multidimensional array object together with a large selection of functions for operations on these objects.

1.3.4 Scikit-learn

Scikit-learn [6] is a python library that provides state-of-the-art implementations of ma-chine learning algorithms, with a easy-to-use interface. Scikit-learn is built upon three un-derlaying technologies, Numpy, Scipy and Cython. Numpy is a package containing the base data structure used for data and model parameters in Scikit-learn. Scipy provides efficient algorithms for linear algebra, sparse matrix representation, special functions and basic statis-tical functions. Cython is a language which combines C and Python to reach the performance

(18)

of compiled languages; such as C, with Python-like syntax and high-level operations. Scikit-learn one of the most used machine Scikit-learning libraries and is used in various industries, for example, medical imaging [42].

1.4 Organization

In Chapter 2 the most relevant machine learning aspects will be introduced and the two classification algorithms, random forest and gradient boosted trees will be explained. Some previous related works around machine learning and traffic classification is presented in Chap-ter 3. The most relevant information and results from the related studies are summarized.

Some of the preliminaries to understand the experiments done are explained in Chapter 4. The data and feature sets are explained in details. The method for evaluating the classifiers is explained. The data sampling for the training and validation data is also explained. Random forest and gradient boosted trees are examined in Chapter 5 and Chapter 6. Both algorithms are tuned for the two sets of features used in the study.

In Chapter 7 the two classification methods, random forest and gradient boosted trees are compared in several ways to find out which of the two models best fits real-time traffic classification. Chapter 8 presents some additional aspects around classification and increas-ing the performance of the RF classifier. The performance impact of the trainincreas-ing size was examined along with an analysis of the misclassified samples.

The conclusion of the paper is presented in Chapter 9. The study is evaluated and some possible future works are presented.

(19)

2. Machine Learning

2.1 Overview

In this following chapter the most relevant machine learning aspects will be introduced and the two classification algorithms, random forest and gradient boosted trees will be explained.

2.2 Description of Machine Learning

Machine learning (ML) is a subfield of artificial intelligence and can be thought of as a way of creating a model out of some data [46]; it is the study of programs that use past experiences to make future decisions.

There are no clear definition of what machine learning is, but a popular quote from a computer scientist named Tom Mitchell says "A program can be said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." [37]. Arthur Samuel, a pioneer in artificial intelligence explained machine learning as "the study that gives computers the ability to learn without being explicitly programmed." [24].

An example of a machine learning problem could be a group of pictures of either cats or dogs, the task could be to sort each picture into the cat collection or the dog collection. A program could learn to perform this task by observing the pictures that are already sorted and then its performance can be evaluated by calculating the percentage of correctly classified pictures. Another common example of machine learning is spam filtering, by observation of thousands of emails labeled as either spam or non-spam, the spam filters learn which emails are spam.

Machine learning can be described as either learning with supervision or without super-vision. Supervised learning is the class of machine learning where a output is predicted from

(20)

some input by first learning from correctly labeled data. In simple terms, supervised learning is when a program learns from examples of the right answers. In unsupervised learning the program does not use labeled data and instead attempts to find patterns in the data. There are more types of machine learning but supervised and unsupervised learning are the two most common types.

2.2.1 Features

In machine learning, a feature is a measurable property of the data that is being observed. Good features should be informative, discriminating and independent for creating a good model of the data.

As an example, in the famous Iris flower data set [22], which is a data set that contains 50 samples of three species of Iris flowers. A small part of the iris data set with only six samples can be seen in Table 2.1. There are four features for each sample in the data set, sepal length, sepal width, petal length and petal width. The data set also contains the species for each sample, this is called the label or the ground truth of that sample. Features are often represented in a feature vector. A feature vector is an n-dimensional vector containing features for some observed data.

Table 2.1: Example data set

sepal_length sepal_width petal_length petal_width species

0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 7.0 3.2 4.7 1.4 versicolor 3 6.4 3.2 4.5 1.5 versicolor 4 6.3 3.3 6.0 2.5 verginica 5 5.8 2.7 5.1 1.9 verginica 2.2.2 Unsupervised Learning

As mentioned in the previous section, unsupervised learning is the class of machine learn-ing that involves creatlearn-ing a model of a dataset without any right answers to learn from. There are two common unsupervised machine learning tasks, clustering and dimensionality reduction.

Clustering is the task of finding patterns in data, these patterns are often called clusters. In clustering, data is assigned to groups that are similar to each other based on some set of

(21)

features.

An example of clustering can be explained with the iris data set. Assuming we know that there are three species but missing the labels of each sample clustering could be tried to separate the data into three clusters. Even if three clusters are found there is no guarantee that the data is separated right.

Some problems in machine learning could have millions of features which could be com-putationally costly to to work with. Dimensionality reduction is the task of reducing the number of features from a dataset for some problem.

The reduction of features can be done by finding a better subset of the original features, which is called feature selection. Some features could introduce noise or might be irrelevant to the problem, dimensionality reduction can be used to find these features and remove them from the model. Dimensionality reduction also helps visualize data. For example, the price of a house might depend on hundreds of variables; such as, location, size, distance to the sea, size of the living room and so forth. Visualizing how the features affects the price might be hard in the case of hundreds of variables, dimensionality reduction could be used to remove the least important features. Assuming the size of the house affects the price significantly more than the other features then the price could be reduced to just two variables which is easily visualized in a 2d plot.

Feature extraction is another type of dimensionality reduction where the higher-dimensional space of features is transformed to a lower dimensional space while keeping the qualities of the full dataset.

2.3 Supervised Learning

This paper will focus on supervised learning which is the class of machine learning where a model learns to predict some output by first learning from already labeled data. This learning is done by modeling the relationship between the measured features of data and some label. After this model is found it can be used to find the labels of new unknown data. Supervised learning is often divided into two classification and regression tasks.

Regression are the tasks to predict a value of some continuous output; such as, predicting the sales of a new product or predicting how much a stock price will fall under some period of time.

(22)

Classification tasks are the tasks where the created model learns to predict discrete values for the output from a set of features. The model can then predict the most probable class for a new observation. Predicting if stock prices will fall or rise is an example of a classification task.

Figure 2.1: Overview of Classification

An overview of classification can be seen in Figure 2.1, already labeled data is split into training data and validation data. The training data is used to train the classification model, also known as a classifier, while the validation data is used to evaluate the performance of the model. When the model is trained and validated to make sure the performance is good it can then be used on new unlabeled data to predict the classes of the data.

Figure 2.2 shows an example of a dataset with two distinct groups, the goal of the learning is here to create a model to separate the two groups of data. An example of a learned can be seen in Figure 2.3 where the two groups of data are separated. Now that a model has been created then newly observed and unlabeled data can be assigned labels as seen in Figure 2.4. 2.3.1 Feature Engineering

Good features comes from feature engineering and feature selection. Feature engineering is the process of creating new or additional features from raw data or other already existing features to improve the prediction of the machine learning algorithm[2] . In simple terms, feature engineering is the task of taking observed data and turning it into numbers for the machine learning algorithms. Feature engineering is a fundamental part in machine learning,

(23)

feature 1

feature 2

Input Data

Figure 2.2: Figure of the classification task of finding a way to split the two classes of data.

a simple machine learning algorithm will often provide better results with good features than a fancy algorithm with bad features.

2.3.2 Feature Selection

Feature selection is the process of selecting a subset of the available feature to reduce the dimensionality of the training problem [2]. Feature selection is often applied after feature engineering in order to remove redundant or irrelevant features.

2.3.3 Performance Metrics

There are a multitude of ways to measure the performance of a machine learning model. The best prediction metrics for a a machine learning model depends on the task, this can be explained by an example of classification of tumors as either malignant or benign [24]. A common metric for performance of a machine learning model is the accuracy, which is the fraction of predictions that were classified correctly. In the case of the tumor prediction, accuracy might not be the best metric as accuracy does not differentiate between a malignant tumor that were classified as benign and a benign tumor that has been classified as malignant. In some cases this might not be a problem, but in this specific problem a malignant tumor classified as benign is a more severe error than a benign tumor classified as malignant.

(24)

feature 1

feature 2

Model Learned from Input Data

Figure 2.3: Figure of the complete model of the data set.

feature 1 feature 2 Unknown Data feature 1 feature 2 Predicted Labels

Figure 2.4: Left plot shows the unlabeled new observations and the right plot shows them

labeled

multiple performance metrics. In this specific example a malignant tumor classified correctly can be called a True Positive, a benign tumor classified as malignant is called a False Positive. A benign tumor classified correctly would be called a True Negative and a wrongly classified malignant tumor would be called a False Negative. The amount of true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) can be used to calculate three metrics, accuracy, precision and recall.

Accuracy is the fraction of correctly classified observations, in the tumor case this would correspond to the fraction of correctly classified tumors. Accuracy can be calculated using

(25)

the following formula:

Accuracy = (T P + T N )

(T P + T N + F P + F N )

Precision is the fraction of positives classified correctly, in this case it would be the fraction of tumors classified as malignant that is actually malignant. Precision can be calculated using the following formula:

P recision = (T P )

(T P + F P )

Recall is the fraction of true positives classified as true positives and would, in this exam-ple, be the fraction of malignant tumors found. Recall can be calculated using the following formula:

Recall = (T P )

(T P + F N )

As mentioned earlier, a system with high accuracy might not always be a good system if the recall and/or precision is low. A system with a high recall and a low accuracy would in this case probably be better than high accuracy and low recall. Accuracy, precision and recall are not the only performance metrics, and the right metric for a specific model depends on the problem.

ROC Curves and AUC

A receiver operating characteristic (ROC) curve is a plot of a curve depicting the relative trade-offs between true positive rate (TPR), also known as recall, and false positive rates (FPR), 1 - specificity, for a binary classifier as some threshold value is varied. A ROC curve is often used to see how well separated the positive and negative examples can be separated [35]. The TPR is plotted on the Y axis while the FPR is plotted on the X axis. An example of a ROC curve can be seen in Figure 2.5.

The area under the the ROC curve (AUROC or AUC), is a value often used as a metric to summarize the performance of a classifier in a single number. A good classifier has a value as close to 1 as possible while a bad classifier has a value close to 0.5. And a classifier is considered better than another classifier if it is closer to the top left corner in the ROC space. Classifiers on the left side make positive classifications only with strong evidence so they make as few false positives as possible, which often leads to low true positive values too. Classifiers towards the top right of the ROC space make positive classifications as long as

(26)

0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate

0.0 0.2 0.4 0.6 0.8 1.0

True Positive Rate

Receiver operating characteristic example

ROC curve (area = 0.79)

Figure 2.5: Example of a ROC curve

there is some evidence that the sample is positive, which often leads to higher false positive rates. AUC measures the discrimination, which is equal to the probability that the classifier will rank a random chosen positive instance higher than a randomly chosen negative one.

A discrete classifier is a classifier which provides one class as an output for a sample, this classifier can be used to create one single point in ROC space [19]. Some discrete classification models can also provide a probability if a sample belongs in a class or the other. This means the classifier can be plotted as a curve in the ROC space as the probability requirement for a classification to be true is used as the threshold. This can be explained by an example, if the threshold is set so every sample with a probability over 0.25 is classified as a positive sample then the classifier would have a high true positive rate and a high false positive rate and a point can be produced towards the top right of the ROC space. If on the other hand the probability of 0.75 is required for a sample to be positive then the classifier would have a low TPR and a low FPR and the point would be towards the bottom left side of the ROC space. By varying this threshold value a curve can be created. A ROC curve can also be created for a multi class classifier which will not be covered in this paper.

Confusion Matrices

A confusion matrix is a Figure representing the performance of a classifier in an intu-itive way [36]. There are four metrics shown in a confusion matrix, the number of true positives, false positives, false negatives and true negatives. These four metrics can then be used to compute a lot of other relevant metrics for evaluating a classifier; such as, accuracy

(27)

misclassification rate, precision and more. Non-Video Video Predicted label Non-Video Video True label 5764 395 712 5543 Confusion matrix

Figure 2.6: Example of a Confusion Matrix

Figure 2.6 an example of a confusion matrix. This confusion matrix shows the classifi-cation results of 6200 samples of two classes, true or false. About 5500 of the samples were correctly classified as true while 5700 out of 6200 samples were correctly classified as false. 2.3.4 Bias and Variance

Prediction errors in a machine learning system are often caused by bias and variance in the system. Bias is a term for when a learner keeps learning the same wrong thing independent of the features. A high bias results in a inflexible model that underfits the data. Variance is the term for a learners tendency to learn random things regardless of the features. High variance in a model results in a overly flexible model that even tries to model noise in the test data, where the variance causes the model to overfit the data. Figure 2.7 shows two plots where one of the plots have high bias and underfits the data while the other has high variance and overfits the data. The goal in every machine learning problem is to keep bias and variance low, but the decrease in one often causes the other to increase, this is called the bias-variance trade-off [24].

Variance and bias can be visualized as darts thrown at a dartboard as can be seen in Figure 2.8. Each dart represents a prediction in a model, the closer to the middle the better the prediction. Low bias and low variance and will throw the darts close to the middle. High bias and low variance will throw darts far away from the bull’s eye but tightly clustered. Low bias and high variance causes the darts to be very spread around the bull’s eye. High variance and high bias will cause the darts to be far away from the bull’s eye and poorly clustered.

(28)

0.0 0.2 0.4 0.6 0.8 2 0 2 4 6 8 10 12

14 High-bias model: Underfits the data

0.0 0.2 0.4 0.6 0.8 2 0 2 4 6 8 10 12

14 High-variance model: Overfits the data

Figure 2.7: Plots illustrating bias and variance

Figure 2.8: Bull’s eye diagram illustrating variance and bias

2.3.5 Model Validation

A common problem when evaluating a machine learning model is overfitting the training data, which means that the model describes random noise in the training data instead of creating a generalized model of the data. A overfitted classifier might be 100% accurate the given training data but then only 60% accurate on the test data. The problem here is that the training and testing of a model should be done on different data sets. A better way of evaluating a model is the use of holdout sets. Using holdout sets means that some of the data is not used when training the model and instead is used to test the model. The drawback with using holdout sets is that some of the data does not contribute anything to the training of the model, which can be a problem when the amount of available data is small.

Another way of evaluating our model without any data not contributing to the training of the model is cross validation. Cross validation works similarly to holdout sets where the data is divided into multiple parts or folds as it is often called. The training is then done multiple

(29)

times, each time with one of the folds not used for training and instead used for testing. The results for each test is then combined in some way; for example by taking the mean of the results, to get a better result while making sure all data is contributing to the training. There are multiple different ways of doing cross validation, one of the most common ones are k-fold cross validation. In k-fold cross validation the data is split into k equally big parts. The simples form of k-fold cross validation, 2-fold cross validation, can be seen in Figure 2.9 and in Figure 2.10 five-fold cross validation can be seen.

validation

set

trial 2

validation

set

trial 1

Figure 2.9: Two fold Cross-validation

validation set trial 5 validation set trial 4 validation set trial 3 validation set trial 2 validation set trial 1

Figure 2.10: Five fold Cross-validation

2.3.6 Decision Trees

A Decision tree classifier (DT) is an example of a machine learning algorithm used in supervised learning. DTs are trees where every node, except the leafs, evaluates a feature.

(30)

The nodes are sorted by what features would best divide the training data, so the root would divide the training data the most. Each branch in the trees represent a value the feature can take. The leaves in the tree are assigned a class the observation will be classified as. Classification is done by starting at the root node, then following the branches down according to the feature values and ending up at a leaf which corresponds to the classification of the data [44] [31]. Figure 2.11 shows a simple example of a decision tree.

A decision tree splits the data at each node in the tree. Figure 2.12 shows how a decision tree splits the data for each level in the tree. As seen in the figure the biggest most obvious splits are done a the first levels of the tree while smaller less obvious decision are made lower in the tree.

Figure 2.11: Decision Tree: Identification of Animals

Figure 2.12: Decision Tree: splits at different tree depths

There are many advantages to using decision trees for classification, they are intuitive and easy to understand, the training is often straight-forward with a small amount of hyperpa-rameters to tune and decision trees performs extremely fast [25]. One of the big disadvantages with decision trees are their tendencies to overfit the training data, which means that it cre-ates a model specific for the training data. Figure 2.13 shows the same decision tree model fit to two different halves of the same data set. As seen in the figure, the model overfitts the data to the training data instead of making a good generalization, DTs are inconsistent in parts where it is less certain. A couple of studies was found to have tried to reduce this prob-lem, often by pruning the tree which at the same time reduces the accuracy on the training

(31)

data [13]. To overcome the problem with overfitting, multiple trees can be used; i.e. forests. These forests reduces the overfitting of decision trees while keeping the accuracy high [26]. The most common forest algorithm is random forest which will be introduced in the following section.

(32)

2.3.7 Ensemble Methods

Ensemble methods are a class of algorithms used for classifiers in supervised machine learning. Ensemble algorithms use an ensemble of classifiers where the predictions of each classifier is combined to create a stronger classifier [18]. Using ensemble methods have shown to increase classification accuracy significantly [15]. Two categories of ensemble methods exist, averaging and boosting. Averaging methods create a set of independent classifiers and then averages their results to create a stronger estimator. Random Forest and Bagging are two examples of averaging methods. The second category of ensembles are boosting methods. In boosting algorithms multiple classifiers are built incrementally where each new classifier is added to the previous model in order to reduce the bias and variance of the earlier classifiers. Some common boosting algorithms are Gradient Tree Boosting and AdaBoost [7].

Random Forest

Random forest is an ensemble algorithm utilizing slightly modified decision trees. The flaw with decision trees is that it tends to overfit data. Another of the advantages with random forests is the simplicity of the algorithm, there are few hyperparameters to tune; such as, number of trees and tree depth.

The algorithm by creating each tree as follows:

1. If the dataset contains N samples, then samples are sampled N times at random from the dataset. This means multiple of the same samples can be chosen.

2. If there is M features then a number m (m<M) features is selected at random from the M features for each node, and the best feature from m is chosen to split the data for that node. The size of m is the same for each node in the forest.

3. Each tree is grown as large as possible.

In Figure 2.13 it is shown that two decision trees from different parts of the same data can be drastically different in some cases. Combining the information from the two decision trees can provide a better result as seen in Figure 2.14, more trees give an even better result as seen in Figure 2.15 with 100 decision trees. Random forest uses the fact that multiple overfitting estimators combined can reduce overfitting by having the errors in the different estimator cancel one another.[25] [26]

(33)

Figure 2.14: Random forest: Combination of the two decision trees from Figure 2.13

Figure 2.15: Random forest of 100 decision trees

When constructing a decision tree, the feature that best divides the data is taken at each node going down the tree. In random forests the use of the best feature at each node would create a forest of identical decision trees. Instead at every non-leaf node the best feature from a randomly chosen subset of the features is chosen to divide the data. The randomly selected features in random forests reduces the similarities between the decision trees and reduces the overfitting by having the different errors in the decision trees cancel each other out. [33]. Gradient Boosting Tree

Gradient Boosting Tree is another ensemble algorithm that uses decision trees, like all ensemble algorithms it is built incrementally where each successive estimator tries to reduce the error of the previous model. Compared to the random forest, gradient boost trees makes

(34)

use of very small decision trees (tree stumps) with often less than 8 leaf nodes [23].

Figure 2.16 shows an example of gradient boosted trees. The ground truth is modeled by incrementally adding more and more trees where each tree tries to minimize the loss function evaluated on the previous model.

Figure 2.16: Gradient boosted tree example[43].

The boosting method in gradient trees uses gradient descent. Gradient decent is the fact that to minimize a continuous function, one can take steps proportional to the gradient of the function [14]. Gradient Trees generally perform slightly better than random forest on classification problems without many classes [20]. There are two main disadvantages with gradient trees compared to random forest, the first is the fact that they take a lot longer to build because each tree in the forest has to be built successively. The other disadvantage is the fact that tuning the algorithm takes a lot of effort compared to random forest, there are several important hyperparameters where random forest have fewer important ones.

2.4 Summary

In this chapter some of the most relevant aspects of machine learning were introduced. The different types of machine learning is explained with a focus on supervised learning as it is the method used in this study. Some different performance metrics and performance evaluation methods are explained. Random forest and gradient boosted trees are also explained as they are the two algorithms used in the study.

(35)

3. Related Work

In this following section some related studies on machine learning and traffic classification will be summarized.

Identification of VoIP encrypted traffic, Alshammari 2014

The paper from Riyad Alshammari and A. Nur Zincir-Heywood [9] presents a machine learning approach for identifying VoIP in encrypted network traffic. In the paper they com-pare three different supervised learning algorithms, C5.0, AdaBoost and Genetic Program-ming (GP). The reasoning behind the choices of algorithms were that all three use memory very well and do not have the drawbacks of significant memory overheads like for example Support Vector Machines (SVM). The data used to compare the algorithms were bidirectional flows between two hosts to represent network traffic. A tool-set called NetMate was used to create the network traffic and these flows were then used to calculate a collection of different statistical features such as arrival times, packet lengths, number of bytes in the flows and other features. Two criteria were used to measure the performance of the three algorithms, detection rate (DR) and false positive rate (FPR). The results of the study were that C5.0 achieved the best performance of the three algorithms, with as high as 99% DR and 1% FPR for SKYPE while AdaBoost got 74% DR and 3% FPR and GP got 95% DR and 10% FPR for the same dataset.

Encrypted traffic classification: ssh and skype, Alshammari 2009

In another paper by Riyad Alshammari and A. Nur Zincir-Heywood [8] they assessed ma-chine learning for classifying encrypted traffic. Five algorithms were compared in the paper, AdaBoost, DVM, Naïve Bayesian, RIPPER and C4.5. Two different types network traffic

(36)

were considered, SSH and Skype. The ten features considered were flow-based statistical features; such as, mean backward packet length, std deviation of forward packet length and number of packets in backward direction. C4.5 and RIPPER performed the best overall over the different datasets and classes. For skype based traffic C4.5 managed a 98.4% DR and less than 8% FPR, AdaBoost 86% DR with 18% FPR, Naive 92% DR with 60% FPR, SVM 89% DR with 55% FPR and RIPPER 97% DR with 6% FPR. For SSH traffic four different datasets where three of them were of real network traces and the last one from simulated network traffic. Over the real network traces C4.5 got the best results, C4.5 managed 97% DR and 0.8% FPR best case and 83% DR and 0.5% FPR worst case. SVM got a higher average DR but had as much as 47% FPR worst case.

Fingerprinting of Smartphone Apps From Encrypted Network

Traffic, Taylor 2016

In the paper by Vincent F. Taylor et Al. [45] they created a metodology and a frame-work called AppScanner for automatic fingerprinting and real-time identification of Android application in encrypted network traffic. Traffic flows from apps are separated in interactive and non-interactive traffic, interactive traffic is traffic generated by user interaction while non-interactive traffic is generated without user interaction, such as when a application polls a server for updates. In the paper the methodology focuses primarily on interactive traffic. Supervised learning was used on network flows which was filtered to only include error free TCP traffic. The network flows were used to derive statistical properties such as variance and skew of packets sizes in the different flows. There were two algorithms compared in the paper, Random Forest and SVC because both of them were found to be particularly suited for network flow classification. These two algorithms were then used in three different ap-proaches, the first approach was to use one classifier per network flow, the second was to use a single large classifier on all applications and the third approach was to use one classifier per application. One drawback of these algorithms is that they would always put each flow in one of the labels given to the classifier even if the flow is from an unknown application, which can cause an increase in false positives. The classification algorithms can give a prob-ability of the result being right. A classification validation stage was implemented where

(37)

this probability was used to filter out the unknown flows, if the probability was below a set threshold then that flow was removed. Of the two algorithms random forest had the best performance over the three approaches. In the first approach with per flow classifier random forest managed 84.4% precision, 83.1% recall and 82.1% accuracy. With a large classifier for all applications random forest got 89.5% precision, 85.9% recall and 86.9% accuracy. And in the last approach 96% precision, 82.5% recall and 99.8% accuracy.

Network traffic classification in encrypted environment, Datta

2015

Jayetta Datta et al wrote a paper on traffic classification in an encrypted environment [17]. Specifically the paper focuses on Google Hangout traffic. Google Hangout is a semi peer to peer application which allows users to video chat and text chat online. Three algorithms were compared, Naïve Bayes, AdaBoost and j48 decision tree. The network traffic were collected and divided into four different classes, Google Hangout, Google Plus, Gmail and Not Hangout. A total of 2.5 million packets were collected over several hours of traffic. The features used for classification were packet based instead of flow based and consisted of 7 features. Some of the features were based on the Session Traversal Utilities for NAT (STUN) protocol used by Google Hangouts. STUN is used by Google Hangout to interact with the server since clients will be behind NAT in most cases. The features can be seen in Table 3.1. The testing and results were split into three tests, the first being only Hangout and non hangout, second test also included Gmail traffic and the last test addeed Google Plus aswell. For the first test J48 and AdaBoost both got 99.99% DR, Naïve Base got 97.91%. In the second test AdaBoost got 99.99% again, Naïve Base got 99.98% and J48 got 99.46%. And in the last test AdaBoost and J48 hit 100% and Naïve Base 99.98%.

Table 3.1: Features used for classification

# Feature

0 is packet to/from a conference server offered in DNS reply to clients*.Google.com

1 Layer 4 Protocol (UDP or TCP)

2 is UDP using same source port as STUN?

3 is packet to/from a conference server offered in DNS reply to stun.1.Google.com

4 destination port

5 packet length

(38)

Realtime classification for encrypted traffic, Bar-Yanai 2010

Roni Bar-Yanai et al wrote a paper on realtime classification of encrypted network traffic [10]. In the paper they introduce a hybrid statistical algorithm which integrates the both K-nearest neighbor and k-means algorithms together. The algorithm is then tested on HTTP, Bit-torrent SMTP and EDonkey traffic. K-nearest neighbor is one of the simplest and most used classification algorithms, but has the disadvantage of growing linearly with the training set size and thus becomes slow at larger datasets with thousands of samples. To fix this problem k-means algorithm is introduced. The k-means algorithm has a lower accuracy rate but requires significantly less computational power and is faster than k-nearest neighbor. Because short flows (<15 packets) have unreliable statistical properties they were removed and only longer flows were used. It is mentioned that short flows account for 87% total flows but only 7% of total bytes of overall bandwidth. The results showed that k-means alone only averaged an accuracy rate of 83% and K-nearest 99.1%. Together the hybrid algorithm used averaged a similar detection rate as k-means which means not a lot of accuracy were lost while the time complexity problem was reduced compared to the original k-nearest neighbor algorithm.Realtime Classification for Encrypted Traffic

Classification and regression by randomForest, Liaw 2002

In a paper written by Andy Liaw and Matthew Wiener wrote a paper on classification and regression by random forest [33]. In their paper they presented a package for classification and regression for the R programming language. They also provided some interesting information on some random forest aspects; such as, how the variable importance metric was estimated in random forest. The variable importance is created as each tree in the forest is created and is generated by looking at how the prediction error increases when data for a feature is permuted while all the rest are unchanged. They also presented some practical information for using random forest, from their findings, mtry (max feature in sklearn) parameter generally does not affect the performance dramatically and even a set to 1 gives a good performance for some data. With a lot of features and only a few that is assumed to be good then a large mtry parameter often gives better performance. Another thing they found was that the feature importance measurement varied from run to run but the ranking was often stable. They also

(39)

found that for some classification problems where there is an extremely unbalance in the class frequency, like 99% to 1%. It was better to change prediction rule from majority rule

Automated traffic classification and application identification

using machine learning, Zander 2005

The paper published by Sebastian Zander et al [49] present a method for traffic classifi-cation using unsupervised machine learning. The paper explains some of the common earlier techniques used for identification of network traffic, like deep packet inspection and port number inspection. A more reliable technique is direct analysis of session and application layer content in the flows is also explained. The disadvantages with this technique was that it is very computationally complex and sometimes a violation of privacy legislation. The approach used in their paper is called autoclass. autoclass is an Bayesian clustiering classifier which tries to find ’natural’ classes in the dataset. Their method of classification included the identification of the optimal flow features that would minimize the classification cost while maximizing the accuracy of the classifier. This feature selection was done by repeatedly se-lecting a subset of the features, training the classifier on that feature set and evaluating the classifier until the best set of features were found. A mean accuracy of 86.5% was found for every application and the application with the highest accuracy was as high as 95%.

A survey of methods for encrypted traffic classification and

anal-ysis, Velan 2015

Petr Velan et al. [47] presented a paper where existing approaches for classification and analysis of encrypted traffic is surveyed. In the paper they explain some of the most widespread encryption protocols throughout the Internet, IPsec, TLS, SSH, Bittorrent and Skype. Some of the most used payload-based classification tools are also listed, PACE, NBAR, nDPI, Libprotoident and L7-filter. They also show how the structure of the protocol can give out a lot of information for classification and analysis of the traffic. For example, the protocol used in the traffic can be identified from the knowledge of the packet structure of the unencrypted parts, such as content type and length of the packet. A couple of

(40)

feature-based techniques for encrypted traffic classification are described. The problem with most of these methods is that flow-based classifiers normally perform classification after the flows expired, now there are a couple of research groups focusing on real-time classification using flow features. There are no conclusive results which shows what feature-based traffic classifiers has the best results but there are a lot of them. The main reason that it is so hard to find the best results are that they depend a lot on the dataset used and the configurations of the models.

Discriminators for use in flow-based classification, Moore 2005

Moore et al, wrote a paper on features for use in flow-based classification [39]. They created a couple of datasets to be publicly available for the research community to assess classification techniques on real data. The data was gathered from a high-performance net-work monitor. The monitor ran over 24 hours and the day trace was split into randomly selected 28 minute blocks. In the paper almost 250 flow-based features were identified where some of the features were generated from statistics on packet timings, packet lengths and transfport protocol information such as SYN and ACK counts. A lot of the other features were derived from TCP headers and counting packets and from header-sizes.

Markov chain fingerprinting to classify encrypted traffic, Maciej

2014

Markov Chain Fingerprinting has been used to classify encrypted traffic as shown in the paper by Korczyński Maciej and Duda Andrzej [30]. The information in SSL/TLS headers were used to find statistical features on the traffic of some applications. The features were called fingerprints in this paper. A sequence of SSL/TLS messages in one direction can be modeled by a Markov chain. There were traffic to and from twelve applications which were studied in this paper and can be seen in table 3.2.

Some examples of the features of the applications were given in the paper; for example, 55% of new SSL/TLS sessions Twitter are resumed from previous sessions. Paypal on the other hand had 92.8% of the sessions start with a "Server Hello" message. Very good results were found and most applications had a TPR rate over 90% which varied a bit between

(41)

Table 3.2: Applications studied

# Application # Application

1 PayPal 7 Dziekant

2 Twitter 8 Amazon S3

3 Dropbox 9 Amazon EC2

4 Gadu-Gadu 10 Skype

5 Mozilla 11 Poczta

6 MBank 12 PKO

the application and dataset, Amazon S3 for example had a true poisitive rate of over 97% regardless on what dataset was used for training and for validation while Dropbox was as low as 0.69% TPR for one of the sets. All results were validated on four different datasets using 4-fold cross-validation.

An empirical comparison of supervised learning algorithms,

Caru-ana 2006

Rich Caruana and Alexandru Niculescuö-Mizil wrote a paper on comparing some super-vised learning algorithms with different performance metrics [16]. The algorithms compared in the study were Support vector machines (SVM), Artificial neural networks (ANN), Logis-tic Regression (LOGREG), Naive Bayes (NB), KNN, RF, DTs, Bagged Trees and Boosted Trees and boosted stumps. Every algorithm was evaluated using nine different metrics which were accuracy, squared error, cross-entropy, ROC area, F-score, precision/recall break-even point, average precision, lift and calibration. To find the best parameters multiple settings and variations were tested for each algorithm. To get the results every algorithm was com-pared on eight different classification problems from different machine learning repositories. The on-average best performing algorithms were neural networks bagged trees and random forests. After tuning the parameters boosted trees, random forest and neural networks were performing the best.

(42)

Web-Search ranking with Initialized Gradient Boosted Trees,

Mohan 2011

Ananth Mohan et al, wrote a paper on Web-Search ranking with Initialized Gradient Boosted Trees [38]. As search engines has had such a success it has lead to interest in finding better ways for automated web search rankings. These are often seen as supervised machine learning problems. In the paper they examined random forest and gradient boosted regression trees (GBRT). Random forest was seen as comparable or even better than GBRT. These two algorithms were then combined, the combination was called Initialized Gradient Boosted Trees (iGBRT) by using random forest to create a ranking function and using that as an initialization for GBRT. iGBRT outperformed both random forest and GBRT in all the benchmark tests. All algorithms were evaluated on datasets from Yahoo Ranking competition and Microsoft MSLR datasets, both of the datasets already contained pre-defined training, validation and test splits. When comparing the speed of the three algorithms random forest was the fastest, GBRT was slower and iGBRT was the slowest as it sequentially uses random forest and GBRT.

Parameter Sensitivity of Random Forest, Huang 2016

A paper by Barbara F.F. Huang and Paul C. Boutros examined the parameter sensitivity of random forests [27]. The tuning of the random forest parameters had a big impact on the performance in some cases, but in most cases the default parameters had a relatively good performance which was sometimes close to optimal. The results were evaluated on two datasets commonly used in computational biology. The first dataset includes 720 training samples and 576 validation samples and has 15 features. The second dataset has only 3 features on 12,135 samples with the classes "no death" and "death" as outcome. Out of the 12 thousand samples, only 255 were used for training. The performance with varying parameters ranged form 0.61 to 0.99 AUC where the default parameters achieved an AUC of 0.972. The average AUC was 0.893.

(43)

Traffic Classification On The Fly, Bernaille 2006

Laurent Bernaille et al, wrote a paper on traffic classification on the fly [12]. Classification is often done after a TCP flow has been completed. In the paper they proposes a technique which only uses the first five packets to identify what application the flow is associated with. The goal was to be able to identify what application a flow was associated with as early as possible. The only feature used was the size of the data packets and no control packets were used. The training and validation data were taken several months apart on the same link, they included 10 different applications and it was found that 5 packets were enough to separate the applications. Unsupervised clustering is used to find groups in the dataset, as a single application might have different behaviors (a cluster) that each should be modeled.

The classifier created included 4 modules, the first module was the packet analyzer which was used to get protocol, source/destination port and IP and the packet sizes going in and out of a University network. After P packets were of a connection has been analyzed the flow is sent to the flow conversion module to map the flow to a spatial representation. The flow is then sent to the cluster assignment module which searches the cluster descriptions to find the one that best fits the flow. The last module is the application identification module which selects the most likely application for the given flow. Over 80% accuracy was achieved for most of the applications; for example, nntp achieved the highest accuracy at 99.6% while pop3 achieved 0% due to always belonging to clusters where pop3 was not the dominant application.

Traffic Classification - A Runtime Performance Study, Västlund

2017

In the paper by Filip Västlund [21] the runtime performance of machine learning algo-rithms is examined when classifying video and non-video traffic. The goal of the study was to find the fastest classification time relative to the classification accuracy by creating the al-gorithms in Python and implementing them as C code. Random forest and gradient boosted trees were both compared and it was found that random forest was significantly faster. Initial tests suggested that random forest was roughly seven times faster than gradient boosted trees after compiler optimization.

(44)

3.1 Summary

In this chapter some related studies about machine learning and traffic classification is summarized. A lot of the studies compared different classification approaches and algorithms. A common result in the studies were that random forest performed really well and was often one of the top algorithms for overall performance. Classification times were often not a factor considered in the studies, while it is one of the main aspects in this study.

(45)

4. Study Preliminaries

In the following chapter of the report the preliminary information of the experiments; such as, the creation processing of the datasets, the features used and general information on classification creation and evaluation.

4.1 Traffic Flow

A flow is defined by the sending and receiving IP-addresses and port numbers, and the used transport protocol. In this study per packet flow information is captured during the first 60 seconds of the flow lifetime. The classification done in this paper is done on features calculated on different flows.

4.2 Dataset

The data for the dataset provided for this study was generated by Procera using one of their Deep Packet Inspection (DPI) devices. The data for the dataset used was captured in the DPI box over 24 hours. After the data was received it was preprocessed by filtering out artifacts and uninteresting data. Encrypted traffic was filtered as all encrypted traffic was labeled as non-video flows while it might contain video traffic. The artifacts filtered were the flows which did not start with packet index one, and the flows that started in the last 30s of the capture was also removed. The uninteresting flows were those with 10 or less packets, flows which had packets with size over 1580 bytes and flows with only one direction. Some more pre-training filtering was done on the data; such as, unclassified flows and flows that were not of protocol TCP or UDP. After preprocessing the final datasets for the study could be generated, at this point the dataset included over five million flows of network traffic.

(46)

top 4 most common video flows in the dataset.

Table 4.1: Top 10 services in the dataset

# Service Flows Fraction of total Flows

1 Google 697102 0.13275 2 HTTP 501443 0.095490 3 Facebook 451766 0.086030 4 Youtube 308264 0.058703 5 Instagram 306007 0.058273 6 Outlook.com 234836 0.044720 7 Apple 197555 0.037621

8 iCloud control data 116373 0.022161

9 iTunes Store 102828 0.025460

10 Amazon AWS 55891 0.013839

Table 4.2: Top 4 services of video traffic in the dataset

# Service Flows Fraction of total Flows

1 YouTube 308264 0.058703

2 HTTP media stream 39255 0.007475

3 Netflix 16555 0.003153

4 Kodi 852 0.000162

4.3 Features

After the preprocessing of the data the feature sets used for the study could be generated. Two different sets of features were generated, the first set of features can be seen in Table 4.3, the features were created statistically from the packet based behavior for each network flow. These "Statistical Features" tries to capture statistical properties in the sizes of the packets in each flow.

The second set of features was discussed with Procera, the features can be seen in Table 4.4 and are features that can easily be calculated directly in a deep packet inspection box with simple counters, additions and subtractions. Fast calculations the features are required in real-time classification of network flows. Instead of calculating statistical properties of the flows, these "Composite Features" tries to capture the properties of the packets and their arrival times; such as, the number of bursts, properties on inter-arrival times and packet sizes.

(47)

Table 4.3: Statistical features

# Feature

0 Number of bytes in the flow

1-3 Number of packets in the flow total, up and down

4-6 Mean packet size in the flow total, up and down

7-9 Max packet size in flow total, up and down

10-12 Standard deviation in packet size total, up and down 13-15 Variance in packet size total, up and down

16-18 Packet size skew total, up and down 19-21 Packet size kurtosis total, up and down

Table 4.4: Composite Features

# Feature

1 Flow Duration

2-4 # Bytes, total/up/down

5-7 # Packets, total/up/down

8-13 Mean/Max packet size, total/up/down

14-19 Mean/Min/Max IAT up/down

20 Mean (per packet distance from running average of IAT)

21 Mean (per packet distance from running average of packet size)

22 # Non full packets

23-28 # of 200/400/800/1600/3200/6400 Ms silent periods 29-34 # of 100/200/400/800/1600/3200 kB bursts

35-40 # of at least 500/1000/1400 b packets up/down

4.4 Testing and Training data

To perform supervised learning there are two sets of data required. The first dataset is the training data which is used to create the model/classifier, also called the training data. The second dataset required is the validation set which is used to evaluate the model performance. The training and validation sets were sampled from the dataset at random with equal amounts of video and non-video flows. The reason for sampling the data is mainly the classification time, more training data leads to less bias towards the training set and it also leads to less variation in the model performance. Training the model on the whole dataset would take a very long time, the training set size is generally a trade-off between the training time and performance. After the data has been sampled into training and test data both the sets are split into a feature vector and ground truth.

After the training set is constructed it can be used to train the classifier. This is done in scikit-learn with the classifiers fit function. When training the model the parameters used for that model is also provided to create the model. When the model has been created the feature

(48)

vector of the validation set can be provided to the model to get a prediction on that dataset, this is done with the predict function in scikit-learn. The ground truth of the validation set can then be compared to the validation set to get some metric for the performance of the classifier, this can be done with either of the metric functions in scikit-learn; such as, accuracy_score or auc_score or just manually compare the prediction and ground truth.

4.5 Classifier Evaluation

There are multiple metrics for evaluating the performance of a classifier, some of these metrics can be seen in section 2.3.3. For this study measurements for the performance of a single classifier is needed as well as a way to easily compare classifiers. Three metrics were chosen for measuring the performance of a classifier, accuracy, precision and AUC. Accuracy was chosen as it is a very intuitive metric which gives a general overall performance of the classifier. While AUC is not as intuitive as accuracy it was chosen as it also gives a value for the overall performance of a classifier. Precision was chosen for problems where correctly classified video traffic is more important than the other metrics; for example, for applications where as many video traffic flows should be as correctly classified as possible.

When comparing classifiers AUC is the metric of choice as it gives a good overall per-formance comparison between the classifiers, and a ROC curve with multiple classifiers can easily be plotted and examined.

4.6 Chapter Summary

This chapter presented some of the preliminaries to understand the experiments done. The data and feature sets are explained in details. The method for evaluating the classifiers are explained. The data sampling for the training and validation data is also explained.

(49)

5. Random Forest Examination

This following chapter contains the examination done on the different parameters for random forest. Each parameter that affects the performance and complexity of the model will be analyzed. In this section two aspects of random forest will be examined, the first aspect is the tuning of the parameters. Every relevant parameter will be examined and the choice of the tuning method will be explained. The second aspect to be examined in this section is an analysis of the parameters from a latency point of view, in other words, to find out if it is possible to reduce the complexity of the forest while keeping a relatively good performance.

There are three main parameters to tune in random forest, the first is the number of trees in the forest, the second parameter is the max depth of each tree and the third is the max features to be considered for each node. Of these major parameters there are two that affect the classification speed, that is the number of trees and the max depth, as both of these increase the size of the trees.

There are also some less impactful parameters, two of these parameters will be considered as they affect how deep each tree will go. These two parameters are the minimum samples allowed for a split to occur and minimum samples allowed for a leaf. The rest of the parameters will keep their default values as they does not increase the performance or classification speed in any relevant way.

(50)

5.1 Number of Trees

The first parameter that was tuned and analyzed was the number of trees in the in the forest. In random forest each tree is built individually. Increasing the number of trees in-creases the performance of the classifier, while also increasing the classification speed linearly, which can be seen later in Section 7.3.1. The goal here is to find a good value of trees that keeps the performance high while still keeping the number of trees relatively low to increase the classification speed.

Statistical Features

To find a good value for the number of trees the accuracy was plotted at increasing number of trees, up to a value of 40 trees. This process was done five times for different random samplings of data to make sure the result was consistent.

0 10 20 30 40 50 Number of Trees 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 Accuracy Score

(a) Statistical Features

0 10 20 30 40 50 Number of Trees 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 Accuracy Score (b) Composite Features

Figure 5.1: Accuracy performance for increasing number of trees

Looking at the results in the Figure 5.1a it can be seen that the accuracy increases significantly up to some number of trees where the performance increase for each additional tree is diminished considerably. The value where the performance for each additional tree gets drastically reduced was determined to be at around 8-10 trees, while below 5 trees the performance is increased significantly for each additional tree. Finding a good value in for the number of trees can be done by looking at the graph and choosing a relatively low value while staying above 10 trees. 10 trees was determined to be a good value as it is over the knee of the curve by a small margin.

Classiﬁcation of Video Traﬃc

Classification of Video Traffic

An Evaluation of Video Traffic Classification using Random

Forests and Gradient Boosted Trees

Acknowledgements

Contents

List of Figures

List of Tables

1.

Introduction and Background

1.1

Aim of this Study

1.2

Brief Summary of the Results

1.3

Tools and Resources

1.4

Organization

2.

Machine Learning

2.1

Overview

2.2

Description of Machine Learning

2.3

Supervised Learning

validation

set

trial 2

validation

set

trial 1

2.4

Summary

3.

Related Work

Identification of VoIP encrypted traffic, Alshammari 2014

Encrypted traffic classification: ssh and skype, Alshammari 2009

Fingerprinting of Smartphone Apps From Encrypted Network

Traffic, Taylor 2016

Network traffic classification in encrypted environment, Datta

2015

Realtime classification for encrypted traffic, Bar-Yanai 2010

Classification and regression by randomForest, Liaw 2002

Automated traffic classification and application identification

using machine learning, Zander 2005

A survey of methods for encrypted traffic classification and

anal-ysis, Velan 2015

Discriminators for use in flow-based classification, Moore 2005

Markov chain fingerprinting to classify encrypted traffic, Maciej

2014

An empirical comparison of supervised learning algorithms,

Caru-ana 2006

Web-Search ranking with Initialized Gradient Boosted Trees,

Mohan 2011

Parameter Sensitivity of Random Forest, Huang 2016

Traffic Classification On The Fly, Bernaille 2006

Traffic Classification - A Runtime Performance Study, Västlund

2017

3.1

Summary

4.

Study Preliminaries

4.1

Traffic Flow

4.2

Dataset

4.3

Features

4.4

Testing and Training data

4.5

Classifier Evaluation

4.6

Chapter Summary

5.

Random Forest Examination

5.1

Number of Trees