• No results found

Detecting network failures using principal component analysis

N/A
N/A
Protected

Academic year: 2021

Share "Detecting network failures using principal component analysis"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköping University | IDA Bachelor Thesis | Computer and Information Science Spring 2016| LIU-IDA/LITH-EX-G--16/074—SE

Detecting network failures using

principal component analysis

Tim Lestander

Jakob Nilsson

Tutor: Niklas Carlsson

(2)

Abstract

This thesis investigates the efficiency of a methodology that first performs a Princi-pal Component Analysis (PCA), followed by applying a threshold-based algorithm with a static threshold to detect potential network degradation and network attacks. Then a proof of concept of an online algorithm that is using the same methodology except for using training data to set the threshold is presented and analyzed. The analysis and algorithms are used on a large crowd-sourced dataset of Internet speed measurements, in this case from the crowd-based speed test application Bredbandskollen.se.

The dataset is first analyzed on a basic level by looking at the correlations between num-ber of measurements and average download speed for every day. Second, our PCA-based methodology applied on the dataset, taking into account many factors, including the num-ber of correlated measurements. The results from each analysis is compared and evaluated. Based on the results, we give insights to just how efficient the tested methods are and what improvements that can be made on the methods.

(3)

Acknowledgments

We would like to thank and show our gratitude to Niklas Carlsson for being a supportive, helpful and good supervisor to us while writing this thesis. We would also like to sincerely thank Rickard Dahlstrand at .SE (The Internet Infrastructure Foundation) for providing us with the dataset that made this thesis possible. The dataset made this thesis a real world case, which made it far more interesting. Finally, we would like to thank Martin Claesson and Lovisa Edholm for proof-reading and providing useful feedback to us.

(4)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vi

List of Tables vii

1 Introduction 1

1.1 Background . . . 1

1.2 Anomaly detection . . . 1

1.3 The dataset . . . 3

1.4 Problem formulation . . . 3

1.5 Contributions and thesis outline . . . 3

2 Background and Related Work 5 2.1 Anomaly detection . . . 5

2.2 Methods and algorithms . . . 5

2.3 Crowd sourced measurements . . . 6

3 Theory 7 3.1 Principal Component Analysis . . . 7

3.2 K-mean clustering . . . 7

3.3 K-means algorithm . . . 8

4 Methodology 9 4.1 The KDD-model . . . 9

4.2 Working with the dataset . . . 10

5 Results 11 5.1 Initial tests . . . 11

5.2 Principle Component Analysis . . . 13

5.3 Interpretation of principal components . . . 14

5.4 2-dimensional K-mean cluster . . . 16

5.5 Proof of concept of online algorithm . . . 17

6 Discussion 21 7 Conclusion 24 7.1 Future work . . . 24

(5)

List of Figures

1.1 Contextual anomalies . . . 2

1.2 Point anomalies . . . 2

3.1 K-means algorithm . . . 8

5.1 Number of measurements and mean speed every day during 2014 . . . 11

5.2 Number of measurements and mean speed for Telia users every day 2014 . . . 12

5.3 Number of measurements and mean speed for Tele2 users every day 2014 . . . 12

5.4 Number of measurements and mean speed for Telenor users every day 2014 . . . . 13

5.5 3D Plot of mean values from correlated measurements . . . 13

5.6 Scree plot from the principal component analysis . . . 15

5.7 Principal Component Plot from PCA . . . 16

5.8 2D plot with component 1 and 2 as axes . . . 17

5.9 2D plot with component 1 and 3 as axes . . . 18

5.10 2D plot with component 2 and 3 as axes . . . 18

5.11 Visualization of the days identified as anomalies . . . 18

(6)

List of Tables

5.1 Principal Component Matrix from PCA . . . 14 5.2 Eigenvalues, total variance and cumulative variance of principal component 1-10 . 15 5.3 Table of anomalies and principle components . . . 19 5.4 Summary of results from online algorithm . . . 20

(7)

1

Introduction

This section introduces the reader to the subject and provides the basic knowledge needed to follow along in this thesis.

1.1

Background

In today’s society the Internet plays a very important role and is an essential part when trans-ferring data and information. Since a lot of people rely on fast and safe Internet connections in their jobs as well as in their daily life, it is important to detect and prevent network fail-ures and network performance degradations to make sure that information and data is not changed or lost. Crowd-sourced measurements and anomaly detection could help detecting network failures or performance degradations in a network caused by an attack, due to a natural disaster, or due to some other unforeseen event.

What correlations exists between correlated network measurements (e.g multiple measure-ments from the same location) and network speed? How effective is crowd-sourced mea-surements when it comes to detecting anomalies and network failures? Which techniques can be used to do anomaly detection in large datasets? In this thesis we are using a large crowd-sourced dataset from Bredbandskollen1to answer these and other questions regard-ing this subject.

1.2

Anomaly detection

Anomaly detection is about finding patterns of data that are not following an expected be-haviour and might occur because of different reasons, including malicious activity, for exam-ple. Patterns of unexpected behaviours are also referred to as outliers, exceptions, peculiari-ties, surprises etc. We refer to these patterns and behaviours as anomalies to be consistent and avoid confusion. Anomaly detection is an increasingly important subject that is being used in more and more domains. For example, anomaly detection is used for detecting bank or

iden-1Bredbandskollen is a tool where the users can test and evaluate their network connection based on upload and

(8)

1.2. Anomaly detection

Figure 1.1: Contextual anomalies

Figure 1.2: Point anomalies

tity frauds, eco-system disturbance, cyber-intrusion, or detecting break down of a system, and a lot more.

While working with anomaly detection in large datasets there are challenges. First, there is a need to define what is a "normal" behaviour and what is not. Since the behaviour could change and evolve over time, the behaviour that at one point is considered normal might be abnormal in the future. Second, the context of the measurements also plays a major role. We need to consider what happened right before and after the moment we are analyzing, since the context is an important factor when analyzing such anomalies [4]. Figure 1.1 illustrates two different measurements where the first measurement is considered normal while the second is not. These types of anomalies are called contextual anomalies. Another type of anomaly that is common is the point anomaly. That is when one specific point is compared to the rest without considering the context. Point anomaly is illustrated in Figure 1.2, where the points in the region N1 and N2 is considered normal while the points A1, A2 and the points in the region A3 are anomalies.

Anomaly detection techniques is often split into three different main categories. These are su-pervised anomaly detection, unsusu-pervised anomaly detection, and semi-susu-pervised anomaly

(9)

1.3. The dataset

detection [7]. When working with supervised technologies there is a need for a labeled dataset where every entry has been labeled as "normal" or "abnormal" so that the com-puter can be taught through machine learning what is an anomaly and what is not. Semi-supervised is using a normal training dataset to derive a model and then tests the likelihood of a test instance to be generated by that model. The third category, unsupervised anomaly detection, is working with an unlabeled dataset and makes an assumption that the majority of the data instances are normal and are then trying to look at other instances that are not fitting very well with the rest of the dataset.

1.3

The dataset

In this thesis we have analyzed a large, but relatively sparse, dataset from Bredband-skollen.se, consisting of data from 41 million measurements. All of these measurements are made from cell phones or tablets without WiFi, which means that we have only been working with data from mobile networks.

Every entry in the dataset contains information about the measured event, including the ge-ographical position (longitude and latitude) of the mobile unit at the time of the measure, the upload speed, the download speed, the end-to-end latency, the Internet Service Provider (ISP), the timestamp, and the network technology that was used during the test. The dataset consist of 41 million entries, but we limited ourselves to the measurements that were gath-ered during a one year period between January 2014 to December 2014. The year of 2014 is the most recent year with measurements that covers all 365 days in the dataset and the tech-nical differences was therefore as small as possible compared to today. These limitations still gave us 14 million entries to work with. One advantage with reducing the span was that the technological differences were much smaller during a one year period compared to over the full eight year period of the original dataset.

The entries in the dataset differs in many ways. Totally, the measurements are made from 3,184 different mobile phone types. The majority of the measurements are made from iPhone (38.7 %) and iPad’s (21.7 %) and the three most popular operators are Telia (32.7%), Tele2 (10.9%) and Telenor (10.3%) [2].

1.4

Problem formulation

The purpose of this thesis and the problem that we are trying to solve is to conclude if there are any efficient method that can be used to detect network failures and/or network degra-dations. This thesis are not comparing different methods but is focusing on evaluating an approach based on principal component analysis followed by applying a K-mean algorithm for detecting anomalies.

1.5

Contributions and thesis outline

This thesis makes three primary contributions. First, we perform a principal component anal-ysis and discuss how effective that is when working with anomaly detection in big datasets. Second, we present a simple anomaly detection method and apply it on our dataset and dis-cuss and analyze whether this method is efficient and can be used for detection network failures or performance degradations in networks. Third, we implement an online detec-tion algorithm that is based on training data from previous measurements and compare the results between the online algorithm and the basic algorithm.

(10)

1.5. Contributions and thesis outline

The remainder of this thesis is structured as follows: First, Chapter 2 provides some basic knowledge regarding the algorithms and methods that we are using for this analysis. Second, in Chapter 3 we apply these to our dataset to see if any interesting observations can be done and present these. In Chapters 4 and 5 we then describe our implementation and present the results from our online anomaly detection algorithm and compare these results with the basic detection method presented earlier. Lastly, we discuss of the methods, results and how these can be used to detect network failures or network degradations.

(11)

2

Background and Related Work

This chapter presents papers and articles that provide good knowledge and background to our thesis. We also present some papers that are related to what we are going to do.

2.1

Anomaly detection

There are a lot of papers regarding different anomaly detection methods in large datasets. Chandola et al. present a very well written survey regarding a large amount of different anomaly detection methods [4], which provides a really good introduction for understanding the subject. Their survey does not target large datasets specifically, but provides useful and relevant information to this subject.

Patcha and Park also present a very good overview regarding anomaly detection techniques and what problems that might occur when detecting anomalies [12]. They discuss both anomaly detection systems as well as hybrid intrusion detection systems that is relevant both in the present and the recent past. The paper also handles the technological trends in the anomaly detection field and the challenges that will arise when trying to detect anomalies.

2.2

Methods and algorithms

McCallum et al. present a similar method to the K-mean clustering method that we are using in this paper, which they call canopy clustering [10]. The key with this method is to do the clustering or partitioning in two different steps. First, a cheap and approximating similarity measure is used to group the different points where the subsets are overlapping. Second, the distances between points that belongs to the same subsets is computed. By doing this in two steps the number of distance computations is reduced.

To solve the K-mean clustering problem we are using the K-mean clustering algorithm. This is not the only algorithm to solve this problem, Kanungo et al. presents another algorithm called the filtering algorithm [8]. This is basically an improved version of the Lloyd’s K-mean clustering algorithm. The algorithm is using a so called kd-tree, which is a binary tree, to store the data points. The idea with this algorithm is to get a few candidates to the existing

(12)

2.3. Crowd sourced measurements

clustering center and filtering the centers that does not fit. This means that there wont be any updates of the structure unlike the algorithm we are using, where the centers are updated continuously. The paper also discusses the running time of the algorithm and the differences between the running time depending on the distance between the clusters.

Ghoting et al. presented a distance based method for detecting anomalies in high dimen-sional datasets [5]. This paper is interesting and relevant since we are working with multiple dimension as well. The authors are presenting an algorithm they call RBRP, which is a modi-fied version of the k-nearest neighbor method. The authors argue that methods like clustering methods do not scale well with the number of dimensions in the dataset. This is an interesting statement and is one thing that we are investigating and discussing further in this paper. The Principal Component Analysis is a big part of this paper. This analysis is very well de-scribed in an interesting tutorial paper by Abdi and Williams [1]. They present the prerequi-site notions and notations of the analysis as well as the goals of PCA. They are also explaining how to interpret the results from the analysis and what to look for and what can be useful. Furthermore, they discuss how to decide how many components to use with the help of the so called "elbow rule", which we are using in our paper. To help the reader to understand the PCA, the writers also includes a well explained and well written example of a Principal Component Analysis.

2.3

Crowd sourced measurements

Odlander and Andersson has been writing their bachelor thesis about a similar subject to ours [11]. They used the same dataset from Bredbandskollen like we are, but investigated if crowd-based data could be used for detecting DDoS-attacks. Their thesis is focusing on one particular week in December when there is known that there were an actual DDoS-attack against the Swedish tele operator Telia’s servers. They focused on measurements and down-load speed and applied an adaptive threshold to identify unusual patterns in the data. Linder et al. has written a paper about using crowd-sourced measurements for performance prediction [2]. This paper is using the same dataset as we are doing. They are using the dataset to evaluate the prediction accuracy and achievable performance improvements when using crowd-sourced measurements as a prediction tool.

Another paper regarding a similar subject has been presented by Hiran et al. where they are using a crowd-based detection of routing anomalies [3]. This paper has an security approach and want to investigate if a crowd-based method can be used to detect attacks in networks. Since the overhead cost of continuously active measurements is high, they present a passive monitoring approach that is combined with collaborative information sharing as a method to detect unusual behaviours that should be further investigated.

Arlitt et al. has also written a paper using passive crowd source measurement but is focusing on which conclusion that can be drawn regarding the infrastructure and performance of the world wide web [16]. This has been done by collecting data regarding four million transac-tions from 95 different proxy servers across the world. This data was then used to determine differences in throughput, distinguish content providers by their quality of service and much more. This paper is interesting for us since the methodology is similar to our with a passive crowd source measurement based approach.

(13)

3

Theory

In this section we are discussing and explaining the anomaly detection methods and the algorithms that are used to analyze the dataset.

3.1

Principal Component Analysis

The Principal Component Analysis (PCA) is used to identify variation and find strong pat-terns in big datasets [9]. These patpat-terns are called principal components and the goal is to find the maximum amount of variance with the fewest number of principal components.

This is done by looking at different characteristics of the data points in the set. Since many of the characteristics measures related properties there are many of them that are more or less redundant. The PCA solves this by summarizing each data point with less characteristics. PCA is often used for dimension reduction to make datasets easier to work with and more manageable [1]. It finds the best possible characteristics that summarizes the data points in the best way by construct new characteristics in a rotated plan from linear combinations of the different factors [6]. The newly made characteristics is the properties that differs the most across the data points.

To help us do the PCA, we used a software called SPSS2. SPSS creates tables and plots to

help the user visualize the data that the user want to analyze as well as present statistical observations of the dataset.

3.2

K-mean clustering

Clustering is a process of partition and splitting sets of data (or other things) into small groups or clusters. A daily example of clustering is the items in a supermarket, which are clustered into different categories where tomatoes, carrots and cucumbers are grouped into vegetables for example. The goal of clustering is to assign a cluster to each data point in a set, where we have n data points xj, j = 1...n that have to be partitioned into k clusters. The K-means clustering method tries to find the positions µi, i = 1...k of the clusters that minimize the

(14)

3.3. K-means algorithm

Figure 3.1: K-means algorithm

square of the distance between the data points and the cluster. It minimizes the following function [13]: J= k ÿ i=1 n ÿ j=1 ||xj´ µi||2. (3.1)

3.3

K-means algorithm

The K-means algorithm is used to solve the K-means clustering problem. The algorithm per-forms an iterative alternating fitting process in order to create a number of specified clusters. The method initially selects k objects that acts as cluster centers. In other words, k is the de-sired number of clusters. The method then assigns each object in the dataset to the cluster center that each object is most similar to. After this, the mean of the clusters are updated by using the following formula:

µi = 1 |ki| k ÿ xjPk xj. (3.2)

The objects are then again assigned to the cluster centers based on the updated cluster means, and when this is done, the cluster means is recalculated once more. This loop continues until there are no changes in the clusters [17]. The different steps are illustrated in Figure 3.1 where the red dots are the cluster centers and the green and blue squares are the data points.

(15)

4

Methodology

In this section we present which methods are used and how we work with the dataset during this project.

4.1

The KDD-model

We have during this project worked according to the KDD-model, which is a process for how to work and extract interesting data from large datasets. KDD in this case is an acronym for "knowledge discovery in databases" and the process consists of five steps for how to work with big databases or datasets [15].

1. Selection of raw data: The first step in the KDD-model is to select which raw data that are interesting and should be handled. This can be a question of limit the dataset to a time period to get a better focus on the interesting things. In our case we have limited the time period to a single year to get less technological differences over the examined time period.

2. Data preprocessing: To get the best possible results it is important to prepare the se-lected data in a way that is easy to handle. This might be to delete noise from the dataset or filtering out the interesting parameters from the dataset so the processing can be made more efficient later on. In our case this means that we prepared the dataset by filtering the entries made during 2014 and filtered out the interesting parameters like date and download speed and so on.

3. Data transformation: To get the best possible result it is important that the data you should work with is presented in an efficient way. It is a waste of power and time if there is a need to format the data during run time, therefore this step is making sure that the dataset is prepared in a way that is easy to read and work with.

4. Data mining: In this step one or more data mining algorithms is applied to the dataset. In our case we applied a PCA and K-means clustering method, which was presented in the theory section.

5. Interpretation and evaluation: After the data mining is done the only thing that is left is to interpret and evaluate the result. This is usually done both in a visual way through

(16)

4.2. Working with the dataset

graphs and diagrams and by discussing the results in text, which is what we are doing later in this thesis.

4.2

Working with the dataset

In this thesis we present an overview of the dataset by analyzing the dataset on a basic level. We focus on multiple different variables when analyzing the dataset. First we look at how many measurements that were done every day. We also noticing the average download speed for every day. Second, we look at the volume of correlated measurements of each day as well as the average download speed of the correlated measurements. We are considering the measurements as correlated if the measurements is in the same location and within a time window that we vary between 10, 30 and 60 seconds. We are considering the "same" loca-tions as squares with side lenghts that we change between 10, 50 and 100 meters in different combinations. Third, we look at the number of measurements and average download speed of the three largest operators, which are Telia, Tele2 and Telenor. This means that we get a total of 26 different factors and dimensions, which we are trying to reduce by doing a PCA and dimension reduction.

When we have a better understanding of the dataset we first try to detect anomalies through analysis of the dataset with a Principal component analysis followed by a cluster based anomaly detection method called K-means clustering. The K-mean clustering method helps us find days that shows unusual patterns compared to the rest of the days by applying a threshold that determines if there are any anomalies. We are investigating every specific day that the method finds as an anomaly. This thesis handles a special case of the K-means clustering method since we are only working with one cluster. We are working with more than two dimensions, but we are still implementing the K-means algorithm pairwise in two dimensions to get a clearer view of the visualized results.

After this, we do a Principal component analysis followed by a K-mean clustering once more, only this time we are trying to find anomalies based on previous measurements. In other words, we are looking at the data from three consecutive months to help us detect anoma-lies in the data from the subsequent month. The three consecutive months are acting as a

training set and the subsequent month are acting as a testing set. Unlike the earlier

analy-sis, this time we are implementing the K-means clustering method in three dimensions. The training set is determining the threshold for what is considered an anomaly in the testing set. This is done by defining the radius of a sphere where 99% of the data points in the training set is inside the sphere.

(17)

5

Results

The following chapter shows the results from our analysis.

5.1

Initial tests

Figure 5.1(a) provides an initial overview of the amount of measurements made on a daily ba-sis. We have found that the mean value of the measurements is 38,586 measurements per day. The graph shows that there are a few peeks where the amount of measurements is somewhat abnormal compared to the mean value and the surrounding values. The biggest amount of measurements occurred on the 15th of December. This day had 70,199 measurements, which is about 82% more than average. Figure 5.1(b) visualizes the mean download speed for every day during the year that is investigated. One interesting observation is that when there is an unusually high amount of measurements, the average download speed often has some kind of local minimum. The overall download speed also has a clear increment during the year which probably is due to the constant technology development.

(18)

5.1. Initial tests

Figure 5.2: Number of measurements and mean speed for Telia users every day 2014

Figure 5.3: Number of measurements and mean speed for Tele2 users every day 2014

Figures 5.2-5.4 shows the number of measurements and mean download speed for the top three operators (Telia, Tele2, and Telenor respectively) in Sweden during 2014. The mean download speed looks similar between all three operators with a slow increase of download speed during the whole year. The number of measurements has a more interesting behaviour. The graph that represents Telia seems to be relatively stable with not that much deviations except at the end of the year. When it comes to the number of measurements made by Tele2 and Telenor users the results are much more inconsistent. The number of measurements made by Tele2 (Figure 5.3(a)) and Telenor (Figure 5.4(a)) has similar shapes but the deviations are somewhat greater and spread out across the one whole year. This may partly be explained by smaller traffic values, but may also be due to more visible attacks against the Telia network in December 2014.

In Figure 5.5 we are showing a sample from the results of number of measurements and aver-age download speed of correlated measurements as seen when using different combinations of window size and bucket size. Here, we show the mean values during the whole year, pro-viding an overview of what that data looks like. It is interesting to see that the number of correlated measurements are increasing really quickly when we are increasing the time

(19)

win-5.2. Principle Component Analysis

Figure 5.4: Number of measurements and mean speed for Telenor users every day 2014

Figure 5.5: 3D Plot of mean values from correlated measurements

dow and bucket size just a little bit. We can also see that the average download speed is very low when the time window is set to 10 seconds but that the bucket size does not affect the download speed very much.

These graphs and data are shown later in this section when we present the Principal Compo-nent Analysis and which days that are classified as anomalies.

5.2

Principle Component Analysis

To reduce the number of dimensions we performed a principal component analysis and di-mension reduction to go from 26 didi-mensions to a more manageable amount of didi-mensions. The dimension we used is listed in Table 5.1 and how we obtained these dimensions is ex-plained in the first paragraph of Section 4.2.

Table 5.2 shows the result of the total variance of the principal components 1-10 followed by Figure 5.6 that shows a scree plot of the result. By looking at Table 5.2 we can see that the first three principal components corresponds to 85.6% of the total variance, while adding a

(20)

5.3. Interpretation of principal components Component Dimensions 1 2 3 Measures -0.080 0.815 0.481 Mean speed 0.784 0.567 -0.201 Telia Measures 0.098 0.821 0.315 Telia Speed 0.766 0.552 -0.148 Tele2 Measures -0.667 0.300 0.477 Tele2 Speed 0.838 0.337 -0.192 Telenor Measures -0.643 0.107 0.582 Telenor Speed 0.697 0.398 -0.349 Measures 10m 10s -0.497 0.454 -0.499 Speed 10m 10s 0.437 -0.255 0.561 Measures 10m 30s -0.596 0.624 0.032 Speed 10m 30s 0.852 0.437 0.048 Measures 10m 60s -0.552 0.702 0.386 Speed 10m 60s 0.843 0.496 -0.087 Measures 50m 10s -0.521 0.516 -0.564 Speed 50m 10s 0.567 -0.205 0.645 Measures 50m 30s -0.611 0.658 -0.060 Speed 50m 30s 0.891 0.369 0.116 Measures 50m 60s -0.520 0.751 0.362 Speed 50m 60s 0.857 0.484 -0.070 Measures 100m 10s -0.522 0.531 -0.553 Speed 100m 10s 0.566 -0.178 0.647 Measures 100m 30s -0.608 0.672 -0.084 Speed 100m 30s 0.896 0.360 0.139 Measures 100m 60s -0.505 0.767 0.357 Speed 100m 60s 0.859 0.485 -0.064 Table 5.1: Principal Component Matrix from PCA

fourth principal component would only give us an additional 4.5%. In other words, we find it suitable to reduce the number of dimensions from 26 to 3. The rule of thumb that usually is referred to as "the elbow rule", which says that the principal components above the "elbow" in a scree plot (principal component 4-6 in our case) is the number of components that we want to aim for, also confirms that we have chosen a legitimate amount of principal components [1]. By looking at Figure 5.6 we can clearly see that there are three principal components that are placed above the "elbow".

5.3

Interpretation of principal components

The Principal Component Analysis returns principal components that actually are linear combinations in a rotated space [1]. A common misunderstanding is that the principal com-ponents actually is just one column from the original matrix that has been chosen. Because of this it is important to try to get an overview of what contributions the different principal components gives. This might be a challenging task and is difficult to interpret perfectly, but it is necessary to at least give the results a thought to get a better understanding of the "new" dataset with fewer dimension and data.

The first thing we did was to look at the component matrix that told us what contribution the different factors had to the three different principal components. Table 5.1 shows that the first principal component seems to be speed related with negative weights to number of

(21)

5.3. Interpretation of principal components

Component Total % of Variance Cumulative %

1 11.325 43.556 43.556 2 7.322 28.162 71.718 3 3.613 13.896 85.613 4 1.179 4.536 90.150 5 0.759 2.920 93.070 6 0.374 1.438 94.508 7 0.309 1.187 95.695 8 0.283 1.087 96.782 9 0.177 0.681 97.463 10 0.157 0.602 98.065

Table 5.2: Eigenvalues, total variance and cumulative variance of principal component 1-10

Figure 5.6: Scree plot from the principal component analysis

measurements. This can be seen in the first column, where most of the speed related factors are positive and relatively high, while the measurement related factors are often negative. The same arguments can be applied to the third principal component, but with the result that this component seems to be related to the number of measurements. The second principal component is more difficult to interpret since it seems to be a mix of these two since values in the matrix is similar between speed and measurement related factors.

Figure 5.7 provides a visual representation of how the different factors are contributing to the different principal components. It might be difficult to get a good view and understanding of the data in three dimensions, but it is possible to distinguish three groups of dots that could be considered as clusters from the figure. One cluster consists of mostly measurement related factors where as the other two consists of mostly speed related factors. Together with the table above it provide an extra understanding of the principal components and its contributing factors.

(22)

5.4. 2-dimensional K-mean cluster

Figure 5.7: Principal Component Plot from PCA

To help us evaluate the PCA further, we can calculate the principal component score. As mentioned before, it might be difficult to get a clear picture of the data in a 3D plot. The principal component score is calculated by first centering the original variables by subtracting the column means, followed by multiplying the coefficient of the principal components that we chose above with its corresponding values in the original matrix [14]. By summarizing these values we got a new matrix with the principal component scores that is presented later.

5.4

2-dimensional K-mean cluster

Figures 5.8-5.10 demonstrates the results from the 1-mean clustering algorithm when we ap-plied a threshold of 3.25 (the circle). The first plot, that is structured by principal component 1 and 2, has three obvious dots that are outside the threshold circle. These represents the 3rd of January, the 26th of January and the 16th of November. Furthermore, the second plot, that is structured by principal component 1 and 3, has four dots outside the threshold. Two out of these four are the same as in the first plot; 3rd of January and 26th of January. The third and fourth dot represents the 25th of July and the 25th of December.

Finally, the third plot, that is structured by principal component 2 and 3, has four dots outside the circle. One of these is the 16th of November which is the same as one of the dots in the first plot. The other three are the 14th of April, the 10th of December and 15th of December. In total, this preliminary analysis identifies eight different days that are marked as anoma-lies. Three out of these eight days were placed outside the threshold in all three dimensions, and the others only in two dimensions. The anomalies are summarized in Table 5.3. Another interesting note is that every date that is considered an anomaly is outside the threshold of

(23)

5.5. Proof of concept of online algorithm

Figure 5.8: 2D plot with component 1 and 2 as axes

the third principal component, which is, as mentioned earlier, mostly related to number of measurements. This means that every date that we have identified as an anomaly is some-what abnormal when it comes to number of measurements. Figure 5.11 show the identified anomalies inserted into Figure 5.1. When looking at Figure 5.11, six out of eight of the ab-normal dates has what could be considered as a local maximum, or a spike. The two dates that is not a local maximum is the 14th of April and 25th of July. These dates has both less than average number of measurements, with only 33,862 and 35,893 measurements. Since the 25th of July has a negative value at the PC3-axis in Figure 5.9, this local minimum makes sense. The 14th, however, has a positive value at the PC3-axis in Figure 5.10 which was a bit surprising. One theory could be that it has a negative value at the PC2-axis, which affects the number of measurements to some extend. The other six dates is above average when it comes to number of measurements, with the 15th of December as the highest measured number at 70,199.

In Figure 5.1(a) we can also see that the dates of 3rd of January and 26th of January is local minimums when it comes to mean speed. This makes sense since both dates are outside the threshold in all three dimensions. This could also be interesting since those dates have a relatively high amount of measurements. The 25th of July and 25th of December could also be considered as a local minimums, which seems reasonable since these dates are outside the threshold in the first principal component which is, as mentioned earlier, mostly related to the mean speed. The last date that is outside the threshold of principal component 1 is the 16th of November. The mean speed of this date is not a clear minimum or maximum which is a little unexpected considering that the 16th of November is also outside the threshold circle in all three dimensions.

Another anomaly date that does not have a clear minimum or maximum is the 14th of April. This is less surprising since the earlier observations concerning this date was not what we expected. Lastly, the two remaining dates, 10th and 15th of December, has both a local max-imum. These two dates has highest number of measurements as well as the highest mean speed of the eight anomaly dates.

5.5

Proof of concept of online algorithm

To get more reliable and trustworthy results we decided to implement a semi-supervised anomaly detection method. This approach is similar to what we did in the initial test with

(24)

5.5. Proof of concept of online algorithm

Figure 5.9: 2D plot with component 1 and 3 as axes

Figure 5.10: 2D plot with component 2 and 3 as axes

(25)

5.5. Proof of concept of online algorithm Date PC1 PC2 PC3 3 Jan X X X 26 Jan X X X 14 Apr - X X 25 Jul X - X 16 Nov X X X 10 Dec - X X 15 Dec - X X 25 Dec X - X

Table 5.3: Table of anomalies and principle components

principal component analysis followed by applying a threshold to detect potential anomalies. The big difference here is that we are using training data from the three previous months to let the computer decide what should be a reasonable threshold to the next month. By doing this the threshold was set according to the related measurements done recently and not depend on measurements made during such a long time period. By looking at the most right graph in Figure 5.11 we can see that the average download speed has been increasing a lot from January to December. One of the advantages with training data from a close time period is that such differences will not affect the result of the analysis like it did in the initial test section.

As before, we calculated the principal component scores and plotted these in a scatter plot. The principal component analysis returns principal components that are normalized and in a rotated plane, which means that our confidence region resulted in a sphere with a radius that is our threshold. We decided that a confidence region of 99% should give us an inter-esting result since that value would not mark too many nor too few days as anomalies. In other words, we decided our thresholds by analyzing when 99% of the measurements were inside the confidence region. Table 5.4 is showing which thresholds that were calculated and which days that were marked as anomalies. Totally this method with these values identified seven different days, where only two of these (16th of November and 15th of December) were identified in the initial test section.

Table 5.4 shows that the months with relatively low threshold value has detected anomalies, except for May which has the fourth lowest threshold. The lowest threshold that was cal-culated is June. This resulted in two detected anomalies this month; the 22nd and the 30th. Figure 5.12 shows that the 30th of June has a local maximum in both measurements and mean speed in comparison to the period of Mars-May. The 22nd on the other hand has a smaller peak in the measurement graph but a local minimum in the mean speed graph. These distinct ups and downs explains why the two dates is classified as anomalies. November, which has the third lowest threshold, has, beyond the 16th, one more anomaly. This is the 11th, which has a small peak in the number of measurements but not a clear minimum nor maximum in the mean speed graph. Therefore it is kind of odd that this date was outside the threshold, but the lack of dates with higher measurement or mean speed values in the training data could explain this to some degree.

Table 5.4 also shows that April has the second lowest threshold. This month only has one anomaly which was the 18th. When looking at Figure 5.12, the 18th has a relatively high value of measurements compared to February and Mars, but January clearly has a lot of higher measurement values. The 18th of April also has a very low mean speed value compared to February and Mars but once again, all of January has lower values which is hard to explain.

(26)

5.5. Proof of concept of online algorithm

Month Training PC1 PC2 PC3 Total variance Threshold Anomalies

Jan - - -

-Feb - - -

-Mar - - -

-Apr Jan-Mar 59.34% 17.68% 9.33% 86.4% 3.20 18 Apr

May Feb-Apr 48.34% 21.74% 10.71% 80.8% 3.44

-Jun Mar-May 53.67% 16.84% 9.57% 80.0% 3.19 22 Jun, 30 Jun

Jul Apr-Jun 36.26% 23.27% 12.01% 71.5% 3.73

-Aug May-Jul 46.65% 25.75% 8.13% 80.5% 4.03

-Sep Jun-Aug 47.42% 22.18% 13.57% 83.2% 3.66

-Oct Jul-Sep 57.75% 14.43% 12.07% 84.3% 3.49 18 Oct

Nov Aug-Oct 46.50% 27.87% 8.50% 82.9% 3.37 11 Nov, 16 Nov

Dec Sep-Nov 42.03% 27.40% 10.66% 80.1% 3.50 15 Dec

Table 5.4: Summary of results from online algorithm

Figure 5.12: Visualization of the days identified as anomalies in online algorithm

The last anomaly is the 18th October with a threshold of 3.49 which makes it the fifth lowest. This date has a high measurement value compared to the training data, and Figure 5.12 shows that only the start of July has some higher values. The mean speed is also relatively high compared to the training data and the only higher values occurred in September.

As mentioned earlier, the two dates that was identified in both tests were the 16th of Novem-ber and the 15th of DecemNovem-ber. When looking at Figure 5.12, these two dates is clearly local maximums in both number of measurements and mean speed comparing to the dates train-ing data.

(27)

6

Discussion

We think that the approach that is based on the principal component analysis was an inter-esting way of tackle the problem with analyzing big datasets for detecting network failures and network degradation. In the basic algorithm we only lost about 14% of the total variance and in the online alogrithm we lost between 14-29% depending on which time period that was used for training. Since we are working with a lot of different factors in the analysis we thought that losing that small amount of variance would not be critical and that we should get good and interesting results from the PCA.

If we look at Table 5.2 we can see that five out of eight days that were classified as anomalies only violated the threshold in two dimensions with the basic algorithm. One interesting thing is that the 15th and 10th of December was among them. These dates are the only well known dates that we know that an actual attack took place [11]. This time it was an attack against EA Sport’s servers which affected Telia’s servers as well3,4,5. Figure 5.11 clearly shows that there were a huge increase of measurements these days which was one of the reasons these days were classified as anomalies. Since these days are the most interesting days during this year (as far as we know) we thought that it should violate the threshold in all three dimension, which unfortunately did not happen. Although, we are still satisfied that it was marked as anomalies since that are the only days we have information about an actual attack.

It is interesting to compare the results of the identified days between the basic method and the online algorithm. Both methods identified mostly different days except the two days in November and December. Even though the results varied, we can clearly see that the days that were identified by the online algorithm makes more sense and are easier to explain. In the initial test section there were some days that were marked as anomalies that were difficult to motivate (e.g. 14th of April and 25th of July). When looking at Figure 5.12 and Figure 5.11 and comparing the results between the online and basic algorithm it is important to remember that the online algorithm are detecting contextual anomalies based on the training period. This means that sometimes the anomalies identified by the online algorithm may be intuitively difficult to understand if comparing to other dates in the future or in the past.

3http://www.dn.se/ekonomi/hackergruppen-som-sankte-telia/

4http://www.idg.se/2.1085/1.603231/kapade-routrar-bakom-odleattacken 5http://www.svt.se/nyheter/inrikes/hackergrupp-tar-pa-sig-telia-attack

(28)

When looking at Figure 5.12 we find it strange that the 1st of July was not identified since that date has an extremely large peak in number of measurements. One possible explana-tion to this could be that the principal component with the lowest percentage of variance is contributing mainly to the number of measurements. In other words, a big increase in mea-surements will not affect the results as much as a little change in for example average speed will do. When we tried to manually lower the threshold to see what days that would be marked as anomalies, the 1st of July was the first day to be detected, but we had to lower the threshold with about 0.5. This means that the 1st of July was actually not even close to be identified as an anomaly.

Since we are doing a Principal Component Analysis and dimension reduction there is no actual limit for how many factors that could be taken into account. In this thesis we used average mean speed, measurements, mean speed for correlated measurements and amount of correlated measurements. This is actually four main factors but we let the boundaries for time window and bucket size of the correlated measurements vary to be sure to get the values that contributed the most. We could (and maybe should) have been thinking and selecting more interesting factors to be part of the analysis.

We found it really difficult to find information about actual attacks or network malfunctions. This made it difficult for us to evaluate the results and we can not know for sure if there were any attacks at the days that we identified as anomalies. The best we can do is looking at the results from our processed data and analysis and look if these days shows any unusual behavior. Gladly we could see that almost every day that was classified as an anomaly was a local (and sometimes global) maximum for the number of measurements and sometimes it was a local minimum at the same day for the mean download speed.

We performed a Principal Component Analysis which we found provided really good in-sights into the result. We have discussed the correlated measurements and found it really difficult to decide which bucket and time window sizes that should be suitable. By doing a PCA we could use a lot of different combinations and let the software take care of that prob-lem for us. This helped us determine what sizes that contributed the most and still take all the different variations into account. We then applied a 1-mean clustering algorithm to our data with a threshold to identify anomalies. Since this is a special case of the K-mean algorithm, it simply becomes a scatter plot with a threshold. Our initial thought was to see if we could find anything interesting with multiple clusters. When we realised that it didn’t provide anything interesting we decided to stick to a single cluster.

We limited the time span of the dataset to measurements performed during 2014 to avoid any big technological differences. But if we are looking at Figure 5.1 we can see that the average download speed has increased from about 12 Mb/s up to 20 Mb/s during only this year. This makes it more difficult to interpret the results from mean download speed and may have affected the results from the PCA negative. To get a better result, maybe we should have processed that data in some way before actually using it in our thesis, especially in the initial tests section. The online detection algorithm is not affected by this in the same extent since we are calculating the threshold from the three previous months. This is a big advantage with the online detection algorithm.

During this thesis we have been working with a dataset from Bredbandskollen.se, including the locations where people have been while doing the measurements. This kind of data could potentially be misused. We have therefore treated the data with respect and never showed the data to other people than our thesis supervisor. We also only present aggregate results, that do not include any personal information.

(29)

We do not think that using the basic algorithm based on a result from speed test could be used to distinguish network failures of network degradations from other outlier events. While it can be used to get a hint of when a problem have occurred but we find that it is not effective and precise enough to display differents events on its own. Although, we do think that it can be a good complement to other techniques. The online algorithm is absolutely better and more efficient but it is still difficult to conclude how precise and efficient the method is when we have no "answers" to compare with. But we definitely think this approach could be used as a complement to other techniques as well.

(30)

7

Conclusion

In this thesis we have performed dimension reduction with PCA and performed both an unsupervised anomaly detection method and a semi-supervised anomaly detection method called K-mean clustering on a large dataset Bredbandskollen. The unsupervised method sim-ply sets a threshold that was legitimate while the threshold in the semi-supervised method based the threshold on training data. The purpose was to investigate if this is an effective method to detect network failures and network degradations. We think this method and way of approaching the problem was interesting and provided some good insights but not good and precise enough to be used as a stand-alone method to detect problems in a network. We had very few documentations of actual attacks or network failures which made it difficult for us to interpret the results and conclude whether the days we identified was correctly identified or not. Even though we faced some difficulties we are safe to say that this approach surely provides some good insights to a dataset like this and could be used as a complement to other techniques.

7.1

Future work

Our implementation and results could be used to detect days that could be days when attacks or network failures occurred. One interesting direction of future work could be to investigate these days more thorough with the goal to try to identify which operator and which area that was affected. It could also be interesting to implement and try a method like this live to see if it is possible to detect network problems immediately when they are identified. It could also be interesting and useful if it is possible to get a log or additional data from the operators to see how and when the network problems took place.

Our implementation of PCA and K-mean algorithm is just one of many ways to approach this problem. It could be interesting to see what result another way of approaching the problem on the same dataset would return. Then a comparison and discussion of the differences could be really interesting. We had very little knowledge of this subject when starting working with this thesis, which made the choice of methods and algorithms based on just a little research, which makes it possible that there are more suitable ways to work and approach this kind of problems.

(31)

Bibliography

[1] Hervé Abdi and Lynne J Williams. “Principal component analysis”. In: Wiley

Interdisci-plinary Reviews: Computational Statistics 2 (2010), pp. 433–459.

[2] Niklas Carlsson, Tova Linder, Pontus Persson, Jakob Danielsson, and Anton Forsberg. “On Using Crowd-sourced Network Measurements for Performance Prediction”. In:

Proc. IEEE/IFIP Wireless On-demand Network Systems and Services Conference (IEEE/IFIP WONS) (2016).

[3] Rahul Hiran, Niklas Carlsson and Nahid Shahmehri. “Crowd-based Detection of Rout-ing Anomalies on the Internet”. In: Proc. IEEE Conference on Communications and Network

Security (IEEE CNS) (Sept. 2015).

[4] Varun Chandola, Arindam Banerjee, and Vipin Kumar. “Anomaly Detection: A Sur-vey”. In: ACM Comput. Surv. 41.3 (July 2009), 15:1–15:58.

[5] Amol Ghoting, Srinivasan Parthasarathy, and Matthew Eric Otey. “Fast Mining of Distance-based Outliers in High-dimensional Datasets”. In: Data Min. Knowl. Discov. 16.3 (June 2008), pp. 349–364.

[6] Steven M. Holland. “Principal Components Analysis (PCA)”. In: Online tutorial (2002). [7] Marina Thottan, Guanglei Liu, Chuanyi Ji. Anomaly Detection Approaches for

Communi-cation Networks. Springer London, 2010.

[8] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. “An Efficient k-Means Clustering Algorithm: Analysis and Implementation”. In: IEEE Trans. Pattern Anal. Mach. Intell. 24.7 (July 2002), pp. 881– 892.ISSN: 0162-8828.

[9] Lund Research Ltd. "Principal Components Analysis (PCA) using SPSS Statistics. 2013 (ac-cessed May 3, 2016).URL: https://statistics.laerd.com/spss-tutorials/ principal-components-analysis-pca-using-spss-statistics.php. [10] Andrew McCallum, Kamal Nigam, and Lyle H. Ungar. “Efficient Clustering of

High-dimensional Data Sets with Application to Reference Matching”. In: Proc. ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining. New York,

NY, USA: ACM, 2000.

[11] Marcus Odlander and Karl Andersson. Detecting a Distributed Denial-of-Service Attack

Using Speed Test Data: A Case Study on an Attack with Nationwide Impact. Bachelor Thesis,

(32)

Bibliography

[12] Animesh Patcha and Jung-Min Park. “An Overview of Anomaly Detection Techniques: Existing Solutions and Latest Technological Trends”. In: Comput. Netw. 51.12 (Aug. 2007), pp. 3448–3470.

[13] Khaled Alsabti, Sanjay Ranka and Vineet Singh. “An Efficient K-Means Clustering Al-gorithm”. In: Proc. Workshop High Performance Data Mining. 1998.

[14] Lindsay I Smith. “A tutorial on Principal Components Analysis”. In: Tech Report, Cornell

University (2002).

[15] Graham J. Williams and Zhexue Huang. “Modelling the KDD Process”. In: Tech Report,

CSIRO (1996).

[16] Martin Arlitt, Niklas Carlsson, Carey Williamson and Jerry Rolia. “Passive crowd-based monitoring of World Wide Web infrastructure and its performance”. In: Proc.

IEEE International Conference on Communications (ICC) (June. 2012).

[17] Archana Singh , Avantika Yadav and Ajay Rana. “Article: K-means with Three different Distance Metrics”. In: International Journal of Computer Applications 67.10 (Apr. 2013), pp. 13–17.

(33)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet

– eller dess framtida ersättare – under 25 år från

publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för

enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning.

Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan

användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten

och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god

sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras

eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära

eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida

http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet

– or its possible replacement – for a period

of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to

download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial

research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All

other uses of the document are conditional upon the consent of the copyright owner. The publisher has

taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is

accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for

publication and for assurance of document integrity, please refer to its www home page:

References

Related documents

Approximately 2000 tweets were gathered for every keyword, and roughly 4000 using no filter, to allow us to discern which topics contain higher and lower percentages of likely trolls

Results on five image data sets and five micro array data sets show that PCA is more effective for severe dimen- sionality reduction, while RP is more suitable when keep- ing a

Based on the definitions of natural gas industry sustainability, it can be concluded that the theory of gas sustainability is composed of five parts: sustainable supply of natural

For the concept of using air cylinder systems, which is finally chosen as the test rig, it overcomes most disadvantages in other concepts, it is ran by air force and load cell

The main objective of this thesis project is to investigate, implement and test a suitable version of Online Principal Component Analysis (OPCA) in the... context of

Det här uttalandet skulle kunna gälla för merparten av länets kommuner när det gäller rutiner för vuxna som utsatts för våld och ärenden med barn där det finns

Exempel som kan vara till hjälp för personer som spelar för mycket dataspel kan vara att inte ha spelkonsolen i sovrummet, att personen har en klocka i rummet där denne spelar för

The overall adjusted mean insulin use (units/kg/day) at all times after baseline in the 14-day full-dose versus placebo groups was 0.59 versus 0.62 for all patients (not signi