Anomaly Detection in an e-Transaction System using Data Driven Machine Learning Models: An unsupervised learning approach in time-series data

(1)

Bachelor of Science in Computer Science May 2019

Anomaly Detection in an e-Transaction

System using Data Driven Machine

Learning Models

An unsupervised learning approach in time-series data

Albin Ekholm

Adnan Avdic

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identified as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information: Author(s): Albin Ekholm E-mail: alek16@student.bth.se Adnan Avdic E-mail: adav16@student.bth.se University advisor: PhD Martin Boldt

Department of Computer Science and Engineering

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00 SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Background. Detecting anomalies in time-series data is a task that can be done with the help of data driven machine learning models. This thesis will investigate if, and how well, different machine learning models, with an unsupervised approach, can detect anomalies in the e-Transaction system Ericsson Wallet Platform. The anomalies in our domain context is delays on the system.

Objectives. The objectives of this thesis work is to compare four different machine learning models ,in order to find the most relevant model. The best performing models are decided by the evaluation metric F1 score. An intersection of the best

models are also being evaluated in order to decrease the number of False positives in order to make the model more precise.

Methods. Investigating a relevant time-series data sample with 10-minutes interval data points from the Ericsson Wallet Platform was used. A number of steps were taken such as, handling data, pre-processing, normalization, training and evaluation. Two relevant features was trained separately as one-dimensional data sets. The two features that are relevant when finding delays in the system which was used in this thesis is the Mean wait (ms) and the feature Mean * N were the N is equal to the Number of calls to the system. The evaluation metrics that was used are True positives, True Negatives, False positives, False Negatives, Accuracy, Precision, Recall, F1 score and Jaccard index. The Jaccard index is a metric which will reveal

how similar each algorithm are at their detection. Since the detection are binary, it’s classifying the each data point in the time-series data.

Results. The results reveals the two best performing models regards to the F1 score.

The intersection evaluation reveals if and how well a combination of the two best performing models can reduce the number of False positives.

Conclusions. The conclusion to this work is that some algorithms perform better than others. It is a proof of concept that such classification algorithms can separate normal from non-normal behavior in the domain of the Ericsson Wallet Platform.

Keywords: Anomaly Detection, e-Transaction System, Machine Learning, Unsu-pervised Learning.

(4)

(5)

Acknowledgments

We would like to thank Maria Larsson, Johan Piculell and Dani Manih at Erics-son for the oppurtunity and for providing us with information, data, hardware and knowledge, to use in our study. A special thanks goes out to our supervisor Dr Mar-tin Boldt, at BTH, for supporMar-ting us throughout our thesis.

Lastly we would like to thank Joakim Thiel at Blue science park in Karlskrona for helping us get in contact with Ericsson, without him this work would not have been done.

(6)

(7)

Chapter 1 Introduction

Our thesis is inspired by the growing interests in Artificial Intelligence and the sub-category of Machine Learning. The field of Machine Learning has grown a lot the past years[2] because of the large produced data that has been encapsulated into peo-ples lives, and the growing power of the hardware. The interest in Machine Learning begun when we worked on a project that could detect whale identities by just looking at images of the shape of their tails. This was when we got the understanding of how powerful an Artifical Neural Network(ANN)[6] could be.

We also got the understanding that it is quite easy to create a Machine Learning model when using today’s modern development tools, unlike just a couple of years ago when everything had to be developed from scratch.

In this thesis we are experimenting on time-series data and how to find the underlying structure, patterns in data in order to detect outliers.

1.1 Motivation

1.1.1 Pre-understanding

In order to dig deeper into the world of time-series data, and the anomalies that produces a lot of issues, it is quite hard not to mention some technical details about it. Although, we try to not dig to deep but some introduction explanation is required. Industries have been tackling anomalies for a long time now. There are a lot of systems out there that are producing anomalies and could eventually result in a col-lapse of the system or delaying some kind of operations.

Anomaly detection can be explained in a way where the analyst or the observer is looking for rare patterns that may occur. These rare events can be associated with fraud, intrusion detection, medical issues, delays, malfunctioning of equipment and etc[13].

1.1.2 The anomalies in the Ericsson Wallet Platform(EWP)

The EWP is an electronic wallet platform for consumers mostly in Africa. There are a lot of people in Africa who does not own a bank account, although a lot of people own a mobile phone. This is where the EWP can help. The platform handles all sort of requests through a simple front-end solution to the end user which is text based so people without a smart-phone also can use the EWP. The end user can transfer

(10)

2 Chapter 1. Introduction money easily by their phones just like they were paying with bills. This platform handles millions of users, which of course produces a lot of data traffic in and out of the EWP. This is where the issue begins.

The domain experts at the EWP is today handling and finding the anomalies man-ually by searching trough the logs. This work takes a lot of effort and is very time-consuming. Therefore, the domain experts at Ericsson are looking for not maybe a smarter way for detecting the anomalies but a faster way to detect them.

1.1.3 Aims and objectives

The goal of this thesis is to construct a relevant machine learning model to be a proof of concept that it is possible to help out in the day to day operations for the domain experts of separating normal behavior from non-normal. To achieve this aim an experiment will be conducted. The course of action includes picking relevant algorithms, pre-processing of the data, train the algorithms to produce models and then evaluate each model separately. The evaluating is crucial when picking the relevant models. An intersection between the two most promising models will be performed in order to investigate if both algorithms decisions can increase the quality of the detections.

1.1.4 Research questions

• RQ:1: Which learning algorithm produces the most accurate model with re-gards to F1 score for detecting anomalies in the EWP platform?

• RQ:2: To what extent can the two most accurate models from RQ1 be com-bined in order to decrease the number of false positives?

(11)

Chapter 2 Background

2.1 Machine learning

The common way anomalies have been tackled today is by an observer who is looking for strange behaviors in a system or a statistical approach where data streams can create correlations and then observe if there is any unseen values[11]. Now there are smarter ways to automate this process. A modernization is being evolved around the subject and machine learning has come along.

Supervised Learning: A supervised approach is the most common way of training a model. A supervised training method learns from past data to build the model. The way a supervised learning method learns is that it builds it’s own model around the labeled data set. The data set contains labels which has been produced mostly from humans. Every entry in the data set is classified. In other words the model is being built with a supervisor.

Unsupervised Learning: An unsupervised approach is a bit more uncommon in the machine learning community. Because of the non-labeling of the data it is not possible to classify a data point[17]. In other words, there are no supervision in the learning process. Instead the learning process is found in the data itself. Therefore it is hard to evaluate how well the model is performing because there are no answers to test on. Although, there are some fields that an unsupervised approach can perform very well. Clustering, anomaly detection, association mining and latent variable models[1].

2.2 Algorithms

During our experiment we investigated and tested 4 algorithms that has showed great results in time-series anomaly detection. The unsupervised approach was the most ideal way in our experiment because of the unlabeled data to learn from and especially for time-series analysis.

2.2.1 K-means

K-means is a clustering algorithm that is very popular for unsupervised learning. The algorithms principals is very simple. It looks for similar data points in the

(12)

4 Chapter 2. Background data set and group them together. The data points, which does not belongs to any clusters, are then possibly considered outliers(anomalies). The algorithm finds K number of clusters in a data set. Defining these K numbers is necessary in order for the algorithm to perform well accordingly to the data set given. Every cluster has one centroid which represents the middle of the cluster. It is defined as the real location of the cluster. Every data point is then allocated to one of the clusters[3].

Figure 2.1: Vizualization of Data distributed with cluster size 2 and 5

Elbow-method

The elbow method is used to interpret and also validate the consistency within cluster analysis. This method is helpful when deciding how many clusters should be used.

(13)

2.2. Algorithms 5 This method is solely used in combination with the k-mean algorithm.

The idea is to choose the most optimal number of clusters, so adding another cluster would not affect the results significantly.

2.2.2 Isolation forest

Isolation forest is built upon decision trees. A decision tree is a set of questions that eventually will give an answer to each unit in the data[4]. E.g. is this individual old enough to buy alcohol? If the person is above the age of 20 then yes the individual can buy alcohol.

The key thing about Isolation forest is that it does not calculate any distances as the K-mean algorithm does to check for outliers in the data. The Isolation forest instead tries to isolate each data point and looks for data points that is unusual from the other data points. It uses as mentioned earlier the decision tree as a base estimator in the process. It then builds the decision tree randomized and then parsing through the tree as each node as a data point in the data set. If the current data point(node) in the tree passes the equation it is then considered normal and is then being parsed on in the tree. The equation is shown below.

s(x, n) = 2−

E(h(x)) c(n)

The h(x) is the path length of the data point where the c(n) is the average path length and the n is the number of total nodes. If the data point is given a value of close to 1 it is then viewed as an anomaly. A score of 0.5 or lower means that the data point is considered normal.

Figure 2.2: Example of Isolation Forest graphical visualization

2.2.3 One-class support vector machine

The traditional Support vector machine(SVMs) is a two-class regression or classi-fication algorithm. It is a binary classiclassi-fication which separated the both classes

(14)

6 Chapter 2. Background with a hyper plane(called a hyper plane although it can be considered a line in a two-dimensional space). The hyper plane is basically a wide space between the two classes. In order to find the right hyper plane it is placed so the largest distance between the hyper plane and the nearest data point is set. The data points are then considered assigned to each class.

SVMs are supervised learning models suited to analyze data and recognize patterns. SVMs is typically used in supervised approaches. The algorithm, One class support vector machine, is applicable when using an unsupervised approach.

In One class SVMs, the model is trained on data with only one class. This one class can also be refereed to as the "normal" class. It learns the properties and patterns of normal behavior, and from this knowledge it can detect data points that deviates from the normal behavior. One class SVMs are mostly used for finding outliers which makes it great when finding anomalies.

Figure 2.3: Example of one class SVM graphical visualization

2.2.4 Elliptic envelope

Elliptic envelope is a supervised and unsupervised algorithm, where data is dis-tributed across an ellipse, hence the name elliptic envelope. The Elliptical Envelope routine models the data as a high dimensional Gaussian distribution with possible co-variances between feature dimensions. In short it attempts to find an boundary ellipse that contains most of the normal distributed data. Any data outside of the ellipse is considered, by the algorithm, to be an anomaly. The size and shape of the

(15)

2.2. Algorithms 7 ellipse is dependant on the data set, as well as the choice of outlier fraction.

(16)

(17)

Chapter 3 Related Work

Work related to this thesis and that, to some extent, inspired our work can be cred-ited to Martin Boldt , et al [24] who made an anomaly detection in video-on-demand (VoD) streams. They identified event sequences that do not confirm normal behav-ior. They discusses that it is important for the VoD-providers to keep an high quality of service which can be helped by identifying anomalies.

Techniques applied to identify the anomalies was Markov models which is based on probabilistic methods. The work of Martin Boldt, et al [24] was evaluated in a way similar to ours. The data are sequence based which is similar to time-series, their data sets are divided into sub data sets from a original data set called temporal resolutions. The temporal resolutions is in different sizes, one hour, a day and three days. Further division of the temporal resolutions where 10 data sets per tempo-ral resolution. In the end they had a total of 30 data sets to perform the Markov models on. Then the temporal resolutions were evaluated by investigate the overlaps between the data sets. The results showed that the approach can identify anomaly patterns pretty well. The best result was from the one day resolution. Although, the overall scores showed that it can detect anomalies for all the temporal resolutions. Work published by O.Carlsson, et al [22] where machine learning is used to detect Advanced Persistent Threats(APT), which is usually caused by individuals trying to access the system. Their conclusion is that it is possible, with aid of machine learn-ing algorithms, to detect APT’s. To be more specific, in their case it was K-nearest neighbor, Random forest and support vector machine that achieved the highest score. Their proposed method however uses an supervised approach to their problem. The authors Blomquist, et al [21] proposes a method to detect incorrect data in-side of Sida’s statistics about their contributions, where contributions are financial support a project receives. Based on supervised approach the goal is to determine wether a contribution has a risk to be innacurate. Their result showed that algorithm Adaboost had the best result overall. However, an examination of the top ten most suspected anomalies showed that half of them were actually incorrectly coded. Their conlusion is that it still can be used in quality assurance and it could be efficient since it can provide a set of more probably anomalous components for human inspection. The authors Lopez-Rojas, et al [23] aims to analyze the implications of using ma-chine learning techniques for money laundering detection (also known as Anti-Money Laundering, AML) in a data set consisting of synthetic financial transactions. Their

(18)

10 Chapter 3. Related Work method also includes the use of synthetic data, which has a risk of generating data that does not mimic real world data. This can lead to result being biased if not correctly mimicked. If mimicked right the results can be very useful.

Monitoring Youtube traffic to look for stream quality is a work that the author Abiy Biru [20] did. The time-series data detection of re-buffering and the bitrate re-adaptation events is similar to our outlier detection approach. The author used four different learning algorithms in order to find the most relevant machine learning model. The best performing algorithm was the Random forest with an F1 score of

67 and 70 percent.

The authors Osekowska, et al [25] proposes that machine learning could be used to detect non normal behaviour in maritime vessel traffic. By using historical data from all the vessels throughout the years, it is possible to establish normal routes, speed, direction and etc based on historical data. This will establish normal behaviour and from it one can detect anomalies, i.e behaviour that differs from standard behaviour.

(19)

Chapter 4 Method

4.1 Data

The data that is being gathered are logs from the different subsystems in the EWP. But for the purpose of this thesis the subsystem M3 is only being analyzed because of the extreme large data that every subsystem holds. Although, the method and prin-cipal is the same for every subsystem. The logs are statistics of all the interactions between every subsystem. Every subsystem produces wait-logs which is basically how long an interaction took. The logs time interval varies from 5 to 10 minute intervals. The logs that we received were in 10 minute intervals.

Inside the logs, there are four main traffic measurements, incoming, internal, lock and outgoing. E.g one log may represent the traffic inside a subsystem which in this example is called the M3. The M3 is the central processing which handles all the calculations of the whole system which can be transactions between customers, it can be questions to the database or fetching balance and so on. The M3 receives an incoming request which often is plain HTTP request from the user indirectly and directly from another subsystem often the front-end. This request may be a transaction between customers. Then the internal measurement in the log makes the calculation of the transaction. Lastly the outgoing traffic from M3 is in this case a query to the database to update the balance on both customers. All incoming, internal, lock and outgoing all have the same features. There are a lot of unique tasks that are being requested by the end user. The tasks could be a balance check or an actual transaction. There are hundreds of unique tasks in the EWP.

Every data set is collected in a Pandas data frame. The Pandas data frame is a common tool for handling time-series data and is very effective. The start date time is set to the index value which is beneficial when working with time-series data, each index value is a data point in the data set which is a 10-minute interval. The data. that were used to train and test the algorithms, in order to create different models, comes from real data generated by the EWP platform.

4.1.1 Features

The data features • StartDateTime

(20)

12 Chapter 4. Method • EndDateTime • Task • Sucess rate % • Number of failures • Number of errors • Min wait (ms) • Max wait (ms) • Mean wait (ms)

• Tps (Transactions per second) • Number of calls (N)

• Mean * N

In a more traditional anomaly detection scenario where anomalous behavior is to be identified, there may be one or several other features that directly or indirectly affects the measured behavior, i.e. a prize may be affected by the seasons or the popularity. This is crucial to know when building a model whose job is to detect anomalous behavior, otherwise the model may detect data points as anomalies when they actually are normal behavior.

This case is built around the Mean wait. The Mean wait is the most suitable feature when finding delays on the system. The delays are anomalies in this case. The Mean wait should not be affected by any other features listed. The Mean wait is then our chosen feature to build our models on. Although, if the there is a load on the Mean wait it could be the Number of calls. As the Number of calls increases the Mean wait could also increase. Therefore there is one other feature that the models were trained on. That is the Mean * N. The Mean * N is interesting since it combines the Mean wait time and the Number of calls. This feature is taking in to account if the Number of calls affects the Mean wait. Therefore it is interesting to investigate how well this feature matches the confirmed anomalies in the evaluation as well. This will indicate if the Number of calls has a load on the wait.

Chosen features • Mean wait (ms) • Mean * N

(21)

4.2. Selected learning algorithms 13

4.1.2 Pre-processing

Before training each model there were some pre-processing stages that was needed. Since our models are trained separately on one feature(one dimension) there is in need of a reshape of the data array before training. Numpy offers a reshape func-tion, e.g data.values.reshape(-1,1), which reshapes the array into one dimension[19]. This was used before training every algorithm.

The need of normalization is crucial in every learning model. The library scikit-learn offers a standard scalar function, StandardScaler(), that we used to prepare the data. The standardization means that the data is normally distributed and transforms the data so that the data set will have a mean value of 0 and a standard deviation of 1[15]. This is due to the part of the learning process. If a value is not distributed as the others it can then interpret the other values differently and then learn in a incorrect way.

4.2 Selected learning algorithms

The algorithms we investigated would suit our problem was K-mean, Isolation forest, One-class support vector machine and the Elliptic envelope. These algorithms were all picked from the library scikit-learn. The main reason why these algorithm was selected was that they have been used pretty frequently for anomaly detection in other cases[5].

Algorithms

• The K-mean algorithm was picked because of the clustering point of view. The algorithm is clustering based on the data set which then assign each data point to a cluster. Data points that is to far will then be considered an anomaly. This technique is really interesting in our case because data points in the data set is scattered. Assigning data points into clusters may result in learning the normal behavior.

• The Isolation forest was chosen because of its ability to view each data point separately and then separate a data point if it is not an "ordinary" data point based on a decision tree. It is very useful in our situation since we are looking for non-normal data points.

• The One-class support vector machine is a known supervised learning method, but could also be implemented as an unsupervised approach. The problems addressed by one class SVM is novelty detection. The idea of novelty detection is to detect rare events, i.e. events that happen rarely, and hence, of which you have very little samples.[5].

• the Elliptic envelope was trained because it looks at the whole data set and find the key parameter of the general distribution of the data[12]. Another

(22)

14 Chapter 4. Method clustering algorithm that could be a potential strong candidate when finding normal behavior.

4.3 Producing the results

The data set chosen is a month of data in the M3 sub-system in Ghana. The task to set the scene was sp.MTNONLINEAIRTIMEVENDOR, It is used to purchase online airtime through mobile money USSD menu or app. This task was chosen because there was confirmed anomalies on by the domain experts. The time length was cho-sen because there may not be as many confirmed anomalies in a smaller data set. This data set contains 96 confirmed deviating data points(anomalies) out of 4311 data points.

The k-mean algorithm was trained on 1-5 clusters since the elbow method recom-mended a clustering range of 1-5 cluster sizes. For the purpose of this demonstration the clusters with the best score was chosen within this thesis which was a cluster size of 1.

The results was produced by creating three models for each algorithm. Since we chose to test with three outlier fractions per algorithm. This resulted in 12 models per feature tested(Mean wait and Mean * N). The outlier fractions chosen were 0.01, 0.035 and 0.05. To choose suitable outlier fractions is a difficult task because the number of anomalies found is dependant on the outlier fraction. This is why we tested three outlier fraction in order to get a wider spread of the results. Increasing the outlier fractions even more will result in more false positives.

When the models was produced we did an intersect between the two best models. The two best models was decided by the highest F1 score. The intersection produced

a Jaccard score which shows how alike the two models are. This was done in order to investigate if the two best models can decrease the number of true positives.

(23)

4.3. Producing the results 15

Figure 4.1: The work flow of producing the results

The figure 4.1 displays how the work flow of producing the results is made. By testing features separately we can see the effect, that each algorithm has when finding true positives and reducing false positives. By testing e.g Mean * N feature we can see if Number of calls(N) has any affect on the Mean wait(ms) after testing the Mean wait(ms) feature separately.

(24)

16 Chapter 4. Method

4.4 Evaluation

4.4.1 Detection metrics

The produced models will be binary classifiers, i.e true or false. This introduces the evaluation of binary classification which will be beneficial in order to establish correct score, more specifically F1-score, to each separate model.

• True Positive(TP)

– Detected an anomaly that was labeled as an anomaly. • True Negative(TN)

– Detected an non-anomaly that was labeled as a non-anomaly. • False Positive(FP)

– Detected an anomaly that was labeled as a non-anomaly. • False Negative(FN)

– Detected an non-anomaly that was labeled as an anomaly.

4.4.2 Evaluation metrics

Accuracy: T P + T N T P + T N + F P + F N (4.1) Precision: T P T P + F P (4.2) Recall: T P T P + F N (4.3)

F1-score: 2 × P recision × Recall

P recision + Recall (4.4)

Precision is a metric of how precise the algorithm is. It measures the proportion of predicted positives that are true positives. The recall calculates the proportion of all available positives that a model predicts as true positives.

The recall metric is very good when speaking in terms of predicting anomalies. What if the algorithm predicts that an actual anomaly is considered non-anomaly? This would not be very good if this particular anomaly could cause the whole system to

(25)

4.5. System specifications 17 crash. Then the recall metric is a good evaluation metric in this context [18]. The F1-score will seek a balance between precision and recall. In our context of anomaly detection the F1-score is a good measurement when discovering if there is a large number of actual negatives. There could be an unbalance between the labelled classes.

This could not be detected by just the accuracy, because there may be a large num-ber of true negatives that is being classified as true positives.

These metrics are we going to use in order to evaluate how well each algorithm performs when detecting anomalies [18].

4.4.3 Jaccard index

The Jaccard index, also known by the names Intersection over union and the jaccard similarity coefficient, is a statistic used for determining the similarity or difference of finite sample sets [8]. It compares members in two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0 to 1. A score of 1, from the jaccard index, would indicate that the sets are identical.

This algorithm will be used to decide how well the best models, that are produced from the experiment, compare to each other. The formula for Jaccard index in no-tation is: J (X, Y ) = | X ∩ Y | | X ∪ Y | = | X ∩ Y | | X | + | Y | − | X ∩ Y |

4.5 System specifications

The hardware was kindly provided by Ericsson. The system specification were lap-tops with the Intel Core i7-8650U 1.90GHz 2.11GHz x64-based processor, 32GB of RAM, Windows 10 Operating system. A powerful laptop which made our job much easier.

In our experiment we were coding in Python 3.7.1 [14] which is one of the latest versions of Python. The Pandas library [10] was used to store our time-frame data as simple and effetive as possible. Numpy [9] provided us with tools to reshape our data and was a part of the pre-processing stages. We also used the library Matplotlib [7] which is a library that provided us with illustrations of our results. Lastly we used the powerful library scikit-learn [16] which holds a number of machine learning tools and algorithms. All algorithms was retrieved from this library.

(26)

(27)

Chapter 5 Results

In this chapter we are presenting our results. The chapter contains all results from the individual algorithms performances for each feature in separate tables. The intersection scores are presented in separate tables as well between each features. The two best performing algorithms are also presented in a plot graph which was captured with the help of the library matplotlib. Both intersections are also plotted between the two best performing algorithms per feature. The best performing algorithms is decided by the F1 scores.

5.1 Abbreviations on the tables and figures

The tables 5.1 and 5.2 are showing the results from all the models from the two features. First The table 5.1 is presenting the results from the Mean wait feature and the table 5.2 is presenting the results from the feature Mean * N. The tables 5.3 and 5.4 are showing the results from the intersection between the best performing algorithms. The table 5.3 is presenting the result from the two best algorithms when trained on the Mean wait and the table 5.4 is presenting the results from the two best algorithms when trained on the Mean * N.

• OF = Outlier-fraction • DA = Detected anomalies • TP = True positives • FP = False Positives • TN = True Negatives • FN = False Negatives

The figures 5.1, 5.2, 5.3, and 5.4 illustrates the whole data set that was used (April month of 2019). The red scatters represents which data points (10-minutes intervals) the algorithm classified as anomalies. Everything else are normal data points (10-minutes). The x-axis is where the data points are in time, where the data points are represented as numeric values. The y-axis are the Mean wait or Mean * N in milliseconds depending on which results are being visualized.

The figures 5.5 and 5.6 are also displaying the whole data set that was used but are presented the intersected algorithms TPs as green scatters and the intersected algorithms FPs as red scatters.

(28)

20 Chapter 5. Results

5.2 Algorithm comparison

Algorithm(OF) DA TP FP TN FN Accuracy Precision Recall F1-score

K-mean(0.01) 43 36 7 4209 59 0.9846 0.8372 0.3789 0.5217 K-mean(0.035) 150 91 59 4157 4 0.9853 0.6066 0.9578 0.7428 K-mean(0.05) 215 95 120 4096 0 0.9721 0.4418 1.0 0.6129 Isolation Forest(0.01) 44 37 7 4209 58 0.9849 0.8409 0.3894 0.5323 Isolation Forest(0.035) 151 92 59 4157 3 0.9856 0.6092 0.9684 0.7479 Isolation Forest(0.05) 216 95 121 4095 0 0.9719 0.4398 1.0 0.6109 One-class SVM(0.01) 43 12 31 4185 83 0.9735 0.2790 0.1263 0.1739 One-class SVM(0.035) 151 52 99 4117 43 0.9670 0.3443 0.5473 0.4227 One-class SVM(0.05) 215 66 149 4067 29 0.9587 0.3069 0.6947 0.4258 Elliptic Envelope(0.01) 44 37 7 4209 58 0.9849 0.8409 0.3894 0.5323 Elliptic Envelope(0.035) 151 91 60 4156 4 0.9851 0.6026 0.9578 0.7398 Elliptic Envelope(0.05) 216 95 121 4095 0 0.9719 0.4398 1.0 0.6109

Table 5.1: Results on feature Mean wait (ms)

The table 5.1 shows the results from the produced models which was trained on the Mean wait feature.

Isolation forest on 0.035 outlier fraction resulted in one of the highest Accuracy score since it detected the most TPs and TNs. Although the Precision was a bit lower since the FPs was high. The Recall resulted in a high score since FNs was low. The combination of Precision and Recall resulted in a high F1 score and the best

performing algorithm on the feature Mean wait.

K-mean on 0.035 outlier fraction resulted in a Accuracy score that was also one of the highest since the detected TPs and TNs was high. Similar results on the Precision and Recall resulted in the second highest F1 score.

(29)

5.2. Algorithm comparison 21

Figure 5.1: Visualization of K-Mean(0.035) result on Mean wait. Start-date 2019-04-01T00:00, end-date 2019-04-30T23:40

The figure 5.1 is a visualization of the produced model trained on K-mean with 0.035 outlier fraction and the feature Mean wait. The second best algorithm from table 5.1. Visualizing the result shows that it detects the delays as anomalies.

(30)

Figure 5.2: Visualization of Isolation forest(0.035) result on Mean wait. Start-date 2019-04-01T00:00, end-date 2019-04-30T23:40

The figure 5.2 is a visualization of the produced model trained on Isolation Forest with 0.035 outlier fraction on the feature Mean wait. The best algorithm from table 5.1. This visualization also shows that it detects the delays an anomalies.

(31)

K-mean(0.01) 43 10 33 4183 85 0.9726 0.2325 0.1052 0.1449 K-mean(0.035) 150 10 140 4076 85 0.9478 0.0666 0.1052 0.0816 K-mean(0.05) 215 10 205 4011 85 0.9327 0.0465 0.1052 0.0645 Isolation Forest(0.01) 44 10 34 4182 85 0.9723 0.2272 0.1052 0.1438 Isolation Forest(0.035) 150 10 140 4076 85 0.9478 0.0666 0.1052 0.0816 Isolation Forest(0.05) 216 10 206 4010 85 0.9324 0.0462 0.1052 0.0643 One-class SVM(0.01) 43 9 34 4182 86 0.9721 0.2093 0.0947 0.1304 One-class SVM(0.035) 151 10 141 4075 85 0.9475 0.0662 0.1052 0.0813 One-class SVM(0.05) 215 10 205 4011 85 0.9327 0.0465 0.1052 0.0645 Elliptic Envelope(0.01) 44 10 34 4182 85 0.9723 0.2272 0.1052 0.1438 Elliptic Envelope(0.035) 151 10 141 4075 85 0.9475 0.0662 0.1052 0.0813 Elliptic Envelope(0.05) 216 10 206 4010 85 0.9324 0.0462 0.1052 0.0643

Table 5.2: Results on feature Mean * N

The table 5.2 shows the results from the produced models which was trained on the Mean * N feature. All models performed pretty poorly when trained on Mean * n. Although the Accuracy score was good, it did not detect a lot of TPs.

The best performing algorithm is K-mean on 0.01 outlier fraction. Both Precision and Recall was one of the highest and therefore produced the best F1 score.

The second best performing model was the Isolation forest on 0.01 outlier fraction. These two algorithms was later intersected in table 5.4.

(32)

Figure 5.3: Visualization of K-mean(0.01) detection on Mean * N. Start-date 2019-04-01T00:00, end-date 2019-04-30T23:40

The figure 5.3 is a visualization of the produced model trained on K-mean with 0.01 outlier fraction on the feature Mean * N. The best performing algorithm from table 5.2. The visualization shows that it detects the delays as anomalies.

(33)

Figure 5.4: Visualization of Isolation forest(0.01) detection on Mean * N. Start-date 2019-04-01T00:00, end-date 2019-04-30T23:40

The figure 5.4 is a visualization of the produced model trained on Isolation forest with 0.01 outlier fraction on the feature Mean * N. The second best performing algorithm from table 5.2. The visualization shows that it detects the delays as anomalies.

(34)

5.3 Intersection

K-mean(0.035) 150 91 59 4157 4 0.9853 0.6066 0.9578 0.7428

Isolation Forest(0.035) 151 92 59 4157 3 0.9856 0.6092 0.9684 0.7479

Intersection Score 148 91 57 4155 3 0.9860 0.6148 0.9680 0.7520

Table 5.3: Intersection Mean wait (ms)

The table 5.3 shows the two best performing models from the table 5.1 and the intersection results of these two. The intersection between the models does not show any improvement on the results. The FPs remain the same which does not affect the Accuracy or the Precision. This resulted in the same F1 score as well. The Jaccard

score was 0.9673 which shows that both models detects almost similar data points as anomalies.

K-mean(0.01) 43 10 33 4183 85 0.9726 0.2325 0.1052 0.1449

Isolation Forest(0.01) 44 10 34 4182 85 0.9723 0.2272 0.1052 0.1438

Intersection Score 43 10 33 4183 85 0.9726 0.2325 0.1052 0.1449

Table 5.4: Intersection Mean * N

The table 5.4 shows the two best performing models from the table 5.2 and the intersection results of these two. The intersection between the models are not an improvement, nor an impairment on the results. The FPs remain the same. The F1 score after the intersection is the same as Kmean.No significant, or any positive

change on the results. The Jaccard score was 0.9772 which shows that both models detected almost identical known anomalies.

(35)

5.3. Intersection 27

Figure 5.5: Visualization of the intersection on Mean wait. Start-date 2019-04-01T00:00, end-date 2019-04-30T23:40

The figure 5.5 is a visualization of the produced intersection result between the two best performing algorithms trained on the feature Mean wait. A mentioned earlier in the section 5.1 the red scatters are the intersected TPs and the red scatters are the intersected FPs between the algorithms. This shows that both algorithms detect pretty similar.

(36)

Figure 5.6: Visualization of the intersection on Mean * N. Start-date 2019-04-01T00:00, end-date 2019-04-30T23:40

The figure 5.6 is a visualization of the produced intersection result between the two best performing algorithms trained on the feature Mean * N. This visualization illustrates the data points which are intersected TPs and FPs. The algorithm detects pretty similar data points as anomalies.

(37)

Chapter 6 Analysis and Discussion

The first thing to notice about the results is that the feature Mean * N does not really perform very well. The precision is not quite that good and it does not detect close enough to the majority of the confirmed anomalies. But this results really confirmed that the Number of calls does not have any large impact on the Mean wait. Mean-ing a low F1score, therefore the load of users does not create any anomalies as delays.

The models that were produced from the Mean wait though, displays high F1 score

results. The best models was the K-mean with the outlier fraction of 0.035 and the Isolation forest with the same outlier fraction. This answers the RQ:1 with the F1

scores of 0.7428 and 0.7479.

The intersection between the two models trained on the feature Mean wait showed interesting result regards to the Jaccard index of 0.9673. This means that both al-gorithms are detecting almost the same data points as anomalies. Investigating the number of FPs, it remains almost the same, which means that the two intersected models did not significantly affect the number of FPs, but minor improvement oc-curred by a drop of 2 FPs. The RQ:2 can be considered answered because the intersection did slightly decrease the number of FPs

In table 5.1, we can see the results that each model produced with the feature Mean wait. Looking trough the table, one would notice that the accuracy score of each algorithm at all outlier fractions is close to 100 % accuracy. The reason for this high accuracy lies with the way we calculate the accuracy score. In subsection 4.4.2 equation 4.2 the accuracy score uses both the TPs and TNs to generate the accu-racy. The more typical way of calculating accuracy is to divide the number of correct predictions with the total number of predictions. In this case it would be:

T P DA

Which is exactly the same as calculating the Precision. T P

DA =

T P T P + F P

Following this approach, the two best algorithms in table 5.1 would have an accuracy score of 0.6092 (Isolation forest) and 0.6066 (Kmean). The reason that we calculate accuracy, like its listed in subsection 4.4.2 equation 4.2, is because a binary classi-fication, ie. 1/0 or true/false, is being used. Since it is a binary classiclassi-fication, the

(38)

30 Chapter 6. Analysis and Discussion TNs has a great effect on the results due to the large amount of normal data points that were also classified as normal data points by the algorithms.

The accuracy score however does not affect the F1 score, as shown in the subsection 4.4.2 equation 4.4, the F1 score uses both precision and recall. The precision and recall evaluation is almost similar. The difference is precision takes FPs into account while recall uses the FNs detection metric.

The relation between the FPs and the FPs scores of each produced model is highly affected by the choice of outlier fraction. To clarify this, in the table 5.1 with focus on only the K-mean algorithm. As the outlier fraction increases so does the number of FPs. The same, in reverse, occurs with FNs, where a higher outlier fraction re-sults in a lower number of FNs. This shows that the choice of outlier fraction plays a vital part in the models, and the way they are evaluated. The problem here is to choose the most optimal outlier fraction without knowing the number of confirmed anomalies. Since our data, that we performed the tests on, has 4311 data points. The same data has 96 confirmed anomalies. A simple division of this e.g 96/4311 = 0.0222. One could argue that having an outlier fraction that is close to 0.022 would give better scores than any of the other models. Having a high outlier fraction is the same as saying to the algorithm, in laymans terms, that we want it to detect more anomalies, regardless if they are actual anomalies.

This method could be a proof of concept in the space of anomaly detection in the EWP. It also concludes that these methods gives even further relevance in the outlier detection area. The domain experts that are working with finding these anomalies may use this methods for faster detection and to some extent quality assurance in their own decision making. Faster detection means less work and stronger decision making. It is really hard to evaluate an unsupervised approach and one has to remember that these results are based around the domain experts confirming the anomalies. This means that the results are based around a domain experts point of view. But considering the results based around the best in the domain area we are happy about the results. It may also be interesting investigating those data points that was considered anomalies even if it is not a confirmed anomaly. Especially tak-ing into account that it is a unsupervised approach and the decisions are not based around a "supervisor" learning the models. The models are producing these results based around the underlying structure of the data and may or may not detect outliers that the domain experts would have not.

(39)

Chapter 7 Conclusions and Future Work

Something that can be taken into account when building the models in the future is to have multi-dimensional arrays when training the models. I.e taking more features into account. Although, the results based just around the Mean wait shows great results and considering that to be the definition of an anomaly in this domain area shows that a one-dimensional training array is doing the job. A future consideration is also how to find the best appropriate outlier fraction. Since the outlier fraction is a vital part men reducing the FPs. Different sub-systems may produce more anomalies than others, then the outlier fraction needs to be adapted as well.

An additional approach to the intersection method that were proposed, would be to intersect models with different outlier fractions. Since our two best models, at both features, were on the same outlier fraction. The reasoning behind using various outlier fraction in the intersect method is because some outlier fractions have low FPs but also low TPs and other models may have high TPs but also high FPs. An intersection between these could decrease the FPs since they might complete each others errors.

One could also go another direction to solve this issue. A reinforcement learning approach. Building a machine learning model by detecting anomalies by a reward system instead of finding underlying structure of the data which an unsupervised approach does. This method does not need an outlier fraction to take into account. Although, it is hard to find a method when and when not to reward the model. And based around that we are back at a supervised approach because the humans decide if is should be rewarded or not. But this method may be a more dynamic approach than setting an outlier fraction. A reinforcement approach is especially good at games when the rewards are given when points are made, which is almost every time good in a game. At tasks like that the AI could really get even smarter than humans are today at a task.

The main contribution that was learned from this thesis is that these simple models could be a quality assurance by assisting the domain experts in their work of finding anomalies in a system like the EWP. Based around that contribution the RQ:1 can simply be answered that training on the feature Mean wait (ms) is the most relevant feature to detect anomalies with and the two best algorithm were Isolation forest (0.035) and the K-mean (0.035). The RQ:2 can be answered by a simple no, or to almost no extent. At least in our study the FPs were just a minor improvement when intersecting the two best models.

(40)

(41)

References

[1] DataRobot editor unsupervised machine learning. https://www.datarobot. com/wiki/unsupervised-machine-learning. Online; Accessed: 2019-03-27. [2] deJongh R., data science is a growing field. http://theconversation.com/

data-science-is-a-growing-field-heres-how-to-train-people-to-do-it-110449. Online; Accessed: 2019-03-13.

[3] Garbade M. J. understanding k-means clustering in

ma-chine learning. https://towardsdatascience.com/

understanding-k-means-clustering-in-machine-learning-6a6e67336aa1. Online; Accessed: 2019-03-27.

[4] Gupta P., decision trees in machine learning. https://towardsdatascience. com/decision-trees-in-machine-learning-641b9c4e8052. Online; Ac-cessed: 2019-04-18.

[5] Li S., time series of price anomaly detection. https://towardsdatascience. com/time-series-of-price-anomaly-detection-13586cd5ff46. Online; Accessed: 2019-05-07.

[6] Luke Dormehl what is an artificial neural network? here’s every-thing you need to know. https://www.digitaltrends.com/cool-tech/ what-is-an-artificial-neural-network/. Online; Accessed: 2019-03-13. [7] Matplotlib website, matplotlib. https://matplotlib.org/. Online; Accessed:

2019-05-21.

[8] neo4j editors, the jaccard similarity algorithm. https://neo4j.com/docs/ graph-algorithms/current/algorithms/similarity-jaccard/. Online; Ac-cessed: 2019-05-05.

[9] Numpy website, numpy. https://www.numpy.org/. Online; Accessed: 2019-05-21.

[10] Pandas pydata website pandas. https://pandas.pydata.org/. Online; Ac-cessed: 2019-05-21.

[11] Perera S., introduction to anomaly detection concepts and techniques. https://iwringer.wordpress.com/2015/11/17/ anomaly-detection-concepts-and-techniques/. Online; Accessed: 2019-03-15.

(42)

34 References [12] perpetual editors interpretation of gaussian distribution.

https://prateekvjoshi.com/2012/09/09/interpretation-of-gaussian-distribution/. Online; Accessed: 2019-05-07.

[13] Pramit Choudhary introduction to anomaly detection. https://www. datascience.com/blog/python-anomaly-detection. Online; Accessed: 2019-03-13.

[14] Python website, python. https://www.python.org/. Online; Accessed: 2019-05-21.

[15] scikit-learn editors sklearn.preprocessing.standardscaler. https:// scikit-learn.org/stable/modules/generated/sklearn.preprocessing. StandardScaler.html. Online; Accessed: 2019-04-27.

[16] Scikit learn website scikit learn. https://sklearn.org/. Online; Accessed: 2019-05-21.

[17] Seif G., an easy introduction to unsupervised learning with 4 basic techniques. https://towardsdatascience.com/

an-easy-introduction-to-unsupervised-learning-with-4-basic-techniques-897cb81979fd. Online; Accessed: 2019-03-27.

[18] Shung K. P., accuracy, precision, recall or f1? https://towardsdatascience. com/accuracy-precision-recall-or-f1-331fb37c5cb9. Online; Accessed: 2019-03-06.

[19] The SciPy community numpy.reshape. https://docs.scipy.org/doc/numpy/ reference/generated/numpy.ndarray.html. Online; Accessed: 2019-05-15. [20] Abiy Biru. Monitoring of Video Streaming Quality from Encrypted Network

Traffic : The Case of YouTube Streaming. 2016, [Online; accessed 2019-05-19]. [21] Hanna Blomquist and Johanna Möller. Anomaly detection with Machine

learn-ing. 2017, [Online; accessed 2019-04-24].

[22] Oskar Carlsson and Daniel Nabhani. User and Entity Behavior Anomaly De-tection using Network Traffic. 2017, [Online; accessed 2019-03-13].

[23] Edgar Alonso Lopez-Rojas and Stefan Axelsson. Money Laundering Detection using Synthetic Data. 2017, [Online; accessed 2019-04-21].

[24] Selim Ickin Martin Boldt, Anton Borg and Jörgen Gustafsson. “Anomaly detec-tion of event sequences using multiple temporal resoludetec-tions and Markov chains”, to appear in Knowledge and Information Systems, 2019.

[25] Ewa Osekowska, Stefan Axelsson, and Bengt Carlsson. Potential fields in mar-itime anomaly detection. [Online; accessed 2019-05-05].

(43)

Appendix A

Supplemental Information

(44)

(45)

(46)

Anomaly Detection in an e-Transaction System using Data Driven Machine Learning Models: An unsupervised learning approach in time-series data