• No results found

Unsupervised Anomaly Detection on Multi-Process Event Time Series

N/A
N/A
Protected

Academic year: 2021

Share "Unsupervised Anomaly Detection on Multi-Process Event Time Series"

Copied!
94
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2018

(2)

Unsupervised Anomaly Detection

on Multi-Process Event Time Series

NICOLÒ VENDRAMIN

Master Thesis at Ericsson AB

Industrial Supervisors: Tomas Lundborg, Peter Björkén Academic Supervisor: Lars Rasmusson

Examiner: Henrik Boström

(3)

Abstract

Establishing whether the observed data are anomalous or not is an important task that has been widely investigated in literature, and it becomes an even more complex prob-lem if combined with high dimensional representations and multiple sources independently generating the patterns to be analyzed. The work presented in this master thesis em-ploys a data-driven pipeline for the definition of a recurrent auto-encoder architecture to analyze, in an unsupervised fashion, high-dimensional event time-series generated by multiple and variable processes interacting with a system. Facing the above mentioned problem the work investigates whether it is possible or not to use a single model to ana-lyze patterns produced by different sources. The analysis of log files that record events of interaction between users and the radio network infrastructure is employed as real-world case-study for the given problem. The investigation aims to verify the performances of a single machine learning model applied to the learning of multiple patterns devel-oped through time by distinct sources. The work proposes a pipeline, to deal with the complex representation of the data source and the definition and tuning of the anomaly detection model, that is based on no domain-specific knowl-edge and can thus be adapted to different problem settings. The model has been implemented in four different variants that have been evaluated over both normal and anomalous data, gathered partially from real network cells and par-tially from the simulation of anomalous behaviours. The empirical results show the applicability of the model for the detection of anomalous sequences and events in the de-scribed conditions, with scores reaching above 80% in terms of F1-score, and varying depending on the specific thresh-old setting. In addition, their deeper interpretation gives insights about the difference between the variants of the model and thus, their limitations and strong points.

(4)

Referat

Oövervakad anomalidetektion för

händelsetidsserier som genereras av multipla

processer

Att fastställa huruvida observerade data är avvikande eller inte är en viktig uppgift som har studerats ingående i lit-teraturen och problemet blir ännu mer komplext, om detta kombineras med högdimensionella representationer och fle-ra källor som oberoende generefle-rar de mönster som ska ana-lyseras. Arbetet som presenteras i denna uppsats använder en data-driven pipeline för definitionen av en återkomman-de auto-encoåterkomman-derarkitektur för att analysera, på ett oöverva-kat sätt, högdimensionella händelsetidsserier som genereras av flera och variabla processer som interagerar med ett sy-stem. Mot bakgrund av ovanstående problem undersöker arbetet om det är möjligt eller inte att använda en enda modell för att analysera mönster som producerats av olika källor. Analys av loggfiler som registrerar händelser av in-teraktion mellan användare och radionätverksinfrastruktur används som en fallstudie för det angivna problemet. Un-dersökningen syftar till att verifiera prestandan hos en enda maskininlärningsmodell som tillämpas för inlärning av flera mönster som utvecklats över tid från olika källor. Arbetet föreslår en pipeline för att hantera den komplexa represen-tationen hos datakällorna och definitionen och avstämning-en av anomalidetektionsmodellavstämning-en, som inte är baserad på domänspecifik kunskap och därför kan anpassas till olika probleminställningar. Modellen har implementerats i fyra olika varianter som har utvärderats med avseende på både normala och avvikande data, som delvis har samlats in från verkliga nätverksceller och delvis från simulering av avvi-kande beteenden. De empiriska resultaten visar modellens tillämplighet för detektering av avvikande sekvenser och händelser i det föreslagna ramverket, med F1-score över 80%, varierande beroende på den specifika tröskelinställ-ningen. Dessutom ger deras djupare tolkning insikter om skillnaden mellan olika varianter av modellen och därmed deras begränsningar och styrkor.

(5)

Acknowledgements

I would like to express my gratitude to everyone that supported me in the develop-ment of this work, both at Ericsson and at KTH. In particular, I would like to thank my industrial supervisor Tomas Lundborg and my manager Peter Björkén for their help and feedbacks for the work, and for their effort in making it easy for me to feel comfortable in the working environment. I would also like to mention Francesco Davide Calabrese and to thank him for the nice discussions about tehcnical and non technical topics, that contributed to make my staying at Ericsson even more pleasent.

(6)

Contents

List of Figures List of Tables 1 Introduction 1 1.1 Background . . . 1 1.2 Problem . . . 4 1.3 Purpose . . . 6 1.4 Objectives . . . 6 1.5 Methodology . . . 6

1.6 Ethics and Sustainability . . . 7

1.7 Delimitations . . . 8

1.8 Outline . . . 8

2 Background Theory 9 2.1 Machine Learning Concepts . . . 9

2.1.1 Inductive Learning and Generalization . . . 11

2.1.2 Neural Networks . . . 15

2.1.3 Recurrent Auto-encoders for Anomaly Detection . . . 21

2.1.4 Architecture Definition using Genetic Algorithms . . . 22

2.2 Anomaly Detection . . . 26

2.2.1 Concepts for Anomaly Detection . . . 26

2.2.2 Anomaly Detection on Sequences and Time Series . . . 29

2.2.3 Output and Evaluation of an AD Method . . . 31

3 Methodology 34 3.1 Choice of the Method . . . 34

3.2 Problem Definition . . . 35

3.3 Model Selection . . . 35

3.3.1 Comparison of the Candidate Models . . . 36

3.4 Evaluation . . . 37

3.5 Dataset . . . 40

(7)

4 Implementation 42

4.1 Pre-processing Phase . . . 42

4.1.1 Data Extraction . . . 43

4.1.2 Redundancy of the Data . . . 43

4.1.3 In-homogeneity of the Data . . . 44

4.1.4 Sparsity of the Data . . . 45

4.2 Application of Genetic Algorithms . . . 47

4.3 Comparison of the Candidate Models and Final Architecture Descrip-tion . . . 49

5 Experimental Results 53 5.1 Analysis of the Performances of the Models . . . 53

5.1.1 Generalization of Learnt Patterns . . . 53

5.1.2 Evaluation as Classification Technique . . . 54

5.2 Additional Results . . . 62

5.2.1 Error Profile within a Sequence . . . 62

5.3 Discussion . . . 70

6 Conclusion 71

(8)

Abbreviations

1SLA 1-Step Look Ahead A-ENC Auto-ENCoder AD Anomaly Detection

AI Artificial Intelligence

ANN Artificial Neural Network CPU Central Processing Unit ECG ElectroCardioGram EP Error Profile

FN(R) False Negative (Rate) FP(R) False Positive (Rate) GA Genetic Algorithm

GAN Generative Adversarial

Network

GRU Gated Recurrent Unit ISS In-Sequence Swap (Anomaly

Type)

JSON JavaScript Object Notation KPI Key Performance Indicator

LSTM Long Short Term Memory

ML Machine Learning

MSE Mean Squared Error PCA Principal Components

Analysis

RNN Recurrent Neural Network ROC Receiver Operating

Characteristic

RVI Random Vector Injection (Anomaly Type)

ReLU Rectified Linear Unit SVA Single Value Alteration

(Anomaly Type)

SVI Stranger Vector Injection (Anomaly Type)

SVM Support Vector Machine TN(R) True Negative (Rate) TP(R) True Positive (Rate)

(9)

List of Figures

2.1 Machine learning vs. Traditional programming . . . 10

2.2 Underftting vs. Overfitting . . . 12

2.3 Perceptron . . . 15

2.4 Example of Activation Functions . . . 17

2.5 LSTM cell . . . 20

2.6 Confusion matrix . . . 32

3.1 Anomaly types . . . 39

3.2 Dataset: example of JSON event . . . 40

3.3 Employed Hardware: lscpu . . . 41

4.1 Genetic representation of a model . . . 48

4.2 Genetic Algorithm: Score Optimization . . . 49

4.3 Model comparison box-plot . . . 49

4.4 Final model . . . 50

4.5 Architecture: LSTM vs GRU . . . 50

4.6 Architecture: A-ENC vs 1SLA . . . 51

4.7 1SLA Models: Input and Target Extraction . . . 52

5.1 Threshold selection: Sliding Interval of Training Scores . . . 54

5.2 Model 1: Performances by Anomaly Class . . . 60

5.3 ROC Curve for the four evaluated models . . . 61

5.4 Model 1: MSE per Event Example . . . 63

5.5 Model 2: MSE per Event Example . . . 64

5.6 Error Profile: Random Vector Injection . . . 65

5.7 Error Profile: Stranger Vector Injection . . . 66

5.8 Error Profile: In-Sequence Swap . . . 67

5.9 Error Profile: Single Value Alteration . . . 68

(10)

List of Tables

5.1 Generalization: Train vs Test Error. . . 54

5.2 Thresholds: Parameter and Values . . . 55

5.3 Model1: results . . . 55

5.4 Model2: results . . . 56

5.5 Model3: results . . . 56

5.6 Model4: results . . . 57

5.7 Comparison of Model Performances . . . 58

(11)

Chapter 1

Introduction

This document serves as a final report for the project carried out within the Au-tomation team of the System and Technology department at Ericsson AB, Sweden, in fulfilment of the requirements of master thesis for the EIT Digital Master on Data Science (together with the partner universities Kungliga Tekniska Högskolan and Politecnico di Milano). The report illustrates the work that has been done in the investigation of the field of anomaly detection on data collected from the radio network and, more in particular, attempts to employ log traces to identify novelties and outliers in the events describing the behavior of the users interacing with the system. The project was developed throughout the period going from January the 15th 2018 and May 30th of the same year.

1.1 Background

Machine learning is a field of computer science that defines a set of techniques employing statistical methods in order to be able to progressively improve the per-formances on a specific task, exploiting data that are representative for it, without being explicitly programmed for its solution. Over the past decades machine learn-ing has gained an increaslearn-ing popularity in the field of information technologies and has de facto entered in our everyday life, even if, often, in a rather hidden format. As a matter of fact, many of the applications or services that we use every day exploit machine learning algorithms as a tool or sometimes as their core technolog-ical asset. When we are accessing our Spotify or Netflix accounts, or interacting with the virtual assistant in our phone, or even accessing a web page, we are often interacting with a machine learning algorithm that, basing on the data available about the user, estimates which movie or song to suggest to us, recognizes words from our speech or decides how to dispose the element of the web interface to better fit our profile.

(12)

• Supervised Learning: the input data is characterized by a set of features and an output. The learning algorithm focuses on learning the mapping func-tion between input and output in order to be able to predict the output for previously unseen inputs.

• Unsupervised Learning: the input data is only characterized by features and the algorithmic techniques aim to extract clusters or patterns between them. • Semi-supervised Learning: the training dataset contains both labelled and unlabelled data, and the training tries to exploit both information in order to learn the input output mapping.

• Reinforcement Learning: the model is trained to be able to act in a given environment, performing actions in order to maximize the discounted reward generated by its control choices.

As it is possible to notice from the list above, the proposed taxonomy mainly fo-cuses on discriminating according to the way in which the data can and is used in the learning procedure, depending whether the available training points are labelled (the output is provided for each input) or not, but it is also possible to perform different classification based on alternative parameters.

For example it is possible to group machine learning techniques according to their functioning method into several classes that share similar operational paradigm. Neural networks for example is a processing method that is inspired by the func-tioning of the nervous system, based on adjusting synaptic connections in order to define an input output mapping [2], and differs from other algorithm families such as bayesian algorithms, decision trees, regularization techniques, clustering al-gorithms, association rule learning, dimensionality reduction methods or ensemble models.

Machine learning techniques have gained extremely high popularity in the academic and industrial world thanks to the performances that they have achieved in a wide variety of different fields.

Indeed, thanks to their intrinsic data driven nature, most of machine learning tech-niques find application in several different domains, given the required adjustments, and thus, research in the field of machine learning has crossed the boundary and overlapped with a long array of different and previously independent research areas, such as computer vision, natural language processing or the one of specific interest for the work presented in this report: anomaly detection.

Anomaly detection (or AD), often referred to as outliers’ detection [3], novelty de-tection [4] or behavioural analysis [5], deals with the identification of abnormal behaviours in the data and groups a wide variety of different methods and tech-niques that have been developed to deal within different application domains with the horizontal problem of spotting and signaling anomalous points or series in the available data.

(13)

are characterized by different requirements given for example by the type, structure and dimension of the data.

A first important definition to point out is the non trivial explanation of the concept of anomaly itself, defined by Barnett and Lewis in [6] as:

"An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data.".

The given definition encapsulates several of the different aspects that are critical when dealing with anomaly detection. First of all, it focuses on the discrimination between normal and anomalous data as based on the rest of the known points and not to some bad or good intrinsic property. As a matter of fact, it is important to remind that the world anomaly, also in line with its very definition as "something that deviates from what is standard, normal or expected" [7], does not necessarily refer to an incorrect or dangerous behaviour but rather to an unlikely or unexpected occurrence with respect to what is known or it has been modeled of a given system. Secondly, Barnett and Lewis point out that an anomaly can both refer to an obser-vation and to a set of obserobser-vations, reflecting an important taxonomy that classifies possible anomalies in the following three categories [8]:

• Point anomalies: also referred to as outliers, are represented by single data points that don’t resemble normal data.

• Contextual anomalies: once a context for the data points is defined, is formed by those data points that don’t follow the standard context-point associations. • Collective anomalies: is composed by those cases in which a set of data points, that considered on their singularity are not point nor contextual anomalies, don’t follow the patterns that characterize normal behaviours.

Anomalies are, by definition, rare or unlikely events and thus, it is often the case that there is not a balanced data representation for the positive and negative cases. Depending on this, within the field of anomaly detection, methods are generally grouped into three main classes [9]:

• Supervised Techniques: group the techniques that exploit labelled data to understand which data point is anomalous and which is normal.

• Unsupervised Techniques: do not exploit any explicit knowledge about the nature of the data point in the training set, and base on pattern recognition the ability to sort out anomalous occurrences from normal data.

(14)

It is finally possible to derive a taxonomy of anomaly detection techniques also ac-cording to the systematic approach to the problem, resulting in the identification of six main families: classification based, nearest neighbor based, clustering based, statistical, information theory and spectral AD techniques.

As it results from the previous classification, not all anomaly detection techniques are based on machine learning but only a subset of them, mainly within the family of classification based AD methods, where methods such as support vector machines [10] or neural networks [11][12][13] have proved to be competitive over the bench-marks for novelty detection.

It is often characteristic of problems in the field of anomaly detection to deal with data having a temporal structure since in most of the real-world systems the be-haviour must be analyzed in its dynamic evolution in order to be able to assess its normality or abnormality. In the above mentioned case we can identify four main possible kind of sequential data given by the combination of the following two dichotomies:

• Uni-variate vs Multi-variate: a uni-variate sequence is considered as a signal that varies over a single dimension. On the contrary, when the number of independent features of the signal are multiple, the problem is defined as multi-variate.

• Time Series vs Symbolic Sequence: time series or continuous sequences are data that vary in a continuous interval while a symbolic sequence it’s repre-sented by a string of characters belonging to a discrete alphabet.

We can have a clearer touch on the concept by analyzing the following example regarding anomaly detection on stock value for a given index: if we represent the sequence employing stock price as feature we approach the problem as a time series while in case we just use the information formed by the sequence of UP or DOWN symbols, representing that there has been a relative increase, or respectively de-crease, on the value, we are employing a discrete (symbolic) representation. If the index that is analyzed is a composite index formed by different stock prices (or relative increase/decrease symbols) we would deal with a multi-variate setting, as opposite to a uni-variate approach in which the index is only represented by a single aggregate value (or symbol).

1.2 Problem

(15)

[16] or deep autoencoders[17], deal with data that are produced by a known or hid-den process that is the same at training and inference time.

Although, in many different application scenarios it is common that the generative process of the time series whose anomaly level we would like to assess, changes through time or, at least, it cannot be assumed to be constant.

An example for such a scenario is the detection of anomalies in a log file recording interaction of multiple users. Different users can interact, in principle, in an ar-bitrary and independent way with the system, generating independent time series characterized by autonomous patterns, since only the system’s contribution to the process is constant.

In many cases related to real system, even with extremely loose constraints over the possible interaction between the users and the systems, reflecting in few as-sumptions about the common characteristics between the independent processes, it might still be possible that the normal behaviours are characterized by similar hidden patterns. Thus, the knowledge gap that is identified by this research work is to better understand whether and in which measure, state of the art techniques for anomaly detection on time series can be employed in the context in which the processes generating the sequences are multiple even though possibly similar. As partially mentioned above, this research problem is reflected in an extensive set of real-world applications, such as the one from which the data employed in this research is retrieved. In most of the cases, it is practically impossible to dedicate a model for each different process interacting with a system and it would be of high benefit to be able to use a single entity to grasp behaviours that are common across the different agents.

With particular reference to the practical case for which it is employed in Erics-son, this research is motivated by the need of a tool for the analysis of log files in troubleshooting application. Such files contain mixed information regarding short interaction sequences of multiple users that vary through time and interact in an independent way with the network infrastructure, thus, raising the question about whether it is or it is not sufficient to employ a single model to extract indication about the level of abnormality of the singular user interactions.

A similar evaluation has been performed previously by the work referenced at [18], where the application of hierarchical temporal memories is evaluated over the prob-lem of anomaly detection on the behaviour in user’s requests to a website.

Although similar when considering the multi-user part of the analysis, the work presents two main differing traits, distinguishing the two problems, with respect to the online context of application and the nature of the analyzed sequences.

(16)

When considering, instead, the academic literature with respect to the theme of concept drift it is important to reference the works of Zhou et Al.[19], Subutai et Al. [20] and more that are introduced in the survey carried out by Gama about the topic [21]. The work differs from what presented in the mentioned literature because while the basic idea of concept drift is to adapt the algorithm or learning procedure to the variation of the stream that generates the data, in the case of the work presented by this thesis, the objective is to validate the possibility to employ a single model that, without any real-time adaptation, is able to scale across patterns generated by different sources, exploiting the common sequential patterns among them. In addition, in the case in analysis the sources varies even between the dif-ferent time-series present in the training data, not only shifting between training and test time.

1.3 Purpose

Within the problem area described in the previous section, the work of this thesis is centered in the evaluation of the performances obtained with the application of state of the art for time series prediction when dealing with a problem of anomaly detection on multi-variate sequences of non-symbolic values generated by multiple and variable processes. The research question is then formulated as follows: What are the performances of a system employing a single machine learning model

to learn multi-variate continuous patterns generated by multiple processes?

1.4 Objectives

In order to answer the research question reported above, the project is focused on the following goals to be fulfilled:

• Analysis of the previous literature and of the state of the art for time series anomaly detection.

• Identification and design of a modelling technique for the identification of anomalies.

• Implementation of the anomaly detection system, including tuning of the hyper-parameters and training of the model.

• Evaluation of the performances of the implemented system over anomalous and normal data.

1.5 Methodology

(17)

network cell, for which a deeper explanation is provided in the following chapters of this report.

The work has been carried out after an extensive and detailed analysis of the previ-ous research on the subject, in order to outline available alternatives and important variables, identifying the candidate architecture to test in recurrent auto-encoders, as a combination of the two main characteristics of the problem, given its unsuper-vised and temporal nature.

After the study of related works, the main focus from the methodological standpoint has been put on the selection of the model, the evaluation of possible threshold and of the respective performances.

For what the first factor is concerned, a data-driven routine has been chosen as methodology for model selection. In particular, an evolutionary strategy composed by a basic genetic algorithm has been applied to produce a set of candidate models. Such models were subject to a further selection after an inspection of their perfor-mances and, more in particular of their training and validation error.

Consequently, the remaining potential architectures have been compared using a statistical test to assess their relative performances.

Different thresholds, computed starting from the scores given by the model to the training sequences, are evaluated and compared in terms of performances and through the representation of the ROC curve, representing the variation of the ratio between true positives and false positives for different values of the or sensitivity level.

Finally, the evaluation has been carried out employing both real data coming from operating cells and simulated data in which anomalies were induced, in terms of identification percentage or accuracy and other metrics, obtained from the analysis the confusion matrix of the classifier.

1.6 Ethics and Sustainability

The problem of identifying anomalous behaviour of a system is strictly connected with sustainability issues. As a matter of fact, anomalies often are a cause of degra-dation of a system’s performances, therefore generating a non optimal utilization of the resources. Automatic tools for anomaly detection can help to prevent this sustainability issues, together with providing valuable tools for protection against hazardous or illegal activities like in the case of their application to intrusion de-tection. In most of the fields, this automation process for system monitoring does not actually replace any human labour and even in the few domains in which this occurs, the need for different professional figures raises for the control, update and verification of the detection systems.

(18)

should be employed or if their usage might cross the boundaries of monitoring be-yond the personal sphere of the users of a service or system. As a matter of fact, it is necessary to respect the ethical limits of privacy when dealing with monitor-ing applications, and to comply with the sensitivity of the entities whose data are analyzed.

1.7 Delimitations

The project looks to a particular snapshot of the problem that is described in the previous sections. As a matter of fact, the research is limited to the only field of machine learning and to a specific subfamily of models, leaving space for the evaluation of additional techniques, new or coming from different branches of anomaly detection.

In addition, given the considerations about the environment in which the research is developed and the availability of data, the horizon of the possible techniques is limited only to the unsupervised learning approaches.

A second limitation stands in the fact that due to the availability of data, the model has only been tested against simulated anomalous data.

Finally, with respect to the employed software, Keras has been selected for its flexibility, easiness to use and for the familiarity of the author with the environment, but other possible choices are available.

1.8 Outline

(19)

Chapter 2

Background Theory

The background chapter, that continues the introductory overview given in Section 1.1, is dedicated to the description of the important concepts and literature upon which the thesis work is built. The discussion is limited to the knowledge that is required for the understanding of the following work, it is organized in sections, dividing the machine learning concepts from the anomaly detection background, and the reader with familiarity with one or both of the two areas is suggested to directly skip the reading of the corresponding sections.

2.1 Machine Learning Concepts

Machine learning is an area that gained extensive popularity in the field of infor-mation and communication technologies, above all in the past decade. Although this recent phenomenon of public diffusion, the interest for extracting and analyzing patterns in data has been documented always in the scientific world. Recalling the example employed by Bishop in [22], it was the analysis of the data gathered by Tycho Brahe about the motion of planets that allowed Kepler to extract a model and produce his well known astronomical laws, proving the interest for data-driven approaches already back in the 17th century.

Advanced computing systems have offered a powerful tool to be employed for pattern extraction, allowing the execution of algorithmic routines able to classify, predict or control according to models empirically learned from data, and giving birth to a new area of science, related with the task to extract information and knowledge from data: machine learning.

(20)

Figure 2.1: Machine learning vs. Traditional programming: schematic comparison between traditional programming and machine learning paradigm.

The fundamental idea defining machine learning can be resumed paraphrasing the words pronounced in 1959 by Arthur Sammuel [25] as:

The field of study that gives computers the ability to learn without being explicitly programmed.

or, in a more formal way, by the definition of learning problem elaborated in 1986 by Tom Mitchell[23], stating that

"A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P,

improves with experience E."

(21)

2.1.1 Inductive Learning and Generalization

Therefore, when referring to machine learning or data driven approaches, specially in the supervised setting, it is always important to point out their native inductive characteristics, aiming to learn general patterns starting from a limited set of ex-amples.

As a matter of fact, regardless the employed technique, the core concept is always related to the possibility to extract a model from a subset of data, and be able to use it on unseen data, raising the two fundamental questions about how to extract those patterns from the data and on how to maximize and evaluate the generaliza-tion capabilities of the output program.

The concept of generalization refers to the capability of a model to be applied on data that has not been used for its generation, and it is an extremely important property for every machine learning algorithm.

In fact, if a model is not capable of generalizing, it will only be able to predict accu-rately on the training data, for which the exact mapping is already known, making its application useless for practical purposes.

Let’s assume to consider a certain task T and to call t, the target function of T , representing its perfect solution. For example, the task could be to assess whether a picture represents a cat or not, and the target function t is the mapping that is always able to discriminate between the cat and no-cat classes.

Following the common terminology, we refer with the word examples to the set of data available for the learning, considered to be samples extracted from t (possibly with some noise) and for which the value of the target function, also referred to as label, is known. It is important to remark that, even though the examples are in principle drown from the distribution described by t, they only give a partial rep-resentation ˆt of the original target function. The word features identifies the set of values that are used to represent an example and that correspond to the description of one possible input value of t. In the context of the example discussed above, the features could be the numerical representation of the image as a set of pixels. The aim of the learning procedure is to employ the examples in order to define a mapping tÕ that approximates ˆt, and in the best case coincides with t. In order to

evaluate how well t is approximated by tÕ, it is necessary to define a loss function L,

that proxies whether the properties of a training example are learned, by describing how well this is inferred by tÕ with respect to the real value it assumes according to

t.

The learning is driven by an optimization technique that aims to minimize the dis-tance between inferred and expected output, using L as a metric. The set of all possible functions is referred to as the concept space, and contains all the mappings between the feature space to the output space, including the target function. An hypothesis is defined as a candidate function to approximate the target function, and every machine learning technique defines the so called hypothesis space as the set of function that it can generate.

(22)

Figure 2.2: Underftting vs. Overfitting: the same datapoints fit with an underfit (left), appropriate (center) and overfit (right) model. Blue datapoints are used for fitting, pink instances are unseen points.

independent parameters that the learning procedure is meant to estimate, the size of the hypothesis space can vary from really reduced sets of a finite number of con-cepts to spaces of infinite order.

Although it might seem reasonable to always favor a more representative hypothesis space with respect to a smaller one, this is not the right approach. As a matter of fact, firstly, a more extensive hypothesis space can is more difficult to explore and in the second place, employing a representation that is of higher order with respect to the necessary one can lead the model to learn patterns that are specific of ˆt but not of t, resulting in the phenomenon described as overfitting. An overfitted model is the outcome of a learning procedure that loses in generalization capabilities because of learning the noise of the samples, due to the excessive representational power of the selected hypothesis space. At the same time, it is also a wrong methodological approach to always select less representative hypothesis space, since they could lead to the opposite phenomenon of underfitting, caused by the insufficient complexity of the model that makes it impossible to learn the patterns characterizing the ex-amples.

Example of underfitting and overfitting representation of a set of examples is given in Figure 2.2, where a set of samples extracted from a quadratic function are fit into respectively a linear, quadratic and higher dimensional model.

While in the case of underfitting, it is rather difficult to establish a direct metric for its estimation other than the case in which the learning procedure does not converge to a stable solution, an overfitted model can easily be identified in the case in which the error on unseen sample is consistently higher than the error on the examples used for training, meaning that the patterns learned by the model are not, or not only, general for t, but rather specific of its representation ˆt through the set of available samples.

(23)

The opposite concept of underfitting instead, indicates the fact that the selected hypothesis space is too limited and thus, no good approximation can be reached of the target function, given the insufficient capabilities of the candidate concepts. In order to evaluate the performances of a model or of a candidate function on unseen data, it is often a common practice to partition the available data into a training set employed for the actual learning of the model parameters, and a test set on which the performances are assessed.

In order to cope with the problem of overfitting, and to reduce the sensitivity of the model to the noise present in the dataset, it is possible to apply some meta-algorithms that limit the effective model complexity and allow a complex concept to avoid severe overfitting. Such techniques, also known as regularization techniques, like for example lasso regularization [26] and weight decay [27], reduce the degrees of freedom of the model by penalizing, during the learning procedure, the parameter sets that represent more complex models.

An excessive regularization can lead to poor performances or no convergence of the model and therefore, it is important to accurately select the regularization technique and parameters according to the problem and to the analysis of the performances. The issue of overfitting and, more in general, the need for models with good gen-eralization capabilities, introduces also to another important topic in the field of machine learning: model selection and hyper-parameter tuning.

Both are factors influencing the size of the hypothesis space, having direct effect on the performances of the model on unseen samples, and respectively refer to the choice between the different possible algorithmic techniques and of their non-learned parameters.

As a matter of fact, a ML model is often defined not only by the algorithm and by the parameters learned using some optimization technique over the training sam-ples, but also by a set of values, called hyper-parameters, that cannot be learned at training time and that specify different settings of the learning strategy or of the architecture of the model.

Different methods are available for model selection and hyper-parameter tuning, within which grid search, genetic algorithm (discussed more in detailed in the methodology) and random search [28] are recalled, all relying on the use of a subset of the training data left out, with different possible strategies such as hold-out or cross-validation, for the comparative evaluation of different combinations of models and hyper-parameters.

More about this problem and the available techniques can be find surveyed in [29]. Machine learning techniques are often distinguished in two broad categories:

• Online learning algorithms • Batch learning algorithms

(24)

same sample is only observed once. On the other hand, batch algorithms update the model according to bigger sets of samples and generally the learning occurs in an offline fashion in which multiple passes over the data are allowed. Most of the techniques composing the state of the art in machine learning belong to the second class.

Although it is difficult to actually define a single state of the art in machine learning, since, as proved in [30], there exist no optimization technique that is superior over all problem classes, it is still possible to identify some popular techniques that form the most important basic machine learning modeling tools, of which few are introduced in the following list:

• Linear Regression: one of the basic building blocks in machine learning. It is a supervised technique that aims to learn a linear mapping between the feature space and the output, or, in other terms, to reconstruct the label of the samples learning the parameters of the linear combination of the input features.

• Decision Trees: are a popular techniques to learn in the setting in which the target function is discrete-valued. The algorithm represents the available classes as leaves of a tree and the classification results by the path that the features representing the input value form from the root of the tree to one of the leaves. Multiple decision trees can be combined together to form a random forest that chooses as output class some aggregation of the output of the single decision trees. Their strength relies in the easiness of interpretation, since they can be easily read as a if-else sequences. The work referenced at [31] contains a recent review of this class of algorithms.

• SVM: a Support Vector Machines [32] is a technique to identify an hyper-plane that can be used for classification, regression or clustering purposes. Their mathematical description is extremely complex, but their basic ideas relies on the application of the so-called kernel trick to consider distances in arbitrarily high-dimensional spaces, enabling to find the separating hyper-plane also for non-linear regions, and on the use of a quadratic optimization problem to avoid to get stuck in local minima. The basic concepts are described in the work of Fradkin referenced at [33].

• Q-learning: this technique is widely employed in the field of reinforcement learning. It can be structured under different formats but the general concept is to learn the mapping between the current state of an agent, a possible action to take and the expected reward that the agent receives from the environment, employing the previous feedback received by the agent itself.

(25)

Figure 2.3: Perceptron: simple representation. x represents the input vector with nfeatures, w the vector of the weights and finally tÕ the output.

of the cluster centers and iterates assigning the points to their closest clus-ter cenclus-ter, to finally recompute the updated clusclus-ter cenclus-ters and repeat the procedure until the centers don’t vary anymore.

• Neural Networks: this algorithm family includes one of the most powerful and flexible techniques in machine learning. More about this class is explained in detail in the following subsection.

2.1.2 Neural Networks

Neural networks, or more precisely artificial neural networks (ANN), are an AI technique that takes inspiration from and mimics the method of functioning of the human brain. It relies on a modular system composed of interconnected basic com-putational units called neurons that perform a simple mathematical operation to produce an output given a set of inputs. The outputs of each neuron can be either used as output of the network, or be employed as input for other neurons, and the relationships among neurons are adjusted through weights that, together with the architecture of the network, define a learned model.

Before analyzing possible network architectures and properties, it is necessary to analyze the network’s more basic component: the neuron, of which, the perceptron is the most common model.

(26)

input features (and the bias), and outputs the outcome of the application of an activation function to the result.

In the case of the original definition of this model, the activation function coincides with a simple step function, outputting one if the weighted sum is positive and minus one (or zero) otherwise, being, in other terms, a sign function.

The weights are randomly initialized and adjusted during the training using the following rule after observing each training sample:

1. The input sample is evaluated and the output tÕ of the perceptron is computed

asqw[i]úx[i] where w[i] is the ithweight, x[i] is the ithinput and in particular w[0] is the weight associated with the bias and x[0] = 1 is the bias.

2. The error is computed by comparing the actual target value t with tÕ, obtaining

= t ≠ tÕ.

3. Each weight is updated, according to the strength of its corresponding input and proportionally to the error and to a value – called learning rate, that represents the importance given to each sample. In formal terms: W[i] = ú ú x[i].

Such architecture and learning methods are proven to always converge in the setting in which the decision boundary is linear, even though the training procedure might take an arbitrarily high number of steps.

The learning rate (–) controls the way in which each supervised evaluation of the model impacts on the variation of the model itself, and it is a variable that require an accurate design. As a matter of fact, a low leraning rate can imply a really slow convergence of the model as well as a major probability to get stuck in local minimas (if no techniques are employed to avoid so). On the other hand, a higher learning rate, might compromise the convergence of the optimization and generate an oscillating unstable solution.

Considering multiple possible activation functions such as sigmoid function, hyper-bolic tangent, linear or rectified linear (ReLU) function (see Figure 2.4), we can extend the concept of perceptron to the more general idea of neuron, as basic com-ponent performing an input-output mapping on the weighted sum of its inputs and of the bias term. Each neuron, through the learning procedure, learns how to eval-uate its input stimuli to produce a correct output.

However, a single neural unit cannot distinguish classes that are separeted by non-linear decision boundaries, introducing the need to combine multiple units into networks, forming the more complex structure of an ANN.

(27)

(a) Step Function (b) Sigmoid Function (c) Hyperbolic Tangent

(d) Linear Function (e) ReLU Function

Figure 2.4: Some examples of activation functions including the most popular choices in neural networks.

N and V define the architecture of the network, whose variation can affect the rep-resentative power and consequently the capability of the model, while w, the weight function or matrix, is learned through the training procedure from the data. The usage of a NN for prediction or classification purposes, requires to also define the loss-function L, that establishes how close the output of the network is to the target value, and plays a fundamental role in the learning procedure.

Popular loss functions are mean squared error (MSE), root mean squared error (RMSE) for the regression problem and cross-entropy (Eentropy) for k-class

classi-fication tasks, that, on a dataset of n samples, calling t the target value and tÕ the

(28)

When referring to an ANN, the concept of layers plays an important role and corresponds to a set of neural units that share the same distance (in terms of com-putational steps) from the input and from the output of the model, and are not connected between them by weights.

Each layer, exception made for the output layer that is trained to learn the final mapping, is employed to transform the output of the previous one in a new different feature space, being equivalent to consecutive applications of learned feature trans-formations.

A simple example of artificial neural network, is composed by the input layer, a number of perceptron units, and finally the output layer. All the layers that are not direct expression of either the input or output of the network are referred to as hidden layers. The number of layers composing a network is used to define the concept of depth of the network.

Layers in a neural network can be stacked, in order to learn feature transformations for the input that allow to better approximate the desired target function.

Depending on the architecture of the network, different kind of features can be extracted, resulting in different tasks for which the neural network becomes more suitable. Examples of different kind of classes of network are feed forward neural networks, in which the layers are connected without loops and links only appear from neurons of a layer to neurons of the next one, convolutional neural networks and recurrent neural networks that will be explained in further detail in the follow-ing part of this chapter. Feed forward neural networks are the basic structure that a multi-layer neural network can assume and are mathematically interpreted as the consecutive application to the input of learned feature transformations, thus being equivalent to the application of a composite function.

Each layer is equivalent to the application of an affine transformation to its input, in which the non-linearity is introduced by the activation function employed in the neurons.

While in the case of the perceptron architecture the link between each weight with the output of the network is direct, when dealing with a layered structure, this is not true and thus a new learning algorithm is required to adjust the weights that define the model learned by the network.

In the case of feed forward neural networks, the learning is converted into a gradient-based optimization problem and the technique employed to compute the partial derivatives of the loss with respect to the different weights of the network is called back-propagation. Such technique, that relies on the chain-rule of derivatives is explained in [36] and it is used to extract the expression to update each weight according to the total error of the network.

(29)

for a generic parameter ◊ applying a learning rate of –:

◊= ◊ ≠ – ◊ ”L(◊)

”◊ . (2.4)

As all machine learning techniques, also neural network can be affected by the problem of overfitting and thus, require to employ some regularization techniques in order to cope with it.

Most popular techniques are weight decay, adding to the loss function a component that penalizes high values of the parameters, dropout, that randomly deactivates some neurons during the training phase, or early stopping whose basic principle is to continuously validate the network on unseen data and to stop the training as soon as the validation error starts increasing.

Recurrent Neural Networks

Recurrent Neural Networks are a strict subset of artificial neural networks, dispos-ing of memory cells and where the neurons are connected with direct cycles. They were firstly introduced by Hopfield in 1982 [38], and representative of this class are Elman [39] networks and Jordan Networks [40].

The possibility to include stored information in the computation makes this archi-tectures particularly suitable to be applied on multiple inputs to be interpreted as a sequence.

This kind of architectures are generally trained through a technique that is con-sidered an adaptation of standard backpropagation and, because of that, is called backpropagation through time [41].

This kind of networks showed empirical evidence to be particularly suitable for time series forecasting thanks to their ability to extend the input output relationship to the whole sequence due to their learning technique. The main limitation of this model is that when the network is unfolded for a big number of steps, the gradient could tend to be amplified or suppressed causing respectively the exploding or van-ish gradient phenomena.

LSTM (Long short term memory) is a recurrent architecture that, using the words of its inventors, reported in Section 1 of [42]:

“is designed to overcome these error back-flow problems”.

As a matter of fact, this model solves the problems of RNN and other proposed so-lutions applied to learn time structures “by enforcing constant error flows through constant error carousels within special units called cells” as correctly explained in the abstract of the doctoral thesis referenced at [43].

(30)

Figure 2.5: LSTM cell: detailed schematic of a Long Short-Term Memory block as used in the hidden layers of a recurrent neural network. (Picture and caption from [45]).

An illustrative image of the internal architecture of an LSTM cell is presented in Figure 2.5.

Empirical evaluation has shown that LSTM outperform traditional Recurrent Neu-ral Networks in terms of capability to learn long term dependencies.

Gated recurrent units (GRUs) are an alternative implementation of an architecture inspired from LSTM cells. As a matter of fact, they also employ learned gates to store information about the sequences, but present a slightly simpler architec-ture with respect to LSTM. Different versions are available, they have been firstly introduced by [46], and they proved to compete with LSTM both in terms of perfor-mances and training time, varying in relative comparison depending on the specific application.

Auto-encoders

Auto-encoders are special kinds of artificial neural networks that employ a super-vised learning algorithm for unsupersuper-vised tasks.

As a matter of fact, as all other artificial neural networks the learning of the weights is based on the error between the target and the output of the network. Although, in the case of auto-encoders, the target of the network corresponds to the input of the network itself, making of them an unsupervised technique. Formally speaking auto-encoders are artificial neural networks trained in a setting in which the target function corresponds to the identity function (t(x) = x).

(31)

error. In order to achieve dimensionality reduction, after that the auto-encoder has been trained, the output of one of the hidden layers corresponding to a given input, is used as its embedded representation, relying on the fact that feeding this lower dimensional vector to the following layers it is possible to build back the input to its original space.

Multiple variations exist to adapt auto-encoders to different problems, within which we mention the more commonly used alternatives:

• Denoising auto-encoders: apply a partial corruption to the input and try to learn the mapping to the clean input (without the introduced noise).

• Sparse auto-encoders: in this variation, the number of hidden units is actually higher than the dimension of the input, but only few of them are allowed to be contemporarily active. The sparsity of representation is generally forced by either introducing additional terms in the loss function or by pruning all the units but the k ones with stronger activation.

• Variational auto-encoders: employ the variational approach for the learning of the latent representation thus, making strong assumptions on the distribution of the output of the compression.

Auto-encoders can in principle be composed by any network architecture, having only as a constraint that the input and output layers are equivalent in terms of dimension. The term recurrent auto-encoder is employed to define an auto-encoder that is composed by a recurrent neural network, and their main practical application is for the reconstruction of sequences.

The reconstruction error for an auto-encoder is defined as the loss computed over a given input.

2.1.3 Recurrent Auto-encoders for Anomaly Detection

The analysis of the problem and the study of the previous work in the field of anomaly detection have outlined two main fundamental considerations that have acted as driver in the identification about which machine algorithm to apply in the research.

The first consideration, related to the intrinsic temporal characterization of the problem in analysis that, from its very definition, relates to the detection of the anomalous behaviours in multivariate sequences of data points, has oriented the author of this work towards the application of a recurrent neural network. The choice is backed up by the fact that, as explained in Chapter2, RNN, in particular employing LSTM or GRU cells, form the state of the art for time series predic-tion and classificapredic-tion, proving to outperform other models in most of the available benchmarks.

(32)

labelled data, forcing to only look at the available unsupervised techniques.

Among the most important techniques for unsupervised anomaly detection common choices are clustering, self-organizing maps, single class support vector machines, latent feature models and auto-encoders.

Within the available options, the last has been selected for its more natural in-tegration with the requirement coming from the temporal characterization of the problem in analysis. As a matter of fact, as explained in Chapter 2 auto-encoders can be arbitrary neural networks and, therefore, they can also be built as a recur-rent architecture.

Given the above mentioned observations, the model family has been identified in the group of recurrent auto-encoders. For the clarity of the reader a brief recall about recurrent auto-encoders and their extension to anomaly detection is reported, in the following paragraph.

A recurrent auto-encoder is a neural network employing recurrent layers in order to learn a mapping between an input sequence and the sequence itself.

Recurrent auto-encoders can be applied to unsupervised (or semi-supervised) anomaly detection problems by using the reconstruction error on a given sequence as a direct indicator of the level of anomaly.

As a matter of fact, the reconstruction error is the difference between the expected evolution of the time series and the real recorded behaviour. Since the expected behaviour is estimated according to the patterns learned at training time by the model from normal sequences, a high reconstruction errors signals a marked dis-tinction between the patterns present in the analyzed sequence and the ones that characterize normal data, and thus it can be directly used as a proxy of the anomaly score of the sequence.

2.1.4 Architecture Definition using Genetic Algorithms

A common trait among almost all machine learning techniques is that they are char-acterized by a certain set lambda of hyper-parameters, that are not directly learned during the training procedure and whose variation can affect in an important way the outcome of the obtained model [50]. In order to cope with this issue, in the following work it is employed a genetic algorithm, that consist in a data-driven op-timization technique that can be applied to the selection of the hyper-parameters or for the generation of a set of candidate models, as previously done in literature [51][52].

Genetic algorithms (GAs) belong to the family of evolutionary techniques for opti-mization and are inspired on Darwin’s theory of evolution and natural selection [53]. As a matter of fact, such techniques mock the biological process in which stronger individuals are likely to be dominant in a competing environment and thus, to carry on their characteristics, through their offspring, for multiple generations of a given population.

(33)

of literature [55][56][57][58], together with several reports and practical examples and implementations.

The optimization heuristic is based on the five fundamental concepts of gene, fitness function, selection, crossover and mutation.

In the context of GAs, a gene is the representation of a possible solution, as a nu-merical embedding of its characteristic, often also referred to as individual. This representation should contain all the features that characterize the solution of the problem for which the optimization routine is being used. The genes, that are formed by a set of chromosomes or loci defining different characteristics, can as-sume several structures to represent the solution, within which we recall:

• Binary representation: the gene is represented as an array of bits that can only assume value of 0 or 1.

• Real value representation: each of the loci of the gene is represented by a floating point (or integer) value as an approximation of the real value to be chosen.

• Discrete representation: each of the chromosomes is a discrete variable that can only assume some given values.

• Hybrid representation: it is also possible to employ more than a single kind of chromosome in the representation of the gene, embedding some characteristics of the solution into real valued loci and some others in a binary or discrete representation.

A set of genes is defined as a population, that evolves generation by generation with the mechanism described later, and we will from here after refer with the term of offspring to the result of a recombination applied to a couple of individuals, in sim-ilarity with the biological phenomenon of reproduction.

Furthermore, a fitness function must be defined as a mapping between a gene and it’s "goodness" [59], in order to rank genes according to the quality of the solution that they represent. Depending on the specific problem for which the routine is employed it is possible to define different type of fitness function, for example defin-ing a function favourdefin-ing low-cost solutions or by attributdefin-ing higher values to genes proving better performances.

The definition of the genetic representation and of the fitness function represent the basic to employ a genetic approach for the optimization of a problem. They are both extremely domain specific tasks, since they require a knowledge both on the fundamental variables over which it’s desired to run the optimization procedure and on valid performance indicators for the given setting.

Once the setup of the problem has been identified and both the genetic embedding and the fitness function are well defined, the evolutionary components of the algo-rithm have to be designed.

(34)

generation of the offspring and thus to be propagated in the new population. Therefore, the selection basically consists in a function that extracts from a given population the individuals that, according to their fitness level, survive the compet-itive environment and participate to the recombination. As common in all search problems, also genetic algorithms need to cope with the trade off between explo-ration and exploitation [60], meaning that they need to be adjusted in order to be able to grant at the same time enough probing of the solution space and an efficient utilization of the information about good and bad solutions. This issue, that is explored in literature work such as [61], [62], must be taken into account in the design of the evolutionary parameters of the algorithm, with particular respect to the selection, crossover and mutation strategy.

Given that the selection should identify k individuals out of a population of size n for the generation of the offspring, possible choices for the selection routine are:

• Greedy selection: the top k fittest individuals are selected to produce the offspring.

• Proportionate roulette wheel selection: first introduced by Holland in [54], is a simple paradigm for which k extractions are performed in which every gene has a probability to be selected proportional to its fitness value.

• Exponential selection: the same principle of the previous mechanism but in which the selection probabilities for the different genes are not linearly related to the fitness score, but exponentially weighted instead. It basically consists in the application of a softmax function to the fitness values and to apply a random extraction of k elements according to the resulting probabilities. • Tournament selection: it consists of randomly extracting t < n individuals

from the population and selecting the best among them. t is called tourna-ment size and it is commonly set to the value of 2, case to which we refer as binary tournament selection. The tournament is repeated until k winners are extracted and can also be structured in multiple iterative phases.

More detailed explanation of the above mentioned techniques can be found in the work referenced at [63].

The next fundamental concept to be defined is the crossover function, employed to determine how to combine two selected individuals and produce their offspring. Possible cases of crossover functions are reported below, following their explanation provided in literature [64].

(35)

• Single point crossover: two genes a and b are recombined producing as off-spring the genes c and d by extracting a random integer t Æ l, and making sure that for each index i Æ l, if i < t, c[i] = a[i] and d[i] = b[i], while if i Ø t, c[i] = b[i] and d[i] = a[i].

• Double point crossover: two genes a and b are recombined producing as off-spring the genes c and d by extracting two random integers t Æ y Æ l, and making sure that for each chromosome i Æ l, if i < t or i Ø y, c[i] = a[i] and d[i] = b[i], while if i Ø t and i < y, c[i] = b[i] and d[i] = a[i].

• Uniform crossover: two genes a and b produce as offspring c and d where for each locus i Æ l, either c[i] = a[i] and d[i] = b[i] or c[i] = b[i] and d[i] = a[i], where the offspring inherit from one of the parents with uniform probability. • Flat crossover: also known as linear recombination crossover, it is a method that produces the offspring c and d as linear combination of the vectors a and b.

The crossover function depends highly on the representation of the solution and it is important to take this into account during its design. As a matter of fact, not all the possible crossover function are suitable for any genetic encoding, like the case of a linear recombination crossover that cannot be coupled with a representation of the gene as a discrete array.

Finally, we define the concept of mutation as a function that applies a probabilistic alteration to one or more loci of a gene. The design of this function and of its operational parameters also have high impact on the trade off between exploitation and exploration announced above. As a matter of fact, by increasing the muta-tion probability we introduce a bias in favour of a more probating approach while, on the opposite, by setting a low mutation likelihood the method tends to be less randomized. Also in the case of the mutation function, it is required to take into account the genetic representation of the solution, in order to make sure that the mutation applies meaningful alterations in the solution.

Clearly, as in every search strategy it is also necessary to define a criterion to es-tablish when a solution has been found or when the computation has to terminate anyway retrieving the optimal (or sub-optimal) individual that has been identified during the execution.

(36)

Algorithm 1 Genetic algorithm - skeleton 1: procedure EvolutionaryProcedure 2: population Ω PopulationGenerator(string) 3: loop: 4: f itnessΩ FitnessFunction(string) 5: population Ω Crossover(population) 6: population Ω Mutation(population)

7: if Terminate(population) == (false) then

8: population Ω Selection(population)

9: goto loop.

10: close;

11: return population.

For this reason, in the context of this research, a GA has been selected for the generation of a set of models to be evaluated, rather than with the expectation to automatically select the best possible performing model.

When considering the approach analyzed in this work no architecture is estab-lished in the literature as dominating method, leaving wide space for the defini-tion of potentially infinite models, even just remaining in the family of recurrent auto-encoders, justifying the need of an exploratory search technique to get some acceptable guesses about possible good alternatives.

2.2 Anomaly Detection

The work that is presented in this report is located in the broad field of study that is generally referred to as anomaly detection. Some basic concepts for this field have already been defined in the introductory chapter, such as the definition of anomaly, the basic classification for anomalies and a brief outline over the methods for anomaly detection. This section will be focused on a deeper overview of the concepts and terminology in anomaly detection, together with a discussion about the related works.

2.2.1 Concepts for Anomaly Detection

Recalling the introduction presented in Section 1.1, an anomaly detection problem can generally be formulated as the assessment of whether a data instance, or a set of data instances, fits with what is considered to be the normal distribution of the data.

For this reason, often, the problem is in the definition of what the normal behaviour is, for which multiple approaches exist, generally divided in two categories: expert systems or data-driven approaches.

(37)

by study of the specific properties of a given domain, in order to assess the charac-teristic features that identify correct or incorrect instances.

An example of this kind, is for example a fire alarm that has been set to detect anomalous concentration of smoke by comparing the measured level with threshold defined by the designer, given its knowledge of the application environment. On the other hand, the latter group is formed by techniques that model the normal distribution of the data starting from a sample drawn from it, employing different methods to extract characteristic patterns using representative examples.

The work presented in this master thesis is located within the second group that, in overall, is gaining popularity thanks to the high availability of data, the increased performances of predictive modelling and the high level of complexity of many real case application, making it hard to design an handcrafted model. For this reason, the discussion of the related work will be limited to available techniques and meth-ods belonging to the second category.

As already mentioned in the introductory chapter, [8] proposes a classification of state of the art anomaly detection techniques according to their operating princi-ples, in five major classes discussed in the following paragraphs. The attention is major on the first three groups, that can be identified as machine learning based techniques, for their major connection with the scope of this project.

Classification Based Techniques

Classification based techniques assume that it is possible to learn a decision bound-ary that can differentiate the normal and anomalous classes, employing different classification techniques. A common approach is based on the application of sup-port vector machines to the one class classification problem [66], in order to learn the boundaries within which the normal data instances are contained. Examples of different variants of this method can be found in [67], [68] and [69].

Neural networks can also be employed for anomaly detection, both as standard multi-class classifiers and in the single-class setting. Different architectures have been used, such as multi-layered perceptron [70], radial basis networks [71] or replicator neural networks [72] that employ the reconstruction error to assign the anomaly score, similarly to the technique that has been employed in this work. The former concept, relies on reducing the problem to a regression problem, in which the network is asked to replicate the given input. The closer the replication is to the real input value and the more likely it is that the analyzed input resembles the ones used for the training of the network. More about this paradigm is explained in section 2.1.3.

(38)

Nearest Neighbor Based Techniques

An other interesting approach to the AD problem is based on exploiting the infor-mation related to the neighborhood of an instance in order to assess its degree of normality. The methods belonging to this class, as also recalled by [8], require a distance to be defined, that, although it does not need to be metric nor to satisfy the triangle inequality, it is generally to be defined as positive-definite and symmetric. The metric is employed in order to define distances between data points, used to compute the anomaly score.

A first approach, followed by [76], is to employ the distance from the kth nearest

neighbor as anomaly score, or to compute the cumulative distance from the k near-est neighbours [77][78][79].

Alternatively, other methods employ metrics to measure the density factor of the neighborhood, such as the local outlier factor [80][81], the connectivity-based out-lier factor [82] or the multi-granularity deviation factor [83], adapting to different geometrical properties of the data under consideration.

Interesting is the work of [84] where it is proposed an anomaly detection method based on neighborhood analysis, employing a metric that adapts to a mixture of categorical and continuous data.

Clustering Based Techniques

Clustering based techniques, differently from the approaches described in the pre-vious paragraph, evaluate each data instance according to the cluster it belongs rather than to its local neighborhood. Thus, they rely on both the selected distance metric and on the specific clustering technique used for clustering.

The simpler approach is based on applying standard clustering on data and com-puting the anomaly score of each instance as the distance from its closest centroid [85].

It is also possible to just classify each cluster as normal or anomalous according to its sparsity and size, and assign normal/anomalous label to the data instance as the label of the cluster it belongs to [86][87][88][89].

If a clustering technique that does not necessarily cluster all the points is used, such as DBSCAN [90], points that are not assigned to any clusters can be directly defined as anomalous instances.

Statistical Based Techniques

References

Related documents

technology. These events or patterns are referred to as anomalies. This thesis focuses on detecting anomalies in form of sudden peaks occurring in time series generated from

The DARPA KDD 99 dataset is a common benchmark for intrusion detection and was used in this project to test the two algorithms of a Replicator Neural Network and Isolation Forest

In this study, log data from an educational application was given and used to construct anomaly detection models with purpose to distinguish normal course sections from anoma- lous.

Thanks to advances in the machine learning area, it’s now much easier to apply ma- chine learning techniques on SCADA traffic that cannot be modeled by the request-response

The problem under study is a fundamental one and it has applications in signal denoising, anomaly detection, and spectrum sensing for cognitive radio.. We illustrate the results in

First, the data point with row number 4, ID = 0003f abhdnk15kae, is assigned to the zero cluster when DBSCAN is used and receives the highest score for all experiments with

To summarize, the main contributions are (1) a new method for anomaly detection called The State-Based Anomaly Detection method, (2) an evaluation of the method on a number

This is done by a characterisation of the surveillance domain and a literature review that identifies a number of weaknesses in previous anomaly detection methods used in