• No results found

Anomaly Detection in Streaming Data from a Sensor Network

N/A
N/A
Protected

Academic year: 2021

Share "Anomaly Detection in Streaming Data from a Sensor Network"

Copied!
74
0
0

Loading.... (view fulltext now)

Full text

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

Anomaly Detection in

Streaming Data from a

Sensor Network

EGILL VIGNISSON

(2)
(3)

Anomaly Detection in Streaming

Data from a Sensor Network

EGILL VIGNISSON

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2019

Supervisors at Scania: Paola Maggino, Nikhil Thakrar Supervisor at KTH: Jimmy Olsson

(4)

TRITA-SCI-GRU 2019:319 MAT-E 2019:76

Royal Institute of Technology

School of Engineering Sciences

KTH SCI

(5)

Abstract

In this thesis, unsupervised and semi-supervised machine learning techniques are analyzed as potential tools for anomaly detection in Scania truck sensor networks. The thesis investigates the need for both point and contextual anomaly detection in this setting. For the point anomaly detection the method of Isolation forest was applied and for contextual anomaly detection two different recurrent neural network architectures using Long Short Term Memory units were used. One model was simply a many-to-one regression model trained to predict a certain signal, while the other was an encoder-decoder network trained to reconstruct a sequence. Both models were trained in an semi-supervised manner, i.e. on data that only depicts normal behaviour, which theoretically should lead to a performance drop on abnormal sequences resulting in higher error terms. In both settings the parameters of a Gaussian distribution were estimated using these error terms, which allowed for a convenient way of defining a threshold which would decide if the observation would be flagged as anomalous or not. Additional experiments using an exponential weighted moving average over a number of past observations to filter the signal was also conducted. The methods performance on this particular task were very different, but the regression model showed a lot of promise especially when combined with a filtering preprocessing step to reduce the noise in the data. However, the model selection will always be governed by the nature of the particular task at hand, so the other methods might perform better in other settings.

Keywords

(6)
(7)

Anomalidetektion i strömmande

data från sensornätverk

Sammanfattning

I den här uppsatsen analyseras användningen av oövervakad och semi-övervakad maskininlärning som ett möjligt verktyg för att upptäcka anomalier i sensornätverket hos Scanialastbilar. Studien analyserar behovet av detektion

av både punkt avvikelser och kontextuella avvikelser i denna miljö. För

punktavvikelse användes metoden Isolation forest och för kontextuella avvikelser

användes två arkitekturer av rekurrenta neurala nätverk. En av modellerna

var helt enkelt en många-till-en regressionmodell tränad för att förutspå ett visst märke, medan en annan var ett kodare-avkodare-nätverk tränat för att rekonstruera en sekvens.

Båda modellerna tränades på ett semi-övervakat sätt, d.v.s. på data som endast uppvisade normalt beteende, som teoretiskt skulle leda till minskad prestanda på anormala sekvenser med ökat antal feltermer. I båda fallen blev parametrarna av en Gaussisk distribution skattade baserat på dessa feltermer, vilket gav ett bekvämt sätt att definiera en för om iakttagelsen skulle bli flaggad som en anomali

eller inte. Vidare tillämpades ett exponentiellt viktat glidande medelvärde

över ett visst antal tidigare iakttagelser för att filtrera märket. Modellerna

uppvisade varierad prestanda med avseende på denna uppgift. Dock lovade regressionmodellen mycket, särskilt då denna kombinerades med ett filtrerat förbehandlingssteg för att minska bruset hos datan. Ändå kommer modelldelen alltid styras av uppgiftens natur, och andra metoder skulle kunna ge bättre prestanda i andra miljöer.

Nyckelord

(8)
(9)

Acknowledgements

(10)
(11)

Contents

1 Introduction 1 1.1 Technical Background . . . 1 1.2 Problem Description . . . 2 1.3 Research Questions . . . 3 1.4 The Data . . . 3 1.5 Outline . . . 4 2 Theory 5 2.1 An Anomaly . . . 5 2.2 Sensor Networks . . . 6 2.3 Machine Learning . . . 6 2.4 Dimensionality Reduction . . . 8

2.5 Methods for Point Anomaly Detection . . . 10

2.6 Contextual Anomaly Detection . . . 11

2.7 Artificial Neural Networks . . . 12

3 Method 21 3.1 Data acquisition . . . 21 3.2 Data exploration . . . 23 3.3 Data preprocessing . . . 25 3.4 Anomaly Scoring . . . 28 3.5 Architecture Implementation . . . 32 3.6 Training procedure . . . 33 3.7 Model Evaluation . . . 34 4 Results 37 4.1 Scania data . . . 37 4.2 Synthesized data . . . 50 5 Conclusion 53

6 Discussion, Delimitation and Future work 55

(12)
(13)

1

Introduction

Technological development in the transportation sector has been dominated by solutions that push for further electrification and autonomy. This development has resulted in an increasing level of complexity in the tasks these solutions are capable of carrying out and the overall communication between electronic components. Society’s call for sustainable solutions that run on renewable energy sources will only accelerate this development even further. The purpose of this thesis is to develop and compare different methods based on machine learning and data analytics that seek to simplify the process of detecting faults in the electronic system in real-time during the vehicle’s testing procedure with special focus given to semi-supervised and unsupervised methods.

1.1

Technical Background

As this projects works within topics concerning the automotive industry a brief technical background is needed in order to get an understanding of this setting and the intricacies of a vehicle’s electrical system.

1.1.1 Electronic Control Units

An electronic control unit (ECU) is an electronic device meant to control a certain process that might include the conversion of signals from sensors, data transmission and diagnostic support (Controller Area Network (CAN) Overview 2019). Therefore a positive relationship exists between the number of ECU’s and the number of sensors and actuators that relay information.

1.1.2 Controller Area Network

(14)

it if it is not needed for that ECU’s function (Controller Area Network (CAN) Overview 2019). Furthermore all communications on the network is prioritized by ID meaning that some signals are of more importance. This can result in varying time intervals between signals especially those of low importance. This can lead to irregularity in the sequences which is a well known problem in time series analysis (Beygelzimer et al. 2005).

1.2

Problem Description

This trend towards electrification is very visible in Scania’s products but over a period of a few years, the amount of Electronic Control Units (ECU) needed in their vehicles has increased dramatically as more and more sensors and other electric components have been added that require control. Naturally this increase in the complexity of the vehicle’s electronic system has resulted in a large increase in communication and data that flow in the vehicle network. This in turns makes the process of ensuring that all the components are serving their intended purpose significantly more tedious and resource demanding. Identifying that something is not functioning as it should is, however, only one part of the problem and identifying exactly which component is causing the problem often requires quite extensive data analysis and testing. That is why the need for improved ability to catch deviations in the working conditions of electronic components in vehicles is imperative.

During the vehicle’s testing procedure it is the drivers responsibility to mark down any experienced abnormal behaviour. This is accomplished by pressing a button, that in turn creates a timestamp which can be used as a reference for the post testing analytics where the test engineers and developers seek to identify the root cause of the problem. This high dependency on the test driver is an aspect that Scania wishes to remove from its testing procedure and replace it with a solution capable of carrying out real time analysis on the data obtained from the ECUs, which will alert the driver of potential problems and establish a basis for further analysis by marking them down in the time.

(15)

by the driver.

This thesis looks to investigate the use of advanced analytics and predictive modelling as potential tools to combat this problem with the objective of creating an efficient testing procedure of the vehicles.

1.3

Research Questions

By dissecting the problem description certain criteria that the model needs to meet in order to function in this environment can be identified. This description therefore lays the ground for the different questions that will be addressed in this thesis project, i.e.

• What modelling options are able to detect anomalous behaviour?

• What approaches can be used to combat the detection of false positives? • What are the models potential to generalize to different applications? • What model performs best overall?

1.4

The Data

The data set relied on during this project was composed of logged CAN signals acquired from a test vehicle during a two week period. The logs are both made up of sequences that are meant to represent normal behavior and other logs where certain faults had been injected by making mechanical or electronic adjustments to the vehicles. The data set consists of approximately 1500 separate signals. Due to confidentiality reasons no true variable names will be included and a clear description of the process of injecting the faults will not be given in this thesis as it relates to other research projects being conducted at Scania. Therefore each variable will only be referred to as Si, where i is some positive integer. Due to

(16)

1.5

Outline

(17)

2

Theory

This section lays the necessary theoretical foundation which includes definitions of the different types of anomalous behaviour followed by short descriptions of the different areas of machine learning that were used in this project.

2.1

An Anomaly

An important aspect to take into consideration when it comes to anomaly detection is that there are certain characteristics that make a data sample

anomalous. These characteristics can be grouped as point, contextual and

collective anomalies where each group has been researched significantly (Chandola et al. 2009). The different methods developed for anomaly detection were developed with specific groups in mind and therefore it is crucial to take that into account when choosing and comparing methods.

Point anomaly

The simplest form of an anomaly is the point anomaly. This group has the longest research history and its definition is very close to Hawkins definition of an outlier, i.e. that it is an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism (Hawkins 1980). This kind of anomaly detection is experimented with in this project but the temporal dependencies in the data makes it an unappealing option.

Contextual anomaly

An individual observation can show no anomalous characteristics but given the surrounding structure or context in which it appears, it has to be considered

anomalous. Two sets of attributes define each observation, i.e. contextual

(18)

Collective anomaly

Similar to contextual anomalies, collective anomalies refer to observations that individually are not anomalous. The collection of these observations however is anomalous with respect to the entire data set (Chandola et al. 2009). This set of anomalies will not be considered individually in this project.

2.2

Sensor Networks

Anomaly detection in data generated by a sensor network is a challenging endeavour. The nature of the faults can stem from different sources such as faulty sensor as well as abnormal events e.g. intrusion attacks (Chandola et al. 2009). High levels of noise in the data tend to also add to the complexity as it makes distinction between it and abnormality hard to make, leading to a high rate of false positives. In this setting the data is generated by a sensor network resulting in a multivariate time series made up of 1415 signals. However the majority of the signals have a lot of missing samples as well as other quality issues. This is most likely due to limitation in the communication capabilities of the ECU’s that collect and transmit these signals. Certain limitations like this called for extensive preprocessing before any modelling could be seriously considered. This calls for methods that allow us to detect and uncover patterns in data and that is where we enter the world of statistics and machine learning.

2.3

Machine Learning

Machine learning (ML) offers a set of methods with a probabilistic foundation intended for the purpose of automated data analysis and pattern recognition (Murphy 2012, p 1-2). Machine learning methods are commonly further divided into two categorise, i.e. supervised and unsupervised learning. In the context of this thesis it is fitting to include a third category, namely semi-supervised learning.

2.3.1 Supervised Learning

(19)

x, made up of a number of features, to a response variable y, also known as a label,

by feeding the learning algorithm a training set, or more specifically a number of input-output pairs{(xi, yi)}Ni=1 (Murphy 2012, p 3). Here, the output yi can be

either continuous or categorical and based on that, the specific problem at hand is referred to as a regression problem in the continuous case and a classification in the categorical case. The dependency on labeled data can be very limiting in certain areas and it certainly is in the context of anomaly detection in time series where amounts of data are often enormous but labeled states are scarce or nonexistent in most cases (Chandola et al. 2009). Therefore no fully supervised learning methods were analyzed in this project.

2.3.2 Unsupervised Learning

Unsupervised learning is completely free from the dependency on labeled data described in the supervised setting. Here the only data fed into the algorithms is the input{xi}N

i=1which is manipulated in different ways to extract new knowledge.

What makes this setting challenging is that it is often more troublesome to define a problem in an unsupervised manner, as it lacks concrete ways of evaluating the results opposed to prediction accuracy in the supervised setting (Murphy 2012,

p 9-10). Here one seeks to construct models of the form p(xi|θ) and therefore

(20)

2.3.3 Semi-supervised Learning

Semi-supervised learning is a different paradigm that is harder to define. The most classical representation of semi-supervised learning is when the power of supervised and unsupervised learning methods is combined into a single model. However it is also argued that the setting where models are used to estimate the density of data under the assumptions that the underlying data generation is of a certain nature, such as depicting normal working conditions of the sensors in the network should be considered as semi-supervised (Chandola et al. 2009)(Chalapathy & Chawla 2019). Here the data is not annotated in any way but still you could say that making that assumption and using it as a representation of normal behaviour is actually annotation. This setting was prioritized in this project as data depicting normal behaviour was available.

2.4

Dimensionality Reduction

An area that is important to keep in mind when working with any kind of statistical modelling is how the models handle increases in dimensionality. A phenomenon known as the curse of dimensionality is a well established problem when working with sequential data (Verleysen & François 2005). In this project a procedure known as Principal Component Analysis was applied to reduce the dimensionality.

2.4.1 Principal Component Analysis

Hotelling (1933) described Principal Component Analysis (PCA) as the orthogonal projection of the data into a linear lower dimensional subspace, such that the variability of the projected data is maximized. It is a technique that allows for the extraction of a lower dimensional representation of the data that contains as much as possible of the total variance. Given data where each observation has p features, PCA assembles these dimensions or principal components through a linear combinations of those p features found in the original data.

(21)

mean and covariance matrix can be defined in the following way ¯ x = 1 N Ni=1 xi, (1) Σ = 1 N Ni=1 (xi− ¯x)(xi− ¯x)T. (2)

This allows the variance of the projection of x to be written as 1 N Ni=1 (ϕT1xi− ϕT1x)(ϕ¯ T 1xi− ϕT1x)¯ T = ϕT 1Σϕ1. (3)

This projected variance can then be maximized with respect to ϕ subjected to the constraint that sum of squares of ϕ is equal to 1 because increasing values for ϕ leads to increasing variance. This results in the following optimization problem. max ϕ ϕ TΣϕ s.t. ϕTϕ = 1 (4)

This is solved by rewriting this problem as an unconstrained optimization problem using a Lagrange multiplier λ, which yields

ϕTΣϕ + λ(1− ϕTϕ). (5)

By setting the derivative of eq. 5 with respect to ϕ equal to zero one obtains

Σϕ = λϕ, (6)

which can be written as

ϕTΣϕ = λ (7)

using the fact that ϕTϕ = 1.

Therefore the variance will be maximized when ϕ is equal to the eigenvector

corresponding to the largest eigenvalue λ (Bishop 2006, p 561-569). This

(22)

that explain the maximum of the remaining variance in the original data.

2.5

Methods for Point Anomaly Detection

Given the nature of the sequences generated by the sensor network the author finds if unlikely that methods for point anomaly detection would be effective. The use of these kinds of methods was analyzed anyway by the use of a well know outlier detection method known as Isolation forest.

2.5.1 Isolation Forest

Liu et al. presented a method that approaches the subject through the measure of isolation (Liu et al. 2008). The method is based on the use of an isolation tree which is a proper binary tree where each node is either a leaf with no child or an internal node that includes a test where an attribute q is split by a value p such that q < p, this partitions the feature space into exactly two child nodes Tl and

Tr.

When an isolation tree is constructed using a data set X ={x1, x2, ..., xn} made up

of n observations, each of which is d dimensional the feature space is repeatedly partitioned by selecting a feature q and value in the range of that feature p until a specified tree height limit is reached or the child node only includes homogeneous observations (Liu et al. 2008).

This method then utilizes characteristics of point anomalies, i.e the feature-values of anomalies are likely to be substantially different from the rest and the number of anomalies is low. In the process of randomly partitioning the feature space when building random trees this appears as fewer number of partitions between the root and an anomaly resulting in a shorter path and the distinguishable nature of an anomaly results in it being likely separated early in the process. When an ensemble of these trees collectively flag observations as having considerably shorter path lengths then this is a strong indication of anomalous behaviour (Liu et al. 2008). Lets denote the path length of an observation xi, which corresponds

to the number of edges between the root of a tree and the leaf xi belongs to, by

h(xi)for i∈ 1, n.

(23)

abnormality of the observation. This system is based on the theory behind an unsuccessful search in a randomized binary search tree. It defines the average path length of an unsuccessful search in a binary search tree as

c(n) = 2H(n− 1) − 2(n− 1)

n , (8)

where H(i) represents the harmonic number. The anomaly score is then defined as the average path length ¯h(xi) over all the trees, normalized using c(n) in the

following manner,

s(x, n) = 2 ¯

h(x)

c(n). (9)

This score can then be used to make assessments as higher scores indicate anomalous behaviour.

This method is not designed with temporal dependencies in mind but the use of the method in combination with a sliding window technique has been researched and been shown to be useful (Ding 2013).

2.6

Contextual Anomaly Detection

As a research topic unsupervised contextual anomaly detection makes up a large portion of modern anomaly detection research (Chandola et al. 2009). Many different methods have been designed and investigated and all of them approach

this problem in different ways. A popular option is to use well established

regression techniques or statistical modeling and construct predictive anomaly detectors while others seek to construct a profile or representation of normal behaviour and use it to judge the abnormality of the signals. Due to this difference the techniques will be split into predictive and non-predictive anomaly detection moving forward.

2.6.1 Predictive anomaly detection

(24)

is highly accurate in its predictions under normal conditions and will therefore have a low prediction error but in cases of anomalous behaviour the model should not be as accurate leading to higher error margins. This is obtained by training the models in a semi-supervised manner, i.e. by only training on data that is representative of normal behaviour. This variability in the predictive errors can be modeled using e.g. thresholding techniques that allow for the construction of a binary classification problem where observations are classified as either normal or anomalous. For this setting essentially any regression technique that is designed for sequential learning could be used. In this project only artificial neural network architectures were considered, with special emphasis placed on the recurrent neural networks with long short term memory units as presented in (Malhotra et al. 2015), for further theoretical details see section 2.7.2 and 2.7.3.

2.6.2 Non-predictive anomaly detection

The set of models that fall into this group are in general unsupervised learning techniques that construct a profile of sorts using training data depicting normal behaviour. This set of methods include methods best known for representation learning such as Principal Component Analysis (Viswanath et al. 2014), encoder-decoder networks (Malhotra et al. 2016) as well as generative methods such as Generative Adversarial Networks (Li et al. 2019).

In this project the recurrent neural network encoder decoder network using Long-Short term memory units (LSTM) was chosen as it is well suited to handle the sequential nature of the data. The background is laid out in section 2.7.3 and 2.7.4.

2.7

Artificial Neural Networks

(25)

2.7.1 Feed-Forward Neural Networks

The feed forward neural network is essentially a composition of functions of the form fθ(x) = f (m) θm (...f (1) θ1 (x)). (10)

This function is meant to serve as a mapping from an input x to a response variable y (Goodfellow et al. 2016, p 163-166). Each function fθ(i)

i can be interpreted as the

transition between hidden layers and has the following form

fθ(i)

i (h

i−1

) = g(i)(W(i)h(i−1)+ b(i)) (11)

Here g represents an activation function, most often a non-linear function, hi−1

represents the value at the i-1-th layer. Then the parameter set at the i-th layer is denoted by θi = (W(i), b(i), in general W(i)is referred to as a weight matrix and b(i)

is referred to as bias.

In this project only the tanh activation function was relied on, it is shown in eq. 12. gtanh(z) = sinh(z) cosh(z) = ez− e−z ez+ e−z (12)

In order for a neural network to be of any use the mapping has to be fitted to the problem at hand. The most commonly used procedure is to apply the propagation algorithm where the chain rule on a directed graph is applied to compute the gradients needed to drive the stochastic gradient descent algorithm that minimizes a loss function. The general outline of the algorithm can be described in the following manner (Hult 2018). For a subset L of the nodes in a computational graph 1,...,n with P a(L) defined as the set of nodes that are parent nodes to L, i.e. P a(L) =∪i∈LP a(i)

1. Make a forward pass by computing the values at the hidden layers given a input x.

2. Set∂hn

(26)

3. For each j∈ P a(L), compute ∂hn ∂hj = ∑ i:j∈P a(i) ∂hn ∂hi ∂hi ∂hj 4. Update L = P a(L)

5. repeat steps 3 and 4 until L =∅.

2.7.2 Recurrent Neural Networks

Recurrent neural network (RNN) is a ANN designed for sequential modeling and therefore very convenient for anomaly detection in sequential data. One way of describing the mechanics is from the point of view of a hidden Markov model (HMM) (Hult 2018).

Given an unobserved hidden Markov chain {ht} with transition probability

p(ht|ht−1) and observations {yt} from p(yt|ht) with y1, ..., yT being conditionally

independent given the corresponding hidden state. It is assumed that the two distributions have their own set of parameters which are denoted by θ and ϕ and the system is extended further to include an observed input{xt} into the dynamics

behind the hidden Markov chain {ht}. This modification yields the following

system

ht|ht−1, xt ∼ p(ht|ht−1, xt, θ),

yt|ht ∼ p(yt|ht, ϕ)

(13)

This setup does not allow for use of maximum likelihood estimation of the sets of parameters θ and ϕ due to the fact that the likelihood needs to be marginalized over the hidden states, i.e.

(27)

Proceeding in this manner will be troublesome and the utilization of advanced computational methods such as Markov Chain Monte Carlo methods would be needed. The RNN however works under the assumption that the transition from one hidden state to another is determined by a deterministic function f. This simplifies the system given in eq. 13 to

ht= f (ht−1, xt; θ),

yt|ht∼ p(yt|ht, ϕ).

(15)

Making the hidden states a function of the input allows for the likelihood to be written as p(y1, ..., yT|x1, ..., xT, θ, ϕ) = p(y1, ..., yT|x1, ..., xT, h1, ..., hT, θ, ϕ) = Tt=1 p(yt|ht, ϕ) (16)

Now the process of maximizing the likelihood is equivalent to minimizing the negative log-likelihood.

Using a function to determine the transitions between hidden states allows for the model to store a lot of historical information and the power involved with the models ability to apply non-linear dynamics yields high modelling flexibility. When working with sequential data the well known backpropagation algorithm (Rumelhart et al. 1986) requires a slight adjustment to produce what is known as backpropagation through time (BPTT). Given a loss function

L({xt, yt}; θ, ϕ) = T

t=1

Lt(yt, g(ht; ϕ))) (17)

(28)

∂θL({xt, yt}; θ, ϕ) = Tt=1 ∂ϕL(yt, g(ht; ϕ)) = Tt=1 ∂Lt(yt, g(ht; ϕ)) ∂g ∂g(ht; ϕ)) ∂ht ∂ht ∂θ = Tt=1 ∂Lt(yt, g(ht; ϕ)) ∂g ∂g(ht; ϕ)) ∂ht ( ∂f (ht−1, xt; θt) ∂θt θt=θ +∂f (ht−1, xt; θ) ∂ht ∂ht−1 ∂θ ) (18)

Where θt serves as a dummy variable since ht depends on θ through both the

function f (ht−1, θ)and through past hidden states ht−1. This last section is iterated

through backwards from t = T, ..., 1 in order to compute all derivatives of∂ht

∂θ . The

second partial derivative is somewhat simpler, i.e.

∂ϕL({xt, yt}; θ, ϕ) = Tt=1 ∂ϕL(yt, g(ht; ϕ)) = Tt=1 ∂Lt(yt, g(ht; ϕ)) ∂g ∂g(ht; ϕ)) ∂ϕ . (19)

Short comings of the standard RNN are phenomena known as vanishing and exploding gradients. This is a problem that appears frequently in practice when long term dependencies are involved (Bengio et al. 1994). These phenomena have been researched significantly and are quite complex in nature. However an intuitive explanation can be established by simplifying the recurrence in the RNN as a function composition as in equation 20 (Goodfellow et al. 2016, p 390-392).

ht = WTht−1 = (Wt)Th0 (20)

Given that W can be eigendecomposed and it can be written as W = QΛQT

(29)

as

ht=QTΛtQh0. (21) Now as t increases the compounding affect is highly influenced by the value of the eigenvalues, i.e. values lower than one will decay to zero while values of magnitude higher than 1 will tend to infinity.

This compounding pushes the gradients to zero or to infinity making the training procedure very troublesome.

2.7.3 Long Short-Term Memory

The Long Short-Term Memory (LSTM) cell was introduced as a way to combat the problem of vanishing gradients by setting up paths based on internal recurrence that allow for better flow of gradients (Hochreiter & Schmidhuber 1997). The LSTM accomplishes this by incorporating a more complex system of gating units that control information flow. There are variants of the LSTM-unit so for transparency’s sake the following explanation is based on the works of (Goodfellow et al. 2016, p 397-400). Here the key component is the internal

state st which is meant to capture dependencies over a longer time interval.

Information from the internal state is combined with information from the previous hidden state hti−1 and the new input xt

i through four different gates to

produce the output of the hidden layer ht

i. This procedure is visualized in figure

(30)

Figure 2.1: LSTM cell (Goodfellow et al. 2016, p 398)

Each gate applies a sigmoid activation function and individual bias terms b, input weights U , and recurrent weights W .

The input gate gtserves the purpose of determining what new information should

be stored in the cell state and is denoted by git= sigm(bgi +∑

j

Uijgxtj +∑

j

Wijghtj−1). (22)

The forget gate works to prioritize past information and is meant to determine how much of prior context is needed, 0 for no context and 1 for full context and is defined as fit= sigm(bfi +∑ j Uijfxtj +∑ j Wijfhtj−1). (23) The output gate works similarly as the input gate but determines what portion of the information should be fed out as output opposed to what information should be stored, it can be written as

qit= sigm(boi +∑

j

Uijoxtj+∑

j

Wijohtj−1). (24)

(31)

tanh activation multiplied by the output of the output gate, i.e.

hti = tanh(sti)qit. (25)

The internal cell state is then updated using sti = fitsti−1+ gitsigm(bi+ ∑ j Uijxtj+ ∑ j Wijhtj−1). (26) 2.7.4 Encoder-Decoder Network

An RNN encoder-decoder network is a useful tool for sequence to sequence

modelling (Goodfellow et al. 2016, p 385-386). Given a time series X =

{x1, x2, ..., xT} were each sample xi ∈ RD is a D dimensional vector of signal

readings. A subsequence ′Xin = {x1, x2, ..., xL} of a fixed length L is fed to the

network as shown in figure 2.2 for a sequence of length L = 3. The network uses the side that is referred to as an encoder to create a representation as the output of the last hidden layer of the encoder. This representation corresponds to the encoded data layer of figure 2.2 which can then be reconstructed using the decoder which is a mirror of the encoder architecture to obtain a reconstructed sequence. The neurons and the dynamics are still as described in sections 2.7.2 and 2.7.3 if LSTM units are used. The largest difference between the architectures is the manner in which it is trained. In this setting the model is trained to minimize the mean square error between the input and the output of the model, i.e.

1 L Li=1 (xini − xouti )2 (27)

(32)
(33)

3

Method

This section is meant as a detailed explanation of the strategy that was assembled in order to answer the research questions that were established in section 1.3.

3.1

Data acquisition

As described in section 1.4 two data sources were used to evaluate the models performance. This is mostly due to the fact that faults can appear differently depending on the setting. Therefore the other data set was experimented on to get a more thorough analysis of the how the different models behave.

3.1.1 Scania data

The data obtained from Scania was logged for the purpose of creating a basis for unsupervised anomaly detection research. Over a two week period a test vehicle was driven and CAN signals were logged. A team of mechanics and electricians was assembled, their role was to both make sure the vehicle was working under normal conditions for a major part of the two week period as well as make mechanical and/or electrical adjustments to the vehicle in order to introduce an injected fault into the system one at a time. In total 10 different injected faults were logged for analysis, however they all share the characteristic that no information apart from the fact that an injected fault can be found in a particular data file exists. This means that no record of anomalous activity on a per observational basis is available and therefore it cannot be used for obtaining summary statistics such as precision and recall. Therefore the experiments were conducted under the following assumptions.

• The normal drives contain no anomalous activity.

• The faulty files only contain the injected fault and no other anomalous activity.

(34)

Figure 3.1: (Left) Example of a time series from a normal drive. (Right) Example of a time series with an injected fault.

Figure 3.1 illustrates examples of difference in behaviour between normal and faulty drive. There are clear subsequences in the faulty data where the signal is constant.

3.1.2 Synthesized data

To get a better idea of the capabilities of the models and their performance, experiments using synthesized data were also conducted. The purpose of these experiments was to investigate the capabilities of the encoder-decoder network to detect faults of different nature than the ones that existed in this particular Scania setting.

Noisy Sinusoidal Wave

A sinusoidal wave was synthesized and each observation was fitted with random noise generated by a Gaussian distribution with a zero mean and standard deviation of σ = 0.1. This signal s(t) was generated by this equation

s(t) = Asin(2πf t + ϕ) (28)

(35)

Figure 3.2: Example of the synthesized sinusoidal wave

3.2

Data exploration

During the first weeks the emphasis was placed upon preliminary data exploration with special focus given on removing some irrelevant sequences that were not of importance for this project. To start off with all sequences that were constant over the entire collection period were removed. Additionally, signals from sensors such as radars and cameras were removed as they were unrelated to the injected

faults that were of interest. This brought the dimensionality down to 264

sequences from the 1415 found in the original data set. Ultimately after conducting correlation analysis the subset of signals selected was taken as the signal S1 that

depicts the faulty behaviour in the most clear manner and the nine highest rated signals according to the importance score generated from fitting a regression model made up of 5000 random trees with a maximum length of 5000 to those

264 sequences. This model was trained to best predict S1. The resulting signals

(36)

Table 3.1: Random Forest Importance Score for the selected signals

Signal RF Importance Score

S2 0.6696729292914351 S3 0.20870335815822644 S4 0.08991379993038735 S5 0.017335697225209827 S6 0.009949318219779535 S7 0.001381075421863678 S8 0.0004251155893856264 S9 0.0003240885292837287 S10 0.0002600217598883752

Random forest is a well known ensemble technique and very commonly used in machine learning (Breiman 2001). It constructs multiple binary decision trees which all have a saying in the final output from the model. A very handy tool that the Random Forest method offers is that of the feature importance scoring used to obtain the set of 10 signals described above. This is accomplished by using a measure called mean decrease impurity where, for the regression setting, the variance is used as a measure of impurity ((Louppe et al. 2013),(Ronaghan 2018)). This is accomplished by computing the node impurity for each node j in a tree, i.e.

∆Ij = wjIj − wlef t,jIlef t,j − wright,jIright,j. (29)

Here wj represents the weighted number of samples in node j, Ij is the impurity

value of node j and the left and right child of node j is denoted by lef t, j and right, j. The feature importance for each tree is then computed and normalized as

F ii =

j1node j is split by feature i ∆Ij k∈all nodes∆Ik

, (30)

F inormi = ∑F ii

jF ij

(37)

Finally the average feature importance score over the entire forest made up of T trees is given as the importance score

RF ISi = ∑T j=1F i norm i , j T . (32)

3.3

Data preprocessing

The logged data is far from being of the quality needed for the modelling options explored in this project, therefore an extensive preprocessing had to be conducted. The art of processing irregular time series, that are noisy and/or missing values is a fairly big research area in its own right (de Carvalho et al. 2007) (Lepot et al. 2017).

3.3.1 Resampling and interpolation

The first thing that had to be considered was that the CAN signals are not logged at the same rates, i.e. the signals are logged as they arrive and therefore we are working with an irregular time series. Therefore, in order to get a full observation of the signals with a fixed sampling intervals interpolation techniques had to be considered. Due to the fact that the end product is meant for real time analysis the techniques can only rely on past observations. Additionally, the sampling rate is fairly high, i.e. a signal is measured approximately every 0.01 seconds.

(38)

Figure 3.3: (Left) Time series prior to downsampling. (Right) Time series post downsampling to a per second basis

3.3.2 Normalization and Standardization

Before training both standardization and normalization techniques were applied and their effects analyzed.

Normalization

Normalization refers to the process of scaling the data, most commonly to the range [0, 1]. This is accomplished by applying equation (33).

˜ Xi =

Xi− MIN(X)

M AX(X)− MIN(X) (33)

Standardization

To standardize a sample is to replace each observation with its z-score. Equation

34 shows how the z-score is computed for an observation Xi using the sample

mean µX and the sample standard deviation σX as estimates for the population

mean and the population standard deviation.

zi =

Xi− µX

σX

(39)

3.3.3 Filtering

Filtering options were explored in the later stages of experimentation with the Scania data in hope that it would have a positive effect on the models.

Exponential Weighted Moving Average

The setting of real time analysis limits the filtering options. Exponential weighted moving average (EWMA) was explored (pandas.DataFrame.ewm n.d.). That is a rolling mean technique that gives higher importance to recently observed values opposed to the conventional moving average that values all observations in the time window used to compute it equally. An EWMA is obtained using the following equations

y0 = x0 yt= (1− α)yt−1+ αxt,

where the value for α is determined by α = 2

s+1. Here s is the number of

(40)

Figure 3.4: Var1 before and after being filtered using EMA over 10 and 100 observational spans

3.4

Anomaly Scoring

In all instances except for the Isolation forest, a procedure for scoring the abnormality of an observation had to be designed. In all cases the same kind of approach was applied although it had to be slightly modified to fit the encoder-decoder network due to difference in shape of the output. For the predictive setting the prediction error is computed for every observation as

epredi = xi− xpredi (35)

where xiis a single observation of the actual signal while xpredi is the corresponding

predicted observation. This allows the prediction error to be modeled as a

univariate Gaussian distribution, using maximum likelihood estimation.

Which can be establish in the following manner. First the likelihood function is defined as L(µ, σ|ei) = 1 2πσ2e −(ei−µ)2 2σ2 . (36)

(41)

as written in eq. 36 which yields L(µ, σ|e1, e2, ..., en) = L(µ, σ|e1)L(µ, σ|e2)...L(µ, σ|en) = 1 2πσ2e −(e1−µ)2 2σ2 1 2πσ2e −(e2−µ)2 2σ2 ...√ 1 2πσ2e −(en−µ)2 2σ2 . (37)

For convenience the natural logarithm is applied which simplifies later stages of the computations. This is justifiable since the likelihood function is non-negative and since the logarithm is a convex function the log-likelihood function will conserve its maximum value.

l(µ, σ|e1, e2, ..., en) =ln( 1 2πσ2e −(e1−µ)2 2σ2 1 2πσ2e −(e2−µ)2 2σ2 ...√ 1 2πσ2e −(en−µ)2 2σ2 ) =−n 2ln(2π)− nln(σ) − (e1− µ)2 2 (e2− µ)2 2 ...− (en− µ)2 2 (38)

Now computing the derivatives with respect to µ and σ becomes simpler, they are ∂l ∂µ = 1 σ2(( ni=1 ei)− nµ), ∂l ∂σ = n σ + 1 σ3( ni=1 (ei− µ)2). (39)

Setting the partial derivatives to zero and solving for µ and σ in the corresponding partial derivative one obtains the maximum likelihood estimates

µe = 1 n ni=1 ei, σe = v u u t 1 n ni=1 (ei− µ)2 (40)

(42)

Figure 3.5: Prediction errors from a validation set and an estimated Gaussian curve

This Gaussian distribution can be utilised as a handy tool to establish a threshold

τthat can be used as a decision boundary which determines if anomalous activity

(43)

Figure 3.6: Anomaly score threshold fitted to validation set prediction errors

The anomaly score is shown in eq. 41 and is computed so that independently of the setting if the anomaly score is above one then the corresponding observation is flagged as anomalous.

AnomalyScorei = |ei− µe| τ σe

(41) Here eiis an error term, µeis the estimated error sample mean, σeis the estimated

error sample standard deviation and τ is a tuning parameter. In the case the LSTM encoder-decoder network, the output is a reconstructed sequence of the same length as the input. Therefore an adjustment to the error computations had to be done in order to get a one dimensional error vector. The network was fed a one dimensional subsequence of length n which will be denoted by s and the

reconstructional error for each observation xiin the input sequence computed as

ei,j =|xi,j − ˆxi,j| (42)

were ˆxi,j is the reconstructed observation j of subsequence i The sum over every

(44)

This vector of errors can now be treated in the same manner as described before, i.e. it can be used to fit a univariate Gaussian distribution which in turn allows for a handy way to establish a threshold that serves as a decision boundary so that the errors that surpass it are flagged as anomalous.

3.5

Architecture Implementation

Different methods that approach the topic of anomaly detection in sequential data in different ways were experimented with. The main emphasis was put on three different types of approaches, point anomaly detectors, non-predictive contextual anomaly detection and predictive contextual anomaly detection.

3.5.1 Setup

For all implementation the Python programming language and the Python packages in table 3.2 were used.

Table 3.2: Python packages used for implementation

Package Version Python 3.7.3 Numpy 1.16.3 Pandas 0.24.2 Matplotlib 3.0.3 Scikit Learn 0.20.3 Tensorflow 1.13.1 Keras 2.2.4

3.5.2 Point anomaly detection

(45)

and new observations are ranked based on how they fit that profile.

Isolation Forest

The algorithm was intilized using a maximum number of samples as 1000 and the contamination rate at 0.0001 as the algorithm was fit to the normal drive training data.

3.5.3 Non-predictive contextual anomaly detection

An Encoder-Decoder network as described in section 2.7.4 was constructed. A linear transformation, based on applying a Principal Component Analysis (PCA) (see 2.4.1) to the normal drive training data set, was used to reduce the dimension to a single sequence. The encoder is fed a sequence of 10 observations and utilizes a hidden layer made up of 16 hidden LSTM units with a tanh activation function. That layers output is then fed as input into the decoder which mirrors the encoder in architecture resulting in an output of the same shape and size as the input. This network is trained to minimize the mean square error between the input and the output in an unsupervised manner.

3.5.4 Predictive contextual anomaly detection

A many to one regression model was constructed using a recurrent neural network

which takes past observations of the 10 signals as input and predicts S1 using a

single hidden layer of 16 LSTM units with a tanh activation function (see eq. 12) which is followed by a dense layer with a linear activation function.

3.6

Training procedure

(46)

a data file of 729 observations from a normal drive and another file holding 605 observations from an injected fault.

All neural networks were trained using the ADAM (adaptive moment estimation) optimization algorithm which is a well known extension of the stochastic gradient

descent algorithm (Kingma & Ba 2014). This algorithm was chosen simply

because it is often recommended as the default optimization algorithm in deep learning (Ruder 2016).

Adam uses exponentally moving averages to compute estimates of the first and second moments (mt, vt)

mt = β1mt−1+ (1− β1)gt

vt= β2vt−1+ (1− β2)gt2

(44)

where g2

t represent the element wise square if gt is a vector or a matrix. Here β1

and β2 are hyperparameters set at the default values of β1 = 0.9 and β2 = 0.999

for this project. These estimates are biased and need to be bias-corrected yielding unbiased estimates ˆ mt= mt 1− βt 1 , ˆ vt= vt 1− βt 2 . (45)

This results in the following update step

θt+1= θt− η ˆ vt+ ϵ (46) Here η is the learning rate set at 0.001 during the experimentation, θ are the models parameters and ϵ is a hyperparameter set to the value of 10−8.

3.7

Model Evaluation

(47)

3.7.1 Scania Data

Due to the lack of informative annotation in the Scania data the models evaluations is limited to the analysis of detection and false positive rate. The work was carried out under the assumption that the normal drive depicted nothing but normal behaviour and therefore the models should preferably not depict any anomalous behaviour in those cases, while still capturing the injected faults. This leads to a somewhat heuristic approach as the detection rate and the rate of false positives was analysed.

3.7.2 Synthesised data

(48)
(49)

4

Results

This section will focus on the experimental results obtained from the implemented models using both data sets. The evaluations based on the data set obtained from Scania is more focused on showing the model’s ability to detect anomalous behaviour and the rate of false positives. The other data set is then meant to give better summary statistics since it has actual annotated anomalies.

4.1

Scania data

In this section the performance of different methods is analysed using the Scania data. As described in section 1.4, the nature of the data restricts the analysis to looking at the detection rate of the injected fault and the false positive rate of the normal drive test sequences.

4.1.1 Isolation Forest

(50)

Figure 4.1: Isolation forest anomaly score from a normal drive test set with threshold set at 1.12

Figure 4.2: Isolation forest anomaly score from a faulty drive test set with threshold set at 1.12

4.1.2 RNN - encoder-decoder network

(51)

Without Filtering

The first post-training step for every model is to compute the error margins between a signal and the output of each model. Those errors are then modeled by a Gaussian distribution as laid out in section 3.4. The estimated mean and standard deviation can be seen in table 4.1. This allows for a convenient way of defining a threshold according to the number of standard deviations from the mean.

Table 4.1: ED Gaussian coefficients without filtering

Mean 0.027867447268757194

Standard Deviation 0.010066136635592152

The reconstructed signals from both the faulty and normal drive both with and without filtering are not particularly interesting as there is little visible difference between the reconstructed signal and the original signal in figures 4.3, 4.4, 4.7, 4.8.

Figure 4.3: The original and the reconstructed normal drive test set without filtering

(52)

detection rate is low when the faulty drive is tested and the anomaly flags only appear in highly volatile context and not in areas where the signal levels off.

Figure 4.4: The original and the reconstructed faulty Drive test set without filtering

(53)

Figure 4.6: Encoder-Decoder anomaly score of a faulty Drive test set without filtering using τ = 5

Table 4.2 shows the ratio of the overall detection rate and based on these results the model does not show a lot of promise for anomaly detection on this particular signal.

Table 4.2: Detection rate of the Encoder-Decoder network for both the Normal and Faulty test drives without filtering using τ = 5

Drive Detection Rate

Normal 0.56%

Faulty 0.84%

With Filtering

The data was filtered using an exponential weighted moving average over a window of 100 observations. The errors were modeled in the same way as before and the estimated coefficients are depicted in table 4.3.

Table 4.3: ED Gaussian coefficients with filtering

Mean 0.017008433576124825

Standard Deviation 0.011817158698356237

(54)

moving average (EWMA) does not have a positive effect on this model as it can be seen that one of the two spikes in the anomaly scoring has dropped below the threshold in figure 4.10.

Figure 4.7: The original and the reconstructed normal drive test set with filtering

(55)

Figure 4.9: Encoder-Decoder anomaly score of a normal drive test set with filtering using τ = 4.9

Figure 4.10: Encoder-Decoder anomaly score of a faulty drive test set with filtering using τ = 4.9

(56)

Table 4.4: Detection rate of the Encoder-Decoder network for both the Normal and Faulty test drives with filtering using τ = 4.9

Drive Detection Rate

Normal 0.00%

Faulty 0.34 %

Sufficient information is lacking about the nature of data collection procedure to be able to fully determine if it is possible that this model is capturing some other effect of the injected fault or not. It is the author’s believe however that the use of a linear transformation through the use of PCA that condenses the information to a single principal component as described by 2.4.1 might play a major role in this instance and different dimensionality reduction techniques

should be experimented with. Additional experimentation using S1 as the only

signal and skipping the dimensionality reduction step using PCA did not lead to any noticeable improvements which builds support for the case that this network architecture might not be suitable for this type combination of signal and error where the abnormality is a constant signal and not erratic behaviour. This was the main driver for further experimentation using synthesized data.

4.1.3 RNN - predictor

The results obtained from the RNN-prediction network are illustrated in this section using both the original data as well as the filtered signals obtained by applying an exponential weighted moving average over a window of the past 100 samples.

Without Filtering

The results of the Gaussian modelling for the RNN-regression model is found in table 4.5.

Table 4.5: RNN predictor Gaussian coefficients without filtering

Mean -0.00017610776

(57)

The errors are simply the difference between the actual value of a observation and the corresponding predicted value of S1. Both signals are visualised using a test sequence put a side from a normal drive, seen in figure 4.11 and using a faulty test set in figure 4.12. In both cases the red curve represent the actual sequence while the blue is the predicted sequence.

Figure 4.11: The original and the predicted normal drive test set without filtering

(58)

does not know how to react in the sub sequences of abnormal behaviour and the majority of the models predictions are far from the actual value. This is a very promising result but there is a concern when figure 4.11 is analysed. There are instances were the models prediction is to far off for a normal drive. This concern becomes even more apparent when figures 4.13 and 4.14 are analysed as we see that a fair amount of observations are flagged as anomalous in the normal drive test set making them false positives according to the assumptions laid down before, table 4.6 depicts the ratio of flagged anomalies for both the normal and the faulty drives. However the high detection rate for the faulty drive shows a lot of promise and this combination of high false positive rate and high detection rate in the faulty drive was the main reasons for why the additional experiments using filtering techniques such as exponential weighted moving average were conducted in the hopes of reducing the false positive rate.

(59)

Figure 4.14: RNN predictor anomaly score of a faulty Drive test set without filtering using τ = 5.5

Table 4.6: Detection rate of the RNN predictor for both the Normal and Faulty test drives without filtering using τ = 5.5

Drive Detection Rate

Normal 1.37 %

Faulty 21.52 %

With Filtering

The results of the Gaussian modelling for the RNN-regression model is found in table 4.7.

Table 4.7: RNN predictor Gaussian coefficients with filtering

Mean 0.004098794

Standard Deviation -0.0017545223

(60)

Figure 4.15: The original and the predicted normal drive test set with filtering

Figure 4.16: The original and the predicted faulty drive test set with filtering

(61)

as the presence of false positives have been removed completely..

Figure 4.17: RNN predictor anomaly score of a normal drive test set with filtering using τ = 4.3

(62)

Table 4.8: Detection rate of the RNN predictor for both the Normal and Faulty test drives with filtering using τ = 4.3

Drive Detection Rate

Normal 0.0%

Faulty 15.9%

4.2

Synthesized data

For the synthesized data the focus will be restricted to encoder-decoder network. A sinusoidal wave was generated as described in section 3.1.2 and the resulting signal is visualized in figure 3.2.

4.2.1 LSTM - encoder-decoder network

Figure 4.19: Original and reconstructed signal from the validation set

(63)

Figure 4.20: Anomaly score from the validation set of the sinusoidal wave

Spikes were injected into the wave by adding a constant c = 0.1 to the signal at every 100th observation to create a test set. Figure 4.21 illustrates the results from testing the model and it can clearly be observed that the model is able to capture these faults in all cases which shows that the this model seems very dependent on how the faults appear in the signals. In a more periodic signal this model shows much more promising performance.

(64)
(65)

5

Conclusion

In this project three different modelling options were analysed for the purpose of anomaly detection in a vehicle’s sensor network. The data which was relied on had been collected explicitly for anomaly detection research and was made up of data depicting normal behaviour as well as several faults that were injected into the system by making mechanical or electrical adjustments. This allowed for the use of semi-supervised techniques were different recurrent neural networks (RNN) were trained only using data representing normal behaviour. The results show that these techniques show great promise for this particular setting of anomaly detection in vehicles and could serve as a great tool for automated fault detection. Furthermore, the detection of anomalous behaviour is the foundation behind carrying out a near real time root cause analysis on the faults which is the ultimate goal. In section 1.3, a set of questions was established that were to be answered at the end of the project. The following subsection will discuss each question and the answers based on the results obtained in section 4.

What modelling options are able to detect anomalous behaviour?

(66)

What approaches can be used to combat the detection of false positives?

When the results from the RNN-predictor, with and without filtering, are compared there is no denying that filtering is a key component in work of this nature. The high rate of false positives, seen in figure 4.13 was the reason filtering options were analysed. Figure 4.17 shows that the false positives were non-existent post filtering. This reduction strengthens the author’s believe that this could become a truly useful tool, once fully developed.

What are the models potential to generalize to different applications?

The RNN-predictor is very dependent on the availability of meaningful input features. This is not always available and needs to be analysed case by case. The encoder-decoder network however can be fitted on any signal without much preprocessing and analysis. In either case a thorough analysis would have to be carried out in order to determine if the model performs up to standards, i.e. has high detection rate and low rate of false positives as the depicted in figures 4.17 and 4.18. Therefore it should always be analysed case by case.

What model performs best overall?

(67)

6

Discussion, Delimitation and Future work

A basis for further development has now been established but there are numerous steps that need to be taken in order to introduce this idea of anomaly/fault detection into Scanias vehicles.

Additional modelling options were experimented with but the ones that were not included in this report all approached the problem in the same way as some of the models presented but did not perform as well. These were mainly One Class SVM and a different neural network architecture called Temporal Convolutional Neural Networks. The One Class SVM approaches the problem similarly as the Isolation Forest by constructing a profile for normal behaviour that is used as a measuring stick that does not account for any temporal dependencies (Manevitz & Yousef 2001). Temporal Convolutional Neural Networks on the other hand is a neural network architecture based on the well established Convolutional Neural Networks that applies convolution along the time axis and extracts features in the data using filters (Bai et al. 2018). This architecture has been shown to outperform RNN’s on a few tasks and therefore was tested. A many to one regression model using this architecture was trained just as the RNN-predictor but did not perform nearly as well, this might just mean that the architecture and the available sources need to be developed further.

Alternative approaches to the use of Principal Component Analysis (PCA) for dimensionality reduction before feeding data to the encoder-decoder network could be worth exploring. The PCA only allows for linear transformations and does not take temporal dependencies into considerations. Therefore, it might be more suitable to use auto-encoders based on recurrent neural networks to better capture the non-linearity in the data as well as the contextual information of the time series.

(68)

technology that will most likely replace the CAN network moving forward.

A clear delimitation of the work conducted in this project is that the data is only logged from one vehicle while the modelling should be generic enough to function for all manufactured vehicles during testing. Therefore it would be really interesting to obtain new data from other vehicles for testing purposes, just to get a better idea of the transferability of the models. Federated learning might offer a promising way of obtaining models that are generic enough to allow them to be transferred from vehicle to vehicle by aggregating models trained on data from different vehicles.

Due to the performance boost obtained from filtering the signal, other methods of real time preprocessing should be explored. Some very basic experiments were conducted using Kalman filters during the filtering step and that should be analyzed further.

(69)

References

Bai, S., Kolter, J. Z. & Koltun, V. (2018), ‘An empirical evaluation of generic convolutional and recurrent networks for sequence modeling’, CoRR

abs/1803.01271.

URL: http://arxiv.org/abs/1803.01271

Bengio, Y., Simard, P. & Frasconi, P. (1994), ‘Learning long-term dependencies with gradient descent is difficult’, IEEE Transactions on Neural Networks

5(2), 157–166.

Beygelzimer, A., Erdogan, E., Ma, S. & Rish, I. (2005), Statistical models for unequally spaced time series.

Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer. Breiman, L. (2001), ‘Random forests’, Machine Learning 45(1), 5–32.

URL: https://doi.org/10.1023/A:1010933404324

Chalapathy, R. & Chawla, S. (2019), ‘Deep learning for anomaly detection: A survey’, CoRR abs/1901.03407.

URL: http://arxiv.org/abs/1901.03407

Chandola, V., Banerjee, A. & Kumar, V. (2009), ‘Anomaly detection: A survey’, ACM Comput. Surv. 41.

Controller Area Network (CAN) Overview (2019).

URL:

https://www.ni.com/sv-se/innovations/white-papers/06/controller-area-network–can–overview.html

de Carvalho, O. A., Guimaraes, R. F., Trancoso Gomes, R. A. & da Silva, N. C. (2007), Time series interpolation, in ‘2007 IEEE International Geoscience and Remote Sensing Symposium’, pp. 1959–1961.

Ding, Z. (2013), An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window, pp. 12–17.

(70)

Hawkins, D. (1980), Identification of Outliers, Monographs on applied probability and statistics, Chapman and Hall.

URL: https://books.google.se/books?id=fb0OAAAAQAAJ

Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural computation 9, 1735–80.

Hult, H. (2018), ‘Lecture notes in sf2957 statistical machine learning’.

Kingma, D. & Ba, J. (2014), ‘Adam: A method for stochastic optimization’, International Conference on Learning Representations .

Lepot, M., Aubin, J.-B. & Clemens, F. (2017), ‘Interpolation in time series: An introductive overview of existing methods, their performance criteria and uncertainty assessment’, Water 9(10), 796.

URL: http://dx.doi.org/10.3390/w9100796

Li, D., Chen, D., Shi, L., Jin, B., Goh, J. & Ng, S. (2019), ‘MAD-GAN: multivariate anomaly detection for time series data with generative adversarial networks’, CoRR abs/1901.04997.

URL: http://arxiv.org/abs/1901.04997

Liu, F. T., Ting, K. M. & Zhou, Z. (2008), Isolation forest, in ‘2008 Eighth IEEE International Conference on Data Mining’, pp. 413–422.

Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. (2013), Understanding variable importances in forests of randomized trees, in C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani & K. Q. Weinberger, eds, ‘Advances in Neural Information Processing Systems 26’, Curran Associates, Inc., pp. 431–439.

URL: http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf

Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P. & Shroff, G. (2016), ‘Lstm-based encoder-decoder for multi-sensor anomaly detection’, CoRR abs/1607.00148.

References

Related documents

In this doctoral thesis, Riveiro investigates the use of combined visual and data mining methods to support the detection of anomalous vessel behavior.. Örebro Studies in

This thesis suggests and investigates the adoption of visual analytics prin- ciples to support the detection of anomalous vessel behavior in maritime traf- fic data. This

All of the above works use RNNs to model the normal time series pattern and do anomaly detection in an unsupervised manner, with labels only being used to set thresholds on

First, the data point with row number 4, ID = 0003f abhdnk15kae, is assigned to the zero cluster when DBSCAN is used and receives the highest score for all experiments with

To summarize, the main contributions are (1) a new method for anomaly detection called The State-Based Anomaly Detection method, (2) an evaluation of the method on a number

This is done by a characterisation of the surveillance domain and a literature review that identifies a number of weaknesses in previous anomaly detection methods used in

In this section, an evaluation of the two detection methods is held based on how well anomalies are detected using either Holt-Winters or median Benchmark model as prediction

To this end, the two econometric models ARMA-GARCH and EWMA, and the two machine learning based algorithms LSTM and HTM, were evaluated for the task of performing unsupervised