IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

*STOCKHOLM SWEDEN 2019*

**Anomaly Detection in **

**Streaming Data from a **

**Sensor Network**

**EGILL VIGNISSON**

**Anomaly Detection in Streaming **

**Data from a Sensor Network **

**EGILL VIGNISSON**

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2019

Supervisors at Scania: Paola Maggino, Nikhil Thakrar Supervisor at KTH: Jimmy Olsson

*TRITA-SCI-GRU 2019:319 *
*MAT-E 2019:76 *

Royal Institute of Technology

*School of Engineering Sciences *

**KTH SCI **

**Abstract**

In this thesis, unsupervised and semi-supervised machine learning techniques are analyzed as potential tools for anomaly detection in Scania truck sensor networks. The thesis investigates the need for both point and contextual anomaly detection in this setting. For the point anomaly detection the method of Isolation forest was applied and for contextual anomaly detection two different recurrent neural network architectures using Long Short Term Memory units were used. One model was simply a many-to-one regression model trained to predict a certain signal, while the other was an encoder-decoder network trained to reconstruct a sequence. Both models were trained in an semi-supervised manner, i.e. on data that only depicts normal behaviour, which theoretically should lead to a performance drop on abnormal sequences resulting in higher error terms. In both settings the parameters of a Gaussian distribution were estimated using these error terms, which allowed for a convenient way of defining a threshold which would decide if the observation would be flagged as anomalous or not. Additional experiments using an exponential weighted moving average over a number of past observations to filter the signal was also conducted. The methods performance on this particular task were very different, but the regression model showed a lot of promise especially when combined with a filtering preprocessing step to reduce the noise in the data. However, the model selection will always be governed by the nature of the particular task at hand, so the other methods might perform better in other settings.

**Keywords**

**Anomalidetektion i strömmande**

**data från sensornätverk**

**Sammanfattning**

I den här uppsatsen analyseras användningen av oövervakad och semi-övervakad maskininlärning som ett möjligt verktyg för att upptäcka anomalier i sensornätverket hos Scanialastbilar. Studien analyserar behovet av detektion

av både punkt avvikelser och kontextuella avvikelser i denna miljö. För

punktavvikelse användes metoden Isolation forest och för kontextuella avvikelser

användes två arkitekturer av rekurrenta neurala nätverk. En av modellerna

var helt enkelt en många-till-en regressionmodell tränad för att förutspå ett visst märke, medan en annan var ett kodare-avkodare-nätverk tränat för att rekonstruera en sekvens.

Båda modellerna tränades på ett semi-övervakat sätt, d.v.s. på data som endast uppvisade normalt beteende, som teoretiskt skulle leda till minskad prestanda på anormala sekvenser med ökat antal feltermer. I båda fallen blev parametrarna av en Gaussisk distribution skattade baserat på dessa feltermer, vilket gav ett bekvämt sätt att definiera en för om iakttagelsen skulle bli flaggad som en anomali

eller inte. Vidare tillämpades ett exponentiellt viktat glidande medelvärde

över ett visst antal tidigare iakttagelser för att filtrera märket. Modellerna

uppvisade varierad prestanda med avseende på denna uppgift. Dock lovade regressionmodellen mycket, särskilt då denna kombinerades med ett filtrerat förbehandlingssteg för att minska bruset hos datan. Ändå kommer modelldelen alltid styras av uppgiftens natur, och andra metoder skulle kunna ge bättre prestanda i andra miljöer.

**Nyckelord**

**Acknowledgements**

**Contents**

**1**

**Introduction**

**1**1.1 Technical Background . . . 1 1.2 Problem Description . . . 2 1.3 Research Questions . . . 3 1.4 The Data . . . 3 1.5 Outline . . . 4

**2**

**Theory**

**5**2.1 An Anomaly . . . 5 2.2 Sensor Networks . . . 6 2.3 Machine Learning . . . 6 2.4 Dimensionality Reduction . . . 8

2.5 Methods for Point Anomaly Detection . . . 10

2.6 Contextual Anomaly Detection . . . 11

2.7 Artificial Neural Networks . . . 12

**3** **Method** **21**
3.1 Data acquisition . . . 21
3.2 Data exploration . . . 23
3.3 Data preprocessing . . . 25
3.4 Anomaly Scoring . . . 28
3.5 Architecture Implementation . . . 32
3.6 Training procedure . . . 33
3.7 Model Evaluation . . . 34
**4** **Results** **37**
4.1 Scania data . . . 37
4.2 Synthesized data . . . 50
**5** **Conclusion** **53**

**6** **Discussion, Delimitation and Future work** **55**

**1**

**Introduction**

Technological development in the transportation sector has been dominated by solutions that push for further electrification and autonomy. This development has resulted in an increasing level of complexity in the tasks these solutions are capable of carrying out and the overall communication between electronic components. Society’s call for sustainable solutions that run on renewable energy sources will only accelerate this development even further. The purpose of this thesis is to develop and compare different methods based on machine learning and data analytics that seek to simplify the process of detecting faults in the electronic system in real-time during the vehicle’s testing procedure with special focus given to semi-supervised and unsupervised methods.

**1.1**

**Technical Background**

As this projects works within topics concerning the automotive industry a brief technical background is needed in order to get an understanding of this setting and the intricacies of a vehicle’s electrical system.

**1.1.1** **Electronic Control Units**

An electronic control unit (ECU) is an electronic device meant to control a
certain process that might include the conversion of signals from sensors, data
*transmission and diagnostic support (Controller Area Network (CAN) Overview*
2019). Therefore a positive relationship exists between the number of ECU’s and
the number of sensors and actuators that relay information.

**1.1.2** **Controller Area Network**

*it if it is not needed for that ECU’s function (Controller Area Network (CAN)*
*Overview 2019). Furthermore all communications on the network is prioritized*
by ID meaning that some signals are of more importance. This can result in
varying time intervals between signals especially those of low importance. This
can lead to irregularity in the sequences which is a well known problem in time
series analysis (Beygelzimer et al. 2005).

**1.2**

**Problem Description**

This trend towards electrification is very visible in Scania’s products but over a period of a few years, the amount of Electronic Control Units (ECU) needed in their vehicles has increased dramatically as more and more sensors and other electric components have been added that require control. Naturally this increase in the complexity of the vehicle’s electronic system has resulted in a large increase in communication and data that flow in the vehicle network. This in turns makes the process of ensuring that all the components are serving their intended purpose significantly more tedious and resource demanding. Identifying that something is not functioning as it should is, however, only one part of the problem and identifying exactly which component is causing the problem often requires quite extensive data analysis and testing. That is why the need for improved ability to catch deviations in the working conditions of electronic components in vehicles is imperative.

During the vehicle’s testing procedure it is the drivers responsibility to mark down any experienced abnormal behaviour. This is accomplished by pressing a button, that in turn creates a timestamp which can be used as a reference for the post testing analytics where the test engineers and developers seek to identify the root cause of the problem. This high dependency on the test driver is an aspect that Scania wishes to remove from its testing procedure and replace it with a solution capable of carrying out real time analysis on the data obtained from the ECUs, which will alert the driver of potential problems and establish a basis for further analysis by marking them down in the time.

by the driver.

This thesis looks to investigate the use of advanced analytics and predictive modelling as potential tools to combat this problem with the objective of creating an efficient testing procedure of the vehicles.

**1.3**

**Research Questions**

By dissecting the problem description certain criteria that the model needs to meet in order to function in this environment can be identified. This description therefore lays the ground for the different questions that will be addressed in this thesis project, i.e.

• What modelling options are able to detect anomalous behaviour?

• What approaches can be used to combat the detection of false positives? • What are the models potential to generalize to different applications? • What model performs best overall?

**1.4**

**The Data**

The data set relied on during this project was composed of logged CAN signals
acquired from a test vehicle during a two week period. The logs are both made up
of sequences that are meant to represent normal behavior and other logs where
certain faults had been injected by making mechanical or electronic adjustments
to the vehicles. The data set consists of approximately 1500 separate signals.
Due to confidentiality reasons no true variable names will be included and a clear
description of the process of injecting the faults will not be given in this thesis
as it relates to other research projects being conducted at Scania. Therefore each
*variable will only be referred to as Si*, where i is some positive integer. Due to

**1.5**

**Outline**

**2**

**Theory**

This section lays the necessary theoretical foundation which includes definitions of the different types of anomalous behaviour followed by short descriptions of the different areas of machine learning that were used in this project.

**2.1**

**An Anomaly**

An important aspect to take into consideration when it comes to anomaly detection is that there are certain characteristics that make a data sample

anomalous. These characteristics can be grouped as point, contextual and

collective anomalies where each group has been researched significantly (Chandola et al. 2009). The different methods developed for anomaly detection were developed with specific groups in mind and therefore it is crucial to take that into account when choosing and comparing methods.

**Point anomaly**

The simplest form of an anomaly is the point anomaly. This group has the longest research history and its definition is very close to Hawkins definition of an outlier, i.e. that it is an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism (Hawkins 1980). This kind of anomaly detection is experimented with in this project but the temporal dependencies in the data makes it an unappealing option.

**Contextual anomaly**

An individual observation can show no anomalous characteristics but given the surrounding structure or context in which it appears, it has to be considered

anomalous. Two sets of attributes define each observation, i.e. contextual

**Collective anomaly**

Similar to contextual anomalies, collective anomalies refer to observations that individually are not anomalous. The collection of these observations however is anomalous with respect to the entire data set (Chandola et al. 2009). This set of anomalies will not be considered individually in this project.

**2.2**

**Sensor Networks**

Anomaly detection in data generated by a sensor network is a challenging endeavour. The nature of the faults can stem from different sources such as faulty sensor as well as abnormal events e.g. intrusion attacks (Chandola et al. 2009). High levels of noise in the data tend to also add to the complexity as it makes distinction between it and abnormality hard to make, leading to a high rate of false positives. In this setting the data is generated by a sensor network resulting in a multivariate time series made up of 1415 signals. However the majority of the signals have a lot of missing samples as well as other quality issues. This is most likely due to limitation in the communication capabilities of the ECU’s that collect and transmit these signals. Certain limitations like this called for extensive preprocessing before any modelling could be seriously considered. This calls for methods that allow us to detect and uncover patterns in data and that is where we enter the world of statistics and machine learning.

**2.3**

**Machine Learning**

Machine learning (ML) offers a set of methods with a probabilistic foundation intended for the purpose of automated data analysis and pattern recognition (Murphy 2012, p 1-2). Machine learning methods are commonly further divided into two categorise, i.e. supervised and unsupervised learning. In the context of this thesis it is fitting to include a third category, namely semi-supervised learning.

**2.3.1** **Supervised Learning**

**x, made up of a number of features, to a response variable y, also known as a label,**

by feeding the learning algorithm a training set, or more specifically a number of
input-output pairs* {(xi, yi*)

*}Ni=1*

*(Murphy 2012, p 3). Here, the output yi*can be

either continuous or categorical and based on that, the specific problem at hand is referred to as a regression problem in the continuous case and a classification in the categorical case. The dependency on labeled data can be very limiting in certain areas and it certainly is in the context of anomaly detection in time series where amounts of data are often enormous but labeled states are scarce or nonexistent in most cases (Chandola et al. 2009). Therefore no fully supervised learning methods were analyzed in this project.

**2.3.2** **Unsupervised Learning**

Unsupervised learning is completely free from the dependency on labeled data
described in the supervised setting. Here the only data fed into the algorithms is
the input**{xi}**N

*i=1*which is manipulated in different ways to extract new knowledge.

What makes this setting challenging is that it is often more troublesome to define a problem in an unsupervised manner, as it lacks concrete ways of evaluating the results opposed to prediction accuracy in the supervised setting (Murphy 2012,

**p 9-10). Here one seeks to construct models of the form p(x****i|θ) and therefore**

**2.3.3** **Semi-supervised Learning**

Semi-supervised learning is a different paradigm that is harder to define. The most classical representation of semi-supervised learning is when the power of supervised and unsupervised learning methods is combined into a single model. However it is also argued that the setting where models are used to estimate the density of data under the assumptions that the underlying data generation is of a certain nature, such as depicting normal working conditions of the sensors in the network should be considered as semi-supervised (Chandola et al. 2009)(Chalapathy & Chawla 2019). Here the data is not annotated in any way but still you could say that making that assumption and using it as a representation of normal behaviour is actually annotation. This setting was prioritized in this project as data depicting normal behaviour was available.

**2.4**

**Dimensionality Reduction**

An area that is important to keep in mind when working with any kind of statistical modelling is how the models handle increases in dimensionality. A phenomenon known as the curse of dimensionality is a well established problem when working with sequential data (Verleysen & François 2005). In this project a procedure known as Principal Component Analysis was applied to reduce the dimensionality.

**2.4.1** **Principal Component Analysis**

Hotelling (1933) described Principal Component Analysis (PCA) as the orthogonal projection of the data into a linear lower dimensional subspace, such that the variability of the projected data is maximized. It is a technique that allows for the extraction of a lower dimensional representation of the data that contains as much as possible of the total variance. Given data where each observation has p features, PCA assembles these dimensions or principal components through a linear combinations of those p features found in the original data.

mean and covariance matrix can be defined in the following way
¯
*x =* 1
*N*
*N*
∑
*i=1*
*xi,* (1)
Σ = 1
*N*
*N*
∑
*i=1*
*(xi− ¯x)(xi− ¯x)T.* (2)

This allows the variance of the projection of x to be written as
1
*N*
*N*
∑
*i=1*
*(ϕT*_{1}*xi− ϕT*1*x)(ϕ*¯
*T*
1*xi− ϕT*1*x)*¯
*T* * _{= ϕ}T*
1

*Σϕ*1

*.*(3)

*This projected variance can then be maximized with respect to ϕ subjected to*
*the constraint that sum of squares of ϕ is equal to 1 because increasing values*
*for ϕ leads to increasing variance. This results in the following optimization*
problem.
max
*ϕ* *ϕ*
*T _{Σϕ}*

*s.t. ϕT*(4)

_{ϕ = 1}This is solved by rewriting this problem as an unconstrained optimization problem
*using a Lagrange multiplier λ, which yields*

*ϕTΣϕ + λ(1− ϕTϕ).* (5)

*By setting the derivative of eq. 5 with respect to ϕ equal to zero one obtains*

*Σϕ = λϕ,* (6)

which can be written as

*ϕTΣϕ = λ* (7)

*using the fact that ϕT _{ϕ = 1}*

_{.}

*Therefore the variance will be maximized when ϕ is equal to the eigenvector*

*corresponding to the largest eigenvalue λ (Bishop 2006, p 561-569).* This

that explain the maximum of the remaining variance in the original data.

**2.5**

**Methods for Point Anomaly Detection**

Given the nature of the sequences generated by the sensor network the author finds if unlikely that methods for point anomaly detection would be effective. The use of these kinds of methods was analyzed anyway by the use of a well know outlier detection method known as Isolation forest.

**2.5.1** **Isolation Forest**

Liu et al. presented a method that approaches the subject through the measure
of isolation (Liu et al. 2008). The method is based on the use of an isolation tree
which is a proper binary tree where each node is either a leaf with no child or an
internal node that includes a test where an attribute q is split by a value p such
*that q < p, this partitions the feature space into exactly two child nodes Tl* and

*Tr*.

*When an isolation tree is constructed using a data set X ={x*1*, x*2*, ..., xn} made up*

of n observations, each of which is d dimensional the feature space is repeatedly partitioned by selecting a feature q and value in the range of that feature p until a specified tree height limit is reached or the child node only includes homogeneous observations (Liu et al. 2008).

This method then utilizes characteristics of point anomalies, i.e the feature-values
of anomalies are likely to be substantially different from the rest and the number
of anomalies is low. In the process of randomly partitioning the feature space
when building random trees this appears as fewer number of partitions between
the root and an anomaly resulting in a shorter path and the distinguishable nature
of an anomaly results in it being likely separated early in the process. When
an ensemble of these trees collectively flag observations as having considerably
shorter path lengths then this is a strong indication of anomalous behaviour (Liu
*et al. 2008). Lets denote the path length of an observation xi*, which corresponds

*to the number of edges between the root of a tree and the leaf xi* belongs to, by

*h(xi*)*for i∈ 1, n.*

abnormality of the observation. This system is based on the theory behind an unsuccessful search in a randomized binary search tree. It defines the average path length of an unsuccessful search in a binary search tree as

*c(n) = 2H(n− 1) −* *2(n− 1)*

*n* *,* (8)

where H(i) represents the harmonic number. The anomaly score is then defined
as the average path length ¯*h(xi*) over all the trees, normalized using c(n) in the

following manner,

*s(x, n) = 2*
¯

*h(x)*

*c(n) _{.}* (9)

This score can then be used to make assessments as higher scores indicate anomalous behaviour.

This method is not designed with temporal dependencies in mind but the use of the method in combination with a sliding window technique has been researched and been shown to be useful (Ding 2013).

**2.6**

**Contextual Anomaly Detection**

As a research topic unsupervised contextual anomaly detection makes up a large portion of modern anomaly detection research (Chandola et al. 2009). Many different methods have been designed and investigated and all of them approach

this problem in different ways. A popular option is to use well established

regression techniques or statistical modeling and construct predictive anomaly detectors while others seek to construct a profile or representation of normal behaviour and use it to judge the abnormality of the signals. Due to this difference the techniques will be split into predictive and non-predictive anomaly detection moving forward.

**2.6.1** **Predictive anomaly detection**

is highly accurate in its predictions under normal conditions and will therefore have a low prediction error but in cases of anomalous behaviour the model should not be as accurate leading to higher error margins. This is obtained by training the models in a semi-supervised manner, i.e. by only training on data that is representative of normal behaviour. This variability in the predictive errors can be modeled using e.g. thresholding techniques that allow for the construction of a binary classification problem where observations are classified as either normal or anomalous. For this setting essentially any regression technique that is designed for sequential learning could be used. In this project only artificial neural network architectures were considered, with special emphasis placed on the recurrent neural networks with long short term memory units as presented in (Malhotra et al. 2015), for further theoretical details see section 2.7.2 and 2.7.3.

**2.6.2** **Non-predictive anomaly detection**

The set of models that fall into this group are in general unsupervised learning techniques that construct a profile of sorts using training data depicting normal behaviour. This set of methods include methods best known for representation learning such as Principal Component Analysis (Viswanath et al. 2014), encoder-decoder networks (Malhotra et al. 2016) as well as generative methods such as Generative Adversarial Networks (Li et al. 2019).

In this project the recurrent neural network encoder decoder network using Long-Short term memory units (LSTM) was chosen as it is well suited to handle the sequential nature of the data. The background is laid out in section 2.7.3 and 2.7.4.

**2.7**

**Artificial Neural Networks**

**2.7.1** **Feed-Forward Neural Networks**

The feed forward neural network is essentially a composition of functions of the
form
*fθ(x) = f*
*(m)*
*θm* *(...f*
(1)
*θ*1 *(x)).* (10)

This function is meant to serve as a mapping from an input x to a response variable
*y (Goodfellow et al. 2016, p 163-166). Each function f _{θ}(i)*

*i* can be interpreted as the

transition between hidden layers and has the following form

*f _{θ}(i)*

*i* *(h*

*i−1*

*) = g(i)(W(i)h(i−1)+ b(i)*) (11)

*Here g represents an activation function, most often a non-linear function, hi−1*

represents the value at the i-1-th layer. Then the parameter set at the i-th layer is
*denoted by θi* *= (W(i), b(i), in general W(i)is referred to as a weight matrix and b(i)*

is referred to as bias.

In this project only the tanh activation function was relied on, it is shown in eq.
12.
*gtanh(z) =*
*sinh(z)*
*cosh(z)* =
*ez− e−z*
*ez _{+ e}−z* (12)

In order for a neural network to be of any use the mapping has to be fitted
to the problem at hand. The most commonly used procedure is to apply the
propagation algorithm where the chain rule on a directed graph is applied to
compute the gradients needed to drive the stochastic gradient descent algorithm
that minimizes a loss function. The general outline of the algorithm can be
*described in the following manner (Hult 2018). For a subset L of the nodes in a*
*computational graph 1,...,n with P a(L) defined as the set of nodes that are parent*
*nodes to L, i.e. P a(L) =∪ _{i∈L}P a(i)*

1. Make a forward pass by computing the values at the hidden layers given a input x.

2. Set*∂hn*

*3. For each j∈ P a(L), compute*
*∂hn*
*∂hj*
= ∑
*i:j∈P a(i)*
*∂hn*
*∂hi*
*∂hi*
*∂hj*
*4. Update L = P a(L)*

*5. repeat steps 3 and 4 until L =∅.*

**2.7.2** **Recurrent Neural Networks**

Recurrent neural network (RNN) is a ANN designed for sequential modeling and therefore very convenient for anomaly detection in sequential data. One way of describing the mechanics is from the point of view of a hidden Markov model (HMM) (Hult 2018).

Given an unobserved hidden Markov chain *{ht} with transition probability*

*p(ht|ht−1*) and observations *{yt} from p(yt|ht*) *with y*1*, ..., yT* being conditionally

independent given the corresponding hidden state. It is assumed that the two
*distributions have their own set of parameters which are denoted by θ and ϕ and*
the system is extended further to include an observed input*{xt} into the dynamics*

behind the hidden Markov chain *{ht}. This modification yields the following*

system

*ht|ht−1, xt* *∼ p(ht|ht−1, xt, θ),*

*yt|ht* *∼ p(yt|ht, ϕ)*

(13)

This setup does not allow for use of maximum likelihood estimation of the sets of
*parameters θ and ϕ due to the fact that the likelihood needs to be marginalized*
over the hidden states, i.e.

Proceeding in this manner will be troublesome and the utilization of advanced computational methods such as Markov Chain Monte Carlo methods would be needed. The RNN however works under the assumption that the transition from one hidden state to another is determined by a deterministic function f. This simplifies the system given in eq. 13 to

*ht= f (ht−1, xt; θ),*

*yt|ht∼ p(yt|ht, ϕ).*

(15)

Making the hidden states a function of the input allows for the likelihood to be
written as
*p(y*1*, ..., yT|x*1*, ..., xT, θ, ϕ) = p(y*1*, ..., yT|x*1*, ..., xT, h*1*, ..., hT, θ, ϕ)*
=
*T*
∏
*t=1*
*p(yt|ht, ϕ)*
(16)

Now the process of maximizing the likelihood is equivalent to minimizing the negative log-likelihood.

Using a function to determine the transitions between hidden states allows for the model to store a lot of historical information and the power involved with the models ability to apply non-linear dynamics yields high modelling flexibility. When working with sequential data the well known backpropagation algorithm (Rumelhart et al. 1986) requires a slight adjustment to produce what is known as backpropagation through time (BPTT). Given a loss function

*L({xt, yt}; θ, ϕ) =*
*T*

∑

*t=1*

*Lt(yt, g(ht; ϕ)))* (17)

*∂*
*∂θL({xt, yt}; θ, ϕ) =*
*T*
∑
*t=1*
*∂*
*∂ϕL(yt, g(ht; ϕ))*
=
*T*
∑
*t=1*
*∂Lt(yt, g(ht; ϕ))*
*∂g*
*∂g(ht; ϕ))*
*∂ht*
*∂ht*
*∂θ*
=
*T*
∑
*t=1*
*∂Lt(yt, g(ht; ϕ))*
*∂g*
*∂g(ht; ϕ))*
*∂ht*
(
*∂f (ht−1, xt; θt*)
*∂θt*
*θt=θ*
+*∂f (ht−1, xt; θ)*
*∂ht*
*∂ht−1*
*∂θ*
)
(18)

*Where θt* *serves as a dummy variable since ht* *depends on θ through both the*

*function f (ht−1, θ)and through past hidden states ht−1*. This last section is iterated

*through backwards from t = T, ..., 1 in order to compute all derivatives of∂ht*

*∂θ* . The

second partial derivative is somewhat simpler, i.e.

*∂*
*∂ϕL({xt, yt}; θ, ϕ) =*
*T*
∑
*t=1*
*∂*
*∂ϕL(yt, g(ht; ϕ))*
=
*T*
∑
*t=1*
*∂Lt(yt, g(ht; ϕ))*
*∂g*
*∂g(ht; ϕ))*
*∂ϕ* *.*
(19)

Short comings of the standard RNN are phenomena known as vanishing and exploding gradients. This is a problem that appears frequently in practice when long term dependencies are involved (Bengio et al. 1994). These phenomena have been researched significantly and are quite complex in nature. However an intuitive explanation can be established by simplifying the recurrence in the RNN as a function composition as in equation 20 (Goodfellow et al. 2016, p 390-392).

*ht* *= WTht−1* *= (Wt*)*Th*0 (20)

**Given that W can be eigendecomposed and it can be written as W = QΛQ***T*

as

*ht*=**Q***T*Λ*t Qh*0

*.*(21)

*Now as t increases the compounding affect is highly influenced by the value of the*eigenvalues, i.e. values lower than one will decay to zero while values of magnitude higher than 1 will tend to infinity.

This compounding pushes the gradients to zero or to infinity making the training procedure very troublesome.

**2.7.3** **Long Short-Term Memory**

The Long Short-Term Memory (LSTM) cell was introduced as a way to combat the problem of vanishing gradients by setting up paths based on internal recurrence that allow for better flow of gradients (Hochreiter & Schmidhuber 1997). The LSTM accomplishes this by incorporating a more complex system of gating units that control information flow. There are variants of the LSTM-unit so for transparency’s sake the following explanation is based on the works of (Goodfellow et al. 2016, p 397-400). Here the key component is the internal

*state st* which is meant to capture dependencies over a longer time interval.

Information from the internal state is combined with information from the
*previous hidden state ht _{i}−1*

*and the new input xt*

*i* through four different gates to

*produce the output of the hidden layer ht*

*i*. This procedure is visualized in figure

Figure 2.1: LSTM cell (Goodfellow et al. 2016, p 398)

*Each gate applies a sigmoid activation function and individual bias terms b, input*
*weights U , and recurrent weights W .*

*The input gate gt*_{serves the purpose of determining what new information should}

be stored in the cell state and is denoted by
*g _{i}t= sigm(bg_{i}* +∑

*j*

*U _{ij}gxt_{j}* +∑

*j*

*W _{ij}ght_{j}−1).* (22)

The forget gate works to prioritize past information and is meant to determine
how much of prior context is needed, 0 for no context and 1 for full context and is
defined as
*f _{i}t= sigm(bf_{i}* +∑

*j*

*U*+∑

_{ij}fxt_{j}*j*

*W*(23) The output gate works similarly as the input gate but determines what portion of the information should be fed out as output opposed to what information should be stored, it can be written as

_{ij}fht_{j}−1).*q _{i}t= sigm(bo_{i}* +∑

*j*

*U _{ij}oxt_{j}*+∑

*j*

*W _{ij}oht_{j}−1).* (24)

tanh activation multiplied by the output of the output gate, i.e.

*ht _{i}*

*= tanh(st*(25)

_{i})q_{i}t.The internal cell state is then updated using
*st _{i}*

*= f*+ ∑

_{i}tst_{i}−1+ g_{i}tsigm(bi*j*

*Uijxtj*+ ∑

*j*

*Wijhtj−1).*(26)

**2.7.4**

**Encoder-Decoder Network**

An RNN encoder-decoder network is a useful tool for sequence to sequence

modelling (Goodfellow et al. 2016, p 385-386). *Given a time series X =*

*{x*1*, x*2*, ..., xT} were each sample xi* *∈ RD* is a D dimensional vector of signal

readings. A subsequence *′Xin* = *{x*1*, x*2*, ..., xL} of a fixed length L is fed to the*

*network as shown in figure 2.2 for a sequence of length L = 3. The network uses*
the side that is referred to as an encoder to create a representation as the output
of the last hidden layer of the encoder. This representation corresponds to the
encoded data layer of figure 2.2 which can then be reconstructed using the decoder
which is a mirror of the encoder architecture to obtain a reconstructed sequence.
The neurons and the dynamics are still as described in sections 2.7.2 and 2.7.3
if LSTM units are used. The largest difference between the architectures is the
manner in which it is trained. In this setting the model is trained to minimize the
mean square error between the input and the output of the model, i.e.

1
*L*
*L*
∑
*i=1*
*(xin _{i}*

*− xout*)2 (27)

_{i}**3**

**Method**

This section is meant as a detailed explanation of the strategy that was assembled in order to answer the research questions that were established in section 1.3.

**3.1**

**Data acquisition**

As described in section 1.4 two data sources were used to evaluate the models performance. This is mostly due to the fact that faults can appear differently depending on the setting. Therefore the other data set was experimented on to get a more thorough analysis of the how the different models behave.

**3.1.1** **Scania data**

The data obtained from Scania was logged for the purpose of creating a basis for unsupervised anomaly detection research. Over a two week period a test vehicle was driven and CAN signals were logged. A team of mechanics and electricians was assembled, their role was to both make sure the vehicle was working under normal conditions for a major part of the two week period as well as make mechanical and/or electrical adjustments to the vehicle in order to introduce an injected fault into the system one at a time. In total 10 different injected faults were logged for analysis, however they all share the characteristic that no information apart from the fact that an injected fault can be found in a particular data file exists. This means that no record of anomalous activity on a per observational basis is available and therefore it cannot be used for obtaining summary statistics such as precision and recall. Therefore the experiments were conducted under the following assumptions.

• The normal drives contain no anomalous activity.

• The faulty files only contain the injected fault and no other anomalous activity.

Figure 3.1: (Left) Example of a time series from a normal drive. (Right) Example of a time series with an injected fault.

Figure 3.1 illustrates examples of difference in behaviour between normal and faulty drive. There are clear subsequences in the faulty data where the signal is constant.

**3.1.2** **Synthesized data**

To get a better idea of the capabilities of the models and their performance, experiments using synthesized data were also conducted. The purpose of these experiments was to investigate the capabilities of the encoder-decoder network to detect faults of different nature than the ones that existed in this particular Scania setting.

**Noisy Sinusoidal Wave**

A sinusoidal wave was synthesized and each observation was fitted with random
noise generated by a Gaussian distribution with a zero mean and standard
*deviation of σ = 0.1. This signal s(t) was generated by this equation*

*s(t) = Asin(2πf t + ϕ)* (28)

Figure 3.2: Example of the synthesized sinusoidal wave

**3.2**

**Data exploration**

During the first weeks the emphasis was placed upon preliminary data exploration with special focus given on removing some irrelevant sequences that were not of importance for this project. To start off with all sequences that were constant over the entire collection period were removed. Additionally, signals from sensors such as radars and cameras were removed as they were unrelated to the injected

faults that were of interest. This brought the dimensionality down to 264

sequences from the 1415 found in the original data set. Ultimately after conducting
*correlation analysis the subset of signals selected was taken as the signal S*1 that

depicts the faulty behaviour in the most clear manner and the nine highest rated signals according to the importance score generated from fitting a regression model made up of 5000 random trees with a maximum length of 5000 to those

*264 sequences. This model was trained to best predict S*1. The resulting signals

Table 3.1: Random Forest Importance Score for the selected signals

**Signal** **RF Importance Score**

*S*2 0.6696729292914351
*S*3 0.20870335815822644
*S*4 0.08991379993038735
*S*5 0.017335697225209827
*S*6 0.009949318219779535
*S*7 0.001381075421863678
*S*8 0.0004251155893856264
*S*9 0.0003240885292837287
*S*10 0.0002600217598883752

Random forest is a well known ensemble technique and very commonly used in
machine learning (Breiman 2001). It constructs multiple binary decision trees
which all have a saying in the final output from the model. A very handy tool that
the Random Forest method offers is that of the feature importance scoring used
to obtain the set of 10 signals described above. This is accomplished by using
*a measure called mean decrease impurity where, for the regression setting, the*
variance is used as a measure of impurity ((Louppe et al. 2013),(Ronaghan 2018)).
*This is accomplished by computing the node impurity for each node j in a tree,*
i.e.

*∆Ij* *= wjIj* *− wlef t,jIlef t,j* *− wright,jIright,j.* (29)

*Here wj* *represents the weighted number of samples in node j, Ij* is the impurity

*value of node j and the left and right child of node j is denoted by lef t, j and*
*right, j*. The feature importance for each tree is then computed and normalized
as

*F ii* =

∑

*j*1node j is split by feature i_{∑} *∆Ij*
*k∈all nodes∆Ik*

*,* (30)

*F inorm _{i}* = ∑

*F ii*

*jF ij*

Finally the average feature importance score over the entire forest made up of T trees is given as the importance score

*RF ISi* =
∑*T*
*j=1F i*
*norm*
*i* *, j*
*T* *.* (32)

**3.3**

**Data preprocessing**

The logged data is far from being of the quality needed for the modelling options explored in this project, therefore an extensive preprocessing had to be conducted. The art of processing irregular time series, that are noisy and/or missing values is a fairly big research area in its own right (de Carvalho et al. 2007) (Lepot et al. 2017).

**3.3.1** **Resampling and interpolation**

The first thing that had to be considered was that the CAN signals are not logged at the same rates, i.e. the signals are logged as they arrive and therefore we are working with an irregular time series. Therefore, in order to get a full observation of the signals with a fixed sampling intervals interpolation techniques had to be considered. Due to the fact that the end product is meant for real time analysis the techniques can only rely on past observations. Additionally, the sampling rate is fairly high, i.e. a signal is measured approximately every 0.01 seconds.

Figure 3.3: (Left) Time series prior to downsampling. (Right) Time series post downsampling to a per second basis

**3.3.2** **Normalization and Standardization**

Before training both standardization and normalization techniques were applied and their effects analyzed.

**Normalization**

Normalization refers to the process of scaling the data, most commonly to the
*range [0, 1]. This is accomplished by applying equation (33).*

˜
*Xi* =

*Xi − MIN(X)*

* M AX(X)− MIN(X)* (33)

**Standardization**

To standardize a sample is to replace each observation with its z-score. Equation

*34 shows how the z-score is computed for an observation Xi* using the sample

*mean µ X*

*and the sample standard deviation σ*as estimates for the population

**X**mean and the population standard deviation.

*zi* =

*Xi− µ X*

*σ X*

**3.3.3** **Filtering**

Filtering options were explored in the later stages of experimentation with the Scania data in hope that it would have a positive effect on the models.

**Exponential Weighted Moving Average**

The setting of real time analysis limits the filtering options. Exponential weighted
*moving average (EWMA) was explored (pandas.DataFrame.ewm n.d.). That*
is a rolling mean technique that gives higher importance to recently observed
values opposed to the conventional moving average that values all observations
in the time window used to compute it equally. An EWMA is obtained using the
following equations

*y*0 *= x*0
*yt*= (1*− α)yt−1+ αxt,*

*where the value for α is determined by α =* 2

*s+1*. Here s is the number of

Figure 3.4: Var1 before and after being filtered using EMA over 10 and 100 observational spans

**3.4**

**Anomaly Scoring**

In all instances except for the Isolation forest, a procedure for scoring the abnormality of an observation had to be designed. In all cases the same kind of approach was applied although it had to be slightly modified to fit the encoder-decoder network due to difference in shape of the output. For the predictive setting the prediction error is computed for every observation as

*epred _{i}*

*= xi− xpredi*(35)

*where xiis a single observation of the actual signal while xpredi* is the corresponding

predicted observation. This allows the prediction error to be modeled as a

univariate Gaussian distribution, using maximum likelihood estimation.

Which can be establish in the following manner. First the likelihood function is
defined as
*L(µ, σ|ei*) =
1
*√*
*2πσ*2*e*
*−(ei−µ)*2
*2σ2* *.* (36)

as written in eq. 36 which yields
*L(µ, σ|e*1*, e*2*, ..., en) = L(µ, σ|e*1*)L(µ, σ|e*2*)...L(µ, σ|en*) =
1
*√*
*2πσ*2*e*
*−(e1−µ)2*
*2σ2* *√* 1
*2πσ*2*e*
*−(e2−µ)2*
*2σ2* *...√* 1
*2πσ*2*e*
*−(en−µ)2*
*2σ2* *.*
(37)

For convenience the natural logarithm is applied which simplifies later stages of the computations. This is justifiable since the likelihood function is non-negative and since the logarithm is a convex function the log-likelihood function will conserve its maximum value.

*l(µ, σ|e*1*, e*2*, ..., en*) =ln(
1
*√*
*2πσ*2*e*
*−(e1−µ)2*
*2σ2* *√* 1
*2πσ*2*e*
*−(e2−µ)2*
*2σ2* *...√* 1
*2πσ*2*e*
*−(en−µ)2*
*2σ2* )
=*−n*
2*ln(2π)− nln(σ) −*
*(e*1*− µ)*2
*2σ*2 *−*
*(e*2*− µ)*2
*2σ*2 *...−*
*(en− µ)*2
*2σ*2
(38)

*Now computing the derivatives with respect to µ and σ becomes simpler, they*
are
*∂l*
*∂µ* =
1
*σ*2((
*n*
∑
*i=1*
*ei*)*− nµ),*
*∂l*
*∂σ* =*−*
*n*
*σ* +
1
*σ*3(
*n*
∑
*i=1*
*(ei− µ)*2*).*
(39)

*Setting the partial derivatives to zero and solving for µ and σ in the corresponding*
partial derivative one obtains the maximum likelihood estimates

*µe* =
1
*n*
*n*
∑
*i=1*
*ei,*
*σe* =
v
u
u
t 1
*n*
*n*
∑
*i=1*
*(ei− µ)*2
(40)

Figure 3.5: Prediction errors from a validation set and an estimated Gaussian curve

This Gaussian distribution can be utilised as a handy tool to establish a threshold

*τ*that can be used as a decision boundary which determines if anomalous activity

Figure 3.6: Anomaly score threshold fitted to validation set prediction errors

The anomaly score is shown in eq. 41 and is computed so that independently of the setting if the anomaly score is above one then the corresponding observation is flagged as anomalous.

AnomalyScore* _{i}* =

*|ei− µe|*

*τ σe*

(41)
*Here eiis an error term, µeis the estimated error sample mean, σe*is the estimated

*error sample standard deviation and τ is a tuning parameter. In the case the*
LSTM encoder-decoder network, the output is a reconstructed sequence of the
same length as the input. Therefore an adjustment to the error computations had
to be done in order to get a one dimensional error vector. The network was fed
*a one dimensional subsequence of length n which will be denoted by s and the*

*reconstructional error for each observation xi*in the input sequence computed as

*ei,j* =*|xi,j* *− ˆxi,j|* (42)

were ˆ*xi,j* *is the reconstructed observation j of subsequence i The sum over every*

This vector of errors can now be treated in the same manner as described before, i.e. it can be used to fit a univariate Gaussian distribution which in turn allows for a handy way to establish a threshold that serves as a decision boundary so that the errors that surpass it are flagged as anomalous.

**3.5**

**Architecture Implementation**

Different methods that approach the topic of anomaly detection in sequential data in different ways were experimented with. The main emphasis was put on three different types of approaches, point anomaly detectors, non-predictive contextual anomaly detection and predictive contextual anomaly detection.

**3.5.1** **Setup**

For all implementation the Python programming language and the Python packages in table 3.2 were used.

Table 3.2: Python packages used for implementation

**Package** **Version**
Python 3.7.3
Numpy 1.16.3
Pandas 0.24.2
Matplotlib 3.0.3
Scikit Learn 0.20.3
Tensorflow 1.13.1
Keras 2.2.4

**3.5.2** **Point anomaly detection**

and new observations are ranked based on how they fit that profile.

**Isolation Forest**

The algorithm was intilized using a maximum number of samples as 1000 and the contamination rate at 0.0001 as the algorithm was fit to the normal drive training data.

**3.5.3** **Non-predictive contextual anomaly detection**

An Encoder-Decoder network as described in section 2.7.4 was constructed. A
linear transformation, based on applying a Principal Component Analysis (PCA)
(see 2.4.1) to the normal drive training data set, was used to reduce the dimension
to a single sequence. The encoder is fed a sequence of 10 observations and utilizes
*a hidden layer made up of 16 hidden LSTM units with a tanh activation function.*
That layers output is then fed as input into the decoder which mirrors the encoder
in architecture resulting in an output of the same shape and size as the input. This
network is trained to minimize the mean square error between the input and the
output in an unsupervised manner.

**3.5.4** **Predictive contextual anomaly detection**

A many to one regression model was constructed using a recurrent neural network

*which takes past observations of the 10 signals as input and predicts S*1 using a

*single hidden layer of 16 LSTM units with a tanh activation function (see eq. 12)*
which is followed by a dense layer with a linear activation function.

**3.6**

**Training procedure**

a data file of 729 observations from a normal drive and another file holding 605 observations from an injected fault.

All neural networks were trained using the ADAM (adaptive moment estimation) optimization algorithm which is a well known extension of the stochastic gradient

descent algorithm (Kingma & Ba 2014). This algorithm was chosen simply

because it is often recommended as the default optimization algorithm in deep learning (Ruder 2016).

Adam uses exponentally moving averages to compute estimates of the first and
*second moments (mt, vt*)

*mt* *= β*1*mt−1*+ (1*− β*1*)gt*

*vt= β*2*vt−1*+ (1*− β*2*)gt*2

(44)

*where g*2

*t* *represent the element wise square if gt* *is a vector or a matrix. Here β*1

*and β*2 *are hyperparameters set at the default values of β*1 *= 0.9* *and β*2 *= 0.999*

for this project. These estimates are biased and need to be bias-corrected yielding
unbiased estimates
ˆ
*mt*=
*mt*
1*− βt*
1
*,*
ˆ
*vt*=
*vt*
1*− βt*
2
*.*
(45)

This results in the following update step

*θt+1= θt−*
*η*
*√*
ˆ
*vt+ ϵ*
(46)
*Here η is the learning rate set at 0.001 during the experimentation, θ are the*
*models parameters and ϵ is a hyperparameter set to the value of 10−8*.

**3.7**

**Model Evaluation**

**3.7.1** **Scania Data**

Due to the lack of informative annotation in the Scania data the models evaluations is limited to the analysis of detection and false positive rate. The work was carried out under the assumption that the normal drive depicted nothing but normal behaviour and therefore the models should preferably not depict any anomalous behaviour in those cases, while still capturing the injected faults. This leads to a somewhat heuristic approach as the detection rate and the rate of false positives was analysed.

**3.7.2** **Synthesised data**

**4**

**Results**

This section will focus on the experimental results obtained from the implemented models using both data sets. The evaluations based on the data set obtained from Scania is more focused on showing the model’s ability to detect anomalous behaviour and the rate of false positives. The other data set is then meant to give better summary statistics since it has actual annotated anomalies.

**4.1**

**Scania data**

In this section the performance of different methods is analysed using the Scania data. As described in section 1.4, the nature of the data restricts the analysis to looking at the detection rate of the injected fault and the false positive rate of the normal drive test sequences.

**4.1.1** **Isolation Forest**

Figure 4.1: Isolation forest anomaly score from a normal drive test set with threshold set at 1.12

Figure 4.2: Isolation forest anomaly score from a faulty drive test set with threshold set at 1.12

**4.1.2** **RNN - encoder-decoder network**

**Without Filtering**

The first post-training step for every model is to compute the error margins between a signal and the output of each model. Those errors are then modeled by a Gaussian distribution as laid out in section 3.4. The estimated mean and standard deviation can be seen in table 4.1. This allows for a convenient way of defining a threshold according to the number of standard deviations from the mean.

Table 4.1: ED Gaussian coefficients without filtering

**Mean** 0.027867447268757194

**Standard Deviation** 0.010066136635592152

The reconstructed signals from both the faulty and normal drive both with and without filtering are not particularly interesting as there is little visible difference between the reconstructed signal and the original signal in figures 4.3, 4.4, 4.7, 4.8.

Figure 4.3: The original and the reconstructed normal drive test set without filtering

detection rate is low when the faulty drive is tested and the anomaly flags only appear in highly volatile context and not in areas where the signal levels off.

Figure 4.4: The original and the reconstructed faulty Drive test set without filtering

Figure 4.6: Encoder-Decoder anomaly score of a faulty Drive test set without
*filtering using τ = 5*

Table 4.2 shows the ratio of the overall detection rate and based on these results the model does not show a lot of promise for anomaly detection on this particular signal.

Table 4.2: Detection rate of the Encoder-Decoder network for both the Normal
*and Faulty test drives without filtering using τ = 5*

**Drive** **Detection Rate**

Normal 0.56%

Faulty 0.84%

**With Filtering**

The data was filtered using an exponential weighted moving average over a window of 100 observations. The errors were modeled in the same way as before and the estimated coefficients are depicted in table 4.3.

Table 4.3: ED Gaussian coefficients with filtering

**Mean** 0.017008433576124825

**Standard Deviation** 0.011817158698356237

moving average (EWMA) does not have a positive effect on this model as it can be seen that one of the two spikes in the anomaly scoring has dropped below the threshold in figure 4.10.

Figure 4.7: The original and the reconstructed normal drive test set with filtering

Figure 4.9: Encoder-Decoder anomaly score of a normal drive test set with
*filtering using τ = 4.9*

Figure 4.10: Encoder-Decoder anomaly score of a faulty drive test set with filtering
*using τ = 4.9*

Table 4.4: Detection rate of the Encoder-Decoder network for both the Normal
*and Faulty test drives with filtering using τ = 4.9*

**Drive** **Detection Rate**

Normal 0.00%

Faulty 0.34 %

Sufficient information is lacking about the nature of data collection procedure to be able to fully determine if it is possible that this model is capturing some other effect of the injected fault or not. It is the author’s believe however that the use of a linear transformation through the use of PCA that condenses the information to a single principal component as described by 2.4.1 might play a major role in this instance and different dimensionality reduction techniques

*should be experimented with. Additional experimentation using S*1 as the only

signal and skipping the dimensionality reduction step using PCA did not lead to any noticeable improvements which builds support for the case that this network architecture might not be suitable for this type combination of signal and error where the abnormality is a constant signal and not erratic behaviour. This was the main driver for further experimentation using synthesized data.

**4.1.3** **RNN - predictor**

The results obtained from the RNN-prediction network are illustrated in this section using both the original data as well as the filtered signals obtained by applying an exponential weighted moving average over a window of the past 100 samples.

**Without Filtering**

The results of the Gaussian modelling for the RNN-regression model is found in table 4.5.

Table 4.5: RNN predictor Gaussian coefficients without filtering

**Mean** -0.00017610776

The errors are simply the difference between the actual value of a observation and the corresponding predicted value of S1. Both signals are visualised using a test sequence put a side from a normal drive, seen in figure 4.11 and using a faulty test set in figure 4.12. In both cases the red curve represent the actual sequence while the blue is the predicted sequence.

Figure 4.11: The original and the predicted normal drive test set without filtering

does not know how to react in the sub sequences of abnormal behaviour and the majority of the models predictions are far from the actual value. This is a very promising result but there is a concern when figure 4.11 is analysed. There are instances were the models prediction is to far off for a normal drive. This concern becomes even more apparent when figures 4.13 and 4.14 are analysed as we see that a fair amount of observations are flagged as anomalous in the normal drive test set making them false positives according to the assumptions laid down before, table 4.6 depicts the ratio of flagged anomalies for both the normal and the faulty drives. However the high detection rate for the faulty drive shows a lot of promise and this combination of high false positive rate and high detection rate in the faulty drive was the main reasons for why the additional experiments using filtering techniques such as exponential weighted moving average were conducted in the hopes of reducing the false positive rate.

Figure 4.14: RNN predictor anomaly score of a faulty Drive test set without
*filtering using τ = 5.5*

Table 4.6: Detection rate of the RNN predictor for both the Normal and Faulty
*test drives without filtering using τ = 5.5*

**Drive** **Detection Rate**

Normal 1.37 %

Faulty 21.52 %

**With Filtering**

The results of the Gaussian modelling for the RNN-regression model is found in table 4.7.

Table 4.7: RNN predictor Gaussian coefficients with filtering

**Mean** 0.004098794

**Standard Deviation** -0.0017545223

Figure 4.15: The original and the predicted normal drive test set with filtering

Figure 4.16: The original and the predicted faulty drive test set with filtering

as the presence of false positives have been removed completely..

Figure 4.17: RNN predictor anomaly score of a normal drive test set with filtering
*using τ = 4.3*

Table 4.8: Detection rate of the RNN predictor for both the Normal and Faulty
*test drives with filtering using τ = 4.3*

**Drive** **Detection Rate**

Normal 0.0%

Faulty 15.9%

**4.2**

**Synthesized data**

For the synthesized data the focus will be restricted to encoder-decoder network. A sinusoidal wave was generated as described in section 3.1.2 and the resulting signal is visualized in figure 3.2.

**4.2.1** **LSTM - encoder-decoder network**

Figure 4.19: Original and reconstructed signal from the validation set

Figure 4.20: Anomaly score from the validation set of the sinusoidal wave

*Spikes were injected into the wave by adding a constant c = 0.1 to the signal at*
every 100th observation to create a test set. Figure 4.21 illustrates the results from
testing the model and it can clearly be observed that the model is able to capture
these faults in all cases which shows that the this model seems very dependent on
how the faults appear in the signals. In a more periodic signal this model shows
much more promising performance.

**5**

**Conclusion**

In this project three different modelling options were analysed for the purpose of anomaly detection in a vehicle’s sensor network. The data which was relied on had been collected explicitly for anomaly detection research and was made up of data depicting normal behaviour as well as several faults that were injected into the system by making mechanical or electrical adjustments. This allowed for the use of semi-supervised techniques were different recurrent neural networks (RNN) were trained only using data representing normal behaviour. The results show that these techniques show great promise for this particular setting of anomaly detection in vehicles and could serve as a great tool for automated fault detection. Furthermore, the detection of anomalous behaviour is the foundation behind carrying out a near real time root cause analysis on the faults which is the ultimate goal. In section 1.3, a set of questions was established that were to be answered at the end of the project. The following subsection will discuss each question and the answers based on the results obtained in section 4.

**What modelling options are able to detect anomalous behaviour?**

**What approaches can be used to combat the detection of false**
**positives?**

When the results from the RNN-predictor, with and without filtering, are compared there is no denying that filtering is a key component in work of this nature. The high rate of false positives, seen in figure 4.13 was the reason filtering options were analysed. Figure 4.17 shows that the false positives were non-existent post filtering. This reduction strengthens the author’s believe that this could become a truly useful tool, once fully developed.

**What are the models potential to generalize to different applications?**

The RNN-predictor is very dependent on the availability of meaningful input features. This is not always available and needs to be analysed case by case. The encoder-decoder network however can be fitted on any signal without much preprocessing and analysis. In either case a thorough analysis would have to be carried out in order to determine if the model performs up to standards, i.e. has high detection rate and low rate of false positives as the depicted in figures 4.17 and 4.18. Therefore it should always be analysed case by case.

**What model performs best overall?**

**6**

**Discussion, Delimitation and Future work**

A basis for further development has now been established but there are numerous steps that need to be taken in order to introduce this idea of anomaly/fault detection into Scanias vehicles.

Additional modelling options were experimented with but the ones that were not included in this report all approached the problem in the same way as some of the models presented but did not perform as well. These were mainly One Class SVM and a different neural network architecture called Temporal Convolutional Neural Networks. The One Class SVM approaches the problem similarly as the Isolation Forest by constructing a profile for normal behaviour that is used as a measuring stick that does not account for any temporal dependencies (Manevitz & Yousef 2001). Temporal Convolutional Neural Networks on the other hand is a neural network architecture based on the well established Convolutional Neural Networks that applies convolution along the time axis and extracts features in the data using filters (Bai et al. 2018). This architecture has been shown to outperform RNN’s on a few tasks and therefore was tested. A many to one regression model using this architecture was trained just as the RNN-predictor but did not perform nearly as well, this might just mean that the architecture and the available sources need to be developed further.

Alternative approaches to the use of Principal Component Analysis (PCA) for dimensionality reduction before feeding data to the encoder-decoder network could be worth exploring. The PCA only allows for linear transformations and does not take temporal dependencies into considerations. Therefore, it might be more suitable to use auto-encoders based on recurrent neural networks to better capture the non-linearity in the data as well as the contextual information of the time series.

technology that will most likely replace the CAN network moving forward.

A clear delimitation of the work conducted in this project is that the data is only logged from one vehicle while the modelling should be generic enough to function for all manufactured vehicles during testing. Therefore it would be really interesting to obtain new data from other vehicles for testing purposes, just to get a better idea of the transferability of the models. Federated learning might offer a promising way of obtaining models that are generic enough to allow them to be transferred from vehicle to vehicle by aggregating models trained on data from different vehicles.

Due to the performance boost obtained from filtering the signal, other methods of real time preprocessing should be explored. Some very basic experiments were conducted using Kalman filters during the filtering step and that should be analyzed further.

**References**

Bai, S., Kolter, J. Z. & Koltun, V. (2018), ‘An empirical evaluation of
*generic convolutional and recurrent networks for sequence modeling’, CoRR*

**abs/1803.01271.**

**URL: http://arxiv.org/abs/1803.01271**

Bengio, Y., Simard, P. & Frasconi, P. (1994), ‘Learning long-term dependencies
*with gradient descent is difficult’, IEEE Transactions on Neural Networks*

**5(2), 157–166.**

Beygelzimer, A., Erdogan, E., Ma, S. & Rish, I. (2005), Statistical models for unequally spaced time series.

*Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer.*
**Breiman, L. (2001), ‘Random forests’, Machine Learning 45(1), 5–32.**

**URL: https://doi.org/10.1023/A:1010933404324**

Chalapathy, R. & Chawla, S. (2019), ‘Deep learning for anomaly detection: A
**survey’, CoRR abs/1901.03407.**

**URL: http://arxiv.org/abs/1901.03407**

Chandola, V., Banerjee, A. & Kumar, V. (2009), ‘Anomaly detection: A survey’,
**ACM Comput. Surv. 41.**

*Controller Area Network (CAN) Overview (2019).*

**URL: **

*https://www.ni.com/sv-se/innovations/white-papers/06/controller-area-network–can–overview.html*

de Carvalho, O. A., Guimaraes, R. F., Trancoso Gomes, R. A. & da Silva, N. C.
*(2007), Time series interpolation, in ‘2007 IEEE International Geoscience and*
Remote Sensing Symposium’, pp. 1959–1961.

Ding, Z. (2013), An anomaly detection approach based on isolation forest algorithm for streaming data using sliding window, pp. 12–17.

*Hawkins, D. (1980), Identification of Outliers, Monographs on applied*
probability and statistics, Chapman and Hall.

**URL: https://books.google.se/books?id=fb0OAAAAQAAJ**

*Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term memory’, Neural*
**computation 9, 1735–80.**

Hult, H. (2018), ‘Lecture notes in sf2957 statistical machine learning’.

Kingma, D. & Ba, J. (2014), ‘Adam: A method for stochastic optimization’,
*International Conference on Learning Representations .*

Lepot, M., Aubin, J.-B. & Clemens, F. (2017), ‘Interpolation in time series:
An introductive overview of existing methods, their performance criteria and
**uncertainty assessment’, Water 9(10), 796.**

**URL: http://dx.doi.org/10.3390/w9100796**

Li, D., Chen, D., Shi, L., Jin, B., Goh, J. & Ng, S. (2019), ‘MAD-GAN: multivariate
anomaly detection for time series data with generative adversarial networks’,
**CoRR abs/1901.04997.**

**URL: http://arxiv.org/abs/1901.04997**

*Liu, F. T., Ting, K. M. & Zhou, Z. (2008), Isolation forest, in ‘2008 Eighth IEEE*
International Conference on Data Mining’, pp. 413–422.

Louppe, G., Wehenkel, L., Sutera, A. & Geurts, P. (2013), Understanding variable
*importances in forests of randomized trees, in C. J. C. Burges, L. Bottou,*
M. Welling, Z. Ghahramani & K. Q. Weinberger, eds, ‘Advances in Neural
Information Processing Systems 26’, Curran Associates, Inc., pp. 431–439.

**URL:**
*http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf*

Malhotra, P., Ramakrishnan, A., Anand, G., Vig, L., Agarwal, P. & Shroff,
G. (2016), ‘Lstm-based encoder-decoder for multi-sensor anomaly detection’,
**CoRR abs/1607.00148.**