Unsupervised anomaly detection in time series with recurrent neural networks

(1)

Unsupervised anomaly detection in time series with recurrent neural networks

JOSEF HADDAD CARL PIEHL

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

detection in time series with recurrent neural networks

JOSEF HADDAD, CARL PIEHL

Bachelor in Computer Science Date: June 7, 2019

Supervisor: Pawel Herman Examiner: Örjan Ekeberg

School of Electrical Engineering and Computer Science Swedish title: Oövervakad avvikelsedetektion i tidsserier med neurala nätverk

(4)

(5)

Abstract

Artificial neural networks (ANN) have been successfully applied to a wide range of problems. However, most of the ANN-based models do not attempt to model the brain in detail, but there are still some models that do. An example of a biologically constrained ANN is Hierarchical Temporal Memory (HTM).

This study applies HTM and Long Short-Term Memory (LSTM) to anomaly detection problems in time series in order to compare their performance for this task. The shape of the anomalies are restricted to point anomalies and the time series are univariate. Pre-existing implementations that utilise these networks for unsupervised anomaly detection in time series are used in this study. We primarily use our own synthetic data sets in order to discover the networks’

robustness to noise and how they compare to each other regarding different characteristics in the time series. Our results shows that both networks can handle noisy time series and the difference in performance regarding noise robustness is not significant for the time series used in the study. LSTM out- performs HTM in detecting point anomalies on our synthetic time series with sine curve trend but a conclusion about the overall best performing network among these two remains inconclusive.

(6)

Denna studie tillämpar HTM och Long Short-Term Memory (LSTM) på avvikelsedetektionsproblem i tidsserier för att undersöka vilka styrkor och svagheter de har för detta problem. Avvikelserna i denna studie är begränsade till punktavvikelser och tidsserierna är i endast en variabel. Redan existerande implementationer som utnyttjar dessa nätverk för oövervakad avvikelsedetektionsproblem i tidsserier används i denna studie. Vi använder främst våra egna syntetiska tidsserier för att undersöka hur nätverken hanterar brus och hur de hanterar olika egenskaper som en tidsserie kan ha. Våra resultat visar att båda nätverken kan hantera brus och prestationsskillnaden rörande brusrobusthet var inte tillräckligt stor för att urskilja modellerna. LSTM presterade bättre än HTM på att upptäcka punktavvikelser i våra syntetiska tidsserier som följer en sinuskurva men en slutsats angående vilket nätverk som presterar bäst överlag är fortfarande oavgjord.

(7)

1 Introduction 1

1.1 Aims and Research Question . . . 2

1.2 Scope . . . 3

1.3 Outline . . . 3

2 Background 4 2.1 Time Series and Anomalies . . . 4

2.2 HTM . . . 5

2.2.1 HTM Neuron . . . 6

2.2.2 HTM Network . . . 7

2.2.3 HTM activation and learning . . . 8

2.3 ANNs . . . 9

2.3.1 Training . . . 10

2.3.2 RNNs . . . 10

2.3.3 LSTM . . . 11

2.4 Related Work . . . 12

3 Method 14 3.1 HTM Configuration . . . 14

3.1.1 Network structure . . . 15

3.1.2 Training . . . 15

3.1.3 Anomaly labelling . . . 16

3.2 LSTM configuration . . . 16

3.2.1 Network structure . . . 17

3.2.2 Training . . . 17

3.2.3 Anomaly labelling . . . 18

3.3 Differences between the used models . . . 19

3.4 Evaluation and Performance metrics . . . 19

3.5 Data sets used . . . 20

v

(8)

4.2 Trend/characteristics results . . . 29

4.2.1 Statistical hypothesis testing . . . 34

4.3 Results from real world data sets . . . 35

4.3.1 Occupancy t4013 . . . 35

4.3.2 Ec2_request_latency_system_failure . . . 37

5 Discussion 39 5.1 Noise robustness . . . 39

5.2 Time series and anomaly characteristics . . . 40

5.3 Real world time series performance . . . 41

5.4 General comparison . . . 42

5.5 Limitations . . . 42

5.6 Biological approach . . . 44

5.7 Future Work . . . 44

6 Conclusions 46 Bibliography 47 A Time series graphs 51 A.1 Noise robustness graphs . . . 51

A.2 Synthetic time series graphs . . . 53

(9)

ANN Artificial Neural Network.

BPTT Back-Propagation Through Time.

HTM Hierarchical Temporal Memory.

LSTM Long-Short Term Memory.

NAB Numenta Anomaly Benchmark.

RNN Recurrent Neural Network.

SDR Sparse Distributed Representation.

vii

(10)

(11)

Introduction

Advances in neuroscience have allowed for a greatly increased understanding of the structure and function of different parts of the brain. Simultaneously, great advancements have been made in the field of machine learning. Artifi- cial Neural Networks (ANN), in particular, have been of great interest in the research community and have been successfully applied to a wide range of problems, from medical diagnosis [1] to playing games [2]. However, most ANN-based models do not attempt to model the brain in any detail [3]. These advancements are mostly driven by mathematically derived models devised to perform specific tasks which do not utilise our increased understanding of the brain. They often also require extensive training in order to perform these tasks and cannot easily be generalised to perform other tasks. An alternative approach is to use biologically inspired models by trying to mimic the way the human brain processes information. There are a few examples of brain inspired methods, one of the more detailed ones is the Hierarchical Temporal Memory (HTM).

HTM is an evolving attempt to model the structure and function of the neocortex first introduced by Hawkins [4]. The model is based on Mount- castle’s [5] proposal that all the regions of the neocortex, which makes up roughly 80% of the brain and is responsible for higher-order functions such as cognition and language, follow a similar neuroanatomical design [5]. It is hy- pothesised that the difference in the functionality by different regions mainly arises out of different inputs. This implies, in theory, that a faithful model of the neocortex could be trained to perform multifarious function that would be considered as the backbone of intelligent behaviour. Hawkins claims that the neocortex achieves this by memorising patterns and constantly making predictions based on those memories. This way, the neocortex can learn spatial and

1

(12)

terest in fields such as engineering, economics and medicine [6]. Patterns in a time series data that deviate from expected or normal behaviour are considered to be anomalies [6]. This means that a tool which can accurately predict the future values in a time series can also be used as an anomaly detector. Time series anomaly detection can be useful in many areas such as sleep monitoring [6], jet engine operation [7] and intrusion detection for computer networks [8].

Anomalies can be difficult to detect because it can be difficult to determine if a pattern in a time series is considered to be normal on not since an anomaly in one process can be considered normal behaviour in another [6]. An identifica- tion of an anomalous behaviour can be done with simple threshold heuristics [9], but those often require knowledge of the data sets and implementation by a human with deep domain knowledge. The major challenge lies in capturing dependencies among multiple variables as well as identifying long-term and short-term repeating patterns. This is where traditional approaches, such as auto regressive methods, can fall short [10]. An alternative approach, which has seen increasing popularity as of late, is using ANNs as anomaly detectors.

Different types of ANNs have been applied to a wide range of time series problems, such as predicting flour prices and modelling the amount of littering in the North Sea [6].

1.1 Aims and Research Question

The aim of this thesis is to evaluate a HTM network in the task of detecting point anomalies in time series data. The strength and weaknesses of the network are compared to a state-of-the-art ANN-based approach. The selected ANN is the Long-Short Term Memory (LSTM) Recurrent Neural Network (RNN). RNNs are well suited for time series data since they can represent information from an arbitrarily long context window [11]. However, they have traditionally been difficult to train and perform worse with very long-term temporal dependencies [12]. Adding LSTM units to these networks has been shown to remedy some of these issues, allowing the networks to achieve state- of-the-art performance in time series anomaly detection tasks [6][13].

(13)

The research question for this thesis is:

How does HTM compare to LSTM in time series point anomaly detection tasks?

1.2 Scope

HTM and LSTM are compared when performing unsupervised anomaly detection in single variable streaming data. In particular, this study focuses on robustness to noise and ability to recognise anomalies in time series with different characteristics.

The type of anomaly is limited to point anomalies in this study. Point anomalies are data points which differ from the rest of the data points. This can for example be extreme values or points that are not abnormal for the entire data set but are anomalies in their context [14]. Both networks are utilised in models designed for the task and the configuration of them is kept constant for all time series in this study. The models are primarily tested on synthetic data sets, but also on two real world data sets.

1.3 Outline

The following chapter provides a theoretical background for time series, HTM and LSTM. It also mentions other studies where the two models have been applied to similar problems. The third chapter describes the data sets used in this study, the methodology of applying the models to the data sets and the metrics used to measure performance. The fourth chapter presents the results obtained. In chapter 5 the implications and validity of the results are discussed and possibilities for future research are explored. Finally, in chapter 6, conclusions are drawn.

(14)

A time series can formally be described as Malhotra et al presents it in [13], as X = {x⁽¹⁾, x⁽²⁾, x⁽³⁾, ..., x⁽ⁿ⁾} where x^(t) ∈ IR^(m), m = dimension and t = timestamp. x^(t) is represented as an m-dimensional vector x^(t) = {x₁, x₂, ..., x_m}. In a time series where we only have one variable, m would be set to 1.

Anomalies can be categorised into different types. One type of anomaly is the point anomaly. Point anomalies are individual instances of data points that differ too much from the rest of the data point instances. An example of this in a univariate time series can be seen in the upper leftmost graph in figure 2.1 [15]. Data points which are normal, with regards to having similar features as other instances in the data set, but differ from the normal instances in the context in which they appear, are referred to as contextual anomalies [16].

A contextual anomaly which only consists of a single data point is called a contextual point anomaly and can be seen on the bottom left part of figure 2.1.

The two graphs on the right side of figure 2.1 illustrate collective anomalies, which are anomalies that are built up by multiple data instances [15].

4

(15)

Figure 2.1: Illustration of different types of anomalies in a univariate time series. [15]

2.2 HTM

HTM is a theory on intelligence which is biologically constrained, based on the neocortex and not considered to be deep-learning or Machine Learning technology, but rather a machine intelligence framework [17]. The general structure of an HTM network can be seen in figure 2.2A. The network is arranged into cellular layers where each layer consists of neurons, or cells, arranged into columns. These cellular layers learn sequences through connections to other cells in the same layer. The structure of HTM makes it robust to noise and suited for prediction, anomaly detection and classification of sequential streaming data [18].

(16)

Figure 2.2: An image overview of the HTM network and individual HTM neurons. [18]

2.2.1 HTM Neuron

A representation of the HTM neuron can be seen in the left part of figure 2.2B.

The feed-forward input in figure 2.2b is the input that got into the network and the context input represents the connections from other neurons in the same layer. The feedback input represents connections between that neuron and neurons positioned in a layer at a higher level. As seen in figure 2.2B (left), the context and feedback input consists of multiple segments. These segments are called dendritic segments. They contain synaptic connections to other neurons and enough active connections cause the segment to activate.

An active segment causes the neuron to enter a polarised state. This state can be called a “predictive state”. Input into the neuron determines what state it is in, where the three possible states are active, predictive and non-active state, but the neuron output is always binary. A depolarised state alone is not enough to make the neuron activate, feedforward input is needed as well [18].

(17)

2.2.2 HTM Network

In figure 2.2A is a presentation of how the HTM neurons are arranged in the HTM network as columns. These networks hold information about high-order sequences it has seen before using two different sparse representations. These two representations are the previous sequence context and the current feedforward input. The current feedforward representation is on column level while the previous sequence context is on cell level, where a cell is the same as an HTM neuron [18].

The feedforward input originates from the input data which is encoded into a sparse distributed representation (SDR). These representations encode how much input the cells in each column should receive and a threshold is set for what percentage of the columns with the most active feedforward inputs should be activated [18]. Figure 2.3 illustrates the transformation which the input data undergoes before being used in the HTM network. The configuration of these encoders are crucial in order for the model to function effectively and the robustness to noise in the model is dependent on the SDR [19].

The other sparse representation, which is on cell level, is the connections between cells in the same layer. These are the connections described as the context input in section 2.2.1 about the HTM neuron. These representations are the ones that make cells enter a predictive state, by having enough active cells connected to it in this sparse representation. This is illustrated in the first matrix in figure 2.2D, where 3 active cells are connected to a red cell which makes it enter a predictive state. These type of connections exist between the neurons in the HTM layer and together make this sparse representation possible [18].

If an active column consists of cells that are in a predictive state, these cells will in turn activate other cells in the networks, predicting the next feedforward input. If an active column does not consist of any cells in a predictive state, all of the cells in that column will activate [18].

(18)

Figure 2.3: Data is encoded into a sparse distributed representation before being used as input into the HTM network.

2.2.3 HTM activation and learning

In the following section, we denote the activation state with a binary matrix A^t with dimensions M*N where N is the number of columns, M is the number of cells per column and t is the time step. The activation state of the cell at column j and cell index i is a^t_ij. Cells in a predictive state at time step t are represented in a similar fashion by a binary matrix Π^twith the same dimensions where the predictive state of the cell at column j and cell index i is π_ij^t.

Each synapse in the HTM network (connection between two neurons) is represented by a scalar value. An M*N matrix D^d_ij represents the scalar value of the synapse connection between the d’th segment for the cell on column j and cell index i and the rest of the HTM cells in the same layer. Each value in that matrix is between 0 and 1. If this value is above a specific threshold, the connection is considered to be connected. The connected synapses are represented in the binary matrix eD^d_ij. The depolarization of a cell can be calculated using equation 2.1 where θ is the segment activation threshold and ◦ is element-wise multiplication. [18].

π^t_ij1 if ∃_d|| eD_ij^d ◦ A^t|| > θ

0 otherwise (2.1)

In order to know which cells are activated, we need to know which columns are active. We denote the active columns at time step t with the set W^t. The active states at time step t is then calculated using equation 2.2. Equation 2.2 tells us that a cell is activated at time step t if the column it is in is activated and if the cell was in a predictive state at time step t - 1. It also tells us that a cell can activate if its column is activated and if no cell in the column were predicted in time step t - 1 [18].

(19)

a^t_ij







1 if j ∈ W^t and π_ij^t−1 = 1 1 if j ∈ W^tandP

iπ_ij^t−1= 0 0 otherwise

(2.2)

Depolarised cells that get activated will cause the dendritic segment that made the cell depolarised to be reinforced. Columns where all cells are active due to no cells in a predictive state will cause the most active segment among the cells in the column to be reinforced [18].

An illustration of the network before and after learning the sequence A- B-C-D-X-B-C-Y can be seen in figure 2.2C and 2.2D. All cells in the column represented by the current input are activated before learning but only specific cells are activated after learning this sequence due to strengthening/weakening of the connections between the cells.

2.3 ANNs

ANNs have their origin in the 1940s as an attempt to model nervous activity [20]. An ANN is a network of computational units called nodes or neurons connected by directed edges, as seen in figure 2.4. Each edge has a weight associated to it, the output of a node along a certain edge is multiplied by the corresponding edge weight. The input of a node is the sum of all its incoming edges. This sum is then applied to a, usually nonlinear, activation function in order to determine the output of the node. The nodes are typically organised in layers. The set of nodes that receive the input to the network are referred to as the input layer and the network’s output is delivered by the output layer.

The remaining nodes are part of hidden layers. Note that the nodes in the input layer do not perform any computations on their input, they only feed it to the next layer.

(20)

Figure 2.4: A basic feed forward neural network with three nodes in the input layer, four nodes in the hidden layer and two nodes in the output layer. [14]

2.3.1 Training

In order to determine the optimal weights for the edges the network has to be used on training data. The goal is to find the set of weights that minimise the error in network output. This is done by calculating the gradient of the error function with respect to the edge weights. The most common learning algorithm is called back-propagation [14]. This method uses the chain rule for derivatives in order to determine the derivative of the error function with respect to each individual edge weight. Then the edge weights are updated in the opposite direction of the derivative. This is called gradient descent.

2.3.2 RNNs

RNNs are a special kind of ANN which are able to retain information of previous inputs and use this when interpreting new inputs. This is done by maintain- ing an internal state vector which acts as a memory of previous information.

The input sequence is fed into the network one element at a time. At each time step the network receives the current element as well as the previous state vector through a feedback connection, which is seen in figure 2.5. This means that RNNs can process inputs of variable length, whereas a regular feed forward network can only process inputs of a fixed length. RNNs are trained using a modified back-propagation algorithm called back-propagation through time (BPTT). This is done by unfolding the network and creating a copy of the model for each time step. BPTT suffers from the vanishing gradient problem [12], where the error gradients become very small when calculated over sev-

(21)

eral time steps. This makes it difficult for the network to learn longer range temporal dependencies in the data.

Figure 2.5: RNN being unfolded in time. x is the input, o is the output and s is the hidden state. U, W and V are weights where U and W is used to calculate the next state. [14]

2.3.3 LSTM

The vanishing gradient problem can be mitigated by using LSTM networks, which modify the structure of the nodes. In addition to the input and output, LSTM units have three gates: an input gate, an output gate and a forget gate, as seen in figure 2.6. These gates are used to regulate information flow. The forget gate is used to determine which information should be forgotten. The input gate is responsible for deciding which information should be stored in the cell state. The output gate decides which information should be transferred to the next state. For each time step a cell state can be calculated by removing the information forgotten by the cell and adding information added through the input gate. LSTM units allow for networks with greatly extended memory of important information making it possible to utilise them on time-series with very long temporal dependencies [21].

(22)

Figure 2.6: An LSTM unit where the input at time t is denoted xt and the hidden state vector ht. The horizontal line at the top is the cell state which contains all the information currently available. This information gets modified by the three gates, which are (from left to right) the input gate, the forget gate and the output gate. [21]

2.4 Related Work

In Cui, Ahmad and Hawkins’ study on Continuous Online Sequence Learning [18], HTM is applied to sequence learning and prediction problems. It is compared to statistical methods and deep learning methods, performing compara- bly with the state-of-the-art methods. Notably, HTM could handle continuous online learning and showed robustness to spatial noise which can be crucial in many real-world applications of anomaly detection methods. HTM also showed some limitations, it took longer for it to learn long term temporal dependencies compared to methods with access to larger history buffers. It was also less robust to temporal noise than some of the other methods. In the study, HTM is only used for low-dimensional data streams. In [9], HTMs suitability for time-series anomaly detection is evaluated by comparing its performance to popular statistical methods. The study compares HTMs prediction accuracy to Autoregressive Integrated Moving Average (ARIMA) on several synthetic data sets and HTMs anomaly detection accuracy to Etsy/Skyline and Twitter’s anomaly detection algorithm on two real data sets. In both cases HTM out- performs the statistical methods and shows the ability to quickly adapt to data.

In [22], HTM is evaluated using the Numenta Anomaly Benchmark (NAB).

NAB is an open source framework which provides real and synthetic data sets as well as a scoring algorithm for real-time anomaly detection methods [23].

The scoring algorithm is designed to reward early detection of anomalies. In

(23)

the study, HTM received the highest score when compared to several common methods including Multinomial Relative Entropy, EXPoSE and Etsy/Skyline.

HTM has also been applied to anomaly detection in breathing patterns [24], micro services infrastructure [25] and use of web services [26].

RNNs have been extensively researched within time series analysis. Within time series classification, some examples are pattern detection in medical data [27] and human activity recognition in mobile usage data [28]. Within time series forecasting, a study [29] showed that RNNs had better results than traditional neural networks when predicting repairable system failures. Adding LSTM units can further improve the performance of RNNs. In a study on anomaly detection in time series [13], LSTM networks show significantly im- proved performance, compared to a traditional RNN, for short term as well as long term temporal dependencies. Another study [6] showed that using LSTM resulted in a significant improvement in detection of long-term dependencies in comparison with simpler RNNs.

In [21] anomaly detection for portfolio risk management was tested using LSTM and HTM. Their performance was measured on different types of anomalies in synthetic data and LSTM showed a significantly higher detection rate across all tested types of anomalies (additive level outlier, level shift outlier, transient change outlier and local trend outlier). The study also showed that LSTM performed better in real-world data compared to HTM.

(24)

task of detecting anomalies in time series. The Keras framework was used for the implementation of the LSTM network in our LSTM model and NAB was used for testing the HTM network. NAB was chosen as representative for the HTM network in our study because it is a benchmark tool for anomaly detection in time series and can be used with our own data sets. These two models were then used for detecting anomalies in our own synthetic time series and real world time series.

3.1 HTM Configuration

The NAB includes multiple anomaly detectors where one is called numenta and is based on Numenta Platform for Intelligent Computing (NuPIC) [30], which is an implementation of the theory of HTM and is available on github [31]. The HTM architecture is used in NAB by providing it with a stream of sparse vector encoded values as input [30][22], where a prediction about a future value is made by the HTM network at each time step, and the network is trained continuously at the same time. The predicted value, or sparse representation of the predicted value, was used in our study by comparing it to the true sparse representation of the current value in order to produce a probability value representing how probable it is that a data point is an anomaly.

14

(25)

3.1.1 Network structure

The HTM network required initial hyper-parameter configuration. See table 3.1 for the complete list of the hyper-parameter values used in our model.

These parameter values were produced by an earlier study [22] where the au- thors use an HTM network for unsupervised anomaly detection where their model managed to detect anomalies which is why we used the same configuration in this study.

Parameter Name Value

Time of day encoder width 21

Time of day encoder radius 9.49

Numeric value encoder number of buckets 130

Number of columns 2048

Number of active columns per step 40 Spatial Pooler connection threshold 0.2 Spatial Pooler permanence increment 0.003 Spatial Pooler permanence decrement 0.0005

Number of cells per colums 32

Dentritic segment activation threshold 13 Maximum number of segments per cell 128 Maximum number of synapses per segment 32 Maximum number of new synapses added at each step 20 Temporal Memory initial synaptic permanence 0.21

Temporal Memory permanence increment 0.1 Temporal Memory permanence decrement 0.1

Spatial value tolerance 0.05

Table 3.1: Hyper-parameter values used for the HTM configuration.

3.1.2 Training

The network was trained continuously as it got the input values from the time series. The input values were provided in sequential order for every data set and each data set were initially exposed to an untrained network. Each data point was only used as input once.

(26)

all non-zero synapse inputs were activated, regardless of threshold. The exact calculation for the anomalyScore can be seen in equation 3.1, where Pt−1 is the set of predicted columns made at time t − 1 and Atis the active columns at time t. The anomalyScore is equal to the ratio between the non predicted active columns and the total active columns at time step t. This yields a value between 0 and 1 where a larger value indicates a higher probability of a data point being an anomaly, since the columns that the sparse encoding of the current input represents did not get predicted by the HTM network in the previous time step.

anomalyScore = 1 − |P_t−1T At|

|A_t| (3.1)

The anomalyScore was then used to label anomalies, by selecting a threshold value where the data points with anomalyScores above a specific value would be labelled as anomalies. We chose a threshold of 0.5 after testing different threshold values on some of our synthetic time series and chose one large enough to not cause too frequent false positives.

3.2 LSTM configuration

The LSTM configuration in our study is similar to the one used in [14] where an LSTM network is trained to predict the value one step ahead of the current value in a time series. This prediction about time step t is compared to the true value at time step t where a too large difference resulted in it being labelled as an anomaly. This approach was used in our study because [14] concluded from the results of using this approach that LSTMs are good anomaly detectors in time series. The time series were split into a training set, a validation set and a test set. The training set was used for training, using BPTT, and the validation set was used for preventing overfitting by utilising early stopping. The test set was used for testing the trained network by labelling and calculating the scores used for evaluating the performance.

(27)

3.2.1 Network structure

The chosen hyper-parameters for the LSTM model can be seen in table 3.2.

They were chosen by analysing the study [14] that we got our implementation from. Upon analysing the study, we selected similar initial parameter values, but with slight modifications to them, since they concluded that hyper- parameter tuning is important for good time series forecasting performance [14]. The hyper-parameters were manually tuned until they performed well overall for time series forecasting on the synthetic time series used in this study. We also changed the values of some of the hyper-parameters in order to decrease the training time since anomaly detection for time series in real life might be urgent to implement in some scenarios. This could for example be anomaly detection for sensor readings on a helicopter at take off in an previously not tested environment.

Parameter Name Value Input Layer nodes 1 Output Layer nodes 1

Hidden layers 1

Nodes in hidden layer 1 120 Number of epoches 10

Batch size 265

Learning rate 0.02

Dropout 0.1

Table 3.2: Column 1 consists of the hyper-parameters used for the LSTM network and column 2 holds the corresponding values.

3.2.2 Training

The task for the network was to predict the value of the data point one step ahead of the latest input data point. Our network was trained using BPTT but the network was limited to being unfolded a maximum step size of 12. This is because a time series can consist of thousands of values and unfolding the network across the whole time series includes many computations, making the network inefficient [14]. Each sample in the batches consisted of a data point in the time series together with a sequence of the 12 previous data points in the time series. These samples were assumed to be independent and the state of the network was only maintained for the 12 previous data points.

(28)

ing past a few steps on a set of tested real world time series. The activation function for the output layer was linear since this is a regression problem where the values are unbounded and the loss was calculated using the mean square error.

A portion of the time series was used as validation set for preventing overfitting of the training data, by stopping the training if the loss on the validation set increased during the training. We did not make any attempt to remove anomalies that were in the training or validation set, but they were not labelled either, which means that the network trained on the anomalies as if they were normal data points. The motivation behind not removing anomalies from the training set is that the only available time series in a real world scenario could be a time series full of anomalies. Removing them manually would require deep knowledge about the time series and the anomalies. We also assumed that the anomalies were infrequent enough to not cause the network to con- sider them as normal, due to our attempts to prevent overfitting by stopping early and utilising dropout. Each time series in our study were trained on a previously untrained network.

3.2.3 Anomaly labelling

Anomalies were labelled by using the difference between the prediction made by the network for time step t at time step t − 1 and the true value at time step t. After training the network, the training set was used again for modelling a Gaussian distribution of the prediction errors. The mean and variance were calculated using maximum likelihood estimation. This Gaussian distribution was later used on the individual prediction errors made by the network for the data points in the test set, by calculating the probability density function value.

A lower probability density function value indicated a higher probability of it being an anomaly, since it was more unlikely to have occurred in the time series given the previous data points, if the LSTM network made a good prediction.

The log of these values were taken to make it easier to separate anomalies from normal values with a threshold. A threshold of -45 was set in order to differentiate the anomalies from the normal data points, and the data points

(29)

that got a value below that threshold were labelled as anomalies. We do this because we assume that the anomalies differ from the predicted values. The specific threshold of -45 was chosen after running the model on some of our synthetic time series and seeing that this threshold was small enough to not label normal values as false positives. A constant threshold, and the same hyper-parameters, were used for all tests because acceptable results should be achieved on different problem characteristics without hyper-parameter tuning in an ideal algorithm [18].

3.3 Differences between the used models

The major difference between the two models is that the HTM model in NAB utilises continuous learning [22] while the LSTM model was trained on a portion of the time series before it was utilised as an anomaly detector, which means that it does not continuously adapt the network to new trends in the time series. Continuous learning is needed in order to adapt to changes in the trend [18]. This was taken into consideration when creating our synthetic time series, since it limited us to focusing on time series where the general trend does not change over time so that the LSTM network gets the chance to train on the first part of the time series before being used to classify anomalies.

3.4 Evaluation and Performance metrics

Three metrics were used in order to evaluate the performance; precision, recall and F1 score. Precision measures the proportion of correctly identified anomalies to the total amount of reported anomalies. Recall measures portion of found anomalies out of all the present anomalies. F1 score combines both the recall and precision score and was used in order to make a comparison easier where the ratio of the precision and recall differed between the models. The exact calculations for these metrics can be found in equations 3.2, 3.3 and 3.4 below, where true positives is the amount of correctly labelled anomalies, f alse positives is the amount of normal values that were incorrectly labelled as anomalies and f alse negatives is the amount of incorrectly labelled anomalies.

P recision = true positives

true positives + f alse positives (3.2)

(30)

belled the anomalies were used to visually see the characteristics of the anomalies found in terms of size, position and context in the time series. This was necessary for some time series in order to get a better understanding of the labelling decisions made by the two models.

The non parametric Mann-Whitney U test was used to evaluate whether the null hypothesis H0, that there is no difference between the models’ overall performance, could be rejected in favour of H₁, that they perform differently.

This specific test was chosen because we did not know the distribution of the scores.

3.5 Data sets used

We tested the performance of the two models by creating our own artificial data sets and by using real world pre-existing time series with labelled anomalies. Our own synthetic data sets were used in order to determine how well the models perform for classifying point anomalies on data sets with different characteristics while the real-world data sets were used for checking how well the models perform in real world scenarios. The decision to create our own time series was made because of the difficulties of finding real world time series with labelled anomalies which are within the scope of this study. This enabled us to make our own choices about what the time series and the anomalies should look like in our experiments.

3.5.1 Data sets for testing noise robustness

We generated multiple univariate time series with different amount of noise for testing the models’ noise robustness. The time series tested for noise robustness were exclusively time series with a horizontal trend where the true function was f(x)=2 for all timestamps in the series. Gaussian distributed noise were added to these time series where the mean of the noise was 0 and the standard deviation of the noise ranged from 0.00 to 0.32 depending on the time series. Each time series had constant values for the mean and variance

(31)

of the noise, the severity of the noise did not change depending on the index of the time series. A total of 6 different levels of noise were used which gave us 6 different time series to use for benchmarking the two models’ ability to handle noise. See table 3.3 for each noise level and corresponding standard deviation and see figure 3.1A for a graph of the time series with noise level 3. Remaining graphs for the different noise levels can be seen in appendix A.1. The decision to add Gaussian noise was because of the assumption that real world time series can have Gaussian noise originating from for example inaccurate sensors or unwanted background signals.

Each time series consisted of 22695 data points where the first data point in the series had unix timestamp 1386015300. Every following data point was 5 minutes ahead of the previous data point.

Noise level 1 2 3 4 5 6

Standard deviation 0.00 0.02 0.04 0.08 0.16 0.32

Table 3.3: Noise level and corresponding standard deviation applied to the horizontal time series used for noise robustness testing.

Point anomalies were added to the time series by iterating through each data point and adding an anomaly with a probability rate of 0.1%. The sizes of the anomalies were randomly selected in the range -1.5 to 1.5 in a uniformly distributed fashion. Every anomaly was labelled with a window consisting of two timestamps. One of those time stamps indicated where the anomaly started and the other marked the end of that anomaly. The timestamp window for where the point anomaly is placed is from the previous timestamp to the timestamp following the one where the anomaly is. This means that our models could detect an anomaly 5 minutes after the anomaly and still be considered as a successful detection for our synthetic time series, which we allowed since time series can have even denser readings making a small delay less critical than not detecting the anomaly at all in a real world scenario.

The HTM model was served the whole time series and continuously learned the pattern by successively reading each data point and labelling them as they came. The time series was split into a training, validation, and training set for the LSTM model. The first 4929 data points were used for training, the following 1440 data points were used as validation set and the last 14886 data points were used as testing set. The first portion of the time series was used as training and validation set because they are the data points that are first available in a time series in a real world scenario with streaming data and can be

(32)

Figure 3.1: Visualisation of some of the synthetic time series. The x-axis is the time stamp in increasing order and the y-axis is the corresponding value.

A. depicts the time series where true function is f(x) = 2 and noise level 3. B.

depicts time series Linear 0. C. Depicts Sinus 1 and D. depicts Sawtooth 0.

3.5.2 Data sets for testing time series characteristics

Our experiments included tests which were created in order to find out how the models handle time series with special characteristics. We specifically wanted to see how the models handle forever increasing time series, periodic time series and time series with periodic sudden changes to the value. Our synthetic time series for these tests were created and used in a similar fashion to our synthetic time series for testing noise robustness in section 3.5.1. The

(33)

difference is that the time series trends are different and that the value of the anomalies is different. All time time series in this section have Gaussian noise with standard deviation 0.04. Every synthetic time series in this section are visualised in appendix A.2.

We tested the models’ ability to detect point anomalies in time series with forever increasing values by creating 3 time series where the value follows a linear trend with a positive slope. The true function and anomaly magnitude can be viewed in table 3.4. A graph of time series called Linear 0 can be seen in figure 3.1B. The x in the function is the order of the timestamp in the time series, ranging from 0 to 22695. Anomalies were added to each point with a probability of 0.1%, just as the time series in section 3.5.1.

Time series Function Anomaly range Linear 0 f(x) = 0.0001 * x (-1.5, -0.5], [0.5, 1.5) Linear 1 f(x) = 0.001 * x (-4, -0.5], [0.5, 4) Linear 2 f(x) = 0.001 * x (-14, -0.5], [0.5, 14)

Table 3.4: The characteristics of the synthetic time series with linear trend.

The function represents the value for data point with index x in the time series, where the data points are ordered by time.

Two different slopes were used in order to test how the severity of the slope influences the performance and noise was added to the time series in order to make it more realistic to real world time series. Noise robustness is also a criteria for a good sequence learning algorithm [18], which makes it logical to also make it a criteria for a good anomaly detection algorithm in time series.

We created five time series which follow a sine curve trend for testing the models on periodic trends and for testing them on contextual and global point anomalies. The pattern in periodic time series repeats itself and is interesting to test since many real world measurements repeat themselves over regular time periods. The true function and anomaly sizes can be viewed in table 3.5, and time series Sinus 1 can be seen in figure 3.1C.

The difference between the time series with sine trends is their period length and that is because we assume that periodic trends in real world don’t all have the same periodic length. Sinus 3 and Sinus 4 have a period length of ex- actly 3 hours and 3 days respectively. This was done because we assume that the anomaly detection models can make any difference in this regard, since some real time series might repeat themselves each 3-hours or 3-days, and not only an arbitrary number of minutes which is tested in time series Sinus 0, 1

(34)

Table 3.5: The characteristics of the synthetic time series with sinus trend.

The function represents the value for data point with index x in the time series, where the data points are ordered by time.

and 2.

Another type of periodic time series was also tested. These resembled a sawtooth wave and were piecewise linear with a positive slope but were reset each 1000th data point, which can be seen in figure 3.1D. Two time series of this type were generated and tested. The major difference between those two time series is the steepness of the slopes, which results in a larger difference in value when the value is reset each 1000th step. We tested this type of characteristics in time series in order to evaluate if the models can handle recurring sudden changes in value without labelling them as anomalies. The true function and the magnitude of the point anomaly can be viewed in table 3.6.

Time series Function Anomaly range

Sawtooth 0 f(x) = (x % 1000) * 0.0005 (-1.5, -0.5], [0.5, 1.5) Sawtooth 1 f(x) = (x % 1000) * 0.002 (-1.5, -0.5], [0.5, 1.5) Table 3.6: The characteristics of the synthetic time series with sawtooth trend.

The function represents the value for data point with position x in the time series sequence, where the data points are ordered by time.

3.5.3 Real-world time series used

We tested the two models on two real world time series with labelled anoma- lies. The time series are called Occupancy_t4013 and Ec2_request_latency_

system_failure. They were selected to be used in our experiments because they had labelled point anomalies in them. Both time series were found in the NAB

(35)

and information about them and how we split the data for the LSTM model are listed below.

1. Occupancy_t4013

Occupancy_t4013consists of real time traffic occupancy readings from the Twin Cities Metro area in Minnesota. This time series was col- lected by Minnesota Department of transportation according to the NAB github repo [30]. The time series was split into different sets for the LSTM model. The training data consisted of the first 641 data points and the next 240 data points were used as validation set. The last 1260 data points in the time series were used as test set. 359 data points located between the validation set and the test set in the time series were not used for anything in the LSTM model.

2. Ec2_request_latency_system_failure

Ec2_request_latency_system_failureconsists of CPU usage readings from a server. This time series ends with a complete system failure [30].

The time series was split into different sets for the LSTM model. The training data consisted of the first 820 data points and the next 576 data points were used as validation set. The last 288 data points in the time series were used as test set. 2348 data points located between the validation set and the test set in the time series were not used for anything in the LSTM model.

(36)

neural networks performed in time series with different characteristics and the third section displays the results from applying the models to the real world time series that were tested in our experiments.

This section presents two scores for the HTM model where the first one, which we denote as HTM score, includes labels from the whole tested time series, while the other one, denoted as HTM* score, only includes labels from the portion of the time series which was used as test set in the LSTM model.

The scores made by the LSTM model are denoted as LSTM score where the score calculations only include labels from the part of the time series which was used as test set.

4.1 Noise robustness

Performance of the two models on the time series with different amount of noise is presented in this section. The results are from the synthesised time series where the true function is a constant values. The noise levels were ranging from level 0 to 5 where the time series with noise level 0 had no noise, and the remaining levels had noise which was Gaussian distributed with standard deviation ranging from 0.02 to 0.32. Only one time series per noise level was used in the experiments.

26

(37)

Figure 4.1: The recall score for the different models on the horizontal linear time series. HTM includes scores for the whole time series. HTM* and LSTM includes scores using labels only from the part of the time series that was used as test data for LSTM.

Figure 4.2: The precision score for the different models on the horizontal linear time series. HTM includes scores for the whole time series. HTM* and LSTM includes scores using labels only from the part of the time series that was used as test data for LSTM.

(38)

Figure 4.3: The F1 score for the different models on the horizontal linear time series. HTM includes scores for the whole time series. HTM* and LSTM includes scores using labels only from the part of the time series that was used as test data for LSTM.

The recall score of the two models when applied to the horizontal linear time series can be seen in figure 4.1. The HTM* recall score was higher than the LSTM recall score on the four time series with the most noise. The LSTM model got a higher recall score when no noise was present in the time series compared to the HTM model. Both the LSTM model and the HTM model got the same recall score of 0.6 on noise level 1 when only labels from the test portion of the time series were included in the scores. A declining trend is present on the recall scores on all models as the noise increases past noise level 2. The recall score for the HTM model is higher than the HTM* model on all noise levels except for noise level 0, where no noise was applied to the time series.

The precision score of the two models can be seen in figure 4.2. Both the LSTM and HTM* precision scores were 1.00 for the four time series with the lowest amount of noise applied to them. The HTM precision score was slightly lower for those four time series ranging from 0.74 to 0.86. The precision score for the LSTM declined with 100% between noise levels 3 and 4 and the HTM and HTM* scores declined 62% and 56% respectively. Both models got a recall score below 0.05 for noise level 5.

The F1 scores from the experiments regarding noise robustness can be found in figure 4.3. The HTM and HTM* scores are similar for all noise levels.

The LSTM score is slightly higher for the time series with no noise compared

(39)

to the HTM and HTM* scores. The LSTM model got a score which declined and reached 0 faster than the HTM model with increasing noise. The HTM and HTM* F1 score did not reach 0 for any of the time series with a constant trend but both the HTM and HTM* score was below 0.06 on noise level 5.

4.1.1 Statistical hypothesis testing

Mann–Whitney U test on the F1 scores between the HTM* and LSTM scores on the 6 time series presented in this section resulted in a U value of 15 while the critical value of U at p < 0.05 is 5 (two-tailed). The result is not significant at p < 0.05 and the null hypothesis H₀ that they overall perform the same on noisy data can not be rejected.

4.2 Trend/characteristics results

In this section we present the results from applying the models on our synthetic time series which follow specific trends, and describe what we found out about the two models regarding their performance. These are the time series that are described in section 3.5.2 and can also be found in appendix A.2 in the shape of graphs. Each time series was tested once for both our models.

The recall scores can be seen in Figure 4.4 where we can see that the LSTM model scored higher than the HTM model on all time series and gained a score of above 0.5 on all the time series in the graph except for the time series called Sawtooth 1. This time series had periodic sudden changes just as Sawtooth 0, but were more severe than the sudden changes in Sawtooth 0. The HTM and HTM* scores were fairly similar across all time series except for time series Sinus 4 where the HTM* score was 80 % higher than the HTM score. The only time series where HTM* got a score above 0.5 were Linear 1, Sinus 4 and Sawtooth 0.

(40)

Figure 4.4: The recall score for the different models on some of the tested time series, where each model is represented by a bar for each time series. HTM includes labels from the whole time series. HTM* and LSTM includes scores using labels only from the part of the time series that was used as test data for LSTM.

Figure 4.5 depicts the precision scores for the time series which were used for analysing how our models perform on time series with special characteristics. The precision score for HTM* was 1.00 on all time series included in figure 4.5 except for time series Sinus 2 and Linear 1 where it got a score of 0.67 and 0 respectively. The LSTM model scored below 0.01 on time series Linear 0, Linear 1and Linear 2, while it got a perfect score of 1 on all time series with a sine curve trend, regardless of the period length of those curves.

The LSTM precision score was 0.89 on Sawtooth 0 and Sawtooth 1 where the major difference between those time series was the magnitude of the periodic sudden changes in the time series, which were larger in Sawtooth 1.

(41)

Figure 4.5: The precision score for the different models on some of the tested time series, where each model is represented by a bar for each time series.

HTM includes labels from the whole time series. HTM* and LSTM includes scores using labels only from the part of the time series that was used as test data for LSTM.

Figure 4.6 illustrates the F1 scores for the models which shows that the LSTM gained a F1 score equal to or above 0.75 for all time series with a sine curve trend and time series Sawtooth 0. The LSTM F1 score was below 0.1 for the rest of the time series in the figure. The HTM* F1 scores above 0.5 were present on four out of 10 of the time series in the figure. Those were time series Linear 0, Linear 2, Sinus 4 and Sawtooth 0. The HTM* score for Linear 1 was 0 and the difference between Linear 0 and Linear 1 was that Linear 1 had a bigger slope and the difference between Linear 1 and Linear 2was that Linear 2 had point anomalies which differs more from the general trend in the series compared to Linear 1.

(42)

Figure 4.6: The F1 score for the different models on some of the tested time series, where each model is represented by a bar for each time series. HTM includes labels from the whole time series. HTM* and LSTM includes scores using labels only from the part of the time series that was used as test data for LSTM.

Below are three figures depicting the labelling made by the models on two of the time series. Figure 4.7 displays a graph of time series Sinus 1 and the labelling done by the HTM model. It shows us that the all of the correctly labelled anomalies are global point anomalies. Figure 4.8 displays the labelling made by the LSTM model on the same time series where it shows us that the set of correctly labelled anomalies consists of both contextual and global point anomalies. Figure 4.9 depicts the labelling made by the LSTM model on time series Sawtooth 1 and shows that the model falsely labelled the periodic drops in value as anomalies.

(43)

Figure 4.7: Data point anomaly labelling done by HTM on time series Sinus 1. Green three pointed stars indicate true positives, red crosses indicate false negatives and magenta plus signs indicate false positives.

Figure 4.8: Data point anomaly labelling done by the LSTM model on time series Sinus 1. Green three pointed stars indicate true positives, red crosses indicate false negatives and magenta plus signs indicate false positives. The data points to the right of the vertical cyan line are the points used in the test data set.

(44)

Figure 4.9: Data point anomaly labelling done by the LSTM model on time series Sawtooth 1. Green three pointed stars indicate true positives, red crosses indicate false negatives and magenta plus signs indicate false positives. The data points to the right of the vertical cyan line are the points used in the test data set.

4.2.1 Statistical hypothesis testing

Mann–Whitney U test on the F1 scores between the HTM* and LSTM scores on all of our synthetic time series, including the ones used for testing noise robustness, results in a U value of 117.5 while the critical value of U at p <

0.05 is 75 (two-tailed). Therefore, the result is not significant on these 16 time series at p < 0.05 and the null hypothesis H₀that they perform the same overall on our synthetic time series can not be rejected.

The Mann–Whitney U test on the F1 scores between the HTM* and LSTM scores on only the 5 time series with sine curve trend is 1 and the critical value of U at p < 0.05 is 2 (Two Tailed). Therefore, the null hypothesis can be rejected when including these 5 time series at p < 0.05 in favour of H1, that they perform differently for these sine line time series.

(45)

4.3 Results from real world data sets

In this section we present the results from applying HTM and LSTM on real world data. Each time series is presented separately with information about how the models labeled the time series.

4.3.1 Occupancy t4013

Applying the real world time series Occupancy_t4013 from the NAB showed us that the HTM model correctly labelled both anomalies in the time series but falsely labelled multiple normal data points as anomalies, as seen in figure 4.10, where multiple false positives are so close to each other that they overlap in the chart. The majority of the false positives are located early in the time series relative to the amount of data points in the series. The second last false positive in figure 4.10 is two false positives that are one data point apart from each other. Our LSTM model did not falsely label any normal data points as anomalies. It did however miss the first occurring anomaly in the time series, and correctly labelled the next, slightly severe, anomaly, which is seen in figure 4.11. The precision and recall score for the two networks can be seen in table 4.1. It tells us that HTM has a precision of only 0.06 for the whole time series, giving it a low F1 score of only 0.11, but a precision of 0.4 when excluding the first portion of the time series.

Network Recall Precision F1 score

HTM 1.00 0.06 0.11

HTM* 1.00 0.40 0.57

LSTM 0.50 1.00 0.67

Table 4.1: Performance metrics for the different models on the Occu- pancy_t4013 data set. HTM is representing the score for the whole time series and HTM* does only include the score for the part of the time series which was used as test data in the LSTM model.

(46)

Figure 4.10: Data point anomaly labelling done by HTM on data set Occu- pancy_t4013. Green three pointed stars indicate true positives and magenta plus signs indicate false positives.

Figure 4.11: Data point anomaly labelling done by LSTM on data set Oc- cupancy_t4013. Green three pointed stars indicate true positives, red areas indicate false negatives and magenta plus signs indicate false positives.

(47)

4.3.2 Ec2_request_latency_system_failure

Applying the two models to the Ec2_request_latency_system_failure time series showed us that the HTM model was able to correctly label all three anomalies, as seen in figure 4.12, and the only false positives are located at the beginning of the time series. The LSTM model was able to correctly label two of the three anomalies in the time series, as seen in figure 4.13. The correctly labelled anomalies were the two most severe ones out of the three anomalies present in the time series. We can see in table 4.2 that the HTM model got a perfect F1 score of 1.0 if we exclude the false positives at the very beginning of the time series.

Model Recall Precision F1 score

HTM 1.00 0.428571428571 0.6

HTM* 1.00 1.00 1.00

LSTM 0.67 1.00 0.80

Table 4.2: Performance metrics for the different models on the Ec2_request_latency_system_failure data set. HTM is representing the score for the whole time series and HTM* does only include the score for the part of the time series which was used as test data in the LSTM model.

(48)

Figure 4.12: Data point anomaly labelling done by HTM on data set Ec2_request_latency_system_failure. Green three pointed stars indicate true positives, red crosses indicate false negatives and magenta plus signs indicate false positives.

Figure 4.13: Data point anomaly labelling done by LSTM on data set Ec2_request_latency_system_failure. Green three pointed stars indicate true positives and red areas indicate false negatives.

(49)

Discussion

The experiments revealed some insights about the noise robustness of the two networks and how they handle time series with different characteristics. The tests where HTM and LSTM were utilised for detecting anomalies showed that both models performed similarly regarding noise robustness on our synthetic time series. Experiments on the time series with different characteristics showed that the LSTM model performed better than the HTM model for both contextual and global point anomalies but that the LSTM model falsely labelled normal sudden changes as anomalies while the HTM model did not.

The HTM model also performed better at finding anomalies in the real world time series which were included in the experiments, but at the cost of more false positives. However, only 2 real world time series were tested and they had few anomalies making it difficult to make any statistical comparisons on the real world time series. The null hypothesis that the models overall perform the same when including all our synthetic time series could not be rejected at p < 0.05.

5.1 Noise robustness

Our experiments showed that the overall score for both models declined as the noise increased. This was expected since we did not have a lower limit for how small the anomalies could be in the time series where noise robustness was tested. Since the anomalies were randomly created, it was expected that some of them would eventually be so small that they would disappear in the noise. Therefore, we do not draw any conclusions about noise robustness at the noise levels with severe noise. We still believe that the models can be considered capable of handling noisy data well since they got high F1 scores

39

(50)

normalisation.

Both models failed to detect some point anomalies in the time series without any noise and where the true value was a constant. This was unexpected since any point which differs from the trend in a time series where the general trend is a constant and has no noise is very unlikely to be a normal value. A possible explanation for this is that the smallest point anomalies were ignored due to the frequency of anomalies being too high in combination with a too high threshold for anomaly selection.

Our results regarding noise robustness for the HTM model correlate to the previous findings about noise robustness in [18] where both artificial and real world time series are used.

5.2 Time series and anomaly characteristics

Our synthetic time series had trends with different characteristics and anomalies which were both contextual and global. The overall high F1 scores for the LSTM model on the sine curve suggests that the LSTM network can be utilised in order to detect contextual point anomalies which was also evident by looking at time series Sinus 1, where the performance of the HTM model was poor on contextual point anomalies. We do not believe that the severity of the contextual point anomalies and a high threshold were the cause of the poor performance of the HTM model since the model correctly labelled global point anomalies which differed less from the general trend than some of the contextual point anomalies. The underlying cause for ignored anomalies in our HTM model was probably that the predictions made in the time step before the anomalies were too close to the anomaly, suggesting that the HTM network has learned that the values of the contextual point anomalies are normal enough to be predicted by the previous sequence despite that they have not been seen in that context before. The null hypothesis that the two models perform the same on sine trend time series with both contextual and global point anomalies could be rejected at p < 0.05 which supports the conclusion that the LSTM model is better for time series with sine trend that has both contextual and global point anomalies in them.

(51)

The LSTM model did not seem to handle sudden changes in value on the sawtooth time series despite them being periodic, since it falsely labelled the sudden changes as as anomalies in Sawtooth 1, when they were not. This was however only occurring on the time series where the sudden change of value were severe enough, making us believe that this was caused by both a bad prediction by the LSTM network in the previous time step, as well as a too low threshold for anomaly selection. The high HTM precision scores on the time series with sudden changes reveals that the HTM model could handle the sudden changes and performed well when the anomalies were large relative to the sudden changes of the time series trend. The LSTM model preformed similarly on time series where the sudden changes were small enough to not be considered as normal values by the model, but was probably due to a high threshold, rather than the network making a good prediction.

Considering the linear time series with a positive slope, it was not unexpected that the LSTM model could not handle this characteristic despite the simplicity of the linear trend. The network was trained on another portion of the data and has never seen the new input values, which probably made it do bad predictions for the future time steps, which was used to detect anomalies.

However, the HTM model handled these time series fairly well, which is probably due to the advantage of continuous learning of the HTM network, but it was surprising that the model didn’t work that well when the slope was steep relative to the anomaly size in Linear 1, when it worked with the same steep- ness in Linear 2 and in Linear 0 which had the same sizes of the anomalies.

5.3 Real world time series performance

The amount of real world time series tested in our studies are limited mainly due to the difficulties of finding labelled real world time series only consisting of point anomalies. This makes it difficult to draw any conclusions about how the neural networks perform on real world time series. The difference between the models’ performance in this study were not drastic enough to be able to determine which one works best for real world time series but both models were decent anomaly detectors for the limited number of real world time series in this study. Related work in [21] showed that LSTM outperformed HTM in both synthetic and real world time series but that study focused on portfolio risk management which is a different type of time series compared to the ones tested in this study.

The interesting part was that the HTM found all of the anomalies at the cost of more false positives while the LSTM model did not find all anomalies

(52)

differently depending on the characteristics of the time series they are exposed to. Choosing which one to use is easier with knowledge about the time series.

HTM appears as a good choice if all point anomalies are global, while LSTM is the better choice if the anomalies are contextual and if there are not any large sudden erratic variations in the underlying time series (not anomalies by themselves). The preference of HTM before LSTM on time series with only global point anomalies, when the LSTM model could detect them as well, is because of the erratic normal changes would more likely be ignored by the HTM model. The problem is that one does not always know what the anomalies are like if they have not been seen before.

After our experiments, we do not think that one of these two models should be declared as the superior one in terms of being the best universal anomaly detector. This is supported by our results since the null hypothesis could not be rejected when comparing the results from all our artificial time series. The superior model depends on the application. Both networks require some initial decision-making for the hyper parameters in the network as well as threshold selection which might lead to different performance. This means that neither models should be used as a “plug and play” anomaly detector where any time series can be set up to the model. If analysing a time series is necessary in order to decide which model to use, a better decision might be to resort to a simpler model for anomaly detection.

5.5 Limitations

This study has two major limitations. The first one is that, due to the limited amount of tested data sets, we could not make statistical hypothesis testing on all characteristics. This renders some of our observations inconclusive. The second major limitation is that we do not test to vary the frequency of the anomalies. We do not know if our anomalies are too frequent and we do also not know how the two networks perform on time series where the frequency of the anomalies varies. Testing the frequency of occurrence is important because having too many anomalies will make them become a part of the noise.