Evaluating Anomaly Detection Algorithms in Power Consumption Data

(1)

UPTEC IT 17009

Examensarbete 30 hp

Juni 2017

Evaluating Anomaly Detection

Algorithms in Power Consumption

Data

(2)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Evaluating Anomaly Detection Algorithms in Power

Consumption Data

Marcus Windmark

The quality of data is an important aspect when performing data scientific tasks. Having a clean ground truth dataset is critical to be able to derive analytical results from experiments.

In this thesis, an automated method of checking the correctness of new data against a defined ground truth dataset is evaluated. With the use of machine learning

algorithms, anomaly detection was applied to separate normal and abnormal measurements of power consumption data, collected as time series from real world household appliances. Due to high variance in the energy data, the problem of detecting anomalies was solved with generative models, using the reconstruction error as anomaly score. Extensive experiments were performed using a range of parameters with three different models, simple regression with Multilayer Perceptron (MLP), Long Short-Term Memory (LSTM) and Dilated Causal Convolutional Neural Network (DC-CNN).

The results of the experiments show promising performance of using generative models for anomaly detection. All three models managed to learn the general power signature of the household appliance measurements, with a varying degree of success relative to the complexity of the signature. Out of the three models, the experiments identified the DC-CNN to be the best performing. Compared with the other models, the DC-CNN had both a higher success rate at classifying anomalous sequences as well as a faster computational speed.

Finally, this thesis concludes that fine-tuning the parameters of the models to the specific task is required to achieve good performance. Finding a good combination of model and parameter values is especially important in the case of handling

measurements from household appliances, due to the complexity of the data.

(3)

Populärvetenskaplig

sammanfattning

Big data är idag ett populärt uttryck och med den explosionsartade ökningen av producerad data blir det också allt viktigare. Detta kräver inte bara en stabil digital infrastruktur som klarar av trycket, utan det finns också många viktiga aspekter när det handlar om att nyttja datan. Dataanalys är ett område som drar fördel av mycket data, men för att resultaten ska vara trovärdiga krävs en garanti att datan håller hög kvalitet. I samband med att mängden data ökar blir detta allt svårare och manuell granskning räcker inte längre till.

Detta examensarbete utfördes genom ett samarbete med företaget Watty, som tillhandahöll mätningar av energiförbrukningen hos ett stort an-tal apparater från riktiga hushåll. Problemet hos Watty var att det inte gick att garantera att de mätningar som utförts på hushållsappa-raterna motsvarade normala omständigheter. Att kylskåpsdörrar kunde stått öppna och tvättmaskiner gått sönder var två av många exempel på oförutsedda händelser som kunde ha påverkat mätningarna.

Målet med studien var att applicera maskininlärning och neurala nätverk för att automatiskt urskilja beteendemönster i energiförbrukningen som ansågs anormala ur en större mängd normal data. Tre olika maskin-inlärnings-algoritmer utvärderades med avseende på deras förmåga att rekonstruera en signal som liknade något de sett tidigare. Genom att endast modellera normalfallet var hypotesen att algoritmerna skulle göra fel i större utsträckning vid försök till rekonstruktion av signaler som skiljde sig mot det.

Studien visade, genom experimentell utvärdering, att alla tre algorit-mer hade förmåga att kunna skilja på grundläggande anormala beteen-demönster i energiförbrukningen. I de fall där signalernas mönster hade hög komplexitet speglades också detta i hur avancerad algoritm det krävdes för att få bra resultat.

(4)

Acknowledgements

I want to thank all the incredible people at Watty for letting me conduct this thesis and welcoming me at your oﬃce. A special thank you goes to my supervisor Anders Huss, who came with highly valuable feedback throughout the thesis.

(5)

6 Discussion 53 6.1 Experiment Observations . . . 53 6.1.1 LSTM Net Size . . . 53 6.1.2 DC-CNN Filter Count . . . 54 6.1.3 Look Back . . . 55 6.1.4 Predict Ahead . . . 57 6.1.5 Overall Performance . . . 59 6.2 Data Handling . . . 59 6.3 Algorithm Design . . . 61 7 Conclusion 62 7.1 Future Work . . . 62 References 64 Appendix A Results of Hyperparameter Experiments 68 A.1 LSTM Net Size . . . 68

(7)

List of Figures

1.1 Two examples of normal data in combination with abnor-mal behaviour. . . 3 3.1 A simple example of linear regression, where X is

pre-sumed to be linearly dependent on Y . . . 11 3.2 A basic artificial neuron, showing inputs x1 and x2, each

paired with respective weight. The activation function node processes a linear combination of x and w, out-putting a value based on the function f. . . 12 3.3 An Artificial Neural Network of feedforward type

consist-ing of an input later, two hidden layers and a sconsist-ingle node as output layer. . . 13 3.4 A Recurrent Neural Network contains loops and is shown

in the leftmost structure, with an input x being processed to an output h and passing along the current state s. The recurrent part can be visualized as unrolling the network. 16 3.5 In the unrolled visualization of an Recurrent Neural

Net-work (RNN), each layer unit has an activation function, in this case the hyperbolic tangent function (tanh). Figure reproduced with permission from C. Olah.1 _[₂₇_{] . . . 17}

3.6 Each gate of the Long Short-Term Memory (LSTM) re-sembles the structure of the RNN, seen in figure 3.5, with the important distinction of a the cell state C being passed along between the sequence steps. Figure repro-duced with additional clarifying notation, with permis-sion from C. Olah.2 _[₂₇_{] . . . 18}

(8)

List of Figures

3.8 The convolution process of a Convolutional Neural Net-work (CNN), here shown transforming a 7x7 space into 3x3, processes the image by sliding a filter across the in-put space. Each filter transformation involves calculating an image convolution, turning each filter matrix into a single value. In this example, the filter has a size of 3x3 with a stride of 1. . . 21 3.9 The filter weights aﬀect the image, enhancing certain

fea-tures. The eﬀect of three example filters are shown, with their weights, relative to the original image having the identity function as filter. . . 22 3.10 Pooling works with diﬀerent operations and in this

exam-ple, the left operation shows a max-pooling on the data in an example filter, while the right shows mean-pooling. 23 3.11 An illustration of a stack of dilated causal convolution

layers, with an exponentially increasing dilation rate. . . 24 4.1 Examples showing measurements of three events of a freezer

(left) and a single washing machine event (right), having a sampling rate of 1 second. Note the diﬀerent scales in the graphs. . . 26 4.2 Measurements of the same type of appliance do not have

to look the same. Here, three washing machines from diﬀerent brands are shown with as many diﬀerent power signatures. . . 27 4.3 The first dataset, wm_dw, consisted of normal

wash-ing machine trainwash-ing data and anomalous test data where some events had been replaced by dishwasher events. . . 29 4.4 The second dataset, dw_wm, consisted of normal

dish-washer training data and anomalous test data where some events had been replaced by washing machines. . . 29 4.5 The third dataset, single_comp_add, had an intact

measurement of single compressor appliances as training data. In the anomalous test data, a square signal with an additive eﬀect had been introduced randomly on top of a regular measurement. . . 31 4.6 The fourth dataset, single_comp_erased, consisted

(9)

List of Figures

4.7 Two examples of Gaussian distributions, where the blue curve has a mean of 0 with 1 variance and the green curve has a mean of 3 with variance 0.6. The idea was that the Gaussian distribution of the error when predicting an abnormal sequence (green) would not overlap the distri-bution of the training sequences (blue). . . 33 4.8 An example of the overlapping format of the input and

target data, here shown with a look back of 2, with the target value set to predict 2 steps ahead. . . 38 4.9 The original signal (above) was transformed using one-hot

encoding, mapping each value to one of 50 bins (below). 39 5.1 Comparisons of the precision and recall, when running

the LSTM with 64 filters, for the four diﬀerent datasets. 43 5.2 Comparisons of the precision and recall, when running

the Dilated Causal Convolutional Neural Network (DC-CNN)with 16 filters, for the four diﬀerent datasets. 44 5.3 Comparisons of the precision and recall, when classifying

the wm_dw dataset using the three diﬀerent modeling algorithms and their best performing look back parameter. 46 5.4 Comparisons of the precision and recall, when classifying

the dw_wm dataset using the three diﬀerent modeling algorithms and their best performing look back parameter. 47 5.5 Comparisons of the precision and recall, when classifying

the single_comp_add dataset using the three diﬀerent modeling algorithms and their best performing look back parameter. . . 47 5.6 Comparisons of the precision and recall, when

classify-ing the sclassify-ingle_comp_erased dataset usclassify-ing the three diﬀerent modeling algorithms and their best performing look back parameter. . . 48 5.7 Comparisons of the precision and recall, when classifying

the wm_dw dataset using the three diﬀerent modeling algorithms. Each algorithm uses its best performing pre-dict ahead parameter for that model and dataset. . . 50 5.8 Comparisons of the precision and recall, when classifying

(10)

List of Figures

5.9 Comparisons of the precision and recall, when classifying the single_comp_add dataset using the three diﬀer-ent modeling algorithms. Each algorithm uses its best performing predict ahead parameter for that model and dataset. . . 51 5.10 Comparisons of the precision and recall, when classifying

the single_comp_erased dataset using the three dif-ferent modeling algorithms. Each algorithm uses its best performing predict ahead parameter for that model and dataset. . . 51 5.11 A sample sequence from the single_comp_erased dataset

being classified by the DC-CNN model. The lower the logP robabilityDensityF unction(P DF ), the higher prob-ability of it being an anomalous point. Note that the global minima did not occur during, but after, an anoma-lous region. . . 52 6.1 The LSTM performance when having a net size of only 8

was one of the worst performing experiments and signifi-cantly worse than with a net size of 64, as was shown in figure 5.1. . . 54 6.2 The precision-recall plot of the diﬀerent look back

param-eters when applying the LSTM on the wm_dw dataset. . 56 6.3 The left figure shows the error histogram of predicting a

sample sequence of the single_comp_add dataset. The left shows the same sequence as the right side, but made anomalous by erasing some periods. Notice the diﬀerence in how the two histograms fit the green curve, which is the fitted Gaussian distribution. . . 57 6.4 Comparison of the F1-scores between the predict ahead

parameters 1 and 5 of the LSTM performance, when clas-sifying the single_comp_erased dataset. . . 58 6.5 The normal data of the washing machines in the wm_dw

dataset had a large variance. In this sequence, the drop in the logP DF in the middle was related to an unnatural form of event. The rest of the dataset showed similar characteristics. . . 60 A.1 The performance of LSTM on the dataset WM-DW with

a changing net size. . . 68 A.2 The performance of LSTM on the dataset DW-WM with

(11)

List of Figures

A.3 The performance of LSTM on the dataset SC-ADD with a changing net size. . . 69 A.4 The performance of LSTM on the dataset SC-ERASED

with a changing net size. . . 69 A.5 The performance of DC-CNN on the dataset WM-DW

with a changing number of filters. . . 70 A.6 The performance of DC-CNN on the dataset DW-WM

with a changing number of filters. . . 70 A.7 The performance of DC-CNN on the dataset SC-ADD

with a changing number of filters. . . 71 A.8 The performance of DC-CNN on the dataset SC-ERASED

with a changing number of filters. . . 71 B.1 The performance of the DC-CNN on the dataset

sin-gle_comp_addwith a changing number of filters. . . . 72 B.2 The performance of the DC-CNN on the dataset

sin-gle_comp_erasedwith a changing number of filters. . 72 C.1 Comparison of the negative log PDF when running the

DC-CNN with 8 and 256 filters, respectively, on the dw_wm dataset. Normal data is blue, abnormal is red. Notice the many more high peaks of normal data in the case of 256, than in 8. This constitutes to the better performance when using only 8 filters. . . 73 C.2 The prediction of a normal dw_wm sequence using the

LSTM. . . 74 C.3 The prediction of an abnormal dw_wm sequence using

(12)

List of Acronyms

AI Artificial Intelligence. 1

ANN Artificial Neural Network. v, 11–14, 17

ARIMA Autoregressive Integrated Moving Average. 4 ARMA Autoregressive Moving Average. 7

AUC Area Under the Curve. 40, 42–46, 49, 53, 54 BPTT Backpropagation Through Time. 17 CCE Categorical Cross Entropy. 14, 38

CNN Convolutional Neural Network. v, vi, 8, 20, 21, 23, 24, 35

DC-CNN Dilated Causal Convolutional Neural Network. vii–ix, 23, 34–37, 41–46, 48, 49, 52, 54–56, 58, 59, 61, 62, 72, 73

FNN Feedforward Neural Network. 13, 18

LSTM Long Short-Term Memory. v, vii–ix, 8, 18, 19, 34–37, 41–43, 45, 46, 48, 49, 53–59, 61, 62, 74

ML Machine Learning. 1

MLP Multilayer Perceptron. 13, 15, 16, 24, 34–37, 41, 42, 45, 48, 49, 55, 57, 59, 62

MSE Mean Squared Error. 14, 38 NN Neural Network. 13, 20, 21, 23, 34

PDF Probability Density Function. viii, ix, 32–34, 39, 52, 55, 58–60, 73 ReLU Rectified Linear Unit. 12, 13, 36

(13)

(14)

1 Introduction

Our homes have so far been relatively free from intelligence, but this is about to change. With the introduction of Artificial Intelligence (AI) and Machine Learning (ML), not only are the appliances in the house-hold made intelligent with Internet connectivity and smart controls, but a whole new layer of AI is going to get integrated into the home itself. This thesis was done in collaboration with Watty, which is a company that develops a product making the electricity meter intelligent. With-out the need for specific meters for the household appliances, this enables real-time monitoring of the power consumption and time stamps of us-age for each separate appliance. Algorithms using ML are accomplishing this feat by continuously analysing the measured signals against patterns in previously collected data.

Data collection is an essential part of developing well performing ML models and, in the case of Watty, volunteering households have had their appliances measured over a time range of several years. However, an often overlooked element of the data collection process is the assur-ance of the data quality. This plays a crucial part in applying the ML algorithms to household appliances, since real-life usage can include ir-regularities in the power consumption. Fridge doors being left open and washing machines breaking down are just two examples. Currently, such abnormal behaviour is removed by a manual process at Watty. This in-cludes a person checking the correctness of the data, since excluding such events is important to not have incorrect patterns present in the dataset.

(15)

1 Introduction

1.1 Scope of Thesis

The primary goal of this thesis was to identify and evaluate diﬀerent methods of applying anomaly detection in the case of power consumption data from household appliances. The study was focused on the following questions:

• What anomaly detection algorithms are most suitable of solving the problem?

– Which are the most important parameters of these algorithms? • What is the best method of calculating an anomaly score?

• How well can this type of anomaly detection generalize?

– How are natural deviations handled? Such as cold weather leading to higher heating usage.

– What is the best way to also detect smaller deviations as anomalies?

1.1.1 Delimitations

The focus of the evaluation of the anomaly detection algorithms was to study the application in a batch processing way. Since the use case at Watty was to sanitize already collected data, there was no need to investigate the possibility of extending to online, real-time, processing. Furthermore, the evaluation was applied to study the characteristics of the household appliances themselves, rather than their usage pattern.

1.2 Overview of Anomaly Detection

Anomaly detection has attracted attention for a long time due to its many applications. Examples of applications domains are fraud detec-tions for credit cards, network intrusion detection, mechanical diagnos-tics and health monitoring. [18]

(16)

1 Introduction

being further away, using a certain metric, from the majority of the samples. In figure 1.1, two examples of anomalies are shown, with be-haviour that clearly diﬀers from the normal. Anomaly detection is the field of finding abnormal behaviour in data to enable analysis and further measurements. [18]

Figure 1.1: Two examples of normal data in combination with abnormal behaviour.

Noise is a relative term to anomalies and this thesis uses the definition by Chandola et al. [6], "Noise can be defined as a phenomenon in data that is not of interest to the analyst, but acts as a hindrance to data analysis". The field of anomaly detection shares many methods with the field of noise removal, with the clear distinction that noise is considered unwanted data, whereas anomalies enable further data analysis.

There are several factors that make anomaly detection a complex task and the most difficult is the wide range of types of data in different fields. For example, both cases seen in figure 1.1 show abnormal behaviours, but the data representation clearly differs. Since an anomaly is defined by its behaviour of not conforming to the normality, it becomes very difficult to define generalized normal and abnormal cases spanning mul-tiple application domains. As much as the anomalies themselves are very dependent on the type of data, so are the methods of anomaly de-tection. This has resulted in many ad-hoc solutions being created for very particular fields and problems. [6]

Apart from there being many ways to represent data, there are also multiple types of anomalies. Listed below are three common types. [6]

(17)

1 Introduction

• Contextual anomalies: If data samples are considered anoma-lous only in a specific context they are called contextual anomalies. An example is data representing the temperature during a year, where minus degrees in July is a contextual anomaly, whereas the same measurement in December is normal.

• Collective anomalies: The definition of collective anomalies is that data samples are considered anomalous only when a group of correlated samples are anomalies relative to the entire dataset, even when the samples within the collective would not be consid-ered an anomaly on its own. For example, it is a normal occurrence to have a day without snow during the winter, but a collective oc-currence if there were several weeks without.

Building upon the three common types of anomalies, a wide range of methods solving the anomaly detection problem has been researched. All methods can generally be put into one of four categories, seen below.

• Association rule mining based: Rule based algorithms gener-ate rules, which represents a sequence of actions inducing a new action, from the data during training. During testing, if a new se-quence of actions is found having a frequency below a set threshold, it is deemed an anomaly. [6]

• Classification based: In classification based anomaly detection, the assumption is made that a classifier can separate normal and anomalous classes of the data. In both the single- and multi-class cases, a classifier is trained to learn the characteristics of each class. A test sample is considered anomalous if the classifier can not put it into any of the classes with a high enough confidence. [2] • Clustering based: By dividing the data into clusters according to

similarity, clustering algorithms can, without any prior knowledge, detect anomalies if samples are deviating too much from nearby clusters. K-Means is an example, which forms K clusters dur-ing traindur-ing and compares the distance to each cluster’s centroid during testing. [2]

(18)

1 Introduction

(19)

2 Related Work

This chapter introduces a brief history of how the field of anomaly detec-tion has been shaped by research dating back to the 1700’s. In addidetec-tion to this, a summary is given about the current state of the art methods of applying anomaly detection to time series data.

2.1 History of Anomaly Detection

Anomaly detection and the definition of discordant observations have a history dating back to at least the 1700’s, when Bernoulli [4] studied the approach of removing anomalies. The research continued with the definition of Pierce’s criterion in the 1800’s, which was a statistical method for removing anomalies iteratively by comparing the probability of the data to the probability with a subset of the data removed. [28] The early methods of anomaly detection were simple and based only on statistical approaches, commonly assuming a normal distribution to be able to compare every potential outlier’s mean and standard deviation to the normal data. [16]

Up until the end of the 1900’s, the computational power was a real lim-iting factor to the type of algorithm and what size of datasets that were reasonable to model. With the recent explosion in available resources, the use of data driven approaches have opened up many possibilities to the scale of data that now can be modeled. Machine learning is a disciple for which the interest has increased manifold and, while the same cate-gories of anomaly detection exist in this field, there are new algorithms utilizing the possibility of automatically inferring features to model with the increasing amount of data. [26]

2.2 Anomaly Detection in Time Series

(20)

2 Related Work

sequential dependencies between data points. This has resulted in the problem not being as well understood in the case of handling time series and the number of methods are therefore more limited. [7]

A common technique of analysing signals is to apply spectral domain transformations and this is also a method used in anomaly detection in time series. Wrinch et al. [32] performed a study in 2012, using the Fourier transform on the energy demand of building systems to create representable features of the time series data. The goal was to assert the energy consumption of buildings over time and use anomaly detection to find devices not functioning properly. An important aspect of using a frequency transformation such as the Fourier transform is that it assumes a constant periodicity in the data, which in the the case of Wrinch et al. was fairly true with constant oﬃce hours every day. Applying these types of methods on data with non periodic elements would however result in a high false positive rate. [11]

In a study by Fontugne et al. [11], a new approach to the problem of analysing the energy consumption of buildings was introduced. Com-pared to the work by Wrinch et al. [32], Fontugne et al. used the method of Ensemble Empirical Mode Decomposition, which is a method to sep-arate non-stationary signals into oscillatory functions, enabling more intricate features than the Fourier transform. The ensemble method re-lied on estimating correlations between data from multiple devices in the building. The result of the study was a method that was able to accomplish the anomaly detection even with no prior knowledge of the data. [11]

Another approach to anomaly detection is to use regression based mod-els, which are methods that have been studied comprehensively in combi-nation with time series. Two early studies using regression was Fox [12] in 1972 and Abraham and Chuang [1] in 1989. Both of these papers used a form of regression, where a model was fitted on the data and for each test sample, the error between the predicted value and the target was used as an anomaly score. The method used by Abraham and Chuang furthermore utilized the process of Autoregressive Moving Average (ARMA), which combined regression with a moving average model.

(21)

2 Related Work

al. used an Support Vector Machine (SVM) to train a model on only normal data, where new samples were tested against the model.

The method of training a model only on normal data was continued in a study by Malhotra et al. [25], who analysed the performance of Long Short-Term Memory networks (LSTM) compared to Recurrent Neural Networks (RNN). The authors used a stack architecture of LSTMs to model the datasets of time series, such as space shuttle valves and elec-trocardiograms (ECG). The anomaly detection was performed by fitting a Gaussian distribution to the prediction error occurring when trying to predict a normal sequence and comparing it to the error on newly in-troduced samples. The results showed that LSTMs clearly performed better than RNNs on all examined datasets. The procedure was mostly unsupervised, but the use of abnormal data when setting thresholds made it semi-supervised.

(22)

3 Theory

3.1 Time Series

A univariate time series is defined as a sequence of data points, measured successively over time at uniform intervals. [34]

The distinction between a univariate and multivariate time series is that each measurement of the former is a single variable, whereas the latter in-corporates multiple variables for each time step. Multivariate time series are used to model correlations between the variables over time. [34] A time series diﬀers from an event sequence, since the inter-arrival time between consecutive events is allowed to be uneven, while a time series is constricted to uniformity. [6]

A time series has four distinct characteristics, described below. [13] • Cyclic: A cyclical component is a long-term variation, which can

be irregular, in the time series that repeats in cycles.

• Irregular: Irregularities in time series are unpredictable varia-tions, making it similar to a random variable.

• Periodic: A periodic component is a regular, repetitive, fluctua-tion in the time series.

• Trend: A trend is a long-term change, either positive or negative. When comparing samples of multiple time series, the two key charac-teristics to compare are periodicity and synchronicity. The definition of two synchronous time series is that they are temporally aligned, that is starting from the same time instance. As an example, two time series are synchronous if they both contain data from Monday. Combining these, there are four possible outcomes, listed below. [7]

• Periodic and Synchronous: The simplest case, where each time series is periodic and they are all relatively temporally aligned. • Aperiodic and Synchronous: The time series lack a steady

(23)

3 Theory

• Periodic and Asynchronous: Each time series has a steady period, but are not temporally aligned.

• Aperiodic and Asynchronous: The time series both lack peri-odicity and are not temporally aligned.

3.2 Regression Analysis

Regression analysis is a common statistical method that dates back more than two hundred years, when Legendre published the least squares method in 1805.[23] Since then, the field of regression analysis has grown to incorporate many diﬀerent type of methods, but in its basic form, it is a method of studying the relationship between two or more variables. There are many diﬀerent types of regression, but two of the most essen-tial are linear regression and non-linear regression.

The linearity of linear regression comes from the requirement that all parameters of the model have to be scalar dependent. That is, given a dependent variables y and an independent variable x, find the rela-tionship so that equation 3.2.1 is satisfied. In this case, when a single dependent variable is used, the method is often called simple linear re-gression. The general case, using one or more independent variables, is shown in equation 3.2.2. The error term E is typically assumed to be normally distributed, with a mean of 0 and a constant variance. [33] The case of simple linear regression is illustrated in figure 3.1, showing a linear model describing the cluster of (x, y) data points.

y = 0+ 1x +E (3.2.1)

y = 0+ 1xi+ ... + ixi+E for i = 1, ..., N (3.2.2)

(24)

3 Theory

Figure 3.1: A simple example of linear regression, where X is presumed to be linearly dependent on Y .

fitted on the training data. Then, for each test sequence, the diﬀerence between the predicted value of the model and the observed value is used as an anomaly score. [6]

3.3 Artificial Neural Networks

The field of Artificial Neural Network (ANN) has its origin in neurobiol-ogy. The human brain consists of a complex network of approximately 100 billion nerve cells, or neurons, being connected by synapses. In this biological scenario, neurons communicate over the synapses with elec-trical impulses and a single neuron typically receives many thousands of signals from other neurons. The voltage of an impulse depends on the strength of the actual synapse connection. The total strength of all signals to a neuron can be regarded as the sum of all impulses and each neuron has a threshold mechanism, where signals exceeding it will result in the neuron generating its own voltage impulse. [17]

(25)

3 Theory

weight factor. The neuron calculates a weighted sum based on all its inputs, resulting in a value that is used in the activation function. The activation function, sometimes also called transfer function, acts as a threshold and there are many diﬀerent types of functions depending on the desired outcome.

x1 x2 P f y w1 w2

Inputs Weighted_Sum Activation_Function Output

Figure 3.2: A basic artificial neuron, showing inputs x1 and x2, each paired with respective weight. The activation function node processes a linear combination of x and w, outputting a value based on the function f.

Each hidden unit h calculates a weighted sum ah of its n inputs and

each respective weight wij. The activation function is then applied to

ah, to calculate the actual output from each unit:

ah = n

X

i=1

wihxi (3.3.1)

Two of the most common activation functions used in ANNs are the hyperbolic tangent function,

tanh(x) = e

2x ₁

e2x_{+ 1} (3.3.2)

limiting all values to [-1, 1], and the logistic sigmoid function (x) with a range of [0, 1]. [15]

(x) = 1

1 + e x (3.3.3)

A third activation that has become popular in the last few years is the Rectified Linear Unit (ReLU) function,

(26)

3 Theory

ReLU works simply by being a thresholded zero and has been found to accelerate convergence when training ANNs. [21]

In the last layer of the neural network, the output nodes calculate the resulting value of the whole network in the same way as the nodes ear-lier. However, it is not necessary for the output nodes to use the same activation function and that choice depends on the task being solved. In the case of a multiclass classification with K classes, a common approach is to apply the softmax function, seen in equation 3.3.5. The function ensures that the sum of all the outputs is one. [10]

f (z)j =

ezj

PK n=1ezn

for j = 1, ..., K (3.3.5) Multiple artificial neurons create a Neural Network (NN) and the nodes are commonly structured in layers, as can be seen infigure 3.3. The lay-ers in the network are arranged with an input and output laylay-ers, with a number of hidden layers in-between. The structure of the connections between layers depends on the type of network and one significant dif-ference is if there are connections forming cycles or not. Feedforward Neural Network (FNN) is an acyclic network and the most widely used type is the Multilayer Perceptron (MLP), which is the type shown in

figure 3.3. ANNs consisting of cycles are called recurrent, or feedback, neural networks, and are discussed further insection 3.4. [15]

x1

x2

x3

y Inputs Hidden_{layer 1} Hidden_{layer 2} Output_layer

(27)

3 Theory

3.3.1 Training an Artificial Neural Network

Training an ANN is performed by exposing the network to typical data and adjust the weights, such that the correct output can be reproduced given a specific input. The most commonly used procedure is performed in two steps, containing a forward pass and a backwards pass. The forward pass consists of processing the input data, as seen infigure 3.2, in each neuron of each layer in the network.

The goal of training the network is to minimize the error between the calculated output ˆY and the target output Y . A commonly used error function is Mean Squared Error (MSE),

M SE(Y, ˆY ) = 1 N N X i=1 ( ˆYi Yi)2 (3.3.6)

which is used in cases where the outputs are numerical values. There are cases where the predictions from the model are distributions instead of numerical values, which is the case of the softmax function in equa-tion 3.3.5. In this case, Categorical Cross Entropy (CCE) is a frequently used error function. CCE is an error function between two distributions Y and ˆY, where Y is the true case and ˆY is an approximation of Y . Each distribution consists of a number of probability values, where 0 represents definitely false and 1 definitely true. This type of metric punishes heavily a wrong prediction having a high probability. [3] Cat-egorical Cross Entropy is defined as CCE(Y , ˆY) and each distribution p and q have N number of classes.

CCE(Y, ˆY ) =

N

X

i=1

Yi· log( ˆYi) (3.3.7)

(28)

3 Theory

an additional term. The combination of an error function and regular-ization is often called the objective function, which is the term used in this thesis. [14]

There are several optimization techniques to minimize the objective function and one of the most fundamental is gradient descent. The idea of gradient descent is to use the derivative of the objective functions, relative to the weights of the network, and adjust the weight with a fix step size in the negative direction. [15]

Backpropagation is a method of calculating the gradient and it is ba-sically just a repeated application of the chain rule, as seen in equa-tion 3.3.8, working backwards from the output through the hidden layers. The notations in the equations are as follows, O is objective function, a is calculated output (as seen inequation 3.3.1), y is expected output.

O a = O y y a (3.3.8)

As seen in equation 3.3.9, the calculation of the derivatives relative to the weights wij, which is used in gradient descent. [15]

O wij = O aj aj wij (3.3.9)

3.4 Recurrent Neural Networks

In regular feedforward neural networks, as discussed insection 3.3, con-nections between neurons are never allowed to form cycles. This limits the network’s ability of making assumptions of relations between data samples because the state of the network is lost after each sample has been processed. These networks are therefore not as suitable processing tasks with data sequences related in time or space, such as words in sentences and time series. [24] Taking the example of wanting to predict the next word in a sentence of written text, it is advantageous for the network to consider words that are much earlier in the sentence for a more accurate prediction.

(29)

3 Theory

are limited to only mapping an input to a corresponding output given a set of weights, the RNN can model whole sequences of dependent items in regards to both input and output. This is turn means that an RNN, theoretically, can model the entire history of previous inputs and outputs. [15] Comparing this to the fixed context window that a regular neural network handles, the strength of the RNN starts to show. An RNN works particularly well with modelling any type of sequential data and it is commonly used in applications of word prediction and ma-chine translation. Another big use case is in image and video processing, since even inherently non-sequential data, such as a single image, can be represented as a sequence using transformations. [24]

ht xt st = h1 x1 h2 x2 ht xt s1 st 1

Figure 3.4: A Recurrent Neural Network contains loops and is shown in the leftmost structure, with an input x being processed to an output h and passing along the current state s. The recurrent part can be visualized as unrolling the network. The recurrent part of the RNN comes from the network performing the same operation for every element of a sequence, having the output from one element as extra input to the next. As can be seen in figure 3.4, a way of visualizing this is to unroll the loop and more clearly show that the network processes the input of each step in the sequence. A sequence containing five words would in this way be shown as a 5-layer network, one layer for each word.

Inspecting a single layer unit, as shown in figure 3.5, shows a single activation function combining the current input and the output from the previous sequence step. The same activation functions can be used in the RNN as in MLP, and an often used function is the hyperbolic tangent function (tanh), which is shown in equation 3.3.2.

(30)

3 Theory

Figure 3.5: In the unrolled visualization of an RNN, each layer unit has an activation function, in this case the hyperbolic tangent function (tanh). Figure reproduced with permission from C. Olah.1 _[₂₇_]

algorithm is called Backpropagation Through Time (BPTT) and is es-sentially the same as regular backpropagation, with the important dis-tinction that the gradients are summed at each step t of the sequence, see equation 3.4.1. This is relevant in the case of an RNN, since the network pass along parameters across sequence steps, in contrast to a regular ANN. [31] O w = X t Ot w (3.4.1)

A negative aspect of the RNN is that, while it can model the depen-dencies between items in a sequence, it suffers from the difficulty of learning long-range dependencies. This imposes a problem for example when modeling language, since the meaning of a sentence often relates to words that are not close. As an example, in the sentence "The man who wore a wig on his head went inside", the meaning is about the man going inside and not the wig. [5] The underlying problem is called vanishing gradient and relates to the workings of backpropagation, explained in equations 3.3.9 and 3.4.1. Due to the way the propagation is a multi-plicative operation with the gradients, the contribution of an input at time t will be multiplied with an increasingly smaller factor. This results in the gradient shrinking exponentially fast. The problem can also be the opposite, depending on the activation functions, with an exploding gradient, with a gradient so much larger in the earlier layers that others have no effect at all. It is worth noting that neither of these problems

(31)

3 Theory

are exclusive to the RNN, but they are more apparent compared to a regular FNN due to the design of an RNN being as deep as the sequence length. [24][5]

3.4.1 Long Short-term Memory

Long Short-Term Memory networks (LSTM) were introduced by Hochre-ither and Schmidhuber [20] in 1997 as a special type of RNN, aiming to solve the problem of the vanishing gradient. Having been specifically de-signed to handle long-term dependencies, LSTMs quickly became popu-lar as an alternative to the RNN in applications such as natural language processing. [27]

Figure 3.6: Each gate of the LSTM resembles the structure of the RNN, seen in figure 3.5, with the important distinction of a the cell state C being passed along between the sequence steps. Figure reproduced with additional clarifying notation, with permission from C. Olah.2 _[₂₇_]

As seen infigure 3.6, the structure of the LSTM can be visualised in an unfolded manner similarly to the RNN in figure 3.5. Where the RNN functions with a single layer, the LSTM has four co-operating layers. The main addition introduced in the LSTM was the cell state C, which acts as a memory channel. This means that, instead of a single output,

(32)

3 Theory

two outputs, Ot and ht, are calculated per step. These are aﬀected by

the four layers in diﬀerent ways, as described below. [27]

The first sigmoid ( ) layer acts as a "forget gate", taking both the pre-vious output ht 1 and current input xtinto consideration when deciding

how much of the cell state Ct 1 should be remembered. The result is ft,

as shown in equation 3.4.2, where ft = 1 keeps Ct 1 as it is and ft = 0

completely disregards it.

ft = (Wf · [ht 1, xt] + bf) (3.4.2)

The next two layers act together deciding the information that will be added to the cell state. The first part is a sigmoid layer, which calculates a vector using equation 3.4.3 deciding how much of each state value that should be updated. The second layer is a tanh layer, which bears resemblance to the single layer of the RNN as seen in figure 3.5. As shown in equation 3.4.4, this layer calculates values that potentially could be important to store in the cell state.

it= (Wi· [ht 1, xt] + bi) (3.4.3)

e

Ct = tanh(Wc· [ht 1, xt] + bc) (3.4.4)

The actual update of the cell state is performed with an addition oper-ation between the candidate values calculated in it⇤ eCt and the current

cell state, as shown in equation 3.4.5. Performing this as an addition means that new information can unobstructedly be added to the cell state.

Ct= ft· Ct 1+ it· eCt (3.4.5)

The final part of the LSTM block calculates the output of this step t. As shown in equation 3.4.6, this is performed in two steps. A sigmoid layer is yet again used as a masking vector ot, using information in the

input to decide what parts of the cell state that is going to be outputted. The cell state is used together with a basic activation function tanh and then combined with ot, resulting in the output ht consisting only of the

parts that are calculated to be significant. ot = (Wo· [ht 1, xt] + bo

(33)

3 Theory

3.5 Convolutional Neural Networks

Convolutional Neural Networks (CNN) share many similarities with reg-ular Neural Networks (NN), discussed insection 3.3, since both consist of layers of neurons connected by edges with learnable weights.

The main diﬀerence introduced with CNNs is the assumption that the input of the network consists of images with a three dimensional feature space, having a width, height and depth. In contrast to the typical single dimensional input of an NN, the architecture of the CNN was designed to take advantage of this concept. [21] An advantage of the CNN, due to the concept around images, is that is scales very well to "raw" inputs, such as regular RGB pixel values of a photo. While a regular NNs can do this as well, the number of weights explodes quickly, considering that an RGB image of only 32x32 pixels results in 3072 weights. A CNN can can eﬃciently skip having hand-designed features and instead rely on the network itself turning into a feature extractor. [21][22]

3.5.1 Architecture

There are three main types of layers when constructing a CNN, the con-volutional layer, pooling layer and fully connected layer. Infigure 3.7, the architecture of a simple CNN is shown, showing the interaction between the data and the layer operations.

(34)

3 Theory Convolutional Layer

The most essential concept of the CNN is the convolutional layer, which is where the network has gotten its name from.

The convolutional operation transforms the input space from a higher dimension into a lower by having a set of filter convolve (or slide) across the input space. A filter is the CNN’s equivalence to the basic neuron in a regular NN. It is defined by a parameterized width and height, but always have the same depth as the input space. The volume of the filter is called the receptive field. A filter is, similarly to the neurons in a NN, defined by a set of learnable weights, of the same size as the receptive field. The number of filters each convolutional layer has is a parameter of the network. [9]

Figure 3.8: The convolution process of a CNN, here shown transforming a 7x7 space into 3x3, processes the image by sliding a filter across the input space. Each filter transformation involves calculating an image convolution, turning each filter matrix into a single value. In this example, the filter has a size of 3x3 with a stride of 1.

(35)

3 Theory

D, each filter convolution results in 1xD values. After the filter has convolved all possible locations of the input, the end result is a volume called a feature map. In the example in figure 3.8, the filter with size 3x3 transforms the 7x7 into a feature map of size 3x3, since the filter can fit in 9 diﬀerent positions. Note that this example as a depth of 1, but the concept extends to any depth.

Conceptually, each filter of the convolutional layer can be regarded as a feature identifier. During training, the weights of the filters will learn to detect simple features, such as edged, color and curves, in the input data. An example of diﬀerent filters applied to the same input image is shown in figure 3.9. [22]

Figure 3.9: The filter weights aﬀect the image, enhancing certain fea-tures. The eﬀect of three example filters are shown, with their weights, relative to the original image having the iden-tity function as filter.

Pooling Layer

(36)

3 Theory 9 2 5 7 4 1 8 3 6 9 Max 9 2 5 7 4 1 8 3 6 5 Mean

Figure 3.10: Pooling works with diﬀerent operations and in this example, the left operation shows a max-pooling on the data in an example filter, while the right shows mean-pooling.

Fully Connected Layer

The fully connected layer in CNNs work the same way as a regular NN. All nodes in this layer are connected to all outputs in the previous layers, which is what the term fully connected refers to. The actual calculation between this layer and a convolutional layer share many similarities, apart from the fact that the latter only works on the local regions of the filters. Fully connected layers are typically used at the end of the network, calculating the last output of the CNN. [9]

Dilated Causal Layer

The dilated layer also goes by the name of à trous, which is French for "with holes". Dilation is a process of convolution where the filter, instead of being applied for every input data, skips values with a step size larger than one. The term dilation refers to that the operation is equivalent to a regular convolution with a larger filter size, but filled with zeros in-between. [30] The step size that is taken when values are skipped is called the dilation rate. Dilation is similar to pooling, as described in sec-tion 3.5.1, but upsampling is performed instead of downsampling. [8] A network consisting only of dilated causal layers, with will henceforth be called Dilated Causal Convolutional Neural Networks (DC-CNN) in this thesis.

The causality of this layer means that a prediction at time t only depends on data at time s < t, in other words not allowing dependencies on future time steps. Causality is often enforced in time series modelling to ensure that only past behaviour is used when predicting the next time step.

(37)

3 Theory

makes for a receptive field of 23 _{= 8}_{, since every output depends on the}

eight previous inputs. In comparison, the same layer network without dilation would have a receptive field of only 4.

The major advantage of using dilation in a CNN comes from the fact that it is computationally eﬀective, enabling very deep networks and therefore a large receptive field, while still preserving the spatial dimensions of the data. [8]

Input

Dilation = 1 Dilation = 2 Dilation = 4

Figure 3.11: An illustration of a stack of dilated causal convolution lay-ers, with an exponentially increasing dilation rate.

3.5.2 Convolutional Neural Networks in Time

Series

While the CNN assumes that the input data has the format of a three di-mensional image, it can still operate on lower dimensions. One example of an application with a lower dimension, where CNNs are applicable as well, is time series. Time series are in their basic univariate form two dimensional, with one measurement per time step, as discussed in

section 3.1.

(38)

4 Anomaly Detection in

Power Consumption Data

The goal of this thesis was about analysing the capability of anomaly detection, to be used as a tool to sanitize collected energy consumption data of household appliances. The following chapter has four main parts. The first discusses the details of how the datasets were set up. In the second part, the overall anomaly detection algorithm is explained in details. Part three covers the diﬀerent evaluate modeling algorithms that were used in conjunction with the overall system. In the last section, the specifics of the experimental design are discussed.

4.1 Time Series Data

The data used in this thesis was collected by Watty, the company where this thesis was conducted. As stated insection 1.1.1about delimitations, the problem was limited to analysing anomalies in data from separate household appliances. These were real measurements from volunteering households, where each stationary appliance (i.e. excluding appliances only plugged in when used, such as chargers) had a separate meter. The total dataset of Watty had a large number of appliances, but a small subset with diﬀerent characteristics was selected for this thesis.

Each measurement was categorized by household and type of appliance. The data was formatted as time series, where the power consumption in Watt (W) had been measured with a sampling resolution of 1 second. Diﬀerent household appliances vary in behaviour and this has a distinct eﬀect on the way they consume energy. This type of distinct variance in the energy consumption is referred to as their power signature and will be an important factor in the anomaly detection process, explained further insection 4.2.

(39)

4 Anomaly Detection in Power Consumption Data

Figure 4.1: Examples showing measurements of three events of a freezer (left) and a single washing machine event (right), having a sampling rate of 1 second. Note the diﬀerent scales in the graphs.

which depends on the user, since it follows a set of predefined washing programs.

The variance occurred both between different types of appliances, such as the example shown in figure 4.1, and within each appliance group. The variance between three washing machines is shown in figure 4.2. These measurements come from three physically different washing ma-chines and their signatures are very different. There are also similarities, such as all three signatures show a high usage for a number of periods at the start and a similar looking tail.

The data collected by Watty had been cleaned, to remove occurrences of the easily detected anomalies in the measurements. However, due to the large quantity of data, there was still a possibility that real anomalies could have been missed and was still present in the datasets in this thesis.

(40)

Figure 4.2: Measurements of the same type of appliance do not have to look the same. Here, three washing machines from diﬀerent brands are shown with as many diﬀerent power signatures.

4.1.1 Datasets

Due to the large variety of anomalies occurring naturally and the lack of documented cases of their behaviour, it was deemed too unpredictable to study real anomalies in this study. Therefore, the assumption was made that the data was clean and without anomalies, despite the small possibility of a few still being present. Even if such occurrences would be included, they were considered to be so few and far inbetween, making them negligible for the evaluation.

(41)

4 Anomaly Detection in Power Consumption Data distorting the values in a certain probabilistic way.

To analyse the performance of the anomaly detection algorithms, four datasets were created. The aim was to have datasets reflecting possible real life scenarios, so that the behaviour of the algorithms could general-ize beyond the experimental setting. The purpose was also to study the performance of the algorithms with data having diﬀerent characteristics, such as periodicity. Two distinct type of datasets were created, event based and periodical, explained in details below.

Dataset 1 and 2: Event Based Data

The purpose with the event based dataset was to create a controlled environment to specifically analyse the algorithms’ ability to model a certain type of event. For this thesis, event based appliances were defined as devices being used sporadically without a period.

Two appliances that fit that requirement were the washing machine and dishwasher. Both have similar use cases, with the device mostly being turned oﬀ and periods stretching several hours when they are in use. Another similarity is the use of a set of predefined programs that the appliance follows, with defined start and end signatures.

Since the focus was on the actual events and not the usage patterns of the washing machine and dishwasher, pre-processing was performed on the data to densify the data. When pre-processing, the following steps were applied for each separate appliance in the dataset. After the filtering, the resulting events were concatenated, creating artificial sequences of clearly divided events, without unnecessary holes inbetween.

• The least amount of power consumption was set to 20 W, to pre-vent noise in the data.

• An event had to be draw zero power for at least 1 hour for it to have fully ended. This prevented programs with short periods of requiring no power to be split, but still divided separate events. • The events were filtered using the criteria that the shortest on

(42)

Figure 4.3: The first dataset, wm_dw, consisted of normal washing machine training data and anomalous test data where some events had been replaced by dishwasher events.

The first dataset was called wm_dw and consisted of washing machine samples, with some events replaced with a dishwasher. An example can be seen in figure 4.3. The second, called dw_wm, was the reverse case, with normal dishwasher data and inserted washing machine events, which is shown infigure 4.4.

The replacing action was attempting to emulate the possibility of the data having been mixed up in the data collection process. The reasoning for creating two datasets with replacements going both ways was to enable easy comparisons between the algorithms’ performances when classifying the data.

(43)

4 Anomaly Detection in Power Consumption Data Dataset 3 and 4: Periodical Data

The second type of dataset was created with appliances having a peri-odical pattern in the power signature. A typical example of a type of appliances following such a regular behaviour is those having a single compressor, such as fridges and freezers. The reason for choosing single compressors was because of their simple power consumption behaviour, compared to the more complex dual compressors.

The aim of the datasets with periodical data was to study the algo-rithms’ abilities to model implicit periodicity. The pre-processing was therefore more simple than in the case of the event based datasets, thus leaving the events and the oﬀ periods inbetween intact. The only pre-processing that was performed was due to shortcomings in the actual data collection process. During the measurements, some sensors had diﬃculties sampling with a steady rate of 1 Hz and the raw data there-fore included holes without a measured power consumption at all. These missing values were interpolated in the pre-processing.

In the third dataset, single_comp_add, the periodical pattern of the single compressor appliances was combined with a square wave that had been added on top of the real signal. This was an attempt to emulate the scenario of having accidentally measured more than the intended appliance. An example is seen in figure 4.5 and, since the square wave could occur at any instance in the measurement, this introduced some intervening periods that never reached a zero power consumption. In the fourth and last dataset, called single_comp_erased, the goal was to analyse the behaviours of the modelling when periods in such a periodical data were missing. Figure 4.6 shows that this could occur both with single periods as well as with many periods in a row. This anomaly could very well happen in real life, due to either a defective appliance or a faulty sensor.

4.2 Anomaly Detection Algorithm

(44)

Figure 4.5: The third dataset, single_comp_add, had an intact mea-surement of single compressor appliances as training data. In the anomalous test data, a square signal with an addi-tive eﬀect had been introduced randomly on top of a regular measurement.

In figure 4.1, the significant difference in behaviour between a washing machine and a freezer was shown. The same was found to be true for all different household appliances, with vastly different energy patterns depending on the type of appliance. This was a reason to model each appliance type separately, which also corresponded well with using the anomaly detection to sanitize collected data for a certain appliance. Despite only modelling one appliance at a time, the task was still not trivial, since there could be large variance even within each appliance group, as figure 4.2 showed.

(45)

Figure 4.6: The fourth dataset, single_comp_erased, consisted of single compressor appliances as training data, with anoma-lies of a number of erased periods in the test data.

The diversity of the anomalies was also an argument for not hand craft-ing features, which would incorrectly limit the detected anomalies to a defined feature space. Furthermore, since the dataset largely consisted of aperiodic and asynchronous time series, the anomaly detection method had to be robust against a high variation also between normal samples. The anomaly detection system was therefore set up based on a predic-tion based, generative, model. The idea was to train the predicpredic-tion capabilities of the model on a training set and then let it predict new data, both of the cases working only with normal data. By using the prediction errors on data without anomalies as a basis, the assumption was that the model should do worse when reconstructing anomalies. The prediction error during training was observed to fit a Gaussian distribu-tion N = N (µ, 2₎_{and the probability of a data point being anomalous}

therefore corresponded to the Probability Density Function (PDF). As illustrated infigure 4.7, the goal was to have the Gaussian distribution of the prediction error of an abnormal sequence not match the distribution of the training sequences.

(46)

Figure 4.7: Two examples of Gaussian distributions, where the blue curve has a mean of 0 with 1 variance and the green curve has a mean of 3 with variance 0.6. The idea was that the Gaussian distribution of the error when predicting an abnor-mal sequence (green) would not overlap the distribution of the training sequences (blue).

instead, the contextual aspects of the anomalies were captured as well. A sequence in this context meant a time series spanning 48 hours, with a resolution of 1 minute, as shown insection 4.1.1. To enhance the eﬀect of the anomalous regions across the sequence, a rolling mean of a 2-hour window was applied to the error before fitting the Gaussian. In this step, the error was also standardized to more easily be able to compare diﬀerent PDF values.

The full training procedure of the system consisted of a number of steps, described below.

1. Pre-process the data according to set parameters, described further insection 4.4.1.

2. Split the normal data into n1, n2, n3, each consisting of a number

of time series sequences.

3. Create the prediction model and train on the n1 data. During

training, 20% of n1 was used as a validation set with early stopping

(47)

4. Let the model try to predict all sequences in n2 and calculate the

categorical cross entropy error err2 (equation 3.3.7) between the

predicted and target values.

5. Apply a rolling mean of 2 hours, as well as a standardization, on err2.

6. Fit the error err2 to a Gaussian distribution gauss.

7. Attempt to predict all sequences in n3 separately and process the

error as in step 5. Using this error, calculate the PDF pdf3 based

on the already fitted gauss.

8. Using pdf3, calculate a threshold based thold on the

misclassifica-tion error and a fixed accepted error ratio.

Using the trained model together with the calculated threshold, the process of classifying a new sequence is described below.

9. Using the trained model, attempt to predict the new sequence test, and follow the same procedure as steps 4- 7, resulting in pdftest

10. Compare every point in pdftest to thold. If any point is over the

threshold, the whole sequence test is classified as abnormal, oth-erwise normal.

4.3 Evaluated Modelling Algorithms

In combination with the four datasets described insection 4.1.1, the eval-uated modelling algorithms of this thesis consisted of a baseline together with two more sophisticated NNs. Neural Networks have in recent years had an upswing in popularity and the goal of this thesis was to compare the recurrent performance of the LSTM with the computationally eﬃ-cient DC-CNN. The theory behind these three algorithms was explained insections 3.2, 3.4.1 and 3.5.

• Multilayer Perceptron (MLP) Regression: Implemented with a simple MLP network, regression acted as a baseline comparison of the performance.

(48)

• Dilated Causal Convolutional Neural Network (DC-CNN): With the promising results shown by van den Oord et al. [30], the DC-CNN stood out as the cutting edge method of applying CNNs on univariate time series. It was therefore an interesting compari-son to the LSTM.

4.4 Experimental Design

The main method of this thesis was empirical, with experiments playing a significant role in the analysis. All modelling algorithms, described in section 4.3, were set up around the same interface with a common data interface. When evaluating the performance of the modelling algo-rithms, there was a set of parameters that constituted the base of the experiments.

The first type of parameter consisted of how many of the previous time steps that should be used as input to predict the next. Due to the dif-ferent inner workings of the algorithms, these values diﬀered depending on the type, which can be seen in table 4.1. The simple model of the MLP relied only on the specified last values, whereas the LSTM had these values as input to the state of the current time step, in combina-tion with its theoretically infinite history from previous time steps (see

section 3.4.1). The dilated part of the DC-CNN contributed to it be-ing much more computationally eﬃcient to have a longer look back, as explained in section 3.5.1.

Algorithm Look Back Values MLP 1, 5, 10, 30, 50 LSTM 1, 5, 10, 30, 50

DC-CNN 8, 16, 32, 64, 128, 256, 512

Table 4.1: The look back parameter diﬀered slightly depending on which modelling algorithm that was used, ranging from [t 1, t 1] to [t 512, t 1]

(49)

This parameter was an experiment to study the eﬀect of letting the model not rely on the immediate past values to forecast the next, but rather force it to learn the temporal structure of the time series. While the look back parameter diﬀered depending on the algorithm, the value to predict ahead of time was the same for them all, seen in table 4.2. This was so that the performance of the algorithms could be evaluated fairly against each other.

Algorithm Predict Ahead Values MLP, LSTM, DC-CNN 1, 5, 10, 15, 20

Table 4.2: The predict ahead parameter decided which value to predict next, ranging from [t + 1] to [t + 20] for all algorithms. Regarding hyperparameters of the modeling algorithms, the two ad-justable hyperparameters for LSTM and DC-CNN were Net Size and N Filters, respectively. The former defined the size of the internal projec-tions in the LSTM, whereas the latter specified the number of filters per convolutional layer. The filter count was the same for all layers in the DC-CNN. The values of the hyperparameters can be seen in table 4.3. The MLP algorithm was a simple regression and did not require any further hyperparameters.

Algorithm Hyperparameter Values

LSTM Net Size 8, 16, 32, 64, 128, 256 DC-CNN Filter Count 8, 16, 32, 64, 128, 256 Table 4.3: The hyperparameters for the LSTM and DC-CNN In addition to the parameters shown in tables 4.1 to 4.3, there were a few settings being fixed throughout all experiments, listed below.

• The filter length of the DC-CNN was set to 2, which was a pa-rameter working well for van den Oord et al. in the WaveNet paper. [30]

• The optimization technique Adaptive Moment Estimation (Adam) was used, which is an alternative to Stochastic Gradient Descent (SGD), and have been shown working well in practice. [29]

(50)

The experiments were set up as a form of parameter search in two stages. The first part consisted of finding well performing hyperparameters for the LSTM and DC-CNN models, as was shown intable 4.3. The best pa-rameter of each model was then used in the subsequent experiments. In the second part of the experiments, the actual parameters of the anomaly detection were tested. For all of the tested parameters, the experiments included running all of the four datasets described insection 4.1.1. To prevent overfitting when training the modeling algorithms, early stop-ping was set up according to the parameters in table 4.4. By reserving 20% of the data when training, to a specific validation set, the error of each epoch was continuously calculated on both the training and valida-tion sets. The objective was to minimize the validavalida-tion error, but in the case of it not improving over a period of 10 epochs, the training stopped early. The maximum epochs during training was set to 500, which was a parameter found to work well during initial testing.

Early Stopping Parameter Value Validation Split 20% Minimum Delta 0.0

Patience 10

Maximum Epochs 500

Table 4.4: The early stopping was set up with a number of parameters, deciding the amount of data to validate against and when to stop the training of the models.

4.4.1 Data Formatting

The parameters of look back and predict ahead, as seen in tables 4.1

and 4.2, required special data formatting to fit the models. Figure 4.8

shows an example of how a time series in the form of a vector is formatted into an input and output, with an overlapping method. The formatting of the input was only necessary in the MLP and LSTM, since the DC-CNN performed an equivalent operation built in to the network.

(51)

4 Anomaly Detection in Power Consumption Data Original Vector 0 1 2 3 4 5 Input Target 0 1 3 1 2 4 2 3 5

Figure 4.8: An example of the overlapping format of the input and target data, here shown with a look back of 2, with the target value set to predict 2 steps ahead.

Training and testing consisted of individually predicting a large number of such sequences.

To minimize the risk of the models being confused by intricate minor patterns in the data during training, the decision was made to apply One-hot Encoding. This is a form of normalization, where each power measurement value maps to one of N bins, thus reducing the continuous value to an array with only zeros except for a 1 where the value was mapped to. In the context of applying anomaly detection to the appli-ance measurements, it was considered most important for the models to learn the state of the appliance and not the exact value. In this thesis, having N = 50 was observed to work well, with each bin evenly spaced between minimum and maximum values of the dataset. An example can be seen in figure 4.9, showing both the original and the encoded signal.

(52)

Figure 4.9: The original signal (above) was transformed using one-hot encoding, mapping each value to one of 50 bins (below).

4.4.2 Performance Metrics

To calculate the performance of the anomaly detection classifier, a set of thresholds were compiled automatically based on the range of values for the PDF of the error. Based on this range, 40 evenly spaced thresholds between minimum and maximum were calculated and for each, the per-formance was calculated. For evaluation purposes, this range of thresh-olds replaced the calculation of the dynamic threshold during training, as was described in step8 insection 4.2.

The two major performance metrics used in this thesis were precision and recall, calculated according to equations 4.4.1 and 4.4.2, respec-tively.

Precision = T P

T P + F P (4.4.1) Recall = T P

Evaluating Anomaly Detection Algorithms in Power Consumption Data

Examensarbete 30 hp

Juni 2017

Evaluating Anomaly Detection

Algorithms in Power Consumption

Data

Abstract

Evaluating Anomaly Detection Algorithms in Power

Consumption Data

Populärvetenskaplig

sammanfattning

Acknowledgements

Contents

List of Figures

List of Acronyms

1 Introduction

1.1 Scope of Thesis

1.1.1 Delimitations

1.2 Overview of Anomaly Detection

2 Related Work

2.1 History of Anomaly Detection

2.2 Anomaly Detection in Time Series

3 Theory

3.1 Time Series

3.2 Regression Analysis

3.3 Artificial Neural Networks

3.3.1 Training an Artificial Neural Network

3.4 Recurrent Neural Networks

3.4.1 Long Short-term Memory

3.5 Convolutional Neural Networks

3.5.1 Architecture

3.5.2 Convolutional Neural Networks in Time

Series

4 Anomaly Detection in

Power Consumption Data

4.1 Time Series Data

4.1.1 Datasets

4.2 Anomaly Detection Algorithm

4.3 Evaluated Modelling Algorithms

4.4 Experimental Design

4.4.1 Data Formatting

4.4.2 Performance Metrics