Interactive Anomaly Detection With Reduced Expert Effort

(1)

MASTER THESIS

Master’s Program In Embedded And Intelligent Systems , 120 credits

Interactive Anomaly Detection With Reduced Expert Effort

Lingyun Cheng and Sadhana Sunadresh

Halmstad University, July 7, 2020–version 1.0

(2)

Lingyun Cheng and Sadhana Sunadresh: Interactive Anomaly Detection With Reduced Expert Effort, , c December 2019

s u p e r v i s o r s:

Mohamed-Rafik Bouguelia and Onur Dikmen e x a m i n e r:

Slawomir Nowaczyk

l o c at i o n: Halmstad City t i m e f r a m e: December 2019

(3)

A B S T R A C T

In several applications, when anomalies are detected, human experts have to investigate or verify them one by one. As they investigate, they unwittingly produce a label - true positive (TP) or false positive (FP). In this thesis, we propose two methods (PAD and Clustering- based OMD/OJRank) that exploit this label feedback to minimize the FP rate and detect more relevant anomalies, while minimizing the expert effort required to investigate them. These two methods iteratively suggest the top-1 anomalous instance to a human expert and receive feedback. Before suggesting the next anomaly, the methods re-ranks instances so that the top anomalous instances are similar to the TP instances and dissimilar to the FP instances. This is achieved by learning to score anomalies differently in various regions of the feature space (OMD-Clustering) and by learning to score anomalies based on the distance to the real anomalies (PAD). An experimental evaluation on several real-world datasets is conducted. The results show that OMD-Clustering achieves statistically significant improvement in both detection precision and expert effort compared to state-of-the-art interactive anomaly detection methods. PAD reduces expert effort but there was no improvement in detection precision compared to state-of-the-art methods. We submitted a paper based on the work presented in this thesis, to the ECML/PKDD Workshop on "IoT Stream for Data Driven Predictive Maintenance".

(4)

(5)

C O N T E N T S

1 i n t r o d u c t i o n 1 2 l i t e r at u r e r e v i e w 2 3 p r o p o s e d m e t h o d 7

3.1 Preliminaries and goals 7

3.2 OMD-Clustering (extension of OMD [13]) 8 3.3 OJRank-Clustering (extension of OJRank [6]]) 10 3.4 PAD ( Probability-based anomaly detection ) 12 3.5 Preprocessing time series data 13

3.5.1 Why peprocess the time series data 13 3.5.2 Concept of Window and Overlap size 13 3.6 Explanation for the anomalies in time series data 14 4 r e s u lt a n d e va l uat i o n 15

4.1 Datasets 16

4.1.1 Time-independent Dataset 16 4.1.2 Time Series Dataset 17 4.2 Baselines 17

4.3 Evaluation Metrics 17

4.3.1 Precision at the budget 18 4.3.2 Expert Effort 18

4.3.3 Area under the curve (AUC) 18 4.3.4 Reason based Effort 18

4.4 Result of Time-independent Datasets 19 4.5 Sensitivity Analysis 20

4.5.1 Sensitivity Analysis of Method PAD 22

4.5.2 Sensitivity Analysis of OMD-Clustering and OJRank- Clustering 23

4.6 Case Study 25

4.6.1 Preprocessing 25

4.6.2 Evaluation Results on Time series Datasets 28 4.6.3 Explanation 29

5 c o n c l u s i o n 31 6 f u t u r e w o r k 31

7 a p p e n d i x a: the paper based on this thesis 33 7.1 Introduction 34

7.2 Related work 35 7.3 Proposed method 37

7.3.1 Preliminaries and goals 37 7.3.2 OMD-Clustering 38 7.4 Evaluation 41

7.5 Conclusion and future work 43 b i b l i o g r a p h y 48

(6)

L I S T O F F I G U R E S

Figure 1 Data from a real heat-pump system, where the goal is to detect compressor failures. Several anomalies are irrelevant as they are not related to compressor failure. These are just atypical (but reasonable) events. Nevertheless, they appear as abnormal. 1

Figure 2 Illustration of the process of splitting the space into various regions and the construction of the Z scores matrix. 9

Figure 3 Simple artificial datasets: toy (with a single anomalous region), and toy2 (with two different anomalous regions). 19

Figure 4 Evaluation of the precision and expert effort according to various values of the budget, on the two synthetic datasets: toy and toy2. 20 Figure 5 Evaluation of the precision and expert effort according to various values of the budget, on two real-world datasets: cardiotocography and mammography_sub. 21

Figure 6 Overall evaluation results: area under the precision and the expert effort curves for all methods and datasets. 22

Figure 7 PAD achieves better performance on precision as the k increasing while starts to get bad after k reaching 40 on 5 datasets. Each line corresponds to one of 13 datasets mentioned in section 4.1 23

Figure 8 PAD has stable performance on Expert effort upon varying k. Each line corresponds to one of 13 datasets mentioned in section 4.1 24 Figure 9 AUC of precision upon varying clustering for

OMD-Clustering. Each line corresponds to one of 13 datasets mentioned in section4.1 24 Figure 10 AUC of effort upon varying clustering for OMD-

Clustering. Each line corresponds to one of 13 datasets mentioned in section4.1 25

Figure 11 AUC of precision upon varying clustering for OJRank-Clustering. Each line corresponds to one of 13 datasets mentioned in section4.1 26

(7)

List of Figures vii

Figure 12 AUC of effort upon varying clustering for OJRank- Clustering. Each line corresponds to one of 13 datasets mentioned in section4.1 26

Figure 13 Implementation of windows and overlap concepts on heat pump data 27

Figure 14 Percentage of Anomalies in the heat pump datasets after processing by different overlap windows 27 Figure 15 Precision of all methods on heatpump datasets

after processing by 12 hours overlap windows 29 Figure 16 Expert effort of all methods on heatpump datasets

after processing by 12 hours overlap windows 29 Figure 17 Reason based effort of all methods on all win-

dow sliced datasets with 12 hours overlap. 30 Figure 18 Data from a real heat-pump system, where the

goal is to detect compressor failures. Several anomalies are irrelevant as they are not related to compressor failure. These are just atypical (but reasonable) events. Nevertheless, they appear as abnormal. 34

Figure 19 Illustration of the process of splitting the space into various regions and the construction of the Z scores matrix. 39

Figure 20 Simple artificial datasets: toy (with a single anomalous region), and toy2 (with two different anomalous regions). 43

Figure 21 Evaluation of the precision and expert effort according to various values of the budget, on the two synthetic datasets: toy and toy2. 43 Figure 22 Evaluation of the precision and expert effort according to various values of the budget, on three real-world datasets: cardiotocography, abalone, and mammography. 44

Figure 23 Overall evaluation results: area under the precision and the expert effort curves for all methods and datasets. 45

(8)

L I S T O F TA B L E S

Table 1 Qualitative comparison in anomaly detection

methods on data sets without inter-dependencie. 5 Table 2 Qualitative comparison in anomaly detection

methods on data sets with inter-dependencies. 6 Table 3 Details of Datasets 16

Table 4 Details of heatpump datasets 17

Table 5 New features of heat pump datasets after processed by window sliding 27

Table 6 Summaries of anomalies after preprocessing heat pump datasets with different overlap sizes 28 Table 7 Details of the benchmark datasets 42

(9)

1

I N T R O D U C T I O N

Anomaly detection allows us to find instances that deviate significantly from the majority of data, indicating e.g., a system fault. Usual unsupervised anomaly detection methods are purely data-driven and do not benefit from valuable expert knowledge. However, many of the anomalies that real-world data exhibits are irrelevant to the user as they represent atypical but normal events. For example, as illustrated in Fig.18, in domestic hot water heat-pump systems, the water reaches abnormally high temperatures once in a while to kill poten- tial Legionella bacteria; this is an atypical but normal event. More- over, anomalies are often subjective and depend on the application purpose and what the user considers as abnormal.

Figure 1: Data from a real heat-pump system, where the goal is to detect compressor failures. Several anomalies are irrelevant as they are not related to compressor failure. These are just atypical (but reasonable) events. Nevertheless, they appear as abnormal.

In order to distinguish between relevant and irrelevant anomalies, this paper proposes an interactive anomaly detection algorithm that proactively communicates with an expert user to leverage her/his feedback and learn to suggest more relevant anomalies. The objective here is two-fold: (i) maximizing the precision on the instances verified by the expert (i.e., ideally, only relevant anomalies are presented to

(10)

the user), and (ii) minimizing the effort spent by the expert to verify these instances.

Recently, several methods such as AAD [22], OMD [13], and OJRank [6] have been proposed to incorporate user feedback into anomaly detectors, to achieve the objective (i). All these methods learn to combine anomaly scores from members of an ensemble of anomaly detectors (e.g., trees in an Isolation Forest [4]). The method proposed in this paper differs from the existing interactive anomaly detection methods in two ways. First, instead of only considering an ensemble of anomaly detectors, the proposed method aims to learn regions of the feature space where relevant anomalies are present. Second, existing methods focus on achieving a good precision (objective (i)) with less attention towards minimizing the expert effort required to investigate anomalies (objective (ii)). The energy and the time of the expert user are often limited, and every opportunity to interact with her/him should be fully utilized. Our proposed methods aim to achieve both objectives by learning to score anomalies differently in various regions of the feature space.

The remainder of this paper is organized as follows. InChapter 2, we present the literature review and discuss the similarity and differences between the proposed methods and the existing ones. In Chapter 3, we formalize our goals and describe our proposed methods. InChapter 4, we present the experimental evaluation where the proposed method is compared against stare-of-the-art methods on several real-world datasets. InChapter 5andChapter 6, we conclude and discuss future work, respectively. Chapter 7 introduces a paper based on the work presented in this thesis, which has been submitted to the ECML/PKDD Workshop on "IoT Stream for Data Driven Predictive Maintenance".

2

L I T E R AT U R E R E V I E W

Since the work proposed in this report involves interactions with a human expert, it is related to, but different from, active learning (which selects informative instances to be labeled, according to a query strategy). It is also closely related to interactive learning methods de- signed to learn from the top-1 feedback (which selects the top instance to be labeled, without a query strategy). In this section, we discuss how our work relates to and differs from state-of-the-art methods in these categories.

(11)

l i t e r at u r e r e v i e w 3

a c t i v e l e a r n i n g m e t h o d s

Usual active learning (AL) techniques such as [3, 8, 12, 10, 24] aim to minimize the labeling cost required to train a high-performance classification model. This is achieved by sampling a small number of informative instances (according to a query strategy) which are presented to an expert for labeling. We refer the reader to [20] for a sur- vey of AL strategies for classification. AL techniques have also been used for anomaly and novelty detection in [15, 16, 17]. In all these AL methods, the goal is to minimize the final error of the model (after querying ends) on new unseen instances. This is in contrast to our proposed method, which aims to minimize the number of irrelevant anomalies presented to the expert during querying (i.e., while she/he is investigating them). In this case, each query is about the most anomalous yet-unlabeled instance.

In [5], a method for detecting errors in insurance claims was proposed. The method aims at reducing the expert effort required to verify the insurance claim errors. It assumes that a classifier has been trained to predict errors in claims and use it to score new unlabelled claims. The top-scoring claims are then clustered, and the clusters are ranked. Insurance claims from the top cluster are presented to the expert for investigation. Presenting instances from the same cluster avoids switching between contexts and therefore reduces the expert effort. The method moves to the next cluster when the precision for the current cluster falls below a threshold. However, this method does not update the model based on user feedback. In contrast, our proposed method incorporates each user-feedback so that the next suggested anomaly is more likely relevant.

i n t e r a c t i v e l e a r n i n g b a s e d o n t h e t o p-1 feedback

In contrast to the above-mentioned methods, there have been a limited number of interactive anomaly detection methods based on the top-1 feedback. Here the goal is to maximize the number of true/relevant anomalies presented to the expert user. These methods can be summarized according to the general process described in Algorithm 1. The method we propose in this paper falls under this category.

In [22], a method called AAD (Active Anomaly Detection) was proposed to maximize the number of correct anomalies presented to the user. At each iteration, the instance having the highest anomaly score is presented to the expert, and label feedback is obtained. The set of all instances labeled so far is used to solve an optimization problem that combines anomaly scores from an ensemble of anomaly detectors. AAD initially used an ensemble of one-dimensional histogram density estimators (LODA) [18], and was extended later in [2] to use a tree-based ensemble such as Isolation Forest [4]. Some drawbacks

(12)

4 l i t e r at u r e r e v i e w

Algorithmus 1 :Interactive Anomaly Detection from the Top-1 Feedback.

Input: raw dataset, budget of b queries;

Initialize an anomaly detection model h;

for t ← 1 to b do

Rank instances in the descending order of their anomaly scores;

Present the top-1 (most anomalous) instance x to the expert;

Get a feedback label y ∈{1, −1} (i.e., TP or FP);

Update the model h based (x, y) (or all instances labelled so far);

end

of ADD are: (i) the fact that it does not care about minimizing the expert effort, and (ii) as the number of labeled instances grows with each new feedback, the optimization problem takes more and more time to solve (i.e., it is not updated in an online fashion). Later, the au- thors of AAD suggested a method referred to as FSSN (feature space suppression network) [21]. The method uses an ensemble of global anomaly detectors and learns their local relevance to specific data instances, using a neural network trained on all labeled instances.

FSSN improves the precision over AAD, but suffers from the same drawbacks as AAD.

Most recently, two methods OMD [13] and OJRank [6] have been proposed to learn to score anomalies from the top-1 feedback. Both learn to combine scores of an ensemble of anomaly detectors and op- timize a loss function in an online fashion based on each received feedback. In OMD, a convex point-wise loss function is optimized using online mirror descent, while in OJRank, a pair-wise loss function is optimized using stochastic gradient descent. Both OMD and OJRank aim to maximize the number of correct anomalies presented to the expert. However, only OJRank emphasizes minimizing expert effort. Nonetheless, OMD depends on way fewer hyper-parameters than OJRank (i.e., only one hyper-parameter representing the learning rate). This is an interesting criterion as there is usually no way to fine-tune the value of hyper-parameters when labeled data is scarce.

The state-of-art methods mentioned above are compared by some properties, as shown in table 1.

Based on these observations, we propose extensions of OMD (called OMD-Clustering) and OJrank (called OJRank-Clustering), which split the feature space into various regions (using several clusterings) and learns (online from each feedback) to score anomalies differently in these different regions of the feature space. We show that the proposed extensions improves the precision (i.e., detects more relevant anomalies within a budget) and significantly reduces the effort spent

(13)

l i t e r at u r e r e v i e w 5

Algorithmus AL[3] RCD[17] MLT[5] GLAD[21] AAD[22] OMD[13] OJRank[6]

Ensemble X X X X X X X

Top-1Feedback X X X X X X X

Pre-classified X X X X X X X

Expert Effort X X X X X X X

Online X X X X X X X

Table 1: Qualitative comparison in anomaly detection methods on data sets without inter-dependencie. Ensemble: Collection of Ho- mogeneous/Heterogeneous anomaly detection algorithm. Top- 1Feedback: Most anomalous instance is presented to the user. Pre- Classified:Classifier trained on labelled dataset and tested on unlabeled dataset. Expert Effort: Similar anomalous instances are presented. Online: Weight update happens on the instance presented rather than whole dataset.

by the expert to verify these instances. Considering that these two methods rely more on anomaly scores instead of the original features, which might ignore information in the original space. We proposed a method (called PAD) to classify the data-points in the original feature space and learns to predict the classification score of each instance to be an anomaly. We applied this method on benchmark datasets, and the results show that PAD minimizes the expert effort significantly, but the precision has not improved.

Since our case study is time-series data collected from heat pumps, inter-dependencies between instances may affect anomaly detection.

Therefore, we review some related work on performing anomaly detection on interdependent data sets. This type of data can be captured as an attributed network to find nodes whose patterns or behavior are significantly different from most reference nodes, such as small

(14)

groups, nodes with unusual neighborhoods, or unusual subgraphs.

The general algorithm is shown as below:

Algorithmus 2 :Anomaly detection on attributed networks Result :set of anomaly nodes

input: An attributed network G, a budget of T queries;

initialization: group nodes using clustering methods;

for t ← 1 to T do for node do

Observe feature vectors of this node and its neighbors;

present one suspicious node(along with the side information);

query the human expert to identify if it is anomalous or not;

update anomaly selection strategy based on feedback end

update the set of detected anomaly nodes;

end

Algorithmus LinUCB[11] LinTS[1] Radar[7] GraphUCB[9]

multi-armed bandit[11] X X X X

Linear X X X X

Contextual X X X X

Attribute information X X X X

Table 2: Qualitative comparison in anomaly detection methods on data sets with inter-dependencies. Multi-armed bandit: make a trade-off to obtain new knowledge and use that knowledge to improve the learning performance. Linear: classifier is linear model. Contextual:

considering side information. Attribute information: model the attribute information of nodes

We will compare between GraphUCB and some related methods by various properties defined in table 2.

g r a p h-based GraphUCB [9] aims to maximize the true anomalies presented to the human expert after a given budget is used up.

GraphUCB is a contextual bandit[23] based framework, extending conventional multi-armed bandit approaches such as LinUCB[11], LinTS[1] and the state-of-the-art anomaly detection methods on attributed networks such as Radar[7]. It applies clustering methods to generate clusters and takes each cluster as an arm. The abnormality of a node is computed by leveraging both the nodal attributes and the node dependencies so it address the exploration-exploitation dilemma. The node selection strategy can be updated along with the human expert feedback over time which enhances the detection performance. How- ever, GraphUCB treats all links equally so it cannot fully capture the

(15)

p r o p o s e d m e t h o d 7

node dependencies and it measures the node abnormality based on its local neighborhood structure.

3

P R O P O S E D M E T H O D

3.1 p r e l i m i na r i e s a n d g oa l s

We are given an unlabeled dataset X ∈ R^n×d of n instances in a d-dimensional space, as well as an unsupervised anomaly detection model A that scores instances according to their abnormality, i.e. A(X) = {s1, s2, . . . , sn}. The proposed method can use any base anomaly detection model A, however, to be consistent with the methods presented in Chapter 2, Isolation Forest [4] is used as a base model in this report.

We consider a limited budget of b, which is the number of instances the expert can verify (i.e., the total number of feedbacks). For example, this could correspond to the number of faulty systems that experts can diagnose within a period or the number of potentially erroneous invoices that could be manually analyzed within a day of work.

Following the general process presented in Algorithm 1, the proposed method proceeds iteratively. At each iteration 1 . . . b, instances are scored, the instance with the highest anomaly score is presented to the expert for verification, expert feedback is obtained, and the model is updated based on the obtained feedback. The goal here is to update the model such that:

1. The precision at the given budget b is maximized: This corresponds to the total number of genuinely anomalous instances (i.e., relevant anomalies) verified by the expert within the budget b. Ideally, we want the expert to only verify relevant anomalies. This metric is defined as:

precision@b = T P_b

a , (1)

where T Pb is the number of true positives (relevant anomalies) among the b verified instances, and a is the total number of true anomalies in the dataset (constant).

2. The overall expert effort is minimized: This corresponds to the cost or effort that the expert spends (due to switching context) when verifying consecutive instances. If consecutive instances presented to the expert are very different (resp. similar), she/he would switch context more often (resp. less often). We use the

(16)

8 p r o p o s e d m e t h o d

same definition of expert effort as in [6]; this is the distance between consecutive instances presented to the expert:

expert effort =

b−1X

t=1

(1 −cosim(xt, xt+1)), (2)

where cosim(., .) denotes the cosine similarity, and xt denotes the instance presented to the expert at the t^thiteration.

3.2 o m d-clustering (extension of omd [1 3])

As shown in Fig.18, instances that are anomalous for the same reason (e.g., compressor failure) are usually located in a similar region of the feature space. Therefore, instead of learning to combine anomaly scores given by members of the ensemble (i.e., trees of the Isolation Forest) such as in [22,21,13,6], the proposed method learns to score anomalies differently in different regions of the feature space. To do this, we first split the feature space into diverse and potentially overlapping regions, as illustrated in Fig.19(a). Such regions are obtained by applying clustering several times based on various numbers of clusters, various combinations of features, and initial clustering con- ditions. A total of m overlapping clusters resulting from the different clusterings are obtained.

Next, a sparse scores matrix Z ∈ R^n×m is defined where the n rows correspond to the instances and the m columns correspond to the clusters (i.e., regions). Each entry Z_i,j is set to si if xi belongs to cluster cj, otherwise to 0, i.e.,

Z_i,j=

s_i if xi∈ c_j

0 otherwise (3)

where xi is the i^th instance in dataset X, si is the score assigned by the anomaly detection model A to instance xi (i.e., si = A(x_i)), and c_j is the j^th cluster. The process of clustering and the construction of the Z scores matrix are described in Algorithm 3. Note that the Z matrix needs only to be constructed once at the beginning (i.e., before starting the interactions with the expert).

Now that Z ∈ R^n×m is defined, the remaining of this section explains how to incorporate the expert feedback at each iteration of the interaction loop.

Let ~1 be a vector of all ones (i.e. m ones). It is worth noting that (_m¹Z.~1) ∈ Rⁿ corresponds exactly to the original anomaly scores we get with the unsupervised anomaly detector A(X) = {s1, s2, . . . , sn}.

By using _m¹Z.~1 to compute the final scores, the same weight (i.e.,_m¹) is being used for to all the m regions of the feature space. Instead of this, we propose to assign different weights to the m regions and define

(17)

3.2 omd-clustering (extension of omd [1 3]) 9

Figure 2: Illustration of the process of splitting the space into various regions and the construction of the Z scores matrix.

Algorithmus 3 :Construction of the Z scores matrix.

Input: dataset X, anomaly scores A(X) ={s1, . . . , sn}, number of clusterings C to perform;

m← 0; // total number of clusters for i ← 1 to C do

Pick a random subset of p features, and a random number of clusters k;

Perform k-means clustering on the subset of X induced by the p features;

m← m + k;

end

Construct a sparse scores matrix Z ∈R^n×maccording to eq.9.;

return Z;

the final anomaly scores based on a weighted sum of the regions scores. That is, we replace the sum _m¹ Z.~1 with

scores = 1 mZ.w

∈Rⁿ (4)

and we learn w ∈ R^m based on subsequent expert feedbacks. This formulation of the anomaly scores allows us to assign lower weights

(18)

to the regions of the feature space containing nominal instances or irrelevant anomalies, and higher weights to the regions containing true/relevant anomalies. As a result, successive anomalies presented to the expert will more likely be from the same truly anomalous region, hence minimizing the expert effort and increasing the precision at the given budget.

In order to learn w, we use the same optimization procedure (online mirror descent) and we minimize the same simple linear loss function as in OMD [13]. Consider that the current iteration is t. Let x_t be the top-1 anomalous instance in X based on the anomaly scores given by the current weights w (according to eq.10); let yt ∈{+1, −1}

be the feedback label provided by the expert for instance xt (with +1 =relevant anomaly, and −1 = nominal or irrelevant anomaly); let z_t be row vector from Z corresponding to the instance xt. Then, the (convex) loss function is simply defined as follows:

f(w) = −y_t(z_t.w) (5)

Note that when yt = +1, the function in eq. 11 gives a smaller loss than if yt = −1, since the instance presented to the expert (scored based on weights w) was a relevant anomaly. Finally, the main interactive anomaly detection algorithm, which minimizes the loss based on each feedback, is provided in Algorithm 4.

Algorithmus 4 :OMD-Clustering

Input:dataset X ∈ R^n×d, model A, budget b, learning rate η ; Construct the scores matrix Z ∈R^n×m using Algorithm 3;

θ← ~1; (initialize weights to ones, θ ∈R^m) for t ← 1 to b do

w← arg minw∈_ˆ R⁺|| ˆw − θ||; (constrains the weights to be positive)

scores ← _m¹Z.w; (scores ∈Rⁿ, see eq.10)

Let xt be the instance with the maximum score in scores;

Get feedback yt ∈{+1, −1} for xt from the expert;

X← X −{xt};

θ← θ−η∂f(w);

end

3.3 o j r a n k-clustering (extension of ojrank [6]])

Similar to the method OMD-Clustering, OJRank-Clustering first applies clustering several times to get the sparse score matrix Z, but the learning strategy to update w is different. OJRank-Clustering uses the same optimization procedure (on-the-job re-ranking) and adopt the same pairwise cross-entropy loss function as in OJRank [6]. As

(19)

3.3 ojrank-clustering (extension of ojrank [6]]) 11

described in [6], OJRank paired the top-1 instance u with each previously labeled instance ν with an opposite label and calculated the desired probability that u ranks above ν. Then finding the weight vector that minimizes the loss function over all the pairs. The gradient update equation is defined as follows:

w^t = w^t−1− α X

(u,ν)∈P

(ˆpuν− p_uν)(z_u− z_ν) (6)

OJRank-Clustering is the same algorithm as the one presented in the [6], and that the only difference is that we use our own Z matrix as defined in Algorithm 3. Finally, the main interactive anomaly detection algorithm, which is the extension of OJRank [6], is provided in Algorithm 5.

Algorithmus 5 :OJRank-Clustering

Input:dataset X ∈ R^n×d, model A, budget b, scale factor δ, Num. pairs to sample k;

Construct the scores matrix Z ∈R^n×musing Algorithm 3;

w← ~1; (initialize weights to ones) for t ← 1 to b do

scores ← _m¹Z.w; (scores ∈Rⁿ, see eq.10)

Let u be the instance with the maximum score in scores;

Get feedback yu ∈{+1, −1} for u from the expert;

if yu = 1 then;

for ν ∈ HN(nominal sets) do add (u, ν, ˆp_hl) to P_H end

for ν ∈ sample(z, yu, (k - |HN|+))do add (u, ν, (1 + δ) ˆpuν) to P_S

end else;

for ν ∈ HA(anomalous sets) do add (u, ν, (1 - ˆphl)) to PH

end

for ν ∈ sample(z, yu, (k - |H_A|+))do add (u, ν, (1 - δ) ˆpuν) to PS

end

default setting α = 0.1;

w← w^t, see eq.6; end

(20)

3.4 pa d ( probability-based anomaly detection )

PAD was proposed with a motivation to minimize expert effort by presenting an instance that is similar to the previously queried instance. Here the similarity is measured by the Minkowski distance.

We assumed that the data points that are close to each others are highly likely to share the same information(be anomalous or not).

Since the anomaly score (S) used in this method is the prediction probability obtained from a random forest classifier, not the ensemble scores (Z), so the method is named as probability-based anomaly detection method.

Algorithmus 6 :PAD

input :Input dataset X ∈ R^n×d; Ensemble (iForest) scores Z∈ R^n×1; Pseudo label L ∈ { -1,1}; Nbr. of neighbours k; a budget of T queries;

Initialization: Li= 1 if Ziis in the top 10%, otherwise Li= -1 ; Anomaly score S ∈ R^n×1, S = 0 ;

for t ← 1 to T do

Training classification: clf ← RandomForest(X, L);

S← predic_prob of clf on X within class ’anomaly’ ; X_u← argmax(S), the top instance;

Calculating the distance to get k neighbours of Xu; y_u ← true label for Xu after querying the human expert;

for xi∈ k neighbours of Xu do

If(yu = -1 and siin the lowest 10%):;

L_i= −1;

elseif(yu = 1 and siin the top 10%): ; L_i= 1;

end end

Algorithm 6 presents PAD. First, an unsupervised anomaly detection method is employed to get the anomaly scores. Based on these scores, data points are divided into two classes: normal or abnormal (i.e., each instance gets a pseudo label). Then a classifier (Random- Forest) is trained on the original dataset along with the pseudo labels. Using this classification model, we predict the confidence of each instance being an anomaly and the probability can be seen as an anomaly score. In each iteration, we present the instance having the highest anomaly score to the expert. After receiving the label- feedback, we label the k nearest neighbors of the selected instance and retrain a classifier using updated pseudo labels to get new anomaly scores. A threshold for the anomaly score is set to define whether the data point is normal or abnormal. Here we set the threshold as 10%

based on the class distribution and available budget. Pseudo labels of the neighbors are only updated when their scores agree with the received feedback. In other words, if the received feedback is "anomaly"

(21)

3.5 preprocessing time series data 13

("normal"), then we only update the pseudo label of the neighbors having a high (low) anomaly score. PAD can actually minimize the expert effort (as shown in Section 4.5), but it is sensitive to the choice of k and the threshold.

3.5 p r e p r o c e s s i n g t i m e s e r i e s d ata 3.5.1 Why peprocess the time series data

In this section, we address why it is essential to preprocess the time series dataset. From section 2, we know that point dataset is those datasets where the data obtained is not dependent on the time at which a set of data was obtained. On the other hand, we have some datasets where the time aspect plays an essential role in analyzing whether a given point is considered to be normal or anomaly, and these datasets are grouped under the category time-series data. We highlight here the necessity to preprocess time-series data using a simple example, which is as follows.

Let us consider 15, 8, -3, 11 as four temperature records of Sweden in^◦C corresponding to Summer, Autumn, Winter, and Spring, respectively. If we consider these records individually, -3^◦C seems to be an anomaly. However, if we analyze these records with the perspective of time, which is summer, autumn, winter, and spring in this example, the evaluation would be fair, and effort to analyze is reduced. From this example, it is clear to represent the data as a collection instead of presenting them individually.

There are different ways of presenting the data as a collection, but we have adopted the following preprocessing. We use the concept of windows with an overlap, which is explained in the following section to re-sample the time series and replace the original signal by the mean and standard deviation values of each window. Since a single data point in a time series dataset does not represent much information, presenting data as a collection of data points with a specific interval of time would be helpful to determine if an anomaly happens at this period. So we do overlapping when processing the signal in order not to lose information between windows.

3.5.2 Concept of Window and Overlap size

Here we explain the concept of window frame and overlap size with a simple example. From the section above, we have seen that data instances cannot be presented individually to the expert as it does not convey the right information. On the other hand, we can not present the whole time-series data to the expert as it represents too much of data, and the vital information may be missed in the process of analyzing. The window size represents the amount of data to be pre-

(22)

sented to the expert, and the overlap size indicates the number of repeated data from the previous window frame. It is necessary to have overlaps from the previous window frame as it serves as con- necting dots to picture the whole scenario. The example below can explain the concept of the window frame and overlap size in detail.

Let us consider our example time-series dataset to be [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] taken at ten periodic intervals of time. As we have seen that all these points can not be presented all at once. So, we choose our window frame size to be 5 ( we present 5 data points at a time) with overlap size 3 ( three datapoints are repeated from the previous window frame.

Dataset: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Our window size is 5

Window frame1: [1, 2, 3, 4, 5]

Since we have taken overlap size as three, only two new elements will be added to the new window frame compared to the previous window frame.

Window frame2: [3, 4, 5, 6, 7]

Window frame3: [5, 6, 7, 8, 9]

Window frame4: [7, 8, 9, 10]

3.6 e x p l a nat i o n f o r t h e a n o m a l i e s i n t i m e s e r i e s d ata Since many reasons caused the anomalies in the heat pump datasets, and we only want to detect the anomalies specific to one reason. So we propose the algorithm ’FindReason’ to explain which features that cause the anomalies. Here a reason for an anomaly is defined as the feature or set of features which causes this anomaly by being higher or lower than expected.

As shown in Algorithm 7, each feature is normalized to [0,1] and then aggregated into segments based on its periodic trend. Then cal- culate the mean values of each segment and compare the difference of a data point to the mean of the segment it belongs. The feature which has the largest absolute difference is the main reason for the label of this data point. If the differences of other features are beyond half of the ranges of these features, these features are the second rea-

(23)

sons.

Algorithmus 7 :FindReason

input :Input dataset X ∈ R^n×d; trend parameter θ;

aggregation segments m ; for i ← 0 to d do

See the trend and seasonality using DF test;

Trend elimination using exponentially weighted moving average:;

Trend: T_i[t] ← (1 - θ) * X_i[t] + θ * T_i[t-1];

Time series with trend eliminated EXi← X_i[t] - Ti[t];

Normalizing each time series EXi: NXi← norm(EXi);

If NXihas periodic trend, aggregation using piecewise aggregate approximation:;

t₀ ← time tick since last aggregated point;

t ← current time tick;

AX_i← mean(NXi[t0:t]) if t - t0 >= m ; Or NXiis not periodic:;

AX_i← mean(NXi);

Difference DXi[t] ← NXi[t] - AXi[t];

Main reason cause label y[t] is ith feature which has largest abs(DXi[t]);

DX_i[t] < 0 means low and DXi[t] > 0 means high;

Setting threshold MDXias mean of DXi;

Other features above their MDXicould be secondary cause;

end

output :category features present the feature or set of features lower/higher than expected

This algorithm ’FindReason’ cannot be applied to time-independent datasets since we aggregate instances based on the time dependency between them. If we can find contextual attribute and improve the algorithm by aggregating the instance along with the contextual information, then we can explain the reasons of anomalies in any types of datasets.

4

R E S U LT A N D E VA L U AT I O N

In this section, we assess the performance of the proposed methods OMD-Clustering, OJRank-Clustering, and PAD on several real-world datasets. We compare the proposed methods to two state-of-the-art interactive anomaly detection methods OJRank [6] and OMD [13], as well as an unsupervised anomaly detector (Isolation Forest [4]) as a reference. The evaluation is done on a set of 13 datasets with available ground truth (true label consisting of "nominal" and "anomaly"). The

(24)

16 r e s u lt a n d e va l uat i o n

performance of the methods is evaluated in terms of three metrics and presented in figures.

4.1 d ata s e t s

4.1.1 Time-independent Dataset

We evaluate performance over a set of 13 real-world datasets with ground truth where groud truth refers to true labels. These benchmark datasets are publicly available and have been widely used in papers mentioned as part of the literature review. We have used 13 datasets out of which 12 have been taken from UCI repository [19], and two dataset toy and toy2 have been synthetically generated.

Implementation of isolation forest requires nxn kernel matrix, so a subset of the larger dataset such as Mammography, shuttle, KDD cup, and covtype have been used to suit our purpose. All the mentioned datasets have two class: nominals and anomaly, and the later approximately comprise of 2% of the whole dataset. The subset of the larger dataset was obtained by sub-sampling 2000 data instances while maintaining the ratio of anomalies to nominal to be the same as the original dataset. These datasets have been preferred as benchmark datasets as these were widely used in previous work. Another reason to prefer these datasets is the ratio of anomalies to nominals is almost the same which makes the performance comparison easier.

Details of the datasets are summarized in Table7.

Name Size Feature Nbr. of Anom. Anom. %

Abalone 1920 9 29 1.51

ann_thyroid_1v3 3251 21 73 2.25

cardiotocography 1700 21 45 2.65

covtype_sub 2000 54 19 0.95

kddcup_sub 2000 91 77 3.85

Mammography 11183 6 260 2.32

Mammography_sub 2000 6 46 2.30

shuttle 12345 9 867 7.02

shuttle_sub 2000 9 140 7.00

toy 485 2 20 4.12

toy2 485 2 35 7.22

weather 13117 8 656 5.00

yeast 1191 8 55 4.62

Table 3: Details of Datasets

(25)

4.2 baselines 17

4.1.2 Time Series Dataset

The time series dataset is referred to these datasets have the time aspect involved in them, these kinds of datasets are generated by collecting data at different points in time. The time series we used in this project is a set of 6 time-series datasets which was collected from more than 200 heat pumps with various types and configura- tions. Each heat pump sent signals with a timestamp and the raw data was sampled at irregular periods. The time period during which compressor failure occurs is considered anomalous. The heat pumps datasets we obtained is the average signal after sliding for one hour in the 24-hour window along with anomaly labels. The details of the heat pump datasets are mentioned in Table 4. The four attributes in six heat pump datasets refer to ’compressor temperature’, ’additional heat’, ’hot water temperature’, and ’outside temperature’. These heat pump datasets have to be evaluated differently compared to the time independent datasets.

Name Dataset Duration Anomaly Duration Heatpump-1 3years, 5 months,16 days 5days Heatpump-2 1year, 11 months, 1 day 15 days Heatpump-3 0year, 4 months, 27 days 4days Heatpump-4 0year, 9 months, 25 days 9days Heatpump-5 2years, 3 months, 20 days 4days Heatpump-6 0year, 5 months, 12 days 5days

Table 4: Details of heatpump datasets

From Table 4, we can see that anomalous behavior is observed for a relatively shorter duration compared to the total duration of the dataset.

4.2 b a s e l i n e s

We compare the performance of our methods (PAD, OMD-Clustering, and OJRank-Clustering) with baselines OMD, OJRank, and Unsuper- vised, which are mentioned in the literature review section. All the methods re-rank anomalies online from top-1 feedback, as discussed in the literature review.

4.3 e va l uat i o n m e t r i c s

The obtained results from proposed methods are compared w.r.t Pre- cision and Expert effort at a given budget and Area under the curve (AUC).

(26)

For time series datasets, we also define one more evaluation metrics named ’Reason based Effort’.

4.3.1 Precision at the budget

The precision at a given budget (precision@b) as described in eq. 7. In this case, we produce curves that show how the precision of the different methods changes according to various values of the budget b.

4.3.2 Expert Effort

One of our aims was to minimize expert effort, to achieve this subsequent feedbacks have to be similar so that human expert who is analyzing the anomalies instances need not invest too much time in analyzing the presented anomalies. To evaluate if subsequent queries presented to the expert is similar or not cosine similarity measure was employed. The cosine similarity just gives information on the similarity between the presented instances. These similar instances need not always have the same reason for being considered as anomalous. To evaluate this, reasons-based-algorithm was also proposed in section 4.3.4.

The expert effort is described in eq.8. Here, we also produce curves to show how the expert effort changes according to various values of the budget b.

4.3.3 Area under the curve (AUC)

Since the precision curve and expert effort curve change with the budget, comparing model performance based on these two curves is not very effective. So we calculated the area under the curve of precision and expert effort respectively vs. used budget as a new measure.(resp.

AUC_precand AUC_effort).

4.3.4 Reason based Effort

We defined a list of strings indicating the features that were higher/lower than expected in the time series datasets as the probable reasons for the anomalies. Since similar queries (i.e. anomalous for similar reasons) presented consecutively to the expert, would result in less labeling effort, we evaluate the expert effort based on the reasons.

The new evaluation metric about expert effort is calculated by taking the sum of Jaccard similarity between successive queried instances

(27)

4.4 result of time-independent datasets 19

based on their reasons. Jaccard similarity is defined as the number of common elements between two sets to the number of unique elements in the two sets.

Given a sequence of queried instances qi∈ Q within the budget b, i means different round to re-rank, q[reason]i means a set of strings indicating the features that were higher/lower than expected for the instance q_i.

ReasonbasedEffort = Xb i=2

(1 −intersection(q[reason]_i−1, q[reason]i) union(q[reason]_i−1, q[reason]i) )

4.4 r e s u lt o f t i m e-independent datasets

To explore how our methods incorporate expert feedback, we plot the precision and effort versus iteration of feedbacks. The precision and expert effort curves obtained by the six methods (OJRank, OMD, OMD-Clustering, OJRank-Clustering, PAD and Unsupervised) are shown in Fig.21(a-d) and Fig.22(a-d). Fig.21(a-d) shows the results of the two artificial datasets illustrated in Fig.20, while Fig.22(a-d) shows the results on two of the real-world datasets (cardiotocography and mammography_sub).

Figure 3: Simple artificial datasets: toy (with a single anomalous region), and toy2 (with two different anomalous regions).

As one expects, we can see from these figures that, in general, all the interactive anomaly detection methods achieve a higher precision at budget compared to the unsupervised anomaly detection method, which confirms that interacting with the expert help to get more relevant anomalies. The proposed methods OMD-Clustering and OJRank-Clustering achieve a precision which is higher than or equal to the other interactive methods, while always resulting in a significantly smaller expert effort. This indicates that clustering helps to aggregate similar instances from which relevant anomalies can be more easily detected. Moreover, OMD-Clustering mostly out- performs OJRank-Clustering, it might because OMD-Clustering has

(28)

(a) toy precision (b) toy effort

(c) toy2 precision (d) toy2 effort

Figure 4: Evaluation of the precision and expert effort according to various values of the budget, on the two synthetic datasets: toy and toy2.

more hyper-parameters, and there is no good way to fine-tune them when labeled data is scarce. PAD achieves a high precision in some of datasets while performs bad in some other datasets, it indicates the classification method is not adaptable to anomaly detection and the parameter is sensitive to the data distribution. Even the PAD minimize the expert effort, we do not consider this method to detect anomalies on heat pump datasets (our case data).

The results on all the remaining datasets are summarized more compactly in Fig.23(a-b). Fig.23(a) shows bars plots corresponding to the area under the precision curves (AUC_prec). Fig. 23 (b) shows bars plots corresponding to the area under the expert effort curves (AUCprec). The same observations can be made from these figures.

Once again, we can see that all the interactive methods outperform the unsupervised one, highlighting the importance of interacting with a human expert. Moreover, the proposed OMD-Clustering method usually achieves a higher precision with a lower effort than the other methods on most of the datasets.

4.5 s e n s i t i v i t y a na ly s i s

This section explains the need for sensitivity analysis and the results of sensitivity analysis performed in our work. By sensitivity analysis,

(29)

4.5 sensitivity analysis 21

(a) cardiotocography (b) cardiotocography

(c) mammography_sub (d) mammography_sub

Figure 5: Evaluation of the precision and expert effort according to various values of the budget, on two real-world datasets: cardiotocography and mammography_sub.

we try to examine if a model’s performance is affected by varying a particular parameter. If the performance of the model changes on varying the parameter, we say that the model is sensitive to that parameter, and an optimal value is obtained by studying the trend. If the model’s performance does not change, we say that the model’s performance is insensitive to the parameter that is varied, and using the smallest possible value for that parameter is preferred.

In our work, we have done a sensitivity analysis to find the optimal parameters for the following:

• Number of clusterings in clustering-based methods

• Number of nearest neighbors in PAD

A sensitivity analysis was also performed on the hyper-parameters in method OJRank. These parameters were mentioned to be sensitive parameters for the OJRank method in the literature review. Further, a sensitivity analysis was performed to study the effect of these in our method. The plots shown below summarizes the results of various sensitivity analysis performed as part of our work.

(30)

(a) Area under the precision curve (AUCprec) for all methods on each dataset.

(b) Area under the expert effort curve (AUC_effort) for all methods on each dataset.

Figure 6: Overall evaluation results: area under the precision and the expert effort curves for all methods and datasets.

4.5.1 Sensitivity Analysis of Method PAD

The proposed method PAD is sensitive to parameter k (number of neighbors), we try different odd-numbered k values from 3 to square- root of the number of data instances present in the dataset to find optimal k for each dataset.

(31)

4.5 sensitivity analysis 23

Figure 7: PAD achieves better performance on precision as the k increasing while starts to get bad after k reaching 40 on 5 datasets. Each line corresponds to one of 13 datasets mentioned in section 4.1

As shown in Fig.7and Fig.8, we use and recommend k = 39.

4.5.2 Sensitivity Analysis of OMD-Clustering and OJRank-Clustering The OMD-Clustering and OJRank-Clustering are sensitive to different times of clustering used to obtain the scores matrix Z, as describes in Algorithm 3. Hyper-parameters value in OJRank are set as the suggested defaults in the original paper [6], numer of pairs to sample k= 20, scale factor δ = 0.1.

As shown in Fig. 9 and Fig. 11, the precision of OMD-Clustering and OJRank-Clustering become higher with increasing k on 5 datasets and keep stable when k is larger than 30.

(32)

Figure 8: PAD has stable performance on Expert effort upon varying k. Each line corresponds to one of 13 datasets mentioned in section 4.1

Figure 9: AUC of precision upon varying clustering for OMD-Clustering.

Each line corresponds to one of 13 datasets mentioned in section4.1

From Fig.10and Fig.12, the expert effort of OMD-Clustering and OJRank-Clustering get lower with increasing k and become relatively

(33)

4.6 case study 25

Figure 10: AUC of effort upon varying clustering for OMD-Clustering. Each line corresponds to one of 13 datasets mentioned in section4.1

stable when k is larger than 20. So, we apply 30 clusterings on original data space.

4.6 c a s e s t u d y 4.6.1 Preprocessing

Provided heat pump dataset has data collected for a specific time period with an interval of one hour. So, we have taken fixed window size of 24-hour for all 6 heat pump data as number of anomalies are very less if the window size is reduced further.

The whole window frame is labeled to be anomalous if there is an anomaly in this window frame. Fig.13gives an visual representation of window frame with overlap concept implementation on one of the heat pump dataset. Table.5illustrate the new set of features obtained after implementation of above mentioned concept.

Different overlap size are experimented for this 24-hour window size. Table 6 summarizes the different overlap size used, number of anomalous windows and budget for the process (shows details of one heat pump data, but results are obtained for all heat pumps). Budget for each round is calculated by min(2* Nbr. of anomalies,100).

From this table6, it can be seen that the number of anomalies be- comes larger as the overlap size increase, and the budget is highly dependent on the number of anomalies.

(34)

Figure 11: AUC of precision upon varying clustering for OJRank-Clustering.

Each line corresponds to one of 13 datasets mentioned in sec- tion4.1

Figure 12: AUC of effort upon varying clustering for OJRank-Clustering.

Each line corresponds to one of 13 datasets mentioned in sec- tion4.1

In case the overlapping window generating more anomalies which could lead the precision measurement to be unfair, we checked the percentage of anomalies in the heat pump datasets after processing by different overlap windows.

(35)

4.6 case study 27

Figure 13: Implementation of windows and overlap concepts on heat pump data

Feature name Feature meaning

Compressor_mean mean compressor temperature per window

Compressor_std standard deviation for compressor temperature per window AdditionalHeat_mean mean additional heating value per window

AdditionalHeat_std standard deviation for additional heating value per window HotWater_mean mean hot water temperature per window

HotWater_std standard deviation for hot water temperature per window Table 5: New features of heat pump datasets after processed by window

sliding

Figure 14: Percentage of Anomalies in the heat pump datasets after processing by different overlap windows

As shown in Fig. 14, the overlapping window did not affect the ratio of anomalous window in the total windows.

The evaluation metric expert effort we used on benchmark datasets is the sum of the cosine distance between consecutive queries.

(36)

Overlap Size Nbr. of Anomalous Windows Budget

0 6 12

3 6 12

4 7 14

6 8 16

8 9 18

9 10 20

12 12 24

15 16 32

16 18 36

18 24 48

20 36 72

21 48 96

22 72 100

Table 6: Summaries of anomalies after preprocessing heat pump datasets with different overlap sizes

From the Table6it can be seen that at lower overlap, the budget is around 20 whereas at the higher overlap the is around 100. From this we can infer that even if the instances are dissimilar the expert effort is going to be small since the budget is small and on the other hand even if the instances are similar the expert effort is going to be higher since the budget is high.

4.6.2 Evaluation Results on Time series Datasets

Since method PAD got very low precision on the heat pump dataset, we applied all baselines except PAD on the six heat pump datasets:

From Fig.15and Fig.16, we can see OMD, OJRank, OMD-Clustering and OJRank-Clustering outperform unsupervised anomaly detection method, illustrating the significance of interacting with experts. OMD- Clustering and OJRank-Clustering achieve higher precision with lower effort compared to OMD and OJRank, respectively, indicating clustering aggregates similar instances from which anomalies can be easily detected. On the other hand, OMD-Clustering significantly outper- forms all baselines, showing its ability to learn from expert feedback on time series datasets.

(37)

4.6 case study 29

Figure 15: Precision of all methods on heatpump datasets after processing by 12 hours overlap windows

Figure 16: Expert effort of all methods on heatpump datasets after processing by 12 hours overlap windows

4.6.3 Explanation

This section explains the reasons for the anomalies and how it can be used as a new evaluation metric to verify if similar relevant anomalies are being presented to the experts.

4.6.3.1 What is reason

To find the relevant anomalies, we implement the algorithm ’FindRea- son’ to explain which feature(s) cause this anomaly.

After implementing Algorithm ’FindReason’, we get the reason for each instance: e.g. [’High Compressor’], [’High Compressor’, ’High

(38)

HotWater’], [’Low HotWater’, ’Low Compressor’], [’Low Aditional- Heat’], [’High HotWater’], [’Low Compressor’, ’Low HotWater’].

4.6.3.2 New expert effort measure based on the reasons for anomalies Before we evaluate the expert effort based on the cosine distance. The smaller the distance between the two anomalies means the greater the similarity, which results in smaller labelling effort. However, on the heat pump dataset, distance similarity cannot ensure the anomalies caused by same reason. We introduce the Jaccard similarity between two anomalies based on these reasons. It is reasonable that similar queries (i.e. anomalous for similar reasons) presented consecutively to the expert, would result in less labeling effort, as the user stays within the same context. So the new evaluation metric about expert effort ’Reason based Effort’ is defined as in section 4.3.4.

Jaccard similarity is in the range[0,1], reason based effort is also in the range[0,1]. The new effort measurement can be also applied to other time series datasets.

Then we show the reason based effort of each model on all window sliced heat pump datasets with 12 overlaps.

Figure 17: Reason based effort of all methods on all window sliced datasets with 12 hours overlap.

From the Fig. 17, we can see the OMD-Clustering and OJRank- Clustering methods get lower reason based effort compared to original OMD and OJRank respectively and OJRank-Clustering perform best. And from Fig. 15 and Fig. 16, we can see OJRank-Clustering achieves highest precision and reduces the effort based on reasons on the heat pump datasets processed by 12 hours overlapped windows.

(39)

5

C O N C L U S I O N

In this work, we developed an interactive anomaly detection method, where a human expert can provide feedback while verifying/investigating anomalies. The proposed method OMD-Clustering and OJRank- Clustering (extension of OMD and OJRank) incorporates each feedback in an online fashion and learns to assign weights to various regions of the feature space. While the proposed method PAD learns to score anomalies based on their distance to the real anomalies. OMD- Clustering and OJRank-Clustering give a higher weight to regions of the feature space containing relevant anomalies, which contributes more to the final anomaly score compared to regions where irrelevant anomalies are present. PAD minimizes the expert effort by presenting instances that are more similar to real anomalies. The proposed methods were evaluated on various publicly available real- world datasets (point datasets) as well as on different heat pump datasets (time-series datasets) and compared to state-of-the-art interactive anomaly detection methods. The results show that the proposed clustering-based methods are more precise at detecting relevant anomalies within a budget, while PAD reduces the expert effort significantly on publicly available point-datasets. An effort was made to combine clustering-based methods and PAD. However, this did not give good results as PAD is very sensitive to number of nearest neighbors on time-series datasets. We extracted features over time windows with various overlaps in order to achieve both good precision and reduce expert effort on time-series data. This approach gave better performance compared to state-of-the-art methods along with presenting more relevant anomalies to the user on time-series datasets. We submitted a paper based on the work presented in this thesis, to the ECML/PKDD Workshop on "IoT Stream for Data Driven Predictive Maintenance". The full paper is shown inChapter 7.

6

F U T U R E W O R K

As part of our future work, we would like to work on the following sections:

1. Introducing a parameter α that emphasizes on parameters precision and expert effort Precision and Expert effort are the two-evaluation metrics considered. It is quite challenging to achieve both high precision and low expert effort at the same time. Introducing a

(40)

32 f u t u r e w o r k

parameter α, an input from the user (expert) decides which parameter is foremost among the two depending on the situation by specifying a value in the range zero to one.

2. Identifying a subset of features that contribute to anomalies In most cases, real-world datasets can be seen as a combination of attributes that directly (indicative) and indirectly (contextual) contribute to anomalous instances, which is subjective. One recent paper proposed a method CONOUT[14] to demarcate data attributes into contextual and indicative attributes, but the setup in this paper was not interactive. We will adopt the idea of CONOUT to identify the right subset of indicative and contextual attributes can help in taking preventive measures. This idea can be incorporated in both time series and time-independent datasets to find the reasons for anomalous instances based on contextual features.

(41)

7

A P P E N D I X A : T H E PA P E R B A S E D O N T H I S T H E S I S

Interactive Anomaly Detection Based on Clustering and Online

Mirror Descent

Lingyun Cheng Sadhana Sundaresh Mohamed-Rafik Bouguelia Onur Dikmen

July 7, 2020

Abstract. In several applications, when anomalies are detected, human experts have to investigate or verify them one by one. As they investigate, they unwittingly produce a label - true positive (TP) or false positive (FP). In this paper, we propose a method (called OMD- Clustering) that exploits this label feedback to minimize the FP rate and detect more relevant anomalies, while minimizing the expert effort required to investigate them. The OMD-Clustering method iteratively suggests the top-1 anomalous instance to a human expert and receives feedback. Before suggesting the next anomaly, the method re- ranks instances so that the top anomalous instances are similar to the TP instances and dissimilar to the FP instances. This is achieved by learning to score anomalies differently in various regions of the feature space. An experimental evaluation on several real-world datasets is conducted. The results show that OMD-Clustering achieves statistically significant improvement in both detection precision and expert effort compared to state-of-the-art interactive anomaly detection methods.

keywords: Interactive Anomaly Detection, Outlier Detection, User Feedback, Expert Effort

(42)

34 a p p e n d i x a: the paper based on this thesis

7.1 i n t r o d u c t i o n

Anomaly detection allows us to find instances that deviate significantly from the majority of data, indicating e.g., a system fault. Usual unsupervised anomaly detection methods are purely data-driven and do not benefit from valuable expert knowledge. However, many of the anomalies that real-world data exhibits are irrelevant to the user as they represent atypical but normal events. For example, as illustrated in Fig.18, in domestic hot water heat-pump systems, the water reaches abnormally high temperatures once in a while to kill poten- tial Legionella bacteria; this is an atypical but normal event. Moreover, anomalies are often subjective and depend on the application purpose and what the user considers as abnormal. For example, an abnormal train delay, which is due to a passenger who blocked the door, is not interesting for a diagnosis purpose. However, it can be interesting for planning purposes.

Figure 18: Data from a real heat-pump system, where the goal is to detect compressor failures. Several anomalies are irrelevant as they are not related to compressor failure. These are just atypical (but reasonable) events. Nevertheless, they appear as abnormal.

In order to distinguish between relevant and irrelevant anomalies, this paper proposes an interactive anomaly detection algorithm that proactively communicates with an expert user to leverage her/his feedback and learn to suggest more relevant anomalies. The objective here is two-fold: (i) maximizing the precision on the instances verified by the expert (i.e., ideally, only relevant anomalies are presented to