• No results found

Spatio-temporal outlier detection in streaming trajectory data

N/A
N/A
Protected

Academic year: 2021

Share "Spatio-temporal outlier detection in streaming trajectory data"

Copied!
105
0
0

Loading.... (view fulltext now)

Full text

(1)

Spatio-temporal outlier detection in streaming

trajectory data

MÁTÉ SZEKÉR

Master’s Thesis at Computer Vision and Active Perception Lab

Supervisor: Carl Henrik Ek, Tove Gustavi

(2)
(3)

Abstract

(4)

Referat

Tidsberoende avvikelsedetektion i

trajektorie-dataströmmar

(5)

Acknowledgements

I am very grateful to my supervisors Carl Henrik Ek and Tove Gustavi without whom this thesis could not have been accomplished. I would also like to thank my family for all the support even though they were over 5000 km away, 6 timezones behind.

(6)

Contents

1 Introduction 1

1.1 Anomalies . . . 2

1.2 Background and motivation . . . 2

1.3 Nature of data . . . 3

1.4 Problem description . . . 4

1.5 Outlier detection and Classification . . . 5

1.6 Contribution . . . 6

1.7 Organization . . . 6

2 Group discovery 7 2.1 Association rule mining . . . 7

2.1.1 Apriori algorithm . . . 8

2.2 Finding groups . . . 9

3 Background on trajectory analysis 11 3.1 Dimensionality reduction . . . 11

3.2 Normalization . . . 11

3.3 Trajectory dissimilarities . . . 12

3.3.1 Dynamic Time Warping . . . 12

3.3.2 Hausdorff distance . . . 13

3.3.3 Fr´echet distance . . . 14

4 Non-parametric trajectory analysis 17 4.1 Introduction to SVM . . . 18

4.1.1 Feature mapping . . . 19

4.1.2 Linearly non-separable datasets . . . 19

4.1.3 Dual formulation of the SVM . . . 20

4.1.4 Kernelization . . . 21

4.1.5 Multi-Class SVM . . . 23

4.2 Trajectory kernels . . . 24

4.3 Preliminary results . . . 24

4.3.1 Gaussian RBF kernel . . . 25

4.4 SVM applied to data streams . . . 26

5 Parametric trajectory analysis 28 5.1 Introduction to Markov Chains . . . 29

5.2 Extensible Markov Model . . . 30

(7)

5.2.2 State clustering . . . 32

5.3 Estimating the parameters of the EMM . . . 32

5.3.1 Clustering threshold . . . 32

5.3.2 Forgetting factor . . . 32

5.4 Classification . . . 36

5.5 Anomaly detection . . . 38

5.6 Preliminary results . . . 39

5.7 Database of individuals with same instantaneous behaviour . . . 42

6 Dataset 44 6.1 MIT Realitycommons Badge dataset . . . 44

6.2 Proprietary FOI dataset . . . 45

7 Results 46 7.1 Organization . . . 46 7.2 Evaluation method . . . 46 7.3 Parametric analysis . . . 47 7.3.1 Estimation of λ . . . 48 7.3.2 Classification . . . 52 7.3.3 Anomaly detection . . . 55

7.3.4 Detecting anomalous groups . . . 59

7.3.5 Group discovery . . . 63

7.4 Non-parametric analysis . . . 65

8 Discussion and Future Work 68 8.1 Discussion and Conclusion . . . 68

8.1.1 Parametric analysis . . . 68 8.1.2 Non-parametric analysis . . . 70 8.2 Conclusions . . . 70 8.3 Future work . . . 71 8.3.1 Parametric analysis . . . 71 8.3.2 Non-parametric analysis . . . 72 Bibliography 74 Appendices 78 A Source code 79 A.1 SVM preliminary results . . . 79

A.2 Algorithm 1 for estimating λ . . . 81

A.3 Algorithm 2 for estimating λ . . . 86

(8)

List of Figures

1.1 Typical trajectory of an employee during a sliding window. The axes correspond to the x and y coordinates in the office environment. 4

1.2 Types of trajectory anomalies. . . 5

2.1 Graph of frequent rules containing single itemsets. This graph contains two groups/equivalence classes: (A, B, C) and (D). . . . 10

3.1 Example of a mapping of points between two curves of different length. . . 13

3.2 Two trajectories with a small Hausdorff distance but with a large Fr´echet distance.[1] . . . 14

3.3 Graphical representation of the Fr´echet distance calculation. . . . 16

4.1 A possible way to pre-process data is to change the coordinate system from Cartesian to polar. . . 19

4.2 Example of a linear classifier applied to a linearly non-separable dataset in R2. . . . 20

4.3 Different approaches to multi-class SVM. . . 23

4.4 Test dataset for distance substitution kernel based SVM classifi-cation. . . 25

4.5 Classification results using the Fr´echet kernel. . . 25

4.6 Classification results using the Hausdorff kernel. . . 26

4.7 Schematic view of the proposed online SVM model. . . 26

5.1 Schematic view of the EMM construction and maintenance. . . . 30

5.2 Left: Graph of the rate of change of behaviour in a hypothetical dataset. The red line is used as a threshold to identify coherent high-activity periods. Right: During such periods, the value of λ is set inversely proportional to the period length. Outside these periods, λ is set to a small constant value. (Graphs not to scale.) 35 5.3 Example of a not irreducible Markov Chain. The thickness of the edges represent the relative probabilities of the transitions. . . . 37

5.4 Test dataset. Notice that both the normative and anomalous behaviour change with time. The colors indicate the type of intervals. . . 40

(9)

5.6 Results using C = 3, occurrence thr = 0.1, transition thr = 0.05 and λ = 0.5. An anomaly score of 1 represents an anomaly while 0 represents the normal data points. . . 41 5.7 Anomality detection results on the dynamic test dataset. An

anomaly score of 1 represents an anomaly while 0 represents the normal behaviour. . . 42 6.1 Floor plan of the MIT Reality Commons Badge dataset. [2] . . . 44 7.1 Overview of the evaluation method. . . 47 7.2 F1 score vs clustering threshold C. . . 48 7.3 Locations of the EMM states, each identified by a unique ID. . . 48 7.4 Dissimilarities between the behaviours at different states during

two subsequent sliding windows during a complete day. . . 49 7.5 Dissimilarities between average behaviours during two subsequent

sliding windows during a complete day. . . 50 7.6 λ estimated using Algorithm 1. . . 51 7.7 λ estimated using Algorithm 2. . . 51 7.8 Classification performance of the department group

classifica-tions during a day . . . 53 7.9 Outlier detection results without fading. . . 56 7.10 Outlier detection results without fading, notice that the

nor-mative behaviour has changed from the previous behaviour has changed. . . 56 7.11 Outlier detection results with fading using Algorithm 1. . . 57 7.12 Outlier detection results without fading, notice that the

nor-mative behaviour has changed from the previous behaviour has changed. . . 58 7.13 Outlier detection results with fading using Algorithm 2. . . 58 7.14 Outlier detection results without fading, notice that the

nor-mative behaviour has changed from the previous behaviour has changed. . . 59 7.15 Percentual increase due to fading. First week in the dataset. . . . 61 7.16 Percentual increase due to fading. Second week in the dataset. . 62 7.17 Percentual increase due to fading. Third week in the dataset. . . 63 7.18 Non-parametric classification, training window: 2007/03/28 10:02

- 2007/03/28 10:04. . . 66 7.19 Non-parametric classification, training window: 2007/03/28 10:04

- 2007/03/28 10:06. . . 67 8.1 Transition probability matrices for the novice and senior

sub-groups in the Configuration group. The colors correspond to the intensity of the transitions. . . 69 8.2 Example of a situation where a first order Markov Chain may be

(10)

List of Tables

2.1 Example database. . . 9 2.2 Frequent rules with support greater than 0.2 and confidence greater

than 0.5. Singletons are omitted. . . 9 6.1 Department groups in the MIT Badge dataset. . . 45 7.1 F1 scores of classification of the department groups. First week

in the dataset. . . 52 7.2 F1 scores of classification of the department groups. Second week

in the dataset. . . 52 7.3 F1 scores of classification of the department groups. Third week

in the dataset. . . 52 7.4 Percentual change in F1 scores of classification with respect to

the classifier without fading. First week in the dataset. . . 53 7.5 Percentual change in F1 scores of classification with respect to

the classifier without fading. Second week in the dataset. . . 54 7.6 Percentual change in F1 scores of classification with respect to

the classifier without fading. Third week in the dataset. . . 54 7.7 Percentual change in F1 scores of classification with respect to

the classifier without fading. First week in the dataset. . . 54 7.8 Percentual change in F1 scores of classification with respect to

the classifier without fading. Second week in the dataset. . . 54 7.9 Percentual change in F1 scores of classification with respect to

the classifier without fading. Third week in the dataset. . . 54 7.10 First week in the dataset. The values correspond to the precision

of the outlier detector. . . 60 7.11 Second week in the dataset. The values correspond to the

preci-sion of the outlier detector. . . 60 7.12 Third week in the dataset. The values correspond to the precision

of the outlier detector. . . 60 7.13 Subgroups of the Configuration group. . . 61 7.14 Precision of the novice subgroup outlier detector. The last

col-umn represents the precision of a random outlier detector. First week in the dataset. . . 61 7.15 Precision of the novice subgroup outlier detector. The last

(11)

7.16 Precision of the novice subgroup outlier detector. The last col-umn represents the precision of a random outlier detector. Third week in the dataset. . . 62 7.17 Discovered groups with clustering threshold C = 8, min supp =

0.1 and min conf = 0.1. The algorithm identified 27 groups, all of which were only containing one badge ID. . . 63 7.18 Discovered groups with clustering threshold C = 8, min supp =

0.2 and min conf = 0.2. The algorithm identified 27 groups, all of which were only containing one badge ID. . . 64 7.19 Discovered groups with clustering threshold C = 9, min supp =

0.1 and min conf = 0.3. . . 64 7.20 Discovered groups with clustering threshold C = 9, min supp =

0.1 and min conf = 0.5. . . 64 7.21 Discovered groups with clustering threshold C = 9, min supp =

0.1 and min conf = 0.5. . . 64 7.22 Discovered groups with clustering threshold C = 10, min supp =

0.3 and min conf = 0.5. . . 65 7.23 Discovered groups with clustering threshold C = 11, min supp =

0.2 and min conf = 0.4. . . 65 7.24 F1 scores of the department group classifications on 2007/03/28. 65 8.1 The number of measurements of the supposed novice group varies

(12)

Chapter 1

Introduction

Anomaly detection is performed by analyzing and modelling interactions be-tween entitites or groups based on some behavioural data. Behaviour is of-ten defined as a set of attributes of the participating individuals that can be measured objectively [3] like physical proximity, telephone call duration and frequency to others etc. It is important to be able to automatically detect devi-ations from the normative activity pattern because in many applicdevi-ations manual analysis is often infeasible due to complexity, information overload and fatigue for human operators[4].

Example of applications include:

• Intrusion detection systems: Card access control systems are increas-ingly being used to enhance the security of buildings and strategic areas by policy regulations for employees entering security doors. It is how-ever relatively easy to bypass such systems either by simply following an authorized person through a door or by stealing another person’s card. • Fraud detection: In many cases, credit card fraud can be detected by

monitoring the patterns in geographic locations of the transactions. Detecting anomalous activities requires either knowing the normal course of events [5] or how the deviations manifest themselves. For example, in the con-text of traffic analysis anomalies could be characterized by the speed, direction or the type of vehicles (car, motorcycle, pedestrian etc.). Usually there are two means for obtaining this type of information: consulting with a human domain expert to set-up predefined rules, or learning from historical data[4]. It is of-ten argued that the use of predefined rules from a domain expert is insufficient due to the complexity of the underlying dynamical system and lack of expert knowledge that covers all possible situations [6]. On the other hand, there may not be sufficient data available for the automatic set up of rules to filter rare incidents.

(13)

1.1

Anomalies

Anomalies or outliers can be understood as deviations from some normative pattern. Unfortunately there is no unified formal definition of what characterizes a ”deviation”. In statistics, the following definition is used

”An outlying observation, or ”outlier,” is one that appears to deviate markedly from other members of the sample in which it occurs.” [7] By narrowing down the abstract notion of other members in the definition above, anomalies can be split into the following categories[8]:

• Point anomalies: Single data points that deviate significantly from all other observations in the dataset.

• Contextual anomalies: Point anomalies that deviate significantly from all other data points in a neighborhood of the data point itself. For exam-ple, the outside temperature of 25C in Stockholm sometime in January is a contextual anomaly since 25C is considered normal during summer but not during winter. The context in this example is season.

• Collective anomalies: A set of neighbouring point or contextual anoma-lies.

It is important to observe that a good outlier detector should discover contextual rather than point anomalies since the dataset may have a dynamic nature in which an anomalous pattern at some time t may not necessarily be anomalous at t+∆t. Furthermore, the quantity of collective anomalies should be relatively low since the outlier detector should ideally be able to ”forget” obsolete observations in order to accommodate itself to new patterns [9].

1.2

Background and motivation

The European Union funded Security UPgrade for PORTs1(SUPPORT) project

aims to increase the security at European port facilities by upgrading legacy surveillance systems. One problem in today’s surveillance systems is the high false-positive alarm rate in some subsystems. The project has explored different ways of reducing this false-positive alarm rate.

In order to understand what is causing these alarms, it is hypothesised that the majority of the alarms can linked to different types of behaviours which are anomalous. The main goal of the thesis is thus to look for anomalous patterns in a behavioural dataset. These behaviours can eventually be incorporated into the surveillance systems to prevent the triggering of false-alarms.

The dataset provided by the project is confidential and has restricted access. As an alternative, a publicly available dataset arose with supposedly similar characteristics. At the beginning I have been using both the confidential and the publicly available datasets parallel. Due to the constrained access to the confidential dataset it was decided that the publicly dataset will be used for the thesis’ purposes.

(14)

1.3

Nature of data

This section provides a brief overview of the datasets used in the thesis. For a more detailed description the Reader is referred to Chapter 7.

• The confidential dataset contains time stamped logs from a card access control system at a harbour facility. Each data point represents the textual identifier of the harbour area in which an employee identified by his/her unique ID currently resides in. It is transaction based in the sense that data is delivered only if there is movement between different areas. The format of this dataset is

<Day, Time, Site Number, Location, Card number, Employee ID, Area> • The publicly available dataset on the other hand contains time-stamped

geographic locations sampled at a predefined sampling period for employ-ees in an office environment. Each data point represents thus the mo-mentarily estimated position of an employee identified by his/her unique ID with respect to some fixed Cartesian coordinate system. The dataset contains position data collected from 39 employees over a period of one month[2]. The number of employees working in the office varies signifi-cantly during a typical day. In the mornings, less than 10 people are active while later on the afternoon the number of employees is higher. A typical trajectory from this dataset is shown in Figure 1.1. The format of this dataset is

<Day, Time, Employee ID, X,Y>

Despite the obvious differences between the two datasets, the confidential dataset may be transformed into geometric trajectories by introducing an abstract co-ordinate system. We now formalize the notion of geometric trajectories in Def-inition 1.

Definition 1. Trajectory: Consider a sequence of a finite number of points points P = hp1, p2, . . . pni. The piecewise linear curve C connecting the vectors

−−→

p1p2, −−→p2p3, . . . , −−−−→pn−1pn is called a trajectory.

(15)

Figure 1.1: Typical trajectory of an employee during a sliding win-dow. The axes correspond to the x and y coordinates in the office environment.

1.4

Problem description

The purpose of the the work in this thesis is to detect contextually anomalous patterns in a two dimensional trajectory dataset. By the complex nature of the human behaviour such anomalies can arise in multiple ways. For example, employees may steal another person’s card or piggyback an authorized person through security doors. In this thesis, we focus on the following types of outliers in the aforementioned trajectory datasets:

• Anomalous trajectories with anomalous positions This corresponds to the situation where the trajectory is outside the normal region to which the trajectories are normally confined to. A possible interpretation of this is when an individual visits a spatial region he/she rarely visits or has limited access to. An example of such situation is shown in Figure 1.2a. • Anomalous trajectories with normal positions Even though the

(16)

(a) Example of a trajectory marked as an outlier because of its anoma-lous segments. This may correspond to the following situations: an indi-vidual visit a room/spatial region that he/she normally does not visit.

(b) Example of a trajectory marked as an outlier because of its anomalous or-dering of sample points. Notice that this type of anomaly also include out-liers in speed, acceleration etc.

Figure 1.2: Types of trajectory anomalies.

So far in the thesis, the focus has been on sub-trajectory outliers (anomalous positions, instantaneous velocities etc.). It is however important to define the requirements for a whole trajectory (within a data frame) to be considered anomalous. Unless stated otherwise, it will be assumed that if a trajectory contains at least one anomaly of any kind, then the whole trajectory will be considered as anomalous.

We have now formalized the notion of trajectory outlier but have been relatively vague in which context we expect to find them in: intra-personal (behaviour of the same person) or interpersonal (behaviour of a group)? It is assumed that interpersonal behaviour is more interesting to model than that of individuals on their own for two reasons: First, according to [10], employees tend socialize more often with others with the same role/expertise since their cubicles are often situated close by. In other words, there is a natural grouping of employees whose behaviour is assumed to easier to model than that of individuals. Second, the modelling of groups can at any time be reverted back to individuals.

1.5

Outlier detection and Classification

In this section, we explore the similarities and differences between outlier de-tection and classification.

(17)

With this in mind, outlier detection can be understood as a binary (two classes: normal or anomalous) unsupervised classification problem. The underlying as-sumption for data characterization is that normal data points are far more frequent than he anomalous data points. Thus, the problem of finding algo-rithms for outlier detection in trajectory data is closely related to trajectory classification.

1.6

Contribution

The thesis has reached the following main results:

• The main result of the thesis is an improved trajectory outlier detection algorithm that can be used to more accurately detect contextual anomalies in datasets with dynamic behaviour.

• The thesis work also resulted in the creation of a new approach to trajec-tory classification which does not require feature selection.

• An algorithm is proposed to discover the normative behaviour of a dataset. This may be required for the anomaly detector if there is no such a priori knowledge in the dataset.

1.7

Organization

The rest of the thesis is organized as follows. The first 4 chapters summarize the result of my literature investigation and my customization of the algorithms found. The theoretical foundation is necessary to understand, use and possibly further develop the source code I wrote which can be found in the Appendix. References to these programs are linked from the results they produced. Chapter 2 introduces the concept of group discovery that is necessary for outlier detection.

Chapter 3 gives the necessary background in trajectory analysis and discusses its challenges.

Chapter 4 introduces the concept of non-parametric classification and describes an algorithm to classify trajectories based on their geometric properties alone. Chapter 5 describes an algorithm to classify trajectories based on their semantic information and introduces the Extensible Markov model.

In Chapter 6, the data set used in the project is also described in detail. In Chapter 7, the results of non-parametric and parametric analysis are pre-sented with explanations.

(18)

Chapter 2

Group discovery

As described in Chapter 1, the problem of outlier detection either necessitates an a priori knowledge of the normal course of events or some subset of a dataset in which the normal events are far more frequent than the outliers. Since the aim of the thesis is to identify anomalous activity in groups, the primary task is thus to identify groups of individuals with similar long-term behaviour. Before moving on, it is important to define what is meant by group in this application. A group can be thought of an equivalence class having the following three properties:

• Reflexivity: A has the same property as A.

• Symmetry: If A and B have the same property then so does B and A. • Transitivity: If A and B, and B and C have the same property, then so

do A and C.

Throughout this Chapter, it is assumed that there is an existing database con-taining distinct sets of individuals with the same instantaneous behaviour pat-terns at different times. Each record in the database - called a transaction - can thus be understood as a combination of individuals with the same behaviour at time t. One possible way of how such databases can be constructed is discussed in Chapter 6. The problem of finding the underlying long-term structure of such database leads over to the concept of data mining and association rule mining.

2.1

Association rule mining

(19)

Definition 2. Association rule [12]: Statements on the form X, Y → Z that are used to represent the predicted occurrence of an item Z based on the occurrence of other, possibly multiple items X and Y , where X, Y and Z are disjoint items.

The validity of an association rule is usually determined by its support and con-fidence values. The support of an association rule X → Y measures how often the rule is correct with respect to some dataset. That is, how often two items X and Y occur in the same transaction in the database. The confidence measures how often the itemset Y appears in transactions containing X. In other words, the confidence can be understood as the conditional probability P (Y |X) which measures the validity of the inference X → Y . It is important to understand the difference between support and confidence, support measures the undirected relation while the confidence measures the directed relation between the items. Given a database D, the goal of association rule mining is to find all rules R that satisfy the following conditions:

support(R) > min support confidence(R) > min conf idence

where min support and min conf idence are user-set thresholds. Frequent as-sociation rules are characterized by high support and confidence intervals.

2.1.1

Apriori algorithm

In this section, the Apriori algorithm [13] is presented to extract frequent asso-ciation rules from databases. The algorithm recursively finds assoasso-ciation rules by extending previous rules according to the Apriori principle.

Definition 3. Apriori principle [13]: An itemset containing k items is frequent only if all of its k − 1 supersets are frequent.

(20)

Algorithm 1 Apriori algorithm

1: procedure Apriori(transactions,minsup) 2: L1← [frequent 1-itemsets in transactions]

3: k ← 2

4: while Lk−16= ∅ do

5: Ck ← candidate generated from Lk−1

6: for transaction in transactions do

7: Ct← candidates in Ck that are contained in transaction

8: for candidate in Ct do

9: count[candidate] ← count[candidate] + 1

10: end for

11: Lk ← candidates in Ctwith counts greater than minsup

12: end for

13: end while

14: return ∪kLk

15: end procedure

2.2

Finding groups

Table 2.1 shows an example of a transactional database containing equivalence classes over the alphabet A, B, C, D. Table 2.2 shows the frequent association rules of these transactions that satisfy minsup > 0.2 and minconf > 0.5. It is now proposed to transform this set of rules into a directed graph by let-ting the edges represent the relations between the items. The problem of finding equivalence classes can than be transformed into finding the cycles in this graph. A cycle is a directed sequence of vertices starting and ending at the same vertex.

ID Transaction 1 A B C 2 B A 3 C A 4 B C 5 A B D Table 2.1: Example database. ID Frequent patterns 1 D → A 2 D → B 3 C → B 4 B → C 5 C → A 6 A → C 7 B → A 8 A → B 9 {B,D} → A 10 {A,D} → B 11 {B,C} → A 12 {A,C} → B

Table 2.2: Frequent rules with support greater than 0.2 and confidence greater than 0.5. Singletons are omitted.

(21)

Proof: By the Apriori principle in Definition 3., it holds that any itemset longer than 2 items can be decomposed into a set of 2-items since the supersets have always greater (or equal) support than the subsets. Conversely, the supersets of an infrequent itemset are also infrequent.

Figure 2.1 shows the graph constructed with the use of the frequent rules 1-8 of Table 2.2. Complex rules, such as {A,C} → B may be recovered by the following the edges A → C and C → B.

Figure 2.1: Graph of frequent rules containing single itemsets. This graph contains two groups/equivalence classes: (A, B, C) and (D). The problem of finding such cycles has been studied thoroughly and several algorithms have been proposed such as Tarjan’s strongly connected components algorithm [14] and the Sharir algorithm [15]. The description of these algorithms are beyond the scope of the thesis and the Reader is referred to the publications for more details.

Finally, it is worth mentioning that the proposed group discovery algorithm described above can also be used as a primitive outlier detector. Suppose that in the database from the office environment, only employee A works outside of during normal working hours. This situation corresponds to that the specific individual will mostly occur by itself in the database containing the combination of individuals with the same behaviour.

(22)

Chapter 3

Background on trajectory

analysis

This chapter introduces the necessary concepts and methods for trajectory anal-ysis which will be used in chapter 4.

With the wide availability of wireless positioning systems such as GPS, RFID, WiFi, ZigBee etc. huge amounts of location based data may be recorded with high temporal and spatial resolution. The main difficulty in analysing such datasets is their temporal nature which may lead to sequences with different lengths.

In the literature, two approaches to trajectory analysis can be found: the first maps the sequences into a fixed dimensional space and uses usual dissimilarity measures. The other approach is to use special dissimilarity measures that do not require fixed dimensions. In this Chapter, several preprocessing techniques are discussed to overcome the problem of dimesionality and finally a number of special dissimilarity measures are introduced.

3.1

Dimensionality reduction

The idea of dimensionality reduction is to represent a trajectory in a time win-dow with a small fixed number of parameters [16]. Clearly, the success of this approach depends to a large extent on the choice these parameters. A natural way of mapping a trajectory into a fixed dimensional space would be to compute the position, velocity and acceleration of an object at regular time increments. However, the choice of parameters is often dependent on the specific application. In a more structured fashion, such parameters can be determined by principal component analysis (PCA) [17].

3.2

Normalization

(23)

known dynamics of a trajectory is used to estimate extra hypothetical trajectory points. When re-sampling, linear interpolation is used to reduce the number of points in the longest sequences. After normalization, usual metrics such as the Euclidean distance can be used to measure the dissimilarity. However, this approach is found to perform poorly in practice [16].

3.3

Trajectory dissimilarities

In the literature the following three dissimilarities are suggested to compare two trajectories:

• Dynamic Time Warping • Hausdorff distance • Fr´echet distance

3.3.1

Dynamic Time Warping

Dynamic Time Warping (DTW) is one of the most used dissimilarity measure for sequences with different lengths[18] and was chosen as the starting point for the thesis. Simply put, the objective of DTW is to either compress or stretch the time axis in some places of the sequences so that the individual points of any two sequences can be mapped to each other in an optimal way.

So far, this reminds of the re-sampling technique discussed in Section 3.2. How-ever, instead of having a predetermined strategy for the re-sampling, the DTW aims to find a dynamic re-sampling strategy which gives the smallest distance between the sequences. A mapping function φ = (φP, φQ) of length n, maps

individual points between two sequences P and Q so that P (φP(i)) is matched

to Q(φQ(i)) for i ≤ n. The dissimilarity between the sequences is defined as the

sum of distances between the two trajectories’ corresponding points:

dDT W(P, Q) = n

X

i

d(P (φP(i)), Q(φQ(i))) (3.1)

(24)

Figure 3.1: Example of a mapping of points between two curves of different length.

Normally, there are several possible mappings that can be used to minimize the distance between two sequences. Therefore, it is necessary to impose certain restrictions on the mapping[18]:

• It must preserve the continuity of the trajectories. Consider two adjacent points P (k − 1) and P (k) on the blue curve and another point Q(k − 1) on the red curve. If the point P (k − 1) is mapped to the point Q(k − 1), then the point P (k) may not be matched to any point that precedes Q(k − 1). Notice that this does not mean that each point on the first curve must be mapped to unique points on the other curve: it is possible that the same point on the blue curve is mapped to other points on the red curve. • The start and end points of the curves are always matched to each other.

This ensures that the entire curves are matched.

Unfortunately, it can be shown that the DTW fails the triangle inequality [19] and is thus not a valid metric. As we shall see in Chapter 4, the metric property will play a central role in our quest for a trajectory classifier/outlier detector.

3.3.2

Hausdorff distance

The Hausdorff distance between two curves P and Q is the smallest distance d such that the set of vertices of P is contained in a circle with radius d around every point in the vertex set of Q.

Definition 4. Directed Hausdorff distance [20]: Let P and Q be any two tra-jectories in R2 and let || · || denote the Euclidean norm. The directed Hausdorff

distance is defined as δH(P, Q) = max q∈VQ { min p∈VP ||p − q||}

where VP and VQ represents the set of vertices in P and Q, respectively.

(25)

Figure 3.2: Two trajectories with a small Hausdorff distance but with a large Fr´echet distance.[1]

This means that the Hausdorff distance can only measure geographic proximities between trajectories. Additional information such as speed, direction are not taken into consideration. Furthermore, it should be noted that the directed Hausdorff distance is assymmetric and is therefore not a metric. A more general version of the Hausdorff distance is given in Definition 5.

Definition 5. Undirected Hausdorff distance [20]: Let P and Q be any two trajectories in R2 and let δ

H denote the directed Hausdorff distance. The

undi-rected Hausdorff distance is defined as

dH(P, Q) = max(δH(P, Q), δH(Q, P ))

According to [21], the undirected Hausdorff-distance is a metric. Even though the Hausdorff distance does not take the continuity of the trajectories into con-sideration, it is assumed in the thesis that it may be appropriate to differentiate between groups in an office environment. Under normal office conditions, it is plausible to assume that the cubicles of employees who have the same role are often situated closer to each other than to other employees. Thus, there may be sufficient information available in the geographic positions alone to be able to classify trajectories to some degree.

3.3.3

Fr´

echet distance

The Fr´echet distance is another curve dissimilarity measure that takes the conti-nuity of the curves into account and is therefore believed to be better suited for capturing contour dissimilarities than the Hausdorff distance. First, we define the monotone parameterization of a trajectory.

Definition 6. Monotone parameterization: A monotone paramterization of a trajectory P of length N is a monotonic function f (t) : [0, 1] → [0, N ] such that f (0) = 0 and f (1) = N .

(26)

The Fr´echet distance is often explained as follows[1]: suppose a man is walking his dog on a leash and both of them are constrained to follow their own tra-jectory. Both the man and his dog can control their speed on their own but are not allowed to go backwards on their path. The Fr´echet distance is then the minimum length of the leash that is needed so that both of them can travel through their polygonal path from start to end. The formal definition is given in Definition 7.

Definition 7. Discrete Fr´echet distance [1]: Let P and Q be any two planar curves in R2with lengths N and M , let α and β be any monotone

reparametriza-tions of P and Q, respectively. Let d be a distance function in R2. The Fr´echet

distance is defined as

dF(P, Q) = min

α,β t∈[0,1]max d(P (α(t)), Q(β(t)) (3.2)

In contrast to the previously introduced Hausdorff metric, the Fr´echet distance is able to summarize multiple aspects of a trajectory curve in a concise way: Clearly, the minimal length of leash is closely related to the relative position and orientation of two trajectories. When two trajectories are close and aligned with each other, the minimum of leash required is shorter than for two curves that are orthogonal and far from each other. Trajectories with different speeds typically require a longer leash to map the start and end points to each other. According to [1], the Fr´echet distance is a metric.

Computing the discrete Fr´echet distance

The optimization problem in Eq. 3.2 cannot be minimized directly for all pos-sible choices of two trajectories in R2. Instead, an iterative method is used to

determine the minimal leash length . The algorithm is initialized with a guess on the minimal leash length  = 0: if it is long enough to couple the two

tra-jectories then  is decreased otherwise it is increased.

Given a guess on the minimal leash length  = 0, we define the Free Space

diagram as all pair of points on the trajectories P and Q whose distance to the other curve is at most .

Definition 8. Free Space diagram [1]: Let α(t) and β(t) be any monotone parametrizations of the two trajectories P and Q, respectively. Further, let  ≥ 0 and d be a distance function be given. The corresponding Free Space diagram is then the set

F= {(α(t), β(t))|d(P (α(t)), Q(β(t)) ≤ } (3.3)

The Free Space diagram is then used to verify that the guess  = 0 is long

enough to couple the two trajectories. It can be shown that if there exists a path from the origin to the top right corner of the Free Space diagram which is monotone in both coordinates, then  solves the coupling problem dF ≤ . An

(27)

(a) Example of two trajectories. (b) Corresponding Free Space dia-gram with an example of a monotone curve in both coordinates from (0, 0) to (4, 5).

Figure 3.3: Graphical representation of the Fr´echet distance calcu-lation.

Each of the axes in figure 3.3b are used to represent the index of the line segments of the two trajectories in figure 3.3a separately. The blue trajectory contains five segments which are numbered on the vertical axis in the Free Space diagram in figure 3.3b. The red trajectory contains four segments which are represented by the horizontal axis.

(28)

Chapter 4

Non-parametric trajectory

analysis

In this Chapter, the class of non-parametric outlier detection and classification algorithms are introduced. Non-parametric models are characterized by that they do not rely on any particular assumption of the nature of the data, neither do they require a predetermined model structure. In the following, popular non-parametric analysis tools are discussed.

• Nearest neighbour clustering assumes that normal data points are closely packed while anomalies are located far from other points or in regions with low density. They are well suited for unsupervised learning but the results are strongly dependent on the number of available data points. If normal points do not have significantly higher number of neighbours than outliers, the classification may fail completely [22]. As the public dataset only contains a limited number of employees, nearest neighbour clustering will not be used.

• Support Vector Machine (SVM)[23] is another commonly used data clas-sification and prediction technique. In contrast to nearest neighbour clus-tering, SVM is completely supervised but does not require a large number of data points. Hence, regular SVM is more appropriate in situations where the anomalous and regular activities are known a priori. In this section, we will assume that such knowledge is available to us and will describe an SVM based trajectory classifier.

(29)

4.1

Introduction to SVM

In this section we will describe binary classification. The goal is to train a linear classifier on some linearly separable data D

D = {(xi, si)|i = 1 . . . n}

where xi are the individual data points (vectors) and si ∈ {−1, 1} is the class

label variable. In other words, the task is to find a hyperplane f (xi) = wTxi+b,

where w and b are parameters such that

f (xi) = ( wTx i+ b ≥ 1, if si= 1. wTx i+ b ≤ −1, if si= −1. (4.1)

There might be several possible set of parameters hw, bi that satisfy Eq. 4.1 but we would like to determine which one is best at separating the data. The problem now is that the underlying true probability distribution is normally unknown, all we have is D which is a sample from that distribution. Notice however, that if we could repeat the sampling process somehow, we would get another sample D0 whose general structure would be similar to that of D. If we were to train a new hyperplane on D0 and compare it to our original hy-perplane trained on D, we would therefore expect that they resemble each other. With this in mind, we would want choose the hyperplane that is the most stable with respect to changes in the dataset. In other words, we want to select the hyperplane whose orthogonal distances to the nearest data points are as large as possible. Mathematically, this distance can be written as

di =

si(wTxi+ b)

||w|| (4.2)

Namely, let x0i be the orthogonal projection of xi on the hyperplane. By

defini-tion, wTx0

i+ b = 0 since xiis on the hyperplane. To get from x0ito xiwe need

to move disi along the normal vector of the hyperplane:

xi= x0i+ disi||w||w → wTxi= wTx0i+ disiw Tw ||w|| = −b + disi||w|| di=si(w Tx i+b) ||w|| where we used si= s1

i. To summarize, the problem of finding the best separating

hyperplane is

max

w,b{mini

si(wTxi+ b)

||w|| } (4.3)

This is a non-convex optimization problem. According to [24], it can be shown that problem can be transferred to the following convex optimization problem

(30)

4.1.1

Feature mapping

Until now, we have restricted ourselves to linearly separable datasets and as-sumed that the separating hyperplane depends linearly on the input data: f (xi) = wTxi + b. However, for some datasets which are not initially

lin-early separable it may be fruitful to pre-process the inputs before training the separator hyperplane. This is called feature mapping.

Definition 9. Feature mapping: Preprocessing of the input data using a feature map Φ : Rn

→ Rm

where Rn

and Rm is called the input and feature space,

respectively.

Instead of training directly on the input space as in Figure 4.1, it is useful to first map the data points so that they become linearly separable.

Figure 4.1: A possible way to pre-process data is to change the coordinate system from Cartesian to polar.

4.1.2

Linearly non-separable datasets

The linear classifier described in the previous section has one major shortcom-ing: it is limited to datasets that are linearly separable. If we were to apply the classifier on some non-separable data, the feasible set would be empty and the optimization problem in Eq. 4.4 would be infeasible. In this section we will relax this assumption and allow for some data points to be misclassified in the hope of being able to classify more interesting datasets.

The idea is to modify Eq. 4.4 so that the misclassified data points are penalized by a factor proportional to their classification error.

minimize w,b 1 2||w|| 2+ C n X i=1 ξi subject to si(wTφ(xi) + b) ≥ 1 − ξi, i = 1, . . . , n. ξi≥ 0, i = 1, . . . , n. (4.5)

Where ξi is a so called slack variable measuring the classification error of the

(31)

Figure 4.2: Example of a linear classifier applied to a linearly non-separable dataset in R2.

To see how this works, consider the following cases in Figure 4.2.

• If ξi = 0, then sixi ≥ 1 which means that the feature vector is correctly

classified. Note that in this case, the feature vector does not contribute to the objective function. The feature vector has no influence on the choice of the hyperplane.

• If 1 ≥ ξi ≥ 0, then sixi ≥ 0 which means that the feature vector is

correctly classified but lays in the marginal area of the hyperplane. Note that even though it is correctly classified, the feature vector contributes to the objective function. Removing this point will change the hyperplane. Points like these are usually referred to as support vectors.

• If ξi ≥ 1, then sixi≤ 0 which means that the feature vector is misclassified.

This point has the same type of effect as a correctly classified feature vector in the margin area. Note that however, the effect of misclassified points is larger than that of correctly classified points in the margin area. In summary, the sumPn

i=1ξi can be seen as an upper bound on the number of

training errors. The larger the value of C the larger the penalties for misclas-sification. Observe that by setting C = ∞ we get back the hard-classifier from the previous section.

4.1.3

Dual formulation of the SVM

Instead of solving the optimization problem in Eq.4.5 directly, usually it is trans-formed into another equivalent formulation which can be solved more efficiently. This is the so called dual problem which is obtained by expressing the weights w as a linear combination of the feature vectors:

w =Xαiφ(xi) (4.6)

(32)

Theorem 4.1.1. Representer theorem [23]: Given a weight vector w ∈ Rn,

the component w⊥ perpendicular to the space spanned by φ(x) has no direct

effect on the classification performance and just adds extra costs to the objective function in Eq. 4.5.

Proof: We can write w as the sum of the two components w = wk+ w⊥ where

wk= ΦTα for some α ∈ Rn and w⊥ΦT = 0. The equation for the separating

hyperplane can thus be written as

f (x0) = wTΦ(x0) + b = (wk+ w⊥)TΦ(x0) + b = wkTΦ(x

0) + 0 + b (4.7)

The intuition behind Eq. 4.7 is that the contribution w⊥ can not help in

de-creasing the classification error in Eq. 4.5. Observe however that ||w||2= ||w

k||2+ ||w⊥||2≥ ||wk||2 (4.8)

One major consequence of Theorem 4.1.1 is that instead of minimizing over w, we may as well optimize over α without any loss of optimality. This leads over to the concept of kernelization.

4.1.4

Kernelization

By substituting Eq. 4.6 into Eq. 4.7, we get f (x0) = n X i αiφ(xi)Tφ(x0) + b = αk(x, x0) + b (4.9) where we define k(x, x0) = n X i φ(xi)Tφ(x0) = hφ(x), φ(x0)i (4.10)

a kernel function where h·, ·i represents the dot product. In other words, given the kernel function we never need to compute the underlying kernel mapping in order to construct the separating hyperplane. This fact enables us to use an arbitrary number of features without actually computing them explicitly. The intuition behind kernelization is that we transform the linearly non-separable data points into a higher dimensional vector space in hope that they become linearly separable. Choosing the kernel function implicitly determines which features will be used. It is therefore important to identify what kind of infor-mation we are expecting to extract from the data.

(33)

Definition 10. Mercer’s condition for positive definite kernels [25]

Let k : Rd× Rd→ R be a real-valued symmetric function. Consider a collection

of n inputs X = {xi∈ Rd, ∀i = 1 . . . n} and define K as the Gram matrix of k

K =    k(x1, x1) k(x1, x2) . . . k(x2, x1) k(x2, x2) . . . .. . ... . ..    (4.11)

If K is positive definite, then k is a positive definite kernel. For example, even though the Euclidean distance d : Rn × Rn

→ R, d = p(x − x0)2 is symmetric and positive definite, k(x

i, xj) =p(x − x0)T(x − x0)

does not give a positive definite Gram matrix, ie. it is not a valid kernel. To understand why, we will show that valid kernel functions must measure the similarity rather than the dissimilarity between features. By using the definition in Eq. 4.10 it can be shown that the following holds

k(x, x0) = ||φ(x)||

2+ ||φ(x0)||2− d(φ(x), φ(x0))2

2 (4.12)

where d(φ(x), φ(x0))2 = ||φ(x) − φ(x0)||2 measures the distance between the

mappings of the data points in the feature space. Equation 4.12 shows that while d measures the dissimilarity, the kernel function measures the similarity [26].

In summary, in order to use distances as ingredients in kernel functions, they must first be transformed to measure the similarity rather than the dissimilarity between points. This is normally accomplished by applying a positive definite, monotonically decreasing function on the distance function [27].

Example 1. Typical kernel functions • Linear

k(xi, xj) = xTixj+ C (4.13)

• Radial basis functions (RBF) is a group of functions whose value depends only on the distance between two elements x and x’ so that: k(x, x0) = k(d(x, x0), 0) where d is a metric. Examples of RBF kernels include

(34)

– Gaussian RBF

k(x, x0) = e−α||xi−xj||2 (4.16)

where α ≥ 0. The value of α determines the spread of the kernel. If overestimated, the exponential will behave almost linearly. If under-estimated, the exponential will be overly sensitive to noise. [28]

Example 1 gives some examples of commonly used kernels in SVM. All of these except the Multiquadratic RBF are examples of positive definite kernels. Before continuing, it is important to notice that an ideal kernel function should map the input data to a set of normalized feature vectors. If not, features with large values may dominate and cause the SVM to neglect features with small values. In other words, normalization in the feature space is necessary in order for the SVM to consider all features equally important. By Definition 4.10, it can be seen that the sufficient condition for this is that k(x, x) = hφ(x), φ(x)i = 1. Such normalized kernel functions ˜k are usually formed by [29]

˜

k(x, x0) = k(x, x

0)

pk(x, x)k(x0, x0) (4.17)

It is easy to verify that the class of RBF-kernels satisfy this requirement.

4.1.5

Multi-Class SVM

So far, the focus has been on binary classification. To be able to classify datasets with more classes, usually a combination of several binary SVMs is used [30]. Popular methods for this are one class versus the rest and one-versus-one.

(a) One vs. all classification. The gray regions correspond to situations that cannot be decided. The other colors correspond to different classes.

(b) One vs. one classification. The gray regions correspond to situations that cannot be decided. The other col-ors correspond to different classes.

Figure 4.3: Different approaches to multi-class SVM.

(35)

binary SVMs reject the features while in the gray areas between the classes, more than one SVM accept the features. Since 1-vs-1 method has a smaller number of gray areas, it will be used in in the thesis for classifying multiple classes.

4.2

Trajectory kernels

In this section, we introduce two new SVM kernels for trajectory classification. These kernels are formed by modifying the RBF kernels by replacing their dis-tance measures with the Hausdorff and Fr´echet distances.

To see how this can be done notice that by Theorem 2.2.1 and 2.2.2 it holds that the undirected Hausdorff and Fr´echet distances are true metrics. Since the Mercer condition is equivalent to the triangle inequality for distance functions [31], it holds that both the Gram matrix for both of these distance measures are positive definite.

We now define the distance substitution kernel [32] kd of an RBF-kernel k by

kd(x, x0) = k(d(x, x0), 0) (4.18)

where d is either the Fr´echet distance dF or the undirected Hausdorff distance

dH.

4.3

Preliminary results

In this section, we are going to evaluate the classification performances of the proposed distance substitution kernels on a test dataset. These results are called preliminary, because these are outcomes of running the classification algorithm on a test dataset which has very obvious differences between the classes. It basically shows that the algorithm is capable to sort the input dataset into two obviously different groups.

(36)

(a) Training dataset, the colors indi-cate the predetermined classes.

(b) Evaluation dataset, the colors in-dicate the true class labels based on the average bearing of each trajectory.

Figure 4.4: Test dataset for distance substitution kernel based SVM classification.

4.3.1

Gaussian RBF kernel

Throughout this section, the misclassification penalty factor C in Eq. 4.5 was set to C = 10.

Fr´echet distance

(a) Gaussian RBF-Fr´echet kernel, α = 2, the colors indicate the predicted classes.

(b) Classification results in the Fr´echet kernel space with α = 2. The colors indicate the true classes.

(37)

Hausdorff distance

(a) Gaussian RBF-Hausdorff kernel, α = 2, the colors indicate the pre-dicted classes.

(b) Classification results in the Haus-dorff kernel space with α = 2. The colors indicate the true classes.

Figure 4.6: Classification results using the Hausdorff kernel.

Figures 4.5 and 4.6 confirm our intuition from the previous sections: since the trajectories in the evaluation dataset are confined to the same region, the Hausdorff distance has a hard time classifying them. On the other hand, the Fr´echet distance based kernel correctly classifies the trajectories based on their average bearing.

4.4

SVM applied to data streams

Training SVMs is normally limited to predefined, static datasets. As described in Chapter 1, in order to handle data streams with characteristics that change over time, the model has to be able to forget old observations. Here, a sim-ple extension is proposed to the regular SVM formulation that satisfies this requirement and is shown in Figure 4.7.

(38)

By introducing a middle-layer of a non-circular buffer with n ≥ 1 elements, the proposed model is able to keep only the most recent n number of dataframes for the training of the SVM. In this way, the SVM is able to forget data that is more than n observations old. The length of the buffer is not fixed and can be changed at any time either by adding empty elements after the rightmost element, or by removing elements from the right.

(39)

Chapter 5

Parametric trajectory

analysis

In addition to the usual relative geometric properties of trajectories, such as speed, direction, etc., additional semantic information can be obtained by also considering the context of the trajectories [33].

One possible way of doing this is to identify interesting spatial regions in the office environment, such as the kitchen or cubicles of different departments and map the trajectories to these regions. In other words, a trajectory can also be described as a sequence of regions that it visits. The idea behind this chap-ter is to model the transitions between such geographic regions. It is assumed that different groups of employees have different transition probabilities between such regions. Markov Chains are readily available to describe/model such state transitions.

Dynamical systems are processes that evolve over time according to some un-derlying law which is probabilistic. The Markov Chain (MC) and its variations such as the Hidden Markov Model (HMM) are one of the most powerful, yet simplistic tools to analyze and model stochastic processes [34]. In its most sim-ple formulation however, the MC is restricted to stationary systems and has therefore key issues when applied to real-life dynamical systems:

• The model may be incomplete at the time of construction and thus sensi-tive to the initial selection of parameters. Having a predetermined struc-ture and number of parameters may lead to that the model describes the inherent noise in the data rather than the underlying dynamics of the data. In statistics, this is often called overfitting.

• Changes in dynamical system behaviour over time should be reflected in the model. Keeping outdated data may negatively impact the ability of the model to classify and identify patterns.

(40)

5.1

Introduction to Markov Chains

A discrete-time Markov Chain is a special case of a stochastic process in discrete time and discrete space. At each time instant ti, it is characterized by a sequence

of random variables X(t) = hX1(t), X2(t), X3(t), . . . i taking values in some finite

set.

Each random variable Xn can be though of as the state at time tn of some

system which is governed by a set of probabilistic laws. The finite set of values which the random variables assume is called the state space of the stochastic process and is denoted by S = hx1, x2, x3, . . . i. The process X(t) is called a

simple Markov Chain if it satisfies the first order Markov property.

Definition 11. First order Markov property: Given the present state, the past contains no additional information for the future evolution of the system. For-mally, the future state is conditionally independent of the past, given the present state of the system:

P (Xk+1= xk+1|X1= x1, . . . , Xk = xk) = P (Xk+1= xk+1|Xk = xk) (5.1)

Given a sequence of states s = x1x2. . . xn over some state space S, we can

characterize the dynamics of the underlying system by counting the number of occurrences of every state in the sequence.

Definition 12. Transition count and probability matrices: Let s0= x1x2. . . xn

be a sequence over some state space S. The transition count and probability matrices TC and TP are defined as

TC=      c11 c12 . . . c1n c21 c22 . . . c2n .. . ... . .. cn1 cn2 . . . cnn      TP =      p11 p12 . . . p1n p21 p22 . . . p2n .. . ... . .. pn1 pn2 . . . pnn      (5.2)

where each cij represents the number of state transitions xi → xj in the sequence

s0. The state transition probabilities are estimated from the transition counts

by using the Maximum-Likelihood method [36] pij =

cij

Pm

k=1cik

(5.3) This probability measures how many times the transition xi→ xj has occurred

normalized by the number of times the state xi actually occurred.

Suppose we are given a state transition probability matrix TP and an initial

state distribution p of some system. We can think of p as a probability vector where the ith component represents the probability that the system starts in state xi. By the definition of conditional probabilities, the next state

distri-bution after one time step is given by p(xj) = p(xj|xi)p(xi) ∀xj ∈ S, or with

matrix notation pn+1 = pnTP. More generally, by using the definition of the

simple Markov model in Eq. 5.1 it can be shown that the state probability distribution after k time steps can be written as

(41)

We can thus predict the future behaviour of some group based on the current dy-namics. Before introducing the Extensible Markov Model there is some further concepts that need to be defined.

Definition 13. Time homogeneous Markov Chain: The probability of the tran-sitions is independent of the time instant t:

P (Xt+1= x|Xt= y) = P (Xt= x|Xt−1= y)

Time homogeneous Markov chains are often represented as directed graphs, in which the vertices correspond to the states and the directed edges to the probabilities from going from one state to another.

Another important aspect is whether the Markov Chain is irreducible. We will return to this in the discussion of the EMM.

Definition 14. Irreducible Markov Chain: A Markov Chain in which it is possible to get to any state from any other state in a finite number of transitions.

5.2

Extensible Markov Model

The Extensible Markov Model (EMM) is a combination of an adaptive, first or-der Markov Chain and a clustering algorithm. The overall structure is shown in Figure 5.1. During a data frame Dkfrom the sliding window (tk, tk+1) the EMM

is characterized by a time homogeneous Markov Chain. The model’s ability to adapt to changes is achieved by updating the underlying Markov Chain between subsequent data frames. During these updates, the EMM is no longer stationary but the discontinuities are ignored by the assumption that the computational time necessary to build and update the EMM is negligible with respect to the data frame durations.

Each state of the EMM corresponds to a unique cluster. The clustering may be off-line or on-line, meaning that the clusters are either predetermined or dependent on the data.

(42)

5.2.1

Updating the Markov Chain

In this section, the update procedures for the EMM are described. To achieve this we define the cumulative and delta transition matrices.

Definition 15. Cumulative transition count matrix: The transition count ma-trix Ct is based on all previous data frames up to time t.

Definition 16. Delta transition count matrix: The transition count matrix ∆Ct is based solely on the current data frame t.

Adding/removing states

As described previously, the EMM may be incomplete at the time of construc-tion. In order to adapt to possible new behaviour, we need to be able to add or remove states.

To add a new state to the model we augment the transition count matrix and initialize the new row and column to zeros. The new state is given a unique ID to keep track of which state corresponds to which cluster.

Similarly, to remove a state xi, we simply remove its row and column in the

transition count matrix. [37] Fading the model

The primary purpose of the fading step is to reduce the importance of old and possible obsolete transitions since current behaviours may not rely on all previous data. Formally, this is accomplished by fading the previous EMM from the open time interval (tk−1, tk) before incorporating the new measurements

from the interval (tk, tk+1)

Tk+1C = TkC· 2−λ (5.5)

where λ ≥ 0 is the fading parameter. A negative λ would correspond to a situation where the transition matrix from the previous time interval would be more similar to the current behaviour than the current transition matrix. Such situation is clearly unrealistic. The effect of fading on the state transition probabilities is given by pt+1xy = c t xy2−λ+ ∆ct+1xy P kc t xk2−λ+ ∆c t+1 xk (5.6) In the formula above, ∆c is the delta transition count matrix that is solely based on the new data frame. pt+1represents the target cumulative probability

transition matrix that describes the dynamics of the system based on both current and historical observations.

Incorporating new sequences

(43)

5.2.2

State clustering

The Markov Chain of the EMM allows to model temporal relationships. To ex-tend the EMM to the spatio-temporal domain, a clustering algorithm is needed. Regular clustering techniques such as k-means [38] are not suitable for handling data streams since they assume that the data points are independent and iden-tically distributed. In the case with streaming datasets however, there is often latent correlation between subsequent data points. [39] In this section we will describe the threshold nearest neighbour (tNN) clustering algorithm.

In tNN, the data points are clustered using the nearest neighbour principle. Each cluster is represented by a central vector. For each new data point, the dissimilarities with respect to the existing clusters are computed and the best matching cluster is selected. However, instead of always assigning a new data point to the best matching cluster, new clusters may be created if the dissimi-larity to the best matching cluster is greater than the specified threshold. For spatial clustering, the dissimilarity measure is normally the Euclidean distance.

5.3

Estimating the parameters of the EMM

As described in the previous sections, the EMM has two free parameters: the clustering threshold C and the forgetting factor λ.

5.3.1

Clustering threshold

The effect of the clustering threshold on the EMM can be understood as follows. If C is underestimated, it forces the EMM to create a large number of states with relatively small occurrence and transition counts. Eventually, this renders the Markov Chain isotropic, ie. all states become equally probable. On the other hand when overestimated instead, the EMM eventually collapses into a single state, losing all of modelling capability.

In addition to the online tNN clustering described above, there is an alternative technique that also takes the semantical information into account when mapping trajectories into discrete states. Examples of such semantics are stops, which correspond to regions in which the trajectories remain for a certain amount of time. Another example are moves, which corresponds to regions in which the average velocity is greater than a threshold[33]. In an office environment, such semantical regions could for example be the cubicles of different departments, the kitchen area, etc.

5.3.2

Forgetting factor

(44)

Proposed Algorithm 1

The idea is to link the relative value of λ to the rate of change in behaviour of some group over time. It is supposed that in an office environment, there exists certain time periods during which the general behaviour within a group of em-ployees changes significantly. Intuitively, such periods could exist for example during morning hours, lunchtime and late afternoons during which employees are supposed to arrive or leave the office and thereby change their general be-haviour.

During such time periods, the importance of old patterns should be decreased in order to accommodate the EMM to the new patterns of behaviour.

How do we measure the behaviour of a group and assess the similarity of two behaviours? Suppose we are given a dataset containing the geometric trajecto-ries of a group of employees under a time period. By identifying certain static regions of interests in the office environment, the trajectories can be clustered and thus mapped to discrete states that each correspond a unique cluster. The transitions between the states can in turn be modelled using a Markov Chain. In this way, we can describe the behaviour of a group of employees either by a transition count or probability matrix.

By using count matrices, Eq. 5.5 could be used to estimate λ by simply invert-ing TkC. However, the estimates of λ are typically noisy and not guaranteed to be greater than zero. Unless stated otherwise, in this algorithm we will use the probability matrix to represent the behaviour because it is per definition always normalized.

To measure the difference in behaviour over time, we use the probability distance between the probability transition matrices of a group during two subsequent time intervals. Two distance functions are presented in Definition 17 and 18. It is important to notice that in order to be able to use the probability matrices to compare the behaviours, the individual rows of the transition matrices must each model the probability distribution of the exact same spatial region. In other words, the rows of the transition matrices must describe the same spatial regions. This requires that we have predetermined spatial regions which are in-teresting in the sense that they are good at discriminating between behaviours. In an office environment, such regions could for example be the communal space, department areas etc.

A probability distance function compares individual probability distributions, in this case however we need to compare sets of distributions:

P = [p1(x), p2(x), . . . , pn(x)]T, pi(x) ∈ R1×n (5.7)

By the assumption that the predetermined regions are all equally interesting, the distance between two sets of distributions is computed as the sum of all pointwise distances between the corresponding distributions in the two sets.

d(P(x), Q(x)) =

n

X

i

δ(pi(x), qi(x)) (5.8)

(45)

Definition 17. Total variation distance: Let P and Q be two discrete probability distributions over some finite set Ω. The total variation distance is [40]

δ(P, Q) = 1 2

X

x

|P (x) − Q(x)| (5.9)

Definition 18. Kullback-Leibler divergence: Let P and Q be two discrete prob-ability distributions over some finite set Ω. The Kullback-Leibler divergence is [41] δ(P, Q) =X x ln P (x) Q(x)  P (x) (5.10)

Notice that the KL-divergence is not symmetric and thus not a true metric. In the following, we will often used the average symmetric KL-divergence

˜

δ(P, Q) = δ(P, Q) + δ(Q, P )

2 (5.11)

Suppose now that we have computed the change of behaviour of the same group over multiple time intervals. There are now two possible ways of linking λ to the difference in behaviour:

• Setting λ proportional to the rate of change directly. Although intuitive, this option leads to λ > 0 during the whole experiment which may be too aggressive and lead to overfitting as described in Section 5.6.

• Identifying high-activity regions to set λ. By thresholding the rate of change we identify coherent time intervals during which the rate of change is greater than a threshold value. This threshold will be referred to as HIGH ACT IV IT Y CON ST . The idea is to set λ proportional to the length of such high-activity periods.

Besides being more softer than the first option, this also has an interesting interpretation: by Eq. 5.5, it holds that λ is inversely proportional to the half life of the elements in the transition count matrix. Thus, the mean life time τ of a transition count that was registered during a high-activity period is inversely proportional to λ:

τ = c

λ (5.12)

(46)

Figure 5.2: Left: Graph of the rate of change of behaviour in a hypothetical dataset. The red line is used as a threshold to iden-tify coherent high-activity periods. Right: During such periods, the value of λ is set inversely proportional to the period length. Outside these periods, λ is set to a small constant value. (Graphs not to scale.)

The implementation of the proposed algorithm is presented in A.2. Proposed Algorithm 2

The second proposed algorithm is based on the fading equation for the EMM which is repeated here for clarity:

pt+1xy = c t xy2−λ+ ∆ct+1xy P kctxk2−λ+ ∆c t+1 xk (5.13) We will re-use the notion of behaviour from Algorithm 1 above. Clearly, if we were given the target cumulative transition probability matrix pt+1, the

previ-ous cumulative transition count matrix ctand the current delta transition count

matrix ∆ct+1,then we could easily estimate λ as in Algorithm 1 by measuring

the dissimilarity between sets of distributions and trying to minimize it. However, all we have is the delta count matrices ∆ctand ∆ct+1estimated from

the the previous and current time intervals, respectively. These matrices cor-respond to the instantaneous behaviours during the previous and current time windows. The target cumulative transition probability distribution is typically unknown because we have no estimate of λ. Neither do we have behavioural domain knowledge to say for example that group A should be in the conference room or at their desk most of the time during in the time period 09:00-09:10. Thus, we are left in a chicken and egg problem.

The idea is to estimate the target distribution pt+1 by computing the delta

probability matrix at time t + 2. Notice that this is not exactly what we are looking for since the time indices are different those in Eq. 5.13. However, if the sliding windows are sufficiently short, we will assume that the behaviour in two subsequent data batches will approximately be the same, that is pt+1 ≈ pt+2.

References

Related documents

Anomaly detection has been existing for several decades but it was only until the last decade that it has been evolving with a fast pace. That is mostly due to the fact that

All of the above works use RNNs to model the normal time series pattern and do anomaly detection in an unsupervised manner, with labels only being used to set thresholds on

technology. These events or patterns are referred to as anomalies. This thesis focuses on detecting anomalies in form of sudden peaks occurring in time series generated from

For the point anomaly detection the method of Isolation forest was applied and for contextual anomaly detection two different recurrent neural network architectures using Long

To summarize, the main contributions are (1) a new method for anomaly detection called The State-Based Anomaly Detection method, (2) an evaluation of the method on a number

This is done by a characterisation of the surveillance domain and a literature review that identifies a number of weaknesses in previous anomaly detection methods used in

In this section, an evaluation of the two detection methods is held based on how well anomalies are detected using either Holt-Winters or median Benchmark model as prediction

unsupervised clustering algorithms namely K-Means, DBSCAN and OPTICS as a suitable approach to detect anomalies on different dimensionality and cluster overlap. In addition,