Feature Extraction for Anomaly Detection in Maritime Trajectories

(1)

Feature Extraction for Anomaly Detection in

Maritime Trajectories

JOEL SUNDHOLM

Master’s Thesis at CSC Supervisor: Hedvig Kjellström Examiner: Danica Kragic Jensfelt

Project Provider: Saab AB

Supervisor at Project Provider: Rikard Laxhammar

(2)

(3)

Abstract

The operators of a maritime surveillance system are hard pressed to make complete use of the near real-time information flow available today. To assist them in this matter there has been an increasing amount of interest in automated systems for the detection of anomalous trajectories.

Specifically, it has been proposed that the framework of conformal anomaly detection can be used, as it provides the key property of a well-tuned alarm rate. However, in order to get an acceptable precision there is a need to care- fully tailor the nonconformity measure used to determine if a trajectory is anomalous. This also applies to the features that are used by the measure. To contribute to a better understanding of what features are feasible and how the choice of feature space relates to the types of anomalies that can be found we have evaluated a number of features on real maritime trajectory data with simulated anomalies. It is found that none of the tested feature spaces was best for detecting all anomaly types in the test set. While one feature space might be best for detecting one kind of anomaly, another feature space might be better for other anomalies.

There are indications that the best possible nonconformity measure should capture both absolute anomalies, such as an anomalous position, as well as relative anomalies, such as strange turns or stops.

(4)

Extrahering av egenskaper för

anomalidetektion i maritima banor

Referat

För operatörerna av ett marint övervakningssystem är det en omöjlighet att göra en genomgripande analys av all realtidsdata som finns tillgänglig idag. För att assistera dem i deras arbete har det på senare tid utvecklats ett intres- se för system som per automatik kan upptäcka anomaliska skeppsbanor. Specifikt har det föreslagits att ramverket conformal anomaly detection kan användas på grund av att det tillhandahåller en välkalibrerad larmfrekvens vilket är en nyckelegenskap för ett anomalidetektionssystem. För att åstadkomma en acceptabel precision finns det dock ett be- hov att omsorgsfullt skräddarsy den funktion, eller noncon- formity measure, som används för att avgöra om en bana är anomalisk eller ej. Av vikt är också att skräddarsy de features som matas in i funktionen. För att bidra till en bättre förståelse för vilka features som är användbara och hur valet av dem påverkar vilka typer av anomalier vi kan upptäcka har en rad features utvärderats på riktig inspelad sjötrafiksdata med simulerade anomalier. Slutsatserna som kan dras från våra experiment är att ingen feature-rymd är bäst på att upptäcka alla sorters anomalier. Det finns indikationer på att den bästa möjliga feature-rymden bör kunna upptäcka både absoluta anomalier, så som en anomalisk position, såväl som relativa anomalier, det vill säga anomaliska rörelser.

(5)

Acknowledgements

I would like to take this opportunity to thank everyone who has been helpful during the writing of this thesis at Saab Security and Defence Solutions, and CSC, Royal Institute of Technology (KTH). Christopher Brax and Saab Electronic De- fence Systems are thanked for having prepared the dataset that has been used, without which this project would have been much more difficult. Thanks also go to Hedvig Kjellström, my supervisor at KTH, who have been most supportive during this work.

Special thanks go to my supervisor at Saab, Rikard Laxhammar, and all the other members of the datafusion group who made this time both fun and inspiring.

(8)

(9)

Chapter 1

Introduction

Maritime surveillance is an important source of information that can be used to disrupt and prevent dangerous or illegal activity occurring at sea. Examples of such activities are sea traffic violations, commanding a vessel under the influence of alcohol or other drugs, smuggling, and piracy, to name a few. For the detection of such events there is an increasing amount of near real-time information available from radar and the Automatic Identification System (AIS). The use of the latter is now required for all ships of 300 gross tonnage or more engaged on international voyages, cargo ships of 500 gross tonnage or more not engaged on international voyages, and passenger ships irrespective of size [19]. It is also required for all fishing boats greater than 15 meters in length sailing in the jurisdiction of a EU member state [18]. Initially conceived for collision avoidance AIS contains information about location and velocity but also other things that may be of interest in a security application such as identity, destination, cargo, and size. The messages are sent at a variable refresh rate of between 2 seconds and 2 minutes depending on vessel behaviour such that the refresh rate is high when the ship is travelling at high speeds and low when the ship is at anchor [20].

In order for the available information to be useful it must be analysed in a timely manner, meaning that a warning should come early enough as to allow for appropriate actions to be taken. In the case of a ship moving toward shallow water the warning should come soon enough that the ship might be contacted and a grounding averted. This is a challenge for the human operators of the surveillance system who would be much helped by an automated system. There are two general approaches for constructing such an automated system; rule-based and statistical.

While the rule-based approach is conceptually simple and easy to implement, it will only be able to detect events for which a rule has been defined. This does not only have the disadvantage that a great many rules need to be defined and kept up to date through changes in season and legislation or procedure; it will still only be able to cover the events of interest that have first been conceived by a human operator. The statistical or data driven approach only requires historical information to determine if an event deviates significantly from the normal pattern.

(10)

CHAPTER 1. INTRODUCTION

The process is also known as anomaly detection and it is what we will focus on for the rest of the thesis.

While the information available from AIS go beyond the exclusively spatio- temporal data, we will only consider the detection of anomalous trajectories. To succeed at this task we need to break it down. A trajectory is in our case a two- dimensional vessel path on the ocean surface that includes velocity and time data, but this does not tell us what makes a trajectory normal or anomalous. A more thorough discussion on this can be found in Chapters 2 and 3 but in principle what constitutes an anomaly depends on the nonconformity measures we use. The problem is thus to reach an agreement between what we as humans consider an anomaly to be, e.g. a ship about to hit ground, and what the methods do. Furthermore, there are indications that the nonconformity measures by which we find the anomalies may be less important for what we find than the features that we use with them [15]. As will hopefully become clear in Chapter 3, the possible feature spaces are not as explored and evaluated as the possible nonconformity measures, and therefore it is on the feature spaces that the thesis will concentrate.

1.1 Contribution

The contribution of this thesis is a better understanding of what feature space or spaces are best suited for automatic detection of anomalous ship trajectories from AIS data.

1.2 Outline

The report is organized in the following way. Chapter 2 covers the basics required to understand the framework within which the thesis will be done. Chapter 3 covers the contemporary research in maritime trajectory analysis but also related fields such as video surveillance applications and general time-series analysis. In Chapter 4 the developed methods are presented as well as the evaluation framework and the dataset on which the methods are run. Chapter 5 gives a summary and visual representation of the results. It also features a discussion of those results organized per anomaly type we evaluate on. Chapter 6 concludes the report and highlights the most important results and gives a discussion on what work remains to be done in the future. For the interested reader the detailed results of all experiments are provided in Appendix A.

(11)

Chapter 2

Background

This chapter will cover the basic concepts and methods needed to give a context to the remainder of the thesis. It begins by introducing the concept of anomaly detection and mentions some of the methods that have been used for this purpose.

It then goes on to a brief overview of the framework of conformal prediction, which is a prerequisite for the summary of conformal anomaly detection that concludes the chapter.

2.1 Anomaly Detection

Anomaly or outlier detection is the process of finding elements in data that somehow deviate from what is normal or expected. The purpose can be to exclude the outliers from the dataset if for example the assumption is that they have arisen through measurement noise. In the data mining community, however, anomaly detection is often used when the overwhelming majority of data is uninteresting and only a few samples require further attention. Examples of such cases where anomaly detection has seen widespread use are financial transactions, network intrusion detection systems, fault detection systems, and military surveillance [4].

If we want to understand how to detect an anomaly we must first decide what it is we want to detect. While there are several possible definitions, they are for the most part very similar, and the methods used to discover them are usually the same [4]. For the purpose of this thesis we will therefore simply choose to use the following definition which suits our needs. It was given by Hawkins [10] in 1980:

[An outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.

The definition fits well into the scope of anomaly detection because it does not imply that we know what to expect from an anomaly other than that it should deviate from previous observations. It also stresses the fact that an anomaly should be suspected

(12)

CHAPTER 2. BACKGROUND

to have arisen through a different mechanism than the other observations. In other words, the anomaly is, in some sense of the word, interesting.

It should now apparent that in order to construct an anomaly detector we need to be able to measure how much an observation deviates from other observations.

It might not sound complicated but a large part of the field of machine learning is devoted to solving this task. A multitude of different methods have been developed and they measure everything from similarity to distance to likelihood. Exactly what the method measures is not always important and almost any of these methods could potentially be used for anomaly detection. What is most important for any distance or similarity function within the scope of anomaly detection is that it should adequately capture the differences that we want to detect. What that is can of course vary between applications and even within applications. An example of the latter in maritime surveillance is that an interesting anomaly could be a ship that visits a number of ports in an unusual sequence as this might give rise to suspicions of smuggling. Another interesting anomaly could be a ship that anchors in a forbidden zone. Both of these scenarios are potentially interesting anomalies but it is not clear if they could both be detected using the same method. In the first scenario it might be best to simply look at the sequence of ports that the ship visits, and in the second scenario it might be best to look at a combination of position and velocity vectors. Perhaps the same method of measuring difference can be used, e.g. k-nearest neighbours according to Euclidean distance, but in this case the very features that we choose to consider are different. Other popular methods include different types of clustering and One Class Support Vector Machines [4].

Furthermore our previously stated requirement that anomalies should be detected in a timely manner implies that we need an algorithm that can be used on incomplete trajectories [15].

2.2 Conformal Prediction

A weakness of most machine learning algorithms is that they do not produce any reliable measures of confidence for their predictions [8]. One attempt to rectify this is the framework of conformal prediction which is a way of producing levels of confidence in new predictions while making use of existing algorithms for classification or regression [24, 27]. It was introduced by Vovk et al. who refer to the addition of confidence as hedging the predictions. It is different from other machine learning techniques in that it does not produce a prediction but rather a set of pre- dictions, usually denoted Γ, which contains the true label with probability at least 1 − ∈ (0, 1). The only assumption is that the data is independent and identically distributed [27].

An alternative way of expressing the above would be that if the assumption is true, conformal predictors guarantee that the probability of the true label y_l+1 not being in the prediction set never exceeds , i.e.:

P (y_l+1 ∈ Γ/ ) ≤  (2.1)

(13)

2.2. CONFORMAL PREDICTION

The way that we use existing machine learning algorithms within the framework of conformal prediction is as so called nonconformity measures which essentially are functions that measure a distance or nonconformity score between a new example and training data. This function could for instance measure the Euclidean distance between the new example and its nearest neighbour in the training set but in principle it could be any useful prediction technique from machine learning [8, 27]. If a suitable nonconformity measure has been chosen the conformal prediction algorithm itself is fairly straightforward:

Algorithm 1: The Conformal Predictor

Input: Nonconformity measure A, significance level , training set (z₁, . . . , z_l), set of possible labels Y, and new example x_l+1

Output: Prediction set Γ_l+1 for the true label y_l+1 corresponding to the new example x_l+1

Γ_l+1 = ∅

foreach Y ∈ Y do z_l+1 = (x_l+1, Y ) for i = 1 to l + 1 do

Bi = {z₁, . . . , zl+1} \ z_i α_i= A(B_i, z_i)

end

p^Y = |{i=1,...,l+1:αi≥α_l+1}|

l+1

if p_Y > then Γ_l+1 = Γ_l+1∪ Y end

end

Put in words this means that we estimate the p-value p^Y ∈ (0, 1) for each possible label Y ∈ Y of the new example z_l+1 as the proportion of α’s that are larger or equal to the last α_l+1. We then include in the prediction set Γ_l+1only the labels with p^Y > .

If used as stated above the framework of conformal prediction will give a measure of confidence for the predictions made by existing machine learning techniques but it does not come without a cost. It is required that we calculate the nonconformity scores for each possible label and example every time a new classification is made. If the training set is large and the underlying nonconformity measure is computationally expensive this can be infeasible.

To deal with the high computational complexity of conformal prediction an alternative called inductive conformal prediction has been proposed [8, 27]. Their main drawback is that they have to potentially sacrifice some prediction power in order to reduce the computational complexity but the end result is that most of the computations have to be done only once which means it may well be worth it.

(14)

CHAPTER 2. BACKGROUND

2.3 Conformal Prediction in Anomaly Detection

The framework of conformal prediction has been adopted by Laxhammar [15] in the domain of anomaly detection with the realization that the method’s single parameter

 can be used to tune the alarm rate. In fact, if the method is run in an online mode, then it is guaranteed that the alarm rate will be no greater than , up to statistical fluctuations (see [15] for details). Laxhammar argues that an intuitive way to tune the alarm rate is one of the most important features of an anomaly detector and proposes the following algorithm for a conformal anomaly detector:

Algorithm 2: The Conformal Anomaly Detector

Input: Nonconformity measure A, anomaly threshold , training set (z₁, . . . , zl) and new example z_l+1

Output: Indicator variable Anom_l+1 ∈ {0, 1}

for i = 1 to l + 1 do

αi = A ({z₁, . . . , zl+1} \ z_i, zi}) end

pl+1 = |{i=1,...,l+1:αi≥α_l+1}|

l+1

if p_l+1< then Anom_l+1= 1 else

Anom_l+1= 0 end

Anom_l+1 takes the value 1 if the new example is an anomaly and 0 otherwise.

The nonconformity measure is a function that measures how much an example deviates from the training set, just like in Algorithm 1 in the previous section.

As previously mentioned, Algorithm 1 is computationally expensive, and Algo- rithm 2 is its anomaly detection counterpart. However, just like in Algorithm 1, most of the calculations in Algorithm 2 can be omitted without great loss to classification performance. The modified algorithm is called the inductive conformal anomaly detector [15].

The nonconformity measure is the only design parameter in conformal anomaly detection [15]. Therefore, any potential for improved classification performance within the framework exists in this function and the features that are plugged into it. We will come back to this in Chapter 4.

(15)

Chapter 3

Related Work

There has been an increasing interest for anomaly detection within the maritime domain in recent years. Some authors like Ekman and Holst[6] have interested themselves in higher level abstractions of trajectories, such as representing a trajectory as a series of turns and straight segments, instead of the more traditional way of treating a trajectory as a series of points in space. There is still interest in this approach as Avram[1] demonstrates when he extends the concept to include even more complex behaviour like U-turning and shuttling.

When comparing two trajectories it is common not to compare them in their entirety but rather some subsequence or subtrajectory. In the extreme such an approach will only compare one sample point from the trajectories at a time. Put in other words this means that we will consider states instead of trajectories. If appropriate the state space may be extended by adding velocity and acceleration which may not be present in the raw data but are deducible from a series of position data with timestamps. This way of treating trajectory data will be referred to as point-based in the rest of this thesis. The weakness of such a representation of trajectory data is that it will certainly fail to model more complex vessel behaviour and the upside is that the data will be of comparatively low dimensionality.

This representation has been used by Pallotta et al. [20] who built models of sea lanes using a point-based kernel method. The upside with their approach is that it not only provides the possibility to detect anomalies, but also to predict future motion of vessels that move in a normal pattern. This methodology is described by Gammerman, Vovk and Vapnik as an inductive-deductive method because it first constructs a general rule or model (induction) and later makes a prediction or classification (deduction) [8, 27, 9]. As an alternative to induction-deduction they propose transduction, in which a classification of an example is made directly from previous examples without the middle step of creating a model or rule. Examples of transduction are nearest neighbour based classifiers.

Other time-series methods not directly designed for trajectories include the Dis- cords algorithm proposed by Keogh et al. [12]. In short, the algorithm takes time series data and divides it into all possible subsequences of fixed length m. Each

(16)

CHAPTER 3. RELATED WORK

subsequence is then treated as a feature vector in a high dimensional space and the distance to the nearest non-overlapping neighbour is determined. Keogh et al.

[12] used the Euclidean distance but in principle other functions could be used.

This might not sound very novel but their main contribution is a way of increasing the computational efficiency using heuristics. It is a very intuitive and easy to implement method that manages to produce competitive results on trajectories and might therefore be used for benchmarking and evaluation purposes.

In the field of video surveillance Piciarelli and Foresti [21] believe that while there are countless works on low-level image analysis there remains much to be done in terms of higher level interpretation of scenes. They have proposed an online trajectory clustering method in which the clusters are organized in a tree structure that describes the relations between the clusters. The tree branches have associated probabilities which makes it possible to model future behaviour of the trajectories. The method was tested on real video surveillance data with interesting results.

The most related work to the thesis has been done by Laxhammar whose work will be the foundation upon which we will build. In some of his early work [14] he used a point-based approach with Gaussian Mixture Models and the Expectation- Maximization algorithm. He has since then pointed out some of the weaknesses of parametric solutions in the scope of trajectory anomaly detection [13]. In particular he shows how Gaussian Mixture Models will fail to accurately model sea lanes because the data is far from normally distributed. To then make the assumption that the data is normally distributed on some level by using a Gaussian Mixture Model will cause a rather severe inconsistency. Surprisingly, Laxhammar et al. were unable to show that it would negatively affect the classification performance when compared to kernel based methods which are more capable of capturing the shape of the true distribution. More recently he has identified six key properties that an algorithm for anomaly detection in surveillance applications should possess. Ac- cording to Laxhammar, an algorithm should be: Non-parametric, Parameter-light, Well-calibrated, Sequential detection capable, Unsupervised online learning capable, Multi-scale.

He has proposed two non-parametric nonconformity measures that can operate within the framework of conformal anomaly detection in the maritime domain, the directed Hausdorff distance k-nearest neighbours, and the sequential sub-trajectory local outlier nonconformity measures [15]. He then goes on to claim that the latter fulfils all of the requirements stated above. It is however clear from his evaluations that the methods do not prove to have significantly better classification performance than the Discords algorithm. In addition, if the Discords algorithm was to be adopted as a nonconformity measure in the framework of conformal anomaly detection, it too would meet all the requirements.

Chen et al. [5] has developed a novel distance function which they call Edit Dis- tance on Real sequence (EDR). They assume that only the shape of the trajectory is important and ignore the time components of the data. They claim that their approach is more resistant to noise and outliers than Euclidean distance and dynamic

(17)

time warping (DTW). They argue that this follows from the fact that they quantize the distance between two trajectory sample points to 0 if the Chebyshev distance between them is less than , and 1 otherwise. In addition, they claim that EDR is more accurate than the longest common subsequence (LCSS), a method used and promoted by Vlachos et al. [26] for its good performance on noisy datasets.

Finally, they present a number of pruning techniques that can be used to increase the performance of a k-NN search in a database.

In fact, most authors that deal with time-series data have not interested themselves in the time component. There are, however, exceptions such as van Kreveld and Luo [25] who stress the importance of the time dimension in the computation of similarity between two trajectories. This stands in sharp contrast to the many elastic measures that instead try to make the data time invariant by allowing various degrees of stretching in the time dimension. Van Kreveld and Luo have developed methods for minimizing the Euclidean distance between two trajectories over the time dimension. They do however only find exact solutions to the problem in the case of discrete time shifts.

Kang et al. [11] argue that the division of space into cells often leads to a more natural representation of a trajectory. Their motivation for this is essentially that for a person moving in a building, a room number might be a more useful representation of the person’s position than the (x,y,z) coordinates. To deal with such a representation they propose an extension to the LCSS similarity measure which includes the time spent in each cell. This extended feature space should make it possible to capture some aspects of a trajectory that is overlooked by a standard LCSS implementation. The idea of dividing space into cells has also been used by Rhodes et al. [23] who also discretized the velocity space. They point out that the use of discrete features gives rise to the problem of choosing an appropriate scale and that the optimal scale may vary between regions.

Wang et al. [28] have addressed the fact that in recent years many new similarity measures have been proposed for time series data and that these new methods are often accompanied by specific claims about their usefulness that generally are sup- ported by quantitative experimental observations. However they argue that often these experiments are too narrow in scope by only focusing on demonstrating their newly proposed method’s superiority over some other method on one or a few data sets. To make a more comprehensive experimental comparison they have applied 9 different similarity measures on 38 publicly available datasets. The evaluated methods include Euclidean and Manhattan distance, as well as elastic measures like DTW and LCSS. Their results show that while the best elastic methods perform better than Euclidean distance for small datasets this difference disappears when the amount of data grows larger. In addition to this they also show that despite claims in the literature some of the newer elastic methods fail to perform significantly better than older and simpler ones like DTW.

That the Euclidean distance metric is hard to beat has been further confirmed by Zhang et al. [30], especially when combined with principal component analysis.

However they seem to find DTW too computationally expensive, something that

(18)

CHAPTER 3. RELATED WORK

is often claimed but also disputed in the literature [22]. With various speed-up techniques Wang et al. [28] and Rakthanmanon et al. [22] have shown that DTW can be made computationally feasible.

Zheng et al. [29] have compared three kinds of nonconformity measures for a conformal predictor. These were the k-nearest neighbour, support vector machine, and logistic regression. Their findings agree with several others that have been covered in this chapter. While they were able to show that kNN and SVM performed better than logistic on one dataset, they were unable to show any significant difference between kNN and SVM.

3.1 Remarks on Related Work

The choice of features and nonconformity measure is likely to affect the character of the anomalies we detect [15]. One reason for this can be that a trajectory can be anomalous at one scale but not at another. In that case it may be possible to construct a set of parallel anomaly detectors, where one or more is concerned with relatively local aspects of the trajectories and the others with more global properties, like what ports the vessels visit, in what order, and how well this corresponds to what is declared via AIS. It was however shown in [15] that even if we restrict ourselves to one scale different features will still produce very varying results on the anomalies considered.

The two feature spaces evaluated by [15] were the spatial positions in the trajectory and the velocities. This is fairly uncommon, of all the related works cited in this report very few have explicitly used the time or velocity parts of the trajectory or time series data. Indeed, the elastic methods mentioned are trying to make the data somewhat invariant in the time dimension. However, according to the results of [15] the spatial features seem to be more effective at catching anomalies related to position such as a ship grounding, while the velocity features seem better suited to catch behaviours such as unexpected stopping.

It should be noted that in the case of [15] the feature spaces were evaluated separately and never together. However, for a stop to be unexpected it should occur in a location where it is unusual for ships to stop. If a ship stops in a port or harbour this should be considered normal behaviour while it should be classified as an anomaly if it occurs in the middle of a sea lane. It can therefore be suspected that a combination of spatial and velocity feature spaces can be more effective at catching such anomalies. While it seems unlikely that one specific set of features would clearly outperform all others for all kinds of anomalies, there seems to be a need to explore possible feature spaces further.

(19)

Chapter 4

Feature Extraction from Maritime

Trajectories

As has been mentioned in the previous chapters, the choice of feature space is one of the most important design decisions made in the creation of an anomaly detector. This chapter presents a number of feature spaces developed for use in the nonconformity measure function that was put in its context in Algorithm 1 and more importantly in Algorithm 2. The feature spaces are rescaled and then evaluated on a dataset consisting of real data with synthetic anomalies. In addition to the new feature spaces intended for use in maritime trajectories we evaluate the previously used spatial and velocity feature spaces for comparative purposes. We also evaluate the components of the new features individually to demonstrate their respective information content.

4.1 Feature Spaces

This section describes how a number of feature vectors can be extracted from an AIS trajectory. First of all the trajectory is resampled. It is then divided into sub- trajectories with the sliding window technique. Lastly the various feature spaces are created with simple operations.

4.1.1 Sliding Window and Resampling

To calculate the distance between two feature vectors many functions, such as the Euclidean distance, requires the vectors to be of equal dimensionality. In the case of trajectories this simply means that in order be able to compare two instances they need to have the same number of sample points. However, there are situations in which there is a variable sampling frequency or missing samples and then that has to be dealt with. As described in Chapter 1, AIS data is subject to both these complications. Each feature vector we want to compare must therefore be processed to be of both the same length and the same duration. To achieve this, the data

(20)

CHAPTER 4. FEATURE EXTRACTION FROM MARITIME TRAJECTORIES

is linearly interpolated and then resampled with a constant sampling frequency _dt¹. To make sure that all trajectories we want to compare are of length w we will do the following. Consider a time series

T S = (x₁, x₂, ..., x_N) (4.1) where N is the number of sample points in the series. It can be broken down into several feature vectors of length w : w < N

f₁ =(x₁, x₂, . . . , x_w) f2 =(x₂, x3, . . . , xw+1)

...

f_{N −w} =(x_{N −w}, x_{N −w+1}, . . . , x_N)

(4.2)

This is the sliding window technique, and w is the window size. This approach will allow us to make maximum use of the available data as each sample point will be used w times. In our case the resulting overlapping sub-trajectories will ensure that the trajectory can be re-evaluated once each time step dt. This is a requirement for the timely manner criterion mentioned in Chapters 1 and 2. Because the sub- trajectories within a trajectory are not independent it is important to avoid self matching. In an online setting this can be achieved by adding a trajectory to the training set only after it has terminated. Though it might not be trivial to decide when that is, the following experiments evaluate the features in an offline setting which lets us avoid the problem altogether.

The benefits provided by the methods described above come at a price. We now have introduced two parameters; window size w and time step dt. We will evaluate our feature spaces at five different parameter settings, which will be the exact same as in [15].

4.1.2 Spatial, Velocity, Speed, and Combination Features

Most of the data mining on trajectories seems to be performed with purely spatial feature vectors

fspa = (x₁, y1, x2, y2, . . . , xw, yw) (4.3) where each pair x_i, yi represents a position sample from the trajectory of length w. For reference this approach will be evaluated also in this work. While this is the dominant approach in the literature some authors have investigated the use of very high level features such as turning and shuttling behaviour and some have investigated more low level features than only spatial ones.

Specifically it was found by Laxhammar [15] that anomaly detectors can perform very differently when using velocity

f_vel= (v_x₁, v_y₁, v_x₂, v_y₂, . . . , v_x_w, v_y_w) (4.4)

(21)

4.1. FEATURE SPACES

f

1

f

2

f

3

f

4

f

5

f

6

Figure 4.1. Image illustrating how the velocity and combination features were extracted. The first element of the trajectory is excluded (red cross) and the average velocity between two samples (green arrow) is associated with the position of the last of the two. fiis thus all the features associated with index i, i.e. xi, yi, vx_i, vy_i, where vx_i is really the average x component of the speed between i − 1 and i.

instead of spatial features. The latter performed better on some anomalies and the previous on others. The velocities were extracted from the spatial data by calculating the mean velocity between two subsequent positions. This reduces the length of the entire trajectory by one sample point.

To explore this further and try to find a feature space that captures the best from both spatial and velocity spaces they will be combined. As has been previously mentioned, this might prove an effective way to discover for instance unexpected stops. This new feature space

f_spavel= (x₁, y1, vx1, vy1, x2, y2, vx2, vy2, . . . , xw, yw, vxw, vyw) (4.5) will have double the dimensionality of its two components which makes it four dimensional per sample point. Like before, each velocity was extracted from the spatial data by calculating the mean velocity between two subsequent positions. It was then associated with the latter of the two positions. The procedure is illustrated in Figure 4.1.

To add to the understanding of the importance of the speed and course components of the velocity feature, they will be evaluated separately. That is, we will

(22)

construct a feature vector f_spe= ((v_x²

1 + v_y²

1)^1/2, (v_x²₂ + v_y²

2)^1/2, . . . , (v_x²_w + v_y²

w)^1/2) (4.6) containing the speed only. For the course we will do something similar, but this is covered in the next section.

In the case of the speed, it will also be used in combination with the spatial features to form a feature space of three dimensions per sample point. That is, the feature space will be:

fspaspe= (x₁, y1, (v_x²₁ + v_y²₁)^1/2, x2, y2, (v_x²₂+ v_y²₂)^1/2, . . . , xw, yw, (v²_x_w+ v²_y_w)^1/2).

(4.7) The last feature space we will consider is the same as the spatial, but with the mean position subtracted from it. This creates a relative feature space which is conceptually similar to the velocity space but may perform differently. The last feature space is thus:

f_relspa= (x₁− ¯x, y₁− ¯y, x₂− ¯x, y₂− ¯y, . . . , x_w− ¯x, y_w− ¯y, ) (4.8) where

x =¯ x₁+ x₂+ · · · + x_w

w , ¯y = y₁+ y₂+ · · · + y_w

w .

4.1.3 Course Feature

Along with the feature spaces previously mentioned a course only space will be evaluated. While there may be few real world situations in which this feature by itself is able to distinguish an anomalous trajectory it is included for the sake of completeness. To avoid having to deal with the peculiarities concerning the periodicity of the course during resampling the velocities of the craft will be used for internal representation and the distance function will be altered instead. The feature vectors will thus be the same as f_vel from equation 4.4.

There is more than one distance function that can be used for the purpose of measuring angular distance between vectors. One example is the cosine distance which can be defined as one minus the cosine similarity:

d1(f₁, f2) = (1 − similarity) = 1 − cos(θ) = 1 − f1· f₂

||f₁||||f₂|| (4.9) where θ is the angle between the vectors f₁ and f₂. This distance measure will produce a result of 0 for identical vectors and 2 for anti parallel vectors.

The obvious drawback of the above measure is that it is not totally invariant in speed. However this is not a big problem as it can be fixed with a small modification after which we arrive at the following form:

d2(f₁, f2) = w −

w

X

i=1

cos(θ_i) = w −

w

X

i=1

ϕ_1i· ϕ_2i

||ϕ_1i||||ϕ_2i|| (4.10)

(23)

4.2. FEATURE SCALING

where i is the index of each pair of corresponding Cartesian velocity samples in the compared trajectories, i.e. ϕ_1icorresponds to the i:th pair in f₁, a two dimensional vector on the form (v_x_i, vyi). This distance measure will produce a result of 0 for identical vectors and 2w for anti parallel vectors.

Another possibility is to simply add up the angle between each sample as follows:

d3(f₁, f2) =

w

X

i=1

θi. (4.11)

The last possibility considered in this work is to transform all velocity vectors to unit length and take the Euclidean distance between them. The upside with this approach is that it should work well with common indexing methods for fast nearest neighbour searching.

4.2 Feature Scaling

When we create a feature space of mixed physical quantities, such as a combination of position and velocity, it is important that we apply some sort of normalization.

If we skip this step it is likely that one quantity will dominate the others to such a degree that they might as well be omitted entirely. In the dataset used for this work the standard deviations in position space as measured in meters is about three orders of magnitude larger than the standard deviations in velocity space as measured in meters per second. For the purpose of this work the chosen normalization technique is standardization or z-normalization which essentially is the calculation of standard scores or z-values. In short, this is a measure of how far from the mean each data point lies in terms of standard deviations. The standard score of a value x is

z = x − µ

σ (4.12)

where µ and σ are the mean and standard deviation of the population, respec- tively. This is a signed and dimensionless quantity. A negative score thus indicates that a value lies below the mean of the population. When the entire dataset has been standardized it will have unit variance owing to the division by the standard deviation.

Note that in our case the mean and standard deviation of the population should be understood as the mean and standard deviation of the training set. This may differ significantly from the mean and standard deviation of the true distribution if the training set size is small, such as it initially may be if an anomaly detector is started in an online learning mode in a new location. The error will decrease with growing training set size but it is important to remember to keep the mean and standard deviation up to date.

(24)

4.3 Evaluation

To evaluate the different feature spaces the evaluation framework in [15] will be used. The obvious benefit is that because the same evaluation has been performed on the same dataset, the results will be comparable.

4.3.1 Algorithm and Implementation

The Discords algorithm, originally proposed by Keogh et al. [12], was used by Laxhammar [15] for comparison with his own nonconformity measures. It was found to have competitive performance on the dataset used. Owing to this and the relative simplicity of the algorithm, it will be used as our nonconformity measure for evaluating the different feature spaces from Section 4.1. The implementation of the Discords algorithm is basically the same as in [15]. To a large extent the same code was used but some functionality was added to allow for the usage of all the feature spaces described in Section 4.1. As in [15] the nearest neighbour searches were implemented using a kD-tree to speed up the search time. The naïve implementation of the Discords algorithm is outlined below:

Algorithm 3: The Discords Nonconformity Measure

Input: Training set (z₁, . . . , z_l), new example z_l+1, and window length w.

Output: Nonconformity score α.

α = −∞

for k = 1 to length(z_l+1) − w do a = (z_i,k, z_i,k+1, . . . , z_i,k+w)

nearestN eighbourDistance = ∞ for i = 1 to l + 1 do for j = 1 to length(z_i) − w do

b = (z_i,j, z_i,j+1, . . . , z_i,j+w) tmp = distE(a, b)

if tmp < nearestN eighbourDistance then nearestN eighbourDistance = tmp end

end end

if nearestN eighbourDistance > α then α = nearestN eighbourDistance end

end

where dist_E is the Euclidean distance function. In words this means that we find the subtrajectory of the new example z_l+1 that has the longest distance to its nearest subtrajectory in the training set (z₁, . . . , zl). This distance is then used as the nonconformity score α of z_l+1.

(25)

4.3. EVALUATION

4.3.2 Performance Measures

The performance measures chosen by Laxhammar are the area under receiver operat- ing characteristics (ROC) curve (AUC), the partial area under ROC curve (pAUC), as well as the sensitivity, also known as true positive rate (tpr), at the false positive rate (f pr) of 0.01.

The tpr is equal to the number of correctly classified positive examples divided by the total number of positive examples. In our case that means that the tpr is the number of correctly classified anomalies divided by the total number of anomalies.

The f pr is the number of incorrectly classified negative examples, i.e. false alarms, divided by the total number of negative examples.

The ROC curve is conceptually a plot of the tpr against the f pr for all possible classification threshold settings [7]. It is calculated by first determining the nonconformity scores of a list of labelled examples. The list is then sorted according to the nonconformity scores. Finally, the list is iterated through, counting the number of true and false positives at the threshold corresponding to the current nonconfor- mity score. From those numbers N pairs of tpr and f pr can be determined, where N is the number of labelled examples in the list. The greatest advantage of the algorithm outlined above is that it runs in linear time because it does not, like the naïve approach, require the tpr and f pr to be calculated at every possible threshold setting. Examples of ROC curves can be seen in Figure 4.2. One advantage of this measure is that it circumvents the need to choose a threshold to evaluate at. It may even be used to find a suitable threshold setting.

A disadvantage of the ROC curve is that it is not possible to directly and quantitatively compare two curves to each other to determine which is best. The AUC is an attempt to make this possible by assigning each ROC curve a scalar number, namely the area under it. Because the tpr and f pr both lie on the interval [0, 1], the AUC will also be in the interval [0, 1]. As can be seen in Figure 4.2, a random classifier will have a ROC curve the shape of a straight line with slope 1 which cuts through the origin. A consequence of this is that the AUC of a random classifier will be 0.5. The value of the AUC is determined by numerical integration.

In cases where it is of paramount importance that the f pr be kept low, a large portion of the AUC curve is of little interest. Because an anomaly per definition is an unusual occurrence it belongs to this category. If f pr is too large the precision, or ratio of correctly classified examples will become very low. To relate to the real world it would mean that a large fraction of the alarms given by the surveillance system would be false or unwanted. This could easily cause the operator to stop caring about the alarms or turn off the system altogether.

The pAUC has been proposed to better capture the characteristic of the curve in the interesting range, e.g. f pr ∈ (0, 0.05) [16, 17]. A situation in which pAUC might be preferable to AUC is illustrated in Figure 4.2. In this work, as in [15], the pAUC is rescaled to the interval [0, 1]. This should not be confused with the standardized pAUC that is sometimes mentioned in the literature [16, 17], that from the AUC retains the property that a random classifier has a value of 0.5. For the

(26)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

fpr

tpr So lid RO C C urve

Do tted RO C C urve fpr = 0 .0 5

Ra ndo m Perfo rm a nce

Figure 4.2. Illustration of how a ROC curve (solid) can be superior to another (dotted) in the interesting region, here defined as fpr ∈ (0, 0.05), even though it has a lower AUC value.

pAUC used in this work and [15], the value for a random classifier will vary with the chosen interval. Specifically, for f pr = 0.01, it will take on a value of 0.005.

4.4 Dataset

For the evaluation of the feature spaces a dataset produced by Brax [3] will be used.

It was created for the evaluation of anomaly detectors in the maritime domain and the same dataset has since been used by Laxhammar [15] for the same purpose.

Here follows a review of the dataset and the preprocessing that has been applied to it by the two authors mentioned above. For a more detailed description of the preprocessing steps the reader is referred to the original sources.

4.4.1 Preprocessing Performed by Brax and Laxhammar

The dataset is derived from approximately 57 million raw AIS reports that were obtained through 50 days of continuous recording from the 2nd February 2010 to the 6th April 2010. A number of preprocessing steps have been applied to

(27)

4.4. DATASET

Figure 4.3. Area in which the data was recorded. The image was exported from Google Earth by Brax [3]( ©2010 Tele Atlas, ©2010 Europa Technologies, ©2010 Google, Data SIO, NOAA, U.S. Navy, NGA, GEBCO). Reproduced from Brax [3]

with permission.

the raw data. An area well within AIS coverage was selected and all the reports originating from outside this area were discarded, as were duplicates and reports from vessels belonging to ship types other than cargo, tanker and passenger. The area includes the port of Gothenburg with surrounding waters and can be seen in Figure 4.3. Reports belonging to the same vessel were grouped together using the unique identification number present in each report. To reduce the size of the data each trajectory was resampled every 100 meters in case the speed was greater than 0.01 knots or every 60 seconds if the speed was less than 0.01 knots. In the case of two consecutive AIS reports having a time difference greater than 30 minutes the trajectory was split in two.

The resulting dataset consists of 8867 trajectories that were split into a training set of size 7980 and an evaluation set size of 887. Two labelled test sets, normal and abnormal, each of size 604 were created using the evaluation set. In this work only a subset of the original abnormal set will be used. The reason for this is that the other

(28)

anomaly types of the original set involves multiclass anomaly detection. This falls outside the scope of the evaluation of feature spaces. The total size of the abnormal set is therefore reduced to 304 trajectories. The normal test set was obtained by random sampling without replacement form the evaluation set. The abnormal test set and the normal test set can be combined to form five different anomaly test sets, each containing 604 normal trajectories, and either 2 or 100 abnormal trajectories.

In other words, we put one anomaly type in each anomaly test set.

The abnormal test set trajectories were created in collaboration with subject experts in order to make the trajectories both interesting from an operators per- spective as well as realistic. A short description of the five anomaly types present in the abnormal test set can be found below.

4.4.2 Anomaly Types

This is a list with short descriptions of the anomaly types used in the evaluation of the methods described in Section 4.1. Figures 4.4 - 4.8 show the anomalous trajectories plotted against a background of normal traffic trajectories.

1. The circle and land anomaly is based on a real incident in which a ship ran in circles before it eventually ran aground. Thinkable underlying mechanisms for this anomaly include sleeping at the helm and malfunctioning rudder. The anomalies were created by hand, which is a very tedious process, and therefore only two instances were created.

2. The missed turn anomaly is an interpretation of the case of when a ship is following a shipping lane and then fails to turn and is heading towards shallow water. Like the above this anomaly class was manually constructed and only two instances exist. They are located in risk areas where, according to domain experts, ships may miss a turn.

3. The unexpected stop anomaly reflects a situation where a ship comes to a stop for approximately 50 minutes before resuming its route. These anomalies were automatically generated and 100 instances exist.

4. The unusual speed anomaly reflects the situation where a ship suddenly starts to move very slowly or fast. The anomalies were created by randomly selecting a sub-trajectory of length 50 and increase or decrease the speed within the sub-trajectory by 90 % with equal probability. 100 instances of this anomaly type were automatically generated.

5. The strange manoeuvres anomaly corresponds to a sudden streak of erratic course changes before the ship resumes its normal route. It was created by translating each point of a randomly selected sub-trajectory a distance ∆d perpendicular to its current course. ∆d was uniformly sampled from the interval [−500, 500]. The durations of the anomalies were generated by an

(29)

4.4. DATASET

unknown unimodal distribution with mean 11 minutes and standard deviation 3.7.

(30)

−5 −4 −3 −2 −1 0 1

x 10⁴ 6.36

6.37 6.38 6.39 6.4 6.41 6.42x 10⁶

−8000 −6000 −4000 −2000 0 2000

6.386 6.388 6.39 6.392 6.394 6.396

x 10⁶

Figure 4.4. Overview (top) and zoomed-in view (bottom) of the two circle and land anomalies, where red and green correspond to the abnormal and normal segments, respectively, and the blue trajectories correspond to the training set. The lengths of the top and bottom abnormal segments are 14 and 38 minutes, respectively. Figure with caption reproduced from Laxhammar [15] with permission.

(31)

4.4. DATASET

−5 −4 −3 −2 −1 0 1

x 10⁴ 6.36

6.37 6.38 6.39 6.4 6.41 6.42x 10⁶

−10000 −8000 −6000 −4000 −2000 0 2000

6.382 6.384 6.386 6.388 6.39 6.392 6.394 6.396

x 10⁶

Figure 4.5. Overview (top) and zoomed-in view (bottom) of the two missed turn anomalies, where red and green correspond to the abnormal and normal segments, respectively, and the blue trajectories correspond to the training set. The lengths of the left and right abnormal segments are 8 and 15 minutes respectively. Figure with caption reproduced from Laxhammar [15] with permission.

(32)

−5 −4 −3 −2 −1 0 1

x 10⁴ 6.36

6.37 6.38 6.39 6.4 6.41 6.42x 10⁶

Figure 4.6. Plot of the 100 anomalous trajectories involved in unexpected stops. red and green correspond to the unexpected stops and the normal segments, respectively, and the blue trajectories correspond to the full training set. Figure with caption reproduced from Laxhammar [15] with permission.

(33)

4.4. DATASET

−5 −4 −3 −2 −1 0 1

x 10⁴ 6.36

6.37 6.38 6.39 6.4 6.41 6.42x 10⁶

−2.4 −2.2 −2 −1.8 −1.6 −1.4

x 10⁴ 6.361

6.362 6.363 6.364 6.365 6.366 6.367 6.368 6.369 6.37 6.371

x 10⁶

Figure 4.7. Overview (top) and zoomed-in view (bottom) of two arbitrarily selected trajectories from the set of 100 unusual speed anomalies. Red and green correspond to the abnormal and normal segments, respectively, and the blue trajectories correspond to the full training set. In order to visualise the relative speed of the normal and abnormal segments, the two trajectories have been uniformly sampled in time and the velocity vectors between successive points plotted. Note that the left and right abnormal segments correspond to high-speed and low-speed segments, respectively, which is indicated by the length of the velocity vectors. Figure with caption reproduced from Laxhammar [15] with permission.

(34)

−5 −4 −3 −2 −1 0 1

x 10⁴ 6.36

6.37 6.38 6.39 6.4 6.41 6.42x 10⁶

−2.4 −2.2 −2 −1.8 −1.6 −1.4

x 10⁴ 6.362

6.364 6.366 6.368 6.37 6.372 6.374

x 10⁶

Figure 4.8. Overview (top) and zoomed-in view (bottom) of an arbitrarily selected trajectory from the set of 100 strange manoeuvre anomalies. Red and green correspond to the abnormal and normal segments, respectively, and the blue trajectories correspond to the full training set. Figure with caption reproduced from Laxhammar [15] with permission.

(35)

Chapter 5

Results and Discussion

This chapter will present and interpret the results that were obtained through the experiments described in the previous chapter. It begins with a summary of the most important results along with a few illustrative figures and then goes on to a short discussion for each anomaly type.

5.1 Summary of Results

This section is a summary of the results from the evaluation of the proposed feature spaces (Section 4.1) on the dataset described in Section 4.4. In Tables 5.1-5.4 the results of all feature spaces averaged over all anomalies are presented. For a more exhaustive presentation of the results over each individual anomaly type, the reader is referred to Appendix A.

In the case of the course only feature all the distance functions mentioned in Section 4.1.3 were evaluated with very similar results. Because they did not prove to be competitive only the results obtained with the distance function given in equation 4.10 are presented.

In Figure 5.1, the AUC measured when the Discords algorithm was run on the five different anomaly test sets is presented in a visually more appealing way. The results for pAUC is presented in an identical fashion in Figure 5.2. The results in the figures are for window length w = 11, and time step dt = 30. What motivated that particular parameter choice was that it performed best, averaged over all anomalies and feature spaces in terms of both AUC and pAUC, although closely followed by w = 3, dt = 30.

Finally, all feature spaces have been evaluated with the parameter setting with which they on average performed best. The results are presented in Figure 5.3 and the exact results and parameter settings can be found in Table 5.5.

(36)

CHAPTER 5. RESULTS AND DISCUSSION

Table 5.1. Mean AUC

spa vel spavel spaspe spe dir relspa w = 3, dt = 30 0.836 0.928 0.960 0.961 0.901 0.721 0.926 w = 11, dt = 30 0.879 0.937 0.959 0.927 0.890 0.721 0.922 w = 11, dt = 60 0.863 0.919 0.952 0.905 0.874 0.751 0.907 w = 11, dt = 120 0.843 0.893 0.927 0.885 0.884 0.740 0.866 w = 11, dt = 180 0.799 0.814 0.829 0.824 0.801 0.688 0.876

Table 5.2. Mean Partial AUC for fpr ∈ (0, 0.01)

Table 5.3. Mean Sensitivity (tpr) for fpr = 0.01

Table 5.4. Average Mean detection delay (seconds) for fpr = 0.01

spa vel spavel spaspe spe dir relspa

w = 3, dt = 30 338 188 90 277 314 1561 247

w = 11, dt = 30 2076 151 287 225 211 -888 180 w = 11, dt = 60 1332 407 340 382 234 454 549 w = 11, dt = 120 1569 767 697 471 485 8611 817 w = 11, dt = 180 1456 1340 1481 621 559 1120 1265

5.2 Anomaly Types Revisited

This section will go through each anomaly type and highlight the most interesting parts of the results that were presented in the previous section. Not only the best results are considered interesting, but also those that do not agree well with what was expected.

Feature Extraction for Anomaly Detection in Maritime Trajectories