Identification of Fundamental Driving Scenarios Using Unsupervised Machine Learning

(1)

Identification of Fundamental

Driving Scenarios using

Unsupervised Machine Learning

DEEPIKA ANANTHA PADMANABAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Driving Scenarios Using

Unsupervised Machine

Learning

DEEPIKA ANANTHA PADMANABAN

Master’s Programme, Machine Learning, 120 credits Date: December 4, 2020

Supervisor: Florian Pokorny Examiner: Pawel Herman

Industry Supervisor: Majid Khorsand vakilzadeh & Ghazaleh Panahandeh

School of Electrical Engineering and Computer Science Host company: Zenseact AB

Swedish title: Identifiering av grundläggande körscenarier med icke-guidad maskininlärning

(4)

c

(5)

Abstract

A challenge to release autonomous vehicles to public roads is safety verification of the developed features. Safety test driving of vehicles is not practically feasible as the acceptance criterion is driving at least 2.1 billion kilometers [1]. An alternative to this distance-based testing is the scenario-based approach, where the intelligent vehicles are exposed to known scenarios. Identification of such scenarios from the driving data is crucial for this validation. The aim of this thesis is to investigate the possibility of unsupervised identification of driving scenarios from the driving data.

The task is performed in two major parts. The first is the segmentation of the time series driving data by detecting changepoints, followed by the clustering of the previously obtained segments. Time-series segmentation is approached using a Deep Learning method, while the second task is performed using time series clustering. The work also includes a visual approach for validating the time-series segmentation, followed by a quantitative measure of the performance. The approach is also qualitatively compared against a Bayesian Nonparametric approach to identify the usefulness of the proposed method. Based on the analysis of results, there is a discussion about the usefulness and drawbacks of the method, followed by the scope for future research.

Keywords: Time-series Segmentation, Time-series Clustering, Stacked

Sparse Autoencoders, Unsupervised Learning, Autonomous Driving, Feature Extraction

(6)

(7)

Sammanfattning

En utmaning att släppa autonoma fordon på allmänna vägar är säkerhetsverifie-ring av de utvecklade funktionerna. Säkerhetstestning av fordon är inte praktis-kt genomförbart eftersom acceptanskriteriet kör minst 2,1 miljarder kilometer [1]. Ett alternativ till denna distansbaserade testning är det scenaribaserade tillväga-gångssättet, där intelligenta fordon utsätts för kända scenarier. Identifi-ering av sådana scenarier från kördata är avgörande för denna validIdentifi-ering. Syftet med denna avhandling är att undersöka möjligheten till oövervakad identifiering av körscenarier från kördata.

Uppgiften utförs i två huvuddelar. Den första är segmenteringen av tidsseri-edrivdata genom att detektera ändringspunkter, följt av klustring av de tidigare erhållna segmenten. Tidsseriesegmentering närmar sig med en Deep Learning-metod, medan den andra uppgiften utförs med hjälp av tidsseriekluster. Arbetet innehåller också ett visuellt tillvägagångssätt för att validera tidsserierna, följt av ett kvantitativt mått på prestanda. Tillvägagångssättet jämförs också med en Bayesian icke-parametrisk metod för att identifiera användbarheten av den föreslagna metoden. Baserat på analysen av resultaten diskuteras metodens användbarhet och nackdelar, följt av möjligheten för framtida forskning.

Nyckelord: Segment av tidsserier, Tidsserie-kluster, Staplade autokodare,

(8)

Acknowledgments

This thesis work was carried out in Zenseact AB, Gothenburg. I would like to thank my industrial supervisors Majid Khorsand Vakilzadeh and Ghazaleh Panahandeh for the weekly discussions and their guidance throughout the project. I would also like to thank the team at Zenseact AB for all the support related to data and GPU access. I extend my thanks to my academic supervisor Dr. Florian Pokorny and his research group for the weekly group discussions and feedback throughout the time of the work.

Stockholm, December 2020 Deepika Anantha Padmanaban

(9)

1 Introduction 1 1.1 Problem . . . 1 1.2 Research Relevance . . . 2 1.3 Approach . . . 2 1.4 Delimitation . . . 3 1.5 Thesis Outline . . . 4 2 Background 5 2.1 Time-series Analysis . . . 5 2.1.1 Time-series Segmentation . . . 5 2.1.2 Time-series Clustering . . . 6 2.2 Unsupervised Learning . . . 6 2.2.1 Feature Extraction . . . 7 2.3 Autoencoders . . . 7 2.3.1 Traditional Autoencoders . . . 7

2.3.2 Sparse Autoencoders (SAE) . . . 9

2.3.3 Stacked Sparse Autoencoders (SSAE) . . . 10

2.4 Clustering . . . 11 2.4.1 k-means . . . 11 2.4.2 Clustering Metrics . . . 12 2.4.2.1 Inertia . . . 12 2.4.2.2 Silhouette Coefficient . . . 13 2.4.2.3 Calinski-Harabasz Index . . . 13

2.4.2.4 Davies Bouldin Index . . . 14

2.5 Related Work . . . 14

2.5.1 Rule-based Approach . . . 14

2.5.2 Bayesian Nonparametric Approaches . . . 15

2.5.3 Segmentation and Clustering . . . 18

2.5.4 Automatic Feature Learning . . . 18

(10)

3 Method 19 3.1 Data Acquisition . . . 19 3.1.1 Data Source . . . 19 3.1.2 Feature Selection . . . 20 3.2 Data Preprocessing . . . 20 3.3 Time-series Segmentation . . . 22 3.3.1 Data Normalization . . . 22 3.3.2 Data Windowing . . . 22 3.3.3 Data Loaders . . . 24

3.3.4 Feature Extraction using SSAE . . . 24

3.3.5 Changepoint Detection & Segment Extraction . . . 27

3.3.6 Evaluation Methods and Metrics . . . 28

3.4 Time-series Clustering . . . 29

3.4.1 Statistical Feature based Clustering . . . 29

3.4.2 Evaluation Methods and Metrics . . . 30

3.5 Bayesian Nonparametric Approach . . . 31

4 Results and Analysis 33 4.1 Time-series Segmentation . . . 33

4.1.1 SSAE Training . . . 33

4.1.2 SSAE Reconstruction . . . 34

4.1.3 Segmentation . . . 35

4.1.3.1 Visual Approach - Qualitative results . . . . 35

4.1.3.2 Changepoints and False Alarms . . . 35

4.1.4 Quantitative Results . . . 39

4.2 Time-series Clustering . . . 39

4.2.1 k-means on Segment Features . . . 39

4.2.1.1 Visual Approach - Qualitative Analysis . . . 40

4.2.1.2 Clustered Data-points in Input Space . . . . 45

4.2.1.3 Profiles of Fundamental Driving Patterns . . 46

4.2.1.4 Stability of Identified Clusters . . . 51

4.2.1.5 Identified sub-scenarios . . . 53

4.2.1.6 Quantification of Cluster Labelling . . . 54

4.3 Bayesian Nonparametric Approach . . . 55

4.4 Qualitative Comparative Analysis . . . 58

5 Discussion 59 5.1 Segmentation . . . 59

(11)

5.3 Qualitative Comparison with Hierarchical Dirichlet Process

-Hidden semi-Markov model (HDP-HSMM) . . . 60

5.4 Reflections . . . 61

5.4.1 Industrial Reflection . . . 61

5.4.2 Ethics, Sustainability and Social Impact . . . 61

6 Conclusion 63 References 64 A Additional Results 69 A.1 Time-series Segmentation . . . 69

(12)

List of acronyms and abbreviations

ADS Automated Driving System BCE Binary Cross Entropy CAN Control Area Network DSAE Deep Stacked Autoencoders FNR False Negative Rate

FPR False Positive Rate

GPS Global Positioning System GRU Gated Recurrent Unit

HDP Hierarchical Dirichlet Process

HDP-HSMM Hierarchical Dirichlet Process - Hidden semi-Markov model HMM hidden Markov model

HSMM hidden semi-Markov model KL Kullback-Leibler

LiDAR Light Detection And Ranging LSTM Long Short Term Memory MSE Mean Squared Error

SAE Sparse Autoencoders

SGD Stochastic Gradient Descent SSAE Stacked Sparse Autoencoders tanh hyperbolic tangent

(13)

Introduction

1.1 Problem

One of the most interesting challenges in releasing autonomous vehicles to public roads is the safety verification of its features. For an autonomous vehicle to be considered safe, it is required that it is at par or better than a human driver. Quantitatively, this means test driving atleast 2.1 billion Km before reporting any fatality[1]. This criterion makes it practically infeasible to validate higher levels of automation using a distance-based validation. A sensible way of approaching this problem is to simulate test scenarios that can be used for exposing the vehicles and verifying its behavior against the expected. Such a test methodology is called scenario-based testing [2].

The fundamental requirement for a scenario-based test is the identification of test scenarios. Rule-based segmentation of time-series driving data is the current industry standard for scenario extraction [3]. Though the rule-based approach works on a lower level setup such as the basic scenario extraction from the ego vehicle, their downside becomes evident while trying to segment data from a complex traffic environment, since, it becomes difficult to set the rules for segmentation. Previous works identify fundamental scenarios in motorways driving by defining combinatory vehicle actions in a multi-vehicle interactive environment [3]. This restricts the number of scenarios identified to human-defined actions that have an obvious physical interpretation. The disadvantage of this approach is the possibility of missing scenarios that do not fit into any of these human-identified rules. These scenarios can be crucial for testing the safety of high level autonomous vehicles, thus making it necessary to exhaustively find all the possible driving scenarios.

(14)

1.2 Research Relevance

The main aspect of this work is to reduce human intervention in the process of scenario based testing. A rule-based approach relies majorly on the rules defined by human experts, but there is a limitation to human comprehension of possible driving scenarios. This limitation may lead to certain scenarios not being identified from the driving data. Rule formulation is a real problem in a multi-vehicle interactive traffic situation due to the complex interactions among the vehicles, especially in an urban or rural driving environment. Currently, rule-based scenario identification is prevalent in identifying the scenarios in motorway driving environment [3]. However, this study does not investigate scenario identification in an urban or rural environment, where due to the increasing complexity in the data, more rules should be formulated to be able to cover all the possible combinations of scenarios. Furthermore, these sets of rules are limited to experts’ knowledge and their driving experiences, hence there is always a danger that experts cannot come up with a comprehensive set of rules, which in turn will hinder the identification of a subset of scenarios from the driving data. To overcome the aforementioned problems with rule-based scenario extraction, this thesis tries to automate the task of scenario identification by detection of change points in data. Since, the analysis of a multi-vehicle traffic environment is a complex task due to the multiple dimensional actions, this thesis is a precursor to the main investigation by restricting its scope to a primary analysis involving only the test-vehicle (or, ego-vehicle), where the results can be validated easily. This thesis aims to study the capability of feature extraction based unsupervised time-series segmentation method to identify the fundamental driving scenarios and compare the results qualitatively with HDP-HSMM, an existing Bayesian nonparametric approach.

1.3 Approach

Driving behaviour is usually measured in vehicles using various sensors, such as Control Area Network (CAN), Global Positioning System (GPS) etc., using which signals such as velocity, acceleration, longitudinal and latitudinal positions, revolutions per minute, steering angle, yaw, pitch, roll, throttle opening rate, brake cylinder pressure, etc. are measured. The fundamentals of driving is hence, usually a multivariate time-series data.

(15)

the driving data collected from the ego vehicle, the approach was based on the well-known research topic of multivariate time-series segmentation. The fundamental idea is to segment the time-series data available by detecting the changepoints. Changepoint detection is the process of finding distinct sequence of values associated with states that are not directly observable. Traditionally, segmentation of time-series data is performed using the Bayesian techniques [4] that involves uncovering the distribution that generates the time-series. But, this technique has a predetermined notion about the distribution that may have generated the data. This drawback of the technique is seen in certain cases where the pre-determined model of the generative process fails in identifying the actual changepoints. Another downside of using a bayesian model [4] is the frequent state-switching that often requires a post-processing for removal of redundant switches. Contrary to these previous unsupervised work, this thesis explores the possibility of using features extracted from the time-series data to determine the changepoints, following [5]. For extracting features from the time-series data we use a self-supervised autoencoder model, specifically a SSAE [6] that can extract highly representative features from the data. This feature extraction is followed by using the spatial distance between the extracted features to identify the changepoints[5].

The segmentation is followed by clustering the segmented input data into fundamental scenarios. Clustering of time-series can be performed either directly on the data using metrics such as Dynamic Time Warping (DTW) [7], euclidean distances or by projecting each time-series as a point by using its statistical features as the explanatory variables [4, 8, 9]. The clusters, thus obtained are expected to represent the fundamental patterns in driving. This is the basis for this thesis, Identification of Fundamental Driving

Scenarios. The performance of the proposed method is also compared with

an existing Bayesian Non-Parametric approach to identify its advantages and shortcomings.

1.4 Delimitation

The scope of this thesis is to build a machine learning model to help identify scenarios from the driving data of ego-vehicle without involving inputs about its interaction with the surrounding traffic objects. It is outside the scope of this thesis to verify the application of the model outside the autonomous driving setup. Also, the thesis involves only a basic hyper-parameter search for the

(16)

model and not an exhaustive one.

1.5 Thesis Outline

There is a total of 6 chapters in this report with the current chapter followed by chapters listed below, succeeded by a list of references and an appendix.

• Chapter 2 : Background This chapter is dedicated to provide a detailed background of the Deep Learning network used and the various clustering methods applied. This is followed by an account of the previous research work that have been used to approach the question in-hand and also discusses the main references used as the basis for the thesis.

• Chapter 3 : Methodology This chapter includes the thoughts behind the selection of data source and feature crafting involved in the thesis, followed by the details of the architecture of the model used for this thesis along with a detailed explanation of the training setup.

• Chapter 4 : Results and Analysis This chapter produces the results of the experiments that were performed as a part of this thesis, followed by a qualitative and quantitative analysis of the obtained results.

• Chapter 5 : Discussion This chapter is dedicated to discuss the results analyzed in the previous chapter, where there is a detailed analysis on the highlights and the drawbacks of the approach used in this thesis and an account of the ethical, social and sustainability aspects of Automated Driving System (ADS).

• Chapter 6 : Conclusion This chapter serves as an overview to the thesis with some remarks on the further scope worth exploration.

(17)

Background

2.1 Time-series Analysis

Time-series data is a sequence of datapoints that monitors a particular process indexed in time order. Time-series data is usually dynamic because their feature values change as a function of time, i.e. the values of each point of a time-series are one or more observations that are made chronologically [10]. Many of the real-world problems are usually time-series. Some examples of time-series data include the stock prices, weather data, driving data, speech data, electrocardiogram to name a few. Time-series analysis can be used to extract useful details about the process generating the time-series, like how the variable changes over time, find the peaks, trends and seasonality in the data, predict the future data etc. There are multiple research problems involving time-series data such as time-series forecasting, time-series segmentation, time-series clustering, time-series classification etc.

2.1.1 Time-series Segmentation

Segmentation of time-series data is an important research area in the field of series analysis. This task involves the method of dividing input time-series into a sequence of discrete segments based on underlying physical property that generated them. This is an important research in the fields of audio signal processing, driving data analysis, finance, medicine, manufacturing etc. The main aim of the task is to identify changepoints that occur in time-series. Time-series segmentation is usually performed using various methods such as sliding window, bottom-up and top-down methods [11].

(18)

2.1.2 Time-series Clustering

Clustering is a technique in data mining where similar data are placed together to form homogeneous groups without any prior knowledge about their definition. The basic idea involved in time-series clustering is same as clustering of normal data, except for the normal data being replaced with time-series data. The method involves the partition of time-time-series data into groups based on measures of similarity such as distance, alignment, overlapping etc. The goal is to form a homogeneous group of time-series with a low inter-cluster and high intra-inter-cluster similarity. Thus, time-series inter-clustering is defined as follows;

Given a dataset of n time-series data D = {D1, D2, ..., Dn}, the unsupervised

partitioning of D into C = {C1, C2, ..., Ck} in such a way that homogeneous

time-series are grouped together based on certain similarity measure is called time-series clustering [10]. Here, Ciis called a cluster where;

D = k [ i=1 Ci Ci∩ Cj = ∅ where i 6= j

In this thesis, we aim at discovering patterns from the time-series driving segments obtained as the outputs of the segmentation algorithm to identify the fundamental scenarios in driving.

2.2 Unsupervised Learning

Unsupervised Learning is a machine learning technique that does not need supervision, where the model is not fed with an explicit label output that it has to learn to match, as in supervised machine learning models. Instead, the model is built to discover useful information all by itself from the data that is fed to it. In the context of this thesis, both the time-series segmentation and fundamental pattern identification are treated as unsupervised learning tasks.

(19)

2.2.1 Feature Extraction

In machine learning and pattern recognition, a feature is an individual measura-ble characteristic of a process that cannot be directly observed but characterized by the observations from them [12]. Identifying the discriminating features aims to improve the performance such as estimated accuracy, visualization and comprehensibility of learned knowledge. This idea of transforming the input space into a lower dimensional subspace that preserves most of the relevant information is called feature extraction [13].

Feature extraction algorithms are valuable tools, which prepare data for other learning methods or tasks [14]. This is usually seen as a process that reduces the dimensionality of the data which helps in representing the raw data into a more manageable format, because it gets rid of the redundancy involved in the data [13]. Reducing dimensionality is also helpful as the high-level representations can help people better understand the intrinsic structure of the data [15]. The reduction in dimensionality happens by combining the explanatory variables in the data in such a way that the original dataset can still be completely and accurately described. For the context of this thesis, we will focus only on unsupervised feature extraction using Deep Learning.

2.3 Autoencoders

Since the task of this thesis is completely unsupervised, it is necessary to have a network that performs feature extraction without the need for labels. Autoencoders are one such network that can be useful in extracting the useful features of the data by trying to reconstruct the input data [16]. Since this network tries reconstructing the input as output, it is a self-supervised learning task, that can be used for reducing the dimensionality and extracting features from the input data.

2.3.1 Traditional Autoencoders

An autoencoder [16] is composed of two units, the first unit is called an encoder that tries to constrict the input into a bottleneck, thus inducing feature extraction. The next unit is called a decoder, where the extracted feature is used to reconstruct the input data. The autoencoders can be built using any kind of layers like convolution layers, Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU) layers or just fully connected layers.

(20)

Encoder An input x fed into an autoencoder, first undergoes a series of

transformation that usually is aimed at reducing the dimensionality of the input using an encoder with a function gθ, that leads to a bottleneck feature

representation of the input. This feature reduction decides which aspects of the observed data are relevant for an effective reconstruction.

gθ(x) = s{Wx + b}

where, θ = {W, b}, is the parameter set where W is the weight matrix and b is the bias vector. The function s represents the non-linearity activation.

Decoder The feature representation, z, thus obtained is then fed into a

decoder network with a function fφthat aims at reconstructing the input as its

output, y. Usually, the decoder architecture is the mirror of the encoder. fφ(z) = s{W’z + b’}

where, the parameter set is φ = {W’, b’}, where W’ is the weight matrix and

b’ is the bias vector.

Figure 2.1: Autoencoder architecture showing a bottleneck

The common choice for loss function of a deep learning model depends on the output. For a real-valued output, the loss function is usually a squared error objective like Mean Squared Error (MSE), while for a binary-valued output, Binary Cross Entropy (BCE) is used as the loss function. The autoencoder is trained to minimize this reconstruction loss that maximizes the lower bound on the mutual information between the input and its reconstucted output.

(21)

2.3.2 SAE

The main aim of using autoencoders in this thesis, is to obtain a feature representation that captures a high level interpretation of the input. But, the problem that arises with using a deep autoencoder is the possibility of the autoencoder overfitting the data and producing a trivial identity mapping. Although this allows the original input data to be reconstructed perfectly, the autoencoder does not learn any meaningful features. Also, a dense representation is highly dependent because any change in the input changes most of the features in the representation vector. To overcome this issue, we introduce sparsity in the network. Sparse representations usually are invariant to small changes in the input, thus making them robust. However, forcing too much of sparsity may also lead to learning representations that are poor. Thus, a sparsity penalty was introduced in the hidden layer such that a concise and efficient encoding of the input is obtained from the encoder [14]. The sparsity in the network can be introduced in many ways. One way is to add a Kullback-Leibler (KL) divergence from an expected average activation term to the objective function or adding a L2 sparsity to the weights of the network.

KL Divergence sparsity The idea of introducing sparsity was to impose a

constraint on the activation of neurons. This can be achieved by trying to push the activations of hidden layer toward a constant ρ is close to zero.

ˆ ρj = 1 N N X i=1 [a(2)_j (xi)]

where, a(2)_j (xi_{) is the activation of j}th _{neuron for the i}th _{input. Thus, ˆ}_ρ j is

averaged over all the inputs. If the average activation of the hidden layer ˆρj

is forced to approach ρ closer to 0 (say, 0.05), then this is achieved at the cost of making most of the hidden unit’s activations zero, leading to only a few representative neurons being fired. The mathematical equivalent of achieving this is by using KL divergence between the average activation ˆρj

and the constant, ρ. Mathematically, this is given as; KL(ρ|| ˆρj) = ρ log ρ ˆ ρj + (1 − ρ)log(1 − ρ) (1 − ˆρj)

This KL divergence term is used as an extra penalty term to our optimization objective that penalizes ˆρj deviating significantly from ρ. Thus, the new

(22)

objective function is; Jsparse(W, b) = J (W, b) + µ M X i=1 KL(ρ|| ˆρj)

where, M is the number of neurons in the hidden layer and µ is the weighting factor of the divergence that decides the strength of sparsity.

L2 regularization To prevent over-fitting, an L2 regularization term or

weight attenuation term is added to the objective function that is to be optimized by the network. Though this term does not make the weights exactly zero, they are pushed closer to zero depending on the strength of the regularization parameter, λ [16]. Jsparse(W, b) = J (W, b) + µ M X i=1 KL(ρ|| ˆρj) + λ 2 (||We|| 2 2+ ||Wd||22)

where λ is the attenuation coefficient of weights and We and Wd are the

encoder and decoder weights respectively.

2.3.3 SSAE

SSAE, also known as Deep Stacked Autoencoders (DSAE) are obtained by stacking multiple SAE on top of one another[17, 18, 19], similar to stacking Restricted Boltzmann Machine (RBM) in Deep Belief Networks (DBN). The autoencoders are stacked on top of one another by using the encoded representation of the previous autoencoder as the input to the next autoencoder. Figure 2.2 shows a pictorial representation of the SSAE.

Training a SSAE SSAE are multiple SAEs connected by using the hidden

representations of the previous layer as input to the next SAE. This ensures that there is an extraction of a high level feature representation of the data. The SSAE is trained by a greedy layer-wise pre-training of the individual SAEs that is sequentially trained to obtain the optimum connection weights of the entire SSAE network. The weights, thus obtained are used to fine-tune the SSAE by a backpropagation of the gradients that ensure the expected output, thus providing the best representation and the optimal weights.

(23)

Figure 2.2: Architecture of a SSAE

2.4 Clustering

Clustering is an unsupervised learning method where similar objects are grouped together based on the similarities between them, thus they work without the need for any labels. Though traditionally, labels are used to verify the performance of the clustering algorithm in place. There are many types of clustering algorithms based on the type of data and what constitutes clusters within them and how can they be efficiently separated. The aim is to reveal the class structure in the data by partitioning them.

2.4.1 k-means

means [20] clustering is a classical unsupervised learning algorithm. k-means uses every point in the data to evolve the clustering over a series of iterations. In each iteration, the algorithm tries to find the local maxima. The algorithm works as given below;

Algorithm:

• k central points or means are picked at random and are fed to the algorithm as the starting centroids of clusters to be formed.

(24)

• All the points in the dataset are then assigned to the closest centroid, based on any metric of similarity such as manhattan, euclidean, mahalanobis distance or dynamic time warping (as in time-series data).

• Once, all the points are assigned to the clusters, then the next step is to find the new centroids of the clusters. The new centroid is obtained by averaging all the points in each cluster. These two steps constitutes one iteration of the algorithm.

• Multiple such iterations are performed until the assignment of points to the clusters do not change. Once the point assignments stop changing, the algorithm is said to have converged.

We will end up with k different clusters, each of which has a centroid closer to every point in its cluster than any other centroid. One of the major problems of the k-means algorithm is that it may produce empty clusters depending on initial centroids chosen.

2.4.2 Clustering Metrics

2.4.2.1 Inertia

One of the requirements of a good clustering performance is that the datapoints that belong to the same cluster is as similar as possible. This is quantitatively interpreted in terms of a metric called inertia [21]. Inertia for each cluster is obtained as the sum of the distances of all the points within a cluster from the centroids of that cluster. This distance for each cluster is summed up to find the final inertial value for all the clusters. Thus, inertia is the metric that measures the intracluster distances of the obtained clusters.

Ij = nj

X

i=1

d(xi, µj)

where, d is the distance metric and xi is the ithdatapoint in cluster j and µj is

the center of the cluster j C. The distance metric d, is usually the sum of the squared distance and hence inertia is also called sum of squared error, which is aimed at being minimized while running the clustering algorithm by choosing the best centroids possible. Lower the inertia, denser the cluster.

Iobj = n X i=1 min µj∈C (||xi− µj||2)

(25)

2.4.2.2 Silhouette Coefficient

Along with the datapoints within the clusters being similar, another requirement for the clustering was consistency within the clusters. Silhouette coefficient[22] is seen as a measure of this, where it evaluates the similarity of datapoints within its own clusters compared to the other clusters. It measures the cohesion within and separation from other clusters. The silhouette scores range from -1 to 1. The silhouette plots are a good visualization of how good the clustering performance is. The higher the silhouette coefficients are, the better. The cluster population and the silhouettes are a way of checking the consistency of the clusters. The silhouette coefficient for each sample is given as;

s = b − a max(a, b)

where, b is the mean distance between the sample and all the points in the next nearest cluster and a is the mean distance between the sample and all the other points within the cluster it is assigned to.

2.4.2.3 Calinski-Harabasz Index

Another metric that can be used to measure the performance of a clustering algorithm is the Calinski Harabasz index [23], also known as Variance Ratio Criterion. It is given as the sum of the ratio of the dispersion between clusters and the dispersion with clusters. The higher the index, the better is the performance of clustering. For a dataset with N datapoints that is clustered into k clusters, the Calinski-Harabasz score is defined mathematically as;

s = tr(Bk) tr(Wk)

× N − k k − 1

where, tr(Bk) and tr(Wk) are the traces of between group dispersion and

within-cluster dispersion matrices respectively.

Wk = k X j=1 X x∈Cj (x − cj)(x − cj)T Bk= k X j=1 nj(cj − co)(cj − co)T

(26)

2.4.2.4 Davies Bouldin Index

The Davies-Bouldin Index is another metric that signifies the average similarity between clusters [24]. It is measured by comparing the distance between clusters with the size of the clusters. Thus, the lower the index, the better the separation between the clusters. Mathematically, similarity between two clusters i and j is given as;

Rij = si+ sj dij DB = 1 k k X i=1 max i6=j Rij

where, sx is the diameter of cluster x and dij is the distance between the

centroids of i and j.

2.5 Related Work

The following sections are aimed at providing a background about the notable research work in the same context as the thesis.

2.5.1 Rule-based Approach

An interesting problem in time-series analysis is segmenting a multivariate time-series into the sequence of segments. This is common arises in applications such as speaker diarization, brain activity analysis, electrocardiogram, financial data analysis and industrial process reconstruction. Traditionally, the problem is approached by discovering of local patterns in multivariate time-series data by formulating rules whose conditions identify patterns in sequential data [3]. The rule framework is usually application dependent. For this thesis, we focus on the segmentation of driving data and labelling the segments. The traditional rules applied for the segmentation of the time-series obtained from the ego-vehicle are shown in table 2.1. The disadvantages of this approach include:

• Thresholds/Rules are predetermined, leading to segments that usually meet initial assumptions, as opposed to allowing the data itself to reveal the most meaningful segments.

• It is difficult to perform segmentation in more than two dimensions, leading to challenges in formulating rules for very complex data.

(27)

Expected Scenarios Rule based condition

Acceleration acceleration >0

Deceleration acceleration <0

Left turn Yaw rate >0

Right turn Yaw rate <0

Constant velocity Acceleration = 0, Yaw rate = 0, velocity = constant Standstill Acceleration = 0, Yaw rate = 0, velocity = 0 Table 2.1: rule-based framework for segmentation and labelling of driving scenarios

2.5.2 Bayesian Nonparametric Approaches

The research with Bayesian Nonparametric aapproaches were a part of driving style analysis framework using primitive driving patterns found in naturalistic data by Wang et al. [4]. As a part of the analysis, the research work included the extraction of primary driving patterns from the time-series driving data without any prior knowledge about the number of patterns. This work was based on a car-following scenario whereas our case was more of a pattern extraction from a single driver’s driving behavior. The base model of this work is a hidden semi-Markov model (HSMM), where a Hierarchical Dirichlet Process (HDP) is used to find the modes of the HSMM that are unknown.

HSMM A hidden Markov model (HMM) is a statistical model that follows

the Markov process to identify the hidden states from a set of observations that are generated by them. The Markov process describes a sequence of continuous events where the current event depends only the latest previous event. One of the main drawbacks with HMM was the mapping of one observation to a hidden state, leading to a constant probability of changing state given survival in the state up to that time [25]. To overcome this, a HSMM was proposed, where there is a distribution placed over the duration of every state, thus, tweaking the idea of the Markov process into a semi-Markov one, where the probability of a state change depends on the time elapsed since entry into that state.

HDP The problem with a traditional Markov model is the requirement to

have the number of states to be known apriori. But in real world scenarios such as the driving data, it is difficult to come up with the exact number of states in the process. Thus, it becomes important to derive the number of

(28)

states from the data, leading to a Bayesian approach. A traditional Dirichlet Process is a probability distribution over discrete probability distributions [26]. Conventionally, it is useful for describing our prior knowledge about the distribution of random variables in a process. It is mathematically expressed in terms of a base distribution, H, which is the expected value of the process and a concentration parameter α, which is a real number. The dirichlet process draws a discrete distribution with the base distribution as the mean while the concentration parameter decides the extent of discretization with α −→ ∞ leading to continuous realizations while for α −→ 0, all the realizations are at a single value. Mathematically,

if X ∼ DP (H, α),

then, (X(B1), ..., X(Bn)) ∼ Dir(αH(B1), ..., αH(Bn))

where, Dir denotes the Dirichlet Distribution and {Bi}ni=1is a finite partition

of the parameter space S. The Dirichlet Process is realized using a stick-breaking procedure, where the probability mass function of the sampled discrete distribution is given as;

G0 = ∞

X

k=1

βk.δθk(θ)

where δθk is the indicator function that takes the value zero everywhere except

at θkwhere it becomes 1 and the probabilities βkis given by breaking a stick

of unit length; βk= β 0 k. k−1 Y i=1 (1 − β_i0),

where, β_k0 ∼ Beta(1, α). The above equation resembles a stick breaking procedure, where, on starting with a unit stick, at each step βk is assigned

a piece that is cut-off the remaining stick according to β_k0. The Bayesian extension to a traditional Markov model is the addition of an HDP [27] to the traditional HMM or HSMM. In a normal context, an HDP is a Bayesian statistical approach that allows different groups of observations to share a statistical strength. In the case of a Markov model(or Semi-Markov Model), the HDP is used to allow groups of observations to share states, i.e. it takes responsibility for the repetition of states in the sequence. This is achieved by drawing base distribution of Dirichlet Process also from a Dirichlet process.

(29)

HDP-HSMM A regular HMM or HSMM would require the number of

states in the process to be known initially. But, since this is unknown in the current scenario, the paper [4] tries using a HDP to define a prior over the transition probabilities, to allow the data to derive meaningful states.

G0 = ∞ X i=1 βiδθi, β|γ ∼ GEM (γ) Gj = ∞ X i=1 πjiδθi, πj|α, β ∼ DP (α, β) θi|H ∼ H ω ∼ G zs ∼ ˜πzs−1 ds ∼ g(ωzs)

where, πj= [πj1, πj2, ...] and Gjis a discrete measure that is a variation on the

global discrete measure, G0, θ and ω are parameters of the model, while the

current state zs = x_t1

s:tds+1s , which is the presence of the hidden state xs for

a duration dswhich is also learnt from the data and y_t1

s:tds+1s = F (θxt) is the

observation sequence for a particular state,s.

Method A HDP-HSMM model is developed by placing a Gamma prior on

the hyperparameters γ, α,κ, while the observations are modelled as Gaussian emissions and hence, the hyperparameter θ is drawn from an Inverse-Wishart prior (the conjugate of a Gaussian distribution) with a hyperparameter γ. The work is available as a fully developed package pyhsmm[28]. The main aim of the work is to identify the hidden state sequence from the observation sequence, which is possible by modelling the HDP-HSMM. This is done by modelling the posterior using the priors and the data. The algorithm uses Gibbs sampling [28] to sample from the posterior and update the parameters in each iteration, until the posterior converges. The parameters of converged posterior is used to identify the state and its duration in the model. The use of HDP-HSMM for the task in hand has been well-researched in the area of driving behavior research. But, the main problems that were discovered with the packages were with the complexity of the model involved. The modelling of the data involves assumptions about the distribution that the data comes from. Also, there is a huge complexity in terms of time and resources required for uncovering the posterior using Gibbs sampling.

(30)

2.5.3 Segmentation and Clustering

Identification of driving patterns has been successfully explored in a car-following scenario [9]. The algorithm is structured in two-steps - segmentation followed by a clustering. The filtering technique used in this paper is rule-based and extracts car-following segments from the driving data. The main condition for the segment identification is only the range (distance between the lead and the following vehicle). The segments thus formed were of varying lengths. This work also includes a segment optimization algorithm, which aims at optimizing the segment length by minimizing the variance within the segment using a Metropolis algorithm [29] with an initial condition of 3 seconds. The segments of car-following scenarios were then applied clustering on, to find similar segments or behaviors in them. The work uses k-means clustering with only the segment centroid representing each segment. The work also includes a sensitivity analysis by evaluating the performance of different number of clusters with the sum of the squared distance of clustered points from the cluster means used as the metric.

2.5.4 Automatic Feature Learning

Deep learning based unsupervised approach has been successful with learning changepoints (also known as breakpoints) with subtle boundaries [5]. This algorithm has been successfully experimented in various contexts, such as human activity, eye state and speech recognition, but not applied to driving time series data. The problems with the Bayesian techniques is the assumptions on the generative process and the ability of them to detect only statistic and abrupt changepoints. Several deep learning based feature extraction have been successfully attempted previously. Deep autoencoders are used to extract features from the time-series data without making any assumptions about the generative process of the data. time-series data is pre-processed into windows that can be input to the autoencoders. The autoencoder module used in the work is a basic encoder-decoder network with tied weights and sigmoid activation trained using stochastic gradient descent, to produce a representation on the feature space. The breakpoints were detected using the distance between the consecutive features. The computed distance is used to plot a distance vector whose peaks were detected as changepoints. The robustness of the method was evaluated for different hyper-parameter settings. This approach was not tested for the context of driving data and involved only the segmentation of time-series data, but does not involve any labeling of segments to extract scenarios.

(31)

Method

3.1 Data Acquisition

The first step to building any machine learning model is to acquire data. Since the thesis involves working on scenario extraction for driving behaviors, the driving data that is available from hundreds of thousands of kilometers of driving by drivers from Zenseact forms the basis for the problem. The scenario extraction is an unsupervised learning task where the exact number of change-points for segmenting the time-series driving data is unknown and the number of fundamental scenarios are also not explicitly labelled. This data from Zenseact, serves as the main source of input for the thesis, where the goal is to obtain a satisfactory performance on this real-world driving data.

3.1.1 Data Source

Driving is a complicated task where the behavior of the driver is usually exhibited through various sensor signals. The sensor signals are usually a function of the driver behavior, surrounding traffic condition etc. Since, the aim of the thesis is to identify the fundamental driving pattern, the idea was to drop complexity in the traffic scenario and consider only the ego-vehicle.

There are many sensors that are used to sense the current state of the vehicle and surroundings such as position, velocity, braking pressure, gas pedal pressure, traffic images etc. There are multiple sources for obtaining these signals such as GPS, Light Detection And Ranging (LiDAR), CAN etc. Due to the wide range of signal availability, the accuracy of the sensor measurements and previous research works [19, 30, 31], CAN bus was chosen

(32)

Expected scenario Sensor signals of interest

Acceleration latitudinal, longitudinal and z-acceleration Deceleration

Constant Velocity latitudinal, longitudinal and z-velocity Standstill

Left Turn

Yaw rate Right Turn

Table 3.1: CAN bus signals of interest depending on our expected scenarios as the source for obtaining the training data. Images from LiDAR may be useful in terms of scenario identification, but for the purpose of this thesis, we have used only CAN bus signals.

CAN bus data The backbone of a car signal is the in-vehicle network

which connects the Electronic Control Unit (ECU). They are associated with sensors that monitor a wide variety of systems that govern the control of the vehicle. A typical vehicle has 50-70 such units that communicate over a series of buses. CAN bus [32] is one such protocol that is used in such networks.

3.1.2 Feature Selection

The features that we considered for this thesis were identified on the basis of the available CAN bus signals and the analysis of the signals that would have an effect on fundamental driving patterns. The high level patterns that we are aiming to find using the available data are acceleration, deceleration, constant velocity, standstill, brakes, left and right turns. The CAN bus signals that are of interest for this purpose, depending on previous works [30, 31] are given in table 3.1. Yaw rate serves useful in the identification of turns. The 3-D velocity and acceleration were combined to form 1-D velocity and acceleration respectively. v =qv2 x+ vy2+ v2z a = q a2 x+ a2y + a2z

3.2 Data Preprocessing

Though the data from the CAN bus sensors are highly reliable and noise-resistant to a large extent, it becomes necessary to ponder upon the data to make it compatible for the learning task in-hand. As the first step, it was

(33)

ensured that the signals obtained from the CAN bus did not have any outliers and were consistent in the time space.

Downsampling and Curation As discussed previously, the CAN bus

data was used as the source. The data from the CAN bus signal is 40 Hz, it was downsampled to 10Hz to make the model memory and computation efficient. The data was also checked for imbalance. The driving data usually consists of a lot of standstill measures or states where there is no hint of any signal. This may cause problems while training an autoencoder model because the observations of the standstill sequence are usually easy to learn, thus making it difficult for the model to learn the other sequences due to being overfit to the standstill state. In such a situation, the main task is to remove redundant standstill samples from the data, such that there is an almost equal representation of the signals from each of the states considered.

Dataset The data that we have used in this thesis is the CAN bus signals of

velocity, acceleration and yaw rate, from a driver who has driven over an entire day. This signal is downsampled to a frequency of 10 Hz and a few standstill samples are downsized, thus leading to a total of 850,000 datapoints, each of which is multivariate with the three selected features. This data was used as the training data, while a similar process was used to form a validation dataset with another drive from the same driver for 240,000 datapoints. This was done as the data could not be split separately because the time aspect of the data had to be kept intact while making sure that the data was balanced in both the validation and training set.

(34)

3.3 Time-series Segmentation

The series segmentation task is defined as: given a multivariate time-series signal over a period of time, find the changepoints in the signal where there is a distinct variation in the generative parameters of the sequence. This may also be stated as finding the state transitions in the generative process that is unobserved. The method used in this thesis closely follows the work in [5] where these changepoints are defined as breakpoints, since the method deals with handling both subtle and abrupt variations in the data.

3.3.1 Data Normalization

The data that is fed to the model has three features, each of which belong to a different range of values. Thus, to make the features comparable, we try normalizing the features using the maximum and minimum values of each dimension to the range (-1,1), since, we use a tanh activation in the model which is discussed in detail in the further sections [5, 19, 33]. For data at timestep t, dt= (dt,1, dt,2,..., dt,Nc) R

Nc_{, where N}

cis the number of features.

The normalized data at timestep t is given as nt = (nt,1, nt,2,..., nt,Nc), where

nt,m is given as;

nt,m = 2 ×

dt,m− min(dm)

max(dm) − min(dm)

− 1 (3.1)

where dt,mis the mthfeature of the data measured at timestamp t and max(dm)

and min(dm) are maximum and minimum value along dimension m of data D.

3.3.2 Data Windowing

The preprocessed data obtained with a frequency of 10Hz was prepared to be fed to the model after normalizing to the range (-1,1) as explained in 3.3.1. Sliding windows, a prevalent technique in handling time-series data is used to ensure that the time aspect of the data is not lost in the process [5, 19, 30]. The sliding window provides the model a memory of the past time-series data.

(35)

(a)

(b)

Figure 3.2: Evolution of sliding window over two consecutive slides of 10 samples. (a)The first data window of 30 seconds (b)The second window of 30 seconds created by sliding the window over 10 samples. The black window shows the previous data window while the red window shows the current data window.

(36)

For a sliding window size, Nw and the sliding value Ns, the number of

datapoints is given as Tw = T/Ns. A data window is formed by sequentially

considering data of size of the window and flattening the features to form a 1D vector. Hence, each data window is given as st_RNcNw×1_{. The next window of}

data is constructed by moving the window by a sample of size Nsand creating

an overlapping set of the next Nwsamples as described above. The windowed

dataset is formed by stacking such windows of data vertically forming a set of Tw such data windows, S [s1, s2, s3, ..., sTw]. Thus,

S RNcNw×Tw _(3.2)

Thus, for the parameters we have considered, with a sliding window size, Nw=30, sliding value Ns=10 and feature code set, Nc=3, we end with a

window with a feature size of 90.

3.3.3 Data Loaders

The normalized and windowed dataset thus obtained is fed to the model in mini-batches to overcome the memory constraint of the hardware in use. The batch size was set to a size of 6000 datapoints (data windows, used interchangeably). The dataloaders were created to ensure that the data windows came in batches of user-defined size where the data is not shuffled like in the usual dataloaders, because the main aspect of this data is the time evolution which has to be kept intact.

3.3.4 Feature Extraction using SSAE

The model used for feature extraction from the windowed time-series data is a SSAE. The idea is to obtain a high level representation of the data in the feature space that can be used for feature comparison and is based on the previous works [6, 19, 33].

Architecture The SSAE is obtained by training layers of Autoencoders

greedily with a sparse loss term, where the hidden representation of each layer is fed as input to the next layer of Autoencoder. This is followed by an end-to-end finetuning. The aim is to create a feature representation of the data on the 3 dimensional space, so that the features can be visualized in a RGB color pallette [19]. The sliding window size for the model is obtained using a basic hyper-parameter search based on the loss of the trained model on the training data [34]. The figure 3.3 shows an elbow for the log loss value at a window size

(37)

of 30, hence, the window size was chosen to be 30. The number of layers of autoencoders used in the SSAE is based on previous research works on driving data, where the number of layers is recommended to be greater than 3, thus we have used a four hidden layered network [19] . The number of neurons in each layer is based on the geometric pyramid rule [35]. If Lout is the number

of neurons expected in the last layer, Linis the number of neurons in the input

layer and NLis the number of layers in the network, then the geometric factor,

β is; β = (Lin/Lout) 1 NL Lh1 = Lout× β(NL−1) Lh2 = Lout× β(NL−2) Lh3 = Lout× β(NL−3) Lh4 = Lout× β(NL−4) Lout = 3

Figure 3.3: Training loss as a function of sliding window size(Nw) to obtain

an optimal sliding window size. The elbow in the training loss is at Nw = 30,

(38)

In our case, Lin = 90, Lout = 3 and NL = 5, thus, β ≈ 2, Lh1=48, Lh2

= 24, Lh3 = 12 and Lh4 = 6. Each single autoencoder of layer l{1,..,NL},

is formed using an encoder and decoder with as many neurons in the hidden layer as calculated from the above. The layerwise architecture of the SAEs used for building the SSAE model is shown in table 3.2. The linear layers were followed by a hyperbolic tangent (tanh) activation function that was used to add non-linearity to the representation at each layer.

SAE Layer Type Layer Output Shape Param

Layer 1 encoder Linear-1 [6000, 1, 48] 4,368 Tanh-2 [6000, 1, 48] 0 decoder Linear-3 [6000, 1, 90] 4,410 Tanh-4 [6000, 1, 90] 0 Layer 2 encoder Linear-1 [6000, 1, 24] 1,176 Tanh-2 [6000, 1, 24] 0 decoder Linear-3 [6000, 1, 48] 1,200 Tanh-4 [6000, 1, 48] 0 Layer 3 encoder Linear-1 [6000, 1, 12] 300 Tanh-2 [6000, 1, 12] 0 decoder Linear-3 [6000, 1, 24] 312 Tanh-4 [6000, 1, 24] 0 Layer 4 encoder Linear-1 [6000, 1, 6] 78 Tanh-2 [6000, 1, 6] 0 decoder Linear-3 [6000, 1, 12] 84 Tanh-4 [6000, 1, 12] 0 Layer 5 encoder Linear-1 [6000, 1, 3] 21 Tanh-2 [6000, 1, 3] 0 decoder Linear-3 [6000, 1, 6] 24 Tanh-4 [6000, 1, 6] 0

Table 3.2: Layerwise Architecture of SAE trained to form a SSAE model

Training details Each layer of the SAE was trained using back-propagation

on the loss function that comprised of two terms. The first term was a reconstruction loss term which was a mean squared error function between the expected output (original input, in this case) and the reconstructed output of the network, the common loss function used in any autoencoders. But, the second and the most important term of the loss function of an SAE corresponds to the sparsity of the network. This sparse term is given as the KL divergence between a required sparse value and the activations of the layers as explained in section 2.

(39)

J (W, b) = 1 2Nv Nv X 1 ||r(v_tl) − vl_t||2₂+ µKL(θ|| ¯hl i) (3.3)

where, Nvis the number of visible inputs,vltis the visible input of layer l, r(vlt)

is the reconstructed output of the visible input of layer l, ρ is the target sparsity and µ controls the strength of sparsity.

Each layer of the SAE was trained for a maximum of 1000 epochs with an early stopping if the validation loss increased continuously over 5 epochs or the loss reduction between consecutive epoch was less than a threshold, set to 0.0001 [36]. The SAEs where built in a such a way that the weights of the encoder and decoder were tied. The hyper-parameters for training were set based on the previous research work [19]. The term that controls the strength of the sparse loss term β = 0.7, the target sparsity for the hidden layers, θ = 0.5. The SAE layers where trained with a learning rate of 0.001 using Stochastic Gradient Descent (SGD) [37] as the optimization. The momentum was set to 0.9, while the regularization was handled using dropout of 0.2. The trained SAE layers were then used to build a SSAE that was finetuned using the same learning rate and optimization.

3.3.5 Changepoint Detection & Segment Extraction

The three dimensional features extracted from the SSAE corresponds to a single window. Thus, every successive window is a one-second delayed input window. Hence, for identifying the changepoint, we try to obtain the distance between the features in each window, thus, forming a distance vector. The maxima in the distance vector then corresponds to the timestep where the change in the extracted feature is very high as compared to its adjacent timesteps. Thus, such points are considered as the changepoints [5]. The metric that is used to find the distance between the features of the successive time window is the normalized Euclidean distance in three dimensions. If the current feature is ftand the previous feature is expressed as ft−1, then;

Distancet=

||ft− ft−1||2

p||ft||2× ||ft−1||2

(3.4) The input data between the consecutive changepoints are extracted to form the segments that will be used for further processing.

(40)

3.3.6 Evaluation Methods and Metrics

The results of time-series segmentation obtained by the changepoint detection algorithm is evaluated using two methods - qualitative and quantitative analysis.

Qualitative Analysis This analysis is performed by visualizing the results

computed by the algorithm (detected changepoints) along with the inputs of the time-series data and the groundtruth changepoints, in one-view [5]. The visualization can be used to find the location of the detected changepoint against the groundtruth changepoint as well as for finding the possible variation in the input that has led to the changepoint being detected.

Quantitative Analysis The changepoints detected by the segmentation

algorithm are evaluated as true positives if that are detected exactly at the location of the ground truth. This approach does not account for temporal adjacency. Thus, a segmentation p(n) = S, at a time n of the time-series would be considered as a good-enough hit if the target segmentation point was located in the immediate neighborhood t(n ± τ ) = S, where τ is a tolerated deviation and the changepoints that fall in this vicinity of the ground truth changepoints are also considered to be true positives [38]. The tolerance value τ is set to 1 sec, as the sensitivity of detected changepoint is 1 sec (or 10 input samples) and also, the minimum time distance between two meaningful segments is, practically 2 secs. The points that are detected as changepoints without any explanatory change in the inputs or groundtruth changepoint, is considered as false positive alarm [5]. The results are quantified in terms of the True Positive Rate (TPR), False Negative Rate (FNR) and False Positive Rate (FPR). The TPR is identified as the number of groundtruth changepoints that are identified correctly by the algorithm out of the total number of groundtruth changepoints, FNR is the number of actual changepoints that are not identified by the algorithm out of the total number of changepoints, while the FPR is the number of false alarms out of the total number of algorithm detected changepoints. T P R = NCR NGT F N R = 1 − T P R F P R = NAL− NCR NAL

(41)

NGT is the number of groundtruth changepoints and NALis the total number

of changepoint alarms detected by the algorithm.

3.4 Time-series Clustering

The segments that have been extracted from the time-series segmentation algorithm are used as the input for the clustering algorithm. The idea was to identify patterns of driving from the bunch of segments that were obtained. Since, the data considered is only from the ego vehicle, the patterns that we aim at obtaining do not involve any complex scenarios, but the simple everyday patterns such as acceleration, deceleration, braking, left turns, right turns and constant velocity. But since the segments are not mapped to their patterns, we try performing a clustering on the segments to see if the algorithm can identify some of these meaningful patterns and can extract some other patterns of interest. In the upcoming sections, we try to identify patterns that constitute fundamental driving scenarios.

3.4.1 Statistical Feature based Clustering

The Tslearn clustering using Dynamic Time Warping and Euclidean distance have a major disadvantage of having too many computations because of the distance matrix calculation and interpolated length respectively, thus making it too time consuming for even a single iteration. To overcome this issue with multivariate time-series data, the time-series data is projected into a single point based on its statistical features [4, 6, 8], thus transforming the multivariate time-series data into statistical data points.

Statistical Feature Extraction The statistical features that were considered

for the clustering algorithm were based on the patterns that were of interest. In our case since we were looking for acceleration, deceleration, constant velocity, brakes and turns, we included the coefficient of variation and the slope of each segment for the three features acceleration, velocity and yawrate of the input. Also, to differentiate between the left and right turns and constant velocity and braking states, we also include the centroid of yawrate and velocity as additional features. Thus, each time-series segment is represented in the form of a vector with the eight features described in table 3.3. For the task of clustering, we consider multiple drives of minimum two hours to ensure stability of the clustering algorithm in use.

(42)

Expected Pattern/Fundamental scenario Features of interest

acceleration coefficient of variation and slope of acceleration

deceleration

constant velocity coefficient of variation, slope and centroid of velocity

brakes

Turns coefficient of variation and slope

of yawrate

Left / Right turn centroid of yaw rate

Table 3.3: Statistical features corresponding to each expected pattern

k-means clustering of projected points The points that were obtained by

projecting the time-series segments are used as input to the regular k-means clustering algorithm. For this, we use the k-means clustering package from scikit learn [39]. The k-means algorithm was run for multiple initialization of the centroids with a tolerance of 0.001.The metric used for the clustering algorithm is the euclidean distance between the points. The best results from all the different initialization are obtained based on inertia of the model. The same is obtained for different values of k, i.e., for different number of clusters. The best cluster is obtained using a comparison of the the silhouette coefficient, Calinski-Harabasz and Davies-Bouldin indices of each model.

3.4.2 Evaluation Methods and Metrics

The performance of the clustering algorithm is evaluated both qualitatively and quantitatively. The clusters that are obtained are analyzed at various levels to map them to their corresponding fundamental scenarios.

Qualitative Analysis The segmentation and scenario labelling results

obtained by the algorithm can be evaluated by visualizing the results along with the time-series data in one view [40]. The visualization shows the input time-series data segmented and colored according to their scenarios obtained through clustering. This provides an insight into the various scenarios by coloring segments showing similar behavior in the same color corresponding to the scenario. The clustering of the time-series datapoints is also visualized in the acceleration vs. yawrate input space, where the datapoints belonging to a segment is colored according to its scenario. This helps the viewer locate cluster and identify scenarios based on the location of the clusters in the acceleration vs. yawrate space. The final visualiation is in terms of the cluster profiles, where the segments belonging to the clusters are segregated

(43)

and the inputs, acceleration, velocity and yawrate of the segments belonging to each cluster are plotted separately to obtain the input profiles of each cluster. The existence of a trend in the inputs of the clustered segments, according to the rules of the corresponding scenario is considered a successful clustering.

Quantitative Analysis The algorithm detected scenario of each segment

is evaluated against its corresponding groundtruth scenario. Since, the expert-identified scenarios for the ego-vehicle driving data is limited and the proposed algorithm is not bound by rules, there are chances of the algorithm obtaining scenarios that are not defined by rules. In such a case, the comparison with the groundtruth becomes difficult. So, we do an analysis at two levels;

• Phase I : The results of the scenario labels without the new clusters, where the comparison is against the expected scenarios and the algorithm-detected scenarios. In this case, we removed the entries where our approach has labelled the segment to a scenario that is undefined by rules.

• Phase II : The results of the scenario labels considering the newly identified scenarios where they are combined to one of the expected scenarios closer to them.

3.5 Bayesian Nonparametric Approach

HDP-HSMM (Section.2.5.2) is used as our benchmark segmentation method. This section provides an intuition of the existing method that was applied to the same data to perform the task of time-series segmentation. The modelling of the data using Bayesian nonparametrics involves many assumptions about the data [4]. The main assumption for computing the posterior by this approach is that the state emissions come from a Gaussian distribution. The use of Inverse-Wishart distribution as the prior, due to its conjugacy with the Gaussian distribution makes the posterior distribution tractable. Consequently, the covariance matrices of the states is assumed to have the following priors:

Σi|n0, κ0 ∼ IW (n0, κ0)

where n0 is the degree of freedom. The covariance of the emissions in each

state was assumed to be lower than total covariance of all the states together ( ¯Σ). κ0 was initialized to 0.5 ¯Σ, where κ0is the expected covariance and µi =

(44)

Table 3.4: Parameters for HDP-HSMM [41]

Parameter Description Value/Default Value

(αα, βα) α gamma prior (1, 0.25) (αγ, βγ) γ gamma prior (1, 0.25) n0 IW degree of freedom N + 2 κ0 IW prior scale 0.5 ¯Σ/ ¯Σ (αg, βg) g gamma prior (12, 0.5) / -t truncation level 150 / -L HDP weak-limit 25 /

-state duration was modelled using a Poisson distribution (g) with a gamma prior. Gamma priors are also placed on the HDP parameters as well. An upper limit is placed on the state duration as well as the number of states. This is done to decrease the computation time of the Gibbs sampling. Table 3.4 shows the final parameter and hyperparameter choices, with the corresponding default parameters [41]. The HDP-HSMM was modelled using the pyhsmm library [28]. The process of choosing the best parameters for the HDP-HSMM was carried out in two steps. Firstly, the model is trained on a set of randomly chosen trips from different drivers where an iteration of Gibbs-sampling goes through all of the trips in the training set and then updates the parameters. The model parameters were inferred by Gibbs sampling, with 30 resamples for each set of trips. The segmentation model chosen was based on the previous work at Zenseact [41]. The results of this method are presented in terms of the scatterplot showing the colored state assignments of the sample datapoints in the input space and segmentation of the time-series inputs, qualitatively. The mean of the gaussian distribution of the emissions of the states discovered by the model are presented as quantitative results.

(45)

Results and Analysis

4.1 Time-series Segmentation

4.1.1 SSAE Training

The SSAE module was trained according to the specifications presented in section 3.3.4. The loss evolution for the SSAE is shown in Figure 4.1. The loss evolution of the model seems to have converged based on the early stopping technique applied on the validation dataset.

0 20 40 60 80 Epochs −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 Lo g lo ss of SS A E

Evolution of SSAE log loss over epochs

Training loss Validation loss

Figure 4.1: Log Loss evolution of finetuned SSAE

(46)

4.1.2 SSAE Reconstruction

The basic function of an autoencoder model is to find a representation from which the data can be retrieved with minimal loss. The representation in the bottleneck layer is responsible for obtaining a reconstructed input close to the original input. The features extracted from the bottleneck layer of the SSAE is three dimensional for each input window. Since the idea of using a SSAE was to extract only the high level features with the time-series data window, we do not expect a perfect match between the reconstructed and original input, but a reconstructed input that stands as a rough estimate of the original input. Figure 4.2 is the reconstructed input from one of the time windows of the validation data. Figure 4.2 shows that the reconstructed input is closer in terms of value to the original data for all three features and thus, serves as a rough estimate of the input. Hence, we can conclude that trained SSAE model has tried learning from the data it has seen.

Figure 4.2: High level reconstruction of an input window of 30 samples of driving data