Product Usage Data collection and Analysis in Lawn-mowers

(1)

Master of Science in Computer Science October 2020

Product Usage Data collection and Analysis in Lawn-mowers

Sarath Chandra Damineni Sai Manikanta Munukoti

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulﬁlment of the requirements for the degree of Master of Science in Computer Science.

The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identiﬁed as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:

Author(s):

Sarath Chandra Damineni E-mail: sadi18@student.bth.se Sai Manikanta Munukoti

E-mail: samk18@student.bth.se

University advisor:

Abbas Cheddad (Senior lecturer/Associate professor) Department of Computer Science

Industrial advisor:

David S Hellström (Senior Systems Engineer) Husqvarna Group

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Background. As the requirements for the modern-day comforts are raising from day to day, the great evolution in the field of lawn-mowers is recorded. This evo- lution made companies produce a fleet of lawn-mowers(commercial, house-hold) for different kinds of usages. Despite the great evolution and market in this field, to the best of our knowledge, no effort was made to understand customer usage by analysis of real-time usage of lawn-mowers. This research made an attempt to analyse the real-time usage of lawn-mowers using techniques like machine learning.

Objectives. The main objective of the thesis work is to understand customer usage of lawn-mowers by analysing the real-time usage data using machine learn- ing algorithms. To achieve this, we ﬁrst review several studies to identify what are the diﬀerent ways(scenarios) and how to understand customer usage from those sce- narios. After discussing these scenarios with the stakeholders at the company, we evaluated a suitable scenario in the case of lawn-mowers. Finally, we achieved the primary objective by clustering the usage of lawn-mowers by analysing the real-world time-series data from the Controller Area Network(CAN) bus based on the driving patterns.

Methods. A Systematic literature review(SLR) is performed to identify the differ- ent ways to understand customer usage by analysing the usage data using machine learning algorithms and SLR is also performed to gain detailed knowledge about different machine learning algorithms to apply to the real-world data. Finally, an experiment is performed to apply the machine learning algorithms on the CAN bus time-series data to evaluate the usage of lawn-mowers into various clusters and the experiment also involves the comparison and selection of different machine learning algorithms applied to the data.

Results. As a result of SLR we achieved diﬀerent scenarios to understand customer behaviours by analysing the usage data. After formulating the best suitable scenario for lawn-mowers, SLR also suggested the best suitable machine learning algorithms to be applied to the data for the scenario. Upon applying the machine learning algo- rithms after making necessary pre-processing steps, we achieved the clusters of usage of lawn-mowers for every driving pattern selected. We also achieved the clusters for diﬀerent features of driving patterns that indicate the various characteristics like a change of intensity in the usage, rate of change in the usage, etc.

Conclusions. This study identiﬁed customer behaviours based on their usage data

by clustering the usage data. Moreover, clustering the CAN bus time-series data from

lawn-mowers gave fresh insights to study human behaviours and interaction with the

lawn-mowers. The formulated clusters have a great scope to classify and develop the

individual strategy for each cluster formulated. Further, clusters can also be useful

for identifying the outlying behaviour of users and/or individual components.

(4)

Keywords: Usage data analysis, lawn-mowers, CAN bus, clustering, driving pat- terns.

ii

(5)

Acknowledgments

We want to express our great level of gratitude towards continuous support, guidance given by our supervisor Abbas Cheddad. We are also thankful to David S Hellström at the Husqvarna group for his motivation, support, and sharing the knowledge during the research.

We would like to thank our parents, family, colleagues, friends for their support and encouragement.

iii

(6)

Abstract i

Acknowledgments iii

1 Introduction 4

1.1 Context and Motivation . . . . 4

1.2 Classiﬁcation of lawn-mowers . . . . 5

1.3 Problem Statement . . . . 7

1.4 Aims and Objectives . . . . 8

1.5 Research Questions . . . . 8

1.6 Thesis Structure . . . . 9

2 Background 10 2.1 Machine Learning . . . . 10

2.2 Unsupervised learning . . . . 10

2.2.1 Clustering . . . . 11

2.3 Performance metric . . . . 14

2.3.1 V-measure . . . . 14

2.3.2 Silhouette Coeﬃcient . . . . 15

2.4 Box plot . . . . 15

2.5 CAN bus . . . . 16

3 Related Work 18 4 Method 22 4.1 Systematic Literature Review . . . . 22

4.1.1 Investigation of Primary Studies . . . . 23

4.1.2 Criteria for the selection of research . . . . 24

4.1.3 Assessment of Quality . . . . 24

4.1.4 Extraction of Data . . . . 25

4.2 Experiment . . . . 26

4.2.1 Tools . . . . 26

4.2.2 Software Environment . . . . 26

4.2.3 Collection of Data . . . . 27

4.2.4 Dataset Description . . . . 27

4.2.5 Experiment Implementation . . . . 28

4.2.6 Performance metrics . . . . 35

4.2.7 Signiﬁcance test . . . . 36

iv

(7)

4.2.8 Selection of Algorithms . . . . 37

5 Results and Analysis 38 5.1 Systematic Literature Review . . . . 38

5.1.1 Synthesis of data from SLR of RQ-1 . . . . 38

5.1.2 Scenarios collected from existing literature(Results of RQ-1) . 40 5.1.3 Formulation of Scenario . . . . 40

5.1.4 Synthesis of data from SLR of RQ-2 . . . . 41

5.1.5 Selection of Algorithms(Results of RQ-2) . . . . 42

5.2 Experiment . . . . 42

5.2.1 Formualtion of Features . . . . 42

5.2.2 Visualising and removing outliers . . . . 44

5.2.3 Histogram Formulation . . . . 46

5.2.4 Clustering the data . . . . 49

5.3 Comparison of Algorithms . . . . 51

5.3.1 Signiﬁcance test . . . . 53

5.4 Selection of Algorithm based on V-measure . . . . 55

5.5 Clusters formulated . . . . 57

6 Discussion 64 6.1 Answers to research questions . . . . 64

6.2 Validity Threats . . . . 65

6.2.1 Internal Validity . . . . 65

6.2.2 External Validity . . . . 65

6.2.3 Construct Validity . . . . 65

6.3 Selection of Signals . . . . 65

6.3.1 Selection of Algorithm . . . . 66

7 Conclusions and Future Work 68

References 69

A Selected Papers 74

v

(8)

List of Figures

1.1 Manual push lawn-mowers . . . . 5

1.2 Robotic lawn-mowers by Husqvarna . . . . 5

1.3 Walk-behind lawn-mowers by Husqvarna . . . . 5

1.4 Heavy-duty lawn-mowers by Husqvarna . . . . 6

1.5 Flow of work . . . . 7

2.1 Working of K-means . . . . 11

2.2 Working of DBSCAN . . . . 13

2.3 Working of Agglomerative . . . . 14

2.4 distribution of data using box plot . . . . 16

2.5 Communication of various ECU’s through CAN bus . . . . 17

4.1 CAN bus time series with required signals . . . . 29

4.2 Formulation of individual sessions from CAN bus data . . . . 30

4.3 Architecture of data computed after formulation of features . . . . 32

4.4 Vector V f formulation from vectors of individual sessions(v f,s ) . . . . . 33

5.1 A sample session data . . . . 44

5.2 Features of the ﬁrst session of the signal ‘FRON_TXPDO1 . . . . 44

5.3 Sample histogram data . . . . 50

1

(9)

List of Tables

4.1 Search string for the SLR-1 . . . . 23

4.2 Search string for the SLR-2 . . . . 24

4.3 Inclusion and exclusion criteria for SLR-1,2 . . . . 24

4.4 Quality assessment for RQ-1 . . . . 25

4.5 Quality assessment for RQ-2 . . . . 25

4.6 Data Extraction . . . . 25

4.7 Description of signals taken from the CAN bus time series data . . . 28

4.8 Description of additional two signals taken from the CAN bus time series data . . . . 28

5.1 Data Extraction from SLR of RQ-1 . . . . 40

5.2 Scenarios derived from the previous studies . . . . 40

5.3 Data Extraction from SLR of RQ-2 . . . . 42

5.4 Each Session data . . . . 43

5.5 box plots for all the features for the signals ’FRONT_TXPDO1 . . . 45

5.6 boxplots for all the features of the signals ’WCU_TXPD . . . . 46

5.7 Histograms for all the features of the signals ’FRONT_TX . . . . 48

5.8 Histograms for all the features of the signals ’WCU_TXPDO1 . . . . 49

5.9 Comparision of algorithms for all the features of the signals ’FRO . . 52

5.10 Comparision of algorithms for all the features of the signals ’WCU . . 53

5.11 Results of Signiﬁcance test . . . . 54

5.12 Results of signiﬁcance test . . . . 54

5.13 Selection of Algorithm for all features of the signals ’FRONT_TXP . 56 5.14 Selection of Algorithms for all features of the signals ’WCU_TXP . . 56

5.15 Clusters of the all the features for the signals ’FRONT_TXPDO . . . 58

5.16 Clusters for the all the features for the signals ’WCU_TXPDO . . . . 60

5.17 Cluster distribution of the feature ’signal value’ . . . . 61

2

(10)

List Of Abbreviations

API: Application Programming Interface.

Arms: Ampere root mean square.

CAN: Controller Area Network.

CPU: Central Processing Unit.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise.

ECU: Electronic Control Unit.

GHz: Gigahertz.

GPS: Global Positioning System.

IDE: Integrated Development Environment.

IoT: Internet of Things.

ms: Milli Second.

PCA: Principal Component Analysis.

RAM: Random Access Memory.

RPM: Revolutions Per Minute.

SOC: State of Charge.

SSD: Solid State Drive.

3

(11)

Chapter 1 Introduction

1.1 Context and Motivation

The major challenge facing the present-day industries is to reduce the production cost associated with the development of the products while maintaining high quality, performance, and customer satisfaction [1]. On the other hand, a global competition of the companies also depends on achieving customer expectations with optimal prices [1]. These challenging things made the companies need to evolve over time to time in both technology and performance. To withstand in the competitive markets for a long time, the companies must focus on re-inventing and re-developing the products more smartly by understanding the customer usage [2]. The development process associated with understanding the customer usage data of the products lies on the facts rather than on the assumptions and can generate better results [3].

Further, analysis of customer usage data with machine learning algorithms gains fresh insights over customers and generates hidden patterns of product usage [4].

In the same way, the competitive market of the lawn-mowers is also evolving very fast [5]. The evolution of lawn-mowers ranges from olden-day manual pushing lawn-mowers(as shown in Figure 1.1) to present-day engine powered lawn-mowers(as shown in Figure 1.2, 1.3, 1.4). As part of this evolution, automatic house-hold lawn-mowers have already been familiar to the present-day world. This evolution continues even further in various directions as people are habituating more for the modern-day comforts and conveniences [5]. So to reach the future evolution and needs of the customer, it is essential to understand the customer usage of lawn- mowers [6]. Further, the modern-day tools like Machine learning and Data analysis help to understand customer usage in diﬀerent scenarios, for example:

1. Understanding human driving behaviours for designing and developing of new products, and for designing the patterns for automation of driving [7] [8].

2. Clustering and classifying the users into groups to treat each group separately to plan service cycles and future development [9].

3. Identifying the outlier behaviour of various components to detect the faults and defects [10].

4. Assessing and improving the eﬃciency, performance, and capability of the ve- hicles [7].

4

(12)

Chapter 1. Introduction 5

Figure 1.1: Manual push lawn-mowers

1.2 Classiﬁcation of lawn-mowers

As a part of the evolution and requirements of the customers, the lawn-mowers are further classiﬁed into diﬀerent classes. For example, as one of the leading manufac- tures of lawn-mowers, Husqvarna produces the following three types of lawn-mowers

• Robotic lawn-mower for house-hold purposes with small lawns(as shown in Figure 1.2).

• Walk-behind lawn-mowers for medium scale lawns(as shown in Figure 1.3).

• Heavy-duty lawn-mowers for commercial use like stadiums, airports, munici- palities, gated communities(as shown in Figure 1.4).

Figure 1.2: Robotic lawn-mowers by Husqvarna

Figure 1.3: Walk-behind lawn-mowers by Husqvarna

(13)

Chapter 1. Introduction 6

Figure 1.4: Heavy-duty lawn-mowers by Husqvarna

Among different classes of lawn-mowers, this research focuses on understanding cus- tomer usage of heavy-duty lawn-mowers. As a first step, we performed a systematic literature review to identify the different scenarios to understand the customer usage by analysing the real-time usage data using machine learning algorithms. As there is very limited previous work associated with lawn-mowers, studies that conducted to understand the customers of vehicles(mainly cars) by analysing the real-time usage data using machine learning algorithms are taken into consideration to identify sce- narios. Though studies from vehicles other than lawn-mowers are considered, all the scenarios collected are discussed with the stakeholders at Husqvarna to formulate a best fit scenario for the lawn-mowers in terms of scope, suitability and significance in future.

After formulating the scenario for the lawn-mowers, another SLR is performed to answer what are the best suitable machine learning algorithms to apply to the data for the scenario formulated from the ﬁrst step. Finally, as a third step, an experiment is performed to apply the selected algorithms on the data after performing the necessary preprocessing and normalisation steps on the data given by the company. Figure 1.5 shows the ﬂow of work of the research.

As discussed, from the many scenarios collected from Systematic literature Re-

view (SLR-1) to understand the customer usage, after a brainstorming session with

stakeholders at the company, this thesis attempts to understand the customer’s usage

by adopting the scenario ’Cluster the customer usage of lawn-mowers using Controller

Area Network(CAN) bus time series data based on the driving patterns’. From the

many machine learning algorithms, SLR-2 suggests the K-means and Agglomerative

clustering algorithms to cluster the usage data of lawn-mowers. Finally, the method

section explains how the algorithms are applied to the data after making the neces-

sary operations. As a result of the experiment, we formulated diﬀerent clusters of

usage of lawn-mowers by comparing both algorithms applied using a metric called

the V-measure.

(14)

Chapter 1. Introduction 7

Figure 1.5: Flow of work

1.3 Problem Statement

Though there is a great scope and value for the market of lawn-mowers, there are no previous studies to understand the usage of lawn-mowers in real-time by the customers using machine learning algorithms.

After collecting the scenarios from SLR-1, by anticipating the great scope and

signiﬁcance of clustering the data in the ﬁeld of lawn-mowers(as described in section

5.1.3), we formulated the scenario ’cluster the customer usage of lawn-mowers using

CAN bus time series data based on the driving patterns’ based on suitability and

recommendation from the company. So, this research focuses on developing a new

model of clustering the real-time usage data using ’CAN bus time series data’ of lawn-

mowers. We stress that all of the existing literature we reviewed in SLR-1 pertain

(15)

Chapter 1. Introduction 8 to cars not to lawn-mowers. Thus, this thesis attempts to understand the customer usage of lawn-mowers by adopting the best-suited car-scenario which clusters the CAN bus data

1.4 Aims and Objectives

The thesis aims to understand customer usage of lawn-mowers in real-time. This is achieved by clustering the customer usage of lawn-mowers using CAN bus time- series data based on the driving patterns. To achieve the aim, several objectives were drawn

Objective 1: To identify the diﬀerent scenarios to understand the customer usage of lawn-mowers by analysing the real-time usage data using machine learning algo- rithms.

Objective 2: To formulate the scenario from the diﬀerent scenarios obtained to understand customer usage that best suits and has good scope in the case of heavy- duty lawn-mowers.

Objective 3: To select suitable machine learning algorithms that can be applied to the data to evaluate the scenario selected to understand the customer.

Objective 4: To obtain the clusters of usage from the CAN bus time series data based on the driving patterns.

1.5 Research Questions

RQ 1: What are the diﬀerent scenarios from which customer usage of lawn-mowers can be understood by analysing real-time usage data of lawn-mowers using machine learning algorithms?

Motivation: Different scenarios from which customer usage with the lawn-mowers can be evaluated by analysing the usage data of lawn-mower with machine learning algorithms are identified. As described, there are no much previous works in the field of lawn-mowers, the studies performed similarly in the case of cars are considered.

These are discussed with the stakeholders of the company to check the suitability and scope in the case of lawn-mowers. Finally, a useful and well-ﬁt scenario is formulated as an output of this research question.

RQ 2: What are the suitable machine learning algorithms that can be applied to the usage data collected for the scenario from RQ1?

Motivation: According to the fact that no machine learning algorithm best ﬁt for all types of data [11], an SLR is performed to select the suitable machine learning algorithms for data obtained from the company to evaluate the scenario from RQ1.

RQ 3: How do these diﬀerent scenarios (from RQ1) coupled with machine learning perform in analysing lawn-mower usage?

Motivation: Application of the machine learning algorithms from the RQ2 on the

data obtained for the scenario from the RQ1 to understand the usage of lawn-mowers

(16)

Chapter 1. Introduction 9 by the customers.

1.6 Thesis Structure

The structure of the thesis is divided into seven chapters, where

Chapter 1: Describes the context and motivation of the project, research questions, aims of objectives of the project.

Chapter 2: Consists of background concepts that explain the essential things re- quired for the thesis work.

Chapter 3: Explains the related work performed to understand the customers by analysing the usage data of the vehicles.

Chapter 4: Describes the implementation of SLR for RQ1 and RQ2, also explains the implementation of the experiment for the RQ3.

Chapter 5: Explains the results and analysis from the SLR and experiment imple- mented for the RQ’s.

Chapter 6: Discusses the validity threats and how the RQ’s are answered.

Chapter 7: Describes the conclusion and future work for the thesis work.

(17)

Chapter 2 Background

2.1 Machine Learning

An application of Artiﬁcial intelligence to learn and gain experience automatically without being programmed explicitly is called Machine learning. So, the main focus of machine learning is to develop computer programs that can learn and gain experi- ence from data on their own [12]. In general, looking at the data will not give much knowledge, for example in the detection of spam emails, the occurrence of a single word will not be useful but the occurrence of certain words together, considering the length of the mail and many other factors will give the good picture to know whether the email is spam or not. Machine learning techniques can easily get information from data. This is the main reason for the companies to use machine learning for improving their business decisions, productivity, proﬁts, and many more things [13].

Machine learning problems are divided into the following ways, based on learning process.

• Supervised learning: In this type of learning, algorithms learn from the existing labelled data. For training, the model is given with the inputs and outputs. After gaining knowledge from learning during the training process, the model can be applied to the new data [14].

• Unsupervised learning: There will be no labels in the data. Generally, unsupervised algorithms are used to ﬁnd the hidden patterns or structures in the data. Unsupervised learning is commonly used to clustering the data into diﬀerent groups [14].

• Reinforcement learning: Reinforcement learning closely resembles the hu- man behaviour of learning new things. The algorithms in this approach learn in a trial and approach way. So algorithms become better by learning [15].

2.2 Unsupervised learning

As discussed, in unsupervised learning there is no desired output to achieve but these are used to gain the knowledge from the data. The general-purpose of these algorithms is for the transformation of data and clustering. Transformation of data is used to convert the data to gain the knowledge or it can also be used to transform the data to a suitable form for the other machine learning algorithms to be applied.

10

(18)

Chapter 2. Background 11 Clustering is used to identify the hidden patterns and divide the data into groups [13].

2.2.1 Clustering

Clustering is one of the most used machine learning applications that achieve the knowledge from the raw data by dividing it into different clusters. In this task of clustering, the data points that are similar to each other are present in the same group. The similarity between the data points present in the same cluster or different cluster is measured using the distance metrics like Euclidean distance, Manhattan distance, correlation distance [14]. These clustering techniques can be used in various fields like pattern recognition, outlier formulation, image processing. The popularly used machine learning algorithms for clustering are

• K-means

• Agglomerative

• DBSCAN K-means

K-means is a simple, fast, non-deterministic, unsupervised machine learning algo- rithm used most commonly for clustering [16]. K-means clusters all the data points into the pre-deﬁned number of clusters(K ) [17]. This is an iterative algorithm that starts with randomly allocated K points as the centroids of the clusters and itera- tively optimises these centroids to form the best clusters [16].

Working of the algorithm:

Figure 2.1: Working of K-means [18]

(19)

Chapter 2. Background 12 1. Assign the K random centroids.

2. Formulate the clusters by assigning each data point to the nearest centroid with the help of a distance metric.

3. Calculate the mean of data points in each cluster to formulate the new centroid for each cluster.

4. Repeat the steps 2,3 until the centroid of each cluster remains unchanged.

Advantages

• Simple Implementation.

• Applicable for large data sets.

• Relatively fast in execution when compared to other algorithms.

Dis-advantages

• Explicit selection of the number of clusters.

• Complexity in handling the outliers.

DBSCAN

DBSCAN stands for density-based spatial clustering of applications with noise. As the name indicates this algorithm identifies the clusters based on the density and works well with the noisy data to find the outliers effectively [19]. This identifies the clusters by assuming that clusters are the regions with a high density of data points separated by the regions with a low density of data points [20].

The two main parameters for the DBSCAN algorithm are

• min-points: The minimum number of points that need to be present in the region to identify it as a cluster.

• eps(ε): The distance from the point to locate the neighbouring points. For a particular point, all the points present in the radius of ε are considered as neighbouring points.

Important terms of the algorithm:

• Core point: In the surrounding radius of ε, if the point contains at least min-points number of points, then those point is called a core point.

• Border point: The points that have less than min-points number of points in the surrounding region of ε and have at least one core point in this region are called Border points.

• Noise point: The points other than the core, border point, and have less

than min-points number of points in the surrounding radius of ε are called

Noise points.

(20)

Chapter 2. Background 13

Figure 2.2: Working of DBSCAN [13]

Advantages

• Unlike K-means, the explicit declaration of the number of clusters is not re- quired.

• Outliers can be identiﬁed easily.

Dis-advantages

• Needs an explicit declaration of values like min-points and ε.

• Cannot handle the data with high dimensions and varying density.

Agglomerative

The agglomerative clustering algorithm is one of the most common hierarchical clus- tering algorithms that work in a bottom-up manner [21]. The working of this algo- rithm starts by assuming each data point as a leaf node(individual cluster) of the tree. Thereby, algorithms start merging the cluster based on the distance metric to form clusters with multiple data points. Like this algorithm achieves the required number of clusters by merging individual clusters based on a distance metric [22].

This implementation of the algorithm leads to the creation of a tree-like structure called ‘dendrogram’ [18]. Figure 2.3 represents the working of agglomerative cluster- ing algorithm by formulating the sample dendogram.

Distance Metric: The most common distance metric used in the hierarchical clustering is the linkage criterion [21]. This works according to the formula

D(X, Y ) = min(d x,y ) (2.1)

where x can be any data point in the cluster X and y can be any data point in the cluster Y.

Advantages

• Easy to implement.

• Provides better visualisation of the formation of clusters.

Dis-advantages

• Cannot handle the missing data.

• Requires more time for execution.

(21)

Chapter 2. Background 14

Figure 2.3: Working of Agglomerative Working of Agglomerative [18]

2.3 Performance metric

Though it is hard to identify the performance of unsupervised clustering algorithms, as there is no ground data to rely on, there are some metrics like V-measure, Silhou- ette score that gives the quality of clusters formulated. These metrics are usually used during the selection of number of clusters, selection of algorithms.

2.3.1 V-measure

V-measure is the harmonic mean of homogeneity and completeness. So, V-measure compares the similarity of two clusters based on the two factors homogeneity and completeness. The value of the V-measure ranges from 0 to 1, in which one indicates the similarity between the two clusters and 0 indicates the dissimilarity between the clusters [23].

• Homogeneity: Homogeneity measures whether the formulated clusters have the data points from the same class or not. The value of homogeneity ranges from 0 to 1. This is represented as

H(C) = −

|C|

c=1

_|K|

k=1 a _ck

N log

_|K|

k=1 a _ck

N (2.2)

Where K indicates the number of clusters, C is the number of classes, N is total number of data points, a ck is the number of data points present cluster

K from class C, c, k indicates the each class and cluster respectively [24].

(22)

Chapter 2. Background 15

• Completeness: Completeness checks whether the data points from the same class are formulated into a single cluster or not. The value of completeness ranges from 0 to 1. This is represented as

H(K) = −

|K|

k=1

_|C|

c=1 a _ck

N log

_|C|

c=1 a _ck

N (2.3)

Where K indicates the number of clusters, C is the number of classes, N is total number of data points, a ck is the number of data points present cluster

K from class C, c, k indicates the each class and cluster respectively [24].

The V-measure is represented as

V β = (1 + β) h · c

β · h + c (2.4)

where h indicates homogeneity, c indicates completeness, depending on f-measure if the value of β is more than 1, then completeness weighs more for calculation and if

β is less than 1, then homogeneity weighs more for calculation [23].

2.3.2 Silhouette Coeﬃcient

Silhouette coeﬃcient measures the suitability of the assigned cluster to a data point.

The value of the silhouette coeﬃcient ranges from -1 to +1, in which higher values indicates that the assigned clusters by the algorithm are suitable for the data points and the lower value indicate that the data points do not assign to well suitable clusters [25]. The implementation is performed by checking the average similarity of the data-point with the assigned cluster and also the average dissimilarity of the data point with the other clusters [26]. The silhouette coeﬃcient relatively works better for well-separated clusters based on density [25].

The formulae of Silhouette Coeﬃcient is

Silhouette − Coef ficient = (b − a)/max(a, b) (2.5) where b is the value of the average distance between the clusters, such that inter distance between all the clusters, which indicates the average dissimilarity between the clusters. ’a’ is the value of the average distance between the points in a cluster, such that intra distance between all the data points in a cluster, which indicates similarity of the data points present in a cluster.

2.4 Box plot

Box plot, which is also called a whisker plot used for graphical representation to show the distribution of numerical data. The box plot is mainly used to visualise the outliers. It divides the data based on quartiles, interquartile, median, upper, and lower limits. Outliers, which are considered as abnormal behaviour are the extreme data points that are present below the lower limit and above the upper limit [27].

The key terms in the box plot are

(23)

Chapter 2. Background 16

• Median( Q2 ): The mid-value in the data set, that divides the data set into two equal halves.

• Lower quartile or ﬁrst quartile( Q1 ): This indicates the median for the ﬁrst half of the data, such that median between the least value in the data and Median(Q2 ).

• Upper quartile or third quartile( Q3 ): This indicates the median for the second half of the data such that the median between the median(Q2 ) and highest value in the data.

• Interquartile range(IQR): The region between the quartile(Q1 ) and quar- tile(Q3 ). This region contains 50% of the entire data. IQR can be expressed as

IQR = Q3 − Q1 (2.6)

• Lower limit or minimum: The value obtained after subtracting the IQR with 1.5 times of lower quartile value(Q1 ). This is considered as the minimum value in the data after removing the outliers.

• Upper limit or maximum: The value obtained after adding the IQR with 1.5 times of upper quartile value(Q3 ). This is considered as the maximum value in the data after removing the outliers.

• Outlier: The values in the dataset present below the lower limit and above the upper limit are called outliers. For normally distributed data, 0.35% of data lies below the lower limit, and also 0.35% of data lies above the upper limit.

Figure 2.4: distribution of data using box plot

2.5 CAN bus

CAN bus stands for controller area network bus that provides robust communica-

tion between the various electronic control units(ECU) in the vehicle. The primary

purpose of the CAN bus is to avoid the point to point wiring between the ECU’s

and provide the communication channel between them [28]. As the count of ECU’s

for the present day applications increases, the use of CAN bus became most popular

not only in cars but also in trucks, buses, agricultural equipment, railway, elevators

(24)

Chapter 2. Background 17

and even in medical equipment and instruments [28].

CAN bus can be treated as the central nervous system of the vehicles that oﬀers the simple broadcasting between all the ECU’s using two wires(CAN high and CAN low). Whenever an ECU wants to communicate with others, it prepares the signal using two wires and broadcasts it on the CAN bus. All the ECU’s connected to the CAN bus accept the signal and checks whether to ignore or receive the signal [29].

CAN bus also prioritises the communication with the help of ID’s, so the higher- priority ECU event can get access to communicate via CAN bus immediately.

The communication of ECU’s via CAN bus also provides the scope for onboard diagnostics. Where the signals between the ECU’s can be extracted from the CAN bus to know the message. Moreover, due to the advancement in data mining and data analysis, it is even become possible to collect a large number of communication signals between ECU’s and understand the driving behaviour [30][31][32]. The Figure 2.5 shows the communication of diﬀerent ECU’s through the CAN bus.

Figure 2.5: Communication of various ECU’s through CAN bus in Husqvarna lawn-

mowers

(25)

Chapter 3 Related Work

In this section, we reviewed the studies that deal with understanding the customers usage of the vehicles by analysing the usage data using machine learning to identify scenarios in the case of lawn-mowers. Further, we also described diﬀerent studies that make use of CAN bus data for analysis purposes to understand the usage of vehicles.

The authors of the study [33] analysed the real-world driving data to identify the needs of the users while adapting to hybrid electric vehicles. Chiara, et al. in this study performed research in two approaches to find the suitability of fully electric and plugin hybrid cars to the users of current gasoline-powered cars. In the first approach, all the travelling trips are classified based on key factors like trip duration, trip distance, and idle time. Then the results from the first approach are used in the second approach to view all the trips in a daily usage manner. Further, the drivers are also classified into two groups based on the usage. The steps performed in this study to implement two approaches showed us a novel way to obtain the effective working sessions in the thesis work.

The study [7] shows how the sensor data from the vehicles can be utilised in many ways. He, et al. described many studies that use the different types of sensor data from vehicles like location-based sensor data, surrounding traffic data, light detection ranging data, CAN data. The authors of this study also explained how this data used to achieve the driving behaviour analysis, detecting traffic operations, identify- ing traffic congestion, and making decisions. This study is useful to get a brief idea about how to utilise the sensor data from the vehicles while collecting the scenarios to understand customer usage.

The studies [34][35][36][37][38] focus on identifying and formulating the driving patterns to understand the real-time usage of the vehicles. These studies help to un- derstand how to identify the driving patterns in diﬀerent ways from the usage data and also how to analyse the driving patterns to draw information about customer usage by applying techniques like machine learning.

The authors of the study [35] formulated the new driving cycles to understand the real-world driving in China. This study formulated the driving cycles based on the driving patterns of vehicles. Further, the study also compares the newly formu- lated driving cycles for China with existing driving cycles of Europe and the USA

18

(26)

Chapter 3. Related Work 19 to understand the diﬀerence in usage of vehicles in diﬀerent regions. The authors gave a new impression to draw the driving cycles to understand the usage of vehicles based on eleven parameters of driving.

Lee, et al. in the study [36] explained how the driving patterns help to understand and characterise the electric cars driving behaviour. This study formulates the driv- ing patterns by applying unsupervised machine learning algorithms to characterise driving behaviours. The driving patterns in this study are obtained from the con- sumption of the battery power to handle the battery management system, which is the main concern for developing electric cars. Though this study mainly focuses the electric cars, the authors described how to combine the machine learning algorithms with real-time data of the vehicles to evaluate the driving patterns to understand the usage. This has great signiﬁcance to evaluate the driving patterns in the thesis work.

Analysis of driving styles has a great impact on understanding the usage and designing intelligent vehicles. Unlike many studies, the authors of [37] extracted the driving patterns based on the dynamic steps and activities performed by the vehi- cle. Moreover, this study works on primitive driving patterns that are considered as smaller and simpler units of driving behaviours. The authors achieved this by apply- ing the non-parametric Bayesian approach with unsupervised clustering algorithms like K-means on the real-time data. This gave a new idea to identify the driving patterns from the dynamic nature of the users.

Driving patterns which indicates the driving behaviour of the users greatly in- ﬂuences the energy potentials of the battery electric vehicles(BEVs) [38]. The study [38] examines whether the driving patterns formulated for the traditional gasoline- powered vehicles are suitable to the BEVs or not. Further, from a large number of parameters required to describe the driving patterns, the study proposed a technique called ’Exploratory data analysis’ to reduce the number of factors. Finally, the au- thors are successful in identifying the relationship between the energy consumption of electric vehicles and their usage with the help of real-world usage data.

Qi, et al. in [34] analysed the differences in driving behaviours between the real- world environment and indoor simulations environment. The authors of this paper collected the data from sensors of vehicles to draw the driving patterns in both the real-world environment and simulation environment to understand the driving be- haviours and identify the differences between them. Though the main focus of the study is to find the differences in driving behaviours, this study clearly described a new approach to understand how to identify the driving behaviours from the vehicle sensor data.

The studies [39][40][41] describe the way to understand the customer usage of the vehicles by clustering data using machine learning algorithms.

Li, et al. in [39] described the selection of driving patterns that describe the

real-time usage of electric vehicles. This study developed a two-level clustering tech-

nique that provides a deep understanding of the usage of electric vehicles to identify

(27)

Chapter 3. Related Work 20 the driving patterns. Among the two levels of clustering, the ﬁrst level focus on the analysis of time-series data, and the second level rely on the results from the ﬁrst level to identify the driving patterns. Further, preprocessing steps performed before the application of the clustering algorithm gave clear insights to the thesis work for the division of entire data to the session-based data.

Clustering the usage of vehicles and developing independent strategies will serve a great purpose to reduce the operating cost associated with the vehicles [40]. The authors of the study [40] clustered the usage data of the diﬀerent vehicles like cars, vans, and trucks by formulating a new approach called Artiﬁcial Bee Colony cluster- ing. The clustering is achieved by grouping the vehicles with similarities in features such as fuel consumption, fuel economy, frequency of usage, frequency of refueling, and value of re-fueling. The study described the formulation of features such as ve- hicle value and an index for fuel consumption from the existing data.

Crozier, et al. in the study [41] identified the five clusters to obtain the usage profiles of electric cars. The study aims to analyse the driving profiles from the real-time data of electric cars to predict vehicle use and also to compare the usage behaviours between electric cars and conventional cars. Distance covered and aver- age speed during the trips are considered as the driving patterns in this study to understand the usage of vehicles. This research formulated the five clusters and also explained the in-detail usage of each cluster.

The studies [42][43] focus on identifying the outliers by clustering the usage data.

These outliers will be useful to know the diﬀerence between the usual and unusual behaviours of users with vehicles.

Yun, et al. in [42] proposed a new framework called ’Monitoring Vehicle Outliers based on Clustering(MVOC)’ to identify the outliers due to the complex states of the vehicle. Unlike the several studies which achieved the clusters from the lifetime of components, this study identiﬁes the outliers from various correlated components in the vehicle. The generated clusters with outliers are further analysed to understand more information about the cause of outliers. Moreover, the developed framework achieved better performance when compared to traditional algorithms to obtain the clusters. This study explained the new approach to obtain the clusters from usage data to identify the outliers by considering the correlation data from diﬀerent com- ponents.

As driving became a daily activity for many people, Owsly, et al. in [44] described

that driving is associated with quality of life. So, there is a great need to identify

reckless driving behaviours that cause fatal accidents. The study [43] identiﬁed the

outlier behaviours that deviate from the usual behaviours of driving by analysing

the data about driving from the smart mobiles. Unlike the Intelligent transportation

system and IoT, this study uses smart mobiles to collect the data. The authors of

the study compared the various clustering algorithms and evaluated the best suitable

algorithm for the data from smart mobiles.

(28)

Chapter 3. Related Work 21 The studies [31], [32] uses the CAN bus time series data for the analysis purpose.

These studies described the way to preprocess the CAN data and also to convert the time series data to sessions based data suitable for machine learning algorithms.

Fugiglando, et al. in [31] obtained the clusters of human driving behaviour from the CAN bus data. This study explains the steps to formulate the time series data into the required format to apply the machine learning algorithms. The authors described how to obtain the unlabelled data to apply the clustering algorithms from the time series data. The study uses the CAN bus data from Audi cars where no user is instructed to use the car in a particular way, So, the entire experiment is conducted in an uncontrollable environment. Moreover, the study also identiﬁes the best method of sampling the data to know the minimum amount of driving data required to identify the consistent results.

Driving behaviours of the drivers are not dependent on the unique parameter, it is

a combination of multiple parameters and components. So Fugiglando, et al. in

[32] achieved the driving DNA to understand the driver’s behaviours by analysing

the data from CAN bus, which provides the data from various ECU’s present in the

vehicles. The authors achieved the goal based on the four dimensions - breaking,

speeding, energy eﬃciency, and turning angle. Each driver is assigned a synthetic

score based on these four factors. The analysis of CAN bus data based on the four

factors related to the risk of accident, driving comfort, and fuel eﬃciency.

(29)

Chapter 4 Method

The systematic literature review is chosen as a research method for the RQ1, RQ2.

As the primary motivation of these research questions is to gain knowledge from the existing literature, the research methods like literature review and SLR suit well when compared to other research methods. Further, when compared to the literature review, SLR oﬀers a complete, comprehensive and valid picture of existing evidence by documenting the results for further access with proper quality assessments of the studies selected [45].

The main motivation for selection of experiment as a research method for the RQ3 is, unlike the descriptive methods like literature review, survey, interviews and case study, experiment deals well with quantitative data [46] by oﬀering a cause and eﬀect relationship between dependent and independent variables [47].

The other research methods like survey, case study are excluded for the research because the survey mainly focuses on collecting individual opinions [46]. Since per- sonal views from the group of people will not serve the purpose of the study, we excluded the survey.

A case study is a qualitative approach to explore the process in detail and in- depth [46]. The case study can be useful for the selection of scenarios, but it requires a long time to get the results. So we excluded the case study.

4.1 Systematic Literature Review

The Systematic Literature Review in this thesis is performed according to the guide- lines provided in the [45] by Kitchenham. The process of Systematic Literature Review is performed in ﬁve steps, and they are as follows

• Investigation of primary studies

• Criteria for the selection of research

• Assessment of quality

• Extraction of data

• Synthesis of data

After identifying the initial primary studies(after performing inclusion, exclusion, quality assessment criteria) from the electronic libraries, further studies were identi- ﬁed from the references present in the primary studies selected using the technique

22

(30)

Chapter 4. Method 23 called snowballing [48]. All the studies collected are analysed according to the three pass approach described in [49].

For the initial search of primary studies, articles, journals, the search engines like BTH Summon, Research gate, Scopus are used. These search engines provide the results in digital libraries like IEEE, Research gate, Science Direct, etc. Systematic Literature Review will answer the following research questions

• RQ 1: What are the diﬀerent scenarios from which customer usage of lawn- mowers can be understood by analysing real-time usage data of lawn-mowers using machine learning algorithms?

• RQ 2: What are the suitable machine learning algorithms that can be applied to the usage data collected for the scenario from RQ1?

So, a Systematic Literature Review is used to

• Identify diﬀerent scenarios to understand the customer usage of lawn-mowers by analysing the real-time data using machine learning algorithms.

• Formulate the best suitable scenario in the case of lawn-mowers from the sce- narios collected.

• Identify the suitable machine learning algorithms to apply to the data for the formulated scenario.

4.1.1 Investigation of Primary Studies

Initially, for the search of primary studies to answer RQ-1, the following combination of keywords are used

No Search String

1 ((data analysis) AND ((vehicle) OR (sensor)) AND Pub- licationYear > 2012)

2 ((customer behaviour) AND (vehicle) AND Publica- tionYear > 2012)

3 ((usage data) AND (analysis) AND (vehicles) AND Publication Year > 2012)

Table 4.1: Search string for the SLR-1

The results generated from these search strings are used to formulate the scenario that best ﬁts in the case of lawn-mowers.

After the formulation of the scenario, the machine learning algorithms suitable to

apply to data for the scenario formulated are found(to answer RQ2). The primary

studies are found by the following combination of keywords

(31)

Chapter 4. Method 24

No Search String

1 ((machine learning) and algorithms)

2 ((clustering) and (algorithms) and (machine learning)) 3 ((unsupervised) and (machine learning))

Table 4.2: Search string for the SLR-2

The results generated from these combinations of keywords are used to study diﬀerent machine learning algorithms, advantages, disadvantages, and suitability to the data collected for the scenario formulated. So this gives a set of machine learning algorithms that can be applied to the scenario to understand the customer usage of lawn-mowers from analysing the real-time usage.

4.1.2 Criteria for the selection of research

After the selection of the studies from digital libraries with the help of search strings, the inclusion and exclusion criteria of studies for both of the literature reviews are done as follows

RQ-1 RQ-2

Inclusion criteria Inclusion criteria

1. Studies in the English language 2. Studies that are available in full

text

3. Studies that are published after the year 2012

1. Studies in the English language 2. Studies that are available in full

text

Exclusion criteria Exclusion criteria

1. Studies that are not in the En- glish language

2. Studies that are not available in full text

3. Studies that were published be- fore the year 2012

1. Studies that are not in the En- glish language

2. Studies that are not available in full text

Table 4.3: Inclusion and exclusion criteria for SLR-1,2

4.1.3 Assessment of Quality

After the selection of the studies based on the inclusion and exclusion criteria, a

quality assessment of the selected studies is performed according to the Table 4.4.

(32)

Chapter 4. Method 25

No Criteria Result

1 Whether the title of the study is related to the thesis? Yes/No 2

Whether the study focuses on how the vehicle is being used in the real-world environment based on real-world data?

Yes/No

3 Whether the selected study is related to the data anal-

ysis of the data from vehicles? Yes/No

4 Whether the selected study draws the conclusion after

analysing the vehicle data? Yes/No

5 Whether the selected study focuses on the data from

various sensors connected to the vehicle? Yes/No Table 4.4: Quality assessment for RQ-1

If the study addresses more than any two assessments as ‘Yes’, then that study is considered as useful for the thesis.

In the same way, the quality of the selected studies to select the suitable machine learning algorithms are assessed according to the Table 4.5.

No Criteria Result

1 Whether the study explained the diﬀerent machine

learning algorithms? Yes/No

2 Whether the study described the working implementa-

tion of machine learning algorithms? Yes/No 3

Whether the study summarised the advantages and dis- advantages of each machine learning algorithm based on type of data?

Yes/No Table 4.5: Quality assessment for RQ-2

If the study addresses more than one assessment above as ’Yes’, then that study is considered as useful for the thesis and moves further with that.

4.1.4 Extraction of Data

The following data is extracted from the studies obtained after assessing the quality

RQ-1 RQ-2

1. Aim of the study.

2. Summary of the data collection and analysis process performed.

3. Scenario selected by the study to understand the customer be- haviour.

1. Title of the study.

2. Working implementation of ma- chine learning algorithms along with advantages and disadvan- tages.

Table 4.6: Data Extraction

(33)

Chapter 4. Method 26 The studies that are selected for the extraction of data of RQ-1 and RQ-2 are presented in Appendix A. Synthesis of data(ﬁnal step in performing the SLR) for both RQ-1 and RQ-2 is described in section 5.1.

4.2 Experiment

The experiment is chosen as a research method for the RQ 3. The goal of the experiment is to ‘Cluster the customer usage of lawn-mowers using CAN bus time series data based on the driving patterns’. This is obtained as a result of the SLR.

As clustering of the data is an exploratory task, it is hard to evaluate the dependent and independent variables. But when the entire the experiment is summarised, the independent variable is the customer usage of the lawn-mowers since it does not depend on anything. In the same way, the dependent variables of the experiment are the clusters formulated and performance of clustering algorithms as they rely on the customer usage data.

4.2.1 Tools

Software

• Python 3.7.2 with Py-Spark API

• Jupyter IDE

• Windows 10 Operating System Hardware

• i5-6200U CPU @ 2.30 GHz 2.40 GHz

• 64-bit operating system

• 8GB RAM - 464SSD Disk

4.2.2 Software Environment

Python and tools like Py-Spark are used to import, analyse and visualise the data.

Python provides an easy and understandable way to handle the data. Py-Spark provides an eﬃcient way to deal with the time series big data [13].

Python is one of the simple, easily implementable programming languages. Its

straightforward syntax made it one of the most used programming languages. On

the other hand, python is extensively used for developing machine learning models

[50]. Various libraries of python like pandas, numpy, sklearn, seaborn, matplotlib

and plotly are used to eﬀectively manage, perform actions and visualise the data.

(34)

Chapter 4. Method 27

4.2.3 Collection of Data

CAN bus is provided in every lawn-mower that handles the communication among all the electronic control units (ECU’s) such as Battery management system, all motor control units, vehicle control unit that in turn connected to the front, rear lights, speedometer and all other accessories. This communication is done in the form of data frames that contains the information. These data frames are not in a human-understandable format. A device called ‘System Logger’ connected to the CAN bus collects all these frames from the CAN bus and converts the data into information(understandable way).

Apart from converting the data frames to information, system logger also uploads this information to the cloud storage. This process will be performed continuously, so the data about all the components in the lawn-mower is collected at regular intervals of time(500 milliseconds in this case) when the ignition key of the lawn-mower is switched on.

4.2.4 Dataset Description

Out of total 2148 signals(details about each component) collected by the CAN bus, only six signals are considered for the analysis purpose. The reason for selecting these particular six signals is explained in section 6.3. So the six signals are

No Signal name Description Values

1 FRONT_TXPDO1.

ActualSpeed (RPM)

Front axle speed measured in RPM

Continuous values within the range -1500 to 6000.

-ve value indicates the movement in the reverse direction, and +ve value indicates the movement in the forward direction of lawn-mower

2 MotorStatusExt_G4.

BoardTemp (deg C)

Temperature of hydraulic pump in celsius scale

Continuous values with maximum value of 80

3 BMS_A_SOC.

Max (%)

State of charge in the battery

Continuous values within the range of 0 to 70

4 WCU_TXPDO1.

pedalPos (%)

Pedal position of the machine

Continuous values within

the range -100 to 100

where -ve value indicates

the movement in the reverse

direction and +ve value in-

dicates the movement in the

forward direction

(35)

Chapter 4. Method 28

5 WCU_TXPDO1.

turnAngle (deg)

Turn angle of steering in de- grees

Continuous values within the range between -47 to 47, where -ve values indi- cate the turn towards left and positive values indicate the turns towards the right.

6 GEN_TXPDO1.

ActualTorque (Arms)

Torque gen- erated by the traction engines

Continuos values with in the range between 0 to 100 Table 4.7: Description of signals taken from the CAN bus

time series data

Apart from these six signals, additional signals called Time (ms), WCU_TXPDO2 .irisPTOOn () are also considered for data pre-processing purpose. The feature ‘Time (ms)’ indicates the time stamp associated with collection of data by CAN bus in mil- liseconds and the feature ‘WCU_TXPDO2.irisPTOOn ()’ indicates the working with external device engaged to the lawn-mower.

No Column(Signal

name) Description Values

1 Time(ms)

Timestamp at 500 milliseconds of interval

Continuous values with multiples of 500 starting from 0

2 WCU_TXPDO2.

irisPTOOn ()

Indicates the engaged exter- nal equipment is switched on or oﬀ. This exter- nal equipment can be grass cutter, sweeper, snow thrower, snow blades.

Discrete values 0,1. One in- dicates the working of con- nected external equipment and zero indicates the vice- versa.

Table 4.8: Description of additional two signals taken from the CAN bus time series data

All the signals described in Table 4.7 and 4.8 are collected for every 500ms interval.

So the data with eight columns(two + six) looks like the Figure 4.1

4.2.5 Experiment Implementation

The steps implemented in the experiment are according to guidelines provided in [31] to cluster the CAN bus time series data. The experiment is implemented in six steps as follows

1. Division of data into individual sessions

(36)

Chapter 4. Method 29 2. Extraction of features from CAN bus time-series data

3. Outlier removal 4. Data normalisation

5. Reducing the dimensions using Principal Component Analysis 6. Application of Unsupervised learning algorithms

Figure 4.1: CAN bus time series with required signals

Division of Data into Sessions

From the entire CAN bus time series data, data is again divided into individual sessions based upon the feature ‘WCU_TXPDO2.irisPTOOn ()’. As previously discussed, data contains the details of eight(2+6) columns at time-intervals of 500 milliseconds(as shown in Figure 4.1). The column ‘WCU_TXPDO2.irisPTOOn ()’

present in each instance of data contains the values 0 or 1. So the collection of contin- uous instances in the data with value 1 in the column ‘WCU_TXPDO2.irisPTOOn ()’ shows the working of lawn-mowers with some external equipment connected.

These continuous series of 1’s in the data indicates the individual session of a lawn-

mower. Only these sessions are considered for the analysis purpose as they indicate

the purposeful working with the lawn-mower. The formulation of sessions from CAN

bus time series data is shown in Figure 4.2. Further after dividing the data into ses-

sions based on the column ‘WCU_TXPDO2.- irisPTOOn ()’, from all the sessions

obtained, the sessions whose running time is greater than 1 minute are considered

(Session with more than 120 instances of data are considered - As each instance is

collected at an interval of 500 milliseconds, 120 instances of data will be collected for

a minute).

(37)

Chapter 4. Method 30

Figure 4.2: Formulation of individual sessions from CAN bus data

(38)

Chapter 4. Method 31 Features Extraction

Further, for each signal(each column) in each individual session obtained from the CAN bus time series data, seven features are calculated. For each individual signal, assume the data collected is in the format (d _i ,t _i ) Where d _i indicates the data for the signal from the CAN bus, t _i indicates the time stamp associated with data and i ∈N(natural numbers).

1. Feature 1: Data collected(d _i ) by CAN bus at the time stamp t _i [31].

2. Feature 2: Quotient(q _i ) for the diﬀerence with the next signal collected im- mediately after 500 milliseconds. Such that this measures the change intensity of signal at each 500 milliseconds interval [31].

q _i = (d _i+1 − d _i )/(t _i+1 − t _i ) (4.1) 3. Feature 3: Time interval(t _i ) measures the time diﬀerence between the time- stamp of current value and time-stamp of local maxima identiﬁed just before the current value. Such that, this gives the frequency of occurrence of peak points [31].

ti i = t i − t local−max (4.2)

4. Feature 4: Local max(lm _i ): The value of local maximum identiﬁed before the current value [31].

5. Feature 5: Mean(m _i ): This gives the average of signal values identiﬁed in the one minute interval. In the range of one minute, CAN bus collects the data for 120 times(once for every 500milli-seconds interval). So the range will contain values R i = {i,i+1,..,i+120} [31].

m i =

x∈R

i

d i /120 (4.3)

6. Feature 6: Median(md _i ): This gives the median value for the signal values identiﬁed in the one minute interval [31].

md _i = median({d _i , d _i+1 , d _i+2 , ...d _i+120 }) (4.4) 7. Feature 7: Standard Deviation(sd _i ): This gives the standard deviation for the

signal values identiﬁed in the one minute interval [31].

sd i = standard_deviation({d i , d i+1 , d i+2 , ...d i+120 }) (4.5)

After the formulation of features the architecture of the data looks as shown in the

Figure 4.3.

(39)

Chapter 4. Method 32

Figure 4.3: Architecture of data computed after formulation of features for each

signal

(40)

Chapter 4. Method 33 Outlier Removal

Let v _f,s be the vector of any feature f of any session s formed by calculating the functions that are described in feature extraction. So for every signal, in each session, there will be seven vectors(i.e. each vector for each feature).

Then from each vector(v _f,s ) of each session, another vector V _f is formed for each feature of the signal in the following way.

V f = ∪ s∈S v f,s (4.6)

‘S ’ be set of all sessions. V _f is formed by associating(performing union operation) the values of each feature from each individual session as shown in the Figure 4.4.

Figure 4.4: Vector V f formulation from vectors of individual sessions(v f,s ) In the next step, in order to visualise the outliers, the boxplot graph is applied to each vector V _f . From the boxplot obtained(shown in the Tables 5.5 and 5.6), the distribution of outliers is visualised(which are present above the upper limit and below the lower limits of box plots). Further, these outliers are removed from the vector with the suggestion of stakeholders at the company as these are collected mainly due to malfunction and errors in the sensors and analysis with these outliers may deviate the actual purpose and motivation of the work.

Data Normalisation

From each vector(V _f ) of each feature formed after the removal of the outliers are

used to form the histograms(HG _f ) with ten bins, let it be H ₁ , H ₂ , H ₃ , H ₄ , H ₅ , H ₆ , H ₇ ,

H ₈ , H ₉ , H ₁₀ and ranges of ten bins be Hr ₀ , Hr ₁ , Hr ₂ , Hr ₃ , Hr ₄ , Hr ₅ , Hr ₆ , Hr ₇ , Hr ₈ ,

Product Usage Data collection and Analysis in Lawn-mowers

Master of Science in Computer Science October 2020

Product Usage Data collection and Analysis in Lawn-mowers

Sarath Chandra Damineni Sai Manikanta Munukoti

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulﬁlment of the requirements for the degree of Master of Science in Computer Science.

The thesis is equivalent to 20 weeks of full time studies.

The authors declare that they are the sole authors of this thesis and that they have not used any sources other than those listed in the bibliography and identiﬁed as references. They further declare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:

Author(s):

Sarath Chandra Damineni E-mail: sadi18@student.bth.se Sai Manikanta Munukoti

E-mail: samk18@student.bth.se

University advisor:

Abbas Cheddad (Senior lecturer/Associate professor) Department of Computer Science

Industrial advisor:

David S Hellström (Senior Systems Engineer) Husqvarna Group

Faculty of Computing Internet : www.bth.se

Blekinge Institute of Technology Phone : +46 455 38 50 00

SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

Abstract

Conclusions. This study identiﬁed customer behaviours based on their usage data

by clustering the usage data. Moreover, clustering the CAN bus time-series data from

lawn-mowers gave fresh insights to study human behaviours and interaction with the

lawn-mowers. The formulated clusters have a great scope to classify and develop the

individual strategy for each cluster formulated. Further, clusters can also be useful

for identifying the outlying behaviour of users and/or individual components.

Keywords: Usage data analysis, lawn-mowers, CAN bus, clustering, driving pat- terns.

ii

Acknowledgments

We want to express our great level of gratitude towards continuous support, guidance given by our supervisor Abbas Cheddad. We are also thankful to David S Hellström at the Husqvarna group for his motivation, support, and sharing the knowledge during the research.

We would like to thank our parents, family, colleagues, friends for their support and encouragement.

iii

Contents

Abstract i

Acknowledgments iii

1 Introduction 4

1.1 Context and Motivation . . . . 4

1.2 Classiﬁcation of lawn-mowers . . . . 5

1.3 Problem Statement . . . . 7

1.4 Aims and Objectives . . . . 8

1.5 Research Questions . . . . 8

1.6 Thesis Structure . . . . 9

2 Background 10 2.1 Machine Learning . . . . 10

2.2 Unsupervised learning . . . . 10

2.2.1 Clustering . . . . 11

2.3 Performance metric . . . . 14

2.3.1 V-measure . . . . 14

2.3.2 Silhouette Coeﬃcient . . . . 15

2.4 Box plot . . . . 15

2.5 CAN bus . . . . 16

3 Related Work 18 4 Method 22 4.1 Systematic Literature Review . . . . 22

4.1.1 Investigation of Primary Studies . . . . 23

4.1.2 Criteria for the selection of research . . . . 24

4.1.3 Assessment of Quality . . . . 24

4.1.4 Extraction of Data . . . . 25

4.2 Experiment . . . . 26

4.2.1 Tools . . . . 26

4.2.2 Software Environment . . . . 26

4.2.3 Collection of Data . . . . 27

4.2.4 Dataset Description . . . . 27

4.2.5 Experiment Implementation . . . . 28

4.2.6 Performance metrics . . . . 35

4.2.7 Signiﬁcance test . . . . 36

iv

4.2.8 Selection of Algorithms . . . . 37

5 Results and Analysis 38 5.1 Systematic Literature Review . . . . 38

5.1.1 Synthesis of data from SLR of RQ-1 . . . . 38

5.1.2 Scenarios collected from existing literature(Results of RQ-1) . 40 5.1.3 Formulation of Scenario . . . . 40

5.1.4 Synthesis of data from SLR of RQ-2 . . . . 41

5.1.5 Selection of Algorithms(Results of RQ-2) . . . . 42

5.2 Experiment . . . . 42

5.2.1 Formualtion of Features . . . . 42

5.2.2 Visualising and removing outliers . . . . 44

5.2.3 Histogram Formulation . . . . 46

5.2.4 Clustering the data . . . . 49

5.3 Comparison of Algorithms . . . . 51

5.3.1 Signiﬁcance test . . . . 53

5.4 Selection of Algorithm based on V-measure . . . . 55

5.5 Clusters formulated . . . . 57

6 Discussion 64 6.1 Answers to research questions . . . . 64