Exploring unsupervised anomaly detection in Bill of Materials structures.

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Data Science

2019 | LIU-IDA/LITH-EX-G--19/024--SE

Exploring unsupervised anomaly

detection in Bill of Materials

structures.

Utforskande av oövervakad anomalidetektering i styckliste

strukturer.

Niklas Allard, Erik Lindgren

Supervisor : George Osipov Examiner : Ola Leiﬂer

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Siemens produce a variety of different products that provide innovative solutions within different areas such as electrification, automation and digitalization, some of which are turbine machines. During the process of creating or modifying a machine, it is vital that the documentation used as reference is trustworthy and complete. If the documenta-tion is incomplete during the process, the risk of delivering faulty machines to customers drastically increases, causing potential harm to Siemens. This thesis aims to explore the possibility of finding anomalies in Bill of Material structures, in order to determine the completeness of a given machine structure. A prototype that determines the completeness of a given machine structure by utilizing anomaly detection, was created. Three different anomaly detection algorithms where tested in the prototype: DBSCAN, LOF and Isolation Forest. From the tests, we could see indications of DBSCAN generally performing the best, making it the algorithm of choice for the prototype. In order to achieve more accurate results, more tests needs to be performed.

(4)

Acknowledgments

We would first like to thank Drake Analytics for providing us this assignment and for all the help recieved. Especially Per Malmlöv, Marta Ruiz Chaparro and Per Söderberg for all assistance regarding the project.

We would also like to give our thanks to our supervisor George Osipov and examiner Ola Leifler for all constructive feedback and help with our thesis.

(5)

List of Figures

1.1 Example of a BOM structure, with connected BOM-items . . . 3 2.1 Two clusters with four neighbouring points demonstrating local and global

anomalies in a two dimensional dataset with variables 1 and 2. The points x1and

x4represent global anomalies and the points x2and x3represents local anomalies.

Based on [goldstein2016comparative]. . . . 6 2.2 A example on how DBSCAN discovers anomalies. . . 8 2.3 A example on how LOF finds anomalies using densities. A have lower density

compared to the other points and will therefore be defined as a anomaly. . . 9 3.1 Data from table 3.1 represented in a BOM structure. Each sub-table represents a

node and a sub-tree represent the construction of a component. . . 13 3.2 Bar diagram representing the distribution of the 30 most occurring number of

ma-terials for Component 1. Logarithmic scale on y-axis. . . 17 3.3 Bar diagram representing the distribution of the 30 most occurring number of

doc-uments for Component 1. Logarithmic scale on y-axis. . . 18 3.4 Bar diagram representing the distribution of BOM items features encoded as

de-scribed in section 3.3 for Component 1. . . 18 3.5 Bar diagram representing the distribution of number of children for Component 1. . 19 A.1 DBSCAN parameters which achieved the highest SI. . . 35 A.2 DBSCAN parameters which achieved the highest ARI. . . 36

(8)

List of Tables

3.1 Example data that shows how the data sets are structured. . . 13

3.2 Example of a summarization of data from table 3.1 . . . 14

4.1 With default parameters. . . 22

4.2 With a custom set of parameters. . . 22

4.3 LOF parameters within a custom set of parameters which achieved the highest silhouette score. . . 22

4.4 Some of DBSCAN parameters within a custom set of parameters which achieved the highest silhouette score, for more parameters see fig. A.1. . . 22

4.5 Maximum adjusted Rand index achieved using a custom set of parameters. Values in range[´1 : 1], higher is better. . . 23

4.6 LOF parameters within a custom set of parameters which achieved the highest adjusted Rand index. . . 23

4.7 Some of DBSCAN parameters within a custom set of parameters which achieved the highest adjusted Rand index, for more parameters see fig. A.2. . . 23

4.8 IF parameters within a custom set of parameters which achieved the highest ad-justed Rand index. . . 23

4.9 Final parameters for DBSCAN. . . 24 4.10 The total QM and CM generated with DBSCAN using default-and final parameters. 24

(9)

1 Introduction

This chapter cover an introduction to the paper, defining preconditions as well as research questions. In chapter 1.1 there is a motivation for the research problem. In chapter 1.2 the aim of the paper is presented. In chapter 1.3 the research questions are defined. In chapter 1.4 the delimitations are defined. Lastly there is a more detailed background description in 1.5 introducing the reader to the given case as well as some pre-assumptions.

1.1 Motivation

Siemens is a large global company that was founded in 1847. They provide innovative so-lutions within different areas such as electrification, automation and digitalization, which means that they produce a variety of different products, some of which are turbine machines. Each of these machines are composed of descriptions, different components and configura-tions.

One of the most vital information for Siemens business is the description that describes which components each machine consists of and how each machine is configured. This description is used to show which parts the machine is configured with. The oldest machines still in operation are from 1930. Many machines still have paper-based descriptions, while some machines are still in a transition between paper-based and digitally maintained de-scription. Therefore, some descriptions can be faulty or incomplete.

During the process of creating or modifying a machine, the description is used as a ref-erence. If the description is incomplete during the process, the risk of delivering faulty machines to customers drastically increases causing potential harm to Siemens. Thus, it is critical that the description is trustworthy and complete.

Today, this control is done manually and is a task that consumes a lot of time and resources, due to the complexity of the machine structure. In order to reduce this time consuming task an application, which automatically evaluates the completeness of a given machine structure is proposed.

(10)

1.2. Aim

1.2 Aim

This thesis aims to explore the possibility of finding anomalies in Bill of Materials(BOM)1 structures representing turbine machines. In order to analyze these BOM structures, an ap-plication that identifies and compares similar components will be created. With this applica-tion we hope to find to what degree a machine is documented as well as which components deviate and need further investigation.

1.3 Research questions

1. Can anomalies be found in a BOM structure using anomaly detection algorithms? 2. Which existing algorithms for anomaly detection may be used?

1.4 Delimitations

Due to time-constraints, following delimitation’s has been set:

• Only three anomaly detection algorithms are applied in the thesis: DBSCAN, LOF and Isolation Forest

• This thesis focuses on only finding non contextual anomalies.

1.5 Background

In this section a more detailed description about BOM structure is provided to ease the under-standing of the thesis. Assumption regarding the dataset which affect the choice of algorithms are also provided in this section.

1.5.1 Bill of Materials structure

Every machine and their corresponding description at Siemens is represented as a BOM structure. A BOM structure is constructed as a tree-structure where each component is a node of its own. The structure of a component is represented by the subtree rooted at the node corresponding to this component. In the BOM structure each node is represented as a description for a component. The description contains items such as text, documents and materials which are called BOM items. These BOM items are associated with the component the node represents. The highest level represents the finished product and the leafs show individual components. Each node contains information about that specific component and this information can differ for the same component types since there are different manufac-tures and configurations.

(11)

1.5. Background

Figure 1.1: Example of a BOM structure, with connected BOM-items

1.5.2 Assumptions

There is a plethora of different algorithms within the research area of anomaly detection. Depending on what kind of anomalies that is to be detected, as well as other preconditions defined, there are different suitable algorithms to choose from. The choice of algorithms to compare is therefore based on a couple of assumptions:

• Similar components have a small variance in features whereas extreme values implies an anomaly.

• There is no definition as to what makes component or machine having complete de-scription.

• The feature space2will be very small, although larger than one feature.

(12)

2 Theory

In this chapter all theory used in the method is described. In 2.1, there is a general description of what machine learning is, when it is used and three common types of machine learning algorithms. In 2.2, there is a description of what data pre-processing is and how it it used. In 2.3, there is a description about anomaly detection. In 2.4, there is a description of different methods used for anomaly detection and the chosen anomaly detection algorithms.

2.1 Machine Learning

Machine learning is a broad area within Artificial intelligence, which utilize statistics and statistical models to learn some function or pattern from some given input data. Murphy defines Machine learning as:

“A set of methods that can automatically detect patterns in data, and then use the un-covered patterns to predict future data, or perform other kinds of decision making under uncertainty” [14]

Within the area of machine learning there are several different methods depending on the task at hand, such as classification, regression and anomaly detection. Classification is the task of assigning data points to a number of groups. Regression is the task of assigning a numerical value to some input data by creating a function with estimated weights to each feature in the data set. Another common task is to find abnormalities in the input data. This is done by looking at some set of data and then flagging objects with protruding values.[9] There are mainly three types of learning algorithms within machine learning. These are named supervised-, unsupervised- and reinforcement learning.

Supervised algorithms maps input data to some output given labelled data. Usually the available data is split into training and test sets in order to evaluate the accuracy of the used model. [14]

Unsupervised algorithms on the other hand takes some unlabelled input and then tries to identify some underlying pattern. Because these algorithms use unlabelled data, it is

(13)

2.2. Data pre-processing

difficult to evaluate the output in an objective way. Instead, evaluation of these algorithms is mostly focused on the execution and not on the outcome of a model. [14]

Reinforcement learning algorithms optimizes some function or behaviour depending on reward and punishment functions. [14]

2.2 Data pre-processing

Data pre-processing is a way to resolve problems that occurs with real world data such as incomplete data, faulty data and inconsistent data. To ensure that the data provided does not contain these problems, is of acceptable quality and can be used as effective as possible„ data pre-processing is a necessary step. Data pre-processing is commonly used in machine learning since it is an important step before using the data in a algorithm. Data with lack of quality can provide inaccurate and poor results.

2.2.1 Missing data

One of the big problems when using real world data is that it often contains missing values. It often occurs because of human errors such as forgetting or missing to complete certain data. The data could also be missing because of lacking specific information in the source. Depending on the reason why the data is incomplete, different methods can be applied. [10] Some of these different methods are:

• Removing the source containing the data. This can especially be effective if the source has multiple missing values. This method may lead to important data loss and is not considered in the analysis.

• Replacing the data with a mean value based on other sources values within the same label, best to use when there is only a few values are missing and that the mean value is appropriate for the label. The problem with this method is that it could bias the result if the dependent variable has a high dependency on the variable with missing data. • Categorizing the missing value as a specific value, suitable if there are multiple

in-stances of missing data with the same label in the dataset.

2.3 Anomaly detection

Anomaly detection is wide subject which reoccur in several different areas such as cyberse-curity, healthcare and production. Kumar et al.[4] describes an anomaly as: “Patterns in data that do not conform to a well-defined notion of normal behaviour”

An anomaly is not to be confused with noise. The concepts "Noise" and "Anomalies" are very similar, but there is a difference. Anomalies are deviations, which often are of interest to be found such as a deviating network connection, which could indicate an at-tack on a network. Noise on the other hand can be described as data interference such as unclean sound which be improved by removing noise from the sound signals. Anomalies are of interest to be found in contrast to noise which complicates the interpretation of data. [4] There are three different kinds of anomaly types: point anomalies, contextual anomalies and collective anomalies. Point anomalies can be described as individual datapoints that deviates from the rest of the data.[4]

Contextual anomalies are anomalies that deviates with regards to some context [4] e. g. 30 C in Sweden would not be a deviation during summer months, but in the winter months

(14)

2.4. Anomaly detection algorithms

it would be a serious deviation.

Lastly, collective anomalies can be described as when several datapoints deviate from the majority of the datapoints. [4]

Figure 2.1: Two clusters with four neighbouring points demonstrating local and global anomalies in a two dimensional dataset with variables 1 and 2. The points x1 and x4

rep-resent global anomalies and the points x2 and x3 represents local anomalies. Based on [8,

Fig.2].

Goldstein and Uschida [8] mentions two other distinctions within anomaly detection called global and local anomalies. As seen in figure 2.1, x1and x4are deviating heavily from

the other two clusters, these are seen as global deviations. The points x2and x3on the other

hand are deviating from their respective cluster, but could also be regarded as a part of their respective neighbouring cluster. The points that deviates from a nearby cluster can be re-garded as local anomalies.

2.4 Anomaly detection algorithms

Finding anomalies is a difficult task since it depends heavily on the problem at hand. Each method for anomaly detection is suitable for a specific problem.

The case for the problem at hand needs to be defined to be able to choose a suitable anomaly detection method. Chandola et al.[4] define that a case is based on different factors. There are several factors such as the type of anomaly in the problem, the nature of the given data, the labeling of the data and what output is expected.

A data set can contain different attributes such as variable and feature. These can be types such as binary, categorical and continuous. Each data entry can either have multiple attributes or only one attribute. These attributes can also be of different types.

Depending on how the data is labeled, different scenarios arise. If a definition for both a normal and anomaly types exists, it is in a supervised scenario. In a semisupervised sce-nario only the normal type is defined. If the normal and anomaly type is undefined it will be in a unsupervised scenario. For each scenario, different algorithms are more or less suitable.

(15)

An anomaly detection ad can produce one of two types of outputs. One type is a score, where each data point in the algorithm is given a score dependent on to what degree the point is considered an anomaly. The other type sets a label, anomaly or normal on each data point.

Dependening on the combination of factors the case consists of, different anomaly detec-tion algorithms can be applied. Therefore it is important to define the case beforehand and then chose a suitable method which fits the specific case.

Unsupervised anomaly detection methods are based on the assumption that there will be far more data defined as normal compared to anomalies. These methods do not require any test data and can therefore be used in a variety of cases where labels are not available. [4] Two common groups of unsupervised anomaly detection algorithms are neighbor-based and clustering-based. These are mainly based on distance and density between points and do not focus on the relation between the data in the data set, thus suitable for unsupervised cases. [4]

In a neighbor-based method an anomaly is determined on the characteristics of its neighbors. If a data point is far away from its neighbors and has a low density compared to other data points, it will be considered a anomaly. The data set can often contain varying densities be-tween data points. To avoid this problem a common neighbor-based method, Local Outlier Factor (LOF), can be used. There is different variations of LOF, but these need more specific info about the case. LOF is used in a more general case. [4]

Clustering-based methods uses clustering to define anomalies. Normal data points will belong to a cluster while anomalies will not. Clustering is not initially created for finding anomalies, this is rather a side effect. One common and straight-forward method that is suit-able for finding anomalies is Density-Based Spatial Clustering of Applications with Noise (DBSCAN). DBSCAN has also different variations that are used in specific situation. For a general case regular DBSCAN is used. [4]

Another group of unsupervised anomaly detection algorithms are decision-tree based al-gorithms such as Isolation Forest.

2.4.1 Density-Based Spatial Clustering of Applications with Noise

DBSCAN is a method that is designed to discover arbitrary shaped clusters and identify outliers in a dataset using density-based clustering. A arbitrary shape is any shape at all. DB-SCAN uses two values, epsilon and minimum points to discover clusters. The epsilon is the distance area which it searches for neighbours and minimum points is the minimum amount of neighbours a point needs to have to create a cluster. DBSCAN start on an arbitrary point and searches for neighbouring points within the epsilon area. If the number of neighbours found is greater than or equal to the minimum points, a cluster is created with the starting point and the neighbouring points. Otherwise the point is marked as an outlier. The starting point is then marked as visited and this process is repeated recursively for each neighbour. If all the points in the cluster has been visited, the whole process is repeated for all remaining unvisited points. The standard distance metric for DBSCAN is the Euclidean distance, which is the straight-line distance between two points. [7] For example in fig 2.2 the yellow point will be determined as anomaly since it has not fullfilled the requirement of 4 neighbours inside its epsilon area.

(16)

Epsilon and minimum points are global parameters, which means that they are static during the clustering process. These variables needs to be chosen carefully since they are very dependent on the density and the sample size of the data, which greatly affect the result. This could lead to problem with data sets that have multiple clusters with varying densities, since each cluster would need a different set of parameters. [7]

The advantages with DBSCAN is that there is no need to specify the number of clusters and it can find arbitrarily shaped clusters. DBSCAN can also explicitly identify outliers.[7]

Figure 2.2: A example on how DBSCAN discovers anomalies.

2.4.2 Local Outlier Factor

The local outlier factor uses a scoring system based on density to detect anomalies. The score is based on the similarity between the density around a point and the density around its neighbours. The density around an anomaly will be different from the density of its neighbours. To estimate the local density around a point, the local reachability density (LRD) is computed for each neighbour. LRD uses the distance to each neighbouring point and the radius of each neighbour. LOF has one parameter, amount of neighbours, defined as k which is the amount of points that is needed to define a radius for a point. The neighbours and the radius for a point is determined by the K-nearest neighbour method, which creates a radius containing the k amount of neighbouring points for a specific point. The LRD of each k neighbour is compared and averaged to determine a LOF score for the point. [2] For example in figure 2.3 point A have higher density(reachability) compare to other points, therefore A is determined as a anomaly.

The closer the LOF score is to 1, the higher similarity between the local densities. A normal point with similar density as its neighbours will have a score around 1, while an anomaly will have a larger score. The LOF score is greatly dependent on the k number of neighbours. A small k focuses more on local points, while a large k has broader focus. Different k values need to be tested to guarantee that all outlier has been found. [2]

(17)

Figure 2.3: A example on how LOF finds anomalies using densities. A have lower density compared to the other points and will therefore be defined as a anomaly.

2.4.3 Isolation forest

Isolation forest is an model-based method proposed by the authors Liu, Ting et al.[12]. The method attempts to find anomalies directly by using isolation techniques in contrast to other methods used within anomaly detection, which attempts to define normal sets with the anomalies as a byproduct. This method is instead designed with the purpose of finding and narrowing down anomalies.[12]

Isolation forest is designed with two assumptions in mind regarding anomalies. The first assumptions is that anomalies are often a small part of the major dataset. The second as-sumption is that the attributes of the deviating data points are different from the majority of data points. [12]

Isolation forest make use of decision trees. Decision trees are built as trees where the nodes represent attributes and the edges between the nodes represent some choice of that attributes1. Generally methods such as Isolation forest create several decision trees, choosing different features by random as the prediction target.The decision trees are built by looking at subsets of the given dataset. The number of trees as well as the size of the subsets are defined by the parameter, number of estimators. A point close to a other points, requires more traversing in the decision tree to isolate then a point which is separated from other points. The root node requires minimal traversing to access it and therefore deviating data points are found close to the root node. In order to find the deviating data points, the average shortest path to the root is calculated. [12]

2.4.4 Cluster validation

When evaluating cluster algorithms there are two factors to take into account. One part of the evaluation is with regards to the algorithms ability to cluster. When these measurements are performed, the internal indices are measured with methods such as Silhouette index, which are described in section 2.4.4.1. Internal indices is a measurement on how well an clustering algorithm cluster without any prelabeled information, for example Silhoutte index. The other

(18)

part of the evaluation is with regards to the output of the clustering algorithm. These kind of measurements measure the external indices with the help of methods such as Adjusted rand index. External indices is a measurement on how well an algorithm is able to cluster with regards to some truth. [6]

2.4.4.1 Silhouette index

Silhouette index(SI) is a method used to decide how well some data-point have been clus-tered, and ranges from 1 to -1. 1 indicates that the data-point is far away from other nearby clusters, 0 indicates that the data-point is close between clusters and -1 indicates that the dat-apoint may be wrongly classified. The index can indicate whether some clustering algorithm are correctly configured or not. If there are many low or negative values, the algorithm likely need to be tweaked. If most datapoints in contrast are close to 1, there exists well defined clusters.[16]

2.4.4.2 Adjusted Rand index

Adjusted rand score(ARI) measures the relationship between two clusters. A value close to 1 means that the clusters compared are equally in their construction. As the index moves towards zero, the relationship between clusters becomes more deviant. Negative values indi-cates that the relationship between the two clusters are complementary in different degrees. [17]

(19)

3 Method

In this chapter the workflow and approach for the project is described. In chapter 3.1 there is a description of the pre-study that was conducted as well as motivation for the conducted study. In 3.2 there is a presentation of the data structure. In 3.3 there is a description of the feature extraction. In 3.4 there is an description of the pre-processing steps performed and on how the data set is summarized. In 3.5 there is an detailed description of the proposed prototype. In 3.6 the different evaluation steps are described. Lastly the tools and libraries are described in 3.7.

3.1 Pre-study

In the initial phase of the project we conducted a literature study with the goal of finding different methods for identifying anomalies within unlabelled data. During this process we narrowed our search by looking for methods suitable for our scenario and the data provided. This was done by studying related work within anomaly detection and machine learning as well as other methods for finding differences between tree structures.

In a paper by Leung and Leckie [11], the problem of network intrusion is discussed and a clustering algorithm is proposed with the goal of finding anomalies in network data. In the paper, the authors define assumptions much similar to those defined in this thesis. The algo-rithm proposed is a density- and grid based algoalgo-rithm and was designed with unsupervised training in mind. In this phase we also gathered information in different methods to evaluate clustering and unsupervised machine learning algorithms.

In Goldstein and Uchida [8] article they execute a comparative evaluation of different anomaly detection algorithms for a unsupervised scenario. We used this article to find sug-gestions for different machine learning algorithms that could be applied in our prototype. The paper summarizes each algorithms functionality and based of this we found LOF which we chose due the fact that it is implemented in the Sckit-learn Python package and worked well with our assumptions and data. According to Goldstein and Uchida [8] neighbor-based algorithms such as LOF perform better in most cases compared to clustering-based for a unsu-pervised scenario, which is something we want to evaluate. Most of the algorithms listed in the article could not be used in our case since they were not implemented in Sckit Learn and

(20)

3.2. Data description

due to time constraint we did not implement them ourselves.

To compare the performance between neighbor-based and based algorithm, a cluster-based algorithms is required. Mumtaz and Duraiswamy [13] analyse and explain three different clustering-based algorithms. One of these clustering algorithms is DBSCAN which has the possibility to find anomalies. Another advantage with DBSCAN is that it could find arbitary clusters as well, which could be usable in our case, due to the lack of information regarding the data. With these advantages and that DBSCAN is implemented in Scikit-learn, we chose DBSCAN as our cluster-based algorithm.

Liu, Ting et al.[12] and Domingues, Filippone et al.[5] publications both shows promis-ing performance of the algorithm Isolation Forest. Liu, Tpromis-ing et al.[12] mention advantages with Isolation forest compared to other cluster-based and neighbour-based algorithms, such as no computational cost of distance calculation and handle high-dimensional problems with a large number of irrelevant attributes. Evaluating if the advantages applies for our our assumptions and data would be interesting and therefore we chose to use Isolation forest as well.

Within the scope of the pre-study, methods for comparing tree structures where also ex-plored. Romanowski and Nagi [15] proposed in their paper a distance or similarity measure between two unordered trees, with a focus on bill of material structures. There is also a sur-vey done by Bille [1] on the problems of comparing labeled trees. Because of time constraints, we chose to delimit the work to machine learning algorithms, with a focus on clustering. The comparison on tree structures are instead left for further studies.

3.2 Data description

In this thesis several data sets was used and every data set had the same structure and rep-resented the description of a machine, constructed as a BOM structure. Each row in the data set represents a BOM item, such as text, document or material. In turn, each row is associated with a particular node in the BOM structure using Equipment number, the unique identifica-tion for each node. Since a node represents the descripidentifica-tion for a component, a node can be associated with multiple BOM items. Except Equipment number, each row have several other attributes such as:

• Article number, which defines the type of component that is associated to the row. • Depth, which is the depth of the node in the BOM structure.

• Parent, which is the Equipment number of the node’s parent.

• Number of item, which is the amount of the specific type of BOM item the row repre-sents.

For example, row one, two and three in table 3.1 represents a node, identified as 101 in the BOM structure, as seen in figure 3.1. Node 101 have two children, three documents, three materials and the type of this component is 111-A. As seen in row eight, nine and ten in table 3.1, multiple nodes can have the same Article number. This can also be seen in figure 3.4, where the Car has two wheels.

(21)

3.3. Feature extraction for anomaly detection

Equipment No Article No. BOM Item Parent Depth No of item

101 111-A Text 1 2 1 101 111-A Document 1 2 3 101 111-A Material 1 2 3 102 121-C Document 1 2 1 102 121-C Material 1 2 2 103 130-C Material 102 3 1 103 130-C Document 102 3 2 104 200-A Material 101 3 1 105 200-A Material 101 3 1 106 200-A Material 103 4 1

Table 3.1: Example data that shows how the data sets are structured.

Figure 3.1: Data from table 3.1 represented in a BOM structure. Each sub-table represents a node and a sub-tree represent the construction of a component.

3.3 Feature extraction for anomaly detection

To determine the completeness of a machine description, each component in the machine needs to be analysed and compared with other components which has identical Article num-ber. The anomaly detection algorithms requires a set of features to do this analysis of the com-ponents. Therefore, four features were extracted from each row containing identical Equip-ment number in the data set. The extracted features extracted were:

(22)

3.4. Data preprocessing and summarization

• No. of children, the amount of children the node have. • No. of documents, the amount of documents the node have. • No. of materials, the amount of materials the node have.

To be able to create the BOM item feature, material, document and text was coded with 1,2 and 4 and then a sum of these values were created, specifying any combination of these three BOM items. These specific numbers was chosen to be able to create a number which could specify any combination of these three BOM items. The other three features was based of the Number of item attribute for each respective BOM Item..

3.4 Data preprocessing and summarization

To be able to iterate through each component in a machine description, each row containing the same Equipment number needed to be summarized into a single row of data. During the summarizing, each of the features mentioned in section 3.3 was calculated and added to the single row created. Before the given data was able to be summarized, the data needed to be complete and usable. This was made possible through data pre-processing. The data set was iterated through to find missing values in the variables. If some value was missing, the row containing the value is removed from the data set. Missing values are not replaced with other values, such as a mean value, since it can bias the result.

Equipment No Article No. BOM Item Children Depth Documents Materials

101 111-A 7 2 2 3 3 102 121-C 3 2 2 1 2 103 130-C 7 1 3 2 1 104 200-A 1 0 3 0 1 105 200-A 1 0 3 0 1 106 200-A 1 0 4 0 1

Table 3.2: Example of a summarization of data from table 3.1

3.5 Prototype

The prototype was created to demonstrate the possibility of using anomaly detection algo-rithms to identify anomalies in a BOM structure. The purpose of the prototype is to determine the completeness of a machine description with help from anomaly detection algorithms. The prototype builds a reference library that is used to store a allowed range for a set of features for a specific component type.

3.5.1 General approach of the prototype

The completeness of a machine’s description is defined by all the components it consists of, implying that the description for each component in the structure needs to be analysed to determine if a machine is complete. The prototype summarizes the data set as described in section 3.4 to represent the components in the structure. This summarized data set is defined as List of components in algorithm (2), containing information of each unique Equipment num-ber. In this summarized data set each unique Equipment number represents a component and is iterated by the prototype to determine if the description of each component in the structure is complete, as algorithm (2) shows.

(23)

3.5. Prototype

In order to determine if a component is complete, the prototype needed a set of values to compare a components features against. Since the references did not exist, the prototype generates these reference by using the algorithms mentioned in section 1.4 to analyse and compare components with identical Article number. These components with identical Article number is represented as the Component pool in algorithm (2). After each time a reference is generated or updated, the current component features is compared to the reference’s allowed range in the reference library. Based on the amount of features that matches its corresponding allowed range, a quality-measurement is created. This quality-measurement could then be used to calculate how complete a component description is. This process was done to each component extracted from the machine data set. When each component in the structure had been iterated, equation (3.2) was used to calculate the percentage of how complete the description for that machine was.

3.5.2 Quality-measurement and completeness-measurement

The quality-measurement(QM) was used to determine the quality of the nodes in the ma-chine structure, which represented the completeness of a components description. The mea-surement was based on the amount of features in the node that matched its corresponding reference value. Dependent on if a feature matched its corresponding reference value, a true or false value was set for each feature, represented as QMdoc, QMmat, QMbom and QMchild.

Equation 3.1 summed the QMdoc, QMmat, QMbomand QMchild and divided the sum with 4

which was the number of features. If all features contained valid values the QM would be one and the quality of the node would be 100%. The prototype iterated through all nodes, except the node representing the whole structure, and calculated the QM for each individual node using the following equation:

QM= QMdoc+QMmat+QMbom+QMchild

4 (3.1)

The completeness-measurement (CM) is a measurement of how complete a sub-tree in the structure is. CM is also used to to calculate how complete a machines description was. To calculate the CM for a machine the prototype first iterated through all nodes in the tree struc-ture, calculated sum of QMs for each node and then divided this sum by the number of total nodes in the tree, as in 3.2. The CM was represented as a percentage of how many complete nodes the tree-structure had. The prototype also calculated the CM for each component, which could be useful in the process of troubleshooting a machine.

CM=

řn i=1QMi

n (3.2)

3.5.3 Reference library

The prototype used a reference library to store references, which represented an interval where the description was deemed as complete for a specific Article number. A reference contains min, max and mean value for the four features: BOM item, number of children, total amount of material and total amount of documents. The library also stored the Article num-ber, which represent the component type, the date of when the reference was created and the number of components used in the latest analysis. This library is stored in a SQLite database for easy access.

(24)

3.5. Prototype

3.5.4 Generate reference

A reference for a component was generated when there was no available reference for the specific component type in the reference library. The prototype retrieves all components with identical Article number from the provided component pool, defined as Similar components in algorithm (1). If there was more than one similar component available, the reference was generated based on the result from the chosen anomaly detection algorithm. The algorithm analysed the available components with identical Article number as well as the current com-ponents features to find comcom-ponents with features that where defined as an anomaly. Based on the similar components, the algorithm produces a predicted result, which is the components predicted as anomalies. The prototype used the components defined as normal, derived from predicted result to create a max, min and mean value of each feature, which represent the reference for the description of a certain type of components. The reference was then stored in the reference library for easy access.

The reference value was only updated if there where more similar components, available during the analysis, compared to the number of components used in the latest analysis. As mentioned in section 3.5.3 the library stored the number of components used in the latest anal-ysis in order to reduce the amount of unnecessary reference updates.

Algorithm 1:Generate reference

Input:Component pool

Input:Component

1 Similar components: All similar components in Component pool 2 if Similar components > Number of components then

3 Predicted result Ð Algorithm analysis

4 Reference Ð Create reference based on Predicted result 5 else

6 Reference Ð Create reference based on Component features

Algorithm 2:General flow of the prototype

Input:Machine data set

Output: Completeness-measurementof the machine 1 List of components: All components in the machine

2 Quality-measurement: Total quality-measurement of the machine 3 for component in list of components do

4 Component pool Ð Retrieve all available similar components 5 if component in reference library then

6 Reference Ð Retrieve existing reference in the reference library 7 QM Ð QMdoc+QMmat+QMbom+QMchild

4 (based on reference)

8 Quality-measurement += QM 9 else

10 Reference Ð Generate new reference using algorithm (1) 11 QM Ð QMdoc+QMmat+QMbom+QMchild

4 (based on reference)

12 Quality-measurement += QM 13 n: Number of components in the machine 14 Completeness-measurement Ð

řn i=1QMi

(25)

3.6. Evaluation

3.6 Evaluation

The Quality measure and Completeness measurement calculated by the prototype is heavily dependent on the anomaly detection algorithm used in the analysis of the components. Thus, to ensure that the prototype produced an accurate estimation of the completeness for a given machine structure, the most suitable algorithm needed to be applied. Each of the chosen algorithms has different scenarios where they excelled and the algorithms needed to be tested with different parameters in order to find the most suitable configuration for the prototype.

3.6.1 Data sample

When performing evaluations, there was a need for uniformity in the different measure-ments. In order to compare the results from the measurements, the input needed to be the same for the different tests. Another important aspect for the evaluations was the size of the inputs. The amount of available article numbers could vary largely, making it important that the algorithms was tuned for different sized inputs. In order to fulfill these criterias, data samples were created by extracting all components with a specific article number from the available components.

The data samples were created by looking at one type of component, referenced as Com-ponent 1. ComCom-ponent 1 has 6414 occurrences, with noticeable variations in the different features. The variation of the different features for Component 1 can be seen in figure 3.2, 3.3, 3.4 and 3.5. The same component is used through out all the evaluations.

Figure 3.2: Bar diagram representing the distribution of the 30 most occurring number of ma-terials for Component 1. Logarithmic scale on y-axis.

(26)

3.6. Evaluation

Figure 3.3: Bar diagram representing the distribution of the 30 most occurring number of doc-uments for Component 1. Logarithmic scale on y-axis.

Figure 3.4: Bar diagram representing the distribution of BOM items features encoded as de-scribed in section 3.3 for Component 1.

(27)

3.6. Evaluation

Figure 3.5: Bar diagram representing the distribution of number of children for Component 1.

From figure 4.1, figure 4.2, figure 4.3 and figure 4.4, one can see the distribution of Compo-nent 1. Figure 4.1 shows that there is a large span of different values for the material feature. The most occurring value for the material feature is 0. The same trend can be seen in figure 4.2, figure 4.3 and figure 4.4 which shows an large amount of the components having zero documents, zero BOM items and zero children connected to them. If values with zero are ignored, one can still see well defined boundaries for the material, document and children features. When looking at the BOM item feature the values are more distributed over the different variations of the feature and will not be used as boundaries to define anomalies in the data samples due to its inconsistency.

3.6.2 Tukays rule

In order to provide a way to evaluate the algorithms outcome, whether the data points is an anomaly, ground truth is needed. One way to define anomalies is to use Tukays rule, which is described in equation 3.3 and 3.4. Tukays rule makes use of the inter quartile range which can be described as how much spread there is with regards to the median. IQR is calculated by taking the difference between the upper and lower quartile (Qhigh- Qlow). The inter quartile

range(IQR) was calculated for a each feature in the data sample and used to calculate the boundaries for the different features. The following equations was used, following Tukays rule:

LowerBound=Qlow´k ˚ IQR (3.3)

U pperBound=Qhigh+k ˚ IQR (3.4)

Where the lower and upper bound was the boundaries for some feature, Qlow and Qhigh

was the lower and upper quartile. The constant k is a pre defined value which, according to Carling[3], often is set to 1.5 in order to label 0,7% of the data points as anomalies. In order for Tukays rule to work, the data set need to be ordered.

Based on Tukays rule, we changed Qlow and Qhigh to the top-5 percentiles and bottom-5

(28)

3.7. Tools and libraries

boundaries. The constant was left unchanged. The reason for not using the k-value to in-crease the boundaries for correct values was that IQR became zero, due to a high occurrences of zero values. When IQR becomes zero, the scalar value do not works as intended, increas-ing the allowed range of values.

With the calculated boundaries, the features in the data samples which had values out-side the boundaries where set as anomalies also called the ground truth. It is important that the data samples have similar distribution of anomalies and non anomalies, to ensure a fair comparison between the different sample sizes. In order to achieve this, the fraction of anomalies in the data samples where calculated by dividing the amount of anomalies with total amount of components. The fraction was then used to create samples with the sizes 10, 100, 1000 data points, which contained the same ratio of anomalies and non anomalies. All the evaluations was done on these different sized data samples.

3.6.3 Internal indices

To find the best configurations of the algorithms, the Silhouette index was calculated based on the samples previously created. The evaluation was done by iterating through different parameter intervals, with the goal of finding the highest Silhouette index for a sample size.

3.6.4 External indices

The samples was also used to evaluate the outcome of the algorithms. The outcome was eval-uated based on the adjusted Rand index, which used the ground truth. The same algorithm configurations were used here as in the Silhouette index calculations.

3.7 Tools and libraries

Within the area of data science, there are a multitude of different tools and libraries to choose from, with different advantages and disadvantages. Python1 is a common programming language which is often used in prototyping and analysis because of its simplicity and rel-ative good performance, which can be further enhanced through libraries such as Cython2. Another reason to choose Python as a developing languages was the multitude of common data science libraries which support Python.

The libraries used was mainly Pandas3 and Scikit-learn4. Pandas was used to provide containers and to manipulate the data in the containers in a efficient manner. Scikit-learn provided the different models as well as evaluation methods. To enable visual comparison of the clusters created from the models, the library Matplotlib5was used.

The SQLite6 engine was used to provide a simple way to store data. SQLite is a easy to use and fast database engine that is not client-server based, instead it is embedded in the program with a database file which reduces the complexity.

1_{Python, https://www.python.org/} 2_{Cython, https://cython.org/} 3_{Pandas, https://pandas.pydata.org/index.html} 4_{Scikit-learn, https://scikit-learn.org/stable/} 5_{Matplotlib, https://matplotlib.org/} 6_{SQLite, https://www.sqlite.org/index.html}

(29)

4 Results

In this chapter the results of the thesis is presented. In chapter 4.1 the component chosen for the samples is described. In chapter 4.2 the results of the evaluations done with internal-and external indices on the different algorithms are presented. The result from the prototype using the chosen algorithm is also presented in chapter 4.2.

4.1 Algorithms and parameters

The prototype needed an algorithm which was suitable for the given data in order to achieve an accurate estimation of the completeness-measurement(CM) and the quality(QM) in a machine (See section 3.5.2. To determine the most suitable algorithm, the Silhoutte index(SI) and the adjusted Rand index(ARI) was calculated using each of the chosen algorithms. Since Isolation forest(IF) is not a clustering technique, it was not included in the calculations of the SI, which measures how well the algorithms performs clustering. The SI was calculated using the different samples described in section 3.6 with default parameters as well as with a set of custom parameters for each algorithm. In the calculations of the ARI the same samples and parameters was used, in order to make a fair judgment of the evaluated algorithms. LOF has one parameter, number of neighbour. It is set to 20 by default. The set of cus-tom parameters is set as the range from 5 to the sample size minus 1, with a step of 1. DBSCAN has two parameter, epsilon and minimum samples. Epsilon is set to 0.5 by default and minimum sample is set to 5 by default. The set of custom parameters is set as the range from 0.1 to 10, with a step of 0.1 for epsilon and for minimum samples it is set as the range from 5 to 100.

IF has one parameter, minimum samples. It is set to 100 by default. The set of custom parameters is set as the range from 10 to 100, with a step of 10.

(30)

4.1. Algorithms and parameters

4.1.1 Internal indices evaluation

Maximum silhouette score achieved. Values in range[´1 : 1], higher is better.

Sample size LOF DBSCAN 10 0.04 0

100 0.84 0.68

1000 -0.15 0.63

Average 0.24 0.44 Table 4.1: With default parame-ters.

Sample size LOF DBSCAN 10 0.48 0.48

100 0.85 0.92

1000 0.75 0.84

Average 0.69 0.75 Table 4.2: With a custom set of parameters.

In table 4.1, the highest SI was calculated using the default parameters stated in the beginning of section 4.1 for each algorithm. When default parameters was used, DBSCAN only achieved the highest score for the sample size 1000, while LOF achieved the highest score in all sizes except for the size 1000, where it achieved a drastically lower score. Because of this low score DBSCAN achieved the highest average SI when default parameter was used, which can be seen in table 4.1. As seen in table 4.2 DBSCAN achieved the highest SI in the sample sizes of 100 and 1000 compared to LOF, resulting in DBSCAN also having the highest average SI of all the sample sizes when custom parameters was used.

Sample size No. of neighbors Silhouette score

10 5 - 8 0.48

100 34-89 0.85

1000 523-905, 943-950 0.74-0.75 Table 4.3: LOF parameters within a custom set of pa-rameters which achieved the highest silhouette score.

Sample size Epsilon Minimum samples Silhouette score

10 8-9.8 6-8 0.48

100 9.6-10 5-7 0.92

1000 9-9.9 16-28 0.84

Table 4.4: Some of DBSCAN parameters within a custom set of rameters which achieved the highest silhouette score, for more pa-rameters see fig. A.1.

4.1.2 External indices evaluation

In table 4.5 the highest ARI was calculated by using the custom parameters stated in the beginning of section 4.1 for each algorithm and compared the predicted labels produced by the algorithms with the true labels derived from the samples. As seen in table 4.5 DBSCAN achieved the highest ARI on all sizes except for the size 10, where IF achieved the highest ARI. Due to the generally high indexes achieved, DBSCAN had the highest average. The algorithm which found the most anomalies and achieved the highest average ARI was DBSCAN.

(31)

Sample size LOF DBSCAN IF

10 -0.11 0.52 1

100 0.61 0.77 0.61

1000 0.64 0.96 0.53

Average 0.38 0.75 0.71 Table 4.5: Maximum adjusted Rand index achieved using a custom set of parame-ters. Values in range [´1 : 1], higher is better.

Sample size No. of neighbors Adjusted Rand score

10 5 - 8 -0.11

100 34-89, 94 0.61

1000 264-502, 916-920 0.63-0.64

Table 4.6: LOF parameters within a custom set of parameters which achieved the highest adjusted Rand index.

Sample size Epsilon Minimum samples Adjusted Rand score

10 6-7.9 5-8 0.52

100 10 5-9 0.77

1000 8.6-8.8 25-29 0.96

Table 4.7: Some of DBSCAN parameters within a custom set of ters which achieved the highest adjusted Rand index, for more parame-ters see fig. A.2.

Sample size No. of estimators Adjusted Rand score

10 10-20, 40-990 1

100 10,20,40,90,250,360,430,480,830,870,880,900 0.61

1000 110, 20 0.51, 0.53

Table 4.8: IF parameters within a custom set of parameters which achieved the highest adjusted Rand index.

4.1.3 Prototype evaluation

Based on the result from the previous section 4.1 an algorithm with a set of parameters was chosen to be used in the prototype. The algorithm which achieved the highest Silhouette score and ARI will be determined as the most suitable algorithm for the prototype. As seen in table 4.1, 4.2 and 4.5, DBSCAN was the best performing algorithms in all the three cases, therefore DBSCAN will be used in the prototype.

(32)

Data size Epsilon Minimum samples n<=99 7.9 5

99<n<=999 10 5

n>999 8.6 25 Table 4.9: Final parameters for DBSCAN.

To be able to evaluate the chosen algorithm, the algorithm was implemented and used in the prototype with two different configuration. The first configuration was DBSCAN with its default parameters as stated in the beginning of section 4.1 and the second configuration was DBSCAN with the custom parameters listed in table 4.9. The parameters in table 4.9 are derived from table 4.7, which are the parameters used to achieve the highest ARI for each sample size. The parameters were arbitrarily picked from table 4.7, since all the parameters achieved the same result for the different sample sizes.

Parameters No. of components Total QM CM Default 240 240 1

Final 240 239,75 0,998958333 Table 4.10: The total QM and CM generated with DBSCAN us-ing default-and final parameters.

(33)

5 Discussion

In this chapter a final discussion is held regarding the the choice of methods as well as the results from the performed tests. In chapter 5.1 there is a discussion with regards to chapter 4. In chapter 5.2 there is a discussion of the chosen method as well as an discussion of the sources used to base this thesis on. In chapter 5.3 there is a discussion regarding the work in a wider context, discussing the social aspects of the work in 5.3.1 and the ethical aspects in chapter 5.3.2.

5.1 General

In this thesis several anomaly detection algorithms has been tested and evaluated based on data set representing BOM-structures. The goal has been to investigate the possibilities of finding anomalies in these structures while using previously mentioned type of algorithms. After the performed tests and evaluations, we can see that it is possible to find anomalies in the data-set. As defined in section 2.4, an anomaly can be both data deviating from the majority of data points, as well as points deviating with regards to some context. As we have too little information in order to draw a conclusion of whether data points deviate contextually, we have only looked at values deviating from the majority in terms of distance between points and clusters. Since there is no ground truth, we can not know if points, with deviating distances, are to classify as real anomalies.

We also wanted to find out what algorithms which could be used to find anomalies in BOM structures. As seen in section 4.1, all of the chosen algorithms can find anomalies but DBSCAN performed the best in general, both when evaluating internal- and external indices. We therefore concluded that DBSCAN indicated to be the best algorithm for finding anomalies in BOM structures.

5.2 Results

As defined in chapter 1.4, the focus of this thesis was to find and extract non contextual anomalies. There are parts of the result which can be questioned which touches upon contextual deviations. One such thing is the high number of occurrences of zero valued

(34)

5.2. Results

components. When looking at the analysed component in the result, the majority of that component had zero materials and documents. For those features that had non zero values, there where some tendency towards a range of values, but there where still a very high spread. This becomes a problem when tweaking the parameters for the algorithms as well as when trying to evaluate the output. Since there exists no definition or context for the components in general, it becomes difficult to define the correct references for a component type.

One hypothesis which laid the base of our proposed prototype, was that most occurrences of components are similar in structure, with few deviating values. This seems to not be the case when looking at figure 3.2, 3.3,3.4 and 3.5. Because of this behaviour in the data, we need to have more information regarding a component, in order to be able to draw conclusions of whether the proposed boundaries are correct or not.

Because of time constraints, only one sort of component where used when performing the evaluations. The component was picked based on two factors. The component had a very high occurrence in the data set and it had deviating values, making it a node in contrast to a leaf. In order to do correct comparisons of the results of the different evaluations, we used the same component and same data points. By analysing only one component we got an indication of how the data is structured as well as how the different algorithms should be tweaked. In order to draw more accurate conclusions, more tests should be made with other kind of components.

When deciding which of these algorithms to choose for the prototype, several measurements have been performed, both with regards to internal and external indices. The performance of the algorithms depends heavily on the choice of parameters, which is showed in table 4.1 and 4.2. As we can see in these tables, the algorithms perform differently depending on the configurations as well as the size of the input data. Only a certain set of parameters achieved the highest score, as seen in table 4.3, 4.4, 4.6, 4.7 and 4.8. The default parameters stated in the beginning of section 4.1 are derived from the default settings stated in Scikit-learns documentation1. The custom set of parameters also stated in the beginning of section 4.1 are estimations of parameters that will yield different outputs based on few tests. For LOF the number of neighbour cant be higher than the sample size and therefore it is dynamic to the sample size. DBSCAN and IF parameters needs to further testing to assure that there are no more significant parameters.

When choosing an algorithm for the prototype, it is important that the algorithm per-forms evenly good over different sized inputs, as the available data for each component in the machines vary to a great degree. With this in mind, we can see in table 4.1 and 4.2 that there is an indication of DBSCAN performing the best in general when looking at internal indices.

When looking at external indices, we can see from table 4.5 that the algorithm with the highest ARI was DBSCAN, with IF closely following. When looking at the table we can see that DBSCAN performed better as the input data grew in size while the reverse applied to IF. Since the input size of the different components probably will increase with time, DBSCAN is to be preferred.

From figure A.1 we can see that there is a large amount of different configuration variations of DBSCAN which achieve a high ARI. From these figures we can also see some parameters which overlap between the different sized datasets. As mentioned earlier, one important

(35)

5.3. Method

aspect when configuring the algorithm is to make it perform evenly good on different sized inputs. It is therefore suitable to choose overlapping parameters to as large degree as possible. Because of the low amount of tests performed, it becomes hard to draw any conclusion regarding the performance of the algorithms. When only few measurements have been done for each sample, we can only claim to show an indication of better performance. Even though the the structure of the components are similar, we can not say for certain that the chosen component is representative for all the other of Siemens components.

Evaluating the impact of the DBSCAN in the prototype is a hard task, due to the lack of information regarding the machines and the components. The data received had no true labels which could be used to evaluate how well DBSCAN performs when used in the proto-type. Figure 4.10 shows the QM and CM that the prototype generates when using DBSCAN with the default- and final parameters. In the prototype different parameters will yield in different reference values for a component, which is then used to determine if a component in the machine has a complete documentation. As seen in figure 4.10, when DBSCAN was used with custom parameters, it produced a lower CM then with the default parameters, since it created a reference for a component sort that a component could not pass. This could either be interpreted as DBSCAN with custom parameters yields a more accurate reference, which a component fails to pass or it could be interpreted as the custom parameters yields a faulty reference. This is where labeled data could be used to determine whether the custom parameters affected the result positively or negatively.

5.3 Method

5.3.1 Algorithms

In chapter 1.4 a second delimiter is defined regarding the number and choice of algorithms to evaluate. As mentioned before, there exists many methods and different techniques for finding anomalies in data sets. In order for the work to be manageable within the given time frames, we had to delimit us to a subset of these methods and techniques. Because of the assumption that most components of the same type have low deviation between each other, this thesis has been focused on clustering algorithms.

Clustering algorithms are designed to divide similar data points into groups. Finding anomalies using these algorithms are therefore something that is possible, but finding anomalies where not a factor when building the algorithms. Therefore, in addition to the two clustering algorithms, one additional algorithm was included. The third algorithm, Isolation forest, is built on decision trees and is especially designed to find anomalies.

When deciding what algorithms to use, the major deciding factor was the degree of reg-ularity. LOF and DBSCAN are commonly used algorithms for clustering and IF is fairly commonly used algorithm for finding anomalies. As mentioned IF is designed to find anomalies which is the reason it was chosen. The reason we chose LOF was because of the simplicity, both in its clustering technique and its parameters. DBSCAN is the opposite of LOF, it uses a more complex technique that could even find arbitrary clusters. This was interesting to investigate, whether it would have a effect in the prototype. Due to a more complex technique, the amount and the complexity of the parameters are increased.

5.3.2 Prototype

When running the prototype, we discovered a problem with regards to the different amounts of available components of some sort. When creating reference values for components with

Exploring unsupervised anomaly detection in Bill of Materials structures.

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Data Science

2019 | LIU-IDA/LITH-EX-G--19/024--SE

Exploring unsupervised anomaly

detection in Bill of Materials

structures.

Utforskande av oövervakad anomalidetektering i styckliste

strukturer.

Niklas Allard, Erik Lindgren

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Delimitations

1.5

Background

1.5.1

Bill of Materials structure

1.5.2

Assumptions

2

Theory

2.1

Machine Learning

2.2

Data pre-processing

2.2.1

Missing data

2.3

Anomaly detection

2.4

Anomaly detection algorithms

2.4.1

Density-Based Spatial Clustering of Applications with Noise

2.4.2

Local Outlier Factor

2.4.3

Isolation forest

2.4.4

Cluster validation

3

Method

3.1

Pre-study

3.2

Data description

3.3

Feature extraction for anomaly detection

3.4

Data preprocessing and summarization

3.5

Prototype

3.5.1

General approach of the prototype

3.5.2

Quality-measurement and completeness-measurement

3.5.3

Reference library

3.5.4

Generate reference

3.6

Evaluation

3.6.1

Data sample

3.6.2

Tukays rule

3.6.3

Internal indices