Unsupervised anomaly detection for structured data - Finding similarities between retail products

(1)

Master Thesis

HALMSTAD

UNIVERSITY

Degree of Master of Science in Engineering, Computer Science and Engineering, 300 credits

Unsupervised anomaly detection for structured data - Finding similarities between retail products

Computer Science and Engineering, 30 credits

Halmstad 2021-06-16

Jonas Fockstedt, Ema Krcic

(2)

Jonas Fockstedt & Ema Krcic: Unsupervised anomaly detection for structured data - Finding similarities between retail products, Master of Science in Computer Science and Engineering, © June 16, 2021

(3)

A B S T R A C T

Data is one of the most contributing factors for modern business operations. Having bad data could therefore lead to tremendous losses, both financially and for customer experience. This thesis seeks to find anomalies in real-world, complex, structured data, causing an international enterprise to miss out on income and the potential loss of customers. By using graph theory and similarity analysis, the findings suggest that certain countries contribute to the discrepancies more than other countries. This is believed to be an effect of countries cus- tomizing their products to match the market’s needs. This thesis is just scratching the surface of the analysis of the data, and the number of opportunities for future work are therefore many.

iii

(4)

(5)

A C K N O W L E D G E M E N T S

We want to show our sincerest gratitude to our supervisor, Zahra Taghiyarrenani, for always finding the time to help and guide us through this thesis. Providing us with constructive feedback that im- proved our work. With so much enthusiasm, that motivated us through difficult times. Despite our baffling emotions, always giving wise advice with such kindness and warmth.

We want to specifically thank Slawomir Nowaczyk, who showed great enthusiasm for our project. Although being a very occupied man, he set aside time for having meetings with us, where he gave advice and helped us improve our work. With a dash of curiosity and his brilliant way of explaining and questioning problems from different perspectives made us more critical and observant of our work.

A warm thank you towards Jayway in Halmstad, who made our work possible by letting us study the database, use their office, and get counseled by the industry’s most brilliant minds. We would like to specifically thank Vilhelm Persson, who gave us incredible insights into the complex database. Despite his busy schedule, he set aside time to answer all of our confusing questions. We would also like to thank Nils Persson for supplying us with the necessary material and handling the communication with third parties. Additional thanks to Nils-Olof Bankell for welcoming us into the company and allowing us to work with such an exciting subject.

And finally, a big thanks to our families for the constant love and support. As the great Michael J. Fox once said ”Family is not an important thing, it’s everything”. We could not agree more. We love you, and we are blessed to have you in our lives!

Yours truly, Ema & Jonas.

v

(6)

(7)

C O N T E N T S

1 i n t r o d u c t i o n 1

1.1 Problem definition 1 1.2 Problem Approach 2 1.3 Purpose & goals 3 1.4 Thesis structure 3

2 t h e o r e t i c a l b a c k g r o u n d 5 2.1 Graph theory 5

2.2 Similarity analysis 6

2.3 Complexity of finding unknown anomalies 6 2.4 Literature review 7

3 d ata s e t 11

3.1 Data statistics 11 3.2 Tables 11

3.3 Product Structure 12 3.4 Known anomalies 15

3.5 Current measures of discrepancies in the data 15 4 m e t h o d o l o g y 19

4.1 Amazon Web Services 19 4.2 Data preprocessing 19 4.3 Graph construction 20

4.4 Structural anomaly detection 22 4.5 Similarity analysis algorithm 23

4.5.1 Attribute-based anomaly detection 24 5 r e s u lt s 29

5.1 Visualization of graphs 29 5.2 Anomalies 30

5.3 Statistics 31

5.3.1 Attribute-related anomalies 32 5.4 Analysis on non-anomalous products 34 5.5 Validation 35

6 d i s c u s s i o n 51

6.1 Additional anomalies 51 6.2 Global and local anomalies 51

6.2.1 Global and local anomalies among non-structurally anomalous products 52

6.3 Limitations 53 6.4 Future work 53 7 c o n c l u s i o n s 57 a a p p e n d i x 59 b i b l i o g r a p h y 63

vii

(8)

L I S T O F F I G U R E S

Figure 1 Example of global, local, and child products. 14 Figure 2 Representation of how a typical product is struc-

tured. It consists of a global product which is referenced by local products sold in a given country C. All products in turn have child products. 15

Figure 3 Representation of the structure of a typical product. 16

Figure 4 Representation of the structure of a product which has one country selling multiple instances of it. 17

Figure 5 Abstract view of how a constructed graph can look like for a given global product. 21 Figure 6 Demonstration of how the known anomaly case

looks like as a graph representation of the data. 22 Figure 7 Demonstration of how the anomalous case in

Figure 6could be solved. 22

Figure 8 Visualization of a possible scenario of the relation between products. A global child product is directly connected to a local child product. 23

Figure 9 Visualization of global and local structures within a graph. 24

Figure 10 Visualization of a case where the local and global structures does not coincide with one another. 25 Figure 11 Visualization of attribute comparison between

global and local structure. 27

Figure 12 Visualization of a product which were found to the same global and local structures. 30 Figure 13 Visualization of a product which were found to

have different global and local structures. 31 Figure 14 Visualization of a product which were known

in beforehand to be anomalous with multiple local products for multiple countries. 32 Figure 15 Distribution of how large portion of anomalies

each country contributes with. Countries with less than 1.7% share are grouped into ”other”. 33 Figure 16 Display of the distribution between how many

local products which have a given amount of local child products. This is shown for all local products in the dataset. 39

viii

(9)

List of Figures ix

Figure 17 Display of the distribution between how many local products which have a given amount of local child products. This is shown for all structurally anomalous local products. 40 Figure 18 Pie charts of products which are structurally

anomalous. The charts show how large portion of the local anomalies found belong to a certain country. 41

Figure 19 Pie charts of products which are structurally anomalous. The charts show how large portion of the global anomalies found belong to a certain country. 42

Figure 20 Pie charts of products which are not structurally anomalous. The charts show how large portion of the local anomalies found belong to a certain country. 43

Figure 21 Comparison between GCPs and LCPs in Ko- rea for products which were found to be structurally anomalous. 44

Figure 22 Comparison between GCPs and LCPs in China for products which were found to be structurally anomalous. 45

Figure 23 Comparison between GCPs and LCPs in Sin- gapore for products which were found to be structurally anomalous. 46

Figure 24 Comparison between GCPs and LCPs in the US for products which were found to be structurally anomalous. 47

Figure 25 Comparison between GCPs and LCPs in India for products which were found to not be structurally anomalous. 48

Figure 26 Anomaly distribution among countries. The x- axis shows the country code for each country with at least one anomaly present, and the y- axis shows the amount of anomalies in the corresponding country. 60

Figure 27 Display of the distribution between how many local products which have a given amount of local child products. This is shown for local products which did not have structural anomalies. 61

(10)

L I S T O F TA B L E S

Table 1 Definition of the terminology used for products in the data. 2

Table 2 Table describing the different tables present in the database. 13

Table 3 Similarity matrix between 5 GCPs and 4 LCPs in 2 different countries. 34

Table 4 Similarity matrix between 2 GCPs and LCPs from 27 different countries, where each country have a varied amount of local products. The matrix has been shortened for easier interpre- tation. 35

Table 5 Amount of local anomalies found for different thresholds for structurally anomalous products.

The bottom row shows the total amount of LCPs which were analyzed when searching for local anomalies. 36

Table 6 Amount of global anomalies found for different thresholds for structurally anomalous products. The bottom row shows the total amount of GCPs which were analyzed when searching for global anomalies. 37

Table 7 Amount of local anomalies found for different thresholds among structurally normal products.

The bottom row shows the total amount of LCPs which were analyzed when searching for local anomalies. 38

Table 8 Similarity matrix between 3 global products and 4local products in the US. 38

Table 9 Similarity scores between LCPs and GCPs which were found to have no structural discrepancy between them. 49

x

(11)

A C R O N Y M S

GCP Global Child Product

LCP Local Child Product

xi

(12)

(13)

1

I N T R O D U C T I O N

Most businesses today rely on data because of their digitization ini- tiatives. If businesses do not assure themselves of fixing bad data, it could lead to significant financial losses. Bad data in the sense that it may contain noise and not all data points contain the correct information. IBM even estimated that in 2016, $3.1 trillion was the cost for the US for having bad data¹. Therefore, it is of the highest interest for an enterprise to make sure that their data is kept clean so their operations can be focused on other aspects of the business.

1.1 p r o b l e m d e f i n i t i o n

This thesis is an attempt to find discrepancies among products in a real-world dataset provided by an international enterprise. Discrep- ancies are referred to when there are conceptions of how the data is structured, but in reality, it has another structure. Because of the discrepancies in the data, the enterprise that has supplied the data for this thesis is suffering from significant financial losses since their operations are great in volume daily. It can also affect customer experience since they cannot find the product they are looking for, leading to a loss of customers. Furthermore, the ambition with this thesis is to find anomalies that were, prior to the work conducted from this thesis, unknown to the enterprise. This is possible due to what is already known about the data; the structure. By knowing the structure, new anomalies can be detected. Finding anomalies translates to finding where there are discrepancies between what is believed of the data and how it actually looks like, making the discrepancies more concrete.

The data consists of furniture products from an international enterprise. The products can be divided into three main categories, global, local, and child products. The definitions for the different types of products are described inTable 1.

The problem lies in that it is sometimes difficult to refer to the right product in the software responsible for displaying the appropriate products to the end-user. Currently, there is no system in place which can detect and correct these discrepancies among products.

This means that the only way to detect this is for a data analyst to manually spot these irregularities in the database or by a customer who is using the enterprise’s services but cannot find the product they are looking for.

1 https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year

1

(14)

2 i n t r o d u c t i o n

p r o d u c t d e s c r i p t i o n

Global A product that can be referred to from all countries the enterprise is selling their products in. Before a country can begin selling a product, there must first exist a global product as global products are used as references. Global products do not belong

to a specific country.

Local A product which a given country is selling, with a global product as a reference. It can therefore be seen

as a ”copy” of a global product which in turn can be modified. The product ID of a local product can be the

same as a global product, or it could be a country specific ID. A local product can look identical to

a global product.

Child What global and local products are made of.

There are both global and local child products.

Global products consist of global child products, and local products consist of local child products.

Global and local products can be seen as ”packages”

of child products.

Table 1: Definition of the terminology used for products in the data.

1.2 p r o b l e m a p p r oa c h

The data in question is relational data. Hence, traditional approaches to process and analyze the data will not be applicable since those methods usually consider non-relational data, where the data is already in tabular/ vectorial form. Therefore, it is necessary to represent the relational data in another manner, while still keeping the data structure. Given the tree-like structure of the data, a feasible method is to represent the data in graph format. This because despite that the structure of the data reminds of a tree structure, some patterns in the data contradicts the possibility for it to be represented as a tree. This scenario is described in more detail in Section 4.3. Representing the data in graph format also allows for keeping the relations between the data points whilst at the same time making the data more interpretable. Since one global product has no relation to other global products, a graph shall be constructed for each global product. Each graph will in turn have the corresponding local and child products for each global product.

(15)

1.3 purpose & goals 3

The problem is hence to be represented by multiple graphs, one for each global product. For each graph, the task first consists of finding structural anomalies, which refers to comparing the structure between a global and a local product. A structural anomaly is detected when a global product has more or less child products than a local product.

Finding these anomalies will tell which products are causing a discrepancy in the dataset and in which countries. After the structural anomalies have been found, a search for attribute-based anomalies will be made on the products with structural anomalies as well as a sam- ple of products with no structural anomalies. This step consists of comparing the attribute values between global and local child products.

Based on the research conducted for this thesis, no discoveries have been made of a corresponding study where one finds structural differences between products for finding anomalies and then to find similarities among those anomalous products. On this basis, this thesis will bring a novel approach on how to interpret products from different vendors as similar. Furthermore, based on the results achieved, this thesis’s contribution will bring a system that an enterprise will use to keep better track of their products and improve their operations.

1.3 p u r p o s e & goals

The purpose of this thesis is to find new, previously undiscovered anomalies in relational, real-world data. With it, the goal is to develop a system that represents the data in a more interpretable way and to find new anomalies in it.

The main contributions which this thesis brings are:

• Visualization of complex, real-world data in graph format.

• A novel analysis on finding previously unknown anomalies in relational data by comparing structures in the data and using similarity analysis for comparing products.

• Distribute the system and its findings to an enterprise.

1.4 t h e s i s s t r u c t u r e

The thesis is structured as follows: Chapter 2 consists of describing graph theory, similarity analysis, the complexities of finding unknown anomalies, followed by a literature review. Data characteristics and statistics, product structure, and already known anomalies in the data and measures for it are described inChapter 3.Chapter 4walks through the process of utilizing the different topics from Chapter 2 into practice to represent the data as a graph structure and finding

(16)

4 i n t r o d u c t i o n

structural differences as well as similarities among those products.

The results are then presented inChapter 5. The results will then be discussed inChapter 6as well as future work based on these findings.

Lastly, the thesis will be concluded inChapter 7.

(17)

2

T H E O R E T I C A L B A C K G R O U N D

This chapter walks through the necessary theoretical background for this thesis. First, graph theory is brought up inSection 2.1. Then, similarity analysis is introduced inSection 2.2. After that, the complexity of finding unknown anomalies in data is brought up in Section 2.3.

Lastly, related work within the literature will be addressed in Sec- tion 2.4.

2.1 g r a p h t h e o r y

One way of representing relational data is by using graphs. The field of graph theory has been getting a lot of attention recently with applications within social networks[1][2], communication networks[3][4], path finding[5][6], medicine[7] and many more[8][9][10].

A graph G consists of N nodes and E edges, which is specified by G = (N, E). A given node ni and a given edge ei, both of which (n_i, ei) ∈ IR, represents data points and the relation between them, respectively. This thesis seeks to construct a graph Gi for each global product, p^g_i, for which G = {G1, G2, ..., Gi}, p^g_i ∈ G_i. The other nodes in graph Giare the local, p^l_i, and child, p^c_i products of p^g_i.

There are also different types of graphs. The two most frequently used graphs are undirected and directed, where both types of graphs refer to the information the edges hold. In undirected graphs, the edges work as a two-way communication network, where information in one node allows for travel to a neighboring node and vice versa. Di- rected graphs, however, only allow for one-way travel between nodes.

For example, given two nodes n1 & n2, path n1 → n₂ is possible, but not n2 → n₁ in a directed graph, but possible in a undirected graph. Directed graphs are usually drawn with arrows pointing in the direction of the information flow between nodes and undirected graphs with regular lines between nodes. For the scope of this thesis, only undirectional graphs will be treated.

The structural integrity of graph networks is not compatible with traditional machine learning (ML) algorithms. It has hence emerged new algorithms designed to find patterns in the graph structure. Some of these are, intuitively named, graph neural networks (GNN’s)[11][12], graph convolutional networks (GCN’s)[13] and more[14]. In situa- tions where traditional ML algorithms are still an attractive option, there are graph embedding techniques, which facilitate downstream tasks such as clustering, classification, and so on.

5

(18)

6 t h e o r e t i c a l b a c k g r o u n d

Graph embedding is a practice of representing graphs in a given vector latent space. The purpose of this is usually to make it compatible with traditional ML algorithms. A rather popular embedding tech- nique when working with graphs is node2vec[15], which uses biased random walks to map similar nodes close to one another in the vector space. Another popular graph embedding technique is graph2vec[16], which makes vectorial representations of whole graphs by using a similar principle to the one found in node2vec, except instead of nodes, it traverses through subgraphs.

2.2 s i m i l a r i t y a na ly s i s

As the name suggests, similarity analysis is the practice of finding or measuring how different similar entities are to one another. Similarity analysis can be used for many applications, such as recommendation systems [17][18]. As Ma et al. [19] claims, graph similarity analysis can be used for computational chemistry and biology, neuroscience, computer security, and computer vision. The applications are many.

Popular similarity measures for graphs include SimRank[20], the Jaccard coefficient[21], and graph edit distance[22]. SimRank was orig- inally intended for ranking similarity among web pages. The intuition behind it is that two pages are similar if the same types of pages reference them. This definition is also applicable to graphs, where two nodes can be seen as similar if they, in turn, are connected to similar nodes. However, as pointed out by Fogaras et al.[23], SimRank fails when two nodes have the same neighboring nodes, but the neighbors themselves are not classified as similar. Instead, the Jaccard coefficient considers the number of common neighbors over the total number of neighbors for the two nodes. Hamedani et al.[24] also combined the SimRank and Jaccard coefficient metrics which resulted in an algorithm that outperformed the other algorithms it was put up against.

The graph edit distance is one of the most basic similarity measures for comparing graphs. Edit distance refers to the number of operations needed to make one graph structurally equivalent to another graph. These operations usually consist of adding or removing an edge or node in the graph, where each operation has a cost. The goal is to use as few operations as possible to make the first graph have the same structure as the second.

Similarity analysis can also be used for neural networks that are explicitly implemented to work on graphs, such as the aforementioned GNN’s and GCN’s, which can be used for Siamese Networks [25][26].

2.3 c o m p l e x i t y o f f i n d i n g u n k n o w n a n o m a l i e s

Usually, in anomaly detection tasks, the types of anomalies to be searched for are known beforehand. This could either be because a

(19)

2.4 literature review 7

certain type of anomaly exists in the dataset, or there is a high probability of cases existing, but no one has checked, and it is not guaran- teed. The scenario could be pretty tricky in the latter case depending on the data volume and its structure.

As described in Section 1.1, the aim of this thesis is finding new anomalies that were previously not known to be present in the dataset.

This makes the task at hand more difficult since if the anomalies exist, they need to be validated.

2.4 l i t e r at u r e r e v i e w

Anomaly detection is the practice of finding sequences or instances in data which does not follow the expected behavior or pattern. One use case of anomaly detection could therefore be for detection of malfunc- tions in production lines before the production comes to a halt due to a machine malfunction. Anomaly detection comes in three main practices: supervised, unsupervised and semi-supervised. This thesis focuses on unsupervised anomaly detection since the anomalies to be detected are new and therefore no examples of those anomalies exist.

The fields of anomaly detection are broad but also well-researched.

For instance, anomaly detection has become quite popular within network intrusion analysis[27][28][29] and fault detection in production lines[29][30] among others[31][32][33].

Relational data consist of data points that have relations to one another. Relational data differs from unrelational data in the sense that unrelational data is sporadically collected and each data point does not have to contain the same sort of data as the other data points.

Finding anomalies in relational data usually implies that the anomalies themselves have relations to one another, according to Maervoet et al.[34]. There have been studies where one uses anomaly detection in relational databases for security reasons[35][36].

In a case study for fashion products, Fallah et al.[37] mention that many factors could determine whether two products are classified as the same. Their output parameter was the number of sales for each product, with each product having the attributes of color, price, number of colors, and the size of the photo in the catalog. They proposed a modified sequential k-means clustering method that im- proved from the conventional k-means method by making the clusters denser. The products in each cluster were more related to one another than the clusters of conventional k-means. A related study from 2020 by Akritidis et al.[38] presented an unsupervised match- ing of product titles algorithm, which managed to match products by their title names with better performance than other algorithms it was compared against.

Boobalan et al.[39] constructed a, for as long as the research for this thesis is concerned, a unique method for clustering nodes in

(20)

8 t h e o r e t i c a l b a c k g r o u n d

graphs. The method aims at finding structural similarity by using k-neighborhood and attribute similarity. K-neighbourhood is used to determine which nodes belong to the same cluster based on distance, and attribute similarity is used for clustering nodes based on their attributes. The results yield clusters with high intracluster structural similarity as well as low intercluster similarity. Ruan et al.[40] took another approach by using a deep autoencoder in order to map graphs into lower-dimensional space. Once in this space, clustering was performed based on the topology and the node attributes. Finally, He et al.[41] also developed a model which took respect to the topology as well as the attributes of nodes in undirected, unweighted graphs. The model first separated the nodes based on their communities and their attribute values and then sought the best connection between the two groupings. The model showed superior results compared to the other models it was compared against.

Sometimes it is not enough to know where there are anomalies, but information of what may have caused the anomaly in the first place may be as important. This principle is called root cause analysis (RCA). As the name suggests, the method aims to find the cause of a particular effect. Mueller et al.[42] showed how this principle could be applied to ML models to make the process automated. Another pa- per researched how to use anomaly detection and RCA to find faulty products and the cause of the fault[43]. This study focused on finding products in a production line that did not fulfill the requirements, such as having the correct dimensions. After detecting these anomalies, they performed RCA to find out what caused that product to be classified as an anomaly. There were two products in the exper- iments, where angle-based outlier detection performed the best on the first product and k-nearest neighbors on the second product. An- other study by Forsberg[44] utilized anomaly detection and RCA to detect anomalies in microservice clusters and what may cause them to become an anomaly. The study used the BIRCH algorithm to define clusters and anomalies, and a probability table was used to determine what might have caused the anomaly.

Some methods have considered transforming the relational database into a non-relational one altogether[45][46]. This may be due to the increased need for performance or scalability. It could also be a reason to prepare the data for traditional ML algorithms, usually as input takes non-relational data. In those cases, Goulon et al.[47] reported in their review of numerical ML for relational data that the data structure can be learned instead of being hard coded, showing similar intuition as convolutional neural networks.

It was previously mentioned that the data is of a tree-like structure but does not tick all the boxes. However, in cases for tree-structured data, there is Tree2Vector[48], an algorithm for learning a vectorial representation of tree-structured data. The algorithm learns the rep-

(21)

2.4 literature review 9

resentation by recursively working from the leaf nodes up to the root node. For each depth level, the nodes’ features for that level are clus- tered together and concatenated with the upper-level features. Then the clustering process is repeated until the root node is reached, representing the whole tree as a single vector.

Theissler[49] produced a model which used ensemble methods to detect known and unknown anomalies for automotive systems.

This by using both two-class and one-class classifiers. The intuition is that the one-class classifiers learn the expected behavior of the data in the training set. Since there are no anomalies in the training data, it will detect unknown anomalies in the test data. The ex- periments showed promising results where the model detected both known and unknown anomalies. Cao et al.[50] applied detection of unknown anomalies on web services where new types of intrusion attacks appear daily. These results were achieved by using a decision tree model.

(22)

(23)

3

D ATA S E T

This chapter first addresses some basic statistics for the data in Sec- tion 3.1. Then the different tables of the data will be walked through inSection 3.2. After that, the product structure inSection 3.3where a more detailed view of the different components of a product will be shown. In Section 3.4 it will be described the already known anomalous cases in the data. Moreover, lastly, current measures of some existing anomalies in the data will be explained in Section 3.5.

3.1 d ata s tat i s t i c s

For all data-driven projects, it is vital to know the properties of the data at hand. The knowledge gained from analyzing the data may be of great value for interpreting whatever results one achieves with it.

The data at hand for this thesis is relational SQL data with information spread across 7 different tables. These tables hold data for 239, 849 global and local products and 219, 044 child products. In total, there are around 1.1 million records, or rows, of product data.

This thesis focuses on products which are currently being sold, which leaves for a total of 40, 504 global products. These products will be treated as the blueprint for the corresponding local products for this thesis. This is explained more thoroughly in Section 3.3. Each global product can be sold in up to 66 different countries.

3.2 ta b l e s

The 7 different tables in the database are shown inTable 2. All tables use the ID of a product as a primary key. The first table, ”Product”, holds information that gives the product’s essential information, such as its name, when it was last updated, and so on. The ”Product At- tribute” table contains more detailed information than the first table, such as who designed the product and whether it is a global or local product. The third table, ”Product Range”, describes which countries sell the product and during which periods. ”Product Specific” is the fourth table that holds information on what a local notation is for a given product in a given country. This table is then used to convert from a global ID to a local ID for a given country. The following table is ”Product Structure”, which tells which category a product is categorized as. These could be different types of kitchens, sofas, and so on. ”Product Parent” is the fifth table, which holds information on the global ID for a given local ID. This table can only take in local

11

(24)

12 d ata s e t

IDs. The last table, ”Product Child” stores information on what child products a given global or local product has. This table only takes in IDs that belong to global or local products, not child products, since child products can not have children themselves.

3.3 p r o d u c t s t r u c t u r e

A global or local product consists of child products, which are the actual products that are being sold. Both global and local products can be seen as packages that are assemblies of child products. These packages could have been created based on that a certain combination of child products are often bought together, or the product designer might have intended that the package should consist of these child products, and so on. It is, therefore, usually the packages that are being displayed to the customers, but it is also possible to purchase child products separately. However, when a customer purchases a local product, it is, logistically speaking, the child products that are being sold. An example of this is shown inFigure 1, whereFigure 1a illustrates a local product (it can also be a global product, but global products are not being sold, only local ones) and its child products are shown inFigure 1candFigure 1d. The package being sold is both the chairs and table, but it is also possible to purchase both of them separately. In addition, it is also possible for child products to be used for selling other local products. InFigure 1b, the chair which is shown in Figure 1c is used for selling another local product. Additionally, there is another table for that product compared to what was shown in Figure 1a, making it another local or global product (package of child products).

International products consist of a global and local structure, where the global structure consists of a global product and its related global child products (GCPs). And the local structure in turn consist of a local product with its corresponding local child products (LCPs). The local product can use either the same notation as the global product, or a country-specific notation. For example, a global product that has 3 GCPs is seen as the reference structure for a local product, so the local product, in turn, has 3 LCPs. However, due to reasons, the country that sells the local product may be interested in adding or removing a LCP from the product, resulting in a change of the local structure.

There could also be national laws that demand using a special notation only used in that country. This does not necessarily have to mean that the product itself should change the structure of the product.

InFigure 2, the product structure is visualized. The global product is positioned in the middle. Inside it, there are 3 GCPs. The global product is related to local products, which are being sold in a given country C. In turn, these local products have 3 LCPs since the global product had 3 child products. Each product is a package of child

(25)

3.3 product structure 13

ta b l e d e s c r i p t i o n

Product Holds general information about products such as its name, if it still is being sold and when it was last updated.

Product Various information about the product. More detailed Attribute than the "Product" table. Such as the designer, why

there is a local article number (if there is one) and when the product is planned to go on sale.

Product Describes which countries are selling the given product Range and during which period it is being sold.

Product Stores information of which country a given product Specific is sold in. It also stores the notation of that product

in the country/ countries it is being sold in. Can be seen as a table which maps a global product into a local product in a given country.

Product Products are assigned to specific groups or categories.

Structure This table describes which category a given product is belonging to and for which country.

Product Tells the relationship between the local and global Parent article number of a product. Can be seen as the reverse

table of ”Product Specific” table which

maps a local article number into the global article number. Only viable with local products since there is no relation between multiple global products in the data.

Product Holds information of the child products for a given Child global or local product. This table only holds

information for global or local products since child products can not have child products of their own.

Table 2: List of the different tables in the dataset and a brief description for each table. All tables are assumed to hold information for global, local, and child products unless stated otherwise. All tables use article numbers as its primary key.

(26)

14 d ata s e t

products viewed to the customer, but in reality, it is the child products being sold.

For all types of products (global, local, LCP, and GCP), they contain data found in the different tables described in Section 3.2. This data describes if the product is a global or local product, which child products it has, where it is sold, and more. Each type of description of a product is referred to as its attribute. There are a total of 52 different attributes for each product and all attributes will be processed when performing the attribute-based anomaly detection, described in Section 4.5.1.

(a) Example of a global or local product.

(b) Example of a global or local product which uses the same child product (Figure 1c) as Figure 1a.

(c) Example of global or local child product 1.

(d) Example of global or local child product 2.

Figure 1: Illustration of the distinction between global, local, and child products.

(27)

3.4 known anomalies 15

Figure 2: Representation of how a typical product is structured. It consists of a global product which is referenced by local products sold in a given country C. All products in turn have child products.

3.4 k n o w n a n o m a l i e s

Before describing the known anomalies, it is essential to point out the ideal format of the data structure. It is expected that for a global product, there are local products and global child products. The local products, in turn, also have local child products and are sold in a given country. Here it is expected that there is one local product per country.

Before the start of the thesis, there was an already known anomaly case in the data. This case consisted of products that have multiple local products in a given country. An example of a product with this type of anomaly is shown inFigure 3, where country C1 has three local products where all of them are related to the same global product.

This causes problems because the software responsible for displaying the appropriate products for the customer does not know which product is the right one to display. This leads to that the software might have to be explicitly programmed to handle certain cases so that it picks the appropriate product.

3.5 c u r r e n t m e a s u r e s o f d i s c r e pa n c i e s i n t h e d ata The enterprise has a tool which is called the planner. The planner’s role is to show the customer the different ways they can configure

(28)

16 d ata s e t

Figure 3: Representation of how a typical product is structured. It consists of a global product which is referenced by local products sold in a given country C. All products in turn have child products. This particular product has one of the already known anomalies in it where one country has multiple local products linked to the same global product.

their product (e.g., if they want different legs for their sofa). However, the planner is following pre-defined rules, which makes anomalous products hidden from the tool. These types of scenarios are today handled by implementing specific rules for the planner to make ex- ceptions. However, this is not sustainable since there are many special rules in the system already, and the data is growing by the day. Addi- tionally, these measures are only applied on the edge, just before the end customer. Hence, the most optimal way to fix this is to make sure that the data in the database itself has the correct relations between products.

A scenario is illustrated in Figure 4, where the end customer is using the planner on the website. When the user selects a product, the planner requests this product from the development team, which asks for it in the database. The request is made on a global product, and the development team has adjusted the query to the database to ask for the equivalent product sold in the country the planner sent its request from. In this case, there are two local instances of the same product, where one variant is newer than the other. These products are returned to the development team, but the planner can only receive one product. In this case, the newer product is returned to the planner since the development team assumes only one product

(29)

3.5 current measures of discrepancies in the data 17

is returned from the database. Instead, the ideal case would be to make sure that the old product is sold out before starting to sell the newer version. This leads to the old product occupying precious space in the warehouses and leading to a loss in profits, both because space is occupied for other products to be stored and the older products may never be sold. Another scenario would be if a given country has two local products being sold, where one of them is using the same notation as the global product. In this case, the local product with the same notation as the global one will be prioritized.

Figure 4: An illustration of how a special rule is handled in the system. The user interacts with the enterprise’s website, where the website asks the development team for a specific product, which asks for it in the database. There are two such products, and the newer product is returned.

(30)

(31)

4

M E T H O D O L O G Y

This chapter brings up the different methodologies which will be used in order to get any results for this thesis. The chapter starts by describing the role of AWS for this thesis inSection 4.1. After that, the ever-so-necessary data preprocessing is described inSection 4.2. Then the procedure of constructing graphs of the relational data is walked through in Section 4.3 with some visual examples. Identification of structural differences in the constructed graphs is then described and visualized in Section 4.4. The chapter will lastly inSection 4.5 bring up the similarity analysis algorithm to be used for finding similarities between products.

4.1 a m a z o n w e b s e r v i c e s

Amazon Web Services, or AWS, has quickly become one of the most popular cloud platforms for developing applications. For this project, the source code will be run on AWS since large volumes of data need to be processed. Running on AWS allows for on-demand comput- ing power when necessary and taking the load off the local machine, which would otherwise be a laptop, infamous for lowering performance during heavier loads after more extended periods. This is of great benefit since the laptop would otherwise have to act as a local database server and a place to run the code on.

The AWS configuration was set up as follows: The SQL database was configured using the Relational Database Service (RDS) in a serverless configuration. For running the code, an Elastic Compute Cloud (EC2) instance was configured where the t2.large instance was chosen. This instance allowed for 2 virtual CPU cores and 8GB of memory. With it, a volume of 12GB was chosen.

4.2 d ata p r e p r o c e s s i n g

There is almost no scenario where the data is ready to be analyzed without preprocessing it first. Like previously mentioned, the SQL- formatted data resides in 7 different tables. An overview and short description of each table was given inTable 2. In order to prepare the data for graph construction, which is further described inSection 4.3, the data will first be reformatted into a dictionary for facilitating the graph construction. The data types of the features consist of numbers, strings, dates, and null values. The string values consist of single words and complete sentences describing the name, dimensions, and

19

(32)

20 m e t h o d o l o g y

color of the product. The data of possession for this thesis consists of 52unique features and a total of 1.1 million records spread out across the 7 given data tables. Out of these records, 40, 504 of these products are, as previously mentioned, global products, which are used for reference within operational countries. For each global product on sale in a given country, they may refer to it with the same ID as the global one, a country-specific one, or multiple country-specific IDs.

Since a global product is not connected to other global products, the whole data will not be represented as one large graph. Instead, a graph shall be constructed for each global product. The architecture of the graphs is explained in Section 4.3. Knowing that there will be one graph for each global product, the necessary data for each graph has to be fetched and structured appropriately. This process consists of several steps since the data is spread across 7 different tables and has a complex relationship between products. The steps for collecting the necessary data are shown in Algorithm1. Each time data is fetched for a given product, all of its corresponding 52 features are collected.

Algorithm 1Collect product data

1: functionGet data for all products

2: global_products .list of all global product IDs

3: forglobal product in global_products do

4: get data for global product

5: get child products of global product

6: get the local_products associated with global product

7: forlocal product in local_products do

8: get child products for local product

9: return products

4.3 g r a p h c o n s t r u c t i o n

As mentioned inSection 4.2, one graph will be constructed for each global product. And the graphs will only feature products which are currently being sold (the dataset also contains some products which are no longer being sold). The reason for only considering currently sold products is to make the graphs correctly represent how the system of the enterprise sees the products. All graphs will be constructed using the NetworkX¹ library in Python. All nodes will additionally have the 52 features which reside in the 7 different tables described in Section 3.2. An abstract view of how a graph for a given global product can look like is shown inFigure 5. Products are represented as nodes, and edges tell which country, e.g., C1, C2, sells the product.

The global product is connected to two different types of products:

1 https://networkx.org/

(33)

4.3 graph construction 21

global child products and local products. A local product refers to what the global product is denoted as in a given country. The same goes for the local child products, which are the different components a global or local product consist of.

Figure 5: Abstract view of how a constructed graph can look like for a given global product. The global product is seen as the root node, where it has its child products. For each country the product is sold in, there are edges connected to a local product in a given country C.

In turn, the local products also have their child products.

Not all product graphs will look the same. For instance, there are 66 different countries in which a product can be sold, and it always has to exist a global product for there to be a local product. The aforementioned already known anomaly scenario is when a country is selling multiple local products of a global product. This scenario is demonstrated in Figure 6, where country C1 has two local products that refer to the same global product. The dotted circle indicates the anomaly. The countries C2 & C3 only have one local product each that is connected to the given global product. This anomalous case for country C1 is not ideal. Instead, the ideal scenario is demonstrated in Figure 7, where a new global product has been created for that second local product for country C1. However, this thesis does not focus on how to fix these anomalies, its sole purpose is just to identify new anomalies, but this could be one solution to these types of anomalies.

There is an additional way of representing the data. This alterna- tive representation is the leading cause of why the data is not classified as having a tree-structure. The scenario is shown in Figure 8, where a dotted edge indicates a possible connection between a global and local child product. The connection is possible because no matter whether the product is a global, local, or child product, there are still relations between global and local (child) products, meaning that a GCP has relationships to other LCPs. Findings in this matter will

(34)

Figure 6: Demonstration of how the known anomaly case looks like as a graph representation of the data.

Figure 7: Demonstration of how the anomalous case in Figure 6 could be solved.

be brought up in Section 5.2. For this reason, this relation is what eliminates the possibility of representing products as trees.

4.4 s t r u c t u r a l a n o m a ly d e t e c t i o n

Given the ideal product structure, which was described inSection 4.3, it is known what the baseline of a product graph should look like.

Therefore, to discover unknown anomalies in the data, the first natu- ral step would be to look at what is known: the data structure. Find- ing structural anomalies is a part of discovering which countries are contributing to discrepancies in the data. Finding these anomalies consists of, for a given graph, comparing the global structure with all local structures. A global structure is defined as a global product and its GCPs, and a local structure is defined as a local product and its corresponding LCPs. This comparison is demonstrated inFigure 9 where the global structure is highlighted in green and the local structures in red. When a local structure does not coincide with the global structure, as shown inFigure 10, where the local structure with lesser amount of LCPs is shown in blue, it will be classified as a structural anomaly. It will also be classified as a structural anomaly if there are more LCPs than GCPs. These structures will be saved as subgraphs to be used for later analysis, which is described in Section 4.5.

(35)

4.5 similarity analysis algorithm 23

Figure 8: Visualization of a possible scenario of the relation between products. A global child product is directly connected to a local child product.

4.5 s i m i l a r i t y a na ly s i s a l g o r i t h m

After structural anomalies have been found, similarity analysis is performed on the products which were found to have structural anomalies. Similarity analysis will be performed in order to find the specific products which are causing the structural anomalies. The algorithm for finding similarities between nodes with respect to their attribute values was inspired by Boobalan et al.[39]. Their algorithm works by, for each node pair, iterating through all attributes and comparing the values of feature x between node i and node j. If ixcontains the same value as jx, then a similarity score of 1 is set, and 0 if they are not the same value. This is done through all features and is summed up and finally normalized by dividing by the amount of features, giving a value range of [0, 1]. However, this algorithm seemed to be designed for handling features with numerical attributes only, not considering continuous values. Since the relevant data for this project also has continuous value in the form of dates among the node attributes, adjustments have to be made. The similarity algorithm to be used for this work is shown in Algorithm2.

The algorithm displays the made adjustments to make the algorithm a bit more suited for the data of this thesis. The algorithm considers two subgraphs, where one subgraph is the global structure, and the second subgraph is a local structure. Neither the global nor local products are considered for the similarity measures, only the child products.Figure 11 shows how the attributes of each GCP are compared with the attributes of each LCP. The local structure is found in the left portion of the figure (red), while the global structure is found on the right (green). The similarity score between two nodes (a GCP and a LCP) is initially set to 0. Then it traverses through the different

(36)

Figure 9: Visualization of global and local structures within a graph. High- lighted in green is the global structure, and the local structure in red.

attributes of the nodes. A check is made to see if the attributes are dates. If they are, then the date values are converted into float values, and the difference between the two is then added to the similarity score, after being normalized in the range [0, 1]. If the attributes do not have any value stored in them (e.g., None), then no comparison is made. This is because there are many empty values in the data, and they do not tell if two products are more similar to one another.

If none of these conditions hold, then a strict comparison between the attribute values is made. The similarity score is increased by 1 if the values are the same and 0 if they are not. Finally, the similarity score is normalized in a [0, 1] interval. The scores between all child products are then summarized in a similarity matrix.

4.5.1 Attribute-based anomaly detection

The anomaly detection consists of two main parts. The first part consists of investigating if there is a corresponding LCP in a given country for a given GCP, referred to as finding global anomalies. The second scenario is to determine if there for a LCP in a given country exists a corresponding GCP, considered as finding local anomalies.

Both of these procedures will be performed on the similarity matrices which were produced by Algorithm2.

For finding global anomalies, the constructed similarity matrix for a given product must have more GCPs than LCPs for each country.

For instance, 3 GCPs and 2 LCPs per country. In order to find the anomaly/-ies, a search is made column-wise in the matrix for each country. For each column, the maximum value is compared with a pre-defined threshold value. This threshold value works as an indi-

(37)

Figure 10: Visualization of a case where the local and global structures does not coincide with one another. The global structure is indicated in green, the anomalous local structure in blue, and the non- structural anomalous structure in red.

cator stating that if the maximum value from a given column is be- low the threshold value, then that GCP will be classified as a global anomaly. The intuition behind this methodology is that the maximum value in a given column states that ”this GCP is the most similar to this LCP in this country”, indicating that if the given GCP would have a corresponding LCP, then this would be indicated by the maximum value of that column.

The process for finding local anomalies is similar to the process of finding global anomalies, except now the rows are considered instead of the columns. First, the similarity matrix has to have more LCPs per country than there are GCPs. Then, for each country, the similarity score between a given LCP and the different GCPs is considered, and the maximum value is compared with the threshold value. The same intuition stands, where the greatest similarity score should indicate that ”if this LCP has a corresponding GCP, then it should be indicated by the greatest similarity score between the two”. So again, if the greatest similarity score is lower than the threshold value, then the LCP will be considered as a anomalous product, a local anomaly.

(38)

Algorithm 2Similarity algorithm

1: functionCalculate attribute similarity

2: sim_score = 0

3: forattribute in node i and node j do

4: iftype of attributes are dates then

5: convert to float value, compare the two values

6: sim_score ← sim_score + time_difference

7: else iftype of either attribute is empty then

8: do not compare attributes

9: else

10: ifattributes have the same value then

11: sim_score ← sim_score + 1

12: else

13: sim_score ← sim_score + 0

14: normalize sim_score in range [0, 1]

15: return sim_score

(39)

Figure 11: Visualization of the similarity measure between nodes. The left subgraph (red) is a local subgraph (local structure), and to the right (green) is a global subgraph (global structure). The attributes of the GCPs of the global subgraph is compared against the attributes of the LCPs of the local subgraph.

(40)

(41)

5

R E S U LT S

This chapter goes through the different results which were found during the work of this thesis. First, a few representable products will be visualized in graph format inSection 5.1. One product with no structural anomalies, one structurally anomalous product, and one product which in beforehand was known to have multiple local products being sold for each country. After that, the found anomalies will be described in Section 5.2. Later, some statistical analysis will be presented in Section 5.3 on the different types of anomalies that were found. The same procedure which was used for finding anomalies among structurally anomalous products, was then used on products with no structural anomalies inSection 5.4. Lastly, the results are validated inSection 5.5.

5.1 v i s ua l i z at i o n o f g r a p h s

As mentioned in Section 3.1, the dataset contains roughly 1.1 million records, or 1.1 million products. Out of these, 40, 504 are global products, meaning there are 40, 504 graphs that have been constructed from the SQL-formatted data. A graph representation of a structurally normal product is shown inFigure 12. Walking through the graph - in the center of the graph is the global product (dark blue colored node), and close to it, in the same color, are its corresponding global child products. Each country selling the product, no matter the local notation, has an edge connection to the global product. These edges are labeled with the country code of which country the local product is being sold in. All these local products are then, as the global product, connected to its related local child products. Furthermore, to make the graph even more interpretable, the nodes have been color-coded, so the products that belong to a specific country have the same color.

So a local product and its local child products have the same color.

Moreover, if a given country sells more than one local product, then all those local products share the same color. No matter the type of product, the described layout is kept across all products.

InFigure 13, a structurally anomalous product has been visualized.

It can be seen that the global structure has 4 child products, whereas all local products have 5 child products, which is the reason for it being classified as a structural anomaly.

A bigger and another anomalous product (however not structurally anomalous) is shown in Figure 14. This is a product that was known beforehand to have the type of anomaly which was described in Sec-

29

(42)

30 r e s u lt s

Figure 12: Graph of a structurally normal product. The global product and its corresponding child products is positioned in the middle is seen in dark blue. The edges from the product goes out to local products, where the edges are labeled with the country code for the local product. The local products are in turn connected to their local child products. Each country has a unique color. The global and local structure of the graph has the same amount of child products.

tion 3.4 (where a country has multiple local products which are connected to the same global product). The figure illustrates how that type of anomaly is identified in a graph. Comparing the graph with Figure 12, it is quickly spotted that there are more than 66 local products in the graph, indicating that some countries have more than one local product (since the enterprise is only selling products in 66 different countries). It should be noted that some colors may seem identical, but in some cases, they have a slightly different gradient than one another. Prior to this graph representation, this anomaly type would only be detected while looking into the database itself.

5.2 a n o m a l i e s

This project aimed to find anomalies among the global and local structures. Out of the total 40, 504 global products analyzed, 555 of these global products was found to have discrepancies (structural anomaly) between the two structures. For each anomalous global product, it is highly likely that more than one local product does not coincide with the global structure. In this case, there were a total of 1, 454 local products which got classified as anomalies. Before finding these types of anomalies, another type of anomaly was found in the process.

These (new) anomalies consist of countries with multiple local child products sold at the same time (the scenario described inFigure 8in Section 4.3). These finding are further discussed inSection 6.1.

(43)

5.3 statistics 31

Figure 13: Graph of a structurally anomalous product. The global product and its corresponding child products is positioned in the middle is seen in dark blue. The edges from the product goes out to local products, where the edges are labeled with the country code for the local product. The local products are in turn connected to their local child products. Each country has a unique color. The global structure has 4 child products whereas all local structures have 5 child products.

5.3 s tat i s t i c s

Further statistical analysis was performed on the structurally anomalous products. For instance, the distribution of the anomalies in different countries. This could be of great importance in order to try to find patterns in the anomalies. The distribution is shown in a pie chart in Figure 15. In the chart, it is quickly spotted that Great Britain, Ireland, and the United States hold the greatest number of anomalies. Coun- tries with less than 1.7% of the total number of anomalies have been labeled ”other”. This group consists of 543 structural anomalies in total. For a complete overview of how many anomalies each country have, seeFigure 26inAppendix A.

Another way of looking at the data is to consider how many local products have a given amount of local child products. This distribution is shown for the whole dataset in Figure 16. One can clearly see that the most common amount of local child products for a given local product is 4, with 3 and 2 as the second and third common, respectively. One also sees a general trend that for 5 or more local child products, the number of local products decreases as the number of local child products increases.

The same statistics for only the structurally anomalous local products are shown in Figure 17. The distribution is different compared to what was shown in Figure 16. The figure indicates that among the anomalous local products, a majority of them have 6 local child products where 5 and 7 child products are the second and third most common cases, respectively. The observant reader will notice that the amount of local child products in this figure only goes up to 16,

(44)

32 r e s u lt s

Figure 14: Graph representation of a product were multiple countries have more than one local product. In the middle of the graph, colored in dark blue, is the global product and its related child products.

Then the edges go out from the global product to local products in a given country, where the edges are labeled with the country code. In turn, the local products are connected to their local child products. Each color of the graphs represent a country.

whereas inFigure 16 the number of local child products went up to 26. This indicates that this does not necessarily have to lead to more anomalies despite having more local child products. For seeing the distribution among products that did not have any structural anomalies in them, please refer toFigure 27inAppendix A.

5.3.1 Attribute-related anomalies

After the structural anomalies had been found, the next step consisted of comparing the attributes of the local child products (which were found to be anomalous) to its corresponding global child products.

First, similarity matrices were constructed in order to get a measure of how similar local and global products are to one another. Table 3 shows a similarity matrix between 5 GCPs and 4 LCPs in 2 different countries. It is important to note that the similarity matrices only contain GCPs and LCPs and not regular global and local products.

However, the GCPs are the child products of a global product, and the LCPs in a given country are child products of a local product (which is connected to the global product).

There are two main procedures for checking for attribute-related anomalies. First, investigate if, for a given global product, exists a corresponding local product in a given country. This will be referred to as finding global anomalies. Second, investigate if there for a given local product in a given country exists a corresponding global product.

This will be referred to as finding local anomalies. The observant reader noticed that in Table 3there is a miss-match between the number of GCPs and the number of LCPs in a given country (5 GCPs and only

(45)

5.3 statistics 33

Figure 15: Distribution of how large portion of anomalies each country contributes with. Countries with less than 1.7% share are grouped into ”other”.

4 LCPs per country). This indicates that the first attribute-anomaly procedure shall be performed (check for global anomalies).

The opposite scenario can be found inTable 4, where there are only 2GCPs, but at least 3 LCPs in each country. Singapore stands out with 4 LCPs. This scenario indicates that at least one of the local products do not have a corresponding global product. In this matrix, it so hap- pens that all LCPs across the different countries (China, Japan, ...) goes under the same product ID, which indicates that they have the same attributes, hence the reoccurring similarity scores across the different countries.

The results of performing attribute-based anomaly detection on anomalous products is shown in Figure 18. Keep in mind that the color codes are not consistent across the different charts. For a given threshold, the different pie charts show how many of the local anomalies found belong to a given country. The threshold value was in the interval [0.1, 0.55] with a step of 0.05. The greater the threshold, the more similar products have to be to one another before being tagged