Graph Based Machine Learning approaches and Clustering in a Customer Relationship Management Setting

(1)

Graph Based Machine Learning approaches and Clustering in a Customer Relationship

Management Setting

Johan Delissen

Computer Science and Engineering, master's level (120 credits) 2020

Luleå University of Technology

Department of Computer Science, Electrical and Space Engineering

(2)

(3)

A

^BSTRACT

This master thesis investigates the utilisation of various graph based machine learning models for solving a customer segmentation problem, a task coupled to Customer Rela- tionship Management, where the objective is to divide customers into di↵erent groups based on similar attributes. More specifically a customer segmentation problem is solved via an unsupervised machine learning technique named clustering, using the k-means clustering algorithm. Three di↵erent representations of customers as a vector of attributes are created and then utilised by the k-means algorithm to divide users into di↵erent clusters. The first representation is using a elementary feature vector and the other two approaches are using feature vectors produced by graph based machine learning models.

Results show that similar grouping are found but that results vary depending on what data is included in the instantiation and training of the various approaches and their corresponding models.

iii

(4)

(5)

P

^REFACE

This is the master thesis ”Graph Based Machine Learning approaches and Clustering in a Customer Relationship Manangement Setting”. This master thesis is the final step in completing a Master Programme in Computer Science and Engineering at Lule˚a Univer- sity of Technology.

The master thesis was undertaken at the company Future Ordering where previously I had been a part of a research project and where I now also am employed. The problem scenario was created by Andreas Stormvinge CEO. Solving the problem scenario itself has been challenging but sincerely worth it since I have further learned a great deal about topics within data science & machine learning.

I would like to thank Johan Kristiansson for his help and assistance during this master thesis period. It was a great experience and ever so inspiring to discuss ideas and concepts together with him in e↵ort to solve the problem scenario. I also wish to thank my supervisor Professor Peter Parnes for all the assistance he has given during this period. Furthermore I would also like to thank you to my fellow master thesis students at Future Ordering, Adam Sawert and David Johansson for all the good discussions we have had, helping each other and sharing knowledge and insight.

Ultimately I wish to make a small comment in regards to the strange times we find ourselves in. As of writing we currently find ourselves in the midst of the Covid Pan- demic. It has been a turbulent period including huge changes to all of us in our daily lives. I wish to say well done to all whom are taking on there part in being responsible and helpful during these time. Thanks especially to those on the front line. I wish the person reading this master thesis finding themselves and there loved ones in good health.

I hope you find this master thesis interesting.

Johan Delissen

Lule˚a, August 15 2020

v

(6)

(7)

C

^ONTENTS

Chapter 1 – Introduction 1

1.1 Background . . . . 1

1.2 Motivation . . . . 1

1.3 Problem Definition . . . . 2

1.4 Delimitations . . . . 3

1.5 Thesis Structure . . . . 4

Chapter 2 – Related Work 5 Chapter 3 – Theory 7 3.1 Customer Segmentation . . . . 7

3.2 Clustering . . . . 8

3.2.1 Clustering Algorithms . . . . 8

3.3 Visualization - TSNE . . . . 10

3.4 Graph Based Machine Learning . . . . 10

3.4.1 Knowledge Graph . . . . 10

3.4.2 ComplEx Model . . . . 11

3.4.3 Training & Evaluation . . . . 12

3.5 MetaPath2Vec . . . . 14

3.5.1 Word2vec . . . . 14

3.5.2 Random Walk . . . . 14

3.5.3 Meta Paths . . . . 15

Chapter 4 – Implementation 17 4.1 Data Pre-processing . . . . 17

4.2 Three Approaches . . . . 19

4.2.1 Approach: Standard Feature Vectors . . . . 19

4.2.2 Approach: MetaPath2Vec Feature Vectors . . . . 20

4.2.3 Approach: Knowledge Graph Embeddings . . . . 23

4.3 Clustering & Visualization . . . . 24

4.4 Technical Tools . . . . 25

Chapter 5 – Results 27 5.1 Results . . . . 27

(8)

5.1.3 Results MetaPath2Vec . . . . 36

5.2 Evaluation . . . . 39

5.2.1 Evaluation of Feature Vector Results . . . . 39

5.2.2 Evaluation of ComplEx Knowledge Graph Results . . . . 40

5.2.3 Evaluation of MetaPath2Vec Results . . . . 43

5.2.4 Comparison of approaches . . . . 45

5.2.5 Clustering . . . . 47

5.2.6 Data filtering . . . . 50

Chapter 6 – Discussion, Conclusion & Future Work 51 6.1 Discussion . . . . 51

6.2 Conclusion . . . . 54

6.3 Future Work . . . . 54

viii

(9)

C

^HAPTER

1 Introduction

1.1 Background

Future Ordering is a company that o↵ers cloud based kiosk-, app- and online-ordering solutions for Food & Beverage Companies. The solution allows Food & Beverage Compa- nies to extend their customer base either by efficiency via on-premise queue elimination self-ordering kiosks or by the flexibility of being able to order from any location on any device. Future Ordering is in the process of digitizing one of the few industries yet un- touched by technology. The technical nature of Future Orderings solutions has allowed Future Ordering to accumulate data from di↵erent channels. From data insight can be drawn. In the case of Future Ordering having insight regarding restaurants, or their customers, can help dictate what direction of development is relevant for Future Ordering.

By creating the correct models for a certain problem scenario, the aim is to o↵er insights regarding for example kitchen efficiency, smarter sales insight or Customer Relationship Management. In this master thesis data will be utilised to create models relevant to Customer Relationship Management.

1.2 Motivation

Modern day technology has led to data becoming available in abundance, the question rises what to do with it. A natural inclination is to explore this data to try and draw useful insights. A difficulty then lies in how to explore data, especially taking into regard the shear amount, and also how to conclude what is useful information. With the emer- gence of new technologies such as Artificial Intelligence and Machine learning, society has been introduced to several new techniques that help gain insights or create intelli- gent problems solving agents by utilizing the large amount of data now available. These techniques are being used in both an economical and societal setting and the uses cases for them are seemingly limitless. Future Ordering as many other business and companies

1

(10)

wishes to incorporate these new technologies into their technological arsenal.

The reason for this master thesis is to use the Future Ordering data and put it to use in a Customer Relationship Management setting. Using various machine learning techniques the aim is to find correlations regarding users and their behaviour. If it is possible to identify these correlations then this can be used for several purposes. For example it can be useful in the context of item recommendation if one can learn the specific preferences of users. It is also possible to identify specific correlations where certain attributes of a user usually coincides with a visit to a specific restaurant, the order of a specific item, or the time of a visit. Another use case would be to identify groups of users that adhere to a specific type of behaviour. With the identified groups at hand one could use this for targeting certain groups with specific deals or advertisement with the goal of enhancing the relations between companies and their customers.

The main reason for extracting insights and using them in a Customer Relationship Management setting is to enhance the user experiences when ordering food at an restaurant using the Future Orderings infrastructure. If the user feels satisfied with their experience in visiting a restaurant the chance for customer retention increases. This in turn increases revenue for the restaurant. Ultimately this strengthens the relationship between Future Ordering and clients encouraging further collaboration. If Future Or- dering can o↵er not only technological infrastructure for ordering food, but also more data-driven services then Future Ordering can benefit more entities within the Food &

Beverage industry consecutively leaving more users satisfied with their ordering experience. Therefore the motivation for this master thesis is to create the outline for a specific type of data-driven service that in turn can enhance the relationship and satisfaction between future ordering, clients and their customers.

1.3 Problem Definition

Customer segmentation is a key aspect of customer relationship management and involves grouping customers into various groups based on similarity. By segmenting customers into di↵erent groups this gives insight about a customer base. The attempt to solve a Customer Segmentation problem is here done by utilising a machine learning technique called clustering. Clustering aims to categorize or group data points with similar features or attributes. Clustering algorithms use data points in the form of vector representation as input and then calculate based on those input vectors how similar those data points are to each other. The notion of clustering comes from that data-points that are fed to the algorithm are then categorized into di↵erent clusters. In this master thesis the representations that are fed into the clustering algorithm are generated using di↵erent approaches. One approach uses an elementary feature vector. The second approach creates a vector representation by using a knowledge graph built of facts in the form of nodes

(11)

1.4. Delimitations 3

and connections. The third approach creates a representation via traversing a graph with nodes and edges and from this generating a vector. More in depth explanation of these various approaches are explained in section 3 of the master thesis.

The problem that is challenged in this master thesis is to segment customers into di↵erent groups depending on their behaviour. The users and their behaviours are captures in di↵erent representations produced utilizing di↵erent approaches. By using the aforementioned representations and feeding them into a an unsupervised clustering algorithm it can be examined if the generated representations can be used for clustering users into various groups. The aim of the master thesis is to answer the following questions:

• Can we find similar groups of customer based on their behavioural attributes using unsupervised clustering?

• Is it possible to group similar users using graph based machine learning models for producing representation used for clustering?

• How do the the di↵erent approaches for identifying various groupings of customers compare to each other?

• How does the size of the feature vectors representing user behaviour produced by the models a↵ect the quality of the clusters?

When aiming to solve the tasks listed above, in this master thesis the specific approach is utilizing graph based machine learning models. The aim is to examine graph based learning models and find out if they are pertinent to solving this problem scenario. The graph based models will be put under scrutiny to see if they can be utilized for customer segmentation in the form of a clustering problem.

Given the various tasks that can be utilized by graph based machine learning models such as link prediction, node classification and recommender systems, the question arises if it make sense to use graph based solutions in a Food & Beverage (also denoted F & B) setting? Can graph based machine learning be utilized for customer segmentation and what are the results? The main subject of interest is to determine if the entity-relation structure appurtenant to graphs can assist in deriving clusters of users. Furthermore the graph based machine learning approaches will also be compared to more classical approaches of clustering methods. Ultimately it will also be reviewed if the graph based produced clusters can actually give insight and be of relevance for customer segmentation.

1.4 Delimitations

Since the aim of this master thesis is to apply customer segmentation by utilizing graph based machine learning methods in combination with clustering the ultimate goal is

(12)

that the results show distinct cluster formation. In this master thesis it will only be examined if clusters are formed and no in depth analysis of what the clusters consists of will be preformed. Given that in this master thesis several graph based machine learning models, in combination with more elementary approaches, will be evaluated, the focus lies in comparing these approaches against each other. Furthermore the analysis of the results is done in a purely metric sense not including any business or marketing assessment given that being a di↵erent field where no knowledge regarding that field is possessed.

Moreover, the business and marketing fields are outside of the scope of the relevant scientific area. Another delimitation is the investigation made into various clustering techniques applied in this master thesis problem scenario. Clustering is such a broad field that an inquiry into the subject would suffice for a separate master thesis problem scenario. Instead a commonly used approach is utilised which is enough to answer the relevant questions included in this master thesis problem scenario.

1.5 Thesis Structure

The master thesis is organized as follows. In section 2 related work is presented describing solutions similar to the solution presented in the master thesis. In section 3 the necessary theory is presented explaining the various concepts utilized in the master thesis. This part includes theory apropos clustering, graph based machine learning and visualization techniques. In section 4 the implementation is presented. This includes the entire pipeline from data cleaning, model preparation, model training, clustering and visualization for each of 3 di↵erent approaches. In section 5 the results for each approach is presented this in the form of clustering metrics as well as clustering visualization. In section 6 the method and results are discussed. In this section the methodology and interpretation of the results is discussed. In section 7 conclusions are stated including how well the problem scenario has been solved and what can be improved and studied in future work.

(13)

C

^HAPTER

2 Related Work

There are many di↵erent problems scenarios within customer relationship management.

As stated in section 1.3 the scope of customer relationship management in this master thesis includes customer segmentation by utilizing clustering techniques. There are several data mining techniques used for customer segmentation and identification.

There are several instances where k-means is used for customer segmentation purposes [4], [1], [26]. K-means is an specific clustering techniques used to cluster data into k di↵erent groupings. Di↵erent from the approach in this master thesis is that in articles [4], [1] and [26] the approach of categorizing users is based on Frequency, Recency and Monetary value of a customer instead of graph based machine learning model output embeddings. Moreover [1] uses silhouette score in the evaluation of clustering results.

The silhouette score is used to evaluate the quality of produced clusters in terms of how similar the data points in a cluster are to each other and also how far the clusters are from one another. In this master thesis silhouette score is also used in the evaluation of cluster results. Although k-means is a prevalent clustering method, other approaches have been used in customer segmentation as for example a soft-clustering approach has been utilised for customer segmentation[40] or an instance where Hierarchical Agglom- erative Clustering has been used[10].

There are several graph based machine learning models such as GraphSage [15], Graph Attention Networks [36], Graph Convolutional networks [21]. Inlcuded in graph based machine learning models are knowledge graph models as Distmult [41] and ConvE [8].

Models can be used for several purposes such as node representation learning where a node in a graph is translated into a vector embeddings [15]. They can also be used for node classification such as in [36], [13]. Graph neural networks have also been used for link prediction. There are models utilized for link prediction for both homogeneous graphs[14], [13] and heterogeneous graph neural networks [43]. Furthermore graph neural networks can also be utilised for clustering scenarios [43]. Graph neural networks, specif-

5

(14)

ically graph convolutional neural networks have been utilised in a customer relationship management setting where embeddings are trained that can be used for solving a recommendation problem [42]. Graph neural networks have also been used in settings for recommendation system taking both social (user-user relation) and preference (user-item relation) in account [12].

Graph based learning models in the form of knowledge graphs can be utilized for a array of tasks such as link prediction, entity resolution, triple classification, entity classification, relation extraction, question answering and recommender systems [37]. Several of the approaches presented in [37] generate embeddings, this is what is utilized for clustering in this master thesis. No example was found utilising knowledge graph based approaches for specifically solving a clustering problem.

Graph based learning models have several use cases as explained above. This master thesis examines if Graph based learning models can be utilized for customer segmentation in a F & B setting. To the best of my knowledge, after examining relevant scientific publication, this specific approach has not been appraised for this specific setting.

For this master thesis there are several graph based machine learning models that would be interesting to investigate but due to time constraints, this is not possible. Such models include Hinsage (a heterogeneous graph based version of GraphSage [15] o↵ered by StellarGraph[7] python library) , Heterogeneous Graph Attention Network [38] and Graph Attention networks [36]. Another reason for the aforementioned models not being eligible to be included in the master thesis is due to their disadvantage of only being applicable to homogeneous graph which does not translate well to this problem scenario.

(15)

C

^HAPTER

3 Theory

The steps in solving the master thesis problem is as followed. First we create representations of users and their behaviour using three di↵erent representations. The first approach is a simple feature vector based on various attributes of a user. How this feature vector is created is explained in section 4 and no in depth theory is needed to understand the procedure. Two more separate representations of a user are also derived, each from a di↵erent graph based machine learning model. To understand how the representations are created using graph based machine learning it is required to understand how these models intrinsically operate. The necessary theory needed for this is explained in this section. Furthermore to accomplish customer segmentation we will be using a machine learning approach called clustering. What and how clustering works is also explained in this section. Once the clustering has been preformed using the various representations the clustering results will be evaluated by using silhouette score. The results will also be visualized to gain further insight in regards to the clustering results. The necessary theory to understand the aforementioned concepts are explained in detail in this section.

3.1 Customer Segmentation

Customer Segmentation is used to divide a set of customers into separate groups where each group is internally similar by some conception defined by a particular criteria. One of the objectives of customer segmentation is to be able direct relevant marketing strategies specifically tailored to a specific group according to the qualities of an identified group[35].

Customer segmentation plays an important role in Customer Relationship Management (CRM in the following).

As the name implies, CRM aims to better the relationship between the customer and company by utilizing various techniques. In modern day society many of the approaches are deployed in a technological setting attempting to better relationships by targeted marketing among other things [5]. CRM and its applications have attracted curiosity

7

(16)

amongst many companies due to the potential advantage that CRM can o↵er[20]. The authors of [20] states that examples of potential benefits are; ”(1) Increased customer retention and loyalty, (2) Higher Customer profitability, (3) Creation value for the customer, (4) Customization of products and services, (5) Lower process, higher quality products and services.” The reason customer segmentation plays an important role in customer relationship management is because it can help provide a company with insight in regards to customer loyalty, spending, churning and more. For a business to be aware of what category a customer belongs to enhances the business ability to make the correct decision in how to handle marketing and other operational processes to strengthen relations and ultimately increasing profit.

3.2 Clustering

Clustering is the attempt to group or categorize data-points into groups that in some notion are similar to one another, where a group containing similar data-points is referred to as a cluster. Clustering is originated from a mathematical and statistical background.

From a newer approach utilising machine learning, clusters are found as hidden patterns that are detected via unsupervised learning[2]. The notion of ”similarity” is most often defined as how close two or more data-points are in a space. A distance function is used to measure distances between data-points[17]. Clustering is often mentioned in the same context as Data mining. A good definition of data mining presented by Jiawei Han [16] is as follows; ”Data mining is the process of discovering knowledge or patterns from massive amounts of data”.

3.2.1 Clustering Algorithms

There are many clustering algorithms available such as K-means, incremental DBSCAN[3], hierarchical agglomerative clustering[10] to name a few from the great collection available today, plenty of which have specific variations depending on use case. Wong et al.

in their survey on clustring algorithms [39] point out that there are several clustering paradigms such as; ”Partitional Clustering, Hierarchical Clustering, Density-based Clus- tering, Grid-based Clustering, Correlation Clustering, Spectral Clustering, Gravitational Clustering, Herd Clustering, and Others”. It is easily noticeable how the research field of clustering is very large. For some the field has grown so much that they argue that the shear amount of clustering algorithms is a consequence of the definition of ”cluster”

being vague and not precise enough, concluding ”...that clusters are, in large part, on the eye of the beholder”. [11]

K-means K-means is one of the most commonly used clustering algorithms. K-means is a form of partition clustering [39].

(17)

3.2. Clustering 9

Anik K. Join, in his article on clustering [18] defines the K-means algorithm in a clear and concise manner as follows: ”Let X = x_{i}, i = 1, · · · , n be the set of n d-dimenional points to be clustered into a set of K Clusters, C = ck, k = 1,· · · , K. K-means algorithm finds a partition such that the squared error between the empirical mean of a cluster and the points in the cluster is minimized. Let µk be the mean of cluster ck. The squared error between µ_k and the points in cluster c_k is defined as:

J (ck) = X

xi2ck

||xi µk||². (3.1)

The goal of K-means is to minimize the sum of the squared error over all K clusters, J (C) =

XK k=1

X

xi2ck

||xi µk||².” (3.2)

Because K-means depends on the mean value of a cluster, this implies that the K- means algorithm can be greatly a↵ected by strong outliers [3]. This is worth to take into consideration during the cluster analysis.

Silhouette Value The silhouette value is a value used to understand the quality of clusters generated from some arbitrary clustering algorithm. The Silhouette Value can be used to assess resulting clusters as well as be used to choose the amount of clusters to specify for clustering algorithms like k-means. The notion of quality in this regards implies how well a data-point lies in correspondence to their specified cluster.

To calculate the Silhouette Value the specific clusters needs to be defined (preferably larger than 1) as well as a possibility to measure distance between all the data points.

The following definitions are taken from the origin paper on Silhouette Value written by Peter J Rousseeuw [31] and utilises the definitions therein. The silhouette is defined as the following:

s(i) = 8>

<

>:

1 a(i)/b(i) if a(i) < b(i),

0 if a(i) = b(i),

b(i)/a(i) 1 if a(i) > b(i),

(3.3)

giving

1 s(i)  1.

Here a(i) = ”average dissimilarity of i to all other objects of cluster A”[31]. Moreover, b(i) is the ”neighbour of object i”[31] which is the ”second best choice” for object i. That is to say if i does not lie within cluster A which is the best cluster B for i to lie within.

B is obtained by finding the minimum d(i, B). Here the function d() is explained in the following example d(i, C) = ”average dissimilarity of i to all object within cluster C.

This is tested for all clusters”[31].

(18)

If s(i) is close to one, this implies that a(i) is a great deal smaller than b(i). Recalling that a(i) denotes dissimilar within cluster A, and b(i) dissimilarity of the ”second best choice” cluster” B. This entails that if a(i) is a great deal smaller than b(i) it implies that i is correctly placed within cluster A. The inverted logic also applies for s(i) close to minus one. If b(i) is a great deal smaller than a(i) then it probably should have belonged to cluster b, but since b(i) denotes the ”second best choice” this implies that i resides in cluster A which most probably is not the optimal cluster assignment[31].

3.3 Visualization - TSNE

Visualizing high dimensional data can be difficult. The problem lies in reducing the high dimensional data to an observable two or three dimensions without loosing to much information. If too much data is lost, then the visualization can potentially not represent the true correlations or pattern that are present in the data. Also without the correct technique visualizing high dimensional data in low dimensions can lead to the results resembling noise. A popular approach to visualizing high dimensional data in low dimensions is by utilizing the technique t-distributed Stochastic Neighbour Embeddings (t-SNE in the following). The aim of t-SNE is a technique that reduces the amount of dimensions (dimensional reduction) yet tries to capture and keep the inherent underlying composition throughout the reduction. Although t-SNE is used throughout the scope of the master thesis the fundamental technicalities of t-SNE are not relevant. For acquiring deeper understanding refer to [23].

3.4 Graph Based Machine Learning

Below, theory is presented explaining the inner workings of two di↵erent graph based machine learning models. Two models will be presented each with specific characteristics.

The first is the ComplEx model that belong to a group the graph based machine learning domain called Knowledge graphs. The second model is named Metapath2vec and is a model that learns from traversing a heterogeneous graph. It is from these two models that representations of users will be derived in the form of vectors otherwise named embeddings. The embeddings are outputted from the models after training. How the notion of training is established for the respective models is also described below.

3.4.1 Knowledge Graph

A Knowledge Graph (KG in the following) is a graph constructed by entities in form of nodes and relations between the entities in the form of edges. Because the edges can be of various types depending on the relationship between two entities, the graph is multi-relational. A graph is composed by connecting nodes with edges making up a

(19)

3.4. Graph Based Machine Learning 11

facts, also called triplets, structured with an head entity, relation and tail entity. This notation makes it simple to represent relational data and with help of of knowledge-graph embeddings one can preform various downstream tasks [37]. Producing embeddings from the entities and relations in a KG usually comprises of three steps; ”(i) Representing entities and relations, (ii) defining a scoring function, (iii) learning entity and relation representation” according to Wang et al. [37].

3.4.2 ComplEx Model

The ComplEx model is the specific Knowledge Graph model that will be used throughout this master thesis. The first thing to observe regarding the ComplEx model is how to model a relation between two entities. An explaination of how the ComplEx model works is presented in the following in accordance with the definitions described by Trouillon et al. in the article presenting the ComplEx model [34]. E is the set of entities with |E| = n.

To describe if there is a relationship between two entities a signed matrix Y is declared by indicating of a relationship between two entities exist indicated with -1 or 1. Formally Y_so 2 { 1, 1}. The aim of the Knowledge graph is to correctly predict links between entities a matrix X 2 Rⁿ^⇥n where each entry is a score dependent of the relationship between two entities. Using the inverse logistic function calculating the probability of a link between two relations is obtained by:

P (Y_so = 1) = (X_so). (3.4)

Here s and o are entities from the previously mentioned set of entities E. Because the complEx model is solving the scenario where an entity can both take have the role of an subject or and object of a fact, the embeddings for a specific entity must represent that entity for both scenarios. To achieve this, Eigenvalue decomposition is utilised. In this case eigenvalue decomposition can be written as

X = EW ¯E^T (3.5)

”... where W 2 Cⁿ^⇥n is the diagonal matrix of eigenvalues (with decreasing modulus) and E 2 Cⁿ^⇥n is a unitary matrix of eigenvectors ...” [34]. Since equation 3.4 is purely real only the real part is taken into consideration as;

X = Re(EW ¯E^T). (3.6)

In the case with several types of relations each relation r receives a corresponding embedding wr 2 C^K . For each relation r 2 R the aim is to retrieve the scores of Xr. Now,

P (Yrso = 1) = ( (r, s, o, ⇥)), (3.7) holds where s and o are entities of E and (r, s, o, ⇥) is the scoring function;

(20)

Re(< wr, es, ¯eo >) = Re(

XK k=1

wrkeske¯ok) (3.8) and (x) is

1

1 + e ^x. (3.9)

It is this scoring function that is later utilized in the training process in the knowledge graph embedding generation [34].

3.4.3 Training & Evaluation

Training The notion of training implies that there is room for improvement, always aiming to become better at a certain task. This is true for both humans and machines.

The notion of training a machine-learning model in general can be very ambiguous although a common notion of training a machine learning model consists of minimizing a loss function. The loss function is a way of measuring or indicating of how well a model is solving a problem. For our knowledge graph we will be training our knowledge graph to produce embeddings that can generate high scores for true facts and low scores for false facts using a specific scoring function[6].

For our model the Multiclass Negative Log-likelihood Loss is used for the loss function defined as

L(X) = X

x2X

log p(e2|e1, rk) X

x2X

log p(e1|rk, e2) (3.10) In this equation X is the set of triples x, where x = (e₁, r_k, e₂) denotes two entities and the relation between them. Here the probabilities p(e2|r^k, e1) are calculated using the scoring function (3.8) defined by the ComplEx model as s(e1, rk, e2) in the softmax function:

p(e2|r^k, e1) = exp(s(e1, rk, e2)) P

¯

e22E(e1,r_k)exp(s(e1, rk, ¯e2)). (3.11) Here ¯e2 are entities that in combination with e1 and rk can be a part of facts with (e1, rk, ?) and exp() is the exponential function [33], [19], [6].

There is also a concept call optimizer that is often included in the field of training a machine learning model. The optimizer is used in the training process. It utilising the output of the loss function and changes the internal states of the model with the aim of decreasing the loss function output. Depending on how efficient an optimizer is this can also contribute to the time it takes to actually train a given model. For the knowledge based model an optimizer commonly referred to as Adam is used.

(21)

3.4. Graph Based Machine Learning 13

The knowledge graph based model also uses regularization. Regularization is often used in neural networks to decrease overfitting of a model. Overfitting is when the model preforms well on training data but not in testing data. The knowledge graph model a regularization method is named LP regularization for the ComplEx model [22]

Evaluation After training a model it is necessary to assess how well the final version the model is at solving the problem at hand. This evaluation can look di↵erently depending on the design of a machine learning model. For the knowledge graph model the following metrics are used to asses how well the model preforms:

1. Mean Rank (MR) The formal definition of MR from the Ampligraph [6] documentation is the following :

M R = 1

|Q|

X|Q|

i=1

rank₍s, p, o)_i (3.12)

Here Q is a set of triples and (s, p, o) is a triple 2 Q. Given a set of triples Q. Each triple is queried by the model and the model returns a collection of facts that are both true and false with complementing score completing the fact queried of the model. This collection returned by the query is sorted by score with the highest score having rank = 1. The rank of the true fact is then noted. For each query and each rank of the true statement the mean rank is calculated. A low value of MR implies good performance of the model.

2. Mean Reciprocal Rank (MMR): The formal definition of MRR from the Ampli- graph [6] documentation is the following :

M RR = 1

|Q|

X|Q|

i=1

1 rank(s,p,o)_i

(3.13)

Mean Reciprocal Rank is defined similar to Mean Rank apart from mean of the recoprocal is calculated.

3. Hits@N: The formal definition of Hits@N from the Ampligraph [6] documentation is the following :

Hits@N = 1

|Q|

X|Q|

i=1

1 if rank(s,p,o)i (3.14) Again Q is a set of triples and (s, p, o) is a triple 2 Q. This evaluation score states the percentage of time that the correct or true fact is placed within the top N of a query of a fact. A high score of this metrics implies good performance of the model.

(22)

3.5 MetaPath2Vec

MetaPath2Vec is the name of a model used for representation learning in heterogeneous networks. Using MetaPaths, Random Walks and the Word2Vec skip-gram model, Meta- Path2vec aims to capture structure in a network with multiple types of nodes and edges [9]. DeepWalk [29] and Node2vec [13] are also approaches to representation learning of nodes using random walks together with the Word2Vec skip-gram model ([25, 24]), however these approaches only apply to homogeneous graphs. There are several concepts here needed to be visited such as Meta-paths, Word2vec Skip-gram and random walks.

The concepts are explained in the following.

3.5.1 Word2vec

Word2Vec has two model architectures; Continuous Bag-of-Words Model and the Con- tinuous Skip-gram Model which both are natural language processing techniques use for learning word vector embeddings from text repositories [24]. In this case the focus is mainly on the Skip-gram model since this is the model used in this master thesis. The Skip-gram is trained by predicting surrounding words in a sentence given a focus word in the said sentence and a window including the surrounding words, resulting in word embeddings trained for this task. Formally defined by the authors in [25], the Skip-gram model aims to maximize the average log probability of a sequence of training words w1, w2, w3,· · · , w^T via;

1 T

XT t=1

X

cjc,j6=0

log p(wt+jwt). (3.15)

Here c is the size of the training context. Words that are often found in the same context, that is to say are similar by some notion are inclined to have more similar vector representations.

The skip-gram model will play an important role in the MetaPath2Vec model since it will be the mechanism that creates representations for nodes in a graph. It can be seen as instead of word predicting the surrounding words in a specific window context, it will be a node used to predict the neighbouring nodes in a graph. How a word in a sentence translate to a node in a graph will be explained in the next section.

3.5.2 Random Walk

DeepWalk [29] uses random walks to generate a network neighbourhoods of a node V . The main concept is to generate a sequence of nodes that belong to a root nodes vi

neighbourhood. This is done by defining a walk length t and traversing the nodes in the neighbourhood of vi with a random walk which chooses to traverse from one node to

(23)

3.5. MetaPath2Vec 15

another node until the amount of steps taken from the original root node vi has reaches a walk length t. The random walk has then generated a sequence of nodes that represents the neighbourhood of the root node of vi. It is then this sequence that can be fed into the skip-gram model to generate representations for the nodes. Basically, instead of supplying the skip-gram model with a set of sequences of words, the skip-gram model is fed a set of sequences of nodes. The same mechanism that would generate embeddings for the words here generates embeddings for a node, the only di↵erence is that the distribution from where the sequence of tokens are generated from.

3.5.3 Meta Paths

Meta Paths are used to assist in the traversing Heterogeneous graph/network. Here a Heterogeneous network is defined by the authors of [9] as:

Definition 3.5.1. ”Heterogeneous Network is defined as a graph G = (V, E, T ) in which each node v and each link e are associated with their mapping functions (v) : V ! Tv

and (e) : E ! Te, respectively. T_v and T_e denote the sets of object and relation types, where |T^v| + |T^e| > 2.”

The creators of MetaPath2vec have design meta-path-random walks to produce paths that captures the the diverse structure of a network including di↵erent types of nodes.

The formal definition of a meta path defined by the authors of [32] is the following : Definition 3.5.2. ”A meta path P is a path defined on the graph of network schema TG = (A, R), and is denoted in the form of A¹ ^R! A¹ ² ^R! · · ·² ^R! A³ ^l+1, which defines a composite relation R = R1 R1 · · · Rl between types A1 and Al+1 where denotes the composition operator on relations”.

For example for this use case a meta-path could be denoted user-restaurant-user de- noting two users whom have visited the same restaurant. The authors Dong et al. in their paper describing MetaPath2Vec [9] explain the meta-path random walk traversal protocol as follow:

”Given a heterogeneous network G = (V, E, T ) and a meta-path scheme P : V1 R1

! V₂ ^R! · · · V² t

Rt

! Vt+1· · · ^R^{l 1}! Vl, the transition probability at step i is defined as follows:

p(vⁱ⁺¹|vtⁱ,P) = 8>

><

>>

:

1

|Nt+1(vⁱ_t)| (vⁱ⁺¹, v_tⁱ)2 E, (vⁱ⁺¹) = t + 1 0 (vⁱ⁺¹, v_tⁱ)2 E, (vⁱ⁺¹)6= t + 1 0 (vⁱ⁺¹, v_tⁱ)62 E

(3.16)

(24)

where vⁱ_t 2 Vt and Nt+1(v_tⁱ) denote the Vt+1 type of neighbourhood of node vⁱ_t.”[9]

By adhering to the meta-path traversing convention one can withhold the semantic relationships from a heterogeneous graph capturing the intrinsic contrasts and variations present in the graph [9].

To summarize the various steps of the MetaPath2Vec model. First the graph is traversed according to the meta-path convention. This generates sequences of nodes composed of a node and its neighbouring nodes therefore including the diversity of di↵erent node types within the sequences. This collection of sequences of nodes is then fed into the skip-gram model that, for a window of nodes and a given node n tries to predict the remanding nodes within said window. By doing so the skip-gram model produces embeddings for each node with the main goal that nodes with similar node neighbourhoods should produce similar embeddings. It is important to note that even though we have di↵erent types of nodes in the heterogeneous graph the same embedding space is shared for the various types of nodes. It is later the embeddings that will be used further for clustering purposes.

(25)

C

^HAPTER

4 Implementation

4.1 Data Pre-processing

Initially the data set includes historically collected order data. The collected data consists of orders that customers have ordered on their phones. Due to several reasons explicated in the following, solely a subset of this data is chosen and used throughout. During a pro- longed period, the menus are updated and changed continuously throughout. Due to the dynamic nature of the menu, the behaviour of customers can potentially vary remarkably during this period. For the sake of capturing user behaviour a smaller subset of the data is chosen to be able to capture behaviour that is less prone to have changed since using a smaller time scale decreases the variability of purchasing behaviour. Another important design choice in regards to capturing customer behaviour is to only include users that have ordered 20 or more times. The reason for the decision is again due to the intention of capturing behaviour. If a customer only has ordered using the service a few times then this is not enough information to capture a behaviour specific to the user. Another reason for using a smaller subset of the total data produced is due to time constraints of training the models. Because the following models have to process, traverse and train using the data, having a smaller data-set reduces the amount of time needed.

Furthermore, due to menus constantly changing and being updated, this entails that there exists more product-ids than actual products. This is because of the product having an id for a specific time frame, having been part of a campaign or due to architectural changes in the Future Ordering system, etc. It is necessary then to group the ids that point to the same product. By utilising string manipulation in combination with a set of rules, a subsequently smaller collection of ids are produced. All the items in the orders with corresponding item-ids are then mapped to an id from the smaller master collection of ids produced.

17

(26)

Due to the intrinsic nature of how graphs are represented by entries and the relations between them, this limits the data that can be included. Therefore no time attributes can be a part of the graph structure. The following entities have been chosen to be taken into account for graph construction.

• Products

• Store

• Phone Operating System

• Payment Method

Three di↵erent approaches are utilized to produce clusters based on the behaviour of the customers. This will include what the user has ordered, what restaurant the user has visited, what phone operating system was used to place the order and also what payment method. Since the data is not initially in form of a graph with entities with relations, a graph structural representation is created for each model. This is a crucial step in the process because the structure of the graph greatly influences the results of the model.

The aim is to include identical relations between entities in all of the models. Here the entities in the graphs will be represented by users, product-items, stores, phone operating system and payment method. The edges of the graph, relations, are connections between these entities representing a purchase, visit or usage. Figure 4.1 includes an example of how the graphs are constructed utilizing the previously mentioned entities and relations coupling them.

Figure 4.1: Example of graph structure

(27)

4.2. Three Approaches 19

In figure 4.1 one can observe how the graph captures users visiting the same restaurant (Store 1 ) or purchasing the same item (Item 2 ).

4.2 Three Approaches

In this master thesis it is attempted to identity various user groups using clustering.

The clustering is done by feeding di↵erent representations in the form of feature vectors into the k-means clustering algorithm. In the following, three di↵erent approaches are explained, each producing a certain representation of users in the form of vectors. Later these vectors are used in the actual clustering procedure.

4.2.1 Approach: Standard Feature Vectors

In this approach basic feature vectors are created capturing the behaviour of the user.

Initially 4 di↵erent feature vectors are produced representing the users in four di↵erent categories. The first vector P_i = [p₁, p₂, . . . , p_k, . . . , p_n] represents the purchase behaviour, where n is the number of di↵erent products, pk is the amount of times user i has ordered product k. The second vector follows the same token as the the first vector with Ri = [r1, r2, . . . , rk, . . . , rm] where m is the numbers of available restaurant, rk

is the number of times user i has visited restaurant k. The third vector is defined as Oi = [o1, o2, . . . , ok, . . . , oj] where j is the amount of operating systems available, ok is the amount of times the user i has ordered with operating system k. The fourth and final vector is represented as Bi = [b1, b2, . . . , bk, . . . , bl] where l is the amount of di↵erent payment methods available, b_k is the amount of times that user i has used payment methods k. It is these features vectors that are used for clustering.

The features vectors are generated for all users. Five di↵erent instances of clustering are established. Four of the clustering instances are produced solely using each of the vectors Pi, Ri, Oi and Bi respectively for each of the users i. The fifth model is created as a combination of all the feature vectors via concatenation.

The reason for the choice of using this specific representation of feature vectors was to use the same data used within the two graph based approaches. Since payment method, phone operating system, item and restaurant are entities in the graphs used in the other two approaches the intention was to utilize these entities but in a vector form. When using feature vectors to cluster users or customers there are other approaches can be be used such as recency,frequency or monetary value among other things. Although this method in itself can give way to great insight it would in this problem scenario not be comparable to the other two approaches presented in this method.

(28)

4.2.2 Approach: MetaPath2Vec Feature Vectors

The first step in this approach is to generate the graph that will be traversed. Using the StellarGraph python package [7] a StellerGraph object is created representing the graph that is managed. The data in table 4.1 is iterated over for each user and triplets are created that represent a entity (node) connected to another entity with an edge of a certain type.

Table 4.1: Form of data iterated over to create graph Users Basket Store Payment Method Operating System U ser₁

F1,1 F1,2 F1,3 F1,4

F_2,1 F_2,2 F_2,3 F_2,4

F3,1 F3,2 F3,3 F3,4

... ... ... ...

U ser_n

F1,1 F1,2 F1,3 F1,4

F_2,1 F_2,2 F_2,3 F_2,4

F3,1 F3,2 F3,3 F3,4

From the table 4.1 the nodes are connected depending on what a user has ordered, what store the user has visited, what payment method and phone operating system the user has used. The edge represents the action and the entities are either a user, item, store, payment method or phone operating system. The Stellargraph object is created by passing the collection of triplets as well as all the entities in separate panda Dataframes. Once the StellarGraph is created the following procedure is divided into two main parts. Foremost the uniform random walk that samples and creates sequences of node Ids, following the composition of the defined meta-paths, is preformed. This is handled by the UniformRandomMetaPathWalk class from the StellarGraph package [7].

The parameters are the following:

Table 4.2: Meta-path based walk object parameters UniformRandomMetaPathWalk

Parameter Type

Graph Stellargraph

N Int

Length Int

Metapaths List of lists

Seed Int

(29)

4.2. Three Approaches 21

The parameters imply the following;

• Graph: This is the aforementioned graph that is traversed and sampled form.

• N: Denotes the amount of random walks made per source node.

• Length: Denotes the maximum length of each random walk starting from the source node.

• Metapaths: The collection of permitted traversal schemas defined in a list. An example of a meta-path in this use case is [U ser, Item, U ser] implying that a traversal captures a relation where two di↵erent users have purchased the same item.

• Seed: Random Seed

The UniformRandomMetaPathWalk class returns a list of list of node ids. The ensuing step is to generate the embedding for each of the nodes. This is done by feeding the output from the UniformRandomMetaPathWalk class into the word2vec skip-gram model. The Word2vec skip-gram model is utilised via the Gensim python library [30]. The parameters of the skip-gram model are the following:

(30)

Table 4.3: Word2vec Skip-gram model parameters Word2vec Model

Parameter Type

Sentences Iterable of iterables

Size Int

Window Int

Min-count Int

Sg Int

Workers Int

Iter Int

The parameters imply the following:

• Sentences: When using the word2vec for its original purpose of generating word embeddings a list of sentences would be supplied via this parameter. In this case instead of sentences, the sequence of node ids is passed to the model.

• Size: Defines the size of the dimensions of the node embeddings produced by the model.

• Window: The size of the maximum length of distance from current word, to the word to predict, within a given sentence (sequence of node ids).

• Min-count: All the nodes that do not have this amount in frequency are excluded from the model.

• Sg: If set to 1, it implies that the word2vec model will train using skip-gram. If set to 0, it will use the cbow-model.

• Workers: Implies the amount of threads utilised to train the model.

• Iter: How many epochs to iterate and train.

The final pipeline for the MetaPath2Vec approach look as the following: