• No results found

Cluster Analysis of Mixed Data Types in Credit Risk: A study of clustering algorithms to detect customer segments

N/A
N/A
Protected

Academic year: 2021

Share "Cluster Analysis of Mixed Data Types in Credit Risk: A study of clustering algorithms to detect customer segments"

Copied!
65
0
0

Loading.... (view fulltext now)

Full text

(1)

Master thesis, 30 credits

Master of Science in Industrial Engineering and Management, 300 credits

Spring term 2020

CLUSTER ANALYSIS OF

MIXED DATA TYPES IN

CREDIT RISK

A study of clustering algorithms to detect

customer segments

(2)

Copyrigth © 2020 Cecilia Apitzsch and Josefin Ryeng All rights reserved

CLUSTER ANALYSIS OF MIXED DATA TYPES IN CREDIT RISK A study of clustering algorithms to detect customer segments

Department of Mathematics and Mathematical Statistics Ume˚a University

Ume˚a, Sweden

Supervisors:

Per Arnqvist, Ume˚a University Wilhelm Back, Klarna

Examiner: Olow Sande

(3)

Abstract

This thesis was written in an underwriting department that works with evaluating the financial risk that consumers pose in credit payments. The risk has to do with whether the consumer will repay the credit as agreed or if the consumer will default, i.e. not be able to repay in time. Underwriting involves the process of conducting research and assessing each consumers degree of risk before granting a credit payment. A way to do that is by grouping consumers into different risk groups, that can be used as a tool to minimize a companies exposure in credit risk. The aim of the thesis was to identify if such risk groups could be found within the data provided. This was done through implementation of the unsupervised machine learning algorithms, PAM and hierarchical clustering. It was desirable that credit applicants in the same group i.e. cluster, shared similar features, in order to facilitate credit decisions.

Based on the available data, three different data sets were constructed, two dichoto-mous and one of mixed character. The dichotodichoto-mous data sets were constructed by two different binning techniques. Manual binning were the ”bins” or intervals were determined by hand and equal frequency binning were an equal number of obser-vations were forced into each of a predetermined number of ”bins” or intervals. To perform clustering an appropriate distance measure is needed that captures the similarities, and dissimilarities of different observations. We have examined four different distance measures: simple matching coefficient, Jaccard index, Dice coeffi-cient (suitable for dichotomous data) and Gower’s dissimilarity (suitable for mixed data).

The implementation and evaluation of the different clustering algorithms, using the four different distance measures, implied that both PAM and hierarchical clustering produced identifiable groups with unique features within the data. These insights can be used as a value-adding tool in the underwriting process. The obtained clustering results can further be used for building different models that, for example, predict the probability of a consumer not being able to make the minimum amount of the agreed monthly repayment for three consecutive months, referred to as a default.

Key words: Cluster analysis, Customer segmentation, Credit risk, Dichotomous data, Mixed data, Distance measures.

(4)

Sammanfattning

”Klusteranalys baserat p˚a blandade datatyper inom kreditrisk”

Denna masteruppsats skrevs p˚a en underwritingavdelning som hanterar riskbed¨ om-ningsprocessen av att bevilja kredit till en konsument. Risken avser huruvida en konsument kommer kunna betala enligt ¨overenskommen avbetalningsplan eller inte. En underwritingavdelning arbetar med att bedriva efterforskning och bed¨oma varje konsuments riskniv˚a innan en kreditans¨okan beviljas. Ett s¨att att utf¨ora kred-itbed¨omningarna ¨ar att gruppera konsumenter i olika riskgrupper.

Syftet med denna masteruppsats var att unders¨oka om det fanns identifierbara riskgrupper inom det givna data. Detta gjordes genom implementering av mask-ininl¨arningsalgoritmerna PAM och hierarkisk klustring. Det var ¨onskv¨art att kred-itkunder inom samma grupp delade homogena egenskaper, detta f¨or att identifiera riskniv˚an p˚a de olika grupperna och p˚a s˚a vis underl¨atta kreditbeslut.

Metoderna som anv¨andes valdes utifr˚an strukturen p˚a datasetet och uppsatsen behandlar hanteringen av b˚ade blandade och kategoriska data. Tv˚a kategoriska dataset skapades genom implementation av tv˚a olika binningstekniker. Det manuellt binnade datasetet skapades genom att tilldela variabler ”bins” eller intervall f¨or hand och lika frekvens binnade datasetet skapades genom att tvinga ett j¨amnt antal observationer att tillh¨ora ett f¨orbest¨amt antal ”bins” eller intervall. F¨or att utf¨ora klustring s˚a beh¨ovs l¨ampliga distansm˚att som kan ber¨akna likheter och olikheter mellan observationer. Vi har unders¨okt fyra olika distansm˚att: simple matching co-efficient, Jaccard index, Dice coefficient (l¨ampliga f¨or kategoriska data) och Gower’s dissimilarity (l¨amplig f¨or blandade data).

Implementeringen och utv¨arderingen av klustringsalgoritmerna med de fyra olika distansm˚atten implicerade att b˚ade PAM och hierarkisk klustring identifierade grupper med unika egenskaper inom det tillg¨angliga data. Dessa insikter kan anv¨andas som ett v¨ardefullt verktyg i underwritingprocessen. De erh˚allna klus-tringresultaten kan vidare anv¨andas f¨or att bygga olika modeller som exempelvis f¨orutser sannolikheten att en konsument inte kommer kunna betala det m˚anadsvisa minimibeloppet f¨or den ¨overenskomna krediten.

Nyckelord : Klusteranalys, Kundsegmentering, Kreditrisk, Dikotom data, Blandad data, Distansm˚att.

(5)

Acknowledgements

We would like to thank our supervisor at Klarna, Wilhelm Back, for believing in us and giving us the opportunity to write our Master Thesis at Klarna. Another person we would like to thank is our supervisor at Ume˚a University, Per Arnqvist, for all the guidance within the area of cluster analysis and good advices along the journey.

We can conclude that our studies at Ume˚a University are soon to be finished which is why we would like to thank our families and friends for all encouragement over the years.

(6)

Contents

1 Introduction 4

1.1 Consumer Credit . . . 4

1.2 Background Description . . . 4

1.2.1 Underwriting and credit risks . . . 5

1.3 About Klarna Bank AB . . . 5

1.4 Problem Description . . . 5

1.5 Aim of the thesis . . . 5

1.6 Delimitations . . . 6

2 Theory 7 2.1 Machine learning in credit risk management . . . 7

2.1.1 Machine learning . . . 7

2.1.2 Unsupervised machine learning . . . 8

2.1.3 Customer segmentation and cluster analysis . . . 8

2.2 Data Preprocessing . . . 8 2.2.1 Dimensionality reduction . . . 8 2.2.2 Missing data . . . 8 2.2.3 Types of data . . . 9 2.2.4 One-hot encoding . . . 9 2.2.5 Collinearity . . . 9 2.2.6 Multicollinearity . . . 10

2.2.7 Standardization and normalization . . . 10

2.2.8 Skewness . . . 10

2.2.9 Data binning . . . 10

2.2.10 Sampling theory . . . 11

2.3 Dissimilarity Measures . . . 11

2.3.1 Distance measure for numerical data . . . 12

2.3.2 Distance measures for dichotomous data . . . 12

2.3.3 Distance measure for mixed data . . . 14

2.4 Cluster Tendency . . . 15

2.4.1 Hopkins statistic . . . 15

2.5 Clustering Algorithms . . . 16

2.5.1 Partitioning Around Medoid . . . 16

2.5.2 Hierarchical . . . 17

2.5.3 T-distributed stochastic neighbor embedding . . . 19

2.6 Determining the Optimal Number of Clusters . . . 20

2.6.1 Silhouette analysis . . . 20 2.7 Cluster Validation . . . 21 2.7.1 Dunn index . . . 21 2.7.2 Davies-Bouldin index . . . 21 3 Method 23 3.1 Data Description . . . 23 3.2 Data Collection . . . 23 3.3 Data Preprocessing . . . 23

(7)

3.4 Model Implementation . . . 25

3.4.1 Data set 1: Manually binned data . . . 25

3.4.2 Data set 2: Equal frequency binned data . . . 26

3.4.3 Data set 3: Unbinned data . . . 27

3.4.4 Compare and assess the best data set . . . 27

3.5 Software Used . . . 27

4 Results 28 4.1 Results using PAM algorithm . . . 28

4.1.1 Cluster tendency . . . 28

4.1.2 PAM clustering: manually binned data . . . 28

4.1.3 PAM clustering: equal frequency binned data . . . 30

4.1.4 PAM clustering: unbinned data . . . 31

4.1.5 Comparing statistics . . . 32

4.1.6 PAM result: equal frequency binned data . . . 33

4.2 Results using hierarchical algorithm . . . 36

4.2.1 Cluster tendency . . . 36

4.2.2 Hierarchical clustering: manually binned data . . . 36

4.2.3 Hierarchical clustering: equal frequency binned data . . . 38

4.2.4 Hierarchical clustering: unbinned data . . . 40

4.2.5 Comparing statistics . . . 41

4.2.6 Hierarchical result: equal frequency binned data . . . 42

5 Analysis 45 5.1 PAM Clustering . . . 45

5.1.1 Manually binned data . . . 45

5.1.2 Equal frequency binned data . . . 45

5.1.3 Unbinned data . . . 46

5.1.4 Determine the optimal data set . . . 46

5.1.5 Analysis of the result from PAM . . . 46

5.1.6 Cluster personas . . . 47

5.2 Hierarchical Clustering . . . 48

5.2.1 Manually binned data . . . 48

5.2.2 Equal frequency binned data . . . 48

5.2.3 Unbinned data . . . 49

5.2.4 Determine the optimal data set . . . 49

5.2.5 Analysis of the result from hierarchical . . . 50

5.2.6 Cluster personas . . . 50

6 Discussion and Conclusion 52 6.1 Further Work . . . 53

Appendices 57

(8)

Acronyms

DB Davies-Bouldin index DC Dice Coefficient GD Gower’s Dissimilarity

IIF Institute of International Finance JI Jaccard Index

ML Machine Learning

PAM Partitioning Around Medoid SMC Simple Matching Coefficient

(9)

Glossary

Consumer A private person using Klarna for financing a product Customer Includes both merchants and consumers i.e. Klarna’s

customers

Dichotomous Dichotomous refers to something that is divided into two distinct parts

Inter-cluster Refers to the distance of observations in different clusters Intra-cluster Refers to the distance of observations within a cluster Merchants A company that has Klarna as their payment solution

provider

Outlier An outlier is a data point that differs significantly from other observations within a set of data

Personas Personas are fictional characters, created in order to represent different groups within a research

(10)

List of Figures

1 Visualization of the PAM algorithm . . . 16 2 Visualization of agglomerative and divisive clustering in a dendrogram 17 3 Visualization of the initial steps in the methodology process . . . 23 4 Visualization of the age variable for three different data sets,

un-binned data, manually un-binned data and equal frequency un-binned data 25 5 Average silhouette values using JI in manually binned data where

the number of clusters, k, ranges between 2 and 10 . . . 29 6 Average silhouette values using DC in equal frequency binned data

where the number of clusters, k, ranges between 2 and 10 . . . 31 7 Average silhouette values using GD in the unbinned data where the

number of clusters, k, ranges between 2 and 10 . . . 31 8 t-SNE visualization of PAM clustering, using DC in equal frequency

binned data, where k = 4 (number of cluster), having respectively cluster medoid displayed in a darker color . . . 33 9 The distribution of clusters, within the selected variables ’age

vari-able 1’ and ’age varivari-able 2’, using PAM algorithm in equal frequency data . . . 34 10 The distribution of clusters, within the selected variables ’amount’

and ’balance’, using PAM algorithm in equal frequency data . . . 34 11 The distribution of clusters, within the selected variables ’risk

mea-sure’ and ’active transactions’, using PAM algorithm in equal fre-quency data . . . 35 12 The distribution of clusters, within the selected variables ’account

state variable’ and ’charge variable’, using PAM algorithm in equal frequency data . . . 35 13 Dendrogram visualization of agglomerative clustering using DC in

equal frequency binned data, where k = 5 (number of clusters), hav-ing each cluster in a different color . . . 42 14 The distribution of clusters, within the selected variables ’age

vari-able 1’ and ’age varivari-able 2’, using hierarchical algorithm in equal frequency data. . . 43 15 The distribution of clusters, within the selected variables ’amount’

and ’balance’, using hierarchical algorithm in equal frequency data. . 43 16 The distribution of clusters, within the selected variables ’risk

mea-sure’ and ’active transactions’, using hierarchical algorithm in equal frequency data. . . 44 17 The distribution of clusters, within the selected variables ’account

state variable’ and ’charge variable’, using hierarchical algorithm in equal frequency data. . . 44 18 t-SNE visualization of clusters from PAM algorithm with k=2

(num-ber of cluster) using DC, with respectively cluster medoid displayed with a darker color . . . 57

(11)

List of Tables

1 Example table with dichotomous data . . . 12 2 Table for constructing distances for observation A and B . . . 13 3 Counts of response combinations for observation A and B to

con-struct distances . . . 13 4 Hopkins statistic using 1,000 randomly chosen observations . . . 28 5 Average silhouette values using SMC, JI and DC in manually binned

data where the number of clusters, k, ranges between 2 and 8 . . . . 28 6 Cluster validation measures, using SMC, JI and DC in manually

binned data . . . 29 7 Average silhouette values using SMC, JI and DC in equal frequency

binned data where the number of clusters, k, ranges between 2 and 8 30 8 Cluster validation measures, using SMC, JI and DC in equal

fre-quency binned data . . . 30 9 Validation measures from PAM clustering algorithm using the three

data sets with their respective distance measure . . . 32 10 Selected variables for the medoids using DC in equal frequency binned

data, business sensitive variables will be hidden . . . 33 11 Hopkins statistics using 100 randomly chosen observations . . . 36 12 Ac values for the different linkage methods in the agglomerative

clus-tering using manually binned data . . . 36 13 Validation measures for divisive clustering in manually binned data

where the number of clusters, k, ranges between 2 and 7 . . . 37 14 Validation measures for agglomerative clustering in manually binned

data where the number of clusters, k, ranges between 2 and 7 . . . . 37 15 Ac values for the different linkage methods in the agglomerative

clus-tering using equal frequency data . . . 38 16 Validation measures for divisive clustering in equal frequency data

where the number of clusters, k, ranges between 2 and 7 . . . 38 17 Validation measures for agglomerative clustering in equal frequency

data where the number of clusters, k, ranges between 2 and 7 . . . . 39 18 Ac values for the different linkage methods in the agglomerative

clus-tering using unbinned data . . . 40 19 Validation measures for divisive clustering in unbinned data where

the number of clusters, k, ranges between 2 and 7 . . . 40 20 Validation measures for agglomerative clustering in unbinned data

where the number of clusters, k, ranges between 2 and 7 . . . 41 21 Validation measures from hierarchical clustering algorithm using the

three data sets with their respectively distance measure . . . 41 22 Selected variables for the clusters using DC in equal frequency binned

data, business sensitive variables will be hidden . . . 42 23 Essential features from the medoids in equal frequency binned data

(12)

1

Introduction

In the following chapter, the subject of the thesis will be introduced. A brief background for the thesis will be given, including a short introduction to machine learning (ML), underwriting and credit risk. This will be followed by a general introduction of the company Klarna Bank AB. Finally, the purpose, objectives and delimitations will be explained to clarify the main focus of the thesis.

1.1 Consumer Credit

Lending and credit decisions involve a careful evaluation of a borrower’s risk, which consists of assessing the capacity and motivation of the borrower to repay the loan and the lender’s protection against losses if the borrower defaults. The idea of consumer credit is that the lender will only gain from it if the borrower does not default. That is why financial institutes invest a lot of time and money in evaluating a client’s history, habits and likelihood of repaying the debt. To perform this kind of analysis, banks and financial institutes have historically been relying on statistical models, for example, different scoring models that rank consumers by how likely they are to pay their credit as agreed (Alto, 2019). However, in recent years there has been a significant increase in the adoption of ML in credit risk modeling among financial institutions (section 2.1.1). More companies are looking into the possibility of implementing ML models to improve credit assessment (IIF, 2019).

1.2 Background Description

The project owner, Klarna Bank AB, has a large amount of data regarding con-sumers and their purchase history when using Klarna as a payment method. Han-dling this massive amount of data is difficult since it can be hard to interpret and challenging to extract the essential information. In this context, clustering algo-rithms can be very useful. The goal for Klarna with this project is to achieve further insights regarding the purchase behavior of their consumers, in order to make more informed decisions about whether the consumers should be granted a credit request.

Klarna is growing at a rapid pace and is constantly entering new markets. A difficulty for Klarna is to understand the behavior of consumers in different markets. Some markets have complicated systems to obtain information, such as personal details, which leads to a challenging process in deciding whether a consumer will be approved for a credit request. By grouping consumers into categories based on their similarities, companies can obtain a better understanding of consumer behavior, optimized marketing strategies and increased profits (Chen & Li, 2009). Patterns revealed from a customer segmentation can provide key measurable data points for improved credit risk management. Representative groups make it possible to maximize performance in all customer segments, even in the seemingly risky segments (Panek, 2019).

Klarna offers three different payment solutions to their consumers: pay now, pay later and financing (Klarna, 2020). This thesis is written for the department

(13)

re-sponsible for the financing solution, i.e. consumer credit products.

The credit risk variables used in the thesis are of a business sensitive character, meaning that some relevant features for assigning the risk profile of a consumer group will not be displayed.

1.2.1 Underwriting and credit risks

Underwriting is the process where an individual or institution takes on financial risk for a fee. Most typically the risk involves loans, insurances or investments. Since this thesis deals with credits the main focus will be on the risk arising from loans. In case of a loan, the risk has to do with whether the borrower will repay the loan as agreed or if the borrower will default. Underwriting involves the process of conducting research and assessing each applicant’s degree of risk before assuming that risk. If the risk is deemed too high, the underwriter may refuse coverage.

1.3 About Klarna Bank AB

Klarna Bank AB is a Swedish bank that provides financial services, for example, different payment solutions for online shopping. During the last 15 years, Klarna has been working towards the objective of making online shopping easier and smoother for customers. Today, Klarna has more than 85 million consumers and 205,000 merchants in 17 countries. In Sweden, where Klarna was founded, almost every other purchase online is made with Klarna. Their prime focus is to provide the simplest, safest and smoothest payment solution on the market (Klarna, 2020).

1.4 Problem Description

In Sweden the consumers are well mapped and there is a lot of information avail-able for analyzing consumer behavior. The accessible amount of information varies among different markets, depending on their regulatory framework. This makes it even more essential to get insights and understand the consumer behavior in the current market. By gathering more information about Klarna’s consumers, it is possible to make more informed decisions about whether a consumer should be allowed a credit or not.

The thesis will address the problem of clustering both mixed and dichotomous data sets, using three distance measures suitable for dichotomous data (simple matching coefficient, Jaccard index and Dice coefficient) and one for the mixed data set (Gower’s dissimilarity).

1.5 Aim of the thesis

The aim of the thesis is to make use of unsupervised ML techniques to segment consumers based on their similarities, in order to obtain deeper knowledge of the consumers purchase behavior, that can be used in the underwriting process. The following questions will be addressed:

(14)

• What methods should be used in the cluster analysis and which trade-offs have to be made to achieve an easily explained result of a customer segmentation? • What variables are needed to facilitate credit decisions and is it possible to

distinguish new variables that would add value to the model?

1.6 Delimitations

The delimitations depend mainly on the amount of time available for the project, since a lot of time will be devoted to data preprocessing and construction of the data sets, resulting in less time for the modeling process. To get interpretable and easily explained results we have reduced the initial data set through data reduction techniques in order to keep the most essential features and get the best possible result.

To avoid problems with computational complexity each of the models will be devel-oped using a random sample of 1,000 or 10,000 observations from the initial data set of 100,000 observations collected between the 1st of January and the 1st of March 2020.

(15)

2

Theory

In this section, the underlying theory is given. A brief introduction to ML and customer segmentation is followed by theory regarding data cleaning techniques. Underlying theory regarding clustering algorithms, distance measures and cluster validation will be presented. Unless otherwise stated, the theory is based on ”An Introduction to Statistical Learning” by Hastie et al., 2013.

2.1 Machine learning in credit risk management

The use of ML in credit risk management is an active area of research and devel-opment. The Institute of International Finance (IIF) recently reported that there has been a significant increase in the number of financial institutes adoption of ML in credit risk modeling (IIF, 2019). Another recent study found that the number of organizations using artificial intelligence more than doubled between 2017-2018, and that 40% of financial services firms are applying it to risk assessment (Ba-jaj, 2019). The IIF report also pinpoints several benefits from implementing ML models, including improved model accuracy and discovery of new risk segments. However, an introduction of a new technology comes along with new challenges. According to the IIF study the main struggle is centered around supervisory un-derstanding of, or, consent to use new processes. The ”black box” nature of ML is a problem regarding traceability and complexity, especially among larger regulated entities, but also a concern for colleagues, auditors and supervisors who need more transparency of the process and deterministic explanation of results (IIF, 2019). In this paper we follow an approach which was recently proposed by Valentina Alto in her article ”Credit risk: unsupervised clients clustering”. The idea is to apply an ML technique called unsupervised learning (section 2.1.2) to the problem of as-sessing whether a consumer is creditworthy or not. This is done by segmenting consumers into homogeneous clusters and see if it is possible to gain relevant in-formation about their creditworthiness. Alto concludes that, by using clustering techniques, banks and financial institutes are able to produce and access relevant results in a few seconds, while analyzing all client’s creditworthiness manually would have required a lot of time (Alto, 2019).

2.1.1 Machine learning

ML refers to data analyzing techniques that teaches computers to learn from ex-perience. ML includes several learning algorithms using computational methods to receive information from data sets in an analytical way, similar to human analytical thinking. ML is defined as a set of automated methods that can detect patterns, reveal valuable information and make predictions in large data sets (Murphy, 2012). ML is divided into two main areas, supervised and unsupervised learning (Math-Works, 2020). Since this thesis will focus on unsupervised ML techniques, the theory regarding supervised learning will be left out.

(16)

2.1.2 Unsupervised machine learning

Unsupervised ML uses data without preexisting labels, to draw inferences and find patterns or valuable structures within the data. Clustering analysis is the most com-monly used method in unsupervised ML, which aims to find homogeneous subsets within a data set by grouping observations based on their similarities. A challenge with unsupervised learning is that there is no information about the expected re-sult. This leads to a difficulty in assessing the obtained results, since there does not exist a universally accepted method in validating the result.

2.1.3 Customer segmentation and cluster analysis

Customer segmentation is the process of distributing a company’s customers into different groups having similar characteristics. By segmenting customers, a com-pany can target a group in order to maximize the value of each customer segment, as well as evaluating them separately, as they may behave differently (Optimove, 2020). To distinguish between different customer segments within data sets one can use statistical techniques broadly called clustering techniques. A central as-pect of cluster analysis is how to define ”similarities” or ”dissimilarities” between observations. Based on these dissimilarities, it is possible to find different segmenta-tion solusegmenta-tions. Choosing dissimilarity measures requires careful considerasegmenta-tions and knowledge regarding what kind of data that is being handled (section 2.3). Cluster-ing techniques are useful in, for example, identifyCluster-ing customer characteristics and purchase behavior (Mirzazadeh & Hanafizadeh, 2010).

2.2 Data Preprocessing

Data preprocessing is a technique that transforms raw data into a machine readable format. Raw data is often insufficient and consists of different data types that cannot be sent through a ML model. That is why preprocessing is an essential step before implementing ML models. Data preprocessing involves activities such as, handling missing values and duplicates in a suitable way, encoding categorical values and standardizing or normalizing the data set (Rajaratne, 2018).

2.2.1 Dimensionality reduction

A multidimensional data set can increase the complexity of a model. A solution would be to compile the data into a smaller set. Dimensionality reduction can be divided into two areas, variable selection where the most important variables are selected, and feature extraction where new features are created by combinations of the original variables. Even for small size problems, dimension reduction techniques have been advocated to facilitate the interpretation of clustering results.

2.2.2 Missing data

Missing data occurs when no data value is stored for a feature in an observation, often indicated by N/A. Missing data is a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Missing data

(17)

can cause problems like, reduced statistical power, reduced representativeness of the sample, less amount of applicable methods and complicating the analysis of the study.

2.2.3 Types of data

The data domain can, on a high level, be divided into qualitative and quantitative data. Qualitative data, also called categorical data, can be divided into nominal and ordinal data. Most ML models cannot handle categorical variables. Therefore, they need to be converted or ”encoded” to numerical values. One-hot encoding (section 2.2.4) is one of the most widespread approaches to encode categorical data. Quantitative data, also called numerical data, can be divided into two groups, discrete and continuous data (Donges, 2018).

Nominal data Nominal data is used for naming or labelling features. In nominal data there does not exist any specific order of the data, for example, names of cities or gender. Nominal data can be represented by numbers, for example binary data (where a variable can only take two possible values) usually represented by 0 and 1, but these values do not represent any order of the data.

Ordinal data Ordinal data exists of ranked values as for example low, medium and high. The ordinal variables can also be numbers, however, the numbers are not mathematically measured but are merely assigned as labels for options.

Discrete data A discrete variable can take a finite number of values and can be obtained by counting. It could, for example, be the number of heads in coin flips. Discrete data fulfill the conditions for integers.

Continuous data A continuous variable can take an infinite number of values and can be obtained, for example, by measuring heights. Continuous data cannot be counted since decimal numbers are allowed.

2.2.4 One-hot encoding

One-hot encoding is a method for converting categorical data to numerical data. The encoded variable is removed and a new binary variable (assigned 1 or 0) is added for each unique label. The 1 and 0 accordingly represents presence and absence. A disadvantage with one-hot encoding, is that if there exists many unique values within a category variable, the number of columns could expand considerably (Yadav, 2019).

2.2.5 Collinearity

Collinearity implies that two variables are nearly a perfect linear combination of one another, hence strongly dependent variables. To detect large pairwise collinearity, it is possible to look at the correlation matrix, given by,

(18)

Elements with large absolute value in the correlation matrix indicates a highly correlated pair of variables, hence there exists collinearity. There are two ways of eliminating collinearity where the first one is to drop the problematic variables and the second one is to construct a single feature of the collinear variables.

2.2.6 Multicollinearity

Multicollinearity is the case of having three or more collinear variables (section 2.2.5). Multicollinearity exists in a data set when the independent variables are linearly related to each other. Equation 1 is an example of perfect multicollinearity, where the independent variables induce perfect linear dependence of the dependent variable xi.

c1x1+ c2x2+ . . . + cnxn= xi (1)

2.2.7 Standardization and normalization

Variables having a very different range or scale can often create problems since the results may be impacted by a few large values. To avoid such issues, it is possible to standardize the data by assigning the initial raw variables with, for example, mean 0 and standard deviation 1 or scaling the observations to fall into the range [0, 1]. 2.2.8 Skewness

Skewness refers to an asymmetric normal distribution curve and can be defined as the degree of distortion from the symmetrical bell curve in a probability distribution. Accordingly, a normal distributed curve has zero skewness while, for example, a lognormal distribution would show a tendency of right-skewness (Chen, 2019). 2.2.9 Data binning

Data binning is a preprocessing technique, where the original values are put into a number of different given intervals, so called ”bins”. Binning numerical data can be summarized as distributing continuous data into different intervals, i.e. trans-forming them into bins of discrete variables (section 2.2.3). A mixed data set of numerical and categorical data would then solely be represented by categorical data. A challenge in working with continuous numerical data (section 2.2.3) is that the distribution of the variables often shows a tendency of skewness (section 2.2.8). This leads to a frequent representation of some values while others being more rarely represented.

Data binning can be done through different methods, for example, equal frequency binning and manual binning.

Equal frequency binning The equal frequency or quantiles method is an easy and robust binning method. The representation of data observations is distributed evenly into the quantiles (often four or six) making them have the same weight

(19)

and adjusted intervals in order to obtain the same amount of observations in each quantile.

Manual binning The manual binning method refers to manually putting data into predefined categories or bins, with the purpose of transforming numerical data into categorical (section 2.2.3). The numerical variables are analyzed and assigned bins suitable for their distribution. The value range within the data can differ greatly and it is important that the bins are constructed in a way that preserves the variation within the data (Ahlemeyer-Stubbe & Coleman, 2014).

2.2.10 Sampling theory

A sample is a statistical subset, constructed to represent the same variation as the original data set, but having a smaller dimension. A data set can be too large to run different functions, which can be solved through the use of a sample.

Simple random sample A simple random sample means that the observations have an equal probability of being chosen. In most cases this creates a balanced subset that represents the data in an unbiased manner. It is possible that a sampling error may occur if a sample does not entirely represent the whole data set (Hayes, 2019).

2.3 Dissimilarity Measures

The goal of clustering and segmentation (section 2.1.3) is to group observations based on how similar they are. It is therefore crucial to have a good understanding of what makes two observations ”similar”. There are plenty of different distance calculation methods available that suit different types of data. In distance-based clustering, dissimilarity measures are the core algorithm components, their efficiency directly influences the performance of the clustering algorithm. In addition to a careful dissimilarity measure selection, one must also consider whether or not the variables should be scaled to have standard deviation one (section 2.2.7) before the dissimilarity between the observations is computed (Chatterjee, 2019). The distance measures we will consider can be grouped based on what type of data they are applicable to:

1. Distance measures for numerical data • Manhattan distance

2. Distance measures for dichotomous data • Simple matching coefficient

• Jaccard index • Dice coefficient

3. Distance measures for mixed data • Gower’s dissimilarity

(20)

2.3.1 Distance measure for numerical data

The dissimilarity between two observations of numerical data can be interpreted as the distance between two points in a high dimensional space. There are several methods for calculating such a distance but we will be using the so called Manhattan distance.

Manhattan distance In two dimensional space, the Manhattan distance mea-sures the distance between two points as the sum of the total horizontal and vertical distances between the points on a grid. The name is based on the gridlike street pattern on the island Manhattan in New York. For example, the length of the shortest path a taxi could take between two intersections on Manhattan is equal to the the Manhattan distance (also called taxicab geometry). In n-dimensional space the Manhattan distance is defined as:

If x = (x1, x2, . . . , xn) and y = (y1, y2, . . . , yn) are two points in n-dimensional

space, the distance (d) is the sum of the distances in each dimension, given by:

d(x, y) = n X i=1 |xi− yi| (2) for i = 1, . . . , n.

2.3.2 Distance measures for dichotomous data

In order to explain how simple matching coefficient, Jaccard index and Dice coeffi-cient works, an example is provided. Given two observations, A and B (where A 6= B) in Table 1, each can be either 0 or 1 (indicating absence or presence), and have n binary variables.

Table 1: Example table with dichotomous data

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 A 0 1 1 1 0 0 1 1 0 0 B 1 0 1 1 0 0 0 1 1 1 C 1 1 1 0 1 0 0 1 0 1 D 1 1 1 0 1 0 0 1 0 1 E 0 0 0 0 1 1 1 1 1 0

For each variable there are four possible combinations of the values in observation A and B. Counting the number of such combinations we get (see Table 2):

(21)

M11 represents the number of cases where both A and B have the value 1.

M01 represents the number of cases where observation A is 0 and observation B is

1.

M10 represents the number of cases where observation A is 1 and observation B is

0.

M00 represents the number of cases where both A and B have the value 0.

Since each case must fall into one of these four categories, we have that, M11+ M01+ M10+ M00= n

Table 2: Table for constructing distances for observation A and B

Obs.A

1 0

Obs.B

1 M11 M01

0 M10 M00

Counting the number of cases for all four combinations of observation A and B, for the example given in Table 1, we get the values in Table 3.

Table 3: Counts of response combinations for observation A and B to construct distances Obs.A 1 0 Obs.B 1 3 3 0 2 2

By using Table 3, it is possible to compute the dissimilarity between observation A and B.

Simple matching coefficient The simple matching coefficient (SMC) represents the simplest way of measuring similarity between observations of dichotomous data, without imposing any weights. SMC counts both mutual presence (M11) and

(22)

the four combinations (proposed in Table 2). The SMC if preferable in situations where 0 and 1 holds equivalent information, for example, a binary gender variable where 0 (indicate male) and 1 (indicate female) should have equal impact on the similarity.

The simple matching coefficient distance is given by,

proportion of similarity = M11+ M00 M11+ M01+ M10+ M00 (3) proportion of dissimilarity = M01+ M10 M11+ M01+ M10+ M00 = 1− M11+ M00 M11+ M01+ M10+ M00 (4) Jaccard index The Jaccard index (JI) is a distance measure that calculates the dissimilarity between dichotomous variables. Unlike SMC, JI only counts the num-ber of mutual presence (M11), and excludes the number of mutual absence (M00).

This approach is preferable when the variables are asymmetric binary, i.e. one state (0 or 1) is determined to be more informative than the other. When we have sym-metric binary variables, i.e. 0 and 2 has equal importance, it is always possible to create asymmetric binary variables through, for example, one hot encoding (sec-tion 2.2.4), with the drawback of adding computa(sec-tional complexity. The JI ranges between [0,1], where 1 indicates strong similarity. The JI is given by,

proportion of similarity = M11 M11+ M01+ M10 (5) proportion of dissimilarity = M01+ M10 M11+ M01+ M10 = 1 − M11 M11+ M01+ M10 (6)

Dice coefficient The Dice coefficient (DC) is a distance measure that is very similar to JI, where the only difference is that it doubles the occurrence of mutual presence M11. Like JI, the DC ranges between [0,1], where 1 indicates strong

similarity. The DC is given by,

proportion of similarity = 2M11 2M11+ M01+ M10 (7) proportion of dissimilarity = M01+ M10 2M11+ M01+ M10 = 1 − 2M11 2M11+ M01+ M10 (8)

2.3.3 Distance measure for mixed data

Real data does rarely contain of strictly numerical or categorical data, most often the data contains of a mix of different data types. Gower’s dissimilarity is a method or distance measure that can handle a data set of various data types and allows missing values (2.2.2).

(23)

Gower’s dissimilarity Clustering algorithms are based on distance measures to define if objects are considered similar or not. Distances need to be defined between two objects in order to use clustering algorithms. A problem with defining distances can occur when a data set consists of mixed data, for instance, numeric, binary, nominal and ordinal data (section 2.2.3). For example how do you measure the similarity between a red car that weights 1400 kg and a blue car that weights 1200 kg? A solution is to use Gower’s dissimilarity measure (GD) that can calculate the distance between two entities whose attributes have a mix of categorical and numerical values. The dissimilarity between two observations is the weighted mean of the contributions of each variable, given by,

dij = d(i, j) =

Pp

k=1wijkdijk

Pp

k=1wijk

Where wijk is the weight for variable k between observations j and i. And dijk is

the distance between observations j and i on variable k. Each partial dissimilarity, dij, ranges between [0, 1], and depends on the type of variable being calculated, so

dijk does not apply the same formula to all variables. For numeric variables dijk

is calculated using the manhattan distance (section 2.3.1) and for binary/nominal variables dijk using DC (section 2.3.2). The weight wijk becomes zero when the

variable k is missing in one or both of the rows (i, j), or when the variable is asymmetric binary and both values are zero (Ahmad, 2018).

2.4 Cluster Tendency

Cluster tendency assessment determines whether a given data set contains mean-ingful clusters. This can be useful to examine prior to the clustering. There are several techniques to assess cluster tendency, here we will focus on Hopkins statistic. 2.4.1 Hopkins statistic

Hopkins statistic is a measurement for assessing clustering tendency within a data set. Hopkins statistics tests the spatial randomness of the data, in other words it tests whether the given data set is uniformly distributed. Following equation 9, (p1, . . . , pn), represents a random sample with n points from data set D. xi

denotes the distance between each pi ∈ D and it’s nearest neighbor pj. yi denotes

the distance between each qi ∈ randomD and it’s nearest neighbor qj in D (where

randomD represents a simulated data set, with n points, (q1, . . . , qn), from a random

uniform distribution with the same variation as the original data set D). Let D be a real data set, then Hopkins statistic, H, is given by,

H = 1 − Pn i=1yi Pn i=1xi+ Pn i=1yi (9)

(24)

The null and alternative hypothesis is stated as, (

H0 : The data set D is uniformly distributed (i.e., no meaningful clusters)

H1 : The data set D is not uniformly distributed (i.e., contains meaningful clusters)

If the value of Hopkins statistic is close to zero, this indicates that the sum of the nearest neighbor distances within the real data set D is negligible compared to the sum of nearest neighbor distances between the random data set and the real data set. Hence, we can reject the null hypothesis and conclude that the data set D is significantly clusterable. A value of H about 0.5 means that Pn

i=1yi and

Pn i=1xi

are close to each other, indicating that the data D is uniformly distributed (Prasad, 2016).

2.5 Clustering Algorithms

Clustering algorithms forms the basis of how to segment customers into different clusters. There exists many different clustering algorithms, in this thesis we will be focusing on PAM and Hierarchical clustering algorithms.

2.5.1 Partitioning Around Medoid

Partitioning Around Medoid (PAM), also called K-medoids, is a clustering algo-rithm related to the very commonly used K-means algoalgo-rithm. K-means clustering aims to minimize the total sum of squares, whilst PAM attempts to minimize the sum of dissimilarities between objects (section 2.3). A medoid is defined as the observation in each cluster whose sum of dissimilarities to all other observations in the cluster is minimum. This is an advantage since the medoids (i.e. observations) can be interpreted as a representative observation for each cluster (Kassambara, 2018b). The PAM algorithm is visualized in Figure 1.

(25)

Algorithm 1 PAM clustering

1. Randomly select k of the n data points as the medoids.

2. Associate each data point o, to the closest medoid m (where closest is defined by the chosen distance measure).

3. For each medoid m. For each non-medoid o. Swap m and o and compute the total cost of the configuration (that is, the average dissimilarity of o to all the data points associated to m).

4. Select the configuration with the lowest cost.

5. Iterate step 2 to 5, until there is no change in the medoids.

2.5.2 Hierarchical

Hierarchical clustering is a method which seeks to build a hierarchy of clusters. Two commonly used strategies for hierarchical clustering are agglomerative (bottom-up) and divisive (top-down), illustrated in Figure 2. In agglomerative clustering, each cluster starts as just one object, and pairs of clusters are recursively merged until there is only one cluster. In divisive clustering, all objects start in one cluster, and each cluster is split recursively until it only contains one object.

Figure 2: Visualization of agglomerative and divisive clustering in a dendrogram

The choice of dissimilarity measure is an important step in hierarchical clustering (section 2.3).

After selecting a suitable distance measure, it is necessary to determine from which point in the clusters the distances should be computed. In agglomerative clustering there are three main methods for combining clusters. It can be computed between the two most similar points between the clusters (single linkage), the two least similar points (complete linkage) or by the center of the clusters (average linkage) (Chen et al., 2018). Each level of the hierarchy represents a special segmentation of the data. It is up to the user to decide which level that represents the data in a reasonable way.

(26)

Agglomerative clustering Let G and H represent two clusters. The dissimi-larity d(G, H) between G and H is computed from the set of pairwise observation dissimilarities dii0, where i belongs to G and i0 belongs to H.

Single Linkage (SL) takes the intergroup dissimilarity to be that of the closest (least dissimilar) pair,

dSL(G, H) = min

i∈G i0∈H

dii0 (10)

Complete linkage (CL) takes the intergroup dissimilarity to be that of the furthest (most dissimilar) pair,

dCL(G, H) = max

i∈G i0∈H

dii0 (11)

Average linkage (AL) takes the average dissimilarity between the groups, dAL(G, H) = 1 NGNH X i∈G X i0∈H dii0 (12)

where NG and NH are the respective number of observations in each group.

From the agglomerative clustering one can obtain the agglomerative coefficient, ac, which is a measure of the clustering structure. For each observation, m(i), the ac measures the ratio of the dissimilarity for the first cluster merging of m(i), and the final merger in the cluster algorithm. The agglomerative coefficient is the average across all samples 1 - m(i). Values closer to 1 suggest a more balanced and strong clustering structure.

The agglomerative coefficient increases with the number of observations, i.e. the measure is not suitable for comparing data sets of significantly different sizes. Algorithm 2 Agglomerative hierarchical clustering

1. Make each data point a single-point cluster, N clusters.

2. Take the two closest data points and merge them to one cluster, N-1 clusters. 3. Take the two closest clusters and merge them to one cluster, N-2 clusters. 4. Repeat step 3 until only one big cluster remains.

Divisive clustering Divisive clustering is the inverse of agglomerative clustering (section 2.5.2). Compared to agglomerative clustering, divisive is a more complex, accurate and efficient method. The divisive clustering uses the global distribution of data while agglomerative initially only uses the local patterns within the data.

(27)

Algorithm 3 Divisive hierarchical clustering

1. Start by having one large cluster, N, containing all observations. 2. Select the cluster with the largest dissimilarity and split into two, N+1 3. Repeat step 2 until all observations represents a single-point cluster.

From the divisive clustering one can obtain the divisvie coefficient, dc, which is a measure of the clustering structure. For each observation, d(i), the dc measures the ratio between the diameter for last cluster to which d(i) belongs to (i.e. before it becomes a single-point cluster) and the diameter of the complete data set. The divisive coefficient is the average of all 1 - d(i). A value of dc close to 1 indicates stronger group distinctions (Boehmke & Greenwell, 2020).

2.5.3 T-distributed stochastic neighbor embedding

T-Distributed stochastic neighbor embedding (t-SNE) is a non-linear dimensionality reduction algorithm (section 2.2.1), which is suitable and efficient for embedding high-dimensional data for visualization into a lower-dimensional space. The t-SNE algorithm focus on preserving local structure, and represents similar objects by nearby points, and dissimilar objects by distant points. The resulting 2D or 3D points can be visualized in a scatter plot, that reveal the underlying structure of the objects, such as the presence of clusters.

Given a set of N objects x1, . . . , xN in a high dimensional space, t-SNE first

com-putes probabilities that are proportional to the similarity of objects xi and xj, as,

pj|i= exp− kxi− xjk2/2σi2 P k6=iexp  − kxi− xkk2/2σ2i (13)

Where pj|iis the conditional probability that point xi would pick xj as its neighbor

if neighbors were picked in proportion to their probability density under a Gaussian centered at xi. σi is the width of the Gaussian kernel for each object, and is chosen

in a way so that the conditional probability, pj|i, has a fixed perplexity, meaning that a fixed number of points fall in the mode of this Gaussian. As a result, the width is adapted to the density of the data: smaller values of σi are used in the

denser parts of the data space.

The final similarity score is symmetrized using

pij =

pj|i+ pi|j

2N (14)

Probabilities where i = j are set to zero, pii= 0.

Each point in the high-dimensional space will now be represented by a point in the low-dimensional space. t-SNE aims to learn a d-dimensional map y1, . . . , yN,

(28)

that reflects the similarities pij as well as possible. For the low-dimensional space a

Cauchy distribution (t-distribution with one degree of freedom) is used to measure similarities, qij, between two points. Given by,

qij =  1 + kyi− yjk2 −1 P k6=l1 + kyk− ylk2 −1 (15)

Probabilities where i = j are set to zero, qii = 0. The t-distribution is chosen

because it allows dissimilar points to be modeled as too far apart in the map, hence dissimilar objects are more spread out in the new representation.

The location of the points yi in the map are determined by changing the location of

the points yiin the embedding to minimize the Kullback-Leibler divergence between

these two distributions, qij and pij,

KL(P kQ) =X i6=j pijlog pij qij (16) The minimization of the Kullback–Leibler divergence with respect to the points yi

is performed using gradient descent, which is an iterative optimization algorithm for finding a local minimum of a differentiable function.

For large data sets, Barnes-Hut’s algorithm can be implemented to reduce the com-putational complexity with the trade-off of accuracy. The comcom-putational complexity is O(n log(n)) for the Barnes-Hut compared to O(n2) without it (Jaju, 2017).

2.6 Determining the Optimal Number of Clusters

Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, because it needs to be determined in order to perform the clustering. There exist plenty of different methods to determine the number of clusters, we will be using Silhouette analysis.

2.6.1 Silhouette analysis

A silhouette analysis is a multi-functional tool that can be used to determine the number of clusters, k, prior to implementation of clustering algorithms (which is required for some algorithms). Another application of silhouette analysis is to assess the level of cluster separation, by measuring how close each point in a cluster is to the points in its neighbor cluster.

The silhouette S(i) can be calculated as,

S(i) = b(i) − a(i) max(b(i), a(i))

where a(i) is the average dissimilarity between observation i and all other points of the cluster to which i belongs. b(i) is the average dissimilarity between i and its

(29)

closest neighbor cluster, defined as b(i) = minCd(i, C), where d(i, C) represents the

dissimilarity between C (all other clusters to which i does not belong) and i. When using silhouette analysis for assessing the cluster separation, then for S(i) in [-1, 1]:

• S(i) large: well clustered • S(i) small: badly clustered

• S(i) negative: assigned to wrong cluster

A rule of thumb is that cluster average S over 0.5 is acceptable (Kassambara, 2018a), meaning that the average intra-cluster dissimilarity (a(i)) is half the size of the inter-cluster dissimilarity (b(i)).

To determine the optimal number of clusters k, the average silhouette method can be used. The optimal number of clusters k, is the one that maximizes the average silhouette score over a range of possible values for k.

An advantage in using the silhouette value for determining the number of clusters is that the silhouette can be calculated with any distance measure, and does not need access to the original data points.

2.7 Cluster Validation

Cluster validation is done to avoid finding patterns in random data, and be able to conclude that the clusters found really represent true subgroups. Cluster validation is also a valuable tool in comparing different clustering algorithms. There are three different cluster validation techniques; internal, external and relative. An internal evaluation score means that the score is based on the cluster itself and not on any external information. External cluster validation is evaluated based on external knowledge such as class labels. Relative cluster validation is evaluated by varying different parameters for the same algorithm (changing the number of clusters) (Kassambara, 2018a).

2.7.1 Dunn index

The Dunn index is an internal validation method where the aim is to distinguish well separated clusters, i.e. with a small within-cluster variance and a large between-cluster variance. The higher the Dunn index is, the better is the separation between clusters (Kassambara, 2018a).

2.7.2 Davies-Bouldin index

The Davies-Bouldin (DB) index is an internal evaluation measure for validating the quality of the clustering. A DB index close to zero indicates a good partitioning. The DB index for k clusters, is given by,

(30)

Algorithm 4 Dunn index

1. For each cluster, compute the distance between each of the observations in the cluster with all observations in the other clusters.

2. The minimum of these pairwise distances (obtained in step 1) will represent the inter-cluster separation (min.separation).

3. For each cluster, compute the distance between the observations in the same cluster.

4. The maximum of these pairwise distances (obtained in step 3) will represent the intra-cluster separation (max.diameter).

5. Calculate the Dunn index (D) as follow:

D = min.separation max.diameter (17) DB index = 1/k k X i=1 max i6=j  ∆ (Xi) + ∆ (Xj) δ (Xi, Xj)  (18)

where δ(Xi, Xj) is the inter-cluster distance i.e. the distance between cluster Xi

and Xj. ∆(Xk), is the intra-cluster distance of cluster Xk i.e. the distance within

(31)

3

Method

This chapter comprises the method that is used based on previous theory chapter. In order to make the method easier to understand, the initial steps are shown in Figure 3.

Figure 3: Visualization of the initial steps in the methodology process

3.1 Data Description

The data used in the thesis contains both purchase history and personal information of the customers. Each row in the data set represents a ”credit request” (i.e. financing product) and contains information regarding, for example, what kind of product that is being requested, the amount of the requested product and variables indicating earlier credit behavior. The variables are both from the internal database and from external credit bureaus.

3.2 Data Collection

The data is limited to 100,000 observations between 1st of January and 1st of March 2020. The observations are randomly distributed over the chosen interval.

Relevant features from different data tables were combined into one complete data set, containing 415 variables.

3.3 Data Preprocessing

The programming language R was used throughout the preprocessing phase (section 2.2). The process of cleaning the data was divided into three steps.

(32)

Step one The first step was to read and obtain the data. During this step the data set was read into R and missing values were identified. Some variables had large negative numerical values to indicate a missing value whilst others were just empty. Hence, all negative and empty values were assigned N/A. During this phase, data types were determined (section 2.2.3).

Step two The second step involved the actual cleaning of the data. To get rid of non-informative variables that possibly could reduce the statistical power and make the analysis more complex. All variables with more than 50% missing values were removed (section 2.2.2). The qualitative variables (section 2.2.3) with more than 30 unique categories and those with less than 3 (i.e variables that contain only one category and missing values) were removed. The data set was split into two parts, one containing the numerical values and the other with the categorical variables. A correlation matrix was constructed using the numerical data. From this correla-tion matrix all pairs of variables with a correlacorrela-tion of more than 90% were extracted (section 2.2.5). For those high-correlating variable-pairs, variables with the most information were kept i.e. the variables with the least amount of N/A’s. Since a variable can correlate with several other variables, it is possible for them to exist both in the subset of variables to keep, and in the subset of variables to drop. In order to hedge the data from variables that have been in the subset of unwanted variables, all variables in the ”drop” subset were removed.

The categorical variables that possibly would induce correlation were converted to numerical and then used to construct a correlation matrix. Since this is a selection procedure and not a determination of correlation, this solution was determined to be the most adequate. During the process of analyzing the data set that remained after the reduction by correlation, 25 variables were determined as non-value adding, hence removed. The data set has further been reduced by taking the mean of the numeric variables that were represented in different time intervals, for example, if a variable was available for the time intervals, 0-3 months, 0-6 months and 0-12 months, these were merged to a mean variable over the intervals.

Step three The last step was to prepare the data for analysis. In this phase, three different data sets were constructed by applying different methods. To visualize how a variable could be distributed for the different data sets, the age variable is visualized for each data set in Figure 4.

• Data set 1: manually binned data The numerical variables were assigned manually determined bins (section 2.2.9). The bins were determined by look-ing at the structure of each variable, since the bins or intervals were manually determined there exists some arbitrariness. The binned numerical data set was then combined with the categorical variables and one-hot encoding was applied to the combined data set (section 2.2.4).

• Data set 2: equal frequency binned data The numerical variables were binned by applying equal frequency method where the variables were divided into four bins, respectively (section 2.2.9). Since the data has some skewness

(33)

(section 2.2.8), it is not consistently possible to divide all variables into four quantiles. To solve this situation, variables that have been divided into two or less quantiles have been removed. The data set was then combined and one-hot encoded (section 2.2.4) together with the categorical data.

• Data set 3: unbinned data In this method, a random sample was extracted from the unbinned data set, without replacement (section 2.2.10). GD was applied as a distance measure for this data set (section 2.3.3).

Figure 4: Visualization of the age variable for three different data sets, unbinned data, manually binned data and equal frequency binned data

Since each one-hot vector sums up to one, this directly induces perfect multi-collinearity in our data set (section 2.2.6). To avoid multicollinearity the first column of the encoded features was dropped.

3.4 Model Implementation

The data preprocessing phase resulted in three different data sets, two of binary character and one with the original structure. Since clustering largely depends on the type of dissimilarity measure used, three different distance measures suitable for binary data sets, SMC, JI and DC were applied (section 2.3), in order to assess the best clustering result.

In this section the clustering methodology for each of the three data sets, manually binned data, equal frequency binned data and unbinned data will be explained. 3.4.1 Data set 1: Manually binned data

Step 1-6 applies PAM clustering.

1. A random sample (section 2.2.10) of 10,000 observations from the original data set (100,000 observations) was extracted in order to have sufficient dimensions for the dissimilarity matrices.

2. The cluster tendency was determined by calculating Hopkins statistic using 1,000 observations (10%) from the sample data in step 1 (section 2.4). 3. Dissimilarity matrices were created (section 2.3):

(34)

- Simple Matching Coefficient (SMC) - Jaccard index (JI)

- Dice Coefficient (DC)

Step 4-6 applies to all three distance measures, SMC, JI and DC. 4. The optimal number of clusters, k, was determined through inspection of the

average silhouette width, using silhouette analysis (section 2.6.1).

5. Clustering using the PAM algorithm was implemented (section 2.5.1), with the optimal number of clusters, k, obtained in step 4.

(a) The results from the PAM clustering were validated to decide which dissimilarity measure that provided the best result (section 2.7).

6. The clustering results, using the best dissimilarity measure, was summarized and visualized in 2 dimensions.

(a) The medoids from the clustering were identified and summarized. (b) The result from the clustering was visualized using t-SNE (section 2.5.3),

displaying respective cluster medoid.

Step 7-9 applies hierarchical clustering using the best dissimilarity mea-sure received in step 5a.

7. A random sample (section 2.2.10) of 1,000 observations was extracted from the original data set (100,000 observations) in order to have sufficient dimensions for the hierarchical clustering.

8. The cluster tendency was determined by calculating Hopkins statistic using 100 observations (10%) from the sample data in step 7 (section 2.4).

9. The linkage method to be used in the agglomerative clustering was identi-fied through calculations where the method with highest ac score was chosen (section 2.5.2).

10. Agglomerative and divisive clustering was performed (section 2.5.2). (a) The clustering results were validated (section 2.7).

(b) The clustering algorithm that produced the best results along with the optimal number of clusters was chosen for further analysis.

11. The clustering algorithm obtained in step 10b was summarized and visualized. (a) The clustering was evaluated and the results were summarized.

(b) The clustering result was visualized in a dendrogram displaying the cho-sen number of clusters k.

3.4.2 Data set 2: Equal frequency binned data

(35)

3.4.3 Data set 3: Unbinned data

Following the same procedure as for Data set 1 (section 3.4.1), except that the distance measure applied was GD.

3.4.4 Compare and assess the best data set

The results obtained from PAM and hierarchical clustering, were analyzed in order to assess the best data set for each of the two algorithms. The different validation measures constituted the basis for analyzing the performance of the clustering, in order to assess the data set that produced the best clustering.

The number of clusters was desired to be more than two, since it would be more value-adding for the project owners. Hence, four clusters were chosen for the man-ually binned data and the equal frequency binned data, since their silhouette values for four clusters was similar to their optimal silhouette value (occurred when the number of clusters were two). When the best data set for PAM and hierarchical clustering was determined, the analysis of the identified clusters began in order to inspect and potentially detect homogeneous features for the different segments.

3.5 Software Used

For this thesis the programming language R was chosen (version 3.6.2). The data has been collected using Datagrip (version 2019.2.6) which is a database IDE that provides a broad spectrum of databases (Jetbrains, n.d.). The tool has been devel-oped by Jetbrain to suit the specific needs of Structured Query Language (SQL) developers. Datagrip is managed using SQL which is a standard language for access-ing and managaccess-ing databases (W3Schools, n.d.). The data is collected from AWS, if nothing else stated, which is a cloud data warehouse product that is integrated in the data lake of Klarna.

(36)

4

Results

In this section, the results achieved from the applied methods are presented. The results will be analyzed in section 5. The currency is business sensitive and will be left out.

4.1 Results using PAM algorithm

The results using PAM clustering algorithm will be presented for three different sample data sets, manually binned data, equal frequency binned data and unbinned data, each containing 10,000 observations.

4.1.1 Cluster tendency

The cluster tendency results, for the three different data sets, are presented in Table 4.

Table 4: Hopkins statistic using 1,000 randomly chosen observations

Manually binned Equal frequency Unbinned data H = 0.3565381 0.357435 0.1254268

4.1.2 PAM clustering: manually binned data

Choosing optimal number of clusters The silhouette analysis results, using the dissimilarity matrices, are presented in Table 5.

Table 5: Average silhouette values using SMC, JI and DC in manually binned data where the number of clusters, k, ranges between 2 and 8

Number of clusters: 2 3 4 5 6 7 8 SMC 0.0305 0.0321 0.0237 0.0217 0.0219 0.0173 0.0200 JI 0.0258 0.0232 0.0242 0.0226 0.0203 0.0196 0.0183 DC 0.0389 0.0344 0.0356 0.0330 0.0296 0.0266 0.0263

(37)

Choosing optimal distance measure The cluster validation results, using the dissimilarity matrices, are presented in Table 6.

Table 6: Cluster validation measures, using SMC, JI and DC in manually binned data SMC JI DC Number of cluster 3 2 2 Cluster size 3,179; 3,121; 3,700 5,489; 4,511 5,489; 4,511 average.within 0.45 0.83 0.73 average.between 0.47 0.85 0.76 wb.ratio 0.96 0.98 0.96 avg.silwidth 0.032 0.026 0.039 Dunn index 0.223 0.255 0.186 DB index 2.02 1.97 1.97

Visualizing average silhouette values The silhouette plot, using JI, is visual-ized in Figure 5.

Figure 5: Average silhouette values using JI in manually binned data where the number of clusters, k, ranges between 2 and 10

(38)

4.1.3 PAM clustering: equal frequency binned data

Choosing optimal number of clusters The silhouette analysis results, using the dissimilarity matrices, are presented in Table 7.

Table 7: Average silhouette values using SMC, JI and DC in equal frequency binned data where the number of clusters, k, ranges between 2 and 8

Number of clusters: 2 3 4 5 6 7 8 SMC 0.1002 0.1061 0.0275 0.0291 0.0253 0.0120 0.0132

JI 0.0780 0.0642 0.0631 0.0315 0.0232 0.0226 0.0229 DC 0.1216 0.0948 0.0927 0.0455 0.0435 0.0324 0.0328

Choosing optimal distance measure The cluster validation results, using the dissimilarity matrices, are presented in Table 8.

Table 8: Cluster validation measures, using SMC, JI and DC in equal frequency binned data SMC JI DC Number of clusters 3 2 2 Cluster size 5,472; 2,723; 1,805 4,670; 5,330 4,594; 5,406 average.within 0.48 0.84 0.74 average.between 0.55 0.91 0.84 wb.ratio 0.87 0.92 0.88 avg.silwidth 0.106 0.078 0.122 Dunn index 0.173 0.458 0.289 DB index 1.94 1.68 1.63

(39)

Visualizing average silhouette values The silhouette plot, using DC, is visu-alized in Figure 6.

Figure 6: Average silhouette values using DC in equal frequency binned data where the number of clusters, k, ranges between 2 and 10

4.1.4 PAM clustering: unbinned data

Choosing optimal number of clusters The silhouette analysis results, using GD, in unbinned data is visualized in Figure 7.

Figure 7: Average silhouette values using GD in the unbinned data where the number of clusters, k, ranges between 2 and 10

(40)

4.1.5 Comparing statistics

Choosing the best data set The validation measures for the different data sets, using PAM clustering algorithm, are presented in Table 9. The one that provides the best overall validation measures will be considered for the analysis.

Table 9: Validation measures from PAM clustering algorithm using the three data sets with their respective distance measure

Manually binned (JI) Equal frequency (DC) Unbinned data (GD)

Number of clusters 4 4 3 n 10,000 10,000 10,000 average.within 0.82 0.71 0.14 average.between 0.85 0.83 0.17 wb.ratio 0.96 0.86 0.83 Dunn index 1.02 1.05 0.95 avg.silwidth 0.02 0.09 0.10 Cluster- 1 size 2,749 3,653 4,920 Cluster- 2 size 3,910 1,775 1,205 Cluster- 3 size 1,662 1,990 3,875 Cluster- 4 size 1,679 2,582

(41)

-4.1.6 PAM result: equal frequency binned data

Cluster results A summary of selected variables for each medoid is presented in Table 10. The result using two clusters can be found in Appendix A.

Table 10: Selected variables for the medoids using DC in equal frequency binned data, business sensitive variables will be hidden

Medoid Cluster 1 Medoid Cluster 2 Medoid Cluster 3 Medoid Cluster 4

Number of obs. in cluster 3,653 1,775 1,990 2,582

age 18-26 34-44 27-33 34-44

age variable 2 0-341 1,031-Inf 789-1,030 789-1,030

time of day 11:00-15:00 16:00-19:00 00:00-10:00 00:00-10:00

sf score 0-8,390 9,360-9,690 9,690-9,980 9,360-9,690

day of week Wed-Thu Mon-Tue Wed-Thu Wed-Thu

sum fees account mean N/A 32.7-2,610 32.7-2,610 0-1.66

amount 0.1-40 41-78 79-179 0.1-40

balance 0.12-158 1,200-Inf 1,200-Inf 0.12-158

internal pd score 0.087-0.709 0.0155-0.0405 0.0155-0.0405 0.00123-0.0155

de external score 0.138-0.866 0.0128-0.0437 0.0000893-0.0128 0.0000893-0.0128

active transactions 0-2 10-Inf 10-Inf 0-2

days since last payment N/A 19-46 0-7 47-Inf

estore group Clothing & Shoes Leisure, Sport & Hobby Clothing & Shoes Clothing & Shoes

product name Product 1 Product 1 Product 2 Product 2

decision REJECT REJECT ACCEPT ACCEPT

Visualization of PAM in a 2-dimensional space The 2-dimensional visual-ization of the clusters, using DC in equal frequency binned data, is presented in Figure 8. The corresponding figure for two clusters can be found in Appendix A.

Figure 8: t-SNE visualization of PAM clustering, using DC in equal frequency binned data, where k = 4 (number of cluster), having respectively cluster medoid displayed in a darker color

(42)

Distribution of clusters within variables using PAM algorithm The dis-tribution of clusters, within a subset of selected variables, is presented in Figures 9, 10, 11 and 12.

(a) Age variable 1 (b) Age variable 2

Figure 9: The distribution of clusters, within the selected variables ’age variable 1’ and ’age variable 2’, using PAM algorithm in equal frequency data

(a) Amount (b) Balance

Figure 10: The distribution of clusters, within the selected variables ’amount’ and ’balance’, using PAM algorithm in equal frequency data

References

Related documents

A first attempt was made to create a model from the entire diamond core data, which predicted sulphur and thermal disintegration index at the same time.. This model was modelled

To better understand the customers and their behavior, this thesis will make an analysis of data that the digital rights management company possess, data that the customers

Detta har gjort att GETRAG AWD AB bestämt sig för att genomföra en energikartläggning av ett produktionsflöde för att kunna visa på var i processkedjan de största

Vad det avser respekt för råvaran menar Patrik att det inte tas tillräckligt stor hänsyn till att olika arter behöver olika behandling i fråga om fångst, avlivning,

This study’s goal is to recommend songs to playlists with the given data set from Spotify using Spectral clustering.. While the given data set had 1 000 000 playlists,

Since the algorithm uses genomic information fusion in order to find new biologically accurate clusters, the analysis of the results depends on the comparison of the clusters

Examining the training time of the machine learning methods, we find that the Indian Pines and nuts studies yielded a larger variety of training times while the waste and wax

The Swedish data processed with the conventional processing using the KMSProTF software produced transfer functions (fig.5.1a) which align well with the constraints outlined in