Predicting Attrition in Financial Data with Machine Learning Algorithms

(1)

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018 ,

Predicting Attrition in Financial Data with Machine Learning Algorithms

JOHAN DARNALD

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Predicting Attrition in

Financial Data with Machine Learning Algorithms

JOHAN DARNALD

Master in Computer Science Date: April 9, 2018

Supervisor: Hedvig Kjellström Examiner: Joakim Gustafson

Swedish title: Förutsäga kundförluster i finansdata med maskininlärningstekniker

School of Computer Science and Communication

(3)

(4)

iii

Abstract

For most businesses there are costs involved when acquiring new cus-

tomers and having longer relationships with customers is therefore

often more profitable. Predicting if an individual is prone to leave the

business is then a useful tool to help any company take actions to mit-

igate this cost. The event when a person ends their relationship with

a business is called attrition or churn. Predicting peoples actions is

however hard and many different factors can affect their choices. This

paper investigates different machine learning methods for predicting

attrition in the customer base of a bank. Four different methods are

chosen based on the results they have shown in previous research and

these are then tested and compared to find which works best for pre-

dicting these events. Four different datasets from two different prod-

ucts and with two different applications are created from real world

data from a European bank. All methods are trained and tested on

each dataset. The results of the tests are then evaluated and compared

to find what works best. The methods found in previous research to

most reliably achieve good results in predicting churn in banking cus-

tomers are the Support Vector Machine, Neural Network, Balanced

Random Forest, and the Weighted Random Forest. The results show

that the Balanced Random Forest achieves the best results with an av-

erage AUC of 0.698 and an average F-score of 0.376. The accuracy and

precision of the model are concluded to not be enough to make definite

decisions but can be used with other factors such as profitability esti-

mations to improve the effectiveness of any actions taken to prevent

the negative effects of churn.

(5)

iv

Sammanfattning

För de flesta företag finns det en kostnad involverad i att skaffa nya kunder. Längre relationer med kunder är därför ofta mer lönsamma.

Att kunna förutsäga om en kund är nära att lämna företaget är där-

för ett användbart verktyg för att kunna utföra åtgärder för att minska

denna kostnad. Händelsen när en kund avslutar sin relation med ett

företag kallas här efter kundförlust. Att förutsäga människors hand-

lingar är däremot svårt och många olika faktorer kan påverka deras

val. Denna avhandling undersöker olika maskininlärningsmetoder för

att förutsäga kundförluster hos en bank. Fyra metoder väljs baserat på

tidigare forskning och dessa testas och jämförs sedan för att hitta vil-

ken som fungerar bäst för att förutsäga dessa händelser. Fyra dataset

från två olika produkter och med två olika användningsområden ska-

pas från verklig data ifrån en Europeisk bank. Alla metoder tränas och

testas på varje dataset. Resultaten från dessa test utvärderas och jäm-

förs sedan för att få reda på vilken metod som fungerar bäst. Metoder-

na som enligt tidigare forskning ger de mest pålitliga och bästa resul-

taten för att förutsäga kundförluster hos banker är stödvektormaskin,

neurala nätverk, balanserad slumpmässig skog och vägd slumpmäs-

sig skog. Resultatet av testerna visar att en balanserad slumpmässig

skog får bäst resultat med en genomsnittlig AUC på 0.698 och ett F-

värde på 0.376. Träffsäkerheten och det positiva prediktiva värdet på

metoden är inte tillräckligt för att ta definitiva handlingar med men

kan användas med andra faktorer så som lönsamhetsuträkningar för

att förbättra effektiviteten av handlingar som tas för att minska de ne-

gativa effekterna av kundförluster.

(6)

Chapter 1 Introduction

1.1 Problem background

1.1.1 Customer relationship management

Customer relationship management (CRM) is an essential part of many companies business strategies today. CRM is used as a tool, sometimes a philosophy, to help a company understand its customer base and develop methods for gaining new customers, retaining existing ones and improving the customer relationships to increase profitability. In CRM the specific actions a company can undertake to work towards these goals are based on data analysis of what information is available on customers and the business in general. According to Payne and Frow (2005) [16] “It [CRM] is often used to describe technology-based customer solutions” (p.167). Customer retention is an important area within CRM, that is, keeping customers from ending the relationship with the business and thereby creating longer and more profitable re- lationships with the customers [16]. It has been shown that it can be up to 5 times more expensive to acquire a new customer compared to retaining an existing one [12].

1.1.2 Customer attrition

Customer attrition, also called churn, is the event when a customer de- cides to leave the company earlier than what is expected. Depending on the business, this can be a canceled month to month subscription, ending relations with a specific retail store or ending a contract before

1

(11)

2 CHAPTER 1. INTRODUCTION

the time stipulated by some agreement. There also exist different rea- sons why a customer might churn, for example, a customer might re- locate and the service is not available in their new location, they might switch to competition or they might just not have need of the service anymore.

1.2 Project introduction

Within the banking market today, customers have different choices of where to take their business. There are companies, so-called loan bro- kers, that specialize in helping customers find loans with the best pos- sible terms. If a person chooses to find a bank through a loan broker, the broker requires a fee from the bank that accepts the loan. Thus, if customers go through loan brokers the banks will have an additional cost when acquiring new customers.

From the customer’s point of view, this does not have to be a new loan. If they already have a loan at a bank, they can transfer their loan to another bank. Customers might switch banks if they want to take a bigger loan than their current bank is willing to accept or if they find a bank that has better terms.

Being able to predict if and when a customer is prone to leave a company for any reason can, therefore, be a powerful tool to help businesses know when and how to deploy strategies to encourage the customer to stay. If retention is not possible, it can still be used as a planning tool to help mitigate any negative effects from such an event.

Considering the cost of acquiring new customers, businesses can also be more selective in which customers they choose to acquire based on how long that customer is likely to stay since a longer relationship with the customer is in most cases more profitable than a short one.

Having a customer for a short time can even translate to a loss in prof- its in some cases.

Some of the reasons for churn might not be predictable from a com-

pany viewpoint. Relocation because of work might be one such unpre-

dictable reason if the data available to the company does not have any

indication for this event.

(12)

CHAPTER 1. INTRODUCTION 3

1.3 Project aim

Given the importance of customer retention, being able to accurately predict if a customer intends to end their relationship with the com- pany is of great importance. This is what this project aims to achieve.

By comparing four different methods that have been proven to achieve good results in the area of attrition prediction, the hope is to find the model that can most accurately predict attrition in customers for the given scenario. The data used in this project comes from the customer base of a bank operating in Europe. The aim is to, given the available data on a customer, classify that customer as either a "churner" or a

"non-churner" given a time frame into the future where the customer might leave or stay. The focus of the project will be on finding suit- able models, creating the datasets and the evaluation and comparison of the performance of the different methods selected on the created datasets.

1.4 Research question

This thesis sets out to answer 2 related questions: ’What are the 4 cur- rently most reliable churn prediction machine learning methods?’ and ’How accurate are the most promising prediction methods for predicting churn in a given population of loan and credit holders from a bank?’.

The first question is answered by a literature study where related work and current research is searched and compared to find models that work well on the problem at hand. The second question is an- swered by implementing, testing and comparing the selected models on different datasets. The datasets will be created and defined as part of the project. All data will come from a bank’s databases and cus- tomer records.

1.5 Project scope

The project is limited to a comparative study of the four most promis-

ing machine learning algorithms for predicting attrition in a popula-

tion of credit and loan holders. The tests on each model are conducted

on four different datasets all created during the project from a larger

collection of bank-customer events. To create the datasets, definitions

(13)

4 CHAPTER 1. INTRODUCTION

of churn that provides useful information to the bank and the pre- dictive models are created. An implementation of each of the chosen frameworks is created and used for training and testing.

Investigating the best actions to take after a prediction has been made, is outside of the scope of this project.

1.6 Ethical, societal and sustainability aspects

When creating models that will make predictions of people’s behavior, it is important that the creator ensures that the models do not discrimi- nate based on other factors than the ones relevant to the task. Different factors might give the models such a behavior. For instance, the selec- tion of biased features that do not represent the individual but instead generalizes too much is one such factor. An example of such a feature that also is excluded from this project, for this reason, is gender. Keep- ing features like these might improve the performance short term at the cost of reinforcing social biases. As an example of this, Zhao et al. [22] found that multilabel object classification and visual semantic role labeling datasets contained significant gender biases and that the models created from this data increased the bias even more.

It is also important that the privacy of the customers is taken se- riously. To take this into account during this project, all data that could connect an individual with an observation is removed from the databases before the project starts. Even with this sanitation of the data, only data that was given by the customers and what the customer has accepted to be used for such purposes is used in the project. No input data or any sensitive output data is published with this project due to legal and privacy concerns. This further protects the customer’s anonymity and privacy. While the privacy of customer is the most im- portant aspect to keep in mind, the results of this thesis need to be re- producible. To facilitate this all experiments are detailed and kept gen- eral enough to be repeatable, given the dataset, and specific enough to achieve interesting results while not disclosing too much information.

It is important to consider both the negative and positive impact of

any research done. While something might be positive from a business

standpoint it might affect the society in a negative way. No such nega-

tive effects have been identified as a result of this project. Being able to

better understand their customers’ churn behavior helps the company

(14)

CHAPTER 1. INTRODUCTION 5

retain the customers, and since the customer always has the option to leave this retention can only be done through better service or terms that incentivize the customer to stay.

There is a concern that research like this that aim to imitate or im- prove human decisions could potentially replace human labor and, in turn, increase unemployment rates. However, better decision mak- ing and a better understanding of the world with the help of machine learning and research will help mitigate some of the negative effects of replacing human labor with automatic processes. This is a complex issue and currently we cannot know exactly what impact increased automation will have.

1.7 Report outline

Chapter 2 introduces previous work and research related to this project

and the subject of churn. Chapter 3 gives some background to the

theory needed for this project. Chapter 4 presents the datasets used

and how they were created. Chapter 5 introduces the framework used

during the training and testing of the models. Chapter 6 describes how

each experiment is conducted and presents the results. In Chapter 7

conclusions are drawn from the results of the experiments in Chapter

6 and recommendations for future work are given.

(15)

Chapter 2 Related work

Much work has been done in the area of churn prediction. Many different methods has been tested and compared in different areas.

The most common area for research in churn prediction is within the telecommunications industry [20]. Banking is another area of interest for research. The variation in the results of previous research imply that different data sets show different results and there does not seem to be a conclusion of what framework perform best for churn predic- tion. The methods used in this project are the ones indicated by previ- ous research to reliably perform best. The models chosen are Support Vector Machines, Neural Networks, and Random Forests. More moti- vations for these choices are given in the later sections of this chapter.

No study has been found during the literature study that compares these methods with the proposed preprocessing steps. The dataset used for this project is also one that has never been used and was cre- ated from real-world data specifically for this task. This adds informa- tion that either strengthens or show variation in the results of previous work.

2.1 Data balancing

Within churn prediction, datasets are often imbalanced due to churn being a relatively rare event in most industries. Predicting churn is in this project a two-class classification problem where customers are either labeled as a churner or a non-churner. Imbalance of the data, in this case, means that there are significantly more non-churners than churners in the datasets. Having few cases of one class makes it hard

6

(16)

CHAPTER 2. RELATED WORK 7

for statistical learning methods to find useful patterns in the data [1].

Some techniques for counteracting the negative effect of such imbal- ance in the datasets have been used and compared in previous re- search. Gür Ali & Arıtürk [1] use Synthetic Minority Over-sampling Technique (SMOTE) with five neighbors in their research and over- sample the minority class while under-sampling the majority class to achieve a balance of one to one between churners and non-churners.

This showed to improve results. Chen et al. [7] compare different tech- niques for prediction on unbalanced data and find that SMOTEBoost seems to increase the accuracy of the models used more than regular SMOTE on the datasets used. Overall, research seem to imply that SMOTE improves accuracy more than other less sophisticated tech- niques for preprocessing the data before training.

2.2 Data preprocessing

Coussement et al. [8] compares different methods for data prepara- tion on churn datasets for use in logistic regression models. They find that remapping categorical variables with decision trees and binning continuous variables and then converting both cases into a weight of evidence score works best for their use case. They also found that data preparation improved the Area under the curve (AUC) results by 14.5% and up to 35% using top decile lift.

2.3 Incorporating balancing and preprocess- ing

Some machine learning models try to incorporate the balancing or

preprocessing of the data in the training and prediction phase to re-

move the need for balancing or preprocessing the training data be-

forehand. Two techniques that has shown to improve the results over

the standard method combined with preprocessing is Balanced Ran-

dom Forests and Weighted Random Forests. Chen et al. [7] introduce

Weighted Random Forests and Balanced Random Forests and com-

pare them to conventional techniques for prediction on imbalanced

data. They find that both Random Forest techniques perform better

than the others. Y. Xie et al. [21] tries to improve on this and use

(17)

8 CHAPTER 2. RELATED WORK

a novel technique that they call Improved Balanced Random Forests (IBRF) for predicting churn in customers of a Chinese bank. They find that IBRF has a higher accuracy than other standard churn prediction techniques. They compare this novel technique with both Weighted Random Forests and Balanced Random Forests and found that it out- performed both even if showing only slightly better results than Bal- anced Random Forests. IBRF achieved an accuracy rate of 93.2% while the compared non-Random Forest methods, DT, ANN and CWC-SVM achieved 62.0%, 78.1%, 87.2% respectively. What this article shows is that testing some variant of Random Forests in this project is war- ranted given all three methods achieved relatively high accuracy. No other study was found to use IBRF.

Spanoudes & Nguyen [18] test the possibility of using deep neural networks to skip the feature engineering step commonly used to im- prove the predictive performance of churn prediction models. The hy- pothesis is that the deep nature of the network should generate its own abstractions of features within the hidden layers. Given the imbalance of the churn prediction data, they use random under-sampling on the dataset improve the accuracy of the network. The result they find is that their method performs better than the framework currently used for churn prediction at a company specialized in data analysis. This compared framework is based on Balanced Random Forests. It was not clear if and how they performed the feature generation step for the Balanced Random Forests in their comparisons.

De Bock and Van Der Poel [4] test a novel technique based on generative models and ensemble learning called GAMens-plus. They compare this technique with other benchmark churn prediction tech- niques on 5 different datasets, one of which is from a bank. The re- sults show that GAMens-plus generally performs best on the different datasets except for the Bank dataset where Random Forests performed better on all metrics used.

2.4 MTPD and STPD

Gür Ali & Arıtürk [1] also introduce a churn prediction framework

called Multiple period training data (MTPD) for multiple observation

per customer and compare it to what they call "the standard frame-

work", Single period training data (STPD) which only uses one obser-

(18)

CHAPTER 2. RELATED WORK 9

vation per customer. They conclude that MPTD increases prediction accuracy over STPD, and theorize that this is because it increases the density and size of the training data and also allows the model to gen- eralize to different time periods.

2.5 Time windows

Another approach of improving results without changing modeling techniques is changing how much data is used. Ballings and Van den Poel [2] examine the effect of different time windows for churn prediction in newspaper subscribers. They find that for the specific company tested, there was no significant improvement to predictive results when using more than 5 years of data. The difference in the results over time were concluded to follow a logarithmic scale, so the increase in performance over time were already low after the second year. The models used in the tests where logistic regression, decision trees and decision trees with bagging. In this research, they only used one subscription based company so the improvement in accuracy over different time windows in other areas is not clear.

2.6 Classification methods

2.6.1 Support vector machines (SVM)

A large variety of techniques for churn prediction has been researched and some stand out more than others. Support Vector Machines are one of them. Vafeiadis et al. [20] compare different techniques for predicting churn within telecommunication customers with boosting.

They use 17 features with each model to predict churn. The classifi-

cation methods they used were: Artificial Neural Network, Support

Vector Machines, Decision trees learning, Naïve Bayes and Logistic

regression analysis. The boosting was done with Adaboost.m1 with

Montecarlo simulation for parameter optimization. They found that

the boosted SVM with a polynomial kernel performed the best with

the SVM with RBF kernel as a close second. In their tests, boosting

improved accuracy by 1-4% on all classifiers where boosting was ap-

plicable.

(19)

10 CHAPTER 2. RELATED WORK

2.6.2 Random Forests (RF)

Random Forests are tested in many different papers and consistently achieves good results. Miloševic et al. [15] tests different methods for predicting and preventing early churn in mobile games. They com- pare 5 different models for churn prediction, Logistic regression, De- cision tree, Naive Bayes, Random Forest and Gradient boosting. They find that for predicting churn, Gradient boosting performs best with an AUC and F-score of 0.83 and 0.76 respectively. The Random Forest model achieves the second best result with and AUC and F-score of 0.80 and 0.74. The worst model was a single decision tree with both an AUC and F-score of 0.67. As described above in the ’Balanced tech- niques’ section, Random Forests where used by both Xie et al. and De Bock and Van Der Poel, the model performed well in both articles.

2.6.3 Neural Networks (NN)

A. Keramati et al. [13] compare Neural Networks, Support Vector Ma-

chines, Decision Trees and K-Nearest Neighbors models for predicting

churn in the telecommunication sector. They compare the F-scores of

all models on 5 different blocks of data. The neural network model

performs best on 4 out of 5 blocks and the SVM outperforms the NN

slightly on one block. Vafeiadis et al. [20] find that a back-propagation

neural network is the best performing model of the unboosted models

that they tested and is slightly better than both the radial basis and

polynomial Support Vector Machines.

(20)

Chapter 3 Theory

3.1 Categorical data

Some machine learning techniques can only handle numerical input, so when a dataset contains categorical values this input needs to be transformed into a format that these models can use. Categorical val- ues can, for example, be gender, civil status or geographical locations.

Some categorical values have an intrinsic ordering to them and can then be turned into numerical values representing their place in that ordering. But for variables without this feature, some other transfor- mation is needed.

The most common method for dealing with these variables is called one-hot encoding. In one-hot encoding each level of a categorical vari- able is transformed into its own feature in the input of the model, the feature being 1 if the corresponding level is present and 0 otherwise.

Some implementations of one-hot encoding use -1 instead of 0. The downside of this approach is that if there are many levels of the vari- able this will mean that many new input features are created and the dimension of the input increases.

The curse of dimensionality is a problem for many machine learn- ing techniques where the number and complexity of the calculations for training and classifying with the model increase exponentially with the size of the input dimension. Therefore, one-hot encoding often cannot be used for categorical variables with many levels.

11

(21)

12 CHAPTER 3. THEORY

3.2 Data balancing using SMOTE

Synthetic Minority Over-sampling Technique (SMOTE) is a technique for balancing datasets with a categorical response variables [5]. The standard technique for balancing data simply involves under-sampling the majority class by removing a certain percentage of random cases and then copying cases with replacement from the minority class until the data is as balanced as required. What differs in the SMOTE tech- nique is that the minority class is not directly copied from to increase the number of samples, instead, the k nearest neighbors of the sample picked for copying is found. Depending on how much over-sampling is needed j of the k nearest neighbors are randomly picked and j new points are added somewhere on the line segments between the copied point and the j chosen neighbors. This procedure is repeated for every data point until the minority class is over-sampled to some percent- age. A value of j = 1 would then mean 100% over-sampling and j = 2 translates to 200% over-sampling. After the over-sampling is done, the majority class is under-sampled until the dataset is as balanced as needed.

3.3 Support Vector Machine

Support Vector Machines (SVM) uses a hyperplane in an extended fea- ture space to classify input data. Since a hyperplane could only sepa- rate points linearly in the original feature space the SVM uses kernels that transform input data into a larger transformed feature space and is thereby able to classify points using non-linear decision boundaries.

The kernel functions themselves can be seen as a measure of the sim- ilarity of two feature vectors [3, 11]. The SVM tries to maximize a margin to the points nearest the classification boundary. Some viola- tions of this boundary and the margin are allowed and a cost hyper- parameter ’C’ controls the magnitude of these constraint violations.

SVM has shown to have favorable performance in churn prediction tasks [20, 10, 9]. Of the most common kernels used in Support Vector Machines the polynomial and the radial basis function (RBF) kernel has shown the most promise for churn prediction [20, 10]. The RBF kernel is the one used for classification in this project for this reason.

The RBF kernel is defined as

(22)

CHAPTER 3. THEORY 13

K(x, x ⁰ ) = exp(−γ||x − x ⁰ || ² )

where x and x ⁰ are feature vectors and γ is a positive hyper-parameter [11]. The factor ||x − x ⁰ || is the Euclidean distance, so if two feature vectors are far away from each other k(x, x ⁰ ) will end up small and thus the feature vector will have a small impact on the classification of the new point.

3.4 Neural network

The most common model of a neural network is based on nodes that represent the neurons in the brain, these are sometimes also referred to as artificial neurons. These neurons are structured in layers and each neuron in a layer is connected to all the neurons in the next layer. This structure of a neural network is called a Multilayer Perceptron (MLP).

The first layer is the input layer and there is one input neuron per fea- ture in the modeled dataset. These input nodes propagate the feature information to the next layer through the connections. Each connec- tion has a weight which scales the propagated value. The last layer of an MLP is the output layer and contains the result of the MLP after the input has propagated through the network. Each layer in between the output layer and the input layer are called hidden layers. The neurons in the hidden layers sum the result from each connection from the pre- vious layer and run this sum through a non-linear activation function.

This function is usually a sigmoid function which outputs a result be- tween 0 and 1. This result is then propagated to the next layer and this procedure is repeated until the information reaches the output layer.

A neural network using this procedure is called a feed forward net- work. Each hidden layer l then has nodes whose output x ^l _i is based on a linear combination of the output from the previous layer (l − 1) run through a non-linear activation function A.

x ^l _i = A(

N

X

j=1

w _ij x ^l−1 _j + w _i 0)

The training of a feed-forward network is based on error back-

propagation where the error in the output layer on the training data is

(23)

14 CHAPTER 3. THEORY

Figure 3.1: Example of the structure of a neural network with 2 input nodes, 4 hidden nodes and 2 output nodes

propagated backward through the network. Gradient descent is used for deciding the change of the connection weights to accommodate for this error.

3.5 Random Forests

Random Forests (RF) is an ensemble method based decision trees. Each decision tree in a Random Forest is trained on a bootstrap sample of the training data. The trees are grown deep and not pruned to promote variance in the classifiers. Ensemble methods like Random Forests combine the result of multiple classifiers. Aggregating the result of many different classifiers reduces the variance of the output and in- creases the bias when compared to the individual classifiers. This is why variance is promoted in the decision trees of the Random Forest.

Unlike standard bagging with decision trees, Random Forests do not test the entire feature space when deciding how to split the tree.

Instead, only a random subset of the feature space is considered at

each split. The size of this subset is often chosen to be p|x| where |x| is

²

the size of the feature vector x. At each split, a new subset is chosen for

consideration. This is done to reduce the correlation between the trees

in the forest, since bagging trees which are highly correlated will yield

a smaller reduction in the variance than if the trees were not correlated

(24)

CHAPTER 3. THEORY 15

[11].

3.5.1 Weighted Random Forests

Weighted Random Forests (WRF) are an extension to Random Forests that can be used to help mitigate the issues that can be caused by highly imbalanced data [7]. In WRF, each class label is assigned a weight representing the importance of that observation in the training data. When dealing with imbalanced data the minority class is given a higher weight depending on the balance of the data. The weights are considered when deciding how to split the tree, giving more impor- tance to the class with higher weight.

When classifying with a Random Forest, each tree gets a vote on the result and the final classification is the class which most of the decision trees voted for. In weighted Random Forests the process is similar, however, each vote now has a weight. When the votes are aggregated each vote is multiplied by the weight of the class that it represents and the result is the weighted sum of all votes.

3.5.2 Balanced Random Forests

Balanced Random Forest (BRF) is another Random Forest method for dealing with imbalanced data. In this method, instead of drawing ran- dom samples from the observations to create each tree, a bootstrap sample is first drawn from the minority class, then a sample of the same size is drawn from the majority class with replacement. This en- sures that each tree in the forest is trained on some minority samples while allowing all samples from the majority class to be used in the forest [7].

3.6 Evaluation criteria

Four main methods for evaluation is used on the trained models to

be able to compare them: precision, balanced accuracy, F-Score, and

AUROC.

(25)

16 CHAPTER 3. THEORY

3.6.1 Accuracy

The overall accuracy of predictions is the most common way to eval- uate machine learning methods. In classification, accuracy is defined as

Accuracy = (T P + T N ) O

where T P is the number of true positive predictions, T N is the number of true negative predictions, and O is the total amount of pre- dictions. In a case like this one, where the data is highly imbalanced, such an accuracy score can be misleading. As an example, if a dataset has a ratio of 99% negative cases, a model that always predicts every observation as the negative class would end up with an accuracy of 99% . For such a score to be useful one would need to know the ratio of the negative and positive observations.

A more easily interpretable score is balanced accuracy. This is de- fined as

Balanced accuracy =

T P P + ^{T N} _N

2 where TP and TN are as above, P is the total positive observations, and N is the total negative observations. The example classifier above would then receive a balanced accuracy of 50% since it miss-classifies all positive observations.

3.6.2 Precision

In a binary classification task where an action is only performed on the positive predictions, even balanced accuracy might not give all the information needed. In the case of customer attrition for example, for a negative prediction, one might not want to perform any action since this customer is predicted to stay with the company under the cur- rent circumstances. On the other hand, for a positive prediction, some action needs to be taken, otherwise the customer is likely to leave.

Precision is the rate of true positive (TP) predictions to total posi-

tive predictions (P).

(26)

CHAPTER 3. THEORY 17

P recision = T P P

Precision can then be used to find how accurate the positive pre- dictions are. If we continue on the previous example, a precision of 50% would then mean that 50% of the predicted positive observations would actually be true positive observations and the other 50% would be negative observations misclassified as positive. This means that the model has extracted a subset from the original population that now has a ratio of 50% positive to negative cases as opposed to the original population with a ratio of 1%.

3.6.3 F-Score

While precision is useful for evaluating binary classifiers, it leaves out some information and thus can also be misleading. A classifier could, in theory, achieve a high precision if it only ever predicted observa- tions as positive if it deemed it very likely for them to be true posi- tives. This method could then result in very few positive predictions.

Depending on what actions are taken on a positive prediction and the effects of taking these actions on a misclassified negative observation, such a classifier might be favorable. Recall is the fraction of true posi- tive predictions (TP) over total positive observations (P’) in the dataset:

Recall = T P P ⁰

The example classifier then has a low recall, that is, it miss-classifies a large proportion of the positive cases.

The F-Score is defined as

F-score = 2 · P recision · Recall P recision + Recall

F-score the harmonic mean of precision and recall so it takes both

recall and precision into consideration and can, therefore, be a better

evaluator for binary classifiers than just precision.

(27)

18 CHAPTER 3. THEORY

3.6.4 Receiver operating characteristic curve

The Receiver operating characteristic (ROC) curve is a visualization of the true positive to false positive rate at different thresholds. The threshold is a value of how certain a model is that an observation is from one class or another. This value can represent different things for different models. What is common for all thresholds is that at its ex- tremes, either all observations are classified as one class or the other.

The true positive rate is the fraction of observations classified as posi- tive among all the positive observations. The false positive rate is the fraction of observations classified as positive among the all the nega- tive observations. The area under the curve (AUC) is often used as a measurement of how well a classification model performs. This metric gives an indication of what the performance is over all possible choices of threshold values.

3.6.5 Precision-recall curve

The precision-recall curve plots the precision of a classifier against the

recall achieved when the threshold value is changed. The variance of

this curve will often be larger at low recall values since fewer values

are used to calculate the precision.

(28)

Chapter 4 Data Set

The data used for this project consists of customer data from a Euro- pean bank. There are four different data sets that the models will be evaluated on. Below is a short description of each dataset.

Credit history - contains observation from existing credit accounts.

Credit application - contains observations from credit applications.

Loan history - contains observations from existing loan accounts.

Loan application - contains observations from loan applications.

Each dataset is created by extracting data from the banks’ databases.

Some feature engineering is also needed for the data to become com- patible with the models. An example of this is converting date vari- ables into ETAs or categorical values. Two of the datasets are from loan customers and the other two are from credit customers. The credit data has a denser historical record since customer interactions with this product are more frequent in most cases. When training each model to detect attrition in applicants each observation in the dataset consists of the relevant data that the bank gets when a person applies for a loan or credit account. Examples of data in the applications include loan or credit amount and information about the customers’ financial situa- tion. When predicting attrition in existing customers, data from up to 6 months back is aggregated into a constant amount of explanatory vari- ables that are fed into each model. In the case of existing customers, more than one observation per customer may exist depending on how long they have been in the records, how this is done is explained in

19

(29)

20 CHAPTER 4. DATA SET

the ’Time frames’ section. For attrition prediction in both potential new and existing customers, data from applications is always used.

4.1 Data Noise

The most common source of noise in the dataset is missing records for some variables. Since simply not using these records would remove a large proportion of the available training data, most of the missing records are inferred. Some of these records were inferred manually by replacing them with the most logical value based on the underly- ing meaning of the variable. For continuous variables that could not be added by this manual method, the median of all other records was used instead. For missing categorical variables a new "Missing" cat- egory was added to allow for the potential that the reason behind a missing value had any relation to the attrition rate of that customer.

4.2 Response variable

A lot of work went into defining what churn is in the different datasets and then extracting this definition from the databases. If the definition of churn is too general, then the predictions that the model will make will contain less or too little information. An example of such a defini- tion in the case of an existing loan customer is "any customer that is no longer a customer in 6 months is a churner". It might not be interesting to know if a customer that has paid 99% of their loan will pay off the remaining 1% within the next 6 months. The definition should not be too specific either, otherwise, there might be too little data on churners for even balanced techniques to train on.

Because of this reason, the ’Loan history’ dataset has churn defined as a bigger than 50% repayment of the loan within 6 months that leaves the current balance below 5%.

For the ’Loan application’ dataset, the definition of paying back the loan within 6 months is decided to be enough. No loan is small enough that it can be repaid within 6 months without paying off a larger sum at some point.

Both credit datasets, ’Credit history’ and ’Credit application’, has

churn defined as either an end of the credit account within 6 months

or a 6 or more month period of inactivity. This is because a credit

(30)

CHAPTER 4. DATA SET 21

account has no predetermined end date and churn can then be seen as any end of use of the product.

4.3 Time Frames

When creating the dataset for churn prediction on existing customers, different time windows are sampled from to produce the observations.

For each time window and each customer present at that time one new observation is created. The response variable for these time frames is created by looking 6 months into the future from the start of the time frame, if the customer has left unexpectedly, a churn event has hap- pened and the observation is registered as such. The same observa- tion has events from 6 months back in time aggregated into explana- tory variables. This information and the response variable is combined with the data from the application to form the final observation. Each observation is then of a time window of 6 + 6 months. Some overlap in the time windows is allowed. This is partly because the 6 months for- ward in time otherwise would never be used as explanatory variables and also to not miss any interesting data that might be on the border between the historical records.

This method of creating multiple observations from every customer

will force the churn event to become even rarer then it is in the origi-

nal dataset. Depending on the overlap of the time windows only a few

churn events are created for a customer who has churned, while, de-

pending on the time this person has been on the records, many more

non-churn observations can be created. This will be slightly balanced

out by the fact that customers that were not present in the first time

frame may exist in others.

(31)

Chapter 5 Training and Evaluation Frame- work

5.1 Data preparation

5.1.1 Categorical features

Most of the models used in this project cannot handle categorical vari- ables, so to keep things equal, all categorical variables are transformed to be one-hot encoded [19].

The transformation is done using the R ’caret’ package function

’dummyVars’. The parameter ’FullRank’ is set to true so that each cat- egory gets its own variable, this is done to avoid any dependencies between the variables and to simplify the manual analysis of the vari- ables.

5.1.2 Preprocessing

Using the R ’caret’ package function ’preprocess’ all training and test data is centered so that the individual means of each feature is equal to 0, then the data is scaled so that it fits into a 1 to -1 range. This is done after the one-hot encoding as suggested by Tibshirani in [19].

5.1.3 Splitting training and test data

For the datasets where multiple time frames are used, the training and test data have to be split up per customer as opposed to just splitting

22

(32)

CHAPTER 5. TRAINING AND EVALUATION FRAMEWORK 23

the observations randomly. Having data from the same customer in both the test and training set can lead to the model over-fitting and thereby performing better in the tests than it would in a real-world ap- plication. This is because, if only a short time span is used between the different time frames, the data from these time frames will most likely not have changed much. The model will then perform better on points that are similar to ones it has already seen in the training.

Even if a longer time period is used some data might never change.

Examples of such non-changing data might be, living conditions, oc- cupation, and social status. Splitting up the data per customer has to be done every time the dataset is split to be used for testing and train- ing as it is done in the cross-validation loop. A tag is then added to the datasets that have multiple time-frames that represent which ac- count the observation came from. After the data has been split, the account tag is removed from the dataset since it has no value for the predictions and could not be recreated for new observations.

5.1.4 Balancing

SMOTE is used on the training set before training models that do not

have any built-in handling of class imbalance. The R package "DMwR

v0.4.1" implementation of SMOTE is used in this project. This imple-

mentation allows for three different variables to be tuned: the amount

of over-sampling, the number of nearest neighbors to consider when

creating new points, and the amount of under-sampling of the major-

ity class. The under-sampling of the majority class is redefined from

the default implementation in DMwR to be dependent on the ratio of

minority versus majority cases. The new points are always created us-

ing k = 5 nearest neighbors as a sampling space. The models that use

SMOTE as a data balancing step in this project are the Neural network

and the Support Vector Machine. The Support Vector Machine has

variations of the default model that tries to compensate for class im-

balance but no evidence that these variations would improve perfor-

mance on churn predictions has been found in related work so SMOTE

is used instead.

(33)

24 CHAPTER 5. TRAINING AND EVALUATION FRAMEWORK

5.2 Model training

For creating the models and conducting the predictions the R pro- gramming language is used. It is chosen for two main reasons, it is the main language used for data analytics at the company providing the project and data and it is also very well suited for statistics, data analysis, and machine learning. It also provides most of the algorithms needed for the research in easily available packages.

5.2.1 Overall framework

Since hyper-parameters are found during the cross-validation, the es- timates of the performance of the model might be overestimated due to over-fitting. So from the initial dataset, a subset is first extracted to be used for final testing as an out-of-sample evaluation of the trained model. This out-of-sample set consists of 30% of the total observations.

For the history datasets that use time frames to increase the number of observations, it is ensured that observations from one person are only in either the evaluation or training datasets.

Cross-validation

Cross-validation with grid search is used to determine the best param- eters for each model. The cross-validation consists of k = 7 folds on the training data. This number of folds is chosen since it is deemed to be a good balance between variance in results and training time. for each fold, k − 1 folds are used for training and the remaining fold is used for testing. For each way to partition folds this way, every pa- rameter of the grid search is used to train one model each and test it on the test set of the fold.

Once the cross-validation is done, the average score of each hyper- parameter combination is calculated over the k folds and the parame- ters that give the best score is selected for the out of sample evaluation.

Out of sample evaluation

The model with the hyper-parameters that showed the best score dur-

ing cross-validation is finally trained on the whole training set used to

create the folds during the cross-validation. This model with a bigger

(34)

CHAPTER 5. TRAINING AND EVALUATION FRAMEWORK 25

training set is then used to predict the outcome of the previously ex- tracted training set. It is these predictions that is the basis for the final evaluation of the models.

5.2.2 Support Vector Machine training

For the SVM the implementation from the R package ’e1071’ is used.

This package allows for some different hyper-parameters to be set. ’C’

the cost of misclassifying an observation, ’γ’ which is a parameter in the RBF kernel, and weights for each class of the response variable.

For more information on the ’e1071’ package, see [14]

5.2.3 Neural network training

The neural network has three parameters to tune in the cross-validation process. The first tunable hyper parameter, ’hidden’, is the number of hidden nodes to use, the second is the ’decay’ parameter that reduces the size of the weights after each iteration and the third parameter is the ’reltol’ parameter that defines when the network should stop train- ing. For more information on these parameters, see [17].

5.2.4 Random Forests training

In the case of training the balanced and weighted Random Forests, SMOTE is not used. This is because the balanced methods showing superior performance over SMOTE with Random Forests in previous research. The implementation of Random Forests in the package ’ran- domforest’ is used for both balanced and weighted Random Forests.

Two parameters are tuned for both models. The ’mtry’ parameter which governs how many features are used to decide each split in the trees, and the ’nodesize’ parameter which limits the minimum size of the leaves of the decision trees. For this implementation, ’nodesize’ is altered to be the minimum proportion of the total amount of training observations that a leaf node is allowed to contain. For more informa- tion on the parameters in the ’randomforest’ package, see [6]

5.2.5 Evaluation

Each model is first evaluated independently to see how well they fit

unseen data. F-score, AUC and balanced accuracy are used to deter-

(35)

26 CHAPTER 5. TRAINING AND EVALUATION FRAMEWORK

mine how well the model performs on the out-of-sample data.

(36)

Chapter 6 Experiments

While presenting data in this chapter, both the Data Protection Act and secrecy clauses involving the data used had to be taken into ac- count. As much data as possible under these limitations is presented for the results to be reproducible. The procedure is the same for each model on the different datasets. To avoid clutter and repetition, only the experiments on one dataset, the credit history dataset, is detailed in depth, while the others are summarized to only present the most important data. The choice of this dataset is due to it having more ob- servations and initial tests on the dataset showed it to be the easiest for the models to predict on.

6.1 Balanced Random Forest

Before sampling observations for each tree in the BRF model, all ob- servations were stratified based on the response variable. An equal amount of observations was then picked from each stratum to be in- cluded in the training of the tree. The sample size from each stratum was equal to the size of the minority class and the sampling was done with replacement. In the ’randomforest’ function this corresponded to setting the ’strata’ parameter to the response column and the ’samp- size’ parameter to a vector with the number of minority observations repeated one time for each class.

Two different hyper-parameters, the number of features to consider at each split of the trees named ’mtry’, and the minimum node size named ’nodesize’ was tuned for this model. The default value for these parameters in the ’randomForest’ package was the square root

27

(37)

28 CHAPTER 6. EXPERIMENTS

Table 6.1: Values of the hyper-parameters ’mtry’ and ’nodesize’ used for the Balanced Random Forest in the first grid search on the Credit history dataset.

Hyper-Parameter values

mtry 2, 4, 8, 32, 61

nodesize 0.01, 0.02, 0.05, 0.1, 0.2

Table 6.2: Values of the hyper-parameters ’mtry’ and ’nodesize’ used for the Balanced Random Forest in the second grid search on the Credit history dataset.

Hyper-Parameter values

mtry 8, 16, 32, 48, 61

nodesize 0.02, 0.03, 0.05, 0.075, 0.01

of the number of features in the dataset for the ’mtry’ variable and 1 for the ’nodesize’ variable. The number of trees in each forest was set to 1000 due to time constraints. Table 6.1 shows the values chosen for the grid search. Values for ’nodesize’ outside of this range were tested but not included in this grid search since they all showed lower accu- racy.

The result from this grid search is shown in Figure 6.1. The bal- anced accuracies for the different parameters in the grid search varied from at worst 65.7% to at best 71%.

For the first grid search, a wide range of variables with low res- olution were tested to make sure that at least one local optimum for the values was captured somewhere within the range. Given these results, another more narrow grid search was then done around the optimal values for ’mtry’ and ’nodesize’ to see if the accuracy could be improved. This two-step solution reduced the time taken by the grid search since fewer values are tested overall.

The average balanced accuracies of this more narrow grid search are displayed in Figure 6.2 and the values for the hyper-parameters are shown in Table 6.2. The results varied from at worst 69% to at best 71% in this grid search. The improvement in accuracy of the best parameters in the first grid search to the second search was 0.09068%

so searching further was deemed to not be necessary.

The model was then run on the out-of-sample dataset with the best

hyper-parameters. Figure 6.3 shows the ROC curve of this final run.

(38)

CHAPTER 6. EXPERIMENTS 29

Figure 6.1: Balanced Random Forest results from the first grid search

for the ’mtry’ and ’nodesize’ hyper-parameters on the Credit history

dataset.

(39)

30 CHAPTER 6. EXPERIMENTS

Figure 6.2: Balanced Random Forest results from the second grid

search for the ’mtry’ and ’nodesize’ hyper-parameters on the Credit

history dataset.

(40)

CHAPTER 6. EXPERIMENTS 31

Figure 6.3: The ROC curve for the Balanced Random Forest using the best parameters from the grid searches on the Credit history dataset.

The model achieved a balanced accuracy of 71.7% and an AUC of

78.1% on the new observations. The threshold of the Random For-

est model was based on the ratio of votes for each class and is by de-

fault set to 50%. The ROC curve was created by changing this thresh-

old form 0% to 100%. Figure 6.4 shows the precision of the predicted

churn class against the recall when changing the threshold. At the de-

fault threshold, the model achieved a recall, or sensitivity, of 65% and

a precision of 35%. Changing the threshold to classify fewer observa-

tions as churners the model could, for example, achieve a precision of

78% and a recall of 22%.

(41)

32 CHAPTER 6. EXPERIMENTS

Figure 6.4: The precision-recall curve for the Balanced Random Forest

sing the best parameters from the grid searches on the Credit history

dataset

(42)

CHAPTER 6. EXPERIMENTS 33

Table 6.3: Final Balanced Random Forest hyper-parameters and results on all datasets with a classification threshold of 50%.

Dataset Bal. accuracy Precision F-score AUC Nodesize Mtry Credit history 71% 40% 0.484 76.4% 0.02 32 Credit application 66.7% 22.6% 0.336 71.7% 0.0005 2 Loan history 63% 33.1% 0.366 67.5% 0.01 8 Loan application 60.2% 22.7% 0.320 63.7% 0.003 2

The same procedure was done for the other datasets and the final results all datasets are presented in Table 6.3.

6.2 Weighted Random Forest

The weighted Random Forest experiment was similar to the BRF ex- periment. There were some differences however. The first is that all observations were used when creating each decision tree. The second and most significant difference was of course what makes this Ran- dom Forest weighted. The ’classwt’ parameter in the ’randomforest’

function was used to set the priors for the two classes. each class gets a prior representing the inverse prevalence of that class in the dataset.

The two hyper parameters that are tuned in the grid search is ’node- size’ and ’mtry’. The values for parameters were chosen to cover most of the range of possible values. Values for ’mtry’ were chosen to be denser on smaller values where a small change in the value might have a bigger impact on the model.

From Figure 6.5 it can be seen that the balanced accuracy for the different parameters ranges from 52% to 65%. The AUC achieved by the best parameter combination was 69.7%.

A new range of hyper-parameters was then tested around the best value from the first search. These new parameters are presented in

Table 6.4: Values of the hyper-parameters ’mtry’ and ’nodesize’ used for the weighted Random Forest in the first grid search on the Credit history dataset.

Hyper-Parameter Values

mtry 2, 4, 8, 32, 61

nodesize 0.01, 0.02, 0.05, 0.1, 0.2

(43)

34 CHAPTER 6. EXPERIMENTS

Figure 6.5: Weighed Random Forest results from the first grid search

for the ’mtry’ and ’nodesize’ hyper-parameters on the Credit history

dataset.

(44)

CHAPTER 6. EXPERIMENTS 35

Table 6.5: Values of the hyper-parameters ’mtry’ and ’nodesize’ used for the weighted Random Forest in the second grid search on the Credit history dataset.

Hyper-parameter Values

mtry 6, 7, 8, 15, 25

nodesize 0.001, 0.005, 0.01, 0.015

Table 6.6: Final weighted Random Forest hyper-parameters and re- sults on all datasets with a classification threshold of 50%.

Dataset Bal. accuracy precision F-score AUC Nodesize Mtry Credit history 63% 63.4% 0.412 74.9% 0.01 25 Credit application 50% 33.3% 0.008 69% 0.01 45 Loan history 57.5% 63.9% 0.251 57.5% 0.1 14 Loan application 50% 0% 0 58.5% 0.06 40

Table 6.5. All other parameters were set to be the same.

In the second grid search, the balanced accuracy ranged from 62%

to 64.5%.

The best parameters found were then tested on a validation set not used during any of the grid searches. The final result of this test set is an AUC of 74.9%, a balanced accuracy of 63% and an F-score of 0.412.

The final results of the tuned WRF models on all datasets are pre- sented in Table 6.9. We see that the model performed the best on the Credit history dataset and worst on the loan history dataset when considering the AUC. On the loan application and credit application datasets, the model only achieved a balanced accuracy of 50%. On these datasets, the model did not predict any observations to be churn when the threshold was at the default 50%. Since the AUC is not 50%

or lower it is clear that the model will start to predict observations as

churners if the threshold is changed.

(45)

36 CHAPTER 6. EXPERIMENTS

Figure 6.6: Weighed Random Forest results from the second grid

search for the ’mtry’ and ’nodesize’ hyper-parameters on the Credit

history dataset.

(46)

CHAPTER 6. EXPERIMENTS 37

Figure 6.7: The ROC curve for the Weighted Random Forest using the

best parameters from the grid searches on the Credit history dataset.

(47)

38 CHAPTER 6. EXPERIMENTS

Figure 6.8: The precision-recall curve for the Weighted Random Forest

using the best parameters from the grid searches on the Credit history

dataset.

(48)

CHAPTER 6. EXPERIMENTS 39

6.3 Support Vector Machine

For the Support Vector Machine model, SMOTE was used to balance the classes in the dataset. Because of the long training times involved when using this model, no grid search was done to find the best pa- rameters for the SMOTE algorithm, instead the minority class was oversampled until it was equal to the majority class.

Multiple values of the hyper-parameters named ’gamma’ and ’cost’

were tested in the cross-validated grid search. The values tested can be found in Table 6.7. The default value for the gamma parameter is

1 data dimension , which in the case of the credit history dataset was 0.023.

The values for gamma displayed in Table 6.7 was deemed to be a good starting range since lower values increase the number of sup- port vectors and in turn increased the training time. The number of time frames also had to be decreased to speed up training time.

In Figure 6.9 it can be seen that the balanced accuracy for the dif- ferent parameters ranged from 62.4% to 67.2%.

A new smaller range of hyper-parameters were then tested around the parameters that lead to the best score from the first search. These new parameters are presented in Table 6.8. All other parameters were set to be the same.

In the second grid search, the balanced accuracy ranged from 66.3%

to 67.7%. Even though the best parameters were found at the edge of the grid search where the cost parameter was 3 and gamma was 0.007, the increase in accuracy was so small from the surrounding param- eters so no further grid search outside of these values was deemed necessary.

The best parameters found was then tested on a validation set not used during any of the grid searches. The final result of this test set was a balanced accuracy of 69.6%, an AUC of 74.6% and an F-score of 0.514. The ROC curve is displayed in Figure 6.11 and the precision-

Table 6.7: Values of the hyper-parameters ’gamma’ and ’cost’ used for the Support Vector Machine in the first grid search on the Credit his- tory dataset.

Hyper-Parameter Values gamma 0.001, 0.01, 0.1

cost 0.01, 1, 10

(49)

40 CHAPTER 6. EXPERIMENTS

Figure 6.9: Support Vector Machine results from the first grid search for the ’gamma’ and ’cost’ hyper-parameters on the Credit history dataset.

Table 6.8: Values of the hyper-parameters ’gamma’ and ’cost’ used for the Support Vector Machine in the second grid search on the Credit history dataset.

Hyper-parameter values gamma 0.007, 0.01, 0.03

cost 0.7, 1, 3

(50)

CHAPTER 6. EXPERIMENTS 41

Figure 6.10: Support Vector Machine results from the second grid

search for the ’gamma’ and ’cost’ hyper-parameters on the Credit his-

tory dataset.

(51)

42 CHAPTER 6. EXPERIMENTS

Figure 6.11: The ROC curve for the Support Vector Machine using the best parameters from the grid searches on the Credit history dataset.

recall curve is displayed in Figure 6.12. The threshold for the Support Vector Machine model was based on the probability values for predic- tions that the ’e1071’ package function can return.

The results of the tuned SVM model on all datasets are presented

in Table 6.9.

(52)

CHAPTER 6. EXPERIMENTS 43

Figure 6.12: The precision-recall curve for the Support Vector Machine using the best parameters from the grid searches on the Credit history dataset.

Table 6.9: Final Support Vector Machine hyper-parameters and results on all datasets with a classification threshold of 50%.

Dataset Bal. accuracy precision F-score AUC Cost Gamma

Credit history 69.6% 40.6% 0.514 74.6% 3 0.01

Credit application 61.3% 15.3% 0.237 63.8% 0.1 0.001

Loan history 57.5% 18.7% 0.267 61.8% 20 0.005

Loan application 56.7% 18.7% 0.291 61.9% 5 0.0001

Predicting Attrition in Financial Data with Machine Learning Algorithms

IN

DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2018 ,

Predicting Attrition in Financial Data with Machine Learning Algorithms

JOHAN DARNALD

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Predicting Attrition in

Financial Data with Machine Learning Algorithms

JOHAN DARNALD

Master in Computer Science Date: April 9, 2018

Supervisor: Hedvig Kjellström Examiner: Joakim Gustafson

Swedish title: Förutsäga kundförluster i finansdata med maskininlärningstekniker

School of Computer Science and Communication

iii

Abstract

For most businesses there are costs involved when acquiring new cus-

tomers and having longer relationships with customers is therefore

often more profitable. Predicting if an individual is prone to leave the

business is then a useful tool to help any company take actions to mit-

igate this cost. The event when a person ends their relationship with

a business is called attrition or churn. Predicting peoples actions is

however hard and many different factors can affect their choices. This

paper investigates different machine learning methods for predicting

attrition in the customer base of a bank. Four different methods are

chosen based on the results they have shown in previous research and

these are then tested and compared to find which works best for pre-

dicting these events. Four different datasets from two different prod-

ucts and with two different applications are created from real world

data from a European bank. All methods are trained and tested on

each dataset. The results of the tests are then evaluated and compared

to find what works best. The methods found in previous research to

most reliably achieve good results in predicting churn in banking cus-

tomers are the Support Vector Machine, Neural Network, Balanced

Random Forest, and the Weighted Random Forest. The results show

that the Balanced Random Forest achieves the best results with an av-

erage AUC of 0.698 and an average F-score of 0.376. The accuracy and

precision of the model are concluded to not be enough to make definite

decisions but can be used with other factors such as profitability esti-

mations to improve the effectiveness of any actions taken to prevent

the negative effects of churn.

iv

Sammanfattning

För de flesta företag finns det en kostnad involverad i att skaffa nya kunder. Längre relationer med kunder är därför ofta mer lönsamma.

Att kunna förutsäga om en kund är nära att lämna företaget är där-

för ett användbart verktyg för att kunna utföra åtgärder för att minska

denna kostnad. Händelsen när en kund avslutar sin relation med ett

företag kallas här efter kundförlust. Att förutsäga människors hand-

lingar är däremot svårt och många olika faktorer kan påverka deras

val. Denna avhandling undersöker olika maskininlärningsmetoder för

att förutsäga kundförluster hos en bank. Fyra metoder väljs baserat på

tidigare forskning och dessa testas och jämförs sedan för att hitta vil-

ken som fungerar bäst för att förutsäga dessa händelser. Fyra dataset

från två olika produkter och med två olika användningsområden ska-

pas från verklig data ifrån en Europeisk bank. Alla metoder tränas och

testas på varje dataset. Resultaten från dessa test utvärderas och jäm-

förs sedan för att få reda på vilken metod som fungerar bäst. Metoder-

na som enligt tidigare forskning ger de mest pålitliga och bästa resul-

taten för att förutsäga kundförluster hos banker är stödvektormaskin,

neurala nätverk, balanserad slumpmässig skog och vägd slumpmäs-

sig skog. Resultatet av testerna visar att en balanserad slumpmässig

skog får bäst resultat med en genomsnittlig AUC på 0.698 och ett F-

värde på 0.376. Träffsäkerheten och det positiva prediktiva värdet på

metoden är inte tillräckligt för att ta definitiva handlingar med men

kan användas med andra faktorer så som lönsamhetsuträkningar för

att förbättra effektiviteten av handlingar som tas för att minska de ne-

gativa effekterna av kundförluster.

Contents

1 Introduction 1

1.1 Problem background . . . . 1

1.1.1 Customer relationship management . . . . 1

1.1.2 Customer attrition . . . . 1

1.2 Project introduction . . . . 2

1.3 Project aim . . . . 3

1.4 Research question . . . . 3

1.5 Project scope . . . . 3

1.6 Ethical, societal and sustainability aspects . . . . 4

1.7 Report outline . . . . 5

2 Related work 6 2.1 Data balancing . . . . 6