Loss Given Default Estimation with Machine Learning Ensemble Methods

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Loss Given Default Estimation with

Machine Learning Ensemble

Methods

ELINA VELKA

(2)

(3)

Loss Given Default Estimation

with Machine Learning

Ensemble Methods

ELINA VELKA

Degree Projects in Financial Mathematics (30 ECTS credits) Master*s Programme in Applied and Computational Mathematics KTH Royal Institute of Technology year 2020

Supervisor at Hoist Finance: Daniel Boström Supervisor at KTH: Anja Janssen

(4)

TRITA-SCI-GRU 2020:305 MAT-E 2020:077

Royal Institute of Technology

School of Engineering Sciences

KTH SCI

(5)

Abstract

(6)

(7)

Sammanfattning

Estimering av förlust vid fallisemang med

en-sembelmetoder inom maskininlärning

(8)

(9)

Acknowledgements

First and foremost I would like to express my sincere gratitude to my supervisor at KTH, Anja Janssen. It’s only with your patience, guid-ance and time that I have been able to complete this project. I would also like to thank my supervisor Daniel Boström and the risk team at Hoist Finance for giving me the opportunity to realise this project and constantly teaching me new things. Working with this team has been a great privilege and I could not ask for better colleagues. My heartfelt gratitude to Gianluca Puccio for taking the time to patiently and ped-agogically explain every question, big or small, about the Italian data. Finally my deepest thanks to Camilla and Anna for all the encourage-ment and proofreading and a special thanks to Erik for supporting me through all the ups and downs during my years at KTH.

This work is dedicated to you, Elina Velka

(10)

(11)

Acronyms

BCBS Basel Committee on Banking Supervision

CRD Capital Requirements Directive

CRR Capital Requirements Regulation

EAD Exposure at Default

EBA European Banking Authority

EL Expected Loss

EU European Union

GBM Gradient Boosting Machine

IRB Internal Ratings-Based

LGD Loss Given Default

MAE Mean Absolute Error

ML Machine Learning

NPLs Non-Performing Loans

PD Probability of Default

RMSE Root Mean Square Error

RR Recovery Rate

R-Squared Coefficient of Determination

RSS Residual Sum of Squares

RWA Risk-Weighted Assets

SA Standardized Approach

SFSA Swedish Financial Supervisory Authority

SVM Support Vector Machine

TSS Total Sum of Squares

(12)

1 Introduction

1

1.1 Methodology . . . 2 1.2 Research Question . . . 3 1.3 Scope . . . 3 1.4 Outline . . . 4

2 Background

5

2.1 Non-Performing Loans . . . 5 2.2 Hoist Finance . . . 6 2.3 Quantities of Interest. . . 7 2.4 Related Work . . . 7

3 Methods

10

3.1 Machine Learning . . . 10

3.2 Generalized Linear Models . . . 12

3.2.1 Linear Models . . . 12 3.2.2 Logistic Regression . . . 13 3.3 Tree Models . . . 14 3.3.1 Decision Tree . . . 14 3.3.2 Random Forest . . . 16 3.3.3 Boosting . . . 17 3.4 Model Complexity . . . 19 3.5 Cross-validation . . . 21 3.6 Metrics of Interest . . . 22

3.6.1 R-Squared and Adjusted R-Squared. . . 22

3.6.2 Root Mean Square Error . . . 23

3.6.3 Mean Absolute Error . . . 24

3.6.4 Pearson’s Correlation Coefficient . . . 24

(13)

CONTENTS

4 Data and Model Preparation

25

4.1 Data Preparation . . . 25

4.2 Model Preparation . . . 26

4.2.1 Decision Tree Model. . . 26

4.2.2 Random Forest Model . . . 27

4.2.3 Boosted Model . . . 27

5 Results

30

5.1 Impact of Variables. . . 30

5.1.1 Variables Used . . . 30

5.1.2 Variables Versus Loss Given Default . . . 31

5.1.3 Variable Importance Assigned by Models . . . 35

(14)

List of Tables

5.1.1 Description of the 20 variables used in this study . . . 30 5.2.1 Decision Tree Model Performance, 20 variables versus 13

variables. . . 38 5.2.2 Random Forest Model Performance, 20 variables versus

13 variables . . . 38 5.2.3 Boosted Model Performance, 20 variables versus 13

vari-ables. . . 39

5.2.4 Results of Model Assessment using 20 variables. . . 39

5.2.5 Results of Model Assessment using 13 variables . . . 40

5.2.6 Results of Model Assessment using 20 variables after re-moving observations where Loss Given Default (LGD) = 1 from the Test set. . . 42 5.2.7 Results of Model Assessment using 13 variables after

(15)

Chapter 1 Introduction

In the aftermath of the financial crisis 2008-2009 financial institutions were required to comply with more prudential regulations than before. The global standard setter and issuer of these regulations is the Basel Committee on Banking Supervision (BCBS). Basel III is an internation-ally agreed upon set of measures developed by the BCBS. The overar-ching goal of the Basel III agreement and its implementing act in Eu-rope, the Capital Requirements Regulation (CRR) and Capital Require-ments Directive (CRD), is to strengthen the resilience of the banking sector across the European Union (EU) so it would be better placed to absorb economic shocks while ensuring that banks continue to finance economic activity and growth [1]. Like all BCBS standards, Basel III standards are minimum requirements which apply to internationally active banks. Members are committed to implementing and applying standards in their jurisdictions within the time frame established by the Committee [2]. Since Basel III was implemented, the European Bank-ing Authority (EBA) has been conductBank-ing work to harmonize practices across European financial institutions. This has led to a substantial number of regulatory requirements, and guidelines to create a level play-ing field across Europe.

(16)

CHAPTER 1. INTRODUCTION

capital treatment for measuring credit risk is the Internal Ratings-Based (IRB) approach that allows banks to use their internal rating systems for measuring credit risk. Banks that have received supervisory approval to use the IRB approach may rely on their own internal estimates of risk components in determining the capital requirement for a given expo-sure, expressing this in terms of an estimated Expected Loss (EL) and Risk-Weighted Assets (RWA). The total sum of RWA is calculated using formulas specified by the BCBS.

EL is calculated by the following formula:

EL = PD· LGD · EAD, (1.1)

where PD is probability of default, LGD is loss given default and EAD is the exposure at default and represents the expected size of the exposure at the time of default. In practice the term LGD describes the proportion of the exposure that is actually lost in the event of default. The opposite is the recovery, that describes the amount of exposure that can be recov-ered through debt reconstruction and asset sales. The use of internally estimated PD and LGD allows for increased risk sensitivity in the IRB capital charges compared with the SA [3].

Beyond the purpose of calculating regulatory capital, these three param-eters have wide ranging uses for banks, serving as inputs into economic capital models, stress testing, impairment forecasting, pricing and in-forming portfolio management [4]. LGD enters the capital requirement formulas in a linear way, unlike PD, which therefore has less of a direct effect on minimum capital. Hence, any changes in the LGD estimates produced by models have a strong bearing on the capital of a financial institution, and thus its long-term strategy as well. It is therefore crucial to have models that estimate LGD as accurately as possible [5].

1.1 Methodology

(17)

above - LGD. ML methods are reported to outperform most of the phys-ical and statistphys-ical methods in predictive modeling in terms of accuracy, robustness, uncertainty analysis, data efficiency, simplicity, and compu-tation cost. Thus, ML methods for credit risk modelling have gained massive popularity during the past few years [6]. Additional research demonstrates that moving beyond single models to develop ensembles of machine learning models provides additional performance benefits in terms of prediction accuracy [7]. The ensemble model development process could also be turned into a fully automated process that would greatly benefit organizations that need to build a large number of models [8].

1.2 Research Question

Can Loss Given Default (LGD) be well predicted by ML ensemble meth-ods? What is the optimal selection of variables used in models predicting LGD?

1.3 Scope

Hoist has a vast data base consisting of data from 11 different countries. The data has been acquired from different sources and therefore has to be consolidated and cleansed before the models can be applied. Hence, in this thesis work only the data from Italy has been cleansed and used for modelling.

As a part of the IRB pre-study project at Hoist another master thesis work estimating LGD with zero-and-one inflated beta regression type models has been conducted by Ljung and Svedberg [9]. In their study only closed unsecured non-performing loans were used. Therefore also in this thesis work only closed unsecured non-performing loans are eval-uated.

(18)

studies in case the models would be improved to two-stage models as suggested by Yao et.al. [10] and other literature.

Data cleansing, model building and graphical presentation of the results in this thesis work were performed in Alteryx, a data analytics platform, R, a programming language for statistical computing, and Microsoft SQL Server Management Studio, an integrated environment for man-aging SQL infrastructure.

1.4 Outline

(19)

Chapter 2 Background

This chapter starts with an introduction to Non-Performing Loans (NPLs). Then it is explained why Hoist Finance is interested in prediction of LGD and it concludes with exploring some of the related work in the field of prediction of credit risk.

2.1 Non-Performing Loans

According to EBA, NPLs or exposures are those that satisfy either of the following criteria:

• material exposures that are more than 90 days past due;

• the debtor is assessed as unlikely to pay its credit obligations in full without realisation of collateral, regardless of the existence of any past due amount or of the number of days past due.

(20)

CHAPTER 2. BACKGROUND

As of June 2019, the weighted average NPL ratio, i.e. the ratio between the NPL total and the total amount of outstanding loans in the bank’s portfolio, stood at 3%, compared with 6% in June 2015. This is the low-est since the EBA introduced a harmonised definition across European countries of NPLs in 2014. On average, the NPL ratio has improved by 75 basis points each year. Reductions in the NPL volume, the numerator of the ratio, mostly drove the improvement. The total volume of NPLs as of June 2019 stood at EUR 636 billion, which is almost half the NPL volume recorded in June 2015 (EUR 1152 billion). The ratio further im-proved as a result of increasing total loans. Loan volumes as of June 2019 stood at EUR 21.2 trillion, an increase of 10% compared with June 2015 (EUR 19.2 trillion) [12].

2.2 Hoist Finance

(21)

2.3 Quantities of Interest

As mentioned before the amount of capital that a financial institution is required to hold against credit risk depends on Expected Loss (EL) and Unexpected Loss (UL). For defaulted exposures UL is measured as the difference between LGD in an economic downturn and the avarage LGD. Since the loans are defaulted the PD is assumed to be 100%. Thus the parameter of interest in calculation of EL is LGD.

For the purpose of convenience in this study LGD is calculated at a given point in time - 24 months after the debt startup date [10]. If the debt is totally recovered during the first 24 months, LGD is set to equal 0, otherwise the following formula is applied:

LGD = Outstanding Debtt=24−

Pt=T

t=25Collections

Outstanding Debt_t=24 , (2.1)

where Outstanding Debt_t=24 =Outstanding Debt_t=0−Pt=24_t=0 Collections and T is the month the debt is closed. Outstanding Debt at t = 0 is the original claim amount when the debt was started. Collections from t = 0 to t = 24 are the cash transactions received from the customer during the first 24 months.

2.4 Related Work

PD modelling has been the main focus of credit research for several decades and in recent years LGD models with challenging bimodal dis-tributions have also become a subject of interest in research [4], [5]. The bimodal distribution can be explained by the fact that most of the cus-tomers in default either pay back a full proportion of the outstanding debt or do not pay back at all [13].

(22)

CHAPTER 2. BACKGROUND level significantly affect LGD.

Two years later in a study done by Leow, Mues and Thomas [14] they found that macroeconomic variables improved the models if the under-lying data was collected from secured mortgage loans. However, in per-sonal loans most of the macroeconomic variables did not improve the prediction models and they suggest that the unsecured personal loans might be less affected by the macroeconomic conditions than the mort-gage loans [10].

A large-scale LGD benchmarking study has been conducted by Lotter-man et al. [5], where 24 regression models were evaluated on six real-life datasets obtained from major international banking institutions. They found that non-linear techniques outperformed the traditional linear models. This suggests the presence of non-linear relationships between the independent variables and LGD.

Another large-scale update on benchmarking classification algorithms for credit scoring is done by Lessmann, Beasens, Seow and Thomas [15]. This study provides a valuable insight in advanced classifiers that do not require human intervention to predict significantly more accurately than simpler alternatives.

In the study performed by Yao, Crook and Andreeva [10] prediction of Recovery Rate (RR), where RR = 1 - LGD, with Support Vector Ma-chine (SVM) techniques is evaluated. The kernel based least squares SVM models were applied in two ways. Firstly, by directly applying SVM regression to the test data they demonstrated that SVM models were bet-ter in prediction than the other linear or generalized linear models. Sec-ondly, two-stage models were evaluated, where the first stage discrim-inated the cases with RR equal to 0 and 1 using a SVM classification algorithm and the second stage performed a SVM regression algorithm

on the cases where RR∈ (0,1). They suggest that in the two-stage model

the choice of regression methods is less influential in prediction of RRs than the choice of classification methods in the first step.

(23)

(24)

Chapter 3 Methods

In this chapter, the theoretical aspects behind several machine learning models are discussed. Three regression methods: decision trees, ran-dom forests and boosted methods, and one classification method, logis-tic regression, is described in detail. The metrics chosen to evaluate the models are also provided in this chapter.

3.1 Machine Learning

Machine learning is the science of programming models that learn from the data. Machine learning can be categorized into three sub-fields: supervised, unsupervised and reinforcement learning, see Figure 3.1.1. The overview in this chapter is based on a book by Hastie, Tibshirani and Friedman [16].

Supervised Learning

(25)

CHAPTER 3. METHODS

Figure 3.1.1: Overview of Machine Learning

• regression if the result is a continuous value, such as temperature or as in the case of this study, where LGD∈[0,1],

• classification if the result is real value, such as “true” or “false”. Classification can be separated into binary classification, distin-guishing between two classes, for example if LGD = 1 or not, or multi-class classification, distinguishing more than two classes. One can say that the aim of classification is to predict a class label from a predefined list of possibilities and the aim of regression is to predict a continuous real number. Some of the most important supervised learn-ing algorithms are linear regression, logistic regression, k-nearest neigh-bors, decision trees and random forests, boosting, support vector ma-chines and neural networks. To comply with regulatory requirements only supervised leaning regression models will be explored in this the-sis work.

Unsupervised Learning

(26)

CHAPTER 3. METHODS

supervised learning, these type of algorithms are called semi supervised learning.

Reinforcement Learning

The third type of machine learning is reinforcement learning that evalu-ates the performance of the model and gives feedback for improvement (reward or penalty). Then the system must learn the best strategy, i.e. a policy to get the most reward. Reinforcement learning utilizes super-vised and unsupersuper-vised learning methods such as regression, classifica-tion and clustering.

The regulatory requirements for the credit risk models emphasises that the models ought to be interpretable and well explained [17]. Some of the more complex ML methods stated above luck the interpretive abil-ity and therefore are inappropriate for credit risk prediction [8]. Hence, only three supervised learning regression models will be evaluated in this thesis work: decision trees, random forests and boosting models. Logistic regression is mathematically explained but left for further ex-ploration if a two-stage models would be implemented in future works [6], [10].

3.2 Generalized Linear Models

3.2.1 Linear Models

Linear models use a linear function of the input variables to predict the outcome. The general prediction formula for regression can be written as follows:

ˆ

y = w1· x1+ . . . + wp· xp + b. (3.1)

Here X = (x1, . . . , xp)denotes the input parameters and W = (w1, . . . , wp) and b are the parameters learned by the model. The predicted parameter ˆ

yis the outcome. Linear models for regression are characterized by the

(27)

CHAPTER 3. METHODS

described by a line if there is only one input parameter, by a plane if two parameters are used and by a hyperplane if higher dimension parame-ters are used. There are different linear models for regression and the difference is in how the model parameters w and b are learned from the training data and how the model complexity can be controlled.

3.2.2 Logistic Regression

Logistic regression is used in applications where binary classes occur, however, this model can be extended to deal with multiclass classifica-tion problems.

The logistic regression models are using functions in X that ensure that the outcome P (X), which is interpreted as a probability, is between 0 and 1 [16]. The logistic function that meets this criteria in one dimen-sion is P (X) = exp(β0+ β1x1+ . . . + βpxp) 1 + exp(β0+ β1x1 + . . . + βpxp) , (3.2) or rearranged log( P (X) 1− P (X)) = β0+ β1x1+ . . . + βpxp. (3.3) This is the logit function and it is linear in X as seen on the right-hand side. The coefficients β0. . . βp are unknown and have to be estimated by the training data. Logistic regression models are usually fitted by maximum likelihood that can be written as

l(β0, βp) = Y i:yi=1 P (xi) Y i′:y_i′=0 (1− P (x′_i)), _(3.4)

where P (xi)estimates the probability for class 1. The estimates ˆβ0 and

ˆ

β1are chosen to maximize this likelihood function [18].

(28)

CHAPTER 3. METHODS 3.2.1.

Figure 3.2.1: 3-class classification problem split into 3 binary classifica-tion problems.

In this study ”One-vs-Rest” logistic regression model loses part of the information when the continuous value of LGD is divided in buckets and therefore cannot compare in performance to other models discussed later in this work. However, the binary logistic model could be used in a two-step LGD prediction models as suggested in [10] where LGD = 1 or LGD = 0 is predicted by the logistic regression as the first step.

There-after the prediction of values where LGD∈ (0,1) could be performed by,

for example, any of the other models discussed later in this work. This is left for future studies.

3.3 Tree Models

3.3.1 Decision Tree

(29)

CHAPTER 3. METHODS

with continuous response Y and inputs X1 and X2. First the space is

split into two regions, and the response is modelled by the mean of Y in each region. Then one or both of these regions are split into two more regions, and this process is continued, until a stopping rule is applied as shown in Figure 3.3.1.

Figure 3.3.1: The left figure shows an example of partition of a two-dimensional feature space by recursive binary splitting. The right figure shows the tree corresponding to the partition in the left figure.

For example, in the left part of Figure 3.3.1, the first split is done at

X1 = t1. Then the region X1 ≤ t1 is split at X2 = t2 and the region

X1 > t1 is split at X1 = t3. Finally, the region X1 > t3 is split at X2 = t4.

The result of this process is a partition into the five regions R1, R2, . . . , R5

shown in the figure. The corresponding regression model predicts Y with a constant cmin region Rm, that is,

ˆ f (X) = 5 X m=1 cmI{(X1, X2)∈ Rm}. (3.5)

The model can be represented by the binary tree in the right part of Figure 3.3.1. At the top node starts the partition with the full data set. Then observations satisfying the condition at each node are assigned to the left branch, and the others to the right branch. The terminal nodes or leaves of the tree correspond to the regions R1, R2, . . . , R5.

(30)

CHAPTER 3. METHODS

if cm is modelled as the response in each region R = (R1, R2, . . . , RM) where M is the number of regions, then

ˆ f (x) = M X m=1 cmI(x∈ Rm). (3.6)

Minimization of the sum of squaresP(yi− f(xi))2for given regions Rm gives

ˆ

cm =ave(yi|xi ∈ Rm), (3.7)

where ˆcmis the average of yiin region Rm. Thereafter a greedy algorithm is applied to the whole data. The pair of half-planes is defined as

R1(j, s) ={X|Xj ≤ s} and R2(j, s) ={X|Xj > s}. (3.8)

where splitting variable j and split point s are chosen in such a way that they solve: min j,s h min c1 X xi∈R1(j,s) (yi− c1)2+min c2 X xi∈R2(j,s) (yi− c2)2 i . _(3.9)

For any choice of j and s, the minimization is solved by ˆ

c1 =ave(yi|xi ∈ R1(j, s)) and cˆ2 =ave(yi|xi ∈ R2(j, s)). (3.10)

For each splitting variable j, the split point s is calculated and thereafter by running through all pairs the optimal pair (j, s) is found. The split-ting process is then repeated on resulsplit-ting regions until a pre-selected threshold, such as the minimum numbers of records needed to allow for a split or the minimum number of records allowed in a terminal node is reached [16].

3.3.2 Random Forest

(31)

CHAPTER 3. METHODS

explained as growing trees and letting them vote for the most popular class and in order to grow these ensembles, random vectors are gener-ated that govern the growth of each tree in the ensemble.

Random forest can be used for both classification and regression prob-lems. In regression problems random forest is formed by growing trees depending on a random vector θ such that the tree predictor h(x, θ) takes on numerical values as opposed to class labels.

The output values are numerical and we assume that the training set is independently drawn from the distribution of the random vector (X, Y ). The mean-squared generalization error for any numerical predictor h(x) is

EX,Y(Y − h(X))2. (3.11)

The random forest algorithm reduces risk for overfitting that is common for a single decision tree algorithm, and by increasing the number of trees in a forest, makes it more accurate in generalization. Random for-est is a very stable algorithm - if a new data point is introduced in the dataset, the overall model is not affected much. The disadvantages of random forest models is that they are more complex and computation-ally expensive than decision tree models. Due to their complexity, they require more time to train than other comparable models.

3.3.3 Boosting

Just as decision trees and random forest, boosting is applicable to both classification and regression problems. Boosting is an additive model that can be explained as combining several simple tree models or base learners that are serially chain-linked together into a more complex func-tion. Each successive tree in the chain is optimized to predict more ac-curately the records from the previous link.

(32)

CHAPTER 3. METHODS

A GBM uses gradient descent as the method of minimizing the errors from the previous tree in the chain. Gradient descent is a first-order iterative optimization algorithm for finding a local minimum in

func-tion space. Similar to gradient descent in parameter space, at the mth

iteration, the direction of the steepest descent is given by the negative gradient of a loss function L, see (3.18), for examples [20]:

−gm(x) =− " δL(Y, f (x)) δf (x) # f (x)=fm−1(x) . (3.12)

At each iteration, a regression tree model ϕmis fitted to predict the neg-ative gradient. A squared error is used as a surrogate loss:

ϕm = arg minϕ n X

i=1

[(−gm(xi))− ϕ(xi)]2, (3.13)

where argmin stands for argument of the minimum, i.e., the value of x for which f (x) attains its minimum. The step length in direction of the negative gradient is determined by ρm:

ρm = arg minρ n X

i=1

L(yi, fm−1(xi) + ρϕm(xi)). (3.14)

Multiplying the step length with a shrinkage factor or learning rate η

∈ (0, 1), the models performance is enhanced [20]. This gives the mth prediction:

fm(x) = ηρmϕm(x)m+ fm−1(x), (3.15)

and the resulting model can be written as:

f (x) = fM(x) = fσ(x) + M X m=1

ηρmϕm(x), (3.16)

where fσ is initialized using a constant:

fσ(x) = θ = arg minθ n X

i=1

(33)

CHAPTER 3. METHODS

3.4 Model Complexity

The purpose of a ML model is to give sensible output when a new dataset is introduced. This is called generalization. The performance of the model relies on its ability to generalize and there are techniques to eval-uate this performance.

Figure 3.4.1: An example of test and training error as a function of model complexity [16].

Figure 3.4.1 illustrates the importance of assessing the ability of a learn-ing method to generalize. In case of a quantitative scale response the loss function L(Y, ˆf (X))for measuring errors between Y , a target variable, and ˆf (X), a prediction model with vector inputs X, is given by:

L(Y, ˆf (X)) =    (Y − ˆf (X))2 _{squared error} Y − ˆf (X) absolute error. (3.18)

(34)

CHAPTER 3. METHODS

that depends on the predictor X, then

L(Y, θ(X)) =−2 · logPθ(X)(Y ). (3.19) The ”-2” in the definition makes the log-likelihood loss for the Gaussian distribution match squared-error loss [16].

The average of the test error or generalization error (dark line in Fig-ure 3.4.1) is the prediction error over an independent test sample where

X and Y are drawn randomly from their population. As the model

be-comes more complex, it uses the training data more and is able to adapt to more complicated underlying structures. Unfortunately training er-ror is not a good estimate of the test erer-ror since it constantly decreases with model complexity. However, a model with zero training error is said to be overfit to the training data [16]. Overfitting or high variance is the case where the overall error is small, but the generalization of the model is unreliable. This is due to model learning from the noise and inaccurate data entries - learning ”too much” from the training data set. In general, the nonparametric models, such as decision trees have a ten-dency of overfitting. In contrary underfitting, i.e. high bias, is the case where the model has made too restrictive assumptions on a function to make it easier to learn. In this case the model cannot capture the un-derlying trend of the data, it has not ”learned enough” from the training data, resulting in low generalization and unreliable predictions. In para-metric models, where a model is predicted by a finite number of param-eters, for example linear models, the risk of underfitting is higher. One can say that generalization is bounded by the two undesirable outcomes - high bias and high variance. Depending on the model a performance that balances between underfitting and overfitting is the one desired, it is called bias-variance trade-off.

Model building consists of two related steps:

• Model selection - estimating the performance of different models in order to choose the best one.

(35)

CHAPTER 3. METHODS

3.5 Cross-validation

One of the methods for estimating prediction error is cross-validation.

K-fold cross-validation uses one part of the available data to fit the

model, and a different part to test it. The data is randomly split into equal-sized samples. Often K is chosen between 5 and 10. If the value of K is too low (below 5) the model can become biased. If the value of

Kis large the bias will be low but the variance will increase.

For the models above number of folds are chosen to be 5. Then 5 equal-sized samples are generated using the 80/20 approach, 4 of these sam-ples are used as training data and the remaining sample is used as val-idation data. This process is repeated, where each sample is used as validation data one time.

Figure 3.5.1: K-Fold Cross-Validation, k=5.

Alteryx offers an option to choose the number of trials. This option al-lows the user to choose the number of times the cross-validation proce-dure is repeated in case the first random split of data is skewed in the folds.

The folds are selected differently in each trial and the overall results are averaged across all the trials. In the models used in this study, the cross-validation procedure is repeated 3 times.

(36)

CHAPTER 3. METHODS

each fold is a good representative. It is recommended to use when the target variable is imbalanced and is used in the models above.

A seed is selected for every model to estimate the accuracy of a model based on a random percentage. Changing the seed changes the fold’s composition. It also allows reproducibility of the results.

3.6 Metrics of Interest

Coefficient of Determination (R-Squared), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are based on two sums of squares: the Total Sum of Squares (TSS),

TSS =X(yi − yi)

2

, (3.20)

and the Residual Sum of Squares (RSS), RSS =

n X

i=1

(yi− f(xi))2. (3.21)

Here y_i is the mean of Y and f (xi) is the fitted value for observation

xi.

TSS measures how far the data are from the mean, and RSS measures how far the data are from the model’s predicted values. The difference between them is the improvement in prediction from the regression model, compared to the mean model.

3.6.1 R-Squared and Adjusted R-Squared

R-Squared is the proportional improvement in prediction from the re-gression model, compared to the mean model. It indicates the goodness of fit for the model. R-Squared is defined as:

(37)

CHAPTER 3. METHODS

R-Squared for linear regressions ranges from zero to one, with zero in-dicating that the proposed model does not improve prediction over the mean model, and one indicating perfect prediction. Improvement in the regression model results in proportional increases in R-Squared. One pitfall of R-Squared is that it can only increase as predictors are added to the regression model. This increase is artificial when predictors are not actually improving the model’s fit.

Adjusted R-Squared incorporates the model’s degrees of freedom. Ad-justed R-Squared will decrease as predictors are added if the increase in model fit does not make up for the loss of degrees of freedom. Like-wise, it will increase as predictors are added if the increase in model fit is worthwhile. Adjusted R-Squared is used in this work since it should always be used for models with more than one predictor variable [21]. It is interpreted as the proportion of total variance that is explained by the model. The relationship between R-Squared and Adjusted R-Squared is:

R2adj = 1−

(1− R2_)(n− 1)

n− q , (3.23)

where n is the number of observations and q is the number of coefficients in the model.

3.6.2 Root Mean Square Error

The Root Mean Square Error (RMSE) is a measure of the differences between predicted and observed values. RMSE is a non-negative value. In general lower RMSE implies more accurate fit.

RMSE =

Pn

i=1(yi− f(xi))2

n

(3.24)

(38)

CHAPTER 3. METHODS

3.6.3 Mean Absolute Error

The Mean Absolute Error (MAE) is a measure of errors between pre-dicted and observed values and is calculated as:

MAE =

Pn

i=1yi− f(xi)

n , (3.25)

where yi is the observed value and f (xi)is the predicted value.

3.6.4 Pearson’s Correlation Coefficient

Pearson’s Correlation Coefficient is represented by r(yi,f (xi)) and may

be referred to as the sample correlation coefficient. Given paired data (y1, f (x1)), . . . , (yn, f (xn)) consisting of n pairs, r(yi,f (xi)) is defined as:

r(yi,f (xi))= Pn i=1(yi − yi)(f (xi)− f(xi)) pPn i=1(yi− yi)2 »Pn i=1(f (xi)− f(xi))2 , _(3.26)

3.6.5 Root Node Error

(39)

Chapter 4 Data and Model Preparation

4.1 Data Preparation

The primary data set consists of accounts collected in Italy during the time period 1994-2020. In this study only the observations of the closed 590 000 debts are used. The major part of these NPLs are defaulted debts with 100% LGD (LGD = 1). To reduce this unbalance, the dataset is split in a semi-random way.

Figure 4.1.1: Flowchart of Data Separation. N = 50% observations with LGD<1, R = proportion coefficient for ratio LGD = 1 vs. LGD < 1

(40)

CHAPTER 4. DATA AND MODEL PREPARATION

LGD=1 are selected for the test set. The remaining 80% of observations are left for the training set. Thereafter the training set is divided in two subsets. The first one is the balanced training set where the amount of observations with LGD<1 (N in figure 4.1.1) is equal to observations with LGD=1. The second set is the unbalanced training set where the ratio (R in figure 4.1.1) of observations with LGD<1 and LGD=1 is the same as in the ”real life” data.

Selection of observations for the test set as well as for two training sub-sets were done randomly without replacement with different random seeds. The difference in mean square errors of results were in the range of 10−4 to 10−6.

Related studies have demonstrated that another important step in data pre-processing is the variable selection [22]. Prediction performance of credit risk models can be significantly improved choosing a subset of only relevant input attributes. It reduces the dimensionality of the fea-ture space that can improve model accuracy and reduce computational as well as acquisition costs [6].

4.2 Model Preparation

4.2.1 Decision Tree Model

Figure 4.2.1: The root node and first 4 nodes of the Decision Tree model.

(41)

[24]. The hyper-parameters used in model customization are: the min-imum numbers of records needed to allow for a split (20), the allowed minimum number of records in a terminal node (7), the number of folds to use in cross validation to prune the tree (10), the maximum allowed depth of any node in the final tree (20), number of cross-validation folds (5) and number of cross-validation trials (3).

The number of folds to use in cross validation to prune the tree allows to select the number of groups the data should be divided into for testing the model. A higher number improves accuracy but increases the run time. The maximum depth allowed of any node in the final tree sets the maximum number of levels of branches allowed from the root node to the most distant node. This limits the overall size of the tree. The root node is assigned the depth = 0.

4.2.2 Random Forest Model

The hyper-parameters used in Random Forest Models are the following: number of trees (200), the minimum number of records allowed in a tree node (5), the records for the creation of each model were selected with replacement and the percentage of the data to sample from to create each tree was 50%.

Figure 4.2.2 shows the loss function decrease. Notice that the ”elbow” or the decrease in the loss function is plateauing after 50 trees. The number of trees could be reduced to increase the speed of the model.

4.2.3 Boosted Model

The model is customized as follows:

The loss function used in the models is Gaussian or Squared Error Loss. Since the error amount is squared, this loss function penalizes the errors more when they are large.

(42)

Figure 4.2.2: Percentage Error for Different Number of Trees in Ran-dom Forest model.

then 50000, which would decrease the running time.

5-fold cross-validation is chosen as the method to determine the fi-nal number of trees in the model. Cross-validation prevents overfit-ting but the downside is that cross-validation takes a lot of processing power.

The Gradient Boosting Machine uses gradient descent and therefore the learning rate or ”shrinkage factor” has to be set. It is considered to be a general rule that the smaller the learning rate, the more accurate the model will be. On the other hand the smaller the learning rate the longer it takes to process and reach the optimal model. The shrinkage error in this study is configured to 0.0020.

Another parameter is the interaction depth. Interaction depth or the maximum nodes per tree is the number of splits that has to be performed on a tree, starting from a single node, and that affects how big each tree should be. Each split increases the total number of nodes by 3 and number of terminal nodes by 2. The models in this study has interaction depth = 1, called an additive model, so the total number of nodes in the tree will be 4 and the number of terminal nodes will be 3.

(43)

min-CHAPTER 4. DATA AND MODEL PREPARATION

imum number of training records that each terminal node should have for the tree to be included in the model. First when a tree is created it is split into right and left nodes. Each split should have at least a mini-mum number of records before the node is approved by the model. The higher value will reduce the number of tree splits and possibly reduce the number of trees, and it will generalize the model better and prevent it from overfitting. The minimum required number of objects used in the models is 10.

For more details see the work of Friedman J. H. [20].

(44)

Chapter 5 Results

In this chapter, the results of the three models described in Chapter 4 are presented and analyzed.

5.1 Impact of Variables

5.1.1 Variables Used

Variable Description

AgeAtStart Customers age when the debt was started

Client The vendor from which Hoist has acquired the portfolio DebtStartupYear The year the debt was started

DebtTypeDebt Type of the debt on debt level DebtTypePortfolio Type of the debt in portfolio level

Freq24M Amount of payments done during the first 24 months Gender Customers gender

IsComment A flag if the debt has been commented by an employee IsIncomingCallEmail A flag if the customer has contacted Hoist by phone or email IsInfoConfirmed A flag if information about the customer is confirmed IsItalian A flag if customer nationality is Italian

IsLegal A flag if the debt is involved in any legal action

IsPayer A flag if the customer has paid at least once during the first 24 months IsPaymentPlan A flag if a payment plan has been created

IsSettlementAuthorization Agreement proposed to the customer that describes the way the amount agreed will be paid

IsToBeContacted A flag if Hoist should contact the customer IsUnpaidUtility A flag if the debt is unpaid utility

OriginalTotalClaimAmount The total claim amount when the debt was started Paid24M The total amount paid during the first 24 months PaymentPlanType Type of the payment plan

(45)

CHAPTER 5. RESULTS

Figure 5.1.1: Colour scale of LGD values

for the reduced model: AgeAtStart, Client, DebtStartupYear, DebtType-Debt, DebtTypePortfolio, Freq24M, IsLegal, IsPayer, IsPaymentPlan, IsSettlementAuthorization, OriginalTotalClaimAmount, Paid24M and PaymentPlanType.

5.1.2 Variables Versus Loss Given Default

In this section the relation between the independent variables during the first 24 months and LGD calculated after the first 24 months is pre-sented. The bar charts in this section show the ratio between customers that after 24 months have LGD = 0 (purple) and LGD = 1 (yellow), for colour scale see Figure 5.1.1. There is a difference in the ratio if the cus-tomer has paid at least once during the first 24 months, see Figure 5.1.2 (a) and Figure 5.1.2 (b). If Hoist has confirmed information about the customer, Figure 5.1.3 (a), and if the customer has agreed to proceed with a payment plan, then these two are variables that show a signifi-cant decrease in LGD, see Figure 5.1.3 (b). Also the type of the dept, see Figure 5.1.4 (a), Is To Be Contacted, Figure 5.1.4 (b), Is Legal, Figure 5.1.4 (c) and Is Settlement Agreement, Figure 5.1.4 (d), show an impact on LGD.

(46)

CHAPTER 5. RESULTS

(a)IsPayer versus LGD (b)Freq24M versus LGD Figure 5.1.2: Variables versus LGD24. No = 0, Yes = 1.

(47)

CHAPTER 5. RESULTS

(a)IsToBeContacted versus LGD (b)IsLegal versus LGD

(48)

CHAPTER 5. RESULTS

(a)IsUnpaidUtility versus LGD (b)IsComment versus LGD

(49)

CHAPTER 5. RESULTS

Figure 5.1.6: Debt Startup Year versus LGD.

Types of payment plans and vendors from whom Hoist is buying port-folios are important variables, but not presented in the figures. Also parameters with continuous outcomes such as the amount paid and the original total claim amount are important but not presented in the fig-ures.

The variables that do not show considerable difference in LGD are IsUn-paidUtility, IsComment, IsItalian and Gender, see Figure 5.1.5.

5.1.3 Variable Importance Assigned by Models

(50)

CHAPTER 5. RESULTS

Figure 5.1.7:Variable importance in percent as assigned by the Decision Tree Model. The left capture shows Balanced Model. The right capture shows Unbalanced Model.

also shows that the parameters that do not significantly influence the outcome in this study are gender, age of the customer, if the nationality of customer is Italian and if the debt is unpaid utility.

5.2 Model Evaluation

Balanced models were evaluated on balanced training set and unbal-anced models were evaluated on unbalunbal-anced training set where the pro-portion of observations with LGD=1 and LGD<1 is preserved.

5.2.1 Model Performance

Decision Tree Model

(51)

CHAPTER 5. RESULTS

Figure 5.1.8: Mean Decrease Gini (IncNodePurity) of variables as assigned by the Random Forest Model. The left capture shows Balanced Model. The right capture shows Unbalanced Model.

(52)

CHAPTER 5. RESULTS

root node error is lower for the unbalanced models and not impact of the amount of variables used in the model construction.

Balanced 20 Unbalanced 20 Balanced 13 Unbalanced 13 Number of variables 20 20 13 13 Variables Actually

Used In Tree Con-struction

15 16 11 12

Root Node Error 0.23491 0.0588 0.23491 0.0588

Correlation 0.934927 0.919105 0.930666 0.918914 Adjusted R-Squared 0.8740 0.8447 0.8661 0.8443 RMSE 0.171981 0.095543 0.177326 0.095651

MAE 0.072596 0.020077 0.076437 0.020078

Table 5.2.1: Decision Tree Model Performance, 20 variables versus 13 variables.

Random Forest Model

Table 5.2.2 show that the model with the lowest scores of RMSE and MAE are the unbalanced model with 20 variables. As above, correla-tion between the predicted values and the actual values and adjusted R-squared value is higher for the balanced model with 20 variables. Boosted Model

The results of model performance of the boosted model presented in Ta-ble 5.2.3 show that the model with both the lowest RMSE and MAE is the unbalanced model with 20 variables. Variable reducing improves the model performance in terms of correlation and adjusted R-squared

Balanced 20 Unbalanced 20 Balanced 13 Unbalanced 13 Number of variables 20 20 13 13 Number of variables

tried at each split

6 6 4 4

(53)

CHAPTER 5. RESULTS Balanced 20 Unbalanced 20 Balanced 13 Unbalanced 13 Number of variables 20 20 13 13 Total Number of Trees

Used

50000 50000 50000 50000 Best number of Trees

Based on 5-fold Cross Validation 49986 49998 49994 49998 Correlation 0.9187 0.876317 0.919188 0.875177 Adjusted R-Squared 0.84394 0.767835 0.844850 0.765822 RMSE 0.1911 0.116834 0.190873 0.11732 MAE 0.101902 0.034983 0.101935 0.035189 Table 5.2.3: Boosted Model Performance, 20 variables versus 13 vari-ables.

value. They are higher for the balanced boosted model with 13 vari-ables.

5.2.2 Generalization

General Test Set

To find the model that performs best in generalizing, 6 models with 20 variables were run on the test set and compared. Table 5.2.4 presents the results of generalization. As one can see, all metrics, i.e. the cor-relation, adjusted R-squared, RMSE and MAE, show that the Random Forest Unbalanced model outperforms the other models.

Model Correlation Adjusted R-Squared

RMSE MAE Decision Tree Balanced 0.864219 0.708047 0.130842 0.026643 Decision Tree Unbalanced 0.909486 0.827059 0.100702 0.021151 Random Forest Balanced 0.886672 0.763079 0.117867 0.02469

Random Forest Unbalanced 0.920981 0.847925 0.094432 0.020783

Boosted Model Balanced 0.835789 0.640546 0.145182 0.053788 Boosted Model Unbalanced 0.874866 0.765289 0.117316 0.035216

Table 5.2.4: Results of Model Assessment using 20 variables. The generalization of models with 13 variables also show that the ran-dom forest unbalanced model performs better in all aspects compared to the other models, see Table 5.2.5:

(54)

CHAPTER 5. RESULTS

RMSE MAE Decision Tree Balanced 0.857259 0.692292 0.1326 0.027767 Decision Tree Unbalanced 0.908847 0.825855 0.101052 0.021228 Random Forest Balanced 0.883053 0.754982 0.119864 0.02544

Random Forest Unbalanced 0.917971 0.842404 0.096131 0.021513

Boosted Model Balanced 0.833299 0.634582 0.146381 0.054343 Boosted Model Unbalanced 0.874161 0.764019 0.117633 0.03542

Table 5.2.5: Results of Model Assessment using 13 variables

Figure 5.2.1: The left capture shows predicted values versus actual values in the Decision Tree Balanced Model. The right capture shows predicted values versus actual values in the Decision Tree Unbalanced Model. 20 variables.

actual values. Figure 5.2.1 shows results of the decision tree models, Figure 5.2.2 of the random forest models and Figure 5.2.3 the boosted models.

Observe that the colour coding shows that point (1,1) is overpopulated compared to the rest of the field and therefore also well predicted. Notice also that the boosted model predicts values of LGD outside the range of the training domain. It could be because each subsequent tree after the first iteration is based on predicting the error of the previous tree and therefore while the initial tree is restricted to the training do-main, the summing across gradient boosted trees is not.

(55)

CHAPTER 5. RESULTS

Figure 5.2.2: The left capture shows predicted values versus actual values in the Random Forest Balanced Model. The right capture shows predicted values versus actual values in the Random Forest Unbalanced Model. 20 variables.

(56)

CHAPTER 5. RESULTS

RMSE MAE Decision Tree Balanced 0.520093 0.259037 0.221671 0.32687 Decision Tree Unbalanced 0.38409 -0.225125 0.285037 0.193715

Random Forest Balanced 0.579994 0.328764 0.210984 0.126213

Random Forest Unbalanced 0.443318 -0.102156 0.270354 0.190938 Boosted Model Balanced 0.392069 0.144517 0.238186 0.161884 Boosted Model Unbalanced 0.210636 -0.694954 0.335266 0.273061 Table 5.2.6: Results of Model Assessment using 20 variables after re-moving observations where LGD = 1 from the Test set.

RMSE MAE Decision Tree Balanced 0.479363 0.218559 0.227646 0.139583 Decision Tree Unbalanced 0.377171 -0.226741 0.285225 0.192948

Random Forest Balanced 0.551189 0.295670 0.216122 0.12988

Random Forest Unbalanced 0.413158 -0.152702 0.276484 0.196281 Boosted Model Balanced 0.39162 0.144157 0.238236 0.162272 Boosted Model Unbalanced 0.213478 -0.688879 0.334665 0.274975 Table 5.2.7: Results of Model Assessment using 13 variables after remov-ing observations where LGD = 1 from the Test set.

would not change the overall results of model comparison and therefore the results with no adjustments were kept.

Reduced Test Set

To observe how well the models would predict the LGD values that are not equal to 1, i.e. when the customer is paying something, the observa-tions with actual LGD equal to 1 were removed from the Test set. The models were run again and the results are presented in Table 5.2.6. Notice that all models show a decrease in performance and the model that is better in generalization is the random forest balanced model. The same pattern is observed when comparing models with 13 variables, see Table 5.2.7.

(57)

CHAPTER 5. RESULTS

Figure 5.2.4: After removing test observations where actual LGD = 1. The left cap-ture shows predicted values versus actual values in the Decision Tree Balanced Model. The right capture shows predicted values versus actual values in the Decision Tree Un-balanced Model. 20 variables.

(58)

CHAPTER 5. RESULTS

(59)

Chapter 6 Conclusions

6.1 Discussion

Credit risk modelling uses empirical models to support decision making. Modelling EAD and LGD hold a particular interest for credit market par-ticipants, as these parameters are used to calculate capital requirements and RWA. In addition, both parameters serve as important inputs into several models, including stress testing and economic capital models [6]. Predicting LGD is challenging because it doesn’t follow the normal dis-tribution. Most of the defaulted loans are either not recovered at all or fully recovered [5].

Pre-Analysis of Data

Hoist’s database has a vast amount of variables available for each debt observation. Therefore choosing the variables for the models was first done manually using the in-house expertise. To further cut down the amount of variables to be used in this thesis work each variable’s impact on LGD was analyzed. Some of these variables were presented graph-ically in Chapter 5. The models evaluated in this thesis work are tree-based models and they all have a ”build-in” variable selection algorithm. It is interesting to mention some ”manual” findings and compare them to the results of variable selection algorithms.

(60)

CHAPTER 6. CONCLUSIONS

the customer has paid at least once since the debt was started at Hoist. Other important parameters were found to be the vendor of the port-folio (Client), total amount paid during the first 24 months (Paid24M) and frequency of payments during the first 24 months (Freq24M). The model assigned importance also showed that the parameters that did not significantly influence the outcome in this study were gender, age of the customer, if the nationality of the customer is Italian and if the debt is unpaid utility.

Most of the variables that were found to be significant in the graphical pre-modelling analysis of data showed also to be that when the signifi-cance was assigned by the models.

Model Performance

In this thesis work three machine learning methods: decision trees, ran-dom forests and boosted models were evaluated and compared.

The first model evaluated was the decision tree model. Correlation and the adjusted R-squared were slightly higher for the model with 20 vari-ables used on a balanced training dataset than for all other models. RMSE and MAE scores showed that the model with 13 variables built on an unbalanced training set outperformed the balanced models. Also root node error was lower for unbalanced models.

The second model evaluated was the random forest model. Just as above in the decision tree case one can see that the correlation and the adjusted R-squared gained highest values when the model with 20 variables was used with the balanced training dataset. Here we can see that RMSE and MAE scores were lower for the model with 20 variables run on an unbalanced training dataset.

The third model evaluated was the boosted model. This model showed highest correlation and adjusted R-squared when 13 variables were used on a balanced training dataset and lower values RMSE and MAE values when a model with 20 variables was trained on an unbalanced training dataset.

(61)

CHAPTER 6. CONCLUSIONS

terms of correlation and adjusted R-Squared, and unbalanced models in terms of RMSE and MAE.

Model Assessment

Since the data is highly unbalanced and LGD = 1 (no recovery) is domi-nant, two test sets were created. One where the ratio of LGD = 1 and LGD < 1 is proportional to Hoist’s data and the other where observations LGD

= 1 were removed. All models were run on these two sets.

The random forest unbalanced models showed higher correlation and adjusted R-squared scores as well as lower RMSE and MAE scores when the models were run on the test data with all observations, i.e. where LGD = 1 were included.

When the observations with LGD = 1 were removed, the random forest balanced model outperformed the other five models. Using this test set the performance of all models decreased significantly. Still the figures of the predicted values of LGD versus actual value show that predictions of LGD = 0 are accurate.

In this study boosted models showed the weakest performance in terms of correlation, adjusted R-Squared, RMSE and MAE. That could be ex-plained by the boosted algorithms additive characteristics, where it suc-cessively fits regression trees to the residuals of the previous stage. It could be because each subsequent tree after the first iteration is based on predicting the error of the previous tree and therefore while the initial tree is restricted to the training domain, the summing across gradient boosted trees is not.

Therefore a conclusion can be drawn that the random forest models are predicting LGD = 1 very well, i.e. if the customer debt will not be recov-ered, or LGD = 0, i.e. if the customer debt will be fully recovrecov-ered, and that the models could be improved for the LGD values that are in the interval 0 < LGD < 1, i.e. the customers that might recover a part of the debt.

(62)

com-CHAPTER 6. CONCLUSIONS

putation time for random forest models was significantly longer (20:1) than the time it took to run decision tree models. Therefore one should consider using decision tree models since the computation cost will de-crease.

Also interpretability is considered as an important factor when creating machine learning models, and it is an important requirement by regula-tors, s.a. EBA [17]. This means that both from a management and reg-ulatory perspective simple decision tree models would suit better than ensemble methods like random forests and boosted methods for predic-tion of LGD values.

6.2 Future Work

As mentioned before all three models evaluated in this thesis performed well in predicting the values of LGD that were equal to 0 or 1. The models could be improved or other ML models could be evaluated like suggested in [10] to increase the performance in the interval 0 < LGD < 1.

Default values of hyper-parameters are used in all ML models evalu-ated in this study. Further investigation using several hyper-parameters could be done to compare if the performance of the models would in-crease.

Further studies could be conducted by building two-stage models as sug-gested by [5], [6] and [10], since the two-stage models developed in the literature have the advantage of addressing the problem of how to model the extreme cases concentrating on the boundaries at 0 and 1. In the two-stage model first a classification algorithm is used to discriminate cases with LGD rates equal to 0 and 1. A study done by Lessmann, Bae-sens, Soew and Thomas [15] where 41 classifiers are compared could be a benchmark for this work. Thereafter a supervised regression algorithm

model is applied to predict the values LGD ∈ (0, 1). Here the work of

(63)

Bibliography

[1] Implementing Basel III in Europe. 2020. URL: https : / / eba . europa.eu/regulation-and-policy/implementing-basel-iii-europe. (accessed: 30.07.2020).

[2] Bank for International Settlements. 2020. URL: https://www. bis.org/about. (accessed: 01.06.2020).

[3] McNeil A., Rüdiger F., and Embrechts P. Quantitative Risk

Man-agement. Princeton Series in Finance. Princeton University Press,

2015. ISBN: 9780691166278.

[4] Tong E., Mues Ch., and Brown I. “Exposure at default models with

and without the credit conversion factor”. In: European Journal

of Operational Research 252 (2016), pp. 910–920. DOI: http : //dx.doi.org/10.1016/j.ejor.2016.01.054.

[5] Loterman G., Brown I., Martens D., Mues Ch., and Baesens B.

“Benchmarking regression algorithms for loss given default mod-elling”. In: International Journal of Forecasting 28 (1 2011),

pp. 161–170. DOI: https://doi.org/10.1016/j.ijforecast.

2011.01.006.

[6] Papouskova M. and Hajek P. “Two-stage consumer credit risk

modelling using heterogeneous ensemble learning”. In: Decision

Support Systems 322 (2019), pp. 33–45. DOI:https://doi.org/ 10.1016/j.dss.2019.01.002.

[7] Abellán J. and Mantas C. J. “Improving experimental studies

about ensembles of classifiers for bankruptcy prediction and credit scoring”. In: Expert Systems with Applications (2014), pp. 3825–

(64)

BIBLIOGRAPHY

[8] Vanderheyden B. and Priestley J. Logistic Ensemble Models. 2018.

URL: https : / / arxiv . org / abs / 1806 . 04555v1. (accessed:

01.06.2020).

[9] Estimation of Loss Given Default Distributions for Non-Performing Loans Using Zero-and-One Inflated Beta Regression Type Mod-els. 2020. URL:http://urn.kb.se/resolve?urn=urn:nbn:se: kth:diva-273593.

[10] Yao X., Crook J., and Andreeva G. “Enhancing two-stage

mod-elling methodology for loss given default with support vector ma-chines”. In: European Journal of Operational Research (2017),

pp. 679–689. DOI:http://dx.doi.org/10.1016/j.ejor.2017.

05.017.

[11] EBA Report on NPLs 2019 Interactive Tool. 2019. URL: https: //tools.eba.europa.eu/interactive- tools/2019/powerbi/ npl19_visualisation_page.html. (accessed: 30.07.2020). [12] EBA report on NPLs, Progress Made and Challenges Ahead.

2019. URL: https : / / eba . europa . eu / risk analysis and

-data/risk-assessment-reports. (accessed: 27.07.2020).

[13] Belotti T. and Crook J. “Loss given default models

incorporat-ing macroeconomic variables for credit cards”. In: International

Journal of Forecasting 28 (1 2012), pp. 171–182. DOI:https:// doi.org/10.1016/j.ijforecast.2010.08.005.

[14] Leow M., Mues C., and Thomas L. “The economy and loss given

default: evidence from two UK retail lending data sets”. In:

Jour-nal of OperatioJour-nal Research 65 (3 2014), pp. 363–375. DOI:

https://www-jstor-org.focus.lib.kth.se/stable/24502085.

[15] Lessmann S., Baesens B., Seow H., and Thomas L.

“Benchmark-ing state-of-the-art classification algorithms for credit scor“Benchmark-ing: An update of research”. In: European Journal of Operational

Re-search 247 (2015), pp. 124–136. DOI: https : / / doi . org / 10 . 1016/j.ejor.2015.05.030..

[16] Hastie T., Tibshirani R., and Friedman J. The Elements to

Statis-tical Learning. Springer Series in Statistics. Springer Science +

(65)

BIBLIOGRAPHY

[17] Guidelines on PD estimation LGD estimation and treatment of defaulted assets. 2020. URL: https : / / eba . europa . eu / regulati and- policy/model- validation/guidelines- on-pd - lgd - estimation - and - treatment - of - defaulted - assets. (accessed: 01.06.2020).

[18] James G., Witten D., Hastie T., and Tibshirani R. An Introduction

to Statistical Learning with Applications in R. Springer Texts

in Statistics. Springer Science + Business Media, 2013. ISBN: 9781461471370.

[19] Breiman L. “Random Forests”. In: Machine Learning 45 (2001),

pp. 5–32. DOI:https://doi.org/10.1023/A:1010933404324.

[20] Friedman J. H. Greedy Function Approximation: A Gradient

Boosting Machine. URL:https://statweb.stanford.edu/~jhf/ ftp/trebst.pdf. (accessed: 30.07.2020).

[21] Grace-Martin K. Assessing the Fit of Regression Models. URL:

https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/. (accessed: 13.07.2020).

[22] Alaraj M. and Abbod M. “A new hybrid ensemble credit

scor-ing model based on classifiers consensus system approach”. In:

Expert Systems with Applications 64 (2016), pp. 36–55. DOI:

https://doi.org/10.1016/j.eswa.2016.07.017.

[23] Therneau T., Atkinson B., and Ripley R. Package ’rpart’. URL:

https: / /cran . r - project . org/ web / packages/ rpart / rpart. pdf. (accessed: 27.07.2020).

[24] Breiman L., Friedman J., Olshen R., and Stone C. Classification

and Regression Trees. Wadsworth statistics/probability series.

(66)

(67)

(68)

Loss Given Default Estimation with Machine Learning Ensemble Methods

Loss Given Default Estimation with

Machine Learning Ensemble

Methods

ELINA VELKA

Loss Given Default Estimation

with Machine Learning

Ensemble Methods

ELINA VELKA

Abstract

Sammanfattning

Estimering av förlust vid fallisemang med

en-sembelmetoder inom maskininlärning

Acknowledgements

Acronyms

Contents

1 Introduction

1

2 Background

5

3 Methods

10

4 Data and Model Preparation

25

5 Results

30

List of Tables

Chapter 1

Introduction

1.1

Methodology

1.2

Research Question

1.3

Scope

1.4

Outline

Chapter 2

Background

2.1

Non-Performing Loans

2.2

Hoist Finance

2.3

Quantities of Interest

2.4

Related Work

Chapter 3

Methods

3.1

Machine Learning

3.2

Generalized Linear Models

3.2.1

Linear Models

3.2.2

Logistic Regression

3.3

Tree Models

3.3.1

Decision Tree

3.3.2

Random Forest

3.3.3

Boosting

3.4

Model Complexity

3.5

Cross-validation

3.6

Metrics of Interest

3.6.1

R-Squared and Adjusted R-Squared

3.6.2

Root Mean Square Error

3.6.3

Mean Absolute Error

3.6.4

Pearson’s Correlation Coefficient

3.6.5