Statistical Methods for Analysis of the Homeowner's Impact on Property Valuation and Its Relation to the Mortgage Portfolio

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2020,

Statistical Methods for Analysis of the Homeowner's Impact on

Property Valuation and Its Relation to the Mortgage Portfolio

CLARA HAMELL

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Statistical Methods for Analysis of the Homeowner's Impact on Property Valuation and Its

Relation to the Mortgage Portfolio

CLARA HAMELL

Degree Projects in Financial Mathematics (30 ECTS credits) Degree Programme in Applied and Computational Mathematics KTH Royal Institute of Technology year 2020

Supervisor at Swedbank: Felix Bogren Supervisor at KTH: Boualem Djehiche Examiner at KTH: Boualem Djehiche

(4)

TRITA-SCI-GRU 2020:380 MAT-E 2020:087

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

| i

Abstract

The current method for house valuations in mortgage portfolio models corresponds to applying a residential property price index (RPPI) to the purchasing price (or last known valuation). This thesis introduces an alternative house valuation method, which combines the current one with the bank’s customer data. This approach shows that the gap between the actual house value and the current estimated house value can to some extent be explained by customer attributes, especially for houses where the homeowner is a defaulted customer. The inclusion of customer attributes can either reduce false overestimation or predict whether or not the current valuation is an overestimation or underestimation. This particular property is of interest in credit risk, as false overestimations can have negative impacts on the mortgage portfolio.

The statistical methods that were used in this thesis were the data mining techniques regression and clustering.

(6)

(7)

ii |

Sammanfattning

Titel: Statistiska metoder för analys av husägarens påverkan på husvärdet och dess koppling till bolåneportföljen

De modeller och tillvägagångssätt som i dagsläget används för husvärdering i bolåneportföljen bygger på husprisindexering och köpesskilling. Denna studie intro- ducerar ett alternativt sätt att uppskattta husvärdet, genom att kombinera dagens metod med bankens egna kunddata. Det här tillvägagångssättet visar på att gapet mellan det faktiska och det uppskattade husvärdet kan i viss mån förklaras av kunddata, framförallt där husägaren är en fallerad kund. Inkluderandet av kunddata kan både minska dagens övervärdering samt predicera huruvida dagens uppskattning är en övervärdering eller undervärdering. För fallerade kunder gav den alternativa husvärderingen ett mer sanningsenligt uppskattat värde av försäljningspriset än den traditionella metoden. Denna egenskap är av intresse inom kreditrisk, då en falsk övervärdering kan ha negativa konsekvenser på bolåneportföljen, framförallt för fallerade kunder. De statistiska verktyg som användes i denna studie var diverse regressionsmetoder samt klusteranalys.

(8)

(9)

Acknowledgments | iii

Acknowledgments

I would like to express my deepest gratitude to the Group Risk unit at Swedbank for giving me the opportunity to write this master thesis. Particularly, Felix Bogren, whose support and guidance have been invaluable during the entire process.

Moreover, I would like to thank Boualem Djehiche, my supervisor at KTH for his feedback and advice.

Finally, my sincerest appreciation to my family and my friends, who have continuously supported me in my studies at KTH and École Centrale Paris.

Stockholm, January 2021 Clara Hamell

(10)

iv | Acknowledgments

(11)

CONTENTS | v

List of Figures

2.1 β_sas a function of the property value and the customer’s exposure. 9 3.1 Regression tree with three regions or leaves and two splits or internal

nodes. . . . 18 4.1 Visualisation of data storing. . . 23 4.2 Construction of the data set containing house prices. . . 24 4.3 Time dependent attributes. The period considered for each customer

is the time period the customer resided in the house. . . 28 4.4 Density plot of the entire house price population. The filled area

represents 98% of the population. . . 29 4.5 Density plot of the defaulted house price population. The filled area

represents 98% of the population. . . 29 5.1 Empiric distribution of the ratio sell price to valuation. . . 32 6.1 The density plots of the OLS predicted house prices, the actual house

prices and the RPPI valuation for the test set. . . 37 6.2 _I·P^S (current) and _{I·P ·Y}^S

OLS (OLS) . . . 37 6.3 Optimal tuning parameter for shrinkage method. The optimal value

is indicated to be in between the dotted vertical lines. . . 38 6.4 The density plots of the predicted prices and actual prices using the

Ridge model for the test sets. . . 39 6.5 _I·P^S compared to _{I·P ·Y}^S

Ridge. . . 40 6.6 The density plots of the predicted prices and actual prices using the

Lasso method for the test sets. . . 41 6.7 _I·P^S compared to _{I·P ·Y}^S

lasso . . . 41 6.8 Training error with respect to the number of trees in the forest. . . . 42 6.9 The density plot of the predicted house prices: Random Forests,

RPPI and actual sell price. . . 42

(14)

viii | LIST OF FIGURES

6.10 The relation _{I·P ·Y}^S

tree for the two test sets compared to _I·P^S .. . . 43

6.11 Two methods to determine the appropriate value for k. . . 44

6.12 _I·P^S compared to _{I·P ·Y}^S Cluster1 for Cluster 1. . . 46

6.15 Model error and false prediction error for overestimation, logistic model. . . 51

6.16 Confusion matrices displaying the predicted outcomes and the actual outcomes for the logistic model. . . 52

6.17 Model error and false prediction error for overestimation, LDA model. 54 6.18 Confusion matrices displaying the predicted outcomes for LDA model and the actual outcomes.. . . 55

6.19 Model error and false prediction error for slight overestimation and extreme overestimation, LDA model. . . . 56

6.20 Confusion matrices displaying the predicted outcomes for the LDA model and the actual outcomes, with three response classes.. . . 57

6.21 Model error and false prediction error for slight overestimation and extreme overestimation, QDA model. . . . 58

6.22 Confusion matrix displaying the predicted outcomes for QDA model and the actual outcomes, with three response classes. . . 59

A.1 Purchase . . . 68

A.2 Age and Time . . . 68

A.3 Debt . . . 69

A.4 Rate of Change, Debt . . . 69

A.5 Income . . . 69

A.6 Rate of Change, Income. . . 70

A.7 Delayed Payments . . . 70

A.8 Metrics for Delayed Payments . . . 70

A.9 Account Balance . . . 71

A.10 Rate of Change, Account Balance . . . 71

A.11 D/I . . . 71

A.12 Rate of Change, D/I . . . 72

B.1 Residual analysis for linear model with original data set . . . 73

B.2 Residual analysis for linear model with all predictors, with transformed data . . . 74

B.3 Number of optimal variables to be selected. . . 74

B.4 Variable selection . . . 75

(15)

LIST OF FIGURES | ix

C.1 Caption . . . 76

D.1 Residuals Cluster 1 . . . 77

D.4 variables Cluster 1 . . . 79

D.5 variables Cluster 2 . . . 79

D.6 Cluster 1 and Cluster 2 . . . 80

D.7 Cluster 1 and Cluster 2 . . . 80

(16)

x | LIST OF FIGURES

(17)

Introduction | 1

Chapter 1 Introduction

This chapter will give an overview of the problem that was studied in this thesis.

The general background to the problem is described and followed by the problem formulation, purpose, research questions and delimitations. At the end of the introduction, the rest of the structure of the thesis is described.

1.1 General background

The Swedish households’ total debt is currently worth 4 244 billion SEK and mortgages account for 82%[1] of the debt. Swedbank has for many years had approximately 24-27% [2] of the Swedish mortgage market, making it the bank with the largest part of the market. This position comes with a risk as the households’

debts correspond to an exposure of the bank’s assets. Defaulted customers constitute the risk associated with mortgage loans. The most common definition of a defaulted customer is a customer that is more than 90 days late with their payment on their credit obligations [3]. This means that they have difficulties repaying their debt to the bank. However, there are other scenarios that can cause a customer to default such as bankruptcy or debt restructuring. As a last resort, a defaulted customer can be forced to sell their property, either voluntarily or by foreclosure. It is during a sale, particularly one under distress, that the bank’s exposure can become a realised loss.

The bank uses the measure "Loan to Value" (LTV) in order to calculate its exposure for each customer. The LTV is simply the ratio between the size of the mortgage and the price/value of the property [4]. The current LTV consists of an estimated sell price of the property at the present time. The bank estimates the value of the property by applying a residential property price index (RPPI) to the original purchasing

(18)

2 | Introduction

price[4],

LT V = mortgage

estimated sell price = mortgage

RP P I · purchase price. (1.1) There are several different ways to construct a RPPI and these will be presented in Chapter 2. However none of these standard indices take into account the homeowner’

impact on the property value [5].

Given a correct LTV, the bank can model and predict the losses of the mortgage portfolio and act thereafter. The bank’s problem is that the LTV for defaulted customers is seldom accurate as the underlaying estimate of the property often overvalues the true sell price when the sell is conducted under distress. Previous studies from the United States show for instance a 27% discount of the estimated sell price for distressed sales [6]. This leads to a faulty LTV which in turn affects the computation of Loss Given Default and Expected Loss. The bank is therefore interested in a closer study of the collateral valuations, market value indexations and the corresponding LTV.

The Expected Loss (EL) of a mortgage portfolio is calculated as [7]

EL = P D · EAD · LGD, (1.2)

where PD is the Probability of Default, EAD is the Exposure At Default and LGD is the Loss Given Default. The bank includes LTV, and thus the estimated house price, in the LGD calculations. Every bank needs to safeguard itself from the loss by allocating capital and it is furthermore required by law to identify, calculate and control the risks [8].

1.2 Purpose

As can be understood from the background, there is an uncertainty regarding the estimated sell price and its impact on the mortgage portfolio. Particularly concerning defaulted customers’ properties, which the current RPPI is unable to capture and describe in a satisfactory manner. The existing methods at hand today evaluates the properties’ characteristics and the geographical area using various different methods [5]. However, it is not an established practice to include factors representing the homeowner’s impact on the property value when constructing RPPI [5]. The question to what extent the homeowner affects the value of the property and whether or not statistical methods are able to capture this impact has yet to be answered. The

(19)

Introduction | 3

purpose of this thesis is to investigate if the accuracy of the property valuations can be improved by including customer data and thereby increase the accuracy of the LTV. The thesis also aims to evaluate different statistical methods and their ability to capture the homeowner’s impact on the property value.

1.3 Research questions

The goal of this project is to investigate if a more accurate property valuation can be obtained by combining the current RPPI with the bank’s customer data using the data mining techniques regression and clustering. This has been divided into the following two research questions:

1. Can a combination of RPPI and the homeowner’s attributes yield a more accurate house valuation?

2. Which statistical methods are able (or unable) to construct the house valuation described in 1?

1.4 Delimitations

The properties that will be examined will be single family homes, which means that apartments, farming properties, industrial properties etc. are excluded from this thesis. The reason being that more than 70% of the bank’s defaulted customers are single family homes. In addition, the RPPI used behaves differently for these different properties. Thus, in order to limit the scope, the thesis will focus on one property type.

The type of RPPI that will be examined is constructed from the SPAR method, that is the Sale Price Appraisal Ratio as this is the RPPI used by the bank. The use of other RPPI methods might therefore yield a different result.

This thesis does not aim to construct a new RPPI from scratch but to make use of the existing RPPI and investigate if it can be improved by additional data and statistical methods.

Limitations in data will be addressed in Chapter 4.

(20)

4 | Introduction

1.5 Structure of the thesis

The rest of the thesis is divided into six chapters. Chapter 2 presents the currently used RPPI methods as well as the current state of knowledge of RPPI performance and the effect on the mortgage portfolio. Chapter 3 gives detailed descriptions of the used statistical methods as well as the reasoning behind the chosen methods.

The construction of the data sets and the premises for the models are described in Chapter 4. Chapter 5 presents the model designs and the models’ results are found and analysed in Chapter 6. Lastly, discussion and conclusion are presented in Chapter 7.

(21)

Background and Literature studies | 5

Chapter 2 Background and Literature studies

This chapter will provide an overview of existing RPPI and an in-depth description of the index currently used by the bank, the Sale Price Appraisal Rate (SPAR).

The chapter will also present the house valuation’s role in the mortgage portfolio as reported by the bank. Furthermore, it will disclose the existing research concerning the SPAR index as well as highlight the studies that have been conducted in regards to RPPI’s effect on the mortgage portfolio.

2.1 RPPI methods

There are mainly four different ways to construct a RPPI according to the Handbook on Residential Property Price Indices[5]. These four indices are described below.

Stratification

Stratification uses the median or mean of the house prices and evaluates how these develop over time. In order to diminish the errors that arise from the fact that the characteristics, location and quality of the houses differ vastly, the population is divided into sub-groups. The RPPI is then calculated as a weighted average for each sub-group.

Repeat Sales

The method’s name, repeat sales, gives the concept. The method builds upon the reselling of the same property and tracking the price development from the base period to the period that is of interest. The first sell date represents the base period.

(22)

6 | Background and Literature studies

Hedonic Regression

The Hedonic Regression constructs an index using the house characteristics and regression techniques in order to determine the attributes’ impact on the sell price.

Sale Price Appraisal Rate

The SPAR method uses the appraisals that are gathered for tax purposes. The appreciated valuation of the house is then matched with its sell price for the same period (as a ratio).

The RPPI available and used depends on the country, the market and the data that is available. There are disadvantages and advantages to all of the methods and these will also change depending on the accessible data. All the methods listed above try to describe the entire housing stock while only relying on the houses that are actively traded. There is no guarantee that the houses that are being sold represent the housing stock adequately which gives rise to an unknown bias [5].

2.2 SPAR

For the Swedish housing market, the SPAR method is one of the RPPIs used as the appreciations are readily available and collected for tax purposes [9]. The SPAR method is slightly similar to the repeat sales method in that it uses the concept of comparing the sell price to a base price. The difference being that the price in the base period is here represented by the appraised value and thus always existent. The SPAR index can be expressed as [10]

ISP AR = Pnt

j=1P_jt/Pnt j=1A_j0 Pn0

i=1P_i0/Pn0

i=1A_i0. (2.1)

The price P_jt for the property j at the time period t is matched with its appraised value A_j0 from the base period, represented by 0. This is then compared to the transaction price Pi0of property i at the base period which in turn is matched with its appraised value Ai0. The SPAR index and its accuracy relies on the amount of houses that are being traded in the the base period , n₀ as well as in the time period of interest n_tin the specific areas. Moreover, it is heavily relying on the accuracy of the appraisals as well [5].

The RPPI used by the bank is supplied by an external property valuation company [4] which uses the SPAR method to construct the index for different geographical

(23)

regions. This means that the populations that are taken into consideration stem from different regions in Sweden and that the index will therefore differ with respect to time and region, It,r. The SPAR index constructed by the external property valuation company, for single family homes, divides Sweden into eleven different regions.

These regions are presented in the table below and are defined by the external property valuation company.

Type Description

Stockholm The city of Stockholm

Malmö The city of Malmö

Göteborg The city of Göteborg

Larger cities Population of 40 000-200 000 inhabitants Commuting municipalities Minimum 40% commutes to another

municipality

Suburbs Minimum 50% commutes to a nearby,

larger city

Industrialised municipalities Minimum 30% is working in production, construction, energy sector etc.

Large sized municipalities Population of more than 25 000 inhabitants

Medium sized municipalities Population of 12 500 - 25 000 inhabitants Small sized municipalities Population less than 12 500 inhabitants Sparsely populated municipalities More than 45 minutes by car to another

community of more than 3000 inhabitants Table 2.1: Regions

Depending on the quality of the data, the index’s accuracy can vary as the number of traded houses varies. As an example, there are more trades in denser populated areas such as larger cities than in sparsely populated municipalities. This can in turn lead to outliers having a greater effect on the index in the less populated municipalities.

As mentioned previously, the SPAR method is also affected by the appraised property value. The Swedish Tax Agency (Skatteverket) collects the appraised values every third year [11]. The taxation value combines the sales that have occurred in the area of the property together with property qualities and characteristics.

The attributes that are taken into account are listed as the following [12].

(24)

• Area and value of the plot

• Type of building and its rights

• Geographical location and distance to water

• Accessibility to potable water

• Type and number of toilet facilities

• Year of completed construction

• Area of house

• Garage

• Fire place

• Kitchen standard

• Electricity and heating consumption

• Type of windows and facade

• Indication of larger renovations of kitchen or toilet facilities

These attributes are given points which are used for the overall valuation of the house. These attributes and points are preassigned by the Taxation Agency and the homeowners receive a form in which they can either approve or challenge the Taxation Agency’s information [12].

2.3 Mortgage portfolio and LTV

The LTV’s involvement in the mortgage portfolio is introduced in Chapter 1 as a part of the LGD function in the computation of EL, see equation (1.2). The LGD function is more specifically defined by the bank as

LGD = α · β · γ. (2.2)

Each parameter, α, β and γ, represents a different step in the default process, where α is a binary parameter computed as the probability of realised default and represents thus the very beginning of the default process. This parameter is computed by logistic regression, where the LTV-value is one of the independent variables. The

(25)

parameter γ on the other hand, represents the end of the process as it is the remainder of the debt owed to the bank after several years. The parameter of interest with respect to the LTV is thus β, which corresponds to the step in the middle where the defaulted customer either can partly write-off the debt or sell the underlaying property. The parameter β is an expression of loss depending on the two possible outcomes, written as

β = β_wπ + (1 − π)β_s. (2.3)

In (2.3) above, βwis the parameter for the write-off and π is the probability of such an arrangement. The parameter β_sis thus the parameter associated with selling the property which can be expressed as

βs = 1 − min

valuation exposure, 1

, (2.4)

where the exposure is the value of the mortgage/debt. The relation in (2.4) is shown in figure2.1.

Figure 2.1: βsas a function of the property value and the customer’s exposure.

Figure2.1illustrates in addition to the relation (2.4) the impact of a faulty property valuation. An overvaluation can result in an inaccurately low βswhich in turn yields an overall β which is not true to the actual, realised loss.

2.4 Literature studies

In this section, the current knowledge of the RPPI’s impact on the mortgage portfolio are disclosed as well as the SPAR index performance compared to the other indices.

(26)

2.4.1 Mortgage portfolio and house valuations

Anderson and Mayack investigated the LTV’s effect on the LGD in their study Loss severities on residential real estate debt during the Great Recession[6] where they constructed a RPPI based on the repeat sales method. Their study included the presence of the homeowner by including a dummy variable in the RPPI that assumes the value 1 if the house is occupied, otherwise 0. Their idea was that the occupancy status can be translated into the owner’s incentive to sell the house. Their result indicated that the presence of an owner marginally affects the house valuation. They also identified the LTV measure as the most important factor when determining the loss severities of a mortgage portfolio. Since the actual LTV measure is unknown unless the property is traded, Anderson and Mayack argued that the LTV included in a loss severity model, regardless of the method used for estimating the house value, will include noise and errors. They identified two major sources for the introduced errors:

1. RPPI incapable of capturing the value of distressed trades 2. Too few defaulted properties in the population samples

Their results implicated that the risk models are likely to, in turn, severely underestimate the loss during economic downturns.

Moreover, Bogin, Doerner and Larson wrote a working paper for the US Federal Housing Financing Agency concerning the current LTV, house pricing and their impact on the portfolio. Their study argued that the current LTV is the "key to estimating a borrower’s likelihood to prepay or default" for the entity of the loan’s life [13]. They focused on the repeat sales RPPI and its accuracy, finding that there is an interest in combining different geographical repeat sales RPPIs in order to obtain more accurate property valuations. That is, for large cities the different postal codes have different impacts on the house value. This stands in contrast to sparsely populated areas, where only the municipality or county would be sufficient information for the house valuation [13].

Glennona, Kiefera and Mayock raised the issue of bias in the repeat sales RPPI in their study Measurement error in residential property valuation: An application of forecast combination [14]. They did not attempt to build a new RPPI from scratch, but constructed an improved RPPI from existing indices by combining them using techniques such as multiple linear regression and time series analysis. They identified different origins of potential bias such as aggregation bias, renovation bias, hedonic bias, trading-frequency bias and sample selection bias. Moreover, they

(27)

stressed the fact that properties traded under distress will not be sold at their predicted market value [14]. Glennona, Kiefera and Mayock highlighted arguments that have been made in other literature studies, that the discount originates from two sources,

"neglected maintenance...and motivated sellers – such as financial institutions and homeowners that have defaulted on their mortgage" [14].

2.4.2 Evaluation of SPAR

The lack of studies of the SPAR method’s accuracy and impact on the mortgage portfolio is not particularly odd as the SPAR method is mainly used in Sweden, New Zealand, the Netherlands and Denmark [5]. However, there are studies which have evaluated and compared the SPAR method to both the repeat sales method and hedonic regression.

The Reserve bank of New Zealand launched an investigation into the three RPPI methods repeat sales, hedonic regression and SPAR [15]. Acknowledging that the best suitable RPPI cannot be distinguished by solely theoretical studies, but rather empirical studies. The three different methods were evaluated empirically and compared to the simplest form of stratification RPPI, the median and stratified median. The three methods in question all out performed the median when compared to the actual sell price. Moreover, their study found that the SPAR method outperforms the other two methods in the New Zealand housing market as it is less noisy, less volatile and more accurate predicting the sell price. This is in line with what a group of staticians from the DELFT university in the Netherlands found. In their research paper A House Price Based Index Based on the SPAR Method they concluded that the SPAR method performs as well as, and sometimes better than, the more commonly used indices. Although, it is also incredibly sensitive to flaws in the taxation value [10].

(28)

12 | Mathematical Methods

Chapter 3 Mathematical Methods

The statistical methods that are used to construct a model for the house valuation that includes customer data are the data mining techniques regression and clustering.

Regression is a useful tool when the outcome of interest is of a predictive nature.

However, there might be subgroups within the data that the regression techniques are unable to discover, introducing the need to include unsupervised statistical learning in the form of clustering, such as K-means. Moreover, classification methods such as logistic regression and discriminant analysis are more appropriate for predicting the probability of a qualitative response variable. The methods mentioned in this paragraph are presented and discussed in this chapter.

3.1 Regression techniques

Regression techniques are a sub category to supervised statistical learning, meaning that regression is a predictive statistical model which yields a product from a set of indata [16]. There is a range of several different regression techniques, from simple linear models to advanced non-linear models, from predictors to classifiers.

3.1.1 OLS Multiple Linear Regression

The linear regression serves as a logical starting point for most predictive models as it is a building block to more advanced regression techniques [16]. Multiple linear regression is defined as [16]

Y = β₀+

p

X

j=1

β_jX_j+ , (3.1)

(29)

Mathematical Methods | 13

where β_j represents the relation or impact that the X_j predictor has on the response variable Y , with j = 1, 2, . . . , p. The independent variables Xj will represent the customer attributes in this thesis whereas the response variable, Y , will be a function of the sell price and the RPPI.

Estimating ˆβ0. . . , ˆβp in (3.1), is done by by minimising the error of the predicted response ˆy with respect to the real observation y. The Ordinary Least Squares (OLS) approach does this by minimising the sum of the squared residuals (RSS) [16], such as

RSS =

n

X

i=1

²_i =

n

X

i=1

(y_i− ˆy_i)² =

n

X

i=1

(y_i − ˆβ_0i−

p

X

j=1

βˆ_jiX_ji)². (3.2)

The estimated coefficients, ˆβ_j, are thus obtained when the expression (3.2) is minimised, yielding the following estimators,

β = ((Xˆ ^TX)⁻¹)X^Ty. (3.3)

The relevance of each customer attribute is evaluated by performing a t-statistic calculation to test the null hypothesis [16],

H₀ : β_j = 0 H_a: ∃β_j 6= 0.

The t-statistics is given by

t = βˆ_j − 0

SE( ˆβj). (3.4)

In (3.4), SE( ˆβj) is the standard error of ˆβj. The p-value corresponds to the probability of obtaining a number ≥ |t| under H0. A small p-value rejects the null hypothesis and validates the presence of the attribute [16]. The next step is thus to find the relevant attributes and this is done by variable selection which is disclosed later on in this section.

Assumptions

The multiple regression model builds upon the assumption that the relationship between the response variable and the predictor variables is of a linear nature.

Moreover, the correlation of the error terms needs to be examined as the standard errors for the coefficients rely on the assumption that the error terms are uncorrelated.

(30)

These assumptions need to be confirmed or treated [16].

Non linearity can be identified using residual plots, plotting the error terms, against the predicted value ˆy. A pattern in the residual plot indicates a non linear relationship. In order to solve this, the predictor can be transformed to obtain the desired linear relationship. These transformations can be log(X), √

X or of a polynomial nature. The linear model can in this sense be extended into a general linear model depending on the transformation of the predictors. Another option is to group similar behaved values into sub-groups in order to obtain linearity [16].

Correlation among the error terms yields an inaccurate standard error which results in too narrow confidence intervals. The presence of heteroscedasticity, or non fixed variance, can be solved by transforming the response variable Y to log(Y ),√

Y , ¹_y [16].

Analysis of variables

The variables included in the regression model need to be further examined in regards to outliers, leverage points and correlated predictors [16].

Outliers will be detected by calculating Cook’s Distance, D_i [17]. The distance D_i, for all points i = 1, 2, ..., n is computed as

D_i = ( ˆy_i− ˆy)² (p + 1)M SE

h_ii

(1 − h_ii)², (3.5)

where p is the number of predictors, M SE the mean square error and hiithe leverage that originates from the hat matrix H = X(X^TX)⁻¹X^T. The M SE [16] is defined as

M SE = 1 n

n

X

i=1

(yi− ˆyi)². (3.6)

High leverage pointson the other hand are points that have unusual predictor values, x_i. By calculating the CovRatio, leverage points can be identified as

CovRatio_i = (S_i²)^p M SE^p

1

1 − h_ii, (3.7)

where S_iis the variance estimate.

(31)

Multicollinearity can either be structural or data based and occurs between the predictors. The first one is caused by the creation of additional predictors that originate from already existing predictors and the second type comes from the data.

The measure Variance Inflation Factors (VIF) is used as an indicator as it measures how much the variance is inflated [16]. The idea is that standard error increases with multicollinearity which in turn gives rise to an increase in variance. The VIF is defined for each predictor j as

V IF_j = 1

1 − R²_j. (3.8)

There are three different threshold values to keep in mind when studying the individual VIF [16]:







V IF < 1 : No multicollinearity V IF > 4 : A closer study is needed

V IF > 10 : Definitively multicollinear, needs to be treated.

Variable selection for least squares

Ideally, investigating all possible combinations of attributes should in the end give a model containing only the attributes which are the best fit for the data. However, this would result in 2^p combinations [16]. The following methods will therefore be used for selection purpose: RSS see (3.2), adjusted R², BIC, AIC and Mallow’s Cp.

The R² (or adjusted R²) takes a value between 0 and 1 [16], where a larger value corresponds to a more accurate model. The R² is defined as

R² = 1 − RSS Pn

i=1(y_i− ¯y)². (3.9)

The BIC uses log-likelihood functions and includes a penalty term for the inclusion of many predictors [18]. The BIC is defined as

BIC = −2 log L( ˆβ) + p log n, (3.10) where L( ˆβ) is the maximised value of the likelihood model. The smaller the BIC- value, the better the model.

Mallow’s Cp also addresses overfitting as it penalises the inclusion of unnecessary

(32)

predictors and as with BIC, the smaller the value of C_p, the better [19]. It is defined as

C_p = RSS

M SE² − n + 2p. (3.11)

Lastly, the model with the lowest value of AIC is yet another variable selection criteria. AIC is defined as [16]

AIC = 2p − 2 log L( ˆβ). (3.12)

3.1.2 Multiple Linear Regression with Shrinkage

An alternative way of estimating the coefficients to OLS, is the use of shrinkage methods such as ridge regression and lasso.

Ridge regression

Ridge regression is a useful tool when the predictors have a high variance [16]. The estimated predictors in ridge regression are defined as the ˆβ_Rwhich minimises

minβ {RSS + λ

p

X

j=1

β_j²}, (3.13)

where λ ≥ 0 is the so called tuning parameter [16]. As λ grows larger, the coefficients shrink towards zero and the variance decreases. However, the price for this is an increase in bias. The tuning parameter should be chosen so that it reduces the variance without introducing a too large bias. Moreover, ridge regression is sensitive to scale, so the predictors should be standardised. The predictors can be standardised as [16]

˜

x_ij = x_ij

q1 n

Pn

i=n(x_ij − ¯x_j)²

. (3.14)

The final model will however include all the predictors. Ridge regression does not offer any method to perform variable selection [16].

Lasso

The lasso regression is very similar to ridge regression, however there is a slight difference regarding the computation. The lasso estimate of the coefficients, ˆβ_Lare

(33)

obtained by minimising the following expression [16]

min

β {RSS + λ

p

X

j=1

|β_j|}. (3.15)

Since the lasso has a l₁ norm for the coefficient β_j, the method forces some of the coefficients to become exactly zero and thereby performs a variable selection [16].

The major difference between these two shrinkage models is the variable selection.

The ridge regression will outperform the lasso if all the predictors are valuable. On the other hand, if there are many available predictors but few of importance, the lasso will outperform the ridge regression. Both should thus be considered when there is a high variance in the indata [16].

Leave One Out Cross Validation

The tuning parameter λ can be chosen either randomly and via "trial and error", but it can also be calculated by using cross validation. Leave One Out Cross validation splits the data set used into two parts. One group contains n−1 observation whereas the other group only has 1 single observation. The larger group can be thought of as the training set. The lasso and ridge regressions can be trained on the larger subgroup and then tested on the one sample left out, calculating the mean square error (MSE) as the test error [16]:

M SE = (y1− ˆy1)². (3.16)

In order to reduce the variability, this procedure is repeated n times. Meanwhile, the tuning parameter λ assumes different values (pre-assigned). The λ with the smallest cross validation error will be used.

3.1.3 Regression Tree

The idea of grouping the variables as mentioned in section 3.1.1 can be extended and developed further, into a tree-based method which relies on segmentation all together. Regression trees can be used as a predictive decision tree, where specific predictor values serve as conditions at each split or branch [16]. This is illustrated in figure3.1.

(34)

Figure 3.1: Regression tree with three regions or leaves and two splits or internal nodes.

In a regression tree, all observations that are grouped together in a region will be assigned the same predictive value [16]. This predicted value is the the mean response value of the data set, which is illustrated in figure3.1.

A basic regression tree starts off by splitting X_j into different values and assign them to G different regions; R_i. . . R_G. These regions are obtained by minimising the sum of RSS [16],

minG {

G

X

g=1

X

i∈Rg

(y_i− ˆy_R_g)²}, (3.17)

with ˆy_R_g as the mean response variable for region g. This is computed in a top − down matter while considering the best split at each interior node, as it’s not possible to consider all combinations. The best split is the one that leads to a minimised expression of (3.17) in that specific split, it does not evaluate future splits. At each split, all predictors are considered and the best X_j, with respect to (3.17), is then chosen. Consider the split into two regions [16]:

R₁(g, c) = {X|X_g < c}, R₂(g, c) = {X|X_g < c} (3.18) The values g and c are computed by

min

G { X

i:xi∈R1(g,c)

(y_i− ˆy_R₁)²+ X

i:xi∈R2(g,c)

(y_i− ˆy_R₂)²}, (3.19)

where c is the cut value. This process is then repeated until a split no longer yields

(35)

sufficient improvement [16].

Random Forests

A basic regression tree constructed as described above usually does not accurately describe the data, as the algorithm only builds one single tree [16]. This can be improved by constructing several different trees on bootstrapped data sets and for every split, 1/3 of the predictors are randomly selected and considered. The restriction forces the algorithm to consider all predictors, not only the ones that are deemed to be locally the best fit according to (3.17) [16].

3.1.4 Model Validation for Prediction Methods

The data set will be divided into two sets, 70% will be used for training and the remaining 30% for validating. This facilitates validation of the models by using the unseen test set. The models will be evaluated by using a so called Validation Set Approach, which is a simple but effective way of testing the fitted models [16]. The test error is then defined as the mean square error, M SE, defined in (3.6) which is obtained when comparing the predicted value of the test set variables to the actual observations [16].

3.1.5 Logistic Regression

Logistic regression differs from the techniques introduced so far in the chapter as it classifies the probability of binary outcomes [16]. For instance, the probability that the SPAR index either overvalues or undervalues the actual sell price can be expressed as P r(SP AR = Overvalue|X), where X is a vector corresponding to customer attributes. Let’s call P r(SP AR = Overvalue|X) fo p(x). This probability can be modelled as [16]

p(x) = P r(SP AR = Overvalue|X) = e^β⁰⁺^P^p^j=1^β^j^X^j

1 + e^β⁰⁺^P^p^j=1^β^j^X^j. (3.20) The coefficients βj are estimated using maximum likelihood. The likelihood function is derived from the joint probability distribution of the dependent variable, y, which ha a binomial distribution [16]. The likelihood function becomes thus

L =

n

Y

i=1

p(x_i)^yⁱ(1 − p(x_i))^1−yⁱ. (3.21)

(36)

By taking the log of (3.21) in combination with (3.20) the log-likelihood is obtained as

` =

n

X

i=1

−log(1 + e^β⁰⁺^P^p^j=1^β^j^X^j) +

n

X

i=1

y_i(β₀+

p

X

j=1

β_jX_j). (3.22) The β estimates are obtained by deriving (3.22) with respect to β and setting the derivatives to zero. The relevance of the parameters X_j are evaluated in the similar way as for the OLS estimated parameters in section 3.1.1 with respect to a p- value. The difference being that logistic regression uses z − statistics instead of t − statistics for the hypothesis testing. This means that BIC and AIC can be used for variable selection [16].

3.1.6 Discriminant Analysis

The probability of, for instance, the SPAR index overvaluing the property can be estimated in an an alternative way to logistic regression. Bayes’ theorem is used in Linear and Quadratic Discriminant Analysis (LDA and QDA) in order to make use of the distribution of each predictor, X_j in each response class, Y₁, Y₂ [16]. Bayes’

theorem can be expressed as

P r(Y = k|x) = π_kf_k(x) PK

l=1π_lf_l(f ), (3.23) where Y can be thought of as SP AR and k as Overvalue. Moreover, K is the number of different response classes (e.g Overvalue, Undervalue), π_kthe probability of an observation belonging to outcome k based on the training data and f_k(x) is the density function of the parameter X with respect to outcome the k. The density function fk(x) needs to be estimated in order to solve for (3.23) and this is what distinguish the LDA from the QDA [16].

The Linear Discriminant Analysis assumes that the observations for each outcome comes from a multivariate Normal distribution, with a common covariance matrix for all the outcomes, X ∼ N (µ, Σ). Bayes’ theorem in (3.23) can thus be used to compute P r(Y = k|x) by using the density function for a multivariate Normal distribution with estimates of µ and Σ from the population [16].

The Quadratic Discriminant Analysis is similar to the LDA in the assumption that the observations originates from a multivariate distribution, the difference being that each class has a different covariance matrix Σk. This leads to that the discriminant functions are quadratic functions of x, instead of linear as in the previous case [16].

(37)

However, note that both LDA and QDA can have more than two response classes, in contrast to logistic regression.

3.1.7 Model Validation Classifiers

The classifiers logistic regression, LDA and QDA cannot use M SE as the other regression techniques as the outcome is qualitative. The test error is thus given by computing the number of faulty predicted outcomes, mf, with respect to the test set [16]. This can be expressed as a percentage of all the predicted outcomes mtot,

Error_test = m_f mtot

. (3.24)

If there are better or worse outcomes of Y , then there is also an interest in minimising the wrong classification of a certain outcome. In the context of this thesis, wrongly classifying an Overvaluation as an Undervaluation is more harmful than the other way around. Therefore is the percentage of false Undervaluation (or more generally a false Y_k) another metric that should be computed when validating the model [16].

This metric can be taken into account when designing the model as it is a direct result from interpreting the posterior probability. By adjusting the threshold value for the posterior probability, false (and true) predictions can be improved (or worsen) to a certain extent [16].

3.2 Clustering: K means

The regression techniques will give one single model for all customers, evaluating the attributes and assigning different factors to them depending on their impact on the house value. But what if there are sub-groups within the customer population that cannot conform to one single model? If there exists different classes of customers, then one general model will not be suitable for all of them. However, by applying non-supervised statistical methods, such as clustering, these sub-groups might be detected.

Clustering cannot help with prediction, but it aims to divide the data into separate groups with respect to a condition based on similarity [16]. Once these groups are obtained, the regression techniques described in section 3.1 can be applied to each individual group [20]. Pre-processing data sets using K means has been shown in previous studies to improve the prediction accuracy [20].

(38)

K means is a clustering technique that builds on dividing a data set into K distinct clusters, no observation can belong to two different clusters simultaneously. It is constructed around the idea of minimising cluster variation, that is [16]

C1;...CminK

{

K

X

k=1

W SS(C_k)}. (3.25)

In (3.25) above Ckis the cluster and W SS represents the variation within the cluster.

The within cluster variation is generally defined as the squared Euclidean distance [16]

W SS = 1 nk

X

i,i⁰∈C_k p

X

j=1

(x_ij − x_i⁰_j)², (3.26) with n_kas the number observation in cluster k. By randomly dividing the observations into K different clusters and computing the mean for each feature in each cluster (centroid), the observations can be reassigned to the cluster where they are as close to the centroids as possible. Since the within cluster variation presented calculates distances, only quantitative data can be treated. Moreover, this data should be scaled or standardised to ensure optimal outcome [16].

K means tend to find local optima, it is therefore of importance to perform multiple different clusterings in order to explore different optimal solutions [16].

3.2.1 Finding the k value

The K means method is, as can be understood, dependent on the choice of k. An appropriate value can be identified by iterating over a range of k-values until the within cluster sum of squares in (3.26) becomes sufficiently small [21]. Another approach is to compute a so called Silhouette coefficient [22]. The coefficient s is defined as

s = b − a

max{a, b}. (3.27)

It uses the notion of similarity and dissimilarity where a is the average distance to the other objects in the same cluster and b is the minimum average distance to the objects in the nearby clusters. The silhouette value varies from -1 to 1. A k value that yields a coefficient close to 1 is therefore desirable [22].

(39)

Data collection and construction | 23

Chapter 4 Data collection and construction

To evaluate the currently used RPPI as well as construct an alternative model to estimate the house valuation, a data set containing purchase and sell prices of houses along with their purchase and sell date was constructed. This data set has to also contain an identification key in order to link the property to a customer of the bank.

Hence, both the data set of the house prices and the data set containing customer data were constructed from the bank’s database.

4.1 House prices

The sell prices of properties are not public records nor is it information that the bank has access to either. However, since the bank has had around a quarter of the mortgage market in Sweden for most of the 21st century, a data set containing the information of purchase and sell price was constructed from mortgage agreements.

When a customer takes a mortgage with the bank, information about the property and the customer is stored in the bank’s database as can be visualised by the schematic below.

Figure 4.1: Visualisation of data storing.

(40)

24 | Data collection and construction

The StartDate in figure4.1represents the purchase date and corresponds to the start date of the mortgage agreement. When the customer sells the property, the sell price is not registered unless the customer is unable to repay the mortgage. Therefore, the database will only be updated to include the end date of the mortgage agreement, as seen in figure4.2. However, the sell price can be identified if the property is sold to another customer within the bank, as it will be registered as the purchase price of that customer. This is shown with the second schematic of the construction of the data set.

Figure 4.2: Construction of the data set containing house prices.

The reasoning displayed in figure 4.2 is a simplified version of the reality. There might be a time gap between customer A’s mortgage agreement end date and customer B’s mortgage agreement start date due to administrative delays. To account for this, the data set allowed for a difference of ± 1 month between these dates.

The next step was to distinguish between the houses that are traded under normal conditions on the market and the houses that undergo a distressed sale due to a defaulted mortgage agreement. The status of all mortgage agreements are continuously updated in the database, indicating if it is healthy (normal) or defaulted.

However, just because an agreement is defaulted does not mean that the customer has to sell the house, there are several aids and steps to ensure that the agreement becomes cured. In order to identify the trades that are caused by defaulted agreements, the definition applied is the following; the agreement must be marked as defaulted at the time of the trade and it must have been defaulted for more than one month.

By the reasoning described above, a data set of house prices was constructed and will be referred to as HousePrice for the rest of the thesis.

(41)

4.2 Customer data

The customer data available differ widely from customer to customer as it depends on the agreements and relationship with the bank. Some customers have only their mortgage with the bank and use other banks for their credit cards, savings, loans etc.

Others keep all their financial assets and agreements within one and the same bank.

The data set that contains the attributes and characteristics of the customers was limited to the customers that are found in HousePrice. The customer data set will be referred to as Attributes. It was not computational feasible to include nor examine all the different customer features that are stored in the bank’s databases.

The data set Attributes was therefore limited to a selection of the available and relevant information. The variables were selected on the bases of their potential effect on the house value.

Financial attributes such as income, debt, transactions, assets and savings can give an insight into the customer’s disposable resources to maintain or improve the property.

The cost of maintaining a house differs depending on the house, the region and the customer’s knowledge and capability. The average maintenance cost in Sweden for a single family home was estimated to be 368 SEK per square meter and year [23], an additional cost to the operating cost. The operating cost for a house is typically 40 000 - 50 000 SEK a year [24] or 129 - 300 SEK per square meter and year [23].

Thus for a single family home of 140 m², the overall cost is around 90 000 SEK per year.

Other customer characteristics such as age and gender were included in Attributes as well. These might not seem relevant to the house value at a first glance but in contrast to the financial data, this type of data is available for all customers who have a mortgage with the bank. These characteristics can help describe and classify the customers. The table below summarise the customer characteristics that were included in the data set Attributes.

(42)

Customer Attributes

Definition Units

1 Purchase The purchase price that the

customer bought the house for.

SEK

2 Time The amount of time the customer

has resided in the house.

Years

3 Age The age of the customer when the

property was sold.

Years

4 Gender The customer’s gender. class

5 Region The type of region the house resides in, the description of them found in table2.1.

class

6 Income The customer’s income after tax

reductions.

SEK 7 Debt The customer’s debt to the bank per

month.

SEK

8 Account

Balance

The customer’s average monthly account balance.

SEK

9 Delayed

payments

The number of days a payment to the bank on any commitment has been late.

days

10* D/I Debt to Income ratio –

Table 4.1: Customer attributes

The first five attributes in4.1are constants or represented by a single value. These are available for every customer with a mortgage and did not impose any further limitations when the data set was constructed. Since two of these attributes are qualitative data, Gender andRegion, dummy variables will be created for them in the models. This works well for Gender as it is binary. However, it becomes quite tedious for Region as it has eleven different classifications. Due to this, Regionwill be classified into three new categories which will be dependent on the performance of the current used RPPI. These three new categories are

1. The top three regions that the current RPPI overestimates.

2. The top three regions that the current RPPI underestimates.

3. The rest of the regions.

Attribute 10* D/I is a constructed parameter that originates from the two other

(43)

attributes Debt and Income as their relation was deemed relevant to the financial character of the customer.

4.2.1 Time dependent attributes

The financial attributes 6-10 change over time and are updated once a month, if there is available information. For example, the Account Balance and Debt are updated regularly whereas the update of Income can vary from month to years.

As stated previously, the amount of available information varies depending on the type of customer.

In order to avoid the problems that data sets with missing values entail, only the customers that have values for the financial attributes 6-10 continuously updated during the the time span of the mortgage agreement were included. Continuously updated means in this case that there is at least four updates per year. The time span varies depending on the customer, as it is the time from the purchase date to the sell date for each individual house. However, due to the long time horizon, every observation of every time dependent variable for each customer cannot be taken into consideration. This would generate an enormous amount of data points which is in worst case computational infeasible and best case time consuming. The time dependent data can therefore be treated by parameters describing the behaviour over time, such as:

1. Compute and use the mean and standard deviation of each attribute.

2. Extract a specified, limited amount of observations over time.

The simplest method would be to only use the mean and standard deviation of each attribute. This would result in only two observations for each attribute per customer. The standard deviation can be considered a measure of volatility and gives an indication of the behaviour over time, but it does not catch the complete behaviour. The account balance and the debt for three customers (A, B and C) are illustrated in4.3a.

(44)

(a) Account balance over time for customers A, B and C.

(b) Exposure over time for customers A, B and C.

Figure 4.3: Time dependent attributes. The period considered for each customer is the time period the customer resided in the house.

In figure 4.3, each customer’s account balance and exposure have been been scaled with respect to their individual initial value, which is set to one, in order to facilitate the visualisation. The account balance for the three customers behaves differently for each customer, a behaviour that cannot be captured by only using mean and standard deviation.

One idea was to use the data from the latest years and treat them as predictors in the model, meaning that each attribute would have several different measurements.

However, there is a high risk of multicollinearity among these predictors, something that should be avoided. Another proposal was to treat the attributes as individual time series, however, this was not a possibility as the behaviour differed radically from customer to customer and moreover did not meet the criteria for time series analysis. An alternative was to consider the rate of change, R_t of each attribute, taking into consideration several periods in order to spot a short term trend and a long term trend. This is a technical indicator that is used for analysing stock markets [25].

The rate of change is simply the difference between attribute values in percentage [25]:

R_t= balancet− balancet−n

balance_t−n , (4.1)

where t is the sell date and n is the time period in comparison and t > t − n. By constructing two different rate of change, one for a shorter time period, Rst, and one for a longer, Rlt, one can capture some of the changes seen in figure 4.3.

The attribute that was exempted from the above treatment was Delayed Payments

(45)

as this attribute behaves differently. For the majority of the population, this attribute is either zero or oscillates around zero. However, there is an interest to know when this attribute changes from hovering around zero and increasing in value since delayed payments is a source of default [3]. Delayed Payments is therefore not measured by rate of change, but rather how often it exceeds a threshold value as well as its value at the sell date. The threshold value is taken to be 30 days in this thesis as it can indicate a non-performing exposure [26].

4.3 Data treatment

For modelling purposes, the data sets need to be treated further. The time horizon is set between 01/01/2005 and 30/06/2020. The long horizon is due to that homeowners tend to live for a longer period of time in the same house, on average around 10 years [27]. However, a too long horizon can result in noisy data, as housing market and administrative policies change over time. Furthermore, disturbing outliers or observations that do not represent a healthy housing market need to be excluded as well. Examples of such data points are houses that are bought or sold for 1 SEK. In order to determine which points to exclude, the sell price in the data set HousePrice were plotted.

As can be seen from the figures below, both the sell price for the the entire population and the defaulted population reassemble normal distributions with a long/heavy right tail.

Figure 4.4: Density plot of the entire house price population. The filled area represents 98% of the population.

Figure 4.5: Density plot of the defaulted house price population. The filled area represents 98% of the population.

The data points that lie below the 1st and above the 99th percentile are considered

(46)

to be damaging outliers in accordance with Glennona and Kiefera [14] and were excluded for modelling purposes. The filled area in figure 4.4 and figure 4.5 represents the data points between these percentiles. However, these thresholds differ depending on which population is considered. If the entire house price population is considered, prices below 190 000 SEK and above 7 800 000 SEK should be removed. However, with a lower limit of 190 000 SEK, the 10th percentile of the defaulted house prices are dismissed. In order to avoid bias due to the exclusion of defaulted houses [6], the cut off values from the defaulted population will be applied, that is sell prices above 36 540 SEK and below 8 282 000 SEK.

Lastly, observations corresponding to sell prices that are over three times the estimated value were also treated as extreme points and thus removed.

4.3.1 Data sets: for training and testing

Imposing the restrictions caused by customer data, time horizon and extreme values elimination yielded a data set containing almost 45 000 observations, where approximately 2% corresponds to defaulted customers. From this data set, 70%

will be used for modelling purposes and 30% for testing. Moreover, a test data set containing only the defaulted trades is created from the full test set and is called defaulted test set.

Statistical Methods for Analysis of the Homeowner's Impact on Property Valuation and Its Relation to the Mortgage Portfolio

Statistical Methods for Analysis of the Homeowner's Impact on

Property Valuation and Its Relation to the Mortgage Portfolio

CLARA HAMELL

Statistical Methods for Analysis of the Homeowner's Impact on Property Valuation and Its

Relation to the Mortgage Portfolio

CLARA HAMELL

Abstract

Sammanfattning

Acknowledgments

Contents

List of Figures

Chapter 1 Introduction

1.1 General background

1.2 Purpose

1.3 Research questions

1.4 Delimitations

1.5 Structure of the thesis

Chapter 2

Background and Literature studies

2.1 RPPI methods

2.2 SPAR

2.3 Mortgage portfolio and LTV

2.4 Literature studies

2.4.1 Mortgage portfolio and house valuations

2.4.2 Evaluation of SPAR

Chapter 3

Mathematical Methods

3.1 Regression techniques

3.1.1 OLS Multiple Linear Regression

3.1.2 Multiple Linear Regression with Shrinkage

3.1.3 Regression Tree

3.1.4 Model Validation for Prediction Methods

3.1.5 Logistic Regression

3.1.6 Discriminant Analysis

3.1.7 Model Validation Classifiers

3.2 Clustering: K means

3.2.1 Finding the k value

Chapter 4

Data collection and construction

4.1 House prices

4.2 Customer data

4.2.1 Time dependent attributes

4.3 Data treatment

4.3.1 Data sets: for training and testing