Bankruptcy Distributions and Modelling for Swedish Companies Using Logistic Regression

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

STOCKHOLM SWEDEN 2019,

Bankruptcy Distributions and

Modelling for Swedish Companies Using Logistic Regression

JACOB EWERTZH

(2)

(3)

Bankruptcy Distributions and

Modelling for Swedish Companies Using Logistic Regression

JACOB EWERTZH

Degree Projects in Mathematical Statistics (30 ECTS credits)

Master's Programme in Applied and Computational Mathematics (120 credits) KTH Royal Institute of Technology year 2019

Supervisor at Sveriges Riksbank: Jesper Lindé Supervisor at KTH: Jimmy Olsson

Examiner at KTH: Jimmy Olsson

(4)

TRITA-SCI-GRU 2019:075 MAT-E 2019:31

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

This thesis discusses the concept of bankruptcy, or default, for Swedish companies. The actual distribution over time is considered both on aggregate level and within different industries. Several models are constructed to best possible describe the default frequency. Mainly logistic regression models are designed for this purpose, but various other models are considered. Some of these are constructed for comparison and for the ambition to produce the most accurate model possible. A large data set of nearly 30 million quarterly observations is used in the analysis. Taking into account micro and macro economic data. The derived models cover different time periods, considering different variables and display varying levels of accuracy.

The most exact model is a logistic regression model considering both micro and macro data. It is tested both in sample and out of sample and perform very well in both areas. This model is estimated on first a subset of the data set to be able to compare with a real scenario. Then an equivalent model is constructed from the whole data set to best possibly describe future scenarios. Here Vector Auto-Regressive (VAR) models, and empirical models constructed by OLS regression estimating the firm values, are used in combination with the logistic regression model to predict the future. All three models are used to describe the most likely scenarios, as well as the worst case scenarios. From the worst case scenarios risk measures, such as the empirical value at risk, can be derived. From all this analysis the most significant results are compiled.

Namely, that the Logistic regression model performs remarkably well both in-sample and out-of-sample, if macro variables are taken into account. Further, the future results are harder to interpret. Yet, the analysis has arguments for prediction accuracy and interesting results of a continued low default frequency within the next year.

(6)

(7)

Sammanfattning

Svensk titel: Konkursfördelning och Modellering för Svenska Företag Genom Användning av Logistisk Regression

Den här uppsatsen avhandlar konceptet konkurs, för svenska företag. Den faktiska konkursfördelningen

¨

over tid analyseras, b˚ade p˚a en sammanlagd niv˚a och inom olika industrier. Flera modeller konstrueras i syfte att bäst beskriva konkursfördelningen. Huvudsakligen är logistiska regressions modeller utformade för detta syfte, men andra typer av modeller är inkluderade i analysen. N˚agra av dessa modeller är skapade för jämförelse, men ocks˚a för att kunna producera en s˚a exakt modell som möjligt. Ett stort data set med nästan 30 miljoner kvartalsvisa observationer används i analysen. Mikro- och makroekonomiska faktorer är inkluderade i detta data set. De framtagna modellerna omfattar olika tidsperioder mellan 1990–2018, tar in olika faktorer i analysen och visar p˚a olika niv˚aer av noggrannhet. Modellen som har högst förklaringsgrad är en logistisk regressionsmodell som tar hänsyn till b˚ade mikro- och makroekonomiska faktorer. Denna modell analyseras b˚ade i och utanför sitt samplingsintervall, och visar p˚a goda resultat i b˚ada omr˚adena. Modellen

¨

ar först skattad p˚a en delmängd av tidsperioden, för att kunna jämföra den förutsp˚adda fördelningen med en faktisk fördelning. Sedan är en ekvivalent modell skattad p˚a hela intervallet, för att bäst möjligt förutsp˚a framtida scenarion. För detta syfte är Logistiska regressionsmodellen kombinerad med Vektor Autoregressiva (VAR)-modeller som förutsp˚ar makroekonomiska faktorer, och empiriska regressionsmodeller som förutsp˚ar mikroekonomiska faktorer. Alla tre modelltyper används för att kunna beskriva det mest sannolika scenariot, samt de värsta tänkbara scenariona. Fr˚an de värsta tänkbara scenariona kan riskm˚att, s˚a som empiriska Value at Risk, tas fram. All analys producerar resultat och de viktigaste sammanställs. Dessa är att den logistiska regression modell som tar hänsyn till makroekonomiska faktorer ger bra resultat b˚ade i och utanför samplingsintervallet. Vidare är de framtida simulerade resultaten sv˚arare att tolka, men den genomförda analysen har argument för exakthet i förutsägelserna. Därmed presenteras ett troligt framtida scenario med fortsatt l˚ag konkurs frekvens inom det närmaste ˚aret.

(8)

(9)

Acknowledgements

I would like to address a big thanks to Riksbanken and a special thanks to my supervisor there, Jesper Lind´e. For introducing me to the topic and accepting me to do this research. Also for providing all necessary data and a work station, along with additional thoughts when needed. Moreover a big thank you to his department, the department of research and all the valuable discussions with colleagues there.

Particularly Anna Schekina for introducing me to the data and always trying to help out on any questions.

Further, I would like to thank my supervisor at KTH, Jimmy Olsson for relevant feedback and discussion on the project and especially on the report. A thank you to Edvin Hedblom and Rasmus ˚Akerblom for great discussion surrounding the project and report, and for relevant feedback on the writing. Lastly, I would like to thank my family for supporting me during my full time at KTH.

(10)

(11)

Contents

Abstract i

Sammanfattning ii

Acknowledgements iii

Contents v

List of Figures vi

List of Tables viii

1 Introduction 1

1.1 Background . . . . 1

1.2 Purpose . . . . 4

1.3 Outline . . . . 5

1.4 Limitations . . . . 5

2 Data 7 2.1 General data information and definitions . . . . 7

2.2 Micro data . . . . 9

2.3 Macro data . . . . 13

3 The Logit Model 16 3.1 Model selection motivation . . . . 16

3.2 Logistic regression . . . . 18

3.3 The models produced . . . . 24

3.3.1 Models for 1990-2009 with macro variables . . . . 25

3.3.2 Models for 1990-2009 without macro variables . . . . 29

3.3.3 Models for 1990-2018 with macro variables . . . . 31

3.3.4 Models for 1990-2018 without macro variables . . . . 34

3.3.5 Ranking of coefficients in models . . . . 35

3.3.6 Discussions regarding the coefficients derived . . . . 38

3.3.7 Error in data affecting the models . . . . 39

3.4 Comparison and validation . . . . 41

3.4.1 Comparison of the models derived . . . . 41

(12)

3.4.2 Comparison to OLS models without dummies . . . . 42

3.4.3 Comparison to Logit models without dummy and macro variables . . . . 46

3.4.4 Comparison to OLS models for macro variables . . . . 48

3.4.5 Testing alternative to macro variable inclusion . . . . 50

3.4.6 Motivation behind selection of macro variables . . . . 53

4 Default modelling 54 4.1 Comparison with real scenario . . . . 54

4.1.1 Further motivation behind selection of macro variables . . . . 54

4.1.2 Results for Logit models 1990-2009 . . . . 56

4.2 Evaluation of prediction . . . . 59

4.3 For future analysis . . . . 64

4.3.1 The VAR models . . . . 64

4.3.2 Shock analysis . . . . 67

4.3.3 Models for future firm values . . . . 68

4.4 Main outcomes from default modelling . . . . 71

5 Results from future analysis 72 5.1 Forecasting default frequency . . . . 72

5.2 Risk measures . . . . 75

5.3 Main outcomes from future analysis . . . . 77

6 Conclusions 78

References 81

7 Appendix A: Tables for evaluation or statistics 84

(13)

List of Figures

2.1 The cumulative distribution of the key financial ratios selected (smooth lines), plotted against the default rates (jagged lines) computed as a moving average of plus/minus 5000 adjacent observations. Using data from 1990-2009. Figure from analysis by Lind´e et. al. [2]. The

default rate is in percent. . . . . 11

2.2 The macro data used is presented, from 1990 Quarter 1, to 2018 Quarter 2. . . . 14

3.1 Examples of the logistic response function, from Montgomery et al. [13]. . . . 21

3.2 Continuously updated firm specific ratios over the years 1990-2009. . . . 51

3.3 Continuously updated firm specific ratios over the years 1990-2018 quarter 2. . . . 52

4.1 The actual and projected default frequency, a comparison between using the output gap or GDP growth as a macro variable. . . . 55

4.2 A comparison between the actual and projected default frequency. Upper plot using the output gap and lower plot the GDP growth. . . . 56

4.3 A comparison between the actual and projected default frequency, with in-sample period 1990-2009 and out-of-sample period 2010-2018 quarter 2. . . . 57

4.4 A comparison between the actual and projected default frequency, with in-sample period 1990-2009 and out-of-sample period 2010-2018 quarter 2. For all industries and economy-wide model. . . . 59

4.5 A figure including four subplots of estimated default percentiles versus actual default percentiles and a 45-degree slope indicating the perfect prediction. . . . 63

4.6 The comparison of bootstrapping procedure for foreign macro variables, here the GDP growth is on display. . . . . 66

4.7 The comparison of bootstrapping procedure for domestic macro variables, here the GDP growth is on display. . . . 66

4.8 A shock analysis on a sudden increase of the aggregate default frequency and impulse response for the domestic macro variables. . . . 68

(14)

List of Tables

3.1 A table presenting the constants derived for the Logit model of the economy-wide analysis over the years 1990-2009. . . . . 27 3.2 A table presenting the constants derived for the Logit model of the industry specific analysis

over the years 1990-2009. . . . . 28 3.3 A table presenting the constants derived for the Logit model of the economy-wide analysis

over the years 1990-2009, without macro variables in the analysis. . . . 29 3.4 A table presenting the constants derived for the Logit model of the industry specific analysis

over the years 1990-2009, without any macro variables considered in the analysis. . . . 30 3.5 A table presenting the constants derived for the Logit model of the economy-wide analysis

over the years 1990-2018 quarter 2. . . . 32 3.6 A table presenting the constants derived for the Logit model of the industry specific analysis

over the years 1990-2018 quarter 2. . . . 33 3.7 A table presenting the constants derived for the Logit model of the economy-wide analysis

over the years 1990-2018 quarter 2, without macro variables in the analysis. . . . 34 3.8 A table presenting the constants derived for the Logit model of the industry specific analysis

over the years 1990-2018 quarter 2, without any macro variables considered. . . . 35 3.9 A table describing the ranking of regressors using t-statistics and marginal effects of the data

from 1990-2009 in the economy-wide model. . . . 37 3.10 A table containing the OLS parameters of the data from 1990-2009 in the economy-wide model. 44 3.11 A table containing the OLS parameters of the data from 1990-2018 quarter 2 in the economy-wide

model. . . . 45 3.12 A table presenting the constants derived for the Logit model of the economy-wide analysis

over the years 1990-2009, without macro variables and dummy variables in the analysis. . . . 46 3.13 A table presenting the constants derived for the Logit model of the economy-wide analysis over

the years 1990-2018 quarter 2, without macro variables and dummy variables in the analysis. 47 3.14 A table containing the OLS parameters of the data from 1990-2009 in the economy-wide

model, using macro variables and intercept only. . . . . 49 3.15 A table containing the OLS parameters of the data from 1990-2018 quarter 2 in the economy-wide

model, using macro variables and intercept only. . . . . 49 4.1 A table presenting the RMSE out-of-sample for six different time series which try to predict

the aggregated default frequency within each industry and in total. . . . 60

(15)

4.2 A table presenting the Diebold-Mariano test statistics for five different quotient, based on six different time series which try to predict the aggregated default frequency within each industry and in total. . . . . 61 4.3 A table presenting the RMSE-quotient out-of-sample for five different quotient, based on six

different time series which try to predict the aggregated default frequency within each industry and in total. . . . 62 5.1 The forecasted aggregate default frequency, j quarters ahead in time from 2018 quarter 2.

Presented in percent. . . . 73 5.2 The forecasted aggregate default frequency, j quarters ahead in time from 1998 quarter 1.

Presented in percent. . . . 73 5.3 The forecasted aggregate default frequency, j quarters ahead in time from 1991 quarter 1.

Presented in percent. . . . 74 5.4 The absolute and relative value at risk, presented as the aggregate default frequency in percent.

Performed j quarters ahead in time from 2018 quarter 2. . . . 76 5.5 The absolute and relative value at risk, presented as the aggregate default frequency in percent.

Performed j quarters ahead in time from 1998 quarter 1. . . . 76 5.6 The absolute and relative value at risk, presented as the aggregate default frequency, in

percent. Performed j quarters ahead in time from 1991 quarter 1. . . . 76 7.1 A table presenting the constants derived for the Logit model of the economy-wide analysis

over the years 1990-2009. Comparing annual and quarterly data. . . . 84 7.2 A table presenting the statistics for the firm specific coefficients of the Logit models over the

years 1990-2009. This is for non-defaulted firms. . . . . 85 7.3 A table presenting the statistics for the firm specific coefficients of the Logit models over the

years 1990-2009. This is for defaulted firms. . . . 86

(16)

(17)

1 Introduction

In this section a brief background around the phenomenon of bankruptcy or default of a company and its consequences will be given. Here a motivation of the model type and variables will also be given, with previous work as a foundation. It is followed by a justification of the need of research around this. Then an outline of the subsequent sections will be presented, followed by the limitations of this thesis.

1.1 Background

Business default is a frequent occurrence in Sweden and in the world. It is of great importance in economics and risk management, since it impacts not only the concerned company and its employees, but also the financial sector, banking industry and possibly the whole economic state in a country. Considerable previous work exists in the subject ([1], [2],[5],[4],[6], etc.), but nonetheless the analysis is not complete. In certain cases it is easier to predict whether a company will default or not, but studying every company in an economy makes it almost impossible to keep an accurate prediction. This can be explained by the time perspective, where economic situations vary with time both locally and globally, and the current state might be unique in its existence when compared to former states. Since economic states often have a cyclical behaviour, the suggestion of a unique economic situation might be considered false. It is more likely than not that the present rates, supply and demand, and recession or boom has appeared previously. The formulation of a unique economic situation refers to broader economic conditions combined with detailed company specific variables, and in this case the situation is more likely to be unique. The reason for this is among other things, the number of companies existing at a certain time, and the industries which these companies operates in, providing a set of variables not likely seen before. Therefore, updates of the business default models and distribution are of great importance. The models must be updated considering the present state, including current economical conditions as well as company classifications and variable selections, to best explain the present economical time.

Recent economic times have been stable and has had a steady, positive trend, since the last economic crash and following disturbance in 2008-2009. Is this well reflected in the bankruptcy distribution? Regardless of whether this is the case or not, a model to describe this bankruptcy distribution, intentionally a precise one, is to be derived. The model will most likely have factors depending on the economic state in Sweden, leading to believe that good economic times should be reflected in the derived bankruptcy distribution.

Both macroeconomic factors and individual firms’ specific coefficients are going to be used throughout the analysis. This to provide the best possible modelling of default. In several previous studies macro variables have proven to be of high importance, Lind´e et. al. [1], Korol [18], Macerinskiene [19] and should therefore be included in the analysis. In addition, the influence of macro variables and firm-specific coefficients will

(18)

investigated, to better understand the dynamics of business defaults. Next the focus is turned to what type of models that can or can not be used for the necessary analysis.

In section 2 the data set will be described completely. It consists of information on both privately and listed firms, that fulfill certain criterion, which in total constitutes nearly all incorporated Swedish firms.

For the analysis this means that the Merton model and similar models (Merton [20], [21] and [22]), can not be used. The reason for this is that these approaches model the company’s equity as a call option on its assets, through Black-Scholes option pricing model, providing a structural relationship between the assets of a firm and its probability of default. Thus, this can not be implemented, since it relies on equity price information and hence are limited to listed firms. However, the inclusion of privately held firms makes the analysis legitimate on a wider economic basis, not least because they usually constitutes for over half of the gross domestic product in matured economies. A fact proven in several studies, e.g. by Asker et. al [23]

and by Durand et. al. [24]. Therefore the Merton-like models are insufficient in this analysis to provide an accurate model for probability of default. As Lind´e, Jacobsson and Roszbach [2] mention, both the aggregate and individual default frequencies show co-movement with macro variables and financial variables. This has been showed to be accurate over time and therefore it is important to process both types of factors in an accurate model.

Bernanke, Gertler and Gilchrist present fundamental groundwork [3]. The authors produce the results of where both aggregate shocks and idiosyncratic shocks have an impact over firm default. An aggregate shock in this situation is a shock effecting the economy as a whole, most likely on a macro variable, while the idiosyncratic shock is a smaller more firm or industry specific shock, only effecting a certain type of business or firm. This suggest that macroeconomic factors and firm specific factors both influences an individual firms default probability. This is a large motivation behind including both sets of factors in an empirical default model. This inclusion of both sets of factors has been shown to have an impact several times since then, as in the work by Hackbarth, Miao and Morellec [4]. Where it is shown that not only idiosyncratic shocks, but also aggregate shocks has an impact on the debt level of a firm. Moreover, the default boundary will be affected by both types of shocks, meaning that firms can default as a consequence of both types of shocks and furthermore it is shown that several firms can have simultaneous defaults if generated by an aggregate shock. Thereby, the theory is clear, but the analysis also has to be advanced to empirical studies.

The proposition of vital impact of macroeconomic factors on default modelling and not solely relying on firm specific factors had to be tested. Such studies did barely exist before 2005. However, since then more empirical conformation has been found, including the work of Tang and Yan [6], where it is clear that the macroeconomic factors play a large roll in explaining default. It is proven that using only firm specific

(19)

factors entails large losses of information. In the study default risk is measured by credit spreads and default probabilities and the difference in explanatory power when including macroeconomic factors is significant.

From the previous paragraph it is clear what set of factors that are necessary for the analysis is given.

Further, the investigation is how to model the default probability, with the inclusion of these factors. As previously stated no Merton like model can be used since privately held firms will be included. Moreover, a popular method used in the past was, single period classification models, or as Shumway [5] refer to them, static models. These are now outdated models. Mostly due to the fact that these models neglect that firms change over time and therefore biased and inconsistent evaluation of bankruptcy probability is introduced.

The explanation for this is that bankruptcy take place infrequently, thus to forecast it one must gather information over a long time period, while the characteristics only can take on one value in static models.

These have to be selected by the investigator, leading to a biased model. Also, different researchers might pick different characteristic values, leading to inconsistency. Shumway solves this by developing a discrete-time hazard model that uses all available data. This is under moderate restraints shown to be equivalent to a multi-period logistic regression model, or Logit model, which is the model to be implemented in this thesis.

More details will be presented in section 3, named the Logit model, but likewise to the hazard model, it avoids bias and inconsistency previously seen in static models. In the same section more modern approaches will be discussed as well.

The data set used in the analysis in this thesis is based on real world data. Meaning that the results from the analysis will be applicable for the real world too. This means that with an accurate model produced, it will be possible to model the actual bankruptcy distribution in the near future. This provides results which can be used to better understand the economic situation that lies ahead, and more knowledge about the near future can be crucial in economics and risk management. One advantage of it is the possibility to develop a damage minimization strategy, especially in the event of an economic crash. The more knowledge about the default rate in general, and bankruptcy risk for specific companies, the higher the chances of predicting bankruptcy and possibly preventing the negative effects of it. To provide more information about the future state and further aid the risk minimization procedure, risk measures can be derived. In this thesis one of the most common risk measures will be used, the Value at Risk, which contributes with information of the worst case scenarios. In this context, a high default rate is categorizes as a worst case scenario. The Value at risk can be derived for the general model, but is perhaps even more advantageous for the industry specific models. If the risk measure derived indicate a possible future disaster, then changes must be implemented as soon as possible to avoid this scenario.

(20)

1.2 Purpose

As previously discussed the economic behaviour is cyclical. However, when considering the classification and inclusion of different companies, where firm specific variables are taken into account, the full set of variables might be new in its existence. Therefore, the analysis of default is in need of updates to stay accurate, meaning that this thesis is highly significant. Since there exists previous work that have proved to perform well, it would be wise to use these as a starting point. However, it is not expected that previous work can be replicated. It is still vital to update factors to present time to well reflect the economy of today.

Nevertheless, the methods used are most likely suitable. The last similar analysis perform in Sweden was done in 2013 and handled data from 1990 to 2009 by Lind´e et. al. [2]. Since then, the market has changed and macroeconomic factors, as well as company specific factors are not the same, additionally many new companies have entered the market and some have defaulted as well. Consequently, it is not likely that the model derived then, is still the best model available. Moreover, comparable analyzes have been performed in other countries. These could also provide valuable information for comparison in this analysis, however it is not probable that the derived model will be the same. So to summarize, an update of the bankruptcy distribution in Sweden and a model for future bankruptcy estimation is needed.

It is not the only purpose of this thesis analyze the bankruptcy distribution over time for Swedish companies, and then derive a model which best describes this distribution. As discussed earlier the data sets used contain real world information and will therefore be applicable for actual future predictions. Moreover, the data sets are extensive in both the number of firms analyzed and in a time perspective. Over 29 million data points in all are used, spanning over nearly 30 years. This provides the possibility to perform an industry specific default analysis, in addition to the general one. Here a similar model can be derived, but with different coefficients, to predict bankruptcy. This allows for more detailed predictions. Because if the accuracy of the produced specific models are high, then it can be more useful for forecasting companies risk of defaulting within certain sectors. This classification also makes it easier to detect if certain areas of business are more in the danger of default than others. Which then could be alarmed and perhaps prevented or mitigated.

Another purpose of this thesis is to include the Value at Risk in the analysis. This adds a risk management perspective which has not been done before on similar data sets in Sweden. The intention of including a risk measure is to examine if it can provide more understanding about the current economic situation, and principally if it can help to mitigate or even prevent a future crash. The risk measure will be performed on the general situation and on the industry specific. Here the two most interesting categories are the banking, insurance and finance sector and the real estate sector, two areas highly dependent of the economic situation.

This analysis is relevant since it has not been performed to this extent before and since it provides a lot of

(21)

valuable information for risk management areas.

To summarize, the main purpose of this thesis is to produce a model which best possible describe the aggregate default distribution for Swedish companies. The analysis is performed on a large data set, mainly using logistic regression and an update of previous models is necessary from a economic time perspective.

The main studies in this work will be:

• A thorough analysis of logistic regression models on an aggregate and industry specific level, where in-sample and out-of-sample performance is studied to determine the best performing model.

• Analysis regarding the importance of regressors in the logistic regression model, with main focus on the set of macro economic variables, but also performed on the firm specific ratios and dummy variables.

• Additional models developed for future forecasting and Value at risk calculations, which have not been performed in previous work, are derived and combined with the best Logit model to perform future analysis.

1.3 Outline

The upcoming sections are structured as follows. First the data set is described in detail, in section 2, both micro and macro data. In section 3 the models used are presented. Firstly some background of what type of models that are going to be used is given, together with a thorough explanation of the models. Then the empirical models produced are presented, and lastly they are justified by error testing and comparison.

section 4 regards the default modelling. Here the out of sample analysis takes place, both for validation reasons and mainly for future predictions. Next is section 5, regarding the results from future analysis, which includes forecasting and risk measures. The employed measure, the empirical Value at Risk, is described and derived on produced models. Lastly the conclusions are given, in section 6. It consists of the main results and a discussion around these, together with suggestions for further research in the area.

1.4 Limitations

The main reason for limitations in this thesis is, as in most master theses, lack of time. The data set is really large. Which is good for validation purposes, but has the disadvantage of long processing time. Combined with extensive coding, a great time effort has to be devoted to the code. This also had the effect that testing of several model types were not possible within the given time frame. To some extent, different type of models could be derived, yet these were simpler models for comparison purposes. The majority of the time was spent on the main model used, the Logit model. This was chosen because it had been proved to be accurate in previous work. Although it would have been interesting to compare this with several other

(22)

strategies. Also if time was not an issue, one could perform a more in detail analysis of the ratios and factors used and perhaps extend this list even more to generate an even more exact model.

Another limitation was the risk measures. This could also have been extended with more time. But this was not the only reason for restriction. Here the fact that we are dealing with an empirical study complicates the risk calculations. The Value at Risk is perhaps the most common risk measure, however it does not provide a full description of the risk. It would be interesting to add further measures, like the Expected Shortfall and utility functions for more detailed descriptions. But this is complicated without a distribution function, missing in these empirical studies.

(23)

2 Data

Here all the data used in the thesis is presented. The firm specific data is given in the micro data subsection and the macro economic factors are presented under the macro data subsection. But first some general information about the data and definitions are introduced. The data is processed at Riksbanken (the central bank of Sweden), and contains classified information. Therefore it was encrypted before the analysis could begin. This was done as a safety procedure so that no firm specific information might be leaked, and at the same time removes any risk of subjectivity. Upplysningscentralen AB, the leading credit bureau in Sweden, provided the data to Riksbanken. This bureau is independently operated but jointly owned by the Swedish commercial banks. All data given to Riksbanken from Upplysningcentralen originates from two sources. The first is the Swedish Companies Registration Office, called Bolagsverket. Here annual reports are required by law to Bolagsverket, and the data consists of these reports including balance sheet and income statements.

The time period for these reports are from 1989 to 2018. The format also follows EU-standards.

2.1 General data information and definitions

Regarding the annual reports, linear interpolation is performed to obtain quarterly data. This means that an assumptions of constant variables within a report period is done. This is done due to that quarterly observations gives a more detailed forecast than using annual observations. In addition it allows for more interesting interpretations. The only concern one can have with this transformation is that accounting variables might be underestimated in the analysis. To assure that this is not the case a robustness check is performed, where variables are determined from quarterly data and annual data. Luckily these results are very similar, leading to no concerns left with this transformation. The results comparing annual and quarterly observations are found in Appendix A, Table 7.1. Where the derived coefficients all have the same signs and approximately the same magnitude and therefore the same impact. If anything the accounting variables are smaller when comparing to the other variables in the annual case. Meaning that no concerns of transformation to quarterly data are valid.

The second source is from the credit bureau, Upplysningscentralen, itself. It consists of information regarding firms’ payment behaviour from all proper sources. This includes Swedish tax authorities, Swedish retail banks, and institutions handling the legal formalities of firms’ bankruptcy processes. All gathered information consists of more than 60 different payment remarks. A remark regards credit and tax related matters, together with documents of numerous steps in the legal procedures preceding a formal bankruptcy. Examples of these types of remarks are, delays in tax payments, seizure of property, restructuring of loans, or actual bankruptcy. The advantages of these payment remarks are firstly that they have proven to be a powerful predictor of default in previous reports done by Riksbanken [1] and [2]. Secondly they are applicable in real

(24)

time. The disadvantage of the remarks are that they are quite unique. Generally this information does not exist outside of Sweden, at least not to the same extent. The effects of this is that it is weakens comparability with other studies outside of Sweden and that it is probably not possible to widen the analysis outside of Sweden.

In the analysis a total number of 29,128,400 observations is used. It is composed of quarterly observations on Swedish aktiebolag, which is equivalent to US corporations or UK limited businesses. By Swedish law, an aktiebolag has to have a minimum of 100,000 SEK of equity to qualify for registration at Bolagsverket.

Moreover, to qualify as active by Riksbanken a firm must have reported at least 1,000 SEK in both total sales and total assets every quarter. Thereby a definition of firms to be included in the data set can be stated.

They must satisfy the condition to be classified as active by Riksbanken and issue a financial report covering the current quarter. The data is from the first quarter in 1990 to the second quarter in 2018, observations after this quarter were very few at the start of the analysis and was therefore left out to keep the accuracy of prediction. The number of firms included was in 1990 about 175,000 and has a steady rise to around 375,000 at the end of the time period.

Even though a clear definition of whether a firm should be included in the analysis or not is stated there occurs some problems. Typically, firms that are in financial distress tend to miss or neglect to file a financial report. By excluding these firms a lot of potential defaults would be omitted. Therefore, not to miss these firms and distort the statistics, firms that are classified as defaulted according to payment remarks will be included. Again the payment remark documents provide vital information. Without these a lot of defaulted firms would be missed out on. Since some firms choose not to file a financial statement in the year of or years before a default, the payment remark archives are the only records of these firms.

Next definition to be declared is the one regarding default. Here a firm will be classified as defaulted if any or several of the subsequent statements have been fulfilled: the firm is declared legally bankrupt, the firm has negotiated a debt composition settlement, the firm is undergoing a reconstruction, the firm has suspended payments, or the firm has lost assets via distress. This rather complex definition is a combination of variables provided by Upplysningscentralen. Where remark types and major distress events are registered and stored. If a firm has several or any of the indicator variables for default satisfied in a quarter, it will be classified as defaulted in that quarter, and leave the data set in the next quarter. Unless, the unlikely event of a bankruptcy cancelled by court has been fulfilled. If that is the case, which occurs a few times in the data set, then the firm is not classified as defaulted and is kept for the remainder of the quarters, or until another default is indicated. The absolutely most common reason for registered default is, declared

(25)

legally bankrupt, yet the other statements are still included in the definition of default to capture all defaults.

Before turning to the micro and macro data, one more definition is necessary. It is previously stated that the analysis will be performed on the whole data set, referred to as the economy-wide model, and on an industry specific level. Here a definition of industries is needed. A division to 10 categories is done, in agreement with the data obtained from Upplysningcentralen. A firm will belong to one of the businesses:

“Agriculture”, “Manufacturing”, “Construction”, “Retail”, “Hotel and Resturant”, “Transport”, “Bank, Finance and Insurance”, “Real Estate”, “Consulting and Retail” or “Not Classified” as a residual industry.

Consequently the industry classification is set. This is in line with the Standard Industrial Classification (SIC) system, but using a one digit code instead of four, leading to a more general description without details in each industry group.

2.2 Micro data

A selection of firm specific data is an important part of the default analysis. Here key financial ratios from firm specific data is to be chosen. This will constitute the micro data, namely key financial ratios from firms included in the data set. Here, the analysis is based on the selection of ratios in the work by Lind´e, Jacobsson and Roszbach [2]. They present a thorough analysis of which financial ratios that are decisive for default predictions. In their work, the authors first present an evaluation of financial ratios used in previous studies, from 1968 to 2008. To qualify in the analysis, the ratio must have been used repeatedly on bankruptcy risk or default calculations. Measures are of liquidity, efficiency and profitability, leverage and solvency, occasionally a size variable and finally a residual category. To qualify for selection a measure must pass two tests. The first is a comparison between the ratios and default risk uni-variate relationship.

This is done by plotting the cumulative distribution of the ratios against the default rate of an observation, based on an average of plus/minus 5000 adjacent observations in the empirical distribution of the ratio, seen in Figure 2.1. From this, ratios uncorrelated to default rate could be eliminated from the analysis, and the correlated ones are kept. The second test regards explanatory power. Here the ratios selected are inserted one at a time in the model to predict bankruptcy and if a ratio does not bring enough explanatory power it is dropped. Here variables like total sales and age were dropped in the previous work.

The motivation behind using the ratios selected by Lind´e et al. [2] are two main reasons. Firstly, the thesis is based on an extension of the data used in their analysis, leading to believe that the same ratios should have explanatory power. The data sets consists of Swedish firms over the time period 1990-2009, for the previous analysis, and 1990-2018 for this analysis. The second reason is a limitation. Due to time restrictions and the extensive coding needed for a new analysis, which likely would produce very similar

(26)

results to the existing one, no new analysis of ratios to be chosen is done. Therefore the six selected key financial ratios are:

• The earnings ratio, or earnings before interest, taxes, depreciation and amortization over total assets,

EBIT DA T A

• The leverage ratio, or total liabilities over total assets, ^{T L}_{T A}

• The quick ratio, or liquid assets over total liabilities, ^LA_{T L}

• The inventory turnover ratio, or inventories over total sales, _{T S}^I

• The debt ratio, or the natural logarithm of total liabilities over total average sales, ln(^{T L}_{T S})

• The interest coverage ratio or interest payments over the sum of interest payments and EBITDA,

IP

IP +EBIT DA = CR

Here all ratios chosen are presented in Figure 2.1. The only ratio not found in the figure is the debt ratio.

That is simply because it is noted that the natural log-level of debt contains predictive power for default.

Therefore the total liabilities are included, but scaled by the average of total sales within each time period to obtain the ratio, ^{T L}_{T S}. Here it is seen that the default distribution is related to all ratios either linearly or non-linearly. The leverage ratio and inventory turnover ratio have a linear positive relationship with default. Moreover the interest coverage ratio have a positive linear relationship with default, between the values of 0 and 1. Outside of this region, the relationship seems to be nonlinear. The earnings and quick ratio have a negative relationship with default. Where the earnings ratio is non-linear and the quick ratio is linear. Finally the ratio ^{T L}_{T S} have a nonlinear relationship with default.

To discard the financial ratios with a nonlinear relationship, because they are to be used linearly in the Logit model, would be to reject valuable information. This is due to that significant covariation between the ratios in the cross section exist. Meaning that each variable contribute with information to the joint empirical model, used to predict default. Moreover the ratios relationship with default is natural. General conclusions can be drawn, where some variation is present. But overall, the higher earnings, the lower the risk of default is. Same holds for liquid assets over total liabilities. While the default should increase if liabilities increases in relation to assets or sales, seen in the two subplots for these ratios. The positive connection also holds for inventories in relation to total sales. Lastly the interest coverage ratio can be intuitively explained. If the EBITDA is negative and has an absolute value larger than IP, then the default risk should be increased, which is true. Also, if IP are much higher than EBITDA and the both are positive, then a higher default should be present too, which also holds true. And in between the default rate is lower. No direct conclusion of the interest coverage ratio can be drawn, but the behaviour between 0 and 1 seems to be linearly and

(27)

positively correlated with the default rate. For this reason, and since non-linearity might have an impact, this ratio is kept and deemed to be used in the analysis.

Figure 2.1: The cumulative distribution of the key financial ratios selected (smooth lines), plotted against the default rates (jagged lines) computed as a moving average of plus/minus 5000 adjacent observations.

Using data from 1990-2009. Figure from analysis by Lind´e et. al. [2]. The default rate is in percent.

In addition to the key ratios, produced by balance sheet variables, some key dummy variables are

developed. The first one is a dummy variable which contain information of if a firm has paid out dividend or not. It is named PAYDIV and equals 1 if a firm has paid out dividend within the current quarter and 0 else. A complication earlier described and dealt with was the one regarding that firms miss to hand in a financial report every period. This provides missing observations. To handle this, it is decided to keep firms with missing values and replace them by imputation. This is done to obtain an aggregate default series which nearly resembles the one from official statistics in Sweden, and to provide unbiased macro coefficients in the model. The imputation is a multi-step procedure where first a backward search and thereafter a forward search for closest available missing value within a firm is wanted. If no such value is

(28)

found then a replacement of value is made randomly from a pot of values within the same industry and default status. The result of imputation is very successful, with no missing values left afterwards. This is done since the focus lie on aggregate default frequency rather than the individual one.

To catch the suspected connection between failure to hand in a financial statement and default a new dummy variable is introduced. It is called TTLFS, short for time to last financial statement, and equals 1 if a firm has not handed in a financial statement in the prior six quarters. The choice of six quarters is to equal the actual time lag related to this information. A financial statement for year t is typically available in quarter 3 of year t + 1. Therefore by looking back six quarters for the dummy variable, real time delay is accounted for.

Of course if there exist a financial statement with the latest six quarters, then TTLFS equals 0. Additionally it equals 0 if a firm is new in the data set, to avoid a miscalculation in the third quarter of the first year of a firms existence. Also, if a firm never has handed in a financial statement and defaults, TTLFS is set to 0.

This is done to lessen rather than exaggerate the impact of the variable TTLFS, so that the impact of other ratios and variables are not neglected. The intent of TTLFS, to catch the connection between willingly not handing in a mandatory financial report and default, is proved to be a vital variable for default prediction.

This is seen in a robustness check, where if one removes all observations with missing values and thereby TTLFS, the explanatory power of the default model shrinks significantly.

There are two more dummy variables used in the analysis. But before presenting them a truncation procedure performed are to be explained. Since the division to industries does not provide an evenly divided data set to each category, the number of observations for each variable will vary with industry classification. This provides micro data sets of varying size, and therefore an outlier study need to be undertaken. To avoid distortion of the estimated results, which would happen if severe outliers are included in the Logit model, a truncation must be performed. Here a Winsorization procedure of the top and bottom 1% is done, within each industry. This is basically equivalent to erasing the 1% of accounting data that is either to large or to small to be included in the domain. To perform this within each industry instead of at the economy-wide model allows for dispersion and different variable statistics for each industry class, and therefore it is done on each industry data set. This truncation is an outlier treatment which sets all values of financial ratios that are in the highest or lowest 1% to the value at the 99th or 1st percentile. Meaning that no firms are erased from the data set, but some firms within each industry have altered values on their financial ratios to erase the largest outliers and therefore possible errors in the modelling.

The last two dummy variables were created using these statistics for truncated data. The aim is to find payment remarks that lead to default as often as possible and are greatly correlated with default. Analyzing all different payment remarks lead to the conclusion that the correlation between default and a remark is

(29)

either irrelevant or simultaneous. Again the testing results from Lind´e et. al. [2] is used. From this two dummy variables are constructed and show strong interaction with predicting defaults. The first one is called PAYREMARK and equals 1 if one or several of the following remarks existed for a firm within the last year: “A bankruptcy petition”, “The issuance of a court order - because of absence during the court hearing - to pay debt”, “Having a non-performing loan”, and/or “The seizure of property”. If none of these remarks were present the last year, then PAYREMARK equals 0. The variable TAXARREARS is much more straightforward and simply equals 1 if the firm were in diverse tax arrears within the last year and 0 if not. In Appendix A Tables 7.3 and 7.2 the statistics of each firm variable is presented for both defaulted firms and non-defaulted firms. From these results it is clear that the correlation between the dummy variables and default is existent. Where the dummy of whether a firm pays dividend is much higher for non-defaulted firms, and the other 3 dummies have much higher values for defaulted firms. This regards the mean and standard deviation.

2.3 Macro data

Four different aggregate variables will be used in this thesis. The first regards the Gross Domestic Product (GDP) and will either be the output gap, namely the deviation of Gross Domestic Product in Sweden, from its trend value. Or the GDP growth in Sweden, measured as a yearly growth for every quarter within the sample period. Which will be used is determined in the subsequent sections. Second is the Repo interest rate, which is a short term nominal interest rate in Sweden set by Riksbanken. Third is the yearly inflation rate, calculated as the fourth difference of the Gross Domestic Product deflator. Which is a quantification of the status of prices for all new, domestically produced, final goods and services in an Sweden the last year.

Fourth and last is the real exchange rate. All time series are plotted in Figure 2.2. Noticeable is that during quarter 3, in 1992, the Repo rate was extremely high. This was due to that Riksbanken raised the so called marginal interest rate to 500%, for a short time, to defend the fixed exchange rate in an economic downfall in Sweden. This is a unique event and has a large effect on the Repo rate, which needs to be adjusted for.

Otherwise the importance of financial costs is underestimated in the default analysis. The number of the Repo rate is therefore adjusted by adding a dummy variable in the concerned quarter in regression analysis, leading to a value of 9.8% instead of 38%. This can be seen as a sort of outlier treatment, and is performed to obtain a smoother time series for the macro variable.

The output gap is calculated with Hodrick-Prescott (HP) filtering of the gross domestic product. It removes the cyclical behaviour of a time series, here the output gap, to obtain a smooth time series. The alteration of the sensitivity of the trend to short-term fluctuations is achieved by modifying the coefficient λ, which here is set to its standard value for quarterly data, namely 1600. The GDP growth is simply the difference

(30)

between the GDP today and one year ago, divided by the former value times 100 to get it as a growth in percentage over the last year. Moreover, the real exchange rate is measured as the nominal TCW-weighted exchange rate, multiplied with the TCW-weighted foreign CPI deflators, over the domestic CPI deflator.

Here TCW stands for trade competitive weights and CPI for consumer price index. Note here that a higher real exchange rate entail depreciation. Consequently, a negative estimated coefficient for this variable, if all other variables are kept constant, indicate that a reduction of the real exchange rate lowers the risk of default. During the time period examined here, the exchange rate is characterized by an upward trend.

This reflects gradual depreciation, and therefore detrending is performed. Again, a HP-filtering is applied to obtain a stationary time series. The four time series seen in Figure 2.2 are the adjusted ones. If one examines them, it is clear that output gap well reflect the economic situations in the past. With a recession in the early 1990s followed by an upturn and tech boom. Then some weaker times in the early 2000s, succeeded by a boom and then crash in 2008-2009, and since then a steady rise up until today.

1995 2000 2005 2010 2015

-6 -4 -2 0 2 4

Output gap

1995 2000 2005 2010 2015

0 5 10 15

Nominal interest rate

1995 2000 2005 2010 2015

0 2 4 6 8

10 Inflation rate

1995 2000 2005 2010 2015

-10 -5 0 5 10

Real exchange rate The macro variables over time

1995 2000 2005 2010 2015

-5 0 5

GDP growth

Figure 2.2: The macro data used is presented, from 1990 Quarter 1, to 2018 Quarter 2.

To justify the detrending procedure done for the real exchange rate a robustness check can be performed.

(31)

Here the results from the method used, a HP-filtering procedure, is compared to performing no stationary adjustments at all, by simply omitting the real exchange rate. Along with a comparison where the real exchange rate is included, but calculated as the percentage deviation around a constant mean. So the mean is computed over the time period in question. Either the whole from 1990 quarter 1 to 2018 quarter 2, or the period 1990-2009. The mean is calculated as:

Q =¯ 1 T

T

X

t=1

Qt (2.1)

And then the real exchange rate, qt, is computed as the percentage deviation around this mean:

qt=Qt− ¯Q

Q¯ (2.2)

The difference in the default model between these procedure are very small. Comparing the variables for key financial ratios over the time period 1990-2009 show a difference in the third or fourth decimal regarding the firm specific ratios. In the second decimal at most in the firm specific dummies, which are much larger, so a larger deviation is natural. And for the other three macro variables there is a change in the second decimal, which has a minimal impact. Therefore the conclusions are that the HP filtering does not have a large effect on the final results and is preferred over the unfiltered version. This because using the filtered data, result in a balanced data set used in the logistic regression model, which is preferred for this analysis. To avoid changed intercepts and thereby skewed probabilities of default. Another test where the removal of the real exchange rate from the analysis was performed. At first this seemed to be of no problem, since no large changes in the coefficients for the other variables was found, or no decrease in the firm level coefficient of determination. However, the p-value in a null hypothesis testing proved that the real exchange rate is not an extravagant variable and should be included in the analysis. In addition the aggregate coefficient of determination was higher with the real exchange rate included, therefore it was kept in the analysis.

To summarize, three sets of data is used in the analysis. The first set is the firm specific ratios constituted by the earnings ratio ^{EBIT DA}_{T A} , the leverage ratio _{T A}^{T L}, the quick ratio ^LA_{T L}, the inventory turnover ratio _{T S}^I , the debt ratio ln(^{T L}_{T S}) and the interest coverage ratio CR. The second set is the firm specific dummy variables PAYDIV, TTLFS, PAYREMARK and TAXARREARS. The third and last set of variables are the macroeconomic variables including the GDP growth, output gap, Repo interest rate, yearly inflation rate and real exchange rate. Only one of GDP growth and output gap, is included, a choice made later in the analysis. The motivation behind the four macro variables selected is to have a set which explains the macroeconomic state in Sweden. The data is selected based on the information provided by Riksbanken and the analysis made by Lind´e et. al [2]. Furthermore, the inclusion of all variables is tested on an individual basis in subsequent sections, providing additional arguments to the inclusion of all factors selected.