Modeling risk and price of all risk insurances with General Linear Models

(1)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2020,

Modeling risk and price of all risk insurances with General Linear Models

ELLINOR DRAKENWARD EMELIE ZHAO

KTH

(2)

(3)

Modeling risk and price of all risk insurances with General Linear Models

Ellinor Drakenward Emelie Zhao

ROYAL

Degree Projects in Applied Mathematics and Industrial Economics (15 hp) Degree Programme in Industrial Engineering and Management (300 hp) KTH Royal Institute of Technology year 2020

Supervisor at KTH: Daniel Berglund Examiner at KTH: Sigrid Källblad Nordin

(4)

TRITA-SCI-GRU 2020:121 MAT-K 2020:022

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

This bachelor thesis lies within the field of mathematical statistics. In collaboration with the insurance company Hedvig, this thesis aims to explore a new method of handling Hedvig’s insurance data by building a pricing model for all risk insurances, with Generalized Linear Models. Two Generalized Linear Models were built, where the first predicts the frequency of a claim and the second predicts the severity. The original data was divided into 9 explanatory variables. Both models included five explanatory variables at start and were then reduced. The reduction resulted in four out of five characteristics to be explanatory significant in the frequency model and only one of the five to be explanatory significant in the severity model. Each of the models obtained relative risks of the levels of their explanatory variables. The relative risks resulted in a total risk for each level. Through multiplication of a created base level with a set combination of risk parameters, the premium for a chosen customer can be obtained.

Keywords: Bachelor Thesis, Mathematical statistics, Generalized Linear Model, Multiplicative GLM, Regression analysis, Insurance Pricing, Claims, Tariff

(6)

(7)

Sammanfattning

Det här kandidatexamensarbetet ligger inom ämnet matematisk statistik. I samarbete med försäkringsbolaget Hedvig, avser uppsatsen att undersöka en ny metod att hantera Hedvigs försäkringsdata genom att bygga en prissättningsmodell för drulleförsäkring med hjälp av generaliserade linjära modeller. Två modeller skapades varav den första förutsäger frekvensen av ett försäkringsanspråk och den andra förutsäger storleken. Originaldatan var indelad i 9 förklarande variabler. Båda modellerna innehöll till en början fem förklarande variabler, vilka sedan reducerades till fyra respektive en variabler i de motsvarande modellerna. Från varje modell kunde sedan de relativa riskerna tas fram för varje kategori av de förklarande variablerna. Tillsammans bildades sedan totalrisken för alla grupper. Genom att multiplicera en given kombination av totalrisker med en skapad basnivå kan premien för en given kund erhållas.

Nyckelord: Kandidatexamensarbete, Matematisk statistik, Generaliserad linjär modell, Försäkringsanspråk, Multiplikativ GLM, Regressionsanalys, Försäkring, Prissättning

(8)

(9)

Acknowledgements

We would like to express our deepest gratitude towards Hedvig for the amazing opportunity to write our bachelor thesis with them. A special thanks goes to our supervisor John Ardelius for sharing his advice and knowledge, as well as for trusting us with this task. Special thanks goes also to Alexandra Hotti, for her continuous support and feedback throughout this thesis.

We would also like to express our gratitude towards our thesis supervisor Daniel Berglund. Daniel has tirelessly been providing valuable advice and feedback which have been extremely helpful and appreciated. Lastly we want to thank Julia Liljegren at the School of Industrial Engineering and Management (KTH) as well as everyone participating in our qualitative study for their feedback and support in writing this Bachelor’s Thesis.

(10)

Authors

Ellinor Drakenward <edr@kth.se> and Emelie Zhao <ezhao@kth.se>

Industrial Engineering and Management

KTH Royal Institute of Technology Stockholm, Sweden

Supervisors

Supervisor at Hedvig: John Ardelius

Supervisor at KTH Royal Institute of Technology: Daniel Berglund

(11)

Contents

1 Introduction 3

1.1 Background . . . . 3

1.2 Scope . . . . 3

1.3 Problem . . . . 4

2 Theoretical Background 5 2.1 Literature Review . . . . 5

2.2 Insurance Theory . . . . 5

2.2.1 Key Terms . . . . 5

2.2.2 The Insurance Business Model . . . . 6

2.2.3 Principal-agent problems in insurance . . . . 7

2.2.4 Price sensitivity of demand . . . . 8

2.3 Mathematical Theory . . . . 9

2.3.1 Linear Regression . . . . 9

2.3.2 Generalized Linear Models (GLM) . . . 10

2.3.3 Modeling Claim Frequency . . . . 11

2.3.4 Modeling Claim Severity . . . 12

2.3.5 Model Validation . . . . 13

2.3.6 Multicollinearity . . . . 15

3 Methodology 17 3.1 Data . . . . 17

3.1.1 Collection of Data . . . . 17

3.1.2 Grouping of Variables . . . 18

3.1.3 Aggregation of Variables . . . 20

3.2 Model development and analysis . . . 21

3.2.1 Current Model . . . 21

3.2.2 New GLM model . . . 21

3.2.3 Modeling the Frequency of a Claim . . . 21

3.2.4 Modeling the severity of a claim . . . 22

3.2.5 The final risk factors . . . 23

3.2.6 Creating the Base level . . . 23

(12)

3.3 Qualitative study . . . 25

4 Results 28 4.1 Claim Frequency Model . . . 28

4.1.1 Goodness of Fit Diagnostics for the full model . . . 28

4.1.2 Finding the Reduced Model . . . 28

4.1.3 Goodness of Fit Diagnostics for the reduced model . . . 30

4.1.4 Significance of Variables in the reduced model . . . 30

4.1.5 Final Model risk factors . . . . 31

4.1.6 Diagnostics of Multicollinearity . . . 32

4.1.7 Residuals . . . 33

4.1.8 Confidence intervals of the risks . . . 34

4.2 Claim Severity Model . . . 35

4.2.1 Goodness of Fit Diagnostics for the full model . . . 35

4.2.2 Finding the Reduced Model . . . 35

4.2.3 Goodness of Fit Diagnostics for the reduced model . . . 37

4.2.4 Significance of Variables in the reduced model . . . 37

4.2.5 Final Model Risk Factors . . . 38

4.2.6 Residuals . . . 39

4.2.7 Confidence intervals of the risks . . . 39

4.3 The final risk factors . . . 40

4.4 The base level . . . 41

4.5 Qualitative study . . . 42

5 Discussion 43 5.1 Frequency model adequacy . . . 43

5.2 Severity model adequacy . . . 44

5.3 Base level adequacy . . . 44

5.4 Uncertainties . . . 45

5.5 Interpretation and Application . . . 46

5.6 Extension and future research . . . 47

5.7 Qualitative study result . . . 47

6 Conclusion 51

(13)

1 Introduction

1.1 Background

An insurance is a means of protection from financial loss in form of risk management, primarily used to hedge against the risk of a contingent or uncertain loss. An entity which provides insurance is known as an insurance company. An entity who buys insurance is known as a policyholder. The insurance is bought by paying a price known as a premium, thus transferring the economic risk to the insurance company.

The insurance company Hedvig is a pioneer in the Swedish insurtech-market.

Insurtech refers to the use of technological innovations in order to increase savings and efficiency, thus differing from the traditional insurance industry model.

More specifically, unlike the traditional insurance companies, Hedvig uses AI to handle data and interpret received claims which they later use as basis for when paying out compensations to their customers. This automatic process is based on mathematical calculations where the algorithms use historical data from similar insurance cases and determine the likeliness of the situation that has occurred.

The decision is completely data driven, which results in a highly efficient process with several economic benefits compared to the traditional insurance companies.

At the same time, the process ensures that all customers are treated equally.

Today, Hedvig offers four different types of home insurance based on type of accommodation and occupancy (Renters, Student, House & Villa and Homeowners). The current tariff is developed in collaboration with another company with the help of external data. Over the course of their business, Hedvig has gathered large quantities of their own data and now want to make the tariff align with their own data.

1.2 Scope

We choose to use regression analysis, a modern data driven approach closely related to mathematical statistics, in order to perform this project. More specifically, we will apply methods of general linear models theory. General linear models, as we in the project further will denote as GLM, has proved itself useful

(14)

in insurance pricing for several decades and is considered vital for this kind of study.

This report has a focus on all risk insurances. All risk insurance is a type of insurance that offers you coverage and protection from all risks or perils that could damage your home, belongings and other personal property unless the risks are specifically excluded in a policy wording. Further, the purpose of this project is to help Hedvig to evolve their price model with the help of their own data. Since all risk insurance applies to all four types of home insurance that the company provides, copious amount of data exists within this section - over 60 % of all claims that Hedvig receive is within in the sector of all risk insurance - which lays a great foundation for performing analyses.

We will also do a qualitative study on price sensitivity in relation to age for insurance customers. This will later be used to analyze how price sensitive customers in different age groups are for the insurance business.

1.3 Problem

Hedvig has over the course of the years gathered large quantities of their own data. Now, they want to make their pricing model align with this data. The problem with their current pricing model is the inability to add new variables to the model. Hedvig has, based on their extensively gathered data, some potential new variables for which they would like to evaluate a fit into their pricing model.

The below issues will be evaluated:

1- Based on Hedvig’s data, what variables are most relevant to consider in their pricing model?

2- Considering the results above, is it possible for us to build a new consistent pricing model with the help of General Linear Models?

We will also do a case study on price sensitivity in relation to insurances in order to evaluate the below:

3- How does price sensitivity differ between customers in the insurance business, based on age?

(15)

2 Theoretical Background

2.1 Literature Review

In addition to the theory below, the previous bachelor thesis Predicting Large Claims within Non-Life Insurances by Jacob Barnholdt and Josefin Grafford (2018) provided inspiration of diverse mathematical methods and approaches used in this thesis.

2.2 Insurance Theory

2.2.1 Key Terms

In this section, we have compiled an introduction to terms frequently used in the insurance industry that is vital to our project.

Claim: the report of accident or damage made by a policyholder of an insurance company in order to be able to invoke their insurance.

Claim cost: the cost associated with a claim that the insurance company pays out to the policyholder in question.

Claim frequency: number of claims with respect to duration. The duration is the number of days that a policyholder has had an insurance with the insurance company in question.

Claim severity: total claim amount divided by number of claims, i.e. the average cost per claim.

Co-insurance: when the customer of an insurance company have insured more people than just him- or herself.

Premium: the continuous amount (monthly or yearly) that the costumer pays to the insurance company in order to have the insurance coverage.

Tariff: fixed price lists that determine the premium rates which insurance companies can charge consumers for insurance products sold by them.

(16)

2.2.2 The Insurance Business Model

Insurance pricing is the art of settling the price of an insurance policy while taking various properties of the insured object, as well as the policy holder, into consideration. The art has a complex nature since one is trying to estimate a set price when every policy holder, i.e. the customer, is associated with an unknown risk. Many models look at markets for trading risk but these are risks designed for trades. Insurance instead deals with personal risk. Personal risk is a type of risk where the customers themselves can try to modify the risk through methods such as prevention. Alternatively, the customer could also try to pool risk along with other consumers (but organizing such a group would consequently create problems of its own)[10, 176-178].

Pricing is strongly linked to the an insurance company’s profitability and therefore, price analyzes are important. The main source on which to base the decision of price is every insurance company’s own and there is no correct pricing model, which is why pricing models differ between insurance companies.

Although, there is always room for improvements and thus, an insurance company’s price model will normally always be a corporate secret[3, 1-2].

Assume we have a consumer with monetary wealth W, probability p to lose an amount L and an insurance that will pay an amount q in the event that the consumer incurs this loss. Further, the amount of money that the consumer has tot pay for q in insurance coverage is π*q, where π here equals the insurance premium per currency of coverage[5, 180].

For each policy, the insurance premium is determined by different values of several chosen variables and in order to estimate this particular relationship, a statistical model is used. The chosen variables is usually belong to one of the following categories[3, 2-3]:

1- Properties of the policyholder. This can be private properties if the policyholder is a private person and corporate properties if the policyholder is a company. For instance it could be age, numbers of co-insured.

2- Properties of the insured objects. For instance: age or model of a phone, type of building.

(17)

3- Properties of the geographic region. For instance: the population or average income of the policyholder’s residential area.

The utility maximization problem max pu(W−L−πq+q)+(1−p)u(W −πq) provide a description of how much coverage the consumer will purchase. Setting the maximization problem equal to zero and deriving, we get the following expression [5, 181]:

pu^′(W − L + q(1 − π))(1 − π) − (1 − p)u^′(W − πq)π = 0 which is equivalent to

u^′(W − L + (1 − π)q)

(u^′(W − πq) = (1− p)π

p(1− π) (1)

(1) indicates that if the event of losing L occurs, the insurance company receives an amount of πq− q. If the event does not occur, the insurance company receives an amount of πq. Hence, the expected profit of the company is (1− p)πq − p(1 − π)q.

2.2.3 Principal-agent problems in insurance

A principal-agent problem is a conflict in priorities between a person or group and the representative authorized to act on their behalf [6]. An agent may act in a way that is contrary to the best interests of the principal. In the insurance business case, the customer act as the agent while the insurance company corresponds the principal[12, 57–74]. In insurance cases, there are two types of principal-agent problems: hidden action and hidden information [5, 456-457].

Hidden action problem: An hidden action problem, also called a moral hazard problem, refers to the impact of insurance coverage in distorting incentives and occurs when a principal bears the risk of what the agent is doing, but at the same time cannot fully observe or condition the agent to do things a specific way. In the context of an insurance market, this indicates a hidden action [5, 456-457].

Naturally, this problem can be divided into two sides: ex ante moral hazard and ex post moral hazard[10, 205-206].

Ex ante (before event) is in this case an individual facing the risk of an accident, for instance a theft, a home fire or a car accident. Without insurance, the costs

(18)

and benefits of accident avoidance (alternatively precaution) would be internal to the individual and thus, the incentives for avoidance would be optimal. With insurance, on the other hand, the insurer would cover some of the accident costs.

This means that the insured individual is bearing all of the costs of accident avoidance but only some benefits, thus will under-invest in precaution[13, 74–

91]. Ex post (after event), would for instance be after a need for medical care has occurred and indicates of the possibility that an individual will spend more resources on care if a portion of those expenses is covered[14, 10–26].

Hidden information problem: Hidden information occurs in the trade between principal and agent, where agents select offers from the principal based on private information. Private information in regards to the insurance business means that the information about the utility or cost function of the agent is not observable[5, 456-457] .

2.2.4 Price sensitivity of demand

Demand describes the behavior of buyer. The demand is the relationship between price and quantity. The quantity demand of any good is the amount of the good that buyers are willing or able to purchase. The law of demand states that when the price of a good rises, the quantity demanded of the good falls. Vice versa, the quantity demanded rises when the price falls. The price is elastic if the quantity demanded responds substantially to changes in price, otherwise it is labeled as inelastic [3, 90-91].

Price sensitivity refers to how much buyers respond to changes in the quantitative measure price. This measures how willing consumers are to buy less of the good as the price rises, i.e. the how the buyer’s demand changes.

The following are the rules of thumb about what influences the price elasticity of demand [3, 90]:

1. Availability of close substitutes where close substitutes have more elastic demand.

2. Necessities vs Luxuries where necessities have inelastic demands and luxuries have elastic demand. Whether a good is a necessity or luxury depends on

(19)

the preference of the buyer but by definition, everything indispensable is a necessity.

3. Definition of the market where a narrowly defined market have more elastic demand.

4. Time horizon where goods tend to have more elastic demand over longer time horizons.

The total price is calculated as follows: price sensitivity = percentage change in price

2.3 Mathematical Theory

Regression analysis is a statistical technique used to model the relationship between variables. It is applied in numerous fields including economics and insurance theory [1, 1]. Regression models can be used to describe data, estimate parameters, prediction and estimation as well as control[1,9].

2.3.1 Linear Regression

In the simplest case, the model is built so that the response is dependent on one explanatory variable, i.e. one regressor. The model is called a simple linear regression model and is on the form:

y = β₀+ β₁x + ϵ (2)

Here, β₀ is the intercept, β₁ is the slope and ϵ is the error term [1,2]. The errors are assumed to be normally distributed with mean 0 and variance σ² and are uncorrelated. The mean of the response y, is a linear function of x: β₀ + β₁xand the variance of y is the same as the variance of the errors σ², thus, independent of the value of x [1,13].

More commonly, the response variable y is dependent of more than one regressor. It is then a multiple linear regression model and built with k regressors x₁, x₂, ..., x_k:

y = β₀+ β₁x + ... + β_kx_k+ ϵ (3)

(20)

Here, the term linear refers to the model being a linear function of the unknown coefficients β₀...β_k. For a unit change in a regressor x_j, when all of the remaining regressors are constant, the corresponding β_j represents the expected change in y [1,67-68]. The errors in the multiple linear model are normally distributed with mean 0, variance σ and uncorrelated.

2.3.2 Generalized Linear Models (GLM)

In the Ordinary Linear Models, we assume normally distributed errors and therefore a normally distributed response variable with mean 0 and variances σ². When these assumptions are not appropriate, generalized linear models are an alternative approach. GLM include both linear and nonlinear models which thus allows the response variable to have a non-normal distribution [1, 421- 450]. The distribution of the response variable only has the requirement to be a member of the exponential family, which includes for example Poisson and Gamma distributions. The general form of a distribution of the exponential family is on the form:

f (y_i, θ_i, ϕ) = exp[[y_iθ_i− b(θi)]/a(ϕ) + h(y_i, ϕ)] (4)

θiis the natural location parameter and ϕ is the scale parameter [1,451]. In GLM, the idea is to build a linear model for a function of the expectation of the response variable. Let η_i be defined by:

η_i = g[E(y_i)] = g(µ_i) = x^′_iβ (5)

Equation (5) yields the following expected response:

E(yi) = g⁻¹(ηi) = g⁻¹(x^′_iβ)) (6)

Where g is the link function. There are several choices of the link function.

For example log link for Poisson distribution or reciprocal link for the Gamma distribution.

Maximum Likelihood is a basis for parameter estimation in GLM. Once the

(21)

coefficient ˆβestimates are obtained, the general fitted model is:

ˆ

y_i = g⁻¹(x^′_iβ)ˆ (7)

2.3.3 Modeling Claim Frequency

With Poisson distribution, the response variable can be modeled so that it consists of a few, relatively rare events such as number of insurance claims per user.

Then it is desired to model the relationship between the number of claims and predictor variables. If the response variable y_i is a count, so that the observation are y_i = 0, 1, 2..., the probability model is often modeled as Poisson distributed:

f (y) = e^−µµ^y

y! , y = 0, 1, ... (8)

Where µ > 0. In this probability distribution, the mean and the variance are the same: µ, which implies that they are related [1,444].

One common link function for the Poisson distribution is the log link which is defined by:

µ_i = g⁻¹(x^′_iβ) = e^x^′ⁱ^β (9) This link ensures that the predicted values will be non-negative which is desirable in the case of number of claims. In Poisson regression, the parameters are estimated with Maximum Likelihood.

The method of maximum likelihood is used when the distribution of the errors is known. In the case of Poisson distribution, the parameter of interest to be estimated is λ. λ is the shape parameter that indicates the average number of events in a given interval. The Poisson probability mass function is:

P (x, λ) = e^−λλ^x

x! (10)

In the equation (10), x is the input variable for which we calculate the probability.

A sequence X_nof n independent observations is drawn. Each term of the sequence has the probability mass function given by equation (10). The probability of

(22)

observing the sequence will then be the product of each individual probability:

L(λ, x₁, ..., x_n) =

∏n i=1

(11) The log-likelihood function is then:

l(λ, x₁, ..., x_n) =

∑n i=1

[−λ − log(xi!) + x_ilog(λ)] (12)

By differentiating the above equation with respect to λ and setting it equal to 0, results in the λ value that maximizes the function.

d(l(λ, x₁, ..., x_n))

dλ = 0 (13)

Solving for λ the maximum likelihood estimator is obtained:

λ =

∑n i=1xi

n (14)

In the case where it is desired to estimate the coefficients β_i, i = 0, 1, 2, ..., n, β is used instead of λ [2,1].

With estimated parameters ˆβ, the fitted model with the log link is:

ˆ

y_i = g⁻¹(x^′_iβ) = eˆ ^x^′ⁱ^β^ˆ (15)

2.3.4 Modeling Claim Severity

The exposure, i.e. the number of claims, can be consider a weight denoted as w.

The average claim cost is obtained by weighting the total amount of claim cost X by the weights w. The claim severity Y = X/w is then obtained.

Assuming that the cost of an individual claim is gamma distributed, w = 1.

The gamma distribution implies that the standard deviation is proportional to µ which means that the coefficient of variation is constant[4, 32]. One equivalent parameterization of the distribution is with the frequency function below [4, 21]:

(23)

G(α, β) = β^α

Γ(α)x^α⁻¹e^−βx; x > 0 (16) where the so-called index parameter α > 0 and scale parameter β > 0.

Therefore, if X is the sum of w independent gamma distributed random variables, we conclude that X ∼ G(wα, β). The function for Y = X/w is, with re- parametrizations through µ = α/β and ϕ = 1/α:

f_Y(y) = f_Y(y; µ, ϕ) = 1

Γ(w/ϕ)( w

µϕ)^w/ϕy^(w/ϕ)⁻¹e^−wy/(µϕ)

= exp(−y/µ − log(µ)

ϕ/w + c(y, ϕ, w)); y > 0 (17) where c(y, ϕ, w) = log(wy/ϕ)w/ϕ− log(y) − logΓ(w/ϕ)which results in the desired E(Y ) = µand V ar(Y ) = ϕµ²/w.

To show that this gamma distribution is a member of the exponential family, the parameter µ is changed to θ = 1/µ, θ < 0. By setting the index i, the frequency function of the claim severity Y_iis

f_{Y i}(y_i; θ_i, ϕ) = exp(y_iθ_i + log(−θi)

ϕ/w_i + c(y_i, ϕ, w_i)) (18)

which concludes that the distribution in question is a member of the exponential family with b(θ_i) = log(−θi). Hence, it can be used in a GLM [4, 32].

With estimated parameters ˆβ the log-gamma GLM results in the fitted model:

ˆ

y_i = g⁻¹(x^′_iβ) = eˆ ^x^′ⁱ^β^ˆ (19)

2.3.5 Model Validation

Akaike information criterion (AIC): AIC is an estimator of out-of-sample prediction error as well as the relative quality of statistical models for a given set of data. It is based on maximizing the expected entropy of the model, where entropy is a measure of the expected information. AIC also provides a mean for model selection by estimating the quality of each model relative to each of the

(24)

other models. Essentially, the AIC is a penalized log-likelihood measure of the following function:

AIC =−2ln(L) + 2p (20)

where p denotes the number of parameters in the model [1, 369].

Bayesian information criterion (BIC) : BIC is a Bayesian Analogue extension of AIC. There are several BICs but the one used in this report is the Schwartz criterion. Compared to AIC, this criterion places a greater penalty on adding regressors to the model as the sample size increases [1, 369].

BIC =−2ln(L) + pln(n) (21)

Here, p denotes the number of parameters in the model.

Residual analysis : Residual analysis is very important in fitting the GLM model.

It can provide guidance concerning the overall adequacy of the models as well as assist in verifying assumptions and determine appropriateness of models [1, 456- 459]. The ordinary residuals from GLM is defined as the following:

ei = y_i−ˆy_i = y_i−ˆµ_i (22)

Although it is generally recommended that the residual analysis in GLM to be performed using deviance residuals, which is defined as the following:

di =±[y_iln(yi/e^(x^′ⁱ^β)^ˆ − (yi− e^(x^′ⁱ^β^ˆ)], i = 1, 2...n (23)

where the sign represents the sign of the original residual. As the observed value of the response y and the predicted value ˆy_ibecome closer to each other, the deviance residuals approach zero. When plotting deviance residuals, it could be plausible to transform the fitted values into constant information scales. For different responses, different transformations should be used.

1- For normal responses, use ˆyi.

2- For binomial responses, use 2sin⁻¹−ˆπ_i. 3- For Poisson responses, use 2 ˆy_i.

(25)

4- For gamma responses, use 2ln( ˆy_i).

Likelihood ratio test: The likelihood ratio test is used to compare a full model with a certain reduced model that is of interest. This technique is originating in a”

sum of squares”-technique by comparing twice the logarithm of the value of the likelihood function for the full model (FM) to twice the logarithm of the value of the likelihood function of the reduced model (RM) in order to obtain a test statistic we call LR. Function is:

LR = 2lnL(F M )

L(RM ) = 2[ln(L(F M )− ln(L(RM))] (24) For large samples and when the reduced model is correct, LR follows a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between FM and RM. Thus, if LR exceeds the upper α percentage point of this certain chi-square distribution, the claim that the reduced model is appropriate would be rejected [1, 448-449].

2.3.6 Multicollinearity

When the relationship between regressors is nonlinear, the regressors are orthogonal. However, in practice it is common that regressors have a linear relationship. This can cause the model to be unreliable and should therefore be avoided. Multicollinearity is thus the problem of correlation between regressors.

There are several methods to detect which regressors have high correlations with others and therefore which regressors that contribute to the multicollinearity of the model [1,285].

Examination of the Correlation Matrix: One can examine the off-diagonal elements r_ij, where i and j represent the indices of the regressors in the correlation matrix X^′X, where the regression model is on matrix form X. If two regressors x_i and x_j have near linear dependencies, |rij| will be close to unity. This method is simple and effective in detection of correlation between pairs of regressors.

However it does not consider the situation when more than two variables are involved in a near linear dependence [1,293-294].

Examination of the Variance Inflation Factors: Another method of detecting

(26)

multicollinearity in the model, is to analyze the variance inflation factors. They are defined as the diagonal elements C_{j j} = (1− R²j)⁻¹ of the inverse matrix C = (X^′X)⁻¹. R²_j is the coefficient of the determination obtained when the regressor x_j is regressed on the remaining p − 1 regressors. If xj is nearly orthogonal to the other regressors, the value of C_{j j} will be close to unity since R²_j will be small.

However, if x_j is nearly linear dependent on some subset of the other regressors, the value of R²_jwill instead be near unity which will imply a large C_{j j}. The variance of the coefficient of the j-th regressor is C_{j j}σ²and C_{j j} can therefore be viewed as the factor by which the variance of that coefficient, is increased due to the near linear dependencies. The formal definition of the variance inflation factor:

V IFj = Cjj = (1− R²j)⁻¹ (25)

The VIF value measures the combined effect of the near linear relationships among the regressors on the variance of that term. If one or more VIF values are large, multicollinearity exists in the model. A large VIF value depends on the size of the data. For a smaller set the VIF values should not exceed 5 and for a larger set they should not exceed 10 [1,296].

(27)

3 Methodology

The two main tools Excel and R were used in this project. Excel was used to sort the data, while R was used for the model building and analysis.

3.1 Data

Data was received from Hedvig. The complete data consisted of 9 different sections: user id, postal code, number of people co-insured, square meters of housing, number of days the individual has been insured, number of claims the individual has made, cost of the claim(s), number of referrals from the individuals as well as birth date. Referrals refers to the customers that the individual has brought to Hedvig through recommendation. The total number of users were 15461 at start.

3.1.1 Collection of Data

The included initial variables were birth date, from date: , to date: postal code, number of people co-insured, square meters of housing and number of referrals, claim cost. The initial data had to be managed due to skewed observations and the fact that not all the included variables were numerical. For instance the number of days insured was divided into two columns called from date (the start date of the insurance) and to date (termination date of the insurance).

Assumptions that had to be made in order to sort the data:

• Skewed observations are not accurate and therefore must be deleted. Those included:

– Observations with birth dates equal to 1900

– Observations with postal codes consisting of less or more than 5 digits, i.e. not Swedish postal codes

– Observations missing a from date and a to date

– Observations with a from date that has not yet occurred

• People with no to date are still insured. The to date, for all observations that

(28)

did not have one, was therefore set to the date when the data was collected:

25− mar − 2020.

3.1.2 Grouping of Variables

The first step of the analysis was to group the variables that covered a wide range.

Otherwise it would, with the large set of data, be impossible to analyze each individual type of observation. When grouping the variables, two aspects had to be taken in consideration.

1. Each group has to be risk homogeneous, so that the risk does not vary much within the group

2. All the groups have to contain ”enough” data in order to obtain a stable GLM analysis

The variables that were initially considered appropriate for grouping were postal code, birth date and square meters. It was not trivial which groups were risk homogeneous so the process of finding groups with estimated equal risk was therefore an iterative process. The iterative process consisted of plotting the variables against total number of claims. This to visualize if certain intervals of the analyzing variable, had similar risk tendencies. There was a certain visible correlation between birth date (i.e. age) and number of claims. However for postal code, it was less obvious if there was an existing correlation. Several models with different groups were built until reached satisfactory groups. During the process of finding risk homogeneous groups, simultaneously it had to be ensured that there were enough observations in the groups. This was obtained by adapting the groups intervals after visualisation of the groups distributions in histograms.

Initially, number of people co-insured and referrals were not grouped since they had few natural categories. However, during the building process, problems arose and the conclusion were that those also had to be grouped. This because some of the groups contained insufficient amounts of data. referrals were in total only 100 observations out of more than 15 000 data points. Out of the 100 observations there were 14 levels, with some of the levels containing a minimum

(29)

of 1 observation. Since there was not an obvious intuition whether the different number of referrals meant different risks, it was assumed that the risk differs in the sense of either having referrals or not. Consequently, referrals was transformed to a binary variable so that all 100 observations could be gathered into one group.

People insured were likewise grouped so that the levels with insufficient data were merged. Although there were considerably many more observations in groups 1 and 2 of people insured, the groups 3 and 4 were not merged since their risk is not considered homogeneous and both groups contain more than 500 observations which is considered enough.

The final groups:

Group Birth dates 1 1979 or older

2 1980-1986

3 1987-1991

4 1992-1994

5 1995-1997

6 1998 or younger Table 3.1: Birth dates

Group Square meters 1 less than 25

2 25-34

3 35-44

4 45-64

5 65 or more Table 3.2: Square meters

(30)

Group Postal codes Location

1 10627-11599 Stockholm

2 11600-13499 Stockholm

3 13500-19999 Stockholm county

4 20000-25999 Skåne

5 26000-41899 Between Skåne and Gothenburg

6 41900-69999 Västra Götaland

7 70000-98331 The rest of Sweden Table 3.3: Postal code

Group People insured

1 0

2 1

3 2

4 3 or more

Table 3.4: People insured Group Referrals

1 0

2 1

Table 3.5: Referrals

3.1.3 Aggregation of Variables

To obtain a row for each existing combination of the variables, instead of having one row per user, an aggregated data-set was created. The variables days insured, number of claims and total claim cost were aggregated by: birth date, square meters, referrals, people insured and postal code. By aggregation the categorical variables were converted to dummy variables which enables each category to receive a coefficient based on the impact the variable has on the model relative to the baseline in the variable groups. Each row has thus a unique combination of the explanatory variables.

(31)

3.2 Model development and analysis

3.2.1 Current Model

The current tariff is on the form:

γ₀

∏4 i=1

γ_j + P remium, basis (26)

Where γ₀ = 129and γ_j are the relative risks form the variables: age, area, postal code and total people insured. The ages range from 16 to 91 years, area from 20 to 250 square meters, people insured from 1 to 6 people and postal code is divided into 14 areas.

3.2.2 New GLM model

The model that is to be created will look like:

P rice = γ₀

∏N k=1

γ_k,i (27)

Where γ₀ is the base level and γ_k,i, k = 1, ..., N are the risk factors corresponding to the variable number k and the corresponding variable group i. All groups in each variable have different risks. Unlike the current model, this tariff is built with generalized linear models.

3.2.3 Modeling the Frequency of a Claim

The Model: The initial step in the development of the model was to find the response variable which in this case was the total number of claims. The used data set for the model is on aggregated form so that, each unique combination of the groups of included variables, result in a certain amount of total claims. Total amount of claims has a range from 0 to 17. The response was modeled with a Poisson distribution with a log-link function and initially all the variables were included. Duration was set to an offset variable. This to enable model rate, i.e.

claim count per unit of exposure, instead of just count, without destroying the distribution of the data. This way, the beta coefficient of duration was set to one [15,1].

(32)

Initially, a likelihood ratio test was performed to conclude which of the variables were significant in the model or not, at a level of rejection of p=0.05. In order to find the reduced model the built in function bestglm in R was used to perform variable selection based on the IC criterion AIC and BIC. The five best models and the ultimately best model, according to the chosen criteria, was then determined.

The risk factors: By Maximum Likelihood, R finds an estimate of all the beta coefficients for each group of each variable. The intercept was removed from the rest of the variables and in order to obtain the risks, by the definition of the fitted Poisson model with log link:

ˆ

y_i = e^(x^′ⁱ^β)^ˆ (28)

The calculated risk coefficients are:

riskf,ki = e^(xⁱ^β^ˆⁱ⁾, (29)



 x_i = 1 x_i = 0

Which translates that for each group i, the risk is equal to the exponential of its beta coefficient times x_i. The x_ican take two values. If x_i = 1the variable group is present in the model and if x_i = 0the variable group is not included in the model.

For the variable groups that were not included in the model, the exponential was simply 1 which indicates that the variable did not have any impact on the final model.

One of the variable groups, for each variable, was set to one to indicate the baseline. The baseline was assigned to the group with highest duration rate. The relative risks were then set based on the the baseline.

3.2.4 Modeling the severity of a claim

The Model: In this case the response variable was to the average amount of claim cost which was calculated as the claim cost divided by number of claims. In the modeling of the Claim severity, the GLM therefore had to be weighted by the number of claims. The response was modeled with gamma distribution with log-

(33)

link function. The same procedures as for the frequency model, were performed with the severity model. A likelihood ratio test was performed, followed by the variable selection with the same built in function bestglm with regard to the criterion AIC and BIC and finally also compared with the Anova significance table.

The risk factors: By Maximum Likelihood, R finds an estimate of all the beta coefficients for each group of each variable in the severity model. Since the log- gamma function is on the same form as for the frequency function:

ˆ

y_i = e^(x^′ⁱ^β)^ˆ (30)

The risk coefficients were calculated as in the previous model:

risks,ki = e^(xⁱ^β^ˆⁱ⁾, (31)



 x_i = 1 x_i = 0

Where x_i = 1if the variable group is present in the model and x_i = 0if the variable group is not included in the model. The baseline was equally determined in the severity model.

3.2.5 The final risk factors

The final risk factors were obtained by simple multiplication of the relative risks of the separate models for each variable group.

riskT,ki = riskf,ki∗ risks,ki = γk,i (32)

3.2.6 Creating the Base level

The base level is calculated such that total claim costs for one year, is covered by the sum of the prices for each insurance on a full year basis. The first step was to estimate the average claim cost for one year. It was done by analyzing the data from the time span when the insurances started til today, which is data from

(34)

2018− 01 − 01 to 2020 − 03 − 25. Assuming that all risk insurances are not season based, i.e. that the claims are evenly distributed throughout the year, the claim cost per year was calculated based on an average of the total claim costs from the data set. The time span is 783 days. By dividing the total number of days by days in one year, a factor is obtained by which the total claim cost can be divided by, to provide the average claim cost for a year.

Average claim cost∗ Average number of claims in a year = (33) (34)

= Claim cost 2018-2020

Claims 2018-2020 ∗Days in one year

Total days ∗ Claims 2018-2020 (35) (36)

= Claim cost 2018-2020

Total days Days in one year

=Claim cost 2020 (37)

The claim cost is based on the customers insured today, which means that in the next step, the calculations were based on the customers insured during the last 365days.

The next step was to calculate the total sum of the insurances premium. With a ratio target between the estimated claim cost and the total premium of 90%. The total premium of the insurances could then be calculated:

Total premium = Expected claim cost

0.9 (38)

Then, the total risk factor for each insurance, which means the product of all risk factors γ_k,ifor that insurance, was calculated. The sum of all the insurances during 2019 was set equal to the total premium so that the base level could be obtained.

γ₀ = Total premium 2020

∑m j=1(∏4

k=1γ_k,i) (39)

Where m are the number of insurances during the year of 2019, k corresponds to the variable number 1, 2, 3, 4, 5 and i to the variable group number whose range varies depending on the variable.

(35)

Hedvig is growing and expanding which implies that the total claim cost and total amount of customers most likely will increase during 2020. However, since it is difficult to estimate those numbers, the assumption is that the ratio calculated between claim costs and people insured until today will look similar in the future.

This way, the base level will be calculated to approximately the same value.

3.3 Qualitative study

We did a qualitative study on price of home insurances in order to analyze price sensitivity for different age groups. Price sensitivity is the degree to which the price of a product affects consumers’ purchasing behaviors, with other words being the extent to which demand changes when the cost of a product or service changes.

The case study was performed digitally with the help of a survey. Out of 90 respondents, we divided them into three groups A,B and C respectively. The distribution with respect to the emergence of internet as well as the acceptance of digital methods in the everyday life.

Figure 3.1: Qualitative study

Generation Z, individuals born on or after the year 1995, are defined to be growing up in an highly diverse environment with high levels of technology. The high levels of technology indicate of growing up in the context of mobility, social media and digital natives. The decision to divide individuals of Generation Z into two subgroups in the study is to later be able to compare the older and younger individuals of the same generation, i.e. those in their late 20s compared to the teenagers and individuals in their early 20s [7].

Generation Y, Generation X and Baby Boomers, altogether representing a span of individuals born on or between the years of 1940–1994, are those growing up w