Predicting Risk Exposure in the Insurance Sector: Application of Statistical Tools to Enhance Price Optimization at Trygg-Hansa

(1)

MASTER THESIS, 30 CREDITS

M.SC. INDUSTRIAL ENGINEERING & MANAGEMENT, INDUSTRIAL STATISTICS, 300 CREDITS

PREDICTING RISK EXPOSURE IN THE INSURANCE SECTOR

Application of Statistical Tools to Enhance Price Optimization at

Trygg-Hansa

Daniel Dunbäck & Lars Mattsson

(2)

Abstract

Knowledge about future customer flow can be very important when trying to optimize a business, especially for an insurance company like Trygg-Hansa since the customer flow is connected to the risk exposure for the company. In this thesis it is shown how customer volume for certain time periods can be estimated using stratification of data and univariate time series models. From this a simulated customer flow can be created using stratified sampling from the historical population. Two different stratification approaches were tested, an expert-driven approach using visualization to partition the population in to smaller subsets, and a data-driven approach using a regression tree. It was found that both approaches were able to capture seasonal effects and trends and delivered better results than the current method used by the company today. However, due to the fact the neither of the methods outperformed the other, it is not possible to determine which of the methods that is the best one, and that should be implemented.

It is therefore recommended that both methods needs to be investigated further. It was also found that the variation in population, when considering the effect on the company’s risk exposure, mattered less than the customer volume.

Sammanfattning

Kunskap om framtida kundflöde kan vara viktigt för att optimera ett företags verk- samhet, särskilt för ett försäkringsbolag som Trygg-Hansa eftersom kundflödet är kopplat till företagets riskexponering. I detta arbete visas det att antalet kunder för specifika tidsperioder kan predikteras genom stratifiering av data och tidsseriemodel- lering. Utifrån detta kan ett simulerat kundflöde genereras genom stratifierat urval från en historisk kundpopulation. Två olika metoder för stratifiering har undersökts, en expertbaserad metod som partionerar populationen i subpopulationer baserat på datavisualisering, och en datadriven metod som partionerar populationen baserat på resultatet från ett regressionsträd. Båda metoder lyckades identifiera säsongsbeteenden och trender samt resulterade i bättre prediktioner av antalet kunder än den nuvarande metoden som används av företaget idag. Eftersom ingen av metoderna presterade bättre än den andra är det dock inte möjligt att avgöra vilken av metoderna som bör impelementeras. Det rekommenderas istället att båda metoderna undersöks ytterliggare. I arbetet upptäcktes det även att variation i populationen hade mindre inverkan än antalet kunder, med avseende på företagets riskexponering.

Titel: Prediktion av Riskexponering inom Försäkringssektorn

(3)

Acknowledgements

First we would like to thank our university supervisor, Konrad Abramowicz, for being constantly available throughout this project, for having a positive attitude when we were gloom, and for being hard on us when it was needed. Thank you for sharing your knowledge and for steering us in the right direction when we went astray. Your expertise and assistance made it possible for us to move forward with the project and we could not have done this without you.

We would also like to thank our company supervisor, Johan Leifland, for all your support and positivity throughout the project, and for making this project possible.

We would like to thank Johanna Rosenvinge, Head of Customer Analytics, and the rest of the Customer Analytics-department for your help and for making us feel welcome at Trygg-Hansa. We hope that we will meet again in the future. A special thanks to Henrik Rosén and Erik Wallentin at Trygg-Hansa, for all your time and assistance. It has been a pleasure working with you.

Finally, we would like to thank our family and friends for all your love and sup-

port throughout this project. Thank you for being understanding with our absence,

absentmindedness and general dwelling over the project these past months.

(4)

The best prophet of the future is the past.

- GEORGE BYRON

(5)

Abstract i

Acknowledgements ii

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem Formulation . . . . 1

1.3 Purpose & Goal . . . . 2

1.4 Previous Work . . . . 2

1.5 Idea of Method . . . . 2

1.6 Limitations . . . . 2

1.7 Structure of Report . . . . 3

2 Data 4 2.1 Data Storage . . . . 4

2.2 Data Introduction . . . . 4

2.3 Data Pre-processing . . . . 4

3 Theory 8 3.1 Time Series . . . . 8

3.1.1 Discrete Time Series . . . . 8

3.1.2 Stationarity . . . . 8

3.1.3 White Noise . . . . 8

3.1.4 Auto-Covariance Function & Auto-Correlation Function . . . . 9

3.1.5 Backshift Operator . . . . 9

3.2 Model Estimation . . . . 9

3.2.1 AR . . . . 9

3.2.2 MA . . . 10

3.2.3 ARMA . . . 10

3.2.4 ARIMA . . . 10

3.2.5 SARIMA . . . 11

3.2.6 Canova-Hansen Test . . . 11

3.2.7 KPSS Test . . . 11

3.2.8 Innovations Algorithm . . . 12

3.2.9 Maximum Likelihood Estimation . . . 13

3.2.10 AIC . . . 13

3.2.11 Hyndman-Khandakar Algorithm . . . 14

3.3 Model Forecasting . . . 14

3.3.1 Forecasting ARIMA . . . 14

3.3.2 Forecasting SARIMA . . . 15

3.3.3 Prediction Intervals of Forecasts . . . 16

3.4 Model Validation . . . 17

(6)

3.4.1 Residuals . . . 17

3.4.2 Residual Diagnostics . . . 17

3.4.3 Ljung-Box Test . . . 18

3.4.4 Shapiro-Wilks Test . . . 18

3.4.5 Bonferroni Correction . . . 18

3.5 Model Evaluation . . . 18

3.5.1 Absolute Error & Mean Absolute Error . . . 18

3.5.2 Total Bias Error . . . 19

3.6 Regression Tree . . . 19

3.7 Sampling . . . 20

3.7.1 Sampling with Replacement . . . 20

3.7.2 Non-parametric Bootstrap . . . 20

4 Method 22 4.1 Data Visualization . . . 22

4.2 Customer Volume Prediction & Data Generation . . . 22

4.2.1 Different Methods of Partitioning . . . 22

4.2.2 Regression Tree . . . 22

4.2.3 Time Series Modelling . . . 23

4.2.4 Time Series Model Validation . . . 23

4.2.5 Sampling & Evaluation . . . 23

4.2.6 Predicted Risk Distribution . . . 24

4.3 Programs & Software . . . 26

5 Results 27 5.1 Expert-driven Approach . . . 27

5.1.1 Data Visualization & Partitioning . . . 27

5.1.2 Time Series Modelling . . . 29

5.1.3 Evaluation of Simulated Sample Distribution . . . 31

5.1.4 Predicted Risk Distribution . . . 31

5.2 Data-driven Approach . . . 33

5.2.1 Regression Tree . . . 33

5.2.2 Time Series Modelling . . . 34

5.2.3 Evaluation of Simulated Sample Distribution . . . 35

5.2.4 Predicted Risk Distribution . . . 35

5.3 Method Comparison . . . 37

5.3.1 Time Series Modelling . . . 37

5.3.2 Risk Distributions . . . 38

5.4 Time Series Model Validation . . . 39

5.4.1 Method 1 . . . 39

5.4.2 Method 2 . . . 41

6 Conclusion 42

(7)

7 Discussion & Future Improvements 43

7.1 Discussion . . . 43

7.2 Alternative Methods . . . 44

7.2.1 Linear Models & GAM . . . 44

7.2.2 Neural Networks Time Series . . . 45

7.2.3 Multivariate Time Series . . . 45

7.2.4 Alternative Univariate Time Series . . . 45

7.2.5 Variable Selection & Data Partition . . . 45

7.2.6 Expert- vs. Data-driven Approach . . . 46

7.2.7 Stratified Sampling . . . 46

7.3 Future Improvements . . . 47

7.3.1 Data Collection . . . 47

7.3.2 Modelling & Forecasting . . . 47

7.3.3 Customer Simulation . . . 48

8 References 49 A Appendix 51 A.1 Time Series Models - Method 1 . . . 51

A.2 Time Series Models - Method 2 . . . 53

(8)

List of Figures

2.1 Total number of observations for each communication channel over time. . 5 3.1 Representation of a regression tree. . . 20 4.1 Flow chart of the risk distribution procedure. . . 25 5.1 Total number of observations for each communication channel over the

entire time period of the cleaned data set. . . 28 5.2 Top: Mean centered volume of observations for the counties 1, 2 and 3.

Bottom: Mean centered volume of observations for the counties 4, 5 and 6. Both figures are concerning the entire time period of the cleaned data set. . . 29 5.3 Total predicted volume for Method 1 and the current method as well as

observed volume for four three month time periods. The line represents the 95% prediction interval for the total volume predicted by Method 1. . 30 5.4 Distribution of total risk value using two different techniques determining

population volume, for time period 1. The narrow histogram shows risk distribution using a fixed population volume, and the wider histogram shows risk distribution when using varying population volume. The dis- tributions are visualized together with the the point prediction, 95% con- fidence interval, the true value for the time period as well as the value estimated by the current method. . . 32 5.5 Mean centered customer volume for the four partitions based on Engine

output and Drivers license age, over the entire time period of the cleaned data set. . . 33 5.6 Total predicted volume for Method 2 and the current method as well as

observed volume for four different three month time periods. The line represents the 95% prediction interval for the total volume predicted by Method 2. . . 34 5.7 Distribution of total risk value using two different techniques determining

population volume, for time period 1. The narrow histogram shows risk distribution using a fixed population volume, and the wider histogram shows risk distribution when using varying population volume. The dis- tributions are visualized together with the the point prediction, 95% con- fidence interval, the true value for the time period as well as the value estimated by the current method. . . 36 5.8 Total predicted volume for Method 1, Method 2 and the current method

as well as observed volume for four different three month time periods. . . 37 5.9 Point predictions of Method 1 and Method 2, displayed against the true

value of time period 1. Shown against the risk distribution of each respec- tive method to indicate the variation of the time period. . . 39 5.10 Residual plot, ACF plot and histogram of residuals for the time series

forecast of County 12 and Channel 1, concerning time period 1. . . 40

(9)

5.11 Residual plot, ACF plot and histogram of residuals for the three month

time series forecast of Partition 2, concerning period 1. . . 41

(10)

List of Tables

2.1 List and description of the created variables. . . . 6 2.2 List and description of the selected variables. . . . 7 5.1 Resulting errors from Method 1 for four three-month time periods 1-4. . . 30 5.2 Resulting errors from Method 2 for four three-month time periods 1-4. . . 35 5.3 Results from comparing Method 1 and Method 2 for three-month time

periods 1-4. . . 38 A.1 Resulting models 1-10 of Method 1 when forecasting period 1. Showing

partition of data, estimated model, and p-values of Shapiro-Wilks as well as Ljung-Box hypothesis tests. . . 51 A.2 Resulting models 10-22 of Method 1 when forecasting period 1. Showing

partition of data, estimated model, and p-value of Shapiro-Wilks as well as Ljung-Box hypothesis tests. . . 52 A.3 Resulting models 23-25 of Method 1 when forecasting period 1. Showing

partition of data, estimated model, and p-value of Shapiro-Wilks as well as Ljung-Box hypothesis tests. . . 53 A.4 Resulting models of Method 2 when forecasting period 1. Showing parti-

tion, estimated model, and p-values of Shapiro-Wilks as well as Ljung-Box

hypothesis tests. . . 53

(11)

1 INTRODUCTION

1 Introduction

The purpose of this section is to provide the reader with the necessary information regarding the project. This includes background about the company and the project, the purpose of the project, as well as the idea of methods and limitations that was considered.

1.1 Background

Trygg-Hansa is one of the largest Nordic non-life insurance companies and a part of the larger Danish insurance company Codan Forsikring. Trygg-Hansa was founded in 1828 and has, to date, approximately 1300 employees distributed on 20 offices around Sweden. They cater both to businesses and private consumers, with vehicle-, home- and personal insurance being some of the fields in their scope (Trygg-Hansa 2021).

At Trygg-Hansa, each customer’s insurance carries a risk premium based on the calculated risk of the customer. The risk premium is derived from customer data, provided by the customer in an insurance quote, where each answer corresponds to a variable. Each insurance quote can contain anything from 150-300 predictive variables, depending on what market the insurance policy resides in. Each premium is based on the predictive variables and calibrated using a price optimization model.

Today, the price optimizer is calibrated using historical data, where Trygg-Hansa assumes that the customer flow from the previous three months reflects the customer flow for the following three months. However, due to factors like seasonal effects and changing customer behavior over time, this assumption does not hold, which affects the price optimization negatively.

1.2 Problem Formulation

How can customer flow be predicted in order to enhance price optimization?

(12)

1.3 Purpose & Goal 1 INTRODUCTION

1.3 Purpose & Goal

The aim of this project is to use customer data to detect seasonal effects and trends in customer flow and to identify/establish a significant relationship between the two so that predictions can be made for future customer flow.

The goal is to provide Trygg-Hansa with a method for modelling expected cus- tomer flow from historical customer data. This method could then be used both in unison with, and as a comparison tool for, the current estimation process. The request of the company is to find a generalized approach which can be interchangeably used throughout all markets/sectors of the company. The method should be clearly structured and documented to be easily and properly used, since misuse could have critical impact on the price optimization process. The method should be dynamic and deployable at any time to predict a future time period and create a pseudo data set which will represent the customer flow of that time period.

The impact goal is that Trygg-Hansa can use the pseudo data set to calibrate the price optimizer, which will result in better price adjustments and the possibility to adjust profit and volume of the efficient frontier to desired levels (Leifland 2021).

1.4 Previous Work

As of today, an effort to adjust for these external factors using a time series analysis on the communication channels has been made which has improved the predictive capabil- ities of the price optimization to some extent. However, the company feels that further improvement is needed and therefore wants to find an explainable relation between cus- tomer flow and some documented, but untested, external factors such as seasonal effects (Leifland 2021).

1.5 Idea of Method

A time series model will be evaluated using similar structure as the company uses today, in order to make comparisons in predictions. An alternative time series model will also be investigated using more area specific data then the one used today. The most promising model will constitute as a benchmark in the testing and validation of predictions.

Bootstrap sampling will then be used to create the pseudo data set mentioned in Section 1.3.

1.6 Limitations

The aforementioned price optimizer is not investigated further than to say that it is the

final destination of the simulated data set. The inner workings are beyond the scope of

this task which intends to provide Trygg-Hansa with a predicted customer flow from

(13)

1.7 Structure of Report 1 INTRODUCTION

which price optimization can be done.

This project focuses on data from the vehicular sector, but the intent is that the results can be generalized for other sectors/markets as well. Prediction of customer flow is also limited to three month periods only.

1.7 Structure of Report

The data used for this project is presented in Section 2. Section 3 covers the theory

behind the method used for the project, which in turn is covered by Section 4. The

results from the project is presented in Section 5, followed by Section 6 which deals with

conclusions regarding the results and wrapping up the report with Section 7 concerning

discussions and future improvements.

(14)

2 DATA

2 Data

The purpose of this section is to provide the reader with the necessary information about the data used during this project, as well as present the actions taken to clean and prepare the data for analysis.

2.1 Data Storage

Trygg-Hansa stores information regarding all the quotes they receive. This data contains information about the customer and about the insurance item, but also information regarding the result of the quote. This includes information about; if the customer ended up buying the insurance or not, exactly which insurance the quote was referring to and estimated insurance risk.

2.2 Data Introduction

The data gathered and presented by Trygg-Hansa was initially analysed through joint visual inspection with the company supervisor. A technical introduction and overview of each variable was conducted by a third party at Trygg-Hansa and attended by both authors. This was done in order to better select the most important variables to be used in the continued work, since the data set used throughout the project consisted of over 270 variables.

Noteworthy is that the original data set contained some duplicate variables as well as some character variables describing its preceding columns value. These were removed to create a lighter and more manageable data set.

The preliminary data set was reduced to 19 variables and approximately 4500000 observations. Some original observations were removed due to the nature of the channel from which the observation originated. One example of this are observations connected to internet search robots which in this case only contributes to an observation with generalized information but never to any actual purchase-, or customer specific, information. Other reductions of observations can be contributed to non-available (NA) values, and to typing errors.

2.3 Data Pre-processing

All 19 of the selected variables where parsed in order to detect if there were wrongful

entries that could tamper with the data analysis. By performing visual inspection it was

discovered that some days exhibited strange behaviour that was traceable throughout

several variables. After consulting with Trygg-Hansa, it was determined that these days

contained clerical errors or wrongfully collected data points, in form of incorrect number

of observations. There were two cases of wrongful entries, dates that contained too

many observations, visible as spikes in the data (see Figure 2.1) and dates that didn’t

(15)

2.3 Data Pre-processing 2 DATA

contain values at all. There were six days during 2019 that held too many entries and approximately eight days that didn’t have any entries at all.

2015 2016 2017 2018 2019 2020

3 6 9 12 3 6 9 12 3 6 9 12 3 6 9 12 3 6 9 12 3 6 9 12

Month

T otal V olume

Channel

1 2 3 4

Total Customer Volume − Channel

Figure 2.1 – Total number of observations for each communication channel over time.

The dates that held too many entries were removed from the data set and replaced by days that were sampled from data that same month. To determine the volume of observations to replace these days, the number of observations for the same weekdays that same specific month was considered. The weekly pattern regarding the number of observations for that month was visualized, both for that specific year and for the adjacent years. Through visualization, it was possible to see that the weeks during a month followed a certain pattern.

With a relationship established, for the concerned week, a volume for the miss- ing days was taken as the mean value for the same week days that month. This resulted in a volume of customers that was sampled from historical data that same month. In the same manner the missing days were replaced as to fill the gaps in the data that could have a possible negative impact on the modeling results.

Also, due to a data collecting system change in the second quarter of 2018, the

structure of the data changed drastically (see Figure 2.1). This led to a decision to

remove all data previous to June 2018 since it did not contain the same information

(16)

2.3 Data Pre-processing 2 DATA

in its variables, such as many columns being strictly NA’s or zero. Also there is an argument to be made that behavioural characteristics in the early years (2015-2017) are probably not representative of 2021 and forward.

A few additional variables were created from the original data and added to the data set. These variables were all somehow related to time, for example a variable for year and one for month. The reason why these variables were created was to make the following data analysis easier. One of the main objectives of the project is to investigate changes in the customer behavior, which meant looking at the data over time, and especially investigating yearly and monthly changes to be able to identify seasonal behavior or trends over time. Creating and adding theses variables to the data set, instead of extracting the information from the original variables every time they are needed, made coding and visualization easier.

Due to different ways of storing the data over the years, some variables have been subject to change regarding how they are stored and the format they have. One example is the channel variable, where the channel is written with uppercase letters for some observations and with lowercase letters for some observations. From a code point of view, the formatting differences results in uppercase entries and lowercase entries being treated as different categories, even if they actually belong to the same category.

To avoid categorization problems a few factor levels were re-named for some variables to get a more consistent variable formatting over the entire time period.

The final data set included 23 variables. Out of these, 19 variables were selected from the original data set and four were created from the original variables. The four created variables are listed and described in Table 2.1 and the 19 selected variables are listed and described in Table 2.2.

Table 2.1 – List and description of the created variables.

Variable name Type Description Year Numeric Year for the observation.

Month Categorical Month for the observation.

Week Categorical Week for the observation.

Day Categorical Day for the observation.

(17)

2.3 Data Pre-processing 2 DATA

Table 2.2 – List and description of the selected variables.

Variable name Type Description

Date Character Date of quote.

Term date Character Date for when the quote was processed.

Channel Categorical Channel of communication. 4 levels.

County Categorical County code. 21 levels.

Apartment Numeric Number of apartment insurances the person has.

Vehicle Numeric Number of vehicle insurances the person has.

House Numeric Number of house insurances the person has.

Personal Numeric Number of personal insurance the person has.

Customer TH Binary Describes if the person is a customer at Trygg-Hansa or not.

Customer age Categorical Age in years for physical or legal person. 76 levels.

Acquisition days Categorical Number of days since the vehicle was acquired. 52 levels.

Vehicle age Numeric Age of vehicle in years based on model year.

Insurance age Categorical Age of insurance in days. 6 levels.

Drivers license age Numeric Number of years with driving license.

Bonus level Categorical Discount level for already existing customers. 5 levels.

Car brand Categorical Car brand. 256 levels.

Bought Binary Describes if a person has purchased an insurance or not.

Users Numeric Number of previous users of the car.

Registered Binary Describes if the car is actively registered or not.

(18)

3 THEORY

3 Theory

In this section, the theory concerning the methods used in the project is presented.

3.1 Time Series

3.1.1 Discrete Time Series

A discrete time series is a set of random variables {X t } that have been collected over a period of distinct points in time t = 1, 2, 3, . . . . The mean function of {X ^t } can be expressed as

µ _X (t) = E(X t ) and the covariance function of {X t } can be expressed as

“ _X (r, s) = Cov(X r , X _s ) = E[(X r ≠ µ X (r))(X s ≠ µ X (s))]

for all integers r and s (Brockwell & Davis 2002, 1).

3.1.2 Stationarity

A time series {X t } is regarded as stationary if it has the similar properties as the "time- shifted" series {X t+h }, for each integer h. Usually this refers to weak stationarity of {X t }, namely that

1. µ X (t) is independent of t

2. “ X (t + h, t) is independent of t for each h

It goes without saying that what is strictly stationary - where all observations of {X t } and {X t+h } belong to the same joint distribution - is also weakly stationary (Brock- well & Davis 2002, 15).

3.1.3 White Noise

For a given sequence of uncorrelated random variables {X ^t }, having zero mean and ‡ ² variance, it can be said that {X t } is stationary, having the same covariance matrix as i.i.d. random variables. This is referred to as white noise process and denoted as

{X t } ≥ WN(0, ‡ ² )

If {X t } instead is a sequence of random variables, having a Gaussian distribution, mean- ing each element of {X t } is by itself a Gaussian random variable, then {X t } is a Gaussian white noise process

{X t } ≥ GWN(0, ‡ ² )

(Brockwell & Davis 2002, 16)

(19)

3.2 Model Estimation 3 THEORY

3.1.4 Auto-Covariance Function & Auto-Correlation Function

If {X t } is a stationary time series then the auto-covariance function of {X t } at a fixed amount of passing time (lag) h, can be formulated as

“ _X (h) = Cov(X _t+h , X _t )

Then the auto-correlation function (ACF) of {X t } at lag h can be defined as ﬂ X (h) © “ X (h)

“ _X (0) = Cor(X t+h , X t )

(Brockwell & Davis 2002, 16) 3.1.5 Backshift Operator

The backshift operator B is a notation used for mathematically interpreting lags when looking back on previous time periods. The trivial case of B can be defined as

BX _t = X _t≠1

when looking back one period to the previous observation of X. The main use of B is to determine difference between current and previous time periods. A general notation for the d-th order difference is

(1 ≠ B) ^d X _t (Hyndman & Athanasopoulos 2018, 8.2).

3.2 Model Estimation 3.2.1 AR

An autoregressive (AR) model is a time series model where the value of the target variable is a linear combination of the variable’s previous values. An AR model of order p can be written as

X t = „ 1 X t ≠1 + „ 2 X t ≠2 + · · · + „ ^p X t ≠p + Á t

where X t is the target variable, „ 1 , . . . , „ _p are model parameters, Á t is white noise (Sec- tion 3.1.3), p is the number of lags (previous values) used for prediction and X t ≠i is the value of the variable at time t ≠ i for i = 1, ..., p. Using backshift notation, this can be formulated as

X t ≠ c = „(B)X t + Á t

where c is a constant and „(z) is the p-th degree polynomials defined as

„(z) = „ 1 z + · · · + „ ^p z ^p

(Hyndman & Athanasopoulos 2018, 8.3).

(20)

3.2 Model Estimation 3 THEORY

3.2.2 MA

A moving average (MA) model is a time series model that uses the past forecast errors when performing regression. A moving average process of order q is written as MA(q) and is defined as;

X _t = c + Á t + ◊ 1 Á _t≠1 + ◊ 2 Á _t≠2 + · · · + ◊ q Á _t≠q

where c is a constant, Á t is white noise (Section 3.1.3), ◊ 1 , . . . , ◊ _q are the parameters of the model and q is the number of lags or previous values. Using backshift notation, this can be formulated as

X _t = c + ◊(B)Á t

where ◊(z) is the q-th degree polynomials defined as;

◊(z) = 1 + ◊ 1 z + · · · + ◊ q z ^q (Hyndman & Athanasopoulos 2018, 8.4).

3.2.3 ARMA

An autoregressive moving average process, or ARMA process, is generally denoted by ARMA(p, q) and can be defined, if {X t } is stationary (Section 3.1.2), by using backshift notation, as

„(B)X t = ◊(B)Á t

The time series {X ^t } is considered as an AR process of order p if ◊(z) © 1, and a MA process of order q if „(z) © 1 (Brockwell & Davis 2002, 83-84).

3.2.4 ARIMA

An ARIMA process (autoregressive integrated moving average) can be denoted by ARIMA(p, d, q)

where p is the order of the autoregressive model (number of time lags), d is the order of differencing (the number of subtractions between an observation and its previous time step) and q is the order of moving average model (dependency between an observation and earlier errors). A time series {X ^t } is an ARIMA process if it fulfills the difference equation

„ (B)(1 ≠ B) ^d X _t = c + ◊(B)Á t , Á _t ≥ WN(0, ‡ ² ) (1)

where „(z) and ◊(z) are polynomials of degrees p and q, B is the backshift operator

(Section 3.1.5) and Á t is white noise at time t. To maintain invertability and casuality it

is assumed that „(z) and ◊(z) have no roots for |z| < 1 (Hyndman & Khandakar 2008,

9).

(21)

3.2 Model Estimation 3 THEORY

3.2.5 SARIMA

A SARIMA, or seasonal autoregressive integrated moving average, is an extension of a regular ARIMA model to include seasonal terms, and can be denoted as;

ARIMA(p, d, q)(P, D, Q) m

where p, d and q are non-seasonal factors in the model, P, D, and Q are the seasonal equivalent of the non-seasonal factors and m is number of time steps for one seasonal period (Hyndman & Athanasopoulos 2018, 8.9).

A time series {X t } is a SARIMA process if it fulfills the difference equation;

(B ^m ) „(B) (1 ≠ B ^m ) ^D (1 ≠ B) ^d X _t = c + (B ^m ) ◊(B)Á t

where (z) and (z) are polynomials of orders P and Q, B is the backshift operator and Á t is white noise at time t (Hyndman & Khandakar 2008, 9).

3.2.6 Canova-Hansen Test

The polynomials „(z) and ◊(z) of an ARIMA process mentioned in Section 3.2.4 both have unit roots. For an autoregressive model, a root value p close to 1 indicates that the data should be differenced before a model can be fitted, whilst a moving-average root value close to 1 indicates that the data has been over-differenced (Brockwell &

Davis 2002, 193-194).

The Canova-Hansen test is a test for investigating if a time series has stable sea- sonal pattern or not. This is done by investigating unit roots. The hypothesis is that;

I H ₀ : No unit roots at seasonal frequencies

H ₁ : One or multiple unit roots at one or multiple seasonal frequencies

(Canova & Hansen 1995, 237). If the null hypothesis can’t be rejected, a stable seasonal pattern in assumed. If the seasonal pattern changes sufficiently over time, the null hypothesis is rejected. The Canova-Hansen test is used to select the order of seasonal differentiation D (Hyndman & Khandakar 2008, 10).

3.2.7 KPSS Test

The KPSS (Kwiatkowski-Phillips-Schmidt-Shin) test is a test for investigating if a time series is stationary around a deterministic trend or not. The hypothesis is that;

I H ₀ : The time series is trend-stationary H ₁ : The time series has a unit root

(Kwiatkowski et al. 1992, 159-160). Repeated KPSS tests are used to determine the

order of non-seasonal differentiation d (Hyndman & Khandakar 2008, 10).

(22)

3.2 Model Estimation 3 THEORY

3.2.8 Innovations Algorithm

The innovations algorithm is a recursive algorithm applicable to all finite second order series, whether they are stationary or not. The algorithm is mainly used for forecasting ARIMA/SARIMA, but can also be used to estimate parameters.

For a zero-mean series {X t } with E|X t | ² < Œ for each t and covariance E(X i X j ) = Ÿ(i, j), a one step-ahead predictors can be described as

X ˆ n =

I 0, if n Æ 1, P n ≠1 X n , if n Ø 2.

with mean squared error v n = E(X n+1 ≠ P n X _n+1 ) ² .

Starting by introducing innovations which can be interpreted as the one-step prediction errors

U _n = X n ≠ ˆ X _n

which give the difference by adding one predictor at a time. Innovations might also be interpreted as vectors X n = (U 1 , . . . , U n ) ^T and X n = (X 1 , . . . , X n ) ^T to get

U n = A n X n

where A n is a lower triangular matrix containing coefficients of best linear predictors for each t. If {X t } is stationary then A n is invertible. A one-step predictors vector ˆ X _n can then be written as

X ˆ n = X n ≠ U n = A ^≠1 _n U n ≠ U n = ^! A ^≠1 _n ≠ I n "! X n ≠ ˆ X n " = n ! X n ≠ ˆ X n "

where n = (A ^≠1 _n ≠ I n ), and X n satisfies the relationship X n = A ^≠1 _n ^! X n ≠ ˆ X n " . The one-step predictions vector, from where predictors ˆ X ₁ , ˆ X ₂ , . . . can be determined, is then formulated as

X ˆ _n+1 =

I 0, if n = 0,

q n j=1 ◊ nj !

X n +1≠j ≠ ˆ X n +1≠j "

, if n Ø 1.

By then stating that

Ÿ(i, j) = Cov(X i , X j )

and starting at position i = j = 1, coefficients ◊ ij and mean squared errors v i = E(X i+1 ≠ X ˆ _i+1 ) ² can be found through recursive iteration of the innovations algorithm:

1. v 0 = Ÿ(1, 1)

2. ◊ _n,n≠k = v ^≠1 _k ^! Ÿ (n + 1, k + 1) ≠ ^q ^k j=0 ^≠1 ◊ _k,k≠j ◊ n,n ≠j v j "

, 0 Æ k < n 3. v n = Ÿ(n + 1, n + 1) ≠ ^q ⁿ i=0 ^≠1 ◊ _n,n≠i ² v _i

by first solving for v 0 and then consecutively for (◊ 11 , v ₁ , ; ◊ 22 , ◊ ₂₁ , v ₂ ; ◊ 33 , ◊ ₃₂ , ◊ ₃₁ , v ₃ ; . . . ) all coefficients and respective mean squared errors can be determined. (Brockwell &

Davis 2002, 71-73).

(23)

3.2 Model Estimation 3 THEORY

3.2.9 Maximum Likelihood Estimation

The Gaussian likelihood estimation for an ARMA process can be written as;

L(„, ◊, ‡) = 1

 (2ﬁ‡ ² ) ⁿ v ₀ · · · v n ≠1

exp ^; ≠ 1 2‡ ²

ÿ n j=1

! X _j ≠ ˆ X _j ^" ² v _j _≠1

<

where (X j ≠ ˆ X _j ) are the one step prediction errors and v j ≠1 are their respective variances for j = 1, . . . , n found through recursion of the innovations algorithm. The maximum likelihood estimators ˆ„, ˆ◊ and ˆ‡ of unknown parameters „, ◊ and ‡ can be found through differentiating ln(L(„, ◊, ‡)) with respect to ‡ ² , taking into consideration that innova- tions and their variance are independent from ‡ ² . Maximum likelihood estimators can then be defined as;

ˆ‡ ² = n ^≠1 S ^! ˆ„, ˆ◊) where

S ^! ˆ„, ˆ◊) = ^ÿ ⁿ

j=1

! X j ≠ ˆ X j " 2

v _j≠1 (2)

ˆ„ and ˆ◊ are the values of „ and ◊ which minimizes;

l(„, ◊) = ln ^! S(„, ◊) ^" + n ^≠1 ^ÿ ⁿ

j=1

ln(r j ≠1 )

For models such as ARIMA, maximizing likelihood is comparable to minimizing the least square estimates (Hyndman & Athanasopoulos, 2018, 8.6). Least square estimates

˜„ and ˜◊ can be obtained by minimizing the function S in Equation (2) to get the least square estimate of ‡ ² ;

˜‡ ² = S ^! ˜„, ˜◊ ^"

n ≠ p ≠ q

where n is the number of observations, p is the auto-regressive order and q is the moving average order (Brockwell & Davis 2002, 160).

3.2.10 AIC

AIC, or Akaike Information Criterion, is a criterion based procedure for model selection.

The AIC for ARIMA/SARIMA can be written as;

AIC = ≠2log(L) + 2(p + q + P + Q + k)

where k = 1 if c ”= 0 and 0 otherwise. p, q, P and Q are the order parameters for the

specific model and L is the likelihood of the model (Hyndman & Khandakar 2008, 9).

(24)

3.3 Model Forecasting 3 THEORY

3.2.11 Hyndman-Khandakar Algorithm

The Hyndman-Khandakar algorithm (Hyndman and Khandakar 2008, 10-11) is a method for identifying the best ARIMA/SARIMA model for a time series, based on the AIC value (Section 3.2.10). The algorithm for seasonal time series, m > 1, uses the following steps;

1. Choose D and d using the Canova-Hansen test (Section 3.2.6) as well as repeated KPSS tests (Section 3.2.7).

2. Fit four possible inital models:

• ARIMA(2, d, 2)(1, D, 1)

• ARIMA(0, d, 0)(0, D, 0)

• ARIMA(1, d, 0)(1, D, 0)

• ARIMA(0, d, 1)(0, D, 1)

3. Select the model with the lowest AIC value.

4. For the selected model, allow different combinations of p, q, P and Q to vary by

± 1 and allow c to be included/excluded. This will try a total of 13 variations of the selected model.

5. If any of the variations offer a lower AIC value, select this variation as selected model.

6. Repeat Step 4 and 5 until no variations of the selected model offer a lower AIC value than the selected model.

3.3 Model Forecasting 3.3.1 Forecasting ARIMA

If d Ø 1 then Equation (1) cannot be used to determine a linear predictor for {X t } without making further assumptions. If {X t } is a process that satisfies the difference equation

(1 ≠ B) ^d X _t = Y t , t = 1, 2, . . . (3) and {Y t } is a casual ARMA(p, q) process (Section 3.2.3) and if (X 1≠d , . . . , X ₀ ) is a random vector uncorrelated with {Y t , t > 0}, Equation (3) can be rewritten as

X _t = X 0 + ^ÿ ^t

j=1

Y _t , t = 1, 2, . . .

denoting that {X t , t Ø 1} is an ARIMA(p, 1, q) process having mean EX t = EX 0

and autocovariance E(X _t+h , X _t ) ≠ (EX 0 ) ² . The best linear predictor (P n ) of X n+1

given the set {1, X 0 , X ₁ , . . . , X n } can then be considered as the linear predictors of

(25)

3.3 Model Forecasting 3 THEORY

{1, X 0 , Y ₁ , . . . , Y _n }. Due to the linearity of P n , the linear predictor can therefore be formulated as

P _n X _n+1 = P n (X 0 + Y 1 + · · · + Y n+1 ) = P n (X n + Y n+1 ) = X n + P n Y _n+1 and then P n Y _n+1 can be evaluated by E(X 0 Y _j ), j = 1, . . . , n + 1 and EX ₀ ² . One way to express P n X _n+1 by {X ^j } is mean squared error of predictions.

The mean squared error of the h-step predictor can be found through

‡ _n ² (h) = E (X n+h ≠ P n X _n+h ) ² = ^h ^ÿ ^≠1

j=0

Q a

ÿ j r=0

‰ _r ◊ n+h≠r≠1,j≠r

R b

2 v _n+h≠j≠1 ,

where ◊ nj and v n are found by applying the innovations algorithm to the differenced series {Y ^t } (see Section 3.2.8). Then through application of

‰(z) = ^ÿ ^Œ

r=0

‰ _r z ^r = ¹ 1 ≠ „ ^ú 1 z ≠ · · · ≠ „ ^ú p+d z ^p+d ² ^≠1 (4) the forecasting coefficients ‰ j is found by recursively replacing polynomials, „ ^ú _j , with,

„ _j . For large values of n - given an invertible coefficients matrix ◊(·) - approximation can be preformed by rewriting Equation 4, as

‡ _n ² (h) = ^h ^ÿ ^≠1

j=0

Â _j ² ‡ ²

with

Â(z) = ^ÿ ^Œ

j=0

Â _j z ^j = („ ^ú (z)) ^≠1 ◊(z)

where Â is approximation of ‰ by non-seasonal-order polynomial (◊ and „) for unit roots z (Brockwell & Davis 2002, 198-200).

3.3.2 Forecasting SARIMA

Forecasting for SARIMA processes can be performed in the same way as in Section 3.3.1 by re-formulating the operator as (1 ≠ B) ^d (1 ≠ B) ^D to get the difference equation

(1 ≠ B) ^d (1 ≠ B) ^D X _t = Y t

the best linear predictor, having t = n + h can be found through

X _n+h = Y n+h + ^d+Ds ^ÿ

j=1

a _j X _n+h≠j

(26)

3.3 Model Forecasting 3 THEORY

where a j are weights, for X _≠d≠Ds+1 , . . . , X ₀ that are uncorrelated with Y t , t Ø 1. The best linear predictor P n X _n+h of X n+h for {1, X ≠d≠Ds+1,...,X n } is then fomulated as

P _n X _n+h = P n Y _n+h + ^d+Ds ^ÿ

j=1

a _j P _n X _n _+h≠j (5)

where P n Y _n+h is the best linear predictor of the ARMA process {Y ^t } for {1, Y 1 , . . . , Y _n } and predictors P n X _n+h can be found for h = 1, 2, . . . through recursion of Equation 5, setting P n X _n+1≠j = X _n+1≠j for j Ø 1. Similarly to Section 3.3.1 the prediction mean squared error of SARIMA processes can be formulated as

‡ _n ² (h) = E (X n+h ≠ P n X _n+h ) ² = ^h≠1 ^ÿ

j=0

Q a

ÿ j r=0

‰ r ◊ n +h≠r≠1,j≠r

R b

2 v n +h≠j≠1 , (6) where ◊ nj and v n are found by applying the innovations algorithm to the differenced series {Y t }. Then through application of

‰(z) = ^ÿ ^Œ

r=0

‰ r z ^r = ^Ë „(z) (z ^s )(1 ≠ z) ^d (1 ≠ z ^s ) ^D ^È ^≠1 , |z| < 1 Equation 6 can be approximated through

‡ _n ² (h) = ^h≠1 ^ÿ

j=0

Â _j ² ‡ ²

with

Â(z) = ^ÿ ^Œ

j=0

Â _j z ^j = ◊(z) (z ^s )

„(z) (z ^s )(1 ≠ z) ^d (1 ≠ z ^s ) ^D , |z| < 1

where Â is approximation of ‰ by non-seasonal-order polynomial (◊ and „) and seasonal order polynomial ( ) with ◊(z) (z ^s ) being non-zero for all unit roots |z| < 1 (Brockwell

& Davis 2002, 208-209).

3.3.3 Prediction Intervals of Forecasts

The first-step (1-–)100% prediction interval of a forecast can be calculated by X ˆ _T _+1|T ± z (–/2) ˆ‡

where t _(–/2) is a quantile for the desired confidence level – and ˆ‡ is the standard deviation of the residuals. Then for multi-step prediction, the interval depends on

X _t = Á t + ^ÿ ^q

i=1

◊ _i Á _t≠i

(27)

3.4 Model Validation 3 THEORY

having the estimated variance

ˆ‡ _h ² = ˆ‡ ² C

1 + ^h ^ÿ ^≠1

i=1

ˆ◊ ² _i ^D

and then a (1-–)100% prediction interval can be formulated as X ˆ _T _+h|T ± z (–/2) ˆ‡ h

where ˆ‡ h is the standard deviation of the h-step residuals, and z _(–/2) refers to the –- quantile of the standard normal distribution (Hyndman & Athanasopoulos 2018, 8.8).

3.4 Model Validation 3.4.1 Residuals

Residuals are the difference between the fitted values and the observed values. Residuals for a time series observation can be written as;

e t = y t ≠ ˆy t

where y t is the observed value at time t and ˆy t is the fitted value at time t (Hyndman &

Athanasopoulos 2018, 3.3).

3.4.2 Residual Diagnostics

A time series model’s ability to make good forecasts can be evaluated using the residuals (Section 3.4.1). A model can be considered to be good if the residuals;

1. Are uncorrelated 2. Have zero mean

3. Have constant variance 4. Are normally distributed.

If a forecasting model does not satisfy the first and the second condition, it can be modified and improved upon. If a forecasting model does not satisfy the third and fourth condition, calculations of prediction intervals might become more difficult (Hyndman &

Athanasopoulos 2018, 3.3).

(28)

3.5 Model Evaluation 3 THEORY

3.4.3 Ljung-Box Test

The Ljung-Box test is a portmenteau test to check whether there are auto-correlation in a sample of data at a certain number of joint lags (h). The hypothesis can be formulated

as; _I

H ₀ : The data has no auto-correlation in the first (h) joint lags H _a : The data has auto-correlation in the first (h) joint lags

The null-hypothesis (H 0 ) is rejected at significance level – = 0.05 for the alternative hypothesis (H a ) (Hyndman & Athanasopoulos 2018, 3.3).

3.4.4 Shapiro-Wilks Test

The Shapiro-Wilks test is a hypothesis test for the normality assumption that a sample of data comes from a normal distribution. The hypothesis can be formulated as;

I H ₀ : The data comes from a normal distribution.

H ₁ : The data does not come from a normal distribution.

The null-hypothesis (H 0 ) is rejected at at significance level – = 0.05 for the alternative hypothesis (H 1 ) indicating that the data does not come from the normal distribution (Faraway 2014, 80-81).

3.4.5 Bonferroni Correction

When performing hypothesis tests such as those mentioned in in Section 3.4.3 and 3.4.4 there is a critical value –, or a threshold, chosen to accept or reject the hypothesis level of each individual test. However, when performing multiple hypothesis tests on sub-groups of a data set, the probability of false positives increases if maintaining the same threshold. The –-level should therefor be corrected to make sure that too many false positives don’t arise.

Bonferroni-correction can be considered as a restriction on a test level – for such hypothesis tests meaning that if a level – is required for an overall test level, –/n should be used for each individual test of n subgroups (Faraway 2014, 87).

3.5 Model Evaluation

3.5.1 Absolute Error & Mean Absolute Error

Absolute error, or AE, is an error measure that can be used to evaluate prediction accuracy. It is defined as

AE = ^ÿ ⁿ

j=1 |y j ≠ ˆy j |

where y j is the observed value for observation j and ˆy j is the predicted value for

observation j. AE corresponds to the total amount of error in the measurements.

(29)

3.6 Regression Tree 3 THEORY

Mean absolute error, or MAE, is an extension of absolute error. It is defined as

MAE = 1 n

ÿ n

j=1 |y j ≠ ˆy j |

where y j is the observed value for observation j, ˆy j is the predicted value for observation j and n is the number of predictions. MAE corresponds to the mean error from all predictions (Hyndman & Athanasopoulos 2018, 3.4).

For easier interpretation when comparing between methods, MAE can be expressed as a percentage by dividing with the mean number of observation. This is formulated as;

M AE% = ⁿ ¹ q n

j=1 |y j ≠ ˆy j |

1 n

q n

j=1 |y j | = q n

j=1 |y j ≠ ˆy j | q n

j=1 |y j | 3.5.2 Total Bias Error

Total bias error is an error measure that, unlike absolute error, allows cancellation between over-prediction and under-prediction. Total bias error is defined as

Total bias error = ^ÿ ⁿ

j=1 (y j ≠ ˆy j )

where y j is the observed value for observation j and ˆy j is the predicted value for obser- vation j (Jiang 2009, 1461).

3.6 Regression Tree

A regression tree is a regression model, that partitions the predictor space into a number of regions and then make predictions for new observations based on the region the observations belongs to. Assume a data set with n observations and p variables, with predictors V j for j = 1, . . . , p. Each observation corresponds to a value y i , i = 1, . . . , n.

A regression tree can then be created using recursive binary splitting;

1. Consider all possible predictors V j and cut-points s for splitting the predictor space into two regions

R ₁ (j, s) = {V | V j < s } and R 2 (j, s) = {V | V j Ø s}

2. Select the predictor and the cut-point that minimize the residual sum of squares;

ÿ

i:v i œR 1 (j,s)

(y i ≠ ˆy R ₁ ) ² + ^ÿ

i:v i œR 2 (j,s)

(y i ≠ ˆy R ₂ ) ²

where ˆy R 1 is the fitted value for the observations in R 1 (j, s), and ˆy R 2 is the fitted

value for the observations in R 2 (j, s).

(30)

3.7 Sampling 3 THEORY

3. Repeat Step 1 and 2 for each region until stopping criterion is reached, which is when the terminal notes are too small or too few to be split (Ripley 2019).

A visual representation of a regression tree is shown in Figure 3.1.

Figure 3.1 – Representation of a regression tree.

(James et al. 2013, 306-307).

3.7 Sampling

3.7.1 Sampling with Replacement

Sampling with replacement is a method where each observation can sampled more than once. When an observation is randomly selected from from the population, it is then returned to the population before the next observations in randomly sampled (James et al. 2013, 189).

3.7.2 Non-parametric Bootstrap

For a population of interest, having cumulative distribution function (cdf) F and

probability density function (pdf) f, a data set D, with n elements, which is a random

sample of F having empirical distribution function ˆ F, is assumed to be a representative

sample of the population. The population parameter of interest, ◊, is unknown but can

represent any desired parameter, and is estimated by calculating ˆ◊ from the sample

data. The sample mean is considered an unbiased estimator of the population mean,

while any other estimates can be found by using ˆ F and the estimator ˆ◊ in place of the

population parameter ◊, and F. This is commonly known as the plug-in principle of

bootstrap, and estimates are called bootstrap estimates, denoted by an additional .*

(31)

3.7 Sampling 3 THEORY

The general non-parametric bootstrap algorithm can be formulated as;

1. From D estimate ◊ by calculating ˆ◊

2. From D generate a bootstrap set S by uniformly random sampling m observations, each with 1/n probability of being selected.

3. Repeat Step 2 a times to form a bootstrap sets S 1 , S ₂ , . . . , S _a 4. Calculate ˆ◊ _i ^ú from the i-th set, S i

5. Repeat Step 4 for all a sets.

From the algorithm a bootstrap-cdf ( ˆ F ^ú ) is created from all ˆ◊ ^ú _i , having mean and variance defined as

◊ ¯ˆ ^ú = 1 a

ÿ a i

ˆ◊ ^ú _i , ˆ‡ ² _ˆ◊ _ú = 1 a ≠ 1

ÿ a i

(ˆ◊ ^ú _i ≠ ◊ ¯ˆ ^ú ) ²

The mean ¯ˆ◊ _i ^ú can be considered as an estimate of the expected value of the estimator.

The (1 ≠ –)100% bootstrap percentile confidence intervals can be determined by C ˆ◊ ^ú _(a ^–

2 ) , ˆ ◊ ^ú _(a(1≠ ^–

2 ))

D

where ˆ◊ ^ú _(i) is the i-th smallest value of the sample ˆ◊ ₍₁₎ ^ú , . . . , ˆ ◊ _(a) ^ú (Berrar 2019, 766-770).

(32)

4 METHOD

4 Method

The purpose of this section is to provide the reader with information regarding how the project was performed and which methods that were used.

4.1 Data Visualization

The data was initially analysed by checking each variable’s behaviour over each year and month to investigate and identify noticeable trends or seasonality in the number of observations. Visualizations were performed as total values, mean centered values or as proportions of the total volume. This was done in order to get a good sense of the fluctuations off the mean, not getting lost in the change of absolute values, and to more easily distinguish trends over time. Through this type of data visualization, it was possible to determine which variables showed most promise for further analysis.

4.2 Customer Volume Prediction & Data Generation

In order to solve the problem at hand the project had to be divided into two parts. The first part is to find models that represents and fits the data from which predictions/fore- casts can be made. The second part is to draw samples from historical data to create new "simulated" data sets of the population.

4.2.1 Different Methods of Partitioning

In an attempt to create homogeneous sub-populations, on which modelling, predictions and sampling can be performed, the historical data is divided into smaller partitions.

Two types of approaches for the data partitioning were chosen, expert-driven and data- driven.

• Expert-driven: meaning that partitioning is performed from the variables consid- ered high priority by the company, and where significant trends and changes are identified through data analysis and visual inspection.

• Data-driven: meaning that machine learning methods are used to partition the observations through classification or clustering into smaller subsets or groups.

In this case the choice of alternative partitioning method was a regression tree classifier, using the individual risk of each observation as response.

4.2.2 Regression Tree

A regression tree (Section 3.6) is fitted on the full data set - not just the pre-selected 19

variables - as a way to extract hidden information and find new ways to partition the

data. The tree partitions the given data into classes given a certain cut point. In this

case it means to divide the whole population into sub-populations based on the variables

the algorithm identified as most significant for the individual risk of a customer.

(33)

4.2 Customer Volume Prediction & Data Generation 4 METHOD

4.2.3 Time Series Modelling

The data is first partitioned into smaller subsets, according to the expert-driven approach, and aggregated together as the total volume of customers per month and year for each individual partition. The data is then converted to a time series (Section 3.1.1) format and model estimation is performed using the Hyndman-Khandakar algorithm (Section 3.2.11) to fit the most suitable ARIMA/SARIMA model (Section 3.2.4 and 3.2.5) for each partition. A forecast of the volume for each partition of the data is then performed, according to Section 3.3.1 and 3.3.2, for each month for a given time period and finally aggregated together to form a total volume.

This procedure is repeated for the data-driven approach, where the data is parti- tioned into smaller subsets based on individual risk, using the regression tree described in Section 3.6. The two different approaches of partitioning is then evaluated using absolute error, mean absolute error (Section 3.5.1) and total bias error (Section 3.5.2) of the predicted volume for a number of different time periods. The reason why total bias error is used for comparison is because it allows for comparison to be made between methods with different numbers of models or forecasts.

For each separate approach, four different three month time periods are forecasted to evaluate the predictive capabilities of the approach over time. A set of separate models are constructed for each time period, where the amount of models depend on the approach used.

The four time periods considered as forecasting horizons are;

• Period 1: November 2020 - January 2021

• Period 2: December 2020 - February 2021

• Period 3: January 2021 - March 2021

• Period 4: February 2021 - April 2021 4.2.4 Time Series Model Validation

Each time series model is subjected to visual inspection of residual plots, auto correlation plots and an auto correlation test (Ljung-Box test, Section 3.4.3), to evaluate the overall fit of the model on the data (Section 3.4.2). The Ljung-Box test is used as a final divisor if a model contains auto-correlation or not (Section 3.1.4). To asses if the residuals are normally distributed, a Shapiro-Wilks test is performed on each model (Section 3.4.4).

The significance level for the tests were subject to Bonferroni correction (Section 3.4.5).

4.2.5 Sampling & Evaluation

Based on the forecasted volume of each partition, sampling with replacement (Section

3.7.1) is performed on the partition to which the forecast correspond, more specifically

Predicting Risk Exposure in the Insurance Sector: Application of Statistical Tools to Enhance Price Optimization at Trygg-Hansa

MASTER THESIS, 30 CREDITS

M.SC. INDUSTRIAL ENGINEERING & MANAGEMENT, INDUSTRIAL STATISTICS, 300 CREDITS

PREDICTING RISK EXPOSURE IN THE INSURANCE SECTOR

Application of Statistical Tools to Enhance Price Optimization at

Trygg-Hansa

Daniel Dunbäck & Lars Mattsson

Abstract

It is therefore recommended that both methods needs to be investigated further. It was also found that the variation in population, when considering the effect on the company’s risk exposure, mattered less than the customer volume.

Sammanfattning

Titel: Prediktion av Riskexponering inom Försäkringssektorn

Acknowledgements

We would also like to thank our company supervisor, Johan Leifland, for all your support and positivity throughout the project, and for making this project possible.

Finally, we would like to thank our family and friends for all your love and sup-

port throughout this project. Thank you for being understanding with our absence,

absentmindedness and general dwelling over the project these past months.

The best prophet of the future is the past.

- GEORGE BYRON

Contents

Abstract i

Acknowledgements ii

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem Formulation . . . . 1

1.3 Purpose & Goal . . . . 2

1.4 Previous Work . . . . 2

1.5 Idea of Method . . . . 2

1.6 Limitations . . . . 2

1.7 Structure of Report . . . . 3

2 Data 4 2.1 Data Storage . . . . 4

2.2 Data Introduction . . . . 4

2.3 Data Pre-processing . . . . 4

3 Theory 8 3.1 Time Series . . . . 8

3.1.1 Discrete Time Series . . . . 8

3.1.2 Stationarity . . . . 8

3.1.3 White Noise . . . . 8

3.1.4 Auto-Covariance Function & Auto-Correlation Function . . . . 9

3.1.5 Backshift Operator . . . . 9

3.2 Model Estimation . . . . 9

3.2.1 AR . . . . 9

3.2.2 MA . . . 10

3.2.3 ARMA . . . 10

3.2.4 ARIMA . . . 10

3.2.5 SARIMA . . . 11

3.2.6 Canova-Hansen Test . . . 11

3.2.7 KPSS Test . . . 11

3.2.8 Innovations Algorithm . . . 12

3.2.9 Maximum Likelihood Estimation . . . 13

3.2.10 AIC . . . 13

3.2.11 Hyndman-Khandakar Algorithm . . . 14

3.3 Model Forecasting . . . 14

3.3.1 Forecasting ARIMA . . . 14

3.3.2 Forecasting SARIMA . . . 15

3.3.3 Prediction Intervals of Forecasts . . . 16

3.4 Model Validation . . . 17

3.4.1 Residuals . . . 17

3.4.2 Residual Diagnostics . . . 17

3.4.3 Ljung-Box Test . . . 18

3.4.4 Shapiro-Wilks Test . . . 18

3.4.5 Bonferroni Correction . . . 18

3.5 Model Evaluation . . . 18

3.5.1 Absolute Error & Mean Absolute Error . . . 18

3.5.2 Total Bias Error . . . 19

3.6 Regression Tree . . . 19

3.7 Sampling . . . 20

3.7.1 Sampling with Replacement . . . 20

3.7.2 Non-parametric Bootstrap . . . 20

4 Method 22 4.1 Data Visualization . . . 22

4.2 Customer Volume Prediction & Data Generation . . . 22

4.2.1 Different Methods of Partitioning . . . 22

4.2.2 Regression Tree . . . 22

4.2.3 Time Series Modelling . . . 23

4.2.4 Time Series Model Validation . . . 23

4.2.5 Sampling & Evaluation . . . 23

4.2.6 Predicted Risk Distribution . . . 24

4.3 Programs & Software . . . 26

5 Results 27 5.1 Expert-driven Approach . . . 27

5.1.1 Data Visualization & Partitioning . . . 27

5.1.2 Time Series Modelling . . . 29

5.1.3 Evaluation of Simulated Sample Distribution . . . 31