Exploring a personal property pricing method in insurance context using multiple regression analysis

(1)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2019,

Exploring a personal property pricing method in insurance

context using multiple regression analysis

RASMUS GUTERSTAM

VIDAR TROJENBORG

(2)

(3)

Exploring a personal property pricing method in insurance

context using multiple regression analysis

RASMUS GUTERSTAM VIDAR TROJENBORG

Degree Projects in Applied Mathematics and Industrial Economics (15 hp) Degree Programme in Industrial Engineering and Management (300 hp) KTH Royal Institute of Technology year 2019

Supervisors at Hedvig AB, John Ardelius

(4)

TRITA-SCI-GRU 2019:168 MAT-K 2019:24

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

In general, insurance companies and especially their clients face long and complicated claims processes where payments rarely, and almost reluctantly, are made the same day. A part of this slow moving procedure is the fact that in some cases the insurer has to value the personal property themselves, which can be a tedious process. In conjunction with the insurance company Hedvig, this project address this issue by examining a pricing model for a specific personal property; smartphones - one of the most common occurring claim types in the insurance context.

Using multiple linear regression with data provided by PriceRunner, 10 key characteristics out of 91 where found to have significant explanatory power in predicting the market price of a smartphone. The model successfully simulates this market price with an explained variance of 90%. Furthermore this thesis illustrates an intuitive example regarding pricing models for personal property of other sorts, identifying limiting key components to be data availability and product complexity.

Keywords: Applied Mathematics, Multiple Linear Regression, Insurance, Pricing, Valu- ation, Claims, Smartphone, Actual Cash Value

(6)

(7)

Sammanfattning

I dagsläget st˚ar försäkringsbolag och deras kunder allt för ofta inför l˚anga och komplicer- ade försäkringsärenden, där utbetalningar i regel aldrig sker samma dag. En del i denna l˚angsamma och utdragna utbetalningsprocess är det faktum att försäkringsbolaget p˚a egen hand m˚aste uppskatta egendomens värde, vilket kan vara en mycket komplicerad process.

I samarbete med försäkringsbolaget Hedvig undersöker denna rapport en värderingsmodell för ett av de vanligaste försäkringsärendena gällande personlig egendom, nämligen smartphones.

Genom att använda multipel linjär regression med data försedd av PriceRunner har 10 av 91 nyckelfaktorer identifierats ha signifikant förklaringsgrad vid modellering av mark- nadsvärdet av en smartphone. Den framtagna modellen simulerar framg˚angsrikt mark- nadsvärdet med en 90-procentig förklaringsgrad av variansen. Vidare illustrerar denna rapport intuitiva riktlinjer för värderingsmodellering till andra typer av personlig egendom, samtidigt som den identifierar begränsande nyckelaspekter som exempelvis tillg˚angen p˚a data och egendomens inneboende komplexitet.

Keywords: Tillämpad Matematik, Multipel Linjär Regression, Försäkring, Prissättning, Värdering, Försäkringsärende, Verkligt Kontant Värde

(8)

(9)

Acknowledgements

We wish to express the greatest gratitude to our project supervisor Boualem Djehiche. Fur- thermore, this work would not have been possible without the support from the wonderful people at Hedvig, and especially our supervisor John Ardelius. Lastly we want to thank the Department of Mathematics (KTH) and School of Industrial Engineering and Manage- ment (KTH) and our peers for their feedback and support in writing this Bachelor’s Thesis.

(10)

(11)

1 Introduction

1.1 Background

The fundamental function of insurance companies is to provide certainty of payment at the uncertainty of loss. The insurer is to function as a safety net so that the insuree can live their lives knowing that themselves, their loved ones and their personal property to some extent would be covered if something would happen. In the case when something gets lost, stolen or broken, the insuree should feel confident that they will get reimbursed to an amount somewhere equivalent of what they lost - at a reality-based valuation. It is the insurers task to decide what the correct valuation of the property is, and this task is often manual and complicated and therefore inaccurate, inconsistent and also costly for the insurer.

To increase accuracy in valuation of personal property, looking at the wide variety of sup- pliers, their o↵ers and characteristics in the property relevant market would arguably be the most correct valuation motivated by the market. Furthermore once data is collected about the products, the relationships between and features of the products can be used to explain the price development through mathematical analysis.

The impact of this project can be found both at the side of the insurance providers but also at the side of the insurance takers. The project aims to provide better conditions for both parts, resulting in an overall more fair insurance agreement. Since this project may be of interest to insurance companies willing improve their methods of valuation, the project is carried out in cooperation with the insurance start-up Hedvig¹. The cooperation will lead to helpful guidelines when looking for data and to get a better understanding about the world of insurance. For Hedvig, the long term goal of pricing personal property extends far beyond the scope of smartphone models. The pricing algorithm has potential to be extended to other items and properties, and can in turn be used to project future valuation development and cash flows.

1.2 Scope

For the scope of this project, the attention will be shifted exclusively to smartphones as the personal property of choice. The motivation being that smartphones are easy to categorize, parameterize and at the same time enabling the data collection process to be narrowed down to a conceivable level. Furthermore the smartphone market will be limited to Scan- dinavia and thus represent the scope of the data. To be able to successfully carry out the project, qualitative data is needed in a time series format. To do this, a data-scraping tool

1Hedvig - https://www.hedvig.com

(14)

was developed in conjunction with Pricerunner² which in turn provides the required data on a daily basis.

Since smartphones today is a very common personal property, it is consequently one of the most often occurring claim types (fig.1) in the insurance context, which further motivates the choice of property.

Figure 1: Claims statistics from Customers of Hedvig (since launch 2016)

Using regression analysis, a modern data driven approach and mathematical statistics, the process of valuing personal property can be implemented in an algorithmic mean, with the ambition to give an improved valuation.

2Pricerunner - https://www.pricerunner.se

(15)

1.3 Problem Description

As mentioned in previous sections, the problem has been formulated in consultation with the insurance company Hedvig. In general, insurance companies and especially their clients face long and complicated claims processes where payments rarely, and almost reluctantly, are made the same day. A part of this slow moving procedure is the fact that the insurer has to value the personal property, which can be a tedious process. This project address this issue by examining possible pricing models, which potentially could be used to streamline the claim-to-payout process. The actual research questions that this report aim to address are:

• Is it possible to identify explanatory variables in the context of smartphone pricing and develop a predictive pricing model?

• Can a statistically based pricing framework provide objectivity and facilitate conflicts of interest among insurers and insurees?

(16)

2 Economic Theory

2.1 Insurance Basics

To understand the relevance of the research questions presented in this project, it is necessary to cover some basic theory from the insurance industry. Though a very complex business, the crucial parts to understand are the concept of depreciation, claims processes and the problems these pose on both insurers and insurees.

In general, a claims process can be described as the following. The insuree submits an application stating that, for example, their smartphone has been stolen and attaches information regarding approximate age, value and replacement cost (preferably affirmed by receipts) of the smartphone. Once received, the insurer take these parameters into consideration and depreciate the value accordingly. This value is referred to as Actual Cash Value (ACV) [13], which can be explained as the value the item could be sold for just before the accident occurred. Depending on the policy of the insurance, this adjusted value may be the final reimbursement. However, another common approach is to also take the actual replacement costs into consideration.

The process described may sound reasonable and structured, but it comes with several problems and in addition, a conflict of interest arises. To start with, the process is rarely smooth and efficient in practice. An insurance application is complicated - the insurer has to identify potential frauds, process the attached information and take the risk profile of the insuree into consideration as well. All of this in combination with the fact that the insurer, from the nature of a conventional for-profit company, is looking to maximize a profit leads to a serious conflict of interest against the insuree. Considering the scope presented as of this project, it is easy to understand that an insurer has many reasons to make the lowest estimation of the ACV for a claim as possible, and no incentive to make it accurate and fair, whereas the consumer on the other hand is seeking to maximize the ACV. It is also important to note that depreciation policies are subjective, and a negotiation therefore always is possible between the two parties. However, with the two parties normally consisting of an individual as the insuree and an insurance company as the insurer, there is obviously an uneven distribution of bargaining power.

Analyzing these issues, it can be concluded that transparency and the lack of common denominators among the insurer and insuree are key factors. Once proposed a valuation, the insuree rarely has any insight in how and why the depreciation has been conducted, and even if they did, it is most likely based on the subjectivity of the insurer [13]. On the other hand, the insurer has no reason to believe the proposed valuation by the insuree.

Instead, the two parties and the process itself would benefit from an objective valuation independent of any potential interests. The valuation should be transparent and act as a

(17)

third party in the claims process, providing empirically based estimations.

2.2 Choice of Market Assumptions

To perform an analysis of the smartphone market in Scandinavia some assumptions regarding the market, the di↵erent actors and incentives of pricing needs to be stated. Without delving too deep into the di↵erent schools of economics, the fundamental assumptions in this report are rooted in the concept of modern capitalism through the mainstream Key- nesian Economic Theory and Austrian School of Economic Thought.

When trying to find the value of a personal property, its value is considered to be heavily dependent on the subjective preferences of the buyer and seller. When the valuations are aggregated for a group of consumers, or even whole markets on a macro level, the subjective preferences will lose its granularity and become homogenized into a universal market valuation with a market-wide preference [4]. When looking at an aggregated market-wide valuation through this macro perspective, the idea is that the preferences can be broken down into a few core aspects. When analyzing properties of materialistic sort, such as smartphones, cars, computers etc., we classify these aspects as the specifications or characteristics of the property.

2.3 Empirical Interpretation

Mathematically, the formulation of the market valuation of a property can be interpreted in many di↵erent ways. More importantly - the pricing of a personal property is dependent on the type of property. This can be displayed in the following example, where the value of a property in the form of a company is conventionally given from the financial formulation by its assets, or the sum of its equity and liabilities. This is shown by the well known expression

V alue_company = Assets = Equity + Liabilities (1) On the contrary, a property of a much smaller magnitude, that is lacking the traits of an actual company, can be classified as a personal property, belonging or commodity. An asset of such variety does not belong to the concept of having assets and liabilities. Moreover, the market value of a personal property is more heavily influenced by the personal preference, rather than the market value of a company. Assuming that a personal property of smaller scale is more dependent on the subjective perception of its owner, the subjective perception itself could be identified as the main valuation mechanism [5]. The personal perception is in turn influenced by external factors further a↵ecting the valuation. The personal valuation can be seen as a function P (x) dependent of the property-specific in-

(18)

formation x available and perceived by the person.

Examples of such information are, but not limited to; rarity of the property, availability of complements or substitutes, market-specific currency developments and so on. The list of potential perceivable information quickly become larger than the amount of information that physically can be processed by a consumer. Arguing that the perceived value of a property exists detached from the objective valuation, the property itself is always objec- tively the same property - even though consumers value it di↵erently. This phenomenon show that the objective valuation of a property may exist, but that it is always influenced by the personal perception of each consumer. This can be expressed by putting the personal valuation function P (x) in relation to the theoretical objective market value resulting in the highly conceptual formula

V alue_{P erceived}= P (x)⇤ V aluemarket, x = 2 66 66 64

x₁ = Complements x2 = Rarity x₃= T rendiness

... x_m = ...

3 77 77 75

(2)

It is obvious that the complexity quickly goes out of hand when trying to theorize a compre- hensive formula for explaining every aspect of the valuation. Since the world and markets therein are dynamic, decisions today a↵ect the outcomes of tomorrow with very low pre- dictability. As the world economy is becoming more globalized, the markets, consumers and vendors become increasingly intertwined resulting in a higher grade of international economic dependency. Sudden changes in a market on one side of the world may very well a↵ect the supply chain of a good on the other side geographically, resulting in a change in valuation and skyrocketing prices.

From another perspective, buying equity in a company or purchasing a personal property, the consumer is in reality buying a part of the expected future profits, or expected future utility. The market then rewards those goods that consumers demand more with a higher price and one can definitely make the argument that future profits are conditional on how the property is valued by the consumer. The market valuation is where the subjective preferences of the buyer meet the motivations to sell from the seller [4].

2.4 Parameterized Valuation

As it seems to be extremely difficult to predict the valuation of a property for a consumer explicitly by looking at the consumer-specific unique set of values and information, it might be easier by approaching the valuation process in reverse by finding the valuation implicitly.

This would mean that instead of trying to find out how di↵erent consumers think, and then

(19)

aggregating all consumers into a singular market valuation, the attributes of the property would be subject of analysis and compared against the prices it is being traded or sold for.

By finding relationships between property-specific characteristics and its price, the idea is that, on aggregated market level, the personal perception is equalized and thus the market valuation can be extracted from the homogenized market value through its characteristics.

This would mean that the valuation is dependent on its property-specific characteristics y, where all of the characteristics are measured and contained in a matrix format. The combination of them all give the total market value of the property through a function C(y). Thus the valuation would follow the conceptual formula

V alue_market= C(y), y = 2 66 66 64

y₁= Characteristic₁ y2= Characteristic2

y₃= Characteristic₃ ...

y_m= Characteristic_m 3 77 77 75

(3)

In this project the function will be assumed to be linear, which will enable multiple linear regression to be used to find linear relationships between the characteristics and market price.

(20)

3 Mathematical Theory

3.1 Multiple Linear Regression

Given a multiple linear regression setting, the largest model possible in matrix notation, can be formulated with respect to our data as

y = X + ✏ (4)

Where the matrices are given by:

y = 2 66 66 64

y₁ y₂ y₃ ... y_n

3 77 77 75

, X = 2 66 66 64

1 x₁₁ x₁₂ x₁₃. . . x_1m 1 x₂₁ x₁₂ x₂₃. . . x_2m 1 x₃₁ x₁₂ x₃₃. . . x_3m ... ... ... ... ... 1 x_n1 x_n2 x_n3. . . x_nm

3 77 77 75

, =

2 66 66 64

1 2

..3

.

m

3 77 77 75

, ✏ = 2 66 66 64

✏₁

✏₂

✏₃ ...

✏_n 3 77 77 75

(5)

The coefficients are unknown and will therefore be approximated using the ordinary least-squares estimators:

ˆ = (X⁰X) ¹X⁰y (6)

However, the regression model requires some assumptions to hold:

1. The error terms are normally independently distributed.

2. The error terms have zero mean and constant variance ².

3. The relation between the response variable and its regressors is approximately linear.

To analyze and verify these assumptions, a thorough analysis of the model and more specifically its residuals must be conducted.

3.2 Residual Analysis

The formal definition of a residual is the di↵erence between an observed value and its corresponding fitted value, that is

e_i = y_i yˆ_i (7)

Assuming normality among the residuals, one can estimate the distribution of these as

ei ⇠ N(0, MSRes) (8)

Where the mean squared error, or the unbiased estimate of error variance, is defined as

(21)

M S_Res= SS_Res

n p = e⁰e

n p (9)

An interpretation of a residual may therefore be viewed as the deviation between the data and the fit. When performing an analysis of the residuals, its possible to look at the normal residuals defined as above to identify possible outliers. However, an efficient complement is to analyze scaled versions of these as well, to investigate further characteristics, as presented below.

3.2.1 Standardized Residuals

The easiest way to scale residuals is to standardize them, i.e. scaling each residual by the unbiased estimate of error variance M SRes. One can use the fact that standardized residuals have approximately unit variance and zero mean, which makes a large standardized residual a potential outlier [3]. Formally, the standardized residuals are computed as

d_i = e_i

pM S_Res (10)

In the process of standardizing the data, it is practical to categorize what makes an outlier.

It would be fairly reasonable to investigate any points that is 2 standard deviations o↵, and especially ones that are 3 units o↵.

3.2.2 Studentized Residuals

Another approach of scaling residuals is to studentize them. The problem with standardized residuals is that M S_Resis a mere approximation of the ith residual. An improvement would therefore be to scale each e_i by the exact standard deviation of the ith residual.

The vector of residuals can be written using the hat matrix H and the corresponding identity matrix I:

e = (I H)y, H = X(X⁰X) ¹X⁰ (11)

It is also known that both H and (I H) are idempotent and symmetric. Therefore the residuals are the same linear transformation of the observations y and the errors ✏. Using the fact that (I H) is idempotent, the corresponding covariance matrix is created:

Var(e) = Var[(I H)y] = ²(I H) (12)

Element-wise, the variance of the ith error term e_iis V ar(e_i) = ²(1 h_ii), which combined with M S_Res as our estimator of ² finally gives that the ith studentized residual is:

ri = ei

pM S (1 h ) (13)

(22)

What this result actually tells us is that the value of studentized residual is related to the value of h_ii, which in turn can be interpreted as a measure of distance in x space. It can therefore be very helpful to use these residuals to detect influential points, since r_i will increase if hii does. One should also note that the di↵erence between the studentized and the standardized residuals will be very small when analysing a large data set, since the estimated variance will converge to the true variance as the amount of data increases.

3.2.3 PRESS Residuals

The PRESS residual is another way to detect outliers. It is based on the idea that, for each observation i = 1, 2...n, the ith observation is deleted. Thus, the regression model is fitted to the remaining observations and then the predicted value of y_i is calculated for each corresponding scenario. The prediction error, or PRESS residuals as

e_(i) = y_i yˆ_(i) (14)

However, it is shown by Montgomery et al. [3] that each PRESS residual can be computed as :

e_(i) = ei

1 h_ii (15)

In a similar way as with the studentized residuals, the hat matrix gives us information regarding the distance in x space and scales the ordinary residuals accordingly. That is, a large value of h_ii will signal a potential influence point.

3.3 Detection and Treatment of Influential Observations 3.3.1 Outliers

Observations that di↵er considerably in y-space from the rest of the data are called outliers. As mentioned in section 3.2 regarding residual analysis, they are often recognized by unusually large residuals. The approach used to determine a large residual can vary.

Comparing the normal standardized residuals can give some insight through locating observations larger than two standard deviations of the sample.

3.3.2 Leverage and Influential Observations

The location of observations in x-space also play an important role in determining the regression coefficients. These points are called leverage points and are distinct from outliers. Leverage points can seriously disturb the model since the outlying observations may weigh in di↵erently, causing significantly di↵erent outcome than without those values.

As noted earlier in Eq. 12, H determines the variances and covariances of ˆy and e.

(23)

Furthermore, the elements hij of the matrix H may be interpreted as the amount of leverage exerted by the ith observation y_i on the jth fitted value ˆy_j.

• Small hij =) yi plays small role in ˆy_j

• Large hij =) yi plays large role in ˆy_j

To define what a corresponding ”large” h_ij is it can be compared to the average size of the diagonal of H given by:

T r(H)^def= Xn i=1

h_ii= p (16)

Traditionally, any observation i whose corresponding h_ij is above the inequality using the average of the diagonal of H, given by the following equation, is considered a leverage point and should therefore be considered [3].

h_ii> 2¯h = 2P_n

i=1h_ii

n =

(

Eq.(16) )

= 2p

n (17)

3.3.3 Cook’s Distance

It is desirable to consider both the location of the observation in x-space and the response variable in measuring influence. A commonly used method to estimate the influence of a data point uses a measure of the squared distance between the least-squares estimate based on all n points ˆ and the estimate obtained by deleting the i:th point, say ˆ_i [10]. This distance measure can be expressed in a general form as

D_i = (M, c) = ( ˆ_(i) ˆ)⁰M( ˆ_(i) ˆ)

c , i = 1, 2, ..., n (18)

Such that M = X⁰X and c = pM S_Res, so that Eq. 18 becomes

D_i = (X⁰X, pM S_Res)⌘ ( ˆ_(i) ˆ)⁰X⁰X( ˆ_(i) ˆ)

pM S_Res , i = 1, 2, ..., n (19) Points with large values of Di have considerable influence on the least-squares estimates ˆ. Points for which D_i > 1 are usually considered to be influential. The equation for D_i may be rewritten as

D_i = ri2

p

V ar(ˆyi) V ar(e_i) = ri2

p hii

1 h_ii, i = 1, 2, ..., n (20) Thus, apart from the constant p, Di is given by the ratio between the distance from the

(24)

cut-o↵ values to use for spotting highly influential points. A very commonly and simple to use operational guideline is D_i > 1 [7]. Another key insight is to summarize leverage and influential points, a combined influence/leverage plot can be useful to give a coherent view of the data.

3.4 Variable Transformations 3.4.1 Box-Cox Method

If the model would show non-normality and/or non-constant variance a transformation could be used to adjust the data accordingly. A power transformation by Box and Cox [9]

where y is raised to a certain power which then is a parameter to be determined. The parameters of the regression model and the parameter can be estimated simultaneously by using the method of maximum likelihood. This is implemented by maximizing

L( ) = 1

2n ln [SS_Res( )] (21)

or equivalently, the residual-sum-of-squares function SS_Res( ) is minimized. An approximate 100(1-↵) percent confidence interval for are the values that satisfy

L(ˆ) L( ) 1 2

2↵,1 (22)

Now, the confidence interval can be constructed by plotting L( ) against and placing a line at height

L(ˆ) 1 2

2↵,1 (23)

3.5 Variable Selection 3.5.1 Multicollinearity

When regression variables have a near-linear dependence, it is called multicollinearity. The actual problem that comes with having linear dependent regressors is the redundance of information, i.e. they do not contribute with anything new to the model. If one does not take these risks into consideration, the results may therefore lead to a flawed model. To identify and analyze potential multicollinearity, there are a couple of methods available.

3.5.2 Variance Inflation Factors

The VIF, or Variance Inflation Factor, is a measure of how much the variance of a given regressor is a↵ected by its dependence of other variables. A large VIF therefore indicates potential multicollinearity and in turn poor model estimators. To compute the VIF for a given regression coefficient j, the diagonal elements of the correlation matrix C is used:

(25)

V IF_j = C_jj = (1 R_j²) ¹ (24) According to Montgomery et al., practical experience has shown that a strong indication of multicollinearity is if the VIFs exceeds 5 or 10.

3.5.3 Adjusted R²

A simple way to analyze a model and its variables is to use the adjusted R². The original R² is a way to measure how well the regression line fits the data, but it comes with some drawbacks. For example, every time you add a predictor to a model the R-sqaured will increase, even if it is due to chance - it is blind to overfitting. The adjusted R² addresses this problem by taking the number of predictors into account by only increasing if the new term contributes to the model more than would be expected by chance. For a p-term equation, the adjusted R² is given by:

R²_adj,p= 1 (n 1

n p)(1 R²_p) (25)

3.5.4 Mallow’s Cp

Developed by Colin Lingwood Mallows, Mallow’s Cp statistic also provides a way to asses the fit of a regression model that is related to the mean square of a fitted value. It is defined as:

C_p = SS_Res(p)

ˆ2 n + 2p (26)

3.5.5 Akaike Information Criteria and Bayesian Analogues

Another mean for model selection is the the Akaike Information Criterion. Given a set of data, it acts as an estimator of the relative quality of statistical models. It is based on maximizing the expected entropy [3], or expected information, of the model and it is similar to the adjusted R² in the sense that it is a penalizing measure with respect to abundant regressors. Given a log-likelihood function L and a model with p parameters, the AIC is defined as:

AIC = 2ln(L) + 2p (27)

There is also a Bayesian extension of the AIC, where the main di↵erence is the penalty of the number of parameters. In AIC the ”penalty term” of 2p, whereas BIC instead is defined as:

(26)

3.5.6 Cross-Validation

An important aspect of a model is obviously how well it is predicting new data, and this is where Cross-Validation comes in. The purpose is to give an estimation of how well a predictive model performs, by evaluating the model in relation to new data that was not a part in building the model. There are several ways to implement cross-validation, as it is a concept rather than a specific technique. In this project, a 10-fold cross-validation was implemented in R by instructions from An Introduction to Statistical Learning [2].

(27)

4 Methodology

4.1 Data

4.1.1 Data Collection

The data used will mainly be daily price quotes for commonly used smartphones in the Scandinavian market, limiting the amount of models to approximately 200 with unique corresponding hardware and software specifications. To retrieve these data sets, a custom web crawler has been developed, which daily pulls everything necessary from PriceRunner.

To work with the data, R and Python has been used along with analytical frameworks including Pandas and NumPy.

4.1.2 Choice of Variables

To avoid multicollinearity it is required to actively consider the trade-o↵ between di↵erent variables. Adding or removing a regressor can significantly change the coefficients on the other regressors. In practice this is usually not explicitly tested, but in our case where multicollinearity invariably is present it is likely best to drop the o↵ending regressors. A model with less explanatory power is certainly better than a model with incorrect (unstable) explanatory power.

As the original data set consists of 91 di↵erent attributes for each and every smartphone model, an evaluation was required. One aspect is of course the actual quality and sufficiency of each specification, but it is also necessary to do some variable selection even at this stage.

As an example, without conducting any form of analysis, it can be reasonable to assume that a dummy covariate of whether a smartphone has a GPS or not might be redundant these days when it comes to its explanatory power of the price. The same argument goes for e.g. the available colors or whether it can be connected to a USB-port or not. With this first filtering conducted, the variables of interest are the following:

(28)

Name Type Description

category iOS Dummy Device runs iOS

headphone jack Dummy Device has 3.5mm headphone jack

usb type C Dummy Device has USB type-C

dual sim Dummy Device has room for dual SIM cards

ram Numerical Amount of RAM in gigabytes

cpu freq Numerical Frequency of processor in GHz

cpu cores hex Dummy Processor is running 6 cores cpu cores qua Dummy Processor is running 4 cores n cameras Numerical Amount of total cameras in device mp camera Numerical Amount of megapixels of rear camera mp front camera Numerical Amount of megapixels of front-camera auto focus Dummy Device rear camera has feature auto-focus max video res 4k Dummy Device camera(s) can capture video in 4k img stabilization Dummy Device camera(s) has image stabilization bluetooth ver 5.0 Dummy Device has Bluetooth 5.0

IP 68 Dummy Device has IP classification 68

IP 67 Dummy Device has IP classification 67

home btn Dummy Device has physical home button or not

fp sensor baksida Dummy Device has fingerprint sensor on back fp sensor framsida Dummy Device has fingerprint sensor in front fp sensor display Dummy Device has fingerprint sensor inside display screen size Numerical Screen size measured by diagonal in inches pixel density Numerical Amount of pixels per inch

screen type OLED Dummy Whether screen type is OLED

storage log2 Numerical Storage in gigabytes (logarithmic, base 2) battery capacity Numerical Battery capacity in milliampere hours wireless charging Dummy Device compatible with wireless charging depth Numerical Device depth measured in millimeters

weight Numerical Device weight measured in grams

(Response variable) Price Numerical Device price in Swedish Krona (SEK)

In the context of technology and more specifically hardware, component types can be categorized in two groups: the most recent edition, and then everything else. That is, whether a smartphone has a dual or single core is irrelevant when predicting prices - instead, the importance lies in if it is the latest or not. The actual implication in the analysis is the introduction of dummy variables. This would mean that the information regarding if the device is containing the latest technology or not is carried within the dummy variable. The base case would then consequently mean that the technology is of the more basic typology. For example, the absence of the dummy variables regarding CPU

(29)

cores would mean that the device has either one or two cores.

Before conducting analysis based on these variables, it was necessary to make a transformation on the storage variable, which is motivated by the fact that storage almost always is increased by a factor of 2 (8GB, 16GB, 32GB, 64GB etc). To retrieve a potential linear relationship, the variable is therefore transformed using a logarithm of base 2 (therefore referred to as storage log2).

4.2 Model creation and analysis 4.2.1 Largest Model Possible

Once data preparation is done, the Largest Model Possible can be created. This model will include all available observations and variables, and explain the relationship according to multiple linear regression analysis. The output of the model is described in the following figure.

Figure 2: Summary of the linear model as Largest Model Possible

The di↵erent model output plots show substantial residual spread, outlying data and leverage that may influence the model drastically. From the right tail in the quantile-quantile (Normal Q-Q) plot, it can be observed that the observations deviates from the straight

(30)

skewed which can be confirmed by looking at the distribution of the data in the following plot.

Figure 3: Distribution of prices in all observations

Since the multiple linear model assumes normality (and symmetry) in the distribution of the data, this will be dealt with through transformation in section 4.2.2.

Furthermore, looking at the values of Variance Inflation Factors (VIF) heavily suggests the presence of multicollinearity.

Figure 4: Variance Inflation Factors for Largest Model Possible

(31)

4.2.2 Model Refinement Outlier Removal

Below the di↵erent types of residuals for the initial model can be found:

Figure 5: Residuals before model adjustment

The di↵erent types of residual plots clearly show that the largest model possible generates some fitted values that diverts from the observed values and vice versa. Following theoretical directives regarding residual treatment [3], the observations labeled with red in fig. 5 are removed from the data set.

Note that the observation names (indexes) do not change for the rest of the data when some observations are removed.

(32)

matrix corresponding to the leverage for each observation.

Figure 6: Leverage measured as diagonal of hat matrix

The red line indicates the theoretical threshold 2¯h and the observations with red labels are considered to have considerable influence. The leverage of an observation is not by itself inflicting influence on the model, but it should be taken into consideration when further analyzing the model. The marked observations are kept in consideration for the next section.

(33)

Figure 7: Cook’s Distance

Cook’s Distance illustrated in fig. 7 show that there are no observations even close to exceeding the theoretical threshold of D_i > 1. The observations with the highest value of D_i marked with red labels can be compared with the observations marked in the leverage plot in fig 6. It is then noticed that some of the observations also can be found to be marked as observations with high leverage, which is very natural since that would make them good candidates for a high D_i. Since the values of D_i are so low, no further action is made but the di↵ering observations will be noted.

(34)

Figure 8: Influence plot

The influence plot gathers several important factors of the data distribution in relationship to the model. The observations numbered in fig. 8 suggests a few observations that di↵er substantially from the model. After conducting a qualitative investigation in these observations, it can be seen that these observations are devices with very specific user profiles. These can be smartphones made for rugged conditions, extreme battery life or abnormally high memory capacity. Therefore, these devices do not align with the norms of overall market pricing. At the same time, the diversion is considered not to be substantial enough, and the observations will therefore be included in the model.

Transformations

As briefly discussed, the price data is confirmed to be right skewed as seen in fig 3. There- fore, a power transformation of the data can be used where the response variable y is raised to a certain power . A Box-Cox normality plot is created to find the value of .

(35)

Figure 9: Box-Cox normality plot

The plot suggests a power transformation where = 0.222 . . . . This transformation is performed, resulting in the data of the prices to be distributed according to the following plot.

Figure 10: Distribution of prices in all observations

The price distribution is now showing the characteristics of a normal distribution, and thus improving the model fit to the data. The transformation will be used in the final model.

(36)

Variable Selection

Given the cleansed data set, several di↵erent techniques to assist during the variable selection process are applied to the model. These include Adjusted R², Bayesian Information Critera and Mallow’s C_p, which all have been illustrated for several potential subsets of variables as presented below. Note that variale names have been excluded for readability, see fig. 12 for details.

Figure 11: Variable selection using all possible regressions

It is easy to see that all of these produce the exact same results. Using these proposed variables as of the first row, the corresponding model performs with an adjusted R²of 0.89 which is considered a reasonable result.

(37)

Figure 12: Variable selection

However, as a result of the marginal decrease of the adjusted R² while introducing bluetooth ver 5.0 and fp sensor baksida, the final variable selection will consist of these two combined with the suggested variables from the first row. Furthermore, a 10-fold cross validation is conducted which suggests that the selection is reasonable.

Name Description

category iOS Device runs iOS

dual sim Device has room for dual SIM cards

ram Amount of RAM in gigabytes

cpu freq Frequency of processor in GHz bluetooth ver 5.0 Device has Bluetooth 5.0 IP 68 Device has IP classification 68 fp sensor baksida Device has fingerprint sensor on back pixel density Amount of pixels per inch

screen type OLED Whether screen type is OLED

storage log2 Storage in gigabytes (logarithmic, base 2)

(38)

Figure 13: 10-fold Cross-Validation

Finally, the proposed model is analysed using Variance Inflation Factor (VIF), where Mont- gomery et al. [3] suggests that a strong indication of multicollinearity is if the VIFs exceeds 5 or 10. As can be seen in figure 14, the selected variables fulfill these requirements and are therefore not indicating any severe case of multicollinearity.

Figure 14: VIF of selected variables

(39)

5 Result

Because of the dynamic nature of the market - constantly shifting where the preferences of consumers meet the o↵ers from producers, modelling the prices on the market is impossible to do on a dynamical basis. In the end, the model created gives a momentary snapshot of the current market. But the important properties of the model is to describe what actually drives the price. The model was reduced from containing almost every measurable characteristic of the devices to only the core influential traits that the market is taking into consideration when valuating the device, which is what it was thought to do.

5.1 Final Model

The final proposed model is based on the following variables with corresponding coefficients:

Name Coefficient CI Lower (95%) CI Upper (95%)

(Intercept) 3.29152 3.093 3.508

category iOS 1.16315 1.021 1.287

dual sim -0.19774 -0.2747 -0.1326

ram 0.09540 0.0558 0.1290

cpu freq 0.31656 0.1891 0.4390

bluetooth ver 5.0 0.21100 0.1262 0.3143

IP 68 0.18900 0.0953 0.2755

fp sensor baksida -0.17784 -0.2541 -0.1140

pixel density 0.00176 0.0013 0.0022

screen type OLED 0.31315 0.2314 0.3948

storage log2 0.19557 0.1503 0.2384

Note that the bootstrapped confidence intervals guarantees the relevance of the coefficients as they do not contain 0. With respect to previously discussed transformations ( = 0.2222), the model itself can compactly in matrix notation be specified as

y = X ˆ Or with the predictor in its original format

y = (X ˆ)^1/

(40)

5.2 Diagnostics and Evaluation

Figure 15: Linear model diagnostics

Figure 16: Prediction with transformation

(41)

Figure 17: Prediction without transformation (in SEK)

Figure 18: Model summary

(42)

6 Discussion

The model that has been created reveal the most important characteristics that are determining the valuation of a modern smartphone. Even though a lot of characteristics are available for each device, some characteristics explain most of the valuation level. The model has proven to be reliable, and the relationship between the price and characteristics will here be studied and interpreted. Furthermore, the implications of the model are discussed in the context of insurance claims processes.

6.1 Variable Interpretation

Each characteristic and corresponding relationship coefficient will now be interpreted as to what actually is the reason behind the relationship.

category iOS

According to the model, the fact of whether or not a smartphone runs iOS is strongly correlated with the price of a smartphone. This result seems reasonable considering the common conception of an exclusive brand, but also in the sense that an Apple branded smartphone encapsulates other valuable components more than the actual operative sys- tem. Even though Android devices definitely may be comparable to an iPhone with respect to its characteristics, the internal spread amongst Android devices are substantially larger with models ranging from budget to flagship.

dual sim

The possibility to be able to use a second SIM-card in a device has proven to be nega- tively correlated with the price according to the model. This relationship seem counter intuitive since it directly contradicts the previously used hypothesis that more technology equals a higher price. Instead, the correlation is negative which would need some other explanation. One theory might originate from the usage profiles of dual-SIM. It is found that the primary smartphone markets having dual-SIM are countries such as India, Nigeria and Brazil [11]. In these markets, the purchasing power is a lot lower than in Scandinavia, causing cheaper devices to be more popular. Since the dual-SIM feature is only popular in markets where lower priced devices are bought, the feature becomes associated with lower price in our model. However, one could argue that the data used to develop this model only encapsulates the Scandinavian market and its characteristics. While this is true and definitely worth noting, it is necessary to distinguish between cause and e↵ect.

Even though the Indian, Nigerian and Brazilian markets are not covered explicitly in this project, these areas generate a demand for cheap dual-SIM compatible devices which the global market in turn supplies. These models may be produced to mainly target these

(43)

specific regions, but since the world is becoming increasingly globalized, western markets such as Scandinavia will be a↵ected by these trends as well.

ram

The number of gigabytes RAM is positively related to the price, which aligns well with intuition. Simply put, more RAM corresponds to more advanced technology and thus a faster phone, which reasonably explains an increase in price.

CPU freq

Specified in units of gigahertz, the CPU frequency of the device is positively correlated to the price. Given the large coefficient and the fact that the majority of models range between 1 and 3 GHz, this regressor seems to play a substantial role in pricing.

bluetooth ver 5.0

As this variable is formulated as a dummy variable, the absence of having Bluetooth version 5.0 means that the device has any version of Bluetooth below or above 5.0, or not at all.

The Bluetooth 5.0 compatibility is related positively to the price, which is reasonable since it is associated with more advanced technology. Although, in the rare occasions where the device has a version of Bluetooth above 5.0, that would mean a lower price according to the model. The idea is that this surplus value either is encapsulated in other regressors such as category iOS, or simply results in decreased precision when predicting high-end devices. This is considered tolerable for this data set, since the occurrences are very few - however, once the technology is established on the market, this would inevitably lead to a flawed model.

IP 68

Constructed as a dummy variable, whether a model is classified with an IP-code of IP68 or not seems to be positively correlated with the price. However, this classification comes with similar problems as mentioned in the previous discussion on Bluetooth 5.0. There are some instances of IP69 in the data set which, per definition, is better than IP68. The problem is that these are not taken into consideration in the current model since they are too rare. Though, once established as a new standard, this would lead to an outdated model.

fp sensor baksida

The placement of the fingerprint sensor is an important factor for the overall design and feel of a device. As most devices currently has a fingerprint sensor acting as a security

(44)

mechanism, the placement can vary. For some devices, it is placed in a button on the front, side, back or even embed it in the front screen of the device. The placement type that the model suggested to be most significant is the fingerprint sensor placement on the back. Moreover, if this variable is satisfied, the correlation is negative. The explanation to this could originate in that the placement on the back is a suitable solution that does not compromise with the screen-size of the device, but instead compromises with the aesthetics of the device in some ways. It is seen that most premium phones have the fingerprint sensor embedded the screen, or even uses some other type of identification technology. In contrast, lower priced models still use fingerprint sensors, but choose to place the fingerprint sensor on the back to optimize other characteristics.

pixel density

The model suggests a positive correlation between pixel density and price, which aligns well with intuition - a better screen resolution should reasonably lead to an increased price, since a better screen resolution requires more advanced technology.

screen type OLED

This variable means that the screen type can either be OLED, or something else. OLED is one of the most modern type of screen technology, compared to the more conventional LCD or LED screens. The model suggests that having an OLED screen correlates with a higher price, which corresponds to more technically advanced device consequently requires a higher price of the device.

storage log2

The logarithmized storage variable relates positively to the price, which aligns well with the common conception of di↵erent pricing on di↵erent storage configurations for phone models. It is worth noting that this variable, as a result of its transformation, actually means how many times the storage has been doubled - that is, an increase from 8GB to 16GB storage is equivalent to an increase from 512GB to 1024GB.

6.2 Model Adequacy

As the model provides some key measurements of accuracy, who are mainly presented in fig. 18, the mathematical accuracy produced by the model can be considered to be quite satisfactory. Studying fig. 17 we can verify that in absolute SEK, the model does broadly seem to follow the actual market pricing of smartphones. From the data in this fig. 17, the average deviation between observed value and value predicted by the model is calculated to an average of about 724 SEK from the actual price. This might seem to be quite a lot, but comparing this to the average price spread for all smartphones (di↵erence between

(45)

highest and lowest price), which is calculated to approximately 2278 SEK, it can be seen that the model produces prices within the approximate spread, which clearly indicates that the model suggests prices converging toward the theoretical market value.

New technology is constantly being developed, and consequently also the technologically dependent smartphone market. At the same time, the form factor of smartphones has constantly been pushed to be thinner and thinner. As of now, the thickness of a device rarely exceeds 1cm. This naturally causes trade-o↵ problems when trying to fit increasingly many technological features into a device whilst at the same time maintaining the form factor as well as a decent battery life standard. An important insight from this is that the incremental e↵ort needed for further improving an already very high-end device is much higher than the e↵orts needed to improve a low-end device. This causes already high- end devices to have a lot higher incremental e↵ects on its price when some characteristic is improved, compared to the price increase of a lower-end device seeing the same incremental change of a characteristic. This can be seen in the model where in many high-end devices, an increase in the memory capacity from 256 GB to 512 GB would mean an increase of over 1000 SEK in price, whereas the same increase for a low end-phone corresponds to an increase in price by barely 500 SEK. Thus, the regression model gives di↵erently steep increments, in absolute terms, in price per increase in device characteristics.

6.3 Application

Given this proposed model, it is relevant to once again take a look at the fundamental conflict of interest posed upon insurers and insurees. As covered previously in section 2.1, a claims process might at first sight seem fair and reasonable. However, as presented, the insuree is looking to maximise the Actual Cash Valuation (ACV) of its personal property whereas the insurer, looking to maximize its profits, wants to minimize the payout by ap- plying excessive depreciation.

Using that the model presented in this project solely originates from the specifications, characteristics and retail quotes of several smartphones, it successfully estimates a valuation independent of subjectivity. Instead of keeping the insuree uninformed and hiding how and why depreciation has been applied, it is possible to transparently derive a fair valuation using the objective characteristics of the model. This approach could be advan- tageous to both parties, and it is worth noting that even though the estimations are not perfect (which, as discussed in section 2.3, is nearly impossible), they may at least act as a solid starting point in negotiations. Consequently, it copes with the uneven distribution of bargaining power between the insurer and the insuree by eliminating a substantial part of the subjectivity.

Though not necessarily related to insurance, another insight that can be derived from

(46)

the model is whether a device is under- or overvalued on the market. Again referring to fig 17, all instances where the prices are located above the drawn line (where the line is representing where the price suggested by the model is the same as observed price) are devices that are considered to be overvalued according to the model. On the other hand, if the observed price is below the model, the device is undervalued. Using this method of reasoning, it can be found that for instance most Apple devices, and also the latest high- end Samsung devices, are overvalued - according to their price in relation to their specs.

What the model does not take into account are device characteristics such as the branding, design and software features (other than OS) of these devices. These factors are definitely present in the actual valuation of the devices, which explain to some degree why the model suggest them to be overvalued. On the other end, devices can also be considered to be undervalued. After a closer study, it is found that the most of the devices that according to the model are undervalued often have extreme characteristics such as a powerful battery, high pixel density or storage. The intuition behind the very low actual price may be that whilst the battery is very good, the form factor or design may lack significant priority.

Again, these are characteristics that are not directly included by the model, causing the device to be considered undervalued.

6.4 Extension and Future Research

One of the main purposes with this thesis was to explore a method for private property pricing, and how this can be used within insurance. From the result it can be seen that it indeed is possible to create a statistical model that find the market price for an item, given its attributes. Based on these findings, other extensions and possibilities appear to enhance and expand the model.

First o↵ all, if the explanatory variables in general are known and a data set consisting of historical price quotes is retrieved, it is possible to extend the model to include time as a parameter as well. In practice, this would be relevant while studying price time series of di↵erent models and how these potentially could be explained by certain configurations.

Perhaps some models exhibit similar price development given similar configurations, thus enabling increased precision by cross-referencing these related models.

Another potential extension of the problem specification is the applicability of the methods presented in this report on other personal property. In the context of insurance, everything that possibly could be purchased is of interest, which makes it relevant to investigate the potential of using similar data from PriceRunner to develop similar pricing models. An obvious requirement is the availability of data, which is relatively simple in the context of technology since it is a well documented area and based on strict manufacturing standards.

However, estimating retail prices of clothing, jewelry etc. may pose entirely new questions concerning qualitative variables, adjusted approaches and other variations of statistical

(47)

modelling.

As a result of the nature of the model, the data set and especially the industry, the proposed results are difficult if not impossible to guarantee precision in a future scenario. With new technology and new standards constantly emerging, phone prices constantly adjusts accordingly. The model may perform sufficiently on the provided data set and at the time of writing, but a suggestion for future research is to consider the validity and change of explanatory variables based on the present technological paradigm, and whether any general pricing patterns can be identified.

In addition, it is also relevant to conduct further research on more potential qualitative variables. Several questions regarding unofficial classifications of smartphone models has occurred throughout the analysis, where an example could be whether a phone is a niche³ model or whether the brand is Samsung or not.

Finally, the relevance of geographical location of the market is also a highly relevant area to study since the data in this study is mainly based on Scandinavian price quotes. Several macro variables may therefore be of interest, such as the market-specific price index or how the characteristics of the technological paradigm in a specific region a↵ects demand for certain smartphone configurations.

3Niche is referred to as a phone model with some extreme configuration, selling for a disproportional

(48)

7 Conclusion

The main purpose of this project has been to find which and determine how much explanatory power di↵erent smartphone configurations have on its market valuation, and furthermore use these findings to develop a predictive pricing model.

Starting with a raw data set consisting of 91 di↵erent characteristics of each device collected from PriceRunner, several steps of thorough analysis, evaluation and model creation finally resulted in only 10 variables to be considered relevant with sufficient explanatory power.

The results proposed a model consisting of:

Name Description

(Intercept) (n/a)

category iOS Device runs iOS

dual sim Device has room for dual SIM cards

ram Amount of RAM in gigabytes

cpu freq Frequency of processor in GHz bluetooth ver 5.0 Device has Bluetooth 5.0 IP 68 Device has IP classification 68 fp sensor baksida Device has fingerprint sensor on back pixel density Amount of pixels per inch

screen type OLED Whether screen type is OLED

storage log2 Storage in gigabytes (logarithmic, base 2)

Which successfully explains 90.07%⁴ of the variance in the predictor price.

The ambition was also to explore the possibility to find the theoretical market value through the model, where the model has proven to provide valuations that lie well inside of the overall market spread. This, in combination with the key metrics from the model, show that the model does a decent attempt at finding the objective market value. However, as the valuation is constantly shifting along with the market itself, the model only gives a snapshot of the objective market value at a specific point of time. But the important point that is argued in favor for in this thesis is that, in the extension, modelling pricing through data consisting of product characteristics can give a viable idea of that objective market value.

Finally, though a sufficient model with a good explanatory ability based on the given data set, the model comes with some limitations. In general, constructing dummy variables in the context of technology makes it possible to explain existing technological standards, but may cause the model to derail in the future due to technological advancements.

4Adjusted R²

(49)

References

[1] D. A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity.

ISBN: 978-0-47-169117-4.

John Wiley & Sons, 1980.

[2] Gareth J., Witten D., Hastie T., Tibshirani R. An Introduction to Statistical Learning, 7th edition.

ISBN: 978-1-46-147137-0.

Springer 2017.

[3] Douglas C. Montgomery et al. Introduction to Linear Regression Analysis.

ISBN: 978-0-47-054281-1. John Wiley & Sons, 2012.

[4] Ludwig von Mises. Human Action, A Treatise on Economics, 1st edition.

ISBN: 978-0-94-546624-6.

Ludwig von Mises Institute, Auburn Alabama, 1998.

[5] Murray N. Rothbard. Man, Economy and State, 2nd edition.

ISBN: 978-1-93-355027-5

Ludwig von Mises Institute, Auburn Alabama, Scholar’s Edition, 2009 [6] A. Goolsbee, S. Levitt och C. Syverson. Microeconomics, 2nd edition.

ISBN: 978-1-31-915396-0.

Worth Publishers Inc.,U.S, 2016

[7] R. Dennis Cook, Sanford Weisberg. Residuals and Influence in Regression.

ISBN: 0-412-24280-0.

Chapman & Hall, 1982.

[8] G. Werner & C. Modlin. Basic Ratemaking, 4th edition.

Casualty Actuarial Society, 2010

[9] Box, G. and Cox, D. An Analysis of Transformations.

Journal of the Royal Statistical Society, 1964.

[10] R. Dennis Cook

Detection of Influential Observation in Linear Regression.

https://www.ime.usp.br/~abe/lista/pdfWiH1zqnMHo.pdf.

American Statistical Association, 1977 [11] Pawel Piejko

Where are dual SIM phones used most in 2017?.

(50)

https://deviceatlas.com/blog/dual-sim-smartphone-usage-2017.

Device Atlas

[12] R. Henckaerts et al. A data driven binning strategy for the construction of insurance tari↵ classes.

https://www.tandfonline.com/doi/pdf/10.1080/03461238.2018.1429300 [13] Depreciation Basics

https://www.uphelp.org/pubs/depreciation-basics.

United Policyholders

(51)

(52)

(53)

TRITA -SCI-GRU 2019:168

Exploring a personal property pricing method in insurance context using multiple regression analysis

Exploring a personal property pricing method in insurance

context using multiple regression analysis

RASMUS GUTERSTAM

VIDAR TROJENBORG

Exploring a personal property pricing method in insurance

context using multiple regression analysis

RASMUS GUTERSTAM VIDAR TROJENBORG

Contents

1 Introduction

2 Economic Theory

3 Mathematical Theory

4 Methodology

5 Result

6 Discussion

7 Conclusion

References